Re: [Bacula-devel] [bacula 0000888]: Spurious connection drops killing backups

Kern Sibbald Sun, 24 Jun 2007 04:21:34 -0700

On Sunday 24 June 2007 03:32, Frank Sweetser wrote:
> Kern Sibbald wrote:
> > I may want to look at tcpdump output later, but at the current time, I 
don't 
> > understand what the error is as far as Bacula sees it.  Normally the 
socket 
> > is closed by the side sending the data, so there should be no race 
> > conditions.  From what I recall from your bug report the error occurred in 
> > the middle of a transmission, but now you are talking about closing the 
> > socket, and that I don't understand.  At the point that Bacula is closing 
the 
> > socket, to the best of my knowledge it ignores any errors.
> > 
> > So, could you start by explaining this from the top down rather than from 
the 
> > bottom up?


Hello Frank,

> 
> I am, as you've noticed, far more used to dealing with the network from the
> bottom up rather than from the application point of view, but I'll give it 
my
> best shot.  I'll start with the verifiable facts I've found so far, and then
> see if I can explain my working theory - my apologies in advance if this is 
a
> bit long winded.
> 
> The original symptoms I ran into were that a subset of my nightly backup 
jobs
> began failing.  The error reports all declared that the remote host had
> unexpectedly closed the network connection.  We do fairly extensive
> monitoring, but none of the logs showed any network anomalies.  Host
> monitoring showed neither host ever went off the network, and switch logs 
did
> not show the relevant ethernet ports going down.  Network flow logs showed
> that the TCP stream closed down ungracefully, via RST packets instead of 
FIN,
> which agreed with the Bacula error messages.
> 
> The next thing I did was get full network captures of a failed job, taken 
from
>  both the SD/Dir system and the FD system.  Each end had an identical view 
of
> packets sent and received, which ruled out the possibility of an 
intermediate
> network glitch dropping packets.
> 
> At this point, since only a few jobs were failing, and those intermittently, 
I
> started looking at packet captures of more jobs to try to find the 
difference
> between failing and successful ones.  Much to my surprise, all of the jobs 
in
> my configuration - both successful and failed - showed at least one 
direction
> closing down ungracefully with a RST.  After more experimentation, and a few
> packet captures from others, I found that the RST only occured in
> configurations with TLS enabled.  Unencrypted communications always closed
> down cleanly with a pair of FIN/ACK handshakes.
> 
> After a little Google research, the only likely scenarios I could find for 
the
> TCP connection closing down ungracefully were either miscommunication 
between
> the applications at either end (ie, one end tried to continue sending and
> receiving after the other end closed the socket), or calling close() on a
> socket that still had data pending in the queue.  If I understood the paper
> correctly, this is referring to data that the other end has not yet read and
> allowed the TCP layer to ACK back to the sending host.
> 
> We now descend into educated guesswork.  My working theory at this point is
> based on the latter scenario, and the fact you mentioned that the TLS code 
is
> non-blocking, while the unencrypted code is blocking.  I don't know enough 
C++
> to know if this is what's actually going on, but I'm picturing a timeline of
> events that, depending on timing between the OS and the applications, look
> something like this:
> 
> FD                 | OS                   | OS               | SD
> -------------------+----------------------+------------------+---------------
> send() final data  |                      |                  |
> blocks and         |                      |                  |
> completion status  |                      |                  |
>                    |                      |                  |
>                    | put data on send     |                  |
>                    | queue                |                  |
>                    |                      |                  |
>                    | send 1st half of     |                  |
>                    | queue                |                  |
>                    |                      |                  |
>                    |                      | recv 1st half    |
>                    |                      |                  |
>                    |                      |                  | recv 1st half
>                    |                      |                  |
> close() socket     |                      |                  |
>                    |                      |                  |
>                    | data still in queue, |                  |
>                    | so RST               |                  |
>                    |                      |                  |
>                    |                      | recv RST, close  |
>                    |                      | socket           |
>                    |                      |                  |
>                    |                      |                  |  recv returns
>                    |                      |                  |  nothing, SD
>                    |                      |                  |  only gets 
part
>                    |                      |                  |  of expected
>                    |                      |                  |  data
> 
> So the end result is that because the TLS code is non-blocking, it's 
possible
> for Bacula to get ahead of the OS and call close() on the socket before the
> other end has called received all of the data in the transmit queue.  Since
> the unencrypted code is blocking, it is guaranteed that the send call won't
> return until the othe rend has gotten all of the data, and the queue is 
empty,
> making it safe to close the socket.
> 
> Since Bacula seems fairly forgiving of any socket related errors that occur
> after it has completed the job, the RST glitch would only cause a given job 
to
> fail if it had not been able to transmit the final results.  Any other 
pending
> data, such as a heartbeat or other similar no-op, would cause the RST to 
occur
> but be ignored by Bacula.  It's only if part of the job is still in the 
queue
> when the socket is closed that a job would fail.
> 
> Does this make sense, or is there something I've missed here?

Yes, *everything* makes sense, and is one of the best analyses and "educated 
guesswork" that I have ever seen!  Can I hire you? :-)

I think it is rather trivial to fix, but I want to look at the code a bit as 
there are several possibilities.  In the mean time, a couple of questions:

Can you reproduce it relatively easily?  (i.e. if I fix it will you know in a 
reasonable time?)

What version of Bacula are you running?  (so I can send you a new file or a 
patch).

Regards,

Kern

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Bacula-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/bacula-devel

Re: [Bacula-devel] [bacula 0000888]: Spurious connection drops killing backups

Reply via email to