Re: [Bacula-devel] [bacula 0000888]: Spurious connection drops killing backups

Frank Sweetser Sat, 23 Jun 2007 18:33:01 -0700

Kern Sibbald wrote:
> I may want to look at tcpdump output later, but at the current time, I don't 
> understand what the error is as far as Bacula sees it.  Normally the socket 
> is closed by the side sending the data, so there should be no race 
> conditions.  From what I recall from your bug report the error occurred in 
> the middle of a transmission, but now you are talking about closing the 
> socket, and that I don't understand.  At the point that Bacula is closing the 
> socket, to the best of my knowledge it ignores any errors.
> 
> So, could you start by explaining this from the top down rather than from the 
> bottom up?


I am, as you've noticed, far more used to dealing with the network from the
bottom up rather than from the application point of view, but I'll give it my
best shot.  I'll start with the verifiable facts I've found so far, and then
see if I can explain my working theory - my apologies in advance if this is a
bit long winded.

The original symptoms I ran into were that a subset of my nightly backup jobs
began failing.  The error reports all declared that the remote host had
unexpectedly closed the network connection.  We do fairly extensive
monitoring, but none of the logs showed any network anomalies.  Host
monitoring showed neither host ever went off the network, and switch logs did
not show the relevant ethernet ports going down.  Network flow logs showed
that the TCP stream closed down ungracefully, via RST packets instead of FIN,
which agreed with the Bacula error messages.

The next thing I did was get full network captures of a failed job, taken from
 both the SD/Dir system and the FD system.  Each end had an identical view of
packets sent and received, which ruled out the possibility of an intermediate
network glitch dropping packets.

At this point, since only a few jobs were failing, and those intermittently, I
started looking at packet captures of more jobs to try to find the difference
between failing and successful ones.  Much to my surprise, all of the jobs in
my configuration - both successful and failed - showed at least one direction
closing down ungracefully with a RST.  After more experimentation, and a few
packet captures from others, I found that the RST only occured in
configurations with TLS enabled.  Unencrypted communications always closed
down cleanly with a pair of FIN/ACK handshakes.

After a little Google research, the only likely scenarios I could find for the
TCP connection closing down ungracefully were either miscommunication between
the applications at either end (ie, one end tried to continue sending and
receiving after the other end closed the socket), or calling close() on a
socket that still had data pending in the queue.  If I understood the paper
correctly, this is referring to data that the other end has not yet read and
allowed the TCP layer to ACK back to the sending host.

We now descend into educated guesswork.  My working theory at this point is
based on the latter scenario, and the fact you mentioned that the TLS code is
non-blocking, while the unencrypted code is blocking.  I don't know enough C++
to know if this is what's actually going on, but I'm picturing a timeline of
events that, depending on timing between the OS and the applications, look
something like this:

FD                 | OS                   | OS               | SD
-------------------+----------------------+------------------+---------------
send() final data  |                      |                  |
blocks and         |                      |                  |
completion status  |                      |                  |
                   |                      |                  |
                   | put data on send     |                  |
                   | queue                |                  |
                   |                      |                  |
                   | send 1st half of     |                  |
                   | queue                |                  |
                   |                      |                  |
                   |                      | recv 1st half    |
                   |                      |                  |
                   |                      |                  | recv 1st half
                   |                      |                  |
close() socket     |                      |                  |
                   |                      |                  |
                   | data still in queue, |                  |
                   | so RST               |                  |
                   |                      |                  |
                   |                      | recv RST, close  |
                   |                      | socket           |
                   |                      |                  |
                   |                      |                  |  recv returns
                   |                      |                  |  nothing, SD
                   |                      |                  |  only gets part
                   |                      |                  |  of expected
                   |                      |                  |  data

So the end result is that because the TLS code is non-blocking, it's possible
for Bacula to get ahead of the OS and call close() on the socket before the
other end has called received all of the data in the transmit queue.  Since
the unencrypted code is blocking, it is guaranteed that the send call won't
return until the othe rend has gotten all of the data, and the queue is empty,
making it safe to close the socket.

Since Bacula seems fairly forgiving of any socket related errors that occur
after it has completed the job, the RST glitch would only cause a given job to
fail if it had not been able to transmit the final results.  Any other pending
data, such as a heartbeat or other similar no-op, would cause the RST to occur
but be ignored by Bacula.  It's only if part of the job is still in the queue
when the socket is closed that a job would fail.

Does this make sense, or is there something I've missed here?

-- 
Frank Sweetser fs at wpi.edu  |  For every problem, there is a solution that
WPI Network Engineer          |  is simple, elegant, and wrong. - HL Mencken
    GPG fingerprint = 6174 1257 129E 0D21 D8D4  E8A3 8E39 29E3 E2E8 8CEC

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Bacula-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/bacula-devel

Re: [Bacula-devel] [bacula 0000888]: Spurious connection drops killing backups

Reply via email to