On Sunday 24 June 2007 03:32, Frank Sweetser wrote: > Kern Sibbald wrote: > > I may want to look at tcpdump output later, but at the current time, I don't > > understand what the error is as far as Bacula sees it. Normally the socket > > is closed by the side sending the data, so there should be no race > > conditions. From what I recall from your bug report the error occurred in > > the middle of a transmission, but now you are talking about closing the > > socket, and that I don't understand. At the point that Bacula is closing the > > socket, to the best of my knowledge it ignores any errors. > > > > So, could you start by explaining this from the top down rather than from the > > bottom up?
Hello Frank, > > I am, as you've noticed, far more used to dealing with the network from the > bottom up rather than from the application point of view, but I'll give it my > best shot. I'll start with the verifiable facts I've found so far, and then > see if I can explain my working theory - my apologies in advance if this is a > bit long winded. > > The original symptoms I ran into were that a subset of my nightly backup jobs > began failing. The error reports all declared that the remote host had > unexpectedly closed the network connection. We do fairly extensive > monitoring, but none of the logs showed any network anomalies. Host > monitoring showed neither host ever went off the network, and switch logs did > not show the relevant ethernet ports going down. Network flow logs showed > that the TCP stream closed down ungracefully, via RST packets instead of FIN, > which agreed with the Bacula error messages. > > The next thing I did was get full network captures of a failed job, taken from > both the SD/Dir system and the FD system. Each end had an identical view of > packets sent and received, which ruled out the possibility of an intermediate > network glitch dropping packets. > > At this point, since only a few jobs were failing, and those intermittently, I > started looking at packet captures of more jobs to try to find the difference > between failing and successful ones. Much to my surprise, all of the jobs in > my configuration - both successful and failed - showed at least one direction > closing down ungracefully with a RST. After more experimentation, and a few > packet captures from others, I found that the RST only occured in > configurations with TLS enabled. Unencrypted communications always closed > down cleanly with a pair of FIN/ACK handshakes. > > After a little Google research, the only likely scenarios I could find for the > TCP connection closing down ungracefully were either miscommunication between > the applications at either end (ie, one end tried to continue sending and > receiving after the other end closed the socket), or calling close() on a > socket that still had data pending in the queue. If I understood the paper > correctly, this is referring to data that the other end has not yet read and > allowed the TCP layer to ACK back to the sending host. > > We now descend into educated guesswork. My working theory at this point is > based on the latter scenario, and the fact you mentioned that the TLS code is > non-blocking, while the unencrypted code is blocking. I don't know enough C++ > to know if this is what's actually going on, but I'm picturing a timeline of > events that, depending on timing between the OS and the applications, look > something like this: > > FD | OS | OS | SD > -------------------+----------------------+------------------+--------------- > send() final data | | | > blocks and | | | > completion status | | | > | | | > | put data on send | | > | queue | | > | | | > | send 1st half of | | > | queue | | > | | | > | | recv 1st half | > | | | > | | | recv 1st half > | | | > close() socket | | | > | | | > | data still in queue, | | > | so RST | | > | | | > | | recv RST, close | > | | socket | > | | | > | | | recv returns > | | | nothing, SD > | | | only gets part > | | | of expected > | | | data > > So the end result is that because the TLS code is non-blocking, it's possible > for Bacula to get ahead of the OS and call close() on the socket before the > other end has called received all of the data in the transmit queue. Since > the unencrypted code is blocking, it is guaranteed that the send call won't > return until the othe rend has gotten all of the data, and the queue is empty, > making it safe to close the socket. > > Since Bacula seems fairly forgiving of any socket related errors that occur > after it has completed the job, the RST glitch would only cause a given job to > fail if it had not been able to transmit the final results. Any other pending > data, such as a heartbeat or other similar no-op, would cause the RST to occur > but be ignored by Bacula. It's only if part of the job is still in the queue > when the socket is closed that a job would fail. > > Does this make sense, or is there something I've missed here? Yes, *everything* makes sense, and is one of the best analyses and "educated guesswork" that I have ever seen! Can I hire you? :-) I think it is rather trivial to fix, but I want to look at the code a bit as there are several possibilities. In the mean time, a couple of questions: Can you reproduce it relatively easily? (i.e. if I fix it will you know in a reasonable time?) What version of Bacula are you running? (so I can send you a new file or a patch). Regards, Kern ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Bacula-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/bacula-devel
