Hello,
I've been thinking about possible causes of "spurious" connection drops and
how to debug them.
As I have noted a number of times, the most likely culprit is a bad network
(in particular switches or ethernet cards that have bad firmware -- several
cases such as this are documented in the manual). Another cause of
disconnects are HP printers, which illegally use port 9100, which is OK, but
under certain conditions they will sometimes broadcast on higher port numbers
such as 9101, 9102, and 9103, which are registered to and used by Bacula.
If you think it might be HP network printers (mine peacefully co-exists with
Bacula), you can always move the HP port number, or move the Bacula ports, or
turn off the printer(s) for a few days overnight while your backups run.
However, another possibility is that Bacula (say the FD or the SD) detects an
internal error or a logic error in the data received on the comm line. In
that case, it is very likely it will abruptly hang up.
In doing so, the daemon will always generate an error message (assuming there
is no bug). The problem comes in delivering the message because normally all
messages from the FD and SD are delivered back to the Director and then
dispatched according to the Director's message rules. The problem is that it
isn't always possible to deliver the message (timing problems, or the error
concerned the connection with the Director), and in that case, the message
will be lost and a "spurious" hang up will be the only visible sign.
So, how to fix this. There are several ways, all involve changing the SD and
FD's Messages resources to direct the error messages to a file, via email, or
to the system log in addition to sending them to the Director.
I'd suggest in cases where there are unexplained drops, you direct all
messages to a file on both the FD and the SD. For example, a typical FD
Messages resource looks like this:
# Send all messages except skipped files back to Director
Messages {
Name = Standard
director = rufus-dir = all, !skipped, !restored
}
I'd suggest changing it to:
Messages {
Name = Standard
director = rufus-dir = all, !skipped, !restored
append = "<working-dir>/log" = all, !skipped, !restored
}
where you change <working-dir> to be the path of the working directory used by
the particular daemon.
Then when a spurious connection drop occurs, perhaps there will be a message
in the log explaining the reason for the drop. If you implement the above,
to avoid filling your disk with log messages, be sure to remove it sometime
later or implement the logrotate code that is distributed in
<bacula-source>/scripts/logrotate.
Best regards,
Kern
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Bacula-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/bacula-devel