Hi,

On 6/2/2007 7:43 AM, Frank Sweetser wrote:
> A couple of weeks ago, a problem started cropping up.  Jobs started failing
> with what look like network errors:
> 
> 02-Jun 01:10 lorien-sd: gkar-daily.2007-06-02_01.05.02 Fatal error:
> append.c:259 Network error on data channel. ERR=Input/output error
> 02-Jun 01:10 lorien-sd: Job write elapsed time = 00:03:16, Transfer rate =
> 4.157 M bytes/second
> 02-Jun 01:10 lorien-sd: gkar-daily.2007-06-02_01.05.02 Error: bnet.c:280 Read
> expected 65536 got 16384 from client:130.215.39.18:36643
> 02-Jun 01:10 lorien-dir: gkar-daily.2007-06-02_01.05.02 Fatal error: Network
> error with FD during Backup: ERR=No data available
> 02-Jun 01:10 lorien-dir: gkar-daily.2007-06-02_01.05.02 Fatal error: No Job
> status returned from FD.
> 02-Jun 01:10 lorien-dir: gkar-daily.2007-06-02_01.05.02 Error: Bacula 2.0.3
> (06Mar07): 02-Jun-2007 01:10:40
> 
> 
> However, I can find no evidence of any actual network problem between the
> machine running the fd and the one running both the sd and dir:
> 
>  - The network monitoring system shows no outages, and none of the switches
> and routers in between show anything out of the ordinary in the logs.
> 
>  - There is no external firewall between the two system.  Both ends are linux
> 2.6 with iptables, with non-stateful rules for all bacula traffic.
> 
>  - IP flow logs show that both ends of the FD -> SD TCP connection
> ungracefully closed down the stream with a RST after a very short idle period
> of about 10 seconds.
> 
>  - I've already tried swapping to a different NIC on the server to rule out a
> dying network card.
> 
>  - The failure occurs on different machines, ruling out something specific to
> one client, though it usually appears to affect the same one.  More
> specifically, it always seems to die around the same time - about ten minutes
> after the batch of nightly jobs start.  I have things configured to run four
> concurrent jobs, and the failures will cancel anywhere from one to four jobs.
>  When multiple jobs die, they all do so at the same time.  I can influence
> which clients get picked on by shuffling around priorities.

Well, this one looks difficult.

I suggest to monitor the memory usage of your server. I experienced 
problems with (usually) the DIR or (seldomly) the SD using up all 
available memory. Wich probably might affect the kernel so that it can't 
allocate memory for the network stuff.

You should have something in the systems log files then, I suppose.

A work around would be to not start all your jobs at once but run them 
in batches. Lowering job concurrency will not work as a job waiting for 
an available slot to run will also use memory.

Also, you could try upgrading to the current development version as I 
believe Kern worked on that problem. You should check the change log.

Hope you get this fixed,

Arno

>  - Running the failed job - either by itself or queued up with a bunch of
> other ones - always appear to work as expected.
> 
> The part *really* driving me bonkers is that I can find no evidence of any
> changes that coincide with the problem starting.  Bacula version, kernel
> version, hardware, network - nothing was changed.
> 
> If anyone has any suggestions where I could start looking, I'd love to hear 
> them.
> 

-- 
IT-Service Lehmann                    [EMAIL PROTECTED]
Arno Lehmann                  http://www.its-lehmann.de

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users

Reply via email to