Re: [Bacula-users] Mysteriously failing jobs

2007-06-08 Thread Frank Sweetser
Arno Lehmann wrote: Or, alternatively, using tcpdump to find if the sequence numbers get out of sync somewhere, which would cause a RST on both ends. Okay, I got a tcpdump and logfile of -d1000 on the fd. I'm a little rusty debugging TCP issues by hand, but I couldn't find anything that

Re: [Bacula-users] Mysteriously failing jobs

2007-06-06 Thread Frank Sweetser
Well, I had a failure last night while I was monitoring memory usage. I had a script snagging the output of ps -o rss for both bacula-sd and bacula-dir every 60 seconds. Based on that, memory usage for both jumped only by a few megs when the jobs started. The dir was around 20M, and the sd

Re: [Bacula-users] Mysteriously failing jobs

2007-06-06 Thread Arno Lehmann
Hello, On 6/6/2007 3:38 PM, Frank Sweetser wrote: Well, I had a failure last night while I was monitoring memory usage. I had a script snagging the output of ps -o rss for both bacula-sd and bacula-dir every 60 seconds. Based on that, memory usage for both jumped only by a few megs when

Re: [Bacula-users] Mysteriously failing jobs

2007-06-06 Thread Frank Sweetser
Arno Lehmann wrote: I'm not a good debugger user, but strace might be the next thing to try... like capturing all socket operations, or something. Perhaps you get to know if the error is cause by the OS on one end. Knowing how verbose strace can be, I'm a little hesitant to jump right to

Re: [Bacula-users] Mysteriously failing jobs

2007-06-04 Thread Arno Lehmann
Hi, On 6/2/2007 7:43 AM, Frank Sweetser wrote: A couple of weeks ago, a problem started cropping up. Jobs started failing with what look like network errors: 02-Jun 01:10 lorien-sd: gkar-daily.2007-06-02_01.05.02 Fatal error: append.c:259 Network error on data channel. ERR=Input/output

Re: [Bacula-users] Mysteriously failing jobs

2007-06-04 Thread Frank Sweetser
Arno Lehmann wrote: Well, this one looks difficult. At least it's not just me, then =) I suggest to monitor the memory usage of your server. I experienced problems with (usually) the DIR or (seldomly) the SD using up all available memory. Wich probably might affect the kernel so that it

Re: [Bacula-users] Mysteriously failing jobs

2007-06-04 Thread Arno Lehmann
Hi, On 6/4/2007 6:17 PM, Frank Sweetser wrote: Arno Lehmann wrote: Well, this one looks difficult. At least it's not just me, then =) I suggest to monitor the memory usage of your server. I experienced problems with (usually) the DIR or (seldomly) the SD using up all available memory.

Re: [Bacula-users] Mysteriously failing jobs

2007-06-04 Thread Frank Sweetser
Arno Lehmann wrote: If you need a minimal Nagios plugin - I wrote some shell script for that purpose once :-) Oddly enough, nothing actually crashes - a handfull of jobs fail, but all subsequent ones go through just fne. A work around would be to not start all your jobs at once but run them

[Bacula-users] Mysteriously failing jobs

2007-06-01 Thread Frank Sweetser
A couple of weeks ago, a problem started cropping up. Jobs started failing with what look like network errors: 02-Jun 01:10 lorien-sd: gkar-daily.2007-06-02_01.05.02 Fatal error: append.c:259 Network error on data channel. ERR=Input/output error 02-Jun 01:10 lorien-sd: Job write elapsed time =