Arno Lehmann wrote:
Or, alternatively, using tcpdump to find if the sequence numbers get out
of sync somewhere, which would cause a RST on both ends.
Okay, I got a tcpdump and logfile of -d1000 on the fd. I'm a little rusty
debugging TCP issues by hand, but I couldn't find anything that
Well, I had a failure last night while I was monitoring memory usage. I had a
script snagging the output of ps -o rss for both bacula-sd and bacula-dir
every 60 seconds. Based on that, memory usage for both jumped only by a few
megs when the jobs started. The dir was around 20M, and the sd
Hello,
On 6/6/2007 3:38 PM, Frank Sweetser wrote:
Well, I had a failure last night while I was monitoring memory usage. I had a
script snagging the output of ps -o rss for both bacula-sd and bacula-dir
every 60 seconds. Based on that, memory usage for both jumped only by a few
megs when
Arno Lehmann wrote:
I'm not a good debugger user, but strace might be the next thing to
try... like capturing all socket operations, or something. Perhaps you
get to know if the error is cause by the OS on one end.
Knowing how verbose strace can be, I'm a little hesitant to jump right to
Hi,
On 6/2/2007 7:43 AM, Frank Sweetser wrote:
A couple of weeks ago, a problem started cropping up. Jobs started failing
with what look like network errors:
02-Jun 01:10 lorien-sd: gkar-daily.2007-06-02_01.05.02 Fatal error:
append.c:259 Network error on data channel. ERR=Input/output
Arno Lehmann wrote:
Well, this one looks difficult.
At least it's not just me, then =)
I suggest to monitor the memory usage of your server. I experienced
problems with (usually) the DIR or (seldomly) the SD using up all
available memory. Wich probably might affect the kernel so that it
Hi,
On 6/4/2007 6:17 PM, Frank Sweetser wrote:
Arno Lehmann wrote:
Well, this one looks difficult.
At least it's not just me, then =)
I suggest to monitor the memory usage of your server. I experienced
problems with (usually) the DIR or (seldomly) the SD using up all
available memory.
Arno Lehmann wrote:
If you need a minimal Nagios plugin - I wrote some shell script for that
purpose once :-)
Oddly enough, nothing actually crashes - a handfull of jobs fail, but all
subsequent ones go through just fne.
A work around would be to not start all your jobs at once but run them
A couple of weeks ago, a problem started cropping up. Jobs started failing
with what look like network errors:
02-Jun 01:10 lorien-sd: gkar-daily.2007-06-02_01.05.02 Fatal error:
append.c:259 Network error on data channel. ERR=Input/output error
02-Jun 01:10 lorien-sd: Job write elapsed time =