Hi,

25.10.2007 15:08,, GDS.Marshall wrote::
> On Wed, 24 October, 2007 8:17 pm, Arno Lehmann wrote:
>> Hi,
>>
>> 24.10.2007 12:33,, GDS.Marshall wrote::
>>> Hello,
>>>
>>>> Hi,
>>>>
>>>> 22.10.2007 21:26,, GDS.Marshall wrote::
>>>>> version 2.2.4 patched from sourceforge
>>>>> Linux kernel 2.6.x
>>>>>
>>>>> I am running 10+ FD's, one SD, and one Director.  I am having problems
>>>>> with one of my FD's, the others are fine. 
...
>>> FD+DIR   FD   FD
>>>   |      |     |
>>>  GSW---------------.... Gig Switch
>>>   |
>>>  FSW---------------.... Fast Switch
>>>   |
>>>   SD
>> And the problem connection is between the hosts to the left... ok.
> That is correct.
> 
>> ...
>>>>> 22-Oct 18:56 backupserver-sd: Spooling data ...
>>>>> 22-Oct 18:56 fileserver-fd: fileserver-backup.2007-10-22_18.54.33
>>>>> Fatal
>>>>> error: backup.c:892 Network send error to SD. ERR=Success
>>>> So the connection breaks shortly after data starts being transferred,
>>>> right?
>>> Correct, 2193816 is always written.
>> Funny. Disk full on the SD, perhaps? Might be worth a look into the
>> system log on both the machines.
> No, that was one of the first things I checked.  The SD spool is a
> dedicated logical volume of 740Gigs (over two tapes of data).  All FD's
> write to the same spool.  When the schedule runs the job, it is not on its
> own, however, when I have been running it by hand, then it is the only job
> running.

So we can be more or less sure it's got to do with the scheduling process.

...
>> Good enough... regarding network problems, you could try to enable the
>> heartbeat function in the FD and / or SD. To find the cause of the
>> problem, tcpdump or wireshark might help.
> I read about heart beat with the 3com issue, and switched it on for both
> the FD and SD.  I have not tried tcpdump or wireshark, will give it a go.

Use the filtering options extensively - otherwise, you will be 
overloaded by the output :-)

>> If you see RST packages on the connection between FD and SD it's only
>> the question who generates them...
>>
>> ...
>>>> Here it's failed, I think. A higher debug level might reveal more, but
>>>> this doesn't tell me anything important.
>>> I am probably going to get flamed for this,
>> Not by me :-)
>>
>>> but what value, currently it
>>> is set to 200, I do not want to put it too high, and swamp the amount of
>>> data I am supplying the mailing list, but neither do I want to waste the
>>> mailing lists time by making it too low....
>> Really a difficult question :-)
>>
>> The best approach might be to run with debug level 400, save the
>> resulting logs, and only post the part around the failure first. If
>> someone needs more detail, you could post the complete log to a web site.
> 
> Okay, will give 400 a go.
> 
>> ...
>>>>> backupserver ~ #
>>>> With the information from above, I suspect a network problem. Does the
>>>> client run before job you have run for a very long time? In such a
>>>> situation, a firewall/router might close the connection between SD and
>>>> FD because it seems to be idle.
>>> The run before job might take half an hour max.  There is no firewall or
>>> router in the setup.
>> Hmm... half an hour should not trigger a RST due to idleing too long.
>> Do your other FDs on the network segment with the DIR have
>> long-running scripts, too, or do they transfer data almost immediately
>> after the backup jobs are started?
> This is the only one with a script.  Surely if it has started to transfer
> data, the RST will not take place as it it no longer idle (just a
> thought).

Well, it might happen that some device or software decides to drop 
that connection earlier, but only sends RST packets when the 
connection is (according to its assumptions) invalid. This would be a 
behaviour often found in routers, I believe.

You could try to run that same job with a dummy "Client Run Before" 
script which immediately exits, just to see what happens then.

If this case works, and the heartbeat doesn't, then it's surely time 
for some network debugging, I think.

Arno

-- 
Arno Lehmann
IT-Service Lehmann
www.its-lehmann.de

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users

Reply via email to