On Wed, 24 October, 2007 8:17 pm, Arno Lehmann wrote:
> Hi,
>
> 24.10.2007 12:33,, GDS.Marshall wrote::
>> Hello,
>>
>>> Hi,
>>>
>>> 22.10.2007 21:26,, GDS.Marshall wrote::
>>>> version 2.2.4 patched from sourceforge
>>>> Linux kernel 2.6.x
>>>>
>>>> I am running 10+ FD's, one SD, and one Director.  I am having problems
>>>> with one of my FD's, the others are fine.  Not sure if it makes any
>>>> difference, but the FD is on the same machine as the Director.
>>>> I have no issues with the network, I see no errors on either the
>>>> interface
>>>> of the FD or the SD.  All FD's are plugged into the same netgear
>>>> switch.
>>>> The SD is plugged into a different netgear switch which is then
>>>> plugged
>>>> into the FD's switch.
>>> Are the FD and SD running on the same host (your description says that
>>> DIR and problem FD are on the same machine, but not if the DIR and SD
>>> are on that same machine, too)?
>> No, the SD is on its own machine
>>
>> FD+DIR   FD   FD
>>   |      |     |
>>  GSW---------------.... Gig Switch
>>   |
>>  FSW---------------.... Fast Switch
>>   |
>>   SD
>
> And the problem connection is between the hosts to the left... ok.
That is correct.

>
> ...
>>>> 22-Oct 18:56 backupserver-sd: Spooling data ...
>>>> 22-Oct 18:56 fileserver-fd: fileserver-backup.2007-10-22_18.54.33
>>>> Fatal
>>>> error: backup.c:892 Network send error to SD. ERR=Success
>>> So the connection breaks shortly after data starts being transferred,
>>> right?
>> Correct, 2193816 is always written.
>
> Funny. Disk full on the SD, perhaps? Might be worth a look into the
> system log on both the machines.
No, that was one of the first things I checked.  The SD spool is a
dedicated logical volume of 740Gigs (over two tapes of data).  All FD's
write to the same spool.  When the schedule runs the job, it is not on its
own, however, when I have been running it by hand, then it is the only job
running.

>
>>> It's a little bit surprising to see an error text of Success here... I
>>> always thought that sort of things only happened on windows ;-)
>> ROTFL.  The FD, Dir, SD are on linux machines, we have not ventured to
>> the
>> Windows FD yet.
>>
>>>
>>>> I know it says "Network send error", however, I have checked the
>>>> network,
>>>> and can not find a problem with any of the equipment.
>>> Do you have a firewall running on that host?
>> No firewalls running on any of the bacula hosts, and the switch is not a
>> 3com.
>
> Good enough... regarding network problems, you could try to enable the
> heartbeat function in the FD and / or SD. To find the cause of the
> problem, tcpdump or wireshark might help.
I read about heart beat with the 3com issue, and switched it on for both
the FD and SD.  I have not tried tcpdump or wireshark, will give it a go.

>
> If you see RST packages on the connection between FD and SD it's only
> the question who generates them...
>
> ...
>>> Here it's failed, I think. A higher debug level might reveal more, but
>>> this doesn't tell me anything important.
>>
>> I am probably going to get flamed for this,
>
> Not by me :-)
>
>> but what value, currently it
>> is set to 200, I do not want to put it too high, and swamp the amount of
>> data I am supplying the mailing list, but neither do I want to waste the
>> mailing lists time by making it too low....
>
> Really a difficult question :-)
>
> The best approach might be to run with debug level 400, save the
> resulting logs, and only post the part around the failure first. If
> someone needs more detail, you could post the complete log to a web site.

Okay, will give 400 a go.

>
> ...
>>>> backupserver ~ #
>>> With the information from above, I suspect a network problem. Does the
>>> client run before job you have run for a very long time? In such a
>>> situation, a firewall/router might close the connection between SD and
>>> FD because it seems to be idle.
>> The run before job might take half an hour max.  There is no firewall or
>> router in the setup.
>
> Hmm... half an hour should not trigger a RST due to idleing too long.
> Do your other FDs on the network segment with the DIR have
> long-running scripts, too, or do they transfer data almost immediately
> after the backup jobs are started?
This is the only one with a script.  Surely if it has started to transfer
data, the RST will not take place as it it no longer idle (just a
thought).

Spencer


-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users

Reply via email to