Re: [Bacula-users] Fatal error: backup.c:892 Network send error to SD. ERR=Connection reset by peer

2010-04-21 Thread Matija Nalis
On Sun, Apr 18, 2010 at 11:46:33AM -0500, Jon Schewe wrote:
> > http://wiki.bacula.org/doku.php?id=faq#my_backup_starts_but_dies_after_a_while_with_connection_reset_by_peer_error
> >
> > [1] It actually tries that at one point in src/lib/bsock.c if
> > TCP_KEEPIDLE support is detected, but it fails to detect it
> > properly because  is not included.
> >
> > However, even after fixing that (and missing semicolon in 
> > 'int opt = heart_beat' line), it still doesn't look like it sets
> > TCP_KEEPIDLE correctly on FD->SD connection, so maybe this
> > codepath is not used there. 
> >
> > Anyway I gave up debugging there and just set the system
> > defaults. But I just though I'd mention that in case someone
> > else wants to continue chasing the bug.
> >
> >   
> Hmm, this sounds like a bug that should be fixed and once it is fixed
> may remove a bunch of problems with firewalls.

FYI, I've put up a patch which fixes current support on bacula-devel
mailing list. That support could be extended (as not all parts of
bacula use that function), but it might be enough. 

If someone is willing to try it, let me (or better, the whole list)
know how it fares and if it fixes the timeouts without the user
needing to resort to changing systems defaults.

--
___
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users


Re: [Bacula-users] Fatal error: backup.c:892 Network send error to SD. ERR=Connection reset by peer

2010-04-18 Thread Jon Schewe
On 04/16/2010 08:30 AM, Matija Nalis wrote:
> On Mon, Apr 12, 2010 at 03:59:49PM -0500, Jon Schewe wrote:
>   
>> On 4/12/10 9:40 AM, Matija Nalis wrote:
>> 
>>> It is especially problem with bigger databases and MySQL instead of
>>> PostgreSQL, see http://bugs.bacula.org/view.php?id=1472, where it can
>>> take even several hours! (note that while it talks about "restore"
>>> speed, it is also related to accurate backups which employ similar
>>> SQL queries)
>>>
>>>   
>> Must be what it is then. I've been thinking about switching to postgres,
>> but haven't because the opensuse packages for bacula are only for mysql.
>> This may motivate me more.
>> 
> You should probably switch soon, before you get to like your
> database,,, Exporting bacula mysql tables for import in PostgreSQL
> can be very painful and problematic; it is much better to just drop
> the database and create fresh one.
>
>   
I'll keep that in mind as I go forward.

>> The backup finished, so it seems that in version 3.0.3 bacula does NOT
>> set the socket option SO_KEEPALIVE.
>> 
> Hmm, yeah, I've check the code casually, and it indeed looks like the
> heartbeats are not setting SO_KEEPALIVE timeouts (note that it does
> set SO_KEEPALIVE on the socket, otherwise the advice above wouldn't
> work -- it just doesn't do TCP_KEEPIDLE on that[1] to specify
> user-defined timeouts and instead uses system defaults). 
>
> The heartbeats look like are doing other things though (application-level, 
> not socket-level), but as you saw they are not perfect for fixing network 
> idleness problems - and so you also MUST set system defaults.
>
> I've updated the FAQ at:
> http://wiki.bacula.org/doku.php?id=faq#my_backup_starts_but_dies_after_a_while_with_connection_reset_by_peer_error
>
>
> [1] It actually tries that at one point in src/lib/bsock.c if
> TCP_KEEPIDLE support is detected, but it fails to detect it
> properly because  is not included.
>
> However, even after fixing that (and missing semicolon in 
> 'int opt = heart_beat' line), it still doesn't look like it sets
> TCP_KEEPIDLE correctly on FD->SD connection, so maybe this
> codepath is not used there. 
>
> Anyway I gave up debugging there and just set the system
> defaults. But I just though I'd mention that in case someone
> else wants to continue chasing the bug.
>
>   
Hmm, this sounds like a bug that should be fixed and once it is fixed
may remove a bunch of problems with firewalls.


--
Download Intel® Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
___
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users


Re: [Bacula-users] Fatal error: backup.c:892 Network send error to SD. ERR=Connection reset by peer

2010-04-16 Thread Matija Nalis
On Mon, Apr 12, 2010 at 03:59:49PM -0500, Jon Schewe wrote:
> On 4/12/10 9:40 AM, Matija Nalis wrote:
> > It is especially problem with bigger databases and MySQL instead of
> > PostgreSQL, see http://bugs.bacula.org/view.php?id=1472, where it can
> > take even several hours! (note that while it talks about "restore"
> > speed, it is also related to accurate backups which employ similar
> > SQL queries)
> >
> Must be what it is then. I've been thinking about switching to postgres,
> but haven't because the opensuse packages for bacula are only for mysql.
> This may motivate me more.

You should probably switch soon, before you get to like your
database,,, Exporting bacula mysql tables for import in PostgreSQL
can be very painful and problematic; it is much better to just drop
the database and create fresh one.

> The backup finished, so it seems that in version 3.0.3 bacula does NOT
> set the socket option SO_KEEPALIVE.

Hmm, yeah, I've check the code casually, and it indeed looks like the
heartbeats are not setting SO_KEEPALIVE timeouts (note that it does
set SO_KEEPALIVE on the socket, otherwise the advice above wouldn't
work -- it just doesn't do TCP_KEEPIDLE on that[1] to specify
user-defined timeouts and instead uses system defaults). 

The heartbeats look like are doing other things though (application-level, 
not socket-level), but as you saw they are not perfect for fixing network 
idleness problems - and so you also MUST set system defaults.

I've updated the FAQ at:
http://wiki.bacula.org/doku.php?id=faq#my_backup_starts_but_dies_after_a_while_with_connection_reset_by_peer_error


[1] It actually tries that at one point in src/lib/bsock.c if
TCP_KEEPIDLE support is detected, but it fails to detect it
properly because  is not included.

However, even after fixing that (and missing semicolon in 
'int opt = heart_beat' line), it still doesn't look like it sets
TCP_KEEPIDLE correctly on FD->SD connection, so maybe this
codepath is not used there. 

Anyway I gave up debugging there and just set the system
defaults. But I just though I'd mention that in case someone
else wants to continue chasing the bug.

-- 
Matija Nalis
Odjel racunalno-informacijskih sustava i servisa
  
Hrvatska akademska i istrazivacka mreza - CARNet 
Josipa Marohnica 5, 1 Zagreb
tel. +385 1 6661 616, fax. +385 1 6661 766
www.CARNet.hr

--
Download Intel® Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
___
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users


Re: [Bacula-users] Fatal error: backup.c:892 Network send error to SD. ERR=Connection reset by peer

2010-04-12 Thread Jon Schewe
On 4/12/10 9:40 AM, Matija Nalis wrote:
> On Mon, Apr 12, 2010 at 09:23:51AM -0500, Jon Schewe wrote:
>   
>> On 4/12/10 9:00 AM, Matija Nalis wrote:
>> 
>>> Good, let us know how it fares.
>>>   
>>>   
>> It seems to be running, but I've run into a problem with bconsole. Once
>> I started the job, if I run bconsole and then "status dir", the console
>> hangs. If I strace the bconsole process it's stuck in a select call.
>>
>> 
>>> strace -p 18452
>>>   
>> Process 18452 attached - interrupt to quit
>> select(4, [3], NULL, NULL, {9, 461287}) = 0 (Timeout)
>> read(3, 0x655d80, 5)= -1 EAGAIN (Resource
>> temporarily unavailable)
>> 
> That should not be related to SO_KEEPALIVE - it should be completly
> transparent to the applications if the network is working (and even
> when it is not working, it should differ only in always terminating
> the connection instead of sometimes terminating connection and
> sometimes hanging idefinitely).
>
> Anyway, it may be few issues with directory hanging. Most common is
> you are too eager. For example, is SQL server is busy, "status dir"
> will hang until it completes.
>
>   
> It is especially problem with bigger databases and MySQL instead of
> PostgreSQL, see http://bugs.bacula.org/view.php?id=1472, where it can
> take even several hours! (note that while it talks about "restore"
> speed, it is also related to accurate backups which employ similar
> SQL queries)
>
>   
Must be what it is then. I've been thinking about switching to postgres,
but haven't because the opensuse packages for bacula are only for mysql.
This may motivate me more.

The backup finished, so it seems that in version 3.0.3 bacula does NOT
set the socket option SO_KEEPALIVE.

-- 
Jon Schewe | http://mtu.net/~jpschewe
If you see an attachment named signature.asc, this is my digital
signature. See http://www.gnupg.org for more information.


--
Download Intel® Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
___
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users


Re: [Bacula-users] Fatal error: backup.c:892 Network send error to SD. ERR=Connection reset by peer

2010-04-12 Thread Matija Nalis
On Mon, Apr 12, 2010 at 09:23:51AM -0500, Jon Schewe wrote:
> On 4/12/10 9:00 AM, Matija Nalis wrote:
> > (SO_KEEPALIVE will work even with only one side of connection having
> > it enabled).
> >   
> So I should only need the heartbeat on that client's setup as well,
> right? Getting rid of extra heart beats would be nice.

Yes, it should be enough. Note that there is no real need to get rid
of extra heartbeats, they are not really expensive (so biggest gain
is "cleaner" config files).

> > Good, let us know how it fares.
> >   
> It seems to be running, but I've run into a problem with bconsole. Once
> I started the job, if I run bconsole and then "status dir", the console
> hangs. If I strace the bconsole process it's stuck in a select call.
>
> >strace -p 18452
> Process 18452 attached - interrupt to quit
> select(4, [3], NULL, NULL, {9, 461287}) = 0 (Timeout)
> read(3, 0x655d80, 5)= -1 EAGAIN (Resource
> temporarily unavailable)

That should not be related to SO_KEEPALIVE - it should be completly
transparent to the applications if the network is working (and even
when it is not working, it should differ only in always terminating
the connection instead of sometimes terminating connection and
sometimes hanging idefinitely).

Anyway, it may be few issues with directory hanging. Most common is
you are too eager. For example, is SQL server is busy, "status dir"
will hang until it completes.

It is especially problem with bigger databases and MySQL instead of
PostgreSQL, see http://bugs.bacula.org/view.php?id=1472, where it can
take even several hours! (note that while it talks about "restore"
speed, it is also related to accurate backups which employ similar
SQL queries)

You can check for this with "show processlist" in MySQL (if you are
running MySQL for database, of course) if that is the case (or simply
wait).

Or you might be unlucky enough to hit a real director bug in 5.0.1,
see http://bugs.bacula.org/view.php?id=1528, but that is unlikely.


--
Download Intel® Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
___
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users


Re: [Bacula-users] Fatal error: backup.c:892 Network send error to SD. ERR=Connection reset by peer

2010-04-12 Thread Jon Schewe
On 4/12/10 9:00 AM, Matija Nalis wrote:
> On Mon, Apr 12, 2010 at 08:45:36AM -0500, Jon Schewe wrote:
>   
>> On 4/12/10 8:39 AM, Matija Nalis wrote:
>> 
>>> echo 60 > /proc/sys/net/ipv4/tcp_keepalive_time
>>>
>>> (or edit /etc/sysctl.d/* or /etc/sysctl.conf to retain value across
>>> reboots). Can you try what "netstat -to" says after you lower that
>>> limit and rerun backups ? 
>>>
>>>   
>> Now I see the timer down where I expect it. Should I only need this on
>> the client?
>> 
> If only that client is having timeout timeout problems, than yes (as
> I understand your Director and SD are on same server, so you should
> not have timeout issues there as no networking is involved).
>
> (SO_KEEPALIVE will work even with only one side of connection having
> it enabled).
>
>   
So I should only need the heartbeat on that client's setup as well,
right? Getting rid of extra heart beats would be nice.

>>> If "netstat -to" then reports smaller timers (60 or less), than it
>>> should fix your problem, so you can try turning accurate back to yes.
>>>
>>> Does that help ?
>>>   
>> It's running, I'll know in a couple of hours.
>> 
> Good, let us know how it fares.
>
>   
It seems to be running, but I've run into a problem with bconsole. Once
I started the job, if I run bconsole and then "status dir", the console
hangs. If I strace the bconsole process it's stuck in a select call.
>strace -p 18452
Process 18452 attached - interrupt to quit
select(4, [3], NULL, NULL, {9, 461287}) = 0 (Timeout)
read(3, 0x655d80, 5)= -1 EAGAIN (Resource
temporarily unavailable)
select(4, [3], NULL, NULL, {10, 0}) = 0 (Timeout)
read(3, 0x655d80, 5)= -1 EAGAIN (Resource
temporarily unavailable)
select(4, [3], NULL, NULL, {10, 0}


-- 
Jon Schewe | http://mtu.net/~jpschewe
If you see an attachment named signature.asc, this is my digital
signature. See http://www.gnupg.org for more information.


--
Download Intel® Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
___
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users


Re: [Bacula-users] Fatal error: backup.c:892 Network send error to SD. ERR=Connection reset by peer

2010-04-12 Thread Matija Nalis
On Mon, Apr 12, 2010 at 08:45:36AM -0500, Jon Schewe wrote:
> On 4/12/10 8:39 AM, Matija Nalis wrote:
> > echo 60 > /proc/sys/net/ipv4/tcp_keepalive_time
> >
> > (or edit /etc/sysctl.d/* or /etc/sysctl.conf to retain value across
> > reboots). Can you try what "netstat -to" says after you lower that
> > limit and rerun backups ? 
> > 
> Now I see the timer down where I expect it. Should I only need this on
> the client?

If only that client is having timeout timeout problems, than yes (as
I understand your Director and SD are on same server, so you should
not have timeout issues there as no networking is involved).

(SO_KEEPALIVE will work even with only one side of connection having
it enabled).

> > If "netstat -to" then reports smaller timers (60 or less), than it
> > should fix your problem, so you can try turning accurate back to yes.
> >
> > Does that help ?
>
> It's running, I'll know in a couple of hours.

Good, let us know how it fares.

-- 
Matija Nalis
Odjel racunalno-informacijskih sustava i servisa
  
Hrvatska akademska i istrazivacka mreza - CARNet 
Josipa Marohnica 5, 1 Zagreb
tel. +385 1 6661 616, fax. +385 1 6661 766
www.CARNet.hr

--
Download Intel® Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
___
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users


Re: [Bacula-users] Fatal error: backup.c:892 Network send error to SD. ERR=Connection reset by peer

2010-04-12 Thread Jon Schewe
On 4/12/10 2:47 AM, Graham Keeling wrote:
> On Sun, Apr 11, 2010 at 09:32:43AM -0500, Jon Schewe wrote:
>   
>> I got it to work again last night. Changing the firewall time outs
>> didn't help. What fixed it was turning off Accurate backups.
>> 
> Ah, so possibly bacula spent long enough stuck doing an accurate query in the
> catalog that the firewall connection timed out.
> Are you using mysql and bacula-5.0.1?
>
>   
I'm using mysql and bacula 3.0.3.

-- 
Jon Schewe | http://mtu.net/~jpschewe
If you see an attachment named signature.asc, this is my digital
signature. See http://www.gnupg.org for more information.


--
Download Intel® Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
___
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users


Re: [Bacula-users] Fatal error: backup.c:892 Network send error to SD. ERR=Connection reset by peer

2010-04-12 Thread Jon Schewe
On 4/12/10 8:39 AM, Matija Nalis wrote:
> On Mon, Apr 12, 2010 at 07:59:53AM -0500, Jon Schewe wrote:
>   
>> /proc/sys/net/ipv4/tcp_keepalive_time:7200
>> 
>>> netstat -to
>>>   
>> Client:
>> tcp0  0 client:9102   server:54043  ESTABLISHED
>> keepalive (7196.36/0/0)
>> 
> That's strange. It should've been the timeouts you specified in
> config files, not 7200 seconds (two hours) which is system default.
>
> It looks like bacula does not use TCP_KEEPIDLE setsockopt(2) on your
> system. You might want to report a bug on http://bugs.bacula.org/
>
> IMHO, it should work there. Or if not, it should probably throw a
> warning if you try to use it and it is not supported or fails.
>
> Apart from fixing bacula, you can override system default, for
> example (on both server and client) do :
>
> echo 60 > /proc/sys/net/ipv4/tcp_keepalive_time
>
> (or edit /etc/sysctl.d/* or /etc/sysctl.conf to retain value across
> reboots). Can you try what "netstat -to" says after you lower that
> limit and rerun backups ? 
>   
Now I see the timer down where I expect it. Should I only need this on
the client?
> If "netstat -to" then reports smaller timers (60 or less), than it
> should fix your problem, so you can try turning accurate back to yes.
>
> Does that help ?
>   
It's running, I'll know in a couple of hours.

-- 
Jon Schewe | http://mtu.net/~jpschewe
If you see an attachment named signature.asc, this is my digital
signature. See http://www.gnupg.org for more information.


--
Download Intel® Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
___
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users


Re: [Bacula-users] Fatal error: backup.c:892 Network send error to SD. ERR=Connection reset by peer

2010-04-12 Thread Matija Nalis
On Mon, Apr 12, 2010 at 07:59:53AM -0500, Jon Schewe wrote:
> /proc/sys/net/ipv4/tcp_keepalive_time:7200
> > netstat -to
> Client:
> tcp0  0 client:9102   server:54043  ESTABLISHED
> keepalive (7196.36/0/0)

That's strange. It should've been the timeouts you specified in
config files, not 7200 seconds (two hours) which is system default.

It looks like bacula does not use TCP_KEEPIDLE setsockopt(2) on your
system. You might want to report a bug on http://bugs.bacula.org/

IMHO, it should work there. Or if not, it should probably throw a
warning if you try to use it and it is not supported or fails.

Apart from fixing bacula, you can override system default, for
example (on both server and client) do :

echo 60 > /proc/sys/net/ipv4/tcp_keepalive_time

(or edit /etc/sysctl.d/* or /etc/sysctl.conf to retain value across
reboots). Can you try what "netstat -to" says after you lower that
limit and rerun backups ? 

If "netstat -to" then reports smaller timers (60 or less), than it
should fix your problem, so you can try turning accurate back to yes.

Does that help ?

--
Download Intel® Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
___
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users


Re: [Bacula-users] Fatal error: backup.c:892 Network send error to SD. ERR=Connection reset by peer

2010-04-12 Thread Jon Schewe
On 4/12/10 7:21 AM, Matija Nalis wrote:
> On Mon, Apr 12, 2010 at 05:41:51AM -0500, Jon Schewe wrote:
>   
>>> Strange. Are you running GNU/Linux system on all the machines 
>>> (FD, SD, DIR) ? IIRC, it might not be supported on other systems,
>>> and/or it may need additional tuning on them.
>>>
>>>   
>>>   
>> I'm running opensuse Linux for the director and storage daemon and
>> Debian Linux for the file daemon.
>> 
> that is strange... 
> can you check what are your default SO_KEEPALIVE values with:
>
> grep '' /proc/sys/net/ipv4/tcp_keepalive_*
>
>   
Server:
/proc/sys/net/ipv4/tcp_keepalive_intvl:75
/proc/sys/net/ipv4/tcp_keepalive_probes:9
/proc/sys/net/ipv4/tcp_keepalive_time:7200

Client:
/proc/sys/net/ipv4/tcp_keepalive_intvl:75
/proc/sys/net/ipv4/tcp_keepalive_probes:9
/proc/sys/net/ipv4/tcp_keepalive_time:7200

bacula 3.0.3 on both systems

> and what bacula is using for running connections - start backup first,
> then check if keepalive is enabled (and with what timers) with:
>
> netstat -to
>   
Client:
tcp0  0 client:9102   server:54043  ESTABLISHED
keepalive (7196.36/0/0)
tcp0  0 client:43628  server:9103   ESTABLISHED
keepalive (7197.26/0/0)

Server (behind NAT):
tcp0  0 192.168.42.2:9103   client:43628 
ESTABLISHED keepalive (7199.10/0/0)
tcp0  0 127.0.0.2:9103  127.0.0.2:33218
ESTABLISHED keepalive (7197.84/0/0)
tcp0  0 127.0.0.2:36664 127.0.0.2:9101 
TIME_WAIT   timewait (56.31/0/0)
tcp0  0 192.168.42.2:54043  client:9102  
ESTABLISHED keepalive (7198.18/0/0)

-- 
Jon Schewe | http://mtu.net/~jpschewe
If you see an attachment named signature.asc, this is my digital
signature. See http://www.gnupg.org for more information.


--
Download Intel® Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
___
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users


Re: [Bacula-users] Fatal error: backup.c:892 Network send error to SD. ERR=Connection reset by peer

2010-04-12 Thread Matija Nalis
On Mon, Apr 12, 2010 at 05:41:51AM -0500, Jon Schewe wrote:
> > Strange. Are you running GNU/Linux system on all the machines 
> > (FD, SD, DIR) ? IIRC, it might not be supported on other systems,
> > and/or it may need additional tuning on them.
> >
> >   
> I'm running opensuse Linux for the director and storage daemon and
> Debian Linux for the file daemon.

that is strange... 
can you check what are your default SO_KEEPALIVE values with:

grep '' /proc/sys/net/ipv4/tcp_keepalive_*

and what bacula is using for running connections - start backup first,
then check if keepalive is enabled (and with what timers) with:

netstat -to

--
Download Intel® Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
___
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users


Re: [Bacula-users] Fatal error: backup.c:892 Network send error to SD. ERR=Connection reset by peer

2010-04-12 Thread Jon Schewe
On 04/12/2010 04:17 AM, Matija Nalis wrote:
> On Fri, Apr 09, 2010 at 07:30:19PM -0500, Jon Schewe wrote:
>   
>> I have heartbeat intervals set at the following:
>> bacula-dir.conf:
>> client {
>>   Heartbeat interval = 15 Seconds
>> }
>> storage {
>>   Heartbeat interval = 1 minutes
>> }
>>
>> bacula-sd.conf
>> storage {
>>   Heartbeat interval = 1 minute
>> }
>>
>> bacula-fd.conf
>> FileDaemon {
>>   Heartbeat Interval = 5 seconds
>> }
>> 
> Strange. Are you running GNU/Linux system on all the machines 
> (FD, SD, DIR) ? IIRC, it might not be supported on other systems,
> and/or it may need additional tuning on them.
>
>   
I'm running opensuse Linux for the director and storage daemon and
Debian Linux for the file daemon.


--
Download Intel® Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
___
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users


Re: [Bacula-users] Fatal error: backup.c:892 Network send error to SD. ERR=Connection reset by peer

2010-04-12 Thread Matija Nalis
On Fri, Apr 09, 2010 at 07:30:19PM -0500, Jon Schewe wrote:
> I have heartbeat intervals set at the following:
> bacula-dir.conf:
> client {
>   Heartbeat interval = 15 Seconds
> }
> storage {
>   Heartbeat interval = 1 minutes
> }
> 
> bacula-sd.conf
> storage {
>   Heartbeat interval = 1 minute
> }
> 
> bacula-fd.conf
> FileDaemon {
>   Heartbeat Interval = 5 seconds
> }

Strange. Are you running GNU/Linux system on all the machines 
(FD, SD, DIR) ? IIRC, it might not be supported on other systems,
and/or it may need additional tuning on them.


I've updated the docs at http://tinyurl.com/y8wapdu


-- 
Matija Nalis
Odjel racunalno-informacijskih sustava i servisa
  
Hrvatska akademska i istrazivacka mreza - CARNet 
Josipa Marohnica 5, 1 Zagreb
tel. +385 1 6661 616, fax. +385 1 6661 766
www.CARNet.hr

--
Download Intel® Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
___
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users


Re: [Bacula-users] Fatal error: backup.c:892 Network send error to SD. ERR=Connection reset by peer

2010-04-12 Thread Graham Keeling
On Sun, Apr 11, 2010 at 09:32:43AM -0500, Jon Schewe wrote:
> I got it to work again last night. Changing the firewall time outs
> didn't help. What fixed it was turning off Accurate backups.

Ah, so possibly bacula spent long enough stuck doing an accurate query in the
catalog that the firewall connection timed out.
Are you using mysql and bacula-5.0.1?


--
Download Intel® Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
___
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users


Re: [Bacula-users] Fatal error: backup.c:892 Network send error to SD. ERR=Connection reset by peer

2010-04-11 Thread Jon Schewe
I got it to work again last night. Changing the firewall time outs
didn't help. What fixed it was turning off Accurate backups.


--
Download Intel® Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
___
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users


Re: [Bacula-users] Fatal error: backup.c:892 Network send error to SD. ERR=Connection reset by peer

2010-04-10 Thread Jon Schewe
I increased the connection timeout and started another job and got this:

10-Apr 08:11 jon-dir JobId 5334: Start Backup JobId 5334,
Job=mtu.2010-04-10_08.11.11_32
10-Apr 08:11 jon-dir JobId 5334: Using Device "FileStorage"
10-Apr 08:11 mtu-fd JobId 5334: shell command: run ClientRunBeforeJob
"/etc/bacula/before-full-backup.sh"
10-Apr 08:11 jon-dir JobId 5334: Sending Accurate information.
10-Apr 10:51 jon-dir JobId 0: Error: bsock.c:379 Wrote 77 bytes to
client:127.0.0.2:36131, but only 0 accepted.
10-Apr 10:51 jon-dir JobId 0: Error: bsock.c:379 Wrote 77 bytes to
client:127.0.0.2:36131, but only 0 accepted.
10-Apr 10:51 jon-dir JobId 0: Error: openssl.c:86 TLS shutdown failure.:
ERR=error:1409F07F:SSL routines:SSL3_WRITE_PENDING:bad write retry
10-Apr 10:51 jon-dir JobId 0: Error: openssl.c:86 TLS shutdown failure.:
ERR=error:1409F07F:SSL routines:SSL3_WRITE_PENDING:bad write retry
10-Apr 10:53 mtu-fd JobId 5334: Fatal error: Bad response from stored to
open command
10-Apr 10:53 jon-dir JobId 5334: Error: Bacula jon-dir 3.0.3 (18Oct09):
10-Apr-2010 10:53:23


--
Download Intel® Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
___
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users


Re: [Bacula-users] Fatal error: backup.c:892 Network send error to SD. ERR=Connection reset by peer

2010-04-10 Thread Jon Schewe
On 04/09/2010 02:33 AM, jerry lowry wrote:
> On 4/10/2010 3:30 AM, Jon Schewe wrote:
>   
>> On 04/08/2010 07:04 AM, Matija Nalis wrote:
>>
>> 
>>> On Wed, Apr 07, 2010 at 02:15:14PM +0100, Prashant Ramhit wrote:
>>>
>>>  
>>>   
 06-Apr 12:54 client-fd JobId 299: Fatal error: backup.c:892 Network 
 send error to SD. ERR=Connection reset by peer

 Is it possible to tell me how to enable more debug on client and
 storage so that i can find more clues to this issue.


 
>>> You can use "-d number" to increase debug level; but in your case it
>>> should be pretty clear -- something (usually router or firewall)
>>> between SD and FD (or even local firewalls on themselves) is killing
>>> TCP connection (usually because it was idle for too long).
>>>
>>> See http://tinyurl.com/y8wapdu
>>> it adding "Heartbeat Interval" helps you.
>>>
>>>
>>>  
>>>   
>> I have heartbeat intervals set at the following:
>> bacula-dir.conf:
>> client {
>>Heartbeat interval = 15 Seconds
>> }
>> storage {
>>Heartbeat interval = 1 minutes
>> }
>>
>> bacula-sd.conf
>> storage {
>>Heartbeat interval = 1 minute
>> }
>>
>> bacula-fd.conf
>> FileDaemon {
>>Heartbeat Interval = 5 seconds
>> }
>>
>>
>> 
> Hi,  are you backing up through a firewall.  I had this same problem and 
> it tuned out that the firewall has a setup limit on how long a job will 
> last.  Reset the limit and all my backups work as planned.
>
>
>   
Yes, I'm behind a firewall running dd-wrt. Do I just need to increase
the connection timeout? Why doesn't the heartbeat take care of this?


--
Download Intel® Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
___
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users


Re: [Bacula-users] Fatal error: backup.c:892 Network send error to SD. ERR=Connection reset by peer

2010-04-09 Thread jerry lowry
On 4/10/2010 3:30 AM, Jon Schewe wrote:
> On 04/08/2010 07:04 AM, Matija Nalis wrote:
>
>> On Wed, Apr 07, 2010 at 02:15:14PM +0100, Prashant Ramhit wrote:
>>
>>  
>>> 06-Apr 12:54 client-fd JobId 299: Fatal error: backup.c:892 Network send 
>>> error to SD. ERR=Connection reset by peer
>>>
>>> Is it possible to tell me how to enable more debug on client and
>>> storage so that i can find more clues to this issue.
>>>
>>>
>> You can use "-d number" to increase debug level; but in your case it
>> should be pretty clear -- something (usually router or firewall)
>> between SD and FD (or even local firewalls on themselves) is killing
>> TCP connection (usually because it was idle for too long).
>>
>> See http://tinyurl.com/y8wapdu
>> it adding "Heartbeat Interval" helps you.
>>
>>
>>  
> I have heartbeat intervals set at the following:
> bacula-dir.conf:
> client {
>Heartbeat interval = 15 Seconds
> }
> storage {
>Heartbeat interval = 1 minutes
> }
>
> bacula-sd.conf
> storage {
>Heartbeat interval = 1 minute
> }
>
> bacula-fd.conf
> FileDaemon {
>Heartbeat Interval = 5 seconds
> }
>
>
> --
> Download Intel® Parallel Studio Eval
> Try the new software tools for yourself. Speed compiling, find bugs
> proactively, and fine-tune applications for parallel performance.
> See why Intel Parallel Studio got high marks during beta.
> http://p.sf.net/sfu/intel-sw-dev
> ___
> Bacula-users mailing list
> Bacula-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/bacula-users
>
Hi,  are you backing up through a firewall.  I had this same problem and 
it tuned out that the firewall has a setup limit on how long a job will 
last.  Reset the limit and all my backups work as planned.

jerry

--
Download Intel® Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
___
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users


Re: [Bacula-users] Fatal error: backup.c:892 Network send error to SD. ERR=Connection reset by peer

2010-04-09 Thread Jon Schewe
On 04/08/2010 07:04 AM, Matija Nalis wrote:
> On Wed, Apr 07, 2010 at 02:15:14PM +0100, Prashant Ramhit wrote:
>   
>> 06-Apr 12:54 client-fd JobId 299: Fatal error: backup.c:892 Network send 
>> error to SD. ERR=Connection reset by peer
>>
>> Is it possible to tell me how to enable more debug on client and
>> storage so that i can find more clues to this issue.
>> 
> You can use "-d number" to increase debug level; but in your case it
> should be pretty clear -- something (usually router or firewall)
> between SD and FD (or even local firewalls on themselves) is killing
> TCP connection (usually because it was idle for too long).
>
> See http://tinyurl.com/y8wapdu
> it adding "Heartbeat Interval" helps you.
>
>   
I have heartbeat intervals set at the following:
bacula-dir.conf:
client {
  Heartbeat interval = 15 Seconds
}
storage {
  Heartbeat interval = 1 minutes
}

bacula-sd.conf
storage {
  Heartbeat interval = 1 minute
}

bacula-fd.conf
FileDaemon {
  Heartbeat Interval = 5 seconds
}


--
Download Intel® Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
___
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users


Re: [Bacula-users] Fatal error: backup.c:892 Network send error to SD. ERR=Connection reset by peer

2010-04-08 Thread Matija Nalis
On Wed, Apr 07, 2010 at 02:15:14PM +0100, Prashant Ramhit wrote:
> 06-Apr 12:54 client-fd JobId 299: Fatal error: backup.c:892 Network send 
> error to SD. ERR=Connection reset by peer
>
> Is it possible to tell me how to enable more debug on client and
> storage so that i can find more clues to this issue.

You can use "-d number" to increase debug level; but in your case it
should be pretty clear -- something (usually router or firewall)
between SD and FD (or even local firewalls on themselves) is killing
TCP connection (usually because it was idle for too long).

See http://tinyurl.com/y8wapdu
it adding "Heartbeat Interval" helps you.

--
Download Intel® Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
___
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users


[Bacula-users] Fatal error: backup.c:892 Network send error to SD. ERR=Connection reset by peer

2010-04-07 Thread Prashant Ramhit




Hi All,
My Backup is failing on a client.
The client has only one Fileset and the size is 400GB.

The error is  as follows

Messages: 
06-Apr 12:16 server-sd JobId 299: Spooling data again ...
06-Apr 12:38 server-sd JobId 299: User specified spool size reached.
06-Apr 12:38 server-sd JobId 299: Writing spooled data to Volume. Despooling 12,422,998,992 bytes ...
06-Apr 12:43 server-sd JobId 299: Despooling elapsed time = 00:04:50, Transfer rate = 42.83 M bytes/second
06-Apr 12:43 server-sd JobId 299: Spooling data again ...
06-Apr 12:54 client-fd JobId 299: Fatal error: backup.c:892 Network send error to SD. ERR=Connection reset by peer
 Volume Session Time: 1270457469
  Last Volume Bytes:  216,986,112,000 (216.9 GB)
  Non-fatal FD errors:0
  SD Errors:  0
  FD termination status:  Error
  SD termination status:  Error
  Termination:*** Backup Error ***

Is it possible to tell me how to enable more debug on client and
storage so that i can find more clues to this issue.

Many thanks,
Prashant Ramhit



--
Download Intel® Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev___
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users