Re: [Bacula-users] Fatal error: backup.c:892 Network send error to SD. ERR=Connection reset by peer
On Sun, Apr 18, 2010 at 11:46:33AM -0500, Jon Schewe wrote: > > http://wiki.bacula.org/doku.php?id=faq#my_backup_starts_but_dies_after_a_while_with_connection_reset_by_peer_error > > > > [1] It actually tries that at one point in src/lib/bsock.c if > > TCP_KEEPIDLE support is detected, but it fails to detect it > > properly because is not included. > > > > However, even after fixing that (and missing semicolon in > > 'int opt = heart_beat' line), it still doesn't look like it sets > > TCP_KEEPIDLE correctly on FD->SD connection, so maybe this > > codepath is not used there. > > > > Anyway I gave up debugging there and just set the system > > defaults. But I just though I'd mention that in case someone > > else wants to continue chasing the bug. > > > > > Hmm, this sounds like a bug that should be fixed and once it is fixed > may remove a bunch of problems with firewalls. FYI, I've put up a patch which fixes current support on bacula-devel mailing list. That support could be extended (as not all parts of bacula use that function), but it might be enough. If someone is willing to try it, let me (or better, the whole list) know how it fares and if it fixes the timeouts without the user needing to resort to changing systems defaults. -- ___ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users
Re: [Bacula-users] Fatal error: backup.c:892 Network send error to SD. ERR=Connection reset by peer
On 04/16/2010 08:30 AM, Matija Nalis wrote: > On Mon, Apr 12, 2010 at 03:59:49PM -0500, Jon Schewe wrote: > >> On 4/12/10 9:40 AM, Matija Nalis wrote: >> >>> It is especially problem with bigger databases and MySQL instead of >>> PostgreSQL, see http://bugs.bacula.org/view.php?id=1472, where it can >>> take even several hours! (note that while it talks about "restore" >>> speed, it is also related to accurate backups which employ similar >>> SQL queries) >>> >>> >> Must be what it is then. I've been thinking about switching to postgres, >> but haven't because the opensuse packages for bacula are only for mysql. >> This may motivate me more. >> > You should probably switch soon, before you get to like your > database,,, Exporting bacula mysql tables for import in PostgreSQL > can be very painful and problematic; it is much better to just drop > the database and create fresh one. > > I'll keep that in mind as I go forward. >> The backup finished, so it seems that in version 3.0.3 bacula does NOT >> set the socket option SO_KEEPALIVE. >> > Hmm, yeah, I've check the code casually, and it indeed looks like the > heartbeats are not setting SO_KEEPALIVE timeouts (note that it does > set SO_KEEPALIVE on the socket, otherwise the advice above wouldn't > work -- it just doesn't do TCP_KEEPIDLE on that[1] to specify > user-defined timeouts and instead uses system defaults). > > The heartbeats look like are doing other things though (application-level, > not socket-level), but as you saw they are not perfect for fixing network > idleness problems - and so you also MUST set system defaults. > > I've updated the FAQ at: > http://wiki.bacula.org/doku.php?id=faq#my_backup_starts_but_dies_after_a_while_with_connection_reset_by_peer_error > > > [1] It actually tries that at one point in src/lib/bsock.c if > TCP_KEEPIDLE support is detected, but it fails to detect it > properly because is not included. > > However, even after fixing that (and missing semicolon in > 'int opt = heart_beat' line), it still doesn't look like it sets > TCP_KEEPIDLE correctly on FD->SD connection, so maybe this > codepath is not used there. > > Anyway I gave up debugging there and just set the system > defaults. But I just though I'd mention that in case someone > else wants to continue chasing the bug. > > Hmm, this sounds like a bug that should be fixed and once it is fixed may remove a bunch of problems with firewalls. -- Download Intel® Parallel Studio Eval Try the new software tools for yourself. Speed compiling, find bugs proactively, and fine-tune applications for parallel performance. See why Intel Parallel Studio got high marks during beta. http://p.sf.net/sfu/intel-sw-dev ___ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users
Re: [Bacula-users] Fatal error: backup.c:892 Network send error to SD. ERR=Connection reset by peer
On Mon, Apr 12, 2010 at 03:59:49PM -0500, Jon Schewe wrote: > On 4/12/10 9:40 AM, Matija Nalis wrote: > > It is especially problem with bigger databases and MySQL instead of > > PostgreSQL, see http://bugs.bacula.org/view.php?id=1472, where it can > > take even several hours! (note that while it talks about "restore" > > speed, it is also related to accurate backups which employ similar > > SQL queries) > > > Must be what it is then. I've been thinking about switching to postgres, > but haven't because the opensuse packages for bacula are only for mysql. > This may motivate me more. You should probably switch soon, before you get to like your database,,, Exporting bacula mysql tables for import in PostgreSQL can be very painful and problematic; it is much better to just drop the database and create fresh one. > The backup finished, so it seems that in version 3.0.3 bacula does NOT > set the socket option SO_KEEPALIVE. Hmm, yeah, I've check the code casually, and it indeed looks like the heartbeats are not setting SO_KEEPALIVE timeouts (note that it does set SO_KEEPALIVE on the socket, otherwise the advice above wouldn't work -- it just doesn't do TCP_KEEPIDLE on that[1] to specify user-defined timeouts and instead uses system defaults). The heartbeats look like are doing other things though (application-level, not socket-level), but as you saw they are not perfect for fixing network idleness problems - and so you also MUST set system defaults. I've updated the FAQ at: http://wiki.bacula.org/doku.php?id=faq#my_backup_starts_but_dies_after_a_while_with_connection_reset_by_peer_error [1] It actually tries that at one point in src/lib/bsock.c if TCP_KEEPIDLE support is detected, but it fails to detect it properly because is not included. However, even after fixing that (and missing semicolon in 'int opt = heart_beat' line), it still doesn't look like it sets TCP_KEEPIDLE correctly on FD->SD connection, so maybe this codepath is not used there. Anyway I gave up debugging there and just set the system defaults. But I just though I'd mention that in case someone else wants to continue chasing the bug. -- Matija Nalis Odjel racunalno-informacijskih sustava i servisa Hrvatska akademska i istrazivacka mreza - CARNet Josipa Marohnica 5, 1 Zagreb tel. +385 1 6661 616, fax. +385 1 6661 766 www.CARNet.hr -- Download Intel® Parallel Studio Eval Try the new software tools for yourself. Speed compiling, find bugs proactively, and fine-tune applications for parallel performance. See why Intel Parallel Studio got high marks during beta. http://p.sf.net/sfu/intel-sw-dev ___ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users
Re: [Bacula-users] Fatal error: backup.c:892 Network send error to SD. ERR=Connection reset by peer
On 4/12/10 9:40 AM, Matija Nalis wrote: > On Mon, Apr 12, 2010 at 09:23:51AM -0500, Jon Schewe wrote: > >> On 4/12/10 9:00 AM, Matija Nalis wrote: >> >>> Good, let us know how it fares. >>> >>> >> It seems to be running, but I've run into a problem with bconsole. Once >> I started the job, if I run bconsole and then "status dir", the console >> hangs. If I strace the bconsole process it's stuck in a select call. >> >> >>> strace -p 18452 >>> >> Process 18452 attached - interrupt to quit >> select(4, [3], NULL, NULL, {9, 461287}) = 0 (Timeout) >> read(3, 0x655d80, 5)= -1 EAGAIN (Resource >> temporarily unavailable) >> > That should not be related to SO_KEEPALIVE - it should be completly > transparent to the applications if the network is working (and even > when it is not working, it should differ only in always terminating > the connection instead of sometimes terminating connection and > sometimes hanging idefinitely). > > Anyway, it may be few issues with directory hanging. Most common is > you are too eager. For example, is SQL server is busy, "status dir" > will hang until it completes. > > > It is especially problem with bigger databases and MySQL instead of > PostgreSQL, see http://bugs.bacula.org/view.php?id=1472, where it can > take even several hours! (note that while it talks about "restore" > speed, it is also related to accurate backups which employ similar > SQL queries) > > Must be what it is then. I've been thinking about switching to postgres, but haven't because the opensuse packages for bacula are only for mysql. This may motivate me more. The backup finished, so it seems that in version 3.0.3 bacula does NOT set the socket option SO_KEEPALIVE. -- Jon Schewe | http://mtu.net/~jpschewe If you see an attachment named signature.asc, this is my digital signature. See http://www.gnupg.org for more information. -- Download Intel® Parallel Studio Eval Try the new software tools for yourself. Speed compiling, find bugs proactively, and fine-tune applications for parallel performance. See why Intel Parallel Studio got high marks during beta. http://p.sf.net/sfu/intel-sw-dev ___ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users
Re: [Bacula-users] Fatal error: backup.c:892 Network send error to SD. ERR=Connection reset by peer
On Mon, Apr 12, 2010 at 09:23:51AM -0500, Jon Schewe wrote: > On 4/12/10 9:00 AM, Matija Nalis wrote: > > (SO_KEEPALIVE will work even with only one side of connection having > > it enabled). > > > So I should only need the heartbeat on that client's setup as well, > right? Getting rid of extra heart beats would be nice. Yes, it should be enough. Note that there is no real need to get rid of extra heartbeats, they are not really expensive (so biggest gain is "cleaner" config files). > > Good, let us know how it fares. > > > It seems to be running, but I've run into a problem with bconsole. Once > I started the job, if I run bconsole and then "status dir", the console > hangs. If I strace the bconsole process it's stuck in a select call. > > >strace -p 18452 > Process 18452 attached - interrupt to quit > select(4, [3], NULL, NULL, {9, 461287}) = 0 (Timeout) > read(3, 0x655d80, 5)= -1 EAGAIN (Resource > temporarily unavailable) That should not be related to SO_KEEPALIVE - it should be completly transparent to the applications if the network is working (and even when it is not working, it should differ only in always terminating the connection instead of sometimes terminating connection and sometimes hanging idefinitely). Anyway, it may be few issues with directory hanging. Most common is you are too eager. For example, is SQL server is busy, "status dir" will hang until it completes. It is especially problem with bigger databases and MySQL instead of PostgreSQL, see http://bugs.bacula.org/view.php?id=1472, where it can take even several hours! (note that while it talks about "restore" speed, it is also related to accurate backups which employ similar SQL queries) You can check for this with "show processlist" in MySQL (if you are running MySQL for database, of course) if that is the case (or simply wait). Or you might be unlucky enough to hit a real director bug in 5.0.1, see http://bugs.bacula.org/view.php?id=1528, but that is unlikely. -- Download Intel® Parallel Studio Eval Try the new software tools for yourself. Speed compiling, find bugs proactively, and fine-tune applications for parallel performance. See why Intel Parallel Studio got high marks during beta. http://p.sf.net/sfu/intel-sw-dev ___ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users
Re: [Bacula-users] Fatal error: backup.c:892 Network send error to SD. ERR=Connection reset by peer
On 4/12/10 9:00 AM, Matija Nalis wrote: > On Mon, Apr 12, 2010 at 08:45:36AM -0500, Jon Schewe wrote: > >> On 4/12/10 8:39 AM, Matija Nalis wrote: >> >>> echo 60 > /proc/sys/net/ipv4/tcp_keepalive_time >>> >>> (or edit /etc/sysctl.d/* or /etc/sysctl.conf to retain value across >>> reboots). Can you try what "netstat -to" says after you lower that >>> limit and rerun backups ? >>> >>> >> Now I see the timer down where I expect it. Should I only need this on >> the client? >> > If only that client is having timeout timeout problems, than yes (as > I understand your Director and SD are on same server, so you should > not have timeout issues there as no networking is involved). > > (SO_KEEPALIVE will work even with only one side of connection having > it enabled). > > So I should only need the heartbeat on that client's setup as well, right? Getting rid of extra heart beats would be nice. >>> If "netstat -to" then reports smaller timers (60 or less), than it >>> should fix your problem, so you can try turning accurate back to yes. >>> >>> Does that help ? >>> >> It's running, I'll know in a couple of hours. >> > Good, let us know how it fares. > > It seems to be running, but I've run into a problem with bconsole. Once I started the job, if I run bconsole and then "status dir", the console hangs. If I strace the bconsole process it's stuck in a select call. >strace -p 18452 Process 18452 attached - interrupt to quit select(4, [3], NULL, NULL, {9, 461287}) = 0 (Timeout) read(3, 0x655d80, 5)= -1 EAGAIN (Resource temporarily unavailable) select(4, [3], NULL, NULL, {10, 0}) = 0 (Timeout) read(3, 0x655d80, 5)= -1 EAGAIN (Resource temporarily unavailable) select(4, [3], NULL, NULL, {10, 0} -- Jon Schewe | http://mtu.net/~jpschewe If you see an attachment named signature.asc, this is my digital signature. See http://www.gnupg.org for more information. -- Download Intel® Parallel Studio Eval Try the new software tools for yourself. Speed compiling, find bugs proactively, and fine-tune applications for parallel performance. See why Intel Parallel Studio got high marks during beta. http://p.sf.net/sfu/intel-sw-dev ___ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users
Re: [Bacula-users] Fatal error: backup.c:892 Network send error to SD. ERR=Connection reset by peer
On Mon, Apr 12, 2010 at 08:45:36AM -0500, Jon Schewe wrote: > On 4/12/10 8:39 AM, Matija Nalis wrote: > > echo 60 > /proc/sys/net/ipv4/tcp_keepalive_time > > > > (or edit /etc/sysctl.d/* or /etc/sysctl.conf to retain value across > > reboots). Can you try what "netstat -to" says after you lower that > > limit and rerun backups ? > > > Now I see the timer down where I expect it. Should I only need this on > the client? If only that client is having timeout timeout problems, than yes (as I understand your Director and SD are on same server, so you should not have timeout issues there as no networking is involved). (SO_KEEPALIVE will work even with only one side of connection having it enabled). > > If "netstat -to" then reports smaller timers (60 or less), than it > > should fix your problem, so you can try turning accurate back to yes. > > > > Does that help ? > > It's running, I'll know in a couple of hours. Good, let us know how it fares. -- Matija Nalis Odjel racunalno-informacijskih sustava i servisa Hrvatska akademska i istrazivacka mreza - CARNet Josipa Marohnica 5, 1 Zagreb tel. +385 1 6661 616, fax. +385 1 6661 766 www.CARNet.hr -- Download Intel® Parallel Studio Eval Try the new software tools for yourself. Speed compiling, find bugs proactively, and fine-tune applications for parallel performance. See why Intel Parallel Studio got high marks during beta. http://p.sf.net/sfu/intel-sw-dev ___ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users
Re: [Bacula-users] Fatal error: backup.c:892 Network send error to SD. ERR=Connection reset by peer
On 4/12/10 2:47 AM, Graham Keeling wrote: > On Sun, Apr 11, 2010 at 09:32:43AM -0500, Jon Schewe wrote: > >> I got it to work again last night. Changing the firewall time outs >> didn't help. What fixed it was turning off Accurate backups. >> > Ah, so possibly bacula spent long enough stuck doing an accurate query in the > catalog that the firewall connection timed out. > Are you using mysql and bacula-5.0.1? > > I'm using mysql and bacula 3.0.3. -- Jon Schewe | http://mtu.net/~jpschewe If you see an attachment named signature.asc, this is my digital signature. See http://www.gnupg.org for more information. -- Download Intel® Parallel Studio Eval Try the new software tools for yourself. Speed compiling, find bugs proactively, and fine-tune applications for parallel performance. See why Intel Parallel Studio got high marks during beta. http://p.sf.net/sfu/intel-sw-dev ___ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users
Re: [Bacula-users] Fatal error: backup.c:892 Network send error to SD. ERR=Connection reset by peer
On 4/12/10 8:39 AM, Matija Nalis wrote: > On Mon, Apr 12, 2010 at 07:59:53AM -0500, Jon Schewe wrote: > >> /proc/sys/net/ipv4/tcp_keepalive_time:7200 >> >>> netstat -to >>> >> Client: >> tcp0 0 client:9102 server:54043 ESTABLISHED >> keepalive (7196.36/0/0) >> > That's strange. It should've been the timeouts you specified in > config files, not 7200 seconds (two hours) which is system default. > > It looks like bacula does not use TCP_KEEPIDLE setsockopt(2) on your > system. You might want to report a bug on http://bugs.bacula.org/ > > IMHO, it should work there. Or if not, it should probably throw a > warning if you try to use it and it is not supported or fails. > > Apart from fixing bacula, you can override system default, for > example (on both server and client) do : > > echo 60 > /proc/sys/net/ipv4/tcp_keepalive_time > > (or edit /etc/sysctl.d/* or /etc/sysctl.conf to retain value across > reboots). Can you try what "netstat -to" says after you lower that > limit and rerun backups ? > Now I see the timer down where I expect it. Should I only need this on the client? > If "netstat -to" then reports smaller timers (60 or less), than it > should fix your problem, so you can try turning accurate back to yes. > > Does that help ? > It's running, I'll know in a couple of hours. -- Jon Schewe | http://mtu.net/~jpschewe If you see an attachment named signature.asc, this is my digital signature. See http://www.gnupg.org for more information. -- Download Intel® Parallel Studio Eval Try the new software tools for yourself. Speed compiling, find bugs proactively, and fine-tune applications for parallel performance. See why Intel Parallel Studio got high marks during beta. http://p.sf.net/sfu/intel-sw-dev ___ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users
Re: [Bacula-users] Fatal error: backup.c:892 Network send error to SD. ERR=Connection reset by peer
On Mon, Apr 12, 2010 at 07:59:53AM -0500, Jon Schewe wrote: > /proc/sys/net/ipv4/tcp_keepalive_time:7200 > > netstat -to > Client: > tcp0 0 client:9102 server:54043 ESTABLISHED > keepalive (7196.36/0/0) That's strange. It should've been the timeouts you specified in config files, not 7200 seconds (two hours) which is system default. It looks like bacula does not use TCP_KEEPIDLE setsockopt(2) on your system. You might want to report a bug on http://bugs.bacula.org/ IMHO, it should work there. Or if not, it should probably throw a warning if you try to use it and it is not supported or fails. Apart from fixing bacula, you can override system default, for example (on both server and client) do : echo 60 > /proc/sys/net/ipv4/tcp_keepalive_time (or edit /etc/sysctl.d/* or /etc/sysctl.conf to retain value across reboots). Can you try what "netstat -to" says after you lower that limit and rerun backups ? If "netstat -to" then reports smaller timers (60 or less), than it should fix your problem, so you can try turning accurate back to yes. Does that help ? -- Download Intel® Parallel Studio Eval Try the new software tools for yourself. Speed compiling, find bugs proactively, and fine-tune applications for parallel performance. See why Intel Parallel Studio got high marks during beta. http://p.sf.net/sfu/intel-sw-dev ___ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users
Re: [Bacula-users] Fatal error: backup.c:892 Network send error to SD. ERR=Connection reset by peer
On 4/12/10 7:21 AM, Matija Nalis wrote: > On Mon, Apr 12, 2010 at 05:41:51AM -0500, Jon Schewe wrote: > >>> Strange. Are you running GNU/Linux system on all the machines >>> (FD, SD, DIR) ? IIRC, it might not be supported on other systems, >>> and/or it may need additional tuning on them. >>> >>> >>> >> I'm running opensuse Linux for the director and storage daemon and >> Debian Linux for the file daemon. >> > that is strange... > can you check what are your default SO_KEEPALIVE values with: > > grep '' /proc/sys/net/ipv4/tcp_keepalive_* > > Server: /proc/sys/net/ipv4/tcp_keepalive_intvl:75 /proc/sys/net/ipv4/tcp_keepalive_probes:9 /proc/sys/net/ipv4/tcp_keepalive_time:7200 Client: /proc/sys/net/ipv4/tcp_keepalive_intvl:75 /proc/sys/net/ipv4/tcp_keepalive_probes:9 /proc/sys/net/ipv4/tcp_keepalive_time:7200 bacula 3.0.3 on both systems > and what bacula is using for running connections - start backup first, > then check if keepalive is enabled (and with what timers) with: > > netstat -to > Client: tcp0 0 client:9102 server:54043 ESTABLISHED keepalive (7196.36/0/0) tcp0 0 client:43628 server:9103 ESTABLISHED keepalive (7197.26/0/0) Server (behind NAT): tcp0 0 192.168.42.2:9103 client:43628 ESTABLISHED keepalive (7199.10/0/0) tcp0 0 127.0.0.2:9103 127.0.0.2:33218 ESTABLISHED keepalive (7197.84/0/0) tcp0 0 127.0.0.2:36664 127.0.0.2:9101 TIME_WAIT timewait (56.31/0/0) tcp0 0 192.168.42.2:54043 client:9102 ESTABLISHED keepalive (7198.18/0/0) -- Jon Schewe | http://mtu.net/~jpschewe If you see an attachment named signature.asc, this is my digital signature. See http://www.gnupg.org for more information. -- Download Intel® Parallel Studio Eval Try the new software tools for yourself. Speed compiling, find bugs proactively, and fine-tune applications for parallel performance. See why Intel Parallel Studio got high marks during beta. http://p.sf.net/sfu/intel-sw-dev ___ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users
Re: [Bacula-users] Fatal error: backup.c:892 Network send error to SD. ERR=Connection reset by peer
On Mon, Apr 12, 2010 at 05:41:51AM -0500, Jon Schewe wrote: > > Strange. Are you running GNU/Linux system on all the machines > > (FD, SD, DIR) ? IIRC, it might not be supported on other systems, > > and/or it may need additional tuning on them. > > > > > I'm running opensuse Linux for the director and storage daemon and > Debian Linux for the file daemon. that is strange... can you check what are your default SO_KEEPALIVE values with: grep '' /proc/sys/net/ipv4/tcp_keepalive_* and what bacula is using for running connections - start backup first, then check if keepalive is enabled (and with what timers) with: netstat -to -- Download Intel® Parallel Studio Eval Try the new software tools for yourself. Speed compiling, find bugs proactively, and fine-tune applications for parallel performance. See why Intel Parallel Studio got high marks during beta. http://p.sf.net/sfu/intel-sw-dev ___ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users
Re: [Bacula-users] Fatal error: backup.c:892 Network send error to SD. ERR=Connection reset by peer
On 04/12/2010 04:17 AM, Matija Nalis wrote: > On Fri, Apr 09, 2010 at 07:30:19PM -0500, Jon Schewe wrote: > >> I have heartbeat intervals set at the following: >> bacula-dir.conf: >> client { >> Heartbeat interval = 15 Seconds >> } >> storage { >> Heartbeat interval = 1 minutes >> } >> >> bacula-sd.conf >> storage { >> Heartbeat interval = 1 minute >> } >> >> bacula-fd.conf >> FileDaemon { >> Heartbeat Interval = 5 seconds >> } >> > Strange. Are you running GNU/Linux system on all the machines > (FD, SD, DIR) ? IIRC, it might not be supported on other systems, > and/or it may need additional tuning on them. > > I'm running opensuse Linux for the director and storage daemon and Debian Linux for the file daemon. -- Download Intel® Parallel Studio Eval Try the new software tools for yourself. Speed compiling, find bugs proactively, and fine-tune applications for parallel performance. See why Intel Parallel Studio got high marks during beta. http://p.sf.net/sfu/intel-sw-dev ___ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users
Re: [Bacula-users] Fatal error: backup.c:892 Network send error to SD. ERR=Connection reset by peer
On Fri, Apr 09, 2010 at 07:30:19PM -0500, Jon Schewe wrote: > I have heartbeat intervals set at the following: > bacula-dir.conf: > client { > Heartbeat interval = 15 Seconds > } > storage { > Heartbeat interval = 1 minutes > } > > bacula-sd.conf > storage { > Heartbeat interval = 1 minute > } > > bacula-fd.conf > FileDaemon { > Heartbeat Interval = 5 seconds > } Strange. Are you running GNU/Linux system on all the machines (FD, SD, DIR) ? IIRC, it might not be supported on other systems, and/or it may need additional tuning on them. I've updated the docs at http://tinyurl.com/y8wapdu -- Matija Nalis Odjel racunalno-informacijskih sustava i servisa Hrvatska akademska i istrazivacka mreza - CARNet Josipa Marohnica 5, 1 Zagreb tel. +385 1 6661 616, fax. +385 1 6661 766 www.CARNet.hr -- Download Intel® Parallel Studio Eval Try the new software tools for yourself. Speed compiling, find bugs proactively, and fine-tune applications for parallel performance. See why Intel Parallel Studio got high marks during beta. http://p.sf.net/sfu/intel-sw-dev ___ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users
Re: [Bacula-users] Fatal error: backup.c:892 Network send error to SD. ERR=Connection reset by peer
On Sun, Apr 11, 2010 at 09:32:43AM -0500, Jon Schewe wrote: > I got it to work again last night. Changing the firewall time outs > didn't help. What fixed it was turning off Accurate backups. Ah, so possibly bacula spent long enough stuck doing an accurate query in the catalog that the firewall connection timed out. Are you using mysql and bacula-5.0.1? -- Download Intel® Parallel Studio Eval Try the new software tools for yourself. Speed compiling, find bugs proactively, and fine-tune applications for parallel performance. See why Intel Parallel Studio got high marks during beta. http://p.sf.net/sfu/intel-sw-dev ___ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users
Re: [Bacula-users] Fatal error: backup.c:892 Network send error to SD. ERR=Connection reset by peer
I got it to work again last night. Changing the firewall time outs didn't help. What fixed it was turning off Accurate backups. -- Download Intel® Parallel Studio Eval Try the new software tools for yourself. Speed compiling, find bugs proactively, and fine-tune applications for parallel performance. See why Intel Parallel Studio got high marks during beta. http://p.sf.net/sfu/intel-sw-dev ___ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users
Re: [Bacula-users] Fatal error: backup.c:892 Network send error to SD. ERR=Connection reset by peer
I increased the connection timeout and started another job and got this: 10-Apr 08:11 jon-dir JobId 5334: Start Backup JobId 5334, Job=mtu.2010-04-10_08.11.11_32 10-Apr 08:11 jon-dir JobId 5334: Using Device "FileStorage" 10-Apr 08:11 mtu-fd JobId 5334: shell command: run ClientRunBeforeJob "/etc/bacula/before-full-backup.sh" 10-Apr 08:11 jon-dir JobId 5334: Sending Accurate information. 10-Apr 10:51 jon-dir JobId 0: Error: bsock.c:379 Wrote 77 bytes to client:127.0.0.2:36131, but only 0 accepted. 10-Apr 10:51 jon-dir JobId 0: Error: bsock.c:379 Wrote 77 bytes to client:127.0.0.2:36131, but only 0 accepted. 10-Apr 10:51 jon-dir JobId 0: Error: openssl.c:86 TLS shutdown failure.: ERR=error:1409F07F:SSL routines:SSL3_WRITE_PENDING:bad write retry 10-Apr 10:51 jon-dir JobId 0: Error: openssl.c:86 TLS shutdown failure.: ERR=error:1409F07F:SSL routines:SSL3_WRITE_PENDING:bad write retry 10-Apr 10:53 mtu-fd JobId 5334: Fatal error: Bad response from stored to open command 10-Apr 10:53 jon-dir JobId 5334: Error: Bacula jon-dir 3.0.3 (18Oct09): 10-Apr-2010 10:53:23 -- Download Intel® Parallel Studio Eval Try the new software tools for yourself. Speed compiling, find bugs proactively, and fine-tune applications for parallel performance. See why Intel Parallel Studio got high marks during beta. http://p.sf.net/sfu/intel-sw-dev ___ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users
Re: [Bacula-users] Fatal error: backup.c:892 Network send error to SD. ERR=Connection reset by peer
On 04/09/2010 02:33 AM, jerry lowry wrote: > On 4/10/2010 3:30 AM, Jon Schewe wrote: > >> On 04/08/2010 07:04 AM, Matija Nalis wrote: >> >> >>> On Wed, Apr 07, 2010 at 02:15:14PM +0100, Prashant Ramhit wrote: >>> >>> >>> 06-Apr 12:54 client-fd JobId 299: Fatal error: backup.c:892 Network send error to SD. ERR=Connection reset by peer Is it possible to tell me how to enable more debug on client and storage so that i can find more clues to this issue. >>> You can use "-d number" to increase debug level; but in your case it >>> should be pretty clear -- something (usually router or firewall) >>> between SD and FD (or even local firewalls on themselves) is killing >>> TCP connection (usually because it was idle for too long). >>> >>> See http://tinyurl.com/y8wapdu >>> it adding "Heartbeat Interval" helps you. >>> >>> >>> >>> >> I have heartbeat intervals set at the following: >> bacula-dir.conf: >> client { >>Heartbeat interval = 15 Seconds >> } >> storage { >>Heartbeat interval = 1 minutes >> } >> >> bacula-sd.conf >> storage { >>Heartbeat interval = 1 minute >> } >> >> bacula-fd.conf >> FileDaemon { >>Heartbeat Interval = 5 seconds >> } >> >> >> > Hi, are you backing up through a firewall. I had this same problem and > it tuned out that the firewall has a setup limit on how long a job will > last. Reset the limit and all my backups work as planned. > > > Yes, I'm behind a firewall running dd-wrt. Do I just need to increase the connection timeout? Why doesn't the heartbeat take care of this? -- Download Intel® Parallel Studio Eval Try the new software tools for yourself. Speed compiling, find bugs proactively, and fine-tune applications for parallel performance. See why Intel Parallel Studio got high marks during beta. http://p.sf.net/sfu/intel-sw-dev ___ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users
Re: [Bacula-users] Fatal error: backup.c:892 Network send error to SD. ERR=Connection reset by peer
On 4/10/2010 3:30 AM, Jon Schewe wrote: > On 04/08/2010 07:04 AM, Matija Nalis wrote: > >> On Wed, Apr 07, 2010 at 02:15:14PM +0100, Prashant Ramhit wrote: >> >> >>> 06-Apr 12:54 client-fd JobId 299: Fatal error: backup.c:892 Network send >>> error to SD. ERR=Connection reset by peer >>> >>> Is it possible to tell me how to enable more debug on client and >>> storage so that i can find more clues to this issue. >>> >>> >> You can use "-d number" to increase debug level; but in your case it >> should be pretty clear -- something (usually router or firewall) >> between SD and FD (or even local firewalls on themselves) is killing >> TCP connection (usually because it was idle for too long). >> >> See http://tinyurl.com/y8wapdu >> it adding "Heartbeat Interval" helps you. >> >> >> > I have heartbeat intervals set at the following: > bacula-dir.conf: > client { >Heartbeat interval = 15 Seconds > } > storage { >Heartbeat interval = 1 minutes > } > > bacula-sd.conf > storage { >Heartbeat interval = 1 minute > } > > bacula-fd.conf > FileDaemon { >Heartbeat Interval = 5 seconds > } > > > -- > Download Intel® Parallel Studio Eval > Try the new software tools for yourself. Speed compiling, find bugs > proactively, and fine-tune applications for parallel performance. > See why Intel Parallel Studio got high marks during beta. > http://p.sf.net/sfu/intel-sw-dev > ___ > Bacula-users mailing list > Bacula-users@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/bacula-users > Hi, are you backing up through a firewall. I had this same problem and it tuned out that the firewall has a setup limit on how long a job will last. Reset the limit and all my backups work as planned. jerry -- Download Intel® Parallel Studio Eval Try the new software tools for yourself. Speed compiling, find bugs proactively, and fine-tune applications for parallel performance. See why Intel Parallel Studio got high marks during beta. http://p.sf.net/sfu/intel-sw-dev ___ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users
Re: [Bacula-users] Fatal error: backup.c:892 Network send error to SD. ERR=Connection reset by peer
On 04/08/2010 07:04 AM, Matija Nalis wrote: > On Wed, Apr 07, 2010 at 02:15:14PM +0100, Prashant Ramhit wrote: > >> 06-Apr 12:54 client-fd JobId 299: Fatal error: backup.c:892 Network send >> error to SD. ERR=Connection reset by peer >> >> Is it possible to tell me how to enable more debug on client and >> storage so that i can find more clues to this issue. >> > You can use "-d number" to increase debug level; but in your case it > should be pretty clear -- something (usually router or firewall) > between SD and FD (or even local firewalls on themselves) is killing > TCP connection (usually because it was idle for too long). > > See http://tinyurl.com/y8wapdu > it adding "Heartbeat Interval" helps you. > > I have heartbeat intervals set at the following: bacula-dir.conf: client { Heartbeat interval = 15 Seconds } storage { Heartbeat interval = 1 minutes } bacula-sd.conf storage { Heartbeat interval = 1 minute } bacula-fd.conf FileDaemon { Heartbeat Interval = 5 seconds } -- Download Intel® Parallel Studio Eval Try the new software tools for yourself. Speed compiling, find bugs proactively, and fine-tune applications for parallel performance. See why Intel Parallel Studio got high marks during beta. http://p.sf.net/sfu/intel-sw-dev ___ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users
Re: [Bacula-users] Fatal error: backup.c:892 Network send error to SD. ERR=Connection reset by peer
On Wed, Apr 07, 2010 at 02:15:14PM +0100, Prashant Ramhit wrote: > 06-Apr 12:54 client-fd JobId 299: Fatal error: backup.c:892 Network send > error to SD. ERR=Connection reset by peer > > Is it possible to tell me how to enable more debug on client and > storage so that i can find more clues to this issue. You can use "-d number" to increase debug level; but in your case it should be pretty clear -- something (usually router or firewall) between SD and FD (or even local firewalls on themselves) is killing TCP connection (usually because it was idle for too long). See http://tinyurl.com/y8wapdu it adding "Heartbeat Interval" helps you. -- Download Intel® Parallel Studio Eval Try the new software tools for yourself. Speed compiling, find bugs proactively, and fine-tune applications for parallel performance. See why Intel Parallel Studio got high marks during beta. http://p.sf.net/sfu/intel-sw-dev ___ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users
[Bacula-users] Fatal error: backup.c:892 Network send error to SD. ERR=Connection reset by peer
Hi All, My Backup is failing on a client. The client has only one Fileset and the size is 400GB. The error is as follows Messages: 06-Apr 12:16 server-sd JobId 299: Spooling data again ... 06-Apr 12:38 server-sd JobId 299: User specified spool size reached. 06-Apr 12:38 server-sd JobId 299: Writing spooled data to Volume. Despooling 12,422,998,992 bytes ... 06-Apr 12:43 server-sd JobId 299: Despooling elapsed time = 00:04:50, Transfer rate = 42.83 M bytes/second 06-Apr 12:43 server-sd JobId 299: Spooling data again ... 06-Apr 12:54 client-fd JobId 299: Fatal error: backup.c:892 Network send error to SD. ERR=Connection reset by peer Volume Session Time: 1270457469 Last Volume Bytes: 216,986,112,000 (216.9 GB) Non-fatal FD errors:0 SD Errors: 0 FD termination status: Error SD termination status: Error Termination:*** Backup Error *** Is it possible to tell me how to enable more debug on client and storage so that i can find more clues to this issue. Many thanks, Prashant Ramhit -- Download Intel® Parallel Studio Eval Try the new software tools for yourself. Speed compiling, find bugs proactively, and fine-tune applications for parallel performance. See why Intel Parallel Studio got high marks during beta. http://p.sf.net/sfu/intel-sw-dev___ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users