Re: [BackupPC-users] Debugging a SIGPIPE error killing my backups
Are you sure that rsync is installed on the host machine? -Rob Les Mikesell wrote: John Rouillard wrote: 2007-10-25 10:54:00 Aborting backup up after signal PIPE 2007-10-25 10:54:01 Got fatal error during xfer (aborted by signal=PIPE) This means the problem is on the other end of the link - or at least that the ssh driving it exited. lastlog got digests fdb1c560d9ba822ab4ffa635d4b5f67f vs fdb1c560d9ba822ab4ffa635d4b5f67f create 400 0/0 65700 lastlog Can't write 33932 bytes to socket Sending csums, cnt = 16, phase = 1 Read EOF: Connection reset by peer The process on the remote side is gone at this point. If I am reading this right, the last file handled before the signal is /var/log/lastlog which is 2GB (65K approx). When the signal occurs, I guess /var/log/ldap is the file in progress. The ldap file is 22GB in size: [EMAIL PROTECTED] log]$ ls -l ldap -rw--- 1 root root 22978928497 Oct 25 18:46 ldap Could the size be the issue? Yes, it sounds very likely that whatever is sending the file on the remote side can't handle files larger than 2 gigs. I have: $Conf{ClientTimeout} = 72000; which is 20 hours and the sigpipe is occurring before then. You'd see sigalarm instead of sigpipe if you had a timeout. Also is there a way to tail the xfer logs in realtime while the daemon is controling the backup? So I don't have to wait for the backup to finish? You aren't going to see a problem in the log file - the other end is crashing. - This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now http://get.splunk.com/ ___ BackupPC-users mailing list BackupPC-users@lists.sourceforge.net List:https://lists.sourceforge.net/lists/listinfo/backuppc-users Wiki:http://backuppc.wiki.sourceforge.net Project: http://backuppc.sourceforge.net/
Re: [BackupPC-users] Debugging a SIGPIPE error killing my backups
On Fri, Oct 26, 2007 at 11:13:14AM -0500, Les Mikesell wrote: John Rouillard wrote: $Conf{ClientTimeout} = 72000; which is 20 hours and the sigpipe is occurring before then. You'd see sigalarm instead of sigpipe if you had a timeout. Something like this I assume: [...] create d 755 0/1 12288 src/fastforward-0.51 finish: removing in-process file . Child is aborting Done: 17 files, 283 bytes Got fatal error during xfer (aborted by signal=ALRM) Backup aborted by user signal Yes, that one is a timeout on the backuppc side. Also I straced the rsync process on the remote system while it was hung (I assume on whatever occurred after the src/fastforward-0.51) directory and got: [EMAIL PROTECTED] ~]$ ps -ef | grep 6909 root 6909 6908 0 Oct25 ?00:00:00 /usr/bin/rsync --server --sender --numeric-ids --perms --owner --group -D --links --hard-links --times --block-size=2048 --recursive --one-file-system --checksum-seed=32761 --ignore-times . /usr/local/ rouilj 10603 10349 0 05:36 pts/000:00:00 grep 6909 [EMAIL PROTECTED] ~]$ strace -p 6909 attach: ptrace(PTRACE_ATTACH, ...): Operation not permitted [EMAIL PROTECTED] ~]$ sudo strace -p 6909 Process 6909 attached - interrupt to quit select(1, [0], [], NULL, {42, 756000}) = 0 (Timeout) select(1, [0], [], NULL, {60, 0}) = 0 (Timeout) select(1, [0], [], NULL, {60, 0}) = 0 (Timeout) select(1, [0], [], NULL, {60, 0} unfinished ... Process 6909 detached And similar results on the server side process. Maybe a deadlock somewhere? The ssh pipe appeared open. I set it up to forward traffic and was able to pass traffic from the server to the client. Are these 2 different scenarios (the sigalarm and sigpipe)? Yes, I just started a new thread on the hang/sigalarm problem. I don't think I've ever seen a real deadlock on a unix/linux rsync although I always got them on windows when trying to run rsync under sshd (and I'd appreciate knowing the right versions to use if that works now). Well, its not really rsync - rsync right, its File::RsyncP- rsync. The sigpipe scenario sounded like the remote rsync crashed or quit (perhaps not being able to handle files 2gigs). This looks like something different. Can you start the remote strace before the hang so you have a chance of seeing the file and activity in progress when the hang occurs? I can try. As far as the sigpipe issue, looks like there is a missing email in this thread. I was able to run an rsync of the 22GB file that was the active transfer atthe time of the SIGPIPE without problem. I'll repost that missing email. -- -- rouilj John Rouillard System Administrator Renesys Corporation 603-643-9300 x 111 - This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now http://get.splunk.com/ ___ BackupPC-users mailing list BackupPC-users@lists.sourceforge.net List:https://lists.sourceforge.net/lists/listinfo/backuppc-users Wiki:http://backuppc.wiki.sourceforge.net Project: http://backuppc.sourceforge.net/
Re: [BackupPC-users] Debugging a SIGPIPE error killing my backups
John Rouillard wrote: $Conf{ClientTimeout} = 72000; which is 20 hours and the sigpipe is occurring before then. You'd see sigalarm instead of sigpipe if you had a timeout. Something like this I assume: [...] create d 755 0/1 12288 src/fastforward-0.51 finish: removing in-process file . Child is aborting Done: 17 files, 283 bytes Got fatal error during xfer (aborted by signal=ALRM) Backup aborted by user signal Yes, that one is a timeout on the backuppc side. Also I straced the rsync process on the remote system while it was hung (I assume on whatever occurred after the src/fastforward-0.51) directory and got: [EMAIL PROTECTED] ~]$ ps -ef | grep 6909 root 6909 6908 0 Oct25 ?00:00:00 /usr/bin/rsync --server --sender --numeric-ids --perms --owner --group -D --links --hard-links --times --block-size=2048 --recursive --one-file-system --checksum-seed=32761 --ignore-times . /usr/local/ rouilj 10603 10349 0 05:36 pts/000:00:00 grep 6909 [EMAIL PROTECTED] ~]$ strace -p 6909 attach: ptrace(PTRACE_ATTACH, ...): Operation not permitted [EMAIL PROTECTED] ~]$ sudo strace -p 6909 Process 6909 attached - interrupt to quit select(1, [0], [], NULL, {42, 756000}) = 0 (Timeout) select(1, [0], [], NULL, {60, 0}) = 0 (Timeout) select(1, [0], [], NULL, {60, 0}) = 0 (Timeout) select(1, [0], [], NULL, {60, 0} unfinished ... Process 6909 detached And similar results on the server side process. Maybe a deadlock somewhere? The ssh pipe appeared open. I set it up to forward traffic and was able to pass traffic from the server to the client. Are these 2 different scenarios (the sigalarm and sigpipe)? I don't think I've ever seen a real deadlock on a unix/linux rsync although I always got them on windows when trying to run rsync under sshd (and I'd appreciate knowing the right versions to use if that works now). The sigpipe scenario sounded like the remote rsync crashed or quit (perhaps not being able to handle files 2gigs). This looks like something different. Can you start the remote strace before the hang so you have a chance of seeing the file and activity in progress when the hang occurs? -- Les Mikesell [EMAIL PROTECTED] - This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now http://get.splunk.com/ ___ BackupPC-users mailing list BackupPC-users@lists.sourceforge.net List:https://lists.sourceforge.net/lists/listinfo/backuppc-users Wiki:http://backuppc.wiki.sourceforge.net Project: http://backuppc.sourceforge.net/
Re: [BackupPC-users] Debugging a SIGPIPE error killing my backups
This is a resend. The original went missing apparently. On Thu, Oct 25, 2007 at 02:10:12PM -0500, Les Mikesell wrote: John Rouillard wrote: 2007-10-25 10:54:00 Aborting backup up after signal PIPE 2007-10-25 10:54:01 Got fatal error during xfer (aborted by signal=PIPE) This means the problem is on the other end of the link - or at least that the ssh driving it exited. Hmm, ok, what happens if I add -v to the remote rsync args? Will the extra output in the rsync stream screw things up? Maybe I can use: rsync -v ... 2 /tmp/rsync.log to get debugging at the rsync level without sending the debugging output to BackupPC. I'll also try adding -o ServerAliveInterval=30 and -vvv to see if that improves the reliability of the ssh session and generates output, since -v sends debugging output to stderr and I can grab that with: ssh -v concord 2 /tmp/log Does BackupPC need to use stderr to the remote system for anything? lastlog got digests fdb1c560d9ba822ab4ffa635d4b5f67f vs fdb1c560d9ba822ab4ffa635d4b5f67f create 400 0/0 65700 lastlog Can't write 33932 bytes to socket Sending csums, cnt = 16, phase = 1 Read EOF: Connection reset by peer The process on the remote side is gone at this point. I'll buy that, but I expect some death message. A dying gasp if you will. If I am reading this right, the last file handled before the signal is /var/log/lastlog which is 2GB (65K approx). When the signal occurs, I guess /var/log/ldap is the file in progress. The ldap file is 22GB in size: [EMAIL PROTECTED] log]$ ls -l ldap -rw--- 1 root root 22978928497 Oct 25 18:46 ldap Could the size be the issue? Yes, it sounds very likely that whatever is sending the file on the remote side can't handle files larger than 2 gigs. I just did an sudo rsync -e ssh ops02.mht1:/var/log/ldap . and it completed without a problem. All 22 GB of the file transfered fine 8-(. However now I have the same sigpipe issue on another host, that has been backing up fine (3 full and 3 incremental) until now: incr backup started back to 2007-10-25 17:28:40 (backup #6) for directory /var/spool/nagios Running: /usr/bin/ssh -q -x -l backup ops01.mht1.renesys.com sudo /usr/bin/rsync --server --sender --numeric-ids --perms --owner --group -D --links --hard-links --times --block-size=2048 --recursive --one-file-system --checksum-seed=32761 . /var/spool/nagios/ Xfer PIDs are now 24197 Rsync command pid is 24197 Got remote protocol 28 Negotiated protocol version 28 Checksum caching enabled (checksumSeed = 32761) Got checksumSeed 0x7ff9 Got file list: 11 entries Child PID is 24213 Xfer PIDs are now 24197,24213 Sending csums, cnt = 11, phase = 0 create d2775 306/2004096 . create p 660 306/521 0 nagios.cmd create d2775 306/2004096 tmp tmp/host-perfdata got digests 46a0099d178d1b97aa39e454ae083d3f vs 46a0099d178d1b97aa39e454ae083d3f Skipping tmp/service-perfdata..bz2 (same attr) Skipping tmp/service-perfdata.0001.gz (same attr) Skipping tmp/service-perfdata.4.gz (same attr) Skipping tmp/service-perfdata.5.gz (same attr) Sending csums, cnt = 0, phase = 1 create 664 306/200 916165956 tmp/host-perfdata tmp/nagios_daemon_pids got digests 7bfc0cffe0f114dd6eea7514c44422cd vs 7bfc0cffe0f114dd6eea7514c44422cd create 664 306/200 6 tmp/nagios_daemon_pids tmp/old_list got digests 0e258a7527fe053eea032e6d58f1de7c vs 0e258a7527fe053eea032e6d58f1de7c create 664 306/200 48 tmp/old_list Read EOF: Tried again: got 0 bytes Can't write 4 bytes to socket finish: removing in-process file tmp/service-perfdata delete 644 306/200 343155581 tmp/service-perfdata.0001.gz delete 664 306/200 343250131 tmp/service-perfdata.5.gz delete 644 306/200 186949772 tmp/service-perfdata..bz2 delete 664 306/200 341890997 tmp/service-perfdata.4.gz delete 644 306/200 1427879157 tmp/service-perfdata Child is aborting Done: 4 files, 916199608 bytes Got fatal error during xfer (aborted by signal=PIPE) Backup aborted by user signal Is there anything I can do to get better diagnostics. If rsync --server --sender exits with an error, how well does the File::RsyncP module do grabbing stderr (or stdout which it would see as a breaking of the protocol) and sending it back to the xfer log? Is there a flag/option I can set in File::RsyncP? (Time to perldoc File::RsyncP I guess.) Also is there a way to tail the xfer logs in realtime while the daemon is controling the backup? So I don't have to wait for the backup to finish? You aren't going to see a problem in the log file - the other end is crashing. Well I have two backups still running (3+ hours later) and I am trying to find out what file they are stuck on. Nothing that I can see should be hanging the rsync this long compared to when I run an rsync
Re: [BackupPC-users] Debugging a SIGPIPE error killing my backups
John Rouillard wrote: And similar results on the server side process. Maybe a deadlock somewhere? The ssh pipe appeared open. I set it up to forward traffic and was able to pass traffic from the server to the client. Are these 2 different scenarios (the sigalarm and sigpipe)? Yes, I just started a new thread on the hang/sigalarm problem. I don't think I've ever seen a real deadlock on a unix/linux rsync although I always got them on windows when trying to run rsync under sshd (and I'd appreciate knowing the right versions to use if that works now). Well, its not really rsync - rsync right, its File::RsyncP- rsync. In the windows case where I've had problems , rsync - rsync behaves the same (hanging when sshd is involved, working with rsync in daemon mode). The sigpipe scenario sounded like the remote rsync crashed or quit (perhaps not being able to handle files 2gigs). This looks like something different. Can you start the remote strace before the hang so you have a chance of seeing the file and activity in progress when the hang occurs? I can try. As far as the sigpipe issue, looks like there is a missing email in this thread. I was able to run an rsync of the 22GB file that was the active transfer atthe time of the SIGPIPE without problem. I'll repost that missing email. If rsync works but File::RsyncP doesn't, I'd suspect a problem in your perl or OS install where you might have some component not built for large file support - perhaps the compression libraries. But SIGPIPE still sounds like the problem is at the other end. -- Les Mikesell [EMAIL PROTECTED] - This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now http://get.splunk.com/ ___ BackupPC-users mailing list BackupPC-users@lists.sourceforge.net List:https://lists.sourceforge.net/lists/listinfo/backuppc-users Wiki:http://backuppc.wiki.sourceforge.net Project: http://backuppc.sourceforge.net/
Re: [BackupPC-users] Debugging a SIGPIPE error killing my backups
John Rouillard wrote: 2007-10-25 10:54:00 Aborting backup up after signal PIPE 2007-10-25 10:54:01 Got fatal error during xfer (aborted by signal=PIPE) This means the problem is on the other end of the link - or at least that the ssh driving it exited. lastlog got digests fdb1c560d9ba822ab4ffa635d4b5f67f vs fdb1c560d9ba822ab4ffa635d4b5f67f create 400 0/0 65700 lastlog Can't write 33932 bytes to socket Sending csums, cnt = 16, phase = 1 Read EOF: Connection reset by peer The process on the remote side is gone at this point. If I am reading this right, the last file handled before the signal is /var/log/lastlog which is 2GB (65K approx). When the signal occurs, I guess /var/log/ldap is the file in progress. The ldap file is 22GB in size: [EMAIL PROTECTED] log]$ ls -l ldap -rw--- 1 root root 22978928497 Oct 25 18:46 ldap Could the size be the issue? Yes, it sounds very likely that whatever is sending the file on the remote side can't handle files larger than 2 gigs. I have: $Conf{ClientTimeout} = 72000; which is 20 hours and the sigpipe is occurring before then. You'd see sigalarm instead of sigpipe if you had a timeout. Also is there a way to tail the xfer logs in realtime while the daemon is controling the backup? So I don't have to wait for the backup to finish? You aren't going to see a problem in the log file - the other end is crashing. -- Les Mikesell [EMAIL PROTECTED] - This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now http://get.splunk.com/ ___ BackupPC-users mailing list BackupPC-users@lists.sourceforge.net List:https://lists.sourceforge.net/lists/listinfo/backuppc-users Wiki:http://backuppc.wiki.sourceforge.net Project: http://backuppc.sourceforge.net/