Re: [BackupPC-users] Debugging a SIGPIPE error killing my backups

2007-11-01 Thread Rob Owens
Are you sure that rsync is installed on the host machine?

-Rob

Les Mikesell wrote:
 John Rouillard wrote:
 
   2007-10-25 10:54:00 Aborting backup up after signal PIPE
   2007-10-25 10:54:01 Got fatal error during xfer (aborted by signal=PIPE)
 
 This means the problem is on the other end of the link - or at least 
 that the ssh driving it exited.
 
   lastlog got digests fdb1c560d9ba822ab4ffa635d4b5f67f vs
   fdb1c560d9ba822ab4ffa635d4b5f67f
 create   400   0/0   65700 lastlog
   Can't write 33932 bytes to socket
   Sending csums, cnt = 16, phase = 1
   Read EOF: Connection reset by peer
 
 The process on the remote side is gone at this point.
 
 If I am reading this right, the last file handled before the signal is
 /var/log/lastlog which is  2GB (65K approx). When the signal occurs,
 I guess /var/log/ldap is the file in progress.

 The ldap file is 22GB in size:

   [EMAIL PROTECTED] log]$ ls -l ldap 
   -rw---  1 root root 22978928497 Oct 25 18:46 ldap

 Could the size be the issue?
 
 Yes, it sounds very likely that whatever is sending the file on the 
 remote side can't handle files larger than 2 gigs.
 
 I have:

$Conf{ClientTimeout} = 72000;

 which is 20 hours and the sigpipe is occurring before then.
 
 You'd see sigalarm instead of sigpipe if you had a timeout.
 
 Also is there a way to tail the xfer logs in realtime while the daemon
 is controling the backup? So I don't have to wait for the backup to
 finish?
 
 You aren't going to see a problem in the log file - the other end is 
 crashing.
 

-
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now  http://get.splunk.com/
___
BackupPC-users mailing list
BackupPC-users@lists.sourceforge.net
List:https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/


Re: [BackupPC-users] Debugging a SIGPIPE error killing my backups

2007-10-26 Thread John Rouillard
On Fri, Oct 26, 2007 at 11:13:14AM -0500, Les Mikesell wrote:
 John Rouillard wrote:
 
   $Conf{ClientTimeout} = 72000;
 
 which is 20 hours and the sigpipe is occurring before then.
 You'd see sigalarm instead of sigpipe if you had a timeout.
 
 Something like this I assume:
 
 [...]
 create d 755   0/1   12288 src/fastforward-0.51
   finish: removing in-process file .
   Child is aborting
   Done: 17 files, 283 bytes
   Got fatal error during xfer (aborted by signal=ALRM)
   Backup aborted by user signal
 
 Yes, that one is a timeout on the backuppc side.
 
 Also I straced the rsync process on the remote system while it was hung
 (I assume on whatever occurred after the src/fastforward-0.51)
 directory and got:
 
   [EMAIL PROTECTED] ~]$ ps -ef | grep 6909
   root  6909  6908  0 Oct25 ?00:00:00 /usr/bin/rsync
   --server --sender --numeric-ids --perms --owner --group -D --links
   --hard-links --times --block-size=2048 --recursive --one-file-system
   --checksum-seed=32761 --ignore-times . /usr/local/
   rouilj   10603 10349  0 05:36 pts/000:00:00 grep 6909
   [EMAIL PROTECTED] ~]$ strace -p 6909
   attach: ptrace(PTRACE_ATTACH, ...): Operation not permitted
   [EMAIL PROTECTED] ~]$ sudo strace -p 6909
   Process 6909 attached - interrupt to quit
   select(1, [0], [], NULL, {42, 756000})  = 0 (Timeout)
   select(1, [0], [], NULL, {60, 0})   = 0 (Timeout)
   select(1, [0], [], NULL, {60, 0})   = 0 (Timeout)
   select(1, [0], [], NULL, {60, 0} unfinished ...
   Process 6909 detached
 
 And similar results on the server side process. Maybe a deadlock
 somewhere? The ssh pipe appeared open. I set it up to forward traffic
 and was able to pass traffic from the server to the client.
 
 Are these 2 different scenarios (the sigalarm and sigpipe)?

Yes, I just started a new thread on the hang/sigalarm problem.
 I don't 
 think I've ever seen a real deadlock on a unix/linux rsync although I 
 always got them on windows when trying to run rsync under sshd (and I'd 
 appreciate knowing the right versions to use if that works now).

Well, its not really rsync - rsync right, its File::RsyncP- rsync.

 The 
 sigpipe scenario sounded like the remote rsync crashed or quit (perhaps 
 not being able to handle files 2gigs).  This looks like something 
 different.  Can you start the remote strace before the hang so you have 
 a chance of seeing the file and activity in progress when the hang occurs?

I can try. As far as the sigpipe issue, looks like there is a missing
email in this thread. I was able to run an rsync of the 22GB file that
was the active transfer atthe time of the SIGPIPE without
problem. I'll repost that missing email.

-- 
-- rouilj

John Rouillard
System Administrator
Renesys Corporation
603-643-9300 x 111

-
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now  http://get.splunk.com/
___
BackupPC-users mailing list
BackupPC-users@lists.sourceforge.net
List:https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/


Re: [BackupPC-users] Debugging a SIGPIPE error killing my backups

2007-10-26 Thread Les Mikesell
John Rouillard wrote:

   $Conf{ClientTimeout} = 72000;

 which is 20 hours and the sigpipe is occurring before then.
 You'd see sigalarm instead of sigpipe if you had a timeout.
 
 Something like this I assume:
 
[...]
 create d 755   0/1   12288 src/fastforward-0.51
   finish: removing in-process file .
   Child is aborting
   Done: 17 files, 283 bytes
   Got fatal error during xfer (aborted by signal=ALRM)
   Backup aborted by user signal

Yes, that one is a timeout on the backuppc side.

 Also I straced the rsync process on the remote system while it was hung
 (I assume on whatever occurred after the src/fastforward-0.51)
 directory and got:
 
   [EMAIL PROTECTED] ~]$ ps -ef | grep 6909
   root  6909  6908  0 Oct25 ?00:00:00 /usr/bin/rsync
   --server --sender --numeric-ids --perms --owner --group -D --links
   --hard-links --times --block-size=2048 --recursive --one-file-system
   --checksum-seed=32761 --ignore-times . /usr/local/
   rouilj   10603 10349  0 05:36 pts/000:00:00 grep 6909
   [EMAIL PROTECTED] ~]$ strace -p 6909
   attach: ptrace(PTRACE_ATTACH, ...): Operation not permitted
   [EMAIL PROTECTED] ~]$ sudo strace -p 6909
   Process 6909 attached - interrupt to quit
   select(1, [0], [], NULL, {42, 756000})  = 0 (Timeout)
   select(1, [0], [], NULL, {60, 0})   = 0 (Timeout)
   select(1, [0], [], NULL, {60, 0})   = 0 (Timeout)
   select(1, [0], [], NULL, {60, 0} unfinished ...
   Process 6909 detached
 
 And similar results on the server side process. Maybe a deadlock
 somewhere? The ssh pipe appeared open. I set it up to forward traffic
 and was able to pass traffic from the server to the client.

Are these 2 different scenarios (the sigalarm and sigpipe)?  I don't 
think I've ever seen a real deadlock on a unix/linux rsync although I 
always got them on windows when trying to run rsync under sshd (and I'd 
appreciate knowing the right versions to use if that works now).  The 
sigpipe scenario sounded like the remote rsync crashed or quit (perhaps 
not being able to handle files 2gigs).  This looks like something 
different.  Can you start the remote strace before the hang so you have 
a chance of seeing the file and activity in progress when the hang occurs?

--
   Les Mikesell
[EMAIL PROTECTED]



-
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now  http://get.splunk.com/
___
BackupPC-users mailing list
BackupPC-users@lists.sourceforge.net
List:https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/


Re: [BackupPC-users] Debugging a SIGPIPE error killing my backups

2007-10-26 Thread John Rouillard
This is a resend. The original went missing apparently.

On Thu, Oct 25, 2007 at 02:10:12PM -0500, Les Mikesell wrote:
 John Rouillard wrote:
   2007-10-25 10:54:00 Aborting backup up after signal PIPE
   2007-10-25 10:54:01 Got fatal error during xfer (aborted by signal=PIPE)
 
 This means the problem is on the other end of the link - or at least 
 that the ssh driving it exited.

Hmm, ok, what happens if I add -v to the remote rsync args? Will the
extra output in the rsync stream screw things up? Maybe I can use:

  rsync  -v ... 2 /tmp/rsync.log

to get debugging at the rsync level without sending the debugging
output to BackupPC.

I'll also try adding -o ServerAliveInterval=30 and -vvv to see if that
improves the reliability of the ssh session and generates output,
since -v sends debugging output to stderr and I can grab that with:

 ssh -v concord 2 /tmp/log

Does BackupPC need to use stderr to the remote system for anything?
 
   lastlog got digests fdb1c560d9ba822ab4ffa635d4b5f67f vs
   fdb1c560d9ba822ab4ffa635d4b5f67f
 create   400   0/0   65700 lastlog
   Can't write 33932 bytes to socket
   Sending csums, cnt = 16, phase = 1
   Read EOF: Connection reset by peer
 
 The process on the remote side is gone at this point.

I'll buy that, but I expect some death message. A dying gasp if you
will.
 
 If I am reading this right, the last file handled before the signal is
 /var/log/lastlog which is  2GB (65K approx). When the signal occurs,
 I guess /var/log/ldap is the file in progress.
 
 The ldap file is 22GB in size:
 
   [EMAIL PROTECTED] log]$ ls -l ldap 
   -rw---  1 root root 22978928497 Oct 25 18:46 ldap
 
 Could the size be the issue?
 
 Yes, it sounds very likely that whatever is sending the file on the 
 remote side can't handle files larger than 2 gigs.

I just did an sudo rsync -e ssh ops02.mht1:/var/log/ldap . and it
completed without a problem. All 22 GB of the file transfered fine
8-(. However now I have the same sigpipe issue on another host, that
has been backing up fine (3 full and 3 incremental) until now:

  incr backup started back to 2007-10-25 17:28:40 (backup #6) for
  directory /var/spool/nagios
  Running: /usr/bin/ssh -q -x -l backup ops01.mht1.renesys.com sudo
  /usr/bin/rsync --server --sender --numeric-ids --perms --owner --group
  -D --links --hard-links --times --block-size=2048 --recursive
  --one-file-system --checksum-seed=32761 . /var/spool/nagios/
  Xfer PIDs are now 24197
  Rsync command pid is 24197
  Got remote protocol 28
  Negotiated protocol version 28
  Checksum caching enabled (checksumSeed = 32761)
  Got checksumSeed 0x7ff9
  Got file list: 11 entries
  Child PID is 24213
  Xfer PIDs are now 24197,24213
  Sending csums, cnt = 11, phase = 0
create d2775   306/2004096 .
create p 660   306/521   0 nagios.cmd
create d2775   306/2004096 tmp
  tmp/host-perfdata got digests 46a0099d178d1b97aa39e454ae083d3f vs
  46a0099d178d1b97aa39e454ae083d3f
  Skipping tmp/service-perfdata..bz2 (same attr)
  Skipping tmp/service-perfdata.0001.gz (same attr)
  Skipping tmp/service-perfdata.4.gz (same attr)
  Skipping tmp/service-perfdata.5.gz (same attr)
  Sending csums, cnt = 0, phase = 1
create   664   306/200   916165956 tmp/host-perfdata
  tmp/nagios_daemon_pids got digests 7bfc0cffe0f114dd6eea7514c44422cd vs
  7bfc0cffe0f114dd6eea7514c44422cd
create   664   306/200   6 tmp/nagios_daemon_pids
  tmp/old_list got digests 0e258a7527fe053eea032e6d58f1de7c vs
  0e258a7527fe053eea032e6d58f1de7c
create   664   306/200  48 tmp/old_list
  Read EOF: 
  Tried again: got 0 bytes
  Can't write 4 bytes to socket
  finish: removing in-process file tmp/service-perfdata
delete   644   306/200   343155581 tmp/service-perfdata.0001.gz
delete   664   306/200   343250131 tmp/service-perfdata.5.gz
delete   644   306/200   186949772 tmp/service-perfdata..bz2
delete   664   306/200   341890997 tmp/service-perfdata.4.gz
delete   644   306/200  1427879157 tmp/service-perfdata
  Child is aborting
  Done: 4 files, 916199608 bytes
  Got fatal error during xfer (aborted by signal=PIPE)
  Backup aborted by user signal

Is there anything I can do to get better diagnostics. If rsync
--server --sender exits with an error, how well does the File::RsyncP
module do grabbing stderr (or stdout which it would see as a breaking
of the protocol) and sending it back to the xfer log?

Is there a flag/option I can set in File::RsyncP?

(Time to perldoc File::RsyncP I guess.)

 Also is there a way to tail the xfer logs in realtime while the daemon
 is controling the backup? So I don't have to wait for the backup to
 finish?
 
 You aren't going to see a problem in the log file - the other end is 
 crashing.

Well I have two backups still running (3+ hours later) and I am trying
to find out what file they are stuck on. Nothing that I can see should
be hanging the rsync this long compared to when I run an rsync

Re: [BackupPC-users] Debugging a SIGPIPE error killing my backups

2007-10-26 Thread Les Mikesell
John Rouillard wrote:

 And similar results on the server side process. Maybe a deadlock
 somewhere? The ssh pipe appeared open. I set it up to forward traffic
 and was able to pass traffic from the server to the client.
 Are these 2 different scenarios (the sigalarm and sigpipe)?
 
 Yes, I just started a new thread on the hang/sigalarm problem.
 I don't 
 think I've ever seen a real deadlock on a unix/linux rsync although I 
 always got them on windows when trying to run rsync under sshd (and I'd 
 appreciate knowing the right versions to use if that works now).
 
 Well, its not really rsync - rsync right, its File::RsyncP- rsync.

In the windows case where I've had problems , rsync - rsync behaves the 
same (hanging when sshd is involved, working with rsync in daemon mode).

 The 
 sigpipe scenario sounded like the remote rsync crashed or quit (perhaps 
 not being able to handle files 2gigs).  This looks like something 
 different.  Can you start the remote strace before the hang so you have 
 a chance of seeing the file and activity in progress when the hang occurs?
 
 I can try. As far as the sigpipe issue, looks like there is a missing
 email in this thread. I was able to run an rsync of the 22GB file that
 was the active transfer atthe time of the SIGPIPE without
 problem. I'll repost that missing email.

If rsync works but File::RsyncP doesn't, I'd suspect a problem in your 
perl or OS install where you might have some component not built for 
large file support - perhaps the compression libraries.  But SIGPIPE 
still sounds like the problem is at the other end.

-- 
   Les Mikesell
[EMAIL PROTECTED]



-
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now  http://get.splunk.com/
___
BackupPC-users mailing list
BackupPC-users@lists.sourceforge.net
List:https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/


Re: [BackupPC-users] Debugging a SIGPIPE error killing my backups

2007-10-25 Thread Les Mikesell
John Rouillard wrote:

   2007-10-25 10:54:00 Aborting backup up after signal PIPE
   2007-10-25 10:54:01 Got fatal error during xfer (aborted by signal=PIPE)

This means the problem is on the other end of the link - or at least 
that the ssh driving it exited.

   lastlog got digests fdb1c560d9ba822ab4ffa635d4b5f67f vs
   fdb1c560d9ba822ab4ffa635d4b5f67f
 create   400   0/0   65700 lastlog
   Can't write 33932 bytes to socket
   Sending csums, cnt = 16, phase = 1
   Read EOF: Connection reset by peer

The process on the remote side is gone at this point.

 If I am reading this right, the last file handled before the signal is
 /var/log/lastlog which is  2GB (65K approx). When the signal occurs,
 I guess /var/log/ldap is the file in progress.
 
 The ldap file is 22GB in size:
 
   [EMAIL PROTECTED] log]$ ls -l ldap 
   -rw---  1 root root 22978928497 Oct 25 18:46 ldap
 
 Could the size be the issue?

Yes, it sounds very likely that whatever is sending the file on the 
remote side can't handle files larger than 2 gigs.

 I have:
 
$Conf{ClientTimeout} = 72000;
 
 which is 20 hours and the sigpipe is occurring before then.

You'd see sigalarm instead of sigpipe if you had a timeout.

 Also is there a way to tail the xfer logs in realtime while the daemon
 is controling the backup? So I don't have to wait for the backup to
 finish?

You aren't going to see a problem in the log file - the other end is 
crashing.

-- 
Les Mikesell
  [EMAIL PROTECTED]


-
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now  http://get.splunk.com/
___
BackupPC-users mailing list
BackupPC-users@lists.sourceforge.net
List:https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/