Trying to split theads here. In a prior discussion on the hang issue, on Fri, Oct 26, 2007 at 11:13:14AM -0500, Les Mikesell wrote: > John Rouillard wrote: > >>> [rouilj] > >>> $Conf{ClientTimeout} = 72000; > >>> > >>>which is 20 hours and the sigpipe is occurring > >>>before then. You'd see sigalarm instead of sigpipe > >>>if you had a timeout. > > > >Something like this I assume: > > > [...] > > create d 755 0/1 12288 src/fastforward-0.51 > > finish: removing in-process file . > > Child is aborting > > Done: 17 files, 283 bytes > > Got fatal error during xfer (aborted by signal=ALRM) > > Backup aborted by user signal > > Yes, that one is a timeout on the backuppc side. > > >Also I straced the rsync process on the remote > >system while it was hung (I assume on whatever > >occurred after the src/fastforward-0.51) directory > >and got: > > > > [EMAIL PROTECTED] ~]$ ps -ef | grep 6909 > > root 6909 6908 0 Oct25 ? 00:00:00 /usr/bin/rsync > > --server --sender --numeric-ids --perms --owner --group -D --links > > --hard-links --times --block-size=2048 --recursive --one-file-system > > --checksum-seed=32761 --ignore-times . /usr/local/ > > rouilj 10603 10349 0 05:36 pts/0 00:00:00 grep 6909 > > [EMAIL PROTECTED] ~]$ strace -p 6909 > > attach: ptrace(PTRACE_ATTACH, ...): Operation not permitted > > [EMAIL PROTECTED] ~]$ sudo strace -p 6909 > > Process 6909 attached - interrupt to quit > > select(1, [0], [], NULL, {42, 756000}) = 0 (Timeout) > > select(1, [0], [], NULL, {60, 0}) = 0 (Timeout) > > select(1, [0], [], NULL, {60, 0}) = 0 (Timeout) > > select(1, [0], [], NULL, {60, 0} <unfinished ...> > > Process 6909 detached > > > >And similar results on the server side > >process. Maybe a deadlock somewhere? The ssh pipe > >appeared open. I set it up to forward traffic and > >was able to pass traffic from the server to the > >client. > > Are these 2 different scenarios (the sigalarm and > sigpipe)? I don't think I've ever seen a real > deadlock on a unix/linux rsync although I always got > them on windows when trying to run rsync under sshd > (and I'd appreciate knowing the right versions to use > if that works now). The sigpipe scenario sounded like > the remote rsync crashed or quit (perhaps not being > able to handle files >2gigs). This looks like > something different. Can you start the remote strace > before the hang so you have a chance of seeing the > file and activity in progress when the hang occurs?
Ask and you will receive. Here is part of an strace of the rsync on the client machine. Continuation of long lines are indented by 2 spaces, line number in parens. Starts: execve("/usr/bin/rsync", ["/usr/bin/rsync", "--server", (line 1) "--sender", "--numeric-ids", "--perms", "--owner", "--group", "-D", "--links", "--hard-links", "--times", "--block-size=2048", "--recursive", "--one-file-system", "--checksum-seed=32761", "--ignore-times", ...], [/* 16 vars */]) = 0 uname({sys="Linux", node="vpn01.fp.psm1.renesys.com", ...}) = 0 some initialization and then: lstat64("/usr/local/.", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 chdir("/usr/local") = 0 (line 63) stat64(".", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 lstat64(".", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 mmap2(NULL, 266240, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7ef7000 mmap2(NULL, 135168, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7ed6000 open(".", O_RDONLY|O_NONBLOCK|O_LARGEFILE|O_DIRECTORY) = 3 fstat64(3, {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 fcntl64(3, F_SETFD, FD_CLOEXEC) = 0 getdents64(3, /* 12 entries */, 4096) = 320 lstat64("share", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 open("share", O_RDONLY|O_NONBLOCK|O_LARGEFILE|O_DIRECTORY) = 4 fstat64(4, {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 fcntl64(4, F_SETFD, FD_CLOEXEC) = 0 getdents64(4, /* 4 entries */, 4096) = 96 lstat64("share/info", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 open("share/info", O_RDONLY|O_NONBLOCK|O_LARGEFILE|O_DIRECTORY) = 5 fstat64(5, {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 fcntl64(5, F_SETFD, FD_CLOEXEC) = 0 getdents64(5, /* 2 entries */, 4096) = 48 getdents64(5, /* 0 entries */, 4096) = 0 close(5) = 0 lstat64("share/man", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 open("share/man", O_RDONLY|O_NONBLOCK|O_LARGEFILE|O_DIRECTORY) = 5 fstat64(5, {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 fcntl64(5, F_SETFD, FD_CLOEXEC) = 0 getdents64(5, /* 12 entries */, 4096) = 288 it starts walking through the directory tree. [fl]stats all over the place. Then file writes: open("/usr/local/bin/addcr", O_RDONLY|O_LARGEFILE) = 3 (line 635) fstat64(3, {st_mode=S_IFREG|0755, st_size=4264, ...}) = 0 read(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\2\0\3\0\1\0\0\0h\203\4"..., 4264) = 4264 select(2, NULL, [1], NULL, {60, 0}) = 1 (out [1], left {60, 0}) write(1, "\374\17\0\7\2\0\0\0\0\0\0\0\0\10\0\0\2\0\0\0\0\0\0\0\250"..., 4096) = 4096 close(3) = 0 open("/usr/local/bin/argv0", O_RDONLY|O_LARGEFILE) = 3 fstat64(3, {st_mode=S_IFREG|0755, st_size=9768, ...}) = 0 read(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\2\0\3\0\1\0\0\0$\204\4"..., 9768) = 9768 select(2, NULL, [1], NULL, {60, 0}) = 1 (out [1], left {60, 0}) write(1, "\374\17\0\7\1\0\0\0\3\0\0\0X\233\4\10X\v\0\0\4\0\0\0\0"..., 4096) = 4096 select(2, NULL, [1], NULL, {60, 0}) = 1 (out [1], left {60, 0}) write(1, "\374\17\0\7\1\0\0\203\372$\270l\230\4\10\17\204\7\1\0\0"..., 4096) = 4096 close(3) = 0 ... up to: open("/usr/local/src/fastforward-0.51/wait.h", (line 2781) O_RDONLY|O_LARGEFILE) = 3 fstat64(3, {st_mode=S_IFREG|0644, st_size=295, ...}) = 0 read(3, "#ifndef WAIT_H\n#define WAIT_H\n\ne"..., 295) = 295 close(3) = 0 open("/usr/local/src/fastforward-0.51/wait_pid.c", O_RDONLY|O_LARGEFILE) = 3 fstat64(3, {st_mode=S_IFREG|0644, st_size=709, ...}) = 0 read(3, "#include <sys/types.h>\n#include "..., 709) = 709 close(3) = 0 open("/usr/local/src/fastforward-0.51/wait_pid.o", O_RDONLY|O_LARGEFILE) = 3 fstat64(3, {st_mode=S_IFREG|0644, st_size=896, ...}) = 0 read(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\1\0\3\0\1\0\0\0\0\0\0\0"..., 896) = 896 select(2, NULL, [1], NULL, {60, 0}) = 1 (out [1], left {59, 999000}) write(1, "\374\17\0\7.sa_handler = 0;\n sa.sa_fla"..., 4096) = 4096 close(3) = 0 open("/usr/local/src/fastforward-0.51/warn-auto.sh", O_RDONLY|O_LARGEFILE) = 3 fstat64(3, {st_mode=S_IFREG|0644, st_size=64, ...}) = 0 read(3, "#!/bin/sh\n# WARNING: This file w"..., 64) = 64 close(3) = 0 select(2, NULL, [1], NULL, {60, 0}) = 1 (out [1], left (line 2799) {60, 0}) write(1, "\300\0\0\7\0waitpid\0__errno_location\0er"..., 196) = 196 select(2, NULL, [1], NULL, {60, 0}) = 1 (out [1], left {60, 0}) write(1, "\4\0\0\7\377\377\377\377", 8) = 8 select(1, [0], [], NULL, {60, 0}) = 0 (Timeout) select(1, [0], [], NULL, {60, 0}) = 0 (Timeout) select(1, [0], [], NULL, {60, 0}) = 0 (Timeout) select(1, [0], [], NULL, {60, 0}) = 0 (Timeout) select(1, [0], [], NULL, {60, 0}) = 0 (Timeout) select(1, [0], [], NULL, {60, 0}) = 0 (Timeout) select(1, [0], [], NULL, {60, 0}) = 0 (Timeout) select(1, [0], [], NULL, {60, 0}) = 0 (Timeout) and toast city. What's wierd is the select 2 after the close. The mode I see a lot is: open file, stat file read from file on fd 3 (if needed I assume file is different from one on server select on 2, write on fd 2 select on fd 2 write on fd 1 ) close fd 3 back to open of file before the hang, I see a close, select, write, select, write, then selects forever (1.5 hours and counting). Running a sed to eliminate paired open/closes I see only one other instance of the selects outside of a file open/close pair. It occurs at line 629 in the trace output: chdir("/home/backup") = 0 (line 616) open("/etc/mtab", O_RDONLY) = 3 fstat64(3, {st_mode=S_IFREG|0644, st_size=469, ...}) = 0 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7ed5000 read(3, "/dev/mapper/VolGroup00-LogVol01 "..., 4096) = 469 close(3) = 0 munmap(0xb7ed5000, 4096) = 0 open("/proc/meminfo", O_RDONLY) = 3 fstat64(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7ed5000 read(3, "MemTotal: 2074828 kB\nMemFre"..., 1024) = 670 close(3) = 0 munmap(0xb7ed5000, 4096) = 0 select(2, NULL, [1], NULL, {60, 0}) = 1 (out [1], left (line 629**) {60, 0}) write(1, "\374\17\0\7tdio.h\314\3\0\0\206\262a5\1\0\0\0\272\25\n"..., 4096) = 4096 select(2, NULL, [1], NULL, {60, 0}) = 1 (out [1], left {60, 0}) write(1, "\1\0\0\7\0", 5) = 5 select(1, [0], [], NULL, {60, 0}) = 1 (in [0], left {59, 684000}) read(0, "\2\0\0\0\0\0\0\0\0\10\0\0\2\0\0\0\0\0\0\0\3\0\0\0\0\0\0"..., 8184) = 6724 open("/usr/local/bin/addcr", O_RDONLY|O_LARGEFILE) = 3 fstat64(3, {st_mode=S_IFREG|0755, st_size=4264, ...}) = 0 read(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\2\0\3\0\1\0\0\0h\203\4"..., 4264) = 4264 select(2, NULL, [1], NULL, {60, 0}) = 1 (out [1], left {60, 0}) write(1, "\374\17\0\7\2\0\0\0\0\0\0\0\0\10\0\0\2\0\0\0\0\0\0\0\250"..., 4096) = 4096 close(3) = 0 I can make the whole file available on the web if you or anybody else want's it. Contact me off list, no sense spamming people. -- -- rouilj John Rouillard System Administrator Renesys Corporation 603-643-9300 x 111 ------------------------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now >> http://get.splunk.com/ _______________________________________________ BackupPC-users mailing list BackupPC-users@lists.sourceforge.net List: https://lists.sourceforge.net/lists/listinfo/backuppc-users Wiki: http://backuppc.wiki.sourceforge.net Project: http://backuppc.sourceforge.net/