Trying to split theads here. In a prior discussion on the hang issue,
on Fri, Oct 26, 2007 at 11:13:14AM -0500, Les Mikesell wrote:
> John Rouillard wrote:
> >>> [rouilj]
> >>>  $Conf{ClientTimeout} = 72000;
> >>>
> >>>which is 20 hours and the sigpipe is occurring
> >>>before then.  You'd see sigalarm instead of sigpipe
> >>>if you had a timeout.
> >
> >Something like this I assume:
> >
> [...]
> >    create d 755       0/1       12288 src/fastforward-0.51
> >  finish: removing in-process file .
> >  Child is aborting
> >  Done: 17 files, 283 bytes
> >  Got fatal error during xfer (aborted by signal=ALRM)
> >  Backup aborted by user signal
>
> Yes, that one is a timeout on the backuppc side.
>
> >Also I straced the rsync process on the remote
> >system while it was hung (I assume on whatever
> >occurred after the src/fastforward-0.51) directory
> >and got:
> >
> >  [EMAIL PROTECTED] ~]$ ps -ef | grep 6909
> >  root      6909  6908  0 Oct25 ?        00:00:00 /usr/bin/rsync
> >  --server --sender --numeric-ids --perms --owner --group -D --links
> >  --hard-links --times --block-size=2048 --recursive --one-file-system
> >  --checksum-seed=32761 --ignore-times . /usr/local/
> >  rouilj   10603 10349  0 05:36 pts/0    00:00:00 grep 6909
> >  [EMAIL PROTECTED] ~]$ strace -p 6909
> >  attach: ptrace(PTRACE_ATTACH, ...): Operation not permitted
> >  [EMAIL PROTECTED] ~]$ sudo strace -p 6909
> >  Process 6909 attached - interrupt to quit
> >  select(1, [0], [], NULL, {42, 756000})  = 0 (Timeout)
> >  select(1, [0], [], NULL, {60, 0})       = 0 (Timeout)
> >  select(1, [0], [], NULL, {60, 0})       = 0 (Timeout)
> >  select(1, [0], [], NULL, {60, 0} <unfinished ...>
> >  Process 6909 detached
> >
> >And similar results on the server side
> >process. Maybe a deadlock somewhere? The ssh pipe
> >appeared open. I set it up to forward traffic and
> >was able to pass traffic from the server to the
> >client.
>
> Are these 2 different scenarios (the sigalarm and
> sigpipe)?  I don't think I've ever seen a real
> deadlock on a unix/linux rsync although I always got
> them on windows when trying to run rsync under sshd
> (and I'd appreciate knowing the right versions to use
> if that works now). The sigpipe scenario sounded like
> the remote rsync crashed or quit (perhaps not being
> able to handle files >2gigs).  This looks like
> something different.  Can you start the remote strace
> before the hang so you have a chance of seeing the
> file and activity in progress when the hang occurs?


Ask and you will receive. Here is part of an strace of the
rsync on the client machine. Continuation of long lines are
indented by 2 spaces, line number in parens.

Starts:

execve("/usr/bin/rsync", ["/usr/bin/rsync", "--server",  (line 1)
  "--sender", "--numeric-ids", "--perms", "--owner",
  "--group", "-D", "--links", "--hard-links", "--times",
  "--block-size=2048", "--recursive",
  "--one-file-system", "--checksum-seed=32761",
  "--ignore-times", ...], [/* 16 vars */]) = 0 
uname({sys="Linux", node="vpn01.fp.psm1.renesys.com", ...}) = 0

some initialization and then:

lstat64("/usr/local/.", {st_mode=S_IFDIR|0755, st_size=4096,
  ...}) = 0
chdir("/usr/local")                     = 0  (line 63)
stat64(".", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
lstat64(".", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
mmap2(NULL, 266240, PROT_READ|PROT_WRITE,
  MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7ef7000
mmap2(NULL, 135168, PROT_READ|PROT_WRITE,
  MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7ed6000
open(".", O_RDONLY|O_NONBLOCK|O_LARGEFILE|O_DIRECTORY) = 3
fstat64(3, {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
fcntl64(3, F_SETFD, FD_CLOEXEC)         = 0
getdents64(3, /* 12 entries */, 4096)   = 320
lstat64("share", {st_mode=S_IFDIR|0755, st_size=4096, ...})
  = 0
open("share", O_RDONLY|O_NONBLOCK|O_LARGEFILE|O_DIRECTORY) =
  4
fstat64(4, {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
fcntl64(4, F_SETFD, FD_CLOEXEC)         = 0
getdents64(4, /* 4 entries */, 4096)    = 96
lstat64("share/info", {st_mode=S_IFDIR|0755, st_size=4096,
  ...}) = 0
open("share/info",
  O_RDONLY|O_NONBLOCK|O_LARGEFILE|O_DIRECTORY) = 5
fstat64(5, {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
fcntl64(5, F_SETFD, FD_CLOEXEC)         = 0
getdents64(5, /* 2 entries */, 4096)    = 48
getdents64(5, /* 0 entries */, 4096)    = 0
close(5)                                = 0
lstat64("share/man", {st_mode=S_IFDIR|0755, st_size=4096,
  ...}) = 0
open("share/man",
  O_RDONLY|O_NONBLOCK|O_LARGEFILE|O_DIRECTORY) = 5
fstat64(5, {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
fcntl64(5, F_SETFD, FD_CLOEXEC)         = 0
getdents64(5, /* 12 entries */, 4096)   = 288

it starts walking through the directory tree.
[fl]stats all over the place.

Then file writes:

open("/usr/local/bin/addcr", O_RDONLY|O_LARGEFILE) = 3 (line 635)
fstat64(3, {st_mode=S_IFREG|0755, st_size=4264, ...}) = 0
read(3,
  "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\2\0\3\0\1\0\0\0h\203\4"...,
  4264) = 4264
select(2, NULL, [1], NULL, {60, 0})     = 1 (out [1], left
  {60, 0})
write(1,
  "\374\17\0\7\2\0\0\0\0\0\0\0\0\10\0\0\2\0\0\0\0\0\0\0\250"...,
  4096) = 4096
close(3)                                = 0
open("/usr/local/bin/argv0", O_RDONLY|O_LARGEFILE) = 3
fstat64(3, {st_mode=S_IFREG|0755, st_size=9768, ...}) = 0
read(3,
  "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\2\0\3\0\1\0\0\0$\204\4"...,
  9768) = 9768
select(2, NULL, [1], NULL, {60, 0})     = 1 (out [1], left
  {60, 0})
write(1,
  "\374\17\0\7\1\0\0\0\3\0\0\0X\233\4\10X\v\0\0\4\0\0\0\0"...,
  4096) = 4096
select(2, NULL, [1], NULL, {60, 0})     = 1 (out [1], left
  {60, 0})
write(1,
  "\374\17\0\7\1\0\0\203\372$\270l\230\4\10\17\204\7\1\0\0"...,
  4096) = 4096
close(3)                                = 0
...

up to:

open("/usr/local/src/fastforward-0.51/wait.h", (line 2781)
  O_RDONLY|O_LARGEFILE) = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=295, ...}) = 0
read(3, "#ifndef WAIT_H\n#define WAIT_H\n\ne"..., 295) = 295
close(3)                                = 0
open("/usr/local/src/fastforward-0.51/wait_pid.c",
  O_RDONLY|O_LARGEFILE) = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=709, ...}) = 0
read(3, "#include <sys/types.h>\n#include "..., 709) = 709
close(3)                                = 0
open("/usr/local/src/fastforward-0.51/wait_pid.o",
  O_RDONLY|O_LARGEFILE) = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=896, ...}) = 0
read(3,
  "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\1\0\3\0\1\0\0\0\0\0\0\0"...,
  896) = 896
select(2, NULL, [1], NULL, {60, 0})     = 1 (out [1], left
  {59, 999000})
write(1, "\374\17\0\7.sa_handler = 0;\n  sa.sa_fla"...,
  4096) = 4096
close(3)                                = 0
open("/usr/local/src/fastforward-0.51/warn-auto.sh",
  O_RDONLY|O_LARGEFILE) = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=64, ...}) = 0
read(3, "#!/bin/sh\n# WARNING: This file w"..., 64) = 64
close(3)                                = 0
select(2, NULL, [1], NULL, {60, 0})     = 1 (out [1], left  (line 2799)
  {60, 0})
write(1, "\300\0\0\7\0waitpid\0__errno_location\0er"...,
  196) = 196
select(2, NULL, [1], NULL, {60, 0})     = 1 (out [1], left
  {60, 0})
write(1, "\4\0\0\7\377\377\377\377", 8) = 8
select(1, [0], [], NULL, {60, 0})       = 0 (Timeout)
select(1, [0], [], NULL, {60, 0})       = 0 (Timeout)
select(1, [0], [], NULL, {60, 0})       = 0 (Timeout)
select(1, [0], [], NULL, {60, 0})       = 0 (Timeout)
select(1, [0], [], NULL, {60, 0})       = 0 (Timeout)
select(1, [0], [], NULL, {60, 0})       = 0 (Timeout)
select(1, [0], [], NULL, {60, 0})       = 0 (Timeout)
select(1, [0], [], NULL, {60, 0})       = 0 (Timeout)

and toast city.

What's wierd is the select 2 after the close. The mode I see
a lot is:

open file,
stat file
read from file on fd 3
(if needed I assume file is different from one on server
  select on 2,
  write on fd 2
  select on fd 2
  write on fd 1
)
close fd 3
back to open of file

before the hang, I see a close, select, write, select,
write, then selects forever (1.5 hours and counting).

Running a sed to eliminate paired open/closes I see only one
other instance of the selects outside of a file open/close
pair. It occurs at line 629 in the trace output:

chdir("/home/backup")                   = 0       (line 616)
open("/etc/mtab", O_RDONLY)             = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=469, ...}) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE,
  MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7ed5000
read(3, "/dev/mapper/VolGroup00-LogVol01 "..., 4096) = 469
close(3)                                = 0
munmap(0xb7ed5000, 4096)                = 0
open("/proc/meminfo", O_RDONLY)         = 3
fstat64(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE,
  MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7ed5000
read(3, "MemTotal:      2074828 kB\nMemFre"..., 1024) = 670
close(3)                                = 0
munmap(0xb7ed5000, 4096)                = 0
select(2, NULL, [1], NULL, {60, 0})     = 1 (out [1], left  (line 629**)
  {60, 0})
write(1,
  "\374\17\0\7tdio.h\314\3\0\0\206\262a5\1\0\0\0\272\25\n"...,
  4096) = 4096
select(2, NULL, [1], NULL, {60, 0})     = 1 (out [1], left
  {60, 0})
write(1, "\1\0\0\7\0", 5)               = 5
select(1, [0], [], NULL, {60, 0})       = 1 (in [0], left
  {59, 684000})
read(0,
  "\2\0\0\0\0\0\0\0\0\10\0\0\2\0\0\0\0\0\0\0\3\0\0\0\0\0\0"...,
  8184) = 6724
open("/usr/local/bin/addcr", O_RDONLY|O_LARGEFILE) = 3
fstat64(3, {st_mode=S_IFREG|0755, st_size=4264, ...}) = 0
read(3,
  "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\2\0\3\0\1\0\0\0h\203\4"...,
  4264) = 4264
select(2, NULL, [1], NULL, {60, 0})     = 1 (out [1], left
  {60, 0})
write(1,
  "\374\17\0\7\2\0\0\0\0\0\0\0\0\10\0\0\2\0\0\0\0\0\0\0\250"...,
  4096) = 4096
close(3)                                = 0

I can make the whole file available on the web if you or anybody else
want's it. Contact me off list, no sense spamming people.

-- 
                                -- rouilj

John Rouillard
System Administrator
Renesys Corporation
603-643-9300 x 111

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
_______________________________________________
BackupPC-users mailing list
BackupPC-users@lists.sourceforge.net
List:    https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:    http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/

Reply via email to