On Fri, Feb 22, 2002 at 12:10:52PM -0500, David Birnbaum wrote:
> On Fri, 22 Feb 2002, Dave Dykstra wrote:
> 
> > [adding the mailing list to the Cc after David sent me trusses and other
> > info about his hang.  I have trimmed down the tracing info to the relevant
> > pieces.]
> >
> > On Tue, Feb 19, 2002 at 11:51:27PM -0500, David Birnbaum wrote:
> > > I wasn't able to get a netstat this time.  However, I should be able to
> > > grab it again if these truss's don't help.  I'm thinking I will run a
> > > snoop the next time and capture the packet traffic to see what's happening
> > > at the TCP level - at least we should be able to see who/what shut down
> > > first, because from the truss, it looks like the kernel shut down the
> > > whole thing at once.  It's tough to time the netstat because it takes a
> > > while for it to die.
> >
> > The client is doing nothing complex, it's just merrilly blasting away down
> > file descriptor 4 and never reading it.  It then hangs and the only reason
> > it wakes up is because the connection has been torn down.
> >
> > The server process looks very odd.  It is doing a 196K read from a file, and
> > it takes 9 minutes to return from that, and when it wakes up from that it
> > finds that an EPIPE from the connection to the rsync client.  Why would a
> > 196K read take 9 minutes?  It looks like it is going over NFS, so maybe the
> > problem is with NFS.  What's the network between the NFS server and the
> > rsync server?  Is it possible to run rsync on the NFS server itself, or is
> > that a network appliance or something?  It could be that the TCP
> > implementation on the client side wasn't willing to wait for a response for
> > 9 minutes and that's why it shut things down with an EPIPE.
> >
> > At the same time, the server child process is doing a 262K read from another
> > file over the same NFS connection and it takes the same 9 minutes and
> > eventually suceeds.  It only dies because the other server process had died
> > and it gets a signal.
> 
> What makes you think it's going over NFS?  Everything being backed up is
> local, and the backup disk is local...as best as I can tell.  The lsof
> stuff all points to local devices as well.

I just thought /export was likely to be a remote disk, and I couldn't imagine
it taking 9 minutes to read a local disk.

Hold on... I misread the trace.  I just did some simple tests and I found
that truss timestamps are actually the time that an operation completes,
not the time they start.  So the long delays are in the operations
following the reads, not the reads themselves.  That changes everything.

New analysis: the client (the sender) is waiting on a poll to write to the
network.  The server main process (the generator) is waiting on a poll to
write to the network in the other direction and possibly also to read from
the socketpair inter-process communication from it's child process (the
receiver).  If you use truss "-vpoll" it should tell you exactly what's
being waited on.  The server child process (the receiver) is waiting on
a 928 byte write to the file being built on the disk!  It gets stranger and
stranger.  Is the disk almost full or something?



> The network is a 100MB switched network, and while the backup server is
> busy at night doing the backups, when I was running the truss neither
> server was particularly busy.
> 
> The plot thickens...it's worked two days in a row now.  But, the next time
> it fails I will get a snoop and see if TCP lends any clues.  Is it
> possible that there's some sort of memory clearing, cleanup, or other
> event going on that could cause the occassional lapse like this?  We
> certainly haven't seen any other networking problems that would explain
> this.

It still looks like a disk problem and not a network problem, so I don't
think a snoop is going to help you.

- Dave Dykstra





> Content-Description: client lsof
> COMMAND   PID USER   FD   TYPE        DEVICE   SIZE/OFF    NODE NAME
...
> rsync   20039 root    4u  IPv4 0x3000dbd3810 0t64514407     TCP 
>corona.chelsea.net:51816->hite.chelsea.net:rsync (ESTABLISHED)
> rsync   20039 root    5r  VREG        32,359   52273152  243313 
>/export/httpd/DOMAINS/stats.chelsea.net/data/NetTracker/cyberounds.com/view.dat
>
> Content-Description: client truss
...
> 3504.6000     write(4, " w s   9 8 ;   V Z _ I E".., 1021)    = 1021
> 3504.6003     poll(0xFFBE5CA0, 1, 60000)                      = 1
> 3504.6005     write(4, "03FCFFFF", 4)                         = 4
> 3504.6010     poll(0xFFBE5CA0, 1, 60000)                      = 1
> 3504.6012     write(4, "02FCFFFF", 4)                         = 4
> 3504.6016     poll(0xFFBE5CA0, 1, 60000)                      = 1
> 3504.6019     write(4, "01FCFFFF", 4)                         = 4
> 3504.6034     poll(0xFFBE5CA0, 1, 60000)                      = 1
> 3504.6036     write(4, "E306\0\0", 4)                         = 4
> 3504.6041     poll(0xFFBE5D18, 1, 60000)                      = 1
> 3504.6046     write(4, " s t 1 0 3 . t n t 4 . r".., 1763)    = 1763
> poll(0xFFBE5CA0, 1, 60000)    (sleeping...)
> 3564.6045     poll(0xFFBE5CA0, 1, 60000)                      = 0
> poll(0xFFBE5CA0, 1, 60000)    (sleeping...)
> 3624.6045     poll(0xFFBE5CA0, 1, 60000)                      = 0
> poll(0xFFBE5CA0, 1, 60000)    (sleeping...)
> 3684.6045     poll(0xFFBE5CA0, 1, 60000)                      = 0
> poll(0xFFBE5CA0, 1, 60000)    (sleeping...)
> 3744.6046     poll(0xFFBE5CA0, 1, 60000)                      = 0
> poll(0xFFBE5CA0, 1, 60000)    (sleeping...)
> 3804.6046     poll(0xFFBE5CA0, 1, 60000)                      = 0
> poll(0xFFBE5CA0, 1, 60000)    (sleeping...)
> 3864.6044     poll(0xFFBE5CA0, 1, 60000)                      = 0
> poll(0xFFBE5CA0, 1, 60000)    (sleeping...)
> 3924.6041     poll(0xFFBE5CA0, 1, 60000)                      = 0
> poll(0xFFBE5CA0, 1, 60000)    (sleeping...)
> 3984.6041     poll(0xFFBE5CA0, 1, 60000)                      = 0
> poll(0xFFBE5CA0, 1, 60000)    (sleeping...)
> 4037.3040     poll(0xFFBE5CA0, 1, 60000)                      = 1
> 4037.3046     write(4, "FEFBFFFF", 4)                         Err#32 EPIPE
> 4037.3055         Received signal #13, SIGPIPE [caught]
> 4037.3064     sigaction(SIGUSR1, 0xFFBE5838, 0xFFBE58B8)      = 0
> 4037.3071     sigaction(SIGUSR2, 0xFFBE5838, 0xFFBE58B8)      = 0
> 4037.3074     write(2, " r s y n c   e r r o r :".., 66)      = 66
> 4037.3084     mmap(0x00000000, 8192, PROT_READ|PROT_WRITE|PROT_EXEC, 
>MAP_PRIVATE|MAP_ANON, -1, 0) = 0xFF310000
> 4037.3088     munmap(0xFF310000, 8192)                        = 0
> 4037.3091     llseek(0, 0, SEEK_CUR)                          = 583129
> 4037.3094     _exit(20)
>
>
> Content-Description: server rsync lsof
> COMMAND   PID USER   FD   TYPE        DEVICE   SIZE/OFF    NODE NAME
...
> rsync   27948 root    5u  IPv4 0x30000a59710 0t64477429     TCP 
>hite.chelsea.net:rsync->corona.chelsea.net:51816 (ESTABLISHED)
> rsync   27948 root    6uw VREG         85,30          0  541128 /export 
>(/dev/md/dsk/d30)
> rsync   27948 root    7r  VREG        32,144   19062784 5347011 /export/backup -- 
>export/httpd/DOMAINS/stats.chelsea.net/data/NetTracker/cyberounds.com/view.idx
>
> Content-Description: server truss
...
> 3497.9255     
>lstat64("export/httpd/DOMAINS/stats.chelsea.net/data/NetTracker/resistance-forum-www.thebody.com/page.dat",
> 0xFFBEE370) = 0
> 3497.9262     
>open64("export/httpd/DOMAINS/stats.chelsea.net/data/NetTracker/resistance-forum-www.thebody.com/page.dat",
> O_RDONLY) = 7
> 3497.9869     read(7, "\002FFB7\0\0\0\0\002E0EF".., 196536)   = 196536
> poll(0xFFBEC6E0, 2, 60000)    (sleeping...)
> 4037.2043     poll(0xFFBEC6E0, 2, 60000)                      = 0
> 4037.2049     poll(0xFFBEC6E0, 2, 60000)                      = 1
> 4037.2052     write(5, "FC0F\0078E hB40698 .84EF".., 4096)    Err#32 EPIPE
> 4037.2816         Received signal #13, SIGPIPE [caught]
> 4037.2843     sigaction(SIGUSR1, 0xFFBEC2B0, 0xFFBEC330)      = 0
> 4037.2848     sigaction(SIGUSR2, 0xFFBEC2B0, 0xFFBEC330)      = 0
> 4037.2851     getpid()                                        = 27948 [27130]
> 4037.3504     kill(28192, SIGUSR1)                            = 0
> 4037.3508     getpid()                                        = 27948 [27130]
> 4037.3514     poll(0xFFBEABB0, 2, 60000)                      = 1
> 4037.3518     write(5, " B\0\0\b r s y n c   e r".., 70)      Err#32 EPIPE
> 4037.3523         Received signal #13, SIGPIPE [default]
>       *** process killed ***
>
>
> Content-Description: server child lsof
> COMMAND   PID USER   FD   TYPE        DEVICE   SIZE/OFF    NODE NAME
...
> rsync   28192 root    5u  IPv4 0x30000a59710 0t64477429     TCP 
>hite.chelsea.net:rsync->corona.chelsea.net:51816 (ESTABLISHED)
> rsync   28192 root    6u  VREG         85,30          0  541128 /export 
>(/dev/md/dsk/d30)
> rsync   28192 root    7r  VREG        32,144   51363840 5347010 /export/backup -- 
>export/httpd/DOMAINS/stats.chelsea.net/data/NetTracker/cyberounds.com/view.dat
> rsync   28192 root    8u  unix        105,36        0t0  306297 
>/devices/pseudo/tl@0:ticots->(socketpair: 0x18d400000000) (0x300016fd448)
> rsync   28192 root    9u  VREG        32,144   37569724 5347009 /export/backup -- 
>export/httpd/DOMAINS/stats.chelsea.net/data/NetTracker/cyberounds.com/.view.dat.Q9aae3
> rsync   28192 root   10u  unix        105,39        0t0  306297 
>/devices/pseudo/tl@0:ticots->(socketpair: 0x18d600000000) (0x300016fc368)
>
> Content-Description: server child truss
...
> 3504.2667     poll(0xFFBECF20, 1, 60000)                      = 1
> 3504.2682     read(5, "D2\t\0\0", 4)                          = 4
> 3504.2699     poll(0xFFBECF98, 1, 60000)                      = 1
> 3504.2714     read(5, " I E   6 . 0 ;   W i n d".., 2514)     = 2514
> 3504.2738     write(9, " I E   6 . 0 ;   W i n d".., 1024)    = 1024
> 3504.2755     write(9, " . 0 ;   W i n d o w s  ".., 1024)    = 1024
> 3504.2771     write(9, " w e r P C )\0 M9BCE9A 1".., 466)     = 466
> 3504.2787     poll(0xFFBECF20, 1, 60000)                      = 1
> 3504.2800     read(5, "E4E6FFFF", 4)                          = 4
> 3504.2822     llseek(7, 5898240, SEEK_SET)                    = 5898240
> 3517.5204     read(7, "\0 Y  \0\0\0\0\0\0 E16C5".., 262144)   = 262144
> 4037.2227     write(9, " d o w s   9 5 )\0 T\v C".., 928)     = 928
> 4037.2233     poll(0xFFBECF20, 1, 60000)                      = 1
> 4037.2238     read(5, "E3E6FFFF", 4)                          = 4
> 4037.2244     write(9, " 4 . 0   ( c o m p a t i".., 928)     = 928
> 4037.2258     poll(0xFFBECF20, 1, 60000)                      = 1
> 4037.2262     read(5, "E2E6FFFF", 4)                          = 4
> 4037.2269     write(9, " c o m : M o z i l l a /".., 928)     = 928
> 4037.2274     poll(0xFFBECF20, 1, 60000)                      = 1
> 4037.2278     read(5, "0E1A\0\0", 4)                          = 4
> 4037.2283     poll(0xFFBECF98, 1, 60000)                      = 1
> 4037.2288     read(5, " M o z i l l a / 4 . 0  ".., 6670)     = 6670
> 4037.2304     write(9, " M o z i l l a / 4 . 0  ".., 1024)    = 1024
> 4037.3011     write(9, " . c o m : M o z i l l a".., 1024)    = 1024
> 4037.3017     write(9, " 2 4 7 - 2 9 - 0 . c l i".., 1024)    = 1024
> 4037.3022     write(9, " 1 ;   Q 3 1 2 4 6 1 )\0".., 1024)    = 1024
> 4037.3028     write(9, " M S I E C r a w l e r )".., 1024)    = 1024
> 4037.3079     llseek(9, 1, SEEK_CUR)                          = 585779
> 4037.3085     write(9, " WB7 lA0 1 2 - 2 5 5 - 1".., 1023)    = 1023
> 4037.3125     write(9, " 0 7 2 1 2 0 0 0 h . 0 8".., 526)     = 526
> 4037.3636     poll(0xFFBECF20, 1, 60000)                      = 1
> 4037.3642     read(5, " %DFFFFF", 4)                          = 4
> 4037.3650     llseek(7, 7733248, SEEK_SET)                    = 7733248
> 4037.6213     read(7, "\01A @\0\01A  \0\0 6109E".., 262144)   = 262144
> 4037.6229         Received signal #16, SIGUSR1 [caught]
>       siginfo: SIGUSR1 pid=27948 uid=0
> 4037.6234     sigaction(SIGUSR1, 0xFFBED188, 0xFFBED208)      = 0
> 4037.6238     sigaction(SIGUSR2, 0xFFBED188, 0xFFBED208)      = 0
> 4037.6913     
>unlink("export/httpd/DOMAINS/stats.chelsea.net/data/NetTracker/oral-forum-www.thebody.com/.visitorid.idx.Lkbae3")
> = 0
> 4037.6921     getpid()                                        = 28192 [1]
> 4037.6957     write(10, " A\0\0\b r s y n c   e r".., 69)     Err#32 EPIPE
> 4037.6977         Received signal #13, SIGPIPE [caught]
> 4037.6982     sigaction(SIGUSR1, 0xFFBEC8B0, 0xFFBEC930)      = 0
> 4037.7016     sigaction(SIGUSR2, 0xFFBEC8B0, 0xFFBEC930)      = 0
> 4037.7021     write(10, " A\0\0\b r s y n c   e r".., 69)     Err#32 EPIPE
> 4037.7026         Received signal #13, SIGPIPE [default]
>       *** process killed ***

-- 
To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html

Reply via email to