On Fri, May 25, 2001 at 12:14:17PM -0500, Phil Howard wrote:
> I switched to 2.4.6 a while back, but have only been making heavy
> use of rsync the past couple of months, and have been running into
> a few problems that may be bugs. I looked at the bug tracker, but
> it was too cumbersome to use effectively. I don't know if these
> are real bugs or just configuration mistakes. Maybe you can tell
> me.
>
> The host OS is Linux 2.4.X (X is 0 on some and 2 on others) and
> Slackware 7.1 in all cases.
>
> Here are the things I'm running into that I did not have in 2.4.4:
>
> 1 =============================================================================
> Write failed: Cannot allocate memory
> unexpected EOF in read_timeout
> unexpected EOF in read_timeout
>
> I've seen this happen when there was over 256 meg available space
> between ram and swap, so it should not have failed as a result of
> not being able to get a reasonable amount from the system. This
> also occurs randomly; if I wipe the files back out and run it all
> again, it often does not happen the next time, or if it does, it
> is not at the same point. Also, the number of files being copied
> is smallish (not more than 100), and it happens even when no file
> is larger than about 4 meg and the total transfer is no larger than
> 20 meg (so even if it pre-loaded every file into ram, there should
> be enough space). The happens even if the target directory starts
> empty.
>
> This is done through ssh and I do not recall it happening when using
> an rsync daemon.
This must be coming from ssh, not rsync. There is no string "Write failed"
in the rsync source code, but there is in both ssh 1.2.27 and openssh 2.9p1.
> 2 =============================================================================
> When syncronizing a very large number of files, all files in a large
> partition, rsync frequently hangs. It's about 50% of the time, but
> seems to be a function of how much work there was to be done. That
> is, if I run it soon after it just ran, it tends to not hang, but if
> I run it after quite some time (and lots of stuff to syncronize) it
> tends to hang. It appears to have completed all the files, but I
> don't get any stats. There are 3 rsync processes sitting idle with
> no files open in the source or target trees.
>
> At last count there were 368827 files and 8083 symlinks in 21749
> directories.
>
> df shows:
> /dev/hda4 42188460 38303916 3884544 91% /home
> /dev/hdb4 42188460 38301972 3886488 91% /mnt/hdb/home
>
> df -i shows:
> /dev/hda4 2662400 398419 2263981 15% /home
> /dev/hdb4 2662400 398462 2263938 15% /mnt/hdb/home
>
> The df numbers are not exact because change is constantly happening
> on this active server. Drives hda and hdb are identical and are
> partitioned alike.
>
> The command line is echoed from the script that runs it:
>
> rsync -axv --stats --delete /home/. /mnt/hdb/home/.
>1>'/home/root/backup-hda-to-hdb/home.log' 2>&1
Use the -W option to disable the rsync algorithm. We really ought to make
that the default when both the source and destination are local.
> The log file shows a file list gone all the way to the last file
> and lsof done after the hang shows:
>
> rsync 26651 root cwd DIR 3,2 4096 24 /root
> rsync 26651 root rtd DIR 3,2 4096 2 /
> rsync 26651 root txt REG 3,10 187443 8758
>/usr/local/bin/rsync
> rsync 26651 root mem REG 3,2 79276 4239 /lib/ld-2.1.3.so
> rsync 26651 root mem REG 3,2 1013224 4249 /lib/libc-2.1.3.so
> rsync 26651 root mem REG 3,2 40360 4274
>/lib/libnss_compat-2.1.3.so
> rsync 26651 root mem REG 3,2 75500 4272
>/lib/libnsl-2.1.3.so
> rsync 26651 root 0u CHR 136,14 16 /dev/pts/14
> rsync 26651 root 1w REG 3,4 56898 1778435
>/home/root/backup-hda-to-hdb/home.log
> rsync 26651 root 2w REG 3,4 56898 1778435
>/home/root/backup-hda-to-hdb/home.log
> rsync 26651 root 4u unix 0xcb0f4040 135813770 socket
> rsync 26651 root 5u unix 0xcb0f4cc0 135813771 socket
> rsync 26652 root cwd DIR 3,68 4096 2 /mnt/hdb/home
> rsync 26652 root rtd DIR 3,2 4096 2 /
> rsync 26652 root txt REG 3,10 187443 8758
>/usr/local/bin/rsync
> rsync 26652 root mem REG 3,2 79276 4239 /lib/ld-2.1.3.so
> rsync 26652 root mem REG 3,2 1013224 4249 /lib/libc-2.1.3.so
> rsync 26652 root mem REG 3,2 40360 4274
>/lib/libnss_compat-2.1.3.so
> rsync 26652 root mem REG 3,2 75500 4272
>/lib/libnsl-2.1.3.so
> rsync 26652 root 1u unix 0xcb93b9a0 135813772 socket
> rsync 26652 root 2w REG 3,4 56898 1778435
>/home/root/backup-hda-to-hdb/home.log
> rsync 26652 root 3u unix 0xca8edcc0 135814161 socket
> rsync 26652 root 5u unix 0xcc9969a0 135814163 socket
> rsync 26704 root cwd DIR 3,68 4096 2 /mnt/hdb/home
> rsync 26704 root rtd DIR 3,2 4096 2 /
> rsync 26704 root txt REG 3,10 187443 8758
>/usr/local/bin/rsync
> rsync 26704 root mem REG 3,2 79276 4239 /lib/ld-2.1.3.so
> rsync 26704 root mem REG 3,2 1013224 4249 /lib/libc-2.1.3.so
> rsync 26704 root mem REG 3,2 40360 4274
>/lib/libnss_compat-2.1.3.so
> rsync 26704 root mem REG 3,2 75500 4272
>/lib/libnsl-2.1.3.so
> rsync 26704 root 0u unix 0xce9cd9a0 135813769 socket
> rsync 26704 root 2w REG 3,4 56898 1778435
>/home/root/backup-hda-to-hdb/home.log
> rsync 26704 root 6u unix 0xcecaccc0 135814164 socket
>
> If I kill 26704, nothing particularly happens. If I kill 26652,
> then I get "unexpected EOF in read_timeout" (not surprising), which
> tells me 26651 was waiting for something from 26652 which was not
> happening. Maybe 26652 was waiting for something from 26651 first.
> A deadly embrace? It seems possible.
No, the receiving side of an rsync transaction splits itself into two
processes for the sake of pipelining: one to generate checksums and one to
accept updates. When you're sending and receiving to the same machine then
you've got one sender and 2 receivers.
> I'm also curious why 26704 has no fd 1.
I don't know. When I tried it all 3 processes had an fd 1.
> 3 =============================================================================
> @ERROR: max connections (16) reached - try again later
>
> This occurs after just one connection is active. It behaves as if
> I had specified "max connections = 1". On another server I set it
> to 40, and it showed:
>
> @ERROR: max connections (40) reached - try again later
>
> so it obvious is parsing and keeping the value I configure, but it
> isn't using it correctly.
>
> Also, if I ^C the client, then I get this error every time until I
> restart the daemon (running in standalone daemon mode, not inetd).
> So it seems like it counts clients wrong. But I can't get more
> that 1 right after restarting the server, so it's a little more
> than that somewhere.
I don't know, I never used max connections. Could indeed be a bug.
The code looks pretty tricky. It's trying to lock pieces of the file
/var/run/rsyncd.lock in order for independent processes to coordinate.
Are you running as root (the lsof above suggests you are)? If not, you
probably need to specify another file that your daemon has access to in the
"lock file" option. Otherwise it would probably help for you to run some
straces.
- Dave Dykstra