Nathan Moore wrote:
Hi,
I've been using rsync as a primitive backup tool on a small cluster of SL45
and SL5 machines. Lately, there is an intermittant error when I run rsync
to backup a large (20GB) directory of mixed file types. The error isn't a
loud failure, but rather just that the filetransfer stalls and the node the
files are being copied from locks up (the lockup is complete - the "server"
node is unavailable via NIS, ssh, or console login - it has to be
powercycled)
Is there a known bug in rsync? Is there a way to trouble-shoot my "server"
machine?
Volume of data isn't the only measure of "large" - the number of files
is important too.
Some time ago (debian Woody+RHL7.3) I had a problem with rsync timing
out when backing up (most of) my Woody filesystem over ADSL.
I took up the issue on the rsync list where the folk were very helpful.
The thread was "reliability and robustness problems" about Oct 04.
By default, rsync does not timeout, so one really needs to specify a
timeout value.
then, I found it timing out too readily.
It also used an enormous amount of RAM: it's the only program I know
that can cause Linux to use swap (many times*real ram) and not cause
thrashing.
As best I can figure it, rsync was building a filtered view of the
target files area and (no doubt) the source files area, and neither side
talks to the other while this is happening. I think this was taking an
hour or so, but this _was_ a few years ago.
The rsync gurus opined that it was better to backup this way than to
backup a single file, but my experience suggests otherwise; I now create
a filtered filesystem image and use rsync to update that.
While rsync is building its lists of what to transfer, systems at both
ends can get rather busy, particularly if something else is running
interference on the use of ram RAM.
This, of course, can cause a bit of distress to both computers, but if
they really are locked up as opposed to being seriously overtaxed, then
you have either a kernel bug or a hardware problem. Nothing rsync can do
should cause the system to actually lock up.
I think I would start by directing syslog (kernel messages at least) to
another box, or to a printer on the parallel port. Look for signs of the
oom killer at work.
You might also do something as crude as adapting and running this:
while :
do
ps xar | logger -i
sleep 1m
done
while making sure the logged message go to Somewhere Else
--
Cheers
John
-- spambait
[EMAIL PROTECTED] [EMAIL PROTECTED]
Please do not reply off-list