Nathan Moore wrote:
Hi,

I've been using rsync as a primitive backup tool on a small cluster of SL45
and SL5 machines.  Lately, there is an intermittant error when I run rsync
to backup a large (20GB) directory of mixed file types.  The error isn't a
loud failure, but rather just that the filetransfer stalls and the node the
files are being copied from locks up (the lockup is complete - the "server"
node is unavailable via NIS, ssh, or console login - it has to be
powercycled)

Is there a known bug in rsync?  Is there a way to trouble-shoot my "server"
machine?

Volume of data isn't the only measure of "large" - the number of files is important too.

Some time ago (debian Woody+RHL7.3) I had a problem with rsync timing out when backing up (most of) my Woody filesystem over ADSL.

I took up the issue on the rsync list where the folk were very helpful. The thread was "reliability and robustness problems" about Oct 04.

By default, rsync does not timeout, so one really needs to specify a timeout value.

then, I found it timing out too readily.

It also used an enormous amount of RAM: it's the only program I know that can cause Linux to use swap (many times*real ram) and not cause thrashing.

As best I can figure it, rsync was building a filtered view of the target files area and (no doubt) the source files area, and neither side talks to the other while this is happening. I think this was taking an hour or so, but this _was_ a few years ago.

The rsync gurus opined that it was better to backup this way than to backup a single file, but my experience suggests otherwise; I now create a filtered filesystem image and use rsync to update that.

While rsync is building its lists of what to transfer, systems at both ends can get rather busy, particularly if something else is running interference on the use of ram RAM.


This, of course, can cause a bit of distress to both computers, but if they really are locked up as opposed to being seriously overtaxed, then you have either a kernel bug or a hardware problem. Nothing rsync can do should cause the system to actually lock up.

I think I would start by directing syslog (kernel messages at least) to another box, or to a printer on the parallel port. Look for signs of the oom killer at work.

You might also do something as crude as adapting and running this:
 while :
  do
    ps xar | logger -i
    sleep 1m
  done
while making sure the logged message go to Somewhere Else



--

Cheers
John

-- spambait
[EMAIL PROTECTED]  [EMAIL PROTECTED]

Please do not reply off-list

Reply via email to