[BUG] rsync 2.4.3, unexpected EOF in read_timeout, and hangs

Neil Schellenberger Fri, 30 Jun 2000 07:32:05 -0700

Hi folks,

I'm also suffering from the dreaded "unexpected EOF in read_timeout"
problem with rsync 2.4.3.  My platform is 

  SunOS 5.6 Generic_105181-10 sun4u sparc SUNW,Ultra-250
  gcc version 2.95.2 19991024 (release)
  GNU assembler version 2.9.1 (sparc-sun-solaris2.5.1), using BFD version 2.9.1
  GNU ld version 2.9.1 (with BFD 2.9.1)

So far, I haven't been able to pin down exactly what is triggering it.
I'm running the server as a permanently running rsync daemon (i.e. not
from inetd); the clients are typically run from a crontab.  This
means, I think, that neither the ssh nor the rsh comms code is being
executed, removing it from contention as the sole cause for this
problem.

It "seems" like larger mirrors have the problem more than smaller
ones, but I've had consecutive runs of various size mirrors die at
different times (as well as run to completion without problem).  The
crontab jobs "seem" to suffer the problem less often than interactive
ones.  It also seems like the more concurrent rsyncs I run, I get a
much higher occurrence of the "hang" problem; I'm not CPU, bandwidth,
or memory limited on the client (but I don't have access to the server
to check it) so those aren't issues.  (I'm sorry that I don't have
more concrete analysis, but I'm a tad short of tuits at the
moment....)

Then I tried setting --timeout values to try to bypass the problem.
This had the interesting effect of apparently decreasing the frequency
of the EOF issue (although this could just be my imagination).  It
did, however, start to yield lots and lots of failed mirrors owing to
"io timeout after n second" errors, even when the connection and
mirroring were proceding just fine.

While digging through the source code, I uncovered what I assume is an
unintentional feature.  It seems that --timeout will effectively limit
the overall time to transfer a file, even if that transfer is actually
proceding without difficulty (and is just long).  Perhaps this is
intentional, but I would imagine that the --timeout option was
envisioned more as an "idle timer"?  (Otherwise, I propose that
the documentation be amended to highlight this behaviour.)

The problem stems from the fact that the client runs as two processes,
a parent managing the overall mirror, and a child which does the
actual transferring of the files.  The child is simply a fork()ed copy
of the parent and inherits the value of io_timeout from the parent.
This is used in the child to do "idle" monitoring of the connection.
Unfortunately, the parent is also (indirectly via read_int() etc.)
using io_timeout between it and the child.  So while the child is off
transferring a large file, the parent's idle timer is ticking away.

Also, by adding more instrumentation, I was also able to catch the EOF
problem in the act on the client side.  Using poll(2), I retrieved the
revent mask after the select() but before the read().  It was
POLLIN|POLLRDNORM; there was no POLLHUP.  Dunno exactly what this
means; perhaps the server is getting fooled into sending some zero
length packets?  Maybe the server parent/child processes are suffering
from some variant of the timeout problem (e.g. since they share
file descriptors, maybe one is writing/closing something it shouldn't
be)?

Enough gunge, here's a proposed patch for the io_timeout issue.  Not
terribly elegant, but it seems to get the job done.  Comments
welcomed.

--- ../main.c.orig      Wed Jun 28 09:39:26 2000
+++ ../main.c   Thu Jun 29 16:18:00 2000
@@ -279,11 +279,13 @@
        int status=0;
        int recv_pipe[2];
        int error_pipe[2];
+       int io_timeout_save = -1;
        extern int preserve_hard_links;
        extern int delete_after;
        extern int recurse;
        extern int delete_mode;
        extern int remote_version;
+       extern int io_timeout;
 
        if (preserve_hard_links)
                init_hard_links(flist);
@@ -325,7 +327,7 @@
                close(recv_pipe[1]);
                io_flush();
                /* finally we go to sleep until our parent kills us
-                  with a USR2 signal. We sleepp for a short time as on
+                  with a USR2 signal. We sleep for a short time as on
                   some OSes a signal won't interrupt a sleep! */
                while (1) sleep(1);
        }
@@ -339,6 +341,9 @@
 
        io_set_error_fd(error_pipe[0]);
 
+       io_timeout_save = io_timeout;
+       io_timeout = 0;         /* child is managing timeouts */
+
        generate_files(f_out,flist,local_name,recv_pipe[0]);
 
        read_int(recv_pipe[0]);
@@ -349,6 +354,8 @@
        }
        io_flush();
 
+       io_timeout = io_timeout_save;
+
        kill(pid, SIGUSR2);
        wait_process(pid, &status);
        return status;

Regards,
Neil

-- 
Neil Schellenberger             | Voice : (613) 599-2300 ext. 8445
CrossKeys Systems Corporation   | Fax   : (613) 599-2330
350 Terry Fox Drive             | E-Mail: [EMAIL PROTECTED]
Kanata, Ont., Canada, K2K 2W5   | URL   : http://www.crosskeys.com/
    + Greg Moore (1975-1999), Gentleman racer and great Canadian +
[BUG] rsync 2.4.3, unexpected EOF in read_timeout, and hangs

Reply via email to