Thanks for committing the crash-bug fix, Martin.

I'm curious if you have any comments on my fixes for the 2 protocol-hangs
I discovered?  The most common problem that people encounter is the bug I
fixed second, which is fixed by a very simple patch:

Index: main.c
--- main.c      29 May 2001 14:37:54 -0000      1.127
+++ main.c      22 Jun 2001 06:25:47 -0000
@@ -504,15 +504,15 @@
                        rprintf(FINFO,"file list sent\n");

                send_files(flist,f_out,f_in);
+               if (remote_version >= 24) {
+                       /* final goodbye message */
+                       read_int(f_in);
+               }
                if (pid != -1) {
                        if (verbose > 3)
                                rprintf(FINFO,"client_run waiting on %d\n",pid);
                        io_flush();
                        wait_process(pid, &status);
-               }
-               if (remote_version >= 24) {
-                       /* final goodbye message */
-                       read_int(f_in);
                }
                report(-1);
                exit_cleanup(status);

This code only gets executed when we are the local receiver process that
has started up a "client" (perhaps via remote shell, perhaps locally)
that consists of the generator/receiver process pair.  What happens is
that when we (the sender) have processed all the files and we think that
there's nothing left to do, we flush our output (to the receiver) and
wait for the client to exit.  However, there can still be a stream of
messages coming in from the generator process that conveys errors and/or
verbose output.  The current code does not read the data from the
generator until after it waits around for the client pid.  If the data
coming from the generator is large enough, it will block and rsync will
hang (and perhaps die of an EOF error if the generator or ssh timeout).

I believe that my simple patch is the right way to solve this problem.
It causes the sender to enter a read loop waiting for the -1 from the
generator program (which also calls io_flush() as a side-effect).  While
it is waiting, the code continues to read all the error/verbose output
that the generator sends, preventing a hang.  After the generator sends
us the -1, we know that everything is indeed done, and we can then call
io_flush() again and wait for the client pid.

Since this code is only called in the case of a local sender process, it
appears to handle the protocol properly (not needing to flush any more
data to the receiver while it waits for the -1 to arrive), so I think it
is the right fix.

The first hang bug I fixed is much less likely to occur, but is possible
when the user is sending a very large quantity of files and the receiver
fills up the "redo" pipe to the generator before the generator finishes
sending all the first-phase data.  I think that the fellow who was
having trouble sending a very large number of files probably ran into
this.

The fix for this is much more complex because the generator process is
busily cramming data down the connection to the sender (which often
fills up temporarily) but the generator must also keep the incoming
"redo" data from the receiver unblocked, even in the middle of the write
routines in io.c.  The majority of my no-hang patch fixes this case, and
I have tested it quite a bit (since my --move-files patch makes much
more intensive use of this receiver -> generator data path).  I'm
currently using my no-hang patches in a production environment (that
also required the --move-files option), and things appear to be working
well.

If you have any questions or issues with these changes, please let me
know.  The full no-hang patch is available here:

    http://www.clari.net/~wayne/rsync-nohang.patch

Don't forget to run "make proto" after applying this.

Once you're done with that, feel free to check out my --move-files
patch, if you like:

    http://www.clari.net/~wayne/rsync-move-files.patch

..wayne..


Reply via email to