(I am adding the rsync mailing list to the Cc)

> Wout van Albada <[EMAIL PROTECTED]> wrote on the ssh mailing list:
> > 
> > Hi,
> > 
> > I think I encountered a serious bug in ssh 1.2.27. There seems to be
> > a race condition where the ssh daemon (sshd) drops data when it has
> > to send it over a slow line. I sent this bug report to
> > [EMAIL PROTECTED]
> > and [EMAIL PROTECTED] on 27/03 but have heard nothing from either so
> > far.
> > 
> > I'll try to clarify what happens:
> > 
> > There are two machines, server and client. Both machines run Solaris.
> > The client makes an ssh connection to the server to download a file:
> > 
> > server% ls -l /tmp/DATA
> > -rw-r--r--   1 wout  staff     200000 Mar 23 11:20 DATA
> > 
> > client% ssh server cat /tmp/DATA > /tmp/DATA
> > client% ls -l /tmp/DATA
> > -rw-r--r--   1 wout  staff     194560 Mar 24 17:10 /tmp/DATA
> > 
> > This would copy a file '/tmp/DATA' from server to /tmp/DATA on client.
> > In this particular case file DATA was 200000 bytes. The size has
> > to be larger then the buffers used inside sshd.
> > 
> > When the command is run, most data is sent over the line as it should
> > be. However, when the 'cat' process dies, sshd receives a SIGCHLD and
> > then fails to read the data left in the pipe to the 'cat' program.
> > 
> > To be more precise, sshd only reads the data left in the pipe to 'cat'
> > if it has space for it in the outgoing buffer (the buffer that is used
> > to store data going back to the client).
> > 
> > So the following happens (all in serverloop.c):
> > 
> >  1. For a while sshd reads data from the 'cat' command. This data is
> >     transmitted to the client, where it is put in /tmp/DATA.
> >  2. cat writes the final data to the pipe to sshd and exits.
> >  3. sshd receives a SIGCHLD and sets child_terminated and
> >     child_just_terminated to 1.
> >  4. sshd falls out of the select() (line 413) it was in
> >     (it usually receives the signal during the select() call).
> >     select() returns -1 because it was interrupted by the signal.
> >  5. sshd empties readset and writeset (lines 415-422 serverloop.c).
> >  6. The if statements on lines 426 and 446 fail.
> >  7. sshd does its usual stuff and then calls
> >     wait_until_can_do_something().
> >  8. The call to packet_not_very_much_data_to_write() on line 353
> >     returns false (because the outgoing buffer contains more than
> >     16384 bytes). This causes fdout and fderr not to be set in the
> >     readset file descriptor set (lines 355-358).
> >  9. select() on line 413 returns 0 again (due to slow network
> >     connection to client). This time the if statement on line
> >     426 succeeds (child_just_terminated has been set to 0 earlier).
> > 10. Descriptor fdout, fderr and fdin are closed (lines 432-442)
> >     causing the data available to fdout never being read.
> > 
> > The change I made to fix this is in a patch (diff on original
> > serverloop.c and modified serverloop.c) you will find attached
> > to this mail. It changes lines 432-439. Instead of blindly closing
> > the fdout and fderr descriptors when select() returns 0, it only
> > closes them if the fdout_eof and fderr_eof flags have been set,
> > respectively. The bug was that the code in lines 426-443 assumed
> > that select() always provides information on fdout and fderr, which
> > is not the case as they had not been set in the readset.
> > 
> > For completeness, I also attach the 'sshd -d' output for a faulty
> > session (original sshd 1.2.27, data is lost) and output for a session
> > after having applied my patch.
> > 
> > Please let me know what you make of this.
> > 
> > Wout van Albada
> > Software Engineer
> > 
> > [EMAIL PROTECTED]
> > 
> > --- serverloop.c.ORIG   Sun Mar 26 13:20:14 2000
> > +++ serverloop.c        Sun Mar 26 13:25:15 2000
> > @@ -429,14 +429,14 @@
> >        if (cleanup_context)
> >          pty_cleanup_proc(cleanup_context);
> > 
> > -      if (fdout != -1)
> > +      if (fdout != -1 && fdout_eof) {
> >          close(fdout);
> > -      fdout = -1;
> > -      fdout_eof = 1;
> > -      if (fderr != -1)
> > +       fdout = -1;
> > +      }
> > +      if (fderr != -1 && fderr_eof) {
> >          close(fderr);
> > -      fderr = -1;
> > -      fderr_eof = 1;
> > +        fderr = -1;
> > +      }
> >        if (fdin != -1)
> >          close(fdin);
> >        fdin = -1;
...

On Mon, Jun 12, 2000 at 08:02:43AM -0700, Rick Moen wrote on ssh mailing list:
> begin  Ville Herva quotation:
> 
> > At least rsync-2.4.x has known problems when ran over ssh pipe. See rsync
> > mailing list archive [http://rsync.samba.org/listproc/rsync/] for details.
> 
> It's a select() deadlock.  Ton Hospel posted to the GCC mailing list a 
> GPLed wrapper for SSH that fixes it.  I keep a copy at 
> http://linuxmafia.com/pub/linux/security/ssh-rsync-wrapper

Here's the status of the most recent releases of rsync:
    2.4.3 - sets O_NONBLOCK on stdin and stdout.  There haven't been
        reports that it still hangs ssh, but there have been numerous
        reports that it gets rsync protocol errors ("unexpected tag" is the
        one most often reported).  I wonder if ssh can't completely handle
        being in non-blocking mode, and I wonder if Wout's patch would solve
        those problems. 
    2.4.2 - similar to 2.4.3 except that it didn't work with rsh so it was
        shortlived.
    2.4.1 - switched to using socketpairs instead of pipes and removed
        complicated buffering scheme that worked around ssh hangs.  
        Numerous hangs of ssh on Solaris at least were reported.
    2.4.0 - similar to 2.4.1 except it had some serious bug so it was
        very short lived.
    2.3.2 - uses pipes, not socketpairs, and has complicated buffering
        scheme that seems to work pretty well to avoid ssh hangs.  Still
        the preferred version for most people.
        

Wout, what version of rsync were you using when you developed your patch?
My guess is that it would be most necessary for 2.4.1, and it sounds like
it will do a better job than turning on O_NONBLOCK.

I also wonder if OpenSSH has the same problem.

- Dave Dykstra

Reply via email to