Ah ha...
Looks like I may give rsync-2.4.3 a try and see how it goes over using
ssh. I am the one of the people who was having problems with clients and
servers both using rsync-2.4.1 over ssh-1.2.27 (it was hanging...we are
using various revisions of FreeBSD and BSDI.)
Going back to rsync-2.3.1 on the clients cleared up this problem,
with rsync-2.4.1 is still being run on the server...
I'll give rsync-2.4.3 a try on both the client and server side, with
ssh-1.2.27 and see how it goes. I'll post a followup on my results
whenever I'm done to indicate whether or not it worked.
-Chris Tracy
(Telerama Internet -/- Network Administrator -/- www.telerama.com)
On Wed, 14 Jun 2000, Dave Dykstra wrote:
> (I am adding the rsync mailing list to the Cc)
>
> > Wout van Albada <[EMAIL PROTECTED]> wrote on the ssh mailing list:
> > >
> > > Hi,
> > >
> > > I think I encountered a serious bug in ssh 1.2.27. There seems to be
> > > a race condition where the ssh daemon (sshd) drops data when it has
> > > to send it over a slow line. I sent this bug report to
> > > [EMAIL PROTECTED]
> > > and [EMAIL PROTECTED] on 27/03 but have heard nothing from either so
> > > far.
> > >
> > > I'll try to clarify what happens:
> > >
> > > There are two machines, server and client. Both machines run Solaris.
> > > The client makes an ssh connection to the server to download a file:
> > >
> > > server% ls -l /tmp/DATA
> > > -rw-r--r-- 1 wout staff 200000 Mar 23 11:20 DATA
> > >
> > > client% ssh server cat /tmp/DATA > /tmp/DATA
> > > client% ls -l /tmp/DATA
> > > -rw-r--r-- 1 wout staff 194560 Mar 24 17:10 /tmp/DATA
> > >
> > > This would copy a file '/tmp/DATA' from server to /tmp/DATA on client.
> > > In this particular case file DATA was 200000 bytes. The size has
> > > to be larger then the buffers used inside sshd.
> > >
> > > When the command is run, most data is sent over the line as it should
> > > be. However, when the 'cat' process dies, sshd receives a SIGCHLD and
> > > then fails to read the data left in the pipe to the 'cat' program.
> > >
> > > To be more precise, sshd only reads the data left in the pipe to 'cat'
> > > if it has space for it in the outgoing buffer (the buffer that is used
> > > to store data going back to the client).
> > >
> > > So the following happens (all in serverloop.c):
> > >
> > > 1. For a while sshd reads data from the 'cat' command. This data is
> > > transmitted to the client, where it is put in /tmp/DATA.
> > > 2. cat writes the final data to the pipe to sshd and exits.
> > > 3. sshd receives a SIGCHLD and sets child_terminated and
> > > child_just_terminated to 1.
> > > 4. sshd falls out of the select() (line 413) it was in
> > > (it usually receives the signal during the select() call).
> > > select() returns -1 because it was interrupted by the signal.
> > > 5. sshd empties readset and writeset (lines 415-422 serverloop.c).
> > > 6. The if statements on lines 426 and 446 fail.
> > > 7. sshd does its usual stuff and then calls
> > > wait_until_can_do_something().
> > > 8. The call to packet_not_very_much_data_to_write() on line 353
> > > returns false (because the outgoing buffer contains more than
> > > 16384 bytes). This causes fdout and fderr not to be set in the
> > > readset file descriptor set (lines 355-358).
> > > 9. select() on line 413 returns 0 again (due to slow network
> > > connection to client). This time the if statement on line
> > > 426 succeeds (child_just_terminated has been set to 0 earlier).
> > > 10. Descriptor fdout, fderr and fdin are closed (lines 432-442)
> > > causing the data available to fdout never being read.
> > >
> > > The change I made to fix this is in a patch (diff on original
> > > serverloop.c and modified serverloop.c) you will find attached
> > > to this mail. It changes lines 432-439. Instead of blindly closing
> > > the fdout and fderr descriptors when select() returns 0, it only
> > > closes them if the fdout_eof and fderr_eof flags have been set,
> > > respectively. The bug was that the code in lines 426-443 assumed
> > > that select() always provides information on fdout and fderr, which
> > > is not the case as they had not been set in the readset.
> > >
> > > For completeness, I also attach the 'sshd -d' output for a faulty
> > > session (original sshd 1.2.27, data is lost) and output for a session
> > > after having applied my patch.
> > >
> > > Please let me know what you make of this.
> > >
> > > Wout van Albada
> > > Software Engineer
> > >
> > > [EMAIL PROTECTED]
> > >
> > > --- serverloop.c.ORIG Sun Mar 26 13:20:14 2000
> > > +++ serverloop.c Sun Mar 26 13:25:15 2000
> > > @@ -429,14 +429,14 @@
> > > if (cleanup_context)
> > > pty_cleanup_proc(cleanup_context);
> > >
> > > - if (fdout != -1)
> > > + if (fdout != -1 && fdout_eof) {
> > > close(fdout);
> > > - fdout = -1;
> > > - fdout_eof = 1;
> > > - if (fderr != -1)
> > > + fdout = -1;
> > > + }
> > > + if (fderr != -1 && fderr_eof) {
> > > close(fderr);
> > > - fderr = -1;
> > > - fderr_eof = 1;
> > > + fderr = -1;
> > > + }
> > > if (fdin != -1)
> > > close(fdin);
> > > fdin = -1;
> ..
>
> On Mon, Jun 12, 2000 at 08:02:43AM -0700, Rick Moen wrote on ssh mailing list:
> > begin Ville Herva quotation:
> >
> > > At least rsync-2.4.x has known problems when ran over ssh pipe. See rsync
> > > mailing list archive [http://rsync.samba.org/listproc/rsync/] for details.
> >
> > It's a select() deadlock. Ton Hospel posted to the GCC mailing list a
> > GPLed wrapper for SSH that fixes it. I keep a copy at
> > http://linuxmafia.com/pub/linux/security/ssh-rsync-wrapper
>
> Here's the status of the most recent releases of rsync:
> 2.4.3 - sets O_NONBLOCK on stdin and stdout. There haven't been
> reports that it still hangs ssh, but there have been numerous
> reports that it gets rsync protocol errors ("unexpected tag" is the
> one most often reported). I wonder if ssh can't completely handle
> being in non-blocking mode, and I wonder if Wout's patch would solve
> those problems.
> 2.4.2 - similar to 2.4.3 except that it didn't work with rsh so it was
> shortlived.
> 2.4.1 - switched to using socketpairs instead of pipes and removed
> complicated buffering scheme that worked around ssh hangs.
> Numerous hangs of ssh on Solaris at least were reported.
> 2.4.0 - similar to 2.4.1 except it had some serious bug so it was
> very short lived.
> 2.3.2 - uses pipes, not socketpairs, and has complicated buffering
> scheme that seems to work pretty well to avoid ssh hangs. Still
> the preferred version for most people.
>
>
> Wout, what version of rsync were you using when you developed your patch?
> My guess is that it would be most necessary for 2.4.1, and it sounds like
> it will do a better job than turning on O_NONBLOCK.
>
> I also wonder if OpenSSH has the same problem.
>
> - Dave Dykstra
>