Re: Linux 2.2.16 through 2.2.18preX TCP hang bug triggered by rsync

2001-01-26 Thread Dave Dykstra

Replying to Alexey's message from the mailing list archive:

> Hello! 
> 
> I take my words back. Manfred is right, this requirement is not a MUST. 
> 
> Real problem is much worse, and it is wholly on the shame of solaris. 
> Tcpdump shows at least two different bugs there. 
> 
>   2060 16:31:42.879337 eth0 < dynamic.ih.lucent.com.39406 > static.8664: . 675
> 80:67580(0) ack 1582261 win 1460 (DF) 
>   2061 16:31:42.907940 eth0 > static.8664 > dynamic.ih.lucent.com.39406: . 158
> 3721:1583721(0) ack 67580 win 1460 (DF) 
> 
> All is OK until now. Solaris's state should be: 
> 
> SND.NXT=SND.UNA=67580 
> SND.WND=1460 
> RCV.NXT=1582261 
> 
>   2062 16:31:42.908620 eth0 < dynamic.ih.lucent.com.39406 > static.8664: . 675
> 80:67581(1) ack 1583721 win 0 (DF) 
> 
> Solaris sends one byte. 
> 
> SND.NXT++ 
> RCV.NXT=1583721 
> 
>   2063 16:31:43.098761 eth0 > static.8664 > dynamic.ih.lucent.com.39406: . 158
> 3721:1583721(0) ack 67581 win 1460 (DF) 
> 
> We ACK it. 
> 
>   2064 16:31:43.100993 eth0 < dynamic.ih.lucent.com.39406 > static.8664: P 675
> 81:68456(875) ack 1583721 win 0 (DF) 
>   2065 16:31:43.101524 eth0 < dynamic.ih.lucent.com.39406 > static.8664: P 684
> 56:69041(585) ack 1583721 win 0 (DF) 
> 
> Solaris sends two segments, filling all the window. 
> 
> SND.NXT=69041 
> 
>   2066 16:31:43.108759 eth0 > static.8664 > dynamic.ih.lucent.com.39406: . 158
> 3720:1583720(0) ack 69041 win 0 (DF) 
> 
> We send zero window probe. SEG.SEQ=1583720. 
> 
> Solaris accepts ACK from it!!! (bug #1) But does not accept window. 


Why is it a bug to accept the ACK from it?  RFC793 page 69 says 

If the RCV.WND is zero, no segments will be acceptable, but
special allowance should be made to accept valid ACKs, URGs and
RSTs.

Why shouldn't this be considered a valid ACK?


> So, now it thinks that SND.UNA=SND.NXT=69041 
>SND.WND=1460 
> 
> State is corrupted. 
> 
> This is hard bug. But it is still not fatal. Actually, such corruptions 
> (but by different reasons) are common with stacks, which borrowed code 
> from BSD. Look into tcp-impl, Subj: "Send window update algorithm ..." 
> They are recoverable, provided stack is sane. 
> 
>   2067 16:31:43.110623 eth0 < dynamic.ih.lucent.com.39406 > static.8664: P 690
> 41:69628(587) ack 1583721 win 0 (DF) 
> 
> Solaris send some crap out of window, because of corrupted state. 
> No problems. 
> 
>   2068 16:31:43.110679 eth0 > static.8664 > dynamic.ih.lucent.com.39406: . 158
> 3721:1583721(0) ack 69041 win 0 (DF) 
> 
> We tell "No pasaran", of course. 
> 
> According to rules, Solaris must shrink window now. 
> This is the only way to recover corrupted state. 
> 
>   2069 16:31:43.111641 eth0 < dynamic.ih.lucent.com.39406 > static.8664: P 696
> 28:70501(873) ack 1583721 win 0 (DF) 
> 
> It does not. And this is point after which recovery is impossible. 
> Fatal bug#2. 
> 
> To resume: it is impossible to help to this from Linux side. 
> We may accept ACK&WIN from out-of-window segments, and this 
> will help in this case _occasionally_. But Solaris is still 
> deemed to lockup randomly with such sawdust in the head. 


I agree that Solaris is wrong for continuing to send data even though the
Linux receive window is 0, and I'm trying to get a bug report into Sun. 
I did not find any mention of such a problem in their patches that are
available in their online support center for any release of Solaris (I've
seen it on Solaris 2.6 and 2.7 but haven't tried others) so this may take
quite a while to get the attention of the right people.

Doesn't it seem likely, however, that the bug is being triggered by the
zero window probe that is subtracting one from the sequence number?  I
couldn't find any mention of that kind of practice in the RFC, perhaps you
can point me to it.  Why doesn't the probe use the correct sequence number
instead of backing up one?  Perhaps a workaround is for Linux to not send
the zero probe with the deliberately incorrect sequence number.

- Dave Dykstra
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Linux 2.2.16 through 2.2.18preX TCP hang bug triggered by rsync

2001-01-23 Thread Dave Dykstra

I'm sorry I didn't give you a more specific version number: the "X" in the
2.2.18preX kernel version we tried is 17.

- Dave
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Linux 2.2.16 through 2.2.18preX TCP hang bug triggered by rsync

2001-01-23 Thread Dave Dykstra

Ville Herva <[EMAIL PROTECTED]> suggested I post this bug report here and that
possibly David Miller or Alexey Kuznetsov could help out.  I found the
problem back at the end of October and narrowed it down as much as I could
but didn't know where to report it until now.  For complete details please
see the rsync mailing list archive at
http://lists.samba.org/pipermail/rsync/2000-October/003004.html
and some of the preceding and following messages.  In particular, the next
message
http://lists.samba.org/pipermail/rsync/2000-October/003005.html
is an interpretation of the TCP dump by my co-worker which implicates the
Linux side.  Also, in
http://lists.samba.org/pipermail/rsync/2000-October/002985.html
Andrew Tridgell refers to a TCP patch that went into Linux kernel 2.2.17
and that "Stephen" told him about it but I don't know what Stephen he was
referring to; that fix didn't help anyway.

The first message above refers to a set of data that could possibly be used
to reproduce the problem, but unfortunately nobody else has reported to me
that they have successfully reproduced it.  I only saw the failures when
using rsync to pull to a particular Solaris 7 workstation, but it happened
when pulling from two different Linux machines and three different kernels
but no other type of machine.  Another message
http://lists.samba.org/pipermail/rsync/2000-October/002981.html
gives a more complete rsync command for reproducing the problem.  The
original report at
http://lists.samba.org/pipermail/rsync/2000-October/002964.html
says that I first noticed the problem on Linux kernel 2.2.16-3smp.

- Dave Dykstra
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/