Hello, this is a rare corner case met by one of HP partners on 2.4.20 on IA64. Inspecting the sources of the latest 2.6.20.1 (net/ipv4/tcp_output.c) we can see that the bug is still there.
Here is a description of the bug and the suggested fix. The problem occurs when the remote host (not necessarily Linux - in our case it was Solaris) does not implement SWS avoidance on sender side. If Linux connection socket has rcvbuf<mtu, we can potentially advertise small rcv_wnd for a long time (SWS). The problem is due to SWS avoidance as implemented in __tcp_select_window(). Everything works fine when rcvbuf > mtu. But if we use small rcvbuf (set by SO_RCVBUF), we can go into SWS mode. Let us for simplicity look only at the case when we don't have WS enabled. If we have free_space above full_space/2, we reach the following section: /* Don't do rounding if we are using window scaling, since the * scaled window will not line up with the MSS boundary anyway. */ window = tp->rcv_wnd; if (tp->rx_opt.rcv_wscale) { <snip> } else { /* Get the largest window that is a nice multiple of mss. * Window clamp already applied above. * If our current window offering is within 1 mss of the * free space we just keep it. This prevents the divide * and multiply from happening most of the time. * We also don't do any window rounding when the free space * is too small. */ (1) if (window <= free_space - mss || window > free_space) window = (free_space/mss)*mss; } return window; What happens if we have a small tp->rcv_wnd and rcvbuf <= mss? In this case condition (1) is almost always false and as a result we'll return unmodified 'window' set to tp->rcv_wnd. If tp->rcv_wnd is small, it can be reused over and over again. For the case rcvbuf <= mss __tcp_select_window() returns: 0 if we have free_space < full_space/2 OK mss if rcvbuf is empty OK tp->rcv_wnd in other case Bad If there is no SWS avoidance on sender side, we can see Linux advertising the same small rcv_wnd over and over again. The problem here is that we never advertise one-half the receiver's buffer space as described e.g. in "TCP/IP Illustrated" by Stevens (v.1, Chapter 22.3): "The normal algorithm is for the receiver not to advertise a larger window than it is currently advertising (which can be 0) until the window can be increased by either one full-sized segment (i.e. the MSS being received) or by one-half the receiver's buffer space, whichever is smaller" ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The fix. -------- We have not been able to reproduce the problem inside HP as it is unclear what conditions are needed to bring system into SWS mode (this needs very special event timing). HP customer was seeing it every 2-3 days while running a custom application (Solaris<->Linux) that was running with low priority on a busy host running other custom applications with SCHED_RR. After going into SWS mode, his application stayed in it until restarted. We provided to customer a fix for 2.4.20 only (used by customer in production) by adding another test and returning rcvbuf/2 when needed: --- net/ipv4/tcp_output.c.orig Wed May 3 20:40:43 2006 +++ net/ipv4/tcp_output.c Tue Jan 30 14:24:56 2007 @@ -641,6 +641,7 @@ * Note, we don't "adjust" for TIMESTAMP or SACK option bytes. * Regular options like TIMESTAMP are taken into account. */ +static const char *SWS_id_string="@#SWS-fix-2"; u32 __tcp_select_window(struct sock *sk) { struct tcp_opt *tp = &sk->tp_pinfo.af_tcp; @@ -682,6 +683,9 @@ window = tp->rcv_wnd; if (window <= free_space - mss || window > free_space) window = (free_space/mss)*mss; + /* A fix for small rcvbuf [EMAIL PROTECTED] */ + else if (mss == full_space && window < full_space/2) + window = full_space/2; return window; } Customer has confirmed that this resolves the problem and decreases CPU usage by his custom application - even when there is no SWS. This is a rare corner case and most users will never meet it. But as the fix is trivial, I think it makes sense to include it in upstream sources. Regards, Alex -- ------------------------------------------------------------------ Alexandre Sidorenko email: [EMAIL PROTECTED] Global Solutions Engineering: Unix Networking Hewlett-Packard (Canada) ------------------------------------------------------------------ - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html