Tcp developers: (Alan Cox: you probably could fix in a minute) I've been told that this is THE PLACE to contact linux kernel developers. Ok, I've got a repeatable bug that I've reported elsewhere to no avail. Hope this is the place: Includes 2 perl scripts to reproduce it and info about what I see not working with tcpdump traces. (i'm not a subscriber - to reach me use my email address) thnx et from my prior posts: -------------------- Hi: I have been trying to figure out why linux tcp is failing to ack properly in some situations. I can easily reproduce this error using 2 perl scripts. (see below) I create a socket server in one script, a client in another, start sending, suspend the receiver and wait 4 minutes. The socket will get disconnected. It should not do this, it should send an ack with a window of 0, which it fails to do. Both the client and the server can be on the same system to easily see the error. To any developer who might be listening - please help me find and fix this problem. Thanks Eric ---- debugging info follows ------------------ If I run the client from a windows system, linux behaves properly, sending acks/window 0, and when I unsuspend the receiver, all re-starts up in a few seconds. When going linux to linux, it fails almost always. When it does fail, I see the sender trying to send a large block (1-2k bytes) and when no ack comes back the sender goes into it's retransmit loop waiting 1,2,4... seconds. I also see the retransmit count going up with: /proc/net/tcp When it fails, the last ack sent back shows a window size of more than 15000. Here is a sample: 21:48:23.376528 lo > 127.0.0.1.5000 > 127.0.0.1.1052: . 1:1(0) ack 1255213 win 18158 <nop,nop,timestamp 21258536 21258536> (DF) 21:48:23.379304 lo > 127.0.0.1.1052 > 127.0.0.1.5000: P 1255213:1255256(43) ack 1 win 31072 <nop,nop,timestamp 21258536 21258536> (DF) 21:48:23.384049 lo > 127.0.0.1.1052 > 127.0.0.1.5000: P 1255256:1257234(1978) ack 1 win 31072 <nop,nop,timestamp 21258537 21258536 No further acks here. This last send occurs until the retransmit count is exhausted. Then the socket is closed. Once in a while, it does not fail, when it works, I see the sender (tcpdump trace) sending only a 1 byte record. And I see all the acks with a window getting smaller until it reaches zero. At that point, the sender is probing by sending 1 byte records. Here is a sample where it does work: 22:13:34.161580 lo > 127.0.0.1.5000 > 127.0.0.1.1053: . 1:1(0) ack 5444248 win 11616 <nop,nop,timestamp 21409615 21409615> (DF) 22:13:34.161979 lo > 127.0.0.1.5000 > 127.0.0.1.1053: . 1:1(0) ack 5451992 win 7744 <nop,nop,timestamp 21409615 21409615> (DF) 22:13:34.162322 lo > 127.0.0.1.5000 > 127.0.0.1.1053: . 1:1(0) ack 5459736 win 3872 <nop,nop,timestamp 21409615 21409615> (DF) 22:13:34.162553 lo > 127.0.0.1.5000 > 127.0.0.1.1053: . 1:1(0) ack 5463608 win 3872 <nop,nop,timestamp 21409615 21409615> (DF) 22:13:34.179785 lo > 127.0.0.1.5000 > 127.0.0.1.1053: . 1:1(0) ack 5467480 win 0 <nop,nop,timestamp 21409617 21409615> (DF) 22:13:34.379759 lo > 127.0.0.1.1053 > 127.0.0.1.5000: . 5467479:5467479(0) ack 1 win 31072 <nop,nop,timestamp 21409637 21409617> (DF) 22:13:34.379856 lo > 127.0.0.1.5000 > 127.0.0.1.1053: . 1:1(0) ack 5467480 win 0 <nop,nop,timestamp 21409637 21409615> (DF) 22:13:34.779762 lo > 127.0.0.1.1053 > 127.0.0.1.5000: . 5467479:5467479(0) ack 1 win 31072 <nop,nop,timestamp 21409677 21409637> (DF) 22:13:34.779852 lo > 127.0.0.1.5000 > 127.0.0.1.1053: . 1:1(0) ack 5467480 win 0 <nop,nop,timestamp 21409677 21409615> (DF) 22:13:35.579755 lo > 127.0.0.1.1053 > 127.0.0.1.5000: . 5467479:5467479(0) ack 1 win 31072 <nop,nop,timestamp 21409757 21409677> (DF) 22:13:35.579842 lo > 127.0.0.1.5000 > 127.0.0.1.1053: . 1:1(0) ack 5467480 win 0 <nop,nop,timestamp 21409757 21409615> (DF) server resumed here. 22:18:33.572608 lo > 127.0.0.1.5000 > 127.0.0.1.1053: . 1:1(0) ack 5467480 win 3872 <nop,nop,timestamp 21439556 21409615> (DF) 22:18:33.572714 lo > 127.0.0.1.1053 > 127.0.0.1.5000: P 5467480:5471352(3872) ack 1 win 31072 <nop,nop,timestamp 21439556 21439556> (DF) 22:18:33.572910 lo > 127.0.0.1.5000 > 127.0.0.1.1053: . 1:1(0) ack 5471352 win 3872 <nop,nop,timestamp 21439556 21439556> (DF) 22:18:33.573007 lo > 127.0.0.1.1053 > 127.0.0.1.5000: P 5471352:5475224(3872) ack 1 win 31072 <nop,nop,timestamp 21439556 21439556 ------------------------------------------------------ To reproduce: create the 2 perl files, badclient.pl and badserver.pl shown below. >From two terminal windows: >perl badserver.pl 5000 >perl badclient.pl You will see for the server (hit control-z): [1028]$ perl badserver.pl 5000 waiting for connection on port 5000 accept connection from 127.0.0.1, 1052 hello there /badclient.pl/ // // 1 hello there /badclient.pl/ // // 1001 hello there /badclient.pl/ // // 2001 hello there /badclient.pl/ // // 3001 ---type control z --- [1]+ Stopped perl badserver.pl 5000 You will see nothing for the client. ---badclient.pl---- use IO::Socket; $client = IO::Socket::INET->new("localhost:5000") or die $@; $a = shift; $b = shift; $i=1; while(1) { print $client "hello there /$0/ /$a/ /$b/ $i\n"; $i++; } close($client); ---badserver.pl--- use IO::Socket; use Socket; $port = shift or die "usage: perl server port_number\n"; $server = IO::Socket::INET->new( LocalPort =>$port, Type => SOCK_STREAM, Reuse => 1, Listen => 10, ) or die "cannot create server : $@\n"; print "waiting for connection on port $port\n"; while ($client = $server->accept()) { $other = getpeername($client) or die "cannot get peer $!\n"; ($port,$iaddr) = unpack_sockaddr_in($other); $ip_addr = inet_ntoa($iaddr); print "accept connection from $ip_addr, $port\n"; $i=0; while( $_ = <$client> ) { if($i%1000 == 0){print $_;} $i++; } print "done with this connection\n"; } close($server); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/