Hello all. I read with interest (and fair ignorance ;-) ) the thread about delayed ACKs in the TCP/IP stack.
Looking at the results of tbench, it looked like something I wanted in my 4.2 kernel. So I patched my kernel accordingly, and ran the tests: ---8<--- Pre-patch: [sheol] /usr/home/hawkeyd/projects/dbench$ ./tbench 1 localhost Throughput 1.15675 MB/sec (NB=1.44593 MB/sec 11.5675 MBit/sec) [sheol] /usr/home/hawkeyd/projects/dbench$ ./tbench 2 localhost Throughput 2.18475 MB/sec (NB=2.73094 MB/sec 21.8475 MBit/sec) [sheol] /usr/home/hawkeyd/projects/dbench$ ./tbench 3 localhost Throughput 3.20828 MB/sec (NB=4.01035 MB/sec 32.0828 MBit/sec) [sheol] /usr/home/hawkeyd/projects/dbench$ ./tbench 1 sheol Throughput 1.14315 MB/sec (NB=1.42894 MB/sec 11.4315 MBit/sec) [sheol] /usr/home/hawkeyd/projects/dbench$ ./tbench 2 sheol Throughput 2.12477 MB/sec (NB=2.65596 MB/sec 21.2477 MBit/sec) [sheol] /usr/home/hawkeyd/projects/dbench$ ./tbench 3 sheol Throughput 3.16156 MB/sec (NB=3.95195 MB/sec 31.6156 MBit/sec) Post-patch: [sheol] /usr/home/hawkeyd/projects/dbench$ ./tbench 1 localhost Throughput 13.8458 MB/sec (NB=17.3073 MB/sec 138.458 MBit/sec) [sheol] /usr/home/hawkeyd/projects/dbench$ ./tbench 2 localhost Throughput 12.8562 MB/sec (NB=16.0703 MB/sec 128.562 MBit/sec) [sheol] /usr/home/hawkeyd/projects/dbench$ ./tbench 3 localhost Throughput 12.1043 MB/sec (NB=15.1304 MB/sec 121.043 MBit/sec) [sheol] /usr/home/hawkeyd/projects/dbench$ ./tbench 1 sheol Throughput 9.62885 MB/sec (NB=12.0361 MB/sec 96.2885 MBit/sec) [sheol] /usr/home/hawkeyd/projects/dbench$ ./tbench 2 sheol Throughput 8.7068 MB/sec (NB=10.8835 MB/sec 87.068 MBit/sec) [sheol] /usr/home/hawkeyd/projects/dbench$ ./tbench 3 sheol Throughput 8.89676 MB/sec (NB=11.1209 MB/sec 88.9676 MBit/sec) --->8--- I didn't bother running through my 100Mb switch - only 10Mb NICs on the other side. Similar results going to the "other" NIC in this box (it's my NAT/FW/GW). Machine particulars: FreeBSD sheol.localdomain 4.2-RELEASE FreeBSD 4.2-RELEASE #33: Thu Dec 6 10:20:08 CST 2001 [EMAIL PROTECTED]:/usr/src/sys/compile/SHEOL i386 Copyright (c) 1992-2000 The FreeBSD Project. Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994 The Regents of the University of California. All rights reserved. FreeBSD 4.2-RELEASE #33: Thu Dec 6 10:20:08 CST 2001 [EMAIL PROTECTED]:/usr/src/sys/compile/SHEOL Timecounter "i8254" frequency 1193182 Hz CPU: Pentium III/Pentium III Xeon/Celeron (764.35-MHz 686-class CPU) Origin = "GenuineIntel" Id = 0x686 Stepping = 6 Features=0x383f9ff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,MMX,FXSR,SSE> ... dc0: <ADMtek AN985 10/100BaseTX> port 0x3000-0x30ff mem 0xf4100000-0xf41003ff irq 11 at device 13.0 on pci1 dc0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> mtu 1500 inet 192.168.16.2 netmask 0xffffff00 broadcast 192.168.16.255 inet6 fe80::203:6dff:fe11:63d2%dc0 prefixlen 64 scopeid 0x1 ether 00:03:6d:11:63:d2 media: autoselect (100baseTX <full-duplex>) status: active supported media: autoselect 100baseTX <full-duplex> 100baseTX 10baseT/UTP <full-duplex> 10baseT/UTP none If Matt or any other qualified hackers can make the time to double-check my patches, I'd appreciate it. Matt's first patch didn't apply (no NewReno in 4.2REL), and the third patch (to tcp_input.c) required a little more work (I changed tests for 'tcp_delack_enabled' to 'DELAY_ACK()'). I'd just like some assurance I got it right. All in all, kudos to Matt for this. In day-to-day use, I can "feel" the improvementi, and everything seems as solid as ever! Dave -- ______________________ ______________________ \__________________ \ D. J. HAWKEY JR. / __________________/ \________________/\ [EMAIL PROTECTED] /\________________/ http://www.visi.com/~hawkeyd/ ---8<--- --- /usr/src/sys/kern/uipc_socket.c.42REL Fri Nov 17 13:47:27 2000 +++ /usr/src/sys/kern/uipc_socket.c Thu Dec 6 07:26:28 2001 @@ -913,6 +913,14 @@ !sosendallatonce(so) && !nextrecord) { if (so->so_error || so->so_state & SS_CANTRCVMORE) break; + /* + * The window might have closed to zero, make + * sure we send an ack now that we've drained + * the buffer or we might end up blocking until + * the idle takes over (5 seconds). + */ + if (pr->pr_flags & PR_WANTRCVD && so->so_pcb) + (*pr->pr_usrreqs->pru_rcvd)(so, flags); error = sbwait(&so->so_rcv); if (error) { sbunlock(&so->so_rcv); --- /usr/src/sys/netinet/tcp_input.c.42REL Wed Aug 16 01:14:23 2000 +++ /usr/src/sys/netinet/tcp_input.c Thu Dec 6 10:05:53 2001 @@ -164,6 +164,17 @@ #endif /* + * Indicate whether this ack should be delayed. We can delay the ack if + * - delayed acks are enabled and + * - there is no delayed ack timer in progress and + * - our last ack wasn't a 0-sized window. We never want to delay + * the ack that opens up a 0-sized window. + */ +#define DELAY_ACK(tp) \ + (tcp_delack_enabled && !callout_pending(tp->tt_delack) && \ + (tp->t_flags & TF_RXWIN0SENT) == 0) + +/* * Insert segment which inludes th into reassembly queue of tcp with * control block tp. Return TH_FIN if reassembly now includes * a segment with FIN. The macro form does the common case inline @@ -177,7 +188,7 @@ if ((th)->th_seq == (tp)->rcv_nxt && \ LIST_EMPTY(&(tp)->t_segq) && \ (tp)->t_state == TCPS_ESTABLISHED) { \ - if (tcp_delack_enabled) \ + if (DELAY_ACK(tp)) \ callout_reset(tp->tt_delack, tcp_delacktime, \ tcp_timer_delack, tp); \ else \ @@ -817,7 +828,7 @@ #endif tp = intotcpcb(inp); tp->t_state = TCPS_LISTEN; - tp->t_flags |= tp0->t_flags & (TF_NOPUSH|TF_NOOPT); + tp->t_flags |= tp0->t_flags & (TF_NOPUSH|TF_NOOPT|TF_NODELAY); /* Compute proper scaling value from buffer space */ while (tp->request_r_scale < TCP_MAX_WINSHIFT && @@ -961,7 +972,7 @@ m_adj(m, drop_hdrlen); /* delayed header drop */ sbappend(&so->so_rcv, m); sorwakeup(so); - if (tcp_delack_enabled) { + if (DELAY_ACK(tp)) { callout_reset(tp->tt_delack, tcp_delacktime, tcp_timer_delack, tp); } else { @@ -1144,7 +1155,7 @@ * segment. Otherwise must send ACK now in case * the other side is slow starting. */ - if (tcp_delack_enabled && ((thflags & TH_FIN) || + if (DELAY_ACK(tp) && ((thflags & TH_FIN) || (tlen != 0 && #ifdef INET6 ((isipv6 && in6_localaddr(&inp->in6p_faddr)) @@ -1289,7 +1300,7 @@ * If there's data, delay ACK; if there's also a FIN * ACKNOW will be turned on later. */ - if (tcp_delack_enabled && tlen != 0) + if (DELAY_ACK(tp) && tlen != 0) callout_reset(tp->tt_delack, tcp_delacktime, tcp_timer_delack, tp); else @@ -2117,7 +2128,7 @@ * Otherwise, since we received a FIN then no * more input can be expected, send ACK now. */ - if (tcp_delack_enabled && (tp->t_flags & TF_NEEDSYN)) + if (DELAY_ACK(tp) && (tp->t_flags & TF_NEEDSYN)) callout_reset(tp->tt_delack, tcp_delacktime, tcp_timer_delack, tp); else --- /usr/src/sys/netinet/tcp_output.c.42REL Tue Sep 12 23:27:06 2000 +++ /usr/src/sys/netinet/tcp_output.c Thu Dec 6 10:05:53 2001 @@ -266,28 +266,38 @@ win = sbspace(&so->so_rcv); /* - * Sender silly window avoidance. If connection is idle - * and can send all data, a maximum segment, - * at least a maximum default-size segment do it, - * or are forced, do it; otherwise don't bother. - * If peer's buffer is tiny, then send - * when window is at least half open. - * If retransmitting (possibly after persist timer forced us - * to send into a small window), then must resend. + * Sender silly window avoidance. We transmit under the following + * conditions when len is non-zero: + * + * - We have a full segment + * - This is the last buffer in a write()/send() and we are + * either idle or running NODELAY + * - we've timed out (e.g. persist timer) + * - we have more then 1/2 the maximum send window's worth of + * data (receiver may be limited the window size) + * - we need to retransmit */ if (len) { if (len == tp->t_maxseg) goto send; - if (!(tp->t_flags & TF_MORETOCOME) && - (idle || tp->t_flags & TF_NODELAY) && - (tp->t_flags & TF_NOPUSH) == 0 && - len + off >= so->so_snd.sb_cc) + /* + * NOTE! on localhost connections an 'ack' from the remote + * end may occur synchronously with the output and cause + * us to flush a buffer queued with moretocome. XXX + * + * note: the len + off check is almost certainly unnecessary. + */ + if (!(tp->t_flags & TF_MORETOCOME) && /* normal case */ + (idle || (tp->t_flags & TF_NODELAY)) && + len + off >= so->so_snd.sb_cc && + (tp->t_flags & TF_NOPUSH) == 0) { goto send; - if (tp->t_force) + } + if (tp->t_force) /* typ. timeout case */ goto send; if (len >= tp->max_sndwnd / 2 && tp->max_sndwnd > 0) goto send; - if (SEQ_LT(tp->snd_nxt, tp->snd_max)) + if (SEQ_LT(tp->snd_nxt, tp->snd_max)) /* retransmit case */ goto send; } @@ -694,6 +704,20 @@ if (win > (long)TCP_MAXWIN << tp->rcv_scale) win = (long)TCP_MAXWIN << tp->rcv_scale; th->th_win = htons((u_short) (win>>tp->rcv_scale)); + + /* + * Adjust the RXWIN0SENT flag - indicate that we have advertised + * a 0 window. This may cause the remote transmitter to stall. This + * flag tells soreceive() to disable delayed acknowledgements when + * draining the buffer. This can occur if the receiver is attempting + * to read more data then can be buffered prior to transmitting on + * the connection. + */ + if (win == 0) + tp->t_flags |= TF_RXWIN0SENT; + else + tp->t_flags &= ~TF_RXWIN0SENT; + if (SEQ_GT(tp->snd_up, tp->snd_nxt)) { th->th_urp = htons((u_short)(tp->snd_up - tp->snd_nxt)); th->th_flags |= TH_URG; --- /usr/src/sys/netinet/tcp_var.h.42REL Wed Aug 16 01:14:23 2000 +++ /usr/src/sys/netinet/tcp_var.h Thu Dec 6 10:05:53 2001 @@ -95,6 +95,7 @@ #define TF_SENDCCNEW 0x08000 /* send CCnew instead of CC in SYN */ #define TF_MORETOCOME 0x10000 /* More data to be appended to sock */ #define TF_LQ_OVERFLOW 0x20000 /* listen queue overflow */ +#define TF_RXWIN0SENT 0x40000 /* sent a receiver win 0 in response */ int t_force; /* 1 if forcing out a byte */ tcp_seq snd_una; /* send unacknowledged */ --->8--- To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message