Hello all.

I read with interest (and fair ignorance ;-) ) the thread about delayed
ACKs in the TCP/IP stack.

Looking at the results of tbench, it looked like something I wanted in
my 4.2 kernel. So I patched my kernel accordingly, and ran the tests:

---8<---

Pre-patch:

[sheol] /usr/home/hawkeyd/projects/dbench$ ./tbench 1 localhost
Throughput 1.15675 MB/sec (NB=1.44593 MB/sec  11.5675 MBit/sec)
[sheol] /usr/home/hawkeyd/projects/dbench$ ./tbench 2 localhost
Throughput 2.18475 MB/sec (NB=2.73094 MB/sec  21.8475 MBit/sec)
[sheol] /usr/home/hawkeyd/projects/dbench$ ./tbench 3 localhost
Throughput 3.20828 MB/sec (NB=4.01035 MB/sec  32.0828 MBit/sec)

[sheol] /usr/home/hawkeyd/projects/dbench$ ./tbench 1 sheol    
Throughput 1.14315 MB/sec (NB=1.42894 MB/sec  11.4315 MBit/sec)
[sheol] /usr/home/hawkeyd/projects/dbench$ ./tbench 2 sheol
Throughput 2.12477 MB/sec (NB=2.65596 MB/sec  21.2477 MBit/sec)
[sheol] /usr/home/hawkeyd/projects/dbench$ ./tbench 3 sheol
Throughput 3.16156 MB/sec (NB=3.95195 MB/sec  31.6156 MBit/sec)

Post-patch:

[sheol] /usr/home/hawkeyd/projects/dbench$ ./tbench 1 localhost
Throughput 13.8458 MB/sec (NB=17.3073 MB/sec  138.458 MBit/sec)
[sheol] /usr/home/hawkeyd/projects/dbench$ ./tbench 2 localhost
Throughput 12.8562 MB/sec (NB=16.0703 MB/sec  128.562 MBit/sec)
[sheol] /usr/home/hawkeyd/projects/dbench$ ./tbench 3 localhost
Throughput 12.1043 MB/sec (NB=15.1304 MB/sec  121.043 MBit/sec)

[sheol] /usr/home/hawkeyd/projects/dbench$ ./tbench 1 sheol    
Throughput 9.62885 MB/sec (NB=12.0361 MB/sec  96.2885 MBit/sec)
[sheol] /usr/home/hawkeyd/projects/dbench$ ./tbench 2 sheol
Throughput 8.7068 MB/sec (NB=10.8835 MB/sec  87.068 MBit/sec)
[sheol] /usr/home/hawkeyd/projects/dbench$ ./tbench 3 sheol
Throughput 8.89676 MB/sec (NB=11.1209 MB/sec  88.9676 MBit/sec)

--->8---

I didn't bother running through my 100Mb switch - only 10Mb NICs on the
other side. Similar results going to the "other" NIC in this box (it's
my NAT/FW/GW).

Machine particulars:

  FreeBSD sheol.localdomain 4.2-RELEASE FreeBSD 4.2-RELEASE #33: Thu Dec  6 10:20:08 
CST 2001     [EMAIL PROTECTED]:/usr/src/sys/compile/SHEOL  i386


  Copyright (c) 1992-2000 The FreeBSD Project.
  Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
        The Regents of the University of California. All rights reserved.
  FreeBSD 4.2-RELEASE #33: Thu Dec  6 10:20:08 CST 2001
    [EMAIL PROTECTED]:/usr/src/sys/compile/SHEOL
  Timecounter "i8254"  frequency 1193182 Hz
  CPU: Pentium III/Pentium III Xeon/Celeron (764.35-MHz 686-class CPU)
    Origin = "GenuineIntel"  Id = 0x686  Stepping = 6
    
Features=0x383f9ff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,MMX,FXSR,SSE>
  ...
  dc0: <ADMtek AN985 10/100BaseTX> port 0x3000-0x30ff mem 0xf4100000-0xf41003ff irq 11 
at device 13.0 on pci1


  dc0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> mtu 1500
        inet 192.168.16.2 netmask 0xffffff00 broadcast 192.168.16.255
        inet6 fe80::203:6dff:fe11:63d2%dc0 prefixlen 64 scopeid 0x1 
        ether 00:03:6d:11:63:d2 
        media: autoselect (100baseTX <full-duplex>) status: active
        supported media: autoselect 100baseTX <full-duplex> 100baseTX 10baseT/UTP 
<full-duplex> 10baseT/UTP none


If Matt or any other qualified hackers can make the time to double-check
my patches, I'd appreciate it. Matt's first patch didn't apply (no NewReno
in 4.2REL), and the third patch (to tcp_input.c) required a little more work
(I changed tests for 'tcp_delack_enabled' to 'DELAY_ACK()'). I'd just like
some assurance I got it right.

All in all, kudos to Matt for this. In day-to-day use, I can "feel" the
improvementi, and everything seems as solid as ever!

Dave

-- 
  ______________________                         ______________________
  \__________________   \    D. J. HAWKEY JR.   /   __________________/
     \________________/\     [EMAIL PROTECTED]    /\________________/
                      http://www.visi.com/~hawkeyd/

---8<---

--- /usr/src/sys/kern/uipc_socket.c.42REL       Fri Nov 17 13:47:27 2000
+++ /usr/src/sys/kern/uipc_socket.c     Thu Dec  6 07:26:28 2001
@@ -913,6 +913,14 @@
                    !sosendallatonce(so) && !nextrecord) {
                        if (so->so_error || so->so_state & SS_CANTRCVMORE)
                                break;
+                       /*
+                        * The window might have closed to zero, make
+                        * sure we send an ack now that we've drained
+                        * the buffer or we might end up blocking until
+                        * the idle takes over (5 seconds).
+                        */
+                       if (pr->pr_flags & PR_WANTRCVD && so->so_pcb)
+                               (*pr->pr_usrreqs->pru_rcvd)(so, flags);
                        error = sbwait(&so->so_rcv);
                        if (error) {
                                sbunlock(&so->so_rcv);


--- /usr/src/sys/netinet/tcp_input.c.42REL      Wed Aug 16 01:14:23 2000
+++ /usr/src/sys/netinet/tcp_input.c    Thu Dec  6 10:05:53 2001
@@ -164,6 +164,17 @@
 #endif
 
 /*
+ * Indicate whether this ack should be delayed.  We can delay the ack if
+ *      - delayed acks are enabled and
+ *      - there is no delayed ack timer in progress and
+ *      - our last ack wasn't a 0-sized window.  We never want to delay
+ *        the ack that opens up a 0-sized window.
+ */
+#define DELAY_ACK(tp) \
+       (tcp_delack_enabled && !callout_pending(tp->tt_delack) && \
+       (tp->t_flags & TF_RXWIN0SENT) == 0)
+
+/*
  * Insert segment which inludes th into reassembly queue of tcp with
  * control block tp.  Return TH_FIN if reassembly now includes
  * a segment with FIN.  The macro form does the common case inline
@@ -177,7 +188,7 @@
        if ((th)->th_seq == (tp)->rcv_nxt && \
            LIST_EMPTY(&(tp)->t_segq) && \
            (tp)->t_state == TCPS_ESTABLISHED) { \
-               if (tcp_delack_enabled) \
+               if (DELAY_ACK(tp)) \
                        callout_reset(tp->tt_delack, tcp_delacktime, \
                            tcp_timer_delack, tp); \
                else \
@@ -817,7 +828,7 @@
 #endif
                        tp = intotcpcb(inp);
                        tp->t_state = TCPS_LISTEN;
-                       tp->t_flags |= tp0->t_flags & (TF_NOPUSH|TF_NOOPT);
+                       tp->t_flags |= tp0->t_flags & (TF_NOPUSH|TF_NOOPT|TF_NODELAY);
 
                        /* Compute proper scaling value from buffer space */
                        while (tp->request_r_scale < TCP_MAX_WINSHIFT &&
@@ -961,7 +972,7 @@
                        m_adj(m, drop_hdrlen);  /* delayed header drop */
                        sbappend(&so->so_rcv, m);
                        sorwakeup(so);
-                       if (tcp_delack_enabled) {
+                       if (DELAY_ACK(tp)) {
                                callout_reset(tp->tt_delack, tcp_delacktime,
                                    tcp_timer_delack, tp);
                        } else {
@@ -1144,7 +1155,7 @@
                         * segment.  Otherwise must send ACK now in case
                         * the other side is slow starting.
                         */
-                       if (tcp_delack_enabled && ((thflags & TH_FIN) ||
+                       if (DELAY_ACK(tp) && ((thflags & TH_FIN) ||
                            (tlen != 0 &&
 #ifdef INET6
                              ((isipv6 && in6_localaddr(&inp->in6p_faddr))
@@ -1289,7 +1300,7 @@
                         * If there's data, delay ACK; if there's also a FIN
                         * ACKNOW will be turned on later.
                         */
-                       if (tcp_delack_enabled && tlen != 0)
+                       if (DELAY_ACK(tp) && tlen != 0)
                                 callout_reset(tp->tt_delack, tcp_delacktime,  
                                     tcp_timer_delack, tp);  
                        else
@@ -2117,7 +2128,7 @@
                         *  Otherwise, since we received a FIN then no
                         *  more input can be expected, send ACK now.
                         */
-                       if (tcp_delack_enabled && (tp->t_flags & TF_NEEDSYN))
+                       if (DELAY_ACK(tp) && (tp->t_flags & TF_NEEDSYN))
                                 callout_reset(tp->tt_delack, tcp_delacktime,  
                                     tcp_timer_delack, tp);  
                        else


--- /usr/src/sys/netinet/tcp_output.c.42REL     Tue Sep 12 23:27:06 2000
+++ /usr/src/sys/netinet/tcp_output.c   Thu Dec  6 10:05:53 2001
@@ -266,28 +266,38 @@
        win = sbspace(&so->so_rcv);
 
        /*
-        * Sender silly window avoidance.  If connection is idle
-        * and can send all data, a maximum segment,
-        * at least a maximum default-size segment do it,
-        * or are forced, do it; otherwise don't bother.
-        * If peer's buffer is tiny, then send
-        * when window is at least half open.
-        * If retransmitting (possibly after persist timer forced us
-        * to send into a small window), then must resend.
+        * Sender silly window avoidance.   We transmit under the following
+        * conditions when len is non-zero:
+        *
+        *      - We have a full segment
+        *      - This is the last buffer in a write()/send() and we are
+        *        either idle or running NODELAY
+        *      - we've timed out (e.g. persist timer)
+        *      - we have more then 1/2 the maximum send window's worth of
+        *        data (receiver may be limited the window size)
+        *      - we need to retransmit
         */
        if (len) {
                if (len == tp->t_maxseg)
                        goto send;
-               if (!(tp->t_flags & TF_MORETOCOME) &&
-                   (idle || tp->t_flags & TF_NODELAY) &&
-                   (tp->t_flags & TF_NOPUSH) == 0 &&
-                   len + off >= so->so_snd.sb_cc)
+               /*
+                * NOTE! on localhost connections an 'ack' from the remote
+                * end may occur synchronously with the output and cause
+                * us to flush a buffer queued with moretocome.  XXX
+                *
+                * note: the len + off check is almost certainly unnecessary.
+                */
+               if (!(tp->t_flags & TF_MORETOCOME) &&   /* normal case */
+                   (idle || (tp->t_flags & TF_NODELAY)) &&
+                   len + off >= so->so_snd.sb_cc &&
+                   (tp->t_flags & TF_NOPUSH) == 0) {
                        goto send;
-               if (tp->t_force)
+               }
+               if (tp->t_force)                        /* typ. timeout case */
                        goto send;
                if (len >= tp->max_sndwnd / 2 && tp->max_sndwnd > 0)
                        goto send;
-               if (SEQ_LT(tp->snd_nxt, tp->snd_max))
+               if (SEQ_LT(tp->snd_nxt, tp->snd_max))   /* retransmit case */
                        goto send;
        }
 
@@ -694,6 +704,20 @@
        if (win > (long)TCP_MAXWIN << tp->rcv_scale)
                win = (long)TCP_MAXWIN << tp->rcv_scale;
        th->th_win = htons((u_short) (win>>tp->rcv_scale));
+
+       /*
+        * Adjust the RXWIN0SENT flag - indicate that we have advertised
+        * a 0 window.  This may cause the remote transmitter to stall.  This
+        * flag tells soreceive() to disable delayed acknowledgements when
+        * draining the buffer.  This can occur if the receiver is attempting
+        * to read more data then can be buffered prior to transmitting on
+        * the connection.
+        */
+       if (win == 0)
+               tp->t_flags |= TF_RXWIN0SENT;
+       else
+               tp->t_flags &= ~TF_RXWIN0SENT;
+
        if (SEQ_GT(tp->snd_up, tp->snd_nxt)) {
                th->th_urp = htons((u_short)(tp->snd_up - tp->snd_nxt));
                th->th_flags |= TH_URG;


--- /usr/src/sys/netinet/tcp_var.h.42REL        Wed Aug 16 01:14:23 2000
+++ /usr/src/sys/netinet/tcp_var.h      Thu Dec  6 10:05:53 2001
@@ -95,6 +95,7 @@
 #define        TF_SENDCCNEW    0x08000         /* send CCnew instead of CC in SYN */
 #define        TF_MORETOCOME   0x10000         /* More data to be appended to sock */
 #define        TF_LQ_OVERFLOW  0x20000         /* listen queue overflow */
+#define        TF_RXWIN0SENT   0x40000         /* sent a receiver win 0 in response */
        int     t_force;                /* 1 if forcing out a byte */
 
        tcp_seq snd_una;                /* send unacknowledged */

--->8---


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message

Reply via email to