I've fixed a couple of additional problems. 

    * tbench() assumes that accept() propogates the NODELAY tcp option.
      It doesn't in FreeBSD.  Er, it didn't in FreeBSD... my patch fixes
      this.

    * If the transwmitter sees a 0 window it stalls waiting for an ack.
      However, if delayed acks are turned on the receiver will not
      acknowledge a drain of the buffer immediately, it will delay.
      This causes severe issues with localhost.

    I've included my patch as it currently stands.  This patch is
    against -stable.  With this patch tbench should work properly with 
    delayed acks turned on (as well as newreno).

    There are still a couple of unresolved issues.  I noticed that when
    connecting locally TCP is non-optimal... when sending a 4100 byte
    data block it sends two 1460 byte packets (maxseg), then one
    1176 byte packet and one 4 byte packet.  The 1176 byte packet is
    sent in response to a received ack, causing the last bit of info
    to be written out using a small packet.  This only occurs on localhost
    connections due to the way the stack works.

    I will be committing these to both -current now, and -stable tomorrow.

    tbench results:

        test1           (from test1) - uses TCP's 16K receive & xmit buffers
        localhost       (from test1) - uses localhost's 48K buffers
        test2           (from test1) - uses TCP's 16K receive & xmit buffers
                                        (100BaseTX full duplex switch)

    delayed acks turned on (default)
    new reno turned on (default)

    ./tbench 1 test1
    Throughput 23.3951 MB/sec (NB=29.2439 MB/sec  233.951 MBit/sec)  1 procs
    ./tbench 1 localhost
    Throughput 29.6299 MB/sec (NB=37.0374 MB/sec  296.299 MBit/sec)  1 procs
    ./tbench 2 localhost
    Throughput 42.963 MB/sec (NB=53.7038 MB/sec  429.63 MBit/sec)  2 procs
    ./tbench 3 localhost
    Throughput 43.9328 MB/sec (NB=54.9161 MB/sec  439.328 MBit/sec)  3 procs

    ./tbench 1 test2
    Throughput 6.43315 MB/sec (NB=8.04144 MB/sec  64.3315 MBit/sec)  1 procs
    ./tbench 2 test2
    Throughput 8.94636 MB/sec (NB=11.183 MB/sec  89.4636 MBit/sec)  2 procs
    ./tbench 3 test2
    Throughput 9.82137 MB/sec (NB=12.2767 MB/sec  98.2137 MBit/sec)  3 procs

With delayed acks turned off:

    ./tbench 1 test1
    Throughput 19.8444 MB/sec (NB=24.8055 MB/sec  198.444 MBit/sec)  1 procs
    ./tbench 1 localhost
    Throughput 26.1442 MB/sec (NB=32.6802 MB/sec  261.442 MBit/sec)  1 procs
    ./tbench 2 localhost
    Throughput 37.1861 MB/sec (NB=46.4826 MB/sec  371.861 MBit/sec)  2 procs
    ./tbench 3 localhost
    Throughput 37.5582 MB/sec (NB=46.9477 MB/sec  375.582 MBit/sec)  3 procs

    ./tbench 1 test2
    Throughput 6.32798 MB/sec (NB=7.90998 MB/sec  63.2798 MBit/sec)  1 procs
    ./tbench 2 test2
    Throughput 8.4896 MB/sec (NB=10.612 MB/sec  84.896 MBit/sec)  2 procs
    ./tbench 3 test2
    Throughput 9.57453 MB/sec (NB=11.9682 MB/sec  95.7453 MBit/sec)  3 procs

                                        -Matt

Index: netinet/tcp_input.c
===================================================================
RCS file: /home/ncvs/src/sys/netinet/tcp_input.c,v
retrieving revision 1.107.2.18
diff -u -r1.107.2.18 tcp_input.c
--- netinet/tcp_input.c 2001/11/12 22:11:24     1.107.2.18
+++ netinet/tcp_input.c 2001/12/02 07:47:01
@@ -158,10 +158,15 @@
 #endif
 
 /*
- * Indicate whether this ack should be delayed.
+ * Indicate whether this ack should be delayed.  We can delay the ack if
+ *     - delayed acks are enabled and
+ *     - there is no delayed ack timer in progress and
+ *     - our last ack wasn't a 0-sized window.  We never want to delay
+ *       the ack that opens up a 0-sized window.
  */
 #define DELAY_ACK(tp) \
-       (tcp_delack_enabled && !callout_pending(tp->tt_delack))
+       (tcp_delack_enabled && !callout_pending(tp->tt_delack) && \
+       (tp->t_flags & TF_RXWIN0SENT) == 0)
 
 static int
 tcp_reass(tp, th, tlenp, m)
@@ -840,7 +845,7 @@
 #endif
                        tp = intotcpcb(inp);
                        tp->t_state = TCPS_LISTEN;
-                       tp->t_flags |= tp0->t_flags & (TF_NOPUSH|TF_NOOPT);
+                       tp->t_flags |= tp0->t_flags & (TF_NOPUSH|TF_NOOPT|TF_NODELAY);
 
                        /* Compute proper scaling value from buffer space */
                        while (tp->request_r_scale < TCP_MAX_WINSHIFT &&
Index: netinet/tcp_output.c
===================================================================
RCS file: /home/ncvs/src/sys/netinet/tcp_output.c,v
retrieving revision 1.39.2.11
diff -u -r1.39.2.11 tcp_output.c
--- netinet/tcp_output.c        2001/11/30 21:34:28     1.39.2.11
+++ netinet/tcp_output.c        2001/12/02 07:37:29
@@ -116,7 +116,9 @@
        u_char opt[TCP_MAXOLEN];
        unsigned ipoptlen, optlen, hdrlen;
        int idle, sendalot;
+#if 0
        int maxburst = TCP_MAXBURST;
+#endif
        struct rmxp_tao *taop;
        struct rmxp_tao tao_noncached;
 #ifdef INET6
@@ -268,28 +270,38 @@
        win = sbspace(&so->so_rcv);
 
        /*
-        * Sender silly window avoidance.  If connection is idle
-        * and can send all data, a maximum segment,
-        * at least a maximum default-size segment do it,
-        * or are forced, do it; otherwise don't bother.
-        * If peer's buffer is tiny, then send
-        * when window is at least half open.
-        * If retransmitting (possibly after persist timer forced us
-        * to send into a small window), then must resend.
+        * Sender silly window avoidance.   We transmit under the following
+        * conditions when len is non-zero:
+        *
+        *      - We have a full segment
+        *      - This is the last buffer in a write()/send() and we are
+        *        either idle or running NODELAY
+        *      - we've timed out (e.g. persist timer)
+        *      - we have more then 1/2 the maximum send window's worth of
+        *        data (receiver may be limited the window size)
+        *      - we need to retransmit
         */
        if (len) {
                if (len == tp->t_maxseg)
                        goto send;
-               if (!(tp->t_flags & TF_MORETOCOME) &&
-                   (idle || tp->t_flags & TF_NODELAY) &&
-                   (tp->t_flags & TF_NOPUSH) == 0 &&
-                   len + off >= so->so_snd.sb_cc)
+               /*
+                * NOTE! on localhost connections an 'ack' from the remote
+                * end may occur synchronously with the output and cause
+                * us to flush a buffer queued with moretocome.  XXX
+                *
+                * note: the len + off check is almost certainly unnecessary.
+                */
+               if (!(tp->t_flags & TF_MORETOCOME) &&   /* normal case */
+                   (idle || (tp->t_flags & TF_NODELAY)) &&
+                   len + off >= so->so_snd.sb_cc &&
+                   (tp->t_flags & TF_NOPUSH) == 0) {
                        goto send;
-               if (tp->t_force)
+               }
+               if (tp->t_force)                        /* typ. timeout case */
                        goto send;
                if (len >= tp->max_sndwnd / 2 && tp->max_sndwnd > 0)
                        goto send;
-               if (SEQ_LT(tp->snd_nxt, tp->snd_max))
+               if (SEQ_LT(tp->snd_nxt, tp->snd_max))   /* retransmit case */
                        goto send;
        }
 
@@ -688,6 +700,20 @@
        if (win > (long)TCP_MAXWIN << tp->rcv_scale)
                win = (long)TCP_MAXWIN << tp->rcv_scale;
        th->th_win = htons((u_short) (win>>tp->rcv_scale));
+
+       /*
+        * Adjust the RXWIN0SENT flag - indicate that we have advertised
+        * a 0 window.  This may cause the remote transmitter to stall.  This
+        * flag tells soreceive() to disable delayed acknowledgements when
+        * draining the buffer.  This can occur if the receiver is attempting
+        * to read more data then can be buffered prior to transmitting on
+        * the connection.
+        */
+       if (win == 0)
+               tp->t_flags |= TF_RXWIN0SENT;
+       else
+               tp->t_flags &= ~TF_RXWIN0SENT;
+
        if (SEQ_GT(tp->snd_up, tp->snd_nxt)) {
                th->th_urp = htons((u_short)(tp->snd_up - tp->snd_nxt));
                th->th_flags |= TH_URG;
Index: netinet/tcp_var.h
===================================================================
RCS file: /home/ncvs/src/sys/netinet/tcp_var.h,v
retrieving revision 1.56.2.8
diff -u -r1.56.2.8 tcp_var.h
--- netinet/tcp_var.h   2001/08/22 00:59:13     1.56.2.8
+++ netinet/tcp_var.h   2001/12/01 21:40:46
@@ -95,6 +95,7 @@
 #define        TF_SENDCCNEW    0x08000         /* send CCnew instead of CC in SYN */
 #define        TF_MORETOCOME   0x10000         /* More data to be appended to sock */
 #define        TF_LQ_OVERFLOW  0x20000         /* listen queue overflow */
+#define TF_RXWIN0SENT  0x40000         /* sent a receiver win 0 in response */
        int     t_force;                /* 1 if forcing out a byte */
 
        tcp_seq snd_una;                /* send unacknowledged */

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message

Reply via email to