Hi, bluhm and I make some network performance measurements and kernel profiling.
Setup: Linux (iperf) -10gbit-> OpenBSD (relayd) -10gbit-> Linux (iperf) We figured out, that the kernel uses a huge amount of processing time for sending ACKs to the sender on the receiving interface. After receiving a data segment, we send our two ACK. The first one in tcp_input() direct after receiving. The second ACK is send out, after the userland or the sosplice task read some data out of the socket buffer. The fist ACK in tcp_input() is called after receiving every other data segment like it is discribed in RFC1122: 4.2.3.2 When to Send an ACK Segment A TCP SHOULD implement a delayed ACK, but an ACK should not be excessively delayed; in particular, the delay MUST be less than 0.5 seconds, and in a stream of full-sized segments there SHOULD be an ACK for at least every second segment. This advice is based on the paper "Congestion Avoidance and Control": 4 THE GATEWAY SIDE OF CONGESTION CONTROL The 8 KBps senders were talking to 4.3+BSD receivers which would delay an ack for atmost one packet (because of an ack’s clock’ role, the authors believe that the minimum ack frequency should be every other packet). Sending the first ACK (on every other packet) coasts us too much processing time. Thus, we run into a full socket buffer earlier. The first ACK just acknowledges the received data, but does not update the window. The second ACK, caused by the socket buffer reader, also acknowledges the data and also updates the window. So, the second ACK, is much more worth for a fast packet processing than the fist one. The performance improvement is between 33% with splicing and 20% without splice: splicing relaying current 3.1 GBit/s 2.6 GBit/s w/o first ack 4.1 GBit/s 3.1 GBit/s As far as I understand the implementation of other operating systems: Linux has implement a custom TCP_QUICKACK socket option, to turn this kind of feature on and off. FreeBSD and NetBSD sill depend on it, when using the New Reno implementation. The following diff turns off the direct ACK on every other segment. We are running this diff in production on our own machines at genua and on our products for several month, now. We don't noticed any problems, even with interactive network sessions (ssh) nor with bulk traffic. Another solution could be a sysctl(3) or an additional socket option, similar to Linux, to control this behavior per socket or system wide. Also, a counter to ACK every 3rd, 4th... data segment could beat the problem. bye, Jan Index: netinet/tcp_input.c =================================================================== RCS file: /cvs/src/sys/netinet/tcp_input.c,v retrieving revision 1.365 diff -u -p -r1.365 tcp_input.c --- netinet/tcp_input.c 19 Jun 2020 22:47:22 -0000 1.365 +++ netinet/tcp_input.c 5 Nov 2020 23:00:34 -0000 @@ -165,8 +165,8 @@ do { \ #endif /* - * Macro to compute ACK transmission behavior. Delay the ACK unless - * we have already delayed an ACK (must send an ACK every two segments). + * Macro to compute ACK transmission behavior. Delay the ACK until + * a read from the socket buffer or the delayed ACK timer causes one. * We also ACK immediately if we received a PUSH and the ACK-on-PUSH * option is enabled or when the packet is coming from a loopback * interface. @@ -176,8 +176,7 @@ do { \ struct ifnet *ifp = NULL; \ if (m && (m->m_flags & M_PKTHDR)) \ ifp = if_get(m->m_pkthdr.ph_ifidx); \ - if (TCP_TIMER_ISARMED(tp, TCPT_DELACK) || \ - (tcp_ack_on_push && (tiflags) & TH_PUSH) || \ + if ((tcp_ack_on_push && (tiflags) & TH_PUSH) || \ (ifp && (ifp->if_flags & IFF_LOOPBACK))) \ tp->t_flags |= TF_ACKNOW; \ else \