Hi, We experience some limit on the maximum packets in flight which seem not to be related with the receive or write buffers. Does somebody know if there is an issue with a maximum of around 1MByte (or sometimes 2Mbyte) of data in flight per TCP flow?
It seems to be a strict and stable limit independent from the CC (tested with Cubic, Reno and DCTCP). On a link of 200Mbps and 200ms RTT our link is only 20% (sometimes 40%, see conditions below) utilized for a single TCP flow with no drop experienced at all (no bottleneck in the AQM or RTT emulation, as it supports more throughput if multiple flows are active). Some configuration changes we already tried on both client and server (kernel 3.18.9): net.ipv4.tcp_no_metrics_save = 1 net.ipv4.tcp_rmem = 4096 87380 6291456 net.ipv4.tcp_wmem = 4096 16384 4194304 SERVER# ss -i tcp ESTAB 0 1049728 10.187.255.211:46642 10.187.16.194:ssh dctcp wscale:7,7 rto:408 rtt:204.333/0.741 ato:40 mss:1448 cwnd:1466 send 83.1Mbps unacked:728 rcv_rtt:212 rcv_space:29200 CLIENT# ss -i tcp ESTAB 0 288 10.187.16.194:ssh 10.187.255.211:46642 dctcp wscale:7,7 rto:404 rtt:203.389/0.213 ato:40 mss:1448 cwnd:78 send 4.4Mbps unacked:8 rcv_rtt:204 rcv_space:1074844 When increasing the write and receive mem further (they were already way above 1 or 2 MB) it steps to double (40%; 2Mbytes in flight): net.ipv4.tcp_no_metrics_save = 1 net.ipv4.tcp_rmem = 4096 8000000 16291456 net.ipv4.tcp_wmem = 4096 8000000 16291456 SERVER # ss -i tcp ESTAB 0 2068976 10.187.255.212:54637 10.187.16.112:ssh cubic wscale:8,8 rto:404 rtt:202.622/0.061 ato:40 mss:1448 cwnd:1849 ssthresh:1140 send 105.7Mbps unacked:1457 rcv_rtt:217.5 rcv_space:29200 CLIENT# ss -i tcp ESTAB 0 648 10.187.16.112:ssh 10.187.255.212:54637 cubic wscale:8,8 rto:404 rtt:201.956/0.038 ato:40 mss:1448 cwnd:132 send 7.6Mbps unacked:18 rcv_rtt:204 rcv_space:2093044 Further increasing (x10) does not help anymore... net.ipv4.tcp_no_metrics_save = 1 net.ipv4.tcp_rmem = 4096 80000000 162914560 net.ipv4.tcp_wmem = 4096 80000000 162914560 As all these parameters autotune, it is hard to find out which one is limiting... In the examples, above unacked does not want to go higher, while congestion window in the server is big enough... rcv_space could be limiting, but it tunes up if I change the server with the higher buffers (switching to 2MByte in flight). We also tried tcp_limit_output_bytes, setting it bigger (x10) and smaller(/10), without effect. We've put it in /etc/sysctl.conf and rebooted, to make sure that it is effective. Some more detailed tests that had an effect on the 1 or 2MByte: - It seems that with TSO off, if we configure a bigger wmem buffer, an ongoing flow suddenly is able to immediately double its bytes in flight limit. We configured further up to more than 10x the buffer, but no further increase helps, and the limits we saw are only 1MByte and 2Mbyte (no intermediate values depending on any parameter). When setting tcp_wmem smaller again, the 2MByte limit stays on the ongoing flow. We have to restart the flow to make the buffer reduction to 1MByte effective. - With TSO on, only the 2MByte limit is effective, independent from the wmem buffer. We have to restart the flow to make a tso change effective. Koen.