Hello, I apologize if this is the wrong list to report this bug to, I did not find a more specific listing in the maintainers file. I think this is a kernel issue and not an issue with my distro, but if you disagree I can re-direct this report as appropriate.
I am upgrading some Linux 4.2 servers to Linux 4.4 (Ubuntu Xenial), and during testing I'm observing TCP segment re-transmits very occasionally on the loopback device, leading to 200ms latency spikes. I don't observe the issues on non loopback devices, and I believe that I've narrowed it down to an issue with qdiscs on loopback. It seems that when a queuing discipline other than noqueue is attached to a loopback device in 4.4+ kernels, packets will (very occasionally) get dropped completely leading to a re-transmit. I'm not sure how this can happen, and I've been trying to figure out what's going on, but if anyone has any pointers or suggestions I'd very much appreciate that. I've attached the script I'm using to reproduce the bug and an example ab run that I believe shows the bug. In particular, the max timings of 200ms in the ab output and seeing TCP segment re-transmits (and sometimes RSTs) in the tcpdump output is indicating the issue to me. I have tested on 3.13, 4.2 and 4.4 kernels and only 4.4 is showing the issue. Furthermore non loopback interfaces don't appear to have the bug. So I ran git diff v4.2..v4.4 drivers/net/loopback.c, and the only commit that seems to touch loopback.c is e65db2b7. I'm attempting to revert the change and re-compile to see if that commit triggers the bug, but I don't understand why that change would be breaking things in this way so that's just a guess. I'm continuing to try to debug this, but I figured it would be a good idea to report it here in case someone with more familiarity may know what's going on. Please let me know if there is any additional information I can provide or tests I can run. Thank you, -Joey Lynch
This is ApacheBench, Version 2.3 <$Revision: 1528965 $> Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/ Licensed to The Apache Software Foundation, http://www.apache.org/ Benchmarking 127.0.0.1 (be patient) Completed 10000 requests Completed 20000 requests Completed 30000 requests Completed 40000 requests Completed 50000 requests Completed 60000 requests Completed 70000 requests Completed 80000 requests Completed 90000 requests Completed 100000 requests Finished 100000 requests Server Software: nginx/1.11.3 Server Hostname: 127.0.0.1 Server Port: 80 Document Path: / Document Length: 612 bytes Concurrency Level: 100 Time taken for tests: 10.636 seconds Complete requests: 100000 Failed requests: 0 Total transferred: 84500000 bytes HTML transferred: 61200000 bytes Requests per second: 9401.97 [#/sec] (mean) Time per request: 10.636 [ms] (mean) Time per request: 0.106 [ms] (mean, across all concurrent requests) Transfer rate: 7758.46 [Kbytes/sec] received Connection Times (ms) min mean[+/-sd] median max Connect: 0 2 1.9 2 8 Processing: 2 9 2.8 9 206 Waiting: 1 8 3.1 8 205 Total: 6 11 1.6 11 206 Percentage of the requests served within a certain time (ms) 50% 11 66% 11 75% 11 80% 11 90% 12 95% 12 98% 14 99% 14 100% 206 (longest request)
repro.sh
Description: Bourne shell script
This is ApacheBench, Version 2.3 <$Revision: 1528965 $> Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/ Licensed to The Apache Software Foundation, http://www.apache.org/ Benchmarking 127.0.0.1 (be patient) Completed 10000 requests Completed 20000 requests Completed 30000 requests Completed 40000 requests Completed 50000 requests Completed 60000 requests Completed 70000 requests Completed 80000 requests Completed 90000 requests Completed 100000 requests Finished 100000 requests Server Software: nginx/1.11.3 Server Hostname: 127.0.0.1 Server Port: 80 Document Path: / Document Length: 612 bytes Concurrency Level: 100 Time taken for tests: 7.423 seconds Complete requests: 100000 Failed requests: 0 Total transferred: 84500000 bytes HTML transferred: 61200000 bytes Requests per second: 13471.01 [#/sec] (mean) Time per request: 7.423 [ms] (mean) Time per request: 0.074 [ms] (mean, across all concurrent requests) Transfer rate: 11116.21 [Kbytes/sec] received Connection Times (ms) min mean[+/-sd] median max Connect: 0 0 0.5 0 5 Processing: 1 7 0.9 7 12 Waiting: 1 7 0.9 7 12 Total: 4 7 0.6 7 12 Percentage of the requests served within a certain time (ms) 50% 7 66% 7 75% 7 80% 7 90% 7 95% 8 98% 10 99% 11 100% 12 (longest request)