On Tue, Aug 16, 2016 at 5:46 AM, Richard Cochran <richardcoch...@gmail.com> wrote: > Jouni, > > If I understand the test correctly, then the slightly different kernel > timer behavior is ok, but the test isn't quite right. Let explain > what I mean. > > First off, reading test_ap_wps.py, the point of the test is to see if > ten simultaneous connections are possible. I guess the server > implements a hard coded limit on the number of clients. (BTW where is > the server loop?) > > You said that the server also sets 'backlog' to ten. The backlog > controls the size of the queue holding incoming connections that are > in the SYN_RCVD or ESTABLISHED state but have not yet been > accept(2)-ed by the server. This is *not* the same as the number of > possible simultaneous connections. > > On Sat, Aug 13, 2016 at 12:12:26PM +0300, Jouni Malinen wrote: >> Yes, it looks like a TCP connect() timeout. I use a significantly >> reduced timeout in the test scripts since they are run unattended and >> are supposed to terminate in reasonable amount of time.. That said, > > I did not find where the client sets the one second timeout. Where > does this happen? > >> If I increase that 20 to 50, I get more of such about 1.03 second >> results at i=17, i=34, i=48.. > > Can you provide the timings when the test runs on the older kernel? > >> Looking more at what exactly is happening at the TCP layer, this is >> likely related to the server behavior since listen() backlog is set to >> 10 and if there are 10 parallel connections, the last one if >> immediately closed before reading anything. > > To clarify, when the backlog is exceed, the new connection is not > closed. Instead, the SYN is simply ignored, and the client is expect > to re-transmit the SYN in the normal TCP fashion. > >> Looking at a sniffer capture (*), the three-way TCP connection goes >> through fine for the first 15 connect() calls, but the 15th one does >> not get a response to SYN. This SYN is the frame 47 in the capture >> file with srcport == 60802. There is no SYN,ACK for it. The about one >> second unexpected time for connect() comes from this, i.e., the >> connection is completed only after the client side does TCP >> retransmission of the SYN (frame #77) a second later and the server >> side replies with RST,ACK (frame #78). > > This is the expected behavior. > >> So it looks like the issue is in one of the SYN,ACK frames getting >> completely lost.. > > No, the frame is not missing. It was never sent because the backlog > was exceeded. > > Here is what I suspect is happening. By sending 20 SYN frames to a > port with a backlog of 10, it saturates the queue. One SYN is ignored > by the kernel, and a race begins between the connect() timeout and the > SYN re-transmission. If the client's re-transmitted SYN and then the > server's SYN,ACK returns before the connect timeout, then the call to > connect() succeeds. With the new timer wheel, the result of the race > is different. > > There a couple of ways to deal with this. One is to increase the > backlog on the server side. Another is to increase the connect() > timeout to a multiple of the re-transmission interval. > > Thoughts? >
I am coming late to the party, but yes, test looks flaky. (Relying on having very precise SYN retransmits when listen backlog on server side is full)