Hi Eric, Ran a few more tests yesterday with packet captures, including a capture on the client. It turns out that the client stops ack'ing entirely at some point in the conversation - the last advertised client window is not even close to zero (it's actually ~348K). So there's complete radio silence from the client for some reason, even though it does send back ACKs early on in the conversation. So yes, as far as the server is concerned, the client is completely gone and tcp_retries2 rightfully breaches eventually once the server retrans go unanswered long (and for sufficient times) enough.
What's odd though is the packet capture on the client shows the server retrans packets arriving, so it's not like the segments don't reach the client. I'll keep investigating, but if you (or anyone else reading this) knows of circumstances that might cause this, I'd appreciate any tips on where/what to look at. Thanks On Wed, Nov 1, 2017 at 7:06 PM, Eric Dumazet <eric.duma...@gmail.com> wrote: > On Wed, 2017-11-01 at 22:22 +0000, Vitaly Davidovich wrote: >> Eric, >> > >> Yes I agree. However the thing I’m still puzzled about is the client >> application is not reading/draining the recvq - ok, the client tcp >> stack should start advertising a 0 window size. Does a 0 window size >> count against the tcp_retries2? Is that what you were alluding to in >> your first reply? >> > > Every time we receive an (valid) ACK, with a win 0 or not, the counter > of attempts is cleared, given the opportunity for the sender to send 15 > more probes. >> >> If it *does* count towards the retries limit then a RST doesn’t seem >> like a bad idea. The client is responding with segments but the user >> app there just isn’t draining the data. Presumably that RST has a >> good chance of reaching the client and then unblocking the read() >> there with a peer reset error. Or am I missing something? >> >> >> If it doesn’t count towards the limit then I need to figure out why >> the 0 window size segments weren’t being sent by the client. > > Yes please :) >> >> >> I will try to double check that the client was indeed advertising 0 >> window size. There’s nothing special about that machine - it’s a >> 4.1.35 kernel as well. I wouldn’t expect the tcp stack there to be >> unresponsive just because the user app is sleeping. >> > > >