> > When I stop and start the Java application, all the new outbound > > connections still get stuck in SYN_SENT state. > > Is it so that they don't timeout at all? You can collect some of their > state from /proc/net/tcp (shows at least timers and attempt counters)....
The outbound connections to timeout. I've watched that they send tcp_syn_retries SYN packets before eventually timing out. > Are you sure that you just don't get unlucky at some point of time and > all 200 available threads are just temporarily stuck and your application > is just very slowly progressing then? Yeah, I'm sure that it isn't an unlucky point of time. If I restart the application when this problem occurs, all the outbound connections still fail. > > For a long time, the only thing that would resolve this was rebooting > > the entire machine. Once I did this, the outbound connections could > > be made succesfully. > > To the very same hosts? Or to another set of hosts? Yes, to the same exact set of hosts. > > And the problem almost instanteaously resolved itself and outbound > > connection attempts were succesful. > > New or the pending ones? I'm fairly sure that sockets that were already open in SYN_SENT state when I turned tcp_sack off started to work as the count of sockets in SYN_SENT state drops very rapidly. > > In my case, it worked fine for about 38 hours before hitting a > > wall where no outbound connections could be made. > > How accurate number? Is the lockup somehow related to daytime cycle? It is 38 hours +/- a half hour or so. It isn't related to the time of day, as it happens through out day/night depending on when the server was restarted. A new developement in this area is that after the first 38 hours of system time the problem would occur, so I disable tcp_sack and the problem clears itself up and outbound connections are succesful. After a couple of hours I re-enable tcp_sack and the next SYN_SENT issue doesn't occur until more than 50 hours later (so like 90 hours after system start). It's as if the first time it occurs and I turn tcp_sack off, it doesn't just reset the clock another 38 hours, but gives even more time until the problem occurs again. > > Is there a kernel buffer or some data structure that tcp_sack uses > > that gets filled up after an extended period of operation? > > SACK has pretty little meaning in context of SYNs, there's only the > sackperm(itted) TCP option which is sent along with the SYN/SYN-ACK. > > The SACK scoreboard is currently included to the skbs (has been like > this for very long time), so no additional data structures should be > there because of SACK... I've been seeing this problem for about 4 years, so could it be related to the scoreboard implementation somehow? > /proc/net/tcp couple of times in a row, try something something like > this: > > for i in (seq 1 40); do cat /proc/net/tcp; echo "-----"; sleep 10; done I can set up to do this the next time the problem occurs. > > I'm running kernel 2.6.18 on RedHat, but have had this problem occur > > on earlier kernel versions (all 2.4 and 2.6). > > I've done some fixes to SACK processing since 2.6.18 (not sure if RedHat > has backported them). Though they're not that critical nor anything in > them should affect in SYN_SENT state. Ok, unless there is direct evidence that there is a fix to this problem in a later kernel I won't be able to upgrade. If there is a redhat provided patch I can probably do that. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html