Hi All, Apologies for the length of this post/question. This covers a problem that I have been attempting to debug for the past few months ...
I am having an on-going problem with JMeter where I get connection time out failures once load passes a threshold on our load generation machines. The error rate is 0.5-1% of total number of transactions. Packet captures show JMeter sometimes not closing connections correctly and then re-using the port sometime later but before the other end has cleaned up the half-open connection. (I am also open to the possibility that this is a symptom rather than a cause). Further details: In this particular test scenario, I am testing a load-balanced application that is fronted by a VIP hosted on an F5. >From the trace mentioned above, at loads above some threshold (yet to be determined) when the F5 initiates the connection tear-down the JMeter machine responds with an ACK but never sends a FIN to close its side of the connection. This leaves a half-open connection on the F5 but a closed connection on the JMeter machine. Some time later after the timed-wait period (2 mins) is passed but before the idle connection clean up job runs on the F5 (5 mins); a connection is initiated on the same local port. The F5 sees this as further traffic on a connection it thought it closed and re-sends its FIN. This confuses the JMeter side that thinks it is standing up a new connection and re-sends its SYN. This repeats with exponential back-off 7 times as per the TCP spec. After this the connection is reported as failing with a connection time of ~127 seconds. To prove this is the probable cause I was able to temporarily get the clean up timeout on the F5 reduced to 90 seconds. In this scenario with the same applied load we saw no connection timeout errors. The load machine is RedHat 7.7 running as far as I am aware, the standard defaults (the host was built and maintained by a 3rd party provider). I have tried JMeter 5.0 (the current version we use at work) and 5.2.1, again running vanilla here other than setting a large JVM heap size. No changes to HTTPClient config, JVM args etc. The test is triggered from Jenkins although I doubt this has any impact as I am just starting JMeter as I would on the command line. Jenkins just helps with pulling the results back through many jumps. I have tried on the HotSpot 1.8.0_51 and openJDK 1.8.0_232 JVMs, both give the same behaviour. CPU, memory and GC activity all look low to normal during the test. The JMeter host is a virtual machine with 4 cpus and 16GB ram. And to discount issues with the host application, I have run the same load across multiple machines (on the same VLAN etc) and everything works as expected. During the test at the point where the errors are being logged, there are still over 21k ephemeral ports available. The process is peaking at ~1500 threads although I'm not sure if this is being reported correctly by RedHat. I was expecting around 6-7x that number as I have configured the script to download embedded resources with 6 threads. The machine is peaking at ~7500 open files which matches roughly the number of open connections reported by ss as expected. I see in the documentation for Apache HttpClient there is a section on connection management that seems to describe the behaviour displayed. http://hc.apache.org/httpcomponents-client-4.3.x/tutorial/html/connmgmt.html#d5e405 "One of the major shortcomings of the classic blocking I/O model is that the network socket can react to I/O events only when blocked in an I/O operation. When a connection is released back to the manager, it can be kept alive however it is unable to monitor the status of the socket and react to any I/O events. If the connection gets closed on the server side, the client side connection is unable to detect the change in the connection state (and react appropriately by closing the socket on its end)." It suggests running a separate thread to periodically clean up the connection pools by calling HttpClientConnectionManager.closeExpiredConnections();. I can see this currently gets called but only at the start of a new thread iteration. This leads to some obvious questions: - is this behaviour anyone has seen before? - is this a bug with JMeter/HTTP Client? - is this maybe a bug with Java? or potentially RedHat reusing ports too aggressively? - Any suggestions for a work-around or further diagnosis? Cheers, Owen
