On Wed, 16 Jun 2021 13:53:38 GMT, Daniel Fuchs <dfu...@openjdk.org> wrote:

> Hi, 
> 
> Please find below a test-only change to fix some intermittent failures 
> observed with the httpclient/websocket tests:
> these tests intermittently and randomly fail with ENOMEM ("No buffer space 
> available").
> 
> Some machines in our CI seem to allow a higher level of concurrency while 
> being (maybe) configured with lower system resources (such as available 
> buffer space for the TCP stack).
> 
> Some of the httpclient/websocket tests attempt to fill the sockets buffers in 
> order to assert some conditions when the buffers are full and writing is 
> paused. When the test process terminates, this leaves behind TCP sockets in 
> the TIME_WAIT state that still hold system buffer resources in case 
> retransmission is needed. When several such tests are run this ends up 
> causing random "No buffer space available" errors on other tests (including 
> these tests themselves) running concurrently or shortly after on the same 
> machine.
> 
> This change implements a few tricks to alleviate the situation:
>  - configure the tests with smaller send buffers on the client side and 
> receive buffers on the server side, in order to limit how much buffer space 
> is consumed by the test.
>  - when the not-reading server is closed, and before the accepted socket is 
> closed, read all available data off the socket buffer in order to free up the 
> buffer space that the test has consumed before closing the socket.
>  - in some tests that create a large number of HttpClients, limit the number 
> of clients created in shared client mode, and add a call to System.gc() and a 
> small pause to give time for gc to collect the old clients which are no 
> longer referenced. 
>  
>  With these changes, I have run the HttpClient tests 200 times on the 
> problematic machines without observing any failures (where previously there 
> was at least a couple of failures per 50 runs). I also ran tier1 once, and 
> tier2 twice and the results came clean.
>  
>  I am therefore claiming success (even if it might prove temporary ;-) )
>  
> If these failures come back to haunt the CI again after this fix, a further 
> remediation policy could be to put the httpclient/websocket directory in 
> exclusive test execution mode (in TEST.root) - this seems to work too - but 
> cleaning up garbage in the tests themselves seems preferable.

LGTM

-------------

Marked as reviewed by michaelm (Reviewer).

PR: https://git.openjdk.java.net/jdk17/pull/79

Reply via email to