I'm not sure if this is easy to solve for Zookeeper, but going to make a suggestion here I've seen work on other projects.
The "find a free port" code is inherently racy. I've seen that solution used in many other projects, and it never works that well. A much safer way to do it is to pass port 0 directly to the server you're trying to start. As you know from the port reservation code, this will cause the OS to use a free port. The port selected by the OS can be retrieved by the test by calling e.g. https://docs.oracle.com/javase/7/docs/api/java/net/ServerSocket.html#getLocalPort(). You go from this flow Open ServerSocket on 0 Get port via ServerSocket.getLocalPort Close socket <- This is the dangerous bit, as after this, the port is again up for grabs for other tests to acquire Pass chosen port to server Server opens ServerSocket on port To Pass 0 to server Server opens ServerSocket on 0, getting a random port Test calls some method on the server that ends up in a call to ServerSocket.getLocalPort This won't work for cases where you're testing e.g. the client starting before the server, and connecting once the server has booted, but for the common use case of starting a server, and then starting a client connected to that server, this should work. On 2021/02/05 15:58:00, Christopher <c...@apache.org> wrote: > FWIW, the maven-surefire-plugin should be able to retry temporarily> > failing tests, but it isn't working with JUnit5 until the plugin is> > updated to a newer version. However, the newer version of the plugin> > didn't work with the versions of JUnit that ZK is using. When I tried> > to update maven-surefire-plugin, things got worse due to JUnit4/JUnit5> > stuff that I don't understand.> > > I have created a PR to trigger the tests to run on JDK8 instead of> > JDK11, to demonstrate that the tests are flaky there (or to prove me> > wrong), but it doesn't need to be merged, as it's just a test:> > https://github.com/apache/zookeeper/pull/1595> > > On Fri, Feb 5, 2021 at 10:48 AM Christopher <ct...@apache.org> wrote:> > >> > > These tests are flaky on JDK8, too, when I tried. It's my> > > understanding that's why they were not being run on Travis previously> > > (still aren't). Most of the tests that I see failing are due to the> > > "Address already in use" bind error. This may be more likely in a> > > virtualized environment like GitHub Actions (vs. non-vritualized> > > Jenkins), but I don't think it is unique to that environment... just> > > maybe more likely for whatever reason.> > >> > > I have spent quite a bit of time looking into this, and I think the> > > main cause is that the port reservation stuff isn't working the way it> > > should. It tries to bind to a port, and then it closes the> > > ServerSocket that was used to find an available port. What it should> > > do instead is return the ServerSocket itself for use in the calling> > > code, after it has successfully bound, rather than returning an> > > integer. But, that might be a big change, and there might be a simpler> > > fix.> > >> > > On Fri, Feb 5, 2021 at 10:25 AM Enrico Olivelli <eo...@gmail.com> wrote:> > > >> > > > Hi,> > > > I see that the new test workflow with Java 11 is very flaky.> > > >> > > > This is an example> > > > https://github.com/apache/zookeeper/pull/1592/checks?check_run_id=1830428694> > > >> > > > I would like to not consider it blocker for merging pull requests.> > > >> > > > That said, we should investigate further, the tests that are failing were> > > > not flaky on JDK8> > > >> > > > Thoughts ?> > > > Enrico> >