Hi, Chris I can't find the test-server package in the code. Has it been merged? Could you explain it in more detail?
Thanks ----------------------- Xinyu Tan On 2023/08/04 16:32:10 Christofer Dutz wrote: > Hi Xinyu, > > No need to apologize … I’m happy that you have an idea on what’s going wrong. > > I don’t know if you saw it, but I proposed a test-server module, which starts > the two parts on random free ports and reports them back to the test starting > it … this way we’d simply use free ports every time a server is started. > > Could this help? > > Chris > > > > > Von: Xinyu Tan <tanxi...@apache.org> > Datum: Freitag, 4. August 2023 um 17:56 > An: dev@iotdb.apache.org <dev@iotdb.apache.org> > Betreff: Re: Fixing flaky tests? > Hi Chris, > > I deeply apologize for the instability of replicateUsingWALTest. The test > failures are occurring due to the frequent start and stop of the thrift > server in the consensus module tests, which can lead to some tests being > unable to bind to the socket during startup and resulting in failures. > > Regarding the root cause of this issue, we suspect that TCP connections, when > disconnected, remain in a TIME_WAIT state for about 4 minutes before the > corresponding port becomes available for reuse. Although we have confirmed > that the thrift server sets the socket as reusable during startup > (https://github.com/apache/thrift/blob/master/lib/java/src/main/java/org/apache/thrift/transport/TNonblockingServerSocket.java#L103), > it seems that this setting does not work in some CI environments. > > As a result, we added logic in https://github.com/apache/iotdb/pull/10530 to > block and wait for 60 seconds if the socket cannot be bound. However, test > failures may still occur, and we suspect that waiting for more than 4 minutes > might be necessary. Consequently, in > https://github.com/apache/iotdb/pull/10540, we increased the timeout waiting > period to 300 seconds. Regrettably, test failures still occasionally happen. > As a result, in https://github.com/apache/iotdb/pull/10723, we introduced > logic to dynamically detect available ports, hoping that switching to > different ports could reduce the probability of failure. However, the current > situation is that even after confirming the port's availability, failures > occur during the actual startup of the thrift server. Now, I have started > another attempt (https://github.com/apache/iotdb/pull/10789), but I am > uncertain whether it will be effective. > > Through this series of efforts, we have managed to significantly reduce the > probability of encountering issues in the CI, but unfortunately, the problem > still occasionally reoccurs. This issue is truly frustrating and > disheartening. I wonder if the community has any better solutions that could > help me… > > Thanks > ————————————— > Xinyu Tan > > On 2023/08/04 14:13:38 Christofer Dutz wrote: > > Hi all, > > > > So, in the past days I‘ve been building IoTDB on several OSes and have > > noticed some tests to repeatedly failing the build, but succeeding as soon > > as I run them again. > > To sum it up it’s mostly these tests: > > > > ————— IoTDB: Core: Consensus > > > > RatisConsensusTest.removeMemberFromGroup:148->doConsensus:258 NullPointer > > Cann… > > > > > > RatisConsensusTest.addMemberToGroup:116->doConsensus:258 NullPointer Cannot > > in... > > > > > > > > ReplicateTest.replicateUsingWALTest:257->initServer:147 » IO > > org.apache.iotdb.... > > > > > > ————— IoTDB: Core: Node Commons > > > > Keeps on failing because of left-over iotdb server instances. > > > > I would be happy to tackle the Node Commons tests regularly failing by > > implementing the Test-Runner, that I mentioned before, which will start and > > run IoTDB inside the VM running the tests, so the instance will be shut > > down as soon as the test is finished. This should eliminate that problem. > > However I have no idea if anyone is working on the RatisConsensusTest and > > the ReplicateTest. > > > > Chris > > >