Hi, Chris

I can't find the test-server package in the code. Has it been merged? Could you 
explain it in more detail?

Thanks
-----------------------
Xinyu Tan

On 2023/08/04 16:32:10 Christofer Dutz wrote:
> Hi Xinyu,
> 
> No need to apologize … I’m happy that you have an idea on what’s going wrong.
> 
> I don’t know if you saw it, but I proposed a test-server module, which starts 
> the two parts on random free ports and reports them back to the test starting 
> it … this way we’d simply use free ports every time a server is started.
> 
> Could this help?
> 
> Chris
> 
> 
> 
> 
> Von: Xinyu Tan <tanxi...@apache.org>
> Datum: Freitag, 4. August 2023 um 17:56
> An: dev@iotdb.apache.org <dev@iotdb.apache.org>
> Betreff: Re: Fixing flaky tests?
> Hi Chris,
> 
> I deeply apologize for the instability of replicateUsingWALTest. The test 
> failures are occurring due to the frequent start and stop of the thrift 
> server in the consensus module tests, which can lead to some tests being 
> unable to bind to the socket during startup and resulting in failures.
> 
> Regarding the root cause of this issue, we suspect that TCP connections, when 
> disconnected, remain in a TIME_WAIT state for about 4 minutes before the 
> corresponding port becomes available for reuse. Although we have confirmed 
> that the thrift server sets the socket as reusable during startup 
> (https://github.com/apache/thrift/blob/master/lib/java/src/main/java/org/apache/thrift/transport/TNonblockingServerSocket.java#L103),
>  it seems that this setting does not work in some CI environments.
> 
> As a result, we added logic in https://github.com/apache/iotdb/pull/10530 to 
> block and wait for 60 seconds if the socket cannot be bound. However, test 
> failures may still occur, and we suspect that waiting for more than 4 minutes 
> might be necessary. Consequently, in 
> https://github.com/apache/iotdb/pull/10540, we increased the timeout waiting 
> period to 300 seconds. Regrettably, test failures still occasionally happen. 
> As a result, in https://github.com/apache/iotdb/pull/10723, we introduced 
> logic to dynamically detect available ports, hoping that switching to 
> different ports could reduce the probability of failure. However, the current 
> situation is that even after confirming the port's availability, failures 
> occur during the actual startup of the thrift server. Now, I have started 
> another attempt (https://github.com/apache/iotdb/pull/10789), but I am 
> uncertain whether it will be effective.
> 
> Through this series of efforts, we have managed to significantly reduce the 
> probability of encountering issues in the CI, but unfortunately, the problem 
> still occasionally reoccurs. This issue is truly frustrating and 
> disheartening. I wonder if the community has any better solutions that could 
> help me…
> 
> Thanks
> —————————————
> Xinyu Tan
> 
> On 2023/08/04 14:13:38 Christofer Dutz wrote:
> > Hi all,
> >
> > So, in the past days I‘ve been building IoTDB on several OSes and have 
> > noticed some tests to repeatedly failing the build, but succeeding as soon 
> > as I run them again.
> > To sum it up it’s mostly these tests:
> >
> > ————— IoTDB: Core: Consensus
> >
> > RatisConsensusTest.removeMemberFromGroup:148->doConsensus:258 NullPointer 
> > Cann…
> >
> >
> > RatisConsensusTest.addMemberToGroup:116->doConsensus:258 NullPointer Cannot 
> > in...
> >
> >
> >
> > ReplicateTest.replicateUsingWALTest:257->initServer:147 » IO 
> > org.apache.iotdb....
> >
> >
> > ————— IoTDB: Core: Node Commons
> >
> > Keeps on failing because of left-over iotdb server instances.
> >
> > I would be happy to tackle the Node Commons tests regularly failing by 
> > implementing the Test-Runner, that I mentioned before, which will start and 
> > run IoTDB inside the VM running the tests, so the instance will be shut 
> > down as soon as the test is finished. This should eliminate that problem. 
> > However I have no idea if anyone is working on the RatisConsensusTest and 
> > the ReplicateTest.
> >
> > Chris
> >
> 

Reply via email to