Hi Xinyu,

No need to apologize … I’m happy that you have an idea on what’s going wrong.

I don’t know if you saw it, but I proposed a test-server module, which starts 
the two parts on random free ports and reports them back to the test starting 
it … this way we’d simply use free ports every time a server is started.

Could this help?

Chris




Von: Xinyu Tan <tanxi...@apache.org>
Datum: Freitag, 4. August 2023 um 17:56
An: dev@iotdb.apache.org <dev@iotdb.apache.org>
Betreff: Re: Fixing flaky tests?
Hi Chris,

I deeply apologize for the instability of replicateUsingWALTest. The test 
failures are occurring due to the frequent start and stop of the thrift server 
in the consensus module tests, which can lead to some tests being unable to 
bind to the socket during startup and resulting in failures.

Regarding the root cause of this issue, we suspect that TCP connections, when 
disconnected, remain in a TIME_WAIT state for about 4 minutes before the 
corresponding port becomes available for reuse. Although we have confirmed that 
the thrift server sets the socket as reusable during startup 
(https://github.com/apache/thrift/blob/master/lib/java/src/main/java/org/apache/thrift/transport/TNonblockingServerSocket.java#L103),
 it seems that this setting does not work in some CI environments.

As a result, we added logic in https://github.com/apache/iotdb/pull/10530 to 
block and wait for 60 seconds if the socket cannot be bound. However, test 
failures may still occur, and we suspect that waiting for more than 4 minutes 
might be necessary. Consequently, in 
https://github.com/apache/iotdb/pull/10540, we increased the timeout waiting 
period to 300 seconds. Regrettably, test failures still occasionally happen. As 
a result, in https://github.com/apache/iotdb/pull/10723, we introduced logic to 
dynamically detect available ports, hoping that switching to different ports 
could reduce the probability of failure. However, the current situation is that 
even after confirming the port's availability, failures occur during the actual 
startup of the thrift server. Now, I have started another attempt 
(https://github.com/apache/iotdb/pull/10789), but I am uncertain whether it 
will be effective.

Through this series of efforts, we have managed to significantly reduce the 
probability of encountering issues in the CI, but unfortunately, the problem 
still occasionally reoccurs. This issue is truly frustrating and disheartening. 
I wonder if the community has any better solutions that could help me…

Thanks
—————————————
Xinyu Tan

On 2023/08/04 14:13:38 Christofer Dutz wrote:
> Hi all,
>
> So, in the past days I‘ve been building IoTDB on several OSes and have 
> noticed some tests to repeatedly failing the build, but succeeding as soon as 
> I run them again.
> To sum it up it’s mostly these tests:
>
> ————— IoTDB: Core: Consensus
>
> RatisConsensusTest.removeMemberFromGroup:148->doConsensus:258 NullPointer 
> Cann…
>
>
> RatisConsensusTest.addMemberToGroup:116->doConsensus:258 NullPointer Cannot 
> in...
>
>
>
> ReplicateTest.replicateUsingWALTest:257->initServer:147 » IO 
> org.apache.iotdb....
>
>
> ————— IoTDB: Core: Node Commons
>
> Keeps on failing because of left-over iotdb server instances.
>
> I would be happy to tackle the Node Commons tests regularly failing by 
> implementing the Test-Runner, that I mentioned before, which will start and 
> run IoTDB inside the VM running the tests, so the instance will be shut down 
> as soon as the test is finished. This should eliminate that problem. However 
> I have no idea if anyone is working on the RatisConsensusTest and the 
> ReplicateTest.
>
> Chris
>

Reply via email to