Hi Xinyu,

thanks for that … if this really reduces the amount of failed tests on a PR, 
that would be a huge improvement :-)

Chris


Von: Xinyu Tan <tanxi...@apache.org>
Datum: Mittwoch, 30. August 2023 um 08:29
An: dev@iotdb.apache.org <dev@iotdb.apache.org>
Betreff: Re: AW: Fixing flaky tests?
Hello all,

After the efforts of both myself and William, we have significantly improved 
the continuous integration stability of the consensus layer. In the recent ten 
unit tests, there have been no instances of failures.

Here are the specific improvements we have implemented:

For IoTConsensus, when a binding socket failure occurs in the CI environment, 
the corresponding test will no longer be executed (pr: 
https://github.com/apache/iotdb/pull/10991).

For RatisConsensus, we have introduced a logic for retrying linearizable reads 
after failures, ensuring the successful progression of tests (pr: 
https://github.com/apache/iotdb/pull/10942).

We welcome any feedback and suggestions from all of you!

Thanks
---------------------------------
Xinyu Tan

On 2023/08/07 06:07:12 Christofer Dutz wrote:
> Haven't built that yet.... Was proposing to do so, and was waiting for some 
> discussion on it.
>
> I'll probably get started after my holidays next week.
>
> Chris
>
> Gesendet von Outlook für Android<https://aka.ms/AAb9ysg>
> ________________________________
> From: Xinyu Tan <tanxi...@apache.org>
> Sent: Monday, August 7, 2023 4:50:02 AM
> To: dev@iotdb.apache.org <dev@iotdb.apache.org>
> Subject: Re: AW: Fixing flaky tests?
>
> Hi, Chris
>
> I can't find the test-server package in the code. Has it been merged? Could 
> you explain it in more detail?
>
> Thanks
> -----------------------
> Xinyu Tan
>
> On 2023/08/04 16:32:10 Christofer Dutz wrote:
> > Hi Xinyu,
> >
> > No need to apologize … I’m happy that you have an idea on what’s going 
> > wrong.
> >
> > I don’t know if you saw it, but I proposed a test-server module, which 
> > starts the two parts on random free ports and reports them back to the test 
> > starting it … this way we’d simply use free ports every time a server is 
> > started.
> >
> > Could this help?
> >
> > Chris
> >
> >
> >
> >
> > Von: Xinyu Tan <tanxi...@apache.org>
> > Datum: Freitag, 4. August 2023 um 17:56
> > An: dev@iotdb.apache.org <dev@iotdb.apache.org>
> > Betreff: Re: Fixing flaky tests?
> > Hi Chris,
> >
> > I deeply apologize for the instability of replicateUsingWALTest. The test 
> > failures are occurring due to the frequent start and stop of the thrift 
> > server in the consensus module tests, which can lead to some tests being 
> > unable to bind to the socket during startup and resulting in failures.
> >
> > Regarding the root cause of this issue, we suspect that TCP connections, 
> > when disconnected, remain in a TIME_WAIT state for about 4 minutes before 
> > the corresponding port becomes available for reuse. Although we have 
> > confirmed that the thrift server sets the socket as reusable during startup 
> > (https://github.com/apache/thrift/blob/master/lib/java/src/main/java/org/apache/thrift/transport/TNonblockingServerSocket.java#L103),
> >  it seems that this setting does not work in some CI environments.
> >
> > As a result, we added logic in https://github.com/apache/iotdb/pull/10530 
> > to block and wait for 60 seconds if the socket cannot be bound. However, 
> > test failures may still occur, and we suspect that waiting for more than 4 
> > minutes might be necessary. Consequently, in 
> > https://github.com/apache/iotdb/pull/10540, we increased the timeout 
> > waiting period to 300 seconds. Regrettably, test failures still 
> > occasionally happen. As a result, in 
> > https://github.com/apache/iotdb/pull/10723, we introduced logic to 
> > dynamically detect available ports, hoping that switching to different 
> > ports could reduce the probability of failure. However, the current 
> > situation is that even after confirming the port's availability, failures 
> > occur during the actual startup of the thrift server. Now, I have started 
> > another attempt (https://github.com/apache/iotdb/pull/10789), but I am 
> > uncertain whether it will be effective.
> >
> > Through this series of efforts, we have managed to significantly reduce the 
> > probability of encountering issues in the CI, but unfortunately, the 
> > problem still occasionally reoccurs. This issue is truly frustrating and 
> > disheartening. I wonder if the community has any better solutions that 
> > could help me…
> >
> > Thanks
> > —————————————
> > Xinyu Tan
> >
> > On 2023/08/04 14:13:38 Christofer Dutz wrote:
> > > Hi all,
> > >
> > > So, in the past days I‘ve been building IoTDB on several OSes and have 
> > > noticed some tests to repeatedly failing the build, but succeeding as 
> > > soon as I run them again.
> > > To sum it up it’s mostly these tests:
> > >
> > > ————— IoTDB: Core: Consensus
> > >
> > > RatisConsensusTest.removeMemberFromGroup:148->doConsensus:258 NullPointer 
> > > Cann…
> > >
> > >
> > > RatisConsensusTest.addMemberToGroup:116->doConsensus:258 NullPointer 
> > > Cannot in...
> > >
> > >
> > >
> > > ReplicateTest.replicateUsingWALTest:257->initServer:147 » IO 
> > > org.apache.iotdb....
> > >
> > >
> > > ————— IoTDB: Core: Node Commons
> > >
> > > Keeps on failing because of left-over iotdb server instances.
> > >
> > > I would be happy to tackle the Node Commons tests regularly failing by 
> > > implementing the Test-Runner, that I mentioned before, which will start 
> > > and run IoTDB inside the VM running the tests, so the instance will be 
> > > shut down as soon as the test is finished. This should eliminate that 
> > > problem. However I have no idea if anyone is working on the 
> > > RatisConsensusTest and the ReplicateTest.
> > >
> > > Chris
> > >
> >
>

Reply via email to