AW: AW: Fixing flaky tests?

2023-08-30 Thread Christofer Dutz
Hi Xinyu,

thanks for that … if this really reduces the amount of failed tests on a PR, 
that would be a huge improvement :-)

Chris


Von: Xinyu Tan 
Datum: Mittwoch, 30. August 2023 um 08:29
An: dev@iotdb.apache.org 
Betreff: Re: AW: Fixing flaky tests?
Hello all,

After the efforts of both myself and William, we have significantly improved 
the continuous integration stability of the consensus layer. In the recent ten 
unit tests, there have been no instances of failures.

Here are the specific improvements we have implemented:

For IoTConsensus, when a binding socket failure occurs in the CI environment, 
the corresponding test will no longer be executed (pr: 
https://github.com/apache/iotdb/pull/10991).

For RatisConsensus, we have introduced a logic for retrying linearizable reads 
after failures, ensuring the successful progression of tests (pr: 
https://github.com/apache/iotdb/pull/10942).

We welcome any feedback and suggestions from all of you!

Thanks
-
Xinyu Tan

On 2023/08/07 06:07:12 Christofer Dutz wrote:
> Haven't built that yet Was proposing to do so, and was waiting for some 
> discussion on it.
>
> I'll probably get started after my holidays next week.
>
> Chris
>
> Gesendet von Outlook für Android<https://aka.ms/AAb9ysg>
> 
> From: Xinyu Tan 
> Sent: Monday, August 7, 2023 4:50:02 AM
> To: dev@iotdb.apache.org 
> Subject: Re: AW: Fixing flaky tests?
>
> Hi, Chris
>
> I can't find the test-server package in the code. Has it been merged? Could 
> you explain it in more detail?
>
> Thanks
> ---
> Xinyu Tan
>
> On 2023/08/04 16:32:10 Christofer Dutz wrote:
> > Hi Xinyu,
> >
> > No need to apologize … I’m happy that you have an idea on what’s going 
> > wrong.
> >
> > I don’t know if you saw it, but I proposed a test-server module, which 
> > starts the two parts on random free ports and reports them back to the test 
> > starting it … this way we’d simply use free ports every time a server is 
> > started.
> >
> > Could this help?
> >
> > Chris
> >
> >
> >
> >
> > Von: Xinyu Tan 
> > Datum: Freitag, 4. August 2023 um 17:56
> > An: dev@iotdb.apache.org 
> > Betreff: Re: Fixing flaky tests?
> > Hi Chris,
> >
> > I deeply apologize for the instability of replicateUsingWALTest. The test 
> > failures are occurring due to the frequent start and stop of the thrift 
> > server in the consensus module tests, which can lead to some tests being 
> > unable to bind to the socket during startup and resulting in failures.
> >
> > Regarding the root cause of this issue, we suspect that TCP connections, 
> > when disconnected, remain in a TIME_WAIT state for about 4 minutes before 
> > the corresponding port becomes available for reuse. Although we have 
> > confirmed that the thrift server sets the socket as reusable during startup 
> > (https://github.com/apache/thrift/blob/master/lib/java/src/main/java/org/apache/thrift/transport/TNonblockingServerSocket.java#L103),
> >  it seems that this setting does not work in some CI environments.
> >
> > As a result, we added logic in https://github.com/apache/iotdb/pull/10530 
> > to block and wait for 60 seconds if the socket cannot be bound. However, 
> > test failures may still occur, and we suspect that waiting for more than 4 
> > minutes might be necessary. Consequently, in 
> > https://github.com/apache/iotdb/pull/10540, we increased the timeout 
> > waiting period to 300 seconds. Regrettably, test failures still 
> > occasionally happen. As a result, in 
> > https://github.com/apache/iotdb/pull/10723, we introduced logic to 
> > dynamically detect available ports, hoping that switching to different 
> > ports could reduce the probability of failure. However, the current 
> > situation is that even after confirming the port's availability, failures 
> > occur during the actual startup of the thrift server. Now, I have started 
> > another attempt (https://github.com/apache/iotdb/pull/10789), but I am 
> > uncertain whether it will be effective.
> >
> > Through this series of efforts, we have managed to significantly reduce the 
> > probability of encountering issues in the CI, but unfortunately, the 
> > problem still occasionally reoccurs. This issue is truly frustrating and 
> > disheartening. I wonder if the community has any better solutions that 
> > could help me…
> >
> > Thanks
> > —
> > Xinyu Tan
> >
> > On 2023/08/04 14:13:38 Christofer Dutz wrote:
> > > Hi all,
> > >
> >

Re: AW: Fixing flaky tests?

2023-08-30 Thread Xinyu Tan
Hello all,

After the efforts of both myself and William, we have significantly improved 
the continuous integration stability of the consensus layer. In the recent ten 
unit tests, there have been no instances of failures.

Here are the specific improvements we have implemented:

For IoTConsensus, when a binding socket failure occurs in the CI environment, 
the corresponding test will no longer be executed (pr: 
https://github.com/apache/iotdb/pull/10991).

For RatisConsensus, we have introduced a logic for retrying linearizable reads 
after failures, ensuring the successful progression of tests (pr: 
https://github.com/apache/iotdb/pull/10942).

We welcome any feedback and suggestions from all of you!

Thanks
-
Xinyu Tan

On 2023/08/07 06:07:12 Christofer Dutz wrote:
> Haven't built that yet Was proposing to do so, and was waiting for some 
> discussion on it.
> 
> I'll probably get started after my holidays next week.
> 
> Chris
> 
> Gesendet von Outlook für Android<https://aka.ms/AAb9ysg>
> 
> From: Xinyu Tan 
> Sent: Monday, August 7, 2023 4:50:02 AM
> To: dev@iotdb.apache.org 
> Subject: Re: AW: Fixing flaky tests?
> 
> Hi, Chris
> 
> I can't find the test-server package in the code. Has it been merged? Could 
> you explain it in more detail?
> 
> Thanks
> ---
> Xinyu Tan
> 
> On 2023/08/04 16:32:10 Christofer Dutz wrote:
> > Hi Xinyu,
> >
> > No need to apologize … I’m happy that you have an idea on what’s going 
> > wrong.
> >
> > I don’t know if you saw it, but I proposed a test-server module, which 
> > starts the two parts on random free ports and reports them back to the test 
> > starting it … this way we’d simply use free ports every time a server is 
> > started.
> >
> > Could this help?
> >
> > Chris
> >
> >
> >
> >
> > Von: Xinyu Tan 
> > Datum: Freitag, 4. August 2023 um 17:56
> > An: dev@iotdb.apache.org 
> > Betreff: Re: Fixing flaky tests?
> > Hi Chris,
> >
> > I deeply apologize for the instability of replicateUsingWALTest. The test 
> > failures are occurring due to the frequent start and stop of the thrift 
> > server in the consensus module tests, which can lead to some tests being 
> > unable to bind to the socket during startup and resulting in failures.
> >
> > Regarding the root cause of this issue, we suspect that TCP connections, 
> > when disconnected, remain in a TIME_WAIT state for about 4 minutes before 
> > the corresponding port becomes available for reuse. Although we have 
> > confirmed that the thrift server sets the socket as reusable during startup 
> > (https://github.com/apache/thrift/blob/master/lib/java/src/main/java/org/apache/thrift/transport/TNonblockingServerSocket.java#L103),
> >  it seems that this setting does not work in some CI environments.
> >
> > As a result, we added logic in https://github.com/apache/iotdb/pull/10530 
> > to block and wait for 60 seconds if the socket cannot be bound. However, 
> > test failures may still occur, and we suspect that waiting for more than 4 
> > minutes might be necessary. Consequently, in 
> > https://github.com/apache/iotdb/pull/10540, we increased the timeout 
> > waiting period to 300 seconds. Regrettably, test failures still 
> > occasionally happen. As a result, in 
> > https://github.com/apache/iotdb/pull/10723, we introduced logic to 
> > dynamically detect available ports, hoping that switching to different 
> > ports could reduce the probability of failure. However, the current 
> > situation is that even after confirming the port's availability, failures 
> > occur during the actual startup of the thrift server. Now, I have started 
> > another attempt (https://github.com/apache/iotdb/pull/10789), but I am 
> > uncertain whether it will be effective.
> >
> > Through this series of efforts, we have managed to significantly reduce the 
> > probability of encountering issues in the CI, but unfortunately, the 
> > problem still occasionally reoccurs. This issue is truly frustrating and 
> > disheartening. I wonder if the community has any better solutions that 
> > could help me…
> >
> > Thanks
> > —
> > Xinyu Tan
> >
> > On 2023/08/04 14:13:38 Christofer Dutz wrote:
> > > Hi all,
> > >
> > > So, in the past days I‘ve been building IoTDB on several OSes and have 
> > > noticed some tests to repeatedly failing the build, but succeeding as 
> > > soon as I run them again.
> > > To sum it up it’s mostly these tests:
> 

Re: AW: Fixing flaky tests?

2023-08-07 Thread Christofer Dutz
Haven't built that yet Was proposing to do so, and was waiting for some 
discussion on it.

I'll probably get started after my holidays next week.

Chris

Gesendet von Outlook für Android<https://aka.ms/AAb9ysg>

From: Xinyu Tan 
Sent: Monday, August 7, 2023 4:50:02 AM
To: dev@iotdb.apache.org 
Subject: Re: AW: Fixing flaky tests?

Hi, Chris

I can't find the test-server package in the code. Has it been merged? Could you 
explain it in more detail?

Thanks
---
Xinyu Tan

On 2023/08/04 16:32:10 Christofer Dutz wrote:
> Hi Xinyu,
>
> No need to apologize … I’m happy that you have an idea on what’s going wrong.
>
> I don’t know if you saw it, but I proposed a test-server module, which starts 
> the two parts on random free ports and reports them back to the test starting 
> it … this way we’d simply use free ports every time a server is started.
>
> Could this help?
>
> Chris
>
>
>
>
> Von: Xinyu Tan 
> Datum: Freitag, 4. August 2023 um 17:56
> An: dev@iotdb.apache.org 
> Betreff: Re: Fixing flaky tests?
> Hi Chris,
>
> I deeply apologize for the instability of replicateUsingWALTest. The test 
> failures are occurring due to the frequent start and stop of the thrift 
> server in the consensus module tests, which can lead to some tests being 
> unable to bind to the socket during startup and resulting in failures.
>
> Regarding the root cause of this issue, we suspect that TCP connections, when 
> disconnected, remain in a TIME_WAIT state for about 4 minutes before the 
> corresponding port becomes available for reuse. Although we have confirmed 
> that the thrift server sets the socket as reusable during startup 
> (https://github.com/apache/thrift/blob/master/lib/java/src/main/java/org/apache/thrift/transport/TNonblockingServerSocket.java#L103),
>  it seems that this setting does not work in some CI environments.
>
> As a result, we added logic in https://github.com/apache/iotdb/pull/10530 to 
> block and wait for 60 seconds if the socket cannot be bound. However, test 
> failures may still occur, and we suspect that waiting for more than 4 minutes 
> might be necessary. Consequently, in 
> https://github.com/apache/iotdb/pull/10540, we increased the timeout waiting 
> period to 300 seconds. Regrettably, test failures still occasionally happen. 
> As a result, in https://github.com/apache/iotdb/pull/10723, we introduced 
> logic to dynamically detect available ports, hoping that switching to 
> different ports could reduce the probability of failure. However, the current 
> situation is that even after confirming the port's availability, failures 
> occur during the actual startup of the thrift server. Now, I have started 
> another attempt (https://github.com/apache/iotdb/pull/10789), but I am 
> uncertain whether it will be effective.
>
> Through this series of efforts, we have managed to significantly reduce the 
> probability of encountering issues in the CI, but unfortunately, the problem 
> still occasionally reoccurs. This issue is truly frustrating and 
> disheartening. I wonder if the community has any better solutions that could 
> help me…
>
> Thanks
> —
> Xinyu Tan
>
> On 2023/08/04 14:13:38 Christofer Dutz wrote:
> > Hi all,
> >
> > So, in the past days I‘ve been building IoTDB on several OSes and have 
> > noticed some tests to repeatedly failing the build, but succeeding as soon 
> > as I run them again.
> > To sum it up it’s mostly these tests:
> >
> > — IoTDB: Core: Consensus
> >
> > RatisConsensusTest.removeMemberFromGroup:148->doConsensus:258 NullPointer 
> > Cann…
> >
> >
> > RatisConsensusTest.addMemberToGroup:116->doConsensus:258 NullPointer Cannot 
> > in...
> >
> >
> >
> > ReplicateTest.replicateUsingWALTest:257->initServer:147 » IO 
> > org.apache.iotdb
> >
> >
> > — IoTDB: Core: Node Commons
> >
> > Keeps on failing because of left-over iotdb server instances.
> >
> > I would be happy to tackle the Node Commons tests regularly failing by 
> > implementing the Test-Runner, that I mentioned before, which will start and 
> > run IoTDB inside the VM running the tests, so the instance will be shut 
> > down as soon as the test is finished. This should eliminate that problem. 
> > However I have no idea if anyone is working on the RatisConsensusTest and 
> > the ReplicateTest.
> >
> > Chris
> >
>


Re: Fixing flaky tests?

2023-08-06 Thread William Song
Sure, will take a look.
William

> 2023年8月7日 10:47,Xinyu Tan  写道:
> 
> Hi William,
> 
> In my PR (https://github.com/apache/iotdb/pull/10789), there was an NPE 
> (NullPointerException) error in the test for 'oneMemberGroupChange' 
> (https://github.com/apache/iotdb/actions/runs/5764037692/job/15640048487?pr=10789).
>  You may want to investigate the cause of this issue.
> 
> Thanks
> --
> Xinyu Tan
> 
> On 2023/08/04 14:59:51 William Song wrote:
>> Hi Chris,
>> 
>> I will take a look at RatisConsensusTest. In case the tests fail next time, 
>> feel free to mention me directly in the PR. This way, I can view the 
>> complete error stack. 
>> 
>> William
>> 
>>> 2023年8月4日 17:13,Christofer Dutz  写道:
>>> 
>>> Hi all,
>>> 
>>> So, in the past days I‘ve been building IoTDB on several OSes and have 
>>> noticed some tests to repeatedly failing the build, but succeeding as soon 
>>> as I run them again.
>>> To sum it up it’s mostly these tests:
>>> 
>>> — IoTDB: Core: Consensus
>>> 
>>> RatisConsensusTest.removeMemberFromGroup:148->doConsensus:258 NullPointer 
>>> Cann…
>>> 
>>> 
>>> RatisConsensusTest.addMemberToGroup:116->doConsensus:258 NullPointer Cannot 
>>> in...
>>> 
>>> 
>>> 
>>> ReplicateTest.replicateUsingWALTest:257->initServer:147 » IO 
>>> org.apache.iotdb
>>> 
>>> 
>>> — IoTDB: Core: Node Commons
>>> 
>>> Keeps on failing because of left-over iotdb server instances.
>>> 
>>> I would be happy to tackle the Node Commons tests regularly failing by 
>>> implementing the Test-Runner, that I mentioned before, which will start and 
>>> run IoTDB inside the VM running the tests, so the instance will be shut 
>>> down as soon as the test is finished. This should eliminate that problem. 
>>> However I have no idea if anyone is working on the RatisConsensusTest and 
>>> the ReplicateTest.
>>> 
>>> Chris
>> 
>> 



Re: AW: Fixing flaky tests?

2023-08-06 Thread Xinyu Tan
Hi, Chris

I can't find the test-server package in the code. Has it been merged? Could you 
explain it in more detail?

Thanks
---
Xinyu Tan

On 2023/08/04 16:32:10 Christofer Dutz wrote:
> Hi Xinyu,
> 
> No need to apologize … I’m happy that you have an idea on what’s going wrong.
> 
> I don’t know if you saw it, but I proposed a test-server module, which starts 
> the two parts on random free ports and reports them back to the test starting 
> it … this way we’d simply use free ports every time a server is started.
> 
> Could this help?
> 
> Chris
> 
> 
> 
> 
> Von: Xinyu Tan 
> Datum: Freitag, 4. August 2023 um 17:56
> An: dev@iotdb.apache.org 
> Betreff: Re: Fixing flaky tests?
> Hi Chris,
> 
> I deeply apologize for the instability of replicateUsingWALTest. The test 
> failures are occurring due to the frequent start and stop of the thrift 
> server in the consensus module tests, which can lead to some tests being 
> unable to bind to the socket during startup and resulting in failures.
> 
> Regarding the root cause of this issue, we suspect that TCP connections, when 
> disconnected, remain in a TIME_WAIT state for about 4 minutes before the 
> corresponding port becomes available for reuse. Although we have confirmed 
> that the thrift server sets the socket as reusable during startup 
> (https://github.com/apache/thrift/blob/master/lib/java/src/main/java/org/apache/thrift/transport/TNonblockingServerSocket.java#L103),
>  it seems that this setting does not work in some CI environments.
> 
> As a result, we added logic in https://github.com/apache/iotdb/pull/10530 to 
> block and wait for 60 seconds if the socket cannot be bound. However, test 
> failures may still occur, and we suspect that waiting for more than 4 minutes 
> might be necessary. Consequently, in 
> https://github.com/apache/iotdb/pull/10540, we increased the timeout waiting 
> period to 300 seconds. Regrettably, test failures still occasionally happen. 
> As a result, in https://github.com/apache/iotdb/pull/10723, we introduced 
> logic to dynamically detect available ports, hoping that switching to 
> different ports could reduce the probability of failure. However, the current 
> situation is that even after confirming the port's availability, failures 
> occur during the actual startup of the thrift server. Now, I have started 
> another attempt (https://github.com/apache/iotdb/pull/10789), but I am 
> uncertain whether it will be effective.
> 
> Through this series of efforts, we have managed to significantly reduce the 
> probability of encountering issues in the CI, but unfortunately, the problem 
> still occasionally reoccurs. This issue is truly frustrating and 
> disheartening. I wonder if the community has any better solutions that could 
> help me…
> 
> Thanks
> —
> Xinyu Tan
> 
> On 2023/08/04 14:13:38 Christofer Dutz wrote:
> > Hi all,
> >
> > So, in the past days I‘ve been building IoTDB on several OSes and have 
> > noticed some tests to repeatedly failing the build, but succeeding as soon 
> > as I run them again.
> > To sum it up it’s mostly these tests:
> >
> > — IoTDB: Core: Consensus
> >
> > RatisConsensusTest.removeMemberFromGroup:148->doConsensus:258 NullPointer 
> > Cann…
> >
> >
> > RatisConsensusTest.addMemberToGroup:116->doConsensus:258 NullPointer Cannot 
> > in...
> >
> >
> >
> > ReplicateTest.replicateUsingWALTest:257->initServer:147 » IO 
> > org.apache.iotdb
> >
> >
> > — IoTDB: Core: Node Commons
> >
> > Keeps on failing because of left-over iotdb server instances.
> >
> > I would be happy to tackle the Node Commons tests regularly failing by 
> > implementing the Test-Runner, that I mentioned before, which will start and 
> > run IoTDB inside the VM running the tests, so the instance will be shut 
> > down as soon as the test is finished. This should eliminate that problem. 
> > However I have no idea if anyone is working on the RatisConsensusTest and 
> > the ReplicateTest.
> >
> > Chris
> >
> 


Re: Fixing flaky tests?

2023-08-06 Thread Xinyu Tan
Hi William,

In my PR (https://github.com/apache/iotdb/pull/10789), there was an NPE 
(NullPointerException) error in the test for 'oneMemberGroupChange' 
(https://github.com/apache/iotdb/actions/runs/5764037692/job/15640048487?pr=10789).
 You may want to investigate the cause of this issue.

Thanks
--
Xinyu Tan

On 2023/08/04 14:59:51 William Song wrote:
> Hi Chris,
> 
> I will take a look at RatisConsensusTest. In case the tests fail next time, 
> feel free to mention me directly in the PR. This way, I can view the complete 
> error stack. 
> 
> William
> 
> > 2023年8月4日 17:13,Christofer Dutz  写道:
> > 
> > Hi all,
> > 
> > So, in the past days I‘ve been building IoTDB on several OSes and have 
> > noticed some tests to repeatedly failing the build, but succeeding as soon 
> > as I run them again.
> > To sum it up it’s mostly these tests:
> > 
> > — IoTDB: Core: Consensus
> > 
> > RatisConsensusTest.removeMemberFromGroup:148->doConsensus:258 NullPointer 
> > Cann…
> > 
> > 
> > RatisConsensusTest.addMemberToGroup:116->doConsensus:258 NullPointer Cannot 
> > in...
> > 
> > 
> > 
> > ReplicateTest.replicateUsingWALTest:257->initServer:147 » IO 
> > org.apache.iotdb
> > 
> > 
> > — IoTDB: Core: Node Commons
> > 
> > Keeps on failing because of left-over iotdb server instances.
> > 
> > I would be happy to tackle the Node Commons tests regularly failing by 
> > implementing the Test-Runner, that I mentioned before, which will start and 
> > run IoTDB inside the VM running the tests, so the instance will be shut 
> > down as soon as the test is finished. This should eliminate that problem. 
> > However I have no idea if anyone is working on the RatisConsensusTest and 
> > the ReplicateTest.
> > 
> > Chris
> 
> 


Re: Fixing flaky tests?

2023-08-04 Thread 谭新宇
Hi Chris,

I deeply apologize for the instability of replicateUsingWALTest. The test 
failures are occurring due to the frequent start and stop of the thrift server 
in the consensus module tests, which can lead to some tests being unable to 
bind to the socket during startup and resulting in failures.

Regarding the root cause of this issue, we suspect that TCP connections, when 
disconnected, remain in a TIME_WAIT state for about 4 minutes before the 
corresponding port becomes available for reuse. Although we have confirmed that 
the thrift server sets the socket as reusable during startup 
(https://github.com/apache/thrift/blob/master/lib/java/src/main/java/org/apache/thrift/transport/TNonblockingServerSocket.java#L103),
 it seems that this setting does not work in some CI environments.

As a result, we added logic in https://github.com/apache/iotdb/pull/10530 to 
block and wait for 60 seconds if the socket cannot be bound. However, test 
failures may still occur, and we suspect that waiting for more than 4 minutes 
might be necessary. Consequently, in 
https://github.com/apache/iotdb/pull/10540, we increased the timeout waiting 
period to 300 seconds. Regrettably, test failures still occasionally happen. As 
a result, in https://github.com/apache/iotdb/pull/10723, we introduced logic to 
dynamically detect available ports, hoping that switching to different ports 
could reduce the probability of failure. However, the current situation is that 
even after confirming the port's availability, failures occur during the actual 
startup of the thrift server. Now, I have started another attempt 
(https://github.com/apache/iotdb/pull/10789), but I am uncertain whether it 
will be effective.

Through this series of efforts, we have managed to significantly reduce the 
probability of encountering issues in the CI, but unfortunately, the problem 
still occasionally reoccurs. This issue is truly frustrating and disheartening. 
I wonder if the community has any better solutions that could help me…

Thanks
—
Xinyu Tan


> 2023年8月4日 22:59,William Song  写道:
> 
> Hi Chris,
> 
> I will take a look at RatisConsensusTest. In case the tests fail next time, 
> feel free to mention me directly in the PR. This way, I can view the complete 
> error stack. 
> 
> William
> 
>> 2023年8月4日 17:13,Christofer Dutz  写道:
>> 
>> Hi all,
>> 
>> So, in the past days I‘ve been building IoTDB on several OSes and have 
>> noticed some tests to repeatedly failing the build, but succeeding as soon 
>> as I run them again.
>> To sum it up it’s mostly these tests:
>> 
>> — IoTDB: Core: Consensus
>> 
>> RatisConsensusTest.removeMemberFromGroup:148->doConsensus:258 NullPointer 
>> Cann…
>> 
>> 
>> RatisConsensusTest.addMemberToGroup:116->doConsensus:258 NullPointer Cannot 
>> in...
>> 
>> 
>> 
>> ReplicateTest.replicateUsingWALTest:257->initServer:147 » IO 
>> org.apache.iotdb
>> 
>> 
>> — IoTDB: Core: Node Commons
>> 
>> Keeps on failing because of left-over iotdb server instances.
>> 
>> I would be happy to tackle the Node Commons tests regularly failing by 
>> implementing the Test-Runner, that I mentioned before, which will start and 
>> run IoTDB inside the VM running the tests, so the instance will be shut down 
>> as soon as the test is finished. This should eliminate that problem. However 
>> I have no idea if anyone is working on the RatisConsensusTest and the 
>> ReplicateTest.
>> 
>> Chris
> 



AW: Fixing flaky tests?

2023-08-04 Thread Christofer Dutz
Hi Xinyu,

No need to apologize … I’m happy that you have an idea on what’s going wrong.

I don’t know if you saw it, but I proposed a test-server module, which starts 
the two parts on random free ports and reports them back to the test starting 
it … this way we’d simply use free ports every time a server is started.

Could this help?

Chris




Von: Xinyu Tan 
Datum: Freitag, 4. August 2023 um 17:56
An: dev@iotdb.apache.org 
Betreff: Re: Fixing flaky tests?
Hi Chris,

I deeply apologize for the instability of replicateUsingWALTest. The test 
failures are occurring due to the frequent start and stop of the thrift server 
in the consensus module tests, which can lead to some tests being unable to 
bind to the socket during startup and resulting in failures.

Regarding the root cause of this issue, we suspect that TCP connections, when 
disconnected, remain in a TIME_WAIT state for about 4 minutes before the 
corresponding port becomes available for reuse. Although we have confirmed that 
the thrift server sets the socket as reusable during startup 
(https://github.com/apache/thrift/blob/master/lib/java/src/main/java/org/apache/thrift/transport/TNonblockingServerSocket.java#L103),
 it seems that this setting does not work in some CI environments.

As a result, we added logic in https://github.com/apache/iotdb/pull/10530 to 
block and wait for 60 seconds if the socket cannot be bound. However, test 
failures may still occur, and we suspect that waiting for more than 4 minutes 
might be necessary. Consequently, in 
https://github.com/apache/iotdb/pull/10540, we increased the timeout waiting 
period to 300 seconds. Regrettably, test failures still occasionally happen. As 
a result, in https://github.com/apache/iotdb/pull/10723, we introduced logic to 
dynamically detect available ports, hoping that switching to different ports 
could reduce the probability of failure. However, the current situation is that 
even after confirming the port's availability, failures occur during the actual 
startup of the thrift server. Now, I have started another attempt 
(https://github.com/apache/iotdb/pull/10789), but I am uncertain whether it 
will be effective.

Through this series of efforts, we have managed to significantly reduce the 
probability of encountering issues in the CI, but unfortunately, the problem 
still occasionally reoccurs. This issue is truly frustrating and disheartening. 
I wonder if the community has any better solutions that could help me…

Thanks
—
Xinyu Tan

On 2023/08/04 14:13:38 Christofer Dutz wrote:
> Hi all,
>
> So, in the past days I‘ve been building IoTDB on several OSes and have 
> noticed some tests to repeatedly failing the build, but succeeding as soon as 
> I run them again.
> To sum it up it’s mostly these tests:
>
> — IoTDB: Core: Consensus
>
> RatisConsensusTest.removeMemberFromGroup:148->doConsensus:258 NullPointer 
> Cann…
>
>
> RatisConsensusTest.addMemberToGroup:116->doConsensus:258 NullPointer Cannot 
> in...
>
>
>
> ReplicateTest.replicateUsingWALTest:257->initServer:147 » IO 
> org.apache.iotdb
>
>
> — IoTDB: Core: Node Commons
>
> Keeps on failing because of left-over iotdb server instances.
>
> I would be happy to tackle the Node Commons tests regularly failing by 
> implementing the Test-Runner, that I mentioned before, which will start and 
> run IoTDB inside the VM running the tests, so the instance will be shut down 
> as soon as the test is finished. This should eliminate that problem. However 
> I have no idea if anyone is working on the RatisConsensusTest and the 
> ReplicateTest.
>
> Chris
>


Re: Fixing flaky tests?

2023-08-04 Thread Xinyu Tan
Hi Chris,

I deeply apologize for the instability of replicateUsingWALTest. The test 
failures are occurring due to the frequent start and stop of the thrift server 
in the consensus module tests, which can lead to some tests being unable to 
bind to the socket during startup and resulting in failures.

Regarding the root cause of this issue, we suspect that TCP connections, when 
disconnected, remain in a TIME_WAIT state for about 4 minutes before the 
corresponding port becomes available for reuse. Although we have confirmed that 
the thrift server sets the socket as reusable during startup 
(https://github.com/apache/thrift/blob/master/lib/java/src/main/java/org/apache/thrift/transport/TNonblockingServerSocket.java#L103),
 it seems that this setting does not work in some CI environments.

As a result, we added logic in https://github.com/apache/iotdb/pull/10530 to 
block and wait for 60 seconds if the socket cannot be bound. However, test 
failures may still occur, and we suspect that waiting for more than 4 minutes 
might be necessary. Consequently, in 
https://github.com/apache/iotdb/pull/10540, we increased the timeout waiting 
period to 300 seconds. Regrettably, test failures still occasionally happen. As 
a result, in https://github.com/apache/iotdb/pull/10723, we introduced logic to 
dynamically detect available ports, hoping that switching to different ports 
could reduce the probability of failure. However, the current situation is that 
even after confirming the port's availability, failures occur during the actual 
startup of the thrift server. Now, I have started another attempt 
(https://github.com/apache/iotdb/pull/10789), but I am uncertain whether it 
will be effective.

Through this series of efforts, we have managed to significantly reduce the 
probability of encountering issues in the CI, but unfortunately, the problem 
still occasionally reoccurs. This issue is truly frustrating and disheartening. 
I wonder if the community has any better solutions that could help me…

Thanks
—
Xinyu Tan

On 2023/08/04 14:13:38 Christofer Dutz wrote:
> Hi all,
> 
> So, in the past days I‘ve been building IoTDB on several OSes and have 
> noticed some tests to repeatedly failing the build, but succeeding as soon as 
> I run them again.
> To sum it up it’s mostly these tests:
> 
> — IoTDB: Core: Consensus
> 
> RatisConsensusTest.removeMemberFromGroup:148->doConsensus:258 NullPointer 
> Cann…
> 
> 
> RatisConsensusTest.addMemberToGroup:116->doConsensus:258 NullPointer Cannot 
> in...
> 
> 
> 
> ReplicateTest.replicateUsingWALTest:257->initServer:147 » IO 
> org.apache.iotdb
> 
> 
> — IoTDB: Core: Node Commons
> 
> Keeps on failing because of left-over iotdb server instances.
> 
> I would be happy to tackle the Node Commons tests regularly failing by 
> implementing the Test-Runner, that I mentioned before, which will start and 
> run IoTDB inside the VM running the tests, so the instance will be shut down 
> as soon as the test is finished. This should eliminate that problem. However 
> I have no idea if anyone is working on the RatisConsensusTest and the 
> ReplicateTest.
> 
> Chris
> 


Re: Fixing flaky tests?

2023-08-04 Thread William Song
Hi Chris,

I will take a look at RatisConsensusTest. In case the tests fail next time, 
feel free to mention me directly in the PR. This way, I can view the complete 
error stack. 

William

> 2023年8月4日 17:13,Christofer Dutz  写道:
> 
> Hi all,
> 
> So, in the past days I‘ve been building IoTDB on several OSes and have 
> noticed some tests to repeatedly failing the build, but succeeding as soon as 
> I run them again.
> To sum it up it’s mostly these tests:
> 
> — IoTDB: Core: Consensus
> 
> RatisConsensusTest.removeMemberFromGroup:148->doConsensus:258 NullPointer 
> Cann…
> 
> 
> RatisConsensusTest.addMemberToGroup:116->doConsensus:258 NullPointer Cannot 
> in...
> 
> 
> 
> ReplicateTest.replicateUsingWALTest:257->initServer:147 » IO 
> org.apache.iotdb
> 
> 
> — IoTDB: Core: Node Commons
> 
> Keeps on failing because of left-over iotdb server instances.
> 
> I would be happy to tackle the Node Commons tests regularly failing by 
> implementing the Test-Runner, that I mentioned before, which will start and 
> run IoTDB inside the VM running the tests, so the instance will be shut down 
> as soon as the test is finished. This should eliminate that problem. However 
> I have no idea if anyone is working on the RatisConsensusTest and the 
> ReplicateTest.
> 
> Chris



Fixing flaky tests?

2023-08-04 Thread Christofer Dutz
Hi all,

So, in the past days I‘ve been building IoTDB on several OSes and have noticed 
some tests to repeatedly failing the build, but succeeding as soon as I run 
them again.
To sum it up it’s mostly these tests:

— IoTDB: Core: Consensus

RatisConsensusTest.removeMemberFromGroup:148->doConsensus:258 NullPointer Cann…


RatisConsensusTest.addMemberToGroup:116->doConsensus:258 NullPointer Cannot 
in...



ReplicateTest.replicateUsingWALTest:257->initServer:147 » IO 
org.apache.iotdb


— IoTDB: Core: Node Commons

Keeps on failing because of left-over iotdb server instances.

I would be happy to tackle the Node Commons tests regularly failing by 
implementing the Test-Runner, that I mentioned before, which will start and run 
IoTDB inside the VM running the tests, so the instance will be shut down as 
soon as the test is finished. This should eliminate that problem. However I 
have no idea if anyone is working on the RatisConsensusTest and the 
ReplicateTest.

Chris