AW: AW: Fixing flaky tests?
Hi Xinyu, thanks for that … if this really reduces the amount of failed tests on a PR, that would be a huge improvement :-) Chris Von: Xinyu Tan Datum: Mittwoch, 30. August 2023 um 08:29 An: dev@iotdb.apache.org Betreff: Re: AW: Fixing flaky tests? Hello all, After the efforts of both myself and William, we have significantly improved the continuous integration stability of the consensus layer. In the recent ten unit tests, there have been no instances of failures. Here are the specific improvements we have implemented: For IoTConsensus, when a binding socket failure occurs in the CI environment, the corresponding test will no longer be executed (pr: https://github.com/apache/iotdb/pull/10991). For RatisConsensus, we have introduced a logic for retrying linearizable reads after failures, ensuring the successful progression of tests (pr: https://github.com/apache/iotdb/pull/10942). We welcome any feedback and suggestions from all of you! Thanks - Xinyu Tan On 2023/08/07 06:07:12 Christofer Dutz wrote: > Haven't built that yet Was proposing to do so, and was waiting for some > discussion on it. > > I'll probably get started after my holidays next week. > > Chris > > Gesendet von Outlook für Android<https://aka.ms/AAb9ysg> > > From: Xinyu Tan > Sent: Monday, August 7, 2023 4:50:02 AM > To: dev@iotdb.apache.org > Subject: Re: AW: Fixing flaky tests? > > Hi, Chris > > I can't find the test-server package in the code. Has it been merged? Could > you explain it in more detail? > > Thanks > --- > Xinyu Tan > > On 2023/08/04 16:32:10 Christofer Dutz wrote: > > Hi Xinyu, > > > > No need to apologize … I’m happy that you have an idea on what’s going > > wrong. > > > > I don’t know if you saw it, but I proposed a test-server module, which > > starts the two parts on random free ports and reports them back to the test > > starting it … this way we’d simply use free ports every time a server is > > started. > > > > Could this help? > > > > Chris > > > > > > > > > > Von: Xinyu Tan > > Datum: Freitag, 4. August 2023 um 17:56 > > An: dev@iotdb.apache.org > > Betreff: Re: Fixing flaky tests? > > Hi Chris, > > > > I deeply apologize for the instability of replicateUsingWALTest. The test > > failures are occurring due to the frequent start and stop of the thrift > > server in the consensus module tests, which can lead to some tests being > > unable to bind to the socket during startup and resulting in failures. > > > > Regarding the root cause of this issue, we suspect that TCP connections, > > when disconnected, remain in a TIME_WAIT state for about 4 minutes before > > the corresponding port becomes available for reuse. Although we have > > confirmed that the thrift server sets the socket as reusable during startup > > (https://github.com/apache/thrift/blob/master/lib/java/src/main/java/org/apache/thrift/transport/TNonblockingServerSocket.java#L103), > > it seems that this setting does not work in some CI environments. > > > > As a result, we added logic in https://github.com/apache/iotdb/pull/10530 > > to block and wait for 60 seconds if the socket cannot be bound. However, > > test failures may still occur, and we suspect that waiting for more than 4 > > minutes might be necessary. Consequently, in > > https://github.com/apache/iotdb/pull/10540, we increased the timeout > > waiting period to 300 seconds. Regrettably, test failures still > > occasionally happen. As a result, in > > https://github.com/apache/iotdb/pull/10723, we introduced logic to > > dynamically detect available ports, hoping that switching to different > > ports could reduce the probability of failure. However, the current > > situation is that even after confirming the port's availability, failures > > occur during the actual startup of the thrift server. Now, I have started > > another attempt (https://github.com/apache/iotdb/pull/10789), but I am > > uncertain whether it will be effective. > > > > Through this series of efforts, we have managed to significantly reduce the > > probability of encountering issues in the CI, but unfortunately, the > > problem still occasionally reoccurs. This issue is truly frustrating and > > disheartening. I wonder if the community has any better solutions that > > could help me… > > > > Thanks > > — > > Xinyu Tan > > > > On 2023/08/04 14:13:38 Christofer Dutz wrote: > > > Hi all, > > > > >
Re: AW: Fixing flaky tests?
Hello all, After the efforts of both myself and William, we have significantly improved the continuous integration stability of the consensus layer. In the recent ten unit tests, there have been no instances of failures. Here are the specific improvements we have implemented: For IoTConsensus, when a binding socket failure occurs in the CI environment, the corresponding test will no longer be executed (pr: https://github.com/apache/iotdb/pull/10991). For RatisConsensus, we have introduced a logic for retrying linearizable reads after failures, ensuring the successful progression of tests (pr: https://github.com/apache/iotdb/pull/10942). We welcome any feedback and suggestions from all of you! Thanks - Xinyu Tan On 2023/08/07 06:07:12 Christofer Dutz wrote: > Haven't built that yet Was proposing to do so, and was waiting for some > discussion on it. > > I'll probably get started after my holidays next week. > > Chris > > Gesendet von Outlook für Android<https://aka.ms/AAb9ysg> > > From: Xinyu Tan > Sent: Monday, August 7, 2023 4:50:02 AM > To: dev@iotdb.apache.org > Subject: Re: AW: Fixing flaky tests? > > Hi, Chris > > I can't find the test-server package in the code. Has it been merged? Could > you explain it in more detail? > > Thanks > --- > Xinyu Tan > > On 2023/08/04 16:32:10 Christofer Dutz wrote: > > Hi Xinyu, > > > > No need to apologize … I’m happy that you have an idea on what’s going > > wrong. > > > > I don’t know if you saw it, but I proposed a test-server module, which > > starts the two parts on random free ports and reports them back to the test > > starting it … this way we’d simply use free ports every time a server is > > started. > > > > Could this help? > > > > Chris > > > > > > > > > > Von: Xinyu Tan > > Datum: Freitag, 4. August 2023 um 17:56 > > An: dev@iotdb.apache.org > > Betreff: Re: Fixing flaky tests? > > Hi Chris, > > > > I deeply apologize for the instability of replicateUsingWALTest. The test > > failures are occurring due to the frequent start and stop of the thrift > > server in the consensus module tests, which can lead to some tests being > > unable to bind to the socket during startup and resulting in failures. > > > > Regarding the root cause of this issue, we suspect that TCP connections, > > when disconnected, remain in a TIME_WAIT state for about 4 minutes before > > the corresponding port becomes available for reuse. Although we have > > confirmed that the thrift server sets the socket as reusable during startup > > (https://github.com/apache/thrift/blob/master/lib/java/src/main/java/org/apache/thrift/transport/TNonblockingServerSocket.java#L103), > > it seems that this setting does not work in some CI environments. > > > > As a result, we added logic in https://github.com/apache/iotdb/pull/10530 > > to block and wait for 60 seconds if the socket cannot be bound. However, > > test failures may still occur, and we suspect that waiting for more than 4 > > minutes might be necessary. Consequently, in > > https://github.com/apache/iotdb/pull/10540, we increased the timeout > > waiting period to 300 seconds. Regrettably, test failures still > > occasionally happen. As a result, in > > https://github.com/apache/iotdb/pull/10723, we introduced logic to > > dynamically detect available ports, hoping that switching to different > > ports could reduce the probability of failure. However, the current > > situation is that even after confirming the port's availability, failures > > occur during the actual startup of the thrift server. Now, I have started > > another attempt (https://github.com/apache/iotdb/pull/10789), but I am > > uncertain whether it will be effective. > > > > Through this series of efforts, we have managed to significantly reduce the > > probability of encountering issues in the CI, but unfortunately, the > > problem still occasionally reoccurs. This issue is truly frustrating and > > disheartening. I wonder if the community has any better solutions that > > could help me… > > > > Thanks > > — > > Xinyu Tan > > > > On 2023/08/04 14:13:38 Christofer Dutz wrote: > > > Hi all, > > > > > > So, in the past days I‘ve been building IoTDB on several OSes and have > > > noticed some tests to repeatedly failing the build, but succeeding as > > > soon as I run them again. > > > To sum it up it’s mostly these tests: >
Re: AW: Fixing flaky tests?
Haven't built that yet Was proposing to do so, and was waiting for some discussion on it. I'll probably get started after my holidays next week. Chris Gesendet von Outlook für Android<https://aka.ms/AAb9ysg> From: Xinyu Tan Sent: Monday, August 7, 2023 4:50:02 AM To: dev@iotdb.apache.org Subject: Re: AW: Fixing flaky tests? Hi, Chris I can't find the test-server package in the code. Has it been merged? Could you explain it in more detail? Thanks --- Xinyu Tan On 2023/08/04 16:32:10 Christofer Dutz wrote: > Hi Xinyu, > > No need to apologize … I’m happy that you have an idea on what’s going wrong. > > I don’t know if you saw it, but I proposed a test-server module, which starts > the two parts on random free ports and reports them back to the test starting > it … this way we’d simply use free ports every time a server is started. > > Could this help? > > Chris > > > > > Von: Xinyu Tan > Datum: Freitag, 4. August 2023 um 17:56 > An: dev@iotdb.apache.org > Betreff: Re: Fixing flaky tests? > Hi Chris, > > I deeply apologize for the instability of replicateUsingWALTest. The test > failures are occurring due to the frequent start and stop of the thrift > server in the consensus module tests, which can lead to some tests being > unable to bind to the socket during startup and resulting in failures. > > Regarding the root cause of this issue, we suspect that TCP connections, when > disconnected, remain in a TIME_WAIT state for about 4 minutes before the > corresponding port becomes available for reuse. Although we have confirmed > that the thrift server sets the socket as reusable during startup > (https://github.com/apache/thrift/blob/master/lib/java/src/main/java/org/apache/thrift/transport/TNonblockingServerSocket.java#L103), > it seems that this setting does not work in some CI environments. > > As a result, we added logic in https://github.com/apache/iotdb/pull/10530 to > block and wait for 60 seconds if the socket cannot be bound. However, test > failures may still occur, and we suspect that waiting for more than 4 minutes > might be necessary. Consequently, in > https://github.com/apache/iotdb/pull/10540, we increased the timeout waiting > period to 300 seconds. Regrettably, test failures still occasionally happen. > As a result, in https://github.com/apache/iotdb/pull/10723, we introduced > logic to dynamically detect available ports, hoping that switching to > different ports could reduce the probability of failure. However, the current > situation is that even after confirming the port's availability, failures > occur during the actual startup of the thrift server. Now, I have started > another attempt (https://github.com/apache/iotdb/pull/10789), but I am > uncertain whether it will be effective. > > Through this series of efforts, we have managed to significantly reduce the > probability of encountering issues in the CI, but unfortunately, the problem > still occasionally reoccurs. This issue is truly frustrating and > disheartening. I wonder if the community has any better solutions that could > help me… > > Thanks > — > Xinyu Tan > > On 2023/08/04 14:13:38 Christofer Dutz wrote: > > Hi all, > > > > So, in the past days I‘ve been building IoTDB on several OSes and have > > noticed some tests to repeatedly failing the build, but succeeding as soon > > as I run them again. > > To sum it up it’s mostly these tests: > > > > — IoTDB: Core: Consensus > > > > RatisConsensusTest.removeMemberFromGroup:148->doConsensus:258 NullPointer > > Cann… > > > > > > RatisConsensusTest.addMemberToGroup:116->doConsensus:258 NullPointer Cannot > > in... > > > > > > > > ReplicateTest.replicateUsingWALTest:257->initServer:147 » IO > > org.apache.iotdb > > > > > > — IoTDB: Core: Node Commons > > > > Keeps on failing because of left-over iotdb server instances. > > > > I would be happy to tackle the Node Commons tests regularly failing by > > implementing the Test-Runner, that I mentioned before, which will start and > > run IoTDB inside the VM running the tests, so the instance will be shut > > down as soon as the test is finished. This should eliminate that problem. > > However I have no idea if anyone is working on the RatisConsensusTest and > > the ReplicateTest. > > > > Chris > > >
Re: Fixing flaky tests?
Sure, will take a look. William > 2023年8月7日 10:47,Xinyu Tan 写道: > > Hi William, > > In my PR (https://github.com/apache/iotdb/pull/10789), there was an NPE > (NullPointerException) error in the test for 'oneMemberGroupChange' > (https://github.com/apache/iotdb/actions/runs/5764037692/job/15640048487?pr=10789). > You may want to investigate the cause of this issue. > > Thanks > -- > Xinyu Tan > > On 2023/08/04 14:59:51 William Song wrote: >> Hi Chris, >> >> I will take a look at RatisConsensusTest. In case the tests fail next time, >> feel free to mention me directly in the PR. This way, I can view the >> complete error stack. >> >> William >> >>> 2023年8月4日 17:13,Christofer Dutz 写道: >>> >>> Hi all, >>> >>> So, in the past days I‘ve been building IoTDB on several OSes and have >>> noticed some tests to repeatedly failing the build, but succeeding as soon >>> as I run them again. >>> To sum it up it’s mostly these tests: >>> >>> — IoTDB: Core: Consensus >>> >>> RatisConsensusTest.removeMemberFromGroup:148->doConsensus:258 NullPointer >>> Cann… >>> >>> >>> RatisConsensusTest.addMemberToGroup:116->doConsensus:258 NullPointer Cannot >>> in... >>> >>> >>> >>> ReplicateTest.replicateUsingWALTest:257->initServer:147 » IO >>> org.apache.iotdb >>> >>> >>> — IoTDB: Core: Node Commons >>> >>> Keeps on failing because of left-over iotdb server instances. >>> >>> I would be happy to tackle the Node Commons tests regularly failing by >>> implementing the Test-Runner, that I mentioned before, which will start and >>> run IoTDB inside the VM running the tests, so the instance will be shut >>> down as soon as the test is finished. This should eliminate that problem. >>> However I have no idea if anyone is working on the RatisConsensusTest and >>> the ReplicateTest. >>> >>> Chris >> >>
Re: AW: Fixing flaky tests?
Hi, Chris I can't find the test-server package in the code. Has it been merged? Could you explain it in more detail? Thanks --- Xinyu Tan On 2023/08/04 16:32:10 Christofer Dutz wrote: > Hi Xinyu, > > No need to apologize … I’m happy that you have an idea on what’s going wrong. > > I don’t know if you saw it, but I proposed a test-server module, which starts > the two parts on random free ports and reports them back to the test starting > it … this way we’d simply use free ports every time a server is started. > > Could this help? > > Chris > > > > > Von: Xinyu Tan > Datum: Freitag, 4. August 2023 um 17:56 > An: dev@iotdb.apache.org > Betreff: Re: Fixing flaky tests? > Hi Chris, > > I deeply apologize for the instability of replicateUsingWALTest. The test > failures are occurring due to the frequent start and stop of the thrift > server in the consensus module tests, which can lead to some tests being > unable to bind to the socket during startup and resulting in failures. > > Regarding the root cause of this issue, we suspect that TCP connections, when > disconnected, remain in a TIME_WAIT state for about 4 minutes before the > corresponding port becomes available for reuse. Although we have confirmed > that the thrift server sets the socket as reusable during startup > (https://github.com/apache/thrift/blob/master/lib/java/src/main/java/org/apache/thrift/transport/TNonblockingServerSocket.java#L103), > it seems that this setting does not work in some CI environments. > > As a result, we added logic in https://github.com/apache/iotdb/pull/10530 to > block and wait for 60 seconds if the socket cannot be bound. However, test > failures may still occur, and we suspect that waiting for more than 4 minutes > might be necessary. Consequently, in > https://github.com/apache/iotdb/pull/10540, we increased the timeout waiting > period to 300 seconds. Regrettably, test failures still occasionally happen. > As a result, in https://github.com/apache/iotdb/pull/10723, we introduced > logic to dynamically detect available ports, hoping that switching to > different ports could reduce the probability of failure. However, the current > situation is that even after confirming the port's availability, failures > occur during the actual startup of the thrift server. Now, I have started > another attempt (https://github.com/apache/iotdb/pull/10789), but I am > uncertain whether it will be effective. > > Through this series of efforts, we have managed to significantly reduce the > probability of encountering issues in the CI, but unfortunately, the problem > still occasionally reoccurs. This issue is truly frustrating and > disheartening. I wonder if the community has any better solutions that could > help me… > > Thanks > — > Xinyu Tan > > On 2023/08/04 14:13:38 Christofer Dutz wrote: > > Hi all, > > > > So, in the past days I‘ve been building IoTDB on several OSes and have > > noticed some tests to repeatedly failing the build, but succeeding as soon > > as I run them again. > > To sum it up it’s mostly these tests: > > > > — IoTDB: Core: Consensus > > > > RatisConsensusTest.removeMemberFromGroup:148->doConsensus:258 NullPointer > > Cann… > > > > > > RatisConsensusTest.addMemberToGroup:116->doConsensus:258 NullPointer Cannot > > in... > > > > > > > > ReplicateTest.replicateUsingWALTest:257->initServer:147 » IO > > org.apache.iotdb > > > > > > — IoTDB: Core: Node Commons > > > > Keeps on failing because of left-over iotdb server instances. > > > > I would be happy to tackle the Node Commons tests regularly failing by > > implementing the Test-Runner, that I mentioned before, which will start and > > run IoTDB inside the VM running the tests, so the instance will be shut > > down as soon as the test is finished. This should eliminate that problem. > > However I have no idea if anyone is working on the RatisConsensusTest and > > the ReplicateTest. > > > > Chris > > >
Re: Fixing flaky tests?
Hi William, In my PR (https://github.com/apache/iotdb/pull/10789), there was an NPE (NullPointerException) error in the test for 'oneMemberGroupChange' (https://github.com/apache/iotdb/actions/runs/5764037692/job/15640048487?pr=10789). You may want to investigate the cause of this issue. Thanks -- Xinyu Tan On 2023/08/04 14:59:51 William Song wrote: > Hi Chris, > > I will take a look at RatisConsensusTest. In case the tests fail next time, > feel free to mention me directly in the PR. This way, I can view the complete > error stack. > > William > > > 2023年8月4日 17:13,Christofer Dutz 写道: > > > > Hi all, > > > > So, in the past days I‘ve been building IoTDB on several OSes and have > > noticed some tests to repeatedly failing the build, but succeeding as soon > > as I run them again. > > To sum it up it’s mostly these tests: > > > > — IoTDB: Core: Consensus > > > > RatisConsensusTest.removeMemberFromGroup:148->doConsensus:258 NullPointer > > Cann… > > > > > > RatisConsensusTest.addMemberToGroup:116->doConsensus:258 NullPointer Cannot > > in... > > > > > > > > ReplicateTest.replicateUsingWALTest:257->initServer:147 » IO > > org.apache.iotdb > > > > > > — IoTDB: Core: Node Commons > > > > Keeps on failing because of left-over iotdb server instances. > > > > I would be happy to tackle the Node Commons tests regularly failing by > > implementing the Test-Runner, that I mentioned before, which will start and > > run IoTDB inside the VM running the tests, so the instance will be shut > > down as soon as the test is finished. This should eliminate that problem. > > However I have no idea if anyone is working on the RatisConsensusTest and > > the ReplicateTest. > > > > Chris > >
Re: Fixing flaky tests?
Hi Chris, I deeply apologize for the instability of replicateUsingWALTest. The test failures are occurring due to the frequent start and stop of the thrift server in the consensus module tests, which can lead to some tests being unable to bind to the socket during startup and resulting in failures. Regarding the root cause of this issue, we suspect that TCP connections, when disconnected, remain in a TIME_WAIT state for about 4 minutes before the corresponding port becomes available for reuse. Although we have confirmed that the thrift server sets the socket as reusable during startup (https://github.com/apache/thrift/blob/master/lib/java/src/main/java/org/apache/thrift/transport/TNonblockingServerSocket.java#L103), it seems that this setting does not work in some CI environments. As a result, we added logic in https://github.com/apache/iotdb/pull/10530 to block and wait for 60 seconds if the socket cannot be bound. However, test failures may still occur, and we suspect that waiting for more than 4 minutes might be necessary. Consequently, in https://github.com/apache/iotdb/pull/10540, we increased the timeout waiting period to 300 seconds. Regrettably, test failures still occasionally happen. As a result, in https://github.com/apache/iotdb/pull/10723, we introduced logic to dynamically detect available ports, hoping that switching to different ports could reduce the probability of failure. However, the current situation is that even after confirming the port's availability, failures occur during the actual startup of the thrift server. Now, I have started another attempt (https://github.com/apache/iotdb/pull/10789), but I am uncertain whether it will be effective. Through this series of efforts, we have managed to significantly reduce the probability of encountering issues in the CI, but unfortunately, the problem still occasionally reoccurs. This issue is truly frustrating and disheartening. I wonder if the community has any better solutions that could help me… Thanks — Xinyu Tan > 2023年8月4日 22:59,William Song 写道: > > Hi Chris, > > I will take a look at RatisConsensusTest. In case the tests fail next time, > feel free to mention me directly in the PR. This way, I can view the complete > error stack. > > William > >> 2023年8月4日 17:13,Christofer Dutz 写道: >> >> Hi all, >> >> So, in the past days I‘ve been building IoTDB on several OSes and have >> noticed some tests to repeatedly failing the build, but succeeding as soon >> as I run them again. >> To sum it up it’s mostly these tests: >> >> — IoTDB: Core: Consensus >> >> RatisConsensusTest.removeMemberFromGroup:148->doConsensus:258 NullPointer >> Cann… >> >> >> RatisConsensusTest.addMemberToGroup:116->doConsensus:258 NullPointer Cannot >> in... >> >> >> >> ReplicateTest.replicateUsingWALTest:257->initServer:147 » IO >> org.apache.iotdb >> >> >> — IoTDB: Core: Node Commons >> >> Keeps on failing because of left-over iotdb server instances. >> >> I would be happy to tackle the Node Commons tests regularly failing by >> implementing the Test-Runner, that I mentioned before, which will start and >> run IoTDB inside the VM running the tests, so the instance will be shut down >> as soon as the test is finished. This should eliminate that problem. However >> I have no idea if anyone is working on the RatisConsensusTest and the >> ReplicateTest. >> >> Chris >
AW: Fixing flaky tests?
Hi Xinyu, No need to apologize … I’m happy that you have an idea on what’s going wrong. I don’t know if you saw it, but I proposed a test-server module, which starts the two parts on random free ports and reports them back to the test starting it … this way we’d simply use free ports every time a server is started. Could this help? Chris Von: Xinyu Tan Datum: Freitag, 4. August 2023 um 17:56 An: dev@iotdb.apache.org Betreff: Re: Fixing flaky tests? Hi Chris, I deeply apologize for the instability of replicateUsingWALTest. The test failures are occurring due to the frequent start and stop of the thrift server in the consensus module tests, which can lead to some tests being unable to bind to the socket during startup and resulting in failures. Regarding the root cause of this issue, we suspect that TCP connections, when disconnected, remain in a TIME_WAIT state for about 4 minutes before the corresponding port becomes available for reuse. Although we have confirmed that the thrift server sets the socket as reusable during startup (https://github.com/apache/thrift/blob/master/lib/java/src/main/java/org/apache/thrift/transport/TNonblockingServerSocket.java#L103), it seems that this setting does not work in some CI environments. As a result, we added logic in https://github.com/apache/iotdb/pull/10530 to block and wait for 60 seconds if the socket cannot be bound. However, test failures may still occur, and we suspect that waiting for more than 4 minutes might be necessary. Consequently, in https://github.com/apache/iotdb/pull/10540, we increased the timeout waiting period to 300 seconds. Regrettably, test failures still occasionally happen. As a result, in https://github.com/apache/iotdb/pull/10723, we introduced logic to dynamically detect available ports, hoping that switching to different ports could reduce the probability of failure. However, the current situation is that even after confirming the port's availability, failures occur during the actual startup of the thrift server. Now, I have started another attempt (https://github.com/apache/iotdb/pull/10789), but I am uncertain whether it will be effective. Through this series of efforts, we have managed to significantly reduce the probability of encountering issues in the CI, but unfortunately, the problem still occasionally reoccurs. This issue is truly frustrating and disheartening. I wonder if the community has any better solutions that could help me… Thanks — Xinyu Tan On 2023/08/04 14:13:38 Christofer Dutz wrote: > Hi all, > > So, in the past days I‘ve been building IoTDB on several OSes and have > noticed some tests to repeatedly failing the build, but succeeding as soon as > I run them again. > To sum it up it’s mostly these tests: > > — IoTDB: Core: Consensus > > RatisConsensusTest.removeMemberFromGroup:148->doConsensus:258 NullPointer > Cann… > > > RatisConsensusTest.addMemberToGroup:116->doConsensus:258 NullPointer Cannot > in... > > > > ReplicateTest.replicateUsingWALTest:257->initServer:147 » IO > org.apache.iotdb > > > — IoTDB: Core: Node Commons > > Keeps on failing because of left-over iotdb server instances. > > I would be happy to tackle the Node Commons tests regularly failing by > implementing the Test-Runner, that I mentioned before, which will start and > run IoTDB inside the VM running the tests, so the instance will be shut down > as soon as the test is finished. This should eliminate that problem. However > I have no idea if anyone is working on the RatisConsensusTest and the > ReplicateTest. > > Chris >
Re: Fixing flaky tests?
Hi Chris, I deeply apologize for the instability of replicateUsingWALTest. The test failures are occurring due to the frequent start and stop of the thrift server in the consensus module tests, which can lead to some tests being unable to bind to the socket during startup and resulting in failures. Regarding the root cause of this issue, we suspect that TCP connections, when disconnected, remain in a TIME_WAIT state for about 4 minutes before the corresponding port becomes available for reuse. Although we have confirmed that the thrift server sets the socket as reusable during startup (https://github.com/apache/thrift/blob/master/lib/java/src/main/java/org/apache/thrift/transport/TNonblockingServerSocket.java#L103), it seems that this setting does not work in some CI environments. As a result, we added logic in https://github.com/apache/iotdb/pull/10530 to block and wait for 60 seconds if the socket cannot be bound. However, test failures may still occur, and we suspect that waiting for more than 4 minutes might be necessary. Consequently, in https://github.com/apache/iotdb/pull/10540, we increased the timeout waiting period to 300 seconds. Regrettably, test failures still occasionally happen. As a result, in https://github.com/apache/iotdb/pull/10723, we introduced logic to dynamically detect available ports, hoping that switching to different ports could reduce the probability of failure. However, the current situation is that even after confirming the port's availability, failures occur during the actual startup of the thrift server. Now, I have started another attempt (https://github.com/apache/iotdb/pull/10789), but I am uncertain whether it will be effective. Through this series of efforts, we have managed to significantly reduce the probability of encountering issues in the CI, but unfortunately, the problem still occasionally reoccurs. This issue is truly frustrating and disheartening. I wonder if the community has any better solutions that could help me… Thanks — Xinyu Tan On 2023/08/04 14:13:38 Christofer Dutz wrote: > Hi all, > > So, in the past days I‘ve been building IoTDB on several OSes and have > noticed some tests to repeatedly failing the build, but succeeding as soon as > I run them again. > To sum it up it’s mostly these tests: > > — IoTDB: Core: Consensus > > RatisConsensusTest.removeMemberFromGroup:148->doConsensus:258 NullPointer > Cann… > > > RatisConsensusTest.addMemberToGroup:116->doConsensus:258 NullPointer Cannot > in... > > > > ReplicateTest.replicateUsingWALTest:257->initServer:147 » IO > org.apache.iotdb > > > — IoTDB: Core: Node Commons > > Keeps on failing because of left-over iotdb server instances. > > I would be happy to tackle the Node Commons tests regularly failing by > implementing the Test-Runner, that I mentioned before, which will start and > run IoTDB inside the VM running the tests, so the instance will be shut down > as soon as the test is finished. This should eliminate that problem. However > I have no idea if anyone is working on the RatisConsensusTest and the > ReplicateTest. > > Chris >
Re: Fixing flaky tests?
Hi Chris, I will take a look at RatisConsensusTest. In case the tests fail next time, feel free to mention me directly in the PR. This way, I can view the complete error stack. William > 2023年8月4日 17:13,Christofer Dutz 写道: > > Hi all, > > So, in the past days I‘ve been building IoTDB on several OSes and have > noticed some tests to repeatedly failing the build, but succeeding as soon as > I run them again. > To sum it up it’s mostly these tests: > > — IoTDB: Core: Consensus > > RatisConsensusTest.removeMemberFromGroup:148->doConsensus:258 NullPointer > Cann… > > > RatisConsensusTest.addMemberToGroup:116->doConsensus:258 NullPointer Cannot > in... > > > > ReplicateTest.replicateUsingWALTest:257->initServer:147 » IO > org.apache.iotdb > > > — IoTDB: Core: Node Commons > > Keeps on failing because of left-over iotdb server instances. > > I would be happy to tackle the Node Commons tests regularly failing by > implementing the Test-Runner, that I mentioned before, which will start and > run IoTDB inside the VM running the tests, so the instance will be shut down > as soon as the test is finished. This should eliminate that problem. However > I have no idea if anyone is working on the RatisConsensusTest and the > ReplicateTest. > > Chris
Fixing flaky tests?
Hi all, So, in the past days I‘ve been building IoTDB on several OSes and have noticed some tests to repeatedly failing the build, but succeeding as soon as I run them again. To sum it up it’s mostly these tests: — IoTDB: Core: Consensus RatisConsensusTest.removeMemberFromGroup:148->doConsensus:258 NullPointer Cann… RatisConsensusTest.addMemberToGroup:116->doConsensus:258 NullPointer Cannot in... ReplicateTest.replicateUsingWALTest:257->initServer:147 » IO org.apache.iotdb — IoTDB: Core: Node Commons Keeps on failing because of left-over iotdb server instances. I would be happy to tackle the Node Commons tests regularly failing by implementing the Test-Runner, that I mentioned before, which will start and run IoTDB inside the VM running the tests, so the instance will be shut down as soon as the test is finished. This should eliminate that problem. However I have no idea if anyone is working on the RatisConsensusTest and the ReplicateTest. Chris