It turns out that there are quite a few bugs in zero-copy. For a list of bugs, see https://github.com/apache/ratis/pull/1156
Note that the list is likely incomplete. After the fix, it can pass all tests (with a few retries). Since the fix is quite big and non-trivial, I am currently splitting it into servel JIRAs. After merging them, I will continue debugging zero-copy. Tsz-Wo On Tue, Sep 10, 2024 at 5:07 PM Tsz Wo Sze <[email protected]> wrote: > Hi Wei-Chiu, > > Thanks for reporting this! > > The failure of TestRaftWithGrpc is related to RATIS-2129 and zero-copy; > see https://issues.apache.org/jira/browse/RATIS-2151 "TestRaftWithGrpc > may fail after RATIS-2129". > > - It does not fail before RATIS-2129 with zero-copy. > - It does not fail after RATIS-2129 without zero-copy. > > However, > - It fails frequently after RATIS-2129 with zero-copy. > > Tsz-Wo > > > On Tue, Sep 10, 2024 at 4:57 PM Wei-Chiu Chuang > <[email protected]> wrote: > >> Hi it looks like TestRaftWithGrpc is failing consistently. >> >> Looking at git history, https://github.com/apache/ratis/commits/master/ >> the failure has been there since >> RATIS-2129 <https://issues.apache.org/jira/browse/RATIS-2129>. Low >> replication performance because LogAppender is often blocked by RaftLog's >> readLock. ( >> < >> https://github.com/apache/ratis/commit/781d61d37411b374f104eb0806e1e2c4090fb35e >> > >> #1141 <https://github.com/apache/ratis/pull/1141>) >> < >> https://github.com/apache/ratis/commit/781d61d37411b374f104eb0806e1e2c4090fb35e >> > >> >> >> Here is one example: >> >> Error: >> org.apache.ratis.grpc.TestRaftWithGrpc.testUpdateViaHeartbeat(Boolean)[2] >> Time elapsed: 6.801 s <<< ERROR! >> 1001 >> < >> https://github.com/apache/ratis/actions/runs/10786817671/job/29914349737#step:5:1002 >> >java.lang.IllegalStateException: >> allLeaks.size = 15 >> 1002 >> < >> https://github.com/apache/ratis/actions/runs/10786817671/job/29914349737#step:5:1003 >> > >> at org.apache.ratis.util.Preconditions.assertTrue(Preconditions.java:77) >> 1003 >> < >> https://github.com/apache/ratis/actions/runs/10786817671/job/29914349737#step:5:1004 >> > >> at org.apache.ratis.util.LeakDetector.assertNoLeaks(LeakDetector.java:107) >> 1004 >> < >> https://github.com/apache/ratis/actions/runs/10786817671/job/29914349737#step:5:1005 >> > >> at >> >> org.apache.ratis.server.impl.MiniRaftCluster.shutdown(MiniRaftCluster.java:869) >> >> 1005 >> < >> https://github.com/apache/ratis/actions/runs/10786817671/job/29914349737#step:5:1006 >> > >> at >> >> org.apache.ratis.grpc.MiniRaftClusterWithGrpc.shutdown(MiniRaftClusterWithGrpc.java:93) >> >> 1006 >> < >> https://github.com/apache/ratis/actions/runs/10786817671/job/29914349737#step:5:1007 >> > >> at >> >> org.apache.ratis.server.impl.MiniRaftCluster$Factory$Get.runWithNewCluster(MiniRaftCluster.java:149) >> >> 1007 >> < >> https://github.com/apache/ratis/actions/runs/10786817671/job/29914349737#step:5:1008 >> > >> at >> >> org.apache.ratis.server.impl.MiniRaftCluster$Factory$Get.runWithNewCluster(MiniRaftCluster.java:121) >> >> 1008 >> < >> https://github.com/apache/ratis/actions/runs/10786817671/job/29914349737#step:5:1009 >> > >> at >> >> org.apache.ratis.grpc.TestRaftWithGrpc.testUpdateViaHeartbeat(TestRaftWithGrpc.java:76) >> >> Not sure if it's a production code issue or test issue. >> >
