It turns out that there are quite a few bugs in zero-copy.  For a list of
bugs, see https://github.com/apache/ratis/pull/1156

Note that the list is likely incomplete.  After the fix, it can pass all
tests (with a few retries).

Since the fix is quite big and non-trivial, I am currently splitting it
into servel JIRAs.  After merging them, I will continue debugging zero-copy.

Tsz-Wo



On Tue, Sep 10, 2024 at 5:07 PM Tsz Wo Sze <[email protected]> wrote:

> Hi Wei-Chiu,
>
> Thanks for reporting this!
>
> The failure of TestRaftWithGrpc is related to RATIS-2129 and zero-copy;
> see https://issues.apache.org/jira/browse/RATIS-2151 "TestRaftWithGrpc
> may fail after RATIS-2129".
>
> - It does not fail before RATIS-2129 with zero-copy.
> - It does not fail after RATIS-2129 without zero-copy.
>
> However,
> - It fails frequently after RATIS-2129 with zero-copy.
>
> Tsz-Wo
>
>
> On Tue, Sep 10, 2024 at 4:57 PM Wei-Chiu Chuang
> <[email protected]> wrote:
>
>> Hi it looks like TestRaftWithGrpc is failing consistently.
>>
>> Looking at git history, https://github.com/apache/ratis/commits/master/
>> the failure has been there since
>> RATIS-2129 <https://issues.apache.org/jira/browse/RATIS-2129>. Low
>> replication performance because LogAppender is often blocked by RaftLog's
>> readLock. (
>> <
>> https://github.com/apache/ratis/commit/781d61d37411b374f104eb0806e1e2c4090fb35e
>> >
>> #1141 <https://github.com/apache/ratis/pull/1141>)
>> <
>> https://github.com/apache/ratis/commit/781d61d37411b374f104eb0806e1e2c4090fb35e
>> >
>>
>>
>> Here is one example:
>>
>> Error:
>> org.apache.ratis.grpc.TestRaftWithGrpc.testUpdateViaHeartbeat(Boolean)[2]
>> Time elapsed: 6.801 s <<< ERROR!
>> 1001
>> <
>> https://github.com/apache/ratis/actions/runs/10786817671/job/29914349737#step:5:1002
>> >java.lang.IllegalStateException:
>> allLeaks.size = 15
>> 1002
>> <
>> https://github.com/apache/ratis/actions/runs/10786817671/job/29914349737#step:5:1003
>> >
>> at org.apache.ratis.util.Preconditions.assertTrue(Preconditions.java:77)
>> 1003
>> <
>> https://github.com/apache/ratis/actions/runs/10786817671/job/29914349737#step:5:1004
>> >
>> at org.apache.ratis.util.LeakDetector.assertNoLeaks(LeakDetector.java:107)
>> 1004
>> <
>> https://github.com/apache/ratis/actions/runs/10786817671/job/29914349737#step:5:1005
>> >
>> at
>>
>> org.apache.ratis.server.impl.MiniRaftCluster.shutdown(MiniRaftCluster.java:869)
>>
>> 1005
>> <
>> https://github.com/apache/ratis/actions/runs/10786817671/job/29914349737#step:5:1006
>> >
>> at
>>
>> org.apache.ratis.grpc.MiniRaftClusterWithGrpc.shutdown(MiniRaftClusterWithGrpc.java:93)
>>
>> 1006
>> <
>> https://github.com/apache/ratis/actions/runs/10786817671/job/29914349737#step:5:1007
>> >
>> at
>>
>> org.apache.ratis.server.impl.MiniRaftCluster$Factory$Get.runWithNewCluster(MiniRaftCluster.java:149)
>>
>> 1007
>> <
>> https://github.com/apache/ratis/actions/runs/10786817671/job/29914349737#step:5:1008
>> >
>> at
>>
>> org.apache.ratis.server.impl.MiniRaftCluster$Factory$Get.runWithNewCluster(MiniRaftCluster.java:121)
>>
>> 1008
>> <
>> https://github.com/apache/ratis/actions/runs/10786817671/job/29914349737#step:5:1009
>> >
>> at
>>
>> org.apache.ratis.grpc.TestRaftWithGrpc.testUpdateViaHeartbeat(TestRaftWithGrpc.java:76)
>>
>> Not sure if it's a production code issue or test issue.
>>
>

Reply via email to