Hi Duong,

There are zero-copy bugs existing since TestRaftWithGrp still keep failing;
see https://issues.apache.org/jira/browse/RATIS-2184

The failure is likely caused by either
- bugs in zero-copy, or
- bugs in CodeInjectionForTesting with zero-copy – some tests may inject
some but not clean them up correctly (e.g. it has blocked some code and
never unblocks it).

To fix the remaining zero-copy bugs.  We may enable advanced tracing as
below and run the test until it fails.  Then, check the retain-release
trace to see

   1. why is an object NOT released properly?
   2. Or, why is an object released without being retained? (This case
   seems to be already fixed.)

+++ 
b/ratis-grpc/src/test/java/org/apache/ratis/grpc/MiniRaftClusterWithGrpc.java
@@ -52,7 +52,7 @@ public class MiniRaftClusterWithGrpc extends
MiniRaftCluster.RpcBase {

   static {
     // TODO move it to MiniRaftCluster for detecting non-gRPC cases-
  ReferenceCountedLeakDetector.enable(false);
+    ReferenceCountedLeakDetector.enable(true);
   }


Tsz-Wo


On Fri, Nov 1, 2024 at 6:31 PM Duong Nguyen <[email protected]> wrote:

> Hi Nicholas,
>
> Thanks a lot for identifying and fixing those issues.
> I just see that RATIS-2164, RATIS-2151, RATIS-2173 have already been
> fixed. Does it mean we can consider releasing a version from Ratis master
> branch, e.g. 3.2.0?
>
> Thanks,
> Duong
>
> On 2024/10/09 00:38:16 Tsz Wo Sze wrote:
> > It turns out that there are quite a few bugs in zero-copy.  For a list of
> > bugs, see https://github.com/apache/ratis/pull/1156
> >
> > Note that the list is likely incomplete.  After the fix, it can pass all
> > tests (with a few retries).
> >
> > Since the fix is quite big and non-trivial, I am currently splitting it
> > into servel JIRAs.  After merging them, I will continue debugging
> zero-copy.
> >
> > Tsz-Wo
> >
> >
> >
> > On Tue, Sep 10, 2024 at 5:07 PM Tsz Wo Sze <[email protected]> wrote:
> >
> > > Hi Wei-Chiu,
> > >
> > > Thanks for reporting this!
> > >
> > > The failure of TestRaftWithGrpc is related to RATIS-2129 and zero-copy;
> > > see https://issues.apache.org/jira/browse/RATIS-2151 "TestRaftWithGrpc
> > > may fail after RATIS-2129".
> > >
> > > - It does not fail before RATIS-2129 with zero-copy.
> > > - It does not fail after RATIS-2129 without zero-copy.
> > >
> > > However,
> > > - It fails frequently after RATIS-2129 with zero-copy.
> > >
> > > Tsz-Wo
> > >
> > >
> > > On Tue, Sep 10, 2024 at 4:57 PM Wei-Chiu Chuang
> > > <[email protected]> wrote:
> > >
> > >> Hi it looks like TestRaftWithGrpc is failing consistently.
> > >>
> > >> Looking at git history,
> https://github.com/apache/ratis/commits/master/
> > >> the failure has been there since
> > >> RATIS-2129 <https://issues.apache.org/jira/browse/RATIS-2129>. Low
> > >> replication performance because LogAppender is often blocked by
> RaftLog's
> > >> readLock. (
> > >> <
> > >>
> https://github.com/apache/ratis/commit/781d61d37411b374f104eb0806e1e2c4090fb35e
> > >> >
> > >> #1141 <https://github.com/apache/ratis/pull/1141>)
> > >> <
> > >>
> https://github.com/apache/ratis/commit/781d61d37411b374f104eb0806e1e2c4090fb35e
> > >> >
> > >>
> > >>
> > >> Here is one example:
> > >>
> > >> Error:
> > >>
> org.apache.ratis.grpc.TestRaftWithGrpc.testUpdateViaHeartbeat(Boolean)[2]
> > >> Time elapsed: 6.801 s <<< ERROR!
> > >> 1001
> > >> <
> > >>
> https://github.com/apache/ratis/actions/runs/10786817671/job/29914349737#step:5:1002
> > >> >java.lang.IllegalStateException:
> > >> allLeaks.size = 15
> > >> 1002
> > >> <
> > >>
> https://github.com/apache/ratis/actions/runs/10786817671/job/29914349737#step:5:1003
> > >> >
> > >> at
> org.apache.ratis.util.Preconditions.assertTrue(Preconditions.java:77)
> > >> 1003
> > >> <
> > >>
> https://github.com/apache/ratis/actions/runs/10786817671/job/29914349737#step:5:1004
> > >> >
> > >> at
> org.apache.ratis.util.LeakDetector.assertNoLeaks(LeakDetector.java:107)
> > >> 1004
> > >> <
> > >>
> https://github.com/apache/ratis/actions/runs/10786817671/job/29914349737#step:5:1005
> > >> >
> > >> at
> > >>
> > >>
> org.apache.ratis.server.impl.MiniRaftCluster.shutdown(MiniRaftCluster.java:869)
> > >>
> > >> 1005
> > >> <
> > >>
> https://github.com/apache/ratis/actions/runs/10786817671/job/29914349737#step:5:1006
> > >> >
> > >> at
> > >>
> > >>
> org.apache.ratis.grpc.MiniRaftClusterWithGrpc.shutdown(MiniRaftClusterWithGrpc.java:93)
> > >>
> > >> 1006
> > >> <
> > >>
> https://github.com/apache/ratis/actions/runs/10786817671/job/29914349737#step:5:1007
> > >> >
> > >> at
> > >>
> > >>
> org.apache.ratis.server.impl.MiniRaftCluster$Factory$Get.runWithNewCluster(MiniRaftCluster.java:149)
> > >>
> > >> 1007
> > >> <
> > >>
> https://github.com/apache/ratis/actions/runs/10786817671/job/29914349737#step:5:1008
> > >> >
> > >> at
> > >>
> > >>
> org.apache.ratis.server.impl.MiniRaftCluster$Factory$Get.runWithNewCluster(MiniRaftCluster.java:121)
> > >>
> > >> 1008
> > >> <
> > >>
> https://github.com/apache/ratis/actions/runs/10786817671/job/29914349737#step:5:1009
> > >> >
> > >> at
> > >>
> > >>
> org.apache.ratis.grpc.TestRaftWithGrpc.testUpdateViaHeartbeat(TestRaftWithGrpc.java:76)
> > >>
> > >> Not sure if it's a production code issue or test issue.
> > >>
> > >
> >
>

Reply via email to