Re: Unexplained stuck memtable flush

Jaydeep Chovatia Wed, 06 Nov 2024 08:31:18 -0800

We have also seen this issue a few times in our production (4.1). My
teammates added a thread dump here
<https://issues.apache.org/jira/browse/CASSANDRA-19564?focusedCommentId=17893987&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17893987>
.


One of my theories is that the MemtableReclaimMemory thread has been unable
to make progress:

"MemtableReclaimMemory:1" - Thread t@56

   java.lang.Thread.State: WAITING

        at [email protected]/jdk.internal.misc.Unsafe.park(Native Method)

        at
[email protected]/java.util.concurrent.locks.LockSupport.park(LockSupport.java:323)

        at
app//org.apache.cassandra.utils.concurrent.WaitQueue$Standard$AbstractSignal.await(WaitQueue.java:289)

        at
app//org.apache.cassandra.utils.concurrent.WaitQueue$Standard$AbstractSignal.await(WaitQueue.java:282)

        at
app//org.apache.cassandra.utils.concurrent.Awaitable$Defaults.awaitUninterruptibly(Awaitable.java:186)

        at
app//org.apache.cassandra.utils.concurrent.Awaitable$AbstractAwaitable.awaitUninterruptibly(Awaitable.java:259)

        at
app//org.apache.cassandra.utils.concurrent.OpOrder$Group.await(OpOrder.java:267)

        at
app//org.apache.cassandra.utils.concurrent.OpOrder$Barrier.await(OpOrder.java:425)

        at
app//org.apache.cassandra.db.ColumnFamilyStore$Flush$1.runMayThrow(ColumnFamilyStore.java:1357)

        at
app//org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)

        at
app//org.apache.cassandra.concurrent.ExecutionFailure$1.run(ExecutionFailure.java:133)

        at
[email protected]/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)

        at
[email protected]/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)

        at
app//io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)

        at [email protected]/java.lang.Thread.run(Thread.java:829)



One of the possible bugs I could see is a line
<https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apache/cassandra/db/ReadExecutionController.java#L141C77-L141C97>
*            indexController = new ReadExecutionController(command,
indexCfs.readOrdering.start(), indexCfs.metadata(), null, null,
NO_SAMPLING, false);*
If "*indexCfs.readOrdering.start()*" succeeded but the constructor "*new
ReadExecutionController*", then we are not closing "*indexCfs.readOrdering*",
which means it will remain
inaccurate forever.


Jaydeep

On Wed, Nov 6, 2024 at 5:51 AM Bowen Song via user <
[email protected]> wrote:

> I think I'm getting really close now. This seems to have something to do
> with the "read-hotness-tracker:1" thread. The thread dump is:
>
> "read-hotness-tracker:1" daemon prio=5 tid=93 WAITING
>     at jdk.internal.misc.Unsafe.park(Native Method)
>     at java.util.concurrent.locks.LockSupport.park(LockSupport.java:323)
>     at
> org.apache.cassandra.utils.concurrent.WaitQueue$Standard$AbstractSignal.await(WaitQueue.java:289)
>        local variable:
> org.apache.cassandra.utils.concurrent.WaitQueue$Standard$SignalWithListener#2086139
>     at
> org.apache.cassandra.utils.concurrent.WaitQueue$Standard$AbstractSignal.await(WaitQueue.java:282)
>        local variable:
> org.apache.cassandra.utils.concurrent.WaitQueue$Standard$SignalWithListener#2086139
>     at
> org.apache.cassandra.utils.concurrent.Awaitable$Defaults.awaitUninterruptibly(Awaitable.java:186)
>        local variable:
> org.apache.cassandra.utils.concurrent.WaitQueue$Standard$SignalWithListener#2086139
>     at
> org.apache.cassandra.utils.concurrent.Awaitable$AbstractAwaitable.awaitUninterruptibly(Awaitable.java:259)
>     at
> org.apache.cassandra.db.commitlog.AbstractCommitLogService.awaitSyncAt(AbstractCommitLogService.java:324)
> <------ here
>        local variable:
> org.apache.cassandra.db.commitlog.PeriodicCommitLogService#1
>        local variable: com.codahale.metrics.Timer$Context#2086795
>     at
> org.apache.cassandra.db.commitlog.PeriodicCommitLogService.maybeWaitForSync(PeriodicCommitLogService.java:42)
>        local variable:
> org.apache.cassandra.db.commitlog.PeriodicCommitLogService#1
>     at
> org.apache.cassandra.db.commitlog.AbstractCommitLogService.finishWriteFor(AbstractCommitLogService.java:284)
>        local variable:
> org.apache.cassandra.db.commitlog.PeriodicCommitLogService#1
>     at org.apache.cassandra.db.commitlog.CommitLog.add(CommitLog.java:307)
>        local variable: org.apache.cassandra.db.commitlog.CommitLog#1
>        local variable: org.apache.cassandra.io.util.DataOutputBuffer$1$1#61
>        local variable:
> org.apache.cassandra.db.commitlog.CommitLogSegment$Allocation#1
>     at
> org.apache.cassandra.db.CassandraKeyspaceWriteHandler.addToCommitLog(CassandraKeyspaceWriteHandler.java:100)
>     at
> org.apache.cassandra.db.CassandraKeyspaceWriteHandler.beginWrite(CassandraKeyspaceWriteHandler.java:54)
> <------ and here
>        local variable:
> org.apache.cassandra.utils.concurrent.OpOrder$Group#8162
>     at org.apache.cassandra.db.Keyspace.applyInternal(Keyspace.java:628)
>        local variable: org.apache.cassandra.db.Keyspace#8
>        local variable: org.apache.cassandra.db.Mutation#54491
>     at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:512)
>     at org.apache.cassandra.db.Mutation.apply(Mutation.java:228)
>     at org.apache.cassandra.db.Mutation.apply(Mutation.java:248)
>     at
> org.apache.cassandra.cql3.statements.ModificationStatement.executeInternalWithoutCondition(ModificationStatement.java:675)
>     at
> org.apache.cassandra.cql3.statements.ModificationStatement.executeLocally(ModificationStatement.java:666)
>     at
> org.apache.cassandra.cql3.QueryProcessor.executeInternal(QueryProcessor.java:447)
>     at
> org.apache.cassandra.db.SystemKeyspace.persistSSTableReadMeter(SystemKeyspace.java:1488)
>        local variable: java.lang.String#42806
>        local variable: java.lang.String#47659
>        local variable:
> org.apache.cassandra.io.sstable.UUIDBasedSSTableId#1389
>        local variable: org.apache.cassandra.metrics.RestorableMeter#752
>     at
> org.apache.cassandra.io.sstable.format.SSTableReader$GlobalTidy$1.run(SSTableReader.java:2170)
>     at
> org.apache.cassandra.concurrent.ExecutionFailure$1.run(ExecutionFailure.java:124)
>        local variable:
> org.apache.cassandra.concurrent.ExecutionFailure$1#122029
>        local variable:
> org.apache.cassandra.utils.WithResources$None$$Lambda$223#1
>     at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
>        local variable: java.util.concurrent.Executors$RunnableAdapter#934
>     at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
>        local variable:
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask#931
>     at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
>        local variable:
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask#931
>     at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>        local variable:
> org.apache.cassandra.concurrent.ScheduledThreadPoolExecutorPlus#6
>        local variable: java.util.concurrent.ThreadPoolExecutor$Worker#7
>        local variable: io.netty.util.concurrent.FastThreadLocalThread#7
>        local variable:
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask#931
>     at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>        local variable: java.util.concurrent.ThreadPoolExecutor$Worker#7
>     at
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>        local variable: io.netty.util.concurrent.FastThreadLocalRunnable#7
>     at java.lang.Thread.run(Thread.java:829)
>        local variable: io.netty.util.concurrent.FastThreadLocalThread#7
>
>
> In the CassandraKeyspaceWriteHandler.beginWrite() method:
>
> public WriteContext beginWrite(Mutation mutation, boolean makeDurable)
> throws RequestExecutionException
> {
>     OpOrder.Group group = null;
>     try
>     {
>         group = Keyspace.writeOrder.start();  // <--- here
>
>         // write the mutation to the commitlog and memtables
>         CommitLogPosition position = null;
>         if (makeDurable)
>         {
>             position = addToCommitLog(mutation);  // <--- and here
>         }
>         return new CassandraWriteContext(group, position);
>     }
>     catch (Throwable t)
>     {
>         if (group != null)
>         {
>             group.close();
>         }
>         throw t;
>     }
> }
>
> The Keyspace.writeOrder.start() method call blocks the
> MemtableFlushWriter thread, and then it calls addToCommitLog(mutation) which
> indirectly calls AbstractCommitLogService.awaitSyncAt()
>
>     void awaitSyncAt(long syncTime, Context context)
>     {
>         do
>         {
>             WaitQueue.Signal signal = context != null ?
> syncComplete.register(context, Context::stop) : syncComplete.register();
>             if (lastSyncedAt < syncTime)
>                 signal.awaitUninterruptibly();  // <--- here
>             else
>                 signal.cancel();
>         }
>         while (lastSyncedAt < syncTime);
>     }
>
> This then got stuck on the signal.awaitUninterruptibly()
>
> Now I know what is blocking the memtable flushing, but I haven't been able
> to figure out is why it got stuck on waiting for that signal.
>
> I would appreciate it if anyone can offer some insight here.
>
>
> On 05/11/2024 17:48, Bowen Song via user wrote:
>
> I will give it a try and see what I can find. I plan to go down the rabbit
> hole tomorrow. Will keep you updated.
> On 05/11/2024 17:34, Jeff Jirsa wrote:
>
>
>
> On Nov 5, 2024, at 4:12 AM, Bowen Song via user
> <[email protected]> <[email protected]> wrote:
>
> Writes on this node starts to timeout and fail. But if left untouched,
> it's only gonna get worse, and eventually lead to JVM OOM and crash.
>
> By inspecting the heap dump created at OOM, we can see that both of the
> MemtableFlushWriter threads are stuck on line 1190
> <https://github.com/apache/cassandra/blob/8d91b469afd3fcafef7ef85c10c8acc11703ba2d/src/java/org/apache/cassandra/db/ColumnFamilyStore.java#L1190>
> in the ColumnFamilyStore.java:
>
>             // mark writes older than the barrier as blocking progress,
> permitting them to exceed our memory limit
>             // if they are stuck waiting on it, then wait for them all to
> complete
>             writeBarrier.markBlocking();
>             writeBarrier.await();   // <----------- stuck here
>
> And the MemtablePostFlush thread is stuck on line 1094
> <https://github.com/apache/cassandra/blob/8d91b469afd3fcafef7ef85c10c8acc11703ba2d/src/java/org/apache/cassandra/db/ColumnFamilyStore.java#L1094>
> in the same file.
>
>             try
>             {
>                 // we wait on the latch for the commitLogUpperBound to be
> set, and so that waiters
>                 // on this task can rely on all prior flushes being
> complete
>                 latch.await();   // <----------- stuck here
>             }
> Our top suspect is CDC interacting with repair, since this started to
> happen shortly after we enabled CDC on the nodes, and each time repair was
> running. But we have not been able to reproduce this in a testing cluster,
> and don't know what's the next step to troubleshoot this issue. So I'm
> posting it in the mailing lists and hoping someone may know something about
> it or point me to the right direction.
>
>
> Wouldn’t be completely surprised if  CDC  or repair somehow has a barrier,
> I’ve also seen similar behavior pre-3.0 with “very long running read
> commands” that have a barrier on the memtable that prevent release.
>
> You’ve got the heap (great, way better than most people debugging), are
> you able to navigate through it and look for references to that memtable or
> other things holding a barrier?
>
>
>
>
>

Re: Unexplained stuck memtable flush

Reply via email to