[ https://issues.apache.org/jira/browse/CASSANDRA-18733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Caleb Rackliffe updated CASSANDRA-18733: ---------------------------------------- Fix Version/s: 4.1.x 5.0 5.x > Waiting indefinitely on ReceivedMessage response in StreamSession#receive() > can cause deadlock > ---------------------------------------------------------------------------------------------- > > Key: CASSANDRA-18733 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18733 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Repair, Consistency/Streaming > Reporter: Caleb Rackliffe > Assignee: Caleb Rackliffe > Priority: Normal > Fix For: 4.1.x, 5.0, 5.x > > > I've observed in a recent stack trace from a node running 4.1 what looks like > a deadlock around the {{StreamSession}} monitor lock when > {{StreamSession#receive()}} waits via {{syncUninteruptibly()}} for a response > to a control message. > {noformat} > "Messaging-EventLoop-3-10" #320 daemon prio=5 os_prio=0 cpu=57979617.98ms > elapsed=5587916.03s tid=0x00007f056e88ae00 nid=0x80ec waiting for monitor > entry [0x00007f056d277000] > java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.cassandra.streaming.StreamSession.closeSession(StreamSession.java:524) > - waiting to lock <0x00000006816fae70> (a > org.apache.cassandra.streaming.StreamSession) > at > org.apache.cassandra.streaming.StreamSession.onError(StreamSession.java:690) > at > org.apache.cassandra.streaming.async.StreamingMultiplexedChannel.onMessageComplete(StreamingMultiplexedChannel.java:264) > at > org.apache.cassandra.streaming.async.StreamingMultiplexedChannel.lambda$sendMessage$1(StreamingMultiplexedChannel.java:233) > at > org.apache.cassandra.streaming.async.StreamingMultiplexedChannel$$Lambda$2029/0x00000008007a0c40.operationComplete(Unknown > Source) > at > org.apache.cassandra.utils.concurrent.ListenerList.notifyListener(ListenerList.java:134) > at > org.apache.cassandra.utils.concurrent.ListenerList.notifyListener(ListenerList.java:148) > at > org.apache.cassandra.utils.concurrent.ListenerList$GenericFutureListenerList.notifySelf(ListenerList.java:190) > at > org.apache.cassandra.utils.concurrent.ListenerList.lambda$notifyExclusive$0(ListenerList.java:124) > at > org.apache.cassandra.utils.concurrent.ListenerList$$Lambda$950/0x0000000800666040.accept(Unknown > Source) > at > org.apache.cassandra.utils.concurrent.IntrusiveStack.forEach(IntrusiveStack.java:195) > at > org.apache.cassandra.utils.concurrent.ListenerList.notifyExclusive(ListenerList.java:124) > at > org.apache.cassandra.utils.concurrent.ListenerList.notify(ListenerList.java:96) > at > org.apache.cassandra.utils.concurrent.AsyncFuture.trySet(AsyncFuture.java:104) > at > org.apache.cassandra.utils.concurrent.AbstractFuture.tryFailure(AbstractFuture.java:148) > at > org.apache.cassandra.utils.concurrent.AsyncPromise.tryFailure(AsyncPromise.java:139) > at > io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:1009) > at > io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:870) > at > io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1367) > at > io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:717) > at > io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:764) > at > io.netty.channel.AbstractChannelHandlerContext$WriteTask.run(AbstractChannelHandlerContext.java:1071) > at > io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) > at > io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472) > at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384) > at > io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) > at > io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) > at > io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) > at java.lang.Thread.run(java.base@11.0.16/Thread.java:829) > {noformat} > It seems that while {{receive()} is holding the monitor lock on > {{StreamSession}}, the callback that executes on a different thread for the > control message it sends carries an error. This error, when handled in > {{onError()}}, then calls {{closeSession()}}, which tries to acquire the > monitor lock already held in {{receive()}}. > {noformat} > "Stream-Deserializer-/100.70.229.6:7000-de724029" #1919000 daemon prio=5 > os_prio=0 cpu=224.66ms elapsed=1604976.92s tid=0x00007f0561c66500 nid=0xe2a2 > waiting on condition [0x00007f0830947000] > java.lang.Thread.State: WAITING (parking) > at jdk.internal.misc.Unsafe.park(java.base@11.0.16/Native Method) > at > java.util.concurrent.locks.LockSupport.park(java.base@11.0.16/LockSupport.java:323) > at > org.apache.cassandra.utils.concurrent.WaitQueue$Standard$AbstractSignal.await(WaitQueue.java:289) > at > org.apache.cassandra.utils.concurrent.WaitQueue$Standard$AbstractSignal.await(WaitQueue.java:282) > at > org.apache.cassandra.utils.concurrent.Awaitable$AsyncAwaitable.await(Awaitable.java:306) > at > org.apache.cassandra.utils.concurrent.AsyncFuture.await(AsyncFuture.java:154) > at > org.apache.cassandra.utils.concurrent.AsyncPromise.await(AsyncPromise.java:244) > at > org.apache.cassandra.net.AsyncChannelPromise.await(AsyncChannelPromise.java:127) > at > org.apache.cassandra.net.AsyncChannelPromise.await(AsyncChannelPromise.java:34) > at > org.apache.cassandra.utils.concurrent.Awaitable$Defaults.awaitUninterruptibly(Awaitable.java:186) > at > org.apache.cassandra.utils.concurrent.AbstractFuture.awaitUninterruptibly(AbstractFuture.java:482) > at > org.apache.cassandra.utils.concurrent.AsyncPromise.awaitUninterruptibly(AsyncPromise.java:254) > at > org.apache.cassandra.net.AsyncChannelPromise.awaitUninterruptibly(AsyncChannelPromise.java:133) > at > org.apache.cassandra.net.AsyncChannelPromise.awaitUninterruptibly(AsyncChannelPromise.java:34) > at > org.apache.cassandra.utils.concurrent.Future.syncUninterruptibly(Future.java:94) > at > org.apache.cassandra.utils.concurrent.AsyncPromise.syncUninterruptibly(AsyncPromise.java:186) > at > org.apache.cassandra.net.AsyncChannelPromise.syncUninterruptibly(AsyncChannelPromise.java:121) > at > org.apache.cassandra.net.AsyncChannelPromise.syncUninterruptibly(AsyncChannelPromise.java:34) > at > org.apache.cassandra.streaming.StreamSession.receive(StreamSession.java:1054) > at > org.apache.cassandra.streaming.StreamSession.messageReceived(StreamSession.java:628) > - locked <0x00000006816fae70> (a > org.apache.cassandra.streaming.StreamSession) > at > org.apache.cassandra.streaming.StreamDeserializingTask.run(StreamDeserializingTask.java:76) > at > io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) > at java.lang.Thread.run(java.base@11.0.16/Thread.java:829) > {noformat} > The most straightforward way to fix this might be to put an upper bound on > the wait in {{receive()}}, along w/ some logging at WARN if it times out. > This would at least allow us to make progress eventually and close the > session properly. The syncUninterruptibly() was intended to avoid a race > during shutdown, IIRC, and unless the timeout is comically low, it shouldn't > compromise that. > The problem can usually be fixed by bouncing affected nodes, who will usually > present w/ an increasing backlog of unrepaired data and a log message that > looks something like this, in addition to a number of streaming errors: > {noformat} > INFO 2023-08-04T10:45:54,845 [NettyStreaming-Outbound-/<ip:port>] > org.apache.cassandra.streaming.async.StreamingMultiplexedChannel:359 - > [Stream #10465c10-3108-11ee-af8a-3f74fd21ad9d] waiting to acquire a permit to > begin streaming <data directory>-nb-101920-big-Data.db. This message logs > every 3 minutes > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org