[ 
https://issues.apache.org/jira/browse/CASSANDRA-18733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17770373#comment-17770373
 ] 

Andres de la Peña commented on CASSANDRA-18733:
-----------------------------------------------

None of the above CircleCI runs contains the repeated runs of the modified 
files ({{StreamDisconnectedWhileReceivingTest.java}}). I think this is due to a 
bug in the non-public script used to generate the CircleCI config file.

Those repeated runs can be generated with the project's 
[{{.circleci/generate.sh}}|https://github.com/apache/cassandra/blob/trunk/.circleci/generate.sh]
 script.

The absence of repeated runs can be easily detected by looking at the CI 
results, in the workflow view. If the patch contains any changes on tests there 
should be jobs named with the {{_repeat}} suffix. One can also check whether 
the pushed {{.circleci/config.yml}} file contains the names of relevant tests.

> Waiting indefinitely on ReceivedMessage response in StreamSession#receive() 
> can cause deadlock
> ----------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-18733
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-18733
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Consistency/Repair, Consistency/Streaming
>            Reporter: Caleb Rackliffe
>            Assignee: Jon Meredith
>            Priority: Normal
>             Fix For: 4.1.4, 5.0, 5.0-alpha1, 5.1
>
>
> I've observed in a recent stack trace from a node running 4.1 what looks like 
> a deadlock around the {{StreamSession}} monitor lock when 
> {{StreamSession#receive()}} waits via {{syncUninteruptibly()}} for a response 
> to a control message.
> {noformat}
> "Messaging-EventLoop-3-10" #320 daemon prio=5 os_prio=0 cpu=57979617.98ms 
> elapsed=5587916.03s tid=0x00007f056e88ae00 nid=0x80ec waiting for monitor 
> entry  [0x00007f056d277000]
>    java.lang.Thread.State: BLOCKED (on object monitor)
>         at 
> org.apache.cassandra.streaming.StreamSession.closeSession(StreamSession.java:524)
>         - waiting to lock <0x00000006816fae70> (a 
> org.apache.cassandra.streaming.StreamSession)
>         at 
> org.apache.cassandra.streaming.StreamSession.onError(StreamSession.java:690)
>         at 
> org.apache.cassandra.streaming.async.StreamingMultiplexedChannel.onMessageComplete(StreamingMultiplexedChannel.java:264)
>         at 
> org.apache.cassandra.streaming.async.StreamingMultiplexedChannel.lambda$sendMessage$1(StreamingMultiplexedChannel.java:233)
>         at 
> org.apache.cassandra.streaming.async.StreamingMultiplexedChannel$$Lambda$2029/0x00000008007a0c40.operationComplete(Unknown
>  Source)
>         at 
> org.apache.cassandra.utils.concurrent.ListenerList.notifyListener(ListenerList.java:134)
>         at 
> org.apache.cassandra.utils.concurrent.ListenerList.notifyListener(ListenerList.java:148)
>         at 
> org.apache.cassandra.utils.concurrent.ListenerList$GenericFutureListenerList.notifySelf(ListenerList.java:190)
>         at 
> org.apache.cassandra.utils.concurrent.ListenerList.lambda$notifyExclusive$0(ListenerList.java:124)
>         at 
> org.apache.cassandra.utils.concurrent.ListenerList$$Lambda$950/0x0000000800666040.accept(Unknown
>  Source)
>         at 
> org.apache.cassandra.utils.concurrent.IntrusiveStack.forEach(IntrusiveStack.java:195)
>         at 
> org.apache.cassandra.utils.concurrent.ListenerList.notifyExclusive(ListenerList.java:124)
>         at 
> org.apache.cassandra.utils.concurrent.ListenerList.notify(ListenerList.java:96)
>         at 
> org.apache.cassandra.utils.concurrent.AsyncFuture.trySet(AsyncFuture.java:104)
>         at 
> org.apache.cassandra.utils.concurrent.AbstractFuture.tryFailure(AbstractFuture.java:148)
>         at 
> org.apache.cassandra.utils.concurrent.AsyncPromise.tryFailure(AsyncPromise.java:139)
>         at 
> io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:1009)
>         at 
> io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:870)
>         at 
> io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1367)
>         at 
> io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:717)
>         at 
> io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:764)
>         at 
> io.netty.channel.AbstractChannelHandlerContext$WriteTask.run(AbstractChannelHandlerContext.java:1071)
>         at 
> io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
>         at 
> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
>         at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384)
>         at 
> io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
>         at 
> io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
>         at 
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>         at java.lang.Thread.run(java.base@11.0.16/Thread.java:829)
> {noformat}
> It seems that while {{receive()} is holding the monitor lock on 
> {{StreamSession}}, the callback that executes on a different thread for the 
> control message it sends carries an error. This error, when handled in 
> {{onError()}}, then calls {{closeSession()}}, which tries to acquire the 
> monitor lock already held in {{receive()}}.
> {noformat}
> "Stream-Deserializer-/100.70.229.6:7000-de724029" #1919000 daemon prio=5 
> os_prio=0 cpu=224.66ms elapsed=1604976.92s tid=0x00007f0561c66500 nid=0xe2a2 
> waiting on condition  [0x00007f0830947000]
>    java.lang.Thread.State: WAITING (parking)
>         at jdk.internal.misc.Unsafe.park(java.base@11.0.16/Native Method)
>         at 
> java.util.concurrent.locks.LockSupport.park(java.base@11.0.16/LockSupport.java:323)
>         at 
> org.apache.cassandra.utils.concurrent.WaitQueue$Standard$AbstractSignal.await(WaitQueue.java:289)
>         at 
> org.apache.cassandra.utils.concurrent.WaitQueue$Standard$AbstractSignal.await(WaitQueue.java:282)
>         at 
> org.apache.cassandra.utils.concurrent.Awaitable$AsyncAwaitable.await(Awaitable.java:306)
>         at 
> org.apache.cassandra.utils.concurrent.AsyncFuture.await(AsyncFuture.java:154)
>         at 
> org.apache.cassandra.utils.concurrent.AsyncPromise.await(AsyncPromise.java:244)
>         at 
> org.apache.cassandra.net.AsyncChannelPromise.await(AsyncChannelPromise.java:127)
>         at 
> org.apache.cassandra.net.AsyncChannelPromise.await(AsyncChannelPromise.java:34)
>         at 
> org.apache.cassandra.utils.concurrent.Awaitable$Defaults.awaitUninterruptibly(Awaitable.java:186)
>         at 
> org.apache.cassandra.utils.concurrent.AbstractFuture.awaitUninterruptibly(AbstractFuture.java:482)
>         at 
> org.apache.cassandra.utils.concurrent.AsyncPromise.awaitUninterruptibly(AsyncPromise.java:254)
>         at 
> org.apache.cassandra.net.AsyncChannelPromise.awaitUninterruptibly(AsyncChannelPromise.java:133)
>         at 
> org.apache.cassandra.net.AsyncChannelPromise.awaitUninterruptibly(AsyncChannelPromise.java:34)
>         at 
> org.apache.cassandra.utils.concurrent.Future.syncUninterruptibly(Future.java:94)
>         at 
> org.apache.cassandra.utils.concurrent.AsyncPromise.syncUninterruptibly(AsyncPromise.java:186)
>         at 
> org.apache.cassandra.net.AsyncChannelPromise.syncUninterruptibly(AsyncChannelPromise.java:121)
>         at 
> org.apache.cassandra.net.AsyncChannelPromise.syncUninterruptibly(AsyncChannelPromise.java:34)
>         at 
> org.apache.cassandra.streaming.StreamSession.receive(StreamSession.java:1054)
>         at 
> org.apache.cassandra.streaming.StreamSession.messageReceived(StreamSession.java:628)
>         - locked <0x00000006816fae70> (a 
> org.apache.cassandra.streaming.StreamSession)
>         at 
> org.apache.cassandra.streaming.StreamDeserializingTask.run(StreamDeserializingTask.java:76)
>         at 
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>         at java.lang.Thread.run(java.base@11.0.16/Thread.java:829)
> {noformat}
> The most straightforward way to fix this might be to put an upper bound on 
> the wait in {{receive()}}, along w/ some logging at WARN if it times out. 
> This would at least allow us to make progress eventually and close the 
> session properly. The syncUninterruptibly() was intended to avoid a race 
> during shutdown, IIRC, and unless the timeout is comically low, it shouldn't 
> compromise that.
> The problem can usually be fixed by bouncing affected nodes, who will usually 
> present w/ an increasing backlog of unrepaired data and a log message that 
> looks something like this, in addition to a number of streaming errors:
> {noformat}
> INFO  2023-08-04T10:45:54,845 [NettyStreaming-Outbound-/<ip:port>] 
> org.apache.cassandra.streaming.async.StreamingMultiplexedChannel:359 - 
> [Stream #10465c10-3108-11ee-af8a-3f74fd21ad9d] waiting to acquire a permit to 
> begin streaming <data directory>-nb-101920-big-Data.db. This message logs 
> every 3 minutes
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to