[ 
https://issues.apache.org/jira/browse/RATIS-2116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17859945#comment-17859945
 ] 

Haibo Sun commented on RATIS-2116:
----------------------------------

[~szetszwo], yes, I am planning to submit a pull request to fix it. Please 
assign this JIRA ticket to me. Thank you!

> Follower state synchronization is blocked
> -----------------------------------------
>
>                 Key: RATIS-2116
>                 URL: https://issues.apache.org/jira/browse/RATIS-2116
>             Project: Ratis
>          Issue Type: Bug
>    Affects Versions: 3.0.0, 2.5.1, 3.0.1
>            Reporter: Haibo Sun
>            Priority: Major
>         Attachments: debug.log
>
>
> Using version 2.5.1, we have discovered that in some cases, the state 
> synchronization of the follower will be permanently blocked.
> Scenario: When the task queue of the SegmentedRaftLogWorker is the pattern 
> (WriteLog, WriteLog, ..., PurgeLog), the last WriteLog of 
> RaftServerImpl.appendEntries does not immediately flush data and complete the 
> result future, because there is a pending PurgeLog task in the queue. It 
> enqueues the result future to be completed after the latter WriteLog flushes 
> data. However, the "nioEventLoopGroup-3-1" thread is already blocked, and 
> will not add new WriteLog to the task queue of SegmentedRaftLogWorker. This 
> leads to a deadlock and causes the state synchronization to stop.
> I confirmed this by adding debug logs, detailed information is attached 
> below. This issue can be easily reproduced by increasing the frequency of 
> TakeSnapshot and PurgeLog operations. In addition, after checking the code in 
> the master branch, this issue still exists.
>  
> *jstack:*
> {code:java}
> "nioEventLoopGroup-3-1" #58 prio=10 os_prio=0 tid=0x00007fc58400b800 
> nid=0x5493a waiting on condition [0x00007fc5b4f28000] java.lang.Thread.State: 
> WAITING (parking) at sun.misc.Unsafe.park0(Native Method) parking to wait for 
> <0x00007fd86a4685e8> (a java.util.concurrent.CompletableFuture$Signaller) at 
> sun.misc.Unsafe.park(Unsafe.java:1025) at 
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:176) at 
> java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1693)
>  at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323) at 
> java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1729)
>  at java.util.concurrent.CompletableFuture.join(CompletableFuture.java:1934) 
> at 
> org.apache.ratis.server.impl.RaftServerImpl.appendEntries(RaftServerImpl.java:1379)
>  at 
> org.apache.ratis.server.impl.RaftServerProxy.appendEntries(RaftServerProxy.java:649)
>  at 
> org.apache.ratis.netty.server.NettyRpcService.handle(NettyRpcService.java:231)
>  at 
> org.apache.ratis.netty.server.NettyRpcService$InboundHandler.channelRead0(NettyRpcService.java:95)
>  at 
> org.apache.ratis.netty.server.NettyRpcService$InboundHandler.channelRead0(NettyRpcService.java:91)
>  at 
> org.apache.ratis.thirdparty.io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
>  at 
> org.apache.ratis.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
>  at 
> org.apache.ratis.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
>  at 
> org.apache.ratis.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
>  at 
> org.apache.ratis.thirdparty.io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
>  at 
> org.apache.ratis.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
>  at 
> org.apache.ratis.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
>  at 
> org.apache.ratis.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
>  at 
> org.apache.ratis.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:346)
>  at 
> org.apache.ratis.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:318)
>  at 
> org.apache.ratis.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
>  at 
> org.apache.ratis.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
>  at 
> org.apache.ratis.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
>  at 
> org.apache.ratis.thirdparty.io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
>  at 
> org.apache.ratis.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440)
>  at 
> org.apache.ratis.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
>  at 
> org.apache.ratis.thirdparty.io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
>  at 
> org.apache.ratis.thirdparty.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)
>  at 
> org.apache.ratis.thirdparty.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788)
>  at 
> org.apache.ratis.thirdparty.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)
>  at 
> org.apache.ratis.thirdparty.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)
>  at 
> org.apache.ratis.thirdparty.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)
>  at 
> org.apache.ratis.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
>  at 
> org.apache.ratis.thirdparty.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
>  at 
> org.apache.ratis.thirdparty.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>  at java.lang.Thread.run(Thread.java:882){code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to