[ https://issues.apache.org/jira/browse/RATIS-2116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17860367#comment-17860367 ]
Haibo Sun commented on RATIS-2116: ---------------------------------- [~szetszwo] https://github.com/apache/ratis/pull/1116 > Follower state synchronization is blocked > ----------------------------------------- > > Key: RATIS-2116 > URL: https://issues.apache.org/jira/browse/RATIS-2116 > Project: Ratis > Issue Type: Bug > Affects Versions: 3.0.0, 2.5.1, 3.0.1 > Reporter: Haibo Sun > Priority: Major > Attachments: debug.log > > > Using version 2.5.1, we have discovered that in some cases, the state > synchronization of the follower will be permanently blocked. > Scenario: When the task queue of the SegmentedRaftLogWorker is the pattern > (WriteLog, WriteLog, ..., PurgeLog), the last WriteLog of > RaftServerImpl.appendEntries does not immediately flush data and complete the > result future, because there is a pending PurgeLog task in the queue. It > enqueues the result future to be completed after the latter WriteLog flushes > data. However, the "nioEventLoopGroup-3-1" thread is already blocked, and > will not add new WriteLog to the task queue of SegmentedRaftLogWorker. This > leads to a deadlock and causes the state synchronization to stop. > I confirmed this by adding debug logs, detailed information is attached > below. This issue can be easily reproduced by increasing the frequency of > TakeSnapshot and PurgeLog operations. In addition, after checking the code in > the master branch, this issue still exists. > > *jstack:* > {code:java} > "nioEventLoopGroup-3-1" #58 prio=10 os_prio=0 tid=0x00007fc58400b800 > nid=0x5493a waiting on condition [0x00007fc5b4f28000] java.lang.Thread.State: > WAITING (parking) at sun.misc.Unsafe.park0(Native Method) parking to wait for > <0x00007fd86a4685e8> (a java.util.concurrent.CompletableFuture$Signaller) at > sun.misc.Unsafe.park(Unsafe.java:1025) at > java.util.concurrent.locks.LockSupport.park(LockSupport.java:176) at > java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1693) > at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323) at > java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1729) > at java.util.concurrent.CompletableFuture.join(CompletableFuture.java:1934) > at > org.apache.ratis.server.impl.RaftServerImpl.appendEntries(RaftServerImpl.java:1379) > at > org.apache.ratis.server.impl.RaftServerProxy.appendEntries(RaftServerProxy.java:649) > at > org.apache.ratis.netty.server.NettyRpcService.handle(NettyRpcService.java:231) > at > org.apache.ratis.netty.server.NettyRpcService$InboundHandler.channelRead0(NettyRpcService.java:95) > at > org.apache.ratis.netty.server.NettyRpcService$InboundHandler.channelRead0(NettyRpcService.java:91) > at > org.apache.ratis.thirdparty.io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99) > at > org.apache.ratis.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444) > at > org.apache.ratis.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) > at > org.apache.ratis.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) > at > org.apache.ratis.thirdparty.io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) > at > org.apache.ratis.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444) > at > org.apache.ratis.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) > at > org.apache.ratis.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) > at > org.apache.ratis.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:346) > at > org.apache.ratis.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:318) > at > org.apache.ratis.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444) > at > org.apache.ratis.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) > at > org.apache.ratis.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) > at > org.apache.ratis.thirdparty.io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) > at > org.apache.ratis.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440) > at > org.apache.ratis.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) > at > org.apache.ratis.thirdparty.io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) > at > org.apache.ratis.thirdparty.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166) > at > org.apache.ratis.thirdparty.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788) > at > org.apache.ratis.thirdparty.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724) > at > org.apache.ratis.thirdparty.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650) > at > org.apache.ratis.thirdparty.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562) > at > org.apache.ratis.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) > at > org.apache.ratis.thirdparty.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) > at > org.apache.ratis.thirdparty.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) > at java.lang.Thread.run(Thread.java:882){code} -- This message was sent by Atlassian Jira (v8.20.10#820010)