symious opened a new pull request, #7046: URL: https://github.com/apache/ozone/pull/7046
## What changes were proposed in this pull request? We met the following issue: Datanode command handler executing close container request, but the timeout logic is not correct, so it blocks all requests from SCM. The jstack shows as follows: ``` "Command processor thread" #215 daemon prio=5 os_prio=0 tid=0x00007fcef3262000 nid=0xa56 waiting on condition [0x00007fcf63f9d000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x00007fd4ab6dcd38> (a java.util.concurrent.CompletableFuture$Signaller) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1707) at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323) at java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1742) at java.util.concurrent.CompletableFuture.join(CompletableFuture.java:1947) at org.apache.ratis.server.impl.RaftServerImpl.executeSubmitClientRequestAsync(RaftServerImpl.java:816) at org.apache.ratis.server.impl.RaftServerProxy.lambda$submitClientRequestAsync$7(RaftServerProxy.java:436) at org.apache.ratis.server.impl.RaftServerProxy$$Lambda$827/1961332062.apply(Unknown Source) at java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:995) at java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2137) at org.apache.ratis.server.impl.RaftServerProxy.submitClientRequestAsync(RaftServerProxy.java:436) at org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis.submitRequest(XceiverServerRatis.java:611) at org.apache.hadoop.ozone.container.common.statemachine.commandhandler.CloseContainerCommandHandler.handle(CloseContainerCommandHandler.java:105) at org.apache.hadoop.ozone.container.common.statemachine.commandhandler.CommandDispatcher.handle(CommandDispatcher.java:103) at org.apache.hadoop.ozone.container.common.statemachine.DatanodeStateMachine.lambda$initCommandHandlerThread$3(DatanodeStateMachine.java:593) at org.apache.hadoop.ozone.container.common.statemachine.DatanodeStateMachine$$Lambda$270/1788388131.run(Unknown Source) at java.lang.Thread.run(Thread.java:748) ``` The direct reason is the timeout logic is not working, because in Ratis the executeSubmitClientRequestAsync is a join() operation, and it will block the timeout on the outer CompletableFuture. ## What is the link to the Apache JIRA https://issues.apache.org/jira/browse/HDDS-11291 ## How was this patch tested? (Please explain how this patch was tested. Ex: unit tests, manual tests, workflow run on the fork git repo.) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this.) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@ozone.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@ozone.apache.org For additional commands, e-mail: issues-h...@ozone.apache.org