symious opened a new pull request, #7046:
URL: https://github.com/apache/ozone/pull/7046

   ## What changes were proposed in this pull request?
   We met the following issue: Datanode command handler executing close 
container request, but the timeout logic is not correct, so it blocks all 
requests from SCM.
   
   The jstack shows as follows:
   ```
   "Command processor thread" #215 daemon prio=5 os_prio=0 
tid=0x00007fcef3262000 nid=0xa56 waiting on condition [0x00007fcf63f9d000]
      java.lang.Thread.State: WAITING (parking)
           at sun.misc.Unsafe.park(Native Method)
           - parking to wait for  <0x00007fd4ab6dcd38> (a 
java.util.concurrent.CompletableFuture$Signaller)
           at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
           at 
java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1707)
           at 
java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
           at 
java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1742)
           at 
java.util.concurrent.CompletableFuture.join(CompletableFuture.java:1947)
           at 
org.apache.ratis.server.impl.RaftServerImpl.executeSubmitClientRequestAsync(RaftServerImpl.java:816)
           at 
org.apache.ratis.server.impl.RaftServerProxy.lambda$submitClientRequestAsync$7(RaftServerProxy.java:436)
           at 
org.apache.ratis.server.impl.RaftServerProxy$$Lambda$827/1961332062.apply(Unknown
 Source)
           at 
java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:995)
           at 
java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2137)
           at 
org.apache.ratis.server.impl.RaftServerProxy.submitClientRequestAsync(RaftServerProxy.java:436)
           at 
org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis.submitRequest(XceiverServerRatis.java:611)
           at 
org.apache.hadoop.ozone.container.common.statemachine.commandhandler.CloseContainerCommandHandler.handle(CloseContainerCommandHandler.java:105)
           at 
org.apache.hadoop.ozone.container.common.statemachine.commandhandler.CommandDispatcher.handle(CommandDispatcher.java:103)
           at 
org.apache.hadoop.ozone.container.common.statemachine.DatanodeStateMachine.lambda$initCommandHandlerThread$3(DatanodeStateMachine.java:593)
           at 
org.apache.hadoop.ozone.container.common.statemachine.DatanodeStateMachine$$Lambda$270/1788388131.run(Unknown
 Source)
           at java.lang.Thread.run(Thread.java:748) 
   ```
   The direct reason is the timeout logic is not working, because in Ratis the 
executeSubmitClientRequestAsync is a join() operation, and it will block the 
timeout on the outer CompletableFuture.
   
   ## What is the link to the Apache JIRA
   
   https://issues.apache.org/jira/browse/HDDS-11291
   
   ## How was this patch tested?
   
   (Please explain how this patch was tested. Ex: unit tests, manual tests, 
workflow run on the fork git repo.)
   (If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this.)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@ozone.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@ozone.apache.org
For additional commands, e-mail: issues-h...@ozone.apache.org

Reply via email to