[jira] [Updated] (YARN-10642) Race condition: AsyncDispatcher can get stuck by the changes introduced in YARN-8995

Peter Bacsko (Jira) Mon, 08 Mar 2021 03:59:07 -0800


     [ 
https://issues.apache.org/jira/browse/YARN-10642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Peter Bacsko updated YARN-10642:
--------------------------------
    Attachment: YARN-10642-branch-3.2.001.patch

> Race condition: AsyncDispatcher can get stuck by the changes introduced in 
> YARN-8995
> ------------------------------------------------------------------------------------
>
>                 Key: YARN-10642
>                 URL: https://issues.apache.org/jira/browse/YARN-10642
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 3.2.1
>            Reporter: zhengchenyu
>            Assignee: zhengchenyu
>            Priority: Critical
>             Fix For: 3.4.0
>
>         Attachments: MockForDeadLoop.java, YARN-10642-branch-3.2.001.patch, 
> YARN-10642-branch-3.3.001.patch, YARN-10642.001.patch, YARN-10642.002.patch, 
> YARN-10642.003.patch, YARN-10642.004.patch, YARN-10642.005.patch, 
> deadloop.png, debugfornode.png, put.png, take.png
>
>
> In our cluster, ResouceManager stuck twice within twenty days. Yarn client 
> can't submit application. I got jstack info at second time, then found the 
> reason.
> I analyze all the jstack, I found many thread stuck because can't get 
> LinkedBlockingQueue.putLock. (Note: Sorry for limited space , omit the 
> analytical process)
> The reason is that one thread hold the putLock all the time, 
> printEventQueueDetails will called forEachRemaining, then hold putLock and 
> readLock. The AsyncDispatcher will stuck.
> {code}
> Thread 6526 (IPC Server handler 454 on default port 8030):
>   State: RUNNABLE
>   Blocked count: 29988
>   Waited count: 2035029
>   Stack:
>     
> java.util.concurrent.LinkedBlockingQueue$LBQSpliterator.forEachRemaining(LinkedBlockingQueue.java:926)
>     java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
>     
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
>     java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
>     java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
>     java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
>     
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.printEventQueueDetails(AsyncDispatcher.java:270)
>     
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:295)
>     
> org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.handleProgress(DefaultAMSProcessor.java:408)
>     
> org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.allocate(DefaultAMSProcessor.java:215)
>     
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.constraint.processor.DisabledPlacementProcessor.allocate(DisabledPlacementProcessor.java:75)
>     
> org.apache.hadoop.yarn.server.resourcemanager.AMSProcessingChain.allocate(AMSProcessingChain.java:92)
>     
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:432)
>     
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
>     
> org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
>     
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528)
>     org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
>     org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1040)
>     org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:958)
>     java.security.AccessController.doPrivileged(Native Method)
> {code}
> I analyze LinkedBlockingQueue's source code. I found forEachRemaining in 
> LinkedBlockingQueue.LBQSpliterator may stuck, when forEachRemaining and take 
> are called in different thread. 
> YARN-8995 introduce printEventQueueDetails method, 
> "eventQueue.stream().collect" will called forEachRemaining method.
> Let's see why？ "put.png" shows that how to put("a"), "take.png" shows that 
> how to take()。Specical Node: The removed Node will point itself for help gc!!!
> The key point code is in forEachRemaining, we see LBQSpliterator use 
> forEachRemaining to visit all Node. But when got item value from Node, will 
> release the lock. If at this time, take() will be called. 
> The variable 'p' in forEachRemaining may point a Node which point itself, 
> then forEachRemaining will be in dead loop. You can see it in "deadloop.png"
> Let's see a simple uni-test, Let's forEachRemaining called more slow than 
> take, the problem will reproduction。uni-test is MockForDeadLoop.java.
> I debug MockForDeadLoop.java, and see a Node point itself. You can see pic 
> "debugfornode.png"
> Environment:
>   OS: CentOS Linux release 7.5.1804 (Core) 
>   JDK: jdk1.8.0_281



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-10642) Race condition: AsyncDispatcher can get stuck by the changes introduced in YARN-8995

Reply via email to