[ 
https://issues.apache.org/jira/browse/YARN-4247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14951161#comment-14951161
 ] 

Anubhav Dhoot commented on YARN-4247:
-------------------------------------

Looking at the jstack here is the deadlock between FS and RMAppAttemptImpl
The first thread has a lock on FSAppAttempt and is waiting on the 
RMAppAttemptImpl lock
The second thread RMAppAttemptImpl.getApplicationResourceUsageReport has taken 
a readlock and waiting on FSAppAttempt
This causes other threads (eg. third thread) such as the AsyncDispatcher 
threads to get  blocked causing RM to stop processing events and then crash 
with OOM because of the backlog of events.

{noformat}
"IPC Server handler 49 on 8030" #239 daemon prio=5 os_prio=0 
tid=0x0000000001093000 nid=0x8206 waiting on condition [0x00007f930b2da000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x000000071719e0f0> (a 
java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283)
        at 
java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
        at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.*RMAppAttemptImpl*.getMasterContainer(RMAppAttemptImpl.java:747)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.isWaitingForAMContainer(SchedulerApplicationAttempt.java:482)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.allocate(FairScheduler.java:938)
        - locked <0x0000000715932d98> (a 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.*FSAppAttempt*)
        at 
org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:529)
        - locked <0x00000007171a5328> (a 
org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService$AllocateResponseLock)
        at 
org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
        at 
org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1060)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080)


"IPC Server handler 9 on 8032" #253 daemon prio=5 os_prio=0 
tid=0x0000000000e2e800 nid=0x8214 waiting for monitor entry [0x00007f930a4cd000]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.getResourceUsageReport(SchedulerApplicationAttempt.java:570)
        - waiting to lock <0x0000000715932d98> (a 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.getAppResourceUsageReport(AbstractYarnScheduler.java:241)
        at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptMetrics.getAggregateAppResourceUsage(RMAppAttemptMetrics.java:114)
        at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.*RMAppAttemptImpl*.getApplicationResourceUsageReport(RMAppAttemptImpl.java:798)
        at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.createAndGetApplicationReport(RMAppImpl.java:655)
        at 
org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:330)
        at 
org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:170)
        at 
org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:401)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1060)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080)

"AsyncDispatcher event handler" #58 prio=5 os_prio=0 tid=0x00007f9345dcd800 
nid=0xd70c waiting on condition [0x00007f9313b6d000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x000000071719e0f0> (a 
java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
        at 
java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:943)
        at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:765)
        at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:109)
        at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:840)
        at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:821)
        at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:174)
        at 
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
        at java.lang.Thread.run(Thread.java:745)
{noformat}

> Deadlock in FSAppAttempt and RMAppAttemptImpl causes RM to stop processing 
> events
> ---------------------------------------------------------------------------------
>
>                 Key: YARN-4247
>                 URL: https://issues.apache.org/jira/browse/YARN-4247
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: fairscheduler, resourcemanager
>            Reporter: Anubhav Dhoot
>            Assignee: Anubhav Dhoot
>            Priority: Blocker
>
> We see this deadlock in our testing where events do not get processed and we 
> see this in the logs before the RM dies of OOM {noformat} 2015-10-08 
> 04:48:01,918 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size of 
> event-queue is 1488000 2015-10-08 04:48:01,918 INFO 
> org.apache.hadoop.yarn.event.AsyncDispatcher: Size of event-queue is 1488000 
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to