[ https://issues.apache.org/jira/browse/TEZ-2310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14496654#comment-14496654 ]
Bikas Saha commented on TEZ-2310: --------------------------------- Thanks for the verification [~daijy]. [~sseth] [~rajesh.balamohan] [~hitesh] Please review. The change is basically having notifications sent out to listeners on a separate thread. Potentially, we could do multiple of these concurrently via a thread pool but for now sticking to a single thread. Will open a separate jira to do this for task status updates. > AM Deadlock in VertexImpl > ------------------------- > > Key: TEZ-2310 > URL: https://issues.apache.org/jira/browse/TEZ-2310 > Project: Apache Tez > Issue Type: Bug > Reporter: Daniel Dai > Assignee: Bikas Saha > Fix For: 0.7.0 > > Attachments: TEZ-2310-0.patch, TEZ-2310.1.patch > > > See the following deadlock in testing: > Thread#1: > {code} > Daemon Thread [App Shared Pool - #3] (Suspended) > owns: VertexManager$VertexManagerPluginContextImpl (id=327) > owns: ShuffleVertexManager (id=328) > owns: VertexManager (id=329) > waiting for: VertexManager$VertexManagerPluginContextImpl (id=326) > > VertexManager$VertexManagerPluginContextImpl.onStateUpdated(VertexStateUpdate) > line: 344 > > StateChangeNotifier$ListenerContainer.sendStateUpdate(VertexStateUpdate) > line: 138 > > StateChangeNotifier$ListenerContainer.access$100(StateChangeNotifier$ListenerContainer, > VertexStateUpdate) line: 122 > StateChangeNotifier.sendStateUpdate(TezVertexID, VertexStateUpdate) > line: 116 > StateChangeNotifier.stateChanged(TezVertexID, VertexStateUpdate) line: > 106 > VertexImpl.maybeSendConfiguredEvent() line: 3385 > VertexImpl.doneReconfiguringVertex() line: 1634 > VertexManager$VertexManagerPluginContextImpl.doneReconfiguringVertex() > line: 339 > ShuffleVertexManager.schedulePendingTasks(int) line: 561 > ShuffleVertexManager.schedulePendingTasks() line: 620 > ShuffleVertexManager.handleVertexStateUpdate(VertexStateUpdate) line: > 731 > ShuffleVertexManager.onVertexStateUpdated(VertexStateUpdate) line: 744 > VertexManager$VertexManagerEventOnVertexStateUpdate.invoke() line: 527 > VertexManager$VertexManagerEvent$1.run() line: 612 > VertexManager$VertexManagerEvent$1.run() line: 607 > AccessController.doPrivileged(PrivilegedExceptionAction<T>, > AccessControlContext) line: not available [native method] > Subject.doAs(Subject, PrivilegedExceptionAction<T>) line: 415 > UserGroupInformation.doAs(PrivilegedExceptionAction<T>) line: 1548 > > VertexManager$VertexManagerEventOnVertexStateUpdate(VertexManager$VertexManagerEvent).call() > line: 607 > > VertexManager$VertexManagerEventOnVertexStateUpdate(VertexManager$VertexManagerEvent).call() > line: 596 > ListenableFutureTask<V>(FutureTask<V>).run() line: 262 > ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) line: 1145 > ThreadPoolExecutor$Worker.run() line: 615 > Thread.run() line: 745 > {code} > Thread #2 > {code} > Daemon Thread [App Shared Pool - #2] (Suspended) > owns: VertexManager$VertexManagerPluginContextImpl (id=326) > owns: PigGraceShuffleVertexManager (id=344) > owns: VertexManager (id=345) > Unsafe.park(boolean, long) line: not available [native method] > LockSupport.park(Object) line: 186 > > ReentrantReadWriteLock$NonfairSync(AbstractQueuedSynchronizer).parkAndCheckInterrupt() > line: 834 > > ReentrantReadWriteLock$NonfairSync(AbstractQueuedSynchronizer).doAcquireShared(int) > line: 964 > > ReentrantReadWriteLock$NonfairSync(AbstractQueuedSynchronizer).acquireShared(int) > line: 1282 > ReentrantReadWriteLock$ReadLock.lock() line: 731 > VertexImpl.getTotalTasks() line: 952 > VertexManager$VertexManagerPluginContextImpl.getVertexNumTasks(String) > line: 162 > > PigGraceShuffleVertexManager(ShuffleVertexManager).updateSourceTaskCount() > line: 435 > > PigGraceShuffleVertexManager(ShuffleVertexManager).onVertexStarted(Map<String,List<Integer>>) > line: 353 > VertexManager$VertexManagerEventOnVertexStarted.invoke() line: 541 > VertexManager$VertexManagerEvent$1.run() line: 612 > VertexManager$VertexManagerEvent$1.run() line: 607 > AccessController.doPrivileged(PrivilegedExceptionAction<T>, > AccessControlContext) line: not available [native method] > Subject.doAs(Subject, PrivilegedExceptionAction<T>) line: 415 > UserGroupInformation.doAs(PrivilegedExceptionAction<T>) line: 1548 > > VertexManager$VertexManagerEventOnVertexStarted(VertexManager$VertexManagerEvent).call() > line: 607 > > VertexManager$VertexManagerEventOnVertexStarted(VertexManager$VertexManagerEvent).call() > line: 596 > ListenableFutureTask<V>(FutureTask<V>).run() line: 262 > ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) line: 1145 > ThreadPoolExecutor$Worker.run() line: 615 > Thread.run() line: 745 > {code} > What happens is thread #1 holding a writeLock (VertexImpl:1628) and enter > into a synchronized block (ShuffleVertexManager.onVertexStateUpdated), in the > mean time, thread #2 already in the synchronized block > (ShuffleVertexManager.onVertexStarted) and try to get a > readLock(VertexImpl:952). Holding a lock and then enter a synchronized block > might be dangerous. > I attach a patch which avoiding that and then deadlock goes away. Not sure if > that is the right fix or if any other patterns like this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)