[ 
https://issues.apache.org/jira/browse/TEZ-2310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14498980#comment-14498980
 ] 

Hitesh Shah commented on TEZ-2310:
----------------------------------

+1. Please open a jira for failing the dag instead of triggering the internal 
error for the handler exception scenario.

> AM Deadlock in VertexImpl
> -------------------------
>
>                 Key: TEZ-2310
>                 URL: https://issues.apache.org/jira/browse/TEZ-2310
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Daniel Dai
>            Assignee: Bikas Saha
>         Attachments: TEZ-2310-0.patch, TEZ-2310.1.patch, TEZ-2310.2.patch
>
>
> See the following deadlock in testing:
> Thread#1:
> {code}
> Daemon Thread [App Shared Pool - #3] (Suspended)      
>       owns: VertexManager$VertexManagerPluginContextImpl  (id=327)    
>       owns: ShuffleVertexManager  (id=328)    
>       owns: VertexManager  (id=329)   
>       waiting for: VertexManager$VertexManagerPluginContextImpl  (id=326)     
>       
> VertexManager$VertexManagerPluginContextImpl.onStateUpdated(VertexStateUpdate)
>  line: 344        
>       
> StateChangeNotifier$ListenerContainer.sendStateUpdate(VertexStateUpdate) 
> line: 138      
>       
> StateChangeNotifier$ListenerContainer.access$100(StateChangeNotifier$ListenerContainer,
>  VertexStateUpdate) line: 122    
>       StateChangeNotifier.sendStateUpdate(TezVertexID, VertexStateUpdate) 
> line: 116   
>       StateChangeNotifier.stateChanged(TezVertexID, VertexStateUpdate) line: 
> 106      
>       VertexImpl.maybeSendConfiguredEvent() line: 3385        
>       VertexImpl.doneReconfiguringVertex() line: 1634 
>       VertexManager$VertexManagerPluginContextImpl.doneReconfiguringVertex() 
> line: 339        
>       ShuffleVertexManager.schedulePendingTasks(int) line: 561        
>       ShuffleVertexManager.schedulePendingTasks() line: 620   
>       ShuffleVertexManager.handleVertexStateUpdate(VertexStateUpdate) line: 
> 731       
>       ShuffleVertexManager.onVertexStateUpdated(VertexStateUpdate) line: 744  
>       VertexManager$VertexManagerEventOnVertexStateUpdate.invoke() line: 527  
>       VertexManager$VertexManagerEvent$1.run() line: 612      
>       VertexManager$VertexManagerEvent$1.run() line: 607      
>       AccessController.doPrivileged(PrivilegedExceptionAction<T>, 
> AccessControlContext) line: not available [native method]   
>       Subject.doAs(Subject, PrivilegedExceptionAction<T>) line: 415   
>       UserGroupInformation.doAs(PrivilegedExceptionAction<T>) line: 1548      
>       
> VertexManager$VertexManagerEventOnVertexStateUpdate(VertexManager$VertexManagerEvent).call()
>  line: 607  
>       
> VertexManager$VertexManagerEventOnVertexStateUpdate(VertexManager$VertexManagerEvent).call()
>  line: 596  
>       ListenableFutureTask<V>(FutureTask<V>).run() line: 262  
>       ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) line: 1145      
>       ThreadPoolExecutor$Worker.run() line: 615       
>       Thread.run() line: 745  
> {code}
> Thread #2
> {code}
> Daemon Thread [App Shared Pool - #2] (Suspended)      
>       owns: VertexManager$VertexManagerPluginContextImpl  (id=326)    
>       owns: PigGraceShuffleVertexManager  (id=344)    
>       owns: VertexManager  (id=345)   
>       Unsafe.park(boolean, long) line: not available [native method]  
>       LockSupport.park(Object) line: 186      
>       
> ReentrantReadWriteLock$NonfairSync(AbstractQueuedSynchronizer).parkAndCheckInterrupt()
>  line: 834        
>       
> ReentrantReadWriteLock$NonfairSync(AbstractQueuedSynchronizer).doAcquireShared(int)
>  line: 964   
>       
> ReentrantReadWriteLock$NonfairSync(AbstractQueuedSynchronizer).acquireShared(int)
>  line: 1282    
>       ReentrantReadWriteLock$ReadLock.lock() line: 731        
>       VertexImpl.getTotalTasks() line: 952    
>       VertexManager$VertexManagerPluginContextImpl.getVertexNumTasks(String) 
> line: 162        
>       
> PigGraceShuffleVertexManager(ShuffleVertexManager).updateSourceTaskCount() 
> line: 435    
>       
> PigGraceShuffleVertexManager(ShuffleVertexManager).onVertexStarted(Map<String,List<Integer>>)
>  line: 353 
>       VertexManager$VertexManagerEventOnVertexStarted.invoke() line: 541      
>       VertexManager$VertexManagerEvent$1.run() line: 612      
>       VertexManager$VertexManagerEvent$1.run() line: 607      
>       AccessController.doPrivileged(PrivilegedExceptionAction<T>, 
> AccessControlContext) line: not available [native method]   
>       Subject.doAs(Subject, PrivilegedExceptionAction<T>) line: 415   
>       UserGroupInformation.doAs(PrivilegedExceptionAction<T>) line: 1548      
>       
> VertexManager$VertexManagerEventOnVertexStarted(VertexManager$VertexManagerEvent).call()
>  line: 607      
>       
> VertexManager$VertexManagerEventOnVertexStarted(VertexManager$VertexManagerEvent).call()
>  line: 596      
>       ListenableFutureTask<V>(FutureTask<V>).run() line: 262  
>       ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) line: 1145      
>       ThreadPoolExecutor$Worker.run() line: 615       
>       Thread.run() line: 745  
> {code}
> What happens is thread #1 holding a writeLock (VertexImpl:1628) and enter 
> into a synchronized block (ShuffleVertexManager.onVertexStateUpdated), in the 
> mean time, thread #2 already in the synchronized block 
> (ShuffleVertexManager.onVertexStarted) and try to get a 
> readLock(VertexImpl:952). Holding a lock and then enter a synchronized block 
> might be dangerous. 
> I attach a patch which avoiding that and then deadlock goes away. Not sure if 
> that is the right fix or if any other patterns like this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to