[
https://issues.apache.org/jira/browse/TEZ-2310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Bikas Saha updated TEZ-2310:
----------------------------
Comment: was deleted
(was: {color:red}-1 overall{color}. Here are the results of testing the latest
attachment
http://issues.apache.org/jira/secure/attachment/12725390/TEZ-2310.1.patch
against master revision 11b5843.
{color:green}+1 @author{color}. The patch does not contain any @author
tags.
{color:green}+1 tests included{color}. The patch appears to include 1 new
or modified test files.
{color:green}+1 javac{color}. The applied patch does not increase the
total number of javac compiler warnings.
{color:green}+1 javadoc{color}. There were no new javadoc warning messages.
{color:red}-1 findbugs{color}. The patch appears to introduce 1 new
Findbugs (version 2.0.3) warnings.
{color:green}+1 release audit{color}. The applied patch does not increase
the total number of release audit warnings.
{color:red}-1 core tests{color}. The following test timeouts occurred in :
org.apache.tez.dag.app.dag.impl.TestVertexImpl
Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/463//testReport/
Findbugs warnings:
https://builds.apache.org/job/PreCommit-TEZ-Build/463//artifact/patchprocess/newPatchFindbugsWarningstez-dag.html
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/463//console
This message is automatically generated.)
> AM Deadlock in VertexImpl
> -------------------------
>
> Key: TEZ-2310
> URL: https://issues.apache.org/jira/browse/TEZ-2310
> Project: Apache Tez
> Issue Type: Bug
> Reporter: Daniel Dai
> Assignee: Bikas Saha
> Fix For: 0.7.0
>
> Attachments: TEZ-2310-0.patch
>
>
> See the following deadlock in testing:
> Thread#1:
> {code}
> Daemon Thread [App Shared Pool - #3] (Suspended)
> owns: VertexManager$VertexManagerPluginContextImpl (id=327)
> owns: ShuffleVertexManager (id=328)
> owns: VertexManager (id=329)
> waiting for: VertexManager$VertexManagerPluginContextImpl (id=326)
>
> VertexManager$VertexManagerPluginContextImpl.onStateUpdated(VertexStateUpdate)
> line: 344
>
> StateChangeNotifier$ListenerContainer.sendStateUpdate(VertexStateUpdate)
> line: 138
>
> StateChangeNotifier$ListenerContainer.access$100(StateChangeNotifier$ListenerContainer,
> VertexStateUpdate) line: 122
> StateChangeNotifier.sendStateUpdate(TezVertexID, VertexStateUpdate)
> line: 116
> StateChangeNotifier.stateChanged(TezVertexID, VertexStateUpdate) line:
> 106
> VertexImpl.maybeSendConfiguredEvent() line: 3385
> VertexImpl.doneReconfiguringVertex() line: 1634
> VertexManager$VertexManagerPluginContextImpl.doneReconfiguringVertex()
> line: 339
> ShuffleVertexManager.schedulePendingTasks(int) line: 561
> ShuffleVertexManager.schedulePendingTasks() line: 620
> ShuffleVertexManager.handleVertexStateUpdate(VertexStateUpdate) line:
> 731
> ShuffleVertexManager.onVertexStateUpdated(VertexStateUpdate) line: 744
> VertexManager$VertexManagerEventOnVertexStateUpdate.invoke() line: 527
> VertexManager$VertexManagerEvent$1.run() line: 612
> VertexManager$VertexManagerEvent$1.run() line: 607
> AccessController.doPrivileged(PrivilegedExceptionAction<T>,
> AccessControlContext) line: not available [native method]
> Subject.doAs(Subject, PrivilegedExceptionAction<T>) line: 415
> UserGroupInformation.doAs(PrivilegedExceptionAction<T>) line: 1548
>
> VertexManager$VertexManagerEventOnVertexStateUpdate(VertexManager$VertexManagerEvent).call()
> line: 607
>
> VertexManager$VertexManagerEventOnVertexStateUpdate(VertexManager$VertexManagerEvent).call()
> line: 596
> ListenableFutureTask<V>(FutureTask<V>).run() line: 262
> ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) line: 1145
> ThreadPoolExecutor$Worker.run() line: 615
> Thread.run() line: 745
> {code}
> Thread #2
> {code}
> Daemon Thread [App Shared Pool - #2] (Suspended)
> owns: VertexManager$VertexManagerPluginContextImpl (id=326)
> owns: PigGraceShuffleVertexManager (id=344)
> owns: VertexManager (id=345)
> Unsafe.park(boolean, long) line: not available [native method]
> LockSupport.park(Object) line: 186
>
> ReentrantReadWriteLock$NonfairSync(AbstractQueuedSynchronizer).parkAndCheckInterrupt()
> line: 834
>
> ReentrantReadWriteLock$NonfairSync(AbstractQueuedSynchronizer).doAcquireShared(int)
> line: 964
>
> ReentrantReadWriteLock$NonfairSync(AbstractQueuedSynchronizer).acquireShared(int)
> line: 1282
> ReentrantReadWriteLock$ReadLock.lock() line: 731
> VertexImpl.getTotalTasks() line: 952
> VertexManager$VertexManagerPluginContextImpl.getVertexNumTasks(String)
> line: 162
>
> PigGraceShuffleVertexManager(ShuffleVertexManager).updateSourceTaskCount()
> line: 435
>
> PigGraceShuffleVertexManager(ShuffleVertexManager).onVertexStarted(Map<String,List<Integer>>)
> line: 353
> VertexManager$VertexManagerEventOnVertexStarted.invoke() line: 541
> VertexManager$VertexManagerEvent$1.run() line: 612
> VertexManager$VertexManagerEvent$1.run() line: 607
> AccessController.doPrivileged(PrivilegedExceptionAction<T>,
> AccessControlContext) line: not available [native method]
> Subject.doAs(Subject, PrivilegedExceptionAction<T>) line: 415
> UserGroupInformation.doAs(PrivilegedExceptionAction<T>) line: 1548
>
> VertexManager$VertexManagerEventOnVertexStarted(VertexManager$VertexManagerEvent).call()
> line: 607
>
> VertexManager$VertexManagerEventOnVertexStarted(VertexManager$VertexManagerEvent).call()
> line: 596
> ListenableFutureTask<V>(FutureTask<V>).run() line: 262
> ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) line: 1145
> ThreadPoolExecutor$Worker.run() line: 615
> Thread.run() line: 745
> {code}
> What happens is thread #1 holding a writeLock (VertexImpl:1628) and enter
> into a synchronized block (ShuffleVertexManager.onVertexStateUpdated), in the
> mean time, thread #2 already in the synchronized block
> (ShuffleVertexManager.onVertexStarted) and try to get a
> readLock(VertexImpl:952). Holding a lock and then enter a synchronized block
> might be dangerous.
> I attach a patch which avoiding that and then deadlock goes away. Not sure if
> that is the right fix or if any other patterns like this.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)