[ https://issues.apache.org/jira/browse/TEZ-2421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14537013#comment-14537013 ]
Jeff Zhang edited comment on TEZ-2421 at 5/10/15 3:26 AM: ---------------------------------------------------------- [~bikassaha] I guess you mean the following scenairo: bq. Thread 1 has V1 readlock acquired and tries to acquire readlock on V2. Thread 2 wants to acquire writelock on V1 and is blocked because thread 1 has the readlock. Thread 3 has writelock on V2 and is trying to acquire readlock on V1 which is blocked due to the pending writelock on Thread 2. || Thread || Owned || Try to acquire || | App Shared Pool - #1 (T1) | | Writelock of Vertex | | TaskSchedulerAppCaller - #0 (T2)| Readlock of Vertex/Task | Readlock of TaskAttempt | | Dispatcher thread:Central (T3) | Writelock of TaskAttempt | Readlock of Vertex | Still not sure why T3 can't continue, because T1 hasn't got the writelock of Vertex, should not block T3, right ? BTW, the patch may still cause issue in recovery. If it is in recovery, the following code in TaskAttempt will still try to acquire the readlock of Vertex (Vertex#createRemoteTaskSpec), and produce the above scenario. But it is supposed can be fixed after TEZ-1019. {code} TaskSpec createRemoteTaskSpec() throws AMUserCodeException { TaskSpec baseTaskSpec = task.getBaseTaskSpec(); if (baseTaskSpec == null) { // since recovery does not follow normal transitions, TaskEventScheduleTask // is not being honored by the recovery code path. Using this to workaround // until recovery is fixed. Calling the non-locking internal method of the vertex // to get the taskSpec directly. Since everything happens on the central dispatcher // during recovery this is deadlock free for now. TEZ-1019 should remove the need for this. baseTaskSpec = ((VertexImpl) vertex).createRemoteTaskSpec(getID().getTaskID().getId()); } return new TaskSpec(getID(), baseTaskSpec.getDAGName(), baseTaskSpec.getVertexName(), baseTaskSpec.getVertexParallelism(), baseTaskSpec.getProcessorDescriptor(), baseTaskSpec.getInputs(), baseTaskSpec.getOutputs(), baseTaskSpec.getGroupInputs()); } {code} was (Author: zjffdu): [~bikassaha] I guess you mean the following scenairo: bq. Thread 1 has V1 readlock acquired and tries to acquire readlock on V2. Thread 2 wants to acquire writelock on V1 and is blocked because thread 1 has the readlock. Thread 3 has writelock on V2 and is trying to acquire readlock on V1 which is blocked due to the pending writelock on Thread 2. || Thread || Owned || Try to acquire || | App Shared Pool - #1 (T1) | | Writelock of Vertex | | TaskSchedulerAppCaller - #0 (T2)| Readlock of Vertex/Task | Readlock of TaskAttempt | | Dispatcher thread:Central (T3) | Writelock of TaskAttempt | Readlock of Vertex | Still not sure why T3 can't continue, because T1 hasn't got the writelock of Vertex, should not block T3, right ? BTW, the patch may still cause issue in recovery. If it is in recovery, the following code in TaskAttempt will still try to acquire the readlock of Vertex, and produce the above scenario. But it is supposed can be fixed after TEZ-1019. {code} TaskSpec createRemoteTaskSpec() throws AMUserCodeException { TaskSpec baseTaskSpec = task.getBaseTaskSpec(); if (baseTaskSpec == null) { // since recovery does not follow normal transitions, TaskEventScheduleTask // is not being honored by the recovery code path. Using this to workaround // until recovery is fixed. Calling the non-locking internal method of the vertex // to get the taskSpec directly. Since everything happens on the central dispatcher // during recovery this is deadlock free for now. TEZ-1019 should remove the need for this. baseTaskSpec = ((VertexImpl) vertex).createRemoteTaskSpec(getID().getTaskID().getId()); } return new TaskSpec(getID(), baseTaskSpec.getDAGName(), baseTaskSpec.getVertexName(), baseTaskSpec.getVertexParallelism(), baseTaskSpec.getProcessorDescriptor(), baseTaskSpec.getInputs(), baseTaskSpec.getOutputs(), baseTaskSpec.getGroupInputs()); } {code} > Deadlock in AM because attempt and vertex locking each other out > ---------------------------------------------------------------- > > Key: TEZ-2421 > URL: https://issues.apache.org/jira/browse/TEZ-2421 > Project: Apache Tez > Issue Type: Bug > Reporter: Bikas Saha > Assignee: Bikas Saha > Priority: Blocker > Attachments: TEZ-2421.1.patch, TEZ-2421.2.patch, TEZ-2421.3.patch, > TEZ-2421.4.patch > > > Ideally locks should be taken one way - either going down or up. Preferably > not going up because most such data can be passed in during object > construction. -- This message was sent by Atlassian JIRA (v6.3.4#6332)