[jira] [Comment Edited] (TEZ-2421) Deadlock in AM because attempt and vertex locking each other out

Jeff Zhang (JIRA) Sat, 09 May 2015 20:27:26 -0700

    [ 
https://issues.apache.org/jira/browse/TEZ-2421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14537013#comment-14537013
 ]


Jeff Zhang edited comment on TEZ-2421 at 5/10/15 3:26 AM:
----------------------------------------------------------

[~bikassaha] I guess you mean the following scenairo:

bq. Thread 1 has V1 readlock acquired and tries to acquire readlock on V2. 
Thread 2 wants to acquire writelock on V1 and is blocked because thread 1 has 
the readlock. Thread 3 has writelock on V2 and is trying to acquire readlock on 
V1 which is blocked due to the pending writelock on Thread 2. 

|| Thread || Owned || Try to acquire ||
| App Shared Pool - #1  (T1)  |                     |   Writelock of Vertex |
| TaskSchedulerAppCaller - #0 (T2)|  Readlock of Vertex/Task  |  Readlock of 
TaskAttempt |
| Dispatcher thread:Central (T3) | Writelock of TaskAttempt   |  Readlock of 
Vertex |

Still not sure why T3 can't continue, because T1 hasn't got the writelock of 
Vertex, should not block T3, right ?


BTW, the patch may still cause issue in recovery. If it is in recovery, the 
following code in TaskAttempt will still try to acquire the readlock of Vertex 
(Vertex#createRemoteTaskSpec), and produce the above scenario. But it is 
supposed can be fixed after TEZ-1019. 
{code}
 TaskSpec createRemoteTaskSpec() throws AMUserCodeException {
    TaskSpec baseTaskSpec = task.getBaseTaskSpec();
    if (baseTaskSpec == null) {
      // since recovery does not follow normal transitions, 
TaskEventScheduleTask
      // is not being honored by the recovery code path. Using this to 
workaround 
      // until recovery is fixed. Calling the non-locking internal method of 
the vertex
      // to get the taskSpec directly. Since everything happens on the central 
dispatcher 
      // during recovery this is deadlock free for now. TEZ-1019 should remove 
the need for this.
      baseTaskSpec = ((VertexImpl) 
vertex).createRemoteTaskSpec(getID().getTaskID().getId());
    }
    return new TaskSpec(getID(),
        baseTaskSpec.getDAGName(), baseTaskSpec.getVertexName(),
        baseTaskSpec.getVertexParallelism(), 
baseTaskSpec.getProcessorDescriptor(),
        baseTaskSpec.getInputs(), baseTaskSpec.getOutputs(), 
baseTaskSpec.getGroupInputs());
  }
{code}




was (Author: zjffdu):

[~bikassaha] I guess you mean the following scenairo:

bq. Thread 1 has V1 readlock acquired and tries to acquire readlock on V2. 
Thread 2 wants to acquire writelock on V1 and is blocked because thread 1 has 
the readlock. Thread 3 has writelock on V2 and is trying to acquire readlock on 
V1 which is blocked due to the pending writelock on Thread 2. 

|| Thread || Owned || Try to acquire ||
| App Shared Pool - #1  (T1)  |                     |   Writelock of Vertex |
| TaskSchedulerAppCaller - #0 (T2)|  Readlock of Vertex/Task  |  Readlock of 
TaskAttempt |
| Dispatcher thread:Central (T3) | Writelock of TaskAttempt   |  Readlock of 
Vertex |

Still not sure why T3 can't continue, because T1 hasn't got the writelock of 
Vertex, should not block T3, right ?


BTW, the patch may still cause issue in recovery. If it is in recovery, the 
following code in TaskAttempt will still try to acquire the readlock of Vertex, 
and produce the above scenario. But it is supposed can be fixed after TEZ-1019. 
{code}
 TaskSpec createRemoteTaskSpec() throws AMUserCodeException {
    TaskSpec baseTaskSpec = task.getBaseTaskSpec();
    if (baseTaskSpec == null) {
      // since recovery does not follow normal transitions, 
TaskEventScheduleTask
      // is not being honored by the recovery code path. Using this to 
workaround 
      // until recovery is fixed. Calling the non-locking internal method of 
the vertex
      // to get the taskSpec directly. Since everything happens on the central 
dispatcher 
      // during recovery this is deadlock free for now. TEZ-1019 should remove 
the need for this.
      baseTaskSpec = ((VertexImpl) 
vertex).createRemoteTaskSpec(getID().getTaskID().getId());
    }
    return new TaskSpec(getID(),
        baseTaskSpec.getDAGName(), baseTaskSpec.getVertexName(),
        baseTaskSpec.getVertexParallelism(), 
baseTaskSpec.getProcessorDescriptor(),
        baseTaskSpec.getInputs(), baseTaskSpec.getOutputs(), 
baseTaskSpec.getGroupInputs());
  }
{code}



> Deadlock in AM because attempt and vertex locking each other out
> ----------------------------------------------------------------
>
>                 Key: TEZ-2421
>                 URL: https://issues.apache.org/jira/browse/TEZ-2421
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Bikas Saha
>            Assignee: Bikas Saha
>            Priority: Blocker
>         Attachments: TEZ-2421.1.patch, TEZ-2421.2.patch, TEZ-2421.3.patch, 
> TEZ-2421.4.patch
>
>
> Ideally locks should be taken one way - either going down or up. Preferably 
> not going up because most such data can be passed in during object 
> construction.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (TEZ-2421) Deadlock in AM because attempt and vertex locking each other out

Reply via email to