[ 
https://issues.apache.org/jira/browse/TEZ-3296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15334492#comment-15334492
 ] 

Bikas Saha edited comment on TEZ-3296 at 6/16/16 7:20 PM:
----------------------------------------------------------

Sure. Lets commit this patch.

Could you please attach the task scheduler logs for the hung job and mention 
conflicting vertices? I follow what you described above and I'd expect the RM 
to return x+y containers at 2G where x is at 1.5G and y at 2G. The AM should 
accept y containers at 2G for vertex2G and x containers at 2G for vertex1.5G 
because 2G > 1.5G and the matching heuristic in AMRMClient considers fits-in vs 
exact match because the RM is always guaranteed to return a container thats 
larger than requested due to rounding. E.g. if the min container size is 1G 
then asking for 1.5G will return 2G containers and the situation would still be 
the same for the vertex1.5G in the AM.

One reason why I think it may hang is if the RM returns x+y containers at 1.5G 
because then y containers for vertex2G would never get a match. Or the RM 
returns less then x+y containers at 2G. The second case would be a bad RM bug 
that should be fixed in YARN urgently. The AM logs would shed some light on 
this.


was (Author: bikassaha):
Sure. Lets commit this patch.

Could you please attach the task scheduler logs for the hung job and mention 
conflicting vertices? I follow what you described above and I'd expect the RM 
to return x+y containers at 2G where x is at 1.5G and y at 2G. The AM should 
accept y containers at 2G for vertex2G and x containers at 2G for vertex1.5G 
because 2G > 1.5G and the matching heuristic in AMRMClient considers fits-in vs 
exact match because the RM is always guaranteed to return a container that 
larger then requested due to rounding. E.g. if the min container size is 1G 
then asking for 1.5G will return 2G containers and the situation would still be 
the same for the vertex1.5G in the AM.

One reason why I think it may hang is if the RM returns x+y containers at 1.5G 
because then y containers for vertex2G would never get a match. Or the RM 
returns less then x+y containers at 2G. The second case would be a bad RM bug 
that should be fixed in YARN urgently. The AM logs would shed some light on 
this.

> Tez job can hang if two vertices at the same root distance have different 
> task requirements
> -------------------------------------------------------------------------------------------
>
>                 Key: TEZ-3296
>                 URL: https://issues.apache.org/jira/browse/TEZ-3296
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.7.1
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>            Priority: Critical
>         Attachments: TEZ-3296.001.patch
>
>
> When two vertices have the same distance from the root Tez will schedule 
> containers with the same priority.  However those vertices could have 
> different task requirements and therefore different capabilities.  As 
> documented in YARN-314, YARN currently doesn't support requests for multiple 
> sizes at the same priority.  In practice this leads to one vertex allocation 
> requests clobbering the other, and that can result in a situation where the 
> Tez AM is waiting on containers it will never receive from the RM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to