[jira] [Commented] (TEZ-3296) Tez job can hang if two vertices at the same root distance have different task requirements

2016-06-16 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15335081#comment-15335081
 ] 

Bikas Saha commented on TEZ-3296:
-

Thanks! Its clear now.

> Tez job can hang if two vertices at the same root distance have different 
> task requirements
> ---
>
> Key: TEZ-3296
> URL: https://issues.apache.org/jira/browse/TEZ-3296
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.7.1
>Reporter: Jason Lowe
>Assignee: Jason Lowe
>Priority: Critical
> Fix For: 0.7.2, 0.9.0, 0.8.4
>
> Attachments: TEZ-3296.001.patch, taskschedulerlog
>
>
> When two vertices have the same distance from the root Tez will schedule 
> containers with the same priority.  However those vertices could have 
> different task requirements and therefore different capabilities.  As 
> documented in YARN-314, YARN currently doesn't support requests for multiple 
> sizes at the same priority.  In practice this leads to one vertex allocation 
> requests clobbering the other, and that can result in a situation where the 
> Tez AM is waiting on containers it will never receive from the RM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-3296) Tez job can hang if two vertices at the same root distance have different task requirements

2016-06-16 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15334623#comment-15334623
 ] 

Bikas Saha commented on TEZ-3296:
-

Ah. Looks like a result of using priority as a key for unique requests vs using 
it a just priority.

> Tez job can hang if two vertices at the same root distance have different 
> task requirements
> ---
>
> Key: TEZ-3296
> URL: https://issues.apache.org/jira/browse/TEZ-3296
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.7.1
>Reporter: Jason Lowe
>Assignee: Jason Lowe
>Priority: Critical
> Fix For: 0.7.2, 0.9.0, 0.8.4
>
> Attachments: TEZ-3296.001.patch
>
>
> When two vertices have the same distance from the root Tez will schedule 
> containers with the same priority.  However those vertices could have 
> different task requirements and therefore different capabilities.  As 
> documented in YARN-314, YARN currently doesn't support requests for multiple 
> sizes at the same priority.  In practice this leads to one vertex allocation 
> requests clobbering the other, and that can result in a situation where the 
> Tez AM is waiting on containers it will never receive from the RM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-3296) Tez job can hang if two vertices at the same root distance have different task requirements

2016-06-16 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15334521#comment-15334521
 ] 

Jason Lowe commented on TEZ-3296:
-

bq. Could you please attach the task scheduler logs for the hung job and 
mention conflicting vertices?

I'll see if I can find the logs for one of the failed jobs.  It was a while 
ago, so we may not have retained them this long.

bq. I'd expect the RM to return x+y containers at 2G where x is at 1.5G and y 
at 2G.

No, it only returned y containers at 2G.  See the code at 
AppSchedulingInfo#updateResourceRequests.  It will simply walk the list of 
requests and stomp over any previous request at the same priority and location. 
 So instead of x+y we're only going to get x _or_ y, depending upon which 
request appeared later in the list.  The YARN protocol is a delta protocol in 
the sense that an app only needs to send requests for what is changing relative 
to the entire ask, but what is sent overrides the total ask for that priority 
and location.  For example, asking for 5 containers at priority 2 for ANY then 
later asking for 3 containers at priority 2 for ANY will only get a total of 3 
containers (assuming no containers were granted between the two requests).

bq. The second case would be a bad RM bug that should be fixed in YARN urgently.

Yes, I commented on YARN-314 mentioning that it's not just a theoretical 
problem.  Note it would also be fixed by the effort at YARN-4879.

> Tez job can hang if two vertices at the same root distance have different 
> task requirements
> ---
>
> Key: TEZ-3296
> URL: https://issues.apache.org/jira/browse/TEZ-3296
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.7.1
>Reporter: Jason Lowe
>Assignee: Jason Lowe
>Priority: Critical
> Attachments: TEZ-3296.001.patch
>
>
> When two vertices have the same distance from the root Tez will schedule 
> containers with the same priority.  However those vertices could have 
> different task requirements and therefore different capabilities.  As 
> documented in YARN-314, YARN currently doesn't support requests for multiple 
> sizes at the same priority.  In practice this leads to one vertex allocation 
> requests clobbering the other, and that can result in a situation where the 
> Tez AM is waiting on containers it will never receive from the RM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-3296) Tez job can hang if two vertices at the same root distance have different task requirements

2016-06-16 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15334492#comment-15334492
 ] 

Bikas Saha commented on TEZ-3296:
-

Sure. Lets commit this patch.

Could you please attach the task scheduler logs for the hung job and mention 
conflicting vertices? I follow what you described above and I'd expect the RM 
to return x+y containers at 2G where x is at 1.5G and y at 2G. The AM should 
accept y containers at 2G for vertex2G and x containers at 2G for vertex1.5G 
because 2G > 1.5G and the matching heuristic in AMRMClient considers fitsIn vs 
exact match because the RM is always guaranteed to return a container that 
larger then requested due to rounding. E.g. if the min container size is 1G 
then asking for 1.5G will return 2G containers and the situation would still be 
the same for the vertex1.5G in the AM.

One reason why I think it may hang is if the RM returns x+y containers at 1.5G 
because then y containers for vertex2G would never get a match. Or the RM 
returns less then x+y containers at 2G. The second case would be a bad RM bug 
that should be fixed in YARN urgently. The AM logs would shed some light on 
this.

> Tez job can hang if two vertices at the same root distance have different 
> task requirements
> ---
>
> Key: TEZ-3296
> URL: https://issues.apache.org/jira/browse/TEZ-3296
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.7.1
>Reporter: Jason Lowe
>Assignee: Jason Lowe
>Priority: Critical
> Attachments: TEZ-3296.001.patch
>
>
> When two vertices have the same distance from the root Tez will schedule 
> containers with the same priority.  However those vertices could have 
> different task requirements and therefore different capabilities.  As 
> documented in YARN-314, YARN currently doesn't support requests for multiple 
> sizes at the same priority.  In practice this leads to one vertex allocation 
> requests clobbering the other, and that can result in a situation where the 
> Tez AM is waiting on containers it will never receive from the RM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-3296) Tez job can hang if two vertices at the same root distance have different task requirements

2016-06-16 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15334129#comment-15334129
 ] 

Jason Lowe commented on TEZ-3296:
-

bq. Wondering why the app was hung.

As I mentioned in the description this is a bug in YARN, see YARN-314.  The RM 
currently cannot support multiple requests with different sizes at the same 
priority.  It only tracks priority and container count per location, *not* the 
container size.  It hung because the AM and the RM got out of sync with respect 
to the priority aliasing.  The AM mistakenly thought it had made two _distinct_ 
requests for priority 11, one with a 1.5GB container and one with a 2GB 
container.  When the RM got that ask list, the _last_ one in the list "won" and 
smashed the other.  Again the RM is only tracking based on priority and 
location, not size, so the ANY request at priority 11 for 1.5 GB container was 
smashed by the ANY request at priority 11 for 2GB.  At that point the RM thinks 
the AM only wants the 2GB containers at priority 11, and the AM hangs waiting 
for the 1.5GB priority 11 containers that never arrive.  Sometimes container 
reuse can work around the issue, but if the AM thinks it cannot reuse the 
containers between those two separate requests then it can lead to a hang if 
both requests are sent in the same heartbeat.  One will smash the other inside 
the RM and get lost.

bq. If this is urgent I think we can go with the current proposal.

This is fairly urgent since we've seen multiple production jobs hang due to 
this, so it's not just a theoretical problem.  We're deploying this patch 
internally, and I don't think we should ship another Tez release until it's 
fixed.

> Tez job can hang if two vertices at the same root distance have different 
> task requirements
> ---
>
> Key: TEZ-3296
> URL: https://issues.apache.org/jira/browse/TEZ-3296
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.7.1
>Reporter: Jason Lowe
>Assignee: Jason Lowe
>Priority: Critical
> Attachments: TEZ-3296.001.patch
>
>
> When two vertices have the same distance from the root Tez will schedule 
> containers with the same priority.  However those vertices could have 
> different task requirements and therefore different capabilities.  As 
> documented in YARN-314, YARN currently doesn't support requests for multiple 
> sizes at the same priority.  In practice this leads to one vertex allocation 
> requests clobbering the other, and that can result in a situation where the 
> Tez AM is waiting on containers it will never receive from the RM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-3296) Tez job can hang if two vertices at the same root distance have different task requirements

2016-06-13 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328921#comment-15328921
 ] 

Bikas Saha commented on TEZ-3296:
-

Sorry. My bad. I even used a calculator for that :P

If this is urgent I think we can go with the current proposal. Would be good to 
open a follow up item to use a BFS or topo-sort based method that uses the 
priority space more conservatively.

> Tez job can hang if two vertices at the same root distance have different 
> task requirements
> ---
>
> Key: TEZ-3296
> URL: https://issues.apache.org/jira/browse/TEZ-3296
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.7.1
>Reporter: Jason Lowe
>Assignee: Jason Lowe
>Priority: Critical
> Attachments: TEZ-3296.001.patch
>
>
> When two vertices have the same distance from the root Tez will schedule 
> containers with the same priority.  However those vertices could have 
> different task requirements and therefore different capabilities.  As 
> documented in YARN-314, YARN currently doesn't support requests for multiple 
> sizes at the same priority.  In practice this leads to one vertex allocation 
> requests clobbering the other, and that can result in a situation where the 
> Tez AM is waiting on containers it will never receive from the RM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-3296) Tez job can hang if two vertices at the same root distance have different task requirements

2016-06-12 Thread Jonathan Eagles (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15326738#comment-15326738
 ] 

Jonathan Eagles commented on TEZ-3296:
--

bq. Now - (1*24*3) + 20*3 = 150 = (2*24*3) + 2*3
The formula is set up so that all vertices with a distance of _h_ from the root 
have a logically higher priority than all vertices with a distance of _h + 1_ . 
In the example above, the calculation on the LHS should be 132.


> Tez job can hang if two vertices at the same root distance have different 
> task requirements
> ---
>
> Key: TEZ-3296
> URL: https://issues.apache.org/jira/browse/TEZ-3296
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.7.1
>Reporter: Jason Lowe
>Assignee: Jason Lowe
>Priority: Critical
> Attachments: TEZ-3296.001.patch
>
>
> When two vertices have the same distance from the root Tez will schedule 
> containers with the same priority.  However those vertices could have 
> different task requirements and therefore different capabilities.  As 
> documented in YARN-314, YARN currently doesn't support requests for multiple 
> sizes at the same priority.  In practice this leads to one vertex allocation 
> requests clobbering the other, and that can result in a situation where the 
> Tez AM is waiting on containers it will never receive from the RM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-3296) Tez job can hang if two vertices at the same root distance have different task requirements

2016-06-12 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15326673#comment-15326673
 ] 

Bikas Saha commented on TEZ-3296:
-

bq. Today each vertex uses a set of three priority values, the low, the high, 
and the mean of those two. (Oddly containers for high are never requested in 
practice, just the low and mean.)
The middle priority is default. The lower value (higher pri) is for failed task 
reruns. The higher value (lower pri) was intended was speculative tasks but may 
have been missed being used for that.

Wondering why the app was hung. IIRC YARN keeps the higher resource request 
when there are multiple at the same priority because thats the safer thing to 
do. So when 2 vertices have the same priority but different resources then we 
would expect to get containers for both but with the higher resource value 
across the board.
If the above is correct then perhaps there is a bug in the task scheduler code 
that needs to get fixed which we might miss if we change the vertex priorities 
to be unique as a workaround. The vertex priority change is good in its own 
right. But would be good to make sure we dont have some pending bug in the task 
scheduler that may have other side effects. Could you please attach the task 
scheduler log for the job that hung in case that has some clues.

On the patch itself the formula looks like
(Height*Total*3) + V*3.
Now - (1*24*3) + 20*3 = 150 = (2*24*3) + 2*3
So we could still have collisions depending on the manner in which vertexIds 
get assigned, right? Unless currently we are getting lucky in the vId 
assignment such that vertices close to the root also happen to get low ids.


> Tez job can hang if two vertices at the same root distance have different 
> task requirements
> ---
>
> Key: TEZ-3296
> URL: https://issues.apache.org/jira/browse/TEZ-3296
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.7.1
>Reporter: Jason Lowe
>Assignee: Jason Lowe
>Priority: Critical
> Attachments: TEZ-3296.001.patch
>
>
> When two vertices have the same distance from the root Tez will schedule 
> containers with the same priority.  However those vertices could have 
> different task requirements and therefore different capabilities.  As 
> documented in YARN-314, YARN currently doesn't support requests for multiple 
> sizes at the same priority.  In practice this leads to one vertex allocation 
> requests clobbering the other, and that can result in a situation where the 
> Tez AM is waiting on containers it will never receive from the RM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-3296) Tez job can hang if two vertices at the same root distance have different task requirements

2016-06-10 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15325186#comment-15325186
 ] 

Jason Lowe commented on TEZ-3296:
-

bq. Could you please help me understand the logic to make these unique.

Today each vertex uses a set of three priority values, the low, the high, and 
the mean of those two.  (Oddly containers for high are never requested in 
practice, just the low and mean.)  I preserved that so each vertex will get a 
set of three priorities to use.  Think of the priority namespace as a 
two-dimensional table.  Each row in the table corresponds to the vertex 
distance to the root.  Each row has the worst-case size of priorities which is 
number of dag vertices * 3 (for the three priorities per vertex).  So the first 
part of the expression, {{((vertexDistanceFromRoot + 1) * 
dag.getTotalVertices() * 3)}}, gets us the priority offset within the row 
corresponding to our distance from the root.  (Well actually that distance +1 
like the original code.)

Once we have the offset to our row, all that remains is getting our 
offset-within-row which is the vertex ID * 3, since there are three priorities 
for each vertex.  We use the row offset + offset-within-row to get the low 
priority.  The high priority is low-2, just like the original code.  Since we 
are selecting a block of three vertices uniquely within a row based on the 
vertex ID (which is unique across all vertices in the DAG), no two vertices 
will have the same priority.  Since we are selecting the row based on distance 
from root, no vertex farther from the root will have a priority lower than a 
vertex closer to the root, preserving the relative ordering of priorities 
within the DAG.

bq.  Instead we could do a BFS on the DAG and assign priority based on the 
traversal.

Yep that's what I mentioned above, but in the interest of fixing this critical 
issue quickly this was simpler and equivalent so I posted it.  


> Tez job can hang if two vertices at the same root distance have different 
> task requirements
> ---
>
> Key: TEZ-3296
> URL: https://issues.apache.org/jira/browse/TEZ-3296
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.7.1
>Reporter: Jason Lowe
>Assignee: Jason Lowe
>Priority: Critical
> Attachments: TEZ-3296.001.patch
>
>
> When two vertices have the same distance from the root Tez will schedule 
> containers with the same priority.  However those vertices could have 
> different task requirements and therefore different capabilities.  As 
> documented in YARN-314, YARN currently doesn't support requests for multiple 
> sizes at the same priority.  In practice this leads to one vertex allocation 
> requests clobbering the other, and that can result in a situation where the 
> Tez AM is waiting on containers it will never receive from the RM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-3296) Tez job can hang if two vertices at the same root distance have different task requirements

2016-06-10 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15324989#comment-15324989
 ] 

Bikas Saha commented on TEZ-3296:
-

Could you please help me understand the logic to make these unique. I am sorry 
I could not follow from the code :)

The minimum solution would be to break ties when needed such that each vertex 
has a unique priority. Right now vertex depth from root is proxying the 
priority. Instead we could do a BFS on the DAG and assign priority based on the 
traversal. Or we could reuse the topological sort in the client (done during 
DAG submission) and assign that as the priority of the vertex.



> Tez job can hang if two vertices at the same root distance have different 
> task requirements
> ---
>
> Key: TEZ-3296
> URL: https://issues.apache.org/jira/browse/TEZ-3296
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.7.1
>Reporter: Jason Lowe
>Assignee: Jason Lowe
>Priority: Critical
> Attachments: TEZ-3296.001.patch
>
>
> When two vertices have the same distance from the root Tez will schedule 
> containers with the same priority.  However those vertices could have 
> different task requirements and therefore different capabilities.  As 
> documented in YARN-314, YARN currently doesn't support requests for multiple 
> sizes at the same priority.  In practice this leads to one vertex allocation 
> requests clobbering the other, and that can result in a situation where the 
> Tez AM is waiting on containers it will never receive from the RM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-3296) Tez job can hang if two vertices at the same root distance have different task requirements

2016-06-10 Thread TezQA (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15324647#comment-15324647
 ] 

TezQA commented on TEZ-3296:


{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12809315/TEZ-3296.001.patch
  against master revision 8985969.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 3.0.1) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: 
https://builds.apache.org/job/PreCommit-TEZ-Build/1789//testReport/
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/1789//console

This message is automatically generated.

> Tez job can hang if two vertices at the same root distance have different 
> task requirements
> ---
>
> Key: TEZ-3296
> URL: https://issues.apache.org/jira/browse/TEZ-3296
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.7.1
>Reporter: Jason Lowe
>Assignee: Jason Lowe
>Priority: Critical
> Attachments: TEZ-3296.001.patch
>
>
> When two vertices have the same distance from the root Tez will schedule 
> containers with the same priority.  However those vertices could have 
> different task requirements and therefore different capabilities.  As 
> documented in YARN-314, YARN currently doesn't support requests for multiple 
> sizes at the same priority.  In practice this leads to one vertex allocation 
> requests clobbering the other, and that can result in a situation where the 
> Tez AM is waiting on containers it will never receive from the RM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-3296) Tez job can hang if two vertices at the same root distance have different task requirements

2016-06-09 Thread Gopal V (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15323618#comment-15323618
 ] 

Gopal V commented on TEZ-3296:
--

[~jlowe]: converting the Vertex priorities from partially ordered to full 
ordered will possibly fix the performance issues noted in TEZ-946

> Tez job can hang if two vertices at the same root distance have different 
> task requirements
> ---
>
> Key: TEZ-3296
> URL: https://issues.apache.org/jira/browse/TEZ-3296
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.7.1
>Reporter: Jason Lowe
>Assignee: Jason Lowe
>Priority: Critical
> Attachments: TEZ-3296.001.patch
>
>
> When two vertices have the same distance from the root Tez will schedule 
> containers with the same priority.  However those vertices could have 
> different task requirements and therefore different capabilities.  As 
> documented in YARN-314, YARN currently doesn't support requests for multiple 
> sizes at the same priority.  In practice this leads to one vertex allocation 
> requests clobbering the other, and that can result in a situation where the 
> Tez AM is waiting on containers it will never receive from the RM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-3296) Tez job can hang if two vertices at the same root distance have different task requirements

2016-06-09 Thread TezQA (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15323552#comment-15323552
 ] 

TezQA commented on TEZ-3296:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12809315/TEZ-3296.001.patch
  against master revision 2c21285.

{color:red}-1 patch{color}.  master compilation may be broken.

Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/1783//console

This message is automatically generated.

> Tez job can hang if two vertices at the same root distance have different 
> task requirements
> ---
>
> Key: TEZ-3296
> URL: https://issues.apache.org/jira/browse/TEZ-3296
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.7.1
>Reporter: Jason Lowe
>Assignee: Jason Lowe
>Priority: Critical
> Attachments: TEZ-3296.001.patch
>
>
> When two vertices have the same distance from the root Tez will schedule 
> containers with the same priority.  However those vertices could have 
> different task requirements and therefore different capabilities.  As 
> documented in YARN-314, YARN currently doesn't support requests for multiple 
> sizes at the same priority.  In practice this leads to one vertex allocation 
> requests clobbering the other, and that can result in a situation where the 
> Tez AM is waiting on containers it will never receive from the RM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-3296) Tez job can hang if two vertices at the same root distance have different task requirements

2016-06-09 Thread TezQA (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15323526#comment-15323526
 ] 

TezQA commented on TEZ-3296:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12809315/TEZ-3296.001.patch
  against master revision 2c21285.

{color:red}-1 patch{color}.  master compilation may be broken.

Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/1782//console

This message is automatically generated.

> Tez job can hang if two vertices at the same root distance have different 
> task requirements
> ---
>
> Key: TEZ-3296
> URL: https://issues.apache.org/jira/browse/TEZ-3296
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.7.1
>Reporter: Jason Lowe
>Assignee: Jason Lowe
>Priority: Critical
> Attachments: TEZ-3296.001.patch
>
>
> When two vertices have the same distance from the root Tez will schedule 
> containers with the same priority.  However those vertices could have 
> different task requirements and therefore different capabilities.  As 
> documented in YARN-314, YARN currently doesn't support requests for multiple 
> sizes at the same priority.  In practice this leads to one vertex allocation 
> requests clobbering the other, and that can result in a situation where the 
> Tez AM is waiting on containers it will never receive from the RM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)