[jira] [Commented] (TEZ-3770) DAG-aware YARN task scheduler

2018-01-23 Thread Siddharth Seth (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16336203#comment-16336203
 ] 

Siddharth Seth commented on TEZ-3770:
-

+1. Looks good. Thanks [~jlowe]

> DAG-aware YARN task scheduler
> -
>
> Key: TEZ-3770
> URL: https://issues.apache.org/jira/browse/TEZ-3770
> Project: Apache Tez
>  Issue Type: New Feature
>Reporter: Jason Lowe
>Assignee: Jason Lowe
>Priority: Major
> Attachments: TEZ-3770.001.patch, TEZ-3770.002.patch
>
>
> There are cases where priority alone does not convey the relationship between 
> tasks, and this can cause problems when scheduling or preempting tasks.  If 
> the YARN task scheduler was aware of the relationship between tasks then it 
> could make smarter decisions when trying to assign tasks to containers or 
> preempt running tasks to schedule pending tasks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TEZ-3770) DAG-aware YARN task scheduler

2017-12-19 Thread TezQA (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16297416#comment-16297416
 ] 

TezQA commented on TEZ-3770:


{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12902920/TEZ-3770.002.patch
  against master revision 4c378b4.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 6 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 3.0.1) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: 
https://builds.apache.org/job/PreCommit-TEZ-Build/2706//testReport/
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/2706//console

This message is automatically generated.

> DAG-aware YARN task scheduler
> -
>
> Key: TEZ-3770
> URL: https://issues.apache.org/jira/browse/TEZ-3770
> Project: Apache Tez
>  Issue Type: New Feature
>Reporter: Jason Lowe
>Assignee: Jason Lowe
> Attachments: TEZ-3770.001.patch, TEZ-3770.002.patch
>
>
> There are cases where priority alone does not convey the relationship between 
> tasks, and this can cause problems when scheduling or preempting tasks.  If 
> the YARN task scheduler was aware of the relationship between tasks then it 
> could make smarter decisions when trying to assign tasks to containers or 
> preempt running tasks to schedule pending tasks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TEZ-3770) DAG-aware YARN task scheduler

2017-12-19 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16296958#comment-16296958
 ] 

Jason Lowe commented on TEZ-3770:
-

If this is failing on vanilla 0.9.1 then I think this warrants a separate JIRA 
since the problem occurs without the patch from this JIRA.

I suspect the issue doesn't happen when running with the dag-aware scheduler 
since it avoids preempting tasks that are not descendants of the pending 
requests in the DAG.  If the DAG has many independent trees (or even 
significant, parallel branches within a large tree) then the default YARN 
scheduler can preempt tasks unnecessarily.  Since it doesn't understand the DAG 
connections it makes the false assumption that any lower priority vertex must 
be a DAG descendant of a higher priority vertex.  It then preempts those lower 
priority tasks when the high priority requests are not being fulfilled.  The 
DAG-aware scheduler does not preempt active, low priority tasks if they are not 
descendants of the higher priority requests since those lower priority tasks 
can still complete and we don't want to lose completed work.

I'll try to get an updated patch addressing the javadoc comment request today.  
After that I think this should be ready to go in since it works on "real" jobs 
like TPC-DS Q64 10TB with a noticeable improvement over the current YARN 
scheduler, and it's "opt-in" so users must explicitly configure it.

> DAG-aware YARN task scheduler
> -
>
> Key: TEZ-3770
> URL: https://issues.apache.org/jira/browse/TEZ-3770
> Project: Apache Tez
>  Issue Type: New Feature
>Reporter: Jason Lowe
>Assignee: Jason Lowe
> Attachments: TEZ-3770.001.patch
>
>
> There are cases where priority alone does not convey the relationship between 
> tasks, and this can cause problems when scheduling or preempting tasks.  If 
> the YARN task scheduler was aware of the relationship between tasks then it 
> could make smarter decisions when trying to assign tasks to containers or 
> preempt running tasks to schedule pending tasks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TEZ-3770) DAG-aware YARN task scheduler

2017-12-15 Thread Eric Wohlstadter (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16293305#comment-16293305
 ] 

Eric Wohlstadter commented on TEZ-3770:
---

I applied TEZ-3770 and TEZ-394 to 0.9.1.
with those patches, the job completes.

On vanilla 0.9.1 the job is failing from preemption.

> DAG-aware YARN task scheduler
> -
>
> Key: TEZ-3770
> URL: https://issues.apache.org/jira/browse/TEZ-3770
> Project: Apache Tez
>  Issue Type: New Feature
>Reporter: Jason Lowe
>Assignee: Jason Lowe
> Attachments: TEZ-3770.001.patch
>
>
> There are cases where priority alone does not convey the relationship between 
> tasks, and this can cause problems when scheduling or preempting tasks.  If 
> the YARN task scheduler was aware of the relationship between tasks then it 
> could make smarter decisions when trying to assign tasks to containers or 
> preempt running tasks to schedule pending tasks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TEZ-3770) DAG-aware YARN task scheduler

2017-12-15 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16293297#comment-16293297
 ] 

Jason Lowe commented on TEZ-3770:
-

Thanks for kicking the tires on the patch, Eric!

The patch doesn't modify the code for the default scheduler, 
YarnTaskSchedulerService, so I can't readily explain things if this patch makes 
that scheduler unable to run that workload.  Is it able to run that workload 
without this patch?

I could see that behavior occurring if you also had pulled in the patch for 
TEZ-394, as it's a known problem causing massive preemptions for certain DAG 
topologies when combining TEZ-394 with YarnTaskSchedulerService.

> DAG-aware YARN task scheduler
> -
>
> Key: TEZ-3770
> URL: https://issues.apache.org/jira/browse/TEZ-3770
> Project: Apache Tez
>  Issue Type: New Feature
>Reporter: Jason Lowe
>Assignee: Jason Lowe
> Attachments: TEZ-3770.001.patch
>
>
> There are cases where priority alone does not convey the relationship between 
> tasks, and this can cause problems when scheduling or preempting tasks.  If 
> the YARN task scheduler was aware of the relationship between tasks then it 
> could make smarter decisions when trying to assign tasks to containers or 
> preempt running tasks to schedule pending tasks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TEZ-3770) DAG-aware YARN task scheduler

2017-12-15 Thread Eric Wohlstadter (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16293274#comment-16293274
 ] 

Eric Wohlstadter commented on TEZ-3770:
---

Here is some data for consideration. 

Running TPC-DS Q64 10TB. At least with my setup:
* DagAwareYarnTaskScheduler, job completes in 17 mins.
* default scheduler, job continues to fail:
{code}
2017-12-15 14:53:58,441 [INFO] [AMRM Callback Handler Thread] 
|rm.YarnTaskSchedulerService|: Trying to service 7 out of total 62 pending 
requests at pri: 440 by preempting from 75 running tasks at priority: 1160
{code}
and things just continue deteriorating from there.

Could this be explained by the use of the DagAwareYarnTaskScheduler?

> DAG-aware YARN task scheduler
> -
>
> Key: TEZ-3770
> URL: https://issues.apache.org/jira/browse/TEZ-3770
> Project: Apache Tez
>  Issue Type: New Feature
>Reporter: Jason Lowe
>Assignee: Jason Lowe
> Attachments: TEZ-3770.001.patch
>
>
> There are cases where priority alone does not convey the relationship between 
> tasks, and this can cause problems when scheduling or preempting tasks.  If 
> the YARN task scheduler was aware of the relationship between tasks then it 
> could make smarter decisions when trying to assign tasks to containers or 
> preempt running tasks to schedule pending tasks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TEZ-3770) DAG-aware YARN task scheduler

2017-12-14 Thread TezQA (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16291855#comment-16291855
 ] 

TezQA commented on TEZ-3770:


{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12874106/TEZ-3770.001.patch
  against master revision 4c378b4.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 6 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 3.0.1) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: 
https://builds.apache.org/job/PreCommit-TEZ-Build/2704//testReport/
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/2704//console

This message is automatically generated.

> DAG-aware YARN task scheduler
> -
>
> Key: TEZ-3770
> URL: https://issues.apache.org/jira/browse/TEZ-3770
> Project: Apache Tez
>  Issue Type: New Feature
>Reporter: Jason Lowe
>Assignee: Jason Lowe
> Attachments: TEZ-3770.001.patch
>
>
> There are cases where priority alone does not convey the relationship between 
> tasks, and this can cause problems when scheduling or preempting tasks.  If 
> the YARN task scheduler was aware of the relationship between tasks then it 
> could make smarter decisions when trying to assign tasks to containers or 
> preempt running tasks to schedule pending tasks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TEZ-3770) DAG-aware YARN task scheduler

2017-12-14 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16291781#comment-16291781
 ] 

Jason Lowe commented on TEZ-3770:
-

Thanks for the review, Sidd!  Apologies for the long delay, got sidetracked 
with other work.

bq. If I'm reading the code right. New containers which cannot be assigned 
immediately are released?

It's a problem with delayed containers.  This comment in the old scheduler 
sorta sums it up:
{code}
  // this container is of lower priority and given to us by the RM 
for
  // a task that will be matched after the current top priority. 
Keep 
  // this container for those pending tasks since the RM is not 
going
  // to give this container to us again
{code}
The old scheduler holding onto the lower priority containers that cannot be 
assigned led to TEZ-3535.  This scheduler tries to avoid that issue.

bq. Not sure if priority is being considered while doing this. i.e. is it 
possible there's a pending higher priority request which has not yet been 
allocated to an idle container (primarily races in timing)? Think this is 
handled since an attempt is made to allocate a container the moment the task 
assigned to it is de-allocated.

Yes, we cover that case by going through the list of pending tasks immediately 
after releasing a container.

bq. This is broken for newly assigned containers?

It's not broken, IMHO, since again we don't want to hold onto lower priority 
containers.  This problem happens in practice because a lower priority vertex 
requests resources just before a higher priority vertex.  Due to the async 
nature of sending requests to the RM, that means we could make the requests for 
lower priority containers before sending the request for higher priority ones.  
It's a natural race in the DAG.  I chose to treat the race as, "if the RM 
thinks this is the highest priority request to allocate then that priority won 
the race."  That's why the code ignores pending request priorities when new 
containers arrive.  The same is not true when deciding to reuse a container.

bq. TaskRequest oldRequest = requests.put -> Is it possible for old to not be 
null? A single request to allocate a single attempt.

This hardens the scheduler in case someone re-requests a task.  If we don't 
account for it and it somehow were to happen in practice then the preemption 
logic and other metadata tracking the DAG could get out of sync.  Not sure it 
can happen in practice today, just some defensive programming.

bq. Would be nice to have some more documentation or an example of how this 
ends up working.

I'll update the patch to add more documentation on how the bitsets are used and 
how we end up using them to prevent scheduling of lower priority descendants 
when reusing containers.

bq. Does it rely on the way priorities are assigned?, the kind of topological 
sort? When reading this, it seems to block off a large chunk of requests at a 
lower priority.

Yes, this is intended to block off lower priority requests that are descendents 
of this vertex in the DAG.

bq. Different code paths for the allocation of a delayed container and when a 
new task request comes in. Assuming this is a result of attempting to not place 
a YARN request if a container can be assigned immediately? Not sure if more 
re-use is possible across the various assign methods.

Yes, we try to assign a requested task to a container immediately before 
requesting the AMRM layer to reduce churn on the AMRM protocol.  I'll look into 
reuse opportunities when I put up the revised patch.

bq. The default out of box behaviour will always generate different vertices at 
different priority levels at the moment. The old behaviour was to generate the 
same priority if distance from root was the same. Is moving back to the old 
behaviour an option - given descendent information is now known).

No, this is not possible because of the bug/limitation in YARN.  Before 
YARN-4789 there is no way to correlate a request to an allocation, so all 
allocations are lumped by priority.  When they are lumped, YARN only supports 
one resource request per priority.  So we can't go back to the old 
multiple-vertices-at-the-same-priority behavior in Tez until we can make 
different resource requests at the same priority in YARN.

bq. Didn't go into enough details to figure out if an attempt is made to run 
through an entire tree before moving over to an unrelated tree

Not sure what you mean by tree here -- disconnected parts of the DAG?  If so 
the scheduler doesn't directly concern itself with whether there is just one 
tree or multiple trees, it's only concerned about what request priorities are 
"unblocked" (i.e.: do not have ancestors in the DAG at higher priority that are 
making requests) and trying to satisfy those, preempting active tasks that are 
descendants of those requesting tasks if 

[jira] [Commented] (TEZ-3770) DAG-aware YARN task scheduler

2017-07-19 Thread Siddharth Seth (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16093956#comment-16093956
 ] 

Siddharth Seth commented on TEZ-3770:
-

bq. It tries to schedule new containers for tasks that match its priority 
before trying to schedule the highest priority task first. This avoids hanging 
onto unused, lower priority containers because higher priority requests are 
pending (see TEZ-3535).
If I'm reading the code right. New containers which cannot be assigned 
immediately are released? Pending requests are removed as soon as a container 
is assigned. YARN will not end up allocating this unused container again (other 
than the regular timing races on the protocol).

bq. New task allocation requests are first matched against idle containers 
before requesting resources from the RM. This cuts down on AM-RM protocol churn.
Not sure if priority is being considered while doing this. i.e. is it possible 
there's a pending higher priority request which has not yet been allocated to 
an idle container (primarily races in timing)? Think this is handled since an 
attempt is made to allocate a container the moment the task assigned to it is 
de-allocated.

bq. Task requests for tasks that are DAG-descendants of pending task requests 
will not be allocated to help reduce priority inversions that could lead to 
preemption.
This is broken for newly assigned containers?

On the patch itself.
DagAwareYarnTaskScheduler
- TaskRequest oldRequest = requests.put -> Is it possible for old to not be 
null? A single request to allocate a single attempt.
- incrVertexTaskCount - lowerStat.allowedVertices.andNot(d); <- Would be nice 
to have some more documentation or an example of how this ends up working. Does 
it rely on the way priorities are assigned?, the kind of topological sort? When 
reading this, it seems to block off a large chunk of requests at a lower 
priority.
- Different code paths for the allocation of a delayed container and when a new 
task request comes in. Assuming this is a result of attempting to not place a 
YARN request if a container can be assigned immediately? Not sure if more 
re-use is possible across the various assign methods.
- RequestPriorityStats - javadoc on descendants is a little confusing. Mentions 
a single vertex. I think this gets set for every vertex at the same priority 
level. The default out of box behaviour will always generate different vertices 
at different priority levels at the moment. The old behaviour was to generate 
the same priority if distance from root was the same. Is moving back to the old 
behaviour an option - given descendent information is now known).
- Didn't go into enough details to figure out if an attempt is made to run 
through an entire tree before moving over to an unrelated tree
- In tryAssignReuseContainer - if a container cannot be assigned immediately, 
will it be released? Should this decision be based on headroom / pending 
requests (headroom is very often incorrect, preemption is meant to take care of 
that). e.g. a task failure, so there's a new request. If the container cannot 
be re-used for this request, and capacity is available in YARN - it may make 
sense to hold on to the container.

DagInfo - Should getVertexDescendants be exposed as a method, or just the 
Vertices and the relationship between them. Whoever wants to use this can set 
up their own representation. The bit representation could be a helper. The 
vertex relationship can likely be used for more than just the list of 
descendants.
TaskSchedulerContext - Instead of exposing a getVertexIndexForTask(Object) - I 
think a better option is to provide an interface for the requesting task 
itself. (TaskRequest instead of Object). That can expose relevant information, 
instead of making an additional call to get this from TaskSchedulerContext. 


> DAG-aware YARN task scheduler
> -
>
> Key: TEZ-3770
> URL: https://issues.apache.org/jira/browse/TEZ-3770
> Project: Apache Tez
>  Issue Type: New Feature
>Reporter: Jason Lowe
>Assignee: Jason Lowe
> Attachments: TEZ-3770.001.patch
>
>
> There are cases where priority alone does not convey the relationship between 
> tasks, and this can cause problems when scheduling or preempting tasks.  If 
> the YARN task scheduler was aware of the relationship between tasks then it 
> could make smarter decisions when trying to assign tasks to containers or 
> preempt running tasks to schedule pending tasks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TEZ-3770) DAG-aware YARN task scheduler

2017-06-27 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16065756#comment-16065756
 ] 

Bikas Saha commented on TEZ-3770:
-

Just clarifying that the original scheduler was not made dag aware by design. 
It was an attempt to prevent leaky features where code changed across the 
scheduler and the dag state machine. Like it happened in MR code where logic 
was spread all over. The DAG core logic and VertexManager user logic could 
determine the dependencies and priorities of tasks and the scheduler would 
allocate resources based on priority. So other schedulers could be easily 
written since they dont need to understand complex relationships.

However not all of those design assumptions have been validated since we dont 
have many schedulers written :P

> DAG-aware YARN task scheduler
> -
>
> Key: TEZ-3770
> URL: https://issues.apache.org/jira/browse/TEZ-3770
> Project: Apache Tez
>  Issue Type: New Feature
>Reporter: Jason Lowe
>Assignee: Jason Lowe
> Attachments: TEZ-3770.001.patch
>
>
> There are cases where priority alone does not convey the relationship between 
> tasks, and this can cause problems when scheduling or preempting tasks.  If 
> the YARN task scheduler was aware of the relationship between tasks then it 
> could make smarter decisions when trying to assign tasks to containers or 
> preempt running tasks to schedule pending tasks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)