[ 
https://issues.apache.org/jira/browse/TEZ-3770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16291781#comment-16291781
 ] 

Jason Lowe commented on TEZ-3770:
---------------------------------

Thanks for the review, Sidd!  Apologies for the long delay, got sidetracked 
with other work.

bq. If I'm reading the code right. New containers which cannot be assigned 
immediately are released?

It's a problem with delayed containers.  This comment in the old scheduler 
sorta sums it up:
{code}
              // this container is of lower priority and given to us by the RM 
for
              // a task that will be matched after the current top priority. 
Keep 
              // this container for those pending tasks since the RM is not 
going
              // to give this container to us again
{code}
The old scheduler holding onto the lower priority containers that cannot be 
assigned led to TEZ-3535.  This scheduler tries to avoid that issue.

bq. Not sure if priority is being considered while doing this. i.e. is it 
possible there's a pending higher priority request which has not yet been 
allocated to an idle container (primarily races in timing)? Think this is 
handled since an attempt is made to allocate a container the moment the task 
assigned to it is de-allocated.

Yes, we cover that case by going through the list of pending tasks immediately 
after releasing a container.

bq. This is broken for newly assigned containers?

It's not broken, IMHO, since again we don't want to hold onto lower priority 
containers.  This problem happens in practice because a lower priority vertex 
requests resources just before a higher priority vertex.  Due to the async 
nature of sending requests to the RM, that means we could make the requests for 
lower priority containers before sending the request for higher priority ones.  
It's a natural race in the DAG.  I chose to treat the race as, "if the RM 
thinks this is the highest priority request to allocate then that priority won 
the race."  That's why the code ignores pending request priorities when new 
containers arrive.  The same is not true when deciding to reuse a container.

bq. TaskRequest oldRequest = requests.put -> Is it possible for old to not be 
null? A single request to allocate a single attempt.

This hardens the scheduler in case someone re-requests a task.  If we don't 
account for it and it somehow were to happen in practice then the preemption 
logic and other metadata tracking the DAG could get out of sync.  Not sure it 
can happen in practice today, just some defensive programming.

bq. Would be nice to have some more documentation or an example of how this 
ends up working.

I'll update the patch to add more documentation on how the bitsets are used and 
how we end up using them to prevent scheduling of lower priority descendants 
when reusing containers.

bq. Does it rely on the way priorities are assigned?, the kind of topological 
sort? When reading this, it seems to block off a large chunk of requests at a 
lower priority.

Yes, this is intended to block off lower priority requests that are descendents 
of this vertex in the DAG.

bq. Different code paths for the allocation of a delayed container and when a 
new task request comes in. Assuming this is a result of attempting to not place 
a YARN request if a container can be assigned immediately? Not sure if more 
re-use is possible across the various assign methods.

Yes, we try to assign a requested task to a container immediately before 
requesting the AMRM layer to reduce churn on the AMRM protocol.  I'll look into 
reuse opportunities when I put up the revised patch.

bq. The default out of box behaviour will always generate different vertices at 
different priority levels at the moment. The old behaviour was to generate the 
same priority if distance from root was the same. Is moving back to the old 
behaviour an option - given descendent information is now known).

No, this is not possible because of the bug/limitation in YARN.  Before 
YARN-4789 there is no way to correlate a request to an allocation, so all 
allocations are lumped by priority.  When they are lumped, YARN only supports 
one resource request per priority.  So we can't go back to the old 
multiple-vertices-at-the-same-priority behavior in Tez until we can make 
different resource requests at the same priority in YARN.

bq. Didn't go into enough details to figure out if an attempt is made to run 
through an entire tree before moving over to an unrelated tree

Not sure what you mean by tree here -- disconnected parts of the DAG?  If so 
the scheduler doesn't directly concern itself with whether there is just one 
tree or multiple trees, it's only concerned about what request priorities are 
"unblocked" (i.e.: do not have ancestors in the DAG at higher priority that are 
making requests) and trying to satisfy those, preempting active tasks that are 
descendants of those requesting tasks if necessary.  TEZ-394 is more along the 
lines of dealing with separate trees in the DAG, but that's more an effort of 
assigning priorities properly so the scheduler does the right thing with those 
priorities.

bq. In tryAssignReuseContainer - if a container cannot be assigned immediately, 
will it be released?

If it exhausts the levels of retention (i.e.: trying to assign for node, then 
rack, the any) then yes, assuming that container doesn't then get held for idle 
session stuff.  The idea here is to be a good citizen on the cluster.  We don't 
want to hold onto resources we have no idea what to do with since there are 
often other jobs trying to do things.  I suppose we could make this decision be 
based on headroom, but as you mentioned headroom is notoriously incorrect at 
times.  This might be best tackled in a followup JIRA.

Per our offline discussion, it might be better to get this in and then consider 
reworking the scheduler API.

I'll work on improving the documentation and looking for better code reuse in 
the container reuse assignment paths.


> DAG-aware YARN task scheduler
> -----------------------------
>
>                 Key: TEZ-3770
>                 URL: https://issues.apache.org/jira/browse/TEZ-3770
>             Project: Apache Tez
>          Issue Type: New Feature
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>         Attachments: TEZ-3770.001.patch
>
>
> There are cases where priority alone does not convey the relationship between 
> tasks, and this can cause problems when scheduling or preempting tasks.  If 
> the YARN task scheduler was aware of the relationship between tasks then it 
> could make smarter decisions when trying to assign tasks to containers or 
> preempt running tasks to schedule pending tasks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to