[jira] [Commented] (TEZ-3770) DAG-aware YARN task scheduler
[ https://issues.apache.org/jira/browse/TEZ-3770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16336203#comment-16336203 ] Siddharth Seth commented on TEZ-3770: - +1. Looks good. Thanks [~jlowe] > DAG-aware YARN task scheduler > - > > Key: TEZ-3770 > URL: https://issues.apache.org/jira/browse/TEZ-3770 > Project: Apache Tez > Issue Type: New Feature >Reporter: Jason Lowe >Assignee: Jason Lowe >Priority: Major > Attachments: TEZ-3770.001.patch, TEZ-3770.002.patch > > > There are cases where priority alone does not convey the relationship between > tasks, and this can cause problems when scheduling or preempting tasks. If > the YARN task scheduler was aware of the relationship between tasks then it > could make smarter decisions when trying to assign tasks to containers or > preempt running tasks to schedule pending tasks. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TEZ-3770) DAG-aware YARN task scheduler
[ https://issues.apache.org/jira/browse/TEZ-3770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16297416#comment-16297416 ] TezQA commented on TEZ-3770: {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12902920/TEZ-3770.002.patch against master revision 4c378b4. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 6 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 3.0.1) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/2706//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/2706//console This message is automatically generated. > DAG-aware YARN task scheduler > - > > Key: TEZ-3770 > URL: https://issues.apache.org/jira/browse/TEZ-3770 > Project: Apache Tez > Issue Type: New Feature >Reporter: Jason Lowe >Assignee: Jason Lowe > Attachments: TEZ-3770.001.patch, TEZ-3770.002.patch > > > There are cases where priority alone does not convey the relationship between > tasks, and this can cause problems when scheduling or preempting tasks. If > the YARN task scheduler was aware of the relationship between tasks then it > could make smarter decisions when trying to assign tasks to containers or > preempt running tasks to schedule pending tasks. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TEZ-3770) DAG-aware YARN task scheduler
[ https://issues.apache.org/jira/browse/TEZ-3770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16296958#comment-16296958 ] Jason Lowe commented on TEZ-3770: - If this is failing on vanilla 0.9.1 then I think this warrants a separate JIRA since the problem occurs without the patch from this JIRA. I suspect the issue doesn't happen when running with the dag-aware scheduler since it avoids preempting tasks that are not descendants of the pending requests in the DAG. If the DAG has many independent trees (or even significant, parallel branches within a large tree) then the default YARN scheduler can preempt tasks unnecessarily. Since it doesn't understand the DAG connections it makes the false assumption that any lower priority vertex must be a DAG descendant of a higher priority vertex. It then preempts those lower priority tasks when the high priority requests are not being fulfilled. The DAG-aware scheduler does not preempt active, low priority tasks if they are not descendants of the higher priority requests since those lower priority tasks can still complete and we don't want to lose completed work. I'll try to get an updated patch addressing the javadoc comment request today. After that I think this should be ready to go in since it works on "real" jobs like TPC-DS Q64 10TB with a noticeable improvement over the current YARN scheduler, and it's "opt-in" so users must explicitly configure it. > DAG-aware YARN task scheduler > - > > Key: TEZ-3770 > URL: https://issues.apache.org/jira/browse/TEZ-3770 > Project: Apache Tez > Issue Type: New Feature >Reporter: Jason Lowe >Assignee: Jason Lowe > Attachments: TEZ-3770.001.patch > > > There are cases where priority alone does not convey the relationship between > tasks, and this can cause problems when scheduling or preempting tasks. If > the YARN task scheduler was aware of the relationship between tasks then it > could make smarter decisions when trying to assign tasks to containers or > preempt running tasks to schedule pending tasks. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TEZ-3770) DAG-aware YARN task scheduler
[ https://issues.apache.org/jira/browse/TEZ-3770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16293305#comment-16293305 ] Eric Wohlstadter commented on TEZ-3770: --- I applied TEZ-3770 and TEZ-394 to 0.9.1. with those patches, the job completes. On vanilla 0.9.1 the job is failing from preemption. > DAG-aware YARN task scheduler > - > > Key: TEZ-3770 > URL: https://issues.apache.org/jira/browse/TEZ-3770 > Project: Apache Tez > Issue Type: New Feature >Reporter: Jason Lowe >Assignee: Jason Lowe > Attachments: TEZ-3770.001.patch > > > There are cases where priority alone does not convey the relationship between > tasks, and this can cause problems when scheduling or preempting tasks. If > the YARN task scheduler was aware of the relationship between tasks then it > could make smarter decisions when trying to assign tasks to containers or > preempt running tasks to schedule pending tasks. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TEZ-3770) DAG-aware YARN task scheduler
[ https://issues.apache.org/jira/browse/TEZ-3770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16293297#comment-16293297 ] Jason Lowe commented on TEZ-3770: - Thanks for kicking the tires on the patch, Eric! The patch doesn't modify the code for the default scheduler, YarnTaskSchedulerService, so I can't readily explain things if this patch makes that scheduler unable to run that workload. Is it able to run that workload without this patch? I could see that behavior occurring if you also had pulled in the patch for TEZ-394, as it's a known problem causing massive preemptions for certain DAG topologies when combining TEZ-394 with YarnTaskSchedulerService. > DAG-aware YARN task scheduler > - > > Key: TEZ-3770 > URL: https://issues.apache.org/jira/browse/TEZ-3770 > Project: Apache Tez > Issue Type: New Feature >Reporter: Jason Lowe >Assignee: Jason Lowe > Attachments: TEZ-3770.001.patch > > > There are cases where priority alone does not convey the relationship between > tasks, and this can cause problems when scheduling or preempting tasks. If > the YARN task scheduler was aware of the relationship between tasks then it > could make smarter decisions when trying to assign tasks to containers or > preempt running tasks to schedule pending tasks. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TEZ-3770) DAG-aware YARN task scheduler
[ https://issues.apache.org/jira/browse/TEZ-3770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16293274#comment-16293274 ] Eric Wohlstadter commented on TEZ-3770: --- Here is some data for consideration. Running TPC-DS Q64 10TB. At least with my setup: * DagAwareYarnTaskScheduler, job completes in 17 mins. * default scheduler, job continues to fail: {code} 2017-12-15 14:53:58,441 [INFO] [AMRM Callback Handler Thread] |rm.YarnTaskSchedulerService|: Trying to service 7 out of total 62 pending requests at pri: 440 by preempting from 75 running tasks at priority: 1160 {code} and things just continue deteriorating from there. Could this be explained by the use of the DagAwareYarnTaskScheduler? > DAG-aware YARN task scheduler > - > > Key: TEZ-3770 > URL: https://issues.apache.org/jira/browse/TEZ-3770 > Project: Apache Tez > Issue Type: New Feature >Reporter: Jason Lowe >Assignee: Jason Lowe > Attachments: TEZ-3770.001.patch > > > There are cases where priority alone does not convey the relationship between > tasks, and this can cause problems when scheduling or preempting tasks. If > the YARN task scheduler was aware of the relationship between tasks then it > could make smarter decisions when trying to assign tasks to containers or > preempt running tasks to schedule pending tasks. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TEZ-3770) DAG-aware YARN task scheduler
[ https://issues.apache.org/jira/browse/TEZ-3770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16291855#comment-16291855 ] TezQA commented on TEZ-3770: {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12874106/TEZ-3770.001.patch against master revision 4c378b4. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 6 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 3.0.1) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/2704//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/2704//console This message is automatically generated. > DAG-aware YARN task scheduler > - > > Key: TEZ-3770 > URL: https://issues.apache.org/jira/browse/TEZ-3770 > Project: Apache Tez > Issue Type: New Feature >Reporter: Jason Lowe >Assignee: Jason Lowe > Attachments: TEZ-3770.001.patch > > > There are cases where priority alone does not convey the relationship between > tasks, and this can cause problems when scheduling or preempting tasks. If > the YARN task scheduler was aware of the relationship between tasks then it > could make smarter decisions when trying to assign tasks to containers or > preempt running tasks to schedule pending tasks. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TEZ-3770) DAG-aware YARN task scheduler
[ https://issues.apache.org/jira/browse/TEZ-3770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16291781#comment-16291781 ] Jason Lowe commented on TEZ-3770: - Thanks for the review, Sidd! Apologies for the long delay, got sidetracked with other work. bq. If I'm reading the code right. New containers which cannot be assigned immediately are released? It's a problem with delayed containers. This comment in the old scheduler sorta sums it up: {code} // this container is of lower priority and given to us by the RM for // a task that will be matched after the current top priority. Keep // this container for those pending tasks since the RM is not going // to give this container to us again {code} The old scheduler holding onto the lower priority containers that cannot be assigned led to TEZ-3535. This scheduler tries to avoid that issue. bq. Not sure if priority is being considered while doing this. i.e. is it possible there's a pending higher priority request which has not yet been allocated to an idle container (primarily races in timing)? Think this is handled since an attempt is made to allocate a container the moment the task assigned to it is de-allocated. Yes, we cover that case by going through the list of pending tasks immediately after releasing a container. bq. This is broken for newly assigned containers? It's not broken, IMHO, since again we don't want to hold onto lower priority containers. This problem happens in practice because a lower priority vertex requests resources just before a higher priority vertex. Due to the async nature of sending requests to the RM, that means we could make the requests for lower priority containers before sending the request for higher priority ones. It's a natural race in the DAG. I chose to treat the race as, "if the RM thinks this is the highest priority request to allocate then that priority won the race." That's why the code ignores pending request priorities when new containers arrive. The same is not true when deciding to reuse a container. bq. TaskRequest oldRequest = requests.put -> Is it possible for old to not be null? A single request to allocate a single attempt. This hardens the scheduler in case someone re-requests a task. If we don't account for it and it somehow were to happen in practice then the preemption logic and other metadata tracking the DAG could get out of sync. Not sure it can happen in practice today, just some defensive programming. bq. Would be nice to have some more documentation or an example of how this ends up working. I'll update the patch to add more documentation on how the bitsets are used and how we end up using them to prevent scheduling of lower priority descendants when reusing containers. bq. Does it rely on the way priorities are assigned?, the kind of topological sort? When reading this, it seems to block off a large chunk of requests at a lower priority. Yes, this is intended to block off lower priority requests that are descendents of this vertex in the DAG. bq. Different code paths for the allocation of a delayed container and when a new task request comes in. Assuming this is a result of attempting to not place a YARN request if a container can be assigned immediately? Not sure if more re-use is possible across the various assign methods. Yes, we try to assign a requested task to a container immediately before requesting the AMRM layer to reduce churn on the AMRM protocol. I'll look into reuse opportunities when I put up the revised patch. bq. The default out of box behaviour will always generate different vertices at different priority levels at the moment. The old behaviour was to generate the same priority if distance from root was the same. Is moving back to the old behaviour an option - given descendent information is now known). No, this is not possible because of the bug/limitation in YARN. Before YARN-4789 there is no way to correlate a request to an allocation, so all allocations are lumped by priority. When they are lumped, YARN only supports one resource request per priority. So we can't go back to the old multiple-vertices-at-the-same-priority behavior in Tez until we can make different resource requests at the same priority in YARN. bq. Didn't go into enough details to figure out if an attempt is made to run through an entire tree before moving over to an unrelated tree Not sure what you mean by tree here -- disconnected parts of the DAG? If so the scheduler doesn't directly concern itself with whether there is just one tree or multiple trees, it's only concerned about what request priorities are "unblocked" (i.e.: do not have ancestors in the DAG at higher priority that are making requests) and trying to satisfy those, preempting active tasks that are descendants of those requesting tasks if
[jira] [Commented] (TEZ-3770) DAG-aware YARN task scheduler
[ https://issues.apache.org/jira/browse/TEZ-3770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16093956#comment-16093956 ] Siddharth Seth commented on TEZ-3770: - bq. It tries to schedule new containers for tasks that match its priority before trying to schedule the highest priority task first. This avoids hanging onto unused, lower priority containers because higher priority requests are pending (see TEZ-3535). If I'm reading the code right. New containers which cannot be assigned immediately are released? Pending requests are removed as soon as a container is assigned. YARN will not end up allocating this unused container again (other than the regular timing races on the protocol). bq. New task allocation requests are first matched against idle containers before requesting resources from the RM. This cuts down on AM-RM protocol churn. Not sure if priority is being considered while doing this. i.e. is it possible there's a pending higher priority request which has not yet been allocated to an idle container (primarily races in timing)? Think this is handled since an attempt is made to allocate a container the moment the task assigned to it is de-allocated. bq. Task requests for tasks that are DAG-descendants of pending task requests will not be allocated to help reduce priority inversions that could lead to preemption. This is broken for newly assigned containers? On the patch itself. DagAwareYarnTaskScheduler - TaskRequest oldRequest = requests.put -> Is it possible for old to not be null? A single request to allocate a single attempt. - incrVertexTaskCount - lowerStat.allowedVertices.andNot(d); <- Would be nice to have some more documentation or an example of how this ends up working. Does it rely on the way priorities are assigned?, the kind of topological sort? When reading this, it seems to block off a large chunk of requests at a lower priority. - Different code paths for the allocation of a delayed container and when a new task request comes in. Assuming this is a result of attempting to not place a YARN request if a container can be assigned immediately? Not sure if more re-use is possible across the various assign methods. - RequestPriorityStats - javadoc on descendants is a little confusing. Mentions a single vertex. I think this gets set for every vertex at the same priority level. The default out of box behaviour will always generate different vertices at different priority levels at the moment. The old behaviour was to generate the same priority if distance from root was the same. Is moving back to the old behaviour an option - given descendent information is now known). - Didn't go into enough details to figure out if an attempt is made to run through an entire tree before moving over to an unrelated tree - In tryAssignReuseContainer - if a container cannot be assigned immediately, will it be released? Should this decision be based on headroom / pending requests (headroom is very often incorrect, preemption is meant to take care of that). e.g. a task failure, so there's a new request. If the container cannot be re-used for this request, and capacity is available in YARN - it may make sense to hold on to the container. DagInfo - Should getVertexDescendants be exposed as a method, or just the Vertices and the relationship between them. Whoever wants to use this can set up their own representation. The bit representation could be a helper. The vertex relationship can likely be used for more than just the list of descendants. TaskSchedulerContext - Instead of exposing a getVertexIndexForTask(Object) - I think a better option is to provide an interface for the requesting task itself. (TaskRequest instead of Object). That can expose relevant information, instead of making an additional call to get this from TaskSchedulerContext. > DAG-aware YARN task scheduler > - > > Key: TEZ-3770 > URL: https://issues.apache.org/jira/browse/TEZ-3770 > Project: Apache Tez > Issue Type: New Feature >Reporter: Jason Lowe >Assignee: Jason Lowe > Attachments: TEZ-3770.001.patch > > > There are cases where priority alone does not convey the relationship between > tasks, and this can cause problems when scheduling or preempting tasks. If > the YARN task scheduler was aware of the relationship between tasks then it > could make smarter decisions when trying to assign tasks to containers or > preempt running tasks to schedule pending tasks. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TEZ-3770) DAG-aware YARN task scheduler
[ https://issues.apache.org/jira/browse/TEZ-3770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16065756#comment-16065756 ] Bikas Saha commented on TEZ-3770: - Just clarifying that the original scheduler was not made dag aware by design. It was an attempt to prevent leaky features where code changed across the scheduler and the dag state machine. Like it happened in MR code where logic was spread all over. The DAG core logic and VertexManager user logic could determine the dependencies and priorities of tasks and the scheduler would allocate resources based on priority. So other schedulers could be easily written since they dont need to understand complex relationships. However not all of those design assumptions have been validated since we dont have many schedulers written :P > DAG-aware YARN task scheduler > - > > Key: TEZ-3770 > URL: https://issues.apache.org/jira/browse/TEZ-3770 > Project: Apache Tez > Issue Type: New Feature >Reporter: Jason Lowe >Assignee: Jason Lowe > Attachments: TEZ-3770.001.patch > > > There are cases where priority alone does not convey the relationship between > tasks, and this can cause problems when scheduling or preempting tasks. If > the YARN task scheduler was aware of the relationship between tasks then it > could make smarter decisions when trying to assign tasks to containers or > preempt running tasks to schedule pending tasks. -- This message was sent by Atlassian JIRA (v6.4.14#64029)