Hari Sekhon created YARN-3680:
---------------------------------

             Summary: Graceful queue capacity reclaim without KilledTaskAttempts
                 Key: YARN-3680
                 URL: https://issues.apache.org/jira/browse/YARN-3680
             Project: Hadoop YARN
          Issue Type: Improvement
          Components: applications, capacityscheduler, resourcemanager, 
scheduler
    Affects Versions: 2.6.0
         Environment: HDP 2.2.4
            Reporter: Hari Sekhon


Request to allow graceful reclaim of queue resources by waiting until running 
containers finish naturally rather than killing them.

For example if you were to dynamically reconfigure Yarn queue 
capacity/maximum-capacity decreasing one queue, then containers in that queue 
start getting killed (and pre-emption is not configured on this cluster) - 
instead of containers being allowed to finish naturally and just having those 
freed resources no longer be available for new tasks of that job.

This is relevant if there are non-idempotent changes being done by a task that 
can cause issues if the task is half competed and then run task killed and 
re-run from the beginning later. For example I bulk index to Elasticsearch with 
uniquely generated IDs since the source data doesn't have any key or even 
compound key that is unique. This means if a task sends half it's data and then 
is killed and starts again it introduces a large number of duplicates into the 
ES index without any mechanism to dedupe later other than rebuilding the entire 
index from scratch which is hundreds of millions of docs multiplied by many 
many indices.

I appreciate this is a serious request and could cause problems with long 
running services never returning their resources... so there needs to be some 
kind of interaction of variables or similar to separate the indefinitely 
running tasks for long lived services from the finite-runtime analytic job 
tasks with some sort of time-based safety cut off.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to