[ https://issues.apache.org/jira/browse/YARN-6191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15876567#comment-15876567 ]
Chris Douglas commented on YARN-6191: ------------------------------------- bq. However there's still an issue because the preemption message is too general. For example, if the message says "going to preempt 60GB of resources" and the AM kills 10 reducers that are 6GB each on 6 different nodes, the RM can still kill the maps because the RM needed 60GB of contiguous resources. I haven't followed the modifications to the preemption policy, so I don't know if the AM will be selected as a victim again even after satisfying the contract (it should not). The preemption message should be expressive enough to encode this, if that's the current behavior. If the RM will only accept 60GB of resources from a single node, then that can be encoded in a ResourceRequest in the preemption message. Even if everything behaves badly, killing the reducers is still correct, right? If the job is still entitled to resources, then it should reschedule the map tasks before the reducers. There are still interleavings of requests that could result in the same behavior described in this JIRA, but they'd be stunningly unlucky. bq. I still wonder about the logic of preferring lower container priorities regardless of how long they've been running. I'm not sure container priority always translates well to how important a container is to the application, and we might be better served by preferring to minimize total lost work regardless of container priority. All of the options [~sunilg] suggests are fine heuristics, but the application has the best view of the tradeoffs. For example, a long-running container might be amortizing the cost of scheduling short-lived tasks, and might actually be cheap to kill. If the preemption message is not accurately reporting the contract the RM is enforcing, then we should absolutely fix that. But I think this is a MapReduce problem, ultimately. > CapacityScheduler preemption by container priority can be problematic for > MapReduce > ----------------------------------------------------------------------------------- > > Key: YARN-6191 > URL: https://issues.apache.org/jira/browse/YARN-6191 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler > Reporter: Jason Lowe > > A MapReduce job with thousands of reducers and just a couple of maps left to > go was running in a preemptable queue. Periodically other queues would get > busy and the RM would preempt some resources from the job, but it _always_ > picked the job's map tasks first because they use the lowest priority > containers. Even though the reducers had a shorter running time, most were > spared but the maps were always shot. Since the map tasks ran for a longer > time than the preemption period, the job was in a perpetual preemption loop. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org