[ https://issues.apache.org/jira/browse/MAPREDUCE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13067240#comment-13067240 ]
Todd Lipcon commented on MAPREDUCE-2324: ---------------------------------------- Hey Bobby. Sorry, was on vacation last week so only partially keeping up with JIRA traffic. My worry mostly has to do with this feature being kicked in as a false positive. In general, false positives here are very expensive, whereas false negatives are not nearly as drastic. For example, imagine a cluster with 10 nodes and a couple of jobs submitted. One of the nodes is out of disk space. The first job, when submitted, takes up all the reduce slots on the first 9 nodes, but the 10th node is left empty since it's out of space. When the second job is submitted, all of the free reduce slots on the cluster are located on this remaining node. Every time the node heartbeats, the counter will get incremented for the queued up job. After 10 heartbeats, the job will fail, even though it was just a single problematic node. So, I think we do need to wait for a scheduling opportunity on at least some number of unique nodes before failing the job. It seems we could do this with a single HashSet per job - whenever any reduce task is successfully scheduld, the set is cleared. Whenever a job is given an opportunity to schedule reduces on a node, but can't due to resource constraints, it's added to the set. Once the size of the set eclipses some percentage of the nodes on the cluster, it fails the job. This memory usage would be O(nodes*jobs) rather than O(nodes*tasks) -- and thus not too bad. > Job should fail if a reduce task can't be scheduled anywhere > ------------------------------------------------------------ > > Key: MAPREDUCE-2324 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-2324 > Project: Hadoop Map/Reduce > Issue Type: Bug > Affects Versions: 0.20.2, 0.20.205.0 > Reporter: Todd Lipcon > Assignee: Robert Joseph Evans > Attachments: MR-2324-security-v1.txt > > > If there's a reduce task that needs more disk space than is available on any > mapred.local.dir in the cluster, that task will stay pending forever. For > example, we produced this in a QA cluster by accidentally running terasort > with one reducer - since no mapred.local.dir had 1T free, the job remained in > pending state for several days. The reason for the "stuck" task wasn't clear > from a user perspective until we looked at the JT logs. > Probably better to just fail the job if a reduce task goes through all TTs > and finds that there isn't enough space. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira