[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13067240#comment-13067240
 ] 

Todd Lipcon commented on MAPREDUCE-2324:
----------------------------------------

Hey Bobby. Sorry, was on vacation last week so only partially keeping up with 
JIRA traffic.

My worry mostly has to do with this feature being kicked in as a false 
positive. In general, false positives here are very expensive, whereas false 
negatives are not nearly as drastic.

For example, imagine a cluster with 10 nodes and a couple of jobs submitted. 
One of the nodes is out of disk space. The first job, when submitted, takes up 
all the reduce slots on the first 9 nodes, but the 10th node is left empty 
since it's out of space. When the second job is submitted, all of the free 
reduce slots on the cluster are located on this remaining node. Every time the 
node heartbeats, the counter will get incremented for the queued up job. After 
10 heartbeats, the job will fail, even though it was just a single problematic 
node.

So, I think we do need to wait for a scheduling opportunity on at least some 
number of unique nodes before failing the job. It seems we could do this with a 
single HashSet per job - whenever any reduce task is successfully scheduld, the 
set is cleared. Whenever a job is given an opportunity to schedule reduces on a 
node, but can't due to resource constraints, it's added to the set. Once the 
size of the set eclipses some percentage of the nodes on the cluster, it fails 
the job. This memory usage would be O(nodes*jobs) rather than O(nodes*tasks) -- 
and thus not too bad.

> Job should fail if a reduce task can't be scheduled anywhere
> ------------------------------------------------------------
>
>                 Key: MAPREDUCE-2324
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2324
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 0.20.2, 0.20.205.0
>            Reporter: Todd Lipcon
>            Assignee: Robert Joseph Evans
>         Attachments: MR-2324-security-v1.txt
>
>
> If there's a reduce task that needs more disk space than is available on any 
> mapred.local.dir in the cluster, that task will stay pending forever. For 
> example, we produced this in a QA cluster by accidentally running terasort 
> with one reducer - since no mapred.local.dir had 1T free, the job remained in 
> pending state for several days. The reason for the "stuck" task wasn't clear 
> from a user perspective until we looked at the JT logs.
> Probably better to just fail the job if a reduce task goes through all TTs 
> and finds that there isn't enough space.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to