[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13072828#comment-13072828
 ] 

Robert Joseph Evans commented on MAPREDUCE-2324:
------------------------------------------------

We saw this in a few cases recently but only a small handful.  The error 
message does not say which job it is that cannot be scheduled so we have to 
manually look at all of the jobs to try and determine which one is causing the 
problem.  So at a minimum we need to update the error message to include the 
name of the job.

The reason we are seeing this is mostly because on our clusters 
mapreduce.reduce.input.limit is disabled because it was causing a lot of false 
positives.  Killing jobs that would pass otherwise which we decided was worse 
then trying to find and manually kill some bad jobs.

As for your size estimate I contend that it is way off.  First of all it is not 
100 bytes.  It is probably closer to 8 bytes, because we are storing a 
reference to a string that all 5000 failing jobs will share, but lets assume 
that it is 100 bytes.  For 5000 jobs to be accurate that would mean that 
somehow 5000 map/reduce jobs each with only 1 or 2 reducers, and a HUGE amount 
of data going to those reducers, were launched and all are getting to the 
reducer stage at almost exactly the same time.  I don't think in practice that 
would ever happen  It would take someone maliciously trying to do this, and 
launching 5000 jobs at all would probably be shut down by queue limits before 
it even got here.  I looked at our clusters here and we have at most about 300 
jobs running at any point in time on our very large clusters.

So I would say, as far as an upper bound is concerned,  a closer estimate would 
be 5000 nodes * 100 * 600 jobs.  This comes out to be about 300MB in the worst 
case.  I suspect that the worst case is going to be a lot closer to (5000 nodes 
* 100 bytes) + (8 byte references to strings * 5000 nodes * 80% of nodes before 
job killed * 400 jobs) = about 13MB.

I agree that there is some uncertainty about how this might perform at a large 
scale.  So I will try to get some time on our large test cluster to see if I 
can run a gridmix simulation to verify that it is not going to cause anything 
bad to happen at scale, and I will also submit a patch that will just change 
the error message to include the job ID to make it simpler to manually kill off 
jobs that are stuck. 

> Job should fail if a reduce task can't be scheduled anywhere
> ------------------------------------------------------------
>
>                 Key: MAPREDUCE-2324
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2324
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 0.20.2, 0.20.205.0
>            Reporter: Todd Lipcon
>            Assignee: Robert Joseph Evans
>             Fix For: 0.20.205.0
>
>         Attachments: MR-2324-security-v1.txt, MR-2324-security-v2.txt, 
> MR-2324-security-v3.patch
>
>
> If there's a reduce task that needs more disk space than is available on any 
> mapred.local.dir in the cluster, that task will stay pending forever. For 
> example, we produced this in a QA cluster by accidentally running terasort 
> with one reducer - since no mapred.local.dir had 1T free, the job remained in 
> pending state for several days. The reason for the "stuck" task wasn't clear 
> from a user perspective until we looked at the JT logs.
> Probably better to just fail the job if a reduce task goes through all TTs 
> and finds that there isn't enough space.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to