[ 
https://issues.apache.org/jira/browse/HADOOP-960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12468751
 ] 

Andrew McNabb commented on HADOOP-960:
--------------------------------------

That's a great question.  I actually care about both the number and the size.  
I think I switched to talking about size because if you make the size even, the 
number issue will get fixed automatically.

When I say "make the size of splits even," I mean "have the same number of 
records in each split."  The reason is that there is a relatively small number 
of records but they each take a long time to run.  Without a specific attempt 
at making the splits even, load balancing suffers.  I think that you'd run into 
this issue with most MapReduce programs that aren't text processors.

I currently have jobs with 1,000 records which take 2 minutes each to map.  If 
there are 1,000 map tasks, there is too much overhead from distributing the 
jobs.  I've tried doing 256 map tasks on 256 processors.  In this case, if the 
number of reduce tasks isn't a power of 2, it creates more tasks than the 
number of processors, and it takes a long time to run.

Again, I agree that the current behavior works well in many cases by default.  
I also think it would be nice if there were a few more knobs.  I don't expect 
that this would be the highest-priority feature request, but I think it would 
be generally useful.

Thanks.

> Incorrect number of map tasks when there are multiple input files
> -----------------------------------------------------------------
>
>                 Key: HADOOP-960
>                 URL: https://issues.apache.org/jira/browse/HADOOP-960
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: documentation
>    Affects Versions: 0.10.1
>            Reporter: Andrew McNabb
>            Priority: Minor
>
> This problem happens with hadoop-streaming and possibly elsewhere.  If there 
> are 5 input files, it will create 130 map tasks, even if 
> mapred.map.tasks=128.  The number of map tasks is incorrectly set to a 
> multiple of the number of files.  (I wrote a much more complete bug report, 
> but Jira lost it when it had an error, so I'm not in the mood to write it all 
> again)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to