Is there a way to force an even spread of data?

On Fri, Mar 22, 2013 at 2:14 PM, jeremy p <athomewithagroove...@gmail.com>wrote:

> Apologies -- I don't understand this advice : "If the evenness is the goal
> you can also write your own input format that return empty locations for
> each split and read the small files in map task directly."  How would
> manually reading the files into the map task help me?  Hadoop would still
> spawn multiple mappers per machine, which is what I'm trying to avoid.  I'm
> trying to get one mapper per machine for this job.
>
> --Jeremy
>
>
> On Thu, Mar 21, 2013 at 11:44 AM, Luke Lu <l...@apache.org> wrote:
>
>>
>> Short version : let's say you have 20 nodes, and each node has 10 mapper
>>> slots.  You start a job with 20 very small input files.  How is the work
>>> distributed to the cluster?  Will it be even, with each node spawning one
>>> mapper task?  Is there any way of predicting or controlling how the work
>>> will be distributed?
>>
>>
>> You're right in expecting that the tasks of the small job will likely be
>> evenly distributed among 20 nodes, if the 20 files are evenly distributed
>> among the nodes and that there are free slots on every node.
>>
>>
>>> Long version : My cluster is currently used for two different jobs.  The
>>> cluster is currently optimized for Job A, so each node has a maximum of 18
>>> mapper slots.  However, I also need to run Job B.  Job B is VERY
>>> cpu-intensive, so we really only want one mapper to run on a node at any
>>> given time.  I've done a bunch of research, and it doesn't seem like Hadoop
>>> gives you any way to set the maximum number of mappers per node on a
>>> per-job basis.  I'm at my wit's end here, and considering some rather
>>> egregious workarounds.  If you can think of anything that can help me, I'd
>>> very much appreciate it.
>>>
>>
>> Are you seeing Job B tasks are not being evenly distributed to each node?
>> You can check the locations of the files by hadoop fsck. If the evenness is
>> the goal you can also write your own input format that return empty
>> locations for each split and read the small files in map task directly. If
>> you're using Hadoop 1.0.x and fair scheduler, you might need to set
>> mapred.fairscheduler.assignmultiple to false in mapred-site.xml (JT restart
>> required) to work around a bug in fairscheduler (MAPREDUCE-2905) that
>> causes the tasks be assigned unevenly. The bug is fixed in Hadoop 1.1+.
>>
>> __Luke
>>
>
>

Reply via email to