Is there a way to force an even spread of data? On Fri, Mar 22, 2013 at 2:14 PM, jeremy p <athomewithagroove...@gmail.com>wrote:
> Apologies -- I don't understand this advice : "If the evenness is the goal > you can also write your own input format that return empty locations for > each split and read the small files in map task directly." How would > manually reading the files into the map task help me? Hadoop would still > spawn multiple mappers per machine, which is what I'm trying to avoid. I'm > trying to get one mapper per machine for this job. > > --Jeremy > > > On Thu, Mar 21, 2013 at 11:44 AM, Luke Lu <l...@apache.org> wrote: > >> >> Short version : let's say you have 20 nodes, and each node has 10 mapper >>> slots. You start a job with 20 very small input files. How is the work >>> distributed to the cluster? Will it be even, with each node spawning one >>> mapper task? Is there any way of predicting or controlling how the work >>> will be distributed? >> >> >> You're right in expecting that the tasks of the small job will likely be >> evenly distributed among 20 nodes, if the 20 files are evenly distributed >> among the nodes and that there are free slots on every node. >> >> >>> Long version : My cluster is currently used for two different jobs. The >>> cluster is currently optimized for Job A, so each node has a maximum of 18 >>> mapper slots. However, I also need to run Job B. Job B is VERY >>> cpu-intensive, so we really only want one mapper to run on a node at any >>> given time. I've done a bunch of research, and it doesn't seem like Hadoop >>> gives you any way to set the maximum number of mappers per node on a >>> per-job basis. I'm at my wit's end here, and considering some rather >>> egregious workarounds. If you can think of anything that can help me, I'd >>> very much appreciate it. >>> >> >> Are you seeing Job B tasks are not being evenly distributed to each node? >> You can check the locations of the files by hadoop fsck. If the evenness is >> the goal you can also write your own input format that return empty >> locations for each split and read the small files in map task directly. If >> you're using Hadoop 1.0.x and fair scheduler, you might need to set >> mapred.fairscheduler.assignmultiple to false in mapred-site.xml (JT restart >> required) to work around a bug in fairscheduler (MAPREDUCE-2905) that >> causes the tasks be assigned unevenly. The bug is fixed in Hadoop 1.1+. >> >> __Luke >> > >