Is there a way to force an even spread of data?
On Fri, Mar 22, 2013 at 2:14 PM, jeremy p wrote:
> Apologies -- I don't understand this advice : "If the evenness is the goal
> you can also write your own input format that return empty locations for
> each split and read the small files in map tas
Apologies -- I don't understand this advice : "If the evenness is the goal
you can also write your own input format that return empty locations for
each split and read the small files in map task directly." How would
manually reading the files into the map task help me? Hadoop would still
spawn m
> Short version : let's say you have 20 nodes, and each node has 10 mapper
> slots. You start a job with 20 very small input files. How is the work
> distributed to the cluster? Will it be even, with each node spawning one
> mapper task? Is there any way of predicting or controlling how the wor
Correction to my previous post: I completely missed
https://issues.apache.org/jira/browse/MAPREDUCE-4520 which covers the
MR config ends already in 2.0.3. My bad :)
On Wed, Mar 20, 2013 at 5:34 AM, Harsh J wrote:
> You can leverage YARN's CPU Core scheduling feature for this purpose.
> It was add
You can leverage YARN's CPU Core scheduling feature for this purpose.
It was added to the 2.0.3 release via
https://issues.apache.org/jira/browse/YARN-2 and seems to fit your
need exactly. However, looking at that patch, it seems like
param-config support for MR apps wasn't added by this so it may
The job we need to run executes some third-party code that utilizes
multiple cores. The only way the job will get done in a timely fashion is
if we give it all the cores available on the machine. This is not a task
that can be split up.
Yes, I know, it's not ideal, but this is the situation I ha
This may not be what you were looking for, but I was just curious when you
mentioned that
you would only want to run only one map task because it was cpu intensive.
Well, the map
tasks are supposed to be cpu intensive, isn't it. If the maximum map slots
are 10 then that
would mean you have close t
Thank you for your help.
We're using MRv1. I've tried setting mapred.tasktracker.map.tasks.maximum
and mapred.map.tasks, and neither one helped me at all.
Per-job control is definitely what I need. I need to be able to say, "For
Job A, only use one mapper per node, but for Job B, use 16 mappers
Which version of hadoop are you using ? MRV1 or MRV2 (yarn) ??
For MRv2 (yarn): you can pretty much achieve this using:
yarn.nodemanager.resource.memory-mb (system wide setting)
and
mapreduce.map.memory.mb (job level setting)
e.g. if yarn.nodemanager.resource.memory-mb=100
and mapreduce.map.mem