Re: What happens when you have fewer input files than mapper slots?

2013-03-22 Thread jeremy p
Is there a way to force an even spread of data? On Fri, Mar 22, 2013 at 2:14 PM, jeremy p wrote: > Apologies -- I don't understand this advice : "If the evenness is the goal > you can also write your own input format that return empty locations for > each split and read the small files in map tas

Re: What happens when you have fewer input files than mapper slots?

2013-03-22 Thread jeremy p
Apologies -- I don't understand this advice : "If the evenness is the goal you can also write your own input format that return empty locations for each split and read the small files in map task directly." How would manually reading the files into the map task help me? Hadoop would still spawn m

Re: What happens when you have fewer input files than mapper slots?

2013-03-21 Thread Luke Lu
> Short version : let's say you have 20 nodes, and each node has 10 mapper > slots. You start a job with 20 very small input files. How is the work > distributed to the cluster? Will it be even, with each node spawning one > mapper task? Is there any way of predicting or controlling how the wor

Re: What happens when you have fewer input files than mapper slots?

2013-03-19 Thread Harsh J
Correction to my previous post: I completely missed https://issues.apache.org/jira/browse/MAPREDUCE-4520 which covers the MR config ends already in 2.0.3. My bad :) On Wed, Mar 20, 2013 at 5:34 AM, Harsh J wrote: > You can leverage YARN's CPU Core scheduling feature for this purpose. > It was add

Re: What happens when you have fewer input files than mapper slots?

2013-03-19 Thread Harsh J
You can leverage YARN's CPU Core scheduling feature for this purpose. It was added to the 2.0.3 release via https://issues.apache.org/jira/browse/YARN-2 and seems to fit your need exactly. However, looking at that patch, it seems like param-config support for MR apps wasn't added by this so it may

Re: What happens when you have fewer input files than mapper slots?

2013-03-19 Thread jeremy p
The job we need to run executes some third-party code that utilizes multiple cores. The only way the job will get done in a timely fashion is if we give it all the cores available on the machine. This is not a task that can be split up. Yes, I know, it's not ideal, but this is the situation I ha

Re: What happens when you have fewer input files than mapper slots?

2013-03-19 Thread hari
This may not be what you were looking for, but I was just curious when you mentioned that you would only want to run only one map task because it was cpu intensive. Well, the map tasks are supposed to be cpu intensive, isn't it. If the maximum map slots are 10 then that would mean you have close t

Re: What happens when you have fewer input files than mapper slots?

2013-03-19 Thread jeremy p
Thank you for your help. We're using MRv1. I've tried setting mapred.tasktracker.map.tasks.maximum and mapred.map.tasks, and neither one helped me at all. Per-job control is definitely what I need. I need to be able to say, "For Job A, only use one mapper per node, but for Job B, use 16 mappers

Re: What happens when you have fewer input files than mapper slots?

2013-03-19 Thread Rahul Jain
Which version of hadoop are you using ? MRV1 or MRV2 (yarn) ?? For MRv2 (yarn): you can pretty much achieve this using: yarn.nodemanager.resource.memory-mb (system wide setting) and mapreduce.map.memory.mb (job level setting) e.g. if yarn.nodemanager.resource.memory-mb=100 and mapreduce.map.mem