I'm unfamiliar with EMR myself (perhaps the question fits EMR's own boards) but here's my take anyway:
On Mon, Jan 28, 2013 at 9:24 PM, Marcelo Elias Del Valle <mvall...@gmail.com> wrote: > Hello, > > I am using hadoop with TextInputFormat, a mapper and no reducers. I am > running my jobs at Amazon EMR. When I run my job, I set both following > options: > -s,mapred.tasktracker.map.tasks.maximum=10 > -jobconf,mapred.map.tasks=10 The first property you've given, refers to a single tasktracker's maximum concurrency. This means, if you have 4 TaskTrackers, with this property at each of them, then you have 40 total concurrent map slots available in all - perhaps more than you intended to configure? Again, this may be an EMR specific and I may be wrong, since I haven't seen anyone pass this via CLI before and it is generally to be configured at a service level. The second property is more to do with your problem. MR typically decides the number of map tasks it requires for a job, based on the input size. In the stable API (the org.apache.hadoop.mapred one), the mapred.map.tasks can be passed in the way you seem to be passing above, for an input format to take it as a 'hint' to decide number of map splits to enforce out of the input, no matter if it isn't large enough to necessitate that many maps. However, the new API code accepts no such config-based hints (and such logic changes need to be done in the programs' own code). So depending on your implementation of the job here, you may or may not see it act in effect. Hope this helps. > When I run my job with just 1 instance, I see it only creates 1 mapper. > When I run my job with 5 instances (1 master and 4 cores), I can see only 2 > mapper slots are used and 6 stay open. Perhaps the job itself launched with 2 total map tasks? You can check this on the JobTracker UI or whatever EMR offers as a job viewer. > I am trying to figure why I am not being able to run more mappers in > parallel. When I see the logs, I find some messages like these: > > INFO org.apache.hadoop.mapred.ReduceTask (main): > attempt_201301281437_0001_r_000003_0 Scheduled 0 outputs (0 slow hosts and0 > dup hosts) > org.apache.hadoop.mapred.ReduceTask (main): > attempt_201301281437_0001_r_000003_0 Need another 1 map output(s) where 0 is > already in progress This is a typical waiting reduce task log, what are you asking here specifically? -- Harsh J