What is your use case? Why would you only want to use only 5 mappers and not the whole 10 task trackers?
"If an individual file is so large that it will affect seek time it will be split to several Splits" (http://wiki.apache.org/hadoop/HadoopMapReduce) "if a split span over more than one dfs block, you lose the data locality scheduling benefits." (https://issues.apache.org/jira/browse/HADOOP-2560) On Tue, Jul 26, 2011 at 12:53 AM, Anfernee Xu <anfernee...@gmail.com> wrote: > I have a generic question about how the number of mapper tasks is > calculated, as far as I know, the number is primarily based on the number of > splits, say if I have 5 splits and I have 10 tasktracker running in the > cluster, I will have 5 mapper tasks running in my MR job, right? > > But what I found is that sometimes if the input is huge(5 GB), at this > point I still have 5 splits which is on purpose, but I got more than 40 > mapper tasks running, how this happens? Now, if I compress the huge input to > smaller size, the number of mapper got back to 5 again, is something tricky > happens here relevant to DFS block location of the input? > > BTW, our InputFormat is a special kind of FileInputFormat which does not > split each file, whereas we copy each file to DFS and the location of the > file on DFS will be the input key to mapper task. > > -- > --Anfernee >