Yes, that is correct. It is indeed looking at the data size. Please carefully read through again what I wrote - particularly the part about files getting broken into chunks (aka "blocks"). If you want fewer map tasks, then store your files in HDFS with a larger block size. They will then get stored in fewer blocks/chunks, and will result in fewer map tasks per job.

DR

On 06/20/2011 03:44 PM, praveen.pe...@nokia.com wrote:
Hi David, I think Hadoop is looking at the data size, not the no. of
input files. If I pass in .gz files, then yes hadoop is choosing 1
map task per file but if I pass in HUGE text file or same file split
into 10 files, its choosing same no. of maps tasks (191 in my case).

Thanks Praveen

-----Original Message----- From: ext David Rosenstrauch
[mailto:dar...@darose.net] Sent: Monday, June 20, 2011 3:39 PM To:
mapreduce-user@hadoop.apache.org Subject: Re: controlling no. of
mapper tasks

On 06/20/2011 03:24 PM, praveen.pe...@nokia.com wrote:
Hi there, I know client can send "mapred.reduce.tasks" to specify
no. of reduce tasks and hadoop honours it but "mapred.map.tasks" is
not honoured by Hadoop. Is there any way to control number of map
tasks? What I noticed is that Hadoop is choosing too many mappers
and there is an extra overhead being added due to this. For
example, when I have only 10 map tasks, my job finishes faster than
when Hadoop chooses 191 map tasks. I have 5 slave cluster and 10
tasks can run in parallel. I want to set both map and reduce tasks
to be 10 for max efficiency.

Thanks Praveen

The number of map tasks is determined dynamically based on the number
of input chunks you have.  If you want fewer map tasks either pass
fewer input files to your job, or store the files using larger chunk
sizes (which will result in fewer chunks per file, and thus fewer
chunks total).

HTH,

DR

Reply via email to