Yes, please open a jira for this. We should ensure that
avgLengthPerSplit in MultiFileInputFormat should not exceed default file
block size. However unlike FileInputFormat, all the files will come from
a different block.
Goel, Ankur wrote:
In this case I have to compute the number of map tasks in the
application - (totalSize / blockSize), which is what I am doing as a
work-around.
I think this should be the default behaviour in MultiFileInputFormat.
Should a JIRA be opened for the same ?
-Ankur
-----Original Message-----
From: Enis Soztutar [mailto:[EMAIL PROTECTED]
Sent: Friday, July 11, 2008 7:21 PM
To: core-user@hadoop.apache.org
Subject: Re: MultiFileInputFormat - Not enough mappers
MultiFileSplit currently does not support automatic map task count
computation. You can manually set the number of maps via
jobConf#setNumMapTasks() or via command line arg -D
mapred.map.tasks=<number>
Goel, Ankur wrote:
Hi Folks,
I am using hadoop to process some temporal data which is
split in lot of small files (~ 3 - 4 MB) Using TextInputFormat
resulted in too many mappers (1 per file) creating a lot of overhead
so I switched to MultiFileInputFormat -
(MutiFileWordCount.MyInputFormat) which resulted in just 1 mapper.
I was hoping to set the no of mappers to 1 so that hadoop
automatically takes care of generating the right number of map tasks.
Looks like when using MultiFileInputFormat one has to rely on the
application to specify the right number of mappers or am I missing
something ? Please advise.
Thanks
-Ankur