Re: cluster involvement trigger

Amogh Vasekar Thu, 25 Feb 2010 22:32:58 -0800

Hi,
The number of mappers initialized depends largely on your input format ( the 
getSplits of your input format) , (almost all) input formats available in 
hadoop derive from fileinputformat, hence the 1 mapper per file block notion ( 
this actually is 1 mapper per split ).
You say that you have too many small files. In general each of these small 
files  ( < 64 mb ) will be executed by a single mapper. However, I would 
suggest looking at CombineFileInputFormat which does the job of packaging many 
small files together depending on data locality for better performance ( 
initialization time is a significant factor in hadoop's performance ).
On the other side, many small files will hamper your namenode performance since 
file metadata is stored in memory and limit its overall capacity wrt number of 
files.


Amogh


On 2/25/10 11:15 PM, "Michael Kintzer" <michael.kint...@zerk.com> wrote:

Hi,

We are using the streaming API.    We are trying to understand what hadoop uses 
as a threshold or trigger to involve more TaskTracker nodes in a given 
Map-Reduce execution.

With default settings (64MB chunk size in HDFS), if the input file is less than 
64MB, will the data processing only occur on a single TaskTracker Node, even if 
our cluster size is greater than 1?

For example, we are trying to figure out if hadoop is more efficient at 
processing:
a) a single input file which is just an index file that refers to a jar archive 
of 100K or 1M individual small files, where the jar file is passed as the 
"-archives" argument, or
b) a single input file containing all the raw data represented by the 100K or 
1M small files.

With (a), our input file is <64MB.   With (b) our input file is very large.

Thanks for any insight,

-Michael

Re: cluster involvement trigger

Reply via email to