Re: cluster involvement trigger

2010-03-01 Thread Amogh Vasekar
Hi,
You mentioned you pass the files packed together using -archives option. This 
will uncompress the archive on the compute node itself, so the namenode won't 
be hampered in this case. However, cleaning up the distributed cache is a 
tricky scenario ( user doesn't have explicit control over this ), you may 
search this list for many discussions pertaining to this. And while on the 
topic of archives, while it may not be practical for you as of now, but Hadoop 
Archives (har) provide similar functionality.
Hope this helps.

Amogh


On 2/27/10 12:53 AM, Michael Kintzer michael.kint...@zerk.com wrote:

Amogh,

Thank you for the detailed information.   Our initial prototyping seems to 
agree with your statements below, i.e. a single large input file is performing 
better than an index file + an archive of small files.   I will take a look at 
the CombineFileInputFormat as you suggested.

One question.   Since the many small input files are all in a single jar 
archive managed by the name node, does that still hamper name node performance? 
  I was under the impression these archives are are only unpacked into the 
temporary map reduce file space (and I'm assuming cleaned up after map-reduce 
completes).   Does the name node need to store the metadata of each individual 
file during the unpacking for this case?

-Michael

On Feb 25, 2010, at 10:31 PM, Amogh Vasekar wrote:

 Hi,
 The number of mappers initialized depends largely on your input format ( the 
 getSplits of your input format) , (almost all) input formats available in 
 hadoop derive from fileinputformat, hence the 1 mapper per file block notion 
 ( this actually is 1 mapper per split ).
 You say that you have too many small files. In general each of these small 
 files  (  64 mb ) will be executed by a single mapper. However, I would 
 suggest looking at CombineFileInputFormat which does the job of packaging 
 many small files together depending on data locality for better performance ( 
 initialization time is a significant factor in hadoop's performance ).
 On the other side, many small files will hamper your namenode performance 
 since file metadata is stored in memory and limit its overall capacity wrt 
 number of files.

 Amogh


 On 2/25/10 11:15 PM, Michael Kintzer michael.kint...@zerk.com wrote:

 Hi,

 We are using the streaming API.We are trying to understand what hadoop 
 uses as a threshold or trigger to involve more TaskTracker nodes in a given 
 Map-Reduce execution.

 With default settings (64MB chunk size in HDFS), if the input file is less 
 than 64MB, will the data processing only occur on a single TaskTracker Node, 
 even if our cluster size is greater than 1?

 For example, we are trying to figure out if hadoop is more efficient at 
 processing:
 a) a single input file which is just an index file that refers to a jar 
 archive of 100K or 1M individual small files, where the jar file is passed as 
 the -archives argument, or
 b) a single input file containing all the raw data represented by the 100K or 
 1M small files.

 With (a), our input file is 64MB.   With (b) our input file is very large.

 Thanks for any insight,

 -Michael





Re: cluster involvement trigger

2010-02-26 Thread Michael Kintzer
Amogh,

Thank you for the detailed information.   Our initial prototyping seems to 
agree with your statements below, i.e. a single large input file is performing 
better than an index file + an archive of small files.   I will take a look at 
the CombineFileInputFormat as you suggested. 

One question.   Since the many small input files are all in a single jar 
archive managed by the name node, does that still hamper name node performance? 
  I was under the impression these archives are are only unpacked into the 
temporary map reduce file space (and I'm assuming cleaned up after map-reduce 
completes).   Does the name node need to store the metadata of each individual 
file during the unpacking for this case?

-Michael

On Feb 25, 2010, at 10:31 PM, Amogh Vasekar wrote:

 Hi,
 The number of mappers initialized depends largely on your input format ( the 
 getSplits of your input format) , (almost all) input formats available in 
 hadoop derive from fileinputformat, hence the 1 mapper per file block notion 
 ( this actually is 1 mapper per split ).
 You say that you have too many small files. In general each of these small 
 files  (  64 mb ) will be executed by a single mapper. However, I would 
 suggest looking at CombineFileInputFormat which does the job of packaging 
 many small files together depending on data locality for better performance ( 
 initialization time is a significant factor in hadoop's performance ).
 On the other side, many small files will hamper your namenode performance 
 since file metadata is stored in memory and limit its overall capacity wrt 
 number of files.
 
 Amogh
 
 
 On 2/25/10 11:15 PM, Michael Kintzer michael.kint...@zerk.com wrote:
 
 Hi,
 
 We are using the streaming API.We are trying to understand what hadoop 
 uses as a threshold or trigger to involve more TaskTracker nodes in a given 
 Map-Reduce execution.
 
 With default settings (64MB chunk size in HDFS), if the input file is less 
 than 64MB, will the data processing only occur on a single TaskTracker Node, 
 even if our cluster size is greater than 1?
 
 For example, we are trying to figure out if hadoop is more efficient at 
 processing:
 a) a single input file which is just an index file that refers to a jar 
 archive of 100K or 1M individual small files, where the jar file is passed as 
 the -archives argument, or
 b) a single input file containing all the raw data represented by the 100K or 
 1M small files.
 
 With (a), our input file is 64MB.   With (b) our input file is very large.
 
 Thanks for any insight,
 
 -Michael
 



Re: cluster involvement trigger

2010-02-25 Thread Amogh Vasekar
Hi,
The number of mappers initialized depends largely on your input format ( the 
getSplits of your input format) , (almost all) input formats available in 
hadoop derive from fileinputformat, hence the 1 mapper per file block notion ( 
this actually is 1 mapper per split ).
You say that you have too many small files. In general each of these small 
files  (  64 mb ) will be executed by a single mapper. However, I would 
suggest looking at CombineFileInputFormat which does the job of packaging many 
small files together depending on data locality for better performance ( 
initialization time is a significant factor in hadoop's performance ).
On the other side, many small files will hamper your namenode performance since 
file metadata is stored in memory and limit its overall capacity wrt number of 
files.

Amogh


On 2/25/10 11:15 PM, Michael Kintzer michael.kint...@zerk.com wrote:

Hi,

We are using the streaming API.We are trying to understand what hadoop uses 
as a threshold or trigger to involve more TaskTracker nodes in a given 
Map-Reduce execution.

With default settings (64MB chunk size in HDFS), if the input file is less than 
64MB, will the data processing only occur on a single TaskTracker Node, even if 
our cluster size is greater than 1?

For example, we are trying to figure out if hadoop is more efficient at 
processing:
a) a single input file which is just an index file that refers to a jar archive 
of 100K or 1M individual small files, where the jar file is passed as the 
-archives argument, or
b) a single input file containing all the raw data represented by the 100K or 
1M small files.

With (a), our input file is 64MB.   With (b) our input file is very large.

Thanks for any insight,

-Michael