On Tue, Nov 30, 2010 at 3:21 AM, Harsh J <qwertyman...@gmail.com> wrote:
> Hey,
>
> On Tue, Nov 30, 2010 at 4:56 AM, Marc Sturlese <marc.sturl...@gmail.com> 
> wrote:
>>
>> Hey there,
>> I am doing some tests and wandering which are the best practices to deal
>> with very small files which are continuously being generated(1Mb or even
>> less).
>
> Have a read: http://www.cloudera.com/blog/2009/02/the-small-files-problem/
>
>>
>> I see that if I have hundreds of small files in hdfs, hadoop automatically
>> will create A LOT of map tasks to consume them. Each map task will take 10
>> seconds or less... I don't know if it's possible to change the number of map
>> tasks from java code using the new API (I know it can be done with the old
>> one). I would like to do something like NumMapTasksCalculatedByHadoop * 0.3.
>> This way, less maps tasks would be instanciated and each would be working
>> more time.
>
> Perhaps you need to use MultiFileInputFormat:
> http://www.cloudera.com/blog/2009/02/the-small-files-problem/
>
> --
> Harsh J
> www.harshj.com
>

MultiFile and ConbinedInputFormats help.
JVM Re-use helps.

The larger problem is that an average NameNode with 4GB ram will start
JVM pausing with a relatively low number of files/blocks, say
10,000,000. 10mil is not a large number when generating thousands of
files a day.

We open sourced a tool to deal with this problem.
http://www.jointhegrid.com/hadoop_filecrush/index.jsp

Essentially it takes a pass over a directory and combines multiple
files into one. On 'hourly' directories we run it after the hour is
closed out.

V2 (which we should throw over the fence in a week or so) uses the
same techniques but will be optimized for dealing with very large
directories and/or subdirectories of varying sizes by doing more
intelligent planning and grouping of which files an individual mapper
or reducer is going to combine.

Reply via email to