On Tue, Nov 30, 2010 at 3:21 AM, Harsh J <qwertyman...@gmail.com> wrote: > Hey, > > On Tue, Nov 30, 2010 at 4:56 AM, Marc Sturlese <marc.sturl...@gmail.com> > wrote: >> >> Hey there, >> I am doing some tests and wandering which are the best practices to deal >> with very small files which are continuously being generated(1Mb or even >> less). > > Have a read: http://www.cloudera.com/blog/2009/02/the-small-files-problem/ > >> >> I see that if I have hundreds of small files in hdfs, hadoop automatically >> will create A LOT of map tasks to consume them. Each map task will take 10 >> seconds or less... I don't know if it's possible to change the number of map >> tasks from java code using the new API (I know it can be done with the old >> one). I would like to do something like NumMapTasksCalculatedByHadoop * 0.3. >> This way, less maps tasks would be instanciated and each would be working >> more time. > > Perhaps you need to use MultiFileInputFormat: > http://www.cloudera.com/blog/2009/02/the-small-files-problem/ > > -- > Harsh J > www.harshj.com >
MultiFile and ConbinedInputFormats help. JVM Re-use helps. The larger problem is that an average NameNode with 4GB ram will start JVM pausing with a relatively low number of files/blocks, say 10,000,000. 10mil is not a large number when generating thousands of files a day. We open sourced a tool to deal with this problem. http://www.jointhegrid.com/hadoop_filecrush/index.jsp Essentially it takes a pass over a directory and combines multiple files into one. On 'hourly' directories we run it after the hour is closed out. V2 (which we should throw over the fence in a week or so) uses the same techniques but will be optimized for dealing with very large directories and/or subdirectories of varying sizes by doing more intelligent planning and grouping of which files an individual mapper or reducer is going to combine.