Would the new archive feature HADOOP-3307 that is currently being developed 
help this problem?
http://issues.apache.org/jira/browse/HADOOP-3307

--Konstantin

Subramaniam Krishnan wrote:

We have actually written a custom Multi File Splitter that collapses all the small files to a single split till the DFS Block Size is hit. We also take care of handling big files by splitting them on Block Size and adding up all the reminders(if any) to a single split.

It works great for us....:-)
We are working on optimizing it further to club all the small files in a single data node together so that the Map can have maximum local data.

We plan to share this(provided it's found acceptable, of course) once this is done.

Regards,
Subru

Stuart Sierra wrote:

Thanks for the advice, everyone.  I'm going to go with #2, packing my
million files into a small number of SequenceFiles.  This is slow, but
only has to be done once.  My "datacenter" is Amazon Web Services :),
so storing a few large, compressed files is the easiest way to go.

My code, if anyone's interested, is here:
http://stuartsierra.com/2008/04/24/a-million-little-files

-Stuart
altlaw.org


On Wed, Apr 23, 2008 at 11:55 AM, Stuart Sierra <[EMAIL PROTECTED]> wrote:
Hello all, Hadoop newbie here, asking: what's the preferred way to
 handle large (~1 million) collections of small files (10 to 100KB) in
 which each file is a single "record"?

 1. Ignore it, let Hadoop create a million Map processes;
 2. Pack all the files into a single SequenceFile; or
 3. Something else?

 I started writing code to do #2, transforming a big tar.bz2 into a
 BLOCK-compressed SequenceFile, with the file names as keys.  Will that
 work?

 Thanks,
 -Stuart, altlaw.org



Reply via email to