Would the new archive feature HADOOP-3307 that is currently being developed
help this problem?
http://issues.apache.org/jira/browse/HADOOP-3307
--Konstantin
Subramaniam Krishnan wrote:
We have actually written a custom Multi File Splitter that collapses all
the small files to a single split till the DFS Block Size is hit.
We also take care of handling big files by splitting them on Block Size
and adding up all the reminders(if any) to a single split.
It works great for us....:-)
We are working on optimizing it further to club all the small files in a
single data node together so that the Map can have maximum local data.
We plan to share this(provided it's found acceptable, of course) once
this is done.
Regards,
Subru
Stuart Sierra wrote:
Thanks for the advice, everyone. I'm going to go with #2, packing my
million files into a small number of SequenceFiles. This is slow, but
only has to be done once. My "datacenter" is Amazon Web Services :),
so storing a few large, compressed files is the easiest way to go.
My code, if anyone's interested, is here:
http://stuartsierra.com/2008/04/24/a-million-little-files
-Stuart
altlaw.org
On Wed, Apr 23, 2008 at 11:55 AM, Stuart Sierra
<[EMAIL PROTECTED]> wrote:
Hello all, Hadoop newbie here, asking: what's the preferred way to
handle large (~1 million) collections of small files (10 to 100KB) in
which each file is a single "record"?
1. Ignore it, let Hadoop create a million Map processes;
2. Pack all the files into a single SequenceFile; or
3. Something else?
I started writing code to do #2, transforming a big tar.bz2 into a
BLOCK-compressed SequenceFile, with the file names as keys. Will that
work?
Thanks,
-Stuart, altlaw.org