Re: Best practices for handling many small files

Chris K Wensel Wed, 23 Apr 2008 12:26:57 -0700

are the files to be stored on HDFS long term, or do they need to befetched from an external authoritative source?


depending on how things are setup in your datacenter etc...

you could aggregate them into a fat sequence file (or a few). keep inmind how long it would take to fetch the files and aggregate them(this is a serial process) and if the corpus changes often (how oftenwill you need to make these sequence files).

another option is to make a manifest (list of docs to fetch), feedthat to your mapper and have it fetch each file individually. thiswould be useful if the corpus is reasonably arbitrary between runs andcould eliminate much of the load time. but painful if the data isexternal to your datacenter and the cost to refetch is high.


there really is no simple answer..

ckw


On Apr 23, 2008, at 9:16 AM, Joydeep Sen Sarma wrote:

million map processes are horrible. aside from overhead - don't doit if u share the cluster with other jobs (all other jobs will getkilled whenever the million map job is finished - see https://issues.apache.org/jira/browse/HADOOP-2393)
well - even for #2 - it begs the question of how the packing itselfwill be parallelized ..
There's a MultiFileInputFormat that can be extended - that allowsprocessing of multiple files in a single map job. it needsimprovement. For one - it's an abstract class - and a concreteimplementation for (at least) text files would help. also - thesplitting logic is not very smart (from what i last saw). ideally -it should take the million files and form it into N groups (say N issize of your cluster) where each group has files local to the Nthmachine and then process them on that machine. currently it doesn'tdo this (the groups are arbitrary). But it's still the way to go ..
-----Original Message-----
From: [EMAIL PROTECTED] on behalf of Stuart Sierra
Sent: Wed 4/23/2008 8:55 AM
To: core-user@hadoop.apache.org
Subject: Best practices for handling many small files

Hello all, Hadoop newbie here, asking: what's the preferred way to
handle large (~1 million) collections of small files (10 to 100KB) in
which each file is a single "record"?

1. Ignore it, let Hadoop create a million Map processes;
2. Pack all the files into a single SequenceFile; or
3. Something else?

I started writing code to do #2, transforming a big tar.bz2 into a
BLOCK-compressed SequenceFile, with the file names as keys.  Will that
work?

Thanks,
-Stuart, altlaw.org


Chris K Wensel
[EMAIL PROTECTED]
http://chris.wensel.net/
http://www.cascading.org/

Re: Best practices for handling many small files

Reply via email to