are the files to be stored on HDFS long term, or do they need to be
fetched from an external authoritative source?
depending on how things are setup in your datacenter etc...
you could aggregate them into a fat sequence file (or a few). keep in
mind how long it would take to fetch the files and aggregate them
(this is a serial process) and if the corpus changes often (how often
will you need to make these sequence files).
another option is to make a manifest (list of docs to fetch), feed
that to your mapper and have it fetch each file individually. this
would be useful if the corpus is reasonably arbitrary between runs and
could eliminate much of the load time. but painful if the data is
external to your datacenter and the cost to refetch is high.
there really is no simple answer..
ckw
On Apr 23, 2008, at 9:16 AM, Joydeep Sen Sarma wrote:
million map processes are horrible. aside from overhead - don't do
it if u share the cluster with other jobs (all other jobs will get
killed whenever the million map job is finished - see https://issues.apache.org/jira/browse/HADOOP-2393)
well - even for #2 - it begs the question of how the packing itself
will be parallelized ..
There's a MultiFileInputFormat that can be extended - that allows
processing of multiple files in a single map job. it needs
improvement. For one - it's an abstract class - and a concrete
implementation for (at least) text files would help. also - the
splitting logic is not very smart (from what i last saw). ideally -
it should take the million files and form it into N groups (say N is
size of your cluster) where each group has files local to the Nth
machine and then process them on that machine. currently it doesn't
do this (the groups are arbitrary). But it's still the way to go ..
-----Original Message-----
From: [EMAIL PROTECTED] on behalf of Stuart Sierra
Sent: Wed 4/23/2008 8:55 AM
To: core-user@hadoop.apache.org
Subject: Best practices for handling many small files
Hello all, Hadoop newbie here, asking: what's the preferred way to
handle large (~1 million) collections of small files (10 to 100KB) in
which each file is a single "record"?
1. Ignore it, let Hadoop create a million Map processes;
2. Pack all the files into a single SequenceFile; or
3. Something else?
I started writing code to do #2, transforming a big tar.bz2 into a
BLOCK-compressed SequenceFile, with the file names as keys. Will that
work?
Thanks,
-Stuart, altlaw.org
Chris K Wensel
[EMAIL PROTECTED]
http://chris.wensel.net/
http://www.cascading.org/