are the files to be stored on HDFS long term, or do they need to be fetched from an external authoritative source?

depending on how things are setup in your datacenter etc...

you could aggregate them into a fat sequence file (or a few). keep in mind how long it would take to fetch the files and aggregate them (this is a serial process) and if the corpus changes often (how often will you need to make these sequence files).

another option is to make a manifest (list of docs to fetch), feed that to your mapper and have it fetch each file individually. this would be useful if the corpus is reasonably arbitrary between runs and could eliminate much of the load time. but painful if the data is external to your datacenter and the cost to refetch is high.

there really is no simple answer..

ckw


On Apr 23, 2008, at 9:16 AM, Joydeep Sen Sarma wrote:
million map processes are horrible. aside from overhead - don't do it if u share the cluster with other jobs (all other jobs will get killed whenever the million map job is finished - see https://issues.apache.org/jira/browse/HADOOP-2393)

well - even for #2 - it begs the question of how the packing itself will be parallelized ..

There's a MultiFileInputFormat that can be extended - that allows processing of multiple files in a single map job. it needs improvement. For one - it's an abstract class - and a concrete implementation for (at least) text files would help. also - the splitting logic is not very smart (from what i last saw). ideally - it should take the million files and form it into N groups (say N is size of your cluster) where each group has files local to the Nth machine and then process them on that machine. currently it doesn't do this (the groups are arbitrary). But it's still the way to go ..


-----Original Message-----
From: [EMAIL PROTECTED] on behalf of Stuart Sierra
Sent: Wed 4/23/2008 8:55 AM
To: core-user@hadoop.apache.org
Subject: Best practices for handling many small files

Hello all, Hadoop newbie here, asking: what's the preferred way to
handle large (~1 million) collections of small files (10 to 100KB) in
which each file is a single "record"?

1. Ignore it, let Hadoop create a million Map processes;
2. Pack all the files into a single SequenceFile; or
3. Something else?

I started writing code to do #2, transforming a big tar.bz2 into a
BLOCK-compressed SequenceFile, with the file names as keys.  Will that
work?

Thanks,
-Stuart, altlaw.org


Chris K Wensel
[EMAIL PROTECTED]
http://chris.wensel.net/
http://www.cascading.org/




Reply via email to