Re: Best practices for handling many small files

Enis Soztutar Thu, 24 Apr 2008 00:27:25 -0700

A shameless attempt to defend MultiFileInputFormat :

A concrete implementation of MultiFileInputFormat is not needed, sinceevery InputFormat relying on MultiFileInputFormat is expected to haveits custom RecordReader implementation, thus they need to overridegetRecordReader(). An implementation which returns (sort of)LineRecordReader is under src/examples/.../MultiFileWordCount. Howeverwe may include it if any generic (for example returningSequenceFileRecordReader) implementation pops up.

An InputFormat returns <numSplits> many Splits from getSplits(JobConfjob, int numSplits), which is the number of maps, not the number ofmachines in the cluster.

Last of all, MultiFileSplit class implements getLocations() method,which returns the files' locations. Thus it's the JT's job to assigntasks to leverage local processing.

Coming to the original question, I think #2 is better, if theconstruction of the sequence file is not a bottleneck. You may, forexample, create several sequence files in parallel and use all of themas input w/o merging.



Joydeep Sen Sarma wrote:

million map processes are horrible. aside from overhead - don't do it if u 
share the cluster with other jobs (all other jobs will get killed whenever the 
million map job is finished - see 
https://issues.apache.org/jira/browse/HADOOP-2393)

well - even for #2 - it begs the question of how the packing itself will be 
parallelized ..

There's a MultiFileInputFormat that can be extended - that allows processing of 
multiple files in a single map job. it needs improvement. For one - it's an 
abstract class - and a concrete implementation for (at least)  text files would 
help. also - the splitting logic is not very smart (from what i last saw). 
ideally - it should take the million files and form it into N groups (say N is 
size of your cluster) where each group has files local to the Nth machine and 
then process them on that machine. currently it doesn't do this (the groups are 
arbitrary). But it's still the way to go ..


-----Original Message-----
From: [EMAIL PROTECTED] on behalf of Stuart Sierra
Sent: Wed 4/23/2008 8:55 AM
To: core-user@hadoop.apache.org
Subject: Best practices for handling many small files

Hello all, Hadoop newbie here, asking: what's the preferred way to

handle large (~1 million) collections of small files (10 to 100KB) in
which each file is a single "record"?

1. Ignore it, let Hadoop create a million Map processes;
2. Pack all the files into a single SequenceFile; or
3. Something else?

I started writing code to do #2, transforming a big tar.bz2 into a
BLOCK-compressed SequenceFile, with the file names as keys.  Will that
work?

Thanks,
-Stuart, altlaw.org

Re: Best practices for handling many small files

Reply via email to