Harsh J skrev:
Hello Per,
On Thu, Sep 1, 2011 at 2:27 PM, Per Steffensen <st...@designware.dk> wrote:
Hi
FileInputFormat sub-classes (TextInputFormat and SequenceFileInputFormat)
are able to take all files in a folder and split the work of handling them
into several sub-jobs (map-jobs). I know it can split a very big file into
several sub-jobs, but how does it handle many small files in the folder. If
there are 10000 small files each with 100 datarecords, I would not like my
sub-jobs to become too small (due to the overhead of starting a JVM for each
sub-job etc.). I would like e.g. 100 sub-jobs each about handling 10000
datarecords, or maybe 10 sub-jobs each about handling 100000 datarecords,
but I would not like 10000 sub-jobs each about handling 100 datarecords. For
this to be possible one split (the work to be done by one sub-job) will have
to span more than one file. My question is, if FileInputFormat sub-classes
are able to make such splits, or if they always create at least one
split=sub-job=map-job per file?
You need CombineFileInputFormat for your case, not the vanilla
FileInputFormat which gives at least one split per file.
Yes I found CombineFileInputFormat. It worries me a little though to see
that it extends the deprecated FileInputFormat instead of the new
FileInputFormat. It that a problem?
Also I notice that CombineFileInputFormat is abstract. Why is that? Is
the extension shown on the following webpage a good way out of this:
http://blog.yetitrails.com/2011/04/dealing-with-lots-of-small-files-in.html
Another thing is: I expect that FileInputFormat has to somehow list the
files in the folder. Who does this listing handle many many files in the
folder. Most OS's are bad at listing files in folders when there are a lot
of files - at some point it become worse than O(n) where n is the number of
files. Windows of course really suck, and even linux has problems with very
high number of files. How does HDFS handle listing of files in a folder with
many many files? Or maybe I should address this question to the hdfs mailing
list?
Listing would be one RPC call in HDFS, so it might take some time for
millions of files under a single directory. Although the listing
operation, since done in-memory of the NameNode would be fast enough,
the transfer of the results to the client for large amount of items
may take up some time -- and there's no way to page either. I do not
think the complexity is worse than O(n) though, for files under the
same dir - only transfer costs should be your worry. But I've not
measured these things to give you concrete statements on this. Might
be a good exercise?
It would be nice with some number about this, yes. But we will make our
own performance test anyway.
Also know that listing is only done on the front end (job client
submission) and not later on. So it is just a one time cost.