On Mon, Sep 15, 2008 at 8:23 AM, Brian Vargas <[EMAIL PROTECTED]> wrote:
> A simple solution is to just load all of the small files into a sequence
> file, and process the sequence file instead.
I use this approach too. I make SequenceFiles with
key= the file name (Text)
value= the contents of th
Hi,
I'm just working on this situation you described, with millions of small
files sized around 10KB.
My idea is to compact this files into big ones and create indexes for
them. This is a file system over file system and support append update, lazy
delete.
May this help .
--
[EMAIL PROTECTE
Peter,
You are likely to hit memory limitations on the name-node.
With 100 million small files it will need to support 200 mln objects,
which will require roughly 30 GB of RAM on the name-node.
You may also consider hadoop archives or present your files as a
collection of records and use Pig, Hiv
-BEGIN PGP SIGNED MESSAGE-
Hash: RIPEMD160
Peter,
In my testing with files of that size (well, larger, but still well
below the block size) it was impossible to achieve any real throughput
on the data because of the overhead of looking up the locations to all
those files on the NameNode.
Hi All,
I am considering using HDFS for an application that potentially has many
small files – ie 10-100 million files with an estimated average filesize of
50-100k (perhaps smaller) and is an online interactive application.
All of the documentation I have seen suggests that a blockszie of 64-