Re: Small Filesizes

2008-09-16 Thread Stuart Sierra
On Mon, Sep 15, 2008 at 8:23 AM, Brian Vargas <[EMAIL PROTECTED]> wrote: > A simple solution is to just load all of the small files into a sequence > file, and process the sequence file instead. I use this approach too. I make SequenceFiles with key= the file name (Text) value= the contents of th

Re: Small Filesizes

2008-09-15 Thread Mafish Liu
Hi, I'm just working on this situation you described, with millions of small files sized around 10KB. My idea is to compact this files into big ones and create indexes for them. This is a file system over file system and support append update, lazy delete. May this help . -- [EMAIL PROTECTE

Re: Small Filesizes

2008-09-15 Thread Konstantin Shvachko
Peter, You are likely to hit memory limitations on the name-node. With 100 million small files it will need to support 200 mln objects, which will require roughly 30 GB of RAM on the name-node. You may also consider hadoop archives or present your files as a collection of records and use Pig, Hiv

Re: Small Filesizes

2008-09-15 Thread Brian Vargas
-BEGIN PGP SIGNED MESSAGE- Hash: RIPEMD160 Peter, In my testing with files of that size (well, larger, but still well below the block size) it was impossible to achieve any real throughput on the data because of the overhead of looking up the locations to all those files on the NameNode.

Small Filesizes

2008-09-14 Thread Peter McTaggart
Hi All, I am considering using HDFS for an application that potentially has many small files – ie 10-100 million files with an estimated average filesize of 50-100k (perhaps smaller) and is an online interactive application. All of the documentation I have seen suggests that a blockszie of 64-