-----BEGIN PGP SIGNED MESSAGE-----
Hash: RIPEMD160

Peter,

In my testing with files of that size (well, larger, but still well
below the block size) it was impossible to achieve any real throughput
on the data because of the overhead of looking up the locations to all
those files on the NameNode.  Your application spends so much time
looking up file names that most of the CPUs sit idle.

A simple solution is to just load all of the small files into a sequence
file, and process the sequence file instead.

Brian

Peter McTaggart wrote:
> Hi All,
> 
> 
> 
> I am considering using HDFS for an application that potentially has many
> small files – ie  10-100 million files with an estimated average filesize of
> 50-100k (perhaps smaller) and is an online interactive application.
> 
> All of the documentation I have seen suggests that a blockszie of 64-128Mb
> works best for Hadoop/HDFS and it is best used for batch oriented
> applications.
> 
> 
> 
> Does anyone have any experience using it for files of this size  in an
> online application environment?
> 
> Is it worth pursuing HDFS for this type of application?
> 
> 
> 
> Thanks
> 
> Peter
> 
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (MingW32)
Comment: What is this? http://pgp.ardvaark.net

iD8DBQFIzlOt3YdPnMKx1eMRA18fAJ48voMDWLRiKPZHcBxAFAM1Kktk8wCguSDX
dIHsqlePzQHQYFr9AwhkI3I=
=gmAj
-----END PGP SIGNATURE-----

Reply via email to