-----BEGIN PGP SIGNED MESSAGE----- Hash: RIPEMD160 Peter,
In my testing with files of that size (well, larger, but still well below the block size) it was impossible to achieve any real throughput on the data because of the overhead of looking up the locations to all those files on the NameNode. Your application spends so much time looking up file names that most of the CPUs sit idle. A simple solution is to just load all of the small files into a sequence file, and process the sequence file instead. Brian Peter McTaggart wrote: > Hi All, > > > > I am considering using HDFS for an application that potentially has many > small files – ie 10-100 million files with an estimated average filesize of > 50-100k (perhaps smaller) and is an online interactive application. > > All of the documentation I have seen suggests that a blockszie of 64-128Mb > works best for Hadoop/HDFS and it is best used for batch oriented > applications. > > > > Does anyone have any experience using it for files of this size in an > online application environment? > > Is it worth pursuing HDFS for this type of application? > > > > Thanks > > Peter > -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (MingW32) Comment: What is this? http://pgp.ardvaark.net iD8DBQFIzlOt3YdPnMKx1eMRA18fAJ48voMDWLRiKPZHcBxAFAM1Kktk8wCguSDX dIHsqlePzQHQYFr9AwhkI3I= =gmAj -----END PGP SIGNATURE-----