Brendan, The issue with using lots of small files is that your processing overhead increases (repeated, avoidable file open-read(little)-close calls). HDFS is also used by those who wish to also heavily process the data they've stored and with a huge number of files such a process is not gonna be quick to cut through quick for them. RAM is just another factor, due to the design of NameNode. But ideally you do not want to end up with having to go through millions of files when you wish to process them all, as they can be stored more efficiently for those purposes via several tools/formats/etc.
You can probably utilize HBase for such storage. It will allow you to store large amounts of data in compact files while at the same time allowing random access to them, if thats needed by your use-case as well. Check out this one previous discussion on this topic at: http://search-hadoop.com/m/j95CxojSOC which was related to storing image files. Should apply to your question as well. Head over to u...@hbase.apache.org if you have further questions on Apache HBase. On Tue, May 22, 2012 at 3:09 PM, Brendan cheng <ccp...@hotmail.com> wrote: > > Hi, > I read HDFS architecture doc and it said HDFS is tuned for at storing large > file, typically gigabyte to terabytes.What is the downsize of storing million > of small files like <10MB? or what setting of HDFS is suitable for storing > small files? > Actually, I plan to find a distribute filed system for storing mult million > of files. > Brendan -- Harsh J