Hi Mohit, > How many are too many for namenode? We have around 100M files and 100M > files every year > > >
The name-node stores file and block metadata in RAM. This is an estimate at memory utilization per file and block: "Estimates show that the name-node uses fewer than 200 bytes to store a single metadata object (a file inode or a block). According to statistics on our clusters, a file on average consists of 1.5 blocks, which means that it takes 600 bytes (1 file object + 2 block objects) to store an average file in name-node’s RAM" http://www.usenix.org/publications/login/2010-04/openpdfs/shvachko.pdf Next generation Hadoop (Hadoop 0.23) brings HDFS Federation, which will improve scalability of the name-node. You can read more about that here: http://hortonworks.com/an-introduction-to-hdfs-federation/ Rohit Bakhshi www.hortonworks.com (http://www.hortonworks.com/) On Tuesday, February 14, 2012 at 10:56 AM, W.P. McNeill wrote: > I'm not sure what you mean by "flat format" here. > > In my scenario, I have an file input.xml that looks like this. > > <myfile> > <section> > <value>1</value> > </section> > <section> > <value>2</value> > </section> > </myfile> > > input.xml is a plain text file. Not a sequence file. If I read it with the > XMLInputFormat my mapper gets called with (key, value) pairs that look like > this: > > (nnnn, <section><value>1</value></section>) > (nnnn, <section><value>2</value></section>) > > Where the keys are numerical offsets into the file. I then use this > information to write a sequence file with these (key, value) pairs. So my > Hadoop job that uses XMLInputFormat takes a text file as input and produces > a sequence file as output. > > I don't know a rule of thumb for how many small files is too many. Maybe > someone else on the list can chime in. I just know that when your > throughput gets slow that's one possible cause to investigate. > >