Stuart - if Dhruba is giving hdfs file and block sizes used by the namenode, you really cannot get a more authoritative number elsewhere :) I would do the back-of-envelope with ~160 bytes/file and ~150 bytes/block.
On Wed, Feb 2, 2011 at 9:08 PM, Stuart Smith <stu24m...@yahoo.com> wrote: > > This is the best coverage I've seen from a source that would know: > > > http://developer.yahoo.com/blogs/hadoop/posts/2010/05/scalability_of_the_hadoop_dist/ > > One relevant quote: > > To store 100 million files (referencing 200 million blocks), a name-node > should have at least 60 GB of RAM. > > But, honestly, if you're just building out your cluster, you'll probably > run into a lot of other limits first: hard drive space, regionserver memory, > the infamous ulimit/xciever :), etc...the > > Take care, > -stu > > --- On *Wed, 2/2/11, Dhruba Borthakur <dhr...@gmail.com>* wrote: > > > From: Dhruba Borthakur <dhr...@gmail.com> > > Subject: Re: HDFS without Hadoop: Why? > To: hdfs-user@hadoop.apache.org > Date: Wednesday, February 2, 2011, 9:00 PM > > > The Namenode uses around 160 bytes/file and 150 bytes/block in HDFS. This > is a very rough calculation. > > dhruba > > On Wed, Feb 2, 2011 at 5:11 PM, Dhodapkar, Chinmay > <chinm...@qualcomm.com<http://mc/compose?to=chinm...@qualcomm.com> > > wrote: > > What you describe is pretty much my use case as well. Since I don’t know > how big the number of files could get , I am trying to figure out if there > is a theoretical design limitation in hdfs….. > > > > From what I have read, the name node will store all metadata of all files > in the RAM. Assuming (in my case), that a file is less than the configured > block size….there should be a very rough formula that can be used to > calculate the max number of files that hdfs can serve based on the > configured RAM on the name node? > > > > Can any of the implementers comment on this? Am I even thinking on the > right track…? > > > > Thanks Ian for the haystack link…very informative indeed. > > > > -Chinmay > > > > > > > > *From:* Stuart Smith > [mailto:stu24m...@yahoo.com<http://mc/compose?to=stu24m...@yahoo.com>] > > *Sent:* Wednesday, February 02, 2011 4:41 PM > > *To:* > hdfs-user@hadoop.apache.org<http://mc/compose?to=hdfs-user@hadoop.apache.org> > *Subject:* RE: HDFS without Hadoop: Why? > > > > Hello, > I'm actually using hbase/hadoop/hdfs for lots of small files (with a > long tail of larger files). Well, millions of small files - I don't know > what you mean by lots :) > > Facebook probably knows better, But what I do is: > > - store metadata in hbase > - files smaller than 10 MB or so in hbase > -larger files in a hdfs directory tree. > > I started storing 64 MB files and smaller in hbase (chunk size), but that > causes issues with regionservers when running M/R jobs. This is related to > the fact that I'm running a cobbled together cluster & my region servers > don't have that much memory. I would play the size to see what works for > you.. > > Take care, > -stu > > --- On *Wed, 2/2/11, Dhodapkar, Chinmay > <chinm...@qualcomm.com<http://mc/compose?to=chinm...@qualcomm.com> > >* wrote: > > > From: Dhodapkar, Chinmay > <chinm...@qualcomm.com<http://mc/compose?to=chinm...@qualcomm.com> > > > Subject: RE: HDFS without Hadoop: Why? > To: > "hdfs-user@hadoop.apache.org<http://mc/compose?to=hdfs-user@hadoop.apache.org>" > <hdfs-user@hadoop.apache.org<http://mc/compose?to=hdfs-user@hadoop.apache.org> > > > Date: Wednesday, February 2, 2011, 7:28 PM > > Hello, > > > > I have been following this thread for some time now. I am very comfortable > with the advantages of hdfs, but still have lingering questions about the > usage of hdfs for general purpose storage (no mapreduce/hbase etc). > > > > Can somebody shed light on what the limitations are on the number of files > that can be stored. Is it limited in anyway by the namenode? The use case I > am interested in is to store a very large number of relatively small files > (1MB to 25MB). > > > > Interestingly, I saw a facebook presentation on how they use hbase/hdfs > internally. Them seem to store all metadata in hbase and the actual > images/files/etc in something called “haystack” (why not use hdfs since they > already have it?). Anybody know what “haystack” is? > > > > Thanks! > > Chinmay > > > > > > > > *From:* Jeff Hammerbacher > [mailto:ham...@cloudera.com<http://mc/compose?to=ham...@cloudera.com>] > > *Sent:* Wednesday, February 02, 2011 3:31 PM > *To:* > hdfs-user@hadoop.apache.org<http://mc/compose?to=hdfs-user@hadoop.apache.org> > *Subject:* Re: HDFS without Hadoop: Why? > > > > > - Large block size wastes space for small file. The minimum file size > is 1 block. > > That's incorrect. If a file is smaller than the block size, it will only > consume as much space as there is data in the file. > > > - There are no hardlinks, softlinks, or quotas. > > That's incorrect; there are quotas and softlinks. > > > > > > > -- > Connect to me at http://www.facebook.com/dhruba > > >