Re: HDFS without Hadoop: Why?

Bharath Mundlapudi Fri, 04 Feb 2011 22:52:55 -0800

Note that there are other data structures in memory for the Namenode 
like BlockMap, Directory etc. Just by having number of bytes for file 
and block is not sufficient. But it is true that File and Block 
structures occupy most of the memory, I would say these two will be in 
the top 10 list of high memory objects. Probably, Konstantin's paper 
will give you more holistic information.


Also, There were quite a bit of memory optimizations went into Namenode. With 
the recent 
optimizations, you can expect > 60 million files (with 1 block each) 
on a 32GB RAM machine. I am being conservative here. You can work your 
way based on these numbers.  Assumption here is Namenode running on a 
64-bit JVM.

-Bharath



From: Stuart Smith <stu24m...@yahoo.com>
To: hdfs-user@hadoop.apache.org
Cc: 
Sent: Wednesday, February 2, 2011 7:32 PM
Subject: Re: HDFS without Hadoop: Why?



> Stuart - if Dhruba is giving hdfs file and block sizes used by the 
namenode, you really cannot get a more authoritative number elsewhere :) 

Yes - very true! :)

I spaced out on the name there ... ;)

One more thing - I believe that if you're storing a lot of your smaller files 
in hbase, you'll end up with a lot less files on hdfs, since several of your 
smaller files will end up in one HFile??

I'm storing 5-7 million files, with at least 70-80% ending up in hbase. I only 
have 16 GB of RAM for my name-node, and it's very far from overloading the 
memory. Off the top of my head, I think it's << 8 GB of RAM used... 


Take care,
  -stu

--- On Wed, 2/2/11, Gaurav Sharma <gaurav.gs.sha...@gmail.com> wrote:


>From: Gaurav Sharma <gaurav.gs.sha...@gmail.com>
>Subject: Re: HDFS without Hadoop: Why?
>To: hdfs-user@hadoop.apache.org
>Date: Wednesday, February 2, 2011, 9:31 PM
>
>
>Stuart -
> if Dhruba is giving hdfs file and block sizes used by the namenode, you 
> really cannot get a more authoritative number elsewhere :) I would do the 
> back-of-envelope with ~160 bytes/file and ~150 bytes/block.
>
>
>On Wed, Feb 2, 2011 at 9:08 PM, Stuart Smith <stu24m...@yahoo.com> wrote:
>
>>>
>>
>>This is the best coverage I've seen from a source that would know:
>>
>>http://developer.yahoo.com/blogs/hadoop/posts/2010/05/scalability_of_the_hadoop_dist/
>>
>>One relevant quote:
>>
>>To store 100 million files (referencing 200 million blocks), a name-node 
>>should have at least 60 GB of RAM.
>>
>>But, honestly, if you're just building out your cluster, you'll probably run 
>>into a lot of other limits first: hard drive space, regionserver memory, the 
>>infamous ulimit/xciever :), etc...the 
>>
>>Take care,
>>  -stu
>>
>>--- On Wed, 2/2/11, Dhruba Borthakur <dhr...@gmail.com> wrote:
>>
>>>>>
>>>From: Dhruba Borthakur <dhr...@gmail.com>
>>>
>>>Subject: Re: HDFS without Hadoop: Why?
>>>
>>>To: hdfs-user@hadoop.apache.org
>>>Date:
>>> Wednesday, February 2, 2011, 9:00 PM
>>>
>>>
>>>
>>>The Namenode uses around 160 bytes/file and 150 bytes/block in HDFS. This is 
>>>a very rough calculation.
>>>
>>>
>>>dhruba
>>>
>>>
>>>>>>
>>>On Wed, Feb 2, 2011 at 5:11 PM, Dhodapkar, Chinmay <chinm...@qualcomm.com> 
>>>wrote:
>>>
>>>>>>>
>>>>
>>>>
>>>>
>>>>
>>>>What you describe is pretty much my use case as well. Since I don’t know 
>>>>how big the number of files could get , I am trying to figure out if there 
>>>>is a theoretical
>>>> design limitation in hdfs…..
>>>> 
>>>>From what I have read, the name node will store all metadata of all files 
>>>>in the RAM. Assuming (in my case), that a file is less than the configured 
>>>>block size….there
>>>> should be a very rough formula that can be used to calculate the max 
>>>> number of files that hdfs can serve based on the configured RAM on the 
>>>> name node?
>>>> 
>>>>Can any of the implementers comment on this? Am I even thinking on the 
>>>>right track…?
>>>> 
>>>>Thanks Ian for the haystack link…very informative indeed.
>>>> 
>>>>-Chinmay
>>>> 
>>>> 
>>>> 
>>>>From:Stuart Smith [mailto:stu24m...@yahoo.com]
>>>>
>>>>Sent: Wednesday, February 02, 2011 4:41 PM
>>>>
>>>>To: hdfs-user@hadoop.apache.org
>>>>
>>>>Subject: RE: HDFS without Hadoop: Why?
>>>> 
>>>>Hello,
>>>>>>>>   I'm actually using hbase/hadoop/hdfs for lots of small files (with a 
>>>>>>>> long tail of larger files). Well, millions of small files - I don't 
>>>>>>>> know what you mean by lots :)
>>>>
>>>>
>>>>>>>>Facebook probably knows better, But what I do is:
>>>>
>>>>>>>>  - store metadata in hbase
>>>>>>>>  - files smaller than 10 MB or so in hbase
>>>>>>>>   -larger files in a hdfs directory tree. 
>>>>
>>>>>>>>I started storing 64 MB files and smaller in hbase (chunk size), but 
>>>>>>>>that causes issues with regionservers when running M/R jobs. This is 
>>>>>>>>related to the fact that I'm running a cobbled together cluster & my 
>>>>>>>>region servers don't have that much memory. I would
>>>> play the size to see what works for you..
>>>>
>>>>>>>>Take care, 
>>>>>>>>   -stu
>>>>
>>>>>>>>--- On Wed, 2/2/11, Dhodapkar, Chinmay <chinm...@qualcomm.com> wrote:
>>>>
>>>>>>>>From: Dhodapkar, Chinmay <chinm...@qualcomm.com>
>>>>>>>>Subject: RE: HDFS without Hadoop: Why?
>>>>>>>>To: "hdfs-user@hadoop.apache.org" <hdfs-user@hadoop.apache.org>
>>>>>>>>
>>>>
>>>>Date: Wednesday, February 2, 2011, 7:28 PM
>>>>Hello,
>>>> 
>>>>I have been following this thread for some time now. I am very comfortable 
>>>>with the advantages of hdfs, but still have lingering questions about the 
>>>>usage of hdfs for general purpose
>>>> storage (no mapreduce/hbase etc).
>>>> 
>>>>Can somebody shed light on what the limitations are on the number of files 
>>>>that can be stored. Is it limited in anyway by the namenode? The use case I 
>>>>am interested in is to store
>>>> a very large number of relatively small files (1MB to 25MB).
>>>> 
>>>>Interestingly, I saw a facebook presentation on how they use hbase/hdfs 
>>>>internally. Them seem to store all metadata in hbase and the actual 
>>>>images/files/etc in something called “haystack”
>>>> (why not use hdfs since they already have it?). Anybody know what 
>>>> “haystack” is?
>>>> 
>>>>Thanks!
>>>>Chinmay
>>>> 
>>>> 
>>>> 
>>>>From:Jeff Hammerbacher [mailto:ham...@cloudera.com]
>>>>
>>>>Sent: Wednesday, February 02, 2011 3:31 PM
>>>>To: hdfs-user@hadoop.apache.org
>>>>Subject: Re: HDFS without Hadoop: Why?
>>>> 
>>>>>>>>>
>>>>>
>>>>>    * >>>>>Large block size wastes space for small file.  The minimum file 
>>>>> size is 1 block.
>>>>That's incorrect. If a file is smaller than the block size, it will only 
>>>>consume as much space as there is data in the file.
>>>>>>>>>
>>>>>
>>>>>    * >>>>>There are no hardlinks, softlinks, or quotas.
>>>>That's incorrect; there are quotas and softlinks. 
>>>> 
>>>
>>>
>>>-- 
>>>Connect to me at http://www.facebook.com/dhruba
>>> 
>>
>

Re: HDFS without Hadoop: Why?

Reply via email to