Re: newbie: need help on understanding HBase

Imran M Yousuf Thu, 12 Nov 2009 18:33:34 -0800

Thanks a lot :) ......

Imran


On Fri, Nov 13, 2009 at 9:19 AM, Ryan Rawson <[email protected]> wrote:
> HBase is semi-column oriented. Column families is the storage model -
> everything in a column family is stored in a file linearly in HDFS.
> That means accessing data from a column family is really cheap and
> easy.  Adding more column families adds more files - it has the
> performance profile of adding new tables, except you dont actually
> have additional tables, so the conceptual complexity stays low.
>
> Data is stored at the "intersection" of the rowid, and the column
> family + qualifier.  This is sometimes called a "Cell" - contains a
> timestamp as well. You can have multiple versions all timestamped.
> The timestamp is by default the int64/java system milli time. I have
> to recommend against setting the timestamp explicitly if you can avoid
> it. So when you retrieve a row, you can get everything, a list of
> column qualifiers or a list of families or any combo.  (eg: list of
> these qualifiers out of family A and everything from family B)
>
> One problem with mysql is that large values tend to be pushed aside
> and require double block seeks to get them.  HBase can help you avoid
> this by giving you column families, eg: store the giant data in one
> family, smaller meta data in the second.
>
> The terms to use are:
> - Column family (or just family): the unit of locality in hbase.
> Everything in a family is stored in 1 (or a set) of files. A table is
> a name and a list of families with attributes for those families (eg:
> compression).  A family is a string.
> - Column qualifier (or just qualifier): allows you to store multiple
> values for the same row in 1 family. This value is a byte array and
> can be anything. The API converts null => new byte[0]{}.  This is the
> tricky bit, since most people dont think of "column names" as being
> dynamic.
> - Cell - the old name for a value + timestamp.  The new API (see:
> class Result) doesn't use this term, instead provides a different path
> to read data.
>
> You can use HBase as a normal datastore and use static names for the
> qualifiers, and that is just fine. But if you need something special
> to get past the lack of relations, you can start to do fun things with
> the qualifier as data. Building a secondary index for example.  The
> row key would be the secondary value (eg: city) and the qualifier
> would be the primary key (eg: userid) and the value would be a
> placeholder to indicate the value exists.
>
>
>
>
>
>
>
> On Thu, Nov 12, 2009 at 6:04 PM, Imran M Yousuf <[email protected]> wrote:
>> On Fri, Nov 13, 2009 at 9:00 AM, Ryan Rawson <[email protected]> wrote:
>>> HBase does at least 3 things that traditional databases have a hard time 
>>> with:
>>>
>>> - Large blobs of data. Mysql is particularly guilty of not handling this 
>>> well.
>>> - Tables that grow to be larger than reasonably priced single machines.
>>> - Write loads that are not compatible with master-slave replication
>>>
>>> The 2nd and 3rd are very interesting, since you either have to pay for
>>> something like Oracle RAC, or start sharding.
>>>
>>
>> Exactly, and since contents will be blob data and my experience with
>> RDBMS blob suggests that scaling is proportional to *BIG* money so I
>> am eager to take the HBase path. I was actually praying and hoping you
>> join this thread :). Can you please elaborate Column Family, Column
>> and Cell and their basic use cases?
>>
>> Thanks a lot,
>>
>> Imran
>>
>>> On Thu, Nov 12, 2009 at 5:58 PM, Imran M Yousuf <[email protected]> wrote:
>>>> On Thu, Nov 12, 2009 at 10:50 PM, Chris Bates
>>>> <[email protected]> wrote:
>>>>> Hi Imran,
>>>>>
>>>>> I'm a new user as well.  I found these presentations helpful in answering
>>>>> most of your questions:
>>>>> http://wiki.apache.org/hadoop/HBase/HBasePresentations
>>>>>
>>>>> There are HBase schema designs in there.
>>>>>
>>>>
>>>> I read them, but without the speakers explanation the schema parts
>>>> remain unexplained for a dumb newbie like me. I was looking for more
>>>> concrete definitions of column family, column, cell etc. and their use
>>>> cases. I guess I will have to learn them by experimenting.
>>>>
>>>>> You might also want to read the original BigTable paper and the chapter on
>>>>> HBase in OReilly's Hadoop book.
>>>>>
>>>>> But to answer one of your questions--"Big Data" usually refers to a 
>>>>> dataset
>>>>> that is millions to billions in length.  But "Big Data" doesn't mean you
>>>>> have to use a tool like HBase.  We have some MySQL tables that are 100
>>>>> million rows and work fine.  You have to identify what works best for your
>>>>> use and use the most appropriate tool.
>>>>
>>>> Thanks, IMHO, I am sure that HBase is more suitable than MySQL simply
>>>> because of the complexity and cost in scaling an application with Blob
>>>> data.
>>>>
>>>> Thanks a lot,
>>>>
>>>> Imran
>>>>
>>>>>
>>>>> On Thu, Nov 12, 2009 at 9:13 AM, Imran M Yousuf <[email protected]> 
>>>>> wrote:
>>>>>
>>>>>> Hi!
>>>>>>
>>>>>> I am absolutely new to HBase. All I have done is to read up
>>>>>> documentation, presentation and getting a single instance up and
>>>>>> running. I am starting on a Content Management System which will be
>>>>>> used as a backend for multiple web applications of different natures.
>>>>>> In the CMS:
>>>>>> * User can define their content known as content type.
>>>>>> * Content can have  one-2-many one-2-one and many-2-many relationship
>>>>>> with other contents.
>>>>>> * Content fields should be versioned
>>>>>> * Content type can change in runtime, i.e. fields (a.k.a. columns in
>>>>>> HBase) added and removal will not be allowed just yet.
>>>>>> * Every content type will have a corresponding grammer to validate
>>>>>> content of its type.
>>>>>> * It will have authentication and authorization
>>>>>> * It will have full text search based on Lucene/Katta.
>>>>>>
>>>>>> Based on these requirements I have the following questions that I
>>>>>> would like feedback on:
>>>>>> * Reading articles and presentations it looks to be HBase is a perfect
>>>>>> match as it supports multi-dimensional rows, versioned cells, dynamic
>>>>>> schema modification. But I could not understand what is the definition
>>>>>> of "Big Data" - that is if a content size is roughly 1~100kB
>>>>>> (field/cell size 0~100kB), is HBase meant for such uses?
>>>>>> * Since I am not sure how much load the site will have, I am planning
>>>>>> to setup DN+RS on Rackspace cloud instances with 2GB/80GB HDD with a
>>>>>> view of with revenue and pageviews increasing, more moderate
>>>>>> "commodity" hardware can be added progressively. Any
>>>>>> comments/suggestions on this strategy?
>>>>>> * Where can I read up on or checkout samples RDBMS schemas converted
>>>>>> to HBase schema? Basically, I want to read up efficient schema design
>>>>>> for different cardinal relationships between objects.
>>>>>>
>>>>>> Thank you,
>>>>>>
>>>>>> --
>>>>>> Imran M Yousuf
>>>>>> Entrepreneur & Software Engineer
>>>>>> Smart IT Engineering
>>>>>> Dhaka, Bangladesh
>>>>>> Email: [email protected]
>>>>>> Blog: http://imyousuf-tech.blogs.smartitengineering.com/
>>>>>> Mobile: +880-1711402557
>>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Imran M Yousuf
>>>> Entrepreneur & Software Engineer
>>>> Smart IT Engineering
>>>> Dhaka, Bangladesh
>>>> Email: [email protected]
>>>> Blog: http://imyousuf-tech.blogs.smartitengineering.com/
>>>> Mobile: +880-1711402557
>>>>
>>>
>>
>>
>>
>> --
>> Imran M Yousuf
>> Entrepreneur & Software Engineer
>> Smart IT Engineering
>> Dhaka, Bangladesh
>> Email: [email protected]
>> Blog: http://imyousuf-tech.blogs.smartitengineering.com/
>> Mobile: +880-1711402557
>>
>



-- 
Imran M Yousuf
Entrepreneur & Software Engineer
Smart IT Engineering
Dhaka, Bangladesh
Email: [email protected]
Blog: http://imyousuf-tech.blogs.smartitengineering.com/
Mobile: +880-1711402557

Re: newbie: need help on understanding HBase

Reply via email to