Re: newbie: need help on understanding HBase

Ryan Rawson Thu, 12 Nov 2009 18:20:29 -0800

HBase is semi-column oriented. Column families is the storage model -
everything in a column family is stored in a file linearly in HDFS.
That means accessing data from a column family is really cheap and
easy.  Adding more column families adds more files - it has the
performance profile of adding new tables, except you dont actually
have additional tables, so the conceptual complexity stays low.


Data is stored at the "intersection" of the rowid, and the column
family + qualifier.  This is sometimes called a "Cell" - contains a
timestamp as well. You can have multiple versions all timestamped.
The timestamp is by default the int64/java system milli time. I have
to recommend against setting the timestamp explicitly if you can avoid
it. So when you retrieve a row, you can get everything, a list of
column qualifiers or a list of families or any combo.  (eg: list of
these qualifiers out of family A and everything from family B)

One problem with mysql is that large values tend to be pushed aside
and require double block seeks to get them.  HBase can help you avoid
this by giving you column families, eg: store the giant data in one
family, smaller meta data in the second.

The terms to use are:
- Column family (or just family): the unit of locality in hbase.
Everything in a family is stored in 1 (or a set) of files. A table is
a name and a list of families with attributes for those families (eg:
compression).  A family is a string.
- Column qualifier (or just qualifier): allows you to store multiple
values for the same row in 1 family. This value is a byte array and
can be anything. The API converts null => new byte[0]{}.  This is the
tricky bit, since most people dont think of "column names" as being
dynamic.
- Cell - the old name for a value + timestamp.  The new API (see:
class Result) doesn't use this term, instead provides a different path
to read data.

You can use HBase as a normal datastore and use static names for the
qualifiers, and that is just fine. But if you need something special
to get past the lack of relations, you can start to do fun things with
the qualifier as data. Building a secondary index for example.  The
row key would be the secondary value (eg: city) and the qualifier
would be the primary key (eg: userid) and the value would be a
placeholder to indicate the value exists.







On Thu, Nov 12, 2009 at 6:04 PM, Imran M Yousuf <[email protected]> wrote:
> On Fri, Nov 13, 2009 at 9:00 AM, Ryan Rawson <[email protected]> wrote:
>> HBase does at least 3 things that traditional databases have a hard time 
>> with:
>>
>> - Large blobs of data. Mysql is particularly guilty of not handling this 
>> well.
>> - Tables that grow to be larger than reasonably priced single machines.
>> - Write loads that are not compatible with master-slave replication
>>
>> The 2nd and 3rd are very interesting, since you either have to pay for
>> something like Oracle RAC, or start sharding.
>>
>
> Exactly, and since contents will be blob data and my experience with
> RDBMS blob suggests that scaling is proportional to *BIG* money so I
> am eager to take the HBase path. I was actually praying and hoping you
> join this thread :). Can you please elaborate Column Family, Column
> and Cell and their basic use cases?
>
> Thanks a lot,
>
> Imran
>
>> On Thu, Nov 12, 2009 at 5:58 PM, Imran M Yousuf <[email protected]> wrote:
>>> On Thu, Nov 12, 2009 at 10:50 PM, Chris Bates
>>> <[email protected]> wrote:
>>>> Hi Imran,
>>>>
>>>> I'm a new user as well.  I found these presentations helpful in answering
>>>> most of your questions:
>>>> http://wiki.apache.org/hadoop/HBase/HBasePresentations
>>>>
>>>> There are HBase schema designs in there.
>>>>
>>>
>>> I read them, but without the speakers explanation the schema parts
>>> remain unexplained for a dumb newbie like me. I was looking for more
>>> concrete definitions of column family, column, cell etc. and their use
>>> cases. I guess I will have to learn them by experimenting.
>>>
>>>> You might also want to read the original BigTable paper and the chapter on
>>>> HBase in OReilly's Hadoop book.
>>>>
>>>> But to answer one of your questions--"Big Data" usually refers to a dataset
>>>> that is millions to billions in length.  But "Big Data" doesn't mean you
>>>> have to use a tool like HBase.  We have some MySQL tables that are 100
>>>> million rows and work fine.  You have to identify what works best for your
>>>> use and use the most appropriate tool.
>>>
>>> Thanks, IMHO, I am sure that HBase is more suitable than MySQL simply
>>> because of the complexity and cost in scaling an application with Blob
>>> data.
>>>
>>> Thanks a lot,
>>>
>>> Imran
>>>
>>>>
>>>> On Thu, Nov 12, 2009 at 9:13 AM, Imran M Yousuf <[email protected]> wrote:
>>>>
>>>>> Hi!
>>>>>
>>>>> I am absolutely new to HBase. All I have done is to read up
>>>>> documentation, presentation and getting a single instance up and
>>>>> running. I am starting on a Content Management System which will be
>>>>> used as a backend for multiple web applications of different natures.
>>>>> In the CMS:
>>>>> * User can define their content known as content type.
>>>>> * Content can have  one-2-many one-2-one and many-2-many relationship
>>>>> with other contents.
>>>>> * Content fields should be versioned
>>>>> * Content type can change in runtime, i.e. fields (a.k.a. columns in
>>>>> HBase) added and removal will not be allowed just yet.
>>>>> * Every content type will have a corresponding grammer to validate
>>>>> content of its type.
>>>>> * It will have authentication and authorization
>>>>> * It will have full text search based on Lucene/Katta.
>>>>>
>>>>> Based on these requirements I have the following questions that I
>>>>> would like feedback on:
>>>>> * Reading articles and presentations it looks to be HBase is a perfect
>>>>> match as it supports multi-dimensional rows, versioned cells, dynamic
>>>>> schema modification. But I could not understand what is the definition
>>>>> of "Big Data" - that is if a content size is roughly 1~100kB
>>>>> (field/cell size 0~100kB), is HBase meant for such uses?
>>>>> * Since I am not sure how much load the site will have, I am planning
>>>>> to setup DN+RS on Rackspace cloud instances with 2GB/80GB HDD with a
>>>>> view of with revenue and pageviews increasing, more moderate
>>>>> "commodity" hardware can be added progressively. Any
>>>>> comments/suggestions on this strategy?
>>>>> * Where can I read up on or checkout samples RDBMS schemas converted
>>>>> to HBase schema? Basically, I want to read up efficient schema design
>>>>> for different cardinal relationships between objects.
>>>>>
>>>>> Thank you,
>>>>>
>>>>> --
>>>>> Imran M Yousuf
>>>>> Entrepreneur & Software Engineer
>>>>> Smart IT Engineering
>>>>> Dhaka, Bangladesh
>>>>> Email: [email protected]
>>>>> Blog: http://imyousuf-tech.blogs.smartitengineering.com/
>>>>> Mobile: +880-1711402557
>>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Imran M Yousuf
>>> Entrepreneur & Software Engineer
>>> Smart IT Engineering
>>> Dhaka, Bangladesh
>>> Email: [email protected]
>>> Blog: http://imyousuf-tech.blogs.smartitengineering.com/
>>> Mobile: +880-1711402557
>>>
>>
>
>
>
> --
> Imran M Yousuf
> Entrepreneur & Software Engineer
> Smart IT Engineering
> Dhaka, Bangladesh
> Email: [email protected]
> Blog: http://imyousuf-tech.blogs.smartitengineering.com/
> Mobile: +880-1711402557
>

Re: newbie: need help on understanding HBase

Reply via email to