Re: Join in HBase

Ryan Rawson Thu, 16 Jul 2009 11:35:46 -0700

Denormalization is one of those things you have to generally be
careful about.  In the audit case, I wouldn't see the need to
denormalize the user's first and last name into the audit table.  Only
the userid would be necessary - generally only when you are digging
into specific cases do you need to actually tie a userid to a human
readable name, and you can make that extra fetch in your web ui.


Ditto for action.

Using hbase to the max needs a more weby approach to be fully
maximally effective I think.  Instead of using integer ids, use string
descriptions for enums. Use usernames instead of userids in the audit
log (tends to be more robust against changes).

Remember, in hbase:
- disk is cheap (denorm cost is lower)
- strings are not as expensive like in dbs
- no schema means you have extreme flexibility

Good luck!
-ryan

On Thu, Jul 16, 2009 at 11:14 AM, Mr Hoberto<[email protected]> wrote:
> Thanks for the advice. I'm going to keep trying to get my head around this,
> and figure out, as you said, distributing the workload.
> Roundtripping has been verboten in my circle, but I think that's more due to
> false doctinization than actual experience. I think that's going to be the
> way to go for me.
>
> -hob
>
> On Thu, Jul 16, 2009 at 1:42 PM, Jonathan Gray <[email protected]> wrote:
>
>> Hoberto, Bharath,
>>
>> Designing these kinds of queries efficiently in HBase means doing multiple
>> round-trips, or denormalizing.
>>
>> That means degrading performance as the query complexity increases, or lots
>> of data duplication and a complex write/update process.
>>
>> In your audit example, you provide the denormalizing solution.  Store the
>> fields you need with the data you are querying (details of the user/action
>> in the audit table with the audit).  If you have to update those details,
>> then you have an extra expense on your write (and you introduce a potential
>> synchronization issue without transactions).
>>
>> The choice about how to solve this really depends on the use case and what
>> your requirements are.  Can you ever afford to miss an update in one of the
>> denormalized fields, even if it is extremely unlikely?  You can build
>> transactional layers on top or you can take a look at TransactionalHBase
>> which attempts to do this in a more integrated way.
>>
>> You also talk about the other approach, running multiple queries in the
>> application.  As far as memory pressure in the app is concerned, that would
>> really depend on the nature of the join.  It's more an issue of how many
>> joins you need to make, and if there's any way to reduce the number of
>> queries/trips needed.
>>
>> If I am pulling the most recent 10 audits, and I need to join each with
>> both the User and Actions table, then we're talking about 1 + 10 + 10 total
>> queries.  That's not so pretty, but if done in a distributed or threaded way
>> may not be too bad.  In the future, I expect more and more tools/frameworks
>> available to aid in that process.
>>
>> Today, this is up to you.
>>
>> At Streamy, we solve these problems with layers above HBase.  Some of them
>> keep lots of stuff in-memory and do complex joins in-memory. Others
>> coordinate the multiple queries to HBase, with or without an OCC-style
>> transaction.
>>
>> My suggestion is to start denormalized.  Build naive queries that do lots
>> of round-trips.  See what the performance is like under different conditions
>> and then go from there.  Perhaps Actions are generally immutable, their name
>> never changes, so you could denormalize that field and cut out half of the
>> total queries.  Have a pool of threads that grab Users so you can do the
>> join in parallel.  Depending on your requirements, this might be sufficient.
>>  Otherwise look at more denormalization, or building a thin layer above.
>>
>> JG
>>
>>
>> Mr Hoberto wrote:
>>
>>> I can think of two cases that I've been wondering about (I am very new,
>>> and
>>> am still reading the docs & archives, so I apologize if this has been
>>> already covered or if I use the wrong notation...I'm still learning).
>>>
>>> First case:
>>>
>>> Tracking audits. In the RDMBS world you'd have the following schema:
>>>
>>> User (userid, firstname, lastname)
>>> Actions (actionid, actionname)
>>> Audit (auditTime, userid, actionid)
>>>
>>> I think the answer in the HBase world is to denormalize the data...have a
>>> structure such as:
>>>
>>> audits (auditid, audittime[timestamp], whowhat[family (firstName,
>>> lastname,
>>> actionname)])
>>>
>>> The problem happens, as Bharath says, what if a firstName or LastName
>>> needs
>>> to be updated? Running a correction on all those denormalized rows is
>>> going
>>> to be problematic.
>>>
>>> Alternatively, I suppose you could store the User and Actions tables
>>> separately, and keep the audits structure in HBase storing only IDs , and
>>> use the website's application layer to "merge" the different data sets
>>> together for display on a page. The downside there is if you wind up with
>>> a
>>> significant amount of users or actions, it'll put a lot of memory pressure
>>> on the app servers.
>>>
>>> Second case:
>>>
>>> Doing analysis on two time-series based data structures, such as a "PE
>>> Ratio"
>>>
>>> In the RDBS world you'd have two tables:
>>>
>>> Prices (ticker, date, price)
>>> Earnings (ticker, date, earning)
>>>
>>> Again, I think the answer is denormalizing in the HBase world, with a
>>> structure such as:
>>>
>>> PEs (date, timestamp, PERatio[family (ticker, PEvalue)])
>>>
>>> The problem here comes, again, with updates. For instance, what if you
>>> only
>>> have available earnings information on an annual basis, and you've come
>>> across a source that has it quarterly....You'll have to update 3/4 of the
>>> rows in the denormalized table.
>>>
>>> Once again, I apologize for any sort of misunderstanding..I'm still
>>> learning
>>> the concepts behind column stores and map/reduce.
>>>
>>> -hob
>>>
>>>
>>> On Thu, Jul 16, 2009 at 11:19 AM, Jonathan Gray <[email protected]>
>>> wrote:
>>>
>>>  The answer is, it depends.
>>>>
>>>> What are the details of what you are trying to join?  Is it just a simple
>>>> 1-to-1 join, or 1-to-many or what?  At a minimum, the join would require
>>>> two
>>>> round-trips.  However, 0.20 can do simple queries in the 1-10ms
>>>> time-range
>>>> (closer to 1ms when the blocks are already cached).
>>>>
>>>> The comparison to an RDBMS cannot be made directly because a single-node
>>>> RDBMS with a smallish table will be quite fast at simple index-based
>>>> joins.
>>>>  I would guess that unloaded, single machine performance of this join
>>>> operation would be much faster in an RDBMS.
>>>>
>>>> But if your table has millions or billions of rows, it's a different
>>>> situation.  HBase performance will stay nearly constant as your table
>>>> increases, as long as you have the nodes to support your dataset and the
>>>> load.
>>>>
>>>> What are your targets for time (sub 100ms? 10ms?), and what are the
>>>> details
>>>> of what you're joining?
>>>>
>>>> As far as code is concerned, there is not much to a simple join, so I'm
>>>> not
>>>> sure how helpful it would be.  If you give some detail perhaps I can
>>>> provide
>>>> some pseudo-code for you.
>>>>
>>>> JG
>>>>
>>>>
>>>> bharath vissapragada wrote:
>>>>
>>>>  JG thanks for ur reply,
>>>>>
>>>>> Actually iam trying to implement a realtime join of two tables on HBase
>>>>> .
>>>>> Actually i tried the idea of denormalizing the tables to avoid the Joins
>>>>> ,
>>>>> but when we do that Updating the data is really difficult .  I
>>>>> understand
>>>>> that the features i am trying to implement are that of a RDBMS and HBase
>>>>> is
>>>>> used for a different purpose . Even then i want (rather i would like to
>>>>> try)
>>>>> to store the data  the data in HBase and implement Joins so that i
>>>>>  could
>>>>> test its performance and if its effective (atleast on large number of
>>>>> nodes)
>>>>> , it maybe of somehelp to me . I know some ppl have already tried this .
>>>>> If
>>>>> anyone of already tried this can you just tellme how the results are ..
>>>>> i
>>>>> mean are they good , when compared to RDBMS join on a single machine ...
>>>>>
>>>>> Thanks
>>>>>
>>>>> On Wed, Jul 15, 2009 at 8:35 PM, Jonathan Gray <[email protected]>
>>>>> wrote:
>>>>>
>>>>>  Bharath,
>>>>>
>>>>>> You need to outline what your actual requirements are if you want more
>>>>>> help.  Open-ended questions that just ask for code are usually not
>>>>>> answered.
>>>>>>
>>>>>> What exactly are you trying to join?  Does this join need to happen in
>>>>>> "realtime" or is this part of a batch process?
>>>>>>
>>>>>> Could you denormalize your data to prevent needing the join at runtime?
>>>>>>
>>>>>> If you provide details about exactly what your data/schema is like (or
>>>>>> a
>>>>>> similar example if this is confidential), then many of us are more than
>>>>>> happy to help you figure out what approach my work best.
>>>>>>
>>>>>> When working with HBase, figuring out how you want to pull your data
>>>>>> out
>>>>>> is
>>>>>> key to how you want to put the data in.
>>>>>>
>>>>>> JG
>>>>>>
>>>>>>
>>>>>> bharath vissapragada wrote:
>>>>>>
>>>>>>  Amandeep , can you tell me what kinds of joins u have implemented ?
>>>>>> and
>>>>>>
>>>>>>> which works the best (based on observation ).. Can u show us the
>>>>>>> source
>>>>>>> code
>>>>>>> (if possible)
>>>>>>>
>>>>>>> Thanks in advance
>>>>>>>
>>>>>>> On Wed, Jul 15, 2009 at 10:46 AM, Amandeep Khurana <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>  I've been doing joins by writing my own MR jobs. That works best.
>>>>>>>
>>>>>>>  Not tried cascading yet.
>>>>>>>>
>>>>>>>> -ak
>>>>>>>>
>>>>>>>> On 7/14/09, bharath vissapragada <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>  Thats fine .. I know that hbase has completely different usage
>>>>>>>>
>>>>>>>>> compared
>>>>>>>>>
>>>>>>>>>  to
>>>>>>>>>
>>>>>>>>  SQL .. But for my application there is some kind of dependency
>>>>>>>>
>>>>>>>>> involved
>>>>>>>>> among the tables . So i need to implement a Join . I wanted to know
>>>>>>>>>
>>>>>>>>>  whether
>>>>>>>>>
>>>>>>>>  there is some kind of implementation already
>>>>>>>>
>>>>>>>>> ..
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>> On Wed, Jul 15, 2009 at 10:30 AM, Ryan Rawson <[email protected]>
>>>>>>>>>
>>>>>>>>>  wrote:
>>>>>>>>>
>>>>>>>>  HBase != SQL.
>>>>>>>>
>>>>>>>>> You might want map reduce or cascading.
>>>>>>>>>>
>>>>>>>>>> On Tue, Jul 14, 2009 at 9:56 PM, bharath
>>>>>>>>>> vissapragada<[email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>  Hi all ,
>>>>>>>>>>
>>>>>>>>>>> I want to join(similar to relational databases join) two tables in
>>>>>>>>>>>
>>>>>>>>>>>  HBase
>>>>>>>>>>>
>>>>>>>>>> .
>>>>>>>>>
>>>>>>>>>  Can anyone tell me whether  it is already implemented in the source
>>>>>>>>>> !
>>>>>>>>>>
>>>>>>>>>>> Thanks in Advance
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  --
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>> Amandeep Khurana
>>>>>>>> Computer Science Graduate Student
>>>>>>>> University of California, Santa Cruz
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>
>

Re: Join in HBase

Reply via email to