Re: Join in HBase

Mr Hoberto Thu, 16 Jul 2009 09:42:55 -0700

I can think of two cases that I've been wondering about (I am very new, and
am still reading the docs & archives, so I apologize if this has been
already covered or if I use the wrong notation...I'm still learning).


First case:

Tracking audits. In the RDMBS world you'd have the following schema:

User (userid, firstname, lastname)
Actions (actionid, actionname)
Audit (auditTime, userid, actionid)

I think the answer in the HBase world is to denormalize the data...have a
structure such as:

audits (auditid, audittime[timestamp], whowhat[family (firstName, lastname,
actionname)])

The problem happens, as Bharath says, what if a firstName or LastName needs
to be updated? Running a correction on all those denormalized rows is going
to be problematic.

Alternatively, I suppose you could store the User and Actions tables
separately, and keep the audits structure in HBase storing only IDs , and
use the website's application layer to "merge" the different data sets
together for display on a page. The downside there is if you wind up with a
significant amount of users or actions, it'll put a lot of memory pressure
on the app servers.

Second case:

Doing analysis on two time-series based data structures, such as a "PE
Ratio"

In the RDBS world you'd have two tables:

Prices (ticker, date, price)
Earnings (ticker, date, earning)

Again, I think the answer is denormalizing in the HBase world, with a
structure such as:

PEs (date, timestamp, PERatio[family (ticker, PEvalue)])

The problem here comes, again, with updates. For instance, what if you only
have available earnings information on an annual basis, and you've come
across a source that has it quarterly....You'll have to update 3/4 of the
rows in the denormalized table.

Once again, I apologize for any sort of misunderstanding..I'm still learning
the concepts behind column stores and map/reduce.

-hob


On Thu, Jul 16, 2009 at 11:19 AM, Jonathan Gray <[email protected]> wrote:

> The answer is, it depends.
>
> What are the details of what you are trying to join?  Is it just a simple
> 1-to-1 join, or 1-to-many or what?  At a minimum, the join would require two
> round-trips.  However, 0.20 can do simple queries in the 1-10ms time-range
> (closer to 1ms when the blocks are already cached).
>
> The comparison to an RDBMS cannot be made directly because a single-node
> RDBMS with a smallish table will be quite fast at simple index-based joins.
>  I would guess that unloaded, single machine performance of this join
> operation would be much faster in an RDBMS.
>
> But if your table has millions or billions of rows, it's a different
> situation.  HBase performance will stay nearly constant as your table
> increases, as long as you have the nodes to support your dataset and the
> load.
>
> What are your targets for time (sub 100ms? 10ms?), and what are the details
> of what you're joining?
>
> As far as code is concerned, there is not much to a simple join, so I'm not
> sure how helpful it would be.  If you give some detail perhaps I can provide
> some pseudo-code for you.
>
> JG
>
>
> bharath vissapragada wrote:
>
>> JG thanks for ur reply,
>>
>> Actually iam trying to implement a realtime join of two tables on HBase .
>> Actually i tried the idea of denormalizing the tables to avoid the Joins ,
>> but when we do that Updating the data is really difficult .  I understand
>> that the features i am trying to implement are that of a RDBMS and HBase
>> is
>> used for a different purpose . Even then i want (rather i would like to
>> try)
>> to store the data  the data in HBase and implement Joins so that i  could
>> test its performance and if its effective (atleast on large number of
>> nodes)
>> , it maybe of somehelp to me . I know some ppl have already tried this .
>> If
>> anyone of already tried this can you just tellme how the results are .. i
>> mean are they good , when compared to RDBMS join on a single machine ...
>>
>> Thanks
>>
>> On Wed, Jul 15, 2009 at 8:35 PM, Jonathan Gray <[email protected]> wrote:
>>
>>  Bharath,
>>>
>>> You need to outline what your actual requirements are if you want more
>>> help.  Open-ended questions that just ask for code are usually not
>>> answered.
>>>
>>> What exactly are you trying to join?  Does this join need to happen in
>>> "realtime" or is this part of a batch process?
>>>
>>> Could you denormalize your data to prevent needing the join at runtime?
>>>
>>> If you provide details about exactly what your data/schema is like (or a
>>> similar example if this is confidential), then many of us are more than
>>> happy to help you figure out what approach my work best.
>>>
>>> When working with HBase, figuring out how you want to pull your data out
>>> is
>>> key to how you want to put the data in.
>>>
>>> JG
>>>
>>>
>>> bharath vissapragada wrote:
>>>
>>>  Amandeep , can you tell me what kinds of joins u have implemented ? and
>>>> which works the best (based on observation ).. Can u show us the source
>>>> code
>>>> (if possible)
>>>>
>>>> Thanks in advance
>>>>
>>>> On Wed, Jul 15, 2009 at 10:46 AM, Amandeep Khurana <[email protected]>
>>>> wrote:
>>>>
>>>>  I've been doing joins by writing my own MR jobs. That works best.
>>>>
>>>>> Not tried cascading yet.
>>>>>
>>>>> -ak
>>>>>
>>>>> On 7/14/09, bharath vissapragada <[email protected]>
>>>>> wrote:
>>>>>
>>>>>  Thats fine .. I know that hbase has completely different usage
>>>>>> compared
>>>>>>
>>>>>>  to
>>>>>
>>>>>  SQL .. But for my application there is some kind of dependency
>>>>>> involved
>>>>>> among the tables . So i need to implement a Join . I wanted to know
>>>>>>
>>>>>>  whether
>>>>>
>>>>>  there is some kind of implementation already
>>>>>> ..
>>>>>>
>>>>>> Thanks
>>>>>> On Wed, Jul 15, 2009 at 10:30 AM, Ryan Rawson <[email protected]>
>>>>>>
>>>>>>  wrote:
>>>>>
>>>>>  HBase != SQL.
>>>>>>
>>>>>>> You might want map reduce or cascading.
>>>>>>>
>>>>>>> On Tue, Jul 14, 2009 at 9:56 PM, bharath
>>>>>>> vissapragada<[email protected]> wrote:
>>>>>>>
>>>>>>>  Hi all ,
>>>>>>>>
>>>>>>>> I want to join(similar to relational databases join) two tables in
>>>>>>>>
>>>>>>>>  HBase
>>>>>>>
>>>>>> .
>>>>>>
>>>>>>> Can anyone tell me whether  it is already implemented in the source !
>>>>>>>>
>>>>>>>> Thanks in Advance
>>>>>>>>
>>>>>>>>
>>>>>>>>  --
>>>>>
>>>>>
>>>>> Amandeep Khurana
>>>>> Computer Science Graduate Student
>>>>> University of California, Santa Cruz
>>>>>
>>>>>
>>>>>
>>

Re: Join in HBase

Reply via email to