I can think of two cases that I've been wondering about (I am very new, and am still reading the docs & archives, so I apologize if this has been already covered or if I use the wrong notation...I'm still learning).
First case: Tracking audits. In the RDMBS world you'd have the following schema: User (userid, firstname, lastname) Actions (actionid, actionname) Audit (auditTime, userid, actionid) I think the answer in the HBase world is to denormalize the data...have a structure such as: audits (auditid, audittime[timestamp], whowhat[family (firstName, lastname, actionname)]) The problem happens, as Bharath says, what if a firstName or LastName needs to be updated? Running a correction on all those denormalized rows is going to be problematic. Alternatively, I suppose you could store the User and Actions tables separately, and keep the audits structure in HBase storing only IDs , and use the website's application layer to "merge" the different data sets together for display on a page. The downside there is if you wind up with a significant amount of users or actions, it'll put a lot of memory pressure on the app servers. Second case: Doing analysis on two time-series based data structures, such as a "PE Ratio" In the RDBS world you'd have two tables: Prices (ticker, date, price) Earnings (ticker, date, earning) Again, I think the answer is denormalizing in the HBase world, with a structure such as: PEs (date, timestamp, PERatio[family (ticker, PEvalue)]) The problem here comes, again, with updates. For instance, what if you only have available earnings information on an annual basis, and you've come across a source that has it quarterly....You'll have to update 3/4 of the rows in the denormalized table. Once again, I apologize for any sort of misunderstanding..I'm still learning the concepts behind column stores and map/reduce. -hob On Thu, Jul 16, 2009 at 11:19 AM, Jonathan Gray <[email protected]> wrote: > The answer is, it depends. > > What are the details of what you are trying to join? Is it just a simple > 1-to-1 join, or 1-to-many or what? At a minimum, the join would require two > round-trips. However, 0.20 can do simple queries in the 1-10ms time-range > (closer to 1ms when the blocks are already cached). > > The comparison to an RDBMS cannot be made directly because a single-node > RDBMS with a smallish table will be quite fast at simple index-based joins. > I would guess that unloaded, single machine performance of this join > operation would be much faster in an RDBMS. > > But if your table has millions or billions of rows, it's a different > situation. HBase performance will stay nearly constant as your table > increases, as long as you have the nodes to support your dataset and the > load. > > What are your targets for time (sub 100ms? 10ms?), and what are the details > of what you're joining? > > As far as code is concerned, there is not much to a simple join, so I'm not > sure how helpful it would be. If you give some detail perhaps I can provide > some pseudo-code for you. > > JG > > > bharath vissapragada wrote: > >> JG thanks for ur reply, >> >> Actually iam trying to implement a realtime join of two tables on HBase . >> Actually i tried the idea of denormalizing the tables to avoid the Joins , >> but when we do that Updating the data is really difficult . I understand >> that the features i am trying to implement are that of a RDBMS and HBase >> is >> used for a different purpose . Even then i want (rather i would like to >> try) >> to store the data the data in HBase and implement Joins so that i could >> test its performance and if its effective (atleast on large number of >> nodes) >> , it maybe of somehelp to me . I know some ppl have already tried this . >> If >> anyone of already tried this can you just tellme how the results are .. i >> mean are they good , when compared to RDBMS join on a single machine ... >> >> Thanks >> >> On Wed, Jul 15, 2009 at 8:35 PM, Jonathan Gray <[email protected]> wrote: >> >> Bharath, >>> >>> You need to outline what your actual requirements are if you want more >>> help. Open-ended questions that just ask for code are usually not >>> answered. >>> >>> What exactly are you trying to join? Does this join need to happen in >>> "realtime" or is this part of a batch process? >>> >>> Could you denormalize your data to prevent needing the join at runtime? >>> >>> If you provide details about exactly what your data/schema is like (or a >>> similar example if this is confidential), then many of us are more than >>> happy to help you figure out what approach my work best. >>> >>> When working with HBase, figuring out how you want to pull your data out >>> is >>> key to how you want to put the data in. >>> >>> JG >>> >>> >>> bharath vissapragada wrote: >>> >>> Amandeep , can you tell me what kinds of joins u have implemented ? and >>>> which works the best (based on observation ).. Can u show us the source >>>> code >>>> (if possible) >>>> >>>> Thanks in advance >>>> >>>> On Wed, Jul 15, 2009 at 10:46 AM, Amandeep Khurana <[email protected]> >>>> wrote: >>>> >>>> I've been doing joins by writing my own MR jobs. That works best. >>>> >>>>> Not tried cascading yet. >>>>> >>>>> -ak >>>>> >>>>> On 7/14/09, bharath vissapragada <[email protected]> >>>>> wrote: >>>>> >>>>> Thats fine .. I know that hbase has completely different usage >>>>>> compared >>>>>> >>>>>> to >>>>> >>>>> SQL .. But for my application there is some kind of dependency >>>>>> involved >>>>>> among the tables . So i need to implement a Join . I wanted to know >>>>>> >>>>>> whether >>>>> >>>>> there is some kind of implementation already >>>>>> .. >>>>>> >>>>>> Thanks >>>>>> On Wed, Jul 15, 2009 at 10:30 AM, Ryan Rawson <[email protected]> >>>>>> >>>>>> wrote: >>>>> >>>>> HBase != SQL. >>>>>> >>>>>>> You might want map reduce or cascading. >>>>>>> >>>>>>> On Tue, Jul 14, 2009 at 9:56 PM, bharath >>>>>>> vissapragada<[email protected]> wrote: >>>>>>> >>>>>>> Hi all , >>>>>>>> >>>>>>>> I want to join(similar to relational databases join) two tables in >>>>>>>> >>>>>>>> HBase >>>>>>> >>>>>> . >>>>>> >>>>>>> Can anyone tell me whether it is already implemented in the source ! >>>>>>>> >>>>>>>> Thanks in Advance >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>> >>>>> >>>>> Amandeep Khurana >>>>> Computer Science Graduate Student >>>>> University of California, Santa Cruz >>>>> >>>>> >>>>> >>
