Denormalization is one of those things you have to generally be careful about. In the audit case, I wouldn't see the need to denormalize the user's first and last name into the audit table. Only the userid would be necessary - generally only when you are digging into specific cases do you need to actually tie a userid to a human readable name, and you can make that extra fetch in your web ui.
Ditto for action. Using hbase to the max needs a more weby approach to be fully maximally effective I think. Instead of using integer ids, use string descriptions for enums. Use usernames instead of userids in the audit log (tends to be more robust against changes). Remember, in hbase: - disk is cheap (denorm cost is lower) - strings are not as expensive like in dbs - no schema means you have extreme flexibility Good luck! -ryan On Thu, Jul 16, 2009 at 11:14 AM, Mr Hoberto<[email protected]> wrote: > Thanks for the advice. I'm going to keep trying to get my head around this, > and figure out, as you said, distributing the workload. > Roundtripping has been verboten in my circle, but I think that's more due to > false doctinization than actual experience. I think that's going to be the > way to go for me. > > -hob > > On Thu, Jul 16, 2009 at 1:42 PM, Jonathan Gray <[email protected]> wrote: > >> Hoberto, Bharath, >> >> Designing these kinds of queries efficiently in HBase means doing multiple >> round-trips, or denormalizing. >> >> That means degrading performance as the query complexity increases, or lots >> of data duplication and a complex write/update process. >> >> In your audit example, you provide the denormalizing solution. Store the >> fields you need with the data you are querying (details of the user/action >> in the audit table with the audit). If you have to update those details, >> then you have an extra expense on your write (and you introduce a potential >> synchronization issue without transactions). >> >> The choice about how to solve this really depends on the use case and what >> your requirements are. Can you ever afford to miss an update in one of the >> denormalized fields, even if it is extremely unlikely? You can build >> transactional layers on top or you can take a look at TransactionalHBase >> which attempts to do this in a more integrated way. >> >> You also talk about the other approach, running multiple queries in the >> application. As far as memory pressure in the app is concerned, that would >> really depend on the nature of the join. It's more an issue of how many >> joins you need to make, and if there's any way to reduce the number of >> queries/trips needed. >> >> If I am pulling the most recent 10 audits, and I need to join each with >> both the User and Actions table, then we're talking about 1 + 10 + 10 total >> queries. That's not so pretty, but if done in a distributed or threaded way >> may not be too bad. In the future, I expect more and more tools/frameworks >> available to aid in that process. >> >> Today, this is up to you. >> >> At Streamy, we solve these problems with layers above HBase. Some of them >> keep lots of stuff in-memory and do complex joins in-memory. Others >> coordinate the multiple queries to HBase, with or without an OCC-style >> transaction. >> >> My suggestion is to start denormalized. Build naive queries that do lots >> of round-trips. See what the performance is like under different conditions >> and then go from there. Perhaps Actions are generally immutable, their name >> never changes, so you could denormalize that field and cut out half of the >> total queries. Have a pool of threads that grab Users so you can do the >> join in parallel. Depending on your requirements, this might be sufficient. >> Otherwise look at more denormalization, or building a thin layer above. >> >> JG >> >> >> Mr Hoberto wrote: >> >>> I can think of two cases that I've been wondering about (I am very new, >>> and >>> am still reading the docs & archives, so I apologize if this has been >>> already covered or if I use the wrong notation...I'm still learning). >>> >>> First case: >>> >>> Tracking audits. In the RDMBS world you'd have the following schema: >>> >>> User (userid, firstname, lastname) >>> Actions (actionid, actionname) >>> Audit (auditTime, userid, actionid) >>> >>> I think the answer in the HBase world is to denormalize the data...have a >>> structure such as: >>> >>> audits (auditid, audittime[timestamp], whowhat[family (firstName, >>> lastname, >>> actionname)]) >>> >>> The problem happens, as Bharath says, what if a firstName or LastName >>> needs >>> to be updated? Running a correction on all those denormalized rows is >>> going >>> to be problematic. >>> >>> Alternatively, I suppose you could store the User and Actions tables >>> separately, and keep the audits structure in HBase storing only IDs , and >>> use the website's application layer to "merge" the different data sets >>> together for display on a page. The downside there is if you wind up with >>> a >>> significant amount of users or actions, it'll put a lot of memory pressure >>> on the app servers. >>> >>> Second case: >>> >>> Doing analysis on two time-series based data structures, such as a "PE >>> Ratio" >>> >>> In the RDBS world you'd have two tables: >>> >>> Prices (ticker, date, price) >>> Earnings (ticker, date, earning) >>> >>> Again, I think the answer is denormalizing in the HBase world, with a >>> structure such as: >>> >>> PEs (date, timestamp, PERatio[family (ticker, PEvalue)]) >>> >>> The problem here comes, again, with updates. For instance, what if you >>> only >>> have available earnings information on an annual basis, and you've come >>> across a source that has it quarterly....You'll have to update 3/4 of the >>> rows in the denormalized table. >>> >>> Once again, I apologize for any sort of misunderstanding..I'm still >>> learning >>> the concepts behind column stores and map/reduce. >>> >>> -hob >>> >>> >>> On Thu, Jul 16, 2009 at 11:19 AM, Jonathan Gray <[email protected]> >>> wrote: >>> >>> The answer is, it depends. >>>> >>>> What are the details of what you are trying to join? Is it just a simple >>>> 1-to-1 join, or 1-to-many or what? At a minimum, the join would require >>>> two >>>> round-trips. However, 0.20 can do simple queries in the 1-10ms >>>> time-range >>>> (closer to 1ms when the blocks are already cached). >>>> >>>> The comparison to an RDBMS cannot be made directly because a single-node >>>> RDBMS with a smallish table will be quite fast at simple index-based >>>> joins. >>>> I would guess that unloaded, single machine performance of this join >>>> operation would be much faster in an RDBMS. >>>> >>>> But if your table has millions or billions of rows, it's a different >>>> situation. HBase performance will stay nearly constant as your table >>>> increases, as long as you have the nodes to support your dataset and the >>>> load. >>>> >>>> What are your targets for time (sub 100ms? 10ms?), and what are the >>>> details >>>> of what you're joining? >>>> >>>> As far as code is concerned, there is not much to a simple join, so I'm >>>> not >>>> sure how helpful it would be. If you give some detail perhaps I can >>>> provide >>>> some pseudo-code for you. >>>> >>>> JG >>>> >>>> >>>> bharath vissapragada wrote: >>>> >>>> JG thanks for ur reply, >>>>> >>>>> Actually iam trying to implement a realtime join of two tables on HBase >>>>> . >>>>> Actually i tried the idea of denormalizing the tables to avoid the Joins >>>>> , >>>>> but when we do that Updating the data is really difficult . I >>>>> understand >>>>> that the features i am trying to implement are that of a RDBMS and HBase >>>>> is >>>>> used for a different purpose . Even then i want (rather i would like to >>>>> try) >>>>> to store the data the data in HBase and implement Joins so that i >>>>> could >>>>> test its performance and if its effective (atleast on large number of >>>>> nodes) >>>>> , it maybe of somehelp to me . I know some ppl have already tried this . >>>>> If >>>>> anyone of already tried this can you just tellme how the results are .. >>>>> i >>>>> mean are they good , when compared to RDBMS join on a single machine ... >>>>> >>>>> Thanks >>>>> >>>>> On Wed, Jul 15, 2009 at 8:35 PM, Jonathan Gray <[email protected]> >>>>> wrote: >>>>> >>>>> Bharath, >>>>> >>>>>> You need to outline what your actual requirements are if you want more >>>>>> help. Open-ended questions that just ask for code are usually not >>>>>> answered. >>>>>> >>>>>> What exactly are you trying to join? Does this join need to happen in >>>>>> "realtime" or is this part of a batch process? >>>>>> >>>>>> Could you denormalize your data to prevent needing the join at runtime? >>>>>> >>>>>> If you provide details about exactly what your data/schema is like (or >>>>>> a >>>>>> similar example if this is confidential), then many of us are more than >>>>>> happy to help you figure out what approach my work best. >>>>>> >>>>>> When working with HBase, figuring out how you want to pull your data >>>>>> out >>>>>> is >>>>>> key to how you want to put the data in. >>>>>> >>>>>> JG >>>>>> >>>>>> >>>>>> bharath vissapragada wrote: >>>>>> >>>>>> Amandeep , can you tell me what kinds of joins u have implemented ? >>>>>> and >>>>>> >>>>>>> which works the best (based on observation ).. Can u show us the >>>>>>> source >>>>>>> code >>>>>>> (if possible) >>>>>>> >>>>>>> Thanks in advance >>>>>>> >>>>>>> On Wed, Jul 15, 2009 at 10:46 AM, Amandeep Khurana <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>> I've been doing joins by writing my own MR jobs. That works best. >>>>>>> >>>>>>> Not tried cascading yet. >>>>>>>> >>>>>>>> -ak >>>>>>>> >>>>>>>> On 7/14/09, bharath vissapragada <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>> Thats fine .. I know that hbase has completely different usage >>>>>>>> >>>>>>>>> compared >>>>>>>>> >>>>>>>>> to >>>>>>>>> >>>>>>>> SQL .. But for my application there is some kind of dependency >>>>>>>> >>>>>>>>> involved >>>>>>>>> among the tables . So i need to implement a Join . I wanted to know >>>>>>>>> >>>>>>>>> whether >>>>>>>>> >>>>>>>> there is some kind of implementation already >>>>>>>> >>>>>>>>> .. >>>>>>>>> >>>>>>>>> Thanks >>>>>>>>> On Wed, Jul 15, 2009 at 10:30 AM, Ryan Rawson <[email protected]> >>>>>>>>> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>> HBase != SQL. >>>>>>>> >>>>>>>>> You might want map reduce or cascading. >>>>>>>>>> >>>>>>>>>> On Tue, Jul 14, 2009 at 9:56 PM, bharath >>>>>>>>>> vissapragada<[email protected]> wrote: >>>>>>>>>> >>>>>>>>>> Hi all , >>>>>>>>>> >>>>>>>>>>> I want to join(similar to relational databases join) two tables in >>>>>>>>>>> >>>>>>>>>>> HBase >>>>>>>>>>> >>>>>>>>>> . >>>>>>>>> >>>>>>>>> Can anyone tell me whether it is already implemented in the source >>>>>>>>>> ! >>>>>>>>>> >>>>>>>>>>> Thanks in Advance >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> >>>>>>>>>> >>>>>>>> Amandeep Khurana >>>>>>>> Computer Science Graduate Student >>>>>>>> University of California, Santa Cruz >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>> >
