Thanks Ian! Very helpful breakdown. For this use case, I think the multi-version row structure is ruled out. We will investigate the onekey-manycolumn approach. Also, the more I study the mechanics behind a SCAN vs GET, the more I believe the informal test I did is inaccurate. What does warrant a look, however, are the filters on the scan. We are already filtering on CF but we can now look at filtering on qualifiers as well.
Thanks again, Neil Yalowitz [email protected] On Thu, Oct 18, 2012 at 4:59 PM, Ian Varley <[email protected]> wrote: > Hi Neil, > > Mike summed it up well, as usual. :) Your choices of where to describe > this "dimension" of your data (a one-to-many between users and events) are: > > - one row per event > - one row per user, with events as columns > - one row per user, with events as versions on a single cell > > The first two are the best choices, since the third is sort of a > perversion of the time dimension (it isn't one thing that's changing, it's > many things over time), and might make things counter-intuitive when > combined with deletes, compaction, etc. You can do it, but caveat emptor. :) > > Since you have in the 100s or 1000s of events per user, it's reasonable to > use the 2nd (columns). And with 1k cell sizes, even extreme cases > (thousands of events) won't kill you. > > That said, the main plus you get out of using columns over rows is ACID > properties; you could get & set all the stuff for a single user atomically > if it's columns in a single row, but not if its separate rows. That's nice, > but I'm guessing you probably don't need to do that, and instead would > write out the events as they happen (i.e., you would rarely be doing PUTs > for multiple events for the same user at the same time, right?). > > In theory, tall tables (the row-wise model) should have a slight > performance advantage over wide tables (the column-wise model), all other > things being equal; the shape of the data is nearly the same, but the > row-wise version doesn't have to do any work preserving consistency. Your > informal tests about GET vs SCAN perf seem a little suspect, since a GET is > actually implemented as a one-row SCAN; but the devil's in the details, so > if you see that happening repeatably with data that's otherwise identical, > raise it up to the dev list and people should look at it. > > The key thing is to try it for yourself and see. :) > > Ian > > ps - Sorry Mike was rude to you in his response. Your question was > well-phrased and not at all boring. Mike, you can explain all you want, but > saying "Your question is boring" is straight up rude; please don't do that. > > > From: Neil Yalowitz <[email protected]<mailto:[email protected] > >> > Date: Tue, Oct 16, 2012 at 2:53 PM > Subject: crafting your key - scan vs. get > To: [email protected]<mailto:[email protected]> > > > Hopefully this is a fun question. :) > > Assume you could architect an HBase table from scratch and you were > choosing between the following two key structures. > > 1) > > The first structure creates a unique row key for each PUT. The rows are > events related to a user ID. There may be up to several hundred events for > each user ID (probably not thousands, an average of perhaps ~100 events per > user). Each key would be made unique with a reverse-order-timestamp or > perhaps just random characters (we don't particularly care about using ROT > for sorting newest here). > > key > ---- > AAAAAA + some-unique-chars > > The table will look like this: > > key vals cf:mycf ts > ------------------------------------------------------------------- > AAAAAA9999... myval1 1350345600 > AAAAAA8888... myval2 1350259200 > AAAAAA7777... myval3 1350172800 > > > Retrieving these values will use a Scan with startRow and stopRow. In > hbase shell, it would look like: > > $ scan 'mytable',{STARTROW=>'AAAAAA', ENDROW=>'AAAAAA_'} > > > 2) > > The second structure choice uses only the user ID as the key and relies on > row versions to store all the events. For example: > > key vals cf:mycf ts > --------------------------------------------------------------------- > AAAAAA myval1 1350345600 > AAAAAA myval2 1350259200 > AAAAAA myval3 1350172800 > > Retrieving these values will use a Get with VERSIONS = somebignumber. In > hbase shell, it would look like: > > $ get 'mytable','AAAAAA',{COLUMN=>'cf:mycf', VERSIONS=>999} > > ...although this probably violates a comment in the HBase documentation: > > "It is not recommended setting the number of max versions to an exceedingly > high level (e.g., hundreds or more) unless those old values are very dear > to you because this will greatly increase StoreFile size." > > ...found here: http://hbase.apache.org/book/schema.versions.html > > > So, are there any performance considerations between Scan vs. Get in this > use case? Which choice would you go for? > > > > Neil Yalowitz > [email protected]<mailto:[email protected]> > >
