Thanks Ian!  Very helpful breakdown.

For this use case, I think the multi-version row structure is ruled out.
We will investigate the onekey-manycolumn approach.  Also, the more I study
the mechanics behind a SCAN vs GET, the more I believe the informal test I
did is inaccurate.  What does warrant a look, however, are the filters on
the scan.  We are already filtering on CF but we can now look at filtering
on qualifiers as well.

Thanks again,

Neil Yalowitz
[email protected]

On Thu, Oct 18, 2012 at 4:59 PM, Ian Varley <[email protected]> wrote:

> Hi Neil,
>
> Mike summed it up well, as usual. :) Your choices of where to describe
> this "dimension" of your data (a one-to-many between users and events) are:
>
>  - one row per event
>  - one row per user, with events as columns
>  - one row per user, with events as versions on a single cell
>
> The first two are the best choices, since the third is sort of a
> perversion of the time dimension (it isn't one thing that's changing, it's
> many things over time), and might make things counter-intuitive when
> combined with deletes, compaction, etc. You can do it, but caveat emptor. :)
>
> Since you have in the 100s or 1000s of events per user, it's reasonable to
> use the 2nd (columns). And with 1k cell sizes, even extreme cases
> (thousands of events) won't kill you.
>
> That said, the main plus you get out of using columns over rows is ACID
> properties; you could get & set all the stuff for a single user atomically
> if it's columns in a single row, but not if its separate rows. That's nice,
> but I'm guessing you probably don't need to do that, and instead would
> write out the events as they happen (i.e., you would rarely be doing PUTs
> for multiple events for the same user at the same time, right?).
>
> In theory, tall tables (the row-wise model) should have a slight
> performance advantage over wide tables (the column-wise model), all other
> things being equal; the shape of the data is nearly the same, but the
> row-wise version doesn't have to do any work preserving consistency. Your
> informal tests about GET vs SCAN perf seem a little suspect, since a GET is
> actually implemented as a one-row SCAN; but the devil's in the details, so
> if you see that happening repeatably with data that's otherwise identical,
> raise it up to the dev list and people should look at it.
>
> The key thing is to try it for yourself and see. :)
>
> Ian
>
> ps - Sorry Mike was rude to you in his response. Your question was
> well-phrased and not at all boring. Mike, you can explain all you want, but
> saying "Your question is boring" is straight up rude; please don't do that.
>
>
> From: Neil Yalowitz <[email protected]<mailto:[email protected]
> >>
> Date: Tue, Oct 16, 2012 at 2:53 PM
> Subject: crafting your key - scan vs. get
> To: [email protected]<mailto:[email protected]>
>
>
> Hopefully this is a fun question.  :)
>
> Assume you could architect an HBase table from scratch and you were
> choosing between the following two key structures.
>
> 1)
>
> The first structure creates a unique row key for each PUT.  The rows are
> events related to a user ID.  There may be up to several hundred events for
> each user ID (probably not thousands, an average of perhaps ~100 events per
> user).  Each key would be made unique with a reverse-order-timestamp or
> perhaps just random characters (we don't particularly care about using ROT
> for sorting newest here).
>
> key
> ----
> AAAAAA + some-unique-chars
>
> The table will look like this:
>
> key                                   vals  cf:mycf                ts
> -------------------------------------------------------------------
> AAAAAA9999...                 myval1                 1350345600
> AAAAAA8888...                 myval2                 1350259200
> AAAAAA7777...                 myval3                 1350172800
>
>
> Retrieving these values will use a Scan with startRow and stopRow.  In
> hbase shell, it would look like:
>
> $ scan 'mytable',{STARTROW=>'AAAAAA', ENDROW=>'AAAAAA_'}
>
>
> 2)
>
> The second structure choice uses only the user ID as the key and relies on
> row versions to store all the events.  For example:
>
> key                           vals   cf:mycf                     ts
> ---------------------------------------------------------------------
> AAAAAA                    myval1                       1350345600
> AAAAAA                    myval2                       1350259200
> AAAAAA                    myval3                       1350172800
>
> Retrieving these values will use a Get with VERSIONS = somebignumber.  In
> hbase shell, it would look like:
>
> $ get 'mytable','AAAAAA',{COLUMN=>'cf:mycf', VERSIONS=>999}
>
> ...although this probably violates a comment in the HBase documentation:
>
> "It is not recommended setting the number of max versions to an exceedingly
> high level (e.g., hundreds or more) unless those old values are very dear
> to you because this will greatly increase StoreFile size."
>
> ...found here: http://hbase.apache.org/book/schema.versions.html
>
>
> So, are there any performance considerations between Scan vs. Get in this
> use case?  Which choice would you go for?
>
>
>
> Neil Yalowitz
> [email protected]<mailto:[email protected]>
>
>

Reply via email to