Hopefully this is a fun question.  :)

Assume you could architect an HBase table from scratch and you were
choosing between the following two key structures.

1)

The first structure creates a unique row key for each PUT.  The rows are
events related to a user ID.  There may be up to several hundred events for
each user ID (probably not thousands, an average of perhaps ~100 events per
user).  Each key would be made unique with a reverse-order-timestamp or
perhaps just random characters (we don't particularly care about using ROT
for sorting newest here).

key
----
AAAAAA + some-unique-chars

The table will look like this:

key                                   vals  cf:mycf                ts
-------------------------------------------------------------------
AAAAAA9999...                 myval1                 1350345600
AAAAAA8888...                 myval2                 1350259200
AAAAAA7777...                 myval3                 1350172800


Retrieving these values will use a Scan with startRow and stopRow.  In
hbase shell, it would look like:

$ scan 'mytable',{STARTROW=>'AAAAAA', ENDROW=>'AAAAAA_'}


2)

The second structure choice uses only the user ID as the key and relies on
row versions to store all the events.  For example:

key                           vals   cf:mycf                     ts
---------------------------------------------------------------------
AAAAAA                    myval1                       1350345600
AAAAAA                    myval2                       1350259200
AAAAAA                    myval3                       1350172800

Retrieving these values will use a Get with VERSIONS = somebignumber.  In
hbase shell, it would look like:

$ get 'mytable','AAAAAA',{COLUMN=>'cf:mycf', VERSIONS=>999}

...although this probably violates a comment in the HBase documentation:

"It is not recommended setting the number of max versions to an exceedingly
high level (e.g., hundreds or more) unless those old values are very dear
to you because this will greatly increase StoreFile size."

...found here: http://hbase.apache.org/book/schema.versions.html


So, are there any performance considerations between Scan vs. Get in this
use case?  Which choice would you go for?



Neil Yalowitz
neilyalow...@gmail.com

Reply via email to