Hopefully this is a fun question. :) Assume you could architect an HBase table from scratch and you were choosing between the following two key structures.
1) The first structure creates a unique row key for each PUT. The rows are events related to a user ID. There may be up to several hundred events for each user ID (probably not thousands, an average of perhaps ~100 events per user). Each key would be made unique with a reverse-order-timestamp or perhaps just random characters (we don't particularly care about using ROT for sorting newest here). key ---- AAAAAA + some-unique-chars The table will look like this: key vals cf:mycf ts ------------------------------------------------------------------- AAAAAA9999... myval1 1350345600 AAAAAA8888... myval2 1350259200 AAAAAA7777... myval3 1350172800 Retrieving these values will use a Scan with startRow and stopRow. In hbase shell, it would look like: $ scan 'mytable',{STARTROW=>'AAAAAA', ENDROW=>'AAAAAA_'} 2) The second structure choice uses only the user ID as the key and relies on row versions to store all the events. For example: key vals cf:mycf ts --------------------------------------------------------------------- AAAAAA myval1 1350345600 AAAAAA myval2 1350259200 AAAAAA myval3 1350172800 Retrieving these values will use a Get with VERSIONS = somebignumber. In hbase shell, it would look like: $ get 'mytable','AAAAAA',{COLUMN=>'cf:mycf', VERSIONS=>999} ...although this probably violates a comment in the HBase documentation: "It is not recommended setting the number of max versions to an exceedingly high level (e.g., hundreds or more) unless those old values are very dear to you because this will greatly increase StoreFile size." ...found here: http://hbase.apache.org/book/schema.versions.html So, are there any performance considerations between Scan vs. Get in this use case? Which choice would you go for? Neil Yalowitz neilyalow...@gmail.com