Is murmur part of the standard java libraries? If not, you end up having to do a bit more maintenance of your cluster and that's going to be part of your tradeoff.
On Jul 8, 2013, at 10:14 AM, Mike Axiak <[email protected]> wrote: > Hello Jason, > > Have you considered the following rowkey? > > murmur_128(userId) + timestamp + userId ? > > This handles both of your cases as (1) murmur 128 is much faster than > md5 so will have very low overhead and (2) the userid at the end of > the key will ensure that no murmur collisions will cause issues. This > key also handle incrementing userIds well because close userIds will > likely be in separate regions. > > Cheers, > Mike > > On Mon, Jul 8, 2013 at 10:19 AM, Jason Huang <[email protected]> wrote: >> Hello, >> >> I am trying to get some advice on pros/cons of using separator/delimiter as >> part of HBase row key. >> >> Currently one of our user activity tables has a rowkey design of >> "UserID^TimeStamp" with a separator of "^". (UserID is a string that won't >> include '^'). >> >> This is designed for the two common use cases in our system: >> (1) If we come from a context where the UserID is known, we can do a scan >> easily for all the user activities with a startRowKey and stopRowKey. >> (2) If we come from a external networked table where the row key of this >> user activity table is stored and can be retrieved as activityRowKey, then >> we can use the following code to parse out the UserID and do the same scan >> as in (1): >> >> String activityRowKeyStr = Bytes.toString(activityRowKey); >> String userId = >> activityRowKeyStr.subString(activityRowKeyStr.indexOf("^")+1) >> >> Then I can set startRowKey and stopRowKey for the scan based on userId. >> Here we get benefit of having the User ID as part of the row key with the >> separator (comparing to another solution that stores the userID as one of >> the columns in the user activity table). >> >> The reason I pick a separator after UserID is that sometimes we may not get >> a fixed length string of the UserID value. At one point I actually thought >> of using MD5 to hash the UserID and make it a fixed length, however, the >> possibility of collision and possible overhead of applying the hash >> function makes me pick the separator "^". >> >> My question: >> (1) I kind of make the argument that using a separator is kind of better >> than using a MD5 hash value. Does that seem reasonable? Could you comments >> on other pros and cons that I might miss (as the bases for my argument)? >> >> (2) On using a separator/delimiter, besides the requirements that this >> separator/delimiter shouldn't appear elsewhere in the rowkey, are there any >> other requirements? Are there any special separator/delimiters that are >> better/worse than the average ones? >> >> thanks! >> >> Jason >
