Hi there, while this isn't an answer to some of the specific design questions, this chapter in the RefGuide can be helpful for general design..
http://hbase.apache.org/book.html#schema On 10/2/12 10:28 AM, "Jason Huang" <[email protected]> wrote: >Hello, > >I am designing a HBase table for users and hope to get some >suggestions for my row key design. Thanks... > >This user table will have columns which include user information such >as names, birthday, gender, address, phone number, etc... The first >time user comes to us we will ask all these information and we should >generate a new row in the table with a unique row key. The next time >the same user comes in again we will ask for his/her names and >birthday and our application should quickly get the row(s) in the >table which meets the name and birthday provided. > >Here is what I am thinking as row key: > >{first 6 digit of user's first name}_{first 6 digit of user's last >name}_{birthday in MMDDYYYY}_{timestamp when user comes in for the >first time} > >However, I see a few questions from this row key: > >(1) Although it is not very likely but there could be some small >chances that two users with same name and birthday came in at the same >day. And the two requests to generate new user came at the same time >(the timestamps were defined in the HTable API and happened to be of >the same value before calling the put method). This means the row key >design above won't guarantee a unique row key. Any suggestions on how >to modify it and ensure a unique ID? > >(2) Sometimes we will only have part of user's first name and/or last >name. In that case, we will need to perform a scan and return multiple >matches to the client. To avoid scanning the whole table, if we have >user's first name, we can set start/stop row accordingly. But then if >we only have user's last name, we can't set up a good start/stop row. >What's even worse, if the user provides a "sounds-like" first or last >name, then our scan won't be able to return good possible matches. >Does anyone ever use names as part of the row key and encounter this >type of issue? > >(3) The row key seems to be long (30+ chars), will this affect our >read/write performance? Maybe it will increase the storage a bit (say >we have 3 million rows per month)? In other words, does the length of >the row key matter a lot? > >thanks! > >Jason >
