You can use a key like (user_id + timestamp + alert_id) to get clustering of rows related to a user. To get better write throughput and distribution over the cluster, you could pre-split the table and use a consistent hash of the user_id as a row key prefix.
Have you looked at the rowkey design section in the hbase book : http://hbase.apache.org/book.html#rowkey.design Alok On Fri, Feb 20, 2015 at 8:49 AM, Marcelo Valle (BLOOMBERG/ LONDON) <mvallemil...@bloomberg.net> wrote: > Hello, > > This is my first message in this mailing list, I just subscribed. > > I have been using Cassandra for the last few years and now I am trying to > create a POC using HBase. Therefore, I am reading the HBase docs but it's > been really hard to find how HBase behaves in some situations, when compared > to Cassandra. I thought maybe it was a good idea to ask here, as people in > this list might know the differences better than anyone else. > > What I want to do is creating a simple application optimized for writes (not > interested in HBase / Cassandra product comparisions here, I am assuming I > will use HBase and that's it, just wanna understand the best way of doing it > in HBase world). I want to be able to write alerts to the cluster, where each > alert would have columns like: > - alert id > - user id > - date/time > - alert data > > Later, I want to search for alerts per user, so my main query could be > considered to be something like: > Select * from alerts where user_id = $id and date/time > 10 days ago. > > I want to decide the data model for my application. > > Here are my questions: > > - In Cassandra, I would partition by user + day, as some users can have many > alerts and some just 1 or a few. In hbase, assuming all alerts for a user > would always fit in a single partition / region, can I just use user_id as my > row key and assume data will be distributed along the cluster? > > - Suppose I want to write 100 000 rows from a client machine and these are > from 30 000 users. What's the best manner to write these if I want to > optimize for writes? Should I batch all 100 k requests in one to a single > server? As I am trying to optimize for writes, I would like to split these > requests across several nodes instead of sending them all to one. I found > this article: > http://hortonworks.com/blog/apache-hbase-region-splitting-and-merging/ But > not sure if it's what I need > > Thanks in advance! > > Best regards, > Marcelo.