Hi, Yes you would want to start your key by user_id. But you don’t need the timestamp. The user_id + alert_id should be enough on the key. If you want to get fancy…
If your alert_id is not a number, you could use the EPOCH - Timestamp as a way to invert the order of the alerts so that the latest alert would be first. If your alert_id is a number you could just use EPOCH - alert_id to get the alerts in reverse order with the latest alert first. Depending on the number of alerts, you could make the table wider and store multiple alerts in a row… but that brings in a different debate when it comes to row width and how you use the data. > On Feb 20, 2015, at 12:55 PM, Alok Singh <aloksi...@gmail.com> wrote: > > You can use a key like (user_id + timestamp + alert_id) to get > clustering of rows related to a user. To get better write throughput > and distribution over the cluster, you could pre-split the table and > use a consistent hash of the user_id as a row key prefix. > > Have you looked at the rowkey design section in the hbase book : > http://hbase.apache.org/book.html#rowkey.design > > Alok > > On Fri, Feb 20, 2015 at 8:49 AM, Marcelo Valle (BLOOMBERG/ LONDON) > <mvallemil...@bloomberg.net> wrote: >> Hello, >> >> This is my first message in this mailing list, I just subscribed. >> >> I have been using Cassandra for the last few years and now I am trying to >> create a POC using HBase. Therefore, I am reading the HBase docs but it's >> been really hard to find how HBase behaves in some situations, when compared >> to Cassandra. I thought maybe it was a good idea to ask here, as people in >> this list might know the differences better than anyone else. >> >> What I want to do is creating a simple application optimized for writes (not >> interested in HBase / Cassandra product comparisions here, I am assuming I >> will use HBase and that's it, just wanna understand the best way of doing it >> in HBase world). I want to be able to write alerts to the cluster, where >> each alert would have columns like: >> - alert id >> - user id >> - date/time >> - alert data >> >> Later, I want to search for alerts per user, so my main query could be >> considered to be something like: >> Select * from alerts where user_id = $id and date/time > 10 days ago. >> >> I want to decide the data model for my application. >> >> Here are my questions: >> >> - In Cassandra, I would partition by user + day, as some users can have many >> alerts and some just 1 or a few. In hbase, assuming all alerts for a user >> would always fit in a single partition / region, can I just use user_id as >> my row key and assume data will be distributed along the cluster? >> >> - Suppose I want to write 100 000 rows from a client machine and these are >> from 30 000 users. What's the best manner to write these if I want to >> optimize for writes? Should I batch all 100 k requests in one to a single >> server? As I am trying to optimize for writes, I would like to split these >> requests across several nodes instead of sending them all to one. I found >> this article: >> http://hortonworks.com/blog/apache-hbase-region-splitting-and-merging/ But >> not sure if it's what I need >> >> Thanks in advance! >> >> Best regards, >> Marcelo. >
smime.p7s
Description: S/MIME cryptographic signature