Re: data partitioning and data model

Michael Segel Mon, 23 Feb 2015 12:36:13 -0800

Hi, 

Yes you would want to start your key by user_id. 
But you don’t need the timestamp. The user_id + alert_id should be enough on 
the key. 
If you want to get fancy…


If your alert_id is not a number, you could use the EPOCH - Timestamp as a way 
to invert the order of the alerts so that the latest alert would be first.
If your alert_id is a number  you could just use EPOCH - alert_id to get the 
alerts in reverse order with the latest alert first. 

Depending on the number of alerts, you could make the table wider and store 
multiple alerts in a row… but that brings in a different debate when it comes 
to row width and how you use the data. 

> On Feb 20, 2015, at 12:55 PM, Alok Singh <aloksi...@gmail.com> wrote:
> 
> You can use a key like (user_id + timestamp + alert_id) to get
> clustering of rows related to a user. To get better write throughput
> and distribution over the cluster, you could pre-split the table and
> use a consistent hash of the user_id as a row key prefix.
> 
> Have you looked at the rowkey design section in the hbase book :
> http://hbase.apache.org/book.html#rowkey.design
> 
> Alok
> 
> On Fri, Feb 20, 2015 at 8:49 AM, Marcelo Valle (BLOOMBERG/ LONDON)
> <mvallemil...@bloomberg.net> wrote:
>> Hello,
>> 
>> This is my first message in this mailing list, I just subscribed.
>> 
>> I have been using Cassandra for the last few years and now I am trying to 
>> create a POC using HBase. Therefore, I am reading the HBase docs but it's 
>> been really hard to find how HBase behaves in some situations, when compared 
>> to Cassandra. I thought maybe it was a good idea to ask here, as people in 
>> this list might know the differences better than anyone else.
>> 
>> What I want to do is creating a simple application optimized for writes (not 
>> interested in HBase / Cassandra product comparisions here, I am assuming I 
>> will use HBase and that's it, just wanna understand the best way of doing it 
>> in HBase world). I want to be able to write alerts to the cluster, where 
>> each alert would have columns like:
>> - alert id
>> - user id
>> - date/time
>> - alert data
>> 
>> Later, I want to search for alerts per user, so my main query could be 
>> considered to be something like:
>> Select * from alerts where user_id = $id and date/time > 10 days ago.
>> 
>> I want to decide the data model for my application.
>> 
>> Here are my questions:
>> 
>> - In Cassandra, I would partition by user + day, as some users can have many 
>> alerts and some just 1 or a few. In hbase, assuming all alerts for a user 
>> would always fit in a single partition / region, can I just use user_id as 
>> my row key and assume data will be distributed along the cluster?
>> 
>> - Suppose I want to write 100 000 rows from a client machine and these are 
>> from 30 000 users. What's the best manner to write these if I want to 
>> optimize for writes? Should I batch all 100 k requests in one to a single 
>> server? As I am trying to optimize for writes, I would like to split these 
>> requests across several nodes instead of sending them all to one. I found 
>> this article: 
>> http://hortonworks.com/blog/apache-hbase-region-splitting-and-merging/ But 
>> not sure if it's what I need
>> 
>> Thanks in advance!
>> 
>> Best regards,
>> Marcelo.
>

smime.p7s
Description: S/MIME cryptographic signature

Re: data partitioning and data model

Reply via email to