Re: data partitioning and data model

Alok Singh Fri, 20 Feb 2015 10:58:17 -0800

You can use a key like (user_id + timestamp + alert_id) to get
clustering of rows related to a user. To get better write throughput
and distribution over the cluster, you could pre-split the table and
use a consistent hash of the user_id as a row key prefix.


Have you looked at the rowkey design section in the hbase book :
http://hbase.apache.org/book.html#rowkey.design

Alok

On Fri, Feb 20, 2015 at 8:49 AM, Marcelo Valle (BLOOMBERG/ LONDON)
<mvallemil...@bloomberg.net> wrote:
> Hello,
>
> This is my first message in this mailing list, I just subscribed.
>
> I have been using Cassandra for the last few years and now I am trying to 
> create a POC using HBase. Therefore, I am reading the HBase docs but it's 
> been really hard to find how HBase behaves in some situations, when compared 
> to Cassandra. I thought maybe it was a good idea to ask here, as people in 
> this list might know the differences better than anyone else.
>
> What I want to do is creating a simple application optimized for writes (not 
> interested in HBase / Cassandra product comparisions here, I am assuming I 
> will use HBase and that's it, just wanna understand the best way of doing it 
> in HBase world). I want to be able to write alerts to the cluster, where each 
> alert would have columns like:
> - alert id
> - user id
> - date/time
> - alert data
>
> Later, I want to search for alerts per user, so my main query could be 
> considered to be something like:
> Select * from alerts where user_id = $id and date/time > 10 days ago.
>
> I want to decide the data model for my application.
>
> Here are my questions:
>
> - In Cassandra, I would partition by user + day, as some users can have many 
> alerts and some just 1 or a few. In hbase, assuming all alerts for a user 
> would always fit in a single partition / region, can I just use user_id as my 
> row key and assume data will be distributed along the cluster?
>
> - Suppose I want to write 100 000 rows from a client machine and these are 
> from 30 000 users. What's the best manner to write these if I want to 
> optimize for writes? Should I batch all 100 k requests in one to a single 
> server? As I am trying to optimize for writes, I would like to split these 
> requests across several nodes instead of sending them all to one. I found 
> this article: 
> http://hortonworks.com/blog/apache-hbase-region-splitting-and-merging/ But 
> not sure if it's what I need
>
> Thanks in advance!
>
> Best regards,
> Marcelo.

Re: data partitioning and data model

Reply via email to