data partitioning and data model

Marcelo Valle (BLOOMBERG/ LONDON) Fri, 20 Feb 2015 08:53:25 -0800

Hello, 

This is my first message in this mailing list, I just subscribed.


I have been using Cassandra for the last few years and now I am trying to 
create a POC using HBase. Therefore, I am reading the HBase docs but it's been 
really hard to find how HBase behaves in some situations, when compared to 
Cassandra. I thought maybe it was a good idea to ask here, as people in this 
list might know the differences better than anyone else.

What I want to do is creating a simple application optimized for writes (not 
interested in HBase / Cassandra product comparisions here, I am assuming I will 
use HBase and that's it, just wanna understand the best way of doing it in 
HBase world). I want to be able to write alerts to the cluster, where each 
alert would have columns like:
- alert id
- user id
- date/time
- alert data

Later, I want to search for alerts per user, so my main query could be 
considered to be something like: 
Select * from alerts where user_id = $id and date/time > 10 days ago.

I want to decide the data model for my application.

Here are my questions:

- In Cassandra, I would partition by user + day, as some users can have many 
alerts and some just 1 or a few. In hbase, assuming all alerts for a user would 
always fit in a single partition / region, can I just use user_id as my row key 
and assume data will be distributed along the cluster?

- Suppose I want to write 100 000 rows from a client machine and these are from 
30 000 users. What's the best manner to write these if I want to optimize for 
writes? Should I batch all 100 k requests in one to a single server? As I am 
trying to optimize for writes, I would like to split these requests across 
several nodes instead of sending them all to one. I found this article: 
http://hortonworks.com/blog/apache-hbase-region-splitting-and-merging/ But not 
sure if it's what I need

Thanks in advance!

Best regards,
Marcelo.

data partitioning and data model

Reply via email to