Re: data partitioning and data model

Marcelo Valle (BLOOMBERG/ LONDON) Mon, 23 Feb 2015 08:44:28 -0800

Thanks Alok, 

I will take a good look at the link for sure.


Just an additional question, I saw, reading this: 
http://stackoverflow.com/questions/13741946/role-of-datanode-regionserver-in-hbase-hadoop-integration
That HBase can rebalance data inside region servers to keep cluster balanced. 
Does this happen also when using pre-loading?

In the case of a rebalance, if I try to WRITE data to a record being 
rebalanced, would the write performance be affected? 

Best regards,
Marcelo Valle.

From: user@hbase.apache.org 
Subject: Re: data partitioning and data model

You don't want a lot of columns in a write heavy table. HBase stores
the "row key" along with each cell/column (Though old, I find this
still useful: 
http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html)
 Having a lot of columns will amplify the amount of data being stored.

That said, if there are only going to be a handful of alert_ids for a
given "user_id+timestamp" row key, then you should be ok.

The query "Select * from table where user_id = X and timestamp > T and
(alert_id = id1 or alert_id = id2)" can be accomplished with either
design. See QualifierFilter and FuzzyRowFilter docs to get some ideas.

Alok

On Fri, Feb 20, 2015 at 11:21 AM, Marcelo Valle (BLOOMBERG/ LONDON)
<mvallemil...@bloomberg.net> wrote:
> Hi Alok,
>
> Thanks for the answer. Yes, I have read this section, but it was a little too 
> abstract for me, I think I was needing to check my understanding. Your answer 
> helped me to confirm I am on the right path, thanks for that.
>
> One question: if instead of using user_id + timestamp + alert_id  I use 
> user_id + timestamp as row key, I would still be able to store alert_id + 
> alert_data in columns, right?
>
> I took the idea from the last section of this link: 
> http://www.appfirst.com/blog/best-practices-for-managing-hbase-in-a-high-write-environment/
>
> But I wonder which option would be better for my case. It seems column scans 
> are not so fast as row scans, but what would be the advantages of one design 
> over the other?
>
> If I use something like:
> Row key: user_id + timestamp
> Column prefix: alert_id
> Column value: json with alert data
>
> Would I be able to do a query like the one bellow?
> Select * from table where user_id = X and timestamp > T and (alert_id = id1 
> or alert_id = id2)
>
> Would I be able to do the same query using user_id + timestamp + alert_id as 
> row key?
>
> Also, I know Cassandra supports up to 2 billion columns per row (2 billion 
> rows per partition in CQL), do you know what's the limit for HBase?
>
> Best regards,
> Marcelo Valle.
>
> From: aloksi...@gmail.com
> Subject: Re: data partitioning and data model
>
> You can use a key like (user_id + timestamp + alert_id) to get
> clustering of rows related to a user. To get better write throughput
> and distribution over the cluster, you could pre-split the table and
> use a consistent hash of the user_id as a row key prefix.
>
> Have you looked at the rowkey design section in the hbase book :
> http://hbase.apache.org/book.html#rowkey.design
>
> Alok
>
> On Fri, Feb 20, 2015 at 8:49 AM, Marcelo Valle (BLOOMBERG/ LONDON)
> <mvallemil...@bloomberg.net> wrote:
>> Hello,
>>
>> This is my first message in this mailing list, I just subscribed.
>>
>> I have been using Cassandra for the last few years and now I am trying to 
>> create a POC using HBase. Therefore, I am reading the HBase docs but it's 
>> been really hard to find how HBase behaves in some situations, when compared 
>> to Cassandra. I thought maybe it was a good idea to ask here, as people in 
>> this list might know the differences better than anyone else.
>>
>> What I want to do is creating a simple application optimized for writes (not 
>> interested in HBase / Cassandra product comparisions here, I am assuming I 
>> will use HBase and that's it, just wanna understand the best way of doing it 
>> in HBase world). I want to be able to write alerts to the cluster, where 
>> each alert would have columns like:
>> - alert id
>> - user id
>> - date/time
>> - alert data
>>
>> Later, I want to search for alerts per user, so my main query could be 
>> considered to be something like:
>> Select * from alerts where user_id = $id and date/time > 10 days ago.
>>
>> I want to decide the data model for my application.
>>
>> Here are my questions:
>>
>> - In Cassandra, I would partition by user + day, as some users can have many 
>> alerts and some just 1 or a few. In hbase, assuming all alerts for a user 
>> would always fit in a single partition / region, can I just use user_id as 
>> my row key and assume data will be distributed along the cluster?
>>
>> - Suppose I want to write 100 000 rows from a client machine and these are 
>> from 30 000 users. What's the best manner to write these if I want to 
>> optimize for writes? Should I batch all 100 k requests in one to a single 
>> server? As I am trying to optimize for writes, I would like to split these 
>> requests across several nodes instead of sending them all to one. I found 
>> this article: 
>> http://hortonworks.com/blog/apache-hbase-region-splitting-and-merging/ But 
>> not sure if it's what I need
>>
>> Thanks in advance!
>>
>> Best regards,
>> Marcelo.
>
>

Re: data partitioning and data model

Reply via email to