data partitioning and data model

2015-02-20 Thread Marcelo Valle (BLOOMBERG/ LONDON)
Hello, This is my first message in this mailing list, I just subscribed. I have been using Cassandra for the last few years and now I am trying to create a POC using HBase. Therefore, I am reading the HBase docs but it's been really hard to find how HBase behaves in some situations, when comp

Re: data partitioning and data model

2015-02-20 Thread Alok Singh
You can use a key like (user_id + timestamp + alert_id) to get clustering of rows related to a user. To get better write throughput and distribution over the cluster, you could pre-split the table and use a consistent hash of the user_id as a row key prefix. Have you looked at the rowkey design se

Re: data partitioning and data model

2015-02-20 Thread Marcelo Valle (BLOOMBERG/ LONDON)
now what's the limit for HBase? Best regards, Marcelo Valle. From: aloksi...@gmail.com Subject: Re: data partitioning and data model You can use a key like (user_id + timestamp + alert_id) to get clustering of rows related to a user. To get better write throughput and distribution over the

Re: data partitioning and data model

2015-02-20 Thread Alok Singh
= X and timestamp > T and (alert_id = id1 > or alert_id = id2) > > Would I be able to do the same query using user_id + timestamp + alert_id as > row key? > > Also, I know Cassandra supports up to 2 billion columns per row (2 billion > rows per partition in CQL), do you

Re: data partitioning and data model

2015-02-23 Thread Marcelo Valle (BLOOMBERG/ LONDON)
. Does this happen also when using pre-loading? In the case of a rebalance, if I try to WRITE data to a record being rebalanced, would the write performance be affected? Best regards, Marcelo Valle. From: user@hbase.apache.org Subject: Re: data partitioning and data model You don't want a l

Re: data partitioning and data model

2015-02-23 Thread Marcelo Valle (BLOOMBERG/ LONDON)
I am sorry, consider I am using auto pre-splitting for question bellow. From: user@hbase.apache.org Subject: Re: data partitioning and data model Thanks Alok, I will take a good look at the link for sure. Just an additional question, I saw, reading this: http://stackoverflow.com/questions

Re: data partitioning and data model

2015-02-23 Thread Alok Singh
d1 >> or alert_id = id2) >> >> Would I be able to do the same query using user_id + timestamp + alert_id as >> row key? >> >> Also, I know Cassandra supports up to 2 billion columns per row (2 billion >> rows per partition in CQL), do you know what's

Re: data partitioning and data model

2015-02-23 Thread Marcelo Valle (BLOOMBERG/ LONDON)
o keep data almost evenly distributed on every partition, I might end up having the increase in read/write latency when data is moving from a region to the other, although this could be rare, is this right? From: user@hbase.apache.org Subject: Re: data partitioning and data model Assuming the clust

Re: data partitioning and data model

2015-02-23 Thread Alok Singh
key like we > described in this thread to keep data almost evenly distributed on every > partition, I might end up having the increase in read/write latency when data > is moving from a region to the other, although this could be rare, is this > right? > > From: user@hba

Re: data partitioning and data model

2015-02-23 Thread Marcelo Valle (BLOOMBERG/ LONDON)
Thanks a lot! From: aloksi...@gmail.com Subject: Re: data partitioning and data model I meant, in the normal course of operation, rebalancing will not affect writes in flight. This is never an issue when pre splitting because, by definition, splits occurred before data was written to the

Re: data partitioning and data model

2015-02-23 Thread Michael Segel
Hi, Yes you would want to start your key by user_id. But you don’t need the timestamp. The user_id + alert_id should be enough on the key. If you want to get fancy… If your alert_id is not a number, you could use the EPOCH - Timestamp as a way to invert the order of the alerts so that the la

Re: data partitioning and data model

2015-02-23 Thread Michael Segel
>> Would I be able to do a query like the one bellow? >> Select * from table where user_id = X and timestamp > T and (alert_id = id1 >> or alert_id = id2) >> >> Would I be able to do the same query using user_id + timestamp + alert_id as >> row key? >>