Thanks Alok, I will take a good look at the link for sure.
Just an additional question, I saw, reading this: http://stackoverflow.com/questions/13741946/role-of-datanode-regionserver-in-hbase-hadoop-integration That HBase can rebalance data inside region servers to keep cluster balanced. Does this happen also when using pre-loading? In the case of a rebalance, if I try to WRITE data to a record being rebalanced, would the write performance be affected? Best regards, Marcelo Valle. From: user@hbase.apache.org Subject: Re: data partitioning and data model You don't want a lot of columns in a write heavy table. HBase stores the "row key" along with each cell/column (Though old, I find this still useful: http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html) Having a lot of columns will amplify the amount of data being stored. That said, if there are only going to be a handful of alert_ids for a given "user_id+timestamp" row key, then you should be ok. The query "Select * from table where user_id = X and timestamp > T and (alert_id = id1 or alert_id = id2)" can be accomplished with either design. See QualifierFilter and FuzzyRowFilter docs to get some ideas. Alok On Fri, Feb 20, 2015 at 11:21 AM, Marcelo Valle (BLOOMBERG/ LONDON) <mvallemil...@bloomberg.net> wrote: > Hi Alok, > > Thanks for the answer. Yes, I have read this section, but it was a little too > abstract for me, I think I was needing to check my understanding. Your answer > helped me to confirm I am on the right path, thanks for that. > > One question: if instead of using user_id + timestamp + alert_id I use > user_id + timestamp as row key, I would still be able to store alert_id + > alert_data in columns, right? > > I took the idea from the last section of this link: > http://www.appfirst.com/blog/best-practices-for-managing-hbase-in-a-high-write-environment/ > > But I wonder which option would be better for my case. It seems column scans > are not so fast as row scans, but what would be the advantages of one design > over the other? > > If I use something like: > Row key: user_id + timestamp > Column prefix: alert_id > Column value: json with alert data > > Would I be able to do a query like the one bellow? > Select * from table where user_id = X and timestamp > T and (alert_id = id1 > or alert_id = id2) > > Would I be able to do the same query using user_id + timestamp + alert_id as > row key? > > Also, I know Cassandra supports up to 2 billion columns per row (2 billion > rows per partition in CQL), do you know what's the limit for HBase? > > Best regards, > Marcelo Valle. > > From: aloksi...@gmail.com > Subject: Re: data partitioning and data model > > You can use a key like (user_id + timestamp + alert_id) to get > clustering of rows related to a user. To get better write throughput > and distribution over the cluster, you could pre-split the table and > use a consistent hash of the user_id as a row key prefix. > > Have you looked at the rowkey design section in the hbase book : > http://hbase.apache.org/book.html#rowkey.design > > Alok > > On Fri, Feb 20, 2015 at 8:49 AM, Marcelo Valle (BLOOMBERG/ LONDON) > <mvallemil...@bloomberg.net> wrote: >> Hello, >> >> This is my first message in this mailing list, I just subscribed. >> >> I have been using Cassandra for the last few years and now I am trying to >> create a POC using HBase. Therefore, I am reading the HBase docs but it's >> been really hard to find how HBase behaves in some situations, when compared >> to Cassandra. I thought maybe it was a good idea to ask here, as people in >> this list might know the differences better than anyone else. >> >> What I want to do is creating a simple application optimized for writes (not >> interested in HBase / Cassandra product comparisions here, I am assuming I >> will use HBase and that's it, just wanna understand the best way of doing it >> in HBase world). I want to be able to write alerts to the cluster, where >> each alert would have columns like: >> - alert id >> - user id >> - date/time >> - alert data >> >> Later, I want to search for alerts per user, so my main query could be >> considered to be something like: >> Select * from alerts where user_id = $id and date/time > 10 days ago. >> >> I want to decide the data model for my application. >> >> Here are my questions: >> >> - In Cassandra, I would partition by user + day, as some users can have many >> alerts and some just 1 or a few. In hbase, assuming all alerts for a user >> would always fit in a single partition / region, can I just use user_id as >> my row key and assume data will be distributed along the cluster? >> >> - Suppose I want to write 100 000 rows from a client machine and these are >> from 30 000 users. What's the best manner to write these if I want to >> optimize for writes? Should I batch all 100 k requests in one to a single >> server? As I am trying to optimize for writes, I would like to split these >> requests across several nodes instead of sending them all to one. I found >> this article: >> http://hortonworks.com/blog/apache-hbase-region-splitting-and-merging/ But >> not sure if it's what I need >> >> Thanks in advance! >> >> Best regards, >> Marcelo. > >