Re: data partitioning and data model

2015-02-23 Thread Marcelo Valle (BLOOMBERG/ LONDON)
Thanks Alok, 

I will take a good look at the link for sure. 

Just an additional question, I saw, reading this: 
http://stackoverflow.com/questions/13741946/role-of-datanode-regionserver-in-hbase-hadoop-integration
That HBase can rebalance data inside region servers to keep cluster balanced. 
Does this happen also when using pre-loading?

In the case of a rebalance, if I try to WRITE data to a record being 
rebalanced, would the write performance be affected? 

Best regards,
Marcelo Valle.

From: user@hbase.apache.org 
Subject: Re: data partitioning and data model

You don't want a lot of columns in a write heavy table. HBase stores
the row key along with each cell/column (Though old, I find this
still useful: 
http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html)
 Having a lot of columns will amplify the amount of data being stored.

That said, if there are only going to be a handful of alert_ids for a
given user_id+timestamp row key, then you should be ok.

The query Select * from table where user_id = X and timestamp  T and
(alert_id = id1 or alert_id = id2) can be accomplished with either
design. See QualifierFilter and FuzzyRowFilter docs to get some ideas.

Alok

On Fri, Feb 20, 2015 at 11:21 AM, Marcelo Valle (BLOOMBERG/ LONDON)
mvallemil...@bloomberg.net wrote:
 Hi Alok,

 Thanks for the answer. Yes, I have read this section, but it was a little too 
 abstract for me, I think I was needing to check my understanding. Your answer 
 helped me to confirm I am on the right path, thanks for that.

 One question: if instead of using user_id + timestamp + alert_id  I use 
 user_id + timestamp as row key, I would still be able to store alert_id + 
 alert_data in columns, right?

 I took the idea from the last section of this link: 
 http://www.appfirst.com/blog/best-practices-for-managing-hbase-in-a-high-write-environment/

 But I wonder which option would be better for my case. It seems column scans 
 are not so fast as row scans, but what would be the advantages of one design 
 over the other?

 If I use something like:
 Row key: user_id + timestamp
 Column prefix: alert_id
 Column value: json with alert data

 Would I be able to do a query like the one bellow?
 Select * from table where user_id = X and timestamp  T and (alert_id = id1 
 or alert_id = id2)

 Would I be able to do the same query using user_id + timestamp + alert_id as 
 row key?

 Also, I know Cassandra supports up to 2 billion columns per row (2 billion 
 rows per partition in CQL), do you know what's the limit for HBase?

 Best regards,
 Marcelo Valle.

 From: aloksi...@gmail.com
 Subject: Re: data partitioning and data model

 You can use a key like (user_id + timestamp + alert_id) to get
 clustering of rows related to a user. To get better write throughput
 and distribution over the cluster, you could pre-split the table and
 use a consistent hash of the user_id as a row key prefix.

 Have you looked at the rowkey design section in the hbase book :
 http://hbase.apache.org/book.html#rowkey.design

 Alok

 On Fri, Feb 20, 2015 at 8:49 AM, Marcelo Valle (BLOOMBERG/ LONDON)
 mvallemil...@bloomberg.net wrote:
 Hello,

 This is my first message in this mailing list, I just subscribed.

 I have been using Cassandra for the last few years and now I am trying to 
 create a POC using HBase. Therefore, I am reading the HBase docs but it's 
 been really hard to find how HBase behaves in some situations, when compared 
 to Cassandra. I thought maybe it was a good idea to ask here, as people in 
 this list might know the differences better than anyone else.

 What I want to do is creating a simple application optimized for writes (not 
 interested in HBase / Cassandra product comparisions here, I am assuming I 
 will use HBase and that's it, just wanna understand the best way of doing it 
 in HBase world). I want to be able to write alerts to the cluster, where 
 each alert would have columns like:
 - alert id
 - user id
 - date/time
 - alert data

 Later, I want to search for alerts per user, so my main query could be 
 considered to be something like:
 Select * from alerts where user_id = $id and date/time  10 days ago.

 I want to decide the data model for my application.

 Here are my questions:

 - In Cassandra, I would partition by user + day, as some users can have many 
 alerts and some just 1 or a few. In hbase, assuming all alerts for a user 
 would always fit in a single partition / region, can I just use user_id as 
 my row key and assume data will be distributed along the cluster?

 - Suppose I want to write 100 000 rows from a client machine and these are 
 from 30 000 users. What's the best manner to write these if I want to 
 optimize for writes? Should I batch all 100 k requests in one to a single 
 server? As I am trying to optimize for writes, I would like to split these 
 requests across several nodes instead of sending them all to one. I found 
 this article: 
 http://hortonworks.com/blog/apache-hbase-region

Re: data partitioning and data model

2015-02-23 Thread Marcelo Valle (BLOOMBERG/ LONDON)
I am sorry, consider I am using auto pre-splitting for question bellow.

From: user@hbase.apache.org 
Subject: Re: data partitioning and data model

Thanks Alok, 

I will take a good look at the link for sure. 

Just an additional question, I saw, reading this: 
http://stackoverflow.com/questions/13741946/role-of-datanode-regionserver-in-hbase-hadoop-integration
That HBase can rebalance data inside region servers to keep cluster balanced. 
Does this happen also when using pre-loading?

In the case of a rebalance, if I try to WRITE data to a record being 
rebalanced, would the write performance be affected? 

Best regards,
Marcelo Valle.

From: user@hbase.apache.org 
Subject: Re: data partitioning and data model

You don't want a lot of columns in a write heavy table. HBase stores
the row key along with each cell/column (Though old, I find this
still useful: 
http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html)
 Having a lot of columns will amplify the amount of data being stored.

That said, if there are only going to be a handful of alert_ids for a
given user_id+timestamp row key, then you should be ok.

The query Select * from table where user_id = X and timestamp  T and
(alert_id = id1 or alert_id = id2) can be accomplished with either
design. See QualifierFilter and FuzzyRowFilter docs to get some ideas.

Alok

On Fri, Feb 20, 2015 at 11:21 AM, Marcelo Valle (BLOOMBERG/ LONDON)
mvallemil...@bloomberg.net wrote:
 Hi Alok,

 Thanks for the answer. Yes, I have read this section, but it was a little too 
 abstract for me, I think I was needing to check my understanding. Your answer 
 helped me to confirm I am on the right path, thanks for that.

 One question: if instead of using user_id + timestamp + alert_id  I use 
 user_id + timestamp as row key, I would still be able to store alert_id + 
 alert_data in columns, right?

 I took the idea from the last section of this link: 
 http://www.appfirst.com/blog/best-practices-for-managing-hbase-in-a-high-write-environment/

 But I wonder which option would be better for my case. It seems column scans 
 are not so fast as row scans, but what would be the advantages of one design 
 over the other?

 If I use something like:
 Row key: user_id + timestamp
 Column prefix: alert_id
 Column value: json with alert data

 Would I be able to do a query like the one bellow?
 Select * from table where user_id = X and timestamp  T and (alert_id = id1 
 or alert_id = id2)

 Would I be able to do the same query using user_id + timestamp + alert_id as 
 row key?

 Also, I know Cassandra supports up to 2 billion columns per row (2 billion 
 rows per partition in CQL), do you know what's the limit for HBase?

 Best regards,
 Marcelo Valle.

 From: aloksi...@gmail.com
 Subject: Re: data partitioning and data model

 You can use a key like (user_id + timestamp + alert_id) to get
 clustering of rows related to a user. To get better write throughput
 and distribution over the cluster, you could pre-split the table and
 use a consistent hash of the user_id as a row key prefix.

 Have you looked at the rowkey design section in the hbase book :
 http://hbase.apache.org/book.html#rowkey.design

 Alok

 On Fri, Feb 20, 2015 at 8:49 AM, Marcelo Valle (BLOOMBERG/ LONDON)
 mvallemil...@bloomberg.net wrote:
 Hello,

 This is my first message in this mailing list, I just subscribed.

 I have been using Cassandra for the last few years and now I am trying to 
 create a POC using HBase. Therefore, I am reading the HBase docs but it's 
 been really hard to find how HBase behaves in some situations, when compared 
 to Cassandra. I thought maybe it was a good idea to ask here, as people in 
 this list might know the differences better than anyone else.

 What I want to do is creating a simple application optimized for writes (not 
 interested in HBase / Cassandra product comparisions here, I am assuming I 
 will use HBase and that's it, just wanna understand the best way of doing it 
 in HBase world). I want to be able to write alerts to the cluster, where 
 each alert would have columns like:
 - alert id
 - user id
 - date/time
 - alert data

 Later, I want to search for alerts per user, so my main query could be 
 considered to be something like:
 Select * from alerts where user_id = $id and date/time  10 days ago.

 I want to decide the data model for my application.

 Here are my questions:

 - In Cassandra, I would partition by user + day, as some users can have many 
 alerts and some just 1 or a few. In hbase, assuming all alerts for a user 
 would always fit in a single partition / region, can I just use user_id as 
 my row key and assume data will be distributed along the cluster?

 - Suppose I want to write 100 000 rows from a client machine and these are 
 from 30 000 users. What's the best manner to write these if I want to 
 optimize for writes? Should I batch all 100 k requests in one to a single 
 server? As I am trying to optimize for writes, I would like to split

Re: data partitioning and data model

2015-02-23 Thread Alok Singh
Assuming the cluster is not manually balanced, hbase will try to
maintain roughly equal number of regions on each region server. So,
when you pre-split a table, the regions should get evenly spread out
to all of the region servers. That said, if you are pre-splitting a
new table on a cluster that already has a lot of existing
tables/regions, then you may see uneven distribution of regions of the
new table. Hbase will try to keep the cluster wide region distribution
even across all tables, without taking into account the distribution
of regions of a specific table.

Rebalancing shouldn't affect writes that are in flight.

After a split and moving of a region, sometimes data locality between
the region server and the data node that hosts the region data files
is lost. If you have significant load on your cluster, you will notice
an increase in read/write latency in the traffic to these regions. The
locality will eventually return after the next major compaction.

Links that have more details:
http://blog.cloudera.com/blog/2012/06/hbase-write-path/
http://www.ngdata.com/visualizing-hbase-flushes-and-compactions/

Alok

On Mon, Feb 23, 2015 at 8:42 AM, Marcelo Valle (BLOOMBERG/ LONDON)
mvallemil...@bloomberg.net wrote:
 Thanks Alok,

 I will take a good look at the link for sure.

 Just an additional question, I saw, reading this: 
 http://stackoverflow.com/questions/13741946/role-of-datanode-regionserver-in-hbase-hadoop-integration
 That HBase can rebalance data inside region servers to keep cluster balanced. 
 Does this happen also when using pre-loading?

 In the case of a rebalance, if I try to WRITE data to a record being 
 rebalanced, would the write performance be affected?

 Best regards,
 Marcelo Valle.

 From: user@hbase.apache.org
 Subject: Re: data partitioning and data model

 You don't want a lot of columns in a write heavy table. HBase stores
 the row key along with each cell/column (Though old, I find this
 still useful: 
 http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html)
  Having a lot of columns will amplify the amount of data being stored.

 That said, if there are only going to be a handful of alert_ids for a
 given user_id+timestamp row key, then you should be ok.

 The query Select * from table where user_id = X and timestamp  T and
 (alert_id = id1 or alert_id = id2) can be accomplished with either
 design. See QualifierFilter and FuzzyRowFilter docs to get some ideas.

 Alok

 On Fri, Feb 20, 2015 at 11:21 AM, Marcelo Valle (BLOOMBERG/ LONDON)
 mvallemil...@bloomberg.net wrote:
 Hi Alok,

 Thanks for the answer. Yes, I have read this section, but it was a little 
 too abstract for me, I think I was needing to check my understanding. Your 
 answer helped me to confirm I am on the right path, thanks for that.

 One question: if instead of using user_id + timestamp + alert_id  I use 
 user_id + timestamp as row key, I would still be able to store alert_id + 
 alert_data in columns, right?

 I took the idea from the last section of this link: 
 http://www.appfirst.com/blog/best-practices-for-managing-hbase-in-a-high-write-environment/

 But I wonder which option would be better for my case. It seems column scans 
 are not so fast as row scans, but what would be the advantages of one design 
 over the other?

 If I use something like:
 Row key: user_id + timestamp
 Column prefix: alert_id
 Column value: json with alert data

 Would I be able to do a query like the one bellow?
 Select * from table where user_id = X and timestamp  T and (alert_id = id1 
 or alert_id = id2)

 Would I be able to do the same query using user_id + timestamp + alert_id as 
 row key?

 Also, I know Cassandra supports up to 2 billion columns per row (2 billion 
 rows per partition in CQL), do you know what's the limit for HBase?

 Best regards,
 Marcelo Valle.

 From: aloksi...@gmail.com
 Subject: Re: data partitioning and data model

 You can use a key like (user_id + timestamp + alert_id) to get
 clustering of rows related to a user. To get better write throughput
 and distribution over the cluster, you could pre-split the table and
 use a consistent hash of the user_id as a row key prefix.

 Have you looked at the rowkey design section in the hbase book :
 http://hbase.apache.org/book.html#rowkey.design

 Alok

 On Fri, Feb 20, 2015 at 8:49 AM, Marcelo Valle (BLOOMBERG/ LONDON)
 mvallemil...@bloomberg.net wrote:
 Hello,

 This is my first message in this mailing list, I just subscribed.

 I have been using Cassandra for the last few years and now I am trying to 
 create a POC using HBase. Therefore, I am reading the HBase docs but it's 
 been really hard to find how HBase behaves in some situations, when 
 compared to Cassandra. I thought maybe it was a good idea to ask here, as 
 people in this list might know the differences better than anyone else.

 What I want to do is creating a simple application optimized for writes 
 (not interested in HBase / Cassandra product comparisions here, I

Re: data partitioning and data model

2015-02-23 Thread Marcelo Valle (BLOOMBERG/ LONDON)
Thanks a lot!

From: aloksi...@gmail.com 
Subject: Re: data partitioning and data model

I meant, in the normal course of operation, rebalancing will not
affect writes in flight. This is never an issue when pre splitting
because, by definition, splits occurred before data was written to the
regions.

If I choose to automatically split rows, but choosing a row key like
we described in this thread to keep data almost evenly distributed on
every partition, I might end up having the increase in read/write
latency when data is moving from a region to the other, although this
could be rare, is this right?
Yes.

Alok

On Mon, Feb 23, 2015 at 10:11 AM, Marcelo Valle (BLOOMBERG/ LONDON)
mvallemil...@bloomberg.net wrote:
 Alok, just to clarify:

 When you say Rebalancing shouldn't affect writes that are in flight. = you 
 mean just in the case I manually split the data on table creation right?
 If I choose to automatically split rows, but choosing a row key like we 
 described in this thread to keep data almost evenly distributed on every 
 partition, I might end up having the increase in read/write latency when data 
 is moving from a region to the other, although this could be rare, is this 
 right?

 From: user@hbase.apache.org
 Subject: Re: data partitioning and data model

 Assuming the cluster is not manually balanced, hbase will try to
 maintain roughly equal number of regions on each region server. So,
 when you pre-split a table, the regions should get evenly spread out
 to all of the region servers. That said, if you are pre-splitting a
 new table on a cluster that already has a lot of existing
 tables/regions, then you may see uneven distribution of regions of the
 new table. Hbase will try to keep the cluster wide region distribution
 even across all tables, without taking into account the distribution
 of regions of a specific table.

 Rebalancing shouldn't affect writes that are in flight.

 After a split and moving of a region, sometimes data locality between
 the region server and the data node that hosts the region data files
 is lost. If you have significant load on your cluster, you will notice
 an increase in read/write latency in the traffic to these regions. The
 locality will eventually return after the next major compaction.

 Links that have more details:
 http://blog.cloudera.com/blog/2012/06/hbase-write-path/
 http://www.ngdata.com/visualizing-hbase-flushes-and-compactions/

 Alok

 On Mon, Feb 23, 2015 at 8:42 AM, Marcelo Valle (BLOOMBERG/ LONDON)
 mvallemil...@bloomberg.net wrote:
 Thanks Alok,

 I will take a good look at the link for sure.

 Just an additional question, I saw, reading this: 
 http://stackoverflow.com/questions/13741946/role-of-datanode-regionserver-in-hbase-hadoop-integration
 That HBase can rebalance data inside region servers to keep cluster 
 balanced. Does this happen also when using pre-loading?

 In the case of a rebalance, if I try to WRITE data to a record being 
 rebalanced, would the write performance be affected?

 Best regards,
 Marcelo Valle.

 From: user@hbase.apache.org
 Subject: Re: data partitioning and data model

 You don't want a lot of columns in a write heavy table. HBase stores
 the row key along with each cell/column (Though old, I find this
 still useful: 
 http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html)
  Having a lot of columns will amplify the amount of data being stored.

 That said, if there are only going to be a handful of alert_ids for a
 given user_id+timestamp row key, then you should be ok.

 The query Select * from table where user_id = X and timestamp  T and
 (alert_id = id1 or alert_id = id2) can be accomplished with either
 design. See QualifierFilter and FuzzyRowFilter docs to get some ideas.

 Alok

 On Fri, Feb 20, 2015 at 11:21 AM, Marcelo Valle (BLOOMBERG/ LONDON)
 mvallemil...@bloomberg.net wrote:
 Hi Alok,

 Thanks for the answer. Yes, I have read this section, but it was a little 
 too abstract for me, I think I was needing to check my understanding. Your 
 answer helped me to confirm I am on the right path, thanks for that.

 One question: if instead of using user_id + timestamp + alert_id  I use 
 user_id + timestamp as row key, I would still be able to store alert_id + 
 alert_data in columns, right?

 I took the idea from the last section of this link: 
 http://www.appfirst.com/blog/best-practices-for-managing-hbase-in-a-high-write-environment/

 But I wonder which option would be better for my case. It seems column 
 scans are not so fast as row scans, but what would be the advantages of one 
 design over the other?

 If I use something like:
 Row key: user_id + timestamp
 Column prefix: alert_id
 Column value: json with alert data

 Would I be able to do a query like the one bellow?
 Select * from table where user_id = X and timestamp  T and (alert_id = id1 
 or alert_id = id2)

 Would I be able to do the same query using user_id + timestamp + alert_id 
 as row key?

 Also, I know Cassandra

Re: data partitioning and data model

2015-02-23 Thread Marcelo Valle (BLOOMBERG/ LONDON)
Alok, just to clarify:

When you say Rebalancing shouldn't affect writes that are in flight. = you 
mean just in the case I manually split the data on table creation right?
If I choose to automatically split rows, but choosing a row key like we 
described in this thread to keep data almost evenly distributed on every 
partition, I might end up having the increase in read/write latency when data 
is moving from a region to the other, although this could be rare, is this 
right?

From: user@hbase.apache.org 
Subject: Re: data partitioning and data model

Assuming the cluster is not manually balanced, hbase will try to
maintain roughly equal number of regions on each region server. So,
when you pre-split a table, the regions should get evenly spread out
to all of the region servers. That said, if you are pre-splitting a
new table on a cluster that already has a lot of existing
tables/regions, then you may see uneven distribution of regions of the
new table. Hbase will try to keep the cluster wide region distribution
even across all tables, without taking into account the distribution
of regions of a specific table.

Rebalancing shouldn't affect writes that are in flight.

After a split and moving of a region, sometimes data locality between
the region server and the data node that hosts the region data files
is lost. If you have significant load on your cluster, you will notice
an increase in read/write latency in the traffic to these regions. The
locality will eventually return after the next major compaction.

Links that have more details:
http://blog.cloudera.com/blog/2012/06/hbase-write-path/
http://www.ngdata.com/visualizing-hbase-flushes-and-compactions/

Alok

On Mon, Feb 23, 2015 at 8:42 AM, Marcelo Valle (BLOOMBERG/ LONDON)
mvallemil...@bloomberg.net wrote:
 Thanks Alok,

 I will take a good look at the link for sure.

 Just an additional question, I saw, reading this: 
 http://stackoverflow.com/questions/13741946/role-of-datanode-regionserver-in-hbase-hadoop-integration
 That HBase can rebalance data inside region servers to keep cluster balanced. 
 Does this happen also when using pre-loading?

 In the case of a rebalance, if I try to WRITE data to a record being 
 rebalanced, would the write performance be affected?

 Best regards,
 Marcelo Valle.

 From: user@hbase.apache.org
 Subject: Re: data partitioning and data model

 You don't want a lot of columns in a write heavy table. HBase stores
 the row key along with each cell/column (Though old, I find this
 still useful: 
 http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html)
  Having a lot of columns will amplify the amount of data being stored.

 That said, if there are only going to be a handful of alert_ids for a
 given user_id+timestamp row key, then you should be ok.

 The query Select * from table where user_id = X and timestamp  T and
 (alert_id = id1 or alert_id = id2) can be accomplished with either
 design. See QualifierFilter and FuzzyRowFilter docs to get some ideas.

 Alok

 On Fri, Feb 20, 2015 at 11:21 AM, Marcelo Valle (BLOOMBERG/ LONDON)
 mvallemil...@bloomberg.net wrote:
 Hi Alok,

 Thanks for the answer. Yes, I have read this section, but it was a little 
 too abstract for me, I think I was needing to check my understanding. Your 
 answer helped me to confirm I am on the right path, thanks for that.

 One question: if instead of using user_id + timestamp + alert_id  I use 
 user_id + timestamp as row key, I would still be able to store alert_id + 
 alert_data in columns, right?

 I took the idea from the last section of this link: 
 http://www.appfirst.com/blog/best-practices-for-managing-hbase-in-a-high-write-environment/

 But I wonder which option would be better for my case. It seems column scans 
 are not so fast as row scans, but what would be the advantages of one design 
 over the other?

 If I use something like:
 Row key: user_id + timestamp
 Column prefix: alert_id
 Column value: json with alert data

 Would I be able to do a query like the one bellow?
 Select * from table where user_id = X and timestamp  T and (alert_id = id1 
 or alert_id = id2)

 Would I be able to do the same query using user_id + timestamp + alert_id as 
 row key?

 Also, I know Cassandra supports up to 2 billion columns per row (2 billion 
 rows per partition in CQL), do you know what's the limit for HBase?

 Best regards,
 Marcelo Valle.

 From: aloksi...@gmail.com
 Subject: Re: data partitioning and data model

 You can use a key like (user_id + timestamp + alert_id) to get
 clustering of rows related to a user. To get better write throughput
 and distribution over the cluster, you could pre-split the table and
 use a consistent hash of the user_id as a row key prefix.

 Have you looked at the rowkey design section in the hbase book :
 http://hbase.apache.org/book.html#rowkey.design

 Alok

 On Fri, Feb 20, 2015 at 8:49 AM, Marcelo Valle (BLOOMBERG/ LONDON)
 mvallemil...@bloomberg.net wrote:
 Hello,

 This is my first

Re: data partitioning and data model

2015-02-23 Thread Alok Singh
I meant, in the normal course of operation, rebalancing will not
affect writes in flight. This is never an issue when pre splitting
because, by definition, splits occurred before data was written to the
regions.

If I choose to automatically split rows, but choosing a row key like
we described in this thread to keep data almost evenly distributed on
every partition, I might end up having the increase in read/write
latency when data is moving from a region to the other, although this
could be rare, is this right?
Yes.

Alok

On Mon, Feb 23, 2015 at 10:11 AM, Marcelo Valle (BLOOMBERG/ LONDON)
mvallemil...@bloomberg.net wrote:
 Alok, just to clarify:

 When you say Rebalancing shouldn't affect writes that are in flight. = you 
 mean just in the case I manually split the data on table creation right?
 If I choose to automatically split rows, but choosing a row key like we 
 described in this thread to keep data almost evenly distributed on every 
 partition, I might end up having the increase in read/write latency when data 
 is moving from a region to the other, although this could be rare, is this 
 right?

 From: user@hbase.apache.org
 Subject: Re: data partitioning and data model

 Assuming the cluster is not manually balanced, hbase will try to
 maintain roughly equal number of regions on each region server. So,
 when you pre-split a table, the regions should get evenly spread out
 to all of the region servers. That said, if you are pre-splitting a
 new table on a cluster that already has a lot of existing
 tables/regions, then you may see uneven distribution of regions of the
 new table. Hbase will try to keep the cluster wide region distribution
 even across all tables, without taking into account the distribution
 of regions of a specific table.

 Rebalancing shouldn't affect writes that are in flight.

 After a split and moving of a region, sometimes data locality between
 the region server and the data node that hosts the region data files
 is lost. If you have significant load on your cluster, you will notice
 an increase in read/write latency in the traffic to these regions. The
 locality will eventually return after the next major compaction.

 Links that have more details:
 http://blog.cloudera.com/blog/2012/06/hbase-write-path/
 http://www.ngdata.com/visualizing-hbase-flushes-and-compactions/

 Alok

 On Mon, Feb 23, 2015 at 8:42 AM, Marcelo Valle (BLOOMBERG/ LONDON)
 mvallemil...@bloomberg.net wrote:
 Thanks Alok,

 I will take a good look at the link for sure.

 Just an additional question, I saw, reading this: 
 http://stackoverflow.com/questions/13741946/role-of-datanode-regionserver-in-hbase-hadoop-integration
 That HBase can rebalance data inside region servers to keep cluster 
 balanced. Does this happen also when using pre-loading?

 In the case of a rebalance, if I try to WRITE data to a record being 
 rebalanced, would the write performance be affected?

 Best regards,
 Marcelo Valle.

 From: user@hbase.apache.org
 Subject: Re: data partitioning and data model

 You don't want a lot of columns in a write heavy table. HBase stores
 the row key along with each cell/column (Though old, I find this
 still useful: 
 http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html)
  Having a lot of columns will amplify the amount of data being stored.

 That said, if there are only going to be a handful of alert_ids for a
 given user_id+timestamp row key, then you should be ok.

 The query Select * from table where user_id = X and timestamp  T and
 (alert_id = id1 or alert_id = id2) can be accomplished with either
 design. See QualifierFilter and FuzzyRowFilter docs to get some ideas.

 Alok

 On Fri, Feb 20, 2015 at 11:21 AM, Marcelo Valle (BLOOMBERG/ LONDON)
 mvallemil...@bloomberg.net wrote:
 Hi Alok,

 Thanks for the answer. Yes, I have read this section, but it was a little 
 too abstract for me, I think I was needing to check my understanding. Your 
 answer helped me to confirm I am on the right path, thanks for that.

 One question: if instead of using user_id + timestamp + alert_id  I use 
 user_id + timestamp as row key, I would still be able to store alert_id + 
 alert_data in columns, right?

 I took the idea from the last section of this link: 
 http://www.appfirst.com/blog/best-practices-for-managing-hbase-in-a-high-write-environment/

 But I wonder which option would be better for my case. It seems column 
 scans are not so fast as row scans, but what would be the advantages of one 
 design over the other?

 If I use something like:
 Row key: user_id + timestamp
 Column prefix: alert_id
 Column value: json with alert data

 Would I be able to do a query like the one bellow?
 Select * from table where user_id = X and timestamp  T and (alert_id = id1 
 or alert_id = id2)

 Would I be able to do the same query using user_id + timestamp + alert_id 
 as row key?

 Also, I know Cassandra supports up to 2 billion columns per row (2 billion 
 rows per partition in CQL), do you

Re: data partitioning and data model

2015-02-23 Thread Michael Segel
Hi, 

Yes you would want to start your key by user_id. 
But you don’t need the timestamp. The user_id + alert_id should be enough on 
the key. 
If you want to get fancy…

If your alert_id is not a number, you could use the EPOCH - Timestamp as a way 
to invert the order of the alerts so that the latest alert would be first.
If your alert_id is a number  you could just use EPOCH - alert_id to get the 
alerts in reverse order with the latest alert first. 

Depending on the number of alerts, you could make the table wider and store 
multiple alerts in a row… but that brings in a different debate when it comes 
to row width and how you use the data. 

 On Feb 20, 2015, at 12:55 PM, Alok Singh aloksi...@gmail.com wrote:
 
 You can use a key like (user_id + timestamp + alert_id) to get
 clustering of rows related to a user. To get better write throughput
 and distribution over the cluster, you could pre-split the table and
 use a consistent hash of the user_id as a row key prefix.
 
 Have you looked at the rowkey design section in the hbase book :
 http://hbase.apache.org/book.html#rowkey.design
 
 Alok
 
 On Fri, Feb 20, 2015 at 8:49 AM, Marcelo Valle (BLOOMBERG/ LONDON)
 mvallemil...@bloomberg.net wrote:
 Hello,
 
 This is my first message in this mailing list, I just subscribed.
 
 I have been using Cassandra for the last few years and now I am trying to 
 create a POC using HBase. Therefore, I am reading the HBase docs but it's 
 been really hard to find how HBase behaves in some situations, when compared 
 to Cassandra. I thought maybe it was a good idea to ask here, as people in 
 this list might know the differences better than anyone else.
 
 What I want to do is creating a simple application optimized for writes (not 
 interested in HBase / Cassandra product comparisions here, I am assuming I 
 will use HBase and that's it, just wanna understand the best way of doing it 
 in HBase world). I want to be able to write alerts to the cluster, where 
 each alert would have columns like:
 - alert id
 - user id
 - date/time
 - alert data
 
 Later, I want to search for alerts per user, so my main query could be 
 considered to be something like:
 Select * from alerts where user_id = $id and date/time  10 days ago.
 
 I want to decide the data model for my application.
 
 Here are my questions:
 
 - In Cassandra, I would partition by user + day, as some users can have many 
 alerts and some just 1 or a few. In hbase, assuming all alerts for a user 
 would always fit in a single partition / region, can I just use user_id as 
 my row key and assume data will be distributed along the cluster?
 
 - Suppose I want to write 100 000 rows from a client machine and these are 
 from 30 000 users. What's the best manner to write these if I want to 
 optimize for writes? Should I batch all 100 k requests in one to a single 
 server? As I am trying to optimize for writes, I would like to split these 
 requests across several nodes instead of sending them all to one. I found 
 this article: 
 http://hortonworks.com/blog/apache-hbase-region-splitting-and-merging/ But 
 not sure if it's what I need
 
 Thanks in advance!
 
 Best regards,
 Marcelo.
 



smime.p7s
Description: S/MIME cryptographic signature


Re: data partitioning and data model

2015-02-23 Thread Michael Segel
Yes and no. 

Its a bit more complicated and it is also data dependent and how you’re using 
the data. 

I wouldn’t go too thin and I wouldn’t go to fat. 

 On Feb 20, 2015, at 2:19 PM, Alok Singh aloksi...@gmail.com wrote:
 
 You don't want a lot of columns in a write heavy table. HBase stores
 the row key along with each cell/column (Though old, I find this
 still useful: 
 http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html)
 Having a lot of columns will amplify the amount of data being stored.
 
 That said, if there are only going to be a handful of alert_ids for a
 given user_id+timestamp row key, then you should be ok.
 
 The query Select * from table where user_id = X and timestamp  T and
 (alert_id = id1 or alert_id = id2) can be accomplished with either
 design. See QualifierFilter and FuzzyRowFilter docs to get some ideas.
 
 Alok
 
 On Fri, Feb 20, 2015 at 11:21 AM, Marcelo Valle (BLOOMBERG/ LONDON)
 mvallemil...@bloomberg.net wrote:
 Hi Alok,
 
 Thanks for the answer. Yes, I have read this section, but it was a little 
 too abstract for me, I think I was needing to check my understanding. Your 
 answer helped me to confirm I am on the right path, thanks for that.
 
 One question: if instead of using user_id + timestamp + alert_id  I use 
 user_id + timestamp as row key, I would still be able to store alert_id + 
 alert_data in columns, right?
 
 I took the idea from the last section of this link: 
 http://www.appfirst.com/blog/best-practices-for-managing-hbase-in-a-high-write-environment/
 
 But I wonder which option would be better for my case. It seems column scans 
 are not so fast as row scans, but what would be the advantages of one design 
 over the other?
 
 If I use something like:
 Row key: user_id + timestamp
 Column prefix: alert_id
 Column value: json with alert data
 
 Would I be able to do a query like the one bellow?
 Select * from table where user_id = X and timestamp  T and (alert_id = id1 
 or alert_id = id2)
 
 Would I be able to do the same query using user_id + timestamp + alert_id as 
 row key?
 
 Also, I know Cassandra supports up to 2 billion columns per row (2 billion 
 rows per partition in CQL), do you know what's the limit for HBase?
 
 Best regards,
 Marcelo Valle.
 
 From: aloksi...@gmail.com
 Subject: Re: data partitioning and data model
 
 You can use a key like (user_id + timestamp + alert_id) to get
 clustering of rows related to a user. To get better write throughput
 and distribution over the cluster, you could pre-split the table and
 use a consistent hash of the user_id as a row key prefix.
 
 Have you looked at the rowkey design section in the hbase book :
 http://hbase.apache.org/book.html#rowkey.design
 
 Alok
 
 On Fri, Feb 20, 2015 at 8:49 AM, Marcelo Valle (BLOOMBERG/ LONDON)
 mvallemil...@bloomberg.net wrote:
 Hello,
 
 This is my first message in this mailing list, I just subscribed.
 
 I have been using Cassandra for the last few years and now I am trying to 
 create a POC using HBase. Therefore, I am reading the HBase docs but it's 
 been really hard to find how HBase behaves in some situations, when 
 compared to Cassandra. I thought maybe it was a good idea to ask here, as 
 people in this list might know the differences better than anyone else.
 
 What I want to do is creating a simple application optimized for writes 
 (not interested in HBase / Cassandra product comparisions here, I am 
 assuming I will use HBase and that's it, just wanna understand the best way 
 of doing it in HBase world). I want to be able to write alerts to the 
 cluster, where each alert would have columns like:
 - alert id
 - user id
 - date/time
 - alert data
 
 Later, I want to search for alerts per user, so my main query could be 
 considered to be something like:
 Select * from alerts where user_id = $id and date/time  10 days ago.
 
 I want to decide the data model for my application.
 
 Here are my questions:
 
 - In Cassandra, I would partition by user + day, as some users can have 
 many alerts and some just 1 or a few. In hbase, assuming all alerts for a 
 user would always fit in a single partition / region, can I just use 
 user_id as my row key and assume data will be distributed along the cluster?
 
 - Suppose I want to write 100 000 rows from a client machine and these are 
 from 30 000 users. What's the best manner to write these if I want to 
 optimize for writes? Should I batch all 100 k requests in one to a single 
 server? As I am trying to optimize for writes, I would like to split these 
 requests across several nodes instead of sending them all to one. I found 
 this article: 
 http://hortonworks.com/blog/apache-hbase-region-splitting-and-merging/ But 
 not sure if it's what I need
 
 Thanks in advance!
 
 Best regards,
 Marcelo.
 
 
 



smime.p7s
Description: S/MIME cryptographic signature


data partitioning and data model

2015-02-20 Thread Marcelo Valle (BLOOMBERG/ LONDON)
Hello, 

This is my first message in this mailing list, I just subscribed. 

I have been using Cassandra for the last few years and now I am trying to 
create a POC using HBase. Therefore, I am reading the HBase docs but it's been 
really hard to find how HBase behaves in some situations, when compared to 
Cassandra. I thought maybe it was a good idea to ask here, as people in this 
list might know the differences better than anyone else.

What I want to do is creating a simple application optimized for writes (not 
interested in HBase / Cassandra product comparisions here, I am assuming I will 
use HBase and that's it, just wanna understand the best way of doing it in 
HBase world). I want to be able to write alerts to the cluster, where each 
alert would have columns like:
- alert id
- user id
- date/time
- alert data

Later, I want to search for alerts per user, so my main query could be 
considered to be something like: 
Select * from alerts where user_id = $id and date/time  10 days ago.

I want to decide the data model for my application.

Here are my questions:

- In Cassandra, I would partition by user + day, as some users can have many 
alerts and some just 1 or a few. In hbase, assuming all alerts for a user would 
always fit in a single partition / region, can I just use user_id as my row key 
and assume data will be distributed along the cluster?

- Suppose I want to write 100 000 rows from a client machine and these are from 
30 000 users. What's the best manner to write these if I want to optimize for 
writes? Should I batch all 100 k requests in one to a single server? As I am 
trying to optimize for writes, I would like to split these requests across 
several nodes instead of sending them all to one. I found this article: 
http://hortonworks.com/blog/apache-hbase-region-splitting-and-merging/ But not 
sure if it's what I need

Thanks in advance!

Best regards,
Marcelo.

Re: data partitioning and data model

2015-02-20 Thread Alok Singh
You can use a key like (user_id + timestamp + alert_id) to get
clustering of rows related to a user. To get better write throughput
and distribution over the cluster, you could pre-split the table and
use a consistent hash of the user_id as a row key prefix.

Have you looked at the rowkey design section in the hbase book :
http://hbase.apache.org/book.html#rowkey.design

Alok

On Fri, Feb 20, 2015 at 8:49 AM, Marcelo Valle (BLOOMBERG/ LONDON)
mvallemil...@bloomberg.net wrote:
 Hello,

 This is my first message in this mailing list, I just subscribed.

 I have been using Cassandra for the last few years and now I am trying to 
 create a POC using HBase. Therefore, I am reading the HBase docs but it's 
 been really hard to find how HBase behaves in some situations, when compared 
 to Cassandra. I thought maybe it was a good idea to ask here, as people in 
 this list might know the differences better than anyone else.

 What I want to do is creating a simple application optimized for writes (not 
 interested in HBase / Cassandra product comparisions here, I am assuming I 
 will use HBase and that's it, just wanna understand the best way of doing it 
 in HBase world). I want to be able to write alerts to the cluster, where each 
 alert would have columns like:
 - alert id
 - user id
 - date/time
 - alert data

 Later, I want to search for alerts per user, so my main query could be 
 considered to be something like:
 Select * from alerts where user_id = $id and date/time  10 days ago.

 I want to decide the data model for my application.

 Here are my questions:

 - In Cassandra, I would partition by user + day, as some users can have many 
 alerts and some just 1 or a few. In hbase, assuming all alerts for a user 
 would always fit in a single partition / region, can I just use user_id as my 
 row key and assume data will be distributed along the cluster?

 - Suppose I want to write 100 000 rows from a client machine and these are 
 from 30 000 users. What's the best manner to write these if I want to 
 optimize for writes? Should I batch all 100 k requests in one to a single 
 server? As I am trying to optimize for writes, I would like to split these 
 requests across several nodes instead of sending them all to one. I found 
 this article: 
 http://hortonworks.com/blog/apache-hbase-region-splitting-and-merging/ But 
 not sure if it's what I need

 Thanks in advance!

 Best regards,
 Marcelo.


Re: data partitioning and data model

2015-02-20 Thread Marcelo Valle (BLOOMBERG/ LONDON)
Hi Alok, 

Thanks for the answer. Yes, I have read this section, but it was a little too 
abstract for me, I think I was needing to check my understanding. Your answer 
helped me to confirm I am on the right path, thanks for that.

One question: if instead of using user_id + timestamp + alert_id  I use user_id 
+ timestamp as row key, I would still be able to store alert_id + alert_data in 
columns, right?

I took the idea from the last section of this link: 
http://www.appfirst.com/blog/best-practices-for-managing-hbase-in-a-high-write-environment/

But I wonder which option would be better for my case. It seems column scans 
are not so fast as row scans, but what would be the advantages of one design 
over the other?

If I use something like:
Row key: user_id + timestamp
Column prefix: alert_id 
Column value: json with alert data

Would I be able to do a query like the one bellow?
Select * from table where user_id = X and timestamp  T and (alert_id = id1 or 
alert_id = id2)

Would I be able to do the same query using user_id + timestamp + alert_id as 
row key?

Also, I know Cassandra supports up to 2 billion columns per row (2 billion rows 
per partition in CQL), do you know what's the limit for HBase?

Best regards,
Marcelo Valle.

From: aloksi...@gmail.com 
Subject: Re: data partitioning and data model

You can use a key like (user_id + timestamp + alert_id) to get
clustering of rows related to a user. To get better write throughput
and distribution over the cluster, you could pre-split the table and
use a consistent hash of the user_id as a row key prefix.

Have you looked at the rowkey design section in the hbase book :
http://hbase.apache.org/book.html#rowkey.design

Alok

On Fri, Feb 20, 2015 at 8:49 AM, Marcelo Valle (BLOOMBERG/ LONDON)
mvallemil...@bloomberg.net wrote:
 Hello,

 This is my first message in this mailing list, I just subscribed.

 I have been using Cassandra for the last few years and now I am trying to 
 create a POC using HBase. Therefore, I am reading the HBase docs but it's 
 been really hard to find how HBase behaves in some situations, when compared 
 to Cassandra. I thought maybe it was a good idea to ask here, as people in 
 this list might know the differences better than anyone else.

 What I want to do is creating a simple application optimized for writes (not 
 interested in HBase / Cassandra product comparisions here, I am assuming I 
 will use HBase and that's it, just wanna understand the best way of doing it 
 in HBase world). I want to be able to write alerts to the cluster, where each 
 alert would have columns like:
 - alert id
 - user id
 - date/time
 - alert data

 Later, I want to search for alerts per user, so my main query could be 
 considered to be something like:
 Select * from alerts where user_id = $id and date/time  10 days ago.

 I want to decide the data model for my application.

 Here are my questions:

 - In Cassandra, I would partition by user + day, as some users can have many 
 alerts and some just 1 or a few. In hbase, assuming all alerts for a user 
 would always fit in a single partition / region, can I just use user_id as my 
 row key and assume data will be distributed along the cluster?

 - Suppose I want to write 100 000 rows from a client machine and these are 
 from 30 000 users. What's the best manner to write these if I want to 
 optimize for writes? Should I batch all 100 k requests in one to a single 
 server? As I am trying to optimize for writes, I would like to split these 
 requests across several nodes instead of sending them all to one. I found 
 this article: 
 http://hortonworks.com/blog/apache-hbase-region-splitting-and-merging/ But 
 not sure if it's what I need

 Thanks in advance!

 Best regards,
 Marcelo.




Re: data partitioning and data model

2015-02-20 Thread Alok Singh
You don't want a lot of columns in a write heavy table. HBase stores
the row key along with each cell/column (Though old, I find this
still useful: 
http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html)
 Having a lot of columns will amplify the amount of data being stored.

That said, if there are only going to be a handful of alert_ids for a
given user_id+timestamp row key, then you should be ok.

The query Select * from table where user_id = X and timestamp  T and
(alert_id = id1 or alert_id = id2) can be accomplished with either
design. See QualifierFilter and FuzzyRowFilter docs to get some ideas.

Alok

On Fri, Feb 20, 2015 at 11:21 AM, Marcelo Valle (BLOOMBERG/ LONDON)
mvallemil...@bloomberg.net wrote:
 Hi Alok,

 Thanks for the answer. Yes, I have read this section, but it was a little too 
 abstract for me, I think I was needing to check my understanding. Your answer 
 helped me to confirm I am on the right path, thanks for that.

 One question: if instead of using user_id + timestamp + alert_id  I use 
 user_id + timestamp as row key, I would still be able to store alert_id + 
 alert_data in columns, right?

 I took the idea from the last section of this link: 
 http://www.appfirst.com/blog/best-practices-for-managing-hbase-in-a-high-write-environment/

 But I wonder which option would be better for my case. It seems column scans 
 are not so fast as row scans, but what would be the advantages of one design 
 over the other?

 If I use something like:
 Row key: user_id + timestamp
 Column prefix: alert_id
 Column value: json with alert data

 Would I be able to do a query like the one bellow?
 Select * from table where user_id = X and timestamp  T and (alert_id = id1 
 or alert_id = id2)

 Would I be able to do the same query using user_id + timestamp + alert_id as 
 row key?

 Also, I know Cassandra supports up to 2 billion columns per row (2 billion 
 rows per partition in CQL), do you know what's the limit for HBase?

 Best regards,
 Marcelo Valle.

 From: aloksi...@gmail.com
 Subject: Re: data partitioning and data model

 You can use a key like (user_id + timestamp + alert_id) to get
 clustering of rows related to a user. To get better write throughput
 and distribution over the cluster, you could pre-split the table and
 use a consistent hash of the user_id as a row key prefix.

 Have you looked at the rowkey design section in the hbase book :
 http://hbase.apache.org/book.html#rowkey.design

 Alok

 On Fri, Feb 20, 2015 at 8:49 AM, Marcelo Valle (BLOOMBERG/ LONDON)
 mvallemil...@bloomberg.net wrote:
 Hello,

 This is my first message in this mailing list, I just subscribed.

 I have been using Cassandra for the last few years and now I am trying to 
 create a POC using HBase. Therefore, I am reading the HBase docs but it's 
 been really hard to find how HBase behaves in some situations, when compared 
 to Cassandra. I thought maybe it was a good idea to ask here, as people in 
 this list might know the differences better than anyone else.

 What I want to do is creating a simple application optimized for writes (not 
 interested in HBase / Cassandra product comparisions here, I am assuming I 
 will use HBase and that's it, just wanna understand the best way of doing it 
 in HBase world). I want to be able to write alerts to the cluster, where 
 each alert would have columns like:
 - alert id
 - user id
 - date/time
 - alert data

 Later, I want to search for alerts per user, so my main query could be 
 considered to be something like:
 Select * from alerts where user_id = $id and date/time  10 days ago.

 I want to decide the data model for my application.

 Here are my questions:

 - In Cassandra, I would partition by user + day, as some users can have many 
 alerts and some just 1 or a few. In hbase, assuming all alerts for a user 
 would always fit in a single partition / region, can I just use user_id as 
 my row key and assume data will be distributed along the cluster?

 - Suppose I want to write 100 000 rows from a client machine and these are 
 from 30 000 users. What's the best manner to write these if I want to 
 optimize for writes? Should I batch all 100 k requests in one to a single 
 server? As I am trying to optimize for writes, I would like to split these 
 requests across several nodes instead of sending them all to one. I found 
 this article: 
 http://hortonworks.com/blog/apache-hbase-region-splitting-and-merging/ But 
 not sure if it's what I need

 Thanks in advance!

 Best regards,
 Marcelo.