Re: OOM when fetching all versions of single row

2014-11-03 Thread Michael Segel
St.Ack, 

I  think you're side stepping the issue concerning schema design. 

Since HBase isn't my core focus, I also have to ask since when has heap sizes 
over 16GB been the norm? 
(Really 8GB seems to be quite a large heap size... ) 


On Oct 31, 2014, at 11:15 AM, Stack st...@duboce.net wrote:

 On Thu, Oct 30, 2014 at 8:20 AM, Andrejs Dubovskis dubis...@gmail.com
 wrote:
 
 Hi!
 
 We have a bunch of rows on HBase which store varying sizes of data
 (1-50MB). We use HBase versioning and keep up to 1 column
 versions. Typically each column has only few versions. But in rare
 cases it may has thousands versions.
 
 The Mapreduce alghoritm uses full scan and our algorithm requires all
 versions to produce the result. So, we call scan.setMaxVersions().
 
 In worst case Region Server returns one row only, but huge one. The
 size is unpredictable and can not be controlled, because using
 parameters we can control row count only. And the MR task can throws
 OOME even if it has 50Gb heap.
 
 Is it possible to handle this situation? For example, RS should not
 send the raw to client, if the last has no memory to handle the row.
 In this case client can handle error and fetch each row's version in a
 separate get request.
 
 
 See HBASE-11544 [Ergonomics] hbase.client.scanner.caching is dogged and
 will try to return batch even if it means OOME.
 St.Ack



smime.p7s
Description: S/MIME cryptographic signature


Re: OOM when fetching all versions of single row

2014-11-03 Thread Bryan Beaudreault
There are many blog posts and articles about people turning for  16GB
heaps since java7 and the G1 collector became mainstream.  We run with 25GB
heap ourselves with very short GC pauses using a mostly untuned G1
collector.  Just one example is the excellent blog post by Intel,
https://software.intel.com/en-us/blogs/2014/06/18/part-1-tuning-java-garbage-collection-for-hbase

That said, two things:

1) St.Ack's reply is very relevant, because as HBase matures it needs to
make it harder for new people to shoot themselves in the foot.  I'd love to
see more tickets like HBASE-11544. This is something we run into often,
with 10s of developers writing queries against a few shared clusters.

2) Since none of these enhancements are available yet, I recommend
rethinking your schema if possible.You could change the cardinality
such that you end up with more rows with less versions each, instead of
these fat rows.  While not exactly the same, you might be able to use TTL
or your own purge job to keep the number of rows limited.

On Mon, Nov 3, 2014 at 2:02 PM, Michael Segel mse...@segel.com wrote:

 St.Ack,

 I  think you're side stepping the issue concerning schema design.

 Since HBase isn't my core focus, I also have to ask since when has heap
 sizes over 16GB been the norm?
 (Really 8GB seems to be quite a large heap size... )


 On Oct 31, 2014, at 11:15 AM, Stack st...@duboce.net wrote:

 On Thu, Oct 30, 2014 at 8:20 AM, Andrejs Dubovskis dubis...@gmail.com
 wrote:

 Hi!

 We have a bunch of rows on HBase which store varying sizes of data
 (1-50MB). We use HBase versioning and keep up to 1 column
 versions. Typically each column has only few versions. But in rare
 cases it may has thousands versions.

 The Mapreduce alghoritm uses full scan and our algorithm requires all
 versions to produce the result. So, we call scan.setMaxVersions().

 In worst case Region Server returns one row only, but huge one. The
 size is unpredictable and can not be controlled, because using
 parameters we can control row count only. And the MR task can throws
 OOME even if it has 50Gb heap.

 Is it possible to handle this situation? For example, RS should not
 send the raw to client, if the last has no memory to handle the row.
 In this case client can handle error and fetch each row's version in a
 separate get request.


 See HBASE-11544 [Ergonomics] hbase.client.scanner.caching is dogged and
 will try to return batch even if it means OOME.
 St.Ack





Re: OOM when fetching all versions of single row

2014-11-03 Thread Michael Segel
Bryan,

I wasn’t saying St.Ack’s post wasn’t relevant, but that its not addressing the 
easiest thing to fix. Schema design. 
IMHO, that’s shooting one’s self in the foot. 

You shouldn’t be using versioning to capture temporal data. 


On Nov 3, 2014, at 1:54 PM, Bryan Beaudreault bbeaudrea...@hubspot.com wrote:

 There are many blog posts and articles about people turning for  16GB
 heaps since java7 and the G1 collector became mainstream.  We run with 25GB
 heap ourselves with very short GC pauses using a mostly untuned G1
 collector.  Just one example is the excellent blog post by Intel,
 https://software.intel.com/en-us/blogs/2014/06/18/part-1-tuning-java-garbage-collection-for-hbase
 
 That said, two things:
 
 1) St.Ack's reply is very relevant, because as HBase matures it needs to
 make it harder for new people to shoot themselves in the foot.  I'd love to
 see more tickets like HBASE-11544. This is something we run into often,
 with 10s of developers writing queries against a few shared clusters.
 
 2) Since none of these enhancements are available yet, I recommend
 rethinking your schema if possible.You could change the cardinality
 such that you end up with more rows with less versions each, instead of
 these fat rows.  While not exactly the same, you might be able to use TTL
 or your own purge job to keep the number of rows limited.
 
 On Mon, Nov 3, 2014 at 2:02 PM, Michael Segel mse...@segel.com wrote:
 
 St.Ack,
 
 I  think you're side stepping the issue concerning schema design.
 
 Since HBase isn't my core focus, I also have to ask since when has heap
 sizes over 16GB been the norm?
 (Really 8GB seems to be quite a large heap size... )
 
 
 On Oct 31, 2014, at 11:15 AM, Stack st...@duboce.net wrote:
 
 On Thu, Oct 30, 2014 at 8:20 AM, Andrejs Dubovskis dubis...@gmail.com
 wrote:
 
 Hi!
 
 We have a bunch of rows on HBase which store varying sizes of data
 (1-50MB). We use HBase versioning and keep up to 1 column
 versions. Typically each column has only few versions. But in rare
 cases it may has thousands versions.
 
 The Mapreduce alghoritm uses full scan and our algorithm requires all
 versions to produce the result. So, we call scan.setMaxVersions().
 
 In worst case Region Server returns one row only, but huge one. The
 size is unpredictable and can not be controlled, because using
 parameters we can control row count only. And the MR task can throws
 OOME even if it has 50Gb heap.
 
 Is it possible to handle this situation? For example, RS should not
 send the raw to client, if the last has no memory to handle the row.
 In this case client can handle error and fetch each row's version in a
 separate get request.
 
 
 See HBASE-11544 [Ergonomics] hbase.client.scanner.caching is dogged and
 will try to return batch even if it means OOME.
 St.Ack
 
 
 



Re: OOM when fetching all versions of single row

2014-10-31 Thread Michael Segel
Here’s the simple answer. 

Don’t do it. 

They way you are abusing versioning is a bad design. 

Redesign your schema. 



On Oct 30, 2014, at 10:20 AM, Andrejs Dubovskis dubis...@gmail.com wrote:

 Hi!
 
 We have a bunch of rows on HBase which store varying sizes of data
 (1-50MB). We use HBase versioning and keep up to 1 column
 versions. Typically each column has only few versions. But in rare
 cases it may has thousands versions.
 
 The Mapreduce alghoritm uses full scan and our algorithm requires all
 versions to produce the result. So, we call scan.setMaxVersions().
 
 In worst case Region Server returns one row only, but huge one. The
 size is unpredictable and can not be controlled, because using
 parameters we can control row count only. And the MR task can throws
 OOME even if it has 50Gb heap.
 
 Is it possible to handle this situation? For example, RS should not
 send the raw to client, if the last has no memory to handle the row.
 In this case client can handle error and fetch each row's version in a
 separate get request.
 
 
 Best wishes,
 --
 Andrejs Dubovskis
 



Re: OOM when fetching all versions of single row

2014-10-31 Thread Stack
On Thu, Oct 30, 2014 at 8:20 AM, Andrejs Dubovskis dubis...@gmail.com
wrote:

 Hi!

 We have a bunch of rows on HBase which store varying sizes of data
 (1-50MB). We use HBase versioning and keep up to 1 column
 versions. Typically each column has only few versions. But in rare
 cases it may has thousands versions.

 The Mapreduce alghoritm uses full scan and our algorithm requires all
 versions to produce the result. So, we call scan.setMaxVersions().

 In worst case Region Server returns one row only, but huge one. The
 size is unpredictable and can not be controlled, because using
 parameters we can control row count only. And the MR task can throws
 OOME even if it has 50Gb heap.

 Is it possible to handle this situation? For example, RS should not
 send the raw to client, if the last has no memory to handle the row.
 In this case client can handle error and fetch each row's version in a
 separate get request.


See HBASE-11544 [Ergonomics] hbase.client.scanner.caching is dogged and
will try to return batch even if it means OOME.
St.Ack


OOM when fetching all versions of single row

2014-10-30 Thread Andrejs Dubovskis
Hi!

We have a bunch of rows on HBase which store varying sizes of data
(1-50MB). We use HBase versioning and keep up to 1 column
versions. Typically each column has only few versions. But in rare
cases it may has thousands versions.

The Mapreduce alghoritm uses full scan and our algorithm requires all
versions to produce the result. So, we call scan.setMaxVersions().

In worst case Region Server returns one row only, but huge one. The
size is unpredictable and can not be controlled, because using
parameters we can control row count only. And the MR task can throws
OOME even if it has 50Gb heap.

Is it possible to handle this situation? For example, RS should not
send the raw to client, if the last has no memory to handle the row.
In this case client can handle error and fetch each row's version in a
separate get request.


Best wishes,
--
Andrejs Dubovskis