subject:"OOM when fetching all versions of single row"

Re: OOM when fetching all versions of single row

2014-11-03 Thread Michael Segel

St.Ack, 

I  think you're side stepping the issue concerning schema design. 

Since HBase isn't my core focus, I also have to ask since when has heap sizes 
over 16GB been the norm? 
(Really 8GB seems to be quite a large heap size... ) 


On Oct 31, 2014, at 11:15 AM, Stack st...@duboce.net wrote:

 On Thu, Oct 30, 2014 at 8:20 AM, Andrejs Dubovskis dubis...@gmail.com
 wrote:
 
 Hi!
 
 We have a bunch of rows on HBase which store varying sizes of data
 (1-50MB). We use HBase versioning and keep up to 1 column
 versions. Typically each column has only few versions. But in rare
 cases it may has thousands versions.
 
 The Mapreduce alghoritm uses full scan and our algorithm requires all
 versions to produce the result. So, we call scan.setMaxVersions().
 
 In worst case Region Server returns one row only, but huge one. The
 size is unpredictable and can not be controlled, because using
 parameters we can control row count only. And the MR task can throws
 OOME even if it has 50Gb heap.
 
 Is it possible to handle this situation? For example, RS should not
 send the raw to client, if the last has no memory to handle the row.
 In this case client can handle error and fetch each row's version in a
 separate get request.
 
 
 See HBASE-11544 [Ergonomics] hbase.client.scanner.caching is dogged and
 will try to return batch even if it means OOME.
 St.Ack



smime.p7s
Description: S/MIME cryptographic signature

Re: OOM when fetching all versions of single row

2014-11-03 Thread Bryan Beaudreault

There are many blog posts and articles about people turning for 16GB
heaps since java7 and the G1 collector became mainstream. We run with 25GB
heap ourselves with very short GC pauses using a mostly untuned G1
collector. Just one example is the excellent blog post by Intel,
https://software.intel.com/en-us/blogs/2014/06/18/part-1-tuning-java-garbage-collection-for-hbase

That said, two things:

1) St.Ack's reply is very relevant, because as HBase matures it needs to
make it harder for new people to shoot themselves in the foot. I'd love to
see more tickets like HBASE-11544. This is something we run into often,
with 10s of developers writing queries against a few shared clusters.

2) Since none of these enhancements are available yet, I recommend
rethinking your schema if possible.You could change the cardinality
such that you end up with more rows with less versions each, instead of
these fat rows. While not exactly the same, you might be able to use TTL
or your own purge job to keep the number of rows limited.

On Mon, Nov 3, 2014 at 2:02 PM, Michael Segel mse...@segel.com wrote:

St.Ack,

I think you're side stepping the issue concerning schema design.

Since HBase isn't my core focus, I also have to ask since when has heap
sizes over 16GB been the norm?
(Really 8GB seems to be quite a large heap size... )

On Oct 31, 2014, at 11:15 AM, Stack st...@duboce.net wrote:

On Thu, Oct 30, 2014 at 8:20 AM, Andrejs Dubovskis dubis...@gmail.com
wrote:

Hi!

We have a bunch of rows on HBase which store varying sizes of data
(1-50MB). We use HBase versioning and keep up to 1 column
versions. Typically each column has only few versions. But in rare
cases it may has thousands versions.

The Mapreduce alghoritm uses full scan and our algorithm requires all
versions to produce the result. So, we call scan.setMaxVersions().

In worst case Region Server returns one row only, but huge one. The
size is unpredictable and can not be controlled, because using
parameters we can control row count only. And the MR task can throws
OOME even if it has 50Gb heap.

Is it possible to handle this situation? For example, RS should not
send the raw to client, if the last has no memory to handle the row.
In this case client can handle error and fetch each row's version in a
separate get request.

See HBASE-11544 [Ergonomics] hbase.client.scanner.caching is dogged and
will try to return batch even if it means OOME.
St.Ack

Re: OOM when fetching all versions of single row

2014-11-03 Thread Michael Segel

Bryan,

I wasn’t saying St.Ack’s post wasn’t relevant, but that its not addressing the
easiest thing to fix. Schema design.
IMHO, that’s shooting one’s self in the foot.

You shouldn’t be using versioning to capture temporal data.

On Nov 3, 2014, at 1:54 PM, Bryan Beaudreault bbeaudrea...@hubspot.com wrote:

That said, two things:

On Mon, Nov 3, 2014 at 2:02 PM, Michael Segel mse...@segel.com wrote:

St.Ack,

I think you're side stepping the issue concerning schema design.

Since HBase isn't my core focus, I also have to ask since when has heap
sizes over 16GB been the norm?
(Really 8GB seems to be quite a large heap size... )

On Oct 31, 2014, at 11:15 AM, Stack st...@duboce.net wrote:

On Thu, Oct 30, 2014 at 8:20 AM, Andrejs Dubovskis dubis...@gmail.com
wrote:

Hi!

The Mapreduce alghoritm uses full scan and our algorithm requires all
versions to produce the result. So, we call scan.setMaxVersions().

See HBASE-11544 [Ergonomics] hbase.client.scanner.caching is dogged and
will try to return batch even if it means OOME.
St.Ack

Re: OOM when fetching all versions of single row

2014-10-31 Thread Michael Segel

Here’s the simple answer. 

Don’t do it. 

They way you are abusing versioning is a bad design. 

Redesign your schema. 



On Oct 30, 2014, at 10:20 AM, Andrejs Dubovskis dubis...@gmail.com wrote:

 Hi!
 
 We have a bunch of rows on HBase which store varying sizes of data
 (1-50MB). We use HBase versioning and keep up to 1 column
 versions. Typically each column has only few versions. But in rare
 cases it may has thousands versions.
 
 The Mapreduce alghoritm uses full scan and our algorithm requires all
 versions to produce the result. So, we call scan.setMaxVersions().
 
 In worst case Region Server returns one row only, but huge one. The
 size is unpredictable and can not be controlled, because using
 parameters we can control row count only. And the MR task can throws
 OOME even if it has 50Gb heap.
 
 Is it possible to handle this situation? For example, RS should not
 send the raw to client, if the last has no memory to handle the row.
 In this case client can handle error and fetch each row's version in a
 separate get request.
 
 
 Best wishes,
 --
 Andrejs Dubovskis

Re: OOM when fetching all versions of single row

2014-10-31 Thread Stack

On Thu, Oct 30, 2014 at 8:20 AM, Andrejs Dubovskis dubis...@gmail.com
wrote:

 Hi!

 We have a bunch of rows on HBase which store varying sizes of data
 (1-50MB). We use HBase versioning and keep up to 1 column
 versions. Typically each column has only few versions. But in rare
 cases it may has thousands versions.

 The Mapreduce alghoritm uses full scan and our algorithm requires all
 versions to produce the result. So, we call scan.setMaxVersions().

 In worst case Region Server returns one row only, but huge one. The
 size is unpredictable and can not be controlled, because using
 parameters we can control row count only. And the MR task can throws
 OOME even if it has 50Gb heap.

 Is it possible to handle this situation? For example, RS should not
 send the raw to client, if the last has no memory to handle the row.
 In this case client can handle error and fetch each row's version in a
 separate get request.


See HBASE-11544 [Ergonomics] hbase.client.scanner.caching is dogged and
will try to return batch even if it means OOME.
St.Ack

OOM when fetching all versions of single row

2014-10-30 Thread Andrejs Dubovskis

Hi!

We have a bunch of rows on HBase which store varying sizes of data
(1-50MB). We use HBase versioning and keep up to 1 column
versions. Typically each column has only few versions. But in rare
cases it may has thousands versions.

The Mapreduce alghoritm uses full scan and our algorithm requires all
versions to produce the result. So, we call scan.setMaxVersions().

In worst case Region Server returns one row only, but huge one. The
size is unpredictable and can not be controlled, because using
parameters we can control row count only. And the MR task can throws
OOME even if it has 50Gb heap.

Is it possible to handle this situation? For example, RS should not
send the raw to client, if the last has no memory to handle the row.
In this case client can handle error and fetch each row's version in a
separate get request.


Best wishes,
--
Andrejs Dubovskis

Re: OOM when fetching all versions of single row

Re: OOM when fetching all versions of single row

Re: OOM when fetching all versions of single row

Re: OOM when fetching all versions of single row

Re: OOM when fetching all versions of single row

OOM when fetching all versions of single row

6 matches

Site Navigation

Mail list logo

Footer information