Re: OOM when fetching all versions of single row
St.Ack, I think you're side stepping the issue concerning schema design. Since HBase isn't my core focus, I also have to ask since when has heap sizes over 16GB been the norm? (Really 8GB seems to be quite a large heap size... ) On Oct 31, 2014, at 11:15 AM, Stack st...@duboce.net wrote: On Thu, Oct 30, 2014 at 8:20 AM, Andrejs Dubovskis dubis...@gmail.com wrote: Hi! We have a bunch of rows on HBase which store varying sizes of data (1-50MB). We use HBase versioning and keep up to 1 column versions. Typically each column has only few versions. But in rare cases it may has thousands versions. The Mapreduce alghoritm uses full scan and our algorithm requires all versions to produce the result. So, we call scan.setMaxVersions(). In worst case Region Server returns one row only, but huge one. The size is unpredictable and can not be controlled, because using parameters we can control row count only. And the MR task can throws OOME even if it has 50Gb heap. Is it possible to handle this situation? For example, RS should not send the raw to client, if the last has no memory to handle the row. In this case client can handle error and fetch each row's version in a separate get request. See HBASE-11544 [Ergonomics] hbase.client.scanner.caching is dogged and will try to return batch even if it means OOME. St.Ack smime.p7s Description: S/MIME cryptographic signature
Re: OOM when fetching all versions of single row
There are many blog posts and articles about people turning for 16GB heaps since java7 and the G1 collector became mainstream. We run with 25GB heap ourselves with very short GC pauses using a mostly untuned G1 collector. Just one example is the excellent blog post by Intel, https://software.intel.com/en-us/blogs/2014/06/18/part-1-tuning-java-garbage-collection-for-hbase That said, two things: 1) St.Ack's reply is very relevant, because as HBase matures it needs to make it harder for new people to shoot themselves in the foot. I'd love to see more tickets like HBASE-11544. This is something we run into often, with 10s of developers writing queries against a few shared clusters. 2) Since none of these enhancements are available yet, I recommend rethinking your schema if possible.You could change the cardinality such that you end up with more rows with less versions each, instead of these fat rows. While not exactly the same, you might be able to use TTL or your own purge job to keep the number of rows limited. On Mon, Nov 3, 2014 at 2:02 PM, Michael Segel mse...@segel.com wrote: St.Ack, I think you're side stepping the issue concerning schema design. Since HBase isn't my core focus, I also have to ask since when has heap sizes over 16GB been the norm? (Really 8GB seems to be quite a large heap size... ) On Oct 31, 2014, at 11:15 AM, Stack st...@duboce.net wrote: On Thu, Oct 30, 2014 at 8:20 AM, Andrejs Dubovskis dubis...@gmail.com wrote: Hi! We have a bunch of rows on HBase which store varying sizes of data (1-50MB). We use HBase versioning and keep up to 1 column versions. Typically each column has only few versions. But in rare cases it may has thousands versions. The Mapreduce alghoritm uses full scan and our algorithm requires all versions to produce the result. So, we call scan.setMaxVersions(). In worst case Region Server returns one row only, but huge one. The size is unpredictable and can not be controlled, because using parameters we can control row count only. And the MR task can throws OOME even if it has 50Gb heap. Is it possible to handle this situation? For example, RS should not send the raw to client, if the last has no memory to handle the row. In this case client can handle error and fetch each row's version in a separate get request. See HBASE-11544 [Ergonomics] hbase.client.scanner.caching is dogged and will try to return batch even if it means OOME. St.Ack
Re: OOM when fetching all versions of single row
Bryan, I wasn’t saying St.Ack’s post wasn’t relevant, but that its not addressing the easiest thing to fix. Schema design. IMHO, that’s shooting one’s self in the foot. You shouldn’t be using versioning to capture temporal data. On Nov 3, 2014, at 1:54 PM, Bryan Beaudreault bbeaudrea...@hubspot.com wrote: There are many blog posts and articles about people turning for 16GB heaps since java7 and the G1 collector became mainstream. We run with 25GB heap ourselves with very short GC pauses using a mostly untuned G1 collector. Just one example is the excellent blog post by Intel, https://software.intel.com/en-us/blogs/2014/06/18/part-1-tuning-java-garbage-collection-for-hbase That said, two things: 1) St.Ack's reply is very relevant, because as HBase matures it needs to make it harder for new people to shoot themselves in the foot. I'd love to see more tickets like HBASE-11544. This is something we run into often, with 10s of developers writing queries against a few shared clusters. 2) Since none of these enhancements are available yet, I recommend rethinking your schema if possible.You could change the cardinality such that you end up with more rows with less versions each, instead of these fat rows. While not exactly the same, you might be able to use TTL or your own purge job to keep the number of rows limited. On Mon, Nov 3, 2014 at 2:02 PM, Michael Segel mse...@segel.com wrote: St.Ack, I think you're side stepping the issue concerning schema design. Since HBase isn't my core focus, I also have to ask since when has heap sizes over 16GB been the norm? (Really 8GB seems to be quite a large heap size... ) On Oct 31, 2014, at 11:15 AM, Stack st...@duboce.net wrote: On Thu, Oct 30, 2014 at 8:20 AM, Andrejs Dubovskis dubis...@gmail.com wrote: Hi! We have a bunch of rows on HBase which store varying sizes of data (1-50MB). We use HBase versioning and keep up to 1 column versions. Typically each column has only few versions. But in rare cases it may has thousands versions. The Mapreduce alghoritm uses full scan and our algorithm requires all versions to produce the result. So, we call scan.setMaxVersions(). In worst case Region Server returns one row only, but huge one. The size is unpredictable and can not be controlled, because using parameters we can control row count only. And the MR task can throws OOME even if it has 50Gb heap. Is it possible to handle this situation? For example, RS should not send the raw to client, if the last has no memory to handle the row. In this case client can handle error and fetch each row's version in a separate get request. See HBASE-11544 [Ergonomics] hbase.client.scanner.caching is dogged and will try to return batch even if it means OOME. St.Ack
Re: OOM when fetching all versions of single row
Here’s the simple answer. Don’t do it. They way you are abusing versioning is a bad design. Redesign your schema. On Oct 30, 2014, at 10:20 AM, Andrejs Dubovskis dubis...@gmail.com wrote: Hi! We have a bunch of rows on HBase which store varying sizes of data (1-50MB). We use HBase versioning and keep up to 1 column versions. Typically each column has only few versions. But in rare cases it may has thousands versions. The Mapreduce alghoritm uses full scan and our algorithm requires all versions to produce the result. So, we call scan.setMaxVersions(). In worst case Region Server returns one row only, but huge one. The size is unpredictable and can not be controlled, because using parameters we can control row count only. And the MR task can throws OOME even if it has 50Gb heap. Is it possible to handle this situation? For example, RS should not send the raw to client, if the last has no memory to handle the row. In this case client can handle error and fetch each row's version in a separate get request. Best wishes, -- Andrejs Dubovskis
Re: OOM when fetching all versions of single row
On Thu, Oct 30, 2014 at 8:20 AM, Andrejs Dubovskis dubis...@gmail.com wrote: Hi! We have a bunch of rows on HBase which store varying sizes of data (1-50MB). We use HBase versioning and keep up to 1 column versions. Typically each column has only few versions. But in rare cases it may has thousands versions. The Mapreduce alghoritm uses full scan and our algorithm requires all versions to produce the result. So, we call scan.setMaxVersions(). In worst case Region Server returns one row only, but huge one. The size is unpredictable and can not be controlled, because using parameters we can control row count only. And the MR task can throws OOME even if it has 50Gb heap. Is it possible to handle this situation? For example, RS should not send the raw to client, if the last has no memory to handle the row. In this case client can handle error and fetch each row's version in a separate get request. See HBASE-11544 [Ergonomics] hbase.client.scanner.caching is dogged and will try to return batch even if it means OOME. St.Ack
OOM when fetching all versions of single row
Hi! We have a bunch of rows on HBase which store varying sizes of data (1-50MB). We use HBase versioning and keep up to 1 column versions. Typically each column has only few versions. But in rare cases it may has thousands versions. The Mapreduce alghoritm uses full scan and our algorithm requires all versions to produce the result. So, we call scan.setMaxVersions(). In worst case Region Server returns one row only, but huge one. The size is unpredictable and can not be controlled, because using parameters we can control row count only. And the MR task can throws OOME even if it has 50Gb heap. Is it possible to handle this situation? For example, RS should not send the raw to client, if the last has no memory to handle the row. In this case client can handle error and fetch each row's version in a separate get request. Best wishes, -- Andrejs Dubovskis