Number of Regions with small Tables

2021-07-12 Thread Christian Pfarr
Hello @all,

i´ve a quesion regarding controlling the number of regions on small tables in 
HBase.
But first i have to give you some hints about our Usecase.

We´ve built a lambda architecture with HDFS (Batch), HBase(Speed) and Drill as 
Serving Layer where we are combining Parquet Files from HDFS with HBase Rows 
that are newer then the most recent Row in HDFS.
The HBase table is filled in realtime via Nifi, while it is cleaned up every 
Batch (nightly) so that Drill can put the most workload on HDFS.
Unfortunately the hbase table is very small and because of this, we have only 
one region and because of that, drill cannot parallelize the query, which leads 
to long query times.

If i pre-split the hbase table everything is fine, until the balancer comes and 
merges the small regions. So after a few hours everything is slow again :-/

So... my question is now, whats the best way to handle these parallization 
issue.
I thought about setting hbase.hregion.max.filesize to a very small number, for 
example HDFS Blocksize = 128 MB but i´m not shure if this leads to new problems.

What do you think? Is there a better way to handle this?

Regards,
z0ltrix

publickey - z0ltrix@pm.me - 0xF0E154C5.asc
Description: application/pgp-keys


signature.asc
Description: OpenPGP digital signature


Re: Number of Regions with small Tables

2021-07-12 Thread Mallikarjun
Do you have any configuration for Region Normalizer (
https://hbase.apache.org/book.html#normalizer) or something?

Balancer does not split or merge regions. AFAIK, split policy controlled by
`hbase.regionserver.region.split.policy` does the splitting and there is
nothing similar for merges.

---
Mallikarjun


On Mon, Jul 12, 2021 at 2:48 PM Christian Pfarr 
wrote:

> Hello @all,
>
> i´ve a quesion regarding controlling the number of regions on small tables
> in HBase.
> But first i have to give you some hints about our Usecase.
>
> We´ve built a lambda architecture with HDFS (Batch), HBase(Speed) and
> Drill as Serving Layer where we are combining Parquet Files from HDFS with
> HBase Rows that are newer then the most recent Row in HDFS.
> The HBase table is filled in realtime via Nifi, while it is cleaned up
> every Batch (nightly) so that Drill can put the most workload on HDFS.
> Unfortunately the hbase table is very small and because of this, we have
> only one region and because of that, drill cannot parallelize the query,
> which leads to long query times.
>
> If i pre-split the hbase table everything is fine, until the balancer
> comes and merges the small regions. So after a few hours everything is slow
> again :-/
>
> So... my question is now, whats the best way to handle these parallization
> issue.
> I thought about setting hbase.hregion.max.filesize to a very small
> number, for example HDFS Blocksize = 128 MB but i´m not shure if this leads
> to new problems.
>
> What do you think? Is there a better way to handle this?
>
> Regards,
> z0ltrix
>
>
>


Re: Number of Regions with small Tables

2021-07-12 Thread Christian Pfarr
ah, ok... thought this was done by the balancer...

normalizer is enabled (checked via hbase shell), but with no special 
configuration than in hbase-default.xml

We run hbase 1.5.0 atm...

‐‐‐ Original Message ‐‐‐

Mallikarjun  schrieb am Montag, 12. Juli 2021 um 
13:16:

> Do you have any configuration for Region Normalizer (
> 

> https://hbase.apache.org/book.html#normalizer) or something?
> 

> Balancer does not split or merge regions. AFAIK, split policy controlled by
> 

> `hbase.regionserver.region.split.policy` does the splitting and there is
> 

> nothing similar for merges.
> 

> --
> 

> Mallikarjun
> 

> On Mon, Jul 12, 2021 at 2:48 PM Christian Pfarr z0lt...@pm.me.invalid
> 

> wrote:
> 

> > Hello @all,
> > 

> > i´ve a quesion regarding controlling the number of regions on small tables
> > 

> > in HBase.
> > 

> > But first i have to give you some hints about our Usecase.
> > 

> > We´ve built a lambda architecture with HDFS (Batch), HBase(Speed) and
> > 

> > Drill as Serving Layer where we are combining Parquet Files from HDFS with
> > 

> > HBase Rows that are newer then the most recent Row in HDFS.
> > 

> > The HBase table is filled in realtime via Nifi, while it is cleaned up
> > 

> > every Batch (nightly) so that Drill can put the most workload on HDFS.
> > 

> > Unfortunately the hbase table is very small and because of this, we have
> > 

> > only one region and because of that, drill cannot parallelize the query,
> > 

> > which leads to long query times.
> > 

> > If i pre-split the hbase table everything is fine, until the balancer
> > 

> > comes and merges the small regions. So after a few hours everything is slow
> > 

> > again :-/
> > 

> > So... my question is now, whats the best way to handle these parallization
> > 

> > issue.
> > 

> > I thought about setting hbase.hregion.max.filesize to a very small
> > 

> > number, for example HDFS Blocksize = 128 MB but i´m not shure if this leads
> > 

> > to new problems.
> > 

> > What do you think? Is there a better way to handle this?
> > 

> > Regards,
> > 

> > z0ltrix

publickey - z0ltrix@pm.me - 0xF0E154C5.asc
Description: application/pgp-keys


signature.asc
Description: OpenPGP digital signature