Hello @all, i´ve a quesion regarding controlling the number of regions on small tables in HBase. But first i have to give you some hints about our Usecase.
We´ve built a lambda architecture with HDFS (Batch), HBase(Speed) and Drill as Serving Layer where we are combining Parquet Files from HDFS with HBase Rows that are newer then the most recent Row in HDFS. The HBase table is filled in realtime via Nifi, while it is cleaned up every Batch (nightly) so that Drill can put the most workload on HDFS. Unfortunately the hbase table is very small and because of this, we have only one region and because of that, drill cannot parallelize the query, which leads to long query times. If i pre-split the hbase table everything is fine, until the balancer comes and merges the small regions. So after a few hours everything is slow again :-/ So... my question is now, whats the best way to handle these parallization issue. I thought about setting hbase.hregion.max.filesize to a very small number, for example HDFS Blocksize = 128 MB but i´m not shure if this leads to new problems. What do you think? Is there a better way to handle this? Regards, z0ltrix
publickey - [email protected] - 0xF0E154C5.asc
Description: application/pgp-keys
signature.asc
Description: OpenPGP digital signature
