Hi,

For my project, HBase would come to a halt after about 8 hours. I managed
to reduce the load time to 10 minutes.

What gave me the best result was: splitting regions to best fit my data,
compacting them manually when there was a change to the tables and using
snappy for compression.

I have data coming from 70 different sources. So, depending on source I
added an offset of 10M to the primary key. The primary key data was 11
bits, so 10 millions was more than enough. E.g., 0 offset for the first
source, 10M for the second, 20M for the third. Then I created 70 regions, 1
for each data source.

Second task was when loading data or deleting old data, I run compaction
manually as I mentioned. This way, I control when compaction is done.

Hope, this helps.

Behdad


On Wed, Jul 8, 2015 at 12:30 PM, Dejan Menges <dejan.men...@gmail.com>
wrote:

> Hi Behdad,
>
> Thanks a lot, but this part I do already. My question was more what to use
> to most intelligently (what exposed or not exposed metrics) figure out
> where major compaction is needed the most.
>
> Currently, choosing the region which has biggest number of store files +
> the biggest amount of store files is doing the job, but wasn't sure if
> there's maybe something better so far to choose from.
>
> Cheers,
> Dejan
>
> On Wed, Jul 8, 2015 at 7:19 PM Behdad Forghani <beh...@exapackets.com>
> wrote:
>
> > To start major compaction for tablename from cli, you need to run:
> > echo major_compact tablename | hbase shell
> >
> > I do this after bulk loading to the table.
> >
> > FYI, to avoid surprises, I also turn off load balancer and rebalance
> > regions manually.
> >
> > The cli command to turn off balancer is:
> > echo balance_switch false | hbase shell
> >
> > To rebalance regions after a bulk load or other changes, run:
> > echo balance | hbase shell
> >
> > You  can run these two command using ssh. I use Ansible to do these.
> > Assuming you have defined hbase_master in your hosts file, you can run:
> > ansible -i hosts hbase_master -a "echo major_compact tablename | hbase
> > shell"
> >
> > Behdad Forghani
> >
> > On Wed, Jul 8, 2015 at 8:03 AM, Dejan Menges <dejan.men...@gmail.com>
> > wrote:
> >
> > > Hi,
> > >
> > > What's the best way to automate major compactions without enabling it
> > > during off peak period?
> > >
> > > What I was testing is simple script which runs on every node in
> cluster,
> > > checks if there is major compaction already running on that node, if
> not
> > > picks one region for compaction and run compaction on that one region.
> > >
> > > It's running for some time and it helped us get our data to much better
> > > shape, but now I'm not quite sure how to choose anymore which region to
> > > compact. So far I was reading for that node rs-status#regionStoreStats
> > and
> > > first choosing the one with biggest amount of storefiles, and then
> those
> > > with biggest storefile sizes.
> > >
> > > Is there maybe something more intelligent I could/should do?
> > >
> > > Thanks a lot!
> > >
> >
>

Reply via email to