Our automation uses a combination of the following to determine what to
compact:

- Which regions have bad locality (% of blocks are local vs remote, using
HDFS getBlockLocations APIs)
- Which regions have the most number of HFiles (most files per region/cf
directory)
- Which regions have gone the longest since a compaction (oldest file)

The order here is the priority we have given each, but YMMV.  We run in
EC2, so value locality over almost everything, to avoid network latencies
on reads.

On Wed, Jul 8, 2015 at 4:48 PM Jean-Marc Spaggiari <jean-m...@spaggiari.org>
wrote:

> Just missing the ColumnFamiliy at the end of the path. Your memory is
> pretty good.
>
> JM
>
> 2015-07-08 16:39 GMT-04:00 Vladimir Rodionov <vladrodio...@gmail.com>:
>
> > You can find this info yourself, Dejan
> >
> > 1. Locate table dir on HDFS
> > 2. List all regions (directories)
> > 3. Iterate files in each directory and find the oldest one (creation
> time)
> > 4. The region with the oldest file is your candidate for major compaction
> >
> > /HBASE_ROOT/data/namespace/table/region (If my memory serves me right :))
> >
> > -Vlad
> >
> > On Wed, Jul 8, 2015 at 1:07 PM, Dejan Menges <dejan.men...@gmail.com>
> > wrote:
> >
> > > Hi Mikhail,
> > >
> > > Actually, reason is quite stupid on my side - to avoid compacting one
> > > region over and over again while others are waiting in line (reading
> HTML
> > > and sorting only on number of store files gets you at some point having
> > > bunch of regions having exactly the same number of store files).
> > >
> > > Thanks for this hint - this is exactly something I was looking for. Was
> > > trying previously to figure out if it's possible to query meta for this
> > > information (using currently 0.98.0, 0.98.4 and waiting for HDP 2.3
> from
> > > Hortonworks to upgrade immediately) but for our current version didn't
> > > found that possible, that's why I decided going this way.
> > >
> > > On Wed, Jul 8, 2015 at 10:02 PM Mikhail Antonov <olorinb...@gmail.com>
> > > wrote:
> > >
> > > > I totally understand the reasoning behind compacting regions with
> > > > biggest number of store files, but didn't follow why it's best to
> > > > compact regions which have biggest store files, maybe I'm missing
> > > > something? I'd maybe compact regions which have the smallest avg
> > > > storefile size?
> > > >
> > > > You may also want to take a look at
> > > > https://issues.apache.org/jira/browse/HBASE-12859, and compact
> regions
> > > > for which MC was last run longer time ago.
> > > >
> > > > -Mikhail
> > > >
> > > > On Wed, Jul 8, 2015 at 10:30 AM, Dejan Menges <
> dejan.men...@gmail.com>
> > > > wrote:
> > > > > Hi Behdad,
> > > > >
> > > > > Thanks a lot, but this part I do already. My question was more what
> > to
> > > > use
> > > > > to most intelligently (what exposed or not exposed metrics) figure
> > out
> > > > > where major compaction is needed the most.
> > > > >
> > > > > Currently, choosing the region which has biggest number of store
> > files
> > > +
> > > > > the biggest amount of store files is doing the job, but wasn't sure
> > if
> > > > > there's maybe something better so far to choose from.
> > > > >
> > > > > Cheers,
> > > > > Dejan
> > > > >
> > > > > On Wed, Jul 8, 2015 at 7:19 PM Behdad Forghani <
> > beh...@exapackets.com>
> > > > > wrote:
> > > > >
> > > > >> To start major compaction for tablename from cli, you need to run:
> > > > >> echo major_compact tablename | hbase shell
> > > > >>
> > > > >> I do this after bulk loading to the table.
> > > > >>
> > > > >> FYI, to avoid surprises, I also turn off load balancer and
> rebalance
> > > > >> regions manually.
> > > > >>
> > > > >> The cli command to turn off balancer is:
> > > > >> echo balance_switch false | hbase shell
> > > > >>
> > > > >> To rebalance regions after a bulk load or other changes, run:
> > > > >> echo balance | hbase shell
> > > > >>
> > > > >> You  can run these two command using ssh. I use Ansible to do
> these.
> > > > >> Assuming you have defined hbase_master in your hosts file, you can
> > > run:
> > > > >> ansible -i hosts hbase_master -a "echo major_compact tablename |
> > hbase
> > > > >> shell"
> > > > >>
> > > > >> Behdad Forghani
> > > > >>
> > > > >> On Wed, Jul 8, 2015 at 8:03 AM, Dejan Menges <
> > dejan.men...@gmail.com>
> > > > >> wrote:
> > > > >>
> > > > >> > Hi,
> > > > >> >
> > > > >> > What's the best way to automate major compactions without
> enabling
> > > it
> > > > >> > during off peak period?
> > > > >> >
> > > > >> > What I was testing is simple script which runs on every node in
> > > > cluster,
> > > > >> > checks if there is major compaction already running on that
> node,
> > if
> > > > not
> > > > >> > picks one region for compaction and run compaction on that one
> > > region.
> > > > >> >
> > > > >> > It's running for some time and it helped us get our data to much
> > > > better
> > > > >> > shape, but now I'm not quite sure how to choose anymore which
> > region
> > > > to
> > > > >> > compact. So far I was reading for that node
> > > rs-status#regionStoreStats
> > > > >> and
> > > > >> > first choosing the one with biggest amount of storefiles, and
> then
> > > > those
> > > > >> > with biggest storefile sizes.
> > > > >> >
> > > > >> > Is there maybe something more intelligent I could/should do?
> > > > >> >
> > > > >> > Thanks a lot!
> > > > >> >
> > > > >>
> > > >
> > > >
> > > >
> > > > --
> > > > Thanks,
> > > > Michael Antonov
> > > >
> > >
> >
>

Reply via email to