Thanks Ted,

I should have been explicit - for the cases I've been working with they can
make their apps effectively go "read-only" for this house keeping step.  At
the end a change of app config or a couple of table name changes (short
outage) would be needed.

I've been using the SimpleNormalizer in 1.2.0 (CDH 5.12+) - I'll dig into
the recent changes.  I had to run several iterations of small region
merging, plus a few iterations of SimpleNormalization to get a decent
result which took a long time (days). On Normalizer - I had wondered if an
approach of determining a good set of splits up front might be portable
into a Normalizer implementation.

I suspect a one time rewrite is cheaper than normalization when a table is
in really bad shape.

Thanks again,
Tim




On Sat, Apr 21, 2018 at 6:59 PM, Ted Yu <yuzhih...@gmail.com> wrote:

> Looking at proposed flow, have you considered the new data coming in
> between steps #a and #d ?
>
> Also, how would client application switch between the original table and
> the new table ?
>
> BTW since you mentioned SimpleNormalizer, which release are you using (just
> want to see if all recent fixes to SimpleNormalizer were in the version you
> use) ?
>
> Cheers
>
> On Sat, Apr 21, 2018 at 9:48 AM, Tim Robertson <timrobertson...@gmail.com>
> wrote:
>
> > Hi folks
> >
> > Recently I've seen a few clusters with badly unbalanced tables, including
> > some with many regions in the KB size. It seems it is easy to overlook
> this
> > in ops.
> >
> > Understandably SimpleNormalizer does a fairly poor job at addressing
> this -
> > takes a long time, doesn't aggressively merge small regions, eagerly
> splits
> > well sized regions if many small ones exist etc. It works well if enabled
> > on a well set up table though.
> >
> > I have been exploring approaches to tackle:
> >   1) determining region splits for a one time bulk load into a presplit
> > table[1] and
> >   2) approaches to fixing really badly skewed tables.
> >
> > I was thinking of creating a Jira which I'd assign to myself to add a
> > utility tool that would:
> >
> >   a) read the HFiles for a table (optionally performing a MC first to
> > discard old edits)
> >   b) analyze the block headers and determine splits that would take you
> > back to regions at e.g. 80% hbase.hregion.max.filesize
> >   c) create a new pre-split table
> >   d) run a table copy (or bulkload?)
> >
> > Does such a thing exist anywhere and I'm just missing it, or does anyone
> > know of a better approach please?
> >
> > Thoughts, criticism, requests very welcome.
> >
> > Thanks,
> > Tim
> >
> > [1]
> > https://github.com/opencore/hbase-bulk-load-balanced/blob/
> > master/src/test/java/com/opencore/hbase/example/ExampleUsageTest.java
> >
>

Reply via email to