Looking at proposed flow, have you considered the new data coming in between steps #a and #d ?
Also, how would client application switch between the original table and the new table ? BTW since you mentioned SimpleNormalizer, which release are you using (just want to see if all recent fixes to SimpleNormalizer were in the version you use) ? Cheers On Sat, Apr 21, 2018 at 9:48 AM, Tim Robertson <timrobertson...@gmail.com> wrote: > Hi folks > > Recently I've seen a few clusters with badly unbalanced tables, including > some with many regions in the KB size. It seems it is easy to overlook this > in ops. > > Understandably SimpleNormalizer does a fairly poor job at addressing this - > takes a long time, doesn't aggressively merge small regions, eagerly splits > well sized regions if many small ones exist etc. It works well if enabled > on a well set up table though. > > I have been exploring approaches to tackle: > 1) determining region splits for a one time bulk load into a presplit > table[1] and > 2) approaches to fixing really badly skewed tables. > > I was thinking of creating a Jira which I'd assign to myself to add a > utility tool that would: > > a) read the HFiles for a table (optionally performing a MC first to > discard old edits) > b) analyze the block headers and determine splits that would take you > back to regions at e.g. 80% hbase.hregion.max.filesize > c) create a new pre-split table > d) run a table copy (or bulkload?) > > Does such a thing exist anywhere and I'm just missing it, or does anyone > know of a better approach please? > > Thoughts, criticism, requests very welcome. > > Thanks, > Tim > > [1] > https://github.com/opencore/hbase-bulk-load-balanced/blob/ > master/src/test/java/com/opencore/hbase/example/ExampleUsageTest.java >