Hi folks Recently I've seen a few clusters with badly unbalanced tables, including some with many regions in the KB size. It seems it is easy to overlook this in ops.
Understandably SimpleNormalizer does a fairly poor job at addressing this - takes a long time, doesn't aggressively merge small regions, eagerly splits well sized regions if many small ones exist etc. It works well if enabled on a well set up table though. I have been exploring approaches to tackle: 1) determining region splits for a one time bulk load into a presplit table[1] and 2) approaches to fixing really badly skewed tables. I was thinking of creating a Jira which I'd assign to myself to add a utility tool that would: a) read the HFiles for a table (optionally performing a MC first to discard old edits) b) analyze the block headers and determine splits that would take you back to regions at e.g. 80% hbase.hregion.max.filesize c) create a new pre-split table d) run a table copy (or bulkload?) Does such a thing exist anywhere and I'm just missing it, or does anyone know of a better approach please? Thoughts, criticism, requests very welcome. Thanks, Tim [1] https://github.com/opencore/hbase-bulk-load-balanced/blob/master/src/test/java/com/opencore/hbase/example/ExampleUsageTest.java