[ https://issues.apache.org/jira/browse/HBASE-28068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17764326#comment-17764326 ]
Viraj Jasani commented on HBASE-28068: -------------------------------------- In fact, the config limit can be applied during plan computation (i.e. {_}computeMergeNormalizationPlans(){_}). For instance, we can limit the size of rangeMembers here: {code:java} ... ... ... if ( rangeMembers.isEmpty() // when there are no range members, seed the range with whatever // we have. this way we're prepared in case the next region is // 0-size. || (rangeMembers.size() == 1 && sumRangeMembersSizeMb == 0) // when there is only one // region and the size is 0, // seed the range with // whatever we have. || regionSizeMb == 0 // always add an empty region to the current range. || (regionSizeMb + sumRangeMembersSizeMb <= avgRegionSizeMb) ) { // add the current region // to the range when // there's capacity // remaining. rangeMembers.add(new NormalizationTarget(regionInfo, regionSizeMb)); sumRangeMembersSizeMb += regionSizeMb; continue; } ... ... ... {code} If the configured limit is higher thanĀ {_}rangeMembers.size(){_}, we don't need to compute any further. This is for merge plan, this might be improved in general as well. > Normalizer should batch merging 0 sized/empty regions > ----------------------------------------------------- > > Key: HBASE-28068 > URL: https://issues.apache.org/jira/browse/HBASE-28068 > Project: HBase > Issue Type: Improvement > Components: Normalizer > Affects Versions: 2.5.5 > Reporter: Ravi Kishore Valeti > Assignee: Rahul Kumar > Priority: Minor > Fix For: 2.6.0, 2.5.6, 3.0.0-beta-1 > > > In our production environment, while investigating an issue, we observed that > the Noramlizer had scheduled one single merge procedure to an RS providing > 27K+ empty regions of a table (this was a result of a failed copy table job > that left 27K+ empty regions of the table) to merge. > This action led the procedure to go to stuck state and eventually the > procedure framework bailed out after ~40mins. This was happening with each > normalizer run until we deleted the table manually. > Logs > Normalizer triggers a merge procedure > normalizer.RegionNormalizerWorker - NormalizationTarget[regionInfo=\{ENCODED > => 6e8606335a62f6bafceb017dc7edfdf5, NAME => 'TEST.TEST_TABLE,XXXX.', > STARTKEY => 'XXXX', ENDKEY => 'YYYY'},{*}regionSizeMb=0{*}], > NormalizationTarget[regionInfo=\{ENCODED => 79607df308d7618e632abe8a12c1bf6b, > NAME => 'TEST.TEST_TABLE,XXXX', STARTKEY => 'XXYY', ENDKEY => > 'YYZZ'},{*}regionSizeMb=0]{*}]] resulting in *pid 21968356* > procedure immediately gets stuck > procedure2.ProcedureExecutor - Worker *stuck* PEWorker-56(pid=21968356), run > time 12.4850 sec > Finally fails after ~40 mins > procedure2.ProcedureExecutor - Worker *stuck* PEWorker-56(pid=21968356), run > time *40 mins, 58.055 sec* > Bails out with RuntimeException > procedure2.ProcedureExecutor - force=false > java.lang.UnsupportedOperationException: pid=21968356, > state=FAILED:MERGE_TABLE_REGIONS_UPDATE_META, locked=true, > exception=java.lang.{*}RuntimeException via CODE-BUG: Uncaught runtime > exception{*}: pid=21968356, state=RUNNABLE:MERGE_TABLE_REGIONS_UPDATE_META, > locked=true; MergeTableRegionsProcedure table=TEST.TEST_TABLEXXXX, > {*}regions={*}{*}[269a1b168af497cce9ba6d3d581568f2{*} > . > . > . > . > *27K+ regions printed here]* -- This message was sent by Atlassian Jira (v8.20.10#820010)