[ https://issues.apache.org/jira/browse/HBASE-25726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
David Manning updated HBASE-25726: ---------------------------------- Description: After OffPeakHours fix for MoveCostFunction (HBASE-24709), MoveCostFunction is no longer included in costFunctions list. {{addCostFunction}} expects multiplier to be non-zero, but multiplier is now only set in {{cost}} function. As a result, {{hbase.master.balancer.stochastic.maxMovePercent}} is not respected, and there is no cost function to oppose a move. Any move that decreases total cost at all will be accepted, causing more churn and disruption from balancer executions. We noticed this when investigating a case where the balancer would run after a regionserver was restarted without use of region_mover script. The regionserver comes online with 0 regions, leading to a shortcut in {{needsBalance}} for {{idleRegionServerExist}}. The balancer runs to move regions to that newly restarted regionserver. However, it moves a large number of regions in the cluster, hyper-optimizing the other cost variables. There were ~4300 regions in the cluster at the time, so moving 25% of the regions should have had a final cost of at least 7 (default moveCostFunction weight.) MoveCostFunction is also not listed in the functions contributing to the initial cost. {{2021-03-30 15:47:43,396 INFO [49187_ChoreService_3] balancer.StochasticLoadBalancer - start StochasticLoadBalancer.balancer, initCost=12.91377229840024, functionCost=RegionCountSkewCostFunction : (500.0, 0.014878672009326464); TableSkewCostFunction : (35.0, 0.013600280177445717); RegionReplicaHostCostFunction : (100000.0, 0.0); RegionReplicaRackCostFunction : (10000.0, 0.0); ReadRequestCostFunction : (5.0, 0.8296332203204705); WriteRequestCostFunction : (5.0, 0.06818455421617946); MemstoreSizeCostFunction : (5.0, 0.08132131691669181); StoreFileCostFunction : (5.0, 0.02054620605193966); computedMaxSteps: 1000000}} {{2021-03-30 15:48:13,385 DEBUG [49187_ChoreService_3] balancer.StochasticLoadBalancer - Finished computing new load balance plan. Computation took 30004ms to try 6571 different iterations. Found a solution that moves 1095 regions; Going from a computed cost of 12.91377229840024 to a new cost of 4.804625730746651}} was: After OffPeakHours fix for MoveCostFunction (HBASE-24709), MoveCostFunction is no longer included in costFunctions list. {{addCostFunction}} expects multiplier to be non-zero, but multiplier is now only set in {{cost}} function. As a result, {{hbase.master.balancer.stochastic.maxMovePercent}} is not respected, and there is no cost function to oppose a move. Any move that decreases total cost at all will be accepted, causing more churn and disruption from balancer executions. We noticed this when investigating a case where the balancer would run after a regionserver was restarted without use of region_mover script. The regionserver comes online with 0 regions, leading to a shortcut in {{needsBalance}} for {{idleRegionServerExist}}. The balancer runs to move regions to that newly restarted regionserver. However, it moves a large number of regions in the cluster, hyper-optimizing the other cost variables. There were ~4300 regions in the cluster at the time, so moving 25% of the regions should have had a final cost of at least 7 (default moveCostFunction weight.) MoveCostFunction is also not listed in the functions contributing to the initial cost. {{2021}}{{-}}{{03}}{{-}}{{30}}{{ }}{{15}}{{:}}{{47}}{{:}}{{43}}{{,}}{{396}}{{ }}{{INFO}}{{ [}}{{49187}}{{_}}{{ChoreService}}{{_}}{{3}}{{] }}{{balancer}}{{.}}{{StochasticLoadBalancer}}{{ }}{{-}}{{ }}{{start}}{{}}{{StochasticLoadBalancer}}{{.}}{{balancer}}{{, }}{{initCost}}{{=}}{{12}}{{.}}{{91377229840024}}{{, }}{{functionCost}}{{=}}{{RegionCountSkewCostFunction}}{{ : (}}{{500}}{{.}}{{0}}{{, }}{{0}}{{.}}{{014878672009326464}}{{); }}{{TableSkewCostFunction}}{{ : (}}{{35}}{{.}}{{0}}{{, }}{{0}}{{.}}{{013600280177445717}}{{); }}{{RegionReplicaHostCostFunction}}{{ : (}}{{100000}}{{.}}{{0}}{{, }}{{0}}{{.}}{{0}}{{); }}{{RegionReplicaRackCostFunction}}{{ : (}}{{10000}}{{.}}{{0}}{{, }}{{0}}{{.}}{{0}}{{); }}{{ReadRequestCostFunction}}{{ : (}}{{5}}{{.}}{{0}}{{, }}{{0}}{{.}}{{8296332203204705}}{{); }}{{WriteRequestCostFunction}}{{ : (}}{{5}}{{.}}{{0}}{{, }}{{0}}{{.}}{{06818455421617946}}{{); }}{{MemstoreSizeCostFunction}}{{ : (}}{{5}}{{.}}{{0}}{{, }}{{0}}{{.}}{{08132131691669181}}{{); }}{{StoreFileCostFunction}}{{ : (}}{{5}}{{.}}{{0}}{{, }}{{0}}{{.}}{{02054620605193966}}{{); }}{{computedMaxSteps}}{{: }}{{1000000}} {{2021}}{{-}}{{03}}{{-}}{{30}}{{ }}{{15}}{{:}}{{48}}{{:}}{{13}}{{,}}{{385}}{{ }}{{DEBUG}}{{ [}}{{49187}}{{_}}{{ChoreService}}{{_}}{{3}}{{] }}{{balancer}}{{.}}{{StochasticLoadBalancer}}{{ }}{{-}}{{ }}{{Finished }}{{}}{{computing}}{{ }}{{new}}{{ }}{{load}}{{ }}{{balance}}{{ }}{{plan}}{{.}}{{ }}{{Computation}}{{ }}{{took}}{{ }}{{30004ms}}{{ }}{{to}}{{ }}{{try}}{{ }}{{6571}}{{ }}{{different}}{{ }}{{iterations}}{{.}}{{ }}{{Found}}{{ }}{{a }}{{}}{{solution}}{{ }}{{that}}{{ }}{{moves}}{{ }}{{1095}}{{ }}{{regions}}{{; }}{{Going}}{{ }}{{from}}{{ }}{{a}}{{ }}{{computed}}{{ }}{{cost}}{{ }}{{of}}{{ }}{{12}}{{.}}{{91377229840024}}{{ }}{{to}}{{ }}{{a}}{{ }}{{new}}{{ }}{{cost}}{{ }}{{of }}{{}}{{4}}{{.}}{{804625730746651}}{{}} > MoveCostFunction is not included in the list of cost functions for > StochasticLoadBalancer > ----------------------------------------------------------------------------------------- > > Key: HBASE-25726 > URL: https://issues.apache.org/jira/browse/HBASE-25726 > Project: HBase > Issue Type: Bug > Components: Balancer > Affects Versions: 3.0.0-alpha-1, 2.3.1, 1.7.0, 2.4.0 > Reporter: David Manning > Priority: Major > > After OffPeakHours fix for MoveCostFunction (HBASE-24709), MoveCostFunction > is no longer included in costFunctions list. {{addCostFunction}} expects > multiplier to be non-zero, but multiplier is now only set in {{cost}} > function. > As a result, {{hbase.master.balancer.stochastic.maxMovePercent}} is not > respected, and there is no cost function to oppose a move. Any move that > decreases total cost at all will be accepted, causing more churn and > disruption from balancer executions. > We noticed this when investigating a case where the balancer would run after > a regionserver was restarted without use of region_mover script. The > regionserver comes online with 0 regions, leading to a shortcut in > {{needsBalance}} for {{idleRegionServerExist}}. The balancer runs to move > regions to that newly restarted regionserver. However, it moves a large > number of regions in the cluster, hyper-optimizing the other cost variables. > There were ~4300 regions in the cluster at the time, so moving 25% of the > regions should have had a final cost of at least 7 (default moveCostFunction > weight.) MoveCostFunction is also not listed in the functions contributing to > the initial cost. > {{2021-03-30 15:47:43,396 INFO [49187_ChoreService_3] > balancer.StochasticLoadBalancer - start StochasticLoadBalancer.balancer, > initCost=12.91377229840024, functionCost=RegionCountSkewCostFunction : > (500.0, 0.014878672009326464); TableSkewCostFunction : (35.0, > 0.013600280177445717); RegionReplicaHostCostFunction : (100000.0, 0.0); > RegionReplicaRackCostFunction : (10000.0, 0.0); ReadRequestCostFunction : > (5.0, 0.8296332203204705); WriteRequestCostFunction : (5.0, > 0.06818455421617946); MemstoreSizeCostFunction : (5.0, 0.08132131691669181); > StoreFileCostFunction : (5.0, 0.02054620605193966); computedMaxSteps: > 1000000}} > {{2021-03-30 15:48:13,385 DEBUG [49187_ChoreService_3] > balancer.StochasticLoadBalancer - Finished computing new load balance plan. > Computation took 30004ms to try 6571 different iterations. Found a solution > that moves 1095 regions; Going from a computed cost of 12.91377229840024 to a > new cost of 4.804625730746651}} -- This message was sent by Atlassian Jira (v8.3.4#803005)