[ 
https://issues.apache.org/jira/browse/HBASE-25726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Manning updated HBASE-25726:
----------------------------------
    Description: 
After OffPeakHours fix for MoveCostFunction (HBASE-24709), MoveCostFunction is 
no longer included in costFunctions list. {{addCostFunction}} expects 
multiplier to be non-zero, but multiplier is now only set in {{cost}} function.

As a result, {{hbase.master.balancer.stochastic.maxMovePercent}} is not 
respected, and there is no cost function to oppose a move. Any move that 
decreases total cost at all will be accepted, causing more churn and disruption 
from balancer executions.

We noticed this when investigating a case where the balancer would run after a 
regionserver was restarted without use of region_mover script. The regionserver 
comes online with 0 regions, leading to a shortcut in {{needsBalance}} for 
{{idleRegionServerExist}}. The balancer runs to move regions to that newly 
restarted regionserver. However, it moves a large number of regions in the 
cluster, hyper-optimizing the other cost variables. There were ~4300 regions in 
the cluster at the time, so moving 25% of the regions should have had a final 
cost of at least 7 (default moveCostFunction weight.) MoveCostFunction is also 
not listed in the functions contributing to the initial cost.

{{2021-03-30 15:47:43,396 INFO [49187_ChoreService_3] 
balancer.StochasticLoadBalancer - start StochasticLoadBalancer.balancer, 
initCost=12.91377229840024, functionCost=RegionCountSkewCostFunction : (500.0, 
0.014878672009326464); TableSkewCostFunction : (35.0, 0.013600280177445717); 
RegionReplicaHostCostFunction : (100000.0, 0.0); RegionReplicaRackCostFunction 
: (10000.0, 0.0); ReadRequestCostFunction : (5.0, 0.8296332203204705); 
WriteRequestCostFunction : (5.0, 0.06818455421617946); MemstoreSizeCostFunction 
: (5.0, 0.08132131691669181); StoreFileCostFunction : (5.0, 
0.02054620605193966); computedMaxSteps: 1000000}}

{{2021-03-30 15:48:13,385 DEBUG [49187_ChoreService_3] 
balancer.StochasticLoadBalancer - Finished computing new load balance plan. 
Computation took 30004ms to try 6571 different iterations. Found a solution 
that moves 1095 regions; Going from a computed cost of 12.91377229840024 to a 
new cost of 4.804625730746651}}

  was:
After OffPeakHours fix for MoveCostFunction (HBASE-24709), MoveCostFunction is 
no longer included in costFunctions list. {{addCostFunction}} expects 
multiplier to be non-zero, but multiplier is now only set in {{cost}} function.

As a result, {{hbase.master.balancer.stochastic.maxMovePercent}} is not 
respected, and there is no cost function to oppose a move. Any move that 
decreases total cost at all will be accepted, causing more churn and disruption 
from balancer executions.

We noticed this when investigating a case where the balancer would run after a 
regionserver was restarted without use of region_mover script. The regionserver 
comes online with 0 regions, leading to a shortcut in {{needsBalance}} for 
{{idleRegionServerExist}}. The balancer runs to move regions to that newly 
restarted regionserver. However, it moves a large number of regions in the 
cluster, hyper-optimizing the other cost variables. There were ~4300 regions in 
the cluster at the time, so moving 25% of the regions should have had a final 
cost of at least 7 (default moveCostFunction weight.) MoveCostFunction is also 
not listed in the functions contributing to the initial cost.

{{2021}}{{-}}{{03}}{{-}}{{30}}{{ }}{{15}}{{:}}{{47}}{{:}}{{43}}{{,}}{{396}}{{ 
}}{{INFO}}{{ [}}{{49187}}{{_}}{{ChoreService}}{{_}}{{3}}{{] 
}}{{balancer}}{{.}}{{StochasticLoadBalancer}}{{ }}{{-}}{{ 
}}{{start}}{{}}{{StochasticLoadBalancer}}{{.}}{{balancer}}{{, 
}}{{initCost}}{{=}}{{12}}{{.}}{{91377229840024}}{{, 
}}{{functionCost}}{{=}}{{RegionCountSkewCostFunction}}{{ : 
(}}{{500}}{{.}}{{0}}{{, }}{{0}}{{.}}{{014878672009326464}}{{); 
}}{{TableSkewCostFunction}}{{ : (}}{{35}}{{.}}{{0}}{{, 
}}{{0}}{{.}}{{013600280177445717}}{{); }}{{RegionReplicaHostCostFunction}}{{ : 
(}}{{100000}}{{.}}{{0}}{{, }}{{0}}{{.}}{{0}}{{); 
}}{{RegionReplicaRackCostFunction}}{{ : (}}{{10000}}{{.}}{{0}}{{, 
}}{{0}}{{.}}{{0}}{{); }}{{ReadRequestCostFunction}}{{ : (}}{{5}}{{.}}{{0}}{{, 
}}{{0}}{{.}}{{8296332203204705}}{{); }}{{WriteRequestCostFunction}}{{ : 
(}}{{5}}{{.}}{{0}}{{, }}{{0}}{{.}}{{06818455421617946}}{{); 
}}{{MemstoreSizeCostFunction}}{{ : (}}{{5}}{{.}}{{0}}{{, 
}}{{0}}{{.}}{{08132131691669181}}{{); }}{{StoreFileCostFunction}}{{ : 
(}}{{5}}{{.}}{{0}}{{, }}{{0}}{{.}}{{02054620605193966}}{{); 
}}{{computedMaxSteps}}{{: }}{{1000000}}

{{2021}}{{-}}{{03}}{{-}}{{30}}{{ }}{{15}}{{:}}{{48}}{{:}}{{13}}{{,}}{{385}}{{ 
}}{{DEBUG}}{{ [}}{{49187}}{{_}}{{ChoreService}}{{_}}{{3}}{{] 
}}{{balancer}}{{.}}{{StochasticLoadBalancer}}{{ }}{{-}}{{ }}{{Finished 
}}{{}}{{computing}}{{ }}{{new}}{{ }}{{load}}{{ }}{{balance}}{{ 
}}{{plan}}{{.}}{{ }}{{Computation}}{{ }}{{took}}{{ }}{{30004ms}}{{ }}{{to}}{{ 
}}{{try}}{{ }}{{6571}}{{ }}{{different}}{{ }}{{iterations}}{{.}}{{ 
}}{{Found}}{{ }}{{a }}{{}}{{solution}}{{ }}{{that}}{{ }}{{moves}}{{ 
}}{{1095}}{{ }}{{regions}}{{; }}{{Going}}{{ }}{{from}}{{ }}{{a}}{{ 
}}{{computed}}{{ }}{{cost}}{{ }}{{of}}{{ }}{{12}}{{.}}{{91377229840024}}{{ 
}}{{to}}{{ }}{{a}}{{ }}{{new}}{{ }}{{cost}}{{ }}{{of 
}}{{}}{{4}}{{.}}{{804625730746651}}{{}}


> MoveCostFunction is not included in the list of cost functions for 
> StochasticLoadBalancer
> -----------------------------------------------------------------------------------------
>
>                 Key: HBASE-25726
>                 URL: https://issues.apache.org/jira/browse/HBASE-25726
>             Project: HBase
>          Issue Type: Bug
>          Components: Balancer
>    Affects Versions: 3.0.0-alpha-1, 2.3.1, 1.7.0, 2.4.0
>            Reporter: David Manning
>            Priority: Major
>
> After OffPeakHours fix for MoveCostFunction (HBASE-24709), MoveCostFunction 
> is no longer included in costFunctions list. {{addCostFunction}} expects 
> multiplier to be non-zero, but multiplier is now only set in {{cost}} 
> function.
> As a result, {{hbase.master.balancer.stochastic.maxMovePercent}} is not 
> respected, and there is no cost function to oppose a move. Any move that 
> decreases total cost at all will be accepted, causing more churn and 
> disruption from balancer executions.
> We noticed this when investigating a case where the balancer would run after 
> a regionserver was restarted without use of region_mover script. The 
> regionserver comes online with 0 regions, leading to a shortcut in 
> {{needsBalance}} for {{idleRegionServerExist}}. The balancer runs to move 
> regions to that newly restarted regionserver. However, it moves a large 
> number of regions in the cluster, hyper-optimizing the other cost variables. 
> There were ~4300 regions in the cluster at the time, so moving 25% of the 
> regions should have had a final cost of at least 7 (default moveCostFunction 
> weight.) MoveCostFunction is also not listed in the functions contributing to 
> the initial cost.
> {{2021-03-30 15:47:43,396 INFO [49187_ChoreService_3] 
> balancer.StochasticLoadBalancer - start StochasticLoadBalancer.balancer, 
> initCost=12.91377229840024, functionCost=RegionCountSkewCostFunction : 
> (500.0, 0.014878672009326464); TableSkewCostFunction : (35.0, 
> 0.013600280177445717); RegionReplicaHostCostFunction : (100000.0, 0.0); 
> RegionReplicaRackCostFunction : (10000.0, 0.0); ReadRequestCostFunction : 
> (5.0, 0.8296332203204705); WriteRequestCostFunction : (5.0, 
> 0.06818455421617946); MemstoreSizeCostFunction : (5.0, 0.08132131691669181); 
> StoreFileCostFunction : (5.0, 0.02054620605193966); computedMaxSteps: 
> 1000000}}
> {{2021-03-30 15:48:13,385 DEBUG [49187_ChoreService_3] 
> balancer.StochasticLoadBalancer - Finished computing new load balance plan. 
> Computation took 30004ms to try 6571 different iterations. Found a solution 
> that moves 1095 regions; Going from a computed cost of 12.91377229840024 to a 
> new cost of 4.804625730746651}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to