[ https://issues.apache.org/jira/browse/HBASE-25739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17368964#comment-17368964 ]
Nick Dimiduk commented on HBASE-25739: -------------------------------------- bq. Because of the fix, the default 0.05 minCostNeedBalance will not quite work. As a gap-stopper before I check in auto-tuning threshold, should I just reduce the default value? So people won't be caught off guard? The broken TableSkewCostFunction artificially inflate the total cost. So if the fix is in and we don't change threshold, people will be badly surprised that balancer gets stuck. This sounds like a case where we have to implement both changes together, nor neither of them. In that case, we have to leave them both out of any patch releases. > TableSkewCostFunction need to use aggregated deviation > ------------------------------------------------------ > > Key: HBASE-25739 > URL: https://issues.apache.org/jira/browse/HBASE-25739 > Project: HBase > Issue Type: Sub-task > Components: Balancer, master > Reporter: Clara Xiong > Assignee: Clara Xiong > Priority: Major > Attachments: > TEST-org.apache.hadoop.hbase.master.balancer.TestStochasticLoadBalancerBalanceCluster.xml, > > org.apache.hadoop.hbase.master.balancer.TestStochasticLoadBalancerBalanceCluster.txt > > > TableSkewCostFunction uses the sum of the max deviation region per server for > all tables as the measure of unevenness. It doesn't work in a very common > scenario in operations. Say we have 100 regions on 50 nodes, two on each. We > add 50 new nodes and they have 0 each. The max deviation from the mean is 1, > compared to 99 in the worst case scenario of 100 regions on a single server. > The normalized cost is 1/99 = 0.011 < default threshold of 0.05. Balancer > wouldn't move. The proposal is to use aggregated deviation of the count per > region server to detect this scenario, generating a cost of 100/198 = 0.5 in > this case. -- This message was sent by Atlassian Jira (v8.3.4#803005)