[ 
https://issues.apache.org/jira/browse/HBASE-17565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15851763#comment-15851763
 ] 

stack commented on HBASE-17565:
-------------------------------

Not what I meant.  You paste results of a macro test as 'proof' a function is 
doing the right thing? I was referring to speculation about results from 
calling certain functions. Was suggesting that we test those functions 
standalone feeding them unusual values verifying they are sensible all in a 
unit test.

Problem now is there are two JIRAs advancing instead of one. Seems like a piece 
of each is needed in the patch. Suggest resolve one as dupe of the other (This 
as a dupe of the older issue would seem to make most sense) and then work 
together on fix for the problem that is common to both.

On the patch here, this seems wonky:

265       // small value for judging whether double variable is close to 0
266       static final double EPSILON = 0.000000000000001D;

Then why the change to the test? No comment on why it is done, on why we go 
from 1.0f to 1.01f?

> StochasticLoadBalancer may incorrectly skip balancing due to skewed 
> multiplier sum
> ----------------------------------------------------------------------------------
>
>                 Key: HBASE-17565
>                 URL: https://issues.apache.org/jira/browse/HBASE-17565
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Ted Yu
>            Assignee: Ted Yu
>            Priority: Critical
>             Fix For: 2.0.0, 1.4.0
>
>         Attachments: 17565.v1.txt, 17565.v2.txt
>
>
> I was investigating why a 6 node cluster kept skipping balancing requests.
> Here were the region counts on the servers:
> 449, 448, 447, 449, 453, 0
> {code}
> 2017-01-26 22:04:47,145 INFO  
> [RpcServer.deafult.FPBQ.Fifo.handler=1,queue=0,port=16000] 
> balancer.StochasticLoadBalancer: Skipping load balancing because balanced 
> cluster; total cost is 127.0171157050385, sum multiplier is 111087.0 min cost 
> which need balance is 0.05
> {code}
> The big multiplier sum caught my eyes. Here was what additional debug logging 
> showed:
> {code}
> 2017-01-27 23:25:31,749 DEBUG 
> [RpcServer.deafult.FPBQ.Fifo.handler=9,queue=0,port=16000] 
> balancer.StochasticLoadBalancer: class 
> org.apache.hadoop.hbase.master.balancer.          
> StochasticLoadBalancer$RegionReplicaHostCostFunction with multiplier 100000.0
> 2017-01-27 23:25:31,749 DEBUG 
> [RpcServer.deafult.FPBQ.Fifo.handler=9,queue=0,port=16000] 
> balancer.StochasticLoadBalancer: class 
> org.apache.hadoop.hbase.master.balancer.          
> StochasticLoadBalancer$RegionReplicaRackCostFunction with multiplier 10000.0
> {code}
> Note however, that no table in the cluster used read replica.
> I can think of two ways of fixing this situation:
> 1. If there is no read replica in the cluster, ignore the multipliers for the 
> above two functions.
> 2. When cost() returned by the CostFunction is 0 (or very very close to 0.0), 
> ignore the multiplier.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to