[ https://issues.apache.org/jira/browse/HBASE-17039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15644764#comment-15644764 ]
Charlie Qiangeng Xu commented on HBASE-17039: --------------------------------------------- Just skimmed through the historical changes for this part, I found the code causing problem right now could be attributed to HBASE-7060. The problem mentioned in that Jira has been handled nicely by other part of current balancer logic, yet the code block aforementioned would only cause problem right now. [~yuzhih...@gmail.com], it seems you were involved in that JIRA, any interest to take a look at this one? > SimpleLoadBalancer schedules large amount of invalid region moves > ----------------------------------------------------------------- > > Key: HBASE-17039 > URL: https://issues.apache.org/jira/browse/HBASE-17039 > Project: HBase > Issue Type: Bug > Components: Balancer > Affects Versions: 2.0.0, 1.2.3, 1.1.7 > Reporter: Charlie Qiangeng Xu > Assignee: Charlie Qiangeng Xu > Attachments: HBASE-17039.patch > > > After increasing one of our clusters to 1600 nodes, we observed a large > amount of invalid region moves(more than 30k moves) fired by the balance > chore. Thus we simulated the problem and printed out the balance plan, only > to find out many servers that had two regions for a certain table(we use by > table strategy), sent out both regions to other two servers that have zero > region. > In the SimpleLoadBalancer's balanceCluster function, > the code block that determines the underLoadedServers might have a problem: > {code} > if (load >= min && load > 0) { > continue; // look for other servers which haven't reached min > } > int regionsToPut = min - load; > if (regionsToPut == 0) > { > regionsToPut = 1; > } > {code} > if min is zero, some server that has load of zero, which equals to min would > be marked as underloaded, which would cause the phenomenon mentioned above. > Since we increased the cluster's size to 1600+, many tables that only have > 1000 regions, now would encounter such issue. > By fixing it up, the balance plan went back to normal. -- This message was sent by Atlassian JIRA (v6.3.4#6332)