Frens Jan Rumph created HBASE-28820:
---------------------------------------
Summary: TableSkew cost scales beyond 1
Key: HBASE-28820
URL: https://issues.apache.org/jira/browse/HBASE-28820
Project: HBase
Issue Type: Bug
Components: Balancer
Affects Versions: 2.5.7
Environment: We experienced the issue with Apache HBase 2.5.7 on
Apache Hadoop 3.3.6 using Java 17 on Debian 12 (Bookworm).
Reporter: Frens Jan Rumph
This may already be covered by later releases, but we noticed that the table
skew cost function can produce cost values beyond 1. In our case with over 1000
tables caused the table skew cost to suppress the region count skew (and other)
cost functions.
I think this is because the cost per table are 'simply' summed in
TableSkewCostFunction#cost. So if the number of tables with skew is large, this
cost function may cause the balancer to favour actions that decrease this cost
at to big of an expense of other costs such as region count skew.
Logging from the HBase master that shows this:
{code:java}
[...] balancer.StochasticLoadBalancer: dBalancer.balancer, initial weighted
average imbalance=0.25500371101846336, functionCost=RegionCountSkewCostFunction
: (multiplier=100000.0, imbalance=0.24272066309658274, need balance);
PrimaryRegionCountSkewCostFunction : (not needed); MoveCostFunction :
(multiplier=7.0, imbalance=0.0); ServerLocalityCostFunction : (multiplier=25.0,
imbalance=0.6022498608833904, need balance); RackLocalityCostFunction :
(multiplier=15.0, imbalance=0.0); TableSkewCostFunction : (multiplier=35.0,
imbalance=35.24784226006047, need balance); RegionReplicaHostCostFunction :
(not needed); RegionReplicaRackCostFunction : (not needed);
ReadRequestCostFunction : (multiplier=5.0, imbalance=0.24057323733439073, need
balance); WriteRequestCostFunction : (multiplier=5.0,
imbalance=0.3233739875438904, need balance); MemStoreSizeCostFunction :
(multiplier=5.0, imbalance=0.3195880383071082, need balance);
StoreFileCostFunction : (multiplier=5.0, imbalance=0.23335375436276784, need
balance); computedMaxSteps=1000000 {code}
Note the {{TableSkewCostFunction : (multiplier=35.0,
imbalance=35.24784226006047)}} part.
In order to work-around this we temporarily reduced the multiplier of the table
skew cost function to 0.
The test case below fails on HBase the 2.5 and 2.6 branches. It simply assigns
two tables with two regions each to a single server.
{code:java}
@Test
public void testTableSkewCost() {
TableName t1 = TableName.valueOf("t1");
TableName t2 = TableName.valueOf("t2");
TreeMap<ServerName, List<RegionInfo>> clusterState = new TreeMap<>();
clusterState.put(ServerName.valueOf("n1", 16020,0), Arrays.asList(
RegionInfoBuilder.newBuilder(t1).setRegionId(11).build(),
RegionInfoBuilder.newBuilder(t1).setRegionId(12).build()
));
clusterState.put(ServerName.valueOf("n2", 16020,0), Arrays.asList(
RegionInfoBuilder.newBuilder(t2).setRegionId(21).build(),
RegionInfoBuilder.newBuilder(t2).setRegionId(22).build()
));
BalancerClusterState cluster = new BalancerClusterState(clusterState, null,
null, null);
Configuration conf = HBaseConfiguration.create();
CostFunction costFunction = new TableSkewCostFunction(conf);
costFunction.prepare(cluster);
double cost = costFunction.cost();
assertTrue(cost >= 0);
assertTrue(cost <= 1.01);
} {code}
It's the second assertion that fails since the computed cost for this cluster
state is 2.
I guess none of the existing cluster state test/mock configurations have a real
table skew.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)