Dirk Daems created HDFS-17673:
---------------------------------
Summary: Unable to run HDFS balancer as a service
Key: HDFS-17673
URL: https://issues.apache.org/jira/browse/HDFS-17673
Project: Hadoop HDFS
Issue Type: Bug
Components: balancer & mover
Affects Versions: 3.4.0
Reporter: Dirk Daems
When running the HDFS balancer as a service using the following command
{code:java}
hdfs balancer -asService {code}
the first balancing round succeeds:
{code:java}
$ hdfs balancer -asService
2024-11-25 14:58:14,712 INFO balancer.Balancer: Balancer will run as a long
running service
2024-11-25 14:58:14,764 INFO impl.MetricsConfig: Loaded properties from
hadoop-metrics2.properties
2024-11-25 14:58:14,824 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot
period at 10 second(s).
2024-11-25 14:58:14,824 INFO impl.MetricsSystemImpl: Balancer metrics system
started
2024-11-25 14:58:14,840 INFO balancer.Balancer: namenodes = [hdfs://prd]
2024-11-25 14:58:14,841 INFO balancer.Balancer: parameters =
Balancer.BalancerParameters [BalancingPolicy.Node, threshold = 10.0, max idle
iteration = 5, #excluded nodes = 0, #included nodes = 0, #source nodes = 0,
#blockpools = 0, run during upgrade = false, sort top nodes = false, hot block
time interval = 0]
2024-11-25 14:58:14,841 INFO balancer.Balancer: included nodes = []
2024-11-25 14:58:14,841 INFO balancer.Balancer: excluded nodes = []
2024-11-25 14:58:14,841 INFO balancer.Balancer: source nodes = []
2024-11-25 14:58:14,841 INFO balancer.Balancer: Keytab is configured, will
login using keytab.
2024-11-25 14:58:14,955 INFO security.UserGroupInformation: Login successful
for user [email protected] using keytab file hdfs.headless.keytab. Keytab auto
renewal enabled : false
Time Stamp Iteration# Bytes Already Moved Bytes Left To Move
Bytes Being Moved NameNode
2024-11-25 14:58:14,956 INFO balancer.NameNodeConnector: getBlocks calls for
hdfs://prd will be rate-limited to 20 per second
2024-11-25 14:58:15,499 INFO balancer.KeyManager: Block token params received
from NN: update interval=10hrs, 0sec, token lifetime=10hrs, 0sec
2024-11-25 14:58:15,501 INFO block.BlockTokenSecretManager: Block token key
range: [0, 2147483647)
2024-11-25 14:58:15,501 INFO block.BlockTokenSecretManager: Setting block keys.
BlockPool = BP-1957577775-192.168.146.236-1720512277934 .
2024-11-25 14:58:15,501 INFO balancer.KeyManager: Update block keys every 2hrs,
30mins, 0sec
2024-11-25 14:58:15,702 INFO balancer.Balancer: dfs.namenode.get-blocks.max-qps
= 20 (default=20)
2024-11-25 14:58:15,703 INFO balancer.Balancer: dfs.balancer.movedWinWidth =
5400000 (default=5400000)
2024-11-25 14:58:15,703 INFO balancer.Balancer: dfs.balancer.moverThreads =
1000 (default=1000)
2024-11-25 14:58:15,703 INFO balancer.Balancer: dfs.balancer.dispatcherThreads
= 200 (default=200)
2024-11-25 14:58:15,703 INFO balancer.Balancer: dfs.balancer.getBlocks.size =
2147483648 (default=2147483648)
2024-11-25 14:58:15,703 INFO balancer.Balancer:
dfs.balancer.getBlocks.min-block-size = 10485760 (default=10485760)
2024-11-25 14:58:15,703 INFO balancer.Balancer:
dfs.datanode.balance.max.concurrent.moves = 100 (default=100)
2024-11-25 14:58:15,703 INFO balancer.Balancer:
dfs.datanode.balance.bandwidthPerSec = 104857600 (default=104857600)
2024-11-25 14:58:15,705 INFO block.BlockTokenSecretManager: Setting block keys.
BlockPool = BP-1957577775-192.168.146.236-1720512277934 .
2024-11-25 14:58:15,708 INFO balancer.Balancer: dfs.balancer.max-size-to-move =
10737418240 (default=10737418240)
2024-11-25 14:58:15,708 INFO balancer.Balancer: dfs.blocksize = 268435456
(default=134217728)
2024-11-25 14:58:15,732 INFO net.NetworkTopology: Adding a new node:
/LeibnizRack17/192.168.146.15:10004
2024-11-25 14:58:15,733 INFO net.NetworkTopology: Adding a new node:
/LeibnizRack17/192.168.146.16:10004
2024-11-25 14:58:15,733 INFO net.NetworkTopology: Adding a new node:
/LeibnizRack17/192.168.146.14:10004
2024-11-25 14:58:15,733 INFO net.NetworkTopology: Adding a new node:
/LeibnizRack17/192.168.146.13:10004
2024-11-25 14:58:15,733 INFO net.NetworkTopology: Adding a new node:
/LeibnizRack17/192.168.146.248:10004
2024-11-25 14:58:15,734 INFO balancer.Balancer: 0 over-utilized: []
2024-11-25 14:58:15,734 INFO balancer.Balancer: 0 underutilized: []
Nov 25, 2024, 2:58:15 PM 0 0 B 0 B
0 B 0 hdfs://prd
The cluster is balanced. Exiting...
2024-11-25 14:58:15,751 INFO balancer.Balancer: Balance succeed!
2024-11-25 14:58:15,751 INFO balancer.Balancer: Finished one round, will wait
for 5.0 minutes for next round{code}
but subsequent rounds always fail with:
{code:java}
2024-11-25 15:03:15,751 INFO balancer.Balancer: namenodes = [hdfs://prd]
2024-11-25 15:03:15,751 INFO balancer.Balancer: parameters =
Balancer.BalancerParameters [BalancingPolicy.Node, threshold = 10.0, max idle
iteration = 5, #excluded nodes = 0, #included nodes = 0, #source nodes = 0,
#blockpools = 0, run during upgrade = false, sort top nodes = false, hot block
time interval = 0]
2024-11-25 15:03:15,751 INFO balancer.Balancer: included nodes = []
2024-11-25 15:03:15,751 INFO balancer.Balancer: excluded nodes = []
2024-11-25 15:03:15,752 INFO balancer.Balancer: source nodes = []
2024-11-25 15:03:15,752 INFO balancer.Balancer: Keytab is configured, will
login using keytab.
2024-11-25 15:03:15,792 INFO security.UserGroupInformation: Login successful
for user [email protected] using keytab file hdfs.headless.keytab. Keytab auto
renewal enabled : false
Time Stamp Iteration# Bytes Already Moved Bytes Left To Move
Bytes Being Moved NameNode
2024-11-25 15:03:15,792 INFO balancer.NameNodeConnector: getBlocks calls for
hdfs://prd will be rate-limited to 20 per second
2024-11-25 15:03:15,909 INFO balancer.KeyManager: Block token params received
from NN: update interval=10hrs, 0sec, token lifetime=10hrs, 0sec
2024-11-25 15:03:15,909 INFO block.BlockTokenSecretManager: Block token key
range: [0, 2147483647)
2024-11-25 15:03:15,909 INFO block.BlockTokenSecretManager: Setting block keys.
BlockPool = BP-1957577775-192.168.146.236-1720512277934 .
2024-11-25 15:03:15,909 INFO balancer.KeyManager: Update block keys every 2hrs,
30mins, 0sec
2024-11-25 15:03:15,934 INFO balancer.Balancer: dfs.namenode.get-blocks.max-qps
= 20 (default=20)
2024-11-25 15:03:15,934 INFO balancer.Balancer: dfs.balancer.movedWinWidth =
5400000 (default=5400000)
2024-11-25 15:03:15,934 INFO balancer.Balancer: dfs.balancer.moverThreads =
1000 (default=1000)
2024-11-25 15:03:15,934 INFO balancer.Balancer: dfs.balancer.dispatcherThreads
= 200 (default=200)
2024-11-25 15:03:15,934 INFO balancer.Balancer: dfs.balancer.getBlocks.size =
2147483648 (default=2147483648)
2024-11-25 15:03:15,934 INFO balancer.Balancer:
dfs.balancer.getBlocks.min-block-size = 10485760 (default=10485760)
2024-11-25 15:03:15,935 INFO balancer.Balancer:
dfs.datanode.balance.max.concurrent.moves = 100 (default=100)
2024-11-25 15:03:15,935 INFO balancer.Balancer:
dfs.datanode.balance.bandwidthPerSec = 104857600 (default=104857600)
2024-11-25 15:03:15,936 INFO block.BlockTokenSecretManager: Setting block keys.
BlockPool = BP-1957577775-192.168.146.236-1720512277934 .
2024-11-25 15:03:15,936 INFO balancer.Balancer: dfs.balancer.max-size-to-move =
10737418240 (default=10737418240)
2024-11-25 15:03:15,936 INFO balancer.Balancer: dfs.blocksize = 268435456
(default=134217728)
2024-11-25 15:03:15,945 WARN balancer.Balancer: Encounter exception while do
balance work. Already tried 1 times
org.apache.hadoop.metrics2.MetricsException: Metrics source
Balancer-BP-1957577775-192.168.146.236-1720512277934 already exists!
at
org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:152)
at
org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:125)
at
org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:229)
at
org.apache.hadoop.hdfs.server.balancer.BalancerMetrics.create(BalancerMetrics.java:52)
at org.apache.hadoop.hdfs.server.balancer.Balancer.<init>(Balancer.java:362)
at
org.apache.hadoop.hdfs.server.balancer.Balancer.doBalance(Balancer.java:824)
at org.apache.hadoop.hdfs.server.balancer.Balancer.run(Balancer.java:887)
at
org.apache.hadoop.hdfs.server.balancer.Balancer$Cli.run(Balancer.java:975)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82)
at org.apache.hadoop.hdfs.server.balancer.Balancer.main(Balancer.java:1133)
2024-11-25 15:03:15,946 INFO balancer.Balancer: Finished one round, will wait
for 5.0 minutes for next round {code}
and finally the process stops after retrying 5 times:
{code:java}
// code placeholder
{code}
Looking at the HDFS balancer code, I don't see a way to prevent this error,
which is strange. A metrics source is created, with name "Balancer-" +
blockpoolID, where blockpoolID is 'BP-1957577775-192.168.146.236-1720512277934'
in our case.
When in servicemode, new balancers (and thus metrics sources) will be created
in the doBalance method, while the service is running.
When looking at the metrics implementation, duplicates are only allowed when
the metrics system is being started. After correct initialization of the
metrics system, duplicates are no longer allowed (monitoring flag is set to
true).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]