Rick Weber created HDFS-17392:
---------------------------------
Summary: NameNode rolls frequently with "EC replicas to be deleted
are not in the candidate" error
Key: HDFS-17392
URL: https://issues.apache.org/jira/browse/HDFS-17392
Project: Hadoop HDFS
Issue Type: Bug
Components: namenode
Affects Versions: 3.3.6
Reporter: Rick Weber
Recently upgraded my clusters from Hadoop v3.3.4 to Hadoop v3.3.6 and noticed a
lot of Namenode instability. Basically after about 1 hour, the active namenode
shuts down and the "next" one takes over.
Looking into the shutdown reasons, I'm seeing errors similar to
{code:java}
2024-02-20 12:05:37,352 INFO
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Rescan of
postponedMisreplicatedBlocks completed in 8 msecs. 6639943 blocks are left. 1
blocks were removed.
2024-02-20 12:05:37,352 ERROR
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: RedundancyMonitor
thread received Runtime exception.
java.lang.IllegalArgumentException: The EC replicas to be deleted are not in
the candidate list
at
org.apache.hadoop.thirdparty.com.google.common.base.Preconditions.checkArgument(Preconditions.java:144)
at
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseExcessRedundancyStriped(BlockManager.java:4082)
at
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseExcessRedundancies(BlockManager.java:3970)
at
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processExtraRedundancyBlock(BlockManager.java:3957)
at
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processMisReplicatedBlock(BlockManager.java:3898)
at
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.rescanPostponedMisreplicatedBlocks(BlockManager.java:2898)
at
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$RedundancyMonitor.run(BlockManager.java:5053)
at java.lang.Thread.run(Thread.java:750)
2024-02-20 12:05:37,357 INFO org.apache.hadoop.util.ExitUtil: Exiting with
status 1: java.lang.IllegalArgumentException: The EC replicas to be deleted are
not in the candidate list {code}
Looking through the code path itself, there is a check for
`Preconditions.checkArgument()` to ensure that a given block chosen for
deletion is actually one of the valid blocks. If not, then the NN shuts down.
This is likely a symptom to a larger issue, such as how is a block being chosen
that is not in the candidate list.
The remainder of the cluster has services such as SPS and Balancer service
disabled, so that the only movement of data should be whatever is "organically"
chosen by the NameNode.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]