Shangshu Qian created HDFS-17782:
------------------------------------
Summary: The implementation of LowRedundancyBlocks can cause
unexpected lock contentions, resulting in DN timeout
Key: HDFS-17782
URL: https://issues.apache.org/jira/browse/HDFS-17782
Project: Hadoop HDFS
Issue Type: Bug
Components: datanode, namenode
Affects Versions: 3.4.1
Reporter: Shangshu Qian
The current implementation of LowRedundancyBlocks involves a lot of
synchronized methods. The main user of this class, `neededReconstruction` of
BlockManager frequently invokes those synchronized method. A feedback loop can
occur when the synchronized methods causes lock contentions.
The feedback loop looks like this:
# The cluster experiences a burst in IO. The BlockManager experiences lock
contention on LowRedundancyBlocks.
# Due to the lock contention, many of the RPC operations in the BlockManager
get delayed, occupying the RPC pool for a long time.
# The heartbeat from the DN get delayed due to the contention. We start to
lose them and the blocks on them.
# We need to replicate those missing blocks, cause even higher load on the DN
as well as the block reports they send to the NN.
# Those block reports interacts with BlockManager and eventually
LowRedundancyBlocks, making the lock contention problem worse.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]