Yuanbo Liu created HDFS-16657:
---------------------------------
Summary: Changing pool-level lock to volume-level lock for
invalidation of blocks
Key: HDFS-16657
URL: https://issues.apache.org/jira/browse/HDFS-16657
Project: Hadoop HDFS
Issue Type: Sub-task
Reporter: Yuanbo Liu
Attachments: image-2022-07-13-10-25-37-383.png,
image-2022-07-13-10-27-01-386.png, image-2022-07-13-10-27-44-258.png
Recently we see that the heartbeating of dn become slow in a very busy cluster,
here is the chart:
!image-2022-07-13-10-25-37-383.png!
After getting jstack of the dn, we find that dn heartbeat stuck in invalidation
of blocks:
!image-2022-07-13-10-27-01-386.png!
!image-2022-07-13-10-27-44-258.png!
The key code is:
{code:java}
// code placeholder
try {
File blockFile = new File(info.getBlockURI());
if (blockFile != null && blockFile.getParentFile() == null) {
errors.add("Failed to delete replica " + invalidBlks[i]
+ ". Parent not found for block file: " + blockFile);
continue;
}
} catch(IllegalArgumentException e) {
LOG.warn("Parent directory check failed; replica " + info
+ " is not backed by a local file");
} {code}
DN is trying to locate parent path of block file, thus there is a disk I/O in
pool-level lock. When the disk becomes very busy with high io wait, All the
pending threads will be blocked by the pool-level lock, and the time of
heartbeat is high. We proposal to change the pool-level lock to volume-level
lock for block invalidation
cc: [~hexiaoqiao] [~Aiphag0]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]