[
https://issues.apache.org/jira/browse/HDFS-16657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18036563#comment-18036563
]
ASF GitHub Bot commented on HDFS-16657:
---------------------------------------
github-actions[bot] commented on PR #4558:
URL: https://github.com/apache/hadoop/pull/4558#issuecomment-3507233632
We're closing this stale PR because it has been open for 100 days with no
activity. This isn't a judgement on the merit of the PR in any way. It's just a
way of keeping the PR queue manageable.
If you feel like this was a mistake, or you would like to continue working
on it, please feel free to re-open it and ask for a committer to remove the
stale tag and review again.
Thanks all for your contribution.
> Changing pool-level lock to volume-level lock for invalidation of blocks
> ------------------------------------------------------------------------
>
> Key: HDFS-16657
> URL: https://issues.apache.org/jira/browse/HDFS-16657
> Project: Hadoop HDFS
> Issue Type: Sub-task
> Reporter: Yuanbo Liu
> Priority: Major
> Labels: pull-request-available
> Attachments: image-2022-07-13-10-25-37-383.png,
> image-2022-07-13-10-27-01-386.png, image-2022-07-13-10-27-44-258.png
>
> Time Spent: 1h
> Remaining Estimate: 0h
>
> Recently we see that the heartbeating of dn become slow in a very busy
> cluster, here is the chart:
> !image-2022-07-13-10-25-37-383.png|width=665,height=245!
>
> After getting jstack of the dn, we find that dn heartbeat stuck in
> invalidation of blocks:
> !image-2022-07-13-10-27-01-386.png|width=658,height=308!
> !image-2022-07-13-10-27-44-258.png|width=502,height=325!
> The key code is:
> {code:java}
> // code placeholder
> try {
> File blockFile = new File(info.getBlockURI());
> if (blockFile != null && blockFile.getParentFile() == null) {
> errors.add("Failed to delete replica " + invalidBlks[i]
> + ". Parent not found for block file: " + blockFile);
> continue;
> }
> } catch(IllegalArgumentException e) {
> LOG.warn("Parent directory check failed; replica " + info
> + " is not backed by a local file");
> } {code}
> DN is trying to locate parent path of block file, thus there is a disk I/O in
> pool-level lock. When the disk becomes very busy with high io wait, All the
> pending threads will be blocked by the pool-level lock, and the time of
> heartbeat is high. We proposal to change the pool-level lock to volume-level
> lock for block invalidation
> cc: [~hexiaoqiao] [~Aiphag0]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]