[ https://issues.apache.org/jira/browse/HDFS-16657?focusedWorklogId=792622&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-792622 ]
ASF GitHub Bot logged work on HDFS-16657: ----------------------------------------- Author: ASF GitHub Bot Created on: 19/Jul/22 11:59 Start Date: 19/Jul/22 11:59 Worklog Time Spent: 10m Work Description: yuanboliu commented on PR #4558: URL: https://github.com/apache/hadoop/pull/4558#issuecomment-1188963756 @Hexiaoqiao 1. The default max delation rate is 20000 blocks per minute with 3s heartbeat, so practically memory wouldn't be the problem. 2. Issue Time Tracking ------------------- Worklog Id: (was: 792622) Time Spent: 1h (was: 50m) > Changing pool-level lock to volume-level lock for invalidation of blocks > ------------------------------------------------------------------------ > > Key: HDFS-16657 > URL: https://issues.apache.org/jira/browse/HDFS-16657 > Project: Hadoop HDFS > Issue Type: Sub-task > Reporter: Yuanbo Liu > Priority: Major > Labels: pull-request-available > Attachments: image-2022-07-13-10-25-37-383.png, > image-2022-07-13-10-27-01-386.png, image-2022-07-13-10-27-44-258.png > > Time Spent: 1h > Remaining Estimate: 0h > > Recently we see that the heartbeating of dn become slow in a very busy > cluster, here is the chart: > !image-2022-07-13-10-25-37-383.png|width=665,height=245! > > After getting jstack of the dn, we find that dn heartbeat stuck in > invalidation of blocks: > !image-2022-07-13-10-27-01-386.png|width=658,height=308! > !image-2022-07-13-10-27-44-258.png|width=502,height=325! > The key code is: > {code:java} > // code placeholder > try { > File blockFile = new File(info.getBlockURI()); > if (blockFile != null && blockFile.getParentFile() == null) { > errors.add("Failed to delete replica " + invalidBlks[i] > + ". Parent not found for block file: " + blockFile); > continue; > } > } catch(IllegalArgumentException e) { > LOG.warn("Parent directory check failed; replica " + info > + " is not backed by a local file"); > } {code} > DN is trying to locate parent path of block file, thus there is a disk I/O in > pool-level lock. When the disk becomes very busy with high io wait, All the > pending threads will be blocked by the pool-level lock, and the time of > heartbeat is high. We proposal to change the pool-level lock to volume-level > lock for block invalidation > cc: [~hexiaoqiao] [~Aiphag0] -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org