[ 
https://issues.apache.org/jira/browse/HDFS-16657?focusedWorklogId=792622&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-792622
 ]

ASF GitHub Bot logged work on HDFS-16657:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 19/Jul/22 11:59
            Start Date: 19/Jul/22 11:59
    Worklog Time Spent: 10m 
      Work Description: yuanboliu commented on PR #4558:
URL: https://github.com/apache/hadoop/pull/4558#issuecomment-1188963756

   @Hexiaoqiao 
   1. The default max delation rate is 20000 blocks per minute with 3s 
heartbeat, so practically memory wouldn't be the problem.  
   2. 

Issue Time Tracking
-------------------

    Worklog Id:     (was: 792622)
    Time Spent: 1h  (was: 50m)

> Changing pool-level lock to volume-level lock for invalidation of blocks
> ------------------------------------------------------------------------
>
>                 Key: HDFS-16657
>                 URL: https://issues.apache.org/jira/browse/HDFS-16657
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>            Reporter: Yuanbo Liu
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: image-2022-07-13-10-25-37-383.png, 
> image-2022-07-13-10-27-01-386.png, image-2022-07-13-10-27-44-258.png
>
>          Time Spent: 1h
>  Remaining Estimate: 0h
>
> Recently we see that the heartbeating of dn become slow in a very busy 
> cluster, here is the chart:
> !image-2022-07-13-10-25-37-383.png|width=665,height=245!
>  
> After getting jstack of the dn, we find that dn heartbeat stuck in 
> invalidation of blocks:
> !image-2022-07-13-10-27-01-386.png|width=658,height=308!
> !image-2022-07-13-10-27-44-258.png|width=502,height=325!
> The key code is:
> {code:java}
> // code placeholder
> try {
>   File blockFile = new File(info.getBlockURI());
>   if (blockFile != null && blockFile.getParentFile() == null) {
>     errors.add("Failed to delete replica " + invalidBlks[i]
>         +  ". Parent not found for block file: " + blockFile);
>     continue;
>   }
> } catch(IllegalArgumentException e) {
>   LOG.warn("Parent directory check failed; replica " + info
>       + " is not backed by a local file");
> } {code}
> DN is trying to locate parent path of block file, thus there is a disk I/O in 
> pool-level lock. When the disk becomes very busy with high io wait, All the 
> pending threads will be blocked by the pool-level lock, and the time of 
> heartbeat is high. We proposal to change the pool-level lock to volume-level 
> lock for block invalidation
> cc: [~hexiaoqiao] [~Aiphag0] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to