[ 
https://issues.apache.org/jira/browse/HDFS-15177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17039169#comment-17039169
 ] 

Stephen O'Donnell commented on HDFS-15177:
------------------------------------------

You posted just about the same time as I did.

On CDH 5.16.1, we have seen the scenario you described about the heart beat 
getting blocked, without many block pools and only high load on a single block 
pool.

I believe part of the problem is that the 2.x branch uses "synchronized" to 
provide locking, and that does not support fairness in the lock. This means 
important threads like the heartbeat are blocked for a long time.

There are also some scenarios where the lock is held while disk operations 
happen, which causes a lot of slowdowns.

On the 3.x branch, the locking in the DN has been changed to a fair lock for 
some time now, and recently I have changed it to a read write lock in 
HDFS-15150 in an effort to improve the throughput of the lock. Work is on going 
to move various code paths to use the read lock (HDFS-15160) to improve 
concurrency. There may be scope to enhance this further by moving to a lock per 
block pool in some scenarios, but those changes would be more difficult.

We have also seen some problems on CDH 5.16 around the FoldedTreeSet which 
holds the blocks in the Replica Map. It would be interesting to see if you are 
seeing anything similar to that - HDFS-15131. We have found the DNs seem to get 
slower over time due to some issue with that structure and need a restart to 
make them go faster again.

> Split datanode invalide block deletion, to avoid the FsDatasetImpl lock too 
> much time.
> --------------------------------------------------------------------------------------
>
>                 Key: HDFS-15177
>                 URL: https://issues.apache.org/jira/browse/HDFS-15177
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: datanode
>            Reporter: zhuqi
>            Assignee: zhuqi
>            Priority: Major
>         Attachments: image-2020-02-18-22-39-00-642.png, 
> image-2020-02-18-22-51-28-624.png, image-2020-02-18-22-52-59-202.png, 
> image-2020-02-18-22-55-38-661.png
>
>
> In our cluster, the datanode receive the delete command with too many blocks 
> deletion when we have many blockpools sharing the same datanode and the 
> datanode with about 30 storage dirs, it will cause the FsDatasetImpl lock too 
> much time.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to