[ https://issues.apache.org/jira/browse/HDFS-17690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17906601#comment-17906601 ]
ASF GitHub Bot commented on HDFS-17690: --------------------------------------- junegunn opened a new pull request, #7231: URL: https://github.com/apache/hadoop/pull/7231 <!-- Thanks for sending a pull request! 1. If this is your first time, please read our contributor guidelines: https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute 2. Make sure your PR title starts with JIRA issue id, e.g., 'HADOOP-17799. Your PR title ...'. --> ### Description of PR Avoid processing redundant BlockECReconstructionInfo from the queue. Check again if the task is really needed just before we dispatch the tasks to the datanodes. ### How was this patch tested? Tested on a 3.4.1 cluster with 21 datanodes. Shutted down 2 datanodes holding 4~5TB of data and observed the reconstruction progress. ### For code changes: - [x] Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')? - [ ] Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation? - [ ] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](http://www.apache.org/legal/resolved.html#category-a)? - [ ] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, `NOTICE-binary` files? > Avoid redundant EC reconstruction tasks after pending reconstruction timeouts > ----------------------------------------------------------------------------- > > Key: HDFS-17690 > URL: https://issues.apache.org/jira/browse/HDFS-17690 > Project: Hadoop HDFS > Issue Type: Improvement > Components: ec > Affects Versions: 3.4.1 > Reporter: Junegunn Choi > Priority: Major > Attachments: image-2024-12-18-13-50-11-978.png, > image-2024-12-18-13-51-33-552.png > > > h2. Problem > We are running HDFS clusters with RS-6-3-1024k policy enabled. > When datanodes go down, HDFS reconstructs the under-replicated EC blocks as > expected and the number of low redundancy blocks decreases. However, after > some period of time, we are observing a lot of wasted effort in > reconstructing EC blocks that are already fully reconstructed for a very long > time. > !image-2024-12-18-13-50-11-978.png! > * We shutted down two datanodes from an HDFS cluster with 21 datanodes. > * Each datanode holds around 2TBs of data on SSD devices. > * It took 10 hours for the low redundancy metric to reach 0. > * Lots of (17K+) failed reconstruction tasks, wasting disk and network > resources during the period. > ** (There were no other activities on the cluster) > This is the log messages for {{{}blk_-9223372036851532480_352520{}}}, and we > can see repeated {{"Failed to reconstruct"}} messages on different datanodes. > {noformat} > @timestamp,server,loglevel,role,data > Dec 18, 2024 @ 00:56:07.000,s19,WARN,datanode,Failed to reconstruct striped > block: BP-1483963022-k1-1732859582319:blk_-9223372036851532480_352520 > Dec 18, 2024 @ 01:03:41.000,s04,WARN,datanode,Failed to reconstruct striped > block: BP-1483963022-k1-1732859582319:blk_-9223372036851532480_352520 > Dec 18, 2024 @ 04:56:16.000,s20,WARN,datanode,Failed to reconstruct striped > block: BP-1483963022-k1-1732859582319:blk_-9223372036851532480_352520 > Dec 18, 2024 @ 05:18:37.000,s20,WARN,datanode,Failed to reconstruct striped > block: BP-1483963022-k1-1732859582319:blk_-9223372036851532480_352520 > Dec 18, 2024 @ 07:18:31.000,s20,WARN,datanode,Failed to reconstruct striped > block: BP-1483963022-k1-1732859582319:blk_-9223372036851532480_352520 > Dec 18, 2024 @ 07:38:34.000,s20,WARN,datanode,Failed to reconstruct striped > block: BP-1483963022-k1-1732859582319:blk_-9223372036851532480_352520 > Dec 18, 2024 @ 08:18:28.000,s19,WARN,datanode,Failed to reconstruct striped > block: BP-1483963022-k1-1732859582319:blk_-9223372036851532480_352520 > Dec 18, 2024 @ 08:18:40.000,s19,WARN,datanode,Failed to reconstruct striped > block: BP-1483963022-k1-1732859582319:blk_-9223372036851532480_352520 > Dec 18, 2024 @ 08:19:07.000,s19,WARN,datanode,Failed to reconstruct striped > block: BP-1483963022-k1-1732859582319:blk_-9223372036851532480_352520 > Dec 18, 2024 @ 08:23:08.000,s18,WARN,datanode,Failed to reconstruct striped > block: BP-1483963022-k1-1732859582319:blk_-9223372036851532480_352520 > Dec 18, 2024 @ 08:23:38.000,s18,WARN,datanode,Failed to reconstruct striped > block: BP-1483963022-k1-1732859582319:blk_-9223372036851532480_352520 > {noformat} > The repeated failures were caused by {{{}ReplicaAlreadyExistsException{}}}, > which means we were performing redundant tasks. > {noformat} > 24/12/18 08:23:59 INFO datanode.DataNode: DatanodeCommand action: > DNA_ERASURE_CODING_RECOVERY > 24/12/18 08:23:59 INFO datanode.DataNode: Receiving > BP-1483963022-k1-1732859582319:blk_-9223372036851532476_352520 src: > /s18:35816 dest: /s18:1004 > 24/12/18 08:23:59 INFO datanode.DataNode: opWriteBlock > BP-1483963022-k1-1732859582319:blk_-9223372036851532476_352520 received > exception > org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block > BP-1483963022-k1-1732859582319:blk_-9223372036851532476_352520 already exists > in state FINALIZED and thus cannot be created. > 24/12/18 08:23:59 INFO datanode.DataNode: s18:1004:DataXceiver error > processing WRITE_BLOCK operation src: /s18:35816 dst: /s18:1004; > org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block > BP-1483963022-k1-1732859582319:blk_-9223372036851532476_352520 already exists > in state FINALIZED and thus cannot be created. > 24/12/18 08:23:59 WARN datanode.DataNode: Broken pipe > 24/12/18 08:23:59 WARN datanode.DataNode: Failed to reconstruct striped > block: BP-1483963022-k1-1732859582319:blk_-9223372036851532480_352520 > java.io.IOException: Transfer failed for all targets. > at > org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.reconstruct(StripedBlockReconstructor.java:118) > at > org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.run(StripedBlockReconstructor.java:63) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > {noformat} > {{hdfs fsck -blockId blk_-9223372036851532480}} also shows that this block > doesn't need reconstruction. > {noformat} > No. of Expected Replica: 9 > No. of live Replica: 9 > {noformat} > h2. Workaround > Because this happens only after several {{PendingReconstructionMonitor timed > out}} messages, I tried increasing > {{dfs.namenode.reconstruction.pending.timeout-sec}} so no timeout occurs, and > the problem disappeared. > h2. Cause > When an EC reconstruction task for a block is scheduled, > {{BlockECReconstructionInfo}} for the block is added to the > {{erasurecodeBlocks}} queue of the {{DatanodeDescriptor}} for the target > node. However, it is not removed from the queue when the reconstruction task > is timed out, recreated, and rescheduled. And we end up with multiple > redundant {{BlockECReconstructionInfo}} objects on the queues of several > datanodes. > This explains a lot of wasted effort after multiple > PendingReconstructionMonitor timeouts. > You're more likely to experience this problem on a relatively small cluster > with with small {{dfs.namenode.replication.max-streams}} value, where block > reconstruction for failed datanodes would take considerable amount of time > and trigger many pending reconstruction timeouts. This was exactly the case > for us; our cluster has only 21 datanodes and > {{dfs.namenode.replication.max-streams}} is set to 1 because we're running > HBase on the same cluster and don't want eager reconstruction to affect > HBase's performance. > h2. Suggested fix > Avoid processing redundant {{BlockECReconstructionInfo}} from the queue. > Check again if the task is really needed just before we dispatch the tasks to > the datanodes. > h2. Result > !image-2024-12-18-13-51-33-552.png! > * Took down 2 datanodes > * Only took 3.5 hours to complete the reconstruction > * 10K+ {{"doesn't need reconstruction"}} messages in the namenode log, > showing that we have successfully avoided dispatching redundant tasks. > * No > {{org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException}} > errors > h2. Alternative approach considered > We can instead remove the {{BlockECReconstructionInfo}} from the queue when > the task is timed out, but this would require more bookkeeping and is more > error-prone. > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org