[jira] [Commented] (HDFS-17690) Avoid redundant EC reconstruction tasks after pending reconstruction timeouts

ASF GitHub Bot (Jira) Wed, 18 Dec 2024 18:45:48 -0800


    [ 
https://issues.apache.org/jira/browse/HDFS-17690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17906906#comment-17906906
 ]


ASF GitHub Bot commented on HDFS-17690:
---------------------------------------

junegunn commented on PR #7231:
URL: https://github.com/apache/hadoop/pull/7231#issuecomment-2552660302

   Let me look into the test failures.




> Avoid redundant EC reconstruction tasks after pending reconstruction timeouts
> -----------------------------------------------------------------------------
>
>                 Key: HDFS-17690
>                 URL: https://issues.apache.org/jira/browse/HDFS-17690
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: ec
>    Affects Versions: 3.4.1
>            Reporter: Junegunn Choi
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: image-2024-12-18-13-50-11-978.png, 
> image-2024-12-18-13-51-33-552.png
>
>
> h2. Problem
> We are running HDFS clusters with RS-6-3-1024k policy enabled.
> When datanodes go down, HDFS reconstructs the under-replicated EC blocks as 
> expected and the number of low redundancy blocks decreases. However, after 
> some period of time, we are observing a lot of wasted effort in 
> reconstructing EC blocks that are already fully reconstructed for a very long 
> time.
> !image-2024-12-18-13-50-11-978.png|width=1000!
>  * We shutted down two datanodes from an HDFS cluster with 21 datanodes.
>  * Each datanode holds around 2TBs of data on SSD devices.
>  * It took 10 hours for the low redundancy metric to reach 0.
>  * Lots of (17K+) failed reconstruction tasks, wasting disk and network 
> resources during the period.
>  ** (There were no other activities on the cluster)
> This is the log messages for {{{}blk_-9223372036851532480_352520{}}}, and we 
> can see repeated {{"Failed to reconstruct"}} messages on different datanodes.
> {noformat}
> @timestamp,server,loglevel,role,data
> Dec 18, 2024 @ 00:56:07.000,s19,WARN,datanode,Failed to reconstruct striped 
> block: BP-1483963022-k1-1732859582319:blk_-9223372036851532480_352520
> Dec 18, 2024 @ 01:03:41.000,s04,WARN,datanode,Failed to reconstruct striped 
> block: BP-1483963022-k1-1732859582319:blk_-9223372036851532480_352520
> Dec 18, 2024 @ 04:56:16.000,s20,WARN,datanode,Failed to reconstruct striped 
> block: BP-1483963022-k1-1732859582319:blk_-9223372036851532480_352520
> Dec 18, 2024 @ 05:18:37.000,s20,WARN,datanode,Failed to reconstruct striped 
> block: BP-1483963022-k1-1732859582319:blk_-9223372036851532480_352520
> Dec 18, 2024 @ 07:18:31.000,s20,WARN,datanode,Failed to reconstruct striped 
> block: BP-1483963022-k1-1732859582319:blk_-9223372036851532480_352520
> Dec 18, 2024 @ 07:38:34.000,s20,WARN,datanode,Failed to reconstruct striped 
> block: BP-1483963022-k1-1732859582319:blk_-9223372036851532480_352520
> Dec 18, 2024 @ 08:18:28.000,s19,WARN,datanode,Failed to reconstruct striped 
> block: BP-1483963022-k1-1732859582319:blk_-9223372036851532480_352520
> Dec 18, 2024 @ 08:18:40.000,s19,WARN,datanode,Failed to reconstruct striped 
> block: BP-1483963022-k1-1732859582319:blk_-9223372036851532480_352520
> Dec 18, 2024 @ 08:19:07.000,s19,WARN,datanode,Failed to reconstruct striped 
> block: BP-1483963022-k1-1732859582319:blk_-9223372036851532480_352520
> Dec 18, 2024 @ 08:23:08.000,s18,WARN,datanode,Failed to reconstruct striped 
> block: BP-1483963022-k1-1732859582319:blk_-9223372036851532480_352520
> Dec 18, 2024 @ 08:23:38.000,s18,WARN,datanode,Failed to reconstruct striped 
> block: BP-1483963022-k1-1732859582319:blk_-9223372036851532480_352520
> {noformat}
> The repeated failures were caused by {{{}ReplicaAlreadyExistsException{}}}, 
> which means we were performing redundant tasks.
> {noformat}
> 24/12/18 08:23:59 INFO datanode.DataNode: DatanodeCommand action: 
> DNA_ERASURE_CODING_RECOVERY
> 24/12/18 08:23:59 INFO datanode.DataNode: Receiving 
> BP-1483963022-k1-1732859582319:blk_-9223372036851532476_352520 src: 
> /s18:35816 dest: /s18:1004
> 24/12/18 08:23:59 INFO datanode.DataNode: opWriteBlock 
> BP-1483963022-k1-1732859582319:blk_-9223372036851532476_352520 received 
> exception 
> org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block 
> BP-1483963022-k1-1732859582319:blk_-9223372036851532476_352520 already exists 
> in state FINALIZED and thus cannot be created.
> 24/12/18 08:23:59 INFO datanode.DataNode: s18:1004:DataXceiver error 
> processing WRITE_BLOCK operation  src: /s18:35816 dst: /s18:1004; 
> org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block 
> BP-1483963022-k1-1732859582319:blk_-9223372036851532476_352520 already exists 
> in state FINALIZED and thus cannot be created.
> 24/12/18 08:23:59 WARN datanode.DataNode: Broken pipe
> 24/12/18 08:23:59 WARN datanode.DataNode: Failed to reconstruct striped 
> block: BP-1483963022-k1-1732859582319:blk_-9223372036851532480_352520
> java.io.IOException: Transfer failed for all targets.
>         at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.reconstruct(StripedBlockReconstructor.java:118)
>         at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.run(StripedBlockReconstructor.java:63)
>         at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
> {noformat}
> {{hdfs fsck -blockId blk_-9223372036851532480}} also shows that this block 
> doesn't need reconstruction.
> {noformat}
> No. of Expected Replica: 9
> No. of live Replica: 9
> {noformat}
> h2. Workaround
> Because this happens only after several {{PendingReconstructionMonitor timed 
> out}} messages, I tried increasing 
> {{dfs.namenode.reconstruction.pending.timeout-sec}} so no timeout occurs, and 
> the problem disappeared.
> h2. Cause
> When an EC reconstruction task for a block is scheduled, 
> {{BlockECReconstructionInfo}} for the block is added to the 
> {{erasurecodeBlocks}} queue of the {{DatanodeDescriptor}} for the target 
> node. However, it is not removed from the queue when the reconstruction task 
> is timed out, recreated, and rescheduled. And we end up with multiple 
> redundant {{BlockECReconstructionInfo}} objects on the queues of several 
> datanodes.
> This explains a lot of wasted effort after multiple 
> PendingReconstructionMonitor timeouts.
> You're more likely to experience this problem on a relatively small cluster 
> with a small {{dfs.namenode.replication.max-streams}} value, where block 
> reconstruction for failed datanodes would take a considerable amount of time 
> and trigger many pending reconstruction timeouts. This was exactly the case 
> for us; our cluster has only 21 datanodes and 
> {{dfs.namenode.replication.max-streams}} is set to 1 because we're running 
> HBase on the same cluster and don't want eager reconstruction to affect 
> HBase's performance.
> h2. Suggested fix
> Avoid processing redundant {{BlockECReconstructionInfo}} from the queue. 
> Check again if the task is really needed just before we dispatch the tasks to 
> the datanodes.
> h2. Result
> !image-2024-12-18-13-51-33-552.png|width=1000!
>  * Took down 2 datanodes
>  * Only took 3.5 hours to complete the reconstruction
>  * 10K+ {{"doesn't need reconstruction"}} messages in the namenode log, 
> showing that we have successfully avoided dispatching redundant tasks.
>  * No 
> {{org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException}} 
> errors
> h2. Alternative approach considered
> We can instead remove the {{BlockECReconstructionInfo}} from the queue when 
> the task is timed out, but this would require more bookkeeping and is more 
> error-prone.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-17690) Avoid redundant EC reconstruction tasks after pending reconstruction timeouts

Reply via email to