[ 
https://issues.apache.org/jira/browse/HDFS-16776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17868843#comment-17868843
 ] 

ASF GitHub Bot commented on HDFS-16776:
---------------------------------------

tomscut opened a new pull request, #6964:
URL: https://github.com/apache/hadoop/pull/6964

   Reverts apache/hadoop#4901
   
   As a result of this change, maintainance can get stuck in two ways: 
   1. In order to satisfy the storage policy.
   2. In an ec block, there are more than 2 dn in Entering Maintenance state 
and dfs.namenode.maintenance.ec.replication.min >= 2.
   
   
   Here's a more complex example. We recently did maintainance on a batch of 
nodes, including host4 and host8. 
   Configuration:
   ```
   dfs.namenode.maintenance.ec.replication.min=1
   storagePolicy=HDD
   ```
   
   hdfs fsck -fs -blockId blk_-9223372035217210640
   ```
   
[blk_-9223372035217210640:DatanodeInfoWithStorage[host1:50010,DS-b9b2ea24-e69b-4a95-8a36-8b73b32003d3,DISK],
 
   
blk_-9223372035217210639:DatanodeInfoWithStorage[host2:50010,DS-dfc9b308-a493-4d9b-b1c1-a134552f089f,SSD],
 
   
blk_-9223372035217210638:DatanodeInfoWithStorage[host3:50010,DS-67669a8d-57d9-4825-8e1e-0e834d1fd47a,DISK],
 
   
blk_-9223372035217210637:DatanodeInfoWithStorage[host4:50010,DS-6826ff2a-a6e5-4676-ad40-284099652670,DISK],
 Entering Maintenance
   
blk_-9223372035217210636:DatanodeInfoWithStorage[host5:50010,DS-2e042fb1-dbc2-4ccf-ba43-da51a9ef2079,DISK],
 
   
blk_-9223372035217210635:DatanodeInfoWithStorage[host6:50010,DS-005f2bce-eb46-432f-85b0-61919554692f,DISK],
 
   
blk_-9223372035217210633:DatanodeInfoWithStorage[host7:50010,DS-cc11ce37-e121-4602-8688-ec7d45a0f276,DISK],
 
   
blk_-9223372035217210632:DatanodeInfoWithStorage[host8:50010,DS-076891a0-4166-4584-9cea-13c853cbd667,DISK]]
 Entering Maintenance
   ```
   
   Datanode log:
   ```
   2024-07-25 12:46:42,680 INFO [Command processor] 
org.apache.hadoop.hdfs.server.datanode.DataNode: processErasureCodingTasks  
BlockECReconstructionInfo(
     Recovering 
BP-1956563710-x.x.x.x-1622796911268:blk_-9223372035217210640_105868369 
     From: [host1:50010, host2:50010, host3:50010, host4:50010, host5:50010, 
host6:50010, host7:50010, host8:50010] 
     To: [[host9:50010, host10:50010])
    Block Indices: [0, 1, 2, 3, 4, 5, 7, 8]
   2024-07-25 12:46:42,680 WARN [Command processor] 
org.apache.hadoop.hdfs.server.datanode.DataNode: Failed to reconstruct striped 
block blk_-9223372035217210640_105868369
   java.lang.IllegalArgumentException: Reconstruction work gets too much 
targets.
        at 
com.google.common.base.Preconditions.checkArgument(Preconditions.java:141)
        at 
org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedWriter.<init>(StripedWriter.java:86)
        at 
org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.<init>(StripedBlockReconstructor.java:47)
        at 
org.apache.hadoop.hdfs.server.datanode.erasurecode.ErasureCodingWorker.processErasureCodingTasks(ErasureCodingWorker.java:134)
        at 
org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:797)
        at 
org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:680)
        at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processCommand(BPServiceActor.java:1327)
        at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.lambda$enqueue$2(BPServiceActor.java:1365)
        at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processQueue(BPServiceActor.java:1301)
        at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.run(BPServiceActor.java:1288)
   ```
   
   In this block group, there is a block written on the SSD 
(blk_-9223372035217210639). 
   
   When doing maintainance, two blocks need to be added: one is to migrate the 
blocks of SSD to HDD(In order to satisfy the storage policy), and the other is 
to ensure at least 7 blocks during maintainance. 
   
   Then the maintainance process to get stuck.




> Erasure Coding: The length of targets should be checked when DN gets a 
> reconstruction task
> ------------------------------------------------------------------------------------------
>
>                 Key: HDFS-16776
>                 URL: https://issues.apache.org/jira/browse/HDFS-16776
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: erasure-coding
>    Affects Versions: 3.4.0, 3.3.5
>            Reporter: Ruinan Gu
>            Assignee: Ruinan Gu
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 3.4.0, 3.3.5, 3.2.5
>
>
> The length of targets should be checked when DN gets a EC reconstruction 
> task.For some reason (HDFS-14768, HDFS-16739) , the length of targets will be 
> larger than additionalReplRequired which causes some elements in targets get 
> the default value 0. It may trigger the bug which leads to the data 
> corrupttion just like HDFS-14768.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to