[ https://issues.apache.org/jira/browse/HDFS-16776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17868843#comment-17868843 ]
ASF GitHub Bot commented on HDFS-16776: --------------------------------------- tomscut opened a new pull request, #6964: URL: https://github.com/apache/hadoop/pull/6964 Reverts apache/hadoop#4901 As a result of this change, maintainance can get stuck in two ways: 1. In order to satisfy the storage policy. 2. In an ec block, there are more than 2 dn in Entering Maintenance state and dfs.namenode.maintenance.ec.replication.min >= 2. Here's a more complex example. We recently did maintainance on a batch of nodes, including host4 and host8. Configuration: ``` dfs.namenode.maintenance.ec.replication.min=1 storagePolicy=HDD ``` hdfs fsck -fs -blockId blk_-9223372035217210640 ``` [blk_-9223372035217210640:DatanodeInfoWithStorage[host1:50010,DS-b9b2ea24-e69b-4a95-8a36-8b73b32003d3,DISK], blk_-9223372035217210639:DatanodeInfoWithStorage[host2:50010,DS-dfc9b308-a493-4d9b-b1c1-a134552f089f,SSD], blk_-9223372035217210638:DatanodeInfoWithStorage[host3:50010,DS-67669a8d-57d9-4825-8e1e-0e834d1fd47a,DISK], blk_-9223372035217210637:DatanodeInfoWithStorage[host4:50010,DS-6826ff2a-a6e5-4676-ad40-284099652670,DISK], Entering Maintenance blk_-9223372035217210636:DatanodeInfoWithStorage[host5:50010,DS-2e042fb1-dbc2-4ccf-ba43-da51a9ef2079,DISK], blk_-9223372035217210635:DatanodeInfoWithStorage[host6:50010,DS-005f2bce-eb46-432f-85b0-61919554692f,DISK], blk_-9223372035217210633:DatanodeInfoWithStorage[host7:50010,DS-cc11ce37-e121-4602-8688-ec7d45a0f276,DISK], blk_-9223372035217210632:DatanodeInfoWithStorage[host8:50010,DS-076891a0-4166-4584-9cea-13c853cbd667,DISK]] Entering Maintenance ``` Datanode log: ``` 2024-07-25 12:46:42,680 INFO [Command processor] org.apache.hadoop.hdfs.server.datanode.DataNode: processErasureCodingTasks BlockECReconstructionInfo( Recovering BP-1956563710-x.x.x.x-1622796911268:blk_-9223372035217210640_105868369 From: [host1:50010, host2:50010, host3:50010, host4:50010, host5:50010, host6:50010, host7:50010, host8:50010] To: [[host9:50010, host10:50010]) Block Indices: [0, 1, 2, 3, 4, 5, 7, 8] 2024-07-25 12:46:42,680 WARN [Command processor] org.apache.hadoop.hdfs.server.datanode.DataNode: Failed to reconstruct striped block blk_-9223372035217210640_105868369 java.lang.IllegalArgumentException: Reconstruction work gets too much targets. at com.google.common.base.Preconditions.checkArgument(Preconditions.java:141) at org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedWriter.<init>(StripedWriter.java:86) at org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.<init>(StripedBlockReconstructor.java:47) at org.apache.hadoop.hdfs.server.datanode.erasurecode.ErasureCodingWorker.processErasureCodingTasks(ErasureCodingWorker.java:134) at org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:797) at org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:680) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processCommand(BPServiceActor.java:1327) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.lambda$enqueue$2(BPServiceActor.java:1365) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processQueue(BPServiceActor.java:1301) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.run(BPServiceActor.java:1288) ``` In this block group, there is a block written on the SSD (blk_-9223372035217210639). When doing maintainance, two blocks need to be added: one is to migrate the blocks of SSD to HDD(In order to satisfy the storage policy), and the other is to ensure at least 7 blocks during maintainance. Then the maintainance process to get stuck. > Erasure Coding: The length of targets should be checked when DN gets a > reconstruction task > ------------------------------------------------------------------------------------------ > > Key: HDFS-16776 > URL: https://issues.apache.org/jira/browse/HDFS-16776 > Project: Hadoop HDFS > Issue Type: Improvement > Components: erasure-coding > Affects Versions: 3.4.0, 3.3.5 > Reporter: Ruinan Gu > Assignee: Ruinan Gu > Priority: Major > Labels: pull-request-available > Fix For: 3.4.0, 3.3.5, 3.2.5 > > > The length of targets should be checked when DN gets a EC reconstruction > task.For some reason (HDFS-14768, HDFS-16739) , the length of targets will be > larger than additionalReplRequired which causes some elements in targets get > the default value 0. It may trigger the bug which leads to the data > corrupttion just like HDFS-14768. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org