[ https://issues.apache.org/jira/browse/HDFS-16420?focusedWorklogId=708046&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-708046 ]
ASF GitHub Bot logged work on HDFS-16420: ----------------------------------------- Author: ASF GitHub Bot Created on: 13/Jan/22 03:26 Start Date: 13/Jan/22 03:26 Worklog Time Spent: 10m Work Description: Jackson-Wang-7 opened a new pull request #3880: URL: https://github.com/apache/hadoop/pull/3880 …y striped blocks. <!-- Thanks for sending a pull request! 1. If this is your first time, please read our contributor guidelines: https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute 2. Make sure your PR title starts with JIRA issue id, e.g., 'HADOOP-17799. Your PR title ...'. --> ### Description of PR if there are two or more blocks exist in a same rack, it may cause unique data block is added to exactlyOne processing list when choosing redundancy stripted block to delete. `storages.remove(cur); if (storages.isEmpty()) { rackMap.remove(rack); } if (moreThanOne.remove(cur)) { if (storages.size() == 1) { **final DatanodeStorageInfo remaining = storages.get(0); moreThanOne.remove(remaining); exactlyOne.add(remaining);** } } else { exactlyOne.remove(cur); }` In this case, moreThanOne list may not contain the remaining block. The remaining block shouldn’t be deleted, but it is added to exactlyOne list. And then it will be deleted. ### How was this patch tested? The testcase is that:(EC 6+3) blk_-xxx009 in rack /d1/r1 blo_-xxx008 in rack /d1/r1 blo_-xxx008 in rack /d1/r2 blo_-xxx008 in rack /d1/r3 blk_-xxx007 in rack /d1/r4 blo_-xxx006 in rack /d2/r1 blk_-xxx005 in rack /d2/r2 blo_-xxx004 in rack /d2/r3 blk_-xxx003 in rack /d2/r4 blo_-xxx002 in rack /d2/r5 blk_-xxx001 in rack /d2/r6 After the FBR is triggered and redundant data blocks are added to invalidate list, blo_-xxx008 in rack /d1/r1 and blo_-xxx008 in rack /d1/r2 need to be deleted, blk_-xxx009 is HEALTHY. ### For code changes: - [ ] Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')? - [ ] Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation? - [ ] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](http://www.apache.org/legal/resolved.html#category-a)? - [ ] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, `NOTICE-binary` files? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking ------------------- Worklog Id: (was: 708046) Remaining Estimate: 0h Time Spent: 10m > ec + balancer may cause missing block > ------------------------------------- > > Key: HDFS-16420 > URL: https://issues.apache.org/jira/browse/HDFS-16420 > Project: Hadoop HDFS > Issue Type: Bug > Reporter: qinyuren > Priority: Major > Attachments: image-2022-01-10-17-31-35-910.png, > image-2022-01-10-17-32-56-981.png > > Time Spent: 10m > Remaining Estimate: 0h > > We have a similar problem as HDFS-16297 described. > In our cluster, we used {color:#de350b}ec(6+3) + balancer with version > 3.1.0{color}, and the {color:#de350b}missing block{color} happened. > We got the block(blk_-9223372036824119008) info from fsck, only 5 live > replications and multiple redundant replications. > {code:java} > blk_-9223372036824119008_220037616 len=133370338 MISSING! Live_repl=5 > blk_-9223372036824119007:DatanodeInfoWithStorage, > blk_-9223372036824119002:DatanodeInfoWithStorage, > blk_-9223372036824119001:DatanodeInfoWithStorage, > blk_-9223372036824119000:DatanodeInfoWithStorage, > blk_-9223372036824119004:DatanodeInfoWithStorage, > blk_-9223372036824119004:DatanodeInfoWithStorage, > blk_-9223372036824119004:DatanodeInfoWithStorage, > blk_-9223372036824119004:DatanodeInfoWithStorage, > blk_-9223372036824119004:DatanodeInfoWithStorage, > blk_-9223372036824119004:DatanodeInfoWithStorage {code} > > We searched the log from all datanode, and found that the internal blocks of > blk_-9223372036824119008 were deleted almost at the same time. > > {code:java} > 08:15:58,550 INFO impl.FsDatasetAsyncDiskService > (FsDatasetAsyncDiskService.java:run(333)) - Deleted > BP-1606066499-xxxx-1606188026755 blk_-9223372036824119008_220037616 URI > file:/data15/hadoop/hdfs/data/current/BP-1606066499-xxxx-1606188026755/current/finalized/subdir19/subdir9/blk_-9223372036824119008 > 08:16:21,214 INFO impl.FsDatasetAsyncDiskService > (FsDatasetAsyncDiskService.java:run(333)) - Deleted > BP-1606066499-xxxx-1606188026755 blk_-9223372036824119006_220037616 URI > file:/data4/hadoop/hdfs/data/current/BP-1606066499-xxxx-1606188026755/current/finalized/subdir19/subdir9/blk_-9223372036824119006 > 08:16:55,737 INFO impl.FsDatasetAsyncDiskService > (FsDatasetAsyncDiskService.java:run(333)) - Deleted > BP-1606066499-xxxx-1606188026755 blk_-9223372036824119005_220037616 URI > file:/data2/hadoop/hdfs/data/current/BP-1606066499-xxxx-1606188026755/current/finalized/subdir19/subdir9/blk_-9223372036824119005 > {code} > > The total number of internal blocks deleted during 08:15-08:17 are as follows > ||internal block||index|| delete num|| > |blk_-9223372036824119008 > blk_-9223372036824119006 > blk_-9223372036824119005 > blk_-9223372036824119004 > blk_-9223372036824119003 > blk_-9223372036824119000 |0 > 2 > 3 > 4 > 5 > 8| 1 > 1 > 1 > 50 > 1 > 1| > > {color:#ff0000}During 08:15 to 08:17, we restarted 2 datanode and triggered > full block report immediately.{color} > > There are 2 questions: > 1. Why are there so many replicas of this block? > 2. Why delete the internal block with only one copy? > The reasons for the first problem may be as follows: > 1. We set the full block report period of some datanode to 168 hours. > 2. We have done a namenode HA operation. > 3. After namenode HA, the state of storage became > {color:#ff0000}stale{color}, and the state not change until next full block > report. > 4. The balancer copied the replica without deleting the replica from source > node, because the source node have the stale storage, and the request was put > into {color:#ff0000}postponedMisreplicatedBlocks{color}. > 5. Balancer continues to copy the replica, eventually resulting in multiple > copies of a replica > !image-2022-01-10-17-31-35-910.png|width=642,height=269! > The set of {color:#ff0000}rescannedMisreplicatedBlocks{color} have so many > block to remove. > !image-2022-01-10-17-32-56-981.png|width=745,height=124! > As for the second question, we checked the code of > {color:#de350b}processExtraRedundancyBlock{color}, but didn't find any > problem. > -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org