[ https://issues.apache.org/jira/browse/HDFS-16420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Takanobu Asanuma updated HDFS-16420: ------------------------------------ Fix Version/s: 3.2.3 (was: 3.2.4) > Avoid deleting unique data blocks when deleting redundancy striped blocks > ------------------------------------------------------------------------- > > Key: HDFS-16420 > URL: https://issues.apache.org/jira/browse/HDFS-16420 > Project: Hadoop HDFS > Issue Type: Bug > Reporter: liubingxing > Assignee: Jackson Wang > Priority: Critical > Labels: pull-request-available > Fix For: 3.4.0, 3.2.3, 3.3.3 > > Attachments: image-2022-01-10-17-31-35-910.png, > image-2022-01-10-17-32-56-981.png > > Time Spent: 2h 10m > Remaining Estimate: 0h > > We have a similar problem as HDFS-16297 described. > In our cluster, we used {color:#de350b}ec(6+3) + balancer with version > 3.1.0{color}, and the {color:#de350b}missing block{color} happened. > We got the block(blk_-9223372036824119008) info from fsck, only 5 live > replications and multiple redundant replications. > {code:java} > blk_-9223372036824119008_220037616 len=133370338 MISSING! Live_repl=5 > blk_-9223372036824119007:DatanodeInfoWithStorage, > blk_-9223372036824119002:DatanodeInfoWithStorage, > blk_-9223372036824119001:DatanodeInfoWithStorage, > blk_-9223372036824119000:DatanodeInfoWithStorage, > blk_-9223372036824119004:DatanodeInfoWithStorage, > blk_-9223372036824119004:DatanodeInfoWithStorage, > blk_-9223372036824119004:DatanodeInfoWithStorage, > blk_-9223372036824119004:DatanodeInfoWithStorage, > blk_-9223372036824119004:DatanodeInfoWithStorage, > blk_-9223372036824119004:DatanodeInfoWithStorage {code} > > We searched the log from all datanode, and found that the internal blocks of > blk_-9223372036824119008 were deleted almost at the same time. > > {code:java} > 08:15:58,550 INFO impl.FsDatasetAsyncDiskService > (FsDatasetAsyncDiskService.java:run(333)) - Deleted > BP-1606066499-xxxx-1606188026755 blk_-9223372036824119008_220037616 URI > file:/data15/hadoop/hdfs/data/current/BP-1606066499-xxxx-1606188026755/current/finalized/subdir19/subdir9/blk_-9223372036824119008 > 08:16:21,214 INFO impl.FsDatasetAsyncDiskService > (FsDatasetAsyncDiskService.java:run(333)) - Deleted > BP-1606066499-xxxx-1606188026755 blk_-9223372036824119006_220037616 URI > file:/data4/hadoop/hdfs/data/current/BP-1606066499-xxxx-1606188026755/current/finalized/subdir19/subdir9/blk_-9223372036824119006 > 08:16:55,737 INFO impl.FsDatasetAsyncDiskService > (FsDatasetAsyncDiskService.java:run(333)) - Deleted > BP-1606066499-xxxx-1606188026755 blk_-9223372036824119005_220037616 URI > file:/data2/hadoop/hdfs/data/current/BP-1606066499-xxxx-1606188026755/current/finalized/subdir19/subdir9/blk_-9223372036824119005 > {code} > > The total number of internal blocks deleted during 08:15-08:17 are as follows > ||internal block||index|| delete num|| > |blk_-9223372036824119008 > blk_-9223372036824119006 > blk_-9223372036824119005 > blk_-9223372036824119004 > blk_-9223372036824119003 > blk_-9223372036824119000 |0 > 2 > 3 > 4 > 5 > 8| 1 > 1 > 1 > 50 > 1 > 1| > > {color:#ff0000}During 08:15 to 08:17, we restarted 2 datanode and triggered > full block report immediately.{color} > > There are 2 questions: > 1. Why are there so many replicas of this block? > 2. Why delete the internal block with only one copy? > The reasons for the first problem may be as follows: > 1. We set the full block report period of some datanode to 168 hours. > 2. We have done a namenode HA operation. > 3. After namenode HA, the state of storage became > {color:#ff0000}stale{color}, and the state not change until next full block > report. > 4. The balancer copied the replica without deleting the replica from source > node, because the source node have the stale storage, and the request was put > into {color:#ff0000}postponedMisreplicatedBlocks{color}. > 5. Balancer continues to copy the replica, eventually resulting in multiple > copies of a replica > !image-2022-01-10-17-31-35-910.png|width=642,height=269! > The set of {color:#ff0000}rescannedMisreplicatedBlocks{color} have so many > block to remove. > !image-2022-01-10-17-32-56-981.png|width=745,height=124! > As for the second question, we checked the code of > {color:#de350b}processExtraRedundancyBlock{color}, but didn't find any > problem. > -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org