[
https://issues.apache.org/jira/browse/HDFS-16420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Takanobu Asanuma resolved HDFS-16420.
-------------------------------------
Fix Version/s: 3.4.0
3.2.4
3.3.3
Resolution: Fixed
> Avoid deleting unique data blocks when deleting redundancy striped blocks
> -------------------------------------------------------------------------
>
> Key: HDFS-16420
> URL: https://issues.apache.org/jira/browse/HDFS-16420
> Project: Hadoop HDFS
> Issue Type: Bug
> Reporter: liubingxing
> Assignee: Jackson Wang
> Priority: Critical
> Labels: pull-request-available
> Fix For: 3.4.0, 3.2.4, 3.3.3
>
> Attachments: image-2022-01-10-17-31-35-910.png,
> image-2022-01-10-17-32-56-981.png
>
> Time Spent: 2h 10m
> Remaining Estimate: 0h
>
> We have a similar problem as HDFS-16297 described.
> In our cluster, we used {color:#de350b}ec(6+3) + balancer with version
> 3.1.0{color}, and the {color:#de350b}missing block{color} happened.
> We got the block(blk_-9223372036824119008) info from fsck, only 5 live
> replications and multiple redundant replications.
> {code:java}
> blk_-9223372036824119008_220037616 len=133370338 MISSING! Live_repl=5
> blk_-9223372036824119007:DatanodeInfoWithStorage,
> blk_-9223372036824119002:DatanodeInfoWithStorage,
> blk_-9223372036824119001:DatanodeInfoWithStorage,
> blk_-9223372036824119000:DatanodeInfoWithStorage,
> blk_-9223372036824119004:DatanodeInfoWithStorage,
> blk_-9223372036824119004:DatanodeInfoWithStorage,
> blk_-9223372036824119004:DatanodeInfoWithStorage,
> blk_-9223372036824119004:DatanodeInfoWithStorage,
> blk_-9223372036824119004:DatanodeInfoWithStorage,
> blk_-9223372036824119004:DatanodeInfoWithStorage {code}
>
> We searched the log from all datanode, and found that the internal blocks of
> blk_-9223372036824119008 were deleted almost at the same time.
>
> {code:java}
> 08:15:58,550 INFO impl.FsDatasetAsyncDiskService
> (FsDatasetAsyncDiskService.java:run(333)) - Deleted
> BP-1606066499-xxxx-1606188026755 blk_-9223372036824119008_220037616 URI
> file:/data15/hadoop/hdfs/data/current/BP-1606066499-xxxx-1606188026755/current/finalized/subdir19/subdir9/blk_-9223372036824119008
> 08:16:21,214 INFO impl.FsDatasetAsyncDiskService
> (FsDatasetAsyncDiskService.java:run(333)) - Deleted
> BP-1606066499-xxxx-1606188026755 blk_-9223372036824119006_220037616 URI
> file:/data4/hadoop/hdfs/data/current/BP-1606066499-xxxx-1606188026755/current/finalized/subdir19/subdir9/blk_-9223372036824119006
> 08:16:55,737 INFO impl.FsDatasetAsyncDiskService
> (FsDatasetAsyncDiskService.java:run(333)) - Deleted
> BP-1606066499-xxxx-1606188026755 blk_-9223372036824119005_220037616 URI
> file:/data2/hadoop/hdfs/data/current/BP-1606066499-xxxx-1606188026755/current/finalized/subdir19/subdir9/blk_-9223372036824119005
> {code}
>
> The total number of internal blocks deleted during 08:15-08:17 are as follows
> ||internal block||index|| delete num||
> |blk_-9223372036824119008
> blk_-9223372036824119006
> blk_-9223372036824119005
> blk_-9223372036824119004
> blk_-9223372036824119003
> blk_-9223372036824119000 |0
> 2
> 3
> 4
> 5
> 8| 1
> 1
> 1
> 50
> 1
> 1|
>
> {color:#ff0000}During 08:15 to 08:17, we restarted 2 datanode and triggered
> full block report immediately.{color}
>
> There are 2 questions:
> 1. Why are there so many replicas of this block?
> 2. Why delete the internal block with only one copy?
> The reasons for the first problem may be as follows:
> 1. We set the full block report period of some datanode to 168 hours.
> 2. We have done a namenode HA operation.
> 3. After namenode HA, the state of storage became
> {color:#ff0000}stale{color}, and the state not change until next full block
> report.
> 4. The balancer copied the replica without deleting the replica from source
> node, because the source node have the stale storage, and the request was put
> into {color:#ff0000}postponedMisreplicatedBlocks{color}.
> 5. Balancer continues to copy the replica, eventually resulting in multiple
> copies of a replica
> !image-2022-01-10-17-31-35-910.png|width=642,height=269!
> The set of {color:#ff0000}rescannedMisreplicatedBlocks{color} have so many
> block to remove.
> !image-2022-01-10-17-32-56-981.png|width=745,height=124!
> As for the second question, we checked the code of
> {color:#de350b}processExtraRedundancyBlock{color}, but didn't find any
> problem.
>
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]