[ https://issues.apache.org/jira/browse/HDFS-8881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Zhe Zhang updated HDFS-8881: ---------------------------- Parent Issue: HDFS-8031 (was: HDFS-7285) > Erasure Coding: internal blocks got missed and got over-replicated at the > same time > ----------------------------------------------------------------------------------- > > Key: HDFS-8881 > URL: https://issues.apache.org/jira/browse/HDFS-8881 > Project: Hadoop HDFS > Issue Type: Sub-task > Reporter: Walter Su > Assignee: Walter Su > Attachments: HDFS-8881.00.patch > > > We know the Repl checking depends on {{BlockManager#countNodes()}}, but > countNodes() has limitation for striped blockGroup. > *One* missing internal block will be catched by Repl checking, and handled by > ReplicationMonitor. > *One* over-replicated internal block will be catched by Repl checking, and > handled by processOverReplicatedBlocks. > *One* missing internal block and *two* over-replicated internal blocks *at > the same time* will be catched by Repl checking, and handled by > processOverReplicatedBlocks, later by ReplicationMonitor. > *One* missing internal block and *One* over-replicated internal block *at the > same time* will *NOT* be catched by Repl checking. > "at the same time" means one missing internal block can't be recovered, and > one internal block got over-replicated anyway. For example: > scenario A: > step 1. block #0 and #1 are reported missing. > 2. a new #1 got recovered. > 3. the old #1 come back, and the recovery work for #0 failed. > scenario B: > 1. An DN decommissioned/dead which has #1. > 2. block #0 is reported missing. > 3. The DN has #1 recommisioned, and the recovery work for #0 failed. > In the end, the blockGroup has \[1, 1, 2, 3, 4, 5, 6, 7, 8\], assume 6+3 > schema. Client always needs to decode #0 if the blockGroup doesn't get > handled. -- This message was sent by Atlassian JIRA (v6.3.4#6332)