[ 
https://issues.apache.org/jira/browse/HDFS-9837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15159485#comment-15159485
 ] 

Jing Zhao commented on HDFS-9837:
---------------------------------

Thanks for the comments, [~rakeshr]. 

bq. Should be StoredReplicaState.DECOMMISSIONED

Good catch. Will fix it.

bq. I could see BlockInfoStriped#findSlot() is expanding the capacity beyond 
#getTotalBlockNum. I think while counting the replicas it is checking only upto 
totalBlock, so it will miss few replica checks.

We're using BitSet here thus we're not tracking all the storages (whose total 
number can exceeds 9) but all the possible internal blocks. Only need to make 
sure the bitset covers the block ID range.

bq. use #getStorageInfo(index) instead of storages[index] and storages[i]

I will use {{getStorageInfo(index)}} to replace storages[index] but will keep 
storages[i] to keep consistent with indices[i]

> BlockManager#countNodes should be able to detect duplicated internal blocks
> ---------------------------------------------------------------------------
>
>                 Key: HDFS-9837
>                 URL: https://issues.apache.org/jira/browse/HDFS-9837
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>    Affects Versions: 3.0.0
>            Reporter: Jing Zhao
>            Assignee: Jing Zhao
>         Attachments: HDFS-9837.000.patch, HDFS-9837.001.patch, 
> HDFS-9837.002.patch
>
>
> Currently {{BlockManager#countNodes}} only counts the number of 
> replicas/internal blocks thus it cannot detect the under-replicated scenario 
> where a striped EC block has 9 internal blocks but contains duplicated 
> data/parity blocks. E.g., b8 is missing while 2 b0 exist:
> b0, b1, b2, b3, b4, b5, b6, b7, b0
> If the NameNode keeps running, NN is able to detect the duplication of b0 and 
> will put the block into the excess map. {{countNodes}} excludes internal 
> blocks captured in the excess map thus can return the correct number of live 
> replicas. However, if NN restarts before sending out the reconstruction 
> command, the missing internal block cannot be detected anymore. The following 
> steps can reproduce the issue:
> # create an EC file
> # kill DN1 and wait for the reconstruction to happen
> # start DN1 again
> # kill DN2 and restart NN immediately



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to