[ 
https://issues.apache.org/jira/browse/HDFS-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14375781#comment-14375781
 ] 

Yi Liu commented on HDFS-7960:
------------------------------

This is a good fix and improvement. Some comments:

*1.* In {{BlockManager}}, the logic of checking zombie datanode storages has 
issue.
{code}
      if (context != null) {
        storageInfo.setLastBlockReportId(context.getReportId());
        if (lastStorageInRpc) {
          int rpcsSeen = node.updateBlockReportContext(context);
          if (rpcsSeen >= context.getTotalRpcs()) {
            List<DatanodeStorageInfo> zombies = node.removeZombieStorages();
            if (zombies.isEmpty()) {
              ...
{code}
In the patch, *rpcsSeen* is calculated in NN by counting all rpcs of same block 
report, it's not safe in case of split reports.
{{DatanodeProtocol#blockReport}} is {{@Idempotent}}, if retry happens, {{if 
(rpcsSeen >= context.getTotalRpcs())}} can be *true*, while some datanode 
storages may not send splits of reports, in this case, these datanode storages 
will be treated as zombie and wrongly removed from NN.
I suggest to check all rpc ids of block report received before checking zombie 
storages.

*2.* Another comment is in {{removeZombieReplicas}}:
{code}
 removeStoredBlock(block, zombie.getDatanodeDescriptor());
{code}
While removing stored block, we'd better to remove it from {{InvalidateBlocks}} 
too. How about call {{removeBlocksAssociatedTo(final DatanodeDescriptor 
node)}}? Then it can also save your code lines.

> The full block report should prune zombie storages even if they're not empty
> ----------------------------------------------------------------------------
>
>                 Key: HDFS-7960
>                 URL: https://issues.apache.org/jira/browse/HDFS-7960
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 2.6.0
>            Reporter: Lei (Eddy) Xu
>            Assignee: Colin Patrick McCabe
>            Priority: Critical
>         Attachments: HDFS-7960.002.patch, HDFS-7960.003.patch, 
> HDFS-7960.004.patch, HDFS-7960.005.patch, HDFS-7960.006.patch
>
>
> The full block report should prune zombie storages even if they're not empty. 
>  We have seen cases in production where zombie storages have not been pruned 
> subsequent to HDFS-7575.  This could arise any time the NameNode thinks there 
> is a block in some old storage which is actually not there.  In this case, 
> the block will not show up in the "new" storage (once old is renamed to new) 
> and the old storage will linger forever as a zombie, even with the HDFS-7596 
> fix applied.  This also happens with datanode hotplug, when a drive is 
> removed.  In this case, an entire storage (volume) goes away but the blocks 
> do not show up in another storage on the same datanode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to