[ 
https://issues.apache.org/jira/browse/HDFS-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15251275#comment-15251275
 ] 

Colin Patrick McCabe commented on HDFS-10301:
---------------------------------------------

Thanks for the bug report.  This is a tricky one.

One small correction-- HDFS-7960 was not introduced as part of DataNode 
hotswap.  It was originally introduced to solve issues caused by HDF-7575, 
although it fixed issues with hotswap as well.

It seems like we should be able to remove existing DataNode storage report RPCs 
with the old ID from the queue when we receive one with a new block report ID.  
This would also avoid a possible congestion collapse scenario caused by 
repeated retransmissions after the timeout.

> Blocks removed by thousands due to falsely detected zombie storages
> -------------------------------------------------------------------
>
>                 Key: HDFS-10301
>                 URL: https://issues.apache.org/jira/browse/HDFS-10301
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 2.6.1
>            Reporter: Konstantin Shvachko
>            Assignee: Walter Su
>            Priority: Critical
>         Attachments: HDFS-10301.01.patch, zombieStorageLogs.rtf
>
>
> When NameNode is busy a DataNode can timeout sending a block report. Then it 
> sends the block report again. Then NameNode while process these two reports 
> at the same time can interleave processing storages from different reports. 
> This screws up the blockReportId field, which makes NameNode think that some 
> storages are zombie. Replicas from zombie storages are immediately removed, 
> causing missing blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to