[ 
https://issues.apache.org/jira/browse/HDFS-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15322994#comment-15322994
 ] 

Colin Patrick McCabe commented on HDFS-10301:
---------------------------------------------

[~shv], comments about me "being on a -1 spree" are not constructive and they 
don't do anything to help the tone of the discussion.  We've been talking about 
this since April and my views have been consistent the whole time.  I have a 
solution, but I am open to other solutions as long as they don't have big 
disadvantages.

bq. The whole approach of keeping the state for the block report processing on 
the NameNode is error-prone. It assumes at-once execution, and therefore when 
block reports interleave the BR-state gets messed up. Particularly, the BitSet 
used to mark storages, which have been processed, can be reset during 
interleaving multiple times and cannot be used to count storages in the report. 
In current implementation the messing-up of BR-state leads to false positive 
detection of a zombie storage and removal of a perfectly valid one.

Block report processing is inherently about state.  It is inherently stateful.  
It is a mechanism for the DN to synchronize its entire block state with the 
block state on the NN.  Interleaved block reports are very bad news, even if 
this bug didn't exist, because they mean that the state on the NN will go "back 
in time" for some storages, rather than monotonically moving forward in time.  
This may lead the NN to make incorrect (and potentially irreversible) decisions 
like deleting a replica somewhere because it appears to exist on an old stale 
interleaved block report.  Keep in mind that these old stale interleaved FBRs 
will override any incremental BRs that were sent in the meantime!

Interleaved block reports also potentially indicate that the DNs are sending 
new full block reports before the last ones have been processed.  So either our 
FBR retransmission mechanism is screwed up and is spewing a firehose of FBRs at 
an unresponsive NameNode (making it even more unresponsive, no doubt), or the 
NN can't process an FBR in the extremely long FBR sending period.  Both of 
these explanations mean that you've got a cluster which has serious, serious 
problems and you should fix it right now.

That's the reason why people are not taking this JIRA as seriously as they 
otherwise might-- because they know that interleaved FBRs mean that something 
is very wrong.  And you are consistently ignoring this feedback and telling us 
how my patch is bad because it doesn't perform zombie storage elimination when 
FBRs get interleaved.

bq. It seems that you don't or don't want to understand reasoning around adding 
separate storage reporting RPC call. At least you addressed it only by 
repeating your -1. For the third time. And did not respond to Zhe Zhang's 
proposal to merge the storage reporting RPC into one of the storage reports in 
the next jira.  Given that and in order to move forward, we should look into 
making changes to the last BR RPC call, which should now also report all 
storages.

I am fine with adding storage reporting to any of the existing FBR RPCs.  What 
I am not fine with is adding another RPC which will create more load.

> BlockReport retransmissions may lead to storages falsely being declared 
> zombie if storage report processing happens out of order
> --------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-10301
>                 URL: https://issues.apache.org/jira/browse/HDFS-10301
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 2.6.1
>            Reporter: Konstantin Shvachko
>            Assignee: Colin Patrick McCabe
>            Priority: Critical
>         Attachments: HDFS-10301.002.patch, HDFS-10301.003.patch, 
> HDFS-10301.004.patch, HDFS-10301.005.patch, HDFS-10301.01.patch, 
> HDFS-10301.sample.patch, zombieStorageLogs.rtf
>
>
> When NameNode is busy a DataNode can timeout sending a block report. Then it 
> sends the block report again. Then NameNode while process these two reports 
> at the same time can interleave processing storages from different reports. 
> This screws up the blockReportId field, which makes NameNode think that some 
> storages are zombie. Replicas from zombie storages are immediately removed, 
> causing missing blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to