[ https://issues.apache.org/jira/browse/HDFS-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15322994#comment-15322994 ]
Colin Patrick McCabe commented on HDFS-10301: --------------------------------------------- [~shv], comments about me "being on a -1 spree" are not constructive and they don't do anything to help the tone of the discussion. We've been talking about this since April and my views have been consistent the whole time. I have a solution, but I am open to other solutions as long as they don't have big disadvantages. bq. The whole approach of keeping the state for the block report processing on the NameNode is error-prone. It assumes at-once execution, and therefore when block reports interleave the BR-state gets messed up. Particularly, the BitSet used to mark storages, which have been processed, can be reset during interleaving multiple times and cannot be used to count storages in the report. In current implementation the messing-up of BR-state leads to false positive detection of a zombie storage and removal of a perfectly valid one. Block report processing is inherently about state. It is inherently stateful. It is a mechanism for the DN to synchronize its entire block state with the block state on the NN. Interleaved block reports are very bad news, even if this bug didn't exist, because they mean that the state on the NN will go "back in time" for some storages, rather than monotonically moving forward in time. This may lead the NN to make incorrect (and potentially irreversible) decisions like deleting a replica somewhere because it appears to exist on an old stale interleaved block report. Keep in mind that these old stale interleaved FBRs will override any incremental BRs that were sent in the meantime! Interleaved block reports also potentially indicate that the DNs are sending new full block reports before the last ones have been processed. So either our FBR retransmission mechanism is screwed up and is spewing a firehose of FBRs at an unresponsive NameNode (making it even more unresponsive, no doubt), or the NN can't process an FBR in the extremely long FBR sending period. Both of these explanations mean that you've got a cluster which has serious, serious problems and you should fix it right now. That's the reason why people are not taking this JIRA as seriously as they otherwise might-- because they know that interleaved FBRs mean that something is very wrong. And you are consistently ignoring this feedback and telling us how my patch is bad because it doesn't perform zombie storage elimination when FBRs get interleaved. bq. It seems that you don't or don't want to understand reasoning around adding separate storage reporting RPC call. At least you addressed it only by repeating your -1. For the third time. And did not respond to Zhe Zhang's proposal to merge the storage reporting RPC into one of the storage reports in the next jira. Given that and in order to move forward, we should look into making changes to the last BR RPC call, which should now also report all storages. I am fine with adding storage reporting to any of the existing FBR RPCs. What I am not fine with is adding another RPC which will create more load. > BlockReport retransmissions may lead to storages falsely being declared > zombie if storage report processing happens out of order > -------------------------------------------------------------------------------------------------------------------------------- > > Key: HDFS-10301 > URL: https://issues.apache.org/jira/browse/HDFS-10301 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode > Affects Versions: 2.6.1 > Reporter: Konstantin Shvachko > Assignee: Colin Patrick McCabe > Priority: Critical > Attachments: HDFS-10301.002.patch, HDFS-10301.003.patch, > HDFS-10301.004.patch, HDFS-10301.005.patch, HDFS-10301.01.patch, > HDFS-10301.sample.patch, zombieStorageLogs.rtf > > > When NameNode is busy a DataNode can timeout sending a block report. Then it > sends the block report again. Then NameNode while process these two reports > at the same time can interleave processing storages from different reports. > This screws up the blockReportId field, which makes NameNode think that some > storages are zombie. Replicas from zombie storages are immediately removed, > causing missing blocks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org