[ https://issues.apache.org/jira/browse/HDFS-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15419823#comment-15419823 ]
Colin P. McCabe commented on HDFS-10301: ---------------------------------------- I don't think the heartbeat is the right place to handle reconciling the block storages. One reason is because this adds extra complexity and time to the heartbeat, which happens far more frequently than an FBR. We even talked about making the heartbeat lockless-- clearly you can't do that if you are traversing all the block storages. Taking the FSN lock is expensive and heartbeats are sent quite frequently from each DN-- every few seconds. Another reason reconciling storages in heartbeats is bad is because if the heartbeat tells you about a new storage, you won't know what blocks are in it until the FBR arrives. So the NN may end up assigning a bunch of new blocks to a storage which looks empty, but really is full. I came up with what I believe is the correct patch to fix this problem months ago. It's here as https://issues.apache.org/jira/secure/attachment/12805931/HDFS-10301.005.patch . It doesn't modify any RPCs or add any new mechanisms. Instead, it just fixes the obvious bug in the HDFS-7960 logic. The only counter-argument to applying patch 005 that anyone ever came up with is that it doesn't eliminate zombies when FBRs get interleaved. But this is not a good counter-argument, since FBR interleaving is extremely, extremely rare in well-run clusters. The proof should be obvious-- if FBR interleaving happened on more clusters, more people would hit this serious data loss bug. This JIRA has been extremely frustrating. It seems like most, if not all, of the points that I brought up in my reviews were ignored. I talked about the obvious problems with compatibility with [~shv]'s solution and even explicitly asked him to test the upgrade case. I told him that this JIRA was a bad one to give to a promising new contributor such as [~redvine], because it required a lot of context and was extremely tricky. Both myself and [~andrew.wang] commented that overloading BlockListAsLongs was confusing and not necessary. The patch confused "not modifying the .proto file" with "not modifying the RPC content" which are two very separate concepts, as I commented over and over. Clearly these comments were ignored. If anything, I think [~shv] got very lucky that the bug manifested itself quickly rather than creating a serious data loss situation a few months down the road, like the one I had to debug when fixing HDFS-7960. Again I would urge you to just commit patch 005. Or at least evaluate it. > BlockReport retransmissions may lead to storages falsely being declared > zombie if storage report processing happens out of order > -------------------------------------------------------------------------------------------------------------------------------- > > Key: HDFS-10301 > URL: https://issues.apache.org/jira/browse/HDFS-10301 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode > Affects Versions: 2.6.1 > Reporter: Konstantin Shvachko > Assignee: Vinitha Reddy Gankidi > Priority: Critical > Fix For: 2.7.4 > > Attachments: HDFS-10301.002.patch, HDFS-10301.003.patch, > HDFS-10301.004.patch, HDFS-10301.005.patch, HDFS-10301.006.patch, > HDFS-10301.007.patch, HDFS-10301.008.patch, HDFS-10301.009.patch, > HDFS-10301.01.patch, HDFS-10301.010.patch, HDFS-10301.011.patch, > HDFS-10301.012.patch, HDFS-10301.013.patch, HDFS-10301.branch-2.7.patch, > HDFS-10301.branch-2.patch, HDFS-10301.sample.patch, zombieStorageLogs.rtf > > > When NameNode is busy a DataNode can timeout sending a block report. Then it > sends the block report again. Then NameNode while process these two reports > at the same time can interleave processing storages from different reports. > This screws up the blockReportId field, which makes NameNode think that some > storages are zombie. Replicas from zombie storages are immediately removed, > causing missing blocks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org