[ 
https://issues.apache.org/jira/browse/HDFS-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15402951#comment-15402951
 ] 

Daryn Sharp commented on HDFS-10301:
------------------------------------

I've read this jira as I said I would, and I've looked at the patch.

Our nightly build & deploy for 2.7 is broken.  DNs claim to report thousands of 
blocks, NN says nope, -1.  This should be reason enough to revert until we get 
to the bottom of it.  We're reverting internally.  If that fixes it, I will 
have someone help me revert tomorrow morning if not already.

Why is this patch changing per-storage reports when it's the single-rpc report 
that is the problem?  Is this change compatible?
# What does an old NN do if it gets this pseudo-report?  Will it forget about 
all the blocks on the non-last storage?
# What does a new NN do when it gets old style reports?  Will it remove all but 
the last storage?

This zombie detection, report context, etc is getting out of hand.  I don't 
understand why the zombie detection isn't based on the healthy storages in the 
heartbeat.  Anything else gets flagged as failed and the heartbeat monitor 
disposes of them.


> BlockReport retransmissions may lead to storages falsely being declared 
> zombie if storage report processing happens out of order
> --------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-10301
>                 URL: https://issues.apache.org/jira/browse/HDFS-10301
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 2.6.1
>            Reporter: Konstantin Shvachko
>            Assignee: Vinitha Reddy Gankidi
>            Priority: Critical
>             Fix For: 2.7.4
>
>         Attachments: HDFS-10301.002.patch, HDFS-10301.003.patch, 
> HDFS-10301.004.patch, HDFS-10301.005.patch, HDFS-10301.006.patch, 
> HDFS-10301.007.patch, HDFS-10301.008.patch, HDFS-10301.009.patch, 
> HDFS-10301.01.patch, HDFS-10301.010.patch, HDFS-10301.011.patch, 
> HDFS-10301.012.patch, HDFS-10301.branch-2.7.patch, HDFS-10301.branch-2.patch, 
> HDFS-10301.sample.patch, zombieStorageLogs.rtf
>
>
> When NameNode is busy a DataNode can timeout sending a block report. Then it 
> sends the block report again. Then NameNode while process these two reports 
> at the same time can interleave processing storages from different reports. 
> This screws up the blockReportId field, which makes NameNode think that some 
> storages are zombie. Replicas from zombie storages are immediately removed, 
> causing missing blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to