[jira] [Commented] (HDFS-10301) BlockReport retransmissions may lead to storages falsely being declared zombie if storage report processing happens out of order

Vinitha Reddy Gankidi (JIRA) Mon, 12 Sep 2016 18:21:35 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15485884#comment-15485884
 ]


Vinitha Reddy Gankidi commented on HDFS-10301:
----------------------------------------------

Upon thorough investigation of heartbeat logic I have verified that unreported 
storages do get removed without any code change. Attached patch 014 eliminates 
the state and the zombie storage removal logic introduced in HDFS-7960. 
I have added a unit test that verifies that when a DN storage with blocks is 
removed, this storage is removed from the DatanodeDescriptor as well and does 
not linger forever. Unreported storages are marked as FAILED in  
{{updateHeartbeatState}} method when {{checkFailedStorages}} is true. Thus when 
a DN storage is removed, it will be marked as FAILED in the next heartbeat. 
The storage removal happens in 2 steps after that (Refer Step 2 & 3 in 
https://issues.apache.org/jira/browse/HDFS-10301?focusedCommentId=15427387&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15427387).
 
The test {{testRemovingStorageDoesNotProduceZombies}} introduced in HDFS-7960 
passes by reducing the heartbeat recheck interval so that the test doesn't 
timeout. By default, the Heartbeat Manager removes blocks associated with 
failed storages every 5 minutes.
I have ignored {{testProcessOverReplicatedAndMissingStripedBlock}} in this 
patch. Please refer to HDFS-10854 for more details.


> BlockReport retransmissions may lead to storages falsely being declared 
> zombie if storage report processing happens out of order
> --------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-10301
>                 URL: https://issues.apache.org/jira/browse/HDFS-10301
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 2.6.1
>            Reporter: Konstantin Shvachko
>            Assignee: Vinitha Reddy Gankidi
>            Priority: Critical
>             Fix For: 2.7.4
>
>         Attachments: HDFS-10301.002.patch, HDFS-10301.003.patch, 
> HDFS-10301.004.patch, HDFS-10301.005.patch, HDFS-10301.006.patch, 
> HDFS-10301.007.patch, HDFS-10301.008.patch, HDFS-10301.009.patch, 
> HDFS-10301.01.patch, HDFS-10301.010.patch, HDFS-10301.011.patch, 
> HDFS-10301.012.patch, HDFS-10301.013.patch, HDFS-10301.branch-2.7.patch, 
> HDFS-10301.branch-2.patch, HDFS-10301.sample.patch, zombieStorageLogs.rtf
>
>
> When NameNode is busy a DataNode can timeout sending a block report. Then it 
> sends the block report again. Then NameNode while process these two reports 
> at the same time can interleave processing storages from different reports. 
> This screws up the blockReportId field, which makes NameNode think that some 
> storages are zombie. Replicas from zombie storages are immediately removed, 
> causing missing blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-10301) BlockReport retransmissions may lead to storages falsely being declared zombie if storage report processing happens out of order

Reply via email to