[jira] [Commented] (HDFS-10857) Rolling upgrade can make data unavailable when the cluster has many failed volumes

Kihwal Lee (JIRA) Mon, 12 Sep 2016 10:44:54 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-10857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15484751#comment-15484751
 ]


Kihwal Lee commented on HDFS-10857:
-----------------------------------

It looks like it is fixed in 2.8 and later. {{DataNode#checkDiskError()}} does 
remove the failed volume from {{DataStorage}}.

> Rolling upgrade can make data unavailable when the cluster has many failed 
> volumes
> ----------------------------------------------------------------------------------
>
>                 Key: HDFS-10857
>                 URL: https://issues.apache.org/jira/browse/HDFS-10857
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 2.7.4
>            Reporter: Kihwal Lee
>            Assignee: Kihwal Lee
>            Priority: Critical
>
> When the marker file or trash dir is created or removed during the heartbeat 
> response processing, an {{IOException}} is thrown if tried on a failed 
> volume.   This stops processing of the rest of storage directories and any 
> DNA commands that were part of the heartbeat response.
> While this is happening, the block token key update does not happen and all 
> read and write requests start to fail, until the upgrade is finalized and the 
> DN receives a new key. All it takes is one failed volume. If there are three 
> such nodes in the cluster, it is very likely that some blocks cannot be read. 
> The NN has no idea unlike the common missing blocks scenarios, although the 
> effect is the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-10857) Rolling upgrade can make data unavailable when the cluster has many failed volumes

Reply via email to