[ https://issues.apache.org/jira/browse/HDFS-10857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15484751#comment-15484751 ]
Kihwal Lee commented on HDFS-10857: ----------------------------------- It looks like it is fixed in 2.8 and later. {{DataNode#checkDiskError()}} does remove the failed volume from {{DataStorage}}. > Rolling upgrade can make data unavailable when the cluster has many failed > volumes > ---------------------------------------------------------------------------------- > > Key: HDFS-10857 > URL: https://issues.apache.org/jira/browse/HDFS-10857 > Project: Hadoop HDFS > Issue Type: Bug > Affects Versions: 2.7.4 > Reporter: Kihwal Lee > Assignee: Kihwal Lee > Priority: Critical > > When the marker file or trash dir is created or removed during the heartbeat > response processing, an {{IOException}} is thrown if tried on a failed > volume. This stops processing of the rest of storage directories and any > DNA commands that were part of the heartbeat response. > While this is happening, the block token key update does not happen and all > read and write requests start to fail, until the upgrade is finalized and the > DN receives a new key. All it takes is one failed volume. If there are three > such nodes in the cluster, it is very likely that some blocks cannot be read. > The NN has no idea unlike the common missing blocks scenarios, although the > effect is the same. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org