Kihwal Lee created HDFS-10857:
---------------------------------

             Summary: Rolling upgrade can make data unavailable when the 
cluster has many failed volumes
                 Key: HDFS-10857
                 URL: https://issues.apache.org/jira/browse/HDFS-10857
             Project: Hadoop HDFS
          Issue Type: Bug
            Reporter: Kihwal Lee
            Priority: Critical


When the marker file or trash dir is created or removed during the heartbeat 
response processing, an {{IOException}} is thrown if tried on a failed volume.  
 This stops processing of the rest of storage directories and any DNA commands 
that were part of the heartbeat response.

While this is happening, the block token key update does not happen and all 
read and write requests start to fail, until the upgrade is finalized and the 
DN receives a new key. All it takes is one failed volume. If there are three 
such nodes in the cluster, it is very likely that some blocks cannot be read. 
The NN has no idea unlike the common missing blocks scenarios, although the 
effect is the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to