DataNode fails stop due to a bad disk (or storage directory) ------------------------------------------------------------
Key: HDFS-1223 URL: https://issues.apache.org/jira/browse/HDFS-1223 Project: Hadoop HDFS Issue Type: Bug Components: data-node Affects Versions: 0.20.1 Reporter: Thanh Do A datanode can store block files in multiple volumes. If a datanode sees a bad volume during start up (i.e, face an exception when accessing that volume), it simply fail stops, making all block files stored in other healthy volumes inaccessible. Consequently, these lost replicas will be generated later on in other datanodes. If a datanode is able to mark the bad disk and continue working with healthy ones, this will increase availability and avoid unnecessary regeneration. As an extreme example, consider one datanode which has 2 volumes V1 and V2, each contains about 10000 64MB block files. During startup, the datanode gets an exception when accessing V1, it then fail stops, making 20000 block files generated later on. If the datanode masks V1 as bad and continues working with V2, the number of replicas needed to be regenerated is cut in to half. This bug was found by our Failure Testing Service framework: http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html For questions, please email us: Thanh Do (than...@cs.wisc.edu) and Haryadi Gunawi (hary...@eecs.berkeley.edu) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.