He Xiaoqiao created HDFS-9068: --------------------------------- Summary: SBN checkpoint could not work after the only name directory recovery from failure Key: HDFS-9068 URL: https://issues.apache.org/jira/browse/HDFS-9068 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 2.4.1 Reporter: He Xiaoqiao
SBN does checkpoint to {{dfs.namenode.name.dir}} peroidly, but the checkpointer could not work when there is only one directory in configuration item {{dfs.namenode.name.dir}} and the disk which the directory located recoveries from failure. The impact of class is org.apache.hadoop.hdfs.server.namenode.FSImage.java {code:title=org.apache.hadoop.hdfs.server.namenode.FSImage.java|borderStyle=solid} @Override public void run() { try { saveFSImage(context, sd, nnf); } catch (SaveNamespaceCancelledException snce) { LOG.info("Cancelled image saving for " + sd.getRoot() + ": " + snce.getMessage()); // don't report an error on the storage dir! } catch (Throwable t) { LOG.error("Unable to save image for " + sd.getRoot(), t); context.reportErrorOnStorageDirectory(sd); } } {code} sd is added to errorSDs: {{context.reportErrorOnStorageDirectory(sd)}}, it will never be used when {{saveFSImage(context, sd, nnf)}} failed becasue storage is Not available or failed even if it recovers from failure. Then JournalNode will accumulate a large number of editlog files since checkpointer failed and NameNode will restart for log time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)