[ https://issues.apache.org/jira/browse/HDFS-6145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13965255#comment-13965255 ]
Ding Yuan commented on HDFS-6145: --------------------------------- Ping. Is there anything else I can help from my side? > Stopping unexpected exception from propagating to avoid serious consequences > ---------------------------------------------------------------------------- > > Key: HDFS-6145 > URL: https://issues.apache.org/jira/browse/HDFS-6145 > Project: Hadoop HDFS > Issue Type: Improvement > Affects Versions: 2.2.0 > Reporter: Ding Yuan > > There are a few cases where an exception should never have occurred, but the > code simply logged it and let the execution continue. Since they shouldn't > have occurred, a safer way may be to simply terminate the execution and stop > them from propagating into some unexpected consequences. > ========================== > Case 1: > Line: 336, File: > "org/apache/hadoop/hdfs/server/namenode/snapshot/INodeDirectorySnapshottable.java" > {noformat} > 325: try { > 326: Quota.Counts counts = cleanSubtree(snapshot, prior, > collectedBlocks, > 327: removedINodes, true); > 328: INodeDirectory parent = getParent(); > .. .. > 335: } catch(QuotaExceededException e) { > 336: LOG.error("BUG: removeSnapshot increases namespace usage.", e); > 337: } > {noformat} > Since this shouldn't have occurred unless some unexpected bugs occur, > should the NN simply stop the execution to prevent bad things from > propagation? > Similar handling of QuotaExceededException can be found at: > Line: 544, File: > "org/apache/hadoop/hdfs/server/namenode/INodeReference.java" > Line: 657, File: > "org/apache/hadoop/hdfs/server/namenode/INodeReference.java" > Line: 669, File: > "org/apache/hadoop/hdfs/server/namenode/INodeReference.java" > ========================================== > ========================== > Case 2: > Line: 601, File: "org/apache/hadoop/hdfs/server/namenode/JournalSet.java" > {noformat} > 591: public synchronized RemoteEditLogManifest getEditLogManifest(long > fromTxId, > .. > 595: for (JournalAndStream j : journals) { > .. > 598: try { > 599: allLogs.addAll(fjm.getRemoteEditLogs(fromTxId, forReading, > false)); > 600: } catch (Throwable t) { > 601: LOG.warn("Cannot list edit logs in " + fjm, t); > 602: } > {noformat} > An exception from addAll will result in some edit log files not considered, > and not included in the checkpoint, which may result in dataloss. > ========================================== > ========================== > Case 3: > Line: 4029, File: "org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java" > {noformat} > 4010: try { > 4011: while (fsRunning && shouldNNRmRun) { > 4012: checkAvailableResources(); > 4013: if(!nameNodeHasResourcesAvailable()) { > 4014: String lowResourcesMsg = "NameNode low on available disk > space. "; > 4015: if (!isInSafeMode()) { > 4016: FSNamesystem.LOG.warn(lowResourcesMsg + "Entering safe > mode."); > 4017: } else { > 4018: FSNamesystem.LOG.warn(lowResourcesMsg + "Already in safe > mode."); > 4019: } > 4020: enterSafeMode(true); > 4021: } > .. .. > 4027: } > 4028: } catch (Exception e) { > 4029: FSNamesystem.LOG.error("Exception in NameNodeResourceMonitor: > ", e); > 4030: } > {noformat} > enterSafeMode might thrown exception. In the case of not being able to > entering safe mode, should the execution simply terminate? > ========================================== -- This message was sent by Atlassian JIRA (v6.2#6252)