Ding Yuan created HDFS-5931:
-------------------------------

             Summary: Potential bugs and improvements for exception handlers
                 Key: HDFS-5931
                 URL: https://issues.apache.org/jira/browse/HDFS-5931
             Project: Hadoop HDFS
          Issue Type: Bug
          Components: datanode, namenode
    Affects Versions: 2.2.0
            Reporter: Ding Yuan


Hi,
This is to report some improvements and potential bug fixes to some error 
handling code. Also attaching a patch for review.

==========================
Case 1:
org/apache/hadoop/hdfs/server/namenode/FSEditLog.java

In recoverUnclosedStreams, purgeLogsOlderThan, and endCurrentLogSegment,
when not enough journals are found, mapJournalsAndReportErrors throws 
IOException.
While in some other cases (such as in logEdit()), this case would be handled by 
a
later call to logSync(), which will terminate, but in these three methods these 
exceptions
might be swallowed since logSync() won't be called in anytime soon.

Propose to immediately handle such cases as in logSync().

{noformat}
  synchronized void recoverUnclosedStreams() {
    Preconditions.checkState(
        state == State.BETWEEN_LOG_SEGMENTS,
        "May not recover segments - wrong state: %s", state);
     try {
       journalSet.recoverUnfinalizedSegments();
     } catch (IOException ex) {
-      // All journals have failed, it is handled in logSync.
-      // TODO: are we sure this is OK?
+        //All journals have failed, handle it here.
+        final String msg =
+                "Could not find enough journals. "
+            LOG.fatal(msg, new Exception());
+            IOUtils.cleanup(LOG, journalSet);
+            terminate(1, msg);
     }
   }
{noformat}
==========================================

==========================
Case 2:
File: "org/apache/hadoop/hdfs/server/datanode/BlockPoolSliceStorage.java"

{noformat}
@@ -446,6 +446,8 @@
           deleteDir(tmpDir);
         } catch (IOException ex) {
           LOG.error("Finalize upgrade for " + dataDirPath + " failed.", ex);
+          return; // return so users will not see confusing log messages where
+                  // "failed" is immediately followed by "complete"
         }
         LOG.info("Finalize upgrade for " + dataDirPath + " is complete.");
       }
{noformat}

 In this case, the log can be confusing if deleteDir failed:
"Finalize upgrade failed" message will be immediately followed by "Finalize 
upgrade is complete".

Same pattern can be seen at:
  Line: 600, File: "org/apache/hadoop/hdfs/server/datanode/DataStorage.java"
  Line: 1250, File: 
"org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsDatasetImpl.java"
==========================================

==========================
Case 3:
File: "org/apache/hadoop/hdfs/server/datanode/BPServiceActor.java"

The log message doesn't include the exception cause.

{noformat}
@@ -167,9 +167,9 @@
         LOG.debug(this + " received versionRequest response: " + nsInfo);
         break;
       } catch(SocketTimeoutException e) {  // namenode is busy
-        LOG.warn("Problem connecting to server: " + nnAddr);
+        LOG.warn("Problem connecting to server: " + nnAddr, e);
       } catch(IOException e ) {  // namenode is not available
-        LOG.warn("Problem connecting to server: " + nnAddr);
+        LOG.warn("Problem connecting to server: " + nnAddr, e);
       }
{noformat}
==========================================

==========================
Case 4:
File: "org/apache/hadoop/hdfs/server/datanode/BlockSender.java"

{noformat}
@@ -371,6 +371,7 @@
    IOException ioe = null;
    if (blockInFd != null &&
        ((dropCacheBehindAllReads) ||
         (dropCacheBehindLargeReads && isLongRead()))) {
      try {
        NativeIO.POSIX.getCacheManipulator().posixFadviseIfPossible(
            block.getBlockName(), blockInFd, lastCacheDropOffset,
            offset - lastCacheDropOffset,
            NativeIO.POSIX.POSIX_FADV_DONTNEED);
      } catch (IOException e) {
        LOG.warn("Unable to drop cache on file close", e);
+       ioe = e;
      }
    }

    if(checksumIn!=null) {
      try {
        checksumIn.close(); // close checksum file
      } catch (IOException e) {
        ioe = e;
      }
      checksumIn = null;

 .. ..
    if (ioe!=null) {
      throw ioe;
    }
{noformat}

It's unclear why exception from posixFadviceIfPossible is not returned to the 
caller.
==========================================

==========================
Case 5:
File: org/apache/hadoop/hdfs/server/datanode/DataXceiver.java

{noformat}
    } finally {
      dataXceiverServer.balanceThrottler.release();
      if (isOpSuccess) {
@@ -751,7 +751,8 @@
         try {
           // send one last byte to indicate that the resource is cleaned.
           reply.writeChar('d');
-        } catch (IOException ignored) {
+        } catch (IOException e) {
+          LOG.warn("Exception caught after copy block succeeded " + e);
         }
       }
       IOUtils.closeStream(reply);
{noformat}

It is unclear why the IOException is not logged.

Similar case:
  Line: 876, File: "org/apache/hadoop/hdfs/server/datanode/DataXceiver.java"
==========================================




--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to