[ 
https://issues.apache.org/jira/browse/YARN-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ding Yuan updated YARN-1677:
----------------------------

    Description: 
Hi Yarn developers,
We are a group of researchers on software reliability, and recently we did a 
study and found that majority of the most severe failures in hadoop are caused 
by bugs in exception handling logic. Therefore we built a simple checking tool 
that automatically detects some bug patterns that have caused some very severe 
failures. I am reporting some of the results for Yarn here. Any feedback is 
much appreciated!

==========================
Case 1:
Line: 551, File: 
"org/apache/hadoop/yarn/server/nodemanager/containermanager/monitor/ContainersMonitorImpl.java"

{noformat}
    switch (monitoringEvent.getType()) {
    case START_MONITORING_CONTAINER:
      .. ..
    default:
      // TODO: Wrong event.
    }
{noformat}

The switch fall-through (handling any potential unexpected event) is empty. 
Should we at least print an error message here?
==========================

==========================
Case 2:
  Line: 491, File: 
"org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java"
{noformat}
          } catch (Throwable e) {

            // TODO Better error handling. Thread can die with the rest of the
            // NM still running.
            LOG.error("Caught exception in status-updater", e);
          }
{noformat}

The handler of this very general exception only logs the error. The TODO seems 
to indicate it is not sufficient.
==========================

==========================
Case 3:
Line: 861, File: 
"org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java"

   for (LocalResourceStatus stat : remoteResourceStatuses) {
        LocalResource rsrc = stat.getResource();
        LocalResourceRequest req = null;
        try {
          req = new LocalResourceRequest(rsrc);
        } catch (URISyntaxException e) {
          // TODO fail? Already translated several times...
        }

The handler for URISyntaxException is empty, and the TODO seems to indicate it 
is not sufficient.
The same code pattern can also be found at:
Line: 901, File: 
"org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java"
Line: 838, File: 
"org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java"
Line: 878, File: 
"org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java"
At line: 803, File: 
org/apache/hadoop/yarn/applications/distributedshell/ApplicationMaster.java, 
the handler of URISyntaxException also seems not sufficient:
{noformat}
       try {
          shellRsrc.setResource(ConverterUtils.getYarnUrlFromURI(new URI(
              shellScriptPath)));
        } catch (URISyntaxException e) {
          LOG.error("Error when trying to use shell script path specified"
              + " in env, path=" + shellScriptPath);
          e.printStackTrace();

          // A failure scenario on bad input such as invalid shell script path
          // We know we cannot continue launching the container
          // so we should release it.
          // TODO
          numCompletedContainers.incrementAndGet();
          numFailedContainers.incrementAndGet();
          return;
        }
{noformat}
==========================

==========================
Case 4:
Line: 627, File: 
"org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptImpl.java"

{noformat}
      try {
        /* keep the master in sync with the state machine */
        this.stateMachine.doTransition(event.getType(), event);
      } catch (InvalidStateTransitonException e) {
        LOG.error("Can't handle this event at current state", e);
        /* TODO fail the application on the failed transition */
      }
{noformat}

The handler of this exception only logs the error. The TODO seems to indicate 
it is not sufficient.

This exact same code pattern can also be found at:
Line: 573, File: 
"org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppImpl.java"
==========================

==========================
Case 5: empty handler for exception: java.lang.InterruptedException
  Line: 123, File: "org/apache/hadoop/yarn/server/webproxy/WebAppProxy.java"

{noformat}
  public void join() {
    if(proxyServer != null) {
      try {
        proxyServer.join();
      } catch (InterruptedException e) {
      }
    }
  }
{noformat}

The InterruptedException is completely ignored. As a result, any events causing 
this interrupt will be lost.

More info on why InterruptedException shouldn't be ignored: 
http://stackoverflow.com/questions/1087475/when-does-javas-thread-sleep-throw-interruptedexception

This pattern of handling InterruptedException can be found in a few other 
places:
Line: 434, File: 
org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java
{noformat}
          try {
            event = eventQueue.take();
          } catch (InterruptedException e) {
            LOG.error("Returning, interrupted : " + e);
            return; // TODO: Kill RM.
          }
{noformat}
Line: 722, File: 
"org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java"
Line: 191, File: 
"org/apache/hadoop/yarn/server/nodemanager/containermanager/monitor/ContainersMonitorImpl.java"
Line: 393, File: 
"org/apache/hadoop/yarn/util/LinuxResourceCalculatorPlugin.java"
Line: 258, File: 
"org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/LocalizedResource.java"
==========================

==========================
Case 6: potential divide by zero
line: 581, File: 
org/apache/hadoop/yarn/server/resourcemanager/scheduler/fifo/FifoScheduler.java
{noformat}
    int availableContainers =
      node.getAvailableResource().getMemory() / capability.getMemory(); // 
TODO: A buggy
   // application
    // with this
    // zero would
    // crash the
    // scheduler.
{noformat}

Should this potential divide by zero be handled?
==========================

==========================
Case 7:
  Line: 260, File: 
"org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/LocalizedResource.java"
{noformat}
      } catch (YarnException e) {
        // TODO cleanup
        return;
      }
{noformat}

The error handler simply returns without proper clean up.
==========================

  was:
Hi Yarn developers,
We are a group of researchers on software reliability, and recently we did a 
study and found that majority of the most severe failures in hadoop are caused 
by bugs in exception handling logic. Therefore we built a simple checking tool 
that automatically detects some bug patterns that have caused some very severe 
failures. I am reporting some of the results for Yarn here. Any feedback is 
much appreciated!

==========================
Case 1:
Line: 551, File: 
"org/apache/hadoop/yarn/server/nodemanager/containermanager/monitor/ContainersMonitorImpl.java"

{noformat}
    switch (monitoringEvent.getType()) {
    case START_MONITORING_CONTAINER:
      .. ..
    default:
      // TODO: Wrong event.
    }
{noformat}

The switch fall-through (handling any potential unexpected event) is empty. 
Should we at least print an error message here?
==========================

==========================
Case 2:
  Line: 491, File: 
"org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java"
{noformat}
          } catch (Throwable e) {

            // TODO Better error handling. Thread can die with the rest of the
            // NM still running.
            LOG.error("Caught exception in status-updater", e);
          }
{noformat}

The handler of this very general exception only logs the error. The TODO seems 
to indicate it is not sufficient.
==========================

==========================
Case 3:
Line: 861, File: 
"org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java"

   for (LocalResourceStatus stat : remoteResourceStatuses) {
        LocalResource rsrc = stat.getResource();
        LocalResourceRequest req = null;
        try {
          req = new LocalResourceRequest(rsrc);
        } catch (URISyntaxException e) {
          // TODO fail? Already translated several times...
        }

The handler for URISyntaxException is empty, and the TODO seems to indicate it 
is not sufficient.
The same code pattern can also be found at:
Line: 901, File: 
"org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java"
Line: 838, File: 
"org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java"
Line: 878, File: 
"org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java"
At line: 803, File: 
org/apache/hadoop/yarn/applications/distributedshell/ApplicationMaster.java, 
the handler of URISyntaxException also seems not sufficient:
{noformat}
       try {
          shellRsrc.setResource(ConverterUtils.getYarnUrlFromURI(new URI(
              shellScriptPath)));
        } catch (URISyntaxException e) {
          LOG.error("Error when trying to use shell script path specified"
              + " in env, path=" + shellScriptPath);
          e.printStackTrace();

          // A failure scenario on bad input such as invalid shell script path
          // We know we cannot continue launching the container
          // so we should release it.
          // TODO
          numCompletedContainers.incrementAndGet();
          numFailedContainers.incrementAndGet();
          return;
        }
{noformat}
==========================

==========================
Case 4:
Line: 627, File: 
"org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptImpl.java"

{noformat}
      try {
        /* keep the master in sync with the state machine */
        this.stateMachine.doTransition(event.getType(), event);
      } catch (InvalidStateTransitonException e) {
        LOG.error("Can't handle this event at current state", e);
        /* TODO fail the application on the failed transition */
      }
{noformat}

The handler of this exception only logs the error. The TODO seems to indicate 
it is not sufficient.

This exact same code pattern can also be found at:
Line: 573, File: 
"org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppImpl.java"
==========================

==========================
Case 5: empty handler for exception: java.lang.InterruptedException
  Line: 123, File: "org/apache/hadoop/yarn/server/webproxy/WebAppProxy.java"

{noformat}
  public void join() {
    if(proxyServer != null) {
      try {
        proxyServer.join();
      } catch (InterruptedException e) {
      }
    }
  }
{noformat}

The InterruptedException is completely ignored. As a result, any events causing 
this interrupt will be lost.

More info on why InterruptedException shouldn't be ignored: 
http://stackoverflow.com/questions/1087475/when-does-javas-thread-sleep-throw-interruptedexception

This pattern of handling InterruptedException can be found in a few other 
places:
Line: 434, File: 
org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java
{noformat}
          try {
            event = eventQueue.take();
          } catch (InterruptedException e) {
            LOG.error("Returning, interrupted : " + e);
            return; // TODO: Kill RM.
          }
{noformat}
Line: 722, File: 
"org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java"
Line: 191, File: 
"org/apache/hadoop/yarn/server/nodemanager/containermanager/monitor/ContainersMonitorImpl.java"
Line: 393, File: 
"org/apache/hadoop/yarn/util/LinuxResourceCalculatorPlugin.java"
Line: 258, File: 
"org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/LocalizedResource.java"
==========================

==========================
Case 6: potential divide by zero
line: 581, File: 
org/apache/hadoop/yarn/server/resourcemanager/scheduler/fifo/FifoScheduler.java
{noformat}
    int availableContainers =
      node.getAvailableResource().getMemory() / capability.getMemory(); // 
TODO: A buggy
                                                                        // 
application
                                                                        // with 
this
                                                                        // zero 
would
                                                                        // 
crash the
                                                                        // 
scheduler.
{noformat}

Should this potential divide by zero be handled?
==========================

==========================
Case 7:
  Line: 260, File: 
"org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/LocalizedResource.java"
{noformat}
      } catch (YarnException e) {
        // TODO cleanup
        return;
      }
{noformat}

The error handler simply returns without proper clean up.
==========================


> Potential bugs in exception handlers
> ------------------------------------
>
>                 Key: YARN-1677
>                 URL: https://issues.apache.org/jira/browse/YARN-1677
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 2.2.0
>            Reporter: Ding Yuan
>
> Hi Yarn developers,
> We are a group of researchers on software reliability, and recently we did a 
> study and found that majority of the most severe failures in hadoop are 
> caused by bugs in exception handling logic. Therefore we built a simple 
> checking tool that automatically detects some bug patterns that have caused 
> some very severe failures. I am reporting some of the results for Yarn here. 
> Any feedback is much appreciated!
> ==========================
> Case 1:
> Line: 551, File: 
> "org/apache/hadoop/yarn/server/nodemanager/containermanager/monitor/ContainersMonitorImpl.java"
> {noformat}
>     switch (monitoringEvent.getType()) {
>     case START_MONITORING_CONTAINER:
>       .. ..
>     default:
>       // TODO: Wrong event.
>     }
> {noformat}
> The switch fall-through (handling any potential unexpected event) is empty. 
> Should we at least print an error message here?
> ==========================
> ==========================
> Case 2:
>   Line: 491, File: 
> "org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java"
> {noformat}
>           } catch (Throwable e) {
>             // TODO Better error handling. Thread can die with the rest of the
>             // NM still running.
>             LOG.error("Caught exception in status-updater", e);
>           }
> {noformat}
> The handler of this very general exception only logs the error. The TODO 
> seems to indicate it is not sufficient.
> ==========================
> ==========================
> Case 3:
> Line: 861, File: 
> "org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java"
>    for (LocalResourceStatus stat : remoteResourceStatuses) {
>         LocalResource rsrc = stat.getResource();
>         LocalResourceRequest req = null;
>         try {
>           req = new LocalResourceRequest(rsrc);
>         } catch (URISyntaxException e) {
>           // TODO fail? Already translated several times...
>         }
> The handler for URISyntaxException is empty, and the TODO seems to indicate 
> it is not sufficient.
> The same code pattern can also be found at:
> Line: 901, File: 
> "org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java"
> Line: 838, File: 
> "org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java"
> Line: 878, File: 
> "org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java"
> At line: 803, File: 
> org/apache/hadoop/yarn/applications/distributedshell/ApplicationMaster.java, 
> the handler of URISyntaxException also seems not sufficient:
> {noformat}
>        try {
>           shellRsrc.setResource(ConverterUtils.getYarnUrlFromURI(new URI(
>               shellScriptPath)));
>         } catch (URISyntaxException e) {
>           LOG.error("Error when trying to use shell script path specified"
>               + " in env, path=" + shellScriptPath);
>           e.printStackTrace();
>           // A failure scenario on bad input such as invalid shell script path
>           // We know we cannot continue launching the container
>           // so we should release it.
>           // TODO
>           numCompletedContainers.incrementAndGet();
>           numFailedContainers.incrementAndGet();
>           return;
>         }
> {noformat}
> ==========================
> ==========================
> Case 4:
> Line: 627, File: 
> "org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptImpl.java"
> {noformat}
>       try {
>         /* keep the master in sync with the state machine */
>         this.stateMachine.doTransition(event.getType(), event);
>       } catch (InvalidStateTransitonException e) {
>         LOG.error("Can't handle this event at current state", e);
>         /* TODO fail the application on the failed transition */
>       }
> {noformat}
> The handler of this exception only logs the error. The TODO seems to indicate 
> it is not sufficient.
> This exact same code pattern can also be found at:
> Line: 573, File: 
> "org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppImpl.java"
> ==========================
> ==========================
> Case 5: empty handler for exception: java.lang.InterruptedException
>   Line: 123, File: "org/apache/hadoop/yarn/server/webproxy/WebAppProxy.java"
> {noformat}
>   public void join() {
>     if(proxyServer != null) {
>       try {
>         proxyServer.join();
>       } catch (InterruptedException e) {
>       }
>     }
>   }
> {noformat}
> The InterruptedException is completely ignored. As a result, any events 
> causing this interrupt will be lost.
> More info on why InterruptedException shouldn't be ignored: 
> http://stackoverflow.com/questions/1087475/when-does-javas-thread-sleep-throw-interruptedexception
> This pattern of handling InterruptedException can be found in a few other 
> places:
> Line: 434, File: 
> org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java
> {noformat}
>           try {
>             event = eventQueue.take();
>           } catch (InterruptedException e) {
>             LOG.error("Returning, interrupted : " + e);
>             return; // TODO: Kill RM.
>           }
> {noformat}
> Line: 722, File: 
> "org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java"
> Line: 191, File: 
> "org/apache/hadoop/yarn/server/nodemanager/containermanager/monitor/ContainersMonitorImpl.java"
> Line: 393, File: 
> "org/apache/hadoop/yarn/util/LinuxResourceCalculatorPlugin.java"
> Line: 258, File: 
> "org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/LocalizedResource.java"
> ==========================
> ==========================
> Case 6: potential divide by zero
> line: 581, File: 
> org/apache/hadoop/yarn/server/resourcemanager/scheduler/fifo/FifoScheduler.java
> {noformat}
>     int availableContainers =
>       node.getAvailableResource().getMemory() / capability.getMemory(); // 
> TODO: A buggy
>    // application
>     // with this
>     // zero would
>     // crash the
>     // scheduler.
> {noformat}
> Should this potential divide by zero be handled?
> ==========================
> ==========================
> Case 7:
>   Line: 260, File: 
> "org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/LocalizedResource.java"
> {noformat}
>       } catch (YarnException e) {
>         // TODO cleanup
>         return;
>       }
> {noformat}
> The error handler simply returns without proper clean up.
> ==========================



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to