Ding Yuan created YARN-1677: ------------------------------- Summary: Potential bugs in exception handlers Key: YARN-1677 URL: https://issues.apache.org/jira/browse/YARN-1677 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.2.0 Reporter: Ding Yuan
Hi Yarn developers, We are a group of researchers on software reliability, and recently we did a study and found that majority of the most severe failures in hadoop are caused by bugs in exception handling logic. Therefore we built a simple checking tool that automatically detects some bug patterns that have caused some very severe failures. I am reporting some of the results for Yarn here. Any feedback is much appreciated! ========================== Case 1: Line: 551, File: "org/apache/hadoop/yarn/server/nodemanager/containermanager/monitor/ContainersMonitorImpl.java" {noformat} switch (monitoringEvent.getType()) { case START_MONITORING_CONTAINER: .. .. default: // TODO: Wrong event. } {noformat} The switch fall-through (handling any potential unexpected event) is empty. Should we at least print an error message here? ========================== ========================== Case 2: Line: 491, File: "org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java" {noformat} } catch (Throwable e) { // TODO Better error handling. Thread can die with the rest of the // NM still running. LOG.error("Caught exception in status-updater", e); } {noformat} The handler of this very general exception only logs the error. The TODO seems to indicate it is not sufficient. ========================== ========================== Case 3: Line: 861, File: "org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java" for (LocalResourceStatus stat : remoteResourceStatuses) { LocalResource rsrc = stat.getResource(); LocalResourceRequest req = null; try { req = new LocalResourceRequest(rsrc); } catch (URISyntaxException e) { // TODO fail? Already translated several times... } The handler for URISyntaxException is empty, and the TODO seems to indicate it is not sufficient. The same code pattern can also be found at: Line: 901, File: "org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java" Line: 838, File: "org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java" Line: 878, File: "org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java" At line: 803, File: org/apache/hadoop/yarn/applications/distributedshell/ApplicationMaster.java, the handler of URISyntaxException also seems not sufficient: {noformat} try { shellRsrc.setResource(ConverterUtils.getYarnUrlFromURI(new URI( shellScriptPath))); } catch (URISyntaxException e) { LOG.error("Error when trying to use shell script path specified" + " in env, path=" + shellScriptPath); e.printStackTrace(); // A failure scenario on bad input such as invalid shell script path // We know we cannot continue launching the container // so we should release it. // TODO numCompletedContainers.incrementAndGet(); numFailedContainers.incrementAndGet(); return; } {noformat} ========================== ========================== Case 4: Line: 627, File: "org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptImpl.java" {noformat} try { /* keep the master in sync with the state machine */ this.stateMachine.doTransition(event.getType(), event); } catch (InvalidStateTransitonException e) { LOG.error("Can't handle this event at current state", e); /* TODO fail the application on the failed transition */ } {noformat} The handler of this exception only logs the error. The TODO seems to indicate it is not sufficient. This exact same code pattern can also be found at: Line: 573, File: "org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppImpl.java" ========================== ========================== Case 5: empty handler for exception: java.lang.InterruptedException Line: 123, File: "org/apache/hadoop/yarn/server/webproxy/WebAppProxy.java" {noformat} public void join() { if(proxyServer != null) { try { proxyServer.join(); } catch (InterruptedException e) { } } } {noformat} The InterruptedException is completely ignored. As a result, any events causing this interrupt will be lost. More info on why InterruptedException shouldn't be ignored: http://stackoverflow.com/questions/1087475/when-does-javas-thread-sleep-throw-interruptedexception This pattern of handling InterruptedException can be found in a few other places: Line: 434, File: org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java {noformat} try { event = eventQueue.take(); } catch (InterruptedException e) { LOG.error("Returning, interrupted : " + e); return; // TODO: Kill RM. } {noformat} Line: 722, File: "org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java" Line: 191, File: "org/apache/hadoop/yarn/server/nodemanager/containermanager/monitor/ContainersMonitorImpl.java" Line: 393, File: "org/apache/hadoop/yarn/util/LinuxResourceCalculatorPlugin.java" Line: 258, File: "org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/LocalizedResource.java" ========================== ========================== Case 6: potential divide by zero line: 581, File: org/apache/hadoop/yarn/server/resourcemanager/scheduler/fifo/FifoScheduler.java {noformat} int availableContainers = node.getAvailableResource().getMemory() / capability.getMemory(); // TODO: A buggy // application // with this // zero would // crash the // scheduler. {noformat} Should this potential divide by zero be handled? ========================== ========================== Case 7: Line: 260, File: "org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/LocalizedResource.java" {noformat} } catch (YarnException e) { // TODO cleanup return; } {noformat} The error handler simply returns without proper clean up. ========================== -- This message was sent by Atlassian JIRA (v6.1.5#6160)