[ https://issues.apache.org/jira/browse/YARN-867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13784503#comment-13784503 ]
Bikas Saha commented on YARN-867: --------------------------------- Probably we can ignore the error here since the container has already failed. {code} // From LOCALIZATION_FAILED State .addTransition(ContainerState.LOCALIZATION_FAILED, @@ -180,6 +184,9 @@ public ContainerImpl(Configuration conf, Dispatcher dispatcher, .addTransition(ContainerState.LOCALIZATION_FAILED, ContainerState.LOCALIZATION_FAILED, ContainerEventType.RESOURCE_FAILED) + .addTransition(ContainerState.LOCALIZATION_FAILED, ContainerState.EXITED_WITH_FAILURE, + ContainerEventType.CONTAINER_EXITED_WITH_FAILURE, + new ExitedWithFailureTransition(false)) {code} Probably have 1 try catch instead of multiple. Can we rename AUXSERVICE_FAIL to AUXSERVICE_ERROR since the service probably hasnt failed. TestAuxService needs an addition for the new code TestContainer - new test can be made simpler by not mocking AuxServiceHandler and instead sending the failed event directly like its done for other tests there. In AuxService.handle(APPLICATION_INIT) and other places like that, where the service does not exist then we should fail too. Zhijie, we should err on the side of caution here and fail the container. If we see real use cases where failure can be ignored then we can make that improvement. > Isolation of failures in aux services > -------------------------------------- > > Key: YARN-867 > URL: https://issues.apache.org/jira/browse/YARN-867 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager > Reporter: Hitesh Shah > Assignee: Xuan Gong > Priority: Critical > Attachments: YARN-867.1.sampleCode.patch, YARN-867.3.patch, > YARN-867.4.patch, YARN-867.sampleCode.2.patch > > > Today, a malicious application can bring down the NM by sending bad data to a > service. For example, sending data to the ShuffleService such that it results > any non-IOException will cause the NM's async dispatcher to exit as the > service's INIT APP event is not handled properly. -- This message was sent by Atlassian JIRA (v6.1#6144)