[ https://issues.apache.org/jira/browse/YARN-867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13784745#comment-13784745 ]
Bikas Saha commented on YARN-867: --------------------------------- Why is this check needed? {code} + private void handleAuxServiceFail(AuxServicesEvent event, Throwable th) { + if (event.getType() instanceof AuxServicesEventType) { + Container container = event.getContainer(); {code} If container has already failed then why do we need to change state again? the container has already failed. {code} + .addTransition(ContainerState.LOCALIZATION_FAILED, ContainerState.EXITED_WITH_FAILURE, + ContainerEventType.CONTAINER_EXITED_WITH_FAILURE, + new ExitedWithFailureTransition(false)) {code} {code} + .addTransition(ContainerState.CONTAINER_CLEANEDUP_AFTER_KILL, + ContainerState.EXITED_WITH_FAILURE, + ContainerEventType.CONTAINER_EXITED_WITH_FAILURE, + new ExitedWithFailureTransition(false)) {code} Why is CONTAINER_EXITED_WITH_FAILURE not being handled while container state is localized/running? Why are extra events being ignored in addition to ContainerEventType.CONTAINER_EXITED_WITH_FAILURE? {code} + ContainerState.EXITED_WITH_FAILURE, + EnumSet.of( + ContainerEventType.KILL_CONTAINER, + ContainerEventType.CONTAINER_EXITED_WITH_FAILURE, + ContainerEventType.RESOURCE_LOCALIZED, + ContainerEventType.RESOURCE_FAILED, + ContainerEventType.CONTAINER_LAUNCHED, + ContainerEventType.CONTAINER_EXITED_WITH_SUCCESS, + ContainerEventType.CONTAINER_KILLED_ON_REQUEST)) {code} {code} + .addTransition(ContainerState.DONE, ContainerState.DONE, + EnumSet.of( + ContainerEventType.RESOURCE_LOCALIZED, + ContainerEventType.CONTAINER_LAUNCHED, + ContainerEventType.CONTAINER_EXITED_WITH_FAILURE, + ContainerEventType.CONTAINER_RESOURCES_CLEANEDUP, + ContainerEventType.CONTAINER_EXITED_WITH_SUCCESS, + ContainerEventType.CONTAINER_KILLED_ON_REQUEST)) {code} Can you please check if ExitedWithFailureTransition(true) needs to be called in places where the patch is adding ExitedWithFailureTransition(false). Is cleanup required? Do the new tests fail without the changes? > Isolation of failures in aux services > -------------------------------------- > > Key: YARN-867 > URL: https://issues.apache.org/jira/browse/YARN-867 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager > Reporter: Hitesh Shah > Assignee: Xuan Gong > Priority: Critical > Attachments: YARN-867.1.sampleCode.patch, YARN-867.3.patch, > YARN-867.4.patch, YARN-867.5.patch, YARN-867.sampleCode.2.patch > > > Today, a malicious application can bring down the NM by sending bad data to a > service. For example, sending data to the ShuffleService such that it results > any non-IOException will cause the NM's async dispatcher to exit as the > service's INIT APP event is not handled properly. -- This message was sent by Atlassian JIRA (v6.1#6144)