[ 
https://issues.apache.org/jira/browse/YARN-867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13784503#comment-13784503
 ] 

Bikas Saha commented on YARN-867:
---------------------------------

Probably we can ignore the error here since the container has already failed.
{code}
     // From LOCALIZATION_FAILED State
     .addTransition(ContainerState.LOCALIZATION_FAILED,
@@ -180,6 +184,9 @@ public ContainerImpl(Configuration conf, Dispatcher 
dispatcher,
     .addTransition(ContainerState.LOCALIZATION_FAILED,
         ContainerState.LOCALIZATION_FAILED,
         ContainerEventType.RESOURCE_FAILED)
+    .addTransition(ContainerState.LOCALIZATION_FAILED, 
ContainerState.EXITED_WITH_FAILURE,
+        ContainerEventType.CONTAINER_EXITED_WITH_FAILURE,
+        new ExitedWithFailureTransition(false))
{code}

Probably have 1 try catch instead of multiple.

Can we rename AUXSERVICE_FAIL to AUXSERVICE_ERROR since the service probably 
hasnt failed.

TestAuxService needs an addition for the new code

TestContainer - new test can be made simpler by not mocking AuxServiceHandler 
and instead sending the failed event directly like its done for other tests 
there.

In AuxService.handle(APPLICATION_INIT) and other places like that, where the 
service does not exist then we should fail too.

Zhijie, we should err on the side of caution here and fail the container. If we 
see real use cases where failure can be ignored then we can make that 
improvement.

> Isolation of failures in aux services 
> --------------------------------------
>
>                 Key: YARN-867
>                 URL: https://issues.apache.org/jira/browse/YARN-867
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>            Reporter: Hitesh Shah
>            Assignee: Xuan Gong
>            Priority: Critical
>         Attachments: YARN-867.1.sampleCode.patch, YARN-867.3.patch, 
> YARN-867.4.patch, YARN-867.sampleCode.2.patch
>
>
> Today, a malicious application can bring down the NM by sending bad data to a 
> service. For example, sending data to the ShuffleService such that it results 
> any non-IOException will cause the NM's async dispatcher to exit as the 
> service's INIT APP event is not handled properly. 



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to