[ 
https://issues.apache.org/jira/browse/SLIDER-1181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15770763#comment-15770763
 ] 

Billie Rinaldi commented on SLIDER-1181:
----------------------------------------

No, the problem is that doing nothing in the onError command will leave the 
Slider AM in a bad state because the AMRMClientAsync client connection to the 
RM will be broken -- its threads will no longer be running. That is unless 
patch YARN-5999 is applied. An interim solution (if YARN-5999 is not available) 
might be to have the AM halt instead of shutting down the entire application, 
as initially proposed in the YARN-5996-yarn-native-services.001.patch. Then 
when the AM is brought back up, it will set up a new AMRMClientAsync instance, 
and the app should continue running.

However, my preference would be not to allow the behavior of the YARN native 
services AM and the Slider AM to diverge, if possible.

> Keep Slider AM running during RM failure
> ----------------------------------------
>
>                 Key: SLIDER-1181
>                 URL: https://issues.apache.org/jira/browse/SLIDER-1181
>             Project: Slider
>          Issue Type: Bug
>          Components: appmaster
>            Reporter: Billie Rinaldi
>            Assignee: Billie Rinaldi
>             Fix For: Slider 1.0.0
>
>         Attachments: SLIDER-1181.1.patch
>
>
> YARN-5944 and YARN-5996 made the native services AM more robust to temporary 
> RM failures. We should apply these to the Slider AM as well. YARN-5996 
> requires YARN change YARN-5999.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to