[ https://issues.apache.org/jira/browse/SLIDER-1181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15770763#comment-15770763 ]
Billie Rinaldi commented on SLIDER-1181: ---------------------------------------- No, the problem is that doing nothing in the onError command will leave the Slider AM in a bad state because the AMRMClientAsync client connection to the RM will be broken -- its threads will no longer be running. That is unless patch YARN-5999 is applied. An interim solution (if YARN-5999 is not available) might be to have the AM halt instead of shutting down the entire application, as initially proposed in the YARN-5996-yarn-native-services.001.patch. Then when the AM is brought back up, it will set up a new AMRMClientAsync instance, and the app should continue running. However, my preference would be not to allow the behavior of the YARN native services AM and the Slider AM to diverge, if possible. > Keep Slider AM running during RM failure > ---------------------------------------- > > Key: SLIDER-1181 > URL: https://issues.apache.org/jira/browse/SLIDER-1181 > Project: Slider > Issue Type: Bug > Components: appmaster > Reporter: Billie Rinaldi > Assignee: Billie Rinaldi > Fix For: Slider 1.0.0 > > Attachments: SLIDER-1181.1.patch > > > YARN-5944 and YARN-5996 made the native services AM more robust to temporary > RM failures. We should apply these to the Slider AM as well. YARN-5996 > requires YARN change YARN-5999. -- This message was sent by Atlassian JIRA (v6.3.4#6332)