You should be able to figure out the cause from the AM log. It sounds like it could be SLIDER-1183. The fix for this issue also requires YARN-5999. With the SLIDER-1183 fix by itself, it should stop the app from being killed, but the AM will remain in a broken state.
On Wed, Sep 27, 2017 at 4:48 PM, David.Serafini <david.seraf...@target.com> wrote: > I'm seeing my slider jobs sometimes fail for no obvious reason. > One hypothesis is that this happens when the resource manager is restarted > (actually, when one of the 2 redundant RMs restarts). > > Is this expected behavior? > > The jobs don't always fail completely; sometimes, yarn will fail an > attempt and start another one, and the job's containers will all restart > and everything will be fine. Sometimes some of the jobs that are running > will have trouble and some won't. I haven't figured out a pattern yet. > > Any insight would be appreciated. > > -david > > >