I'm seeing my slider jobs sometimes fail for no obvious reason. One hypothesis is that this happens when the resource manager is restarted (actually, when one of the 2 redundant RMs restarts).
Is this expected behavior? The jobs don't always fail completely; sometimes, yarn will fail an attempt and start another one, and the job's containers will all restart and everything will be fine. Sometimes some of the jobs that are running will have trouble and some won't. I haven't figured out a pattern yet. Any insight would be appreciated. -david