We run a 3 node marathon cluster on top of 3 mesos masters + 6 slaves. (mesos 0.21.0, marathon 0.7.5)
This morning we had a network outage long enough for everything to lose zookeeper. Now our marathon UI is empty (all 3 marathons think someone else is a master, and marathons 'proxy to leader' feature means the REST API is toast). Odd thing is, at the mesos level, the mesos master UI shows no tasks running (logs mention orphaned tasks), but if i click into the 'slaves' tab and dig down, the slave view details tasks that are in fact active. Any way to bring order to this without needing to kill those tasks? we have no actual outage from a user point of view, but the cluster itself is pretty confused and our service discovery relies on the marathon API which is timing out. Although mesos has checkpointing enabled, marathon isn't running with checkpointing on (it's the default now but doesn't apply to existing frameworks apparently, and we started this around marathon 0.4.x) Would enabling checkpointing help with this kind of issue? If so, how do i enable it for an existing framework?