We run a 3 node marathon cluster on top of 3 mesos masters + 6 slaves.
(mesos 0.21.0, marathon 0.7.5)

This morning we had a network outage long enough for everything to
lose zookeeper.
Now our marathon UI is empty (all 3 marathons think someone else is a
master, and
marathons 'proxy to leader' feature means the REST API is toast).

Odd thing is, at the mesos level, the
mesos master UI shows no tasks running (logs mention orphaned tasks),
but if i click into the 'slaves' tab and dig down, the slave view details tasks
that are in fact active.

Any way to bring order to this without needing to kill those tasks? we
have no actual outage from a user point of view, but the cluster
itself is pretty confused and our service discovery relies on the
marathon API which is timing out.

Although mesos has checkpointing enabled, marathon isn't running with
checkpointing on (it's the default now but doesn't apply to existing
frameworks apparently, and we started this around marathon 0.4.x)

Would enabling checkpointing help with this kind of issue? If so, how
do i enable it for an existing framework?

Reply via email to