Lost jobs on cluster failure

Mauricio Garavaglia Tue, 16 Jun 2015 14:20:58 -0700

Hello!

We had a issue with our aurora mesos cluster that make it to lose quorum.
And we are wondering how the recover of lost jobs works. So, what happen is
basically


#1 Start Aurora job, and have it allocated to node A.
#2 Aurora Schedulers, Mesos Master and ZK stopped
#3 node A stopped
#4 Aurora Schedulers, Mesos Master and ZK started again

Should it assume the Mesos list is complete, and assume the missing nodes
are indeed gone, and hence restart the jobs? is there any guarantee that
not multiple instances of the same job will be started?

If we had health checks, we could presumably use those to validate that the
job is, indeed, truly dead. Would that work?

Thanks!

Lost jobs on cluster failure

Reply via email to