Hello! We had a issue with our aurora mesos cluster that make it to lose quorum. And we are wondering how the recover of lost jobs works. So, what happen is basically
#1 Start Aurora job, and have it allocated to node A. #2 Aurora Schedulers, Mesos Master and ZK stopped #3 node A stopped #4 Aurora Schedulers, Mesos Master and ZK started again Should it assume the Mesos list is complete, and assume the missing nodes are indeed gone, and hence restart the jobs? is there any guarantee that not multiple instances of the same job will be started? If we had health checks, we could presumably use those to validate that the job is, indeed, truly dead. Would that work? Thanks!
