Hi there, I'm going to implement scheduler failover into my framework, and hit an issue - while I know it's how Mesos works for now:
My framework lets Mesos agents fetch my custom executor jar file from scheduler process's HTTP endpoint. Suppose framework process restarted by Marathon or whatever in a different machine after failure, the URL of the HTTP endpoint to download executor jar file from changes to that of new scheduler process. This causes ExecutorInfo validation failure, like [1]. And I think this is why Spark's MesosClusterDispatcher is not ready for HA yet. As a (major?) workaround, [1] avoids this by assuming URL identity by DNS or load balancer-ish stuff. Another short-sighted kludge workaround would be relaxing the ExecutorInfo validation for the failover case - which I believe solves many framework developers' headache. Also, best workaround in Mesos code would be just clearing ExecutorInfo after Master found scheduler failover. I think ExecutorInfo must be 1:1 with FrameworkInfo, but I does not have to be immutable. Under partition, it may diverge across masters but LWW merge after partition heal would be enough to keep it unique. Thoughts? [1] https://github.com/mesosphere/kubernetes-mesos/issues/15 Kota UENISHI