Hi there,

I'm going to implement scheduler failover into my framework, and hit
an issue - while I know it's how Mesos works for now:

My framework lets Mesos agents fetch my custom executor jar file from
scheduler process's HTTP endpoint. Suppose framework process restarted
by Marathon or whatever in a different machine after failure, the URL
of the HTTP endpoint to download executor jar file from changes to
that of new scheduler process. This causes ExecutorInfo validation
failure, like [1]. And I think this is why Spark's
MesosClusterDispatcher is not ready for HA yet.

As a (major?) workaround, [1] avoids this by assuming URL identity by
DNS or load balancer-ish stuff. Another short-sighted kludge
workaround would be relaxing the ExecutorInfo validation for the
failover case - which I believe solves many framework developers'
headache.

Also, best workaround in Mesos code would be just clearing
ExecutorInfo after Master found scheduler failover. I think
ExecutorInfo must be 1:1 with FrameworkInfo, but I does not have to be
immutable. Under partition, it may diverge across masters but LWW
merge after partition heal would be enough to keep it unique.

Thoughts?

[1] https://github.com/mesosphere/kubernetes-mesos/issues/15

Kota UENISHI

Reply via email to