We cannot easily make ExecutorInfo mutable because there might be existing
tasks with executors with the old ExecutorInfo. If there are two different
ExecutorInfos for the same ExecutorID it gets confusing for Mesos (e.g.,
SHUTDOWN executor id 'foo' kills which executor?).

One possible solution is to not re-use ExecutorID, but that depends on what
semantics you want for your executor.

On Thu, Sep 29, 2016 at 3:01 AM, Kota UENISHI <
ueni...@nautilus-technologies.com> wrote:

> Hi there,
>
> I'm going to implement scheduler failover into my framework, and hit
> an issue - while I know it's how Mesos works for now:
>
> My framework lets Mesos agents fetch my custom executor jar file from
> scheduler process's HTTP endpoint. Suppose framework process restarted
> by Marathon or whatever in a different machine after failure, the URL
> of the HTTP endpoint to download executor jar file from changes to
> that of new scheduler process. This causes ExecutorInfo validation
> failure, like [1]. And I think this is why Spark's
> MesosClusterDispatcher is not ready for HA yet.
>
> As a (major?) workaround, [1] avoids this by assuming URL identity by
> DNS or load balancer-ish stuff. Another short-sighted kludge
> workaround would be relaxing the ExecutorInfo validation for the
> failover case - which I believe solves many framework developers'
> headache.
>
> Also, best workaround in Mesos code would be just clearing
> ExecutorInfo after Master found scheduler failover. I think
> ExecutorInfo must be 1:1 with FrameworkInfo, but I does not have to be
> immutable. Under partition, it may diverge across masters but LWW
> merge after partition heal would be enough to keep it unique.
>
> Thoughts?
>
> [1] https://github.com/mesosphere/kubernetes-mesos/issues/15
>
> Kota UENISHI
>

Reply via email to