I would not consider it a "workaround" to make the executor URI stable
between failures. I think that's a requirement for a HA system. If you are
serving the resource from the scheduler itself then yes you need to set up
DNS or some sort of proxy that can direct the fetch request to the current
scheduler.

Alternatively you could put it in a well known location (ie HDFS or S3) and
pass that URI instead. The current scheduler can mutate that storage system
on startup if it is serving a new jar. If you do this you can also then
decouple the resource serving from the scheduler itself which I think is a
nice to have.


On Thu, Sep 29, 2016 at 10:25 AM, Vinod Kone <vinodk...@apache.org> wrote:

> We cannot easily make ExecutorInfo mutable because there might be existing
> tasks with executors with the old ExecutorInfo. If there are two different
> ExecutorInfos for the same ExecutorID it gets confusing for Mesos (e.g.,
> SHUTDOWN executor id 'foo' kills which executor?).
>
> One possible solution is to not re-use ExecutorID, but that depends on
> what semantics you want for your executor.
>
> On Thu, Sep 29, 2016 at 3:01 AM, Kota UENISHI <uenishi@nautilus-
> technologies.com> wrote:
>
>> Hi there,
>>
>> I'm going to implement scheduler failover into my framework, and hit
>> an issue - while I know it's how Mesos works for now:
>>
>> My framework lets Mesos agents fetch my custom executor jar file from
>> scheduler process's HTTP endpoint. Suppose framework process restarted
>> by Marathon or whatever in a different machine after failure, the URL
>> of the HTTP endpoint to download executor jar file from changes to
>> that of new scheduler process. This causes ExecutorInfo validation
>> failure, like [1]. And I think this is why Spark's
>> MesosClusterDispatcher is not ready for HA yet.
>>
>> As a (major?) workaround, [1] avoids this by assuming URL identity by
>> DNS or load balancer-ish stuff. Another short-sighted kludge
>> workaround would be relaxing the ExecutorInfo validation for the
>> failover case - which I believe solves many framework developers'
>> headache.
>>
>> Also, best workaround in Mesos code would be just clearing
>> ExecutorInfo after Master found scheduler failover. I think
>> ExecutorInfo must be 1:1 with FrameworkInfo, but I does not have to be
>> immutable. Under partition, it may diverge across masters but LWW
>> merge after partition heal would be enough to keep it unique.
>>
>> Thoughts?
>>
>> [1] https://github.com/mesosphere/kubernetes-mesos/issues/15
>>
>> Kota UENISHI
>>
>
>

Reply via email to