Re: Registering and framework failover

Neil Conway Wed, 13 Jul 2016 06:37:36 -0700

Ah, right -- yes, at the moment you need to look at error strings to
decide whether to retry with a new framework ID, unfortunately. IMO we
should introduce error codes or enums to make this process more
reliable, but no one has done so yet:


https://issues.apache.org/jira/browse/MESOS-4548
https://issues.apache.org/jira/browse/MESOS-5322

Neil


On Wed, Jul 13, 2016 at 3:27 PM, Evers Benno <ben...@yandex-team.ru> wrote:
> Let me try to clarify:
>
> The problem is that I don't get to decide manually if the framwork
> should try to take a new id or re-use the old one, but it needs to be
> decided programmatically, by an algorithm.
>
> Afaik it's not possible to get the time when the framework disconnected
> from mesos, so it's not possible to know how much time is left until the
> failover timeout runs out. Therefore, if I want to attempt task
> reconciliation, I just have to try registering with my old framework id
> and see what happens.
>
> However, in the case where the failover timeout already passed, I now
> need to programmatically detect this error and try again with an empty
> framework id.
>
> My question was, is it possible to do this?
>
> (also, we actually use a failover timeout of 1 week, but it doesn't
> really change the problem and I mistakenly assumed that an example with
> smaller values would be more intuitive)
>
> On 13.07.2016 14:50, Neil Conway wrote:
>> On Wed, Jul 13, 2016 at 2:44 PM, Evers Benno <ben...@yandex-team.ru> wrote:
>>> imagine the following situation: I am a framework with failover timeout
>>> of 1 hour, and 59 minutes and 55 seconds after shutting down I want to
>>> register with the master again.
>>>
>>> If my registration attempt arrives at the master within the time limit
>>> everything will be fine and I even get back the old tasks for
>>> reconciliation, but if it arrives slightly later the framework id is
>>> permanently blocked by mesos, and I am not able to register. Instead, I
>>> will receive an error()-callback with the message "Framework has been
>>> removed".
>>
>> Right: if you set a failover_timeout of 1 hour, your framework is
>> expected to reregister within one hour. If it does not, all of its
>> tasks will be killed and you need to start over with a new
>> FrameworkID. Can you clarify which aspect of this behavior is
>> problematic for you?
>>
>> Note that a failover_timeout of 1 hour is probably a little low.
>>
>>> Is there any way to reliably connect to the master while also
>>> reconciling old tasks if possible?
>>
>> Sorry, not sure what you mean by this.
>>
>> Neil
>>

Re: Registering and framework failover

Reply via email to