So, given that this probably won't be changed before the 1.0 release, are the strings considered part of the stable API? Or is it recommended not to rely on `error()` at all? (That's what we did for now, setting failover timeout to 5 years)
On 13.07.2016 15:37, Neil Conway wrote: > Ah, right -- yes, at the moment you need to look at error strings to > decide whether to retry with a new framework ID, unfortunately. IMO we > should introduce error codes or enums to make this process more > reliable, but no one has done so yet: > > https://issues.apache.org/jira/browse/MESOS-4548 > https://issues.apache.org/jira/browse/MESOS-5322 > > Neil > > > On Wed, Jul 13, 2016 at 3:27 PM, Evers Benno <ben...@yandex-team.ru> wrote: >> Let me try to clarify: >> >> The problem is that I don't get to decide manually if the framwork >> should try to take a new id or re-use the old one, but it needs to be >> decided programmatically, by an algorithm. >> >> Afaik it's not possible to get the time when the framework disconnected >> from mesos, so it's not possible to know how much time is left until the >> failover timeout runs out. Therefore, if I want to attempt task >> reconciliation, I just have to try registering with my old framework id >> and see what happens. >> >> However, in the case where the failover timeout already passed, I now >> need to programmatically detect this error and try again with an empty >> framework id. >> >> My question was, is it possible to do this? >> >> (also, we actually use a failover timeout of 1 week, but it doesn't >> really change the problem and I mistakenly assumed that an example with >> smaller values would be more intuitive) >> >> On 13.07.2016 14:50, Neil Conway wrote: >>> On Wed, Jul 13, 2016 at 2:44 PM, Evers Benno <ben...@yandex-team.ru> wrote: >>>> imagine the following situation: I am a framework with failover timeout >>>> of 1 hour, and 59 minutes and 55 seconds after shutting down I want to >>>> register with the master again. >>>> >>>> If my registration attempt arrives at the master within the time limit >>>> everything will be fine and I even get back the old tasks for >>>> reconciliation, but if it arrives slightly later the framework id is >>>> permanently blocked by mesos, and I am not able to register. Instead, I >>>> will receive an error()-callback with the message "Framework has been >>>> removed". >>> >>> Right: if you set a failover_timeout of 1 hour, your framework is >>> expected to reregister within one hour. If it does not, all of its >>> tasks will be killed and you need to start over with a new >>> FrameworkID. Can you clarify which aspect of this behavior is >>> problematic for you? >>> >>> Note that a failover_timeout of 1 hour is probably a little low. >>> >>>> Is there any way to reliably connect to the master while also >>>> reconciling old tasks if possible? >>> >>> Sorry, not sure what you mean by this. >>> >>> Neil >>>