Ah, right -- yes, at the moment you need to look at error strings to decide whether to retry with a new framework ID, unfortunately. IMO we should introduce error codes or enums to make this process more reliable, but no one has done so yet:
https://issues.apache.org/jira/browse/MESOS-4548 https://issues.apache.org/jira/browse/MESOS-5322 Neil On Wed, Jul 13, 2016 at 3:27 PM, Evers Benno <ben...@yandex-team.ru> wrote: > Let me try to clarify: > > The problem is that I don't get to decide manually if the framwork > should try to take a new id or re-use the old one, but it needs to be > decided programmatically, by an algorithm. > > Afaik it's not possible to get the time when the framework disconnected > from mesos, so it's not possible to know how much time is left until the > failover timeout runs out. Therefore, if I want to attempt task > reconciliation, I just have to try registering with my old framework id > and see what happens. > > However, in the case where the failover timeout already passed, I now > need to programmatically detect this error and try again with an empty > framework id. > > My question was, is it possible to do this? > > (also, we actually use a failover timeout of 1 week, but it doesn't > really change the problem and I mistakenly assumed that an example with > smaller values would be more intuitive) > > On 13.07.2016 14:50, Neil Conway wrote: >> On Wed, Jul 13, 2016 at 2:44 PM, Evers Benno <ben...@yandex-team.ru> wrote: >>> imagine the following situation: I am a framework with failover timeout >>> of 1 hour, and 59 minutes and 55 seconds after shutting down I want to >>> register with the master again. >>> >>> If my registration attempt arrives at the master within the time limit >>> everything will be fine and I even get back the old tasks for >>> reconciliation, but if it arrives slightly later the framework id is >>> permanently blocked by mesos, and I am not able to register. Instead, I >>> will receive an error()-callback with the message "Framework has been >>> removed". >> >> Right: if you set a failover_timeout of 1 hour, your framework is >> expected to reregister within one hour. If it does not, all of its >> tasks will be killed and you need to start over with a new >> FrameworkID. Can you clarify which aspect of this behavior is >> problematic for you? >> >> Note that a failover_timeout of 1 hour is probably a little low. >> >>> Is there any way to reliably connect to the master while also >>> reconciling old tasks if possible? >> >> Sorry, not sure what you mean by this. >> >> Neil >>