[
https://issues.apache.org/jira/browse/MESOS-658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Benjamin Mahler updated MESOS-658:
----------------------------------
Fix Version/s: (was: 0.14.0)
0.14.1
> A framework can be incorrectly removed by the Master.
> -----------------------------------------------------
>
> Key: MESOS-658
> URL: https://issues.apache.org/jira/browse/MESOS-658
> Project: Mesos
> Issue Type: Bug
> Reporter: Benjamin Mahler
> Assignee: Benjamin Mahler
> Priority: Blocker
> Fix For: 0.14.1
>
>
> Discovered this while reading through the failover code in the Master.
> There is a case during re-registration where the re-registered time was not
> being set.
> This can cause a serious issue when the following occurs:
> -Scheduler disconnects from the master, Master::exited(UPID) sets
> framework->active = false.
> -Scheduler re-registers with ReregisterFrameworkMessage::failover=false.
> Currently, the master does _not_ update the re-registration time in this case!
> -Now the failoverFramework timeout is setup in the Master.
> -Scheduler disconnects again from the master, Master::exited(UPID) sets
> active=false once again.
> -The original failoverFramework timeout fires, compares
> Framework->reregisteredTime. Since it has not been updated, the master
> proceeds to shut down the framework on all the slaves!
> I have a short term fix here: https://reviews.apache.org/r/13744/
--
This message was sent by Atlassian JIRA
(v6.1#6144)