[ 
https://issues.apache.org/jira/browse/MESOS-658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-658:
----------------------------------

    Fix Version/s: 0.14.0
    
> A framework can be incorrectly removed by the Master.
> -----------------------------------------------------
>
>                 Key: MESOS-658
>                 URL: https://issues.apache.org/jira/browse/MESOS-658
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Benjamin Mahler
>            Assignee: Benjamin Mahler
>            Priority: Blocker
>             Fix For: 0.14.0
>
>
> Discovered this while reading through the failover code in the Master.
> There is a case during re-registration where the re-registered time was not 
> being set.
> This can cause a serious issue when the following occurs:
>  -Scheduler disconnects from the master, Master::exited(UPID) sets 
> framework->active = false.
>  -Scheduler re-registers with ReregisterFrameworkMessage::failover=false. 
> Currently, the master does _not_ update the re-registration time in this case!
>  -Now the failoverFramework timeout is setup in the Master.
>  -Scheduler disconnects again from the master, Master::exited(UPID) sets 
> active=false once again.
>  -The original failoverFramework timeout fires, compares 
> Framework->reregisteredTime. Since it has not been updated, the master 
> proceeds to shut down the framework on all the slaves!
> I have a short term fix here: https://reviews.apache.org/r/13744/

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to