Changing FrameworkInfo (while keeping the FrameworkID) is not handled correctly by Mesos at the moment. This is what you currently need to do to propagate FrameworkInfo.checkpoint throughout the cluster.
--> Update FrameworkInfo inside your framework and re-register with master. (Old FrameworkInfo is still cached at master and slaves). --> Failover the leading master. (New FrameworkInfo will be cached by new leading master). --> Hard restart (kill slave and wipe meta data) your slave in batches. The proper fix for this is tracked at: https://issues.apache.org/jira/browse/MESOS-703 On Tue, Feb 24, 2015 at 4:23 PM, Zameer Manji <zma...@twopensource.com> wrote: > For anyone who is going to read this information in the future, this works > because the information in the replicated log can be recovered by the > master. In future releases of Mesos the master might store information > which cannot be recovered so please take extra care if you are going to do > this. > > On Tue, Feb 24, 2015 at 4:11 PM, Steve Niemitz <st...@tellapart.com> > wrote: > >> Definitely don't change the frameworkID, we did that once and it was a >> disaster, for reasons described already. >> >> Here's what we did to force it on (as I can recall) >> - Change the startup flags for all masters to use the in memory DB >> instead of the replicated log (--registry=in_memory) >> - Restart all masters (not all at once, let them fail over) >> - Delete the replicated log on all masters >> - Ensure the framework is now registered with checkpoint = true (the >> slaves won't be yet howerver) >> - Remove the --registry flag from the masters and do a rolling restart >> again >> - Do another rolling restart of the masters >> *- At this point the framework will be persisted as checkpoint = true* >> - Now, restart your slaves. Restarting them should cause them to pick up >> the new framework. I'm not 100% sure if I deleted their state or not when >> I did this part, if it doesn't seem to take, try deleting their slave info >> on each one. >> >> On Tue, Feb 24, 2015 at 4:02 PM, Zameer Manji <zma...@twopensource.com> >> wrote: >> >>> I would like to point out that using a new FrameworkID is not a solution >>> to this problem. This means that a cluster operator has to drain the entire >>> cluster to enable checkpointing, or lose all previous tasks. Both scenarios >>> are not desirable. >>> >>> Fortunately it is possible to do this without changing the FrameworkID. >>> I have cced Steve from TellApart who has enabled checkpointing without >>> changing the FrameworkID on a production cluster. I hope he can share his >>> process here. >>> >>> On Tue, Feb 24, 2015 at 3:51 PM, Tim Chen <t...@mesosphere.io> wrote: >>> >>>> Mesos checkpoints the FrameworkInfo into disk, and recovers it on >>>> relaunch. >>>> >>>> I don't think we expose any API to remove the framework manually though >>>> if you really want to keep the FrameworkID. If you hit the failover timeout >>>> the framework will get removed from the master and slave. >>>> >>>> I think for now the best way is just use a new FrameworkID when you >>>> want to change the FrameworkInfo. >>>> >>>> Tim >>>> >>>> >>>> >>>> On Tue, Feb 24, 2015 at 3:32 PM, Thomas Petr <tp...@hubspot.com> wrote: >>>> >>>>> Hey folks, >>>>> >>>>> Is there a best practice for rolling out FrameworkInfo changes? We >>>>> need to set checkpoint to true, so I redeployed our framework with >>>>> the new settings (with tasks still running), but when I hit a slave's >>>>> stats.json endpoint, it appears that the old FrameworkInfo data is >>>>> still there (which makes sense since there's active executors running). I >>>>> then tried draining the tasks and completely restarting a Mesos slave, but >>>>> still no luck. >>>>> >>>>> Is there anything additional / special I need to do here? Is some part >>>>> of Mesos caching FrameworkInfo based on the framework ID? >>>>> >>>>> Another wrinkle with our setup is we have a rather large >>>>> failover_timeout set for the framework -- maybe that's affecting >>>>> things too? >>>>> >>>>> Thanks, >>>>> Tom >>>>> >>>> >>>> >>> >>> >>> -- >>> Zameer Manji >>> >> >> > > > -- > Zameer Manji >