Safe update of agent attributes

2016-02-22 Thread Zhitao Li
Hi, We recently discovered that updating attributes on Mesos agents is a very risk operation, and has a potential to send agent(s) into a crash loop if not done properly with errors like "Failed to perform recovery: Incompatible slave info detected". This combined with --recovery_timeout made the

Re: Safe update of agent attributes

2016-02-22 Thread Zameer Manji
Zhitao, In my experience the best way to manage these attributes is to ensure attribute changes are minimal (ie one attribute at a time) and roll them out slowly across a cluster. This way you can catch unsafe mutations quickly and rollback if needed. I don't think there is a whitelist/blacklist

Re: Safe update of agent attributes

2016-02-22 Thread Marco Massenzio
IIRC you can avoid the issue by either using a different work_dir for the agent, or removing (and, possibly, re-creating) it. I'm afraid I don't have a running instance of Mesos on this machine and can't test it out. Also (and this is strictly my opinion :) I would consider a change of attribute

Re: Safe update of agent attributes

2016-02-22 Thread Adam Bordelon
Currently, changing any --attributes or --resources requires draining the agent and killing all running tasks. See https://issues.apache.org/jira/browse/MESOS-1739 You could do a `mesos-slave --recovery=cleanup` which essentially kills all the tasks and clears the work_dir; then restart with a `mes

Re: Safe update of agent attributes

2016-02-23 Thread Zhitao Li
Hi Adam, The command `mesos-slave --recover=cleanup` could indeed to be used for clean up an incompatible change. I am still concerned about the possibility that a totally valid attributes or resources value change could leave the Mesos agent to be in crash loop and losing critical tasks after --

Re: Safe update of agent attributes

2016-02-23 Thread Vinod Kone
On Tue, Feb 23, 2016 at 8:44 AM, Zhitao Li wrote: > Can we consider to add a new option like "--auto_recovery_cleanup" which > would automatically perform the clean up if detected incompatible slave > info, or change the default behavior for "--recover"? > Wouldn't you want to know that an incom

Re: Safe update of agent attributes

2016-02-23 Thread Zameer Manji
Is incompatible slave info signaled by a certain exit code? On Tue, Feb 23, 2016 at 11:15 AM, Vinod Kone wrote: > > On Tue, Feb 23, 2016 at 8:44 AM, Zhitao Li wrote: > >> Can we consider to add a new option like "--auto_recovery_cleanup" which >> would automatically perform the clean up if dete

Re: Safe update of agent attributes

2016-02-23 Thread Vinod Kone
On Tue, Feb 23, 2016 at 12:59 PM, Zameer Manji wrote: > Is incompatible slave info signaled by a certain exit code? > Not currently, but we could. A naive/hacky implementation could look at log lines.