Hi Adam, The command `mesos-slave --recover=cleanup` could indeed to be used for clean up an incompatible change.
I am still concerned about the possibility that a totally valid attributes or resources value change could leave the Mesos agent to be in crash loop and losing critical tasks after --recovery_timeout, when the update sequence is incorrect. Can we consider to add a new option like "--auto_recovery_cleanup" which would automatically perform the clean up if detected incompatible slave info, or change the default behavior for "--recover"? Thanks. On Mon, Feb 22, 2016 at 3:41 PM, Adam Bordelon <a...@mesosphere.io> wrote: > Currently, changing any --attributes or --resources requires draining the > agent and killing all running tasks. > See https://issues.apache.org/jira/browse/MESOS-1739 > You could do a `mesos-slave --recovery=cleanup` which essentially kills > all the tasks and clears the work_dir; then restart with a `mesos-slave > --attributes=new_attributes` > Note that even adding a new attribute is the kind of change that could > cause a framework scheduler to no longer want its task on that node. For > example, you add "public_ip=true" and now my scheduler no longer wants to > run private tasks there. As such, any attribute change needs to notify all > schedulers of the change. > > > On Mon, Feb 22, 2016 at 2:01 PM, Marco Massenzio <m.massen...@gmail.com> > wrote: > >> IIRC you can avoid the issue by either using a different work_dir for the >> agent, or removing (and, possibly, re-creating) it. >> >> I'm afraid I don't have a running instance of Mesos on this machine and >> can't test it out. >> >> Also (and this is strictly my opinion :) I would consider a change of >> attribute a "material" change for the Agent and I would avoid trying to >> recover state from previous runs; but, again, there may be perfectly >> legitimate cases in which this is desirable. >> >> -- >> *Marco Massenzio* >> http://codetrips.com >> >> On Mon, Feb 22, 2016 at 12:11 PM, Zhitao Li <zhitaoli...@gmail.com> >> wrote: >> >>> Hi, >>> >>> We recently discovered that updating attributes on Mesos agents is a >>> very risk operation, and has a potential to send agent(s) into a crash loop >>> if not done properly with errors like "Failed to perform recovery: >>> Incompatible slave info detected". This combined with >>> --recovery_timeout made the situation even worse. >>> >>> In our setup, some of the attributes are generated from automated >>> configuration management system, so this opens a possibility that "bad" >>> configuration could be left on the machine and causing big trouble on next >>> agent upgrade, if the USR1 signal was not sent on time. >>> >>> Some questions: >>> >>> 1. Does anyone have a good practice recommended on managing these >>> attributes safely? >>> 2. Has Mesos considered to fallback to old metadata if it detects >>> incompatibility, so agents would keep running with old attributes instead >>> of falling into crash loop? >>> >>> Thanks. >>> >>> -- >>> Cheers, >>> >>> Zhitao Li >>> >> >> > -- Cheers, Zhitao