Hi Adam,

The command `mesos-slave --recover=cleanup` could indeed to be used for
clean up an incompatible change.

I am still concerned about the possibility that a totally valid attributes
or resources value change could leave the Mesos agent to be in crash loop
and losing critical tasks after --recovery_timeout, when the update
sequence is incorrect.

Can we consider to add a new option like "--auto_recovery_cleanup" which
would automatically perform the clean up if detected incompatible slave
info, or change the default behavior for "--recover"?

Thanks.

On Mon, Feb 22, 2016 at 3:41 PM, Adam Bordelon <a...@mesosphere.io> wrote:

> Currently, changing any --attributes or --resources requires draining the
> agent and killing all running tasks.
> See https://issues.apache.org/jira/browse/MESOS-1739
> You could do a `mesos-slave --recovery=cleanup` which essentially kills
> all the tasks and clears the work_dir; then restart with a `mesos-slave
> --attributes=new_attributes`
> Note that even adding a new attribute is the kind of change that could
> cause a framework scheduler to no longer want its task on that node. For
> example, you add "public_ip=true" and now my scheduler no longer wants to
> run private tasks there. As such, any attribute change needs to notify all
> schedulers of the change.
>
>
> On Mon, Feb 22, 2016 at 2:01 PM, Marco Massenzio <m.massen...@gmail.com>
> wrote:
>
>> IIRC you can avoid the issue by either using a different work_dir for the
>> agent, or removing (and, possibly, re-creating) it.
>>
>> I'm afraid I don't have a running instance of Mesos on this machine and
>> can't test it out.
>>
>> Also (and this is strictly my opinion :) I would consider a change of
>> attribute a "material" change for the Agent and I would avoid trying to
>> recover state from previous runs; but, again, there may be perfectly
>> legitimate cases in which this is desirable.
>>
>> --
>> *Marco Massenzio*
>> http://codetrips.com
>>
>> On Mon, Feb 22, 2016 at 12:11 PM, Zhitao Li <zhitaoli...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> We recently discovered that updating attributes on Mesos agents is a
>>> very risk operation, and has a potential to send agent(s) into a crash loop
>>> if not done properly with errors like "Failed to perform recovery:
>>> Incompatible slave info detected". This combined with
>>> --recovery_timeout made the situation even worse.
>>>
>>> In our setup, some of the attributes are generated from automated
>>> configuration management system, so this opens a possibility that "bad"
>>> configuration could be left on the machine and causing big trouble on next
>>> agent upgrade, if the USR1 signal was not sent on time.
>>>
>>> Some questions:
>>>
>>> 1. Does anyone have a good practice recommended on managing these
>>> attributes safely?
>>> 2. Has Mesos considered to fallback to old metadata if it detects
>>> incompatibility, so agents would keep running with old attributes instead
>>> of falling into crash loop?
>>>
>>> Thanks.
>>>
>>> --
>>> Cheers,
>>>
>>> Zhitao Li
>>>
>>
>>
>


-- 
Cheers,

Zhitao

Reply via email to