Hi,
We recently discovered that updating attributes on Mesos agents is a very
risk operation, and has a potential to send agent(s) into a crash loop if
not done properly with errors like "Failed to perform recovery:
Incompatible slave
info detected". This combined with --recovery_timeout made the
Zhitao,
In my experience the best way to manage these attributes is to ensure
attribute changes are minimal (ie one attribute at a time) and roll them
out slowly across a cluster. This way you can catch unsafe mutations
quickly and rollback if needed.
I don't think there is a whitelist/blacklist
IIRC you can avoid the issue by either using a different work_dir for the
agent, or removing (and, possibly, re-creating) it.
I'm afraid I don't have a running instance of Mesos on this machine and
can't test it out.
Also (and this is strictly my opinion :) I would consider a change of
attribute
Currently, changing any --attributes or --resources requires draining the
agent and killing all running tasks.
See https://issues.apache.org/jira/browse/MESOS-1739
You could do a `mesos-slave --recovery=cleanup` which essentially kills all
the tasks and clears the work_dir; then restart with a `mes
Hi Adam,
The command `mesos-slave --recover=cleanup` could indeed to be used for
clean up an incompatible change.
I am still concerned about the possibility that a totally valid attributes
or resources value change could leave the Mesos agent to be in crash loop
and losing critical tasks after --
On Tue, Feb 23, 2016 at 8:44 AM, Zhitao Li wrote:
> Can we consider to add a new option like "--auto_recovery_cleanup" which
> would automatically perform the clean up if detected incompatible slave
> info, or change the default behavior for "--recover"?
>
Wouldn't you want to know that an incom
Is incompatible slave info signaled by a certain exit code?
On Tue, Feb 23, 2016 at 11:15 AM, Vinod Kone wrote:
>
> On Tue, Feb 23, 2016 at 8:44 AM, Zhitao Li wrote:
>
>> Can we consider to add a new option like "--auto_recovery_cleanup" which
>> would automatically perform the clean up if dete
On Tue, Feb 23, 2016 at 12:59 PM, Zameer Manji wrote:
> Is incompatible slave info signaled by a certain exit code?
>
Not currently, but we could. A naive/hacky implementation could look at log
lines.
8 matches
Mail list logo