Zhitao,

In my experience the best way to manage these attributes is to ensure
attribute changes are minimal (ie one attribute at a time) and roll them
out slowly across a cluster. This way you can catch unsafe mutations
quickly and rollback if needed.

I don't think there is a whitelist/blacklist of attributes to reference so
I think this is the safest way to go.

On Mon, Feb 22, 2016 at 12:11 PM, Zhitao Li <zhitaoli...@gmail.com> wrote:

> Hi,
>
> We recently discovered that updating attributes on Mesos agents is a very
> risk operation, and has a potential to send agent(s) into a crash loop if
> not done properly with errors like "Failed to perform recovery:
> Incompatible slave info detected". This combined with --recovery_timeout
> made the situation even worse.
>
> In our setup, some of the attributes are generated from automated
> configuration management system, so this opens a possibility that "bad"
> configuration could be left on the machine and causing big trouble on next
> agent upgrade, if the USR1 signal was not sent on time.
>
> Some questions:
>
> 1. Does anyone have a good practice recommended on managing these
> attributes safely?
> 2. Has Mesos considered to fallback to old metadata if it detects
> incompatibility, so agents would keep running with old attributes instead
> of falling into crash loop?
>
> Thanks.
>
> --
> Cheers,
>
> Zhitao Li
>
> --
> Zameer Manji
>
>

Reply via email to