Zhitao, In my experience the best way to manage these attributes is to ensure attribute changes are minimal (ie one attribute at a time) and roll them out slowly across a cluster. This way you can catch unsafe mutations quickly and rollback if needed.
I don't think there is a whitelist/blacklist of attributes to reference so I think this is the safest way to go. On Mon, Feb 22, 2016 at 12:11 PM, Zhitao Li <zhitaoli...@gmail.com> wrote: > Hi, > > We recently discovered that updating attributes on Mesos agents is a very > risk operation, and has a potential to send agent(s) into a crash loop if > not done properly with errors like "Failed to perform recovery: > Incompatible slave info detected". This combined with --recovery_timeout > made the situation even worse. > > In our setup, some of the attributes are generated from automated > configuration management system, so this opens a possibility that "bad" > configuration could be left on the machine and causing big trouble on next > agent upgrade, if the USR1 signal was not sent on time. > > Some questions: > > 1. Does anyone have a good practice recommended on managing these > attributes safely? > 2. Has Mesos considered to fallback to old metadata if it detects > incompatibility, so agents would keep running with old attributes instead > of falling into crash loop? > > Thanks. > > -- > Cheers, > > Zhitao Li > > -- > Zameer Manji > >