If we do option 1, then there will be no agent crash since the master won't send any unknown operation to an old agent, so option 2 is not a necessity.
On Mon, Apr 16, 2018 at 2:12 PM, Silas Snider <swsni...@apple.com> wrote: > I think we should definitely do option 2 regardless of whether we do > option 1 as well, since although in this case it will still crash 1.5.0, at > least in the future we won't have to have this worry again. > > On 4/16/18, 2:04 PM, "Chun-Hung Hsiao" <chhs...@apache.org> wrote: > > Hi all, > > As some might have already known, we are currently working on patches > to > implement the new GROW_VOLUME and SHRINK_VOLUME operations [1]. > > One problem surfaces is that, since the new operations are not > supported in > Mesos 1.5, they will lead to an agent crash during the operation > application > cycle if a Mesos 1.6 master send these operations to a Mesos 1.5 agent > [2]. > > We are now consider two possibilities to address this compatibility > problem: > > 1) The Mesos 1.6 master should check the agent's Mesos version in > `Master::accept` [3]. Moving forward, if we add new operations in > future > Mesos > releases, we would have code like the following: > > ``` > Version slaveVersion = ...; // Get the Mesos version of the slave of > the > offer. > switch (operation.type()) { > ... > case SOME_NEW_OPERATION: { > if (slaveVersion < minVersionForSomeNewOperation) { > ... // Drop the operation. > } > break; > } > ... > } > ``` > > Pros and cons: > + The new operation won't go into the operation application cycle > since it > is > rejected in the very beginning. This means no resource metadata is > touched. > - Explicit slave version checks at master side make the code look not > very > clean, > and we will need to update this list every time we add a new > operation. > > 2) Treat this issue as an agent crash bug. The Mesos master would > forward > the operation to the agent, regardless of the agent's Mesos version. > In the > agent, > we deploy and backport the following logic in `Slave::applyOperation` > [4]: > > ``` > if (message.operation_info().type() == OPERATION_UNKNOWN) { > ... // Drop the operation and trigger a re-registration or send an > // `UpdateSlaveMessage` to force the master to update the total > resource of > // the slave. > } > ``` > > Pros and cons: > + Easier to add new operations since no new logic needs to be added for > backward > Compability. > - Since the old agent won't know whether the new operations are > speculative > or not, > a re-registration or an `UpdateSlaveMessage` is required. > - Mesos 1.5.0 agents will still have the bug and crash when a new > master > sends a > new operation to them. > > Since both options are viable and there seems to be no clear winner, > we'd > like to > check with the community to see which convention is preferable. Please > let > us know > what you think. Thanks! > > Best, > Chun-Hung > > > [1] https://issues.apache.org/jira/browse/MESOS-4965 > [2] > https://github.com/apache/mesos/blob/1.5.x/src/common/ > protobuf_utils.cpp#L851 > [3] https://github.com/apache/mesos/blob/master/src/master/ > master.cpp#L3899 > [4] https://github.com/apache/mesos/blob/1.5.x/src/slave/ > slave.cpp#L4359 > > > >