I think we should definitely do option 2 regardless of whether we do option 1 
as well, since although in this case it will still crash 1.5.0, at least in the 
future we won't have to have this worry again.

On 4/16/18, 2:04 PM, "Chun-Hung Hsiao" <chhs...@apache.org> wrote:

    Hi all,
    
    As some might have already known, we are currently working on patches to
    implement the new GROW_VOLUME and SHRINK_VOLUME operations [1].
    
    One problem surfaces is that, since the new operations are not supported in
    Mesos 1.5, they will lead to an agent crash during the operation application
    cycle if a Mesos 1.6 master send these operations to a Mesos 1.5 agent [2].
    
    We are now consider two possibilities to address this compatibility problem:
    
    1) The Mesos 1.6 master should check the agent's Mesos version in
    `Master::accept` [3]. Moving forward, if we add new operations in future
    Mesos
    releases, we would have code like the following:
    
    ```
    Version slaveVersion = ...; // Get the Mesos version of the slave of the
    offer.
    switch (operation.type()) {
      ...
      case SOME_NEW_OPERATION: {
        if (slaveVersion < minVersionForSomeNewOperation) {
          ... // Drop the operation.
        }
        break;
      }
      ...
    }
    ```
    
    Pros and cons:
    + The new operation won't go into the operation application cycle since it
    is
      rejected in the very beginning. This means no resource metadata is
    touched.
    - Explicit slave version checks at master side make the code look not very
    clean,
      and we will need to update this list every time we add a new operation.
    
    2) Treat this issue as an agent crash bug. The Mesos master would forward
    the operation to the agent, regardless of the agent's Mesos version. In the
    agent,
    we deploy and backport the following logic in `Slave::applyOperation` [4]:
    
    ```
    if (message.operation_info().type() == OPERATION_UNKNOWN) {
      ... // Drop the operation and trigger a re-registration or send an
          // `UpdateSlaveMessage` to force the master to update the total
    resource of
          // the slave.
    }
    ```
    
    Pros and cons:
    + Easier to add new operations since no new logic needs to be added for
    backward
      Compability.
    - Since the old agent won't know whether the new operations are speculative
    or not,
      a re-registration or an `UpdateSlaveMessage` is required.
    - Mesos 1.5.0 agents will still have the bug and crash when a new master
    sends a
      new operation to them.
    
    Since both options are viable and there seems to be no clear winner, we'd
    like to
    check with the community to see which convention is preferable. Please let
    us know
    what you think. Thanks!
    
    Best,
    Chun-Hung
    
    
    [1] https://issues.apache.org/jira/browse/MESOS-4965
    [2]
    
https://github.com/apache/mesos/blob/1.5.x/src/common/protobuf_utils.cpp#L851
    [3] https://github.com/apache/mesos/blob/master/src/master/master.cpp#L3899
    [4] https://github.com/apache/mesos/blob/1.5.x/src/slave/slave.cpp#L4359
    


Reply via email to