[API WG] Meeting tomorrow

2018-04-16 Thread Greg Mann
Hey folks,
The API working group will be meeting tomorrow at 11am PST. We'll be
chatting about CPU guarantees and limits.

Feel free to add items to the agenda

!

Cheers,
Greg


Scheduler library API change: method to consume Request/Response API calls

2018-04-16 Thread Gastón Kleiman
Hello everyone,


As you all probably know by now, Greg and I are working on enabling operation
feedback  for frameworks.

I implemented a new ReconcileOperations scheduler API call
, which will allow frameworks to get
the latest operation status sent to the framework, and to find out if the
agent (and external resource provider in the future) performing an
operation is unreachable, marked gone, etc.

This is the first scheduler API call that follows the “Request/Response”
pattern, i.e., the master will synchronously respond with a
mesos::v1::scheduler::Response protobuf message instead of via events. This
allows a scheduler to unambiguously associate status updates with a
particular reconciliation call, which is not possible with the Call/Event
pattern.

Our C++ Scheduler API client implements a void send()

method that doesn’t return the response from the master. I’m adding a new
method that will make it possible for users of the library to get that
response.

This method can be implemented in many ways, so I’d like to get your
opinion.

Note: this method will initially be used only by tests, but we might want
to use it if we update our Java bindings to support the “Request/Response”
pattern.

Option #1

process::Future> call(const Call& callMessage)

Where APIResult is a new protobuf message in the mesos.v1.scheduler package
with the following definition:

message APIResult {

 message Status {

   required uint32 status = 1; // HTTP status code

   optional string error = 2; // Body of the HTTP response

 }

 required Status status = 1;

 // The deserialized `v1.scheduler.Response` message.

 //

 // Note: this field is optional, because the master's response to most API

 // calls has a `202 Accepted` status code and an empty body.

 optional Response response = 2;

}

Advantages

   -

   It is possible to know whether the server answered with 200 or with 202.
   -

   It is easy to know if the server’s response included a
   mesos.v1.scheduler.Response


Option #2

process::Future> call(const Call& callMessage)

Where APIResult is a new protobuf message in the mesos.v1.scheduler package
with the following definition:

message APIResult {

 message Error {

   required uint32 status = 1; // HTTP status code

   optional string message = 2; // Body of the HTTP response

 }

 optional Error error = 1;

 optional Response response = 2; // mesos.v1.scheduler.Response

}

Note: if the server responds with “202 Accepted” and an empty body, then
both fields will be empty

Please take a look and let me know what you think!

Thanks,

-Gastón


REMINDER: Mesos 1.6.0 release

2018-04-16 Thread Greg Mann
Hey Apache Mesos devs!
For all tickets which you want to *definitely* make it into Mesos 1.6.0,
ensure that 1.6.0 is set as a 'Target Version', and 'Priority' is set to
Blocker. Please make this change ASAP so that I can sync with you on
progress and be sure that your patches are in 1.6.0.

We currently have 9 open blockers targeting 1.6.0, and they are mostly on
schedule.

Feel free to contact me if you have any questions or concerns relating to
the 1.6.0 release!

Cheers,
Greg


Re: Convention for Backward Compatibility for New Operations in Mesos 1.6

2018-04-16 Thread James Peach

> On Apr 16, 2018, at 2:04 PM, Chun-Hung Hsiao  wrote:
> 
> Hi all,
> 
> As some might have already known, we are currently working on patches to
> implement the new GROW_VOLUME and SHRINK_VOLUME operations [1].
> 
> One problem surfaces is that, since the new operations are not supported in
> Mesos 1.5, they will lead to an agent crash during the operation application
> cycle if a Mesos 1.6 master send these operations to a Mesos 1.5 agent [2].
> 
> We are now consider two possibilities to address this compatibility problem:
> 
> 1) The Mesos 1.6 master should check the agent's Mesos version in
> `Master::accept` [3]. Moving forward, if we add new operations in future
> Mesos
> releases, we would have code like the following:

Using a capability follows the existing practice. I'm also sympathetic to the 
argument that this is an experimental feature and will cause 1.5 agents will 
crash.

> 2) Treat this issue as an agent crash bug. The Mesos master would forward
> the operation to the agent, regardless of the agent's Mesos version. In the
> agent,
> we deploy and backport the following logic in `Slave::applyOperation` [4]:
> 
> ```
> if (message.operation_info().type() == OPERATION_UNKNOWN) {
>  ... // Drop the operation and trigger a re-registration or send an
>  // `UpdateSlaveMessage` to force the master to update the total
> resource of
>  // the slave.
> }
> ```

You should never drop operations. This should respond with some sort of 
"UNKNOWN/UNSUPPORTED" status.

J

Re: Convention for Backward Compatibility for New Operations in Mesos 1.6

2018-04-16 Thread Greg Mann
> Crashing the agent is definitely not a viable option IMO.
>
> Why can't we use agent capabilities instead of agent version and reject
> such operations at master? This is one of the main reasons we introduced
> the concept of framework, master, agent capabilities.
>
>
One thing worth mentioning is that this crash would only manifest when:
1) The operator has set the experimental RESOURCE_PROVIDER capability on
the 1.5 agents in the cluster, and
2) Frameworks/operators begin using the new GROW_VOLUME/SHRINK_VOLUME
operations before upgrade is complete, while some agents are still on 1.5

We can definitely use a capability to address this scenario. For some
reason I find myself hesitant to add capabilities for small features or
edge cases like this, but perhaps there's no reason for such hesitation?

Cheers,
Greg


Re: Proposal: Constrained upgrades from Mesos 1.6

2018-04-16 Thread Greg Mann
Thanks Ben! I think that we can take care of this issue by either examining
the version present in 'MasterInfo', or by introducing a new master
capability.

So, no need for constrained upgrades after all.

Cheers,
Greg

On Tue, Apr 10, 2018 at 5:48 PM, Benjamin Mahler  wrote:

> -user
>
> Do you have a link to the technical details of why this needs to be done?
>
> For instance, why can't master/agent versions be used to determine which
> behavior is performed between the master and agent?
>
> On Tue, Apr 10, 2018 at 5:34 PM, Greg Mann  wrote:
>
> > Hi all,
> > We are currently working on patches to implement the new GROW_VOLUME and
> > SHRINK_VOLUME operations [1]. In order to make it into Mesos 1.6, we're
> > pursuing a workaround which affects the way these operations are
> accounted
> > for in the Mesos master. These operations will be marked as
> *experimental* in
> > Mesos 1.6.
> >
> > As a result of this workaround, upgrades from Mesos 1.6 to later versions
> > would be affected. Specifically, 1.6 masters would not be able to
> properly
> > account for the resources of failed GROW/SHRINK operations on 1.7+
> agents.
> > This means that when upgrading from Mesos 1.6, if GROW_VOLUME or
> > SHRINK_VOLUME operations are being used during the upgrade, the masters
> > *must* be upgraded first. If we follow this proposal, this constraint
> > would be clearly spelled out in our upgrade documentation.
> >
> > Since, in general, we guarantee compatibility between Mesos masters and
> > agents of the same major version, we wanted to check with the community
> to
> > see if this constraint on 1.6 upgrades would be acceptable. Please let us
> > know what you think!
> >
> > Cheers,
> > Greg
> >
> >
> > [1] https://issues.apache.org/jira/browse/MESOS-4965
> >
>


Re: Convention for Backward Compatibility for New Operations in Mesos 1.6

2018-04-16 Thread Chun-Hung Hsiao
If we do option 1, then there will be no agent crash since the master won't
send any unknown operation to an old agent,
so option 2 is not a necessity.

On Mon, Apr 16, 2018 at 2:12 PM, Silas Snider  wrote:

> I think we should definitely do option 2 regardless of whether we do
> option 1 as well, since although in this case it will still crash 1.5.0, at
> least in the future we won't have to have this worry again.
>
> On 4/16/18, 2:04 PM, "Chun-Hung Hsiao"  wrote:
>
> Hi all,
>
> As some might have already known, we are currently working on patches
> to
> implement the new GROW_VOLUME and SHRINK_VOLUME operations [1].
>
> One problem surfaces is that, since the new operations are not
> supported in
> Mesos 1.5, they will lead to an agent crash during the operation
> application
> cycle if a Mesos 1.6 master send these operations to a Mesos 1.5 agent
> [2].
>
> We are now consider two possibilities to address this compatibility
> problem:
>
> 1) The Mesos 1.6 master should check the agent's Mesos version in
> `Master::accept` [3]. Moving forward, if we add new operations in
> future
> Mesos
> releases, we would have code like the following:
>
> ```
> Version slaveVersion = ...; // Get the Mesos version of the slave of
> the
> offer.
> switch (operation.type()) {
>   ...
>   case SOME_NEW_OPERATION: {
> if (slaveVersion < minVersionForSomeNewOperation) {
>   ... // Drop the operation.
> }
> break;
>   }
>   ...
> }
> ```
>
> Pros and cons:
> + The new operation won't go into the operation application cycle
> since it
> is
>   rejected in the very beginning. This means no resource metadata is
> touched.
> - Explicit slave version checks at master side make the code look not
> very
> clean,
>   and we will need to update this list every time we add a new
> operation.
>
> 2) Treat this issue as an agent crash bug. The Mesos master would
> forward
> the operation to the agent, regardless of the agent's Mesos version.
> In the
> agent,
> we deploy and backport the following logic in `Slave::applyOperation`
> [4]:
>
> ```
> if (message.operation_info().type() == OPERATION_UNKNOWN) {
>   ... // Drop the operation and trigger a re-registration or send an
>   // `UpdateSlaveMessage` to force the master to update the total
> resource of
>   // the slave.
> }
> ```
>
> Pros and cons:
> + Easier to add new operations since no new logic needs to be added for
> backward
>   Compability.
> - Since the old agent won't know whether the new operations are
> speculative
> or not,
>   a re-registration or an `UpdateSlaveMessage` is required.
> - Mesos 1.5.0 agents will still have the bug and crash when a new
> master
> sends a
>   new operation to them.
>
> Since both options are viable and there seems to be no clear winner,
> we'd
> like to
> check with the community to see which convention is preferable. Please
> let
> us know
> what you think. Thanks!
>
> Best,
> Chun-Hung
>
>
> [1] https://issues.apache.org/jira/browse/MESOS-4965
> [2]
> https://github.com/apache/mesos/blob/1.5.x/src/common/
> protobuf_utils.cpp#L851
> [3] https://github.com/apache/mesos/blob/master/src/master/
> master.cpp#L3899
> [4] https://github.com/apache/mesos/blob/1.5.x/src/slave/
> slave.cpp#L4359
>
>
>
>


Re: Convention for Backward Compatibility for New Operations in Mesos 1.6

2018-04-16 Thread Chun-Hung Hsiao
Are you suggesting that for every new operation we'll introduce a new
capability?

On Mon, Apr 16, 2018 at 2:14 PM, Vinod Kone  wrote:

> Crashing the agent is definitely not a viable option IMO.
>
> Why can't we use agent capabilities instead of agent version and reject
> such operations at master? This is one of the main reasons we introduced
> the concept of framework, master, agent capabilities.
>
> On Mon, Apr 16, 2018 at 2:04 PM, Chun-Hung Hsiao 
> wrote:
>
> > Hi all,
> >
> > As some might have already known, we are currently working on patches to
> > implement the new GROW_VOLUME and SHRINK_VOLUME operations [1].
> >
> > One problem surfaces is that, since the new operations are not supported
> in
> > Mesos 1.5, they will lead to an agent crash during the operation
> > application
> > cycle if a Mesos 1.6 master send these operations to a Mesos 1.5 agent
> [2].
> >
> > We are now consider two possibilities to address this compatibility
> > problem:
> >
> > 1) The Mesos 1.6 master should check the agent's Mesos version in
> > `Master::accept` [3]. Moving forward, if we add new operations in future
> > Mesos
> > releases, we would have code like the following:
> >
> > ```
> > Version slaveVersion = ...; // Get the Mesos version of the slave of the
> > offer.
> > switch (operation.type()) {
> >   ...
> >   case SOME_NEW_OPERATION: {
> > if (slaveVersion < minVersionForSomeNewOperation) {
> >   ... // Drop the operation.
> > }
> > break;
> >   }
> >   ...
> > }
> > ```
> >
> > Pros and cons:
> > + The new operation won't go into the operation application cycle since
> it
> > is
> >   rejected in the very beginning. This means no resource metadata is
> > touched.
> > - Explicit slave version checks at master side make the code look not
> very
> > clean,
> >   and we will need to update this list every time we add a new operation.
> >
> > 2) Treat this issue as an agent crash bug. The Mesos master would forward
> > the operation to the agent, regardless of the agent's Mesos version. In
> the
> > agent,
> > we deploy and backport the following logic in `Slave::applyOperation`
> [4]:
> >
> > ```
> > if (message.operation_info().type() == OPERATION_UNKNOWN) {
> >   ... // Drop the operation and trigger a re-registration or send an
> >   // `UpdateSlaveMessage` to force the master to update the total
> > resource of
> >   // the slave.
> > }
> > ```
> >
> > Pros and cons:
> > + Easier to add new operations since no new logic needs to be added for
> > backward
> >   Compability.
> > - Since the old agent won't know whether the new operations are
> speculative
> > or not,
> >   a re-registration or an `UpdateSlaveMessage` is required.
> > - Mesos 1.5.0 agents will still have the bug and crash when a new master
> > sends a
> >   new operation to them.
> >
> > Since both options are viable and there seems to be no clear winner, we'd
> > like to
> > check with the community to see which convention is preferable. Please
> let
> > us know
> > what you think. Thanks!
> >
> > Best,
> > Chun-Hung
> >
> >
> > [1] https://issues.apache.org/jira/browse/MESOS-4965
> > [2]
> > https://github.com/apache/mesos/blob/1.5.x/src/common/protob
> > uf_utils.cpp#L851
> > [3] https://github.com/apache/mesos/blob/master/src/master/maste
> > r.cpp#L3899
> > [4] https://github.com/apache/mesos/blob/1.5.x/src/slave/slave.cpp#L4359
> >
>


Re: Convention for Backward Compatibility for New Operations in Mesos 1.6

2018-04-16 Thread Vinod Kone
Crashing the agent is definitely not a viable option IMO.

Why can't we use agent capabilities instead of agent version and reject
such operations at master? This is one of the main reasons we introduced
the concept of framework, master, agent capabilities.

On Mon, Apr 16, 2018 at 2:04 PM, Chun-Hung Hsiao  wrote:

> Hi all,
>
> As some might have already known, we are currently working on patches to
> implement the new GROW_VOLUME and SHRINK_VOLUME operations [1].
>
> One problem surfaces is that, since the new operations are not supported in
> Mesos 1.5, they will lead to an agent crash during the operation
> application
> cycle if a Mesos 1.6 master send these operations to a Mesos 1.5 agent [2].
>
> We are now consider two possibilities to address this compatibility
> problem:
>
> 1) The Mesos 1.6 master should check the agent's Mesos version in
> `Master::accept` [3]. Moving forward, if we add new operations in future
> Mesos
> releases, we would have code like the following:
>
> ```
> Version slaveVersion = ...; // Get the Mesos version of the slave of the
> offer.
> switch (operation.type()) {
>   ...
>   case SOME_NEW_OPERATION: {
> if (slaveVersion < minVersionForSomeNewOperation) {
>   ... // Drop the operation.
> }
> break;
>   }
>   ...
> }
> ```
>
> Pros and cons:
> + The new operation won't go into the operation application cycle since it
> is
>   rejected in the very beginning. This means no resource metadata is
> touched.
> - Explicit slave version checks at master side make the code look not very
> clean,
>   and we will need to update this list every time we add a new operation.
>
> 2) Treat this issue as an agent crash bug. The Mesos master would forward
> the operation to the agent, regardless of the agent's Mesos version. In the
> agent,
> we deploy and backport the following logic in `Slave::applyOperation` [4]:
>
> ```
> if (message.operation_info().type() == OPERATION_UNKNOWN) {
>   ... // Drop the operation and trigger a re-registration or send an
>   // `UpdateSlaveMessage` to force the master to update the total
> resource of
>   // the slave.
> }
> ```
>
> Pros and cons:
> + Easier to add new operations since no new logic needs to be added for
> backward
>   Compability.
> - Since the old agent won't know whether the new operations are speculative
> or not,
>   a re-registration or an `UpdateSlaveMessage` is required.
> - Mesos 1.5.0 agents will still have the bug and crash when a new master
> sends a
>   new operation to them.
>
> Since both options are viable and there seems to be no clear winner, we'd
> like to
> check with the community to see which convention is preferable. Please let
> us know
> what you think. Thanks!
>
> Best,
> Chun-Hung
>
>
> [1] https://issues.apache.org/jira/browse/MESOS-4965
> [2]
> https://github.com/apache/mesos/blob/1.5.x/src/common/protob
> uf_utils.cpp#L851
> [3] https://github.com/apache/mesos/blob/master/src/master/maste
> r.cpp#L3899
> [4] https://github.com/apache/mesos/blob/1.5.x/src/slave/slave.cpp#L4359
>


Re: Convention for Backward Compatibility for New Operations in Mesos 1.6

2018-04-16 Thread Silas Snider
I think we should definitely do option 2 regardless of whether we do option 1 
as well, since although in this case it will still crash 1.5.0, at least in the 
future we won't have to have this worry again.

On 4/16/18, 2:04 PM, "Chun-Hung Hsiao"  wrote:

Hi all,

As some might have already known, we are currently working on patches to
implement the new GROW_VOLUME and SHRINK_VOLUME operations [1].

One problem surfaces is that, since the new operations are not supported in
Mesos 1.5, they will lead to an agent crash during the operation application
cycle if a Mesos 1.6 master send these operations to a Mesos 1.5 agent [2].

We are now consider two possibilities to address this compatibility problem:

1) The Mesos 1.6 master should check the agent's Mesos version in
`Master::accept` [3]. Moving forward, if we add new operations in future
Mesos
releases, we would have code like the following:

```
Version slaveVersion = ...; // Get the Mesos version of the slave of the
offer.
switch (operation.type()) {
  ...
  case SOME_NEW_OPERATION: {
if (slaveVersion < minVersionForSomeNewOperation) {
  ... // Drop the operation.
}
break;
  }
  ...
}
```

Pros and cons:
+ The new operation won't go into the operation application cycle since it
is
  rejected in the very beginning. This means no resource metadata is
touched.
- Explicit slave version checks at master side make the code look not very
clean,
  and we will need to update this list every time we add a new operation.

2) Treat this issue as an agent crash bug. The Mesos master would forward
the operation to the agent, regardless of the agent's Mesos version. In the
agent,
we deploy and backport the following logic in `Slave::applyOperation` [4]:

```
if (message.operation_info().type() == OPERATION_UNKNOWN) {
  ... // Drop the operation and trigger a re-registration or send an
  // `UpdateSlaveMessage` to force the master to update the total
resource of
  // the slave.
}
```

Pros and cons:
+ Easier to add new operations since no new logic needs to be added for
backward
  Compability.
- Since the old agent won't know whether the new operations are speculative
or not,
  a re-registration or an `UpdateSlaveMessage` is required.
- Mesos 1.5.0 agents will still have the bug and crash when a new master
sends a
  new operation to them.

Since both options are viable and there seems to be no clear winner, we'd
like to
check with the community to see which convention is preferable. Please let
us know
what you think. Thanks!

Best,
Chun-Hung


[1] https://issues.apache.org/jira/browse/MESOS-4965
[2]

https://github.com/apache/mesos/blob/1.5.x/src/common/protobuf_utils.cpp#L851
[3] https://github.com/apache/mesos/blob/master/src/master/master.cpp#L3899
[4] https://github.com/apache/mesos/blob/1.5.x/src/slave/slave.cpp#L4359





Re: Question on status update retry in agent

2018-04-16 Thread Varun Gupta
We use explicit ack from Scheduler.

Here, is a snippet of the logs. Please see logs for Status Update UUID:
a918f5ed-a604-415a-ad62-5a34fb6334ef

W0416 00:41:25.843505 124530 status_update_manager.cpp:761] Duplicate
status update acknowledgment (UUID: 67f548b4-96cb-4b57-8720-2c8a4ba347e8)
for update TASK_RUNNING (UUID: a918f5ed-a604-415a-ad62-5a34fb6334ef) for
task node-12__aea9be06-3d26-42c1-ac00-e7b840d47699 of framework
5f606c54-8caa-4e69-afdf-7f2c79acc7e6-1190

E0416 00:41:25.843559 124542 slave.cpp:2951] Failed to handle status update
acknowledgement (UUID: 67f548b4-96cb-4b57-8720-2c8a4ba347e8) for task
node-12__aea9be06-3d26-42c1-ac00-e7b840d47699 of framework
5f606c54-8caa-4e69-afdf-7f2c79acc7e6-1190: Duplicate acknowledgement

W0416 00:41:28.416702 124539 status_update_manager.cpp:478] Resending
status update TASK_RUNNING (UUID: f320b354-067d-4421-a650-aba7a17b219e) for
task node-12__50d38098-f026-4720-b374-a94e7171c830 of framework
5f606c54-8caa-4e69-afdf-7f2c79acc7e6-1190

I0416 00:41:28.416910 124530 slave.cpp:4051] Forwarding the update
TASK_RUNNING (UUID: f320b354-067d-4421-a650-aba7a17b219e) for task
node-12__50d38098-f026-4720-b374-a94e7171c830 of framework
5f606c54-8caa-4e69-afdf-7f2c79acc7e6-1190 to master@10.160.41.62:5050

W0416 00:41:28.425237 124537 status_update_manager.cpp:478] Resending
status update TASK_RUNNING (UUID: a918f5ed-a604-415a-ad62-5a34fb6334ef) for
task node-12__aea9be06-3d26-42c1-ac00-e7b840d47699 of framework
5f606c54-8caa-4e69-afdf-7f2c79acc7e6-1190

I0416 00:41:28.425361 124525 slave.cpp:4051] Forwarding the update
TASK_RUNNING (UUID: a918f5ed-a604-415a-ad62-5a34fb6334ef) for task
node-12__aea9be06-3d26-42c1-ac00-e7b840d47699 of framework
5f606c54-8caa-4e69-afdf-7f2c79acc7e6-1190 to master@10.160.41.62:5050

I0416 00:41:36.723466 124518 status_update_manager.cpp:395] Received status
update acknowledgement (UUID: b99b15d6-d9af-4395-9c46-709f91db90e1) for
task node-12__50d38098-f026-4720-b374-a94e7171c830 of framework
5f606c54-8caa-4e69-afdf-7f2c79acc7e6-1190

W0416 00:41:36.723588 124518 status_update_manager.cpp:761] Duplicate
status update acknowledgment (UUID: b99b15d6-d9af-4395-9c46-709f91db90e1)
for update TASK_RUNNING (UUID: f320b354-067d-4421-a650-aba7a17b219e) for
task node-12__50d38098-f026-4720-b374-a94e7171c830 of framework
5f606c54-8caa-4e69-afdf-7f2c79acc7e6-1190

E0416 00:41:36.723650 124518 slave.cpp:2951] Failed to handle status update
acknowledgement (UUID: b99b15d6-d9af-4395-9c46-709f91db90e1) for task
node-12__50d38098-f026-4720-b374-a94e7171c830 of framework
5f606c54-8caa-4e69-afdf-7f2c79acc7e6-1190: Duplicate acknowledgement

I0416 00:41:36.730357 124527 status_update_manager.cpp:395] Received status
update acknowledgement (UUID: 67f548b4-96cb-4b57-8720-2c8a4ba347e8) for
task node-12__aea9be06-3d26-42c1-ac00-e7b840d47699 of framework
5f606c54-8caa-4e69-afdf-7f2c79acc7e6-1190

W0416 00:41:36.730417 124527 status_update_manager.cpp:761] Duplicate
status update acknowledgment (UUID: 67f548b4-96cb-4b57-8720-2c8a4ba347e8)
for update TASK_RUNNING (UUID: a918f5ed-a604-415a-ad62-5a34fb6334ef) for
task node-12__aea9be06-3d26-42c1-ac00-e7b840d47699 of framework
5f606c54-8caa-4e69-afdf-7f2c79acc7e6-1190

E0416 00:41:36.730465 124518 slave.cpp:2951] Failed to handle status update
acknowledgement (UUID: 67f548b4-96cb-4b57-8720-2c8a4ba347e8) for task
node-12__aea9be06-3d26-42c1-ac00-e7b840d47699 of framework
5f606c54-8caa-4e69-afdf-7f2c79acc7e6-1190: Duplicate acknowledgement

I0416 00:41:43.769918 124532 http.cpp:277] HTTP GET for
/slave(1)/state.json from 10.65.5.13:42600 with
User-Agent='filebundle-agent'

W0416 00:41:48.417913 124539 status_update_manager.cpp:478] Resending
status update TASK_RUNNING (UUID: f320b354-067d-4421-a650-aba7a17b219e) for
task node-12__50d38098-f026-4720-b374-a94e7171c830 of framework
5f606c54-8caa-4e69-afdf-7f2c79acc7e6-1190

I0416 00:41:48.418166 124538 slave.cpp:4051] Forwarding the update
TASK_RUNNING (UUID: f320b354-067d-4421-a650-aba7a17b219e) for task
node-12__50d38098-f026-4720-b374-a94e7171c830 of framework
5f606c54-8caa-4e69-afdf-7f2c79acc7e6-1190 to master@10.160.41.62:5050

W0416 00:41:48.425532 124539 status_update_manager.cpp:478] Resending
status update TASK_RUNNING (UUID: a918f5ed-a604-415a-ad62-5a34fb6334ef) for
task node-12__aea9be06-3d26-42c1-ac00-e7b840d47699 of framework
5f606c54-8caa-4e69-afdf-7f2c79acc7e6-1190


I discussed with @zhitao. Here, is a plausible explanation of the bug.
Executor sends, updates to agent every 60 seconds, and agents maintains
them in pending

queue. Now when the ack comes, they can be in any order for the status
update but _handle section pops

last update from the queue without making sure, ack was for that status
update.




On Tue, Apr 10, 2018 at 6:43 PM, Benjamin Mahler  wrote:

> Do you have log

Convention for Backward Compatibility for New Operations in Mesos 1.6

2018-04-16 Thread Chun-Hung Hsiao
Hi all,

As some might have already known, we are currently working on patches to
implement the new GROW_VOLUME and SHRINK_VOLUME operations [1].

One problem surfaces is that, since the new operations are not supported in
Mesos 1.5, they will lead to an agent crash during the operation application
cycle if a Mesos 1.6 master send these operations to a Mesos 1.5 agent [2].

We are now consider two possibilities to address this compatibility problem:

1) The Mesos 1.6 master should check the agent's Mesos version in
`Master::accept` [3]. Moving forward, if we add new operations in future
Mesos
releases, we would have code like the following:

```
Version slaveVersion = ...; // Get the Mesos version of the slave of the
offer.
switch (operation.type()) {
  ...
  case SOME_NEW_OPERATION: {
if (slaveVersion < minVersionForSomeNewOperation) {
  ... // Drop the operation.
}
break;
  }
  ...
}
```

Pros and cons:
+ The new operation won't go into the operation application cycle since it
is
  rejected in the very beginning. This means no resource metadata is
touched.
- Explicit slave version checks at master side make the code look not very
clean,
  and we will need to update this list every time we add a new operation.

2) Treat this issue as an agent crash bug. The Mesos master would forward
the operation to the agent, regardless of the agent's Mesos version. In the
agent,
we deploy and backport the following logic in `Slave::applyOperation` [4]:

```
if (message.operation_info().type() == OPERATION_UNKNOWN) {
  ... // Drop the operation and trigger a re-registration or send an
  // `UpdateSlaveMessage` to force the master to update the total
resource of
  // the slave.
}
```

Pros and cons:
+ Easier to add new operations since no new logic needs to be added for
backward
  Compability.
- Since the old agent won't know whether the new operations are speculative
or not,
  a re-registration or an `UpdateSlaveMessage` is required.
- Mesos 1.5.0 agents will still have the bug and crash when a new master
sends a
  new operation to them.

Since both options are viable and there seems to be no clear winner, we'd
like to
check with the community to see which convention is preferable. Please let
us know
what you think. Thanks!

Best,
Chun-Hung


[1] https://issues.apache.org/jira/browse/MESOS-4965
[2]
https://github.com/apache/mesos/blob/1.5.x/src/common/protobuf_utils.cpp#L851
[3] https://github.com/apache/mesos/blob/master/src/master/master.cpp#L3899
[4] https://github.com/apache/mesos/blob/1.5.x/src/slave/slave.cpp#L4359