[jira] [Created] (MESOS-7689) Libprocess can crash on malformed request paths for libprocess messages.

2017-06-16 Thread Benjamin Mahler (JIRA)
Benjamin Mahler created MESOS-7689:
--

 Summary: Libprocess can crash on malformed request paths for 
libprocess messages.
 Key: MESOS-7689
 URL: https://issues.apache.org/jira/browse/MESOS-7689
 Project: Mesos
  Issue Type: Bug
  Components: libprocess
Reporter: Benjamin Mahler


The following code will crash when there is a libprocess message and the path 
cannot be decoded:

https://github.com/apache/mesos/blob/1.3.0/3rdparty/libprocess/src/process.cpp#L798-L800



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7688) Improve master failover performance by reducing unnecessary agent retries.

2017-06-16 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-7688:
---
Description: 
Currently, during a failover the agents will (re-)register with the master. 
While the master is recovering, the master may drop messages from the agents, 
and so the agents must retry registration using a backoff mechanism. For large 
clusters, there can be a lot of overhead in processing unnecessary retries from 
the agents, given that these messages must be deserialized and contain all of 
the task / executor information many times over.

In order to reduce this overhead, the idea is to avoid the need for agents to 
blindly retry (re-)registration with the master. Two approaches for this are:

(1) Update the MasterInfo in ZK when the master is recovered. This is a bit of 
an abuse of MasterInfo unfortunately, but the idea is for agents to only 
(re-)register when they see that the master reaches a recovered state. Once 
recovered, the master will not drop messages, and therefore agents only need to 
retry when the connection breaks.

(2) Have the master reply with a retry message when it's in the recovering 
state, so that agents get a clear signal that their messages were dropped. The 
agents only retry when the connection breaks or they get a retry message. This 
one is less optimal, because the master may have to process a lot of messages 
and send retries, but once the master is recovered, the master will process 
only a single (re-)registration from each agent. The number of 
(re-)registrations that occur while the master is recovering can be reduced to 
1 in this approach if the master sends the retry message only after the master 
completes recovery.

  was:
Currently, during a failover the agents will (re-)register with the master. 
While the master is recovering, the master may drop messages from the agents, 
and so the agents must retry registration using a backoff mechanism. For large 
clusters, there can be a lot of overhead in processing unnecessary retries from 
the agents, given that these messages must be deserialized and contain all of 
the task / executor information many times over.

In order to reduce this overhead, the idea is to avoid the need for agents to 
blindly retry (re-)registration with the master. Two approaches for this are:

(1) Update the MasterInfo in ZK when the master is recovered. This is a bit of 
an abuse of MasterInfo unfortunately, but the idea is for agents to only 
(re-)register when they see that the master reaches a recovered state. Once 
recovered, the master will not drop messages, and therefore agents only need to 
retry when the connection breaks.

(2) Have the master reply with a retry message when it's in the recovering 
state, so that agents get a clear signal that their messages were dropped. This 
one is less optimal, because the master may have to process a lot of messages 
and send retries, but once the master is recovered, the master will process 
only a single (re-)registration from each agent. Here, agents only retry when 
the connection breaks or they get a retry message.


> Improve master failover performance by reducing unnecessary agent retries.
> --
>
> Key: MESOS-7688
> URL: https://issues.apache.org/jira/browse/MESOS-7688
> Project: Mesos
>  Issue Type: Improvement
>  Components: agent, master
>Reporter: Benjamin Mahler
>  Labels: scalability
>
> Currently, during a failover the agents will (re-)register with the master. 
> While the master is recovering, the master may drop messages from the agents, 
> and so the agents must retry registration using a backoff mechanism. For 
> large clusters, there can be a lot of overhead in processing unnecessary 
> retries from the agents, given that these messages must be deserialized and 
> contain all of the task / executor information many times over.
> In order to reduce this overhead, the idea is to avoid the need for agents to 
> blindly retry (re-)registration with the master. Two approaches for this are:
> (1) Update the MasterInfo in ZK when the master is recovered. This is a bit 
> of an abuse of MasterInfo unfortunately, but the idea is for agents to only 
> (re-)register when they see that the master reaches a recovered state. Once 
> recovered, the master will not drop messages, and therefore agents only need 
> to retry when the connection breaks.
> (2) Have the master reply with a retry message when it's in the recovering 
> state, so that agents get a clear signal that their messages were dropped. 
> The agents only retry when the connection breaks or they get a retry message. 
> This one is less optimal, because the master may have to process a lot of 
> messages and send retries, but once the master is recovered, 

[jira] [Created] (MESOS-7688) Improve master failover performance by reducing unnecessary agent retries.

2017-06-16 Thread Benjamin Mahler (JIRA)
Benjamin Mahler created MESOS-7688:
--

 Summary: Improve master failover performance by reducing 
unnecessary agent retries.
 Key: MESOS-7688
 URL: https://issues.apache.org/jira/browse/MESOS-7688
 Project: Mesos
  Issue Type: Improvement
  Components: agent, master
Reporter: Benjamin Mahler


Currently, during a failover the agents will (re-)register with the master. 
While the master is recovering, the master may drop messages from the agents, 
and so the agents must retry registration using a backoff mechanism. For large 
clusters, there can be a lot of overhead in processing unnecessary retries from 
the agents, given that these messages must be deserialized and contain all of 
the task / executor information many times over.

In order to reduce this overhead, the idea is to avoid the need for agents to 
blindly retry (re-)registration with the master. Two approaches for this are:

(1) Update the MasterInfo in ZK when the master is recovered. This is a bit of 
an abuse of MasterInfo unfortunately, but the idea is for agents to only 
(re-)register when they see that the master reaches a recovered state. Once 
recovered, the master will not drop messages, and therefore agents only need to 
retry when the connection breaks.

(2) Have the master reply with a retry message when it's in the recovering 
state, so that agents get a clear signal that their messages were dropped. This 
one is less optimal, because the master may have to process a lot of messages 
and send retries, but once the master is recovered, the master will process 
only a single (re-)registration from each agent. Here, agents only retry when 
the connection breaks or they get a retry message.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-7683) Introduce a master capability for reservation refinement.

2017-06-15 Thread Benjamin Mahler (JIRA)
Benjamin Mahler created MESOS-7683:
--

 Summary: Introduce a master capability for reservation refinement.
 Key: MESOS-7683
 URL: https://issues.apache.org/jira/browse/MESOS-7683
 Project: Mesos
  Issue Type: Task
Reporter: Benjamin Mahler


So that frameworks can detect which features the master supports, we have 
proposed introducing master capabilities: MESOS-5675.

For reservation refinement we can add a master capability.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7682) Agent agent downgrade capability checking.

2017-06-15 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-7682:
---
Summary: Agent agent downgrade capability checking.  (was: Agent downgrade 
capability checking.)

> Agent agent downgrade capability checking.
> --
>
> Key: MESOS-7682
> URL: https://issues.apache.org/jira/browse/MESOS-7682
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Benjamin Mahler
>
> It would be great if the agent could prevent a downgrade if it reaches a 
> point where a capability becomes required but the downgraded agent does not 
> have the capability.
> For example, consider the case that an agent starts writing refined 
> reservations to disk (per the RESERVATION_REFINEMENT capability). At this 
> point, the RESERVATION_REFINEMENT capability becomes required. Ideally, the 
> agent persists this information into its state, so that if the agent is 
> downgraded to a pre-RESERVATION_REFINEMENT state, the old agent could detect 
> that a capability it does not have is required. At this point the old agent 
> could refuse to start. This would prevent a "buggy" downgrade due to the old 
> agent mis-reading the checkpointed resources.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-7682) Agent downgrade capability checking.

2017-06-15 Thread Benjamin Mahler (JIRA)
Benjamin Mahler created MESOS-7682:
--

 Summary: Agent downgrade capability checking.
 Key: MESOS-7682
 URL: https://issues.apache.org/jira/browse/MESOS-7682
 Project: Mesos
  Issue Type: Improvement
Reporter: Benjamin Mahler


It would be great if the agent could prevent a downgrade if it reaches a point 
where a capability becomes required but the downgraded agent does not have the 
capability.

For example, consider the case that an agent starts writing refined 
reservations to disk (per the RESERVATION_REFINEMENT capability). At this 
point, the RESERVATION_REFINEMENT capability becomes required. Ideally, the 
agent persists this information into its state, so that if the agent is 
downgraded to a pre-RESERVATION_REFINEMENT state, the old agent could detect 
that a capability it does not have is required. At this point the old agent 
could refuse to start. This would prevent a "buggy" downgrade due to the old 
agent mis-reading the checkpointed resources.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (MESOS-7655) Reservation Refinement: Update the resources logic.

2017-06-14 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-7655:
--

Assignee: Michael Park

> Reservation Refinement: Update the resources logic.
> ---
>
> Key: MESOS-7655
> URL: https://issues.apache.org/jira/browse/MESOS-7655
> Project: Mesos
>  Issue Type: Bug
>Reporter: Michael Park
>Assignee: Michael Park
>
> With reservation refinement, there is a new framework capability called 
> {{RESERVATION_REFINEMENT}}. The framework is required to use the 
> {{Resource.reservations}} field to express reservations if the capability is 
> set, otherwise it is required to use the {{Resource.role}} and 
> {{Resource.reservation}} fields.
> After the validation, we transform the resources from the old format to the 
> new format and deal with the new format internally.
> This allows us to only deal with the old format at the validation phase, and 
> update the code to only consider the new format for all other 
> Resources-related functions.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-7668) Update authorization to handle reservation refinement.

2017-06-13 Thread Benjamin Mahler (JIRA)
Benjamin Mahler created MESOS-7668:
--

 Summary: Update authorization to handle reservation refinement.
 Key: MESOS-7668
 URL: https://issues.apache.org/jira/browse/MESOS-7668
 Project: Mesos
  Issue Type: Task
Reporter: Benjamin Mahler


With reservation refinement, the local authorizer needs to be updated to 
retrieve the role and principal via the {{Resource.reservations}} field.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-7665) v0 Operator API update for reservation refinement.

2017-06-13 Thread Benjamin Mahler (JIRA)
Benjamin Mahler created MESOS-7665:
--

 Summary: v0 Operator API update for reservation refinement.
 Key: MESOS-7665
 URL: https://issues.apache.org/jira/browse/MESOS-7665
 Project: Mesos
  Issue Type: Task
Reporter: Benjamin Mahler


In order to preserve backwards compatibility, the v0 endpoints (e.g. /state) 
should expose the old format using `Resource.role` and `Resource.reservation` 
if the resources do not contain a refined reservation.

If the resource contains a refined reservation, then we need to ensure the v0 
endpoints reflect that in the JSON.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7664) Framework API update for reservation refinement.

2017-06-13 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-7664:
---
Issue Type: Task  (was: Documentation)

> Framework API update for reservation refinement.
> 
>
> Key: MESOS-7664
> URL: https://issues.apache.org/jira/browse/MESOS-7664
> Project: Mesos
>  Issue Type: Task
>Reporter: Benjamin Mahler
>Assignee: Michael Park
>
> In order to add reservation refinement, the framework API needs:
> * A way to express the "stack" of reservations.
> * A new capability to gate the feature, since the resource format has to 
> change.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-7663) Update the documentation to reflect the addition of reservation refinement.

2017-06-13 Thread Benjamin Mahler (JIRA)
Benjamin Mahler created MESOS-7663:
--

 Summary: Update the documentation to reflect the addition of 
reservation refinement.
 Key: MESOS-7663
 URL: https://issues.apache.org/jira/browse/MESOS-7663
 Project: Mesos
  Issue Type: Documentation
  Components: documentation
Reporter: Benjamin Mahler


There are a few things we need to be sure to document:

* What reservation refinement is.
* The new "format" for Resource, when using the RESERVATION_REFINEMENT 
capability.
* The filtering of resources if a framework is not RESERVATION_REFINEMENT 
capable.
* The current limitations that only a single reservation can be pushed / popped 
within a single RESERVE / UNRESERVE operation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (MESOS-7575) Support reservation refinement for hierarchical roles.

2017-06-13 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16048573#comment-16048573
 ] 

Benjamin Mahler edited comment on MESOS-7575 at 6/14/17 1:24 AM:
-

Moving this to an epic so that we can capture all of the work needed here.


was (Author: bmahler):
Moving this to an epic so that capture all of the work needed here.

> Support reservation refinement for hierarchical roles.
> --
>
> Key: MESOS-7575
> URL: https://issues.apache.org/jira/browse/MESOS-7575
> Project: Mesos
>  Issue Type: Epic
>Reporter: Michael Park
>Assignee: Michael Park
>
> With the introduction of hierarchical roles, Mesos provides a mechanism to 
> delegate resources down a hierarchy.
> To complement this, we'll introduce a mechanism to *refine* the reservations 
> down the hierarchy.
> For example, given resources allocated to role {{foo}}, it can be further 
> reserved for {{foo/bar}}.
> When the resources allocated to {{foo/bar}} is unreserved, it goes back to 
> where it came from. In this case, back to role {{foo}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7575) Support reservation refinement for hierarchical roles.

2017-06-13 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-7575:
---
Epic Name: reservation refinement

> Support reservation refinement for hierarchical roles.
> --
>
> Key: MESOS-7575
> URL: https://issues.apache.org/jira/browse/MESOS-7575
> Project: Mesos
>  Issue Type: Epic
>Reporter: Michael Park
>Assignee: Michael Park
>
> With the introduction of hierarchical roles, Mesos provides a mechanism to 
> delegate resources down a hierarchy.
> To complement this, we'll introduce a mechanism to *refine* the reservations 
> down the hierarchy.
> For example, given resources allocated to role {{foo}}, it can be further 
> reserved for {{foo/bar}}.
> When the resources allocated to {{foo/bar}} is unreserved, it goes back to 
> where it came from. In this case, back to role {{foo}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7575) Support reservation refinement for hierarchical roles.

2017-06-13 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-7575:
---
   Summary: Support reservation refinement for hierarchical roles.  (was: 
Support reservation refinement)
Issue Type: Epic  (was: Task)

Moving this to an epic so that capture all of the work needed here.

> Support reservation refinement for hierarchical roles.
> --
>
> Key: MESOS-7575
> URL: https://issues.apache.org/jira/browse/MESOS-7575
> Project: Mesos
>  Issue Type: Epic
>Reporter: Michael Park
>Assignee: Michael Park
>
> With the introduction of hierarchical roles, Mesos provides a mechanism to 
> delegate resources down a hierarchy.
> To complement this, we'll introduce a mechanism to *refine* the reservations 
> down the hierarchy.
> For example, given resources allocated to role {{foo}}, it can be further 
> reserved for {{foo/bar}}.
> When the resources allocated to {{foo/bar}} is unreserved, it goes back to 
> where it came from. In this case, back to role {{foo}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-6972) Improve performance of protobuf message passing by removing RepeatedPtrField to vector conversion.

2017-06-12 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16046889#comment-16046889
 ] 

Benjamin Mahler commented on MESOS-6972:


[~dzhuk] I see, sounds like we have two options that depend on the install 
handler implementation. One case is const-access style, in which arenas and 
const access are best, and the other is wanting to move out the data from the 
protobuf, in which case non-const access (for moveability) with no arenas is 
best.

> Improve performance of protobuf message passing by removing RepeatedPtrField 
> to vector conversion.
> --
>
> Key: MESOS-6972
> URL: https://issues.apache.org/jira/browse/MESOS-6972
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Benjamin Mahler
>  Labels: tech-debt
>
> Currently, all protobuf message handlers must take a {{vector}} for repeated 
> fields, rather than a {{RepeatedPtrField}}.
> This requires that a copy be performed of the repeated field's entries (see 
> [here|https://github.com/apache/mesos/blob/9228ebc239dac42825390bebc72053dbf3ae7b09/3rdparty/libprocess/include/process/protobuf.hpp#L78-L87]),
>  which can be very expensive in some cases. We should avoid requiring this 
> expense on the callers.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7651) Consider a more explicit way to bind reservations / volumes to a framework.

2017-06-09 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16045268#comment-16045268
 ] 

Benjamin Mahler commented on MESOS-7651:


[~xujyan] Updated the description to mention lifecycle.

> Consider a more explicit way to bind reservations / volumes to a framework.
> ---
>
> Key: MESOS-7651
> URL: https://issues.apache.org/jira/browse/MESOS-7651
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Benjamin Mahler
>
> Currently, when a framework creates a reservation or a persistent volume, and 
> it wants exclusive access to this volume or reservation, it must take a few 
> steps:
> * Ensure that no other frameworks are running within the reservation role (or 
> the other frameworks are co-operative).
> * With hierarchical roles, frameworks must also ensure that the role is a 
> leaf so that no descendant roles will have access to the reservation/volume. 
> This could be done by generating a role (e.g. eng/kafka/).
> It's not easy for the framework to ensure these things, since role ACLs are 
> controlled by the operator.
> We should consider a more direct way for a framework to ensure that their 
> reservation/volume cannot be shared. E.g. by binding it to their framework id 
> (perhaps re-using roles for this rather than introducing something new?)
> We should also consider binding the reservation / volumes, much like other 
> objects (tasks, executors), to the framework's lifecycle. So that if the 
> framework is removed, the reservations / volumes it left behind are cleaned 
> up.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7651) Consider a more explicit way to bind reservations / volumes to a framework.

2017-06-09 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-7651:
---
Description: 
Currently, when a framework creates a reservation or a persistent volume, and 
it wants exclusive access to this volume or reservation, it must take a few 
steps:

* Ensure that no other frameworks are running within the reservation role (or 
the other frameworks are co-operative).
* With hierarchical roles, frameworks must also ensure that the role is a leaf 
so that no descendant roles will have access to the reservation/volume. This 
could be done by generating a role (e.g. eng/kafka/).

It's not easy for the framework to ensure these things, since role ACLs are 
controlled by the operator.

We should consider a more direct way for a framework to ensure that their 
reservation/volume cannot be shared. E.g. by binding it to their framework id 
(perhaps re-using roles for this rather than introducing something new?)

We should also consider binding the reservation / volumes, much like other 
objects (tasks, executors), to the framework's lifecycle. So that if the 
framework is removed, the reservations / volumes it left behind are cleaned up.

  was:
Currently, when a framework creates a reservation or a persistent volume, and 
it wants exclusive access to this volume or reservation, it must take a few 
steps:

* Ensure that no other frameworks are running within the reservation role (or 
the other frameworks are co-operative).
* With hierarchical roles, frameworks must also ensure that the role is a leaf 
so that no descendant roles will have access to the reservation/volume. This 
could be done by generating a role (e.g. eng/kafka/).

It's not easy for the framework to ensure these things, since role ACLs are 
controlled by the operator.

We should consider a more direct way for a framework to ensure that their 
reservation/volume cannot be shared. E.g. by binding it to their framework id 
(perhaps re-using roles for this rather than introducing something new?)


> Consider a more explicit way to bind reservations / volumes to a framework.
> ---
>
> Key: MESOS-7651
> URL: https://issues.apache.org/jira/browse/MESOS-7651
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Benjamin Mahler
>
> Currently, when a framework creates a reservation or a persistent volume, and 
> it wants exclusive access to this volume or reservation, it must take a few 
> steps:
> * Ensure that no other frameworks are running within the reservation role (or 
> the other frameworks are co-operative).
> * With hierarchical roles, frameworks must also ensure that the role is a 
> leaf so that no descendant roles will have access to the reservation/volume. 
> This could be done by generating a role (e.g. eng/kafka/).
> It's not easy for the framework to ensure these things, since role ACLs are 
> controlled by the operator.
> We should consider a more direct way for a framework to ensure that their 
> reservation/volume cannot be shared. E.g. by binding it to their framework id 
> (perhaps re-using roles for this rather than introducing something new?)
> We should also consider binding the reservation / volumes, much like other 
> objects (tasks, executors), to the framework's lifecycle. So that if the 
> framework is removed, the reservations / volumes it left behind are cleaned 
> up.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-3826) Add an optional unique identifier for resource reservations

2017-06-09 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16044833#comment-16044833
 ] 

Benjamin Mahler commented on MESOS-3826:


Filed a related issue: https://issues.apache.org/jira/browse/MESOS-7651

> Add an optional unique identifier for resource reservations
> ---
>
> Key: MESOS-3826
> URL: https://issues.apache.org/jira/browse/MESOS-3826
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Sargun Dhillon
>  Labels: mesosphere, reservations
>
> Thanks to the resource reservation primitives, frameworks can reserve 
> resources. These reservations are per role, which means multiple frameworks 
> can share reservations. This can get very hairy, as multiple reservations can 
> occur on each agent. 
> It would be nice to be able to optionally, uniquely identify reservations by 
> ID, much like persistent volumes are today. This could be done by adding a 
> new protobuf field, such as Resource.ReservationInfo.id, that if set upon 
> reservation time, would come back when the reservation is advertised.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (MESOS-7651) Consider a more explicit way to bind reservations / volumes to a framework.

2017-06-09 Thread Benjamin Mahler (JIRA)
Benjamin Mahler created MESOS-7651:
--

 Summary: Consider a more explicit way to bind reservations / 
volumes to a framework.
 Key: MESOS-7651
 URL: https://issues.apache.org/jira/browse/MESOS-7651
 Project: Mesos
  Issue Type: Improvement
Reporter: Benjamin Mahler


Currently, when a framework creates a reservation or a persistent volume, and 
it wants exclusive access to this volume or reservation, it must take a few 
steps:

* Ensure that no other frameworks are running within the reservation role (or 
the other frameworks are co-operative).
* With hierarchical roles, frameworks must also ensure that the role is a leaf 
so that no descendant roles will have access to the reservation/volume. This 
could be done by generating a role (e.g. eng/kafka/).

It's not easy for the framework to ensure these things, since role ACLs are 
controlled by the operator.

We should consider a more direct way for a framework to ensure that their 
reservation/volume cannot be shared. E.g. by binding it to their framework id 
(perhaps re-using roles for this rather than introducing something new?)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7033) Update documentation for hierarchical roles.

2017-06-09 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-7033:
---
Description: 
A few things to be sure cover:

* How to ensure that a volume is not shared with other frameworks. Previously, 
this meant running only 1 framework in the role and using ACLs to prevent other 
frameworks from running in the role. With hierarchical roles, this now also 
includes using ACLs to prevent any child roles from being created beneath the 
role (as these children would be able to obtain the reserved resources). We've 
been advising frameworks to generate a role (e.g. eng/kafka/) to 
ensure that they own their reservations (but the dynamic nature of this makes 
setting up ACLs difficult). Longer term, we may need a more explicit way to 
bind reservations or volumes to frameworks.

> Update documentation for hierarchical roles.
> 
>
> Key: MESOS-7033
> URL: https://issues.apache.org/jira/browse/MESOS-7033
> Project: Mesos
>  Issue Type: Task
>  Components: documentation
>Reporter: Neil Conway
>Assignee: Neil Conway
>  Labels: mesosphere
>
> A few things to be sure cover:
> * How to ensure that a volume is not shared with other frameworks. 
> Previously, this meant running only 1 framework in the role and using ACLs to 
> prevent other frameworks from running in the role. With hierarchical roles, 
> this now also includes using ACLs to prevent any child roles from being 
> created beneath the role (as these children would be able to obtain the 
> reserved resources). We've been advising frameworks to generate a role (e.g. 
> eng/kafka/) to ensure that they own their reservations (but the 
> dynamic nature of this makes setting up ACLs difficult). Longer term, we may 
> need a more explicit way to bind reservations or volumes to frameworks.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-6972) Improve performance of protobuf message passing by removing RepeatedPtrField to vector conversion.

2017-06-08 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16043178#comment-16043178
 ] 

Benjamin Mahler commented on MESOS-6972:


[~dzhuk] great, yeah I was thinking of this option as well but wasn't sure if 
this works when we use arena allocation, since Swap performs deep copies if one 
side is from an arena and the other is not:

https://developers.google.com/protocol-buffers/docs/reference/arenas#message-class-methods

> Improve performance of protobuf message passing by removing RepeatedPtrField 
> to vector conversion.
> --
>
> Key: MESOS-6972
> URL: https://issues.apache.org/jira/browse/MESOS-6972
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Benjamin Mahler
>  Labels: tech-debt
>
> Currently, all protobuf message handlers must take a {{vector}} for repeated 
> fields, rather than a {{RepeatedPtrField}}.
> This requires that a copy be performed of the repeated field's entries (see 
> [here|https://github.com/apache/mesos/blob/9228ebc239dac42825390bebc72053dbf3ae7b09/3rdparty/libprocess/include/process/protobuf.hpp#L78-L87]),
>  which can be very expensive in some cases. We should avoid requiring this 
> expense on the callers.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (MESOS-6244) Add support for streaming HTTP request bodies in libprocess.

2017-06-07 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-6244:
--

Shepherd: Benjamin Mahler
Assignee: Anand Mazumdar

> Add support for streaming HTTP request bodies in libprocess.
> 
>
> Key: MESOS-6244
> URL: https://issues.apache.org/jira/browse/MESOS-6244
> Project: Mesos
>  Issue Type: Improvement
>  Components: libprocess
>Reporter: Benjamin Mahler
>Assignee: Anand Mazumdar
>
> We currently have support for streaming responses. See MESOS-2438.  Servers 
> can start sending the response body before the body is complete. Clients can 
> start reading a response before the body is complete. This is an optimization 
> for large responses and is a requirement for infinite "streaming" style 
> endpoints.
> We currently do not have support for streaming requests. This would allow a 
> client to stream a large or infinite request body to the server without 
> having to have the complete body in hand, and it would allow a server to read 
> request bodies before they are have been completely received over the 
> connection.
> This is a requirement if we want to allow clients to "stream" data into a 
> server, i.e. an infinite request body.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-6531) Add support for incremental gzip compression.

2017-06-07 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-6531:
---
Description: We currently only support compression assuming the entire 
input is available at once. We can add a {{gzip::Compressor}} to support 
incremental compression.  (was: We currently only support compression / 
decompression assuming the entire input is available at once. We can add a 
{{gzip::Compressor}} to support incremental compression.)

> Add support for incremental gzip compression.
> -
>
> Key: MESOS-6531
> URL: https://issues.apache.org/jira/browse/MESOS-6531
> Project: Mesos
>  Issue Type: Improvement
>  Components: stout
>Reporter: Benjamin Mahler
>
> We currently only support compression assuming the entire input is available 
> at once. We can add a {{gzip::Compressor}} to support incremental compression.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7566) Master crash due to failed check in DRFSorter::remove

2017-06-05 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16037599#comment-16037599
 ] 

Benjamin Mahler commented on MESOS-7566:


[~xujyan] can you file a ticket for the race you described? It isn't the issue 
in this ticket AFAICT, but we should capture it and fix it as well.

> Master crash due to failed check in DRFSorter::remove
> -
>
> Key: MESOS-7566
> URL: https://issues.apache.org/jira/browse/MESOS-7566
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.1.1, 1.1.2
>Reporter: Zhitao Li
>Priority: Critical
>
> A check in [sorter.cpp#L355 in 1.1.2 | 
> https://github.com/apache/mesos/blob/1.1.2/src/master/allocator/sorter/drf/sorter.cpp#L355]
>  is triggered occasionally in our cluster and crashes the master leader.
> I manually modified that check to print out the related variables, and the 
> following is a master log.
> https://gist.github.com/zhitaoli/0662d9fe1f6d57de344951c05b536bad#file-gistfile1-txt
> From the log, it seems like the check was using an stale value revocable CPU  
> {{26}} while the new value was updated to 25, thus the check crashed.
> So far two verified occurrence of this bug are both observed near an 
> {{UNRESERVE}} operation (see lines above in the log).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7566) Master crash due to failed check in DRFSorter::remove

2017-06-05 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16037576#comment-16037576
 ] 

Benjamin Mahler commented on MESOS-7566:


For posterity, line 773 in [~zhitao]'s version corresponds to:
https://github.com/apache/mesos/blob/1.1.2/src/master/allocator/mesos/hierarchical.cpp#L749

> Master crash due to failed check in DRFSorter::remove
> -
>
> Key: MESOS-7566
> URL: https://issues.apache.org/jira/browse/MESOS-7566
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.1.1, 1.1.2
>Reporter: Zhitao Li
>Priority: Critical
>
> A check in [sorter.cpp#L355 in 1.1.2 | 
> https://github.com/apache/mesos/blob/1.1.2/src/master/allocator/sorter/drf/sorter.cpp#L355]
>  is triggered occasionally in our cluster and crashes the master leader.
> I manually modified that check to print out the related variables, and the 
> following is a master log.
> https://gist.github.com/zhitaoli/0662d9fe1f6d57de344951c05b536bad#file-gistfile1-txt
> From the log, it seems like the check was using an stale value revocable CPU  
> {{26}} while the new value was updated to 25, thus the check crashed.
> So far two verified occurrence of this bug are both observed near an 
> {{UNRESERVE}} operation (see lines above in the log).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7095) Basic make check from getting started link fails

2017-06-02 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16035652#comment-16035652
 ] 

Benjamin Mahler commented on MESOS-7095:


[~tillt] does the getting started guide need any updates related to this so 
that users don't hit it?
http://mesos.apache.org/gettingstarted/

> Basic make check from getting started link fails
> 
>
> Key: MESOS-7095
> URL: https://issues.apache.org/jira/browse/MESOS-7095
> Project: Mesos
>  Issue Type: Bug
>  Components: build
>Reporter: Alec Bruns
>
> {*** Aborted at 1486657215 (unix time) try "date -d @1486657215" if you are 
> using GNU date *** PC: @0x1080b7367 apr_pool_create_ex *** SIGSEGV 
> (@0x30) received by PID 25167 (TID 0x7fffbdd073c0) stack trace: ***} 
> \{@ 0x7fffb50c7bba _sigtramp 
> @\{ 0x72c0517 (unknown)\} 
> @0x107eaa13a svn_pool_create_ex 
> @0x107691d6e svn::diff() 
> @0x107691042 SVNTest_DiffPatch_Test::TestBody()
>  @0x1077026ba 
> testing::internal::HandleSehExceptionsInMethodIfSupported<>() 
> @0x1076b3ad7 
> testing::internal::HandleExceptionsInMethodIfSupported<>()
>  @0x1076b3985 testing::Test::Run() 
> @0x1076b54f8 testing::TestInfo::Run() 
> @0x1076b6867 testing::TestCase::Run() 
> @0x1076c65dc testing::internal::UnitTestImpl::RunAllTests() 
> @0x1077033da 
> testing::internal::HandleSehExceptionsInMethodIfSupported<>() 
> @0x1076c6007 
> testing::internal::HandleExceptionsInMethodIfSupported<>() 
> @0x1076c5ed8 testing::UnitTest::Run() 
> @0x1074d55c1 RUN_ALL_TESTS() 
> @0x1074d5580 main 
> @ 0x7fffb4eba255 start 
> make[6]: *** [check-local] Segmentation fault: 11 
> make[5]: *** [check-am] Error 2 make[4]: *** [check-recursive] Error 1
>  make[3]: *** [check] Error 2 make[2]: *** [check-recursive] Error 1 
> make[1]: *** [check] Error 2 make: *** [check-recursive] Error 1
> make: *** [check-recursive] Error 1



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (MESOS-7590) Make the default decline timeout configurable on the master rather than burned into the protobuf.

2017-05-30 Thread Benjamin Mahler (JIRA)
Benjamin Mahler created MESOS-7590:
--

 Summary: Make the default decline timeout configurable on the 
master rather than burned into the protobuf.
 Key: MESOS-7590
 URL: https://issues.apache.org/jira/browse/MESOS-7590
 Project: Mesos
  Issue Type: Improvement
  Components: master
Reporter: Benjamin Mahler


Currently, many frameworks decline without setting the filter timeout, we have 
a default filter timeout of 5 seconds burned into the protobuf.

Instead, it would be better if we could configure the default filter timeout on 
the master via a flag. When many frameworks are running and declining with 
short filter timeouts, the master may not have time to try to offer the 
resources to each framework before circling back to the original framework that 
declined them. So, allowing the operator to configure this as a workaround 
would be nice.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7401) Optionally reject messages when UPIDs does not match IP.

2017-05-30 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-7401:
---
Issue Type: Improvement  (was: Bug)

> Optionally reject messages when UPIDs does not match IP.
> 
>
> Key: MESOS-7401
> URL: https://issues.apache.org/jira/browse/MESOS-7401
> Project: Mesos
>  Issue Type: Improvement
>  Components: libprocess
>Reporter: James Peach
>Assignee: James Peach
>Priority: Minor
> Fix For: 1.4.0
>
>
> {{libprocess}} does no validation of the peer UPID so in some deployments it 
> is trivial to inject bogus messages and impersonate legitimate actors. If we 
> add a check to verify that messages are received from the same IP address as 
> the peer UPID claims to be using, we can increase the difficulty of UPID 
> spoofing, and mitigate this somewhat.
> For compatibility, this has to be an optional setting and disabled by default.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7540) Add an agent flag for executor re-registration timeout.

2017-05-26 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-7540:
---
Fix Version/s: 1.1.3

> Add an agent flag for executor re-registration timeout.
> ---
>
> Key: MESOS-7540
> URL: https://issues.apache.org/jira/browse/MESOS-7540
> Project: Mesos
>  Issue Type: Improvement
>  Components: agent
>Reporter: Greg Mann
>Assignee: Greg Mann
>  Labels: mesosphere
> Fix For: 1.2.2, 1.3.1, 1.4.0, 1.1.3
>
>
> Currently, the executor re-register timeout is hard-coded at 2 seconds. It 
> would be beneficial to allow operators to specify this value.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7569) Allow "old" executors with half-open connections to be preserved during agent upgrade / restart.

2017-05-26 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-7569:
---
Fix Version/s: 1.1.3

> Allow "old" executors with half-open connections to be preserved during agent 
> upgrade / restart.
> 
>
> Key: MESOS-7569
> URL: https://issues.apache.org/jira/browse/MESOS-7569
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
> Fix For: 1.2.2, 1.3.1, 1.4.0, 1.1.3
>
>
> Users who have executors in their cluster without the fix to MESOS-7057 will 
> experience these executors potentially being destroyed whenever the agent 
> restarts (or is upgraded).
> This occurs when these old executors have connections idle for > 5 days 
> (default conntrack tcp timeout). At this point, the connection is timedout 
> and no longer tracked by conntrack. From what we've seen, if the agent stays 
> up, the packets still flow between the executor and agent. However, once the 
> agent restarts, in some cases (presence of a DROP rule, or some flavors of 
> NATing), the executor does not receive the RST/FIN from the kernel and will 
> hold a half-open TCP connection. At this point, when the executor responds to 
> the reconnect message from the restarted agent, it's half-open TCP connection 
> closes, and the executor will be destroyed by the agent.
> In order to allow users to preserve the tasks running in these "old" 
> executors (i.e. without the MESOS-7057 fix), we can add *optional* retrying 
> of the reconnect message in the agent. This allows the old executor to 
> correctly establish a link to agent, when the second reconnect message is 
> handled.
> Longer term, heartbeating or TCP keepalives will prevent the connections from 
> reaching the conntrack timeout (see MESOS-7568).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7569) Allow "old" executors with half-open connections to be preserved during agent upgrade / restart.

2017-05-26 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-7569:
---
Fix Version/s: 1.2.2

> Allow "old" executors with half-open connections to be preserved during agent 
> upgrade / restart.
> 
>
> Key: MESOS-7569
> URL: https://issues.apache.org/jira/browse/MESOS-7569
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
> Fix For: 1.2.2, 1.3.1, 1.4.0
>
>
> Users who have executors in their cluster without the fix to MESOS-7057 will 
> experience these executors potentially being destroyed whenever the agent 
> restarts (or is upgraded).
> This occurs when these old executors have connections idle for > 5 days 
> (default conntrack tcp timeout). At this point, the connection is timedout 
> and no longer tracked by conntrack. From what we've seen, if the agent stays 
> up, the packets still flow between the executor and agent. However, once the 
> agent restarts, in some cases (presence of a DROP rule, or some flavors of 
> NATing), the executor does not receive the RST/FIN from the kernel and will 
> hold a half-open TCP connection. At this point, when the executor responds to 
> the reconnect message from the restarted agent, it's half-open TCP connection 
> closes, and the executor will be destroyed by the agent.
> In order to allow users to preserve the tasks running in these "old" 
> executors (i.e. without the MESOS-7057 fix), we can add *optional* retrying 
> of the reconnect message in the agent. This allows the old executor to 
> correctly establish a link to agent, when the second reconnect message is 
> handled.
> Longer term, heartbeating or TCP keepalives will prevent the connections from 
> reaching the conntrack timeout (see MESOS-7568).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7540) Add an agent flag for executor re-registration timeout.

2017-05-26 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-7540:
---
Fix Version/s: 1.2.2

> Add an agent flag for executor re-registration timeout.
> ---
>
> Key: MESOS-7540
> URL: https://issues.apache.org/jira/browse/MESOS-7540
> Project: Mesos
>  Issue Type: Improvement
>  Components: agent
>Reporter: Greg Mann
>Assignee: Greg Mann
>  Labels: mesosphere
> Fix For: 1.2.2, 1.3.1, 1.4.0
>
>
> Currently, the executor re-register timeout is hard-coded at 2 seconds. It 
> would be beneficial to allow operators to specify this value.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7540) Add an agent flag for executor re-registration timeout.

2017-05-26 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-7540:
---
Fix Version/s: 1.3.1

> Add an agent flag for executor re-registration timeout.
> ---
>
> Key: MESOS-7540
> URL: https://issues.apache.org/jira/browse/MESOS-7540
> Project: Mesos
>  Issue Type: Improvement
>  Components: agent
>Reporter: Greg Mann
>Assignee: Greg Mann
>  Labels: mesosphere
> Fix For: 1.3.1, 1.4.0
>
>
> Currently, the executor re-register timeout is hard-coded at 2 seconds. It 
> would be beneficial to allow operators to specify this value.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7569) Allow "old" executors with half-open connections to be preserved during agent upgrade / restart.

2017-05-26 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-7569:
---
Fix Version/s: 1.3.1

> Allow "old" executors with half-open connections to be preserved during agent 
> upgrade / restart.
> 
>
> Key: MESOS-7569
> URL: https://issues.apache.org/jira/browse/MESOS-7569
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
> Fix For: 1.3.1, 1.4.0
>
>
> Users who have executors in their cluster without the fix to MESOS-7057 will 
> experience these executors potentially being destroyed whenever the agent 
> restarts (or is upgraded).
> This occurs when these old executors have connections idle for > 5 days 
> (default conntrack tcp timeout). At this point, the connection is timedout 
> and no longer tracked by conntrack. From what we've seen, if the agent stays 
> up, the packets still flow between the executor and agent. However, once the 
> agent restarts, in some cases (presence of a DROP rule, or some flavors of 
> NATing), the executor does not receive the RST/FIN from the kernel and will 
> hold a half-open TCP connection. At this point, when the executor responds to 
> the reconnect message from the restarted agent, it's half-open TCP connection 
> closes, and the executor will be destroyed by the agent.
> In order to allow users to preserve the tasks running in these "old" 
> executors (i.e. without the MESOS-7057 fix), we can add *optional* retrying 
> of the reconnect message in the agent. This allows the old executor to 
> correctly establish a link to agent, when the second reconnect message is 
> handled.
> Longer term, heartbeating or TCP keepalives will prevent the connections from 
> reaching the conntrack timeout (see MESOS-7568).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7540) Add an agent flag for executor re-registration timeout.

2017-05-26 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-7540:
---
Summary: Add an agent flag for executor re-registration timeout.  (was: Add 
an agent flag for executor re-register timeout)

> Add an agent flag for executor re-registration timeout.
> ---
>
> Key: MESOS-7540
> URL: https://issues.apache.org/jira/browse/MESOS-7540
> Project: Mesos
>  Issue Type: Improvement
>  Components: agent
>Reporter: Greg Mann
>Assignee: Greg Mann
>  Labels: mesosphere
> Fix For: 1.4.0
>
>
> Currently, the executor re-register timeout is hard-coded at 2 seconds. It 
> would be beneficial to allow operators to specify this value.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (MESOS-7574) Allow reservations to multiple roles.

2017-05-26 Thread Benjamin Mahler (JIRA)
Benjamin Mahler created MESOS-7574:
--

 Summary: Allow reservations to multiple roles.
 Key: MESOS-7574
 URL: https://issues.apache.org/jira/browse/MESOS-7574
 Project: Mesos
  Issue Type: Improvement
Reporter: Benjamin Mahler


There have been some discussions for allowing reservations to multiple roles 
(or more generally, role expressions).

E.g. All resources on GPU agents are reserved for "eng/machine-learning" or 
"finance/forecasting" or "data-science/modeling" to use, because these are the 
roles in my organization that make use of GPUs, and I want to guarantee that 
none of the non-GPU workloads tie up the GPU machines cpus/mem/disk.

This GPU related example would allow us to deprecate and remove the 
GPU_RESOURCES capability, which is a hack implementation of reservations to 
multiple roles. Mesos will only offer GPU machine resources to GPU capable 
schedulers. Having the ability to make reservations to multiple roles obviates 
this hack.

With hierarchical roles, we have a restricted version of reservations to 
multiple roles, where the roles are restricted to the descendant roles. For 
example, a reservation for "gpu-workloads" can be allocated to 
"gpu-workloads/eng/image-processing",  "gpu-workloads/data-science/modeling", 
"gpu-workloads/finance/forecasting etc. What isn't achievable is a reservation 
to multiple roles across the tree, e.g. "eng/image-processing" OR 
"finance/forecasting" OR "data-science/modeling". This can get clumsy because 
if "eng/ML" wants to get in on the reserved gpus, the user would have to place 
a related role underneath the "gpu-workloads" role, e.g. "gpu-workloads/eng/ML".

A similar use case has been that some agents are "public" and there are 
disparate roles in the organization that need access to these hosts, so we want 
to ensure that only these roles get access and no other roles can tie up the 
resources on these hosts.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-5332) TASK_LOST on slave restart potentially due to executor race condition

2017-05-25 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16025625#comment-16025625
 ] 

Benjamin Mahler commented on MESOS-5332:


In order to enable users who hit this situation to safely upgrade (without all 
>5 day idle connection executors being destroyed), we will introduce an 
optional retry of the reconnect message via MESOS-7569:

https://reviews.apache.org/r/59584/

This will allow the preservation of executors without the relink (MESOS-7057) 
fix when upgrading an agent. Longer term, TCP keepalives or heartbeating will 
be put in place to avoid the connections timing out in conntrack.

> TASK_LOST on slave restart potentially due to executor race condition
> -
>
> Key: MESOS-5332
> URL: https://issues.apache.org/jira/browse/MESOS-5332
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, libprocess
>Affects Versions: 0.26.0
> Environment: Mesos 0.26
> Aurora 0.13
>Reporter: Stephan Erb
>Assignee: Anand Mazumdar
> Attachments: executor-logs.tar.gz, executor-stderr.log, 
> executor-stderrV2.log, mesos-slave.log
>
>
> When restarting the Mesos agent binary, tasks can end up as LOST. We lose 
> from 20% to 50% of all tasks. They are killed by the Mesos agent via:
> {code}
> I0505 08:42:06.781318 21738 slave.cpp:2702] Cleaning up un-reregistered 
> executors
> I0505 08:42:06.781366 21738 slave.cpp:2720] Killing un-reregistered executor 
> 'thermos-nobody-devel-service-28854-0-6a88d62e-656
> 4-4e33-b0bb-1d8039d97afc' of framework 
> 20151001-085346-58917130-5050-37976- at executor(1)@10.X.X.X:40541
> I0505 08:42:06.781446 21738 slave.cpp:2720] Killing un-reregistered executor 
> 'thermos-nobody-devel-service-23839-0-1d2cd0e6-699
> 4-4cba-a9df-3dfc1552667f' of framework 
> 20151001-085346-58917130-5050-37976- at executor(1)@10.X.X.X:35757
> I0505 08:42:06.781466 21738 slave.cpp:2720] Killing un-reregistered executor 
> 'thermos-nobody-devel-service-29970-0-478a7291-d070-4aa8
> -af21-6fda889f750c' of framework 20151001-085346-58917130-5050-37976- at 
> executor(1)@10.X.X.X:51463
> ...
> I0505 08:42:06.781558 21738 slave.cpp:4230] Finished recovery
> {code}
> We have verified that the tasks and their executors are killed by the agent 
> during startup. When stopping the agent using supervisorctl stop, the 
> executors are still running (verified via {{ps aux}}). They are only killed 
> once the agent tries to reregister.
> The issue is hard to reproduce:
> * When restarting the agent binary multiple times, tasks are only lost for 
> the first restart.
> * It is much more likely to occur if the agent binary has been running for a 
> longer period of time (> 7 days)
> Mesos is correctly sticking to the 2 seconds wait time before killing 
> un-reregistered executors. The failed executors receive the reregistration 
> request, but it seems like they fail to send a reply.
> A successful reregistration (not leading to LOST):
> {code}
> I0505 08:41:59.581231 21664 exec.cpp:456] Slave exited, but framework has 
> checkpointing enabled. Waiting 15mins to reconnect with slave 
> 20160118-141153-92471562-5050-6270-S17
> I0505 08:42:04.780591 21665 exec.cpp:256] Received reconnect request from 
> slave 20160118-141153-92471562-5050-6270-S17
> I0505 08:42:04.785297 21676 exec.cpp:233] Executor re-registered on slave 
> 20160118-141153-92471562-5050-6270-S17
> I0505 08:42:04.788579 21676 exec.cpp:245] Executor::reregistered took 
> 1.492339ms
> {code}
> A failed one:
> {code}
> I0505 08:42:04.779677  2389 exec.cpp:256] Received reconnect request from 
> slave 20160118-141153-92471562-5050-6270-S17
> E0505 08:42:05.481374  2408 process.cpp:1911] Failed to shutdown socket with 
> fd 11: Transport endpoint is not connected
> I0505 08:42:05.481374  2395 exec.cpp:456] Slave exited, but framework has 
> checkpointing enabled. Waiting 15mins to reconnect with slave 
> 20160118-141153-92471562-5050-6270-S17
> {code}
> All task ending up in LOST have an output similar to the one posted above, 
> i.e. log messages are in a wrong order.
> Anyone an idea what might be going on here? 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (MESOS-7569) Allow "old" executors with half-open connections to be preserved during agent upgrade / restart.

2017-05-25 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-7569:
--

Assignee: Benjamin Mahler

> Allow "old" executors with half-open connections to be preserved during agent 
> upgrade / restart.
> 
>
> Key: MESOS-7569
> URL: https://issues.apache.org/jira/browse/MESOS-7569
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
>
> Users who have executors in their cluster without the fix to MESOS-7057 will 
> experience these executors potentially being destroyed whenever the agent 
> restarts (or is upgraded).
> This occurs when these old executors have connections idle for > 5 days 
> (default conntrack tcp timeout). At this point, the connection is timedout 
> and no longer tracked by conntrack. From what we've seen, if the agent stays 
> up, the packets still flow between the executor and agent. However, once the 
> agent restarts, in some cases (presence of a DROP rule, or some flavors of 
> NATing), the executor does not receive the RST/FIN from the kernel and will 
> hold a half-open TCP connection. At this point, when the executor responds to 
> the reconnect message from the restarted agent, it's half-open TCP connection 
> closes, and the executor will be destroyed by the agent.
> In order to allow users to preserve the tasks running in these "old" 
> executors (i.e. without the MESOS-7057 fix), we can add *optional* retrying 
> of the reconnect message in the agent. This allows the old executor to 
> correctly establish a link to agent, when the second reconnect message is 
> handled.
> Longer term, heartbeating or TCP keepalives will prevent the connections from 
> reaching the conntrack timeout (see MESOS-7568).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7569) Allow "old" executors with half-open connections to be preserved during agent upgrade / restart.

2017-05-25 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-7569:
---
Target Version/s: 1.2.2, 1.3.1, 1.4.0, 1.1.3  (was: 1.2.2, 1.3.1, 1.4.0)

> Allow "old" executors with half-open connections to be preserved during agent 
> upgrade / restart.
> 
>
> Key: MESOS-7569
> URL: https://issues.apache.org/jira/browse/MESOS-7569
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Reporter: Benjamin Mahler
>
> Users who have executors in their cluster without the fix to MESOS-7057 will 
> experience these executors potentially being destroyed whenever the agent 
> restarts (or is upgraded).
> This occurs when these old executors have connections idle for > 5 days 
> (default conntrack tcp timeout). At this point, the connection is timedout 
> and no longer tracked by conntrack. From what we've seen, if the agent stays 
> up, the packets still flow between the executor and agent. However, once the 
> agent restarts, in some cases (presence of a DROP rule, or some flavors of 
> NATing), the executor does not receive the RST/FIN from the kernel and will 
> hold a half-open TCP connection. At this point, when the executor responds to 
> the reconnect message from the restarted agent, it's half-open TCP connection 
> closes, and the executor will be destroyed by the agent.
> In order to allow users to preserve the tasks running in these "old" 
> executors (i.e. without the MESOS-7057 fix), we can add *optional* retrying 
> of the reconnect message in the agent. This allows the old executor to 
> correctly establish a link to agent, when the second reconnect message is 
> handled.
> Longer term, heartbeating or TCP keepalives will prevent the connections from 
> reaching the conntrack timeout (see MESOS-7568).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (MESOS-7569) Allow "old" executors with half-open connections to be preserved during agent upgrade / restart.

2017-05-25 Thread Benjamin Mahler (JIRA)
Benjamin Mahler created MESOS-7569:
--

 Summary: Allow "old" executors with half-open connections to be 
preserved during agent upgrade / restart.
 Key: MESOS-7569
 URL: https://issues.apache.org/jira/browse/MESOS-7569
 Project: Mesos
  Issue Type: Bug
  Components: agent
Reporter: Benjamin Mahler


Users who have executors in their cluster without the fix to MESOS-7057 will 
experience these executors potentially being destroyed whenever the agent 
restarts (or is upgraded).

This occurs when these old executors have connections idle for > 5 days 
(default conntrack tcp timeout). At this point, the connection is timedout and 
no longer tracked by conntrack. From what we've seen, if the agent stays up, 
the packets still flow between the executor and agent. However, once the agent 
restarts, in some cases (presence of a DROP rule, or some flavors of NATing), 
the executor does not receive the RST/FIN from the kernel and will hold a 
half-open TCP connection. At this point, when the executor responds to the 
reconnect message from the restarted agent, it's half-open TCP connection 
closes, and the executor will be destroyed by the agent.

In order to allow users to preserve the tasks running in these "old" executors 
(i.e. without the MESOS-7057 fix), we can add *optional* retrying of the 
reconnect message in the agent. This allows the old executor to correctly 
establish a link to agent, when the second reconnect message is handled.

Longer term, heartbeating or TCP keepalives will prevent the connections from 
reaching the conntrack timeout (see MESOS-7568).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-5361) Consider introducing TCP KeepAlive for Libprocess sockets.

2017-05-25 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16025600#comment-16025600
 ] 

Benjamin Mahler commented on MESOS-5361:


Linking in the executor related tickets that came up due to conntrack 
considering connections stale after 5 days.

> Consider introducing TCP KeepAlive for Libprocess sockets.
> --
>
> Key: MESOS-5361
> URL: https://issues.apache.org/jira/browse/MESOS-5361
> Project: Mesos
>  Issue Type: Improvement
>  Components: libprocess
>Reporter: Anand Mazumdar
>  Labels: mesosphere
>
> We currently don't use TCP KeepAlive's when creating sockets in libprocess. 
> This might benefit master - scheduler, master - agent connections i.e. we can 
> detect if any of them failed faster.
> Currently, if the master process goes down. If for some reason the {{RST}} 
> sequence did not reach the scheduler, the scheduler can only come to know 
> about the disconnection when it tries to do a {{send}} itself. 
> The default TCP keep alive values on Linux are of little use in a real world 
> application:
> {code}
> . This means that the keepalive routines wait for two hours (7200 secs) 
> before sending the first keepalive probe, and then resend it every 75 
> seconds. If no ACK response is received for nine consecutive times, the 
> connection is marked as broken.
> {code}
> However, for long running instances of scheduler/agent this still can be 
> beneficial. Also, operators might start tuning the values for their clusters 
> explicitly once we start supporting it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7564) Introduce a heartbeat mechanism for v1 HTTP executor <-> agent communication.

2017-05-25 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-7564:
---
Summary: Introduce a heartbeat mechanism for v1 HTTP executor <-> agent 
communication.  (was: Introduce a heartbeat mechanism for executor <-> agent 
communication.)

> Introduce a heartbeat mechanism for v1 HTTP executor <-> agent communication.
> -
>
> Key: MESOS-7564
> URL: https://issues.apache.org/jira/browse/MESOS-7564
> Project: Mesos
>  Issue Type: Bug
>Reporter: Anand Mazumdar
>
> Currently, we do not have heartbeats for executor <-> agent communication. 
> This is especially problematic in scenarios when IPFilters are enabled since 
> the default conntrack keep alive timeout is 5 days. When that timeout 
> elapses, the executor doesn't get notified via a socket disconnection when 
> the agent process restarts. The executor would then get killed if it doesn't 
> re-register when the agent recovery process is completed.
> Enabling application level heartbeats or TCP KeepAlive's can be a possible 
> way for fixing this issue.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7564) Introduce a heartbeat mechanism for executor <-> agent communication.

2017-05-25 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-7564:
---
Issue Type: Bug  (was: Task)

> Introduce a heartbeat mechanism for executor <-> agent communication.
> -
>
> Key: MESOS-7564
> URL: https://issues.apache.org/jira/browse/MESOS-7564
> Project: Mesos
>  Issue Type: Bug
>Reporter: Anand Mazumdar
>
> Currently, we do not have heartbeats for executor <-> agent communication. 
> This is especially problematic in scenarios when IPFilters are enabled since 
> the default conntrack keep alive timeout is 5 days. When that timeout 
> elapses, the executor doesn't get notified via a socket disconnection when 
> the agent process restarts. The executor would then get killed if it doesn't 
> re-register when the agent recovery process is completed.
> Enabling application level heartbeats or TCP KeepAlive's can be a possible 
> way for fixing this issue.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (MESOS-7568) Introduce a heartbeat mechanism for v0 executor <-> agent links.

2017-05-25 Thread Benjamin Mahler (JIRA)
Benjamin Mahler created MESOS-7568:
--

 Summary: Introduce a heartbeat mechanism for v0 executor <-> agent 
links.
 Key: MESOS-7568
 URL: https://issues.apache.org/jira/browse/MESOS-7568
 Project: Mesos
  Issue Type: Bug
Reporter: Anand Mazumdar


Currently, we do not have heartbeats for executor <-> agent communication. This 
is especially problematic in scenarios when IPFilters are enabled since the 
default conntrack keep alive timeout is 5 days. When that timeout elapses, the 
executor doesn't get notified via a socket disconnection when the agent process 
restarts. The executor would then get killed if it doesn't re-register when the 
agent recovery process is completed.

Enabling application level heartbeats or TCP KeepAlive's can be a possible way 
for fixing this issue.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7468) Could not copy the sandbox path on WebUI

2017-05-10 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16005725#comment-16005725
 ] 

Benjamin Mahler commented on MESOS-7468:


I gave a ship it with some comments. Do you know if there's a way to make the 
breadcrumb slash copy-able?

> Could not copy the sandbox path on WebUI 
> -
>
> Key: MESOS-7468
> URL: https://issues.apache.org/jira/browse/MESOS-7468
> Project: Mesos
>  Issue Type: Bug
>  Components: webui
>Reporter: haosdent
>Assignee: haosdent
>Priority: Minor
>
> I would get 
> {code}
>  var  lib  mesos  slaves  08879b43-58c9-4db7-a93e-4873e35c8144-S1  frameworks 
>  1c092dff-e6d2-4537-a872-52752929ea7e-  executors  
> test-copy.cfd4d72a-3397-11e7-8e73-02426ed45ffc  runs  
> 3d8e16cb-f5c7-4580-952d-1a230943e154
> {code}
> when I select texts in webui.
> It is because the definition of breadcrumb in bootstrap is 
> {code}
> .breadcrumb > li + li:before {
> content: "/";
> }
> {code}
> So "/" would not be included when select and copy text 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (MESOS-5255) Add GPUs to container resource consumption metrics.

2017-05-10 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-5255:
--

Assignee: (was: Chun-Hung Hsiao)

> Add GPUs to container resource consumption metrics.
> ---
>
> Key: MESOS-5255
> URL: https://issues.apache.org/jira/browse/MESOS-5255
> Project: Mesos
>  Issue Type: Task
>Reporter: Kevin Klues
>  Labels: gpu
>
> Currently the usage callback in the Nvidia GPU isolator is unimplemented:
> {noformat}
> src/slave/containerizer/mesos/isolators/cgroups/devices/gpus/nvidia.cpp
> {noformat}
> It should use functionality from NVML to gather the current GPU usage and add 
> it to a ResourceStatistics object. It is still an open question as to exactly 
> what information we want to expose here (power, memory consumption, current 
> load, etc.). Whatever we decide on should be standard across different GPU 
> types, different GPU vendors, etc.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7478) Pre-1.2.x master does not work with 1.2.x agent.

2017-05-09 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16003380#comment-16003380
 ] 

Benjamin Mahler commented on MESOS-7478:


[~anandmazumdar] aside from my manual testing, I ran the upgrade script. It 
turns out it doesn't catch it because it upgrades the master first, then 
agents. Filed MESOS-7483.

> Pre-1.2.x master does not work with 1.2.x agent.
> 
>
> Key: MESOS-7478
> URL: https://issues.apache.org/jira/browse/MESOS-7478
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
>Priority: Blocker
>
> [~evilezh] reported the following crash in the agent upon running a 1.1.0 
> master against a 1.2.0 agent:
> {noformat}
> F0509 00:19:07.045413  3469 slave.cpp:4609] Check failed: 
> resource.has_allocation_info() 
> *** Check failure stack trace: ***
> @ 0x7f4c4a4fa3cd  google::LogMessage::Fail()
> @ 0x7f4c4a4fc180  google::LogMessage::SendToLog()
> @ 0x7f4c4a4f9fb3  google::LogMessage::Flush()
> @ 0x7f4c4a4fcba9  google::LogMessageFatal::~LogMessageFatal()
> @ 0x7f4c49b3bcf5  mesos::internal::slave::Slave::getExecutorInfo()
> @ 0x7f4c49b3cf76  mesos::internal::slave::Slave::runTask()
> @ 0x7f4c49b8832c  ProtobufProcess<>::handler4<>()
> @ 0x7f4c49b4dc06  std::_Function_handler<>::_M_invoke()
> @ 0x7f4c49b6975a  ProtobufProcess<>::visit()
> @ 0x7f4c4a46c933  process::ProcessManager::resume()
> @ 0x7f4c4a477537  
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv
> @ 0x7f4c486b8c80  (unknown)
> @ 0x7f4c481d46ba  start_thread
> @ 0x7f4c47f0a82d  (unknown)
> Aborted (core dumped)
> {noformat}
> This appears to have been due to a lack of manual upgrade testing (we also 
> don't have any automated upgrade testing in place).
> The check in {{getExecutorInfo(...)}} 
> [here|https://github.com/apache/mesos/blob/1.2.0/src/slave/slave.cpp#L4609] 
> crashes with an old master because it occurs before our injection in 
> {{run(...)}}. See the {{runTask(...)}} call into {{getExecutorInfo(...)}} 
> [here|https://github.com/apache/mesos/blob/1.2.0/src/slave/slave.cpp#L1556].



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-5935) Add upgrade testing to the ASF CI

2017-05-09 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16003378#comment-16003378
 ] 

Benjamin Mahler commented on MESOS-5935:


[~vinodkone] I also filed a ticket to test the case where agents are upgraded 
first. Seems like we need an epic here.

> Add upgrade testing to the ASF CI
> -
>
> Key: MESOS-5935
> URL: https://issues.apache.org/jira/browse/MESOS-5935
> Project: Mesos
>  Issue Type: Improvement
>Affects Versions: 1.0.0
>Reporter: Greg Mann
>  Labels: mesosphere
>
> We should add execution of the {{support/test-upgrade.py}} script to the ASF 
> CI runs. This will require having a build of a previous Mesos version to run 
> against latest master; perhaps we could cache builds of the last stable 
> release somewhere, which could be fetched and executed against CI builds.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (MESOS-7483) Update upgrade test script to also test when agents are upgraded first.

2017-05-09 Thread Benjamin Mahler (JIRA)
Benjamin Mahler created MESOS-7483:
--

 Summary: Update upgrade test script to also test when agents are 
upgraded first.
 Key: MESOS-7483
 URL: https://issues.apache.org/jira/browse/MESOS-7483
 Project: Mesos
  Issue Type: Improvement
  Components: test
Reporter: Benjamin Mahler


Currently the upgrade test only tries to upgrade masters firs, e.g.

{noformat}
Running upgrade test from mesos 1.1.0 to mesos 1.2.1
+--+++---+
| Test case|   Framework| Master | Agent |
+--+++---+
|#1|  mesos 1.1.0   | mesos 1.1.0| mesos 1.1.0   |
|#2|  mesos 1.1.0   | mesos 1.2.1| mesos 1.1.0   |
|#3|  mesos 1.1.0   | mesos 1.2.1| mesos 1.2.1   |
|#4|  mesos 1.2.1   | mesos 1.2.1| mesos 1.2.1   |
+--+++---+
{noformat}

We should also test the case where the agents are upgraded first.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7478) Pre-1.2.x master does not work with 1.2.x agent.

2017-05-09 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16003347#comment-16003347
 ] 

Benjamin Mahler commented on MESOS-7478:


[~anandmazumdar] yeah we need to do that, I synced with vinod about 
resurrecting this on CI: MESOS-5935

> Pre-1.2.x master does not work with 1.2.x agent.
> 
>
> Key: MESOS-7478
> URL: https://issues.apache.org/jira/browse/MESOS-7478
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
>Priority: Blocker
>
> [~evilezh] reported the following crash in the agent upon running a 1.1.0 
> master against a 1.2.0 agent:
> {noformat}
> F0509 00:19:07.045413  3469 slave.cpp:4609] Check failed: 
> resource.has_allocation_info() 
> *** Check failure stack trace: ***
> @ 0x7f4c4a4fa3cd  google::LogMessage::Fail()
> @ 0x7f4c4a4fc180  google::LogMessage::SendToLog()
> @ 0x7f4c4a4f9fb3  google::LogMessage::Flush()
> @ 0x7f4c4a4fcba9  google::LogMessageFatal::~LogMessageFatal()
> @ 0x7f4c49b3bcf5  mesos::internal::slave::Slave::getExecutorInfo()
> @ 0x7f4c49b3cf76  mesos::internal::slave::Slave::runTask()
> @ 0x7f4c49b8832c  ProtobufProcess<>::handler4<>()
> @ 0x7f4c49b4dc06  std::_Function_handler<>::_M_invoke()
> @ 0x7f4c49b6975a  ProtobufProcess<>::visit()
> @ 0x7f4c4a46c933  process::ProcessManager::resume()
> @ 0x7f4c4a477537  
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv
> @ 0x7f4c486b8c80  (unknown)
> @ 0x7f4c481d46ba  start_thread
> @ 0x7f4c47f0a82d  (unknown)
> Aborted (core dumped)
> {noformat}
> This appears to have been due to a lack of manual upgrade testing (we also 
> don't have any automated upgrade testing in place).
> The check in {{getExecutorInfo(...)}} 
> [here|https://github.com/apache/mesos/blob/1.2.0/src/slave/slave.cpp#L4609] 
> crashes with an old master because it occurs before our injection in 
> {{run(...)}}. See the {{runTask(...)}} call into {{getExecutorInfo(...)}} 
> [here|https://github.com/apache/mesos/blob/1.2.0/src/slave/slave.cpp#L1556].



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7478) Pre-1.2.x master does not work with 1.2.x agent.

2017-05-09 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16003215#comment-16003215
 ] 

Benjamin Mahler commented on MESOS-7478:


Re ETA: Yes, I have a patch, just manually testing it before posting it.

Re upgrade ordering: My understanding is that we've generally agreed to support 
1.x master against 1.y agents, for all x and y. [~vinodkone] is that clearly 
documented?

> Pre-1.2.x master does not work with 1.2.x agent.
> 
>
> Key: MESOS-7478
> URL: https://issues.apache.org/jira/browse/MESOS-7478
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
>Priority: Blocker
>
> [~evilezh] reported the following crash in the agent upon running a 1.1.0 
> master against a 1.2.0 agent:
> {noformat}
> F0509 00:19:07.045413  3469 slave.cpp:4609] Check failed: 
> resource.has_allocation_info() 
> *** Check failure stack trace: ***
> @ 0x7f4c4a4fa3cd  google::LogMessage::Fail()
> @ 0x7f4c4a4fc180  google::LogMessage::SendToLog()
> @ 0x7f4c4a4f9fb3  google::LogMessage::Flush()
> @ 0x7f4c4a4fcba9  google::LogMessageFatal::~LogMessageFatal()
> @ 0x7f4c49b3bcf5  mesos::internal::slave::Slave::getExecutorInfo()
> @ 0x7f4c49b3cf76  mesos::internal::slave::Slave::runTask()
> @ 0x7f4c49b8832c  ProtobufProcess<>::handler4<>()
> @ 0x7f4c49b4dc06  std::_Function_handler<>::_M_invoke()
> @ 0x7f4c49b6975a  ProtobufProcess<>::visit()
> @ 0x7f4c4a46c933  process::ProcessManager::resume()
> @ 0x7f4c4a477537  
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv
> @ 0x7f4c486b8c80  (unknown)
> @ 0x7f4c481d46ba  start_thread
> @ 0x7f4c47f0a82d  (unknown)
> Aborted (core dumped)
> {noformat}
> This appears to have been due to a lack of manual upgrade testing (we also 
> don't have any automated upgrade testing in place).
> The check in {{getExecutorInfo(...)}} 
> [here|https://github.com/apache/mesos/blob/1.2.0/src/slave/slave.cpp#L4609] 
> crashes with an old master because it occurs before our injection in 
> {{run(...)}}. See the {{runTask(...)}} call into {{getExecutorInfo(...)}} 
> [here|https://github.com/apache/mesos/blob/1.2.0/src/slave/slave.cpp#L1556].



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (MESOS-7478) Pre-1.2.x master does not work with 1.2.x agent.

2017-05-08 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-7478:
--

Assignee: Benjamin Mahler

> Pre-1.2.x master does not work with 1.2.x agent.
> 
>
> Key: MESOS-7478
> URL: https://issues.apache.org/jira/browse/MESOS-7478
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
>Priority: Blocker
>
> [~evilezh] reported the following crash in the agent upon running a 1.1.0 
> master against a 1.2.0 agent:
> {noformat}
> F0509 00:19:07.045413  3469 slave.cpp:4609] Check failed: 
> resource.has_allocation_info() 
> *** Check failure stack trace: ***
> @ 0x7f4c4a4fa3cd  google::LogMessage::Fail()
> @ 0x7f4c4a4fc180  google::LogMessage::SendToLog()
> @ 0x7f4c4a4f9fb3  google::LogMessage::Flush()
> @ 0x7f4c4a4fcba9  google::LogMessageFatal::~LogMessageFatal()
> @ 0x7f4c49b3bcf5  mesos::internal::slave::Slave::getExecutorInfo()
> @ 0x7f4c49b3cf76  mesos::internal::slave::Slave::runTask()
> @ 0x7f4c49b8832c  ProtobufProcess<>::handler4<>()
> @ 0x7f4c49b4dc06  std::_Function_handler<>::_M_invoke()
> @ 0x7f4c49b6975a  ProtobufProcess<>::visit()
> @ 0x7f4c4a46c933  process::ProcessManager::resume()
> @ 0x7f4c4a477537  
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv
> @ 0x7f4c486b8c80  (unknown)
> @ 0x7f4c481d46ba  start_thread
> @ 0x7f4c47f0a82d  (unknown)
> Aborted (core dumped)
> {noformat}
> This appears to have been due to a lack of manual upgrade testing (we also 
> don't have any automated upgrade testing in place).
> The check in {{getExecutorInfo(...)}} 
> [here|https://github.com/apache/mesos/blob/1.2.0/src/slave/slave.cpp#L4609] 
> crashes with an old master because it occurs before our injection in 
> {{run(...)}}. See the {{runTask(...)}} call into {{getExecutorInfo(...)}} 
> [here|https://github.com/apache/mesos/blob/1.2.0/src/slave/slave.cpp#L1556].



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (MESOS-7478) Pre-1.2.x master does not work with 1.2.x agent.

2017-05-08 Thread Benjamin Mahler (JIRA)
Benjamin Mahler created MESOS-7478:
--

 Summary: Pre-1.2.x master does not work with 1.2.x agent.
 Key: MESOS-7478
 URL: https://issues.apache.org/jira/browse/MESOS-7478
 Project: Mesos
  Issue Type: Bug
  Components: agent
Reporter: Benjamin Mahler
Priority: Blocker


[~evilezh] reported the following crash in the agent upon running a 1.1.0 
master against a 1.2.0 agent:

{noformat}
F0509 00:19:07.045413  3469 slave.cpp:4609] Check failed: 
resource.has_allocation_info() 
*** Check failure stack trace: ***
@ 0x7f4c4a4fa3cd  google::LogMessage::Fail()
@ 0x7f4c4a4fc180  google::LogMessage::SendToLog()
@ 0x7f4c4a4f9fb3  google::LogMessage::Flush()
@ 0x7f4c4a4fcba9  google::LogMessageFatal::~LogMessageFatal()
@ 0x7f4c49b3bcf5  mesos::internal::slave::Slave::getExecutorInfo()
@ 0x7f4c49b3cf76  mesos::internal::slave::Slave::runTask()
@ 0x7f4c49b8832c  ProtobufProcess<>::handler4<>()
@ 0x7f4c49b4dc06  std::_Function_handler<>::_M_invoke()
@ 0x7f4c49b6975a  ProtobufProcess<>::visit()
@ 0x7f4c4a46c933  process::ProcessManager::resume()
@ 0x7f4c4a477537  
_ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv
@ 0x7f4c486b8c80  (unknown)
@ 0x7f4c481d46ba  start_thread
@ 0x7f4c47f0a82d  (unknown)
Aborted (core dumped)
{noformat}

This appears to have been due to a lack of manual upgrade testing (we don't 
have any automated upgrade testing in place).

The check in {{getExecutorInfo(...)}} 
[here|https://github.com/apache/mesos/blob/1.2.0/src/slave/slave.cpp#L4609] 
crashes with an old master because it occurs before our injection in 
{{run(...)}}. See the {{runTask(...)}} call into {{getExecutorInfo(...)}} 
[here|https://github.com/apache/mesos/blob/1.2.0/src/slave/slave.cpp#L1556].



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7478) Pre-1.2.x master does not work with 1.2.x agent.

2017-05-08 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-7478:
---
Description: 
[~evilezh] reported the following crash in the agent upon running a 1.1.0 
master against a 1.2.0 agent:

{noformat}
F0509 00:19:07.045413  3469 slave.cpp:4609] Check failed: 
resource.has_allocation_info() 
*** Check failure stack trace: ***
@ 0x7f4c4a4fa3cd  google::LogMessage::Fail()
@ 0x7f4c4a4fc180  google::LogMessage::SendToLog()
@ 0x7f4c4a4f9fb3  google::LogMessage::Flush()
@ 0x7f4c4a4fcba9  google::LogMessageFatal::~LogMessageFatal()
@ 0x7f4c49b3bcf5  mesos::internal::slave::Slave::getExecutorInfo()
@ 0x7f4c49b3cf76  mesos::internal::slave::Slave::runTask()
@ 0x7f4c49b8832c  ProtobufProcess<>::handler4<>()
@ 0x7f4c49b4dc06  std::_Function_handler<>::_M_invoke()
@ 0x7f4c49b6975a  ProtobufProcess<>::visit()
@ 0x7f4c4a46c933  process::ProcessManager::resume()
@ 0x7f4c4a477537  
_ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv
@ 0x7f4c486b8c80  (unknown)
@ 0x7f4c481d46ba  start_thread
@ 0x7f4c47f0a82d  (unknown)
Aborted (core dumped)
{noformat}

This appears to have been due to a lack of manual upgrade testing (we also 
don't have any automated upgrade testing in place).

The check in {{getExecutorInfo(...)}} 
[here|https://github.com/apache/mesos/blob/1.2.0/src/slave/slave.cpp#L4609] 
crashes with an old master because it occurs before our injection in 
{{run(...)}}. See the {{runTask(...)}} call into {{getExecutorInfo(...)}} 
[here|https://github.com/apache/mesos/blob/1.2.0/src/slave/slave.cpp#L1556].

  was:
[~evilezh] reported the following crash in the agent upon running a 1.1.0 
master against a 1.2.0 agent:

{noformat}
F0509 00:19:07.045413  3469 slave.cpp:4609] Check failed: 
resource.has_allocation_info() 
*** Check failure stack trace: ***
@ 0x7f4c4a4fa3cd  google::LogMessage::Fail()
@ 0x7f4c4a4fc180  google::LogMessage::SendToLog()
@ 0x7f4c4a4f9fb3  google::LogMessage::Flush()
@ 0x7f4c4a4fcba9  google::LogMessageFatal::~LogMessageFatal()
@ 0x7f4c49b3bcf5  mesos::internal::slave::Slave::getExecutorInfo()
@ 0x7f4c49b3cf76  mesos::internal::slave::Slave::runTask()
@ 0x7f4c49b8832c  ProtobufProcess<>::handler4<>()
@ 0x7f4c49b4dc06  std::_Function_handler<>::_M_invoke()
@ 0x7f4c49b6975a  ProtobufProcess<>::visit()
@ 0x7f4c4a46c933  process::ProcessManager::resume()
@ 0x7f4c4a477537  
_ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv
@ 0x7f4c486b8c80  (unknown)
@ 0x7f4c481d46ba  start_thread
@ 0x7f4c47f0a82d  (unknown)
Aborted (core dumped)
{noformat}

This appears to have been due to a lack of manual upgrade testing (we don't 
have any automated upgrade testing in place).

The check in {{getExecutorInfo(...)}} 
[here|https://github.com/apache/mesos/blob/1.2.0/src/slave/slave.cpp#L4609] 
crashes with an old master because it occurs before our injection in 
{{run(...)}}. See the {{runTask(...)}} call into {{getExecutorInfo(...)}} 
[here|https://github.com/apache/mesos/blob/1.2.0/src/slave/slave.cpp#L1556].


> Pre-1.2.x master does not work with 1.2.x agent.
> 
>
> Key: MESOS-7478
> URL: https://issues.apache.org/jira/browse/MESOS-7478
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Reporter: Benjamin Mahler
>Priority: Blocker
>
> [~evilezh] reported the following crash in the agent upon running a 1.1.0 
> master against a 1.2.0 agent:
> {noformat}
> F0509 00:19:07.045413  3469 slave.cpp:4609] Check failed: 
> resource.has_allocation_info() 
> *** Check failure stack trace: ***
> @ 0x7f4c4a4fa3cd  google::LogMessage::Fail()
> @ 0x7f4c4a4fc180  google::LogMessage::SendToLog()
> @ 0x7f4c4a4f9fb3  google::LogMessage::Flush()
> @ 0x7f4c4a4fcba9  google::LogMessageFatal::~LogMessageFatal()
> @ 0x7f4c49b3bcf5  mesos::internal::slave::Slave::getExecutorInfo()
> @ 0x7f4c49b3cf76  mesos::internal::slave::Slave::runTask()
> @ 0x7f4c49b8832c  ProtobufProcess<>::handler4<>()
> @ 0x7f4c49b4dc06  std::_Function_handler<>::_M_invoke()
> @ 0x7f4c49b6975a  ProtobufProcess<>::visit()
> @ 0x7f4c4a46c933  process::ProcessManager::resume()
> @ 0x7f4c4a477537  
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv
> @ 0x7f4c486b8c80  (unknown)
> @ 0x7f4c481d46ba  start_thread
> @ 0x7f4c47f0a82d  (unknown)
> Aborted (core dumped)
> {noformat}
> This appears to have been due to a lack of manual 

[jira] [Created] (MESOS-7460) UpdateFrameworkMessage may send a Framework role(s) change to a non-MULTI_ROLE agent.

2017-05-04 Thread Benjamin Mahler (JIRA)
Benjamin Mahler created MESOS-7460:
--

 Summary: UpdateFrameworkMessage may send a Framework role(s) 
change to a non-MULTI_ROLE agent.
 Key: MESOS-7460
 URL: https://issues.apache.org/jira/browse/MESOS-7460
 Project: Mesos
  Issue Type: Bug
  Components: master
Reporter: Benjamin Mahler
Assignee: Michael Park
Priority: Blocker


When a framework is MULTI_ROLE capable, if the framework was previously running 
tasks on an old agent (non-MULTI_ROLE capable), the master *must* ensure the 
UpdateFramework message sent to this old agent preserves the framework's 
original role. Otherwise the agent will interpret the role to have changed, 
which can break things (e.g. not locate the reservations, volumes, etc).

In addition, a framework without MULTI_ROLE has the ability to change their 
role. We'll need to change this to ensure that the {{role}} field is immutable 
and frameworks need to use the {{roles}} field with the MULTI_ROLE capability 
if they want to change their role.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7260) Authorization for `/role` endpoint should take both VIEW_ROLES and VIEW_FRAMEWORKS into account.

2017-05-01 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-7260:
---
Shepherd: Adam B

To confirm, [~arojas] and [~adam-mesos] can you guys review / shepherd this?

> Authorization for `/role` endpoint should take both VIEW_ROLES and 
> VIEW_FRAMEWORKS into account.
> 
>
> Key: MESOS-7260
> URL: https://issues.apache.org/jira/browse/MESOS-7260
> Project: Mesos
>  Issue Type: Bug
>  Components: HTTP API, master
>Reporter: Jay Guo
>Assignee: Jay Guo
>
> Consider following case: both {{framework1}} and {{framework2}} subscribe to 
> {{roleX}}, {{principal}} is allowed to view {{roleX}} and {{framework1}}, but 
> *NOT* {{framework2}}, therefore, {{/role}} endpoint should only contain 
> {{framework1}}, but not both frameworks.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-6441) Display reservations in the agent page in the webui.

2017-05-01 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-6441:
---
Description: 
We currently do not display the reservations present on an agent in the webui. 
It would be nice to see this information.

It would also be nice to update the resource statistics tables to make the 
distinction between unreserved and reserved resources. E.g.

Reserved:
Used, Allocated, Available and Total

Unreserved:
Used, Allocated, Available and Total

  was:
We currently do not display the reservations present on an agent in the webui. 
It would be nice to see this information.

It would also be nice to update the resource statistics tables to make the 
distinction between unreserved and reserved resources.


> Display reservations in the agent page in the webui.
> 
>
> Key: MESOS-6441
> URL: https://issues.apache.org/jira/browse/MESOS-6441
> Project: Mesos
>  Issue Type: Task
>  Components: webui
>Reporter: Benjamin Mahler
>
> We currently do not display the reservations present on an agent in the 
> webui. It would be nice to see this information.
> It would also be nice to update the resource statistics tables to make the 
> distinction between unreserved and reserved resources. E.g.
> Reserved:
> Used, Allocated, Available and Total
> Unreserved:
> Used, Allocated, Available and Total



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (MESOS-7438) Double free or corruption when using parallel test runner

2017-04-28 Thread Benjamin Mahler (JIRA)
Benjamin Mahler created MESOS-7438:
--

 Summary: Double free or corruption when using parallel test runner
 Key: MESOS-7438
 URL: https://issues.apache.org/jira/browse/MESOS-7438
 Project: Mesos
  Issue Type: Bug
  Components: technical debt, test
Reporter: Benjamin Mahler


I observed the following when using the parallel test runner:

{noformat}
/home/bmahler/git/mesos/build/../support/mesos-gtest-runner.py 
--sequential=*ROOT_* ./mesos-tests
..
*** Error in `/home/bmahler/git/mesos/build/src/.libs/mesos-tests': double free 
or corruption (out): 0x7fa818001310 ***
=== Backtrace: =
/usr/lib64/libc.so.6(+0x7c503)[0x7fa87f27e503]
/usr/lib64/libsasl2.so.3(+0x866d)[0x7fa880f0d66d]
/usr/lib64/libsasl2.so.3(sasl_dispose+0x3b)[0x7fa880f1075b]
/home/bmahler/git/mesos/build/src/.libs/libmesos-1.3.0.so(_ZN5mesos8internal8cram_md527CRAMMD5AuthenticateeProcessD1Ev+0x5d)[0x7fa88708f67d]
/home/bmahler/git/mesos/build/src/.libs/libmesos-1.3.0.so(_ZN5mesos8internal8cram_md527CRAMMD5AuthenticateeProcessD0Ev+0x18)[0x7fa88708f734]
/home/bmahler/git/mesos/build/src/.libs/libmesos-1.3.0.so(_ZN5mesos8internal8cram_md520CRAMMD5AuthenticateeD1Ev+0xfb)[0x7fa88708a065]
/home/bmahler/git/mesos/build/src/.libs/libmesos-1.3.0.so(_ZN5mesos8internal8cram_md520CRAMMD5AuthenticateeD0Ev+0x18)[0x7fa88708a0b4]
/home/bmahler/git/mesos/build/src/.libs/libmesos-1.3.0.so(_ZN5mesos8internal5slave5Slave13_authenticateEv+0x67)[0x7fa8879ff579]
/home/bmahler/git/mesos/build/src/.libs/libmesos-1.3.0.so(_ZZN7process8dispatchIN5mesos8internal5slave5SlaveEEEvRKNS_3PIDIT_EEMS6_FvvEENKUlPNS_11ProcessBaseEE_clESD_+0xe2)[0x7fa887a60b7a]
/home/bmahler/git/mesos/build/src/.libs/libmesos-1.3.0.so(_ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal5slave5SlaveEEEvRKNS0_3PIDIT_EEMSA_FvvEEUlS2_E_E9_M_invokeERKSt9_Any_dataS2_+0x37)[0x7fa887aa0efe]
/home/bmahler/git/mesos/build/src/.libs/libmesos-1.3.0.so(_ZNKSt8functionIFvPN7process11ProcessBaseEEEclES2_+0x49)[0x7fad1177]
/home/bmahler/git/mesos/build/src/.libs/libmesos-1.3.0.so(_ZN7process11ProcessBase5visitERKNS_13DispatchEventE+0x2f)[0x7fab5063]
/home/bmahler/git/mesos/build/src/.libs/libmesos-1.3.0.so(_ZNK7process13DispatchEvent5visitEPNS_12EventVisitorE+0x2e)[0x7fac0422]
/home/bmahler/git/mesos/build/src/.libs/mesos-tests(_ZN7process11ProcessBase5serveERKNS_5EventE+0x2e)[0xb088c8]
/home/bmahler/git/mesos/build/src/.libs/libmesos-1.3.0.so(_ZN7process14ProcessManager6resumeEPNS_11ProcessBaseE+0x525)[0x7fab10d5]
/home/bmahler/git/mesos/build/src/.libs/libmesos-1.3.0.so(+0x5f1a880)[0x7faad880]
/home/bmahler/git/mesos/build/src/.libs/libmesos-1.3.0.so(+0x5f2ca8a)[0x7fabfa8a]
/home/bmahler/git/mesos/build/src/.libs/libmesos-1.3.0.so(+0x5f2c9ce)[0x7fabf9ce]
/home/bmahler/git/mesos/build/src/.libs/libmesos-1.3.0.so(+0x5f2c958)[0x7fabf958]
/usr/lib64/libstdc++.so.6(+0xb5230)[0x7fa87fb90230]
/usr/lib64/libpthread.so.0(+0x7dc5)[0x7fa88040ddc5]
/usr/lib64/libc.so.6(clone+0x6d)[0x7fa87f2f973d]
{noformat}

Not sure how reproducible this is, appears to occur in the authentication path 
of the agent.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (MESOS-6345) ExamplesTest.PersistentVolumeFramework failing due to double free corruption on Ubuntu 14.04

2017-04-28 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15559169#comment-15559169
 ] 

Benjamin Mahler edited comment on MESOS-6345 at 4/28/17 10:38 PM:
--

{noformat}
[04:56:48] : [Step 10/10] [ RUN  ] 
ExamplesTest.PersistentVolumeFramework
[04:56:48]W: [Step 10/10] I1008 04:56:48.212661 25257 master.cpp:1097] 
Master terminating
[04:56:48]W: [Step 10/10] I1008 04:56:48.212674 25254 
status_update_manager.cpp:395] Received status update acknowledgement (UUID: 
542b14f7-bfc9-4be3-81b4-c23a1da9ecb5) for task 2 of framework 
84cfc7a4-ad66-4f0d-965c-33ff6093ef32-
[04:56:48]W: [Step 10/10] I1008 04:56:48.212709 25254 
status_update_manager.cpp:531] Cleaning up status update stream for task 2 of 
framework 84cfc7a4-ad66-4f0d-965c-33ff6093ef32-
[04:56:48]W: [Step 10/10] I1008 04:56:48.212712 25257 master.cpp:7725] 
Removing executor 'default' with resources {} of framework 
84cfc7a4-ad66-4f0d-965c-33ff6093ef32- on agent 
84cfc7a4-ad66-4f0d-965c-33ff6093ef32-S2 at slave(1)@172.30.2.21:52703 
(ip-172-30-2-21.mesosphere.io)
[04:56:48]W: [Step 10/10] I1008 04:56:48.212767 25254 slave.cpp:2953] 
Status update manager successfully handled status update acknowledgement (UUID: 
542b14f7-bfc9-4be3-81b4-c23a1da9ecb5) for task 2 of framework 
84cfc7a4-ad66-4f0d-965c-33ff6093ef32-
[04:56:48]W: [Step 10/10] I1008 04:56:48.212782 25254 slave.cpp:6543] 
Completing task 2
[04:56:48]W: [Step 10/10] I1008 04:56:48.212792 25258 hierarchical.cpp:517] 
Removed agent 84cfc7a4-ad66-4f0d-965c-33ff6093ef32-S2
[04:56:48]W: [Step 10/10] I1008 04:56:48.212829 25257 master.cpp:7696] 
Removing task 3 with resources cpus(*):1; mem(*):128 of framework 
84cfc7a4-ad66-4f0d-965c-33ff6093ef32- on agent 
84cfc7a4-ad66-4f0d-965c-33ff6093ef32-S1 at slave(3)@172.30.2.21:52703 
(ip-172-30-2-21.mesosphere.io)
[04:56:48]W: [Step 10/10] I1008 04:56:48.212888 25257 master.cpp:7725] 
Removing executor 'default' with resources {} of framework 
84cfc7a4-ad66-4f0d-965c-33ff6093ef32- on agent 
84cfc7a4-ad66-4f0d-965c-33ff6093ef32-S1 at slave(3)@172.30.2.21:52703 
(ip-172-30-2-21.mesosphere.io)
[04:56:48]W: [Step 10/10] I1008 04:56:48.212915 25258 hierarchical.cpp:517] 
Removed agent 84cfc7a4-ad66-4f0d-965c-33ff6093ef32-S1
[04:56:48]W: [Step 10/10] I1008 04:56:48.213017 25257 master.cpp:7725] 
Removing executor 'default' with resources {} of framework 
84cfc7a4-ad66-4f0d-965c-33ff6093ef32- on agent 
84cfc7a4-ad66-4f0d-965c-33ff6093ef32-S0 at slave(2)@172.30.2.21:52703 
(ip-172-30-2-21.mesosphere.io)
[04:56:48]W: [Step 10/10] I1008 04:56:48.213102 25254 hierarchical.cpp:517] 
Removed agent 84cfc7a4-ad66-4f0d-965c-33ff6093ef32-S0
[04:56:48]W: [Step 10/10] I1008 04:56:48.213281 25251 hierarchical.cpp:337] 
Removed framework 84cfc7a4-ad66-4f0d-965c-33ff6093ef32-
[04:56:48]W: [Step 10/10] I1008 04:56:48.213404 25257 slave.cpp:4174] Got 
exited event for master@172.30.2.21:52703
[04:56:48]W: [Step 10/10] I1008 04:56:48.213418 25253 slave.cpp:4174] Got 
exited event for master@172.30.2.21:52703
[04:56:48]W: [Step 10/10] W1008 04:56:48.213426 25257 slave.cpp:4179] 
Master disconnected! Waiting for a new master to be elected
[04:56:48]W: [Step 10/10] W1008 04:56:48.213433 25253 slave.cpp:4179] 
Master disconnected! Waiting for a new master to be elected
[04:56:48]W: [Step 10/10] I1008 04:56:48.213407 25254 slave.cpp:4174] Got 
exited event for master@172.30.2.21:52703
[04:56:48]W: [Step 10/10] W1008 04:56:48.213448 25254 slave.cpp:4179] 
Master disconnected! Waiting for a new master to be elected
[04:56:48]W: [Step 10/10] I1008 04:56:48.214047 25254 slave.cpp:787] Agent 
terminating
[04:56:48]W: [Step 10/10] I1008 04:56:48.214068 25254 slave.cpp:2506] Asked 
to shut down framework 84cfc7a4-ad66-4f0d-965c-33ff6093ef32- by @0.0.0.0:0
[04:56:48]W: [Step 10/10] I1008 04:56:48.214076 25254 slave.cpp:2531] 
Shutting down framework 84cfc7a4-ad66-4f0d-965c-33ff6093ef32-
[04:56:48]W: [Step 10/10] I1008 04:56:48.214083 25254 slave.cpp:4855] 
Shutting down executor 'default' of framework 
84cfc7a4-ad66-4f0d-965c-33ff6093ef32- (via HTTP)
[04:56:48]W: [Step 10/10] E1008 04:56:48.215160 25384 executor.cpp:681] 
End-Of-File received from agent. The agent closed the event stream
[04:56:48]W: [Step 10/10] I1008 04:56:48.215250 25254 slave.cpp:787] Agent 
terminating
[04:56:48]W: [Step 10/10] I1008 04:56:48.215266 25254 slave.cpp:2506] Asked 
to shut down framework 84cfc7a4-ad66-4f0d-965c-33ff6093ef32- by @0.0.0.0:0
[04:56:48]W: [Step 10/10] I1008 04:56:48.215279 25254 slave.cpp:2531] 
Shutting down framework 84cfc7a4-ad66-4f0d-965c-33ff6093ef32-
[04:56:48]W: [Step 10/10] I1008 04:56:48.215291 25254 slave.cpp:4855] 
Shutting down executor 'default' of framework 

[jira] [Assigned] (MESOS-7430) Per-role Suppress call implementation is broken.

2017-04-27 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-7430:
--

Assignee: Benjamin Mahler

> Per-role Suppress call implementation is broken.
> 
>
> Key: MESOS-7430
> URL: https://issues.apache.org/jira/browse/MESOS-7430
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation, master
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
>Priority: Blocker
>
> The per-role Suppress call implementation is broken currently in the 
> allocator, since it still uses a global 'suppress' bit for the framework.
> Before fixing, we should discuss whether we want keep role within Suppress 
> (it hasn't been released yet), or add calls to move towards consistent 
> naming, e.g. {{Call::ActivateRole}} / {{Call::DeactivateRole}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-5967) Add support for 'docker image inspect' in our docker abstraction.

2017-04-27 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-5967:
---
Target Version/s:   (was: 1.4.0)

> Add support for 'docker image inspect' in our docker abstraction.
> -
>
> Key: MESOS-5967
> URL: https://issues.apache.org/jira/browse/MESOS-5967
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization, docker
>Reporter: Kevin Klues
>Assignee: Guangya Liu
>  Labels: gpu
>
> Docker's command line tool for {{docker inspect}} can take either a 
> {{container}}, an {{image}}, or a {{task}} as its argument, and return a JSON 
> array containing low-level information about that container, image or task. 
> However, the current {{docker inspect}} support in our docker abstraction 
> only supports inspecting containers (not images or tasks).  We should expand 
> this to (at least) support images.
> In particular, this additional functionality is motivated by the upcoming GPU 
> support, which needs to inspect the labels in a docker image to decide if it 
> should inject the required Nvidia volumes into a container.  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (MESOS-7430) Per-role Suppress call implementation is broken.

2017-04-26 Thread Benjamin Mahler (JIRA)
Benjamin Mahler created MESOS-7430:
--

 Summary: Per-role Suppress call implementation is broken.
 Key: MESOS-7430
 URL: https://issues.apache.org/jira/browse/MESOS-7430
 Project: Mesos
  Issue Type: Bug
  Components: allocation, master
Reporter: Benjamin Mahler
Priority: Blocker


The per-role Suppress call implementation is broken currently in the allocator, 
since it still uses a global 'suppress' bit for the framework.

Before fixing, we should discuss whether we want keep role within Suppress (it 
hasn't been released yet), or add calls to move towards consistent naming, e.g. 
{{Call::ActivateRole}} / {{Call::DeactivateRole}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7401) Optionally reject messages when UPIDs does not match IP.

2017-04-26 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-7401:
---
Shepherd: Benjamin Mahler

> Optionally reject messages when UPIDs does not match IP.
> 
>
> Key: MESOS-7401
> URL: https://issues.apache.org/jira/browse/MESOS-7401
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Reporter: James Peach
>Assignee: James Peach
>Priority: Minor
>
> {{libprocess}} does no validation of the peer UPID so in some deployments it 
> is trivial to inject bogus messages and impersonate legitimate actors. If we 
> add a check to verify that messages are received from the same IP address as 
> the peer UPID claims to be using, we can increase the difficulty of UPID 
> spoofing, and mitigate this somewhat.
> For compatibility, this has to be an optional setting and disabled by default.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7401) Optionally reject messages when UPIDs does not match IP.

2017-04-26 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-7401:
---
Summary: Optionally reject messages when UPIDs does not match IP.  (was: 
Optionally pin UPIDs to their IP address.)

> Optionally reject messages when UPIDs does not match IP.
> 
>
> Key: MESOS-7401
> URL: https://issues.apache.org/jira/browse/MESOS-7401
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Reporter: James Peach
>Assignee: James Peach
>Priority: Minor
>
> {{libprocess}} does no validation of the peer UPID so in some deployments it 
> is trivial to inject bogus messages and impersonate legitimate actors. If we 
> add a check to verify that messages are received from the same IP address as 
> the peer UPID claims to be using, we can increase the difficulty of UPID 
> spoofing, and mitigate this somewhat.
> For compatibility, this has to be an optional setting and disabled by default.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-6441) Display reservations in the agent page in the webui.

2017-04-24 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-6441:
---
Description: 
We currently do not display the reservations present on an agent in the webui. 
It would be nice to see this information.

It would also be nice to update the resource statistics tables to make the 
distinction between unreserved and reserved resources.

  was:We currently do not display the reservations present on an agent in the 
webui. It would be nice to see this information.


> Display reservations in the agent page in the webui.
> 
>
> Key: MESOS-6441
> URL: https://issues.apache.org/jira/browse/MESOS-6441
> Project: Mesos
>  Issue Type: Task
>  Components: webui
>Reporter: Benjamin Mahler
>
> We currently do not display the reservations present on an agent in the 
> webui. It would be nice to see this information.
> It would also be nice to update the resource statistics tables to make the 
> distinction between unreserved and reserved resources.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-6447) Display role weight / role quota information in the webui.

2017-04-24 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15981989#comment-15981989
 ] 

Benjamin Mahler commented on MESOS-6447:


This was resolved via MESOS-6995. There is now a tab for viewing roles.

> Display role weight / role quota information in the webui.
> --
>
> Key: MESOS-6447
> URL: https://issues.apache.org/jira/browse/MESOS-6447
> Project: Mesos
>  Issue Type: Task
>  Components: webui
>Reporter: Benjamin Mahler
> Fix For: 1.3.0
>
>
> The webui does not display role weight and role quotas. It would be nice to 
> have this information visible to users of the webui.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7136) Eliminate fair sharing between frameworks within a role.

2017-04-24 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15981959#comment-15981959
 ] 

Benjamin Mahler commented on MESOS-7136:


[~qianzhang] hm.. not sure I follow the difficulty in figuring out which one 
replies first? I would hope to avoid the implicit leaf roles, since it's 
difficult to configure and view (need to know framework ids, and these can keep 
changing if frameworks complete / if new frameworks arrive).

> Eliminate fair sharing between frameworks within a role.
> 
>
> Key: MESOS-7136
> URL: https://issues.apache.org/jira/browse/MESOS-7136
> Project: Mesos
>  Issue Type: Epic
>  Components: allocation, technical debt
>Reporter: Benjamin Mahler
>  Labels: multi-tenancy
>
> The current fair sharing algorithm performs fair sharing between frameworks 
> within a role. This is equivalent to having the framework id behave as a 
> pseudo-role beneath the role. Consider the case where there are two spark 
> frameworks running within the same "spark" role. This behaves similarly to 
> hierarchical roles with the framework ID acting as an implicit role:
> {noformat}
>  ^
>/   \
>   spark services
> ^
>   /   \
> /   \
> FrameworkId1 FrameworkId2
> (fixed weight of 1)(fixed weight of 1)
> {noformat}
> Unfortunately, the frameworks cannot change their weight to be a value other 
> than 1 (see MESOS-6247) and they cannot set quota.
> With the addition of hierarchical roles (see MESOS-6375) we can eliminate the 
> notion of the framework ID acting as a pseudo-role in favor of explicitly 
> using hierarchical roles. E.g.
> {noformat}
>  ^
>/   \
> engsales
> ^
>   /   \
>  analytics ui
>  ^
>/   \
>learning reports
> {noformat}
> Here if two frameworks run within the eng/analytics role, then they will 
> compete for resources without fair sharing. However, if resource guarantees 
> are required, sub-roles can be created explicitly, e.g. 
> eng/analytics/learning and eng/analytics/reports. These roles can be given 
> weights and quota.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7136) Eliminate fair sharing between frameworks within a role.

2017-04-19 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15975879#comment-15975879
 ] 

Benjamin Mahler commented on MESOS-7136:


[~qianzhang] yes, the plan would be to remove the special casing of framework 
id as a pseudo-role, so there would no longer be the additional sorters. To 
clarify what I meant by compete, the resources would be offered to all 
participants within a role and the first one to take it wins. If users wish to 
have constraints among frameworks using a role, they can use sub-roles 
identifying their frameworks, which achieves the existing behavior. But, you 
also get the added benefit of using actual roles, so you can change weights, 
set quota, etc.

> Eliminate fair sharing between frameworks within a role.
> 
>
> Key: MESOS-7136
> URL: https://issues.apache.org/jira/browse/MESOS-7136
> Project: Mesos
>  Issue Type: Epic
>  Components: allocation, technical debt
>Reporter: Benjamin Mahler
>  Labels: multi-tenancy
>
> The current fair sharing algorithm performs fair sharing between frameworks 
> within a role. This is equivalent to having the framework id behave as a 
> pseudo-role beneath the role. Consider the case where there are two spark 
> frameworks running within the same "spark" role. This behaves similarly to 
> hierarchical roles with the framework ID acting as an implicit role:
> {noformat}
>  ^
>/   \
>   spark services
> ^
>   /   \
> /   \
> FrameworkId1 FrameworkId2
> (fixed weight of 1)(fixed weight of 1)
> {noformat}
> Unfortunately, the frameworks cannot change their weight to be a value other 
> than 1 (see MESOS-6247) and they cannot set quota.
> With the addition of hierarchical roles (see MESOS-6375) we can eliminate the 
> notion of the framework ID acting as a pseudo-role in favor of explicitly 
> using hierarchical roles. E.g.
> {noformat}
>  ^
>/   \
> engsales
> ^
>   /   \
>  analytics ui
>  ^
>/   \
>learning reports
> {noformat}
> Here if two frameworks run within the eng/analytics role, then they will 
> compete for resources without fair sharing. However, if resource guarantees 
> are required, sub-roles can be created explicitly, e.g. 
> eng/analytics/learning and eng/analytics/reports. These roles can be given 
> weights and quota.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7376) Reduce copying of the Registry to improve Registrar performance.

2017-04-19 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-7376:
---
Summary: Reduce copying of the Registry to improve Registrar performance.  
(was: Long registry updates when the number of agents is high)

> Reduce copying of the Registry to improve Registrar performance.
> 
>
> Key: MESOS-7376
> URL: https://issues.apache.org/jira/browse/MESOS-7376
> Project: Mesos
>  Issue Type: Improvement
>  Components: master
>Affects Versions: 1.3.0
>Reporter: Ilya Pronin
>Assignee: Ilya Pronin
>Priority: Critical
>
> During scale testing we discovered that as the number of registered agents 
> grows the time it takes to update the registry grows to unacceptable values 
> very fast. At some point it starts exceeding {{registry_store_timeout}} which 
> doesn't fire.
> With 55k agents we saw this ({{registry_store_timeout=20secs}}):
> {noformat}
> I0331 17:11:21.227442 36472 registrar.cpp:473] Applied 69 operations in 
> 3.138843387secs; attempting to update the registry
> I0331 17:11:24.441409 36464 log.cpp:529] LogStorage.set: acquired the lock in 
> 74461ns
> I0331 17:11:24.441541 36464 log.cpp:543] LogStorage.set: started in 51770ns
> I0331 17:11:26.869323 36462 log.cpp:628] LogStorage.set: wrote append at 
> position=6420881 in 2.41043644secs
> I0331 17:11:26.869454 36462 state.hpp:179] State.store: storage.set has 
> finished in 2.428189561secs (b=1)
> I0331 17:11:56.199453 36469 registrar.cpp:518] Successfully updated the 
> registry in 34.971944192secs
> {noformat}
> This is caused by repeated {{Registry}} copying which involves copying a big 
> object graph that takes roughly 0.4 sec (with 55k agents).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7389) Mesos 1.2.0 crashes with pre-1.0 Mesos agents

2017-04-19 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15975401#comment-15975401
 ] 

Benjamin Mahler commented on MESOS-7389:


One thing that could be simple to cherry pick is to simply log and drop 
registrations from pre-1.0 agents in the master into the 1.2.x branch.

> Mesos 1.2.0 crashes with pre-1.0 Mesos agents
> -
>
> Key: MESOS-7389
> URL: https://issues.apache.org/jira/browse/MESOS-7389
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.2.0
> Environment: Ubuntu 14.04 
>Reporter: Nicholas Studt
>Assignee: Benjamin Mahler
>Priority: Critical
>  Labels: mesosphere
>
> During upgrade from 1.0.1 to 1.2.0 a single mesos-slave reregistering with 
> the running leader caused the leader to terminate. All 3 of the masters 
> suffered the same failure as the same slave node reregistered against the new 
> leader, this continued across the entire cluster until the offending slave 
> node was removed and fixed. The fix to the slave node was to remove the mesos 
> directory and then start the slave node back up. 
>  F0412 17:24:42.736600  6317 master.cpp:5701] Check failed: 
> frameworks_.contains(task.framework_id())
>  *** Check failure stack trace: ***
>  @ 0x7f59f944f94d  google::LogMessage::Fail()
>  @ 0x7f59f945177d  google::LogMessage::SendToLog()
>  @ 0x7f59f944f53c  google::LogMessage::Flush()
>  @ 0x7f59f9452079  google::LogMessageFatal::~LogMessageFatal()
>  I0412 17:24:42.750300  6316 replica.cpp:693] Replica received learned notice 
> for position 6896 from @0.0.0.0:0 
>  @ 0x7f59f88f2341  mesos::internal::master::Master::_reregisterSlave()
>  @ 0x7f59f88f488f  
> _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal6master6MasterERKNS5_9SlaveInfoERKNS0_4UPIDERKSt6vectorINS5_8ResourceESaISG_EERKSF_INS5_12ExecutorInfoESaISL_EERKSF_INS5_4TaskESaISQ_EERKSF_INS5_13FrameworkInfoESaISV_EERKSF_INS6_17Archive_FrameworkESaIS10_EERKSsRKSF_INS5_20SlaveInfo_CapabilityESaIS17_EERKNS0_6FutureIbEES9_SC_SI_SN_SS_SX_S12_SsS19_S1D_EEvRKNS0_3PIDIT_EEMS1H_FvT0_T1_T2_T3_T4_T5_T6_T7_T8_T9_ET10_T11_T12_T13_T14_T15_T16_T17_T18_T19_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_
>  @ 0x7f59f93c3eb1  process::ProcessManager::resume()
>  @ 0x7f59f93ccd57  
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv
>  @ 0x7f59f77cfa60  (unknown)
>  @ 0x7f59f6fec184  start_thread
>  @ 0x7f59f6d19bed  (unknown)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7389) Mesos 1.2.0 crashes with pre-1.0 Mesos agents

2017-04-19 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15975396#comment-15975396
 ] 

Benjamin Mahler commented on MESOS-7389:


Sorry for the confusion, I just meant that I'm inclined not to support pre-1.0 
agents against a 1.2 master given the complexity of the solution.

However, I totally agree that incompatible versions of the master and agent 
should not not lead to crashes (especially vague ones). Explicit handling of 
incompatible agents (i.e. just MESOS-6976 within the MESOS-6975 epic) is long 
overdue. In the interim, we can start with explicitly stating the upgrade 
requirements for getting into 1.x from 0.y, since they aren't captured 
[here|http://mesos.apache.org/documentation/latest/versioning/] and you need to 
reach a certain 0.y release before you can upgrade to 1.x. cc [~vinodkone]

> Mesos 1.2.0 crashes with pre-1.0 Mesos agents
> -
>
> Key: MESOS-7389
> URL: https://issues.apache.org/jira/browse/MESOS-7389
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.2.0
> Environment: Ubuntu 14.04 
>Reporter: Nicholas Studt
>Assignee: Benjamin Mahler
>Priority: Critical
>  Labels: mesosphere
>
> During upgrade from 1.0.1 to 1.2.0 a single mesos-slave reregistering with 
> the running leader caused the leader to terminate. All 3 of the masters 
> suffered the same failure as the same slave node reregistered against the new 
> leader, this continued across the entire cluster until the offending slave 
> node was removed and fixed. The fix to the slave node was to remove the mesos 
> directory and then start the slave node back up. 
>  F0412 17:24:42.736600  6317 master.cpp:5701] Check failed: 
> frameworks_.contains(task.framework_id())
>  *** Check failure stack trace: ***
>  @ 0x7f59f944f94d  google::LogMessage::Fail()
>  @ 0x7f59f945177d  google::LogMessage::SendToLog()
>  @ 0x7f59f944f53c  google::LogMessage::Flush()
>  @ 0x7f59f9452079  google::LogMessageFatal::~LogMessageFatal()
>  I0412 17:24:42.750300  6316 replica.cpp:693] Replica received learned notice 
> for position 6896 from @0.0.0.0:0 
>  @ 0x7f59f88f2341  mesos::internal::master::Master::_reregisterSlave()
>  @ 0x7f59f88f488f  
> _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal6master6MasterERKNS5_9SlaveInfoERKNS0_4UPIDERKSt6vectorINS5_8ResourceESaISG_EERKSF_INS5_12ExecutorInfoESaISL_EERKSF_INS5_4TaskESaISQ_EERKSF_INS5_13FrameworkInfoESaISV_EERKSF_INS6_17Archive_FrameworkESaIS10_EERKSsRKSF_INS5_20SlaveInfo_CapabilityESaIS17_EERKNS0_6FutureIbEES9_SC_SI_SN_SS_SX_S12_SsS19_S1D_EEvRKNS0_3PIDIT_EEMS1H_FvT0_T1_T2_T3_T4_T5_T6_T7_T8_T9_ET10_T11_T12_T13_T14_T15_T16_T17_T18_T19_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_
>  @ 0x7f59f93c3eb1  process::ProcessManager::resume()
>  @ 0x7f59f93ccd57  
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv
>  @ 0x7f59f77cfa60  (unknown)
>  @ 0x7f59f6fec184  start_thread
>  @ 0x7f59f6d19bed  (unknown)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7389) Mesos 1.2.0 crashes with pre-1.0 Mesos agents

2017-04-18 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15973853#comment-15973853
 ] 

Benjamin Mahler commented on MESOS-7389:


Ah, I'm sorry! I misread that code.

With regard to this ticket, I'm inclined not to fix given the difficulty and 
that we don't support pre-1.0 agents against a 1.2 master. Any objections?

> Mesos 1.2.0 crashes with pre-1.0 Mesos agents
> -
>
> Key: MESOS-7389
> URL: https://issues.apache.org/jira/browse/MESOS-7389
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.2.0
> Environment: Ubuntu 14.04 
>Reporter: Nicholas Studt
>Assignee: Benjamin Mahler
>Priority: Critical
>  Labels: mesosphere
>
> During upgrade from 1.0.1 to 1.2.0 a single mesos-slave reregistering with 
> the running leader caused the leader to terminate. All 3 of the masters 
> suffered the same failure as the same slave node reregistered against the new 
> leader, this continued across the entire cluster until the offending slave 
> node was removed and fixed. The fix to the slave node was to remove the mesos 
> directory and then start the slave node back up. 
>  F0412 17:24:42.736600  6317 master.cpp:5701] Check failed: 
> frameworks_.contains(task.framework_id())
>  *** Check failure stack trace: ***
>  @ 0x7f59f944f94d  google::LogMessage::Fail()
>  @ 0x7f59f945177d  google::LogMessage::SendToLog()
>  @ 0x7f59f944f53c  google::LogMessage::Flush()
>  @ 0x7f59f9452079  google::LogMessageFatal::~LogMessageFatal()
>  I0412 17:24:42.750300  6316 replica.cpp:693] Replica received learned notice 
> for position 6896 from @0.0.0.0:0 
>  @ 0x7f59f88f2341  mesos::internal::master::Master::_reregisterSlave()
>  @ 0x7f59f88f488f  
> _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal6master6MasterERKNS5_9SlaveInfoERKNS0_4UPIDERKSt6vectorINS5_8ResourceESaISG_EERKSF_INS5_12ExecutorInfoESaISL_EERKSF_INS5_4TaskESaISQ_EERKSF_INS5_13FrameworkInfoESaISV_EERKSF_INS6_17Archive_FrameworkESaIS10_EERKSsRKSF_INS5_20SlaveInfo_CapabilityESaIS17_EERKNS0_6FutureIbEES9_SC_SI_SN_SS_SX_S12_SsS19_S1D_EEvRKNS0_3PIDIT_EEMS1H_FvT0_T1_T2_T3_T4_T5_T6_T7_T8_T9_ET10_T11_T12_T13_T14_T15_T16_T17_T18_T19_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_
>  @ 0x7f59f93c3eb1  process::ProcessManager::resume()
>  @ 0x7f59f93ccd57  
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv
>  @ 0x7f59f77cfa60  (unknown)
>  @ 0x7f59f6fec184  start_thread
>  @ 0x7f59f6d19bed  (unknown)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7389) Mesos 1.2.0 crashes with pre-1.0 Mesos agents

2017-04-18 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15973514#comment-15973514
 ] 

Benjamin Mahler commented on MESOS-7389:


As far as I can tell, fixing this to support pre-1.0 agents is complicated and 
is likely to produce its own subtle bugs. 1.2+ masters maintain an invariant 
that each task / executor has a known allocation role (and it can determine 
this given 1.0+ agents report their frameworks). Whereas if we were to support 
pre-1.0 agents against a 1.2+ master, the master would have to be updated to 
handle tasks that have an unknown allocation role (i.e. what used to be called 
"orphaned" tasks).

A partial fix here would be to handle the case where the framework is already 
re-registered, leaving the "orphaned" task case triggering this check.

[~neilc] [~vinodkone] The handling of pre-1.0 agents in the context of 
"orphaned tasks" already appears have issues, e.g.:
* Master upgraded to 1.2.x
* Pre- 1.0 agent re-registers with task and task's framework id, doesn't send 
the FrameworkInfos.
* This task's framework hasn't re-registered yet, so this is what we used to 
call an "orphan task".
* The re-registration handling drops the task, see 
[here|https://github.com/apache/mesos/blob/1.2.0/src/master/master.cpp#L5784-L5807].
* Later, when this framework re-registers, the task is absent in the master but 
known to the agent.

Is this broken or am I missing something? If broken, given that fixing this 
ticket requires a complicated solution, and we didn't originally intend to 
support pre-1.0 upgrades for > 1.0.x masters, I'd be inclined to not support it 
(and possibly cherry-pick safety checks like MESOS-6975).

> Mesos 1.2.0 crashes with pre-1.0 Mesos agents
> -
>
> Key: MESOS-7389
> URL: https://issues.apache.org/jira/browse/MESOS-7389
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.2.0
> Environment: Ubuntu 14.04 
>Reporter: Nicholas Studt
>Assignee: Benjamin Mahler
>Priority: Critical
>  Labels: mesosphere
>
> During upgrade from 1.0.1 to 1.2.0 a single mesos-slave reregistering with 
> the running leader caused the leader to terminate. All 3 of the masters 
> suffered the same failure as the same slave node reregistered against the new 
> leader, this continued across the entire cluster until the offending slave 
> node was removed and fixed. The fix to the slave node was to remove the mesos 
> directory and then start the slave node back up. 
>  F0412 17:24:42.736600  6317 master.cpp:5701] Check failed: 
> frameworks_.contains(task.framework_id())
>  *** Check failure stack trace: ***
>  @ 0x7f59f944f94d  google::LogMessage::Fail()
>  @ 0x7f59f945177d  google::LogMessage::SendToLog()
>  @ 0x7f59f944f53c  google::LogMessage::Flush()
>  @ 0x7f59f9452079  google::LogMessageFatal::~LogMessageFatal()
>  I0412 17:24:42.750300  6316 replica.cpp:693] Replica received learned notice 
> for position 6896 from @0.0.0.0:0 
>  @ 0x7f59f88f2341  mesos::internal::master::Master::_reregisterSlave()
>  @ 0x7f59f88f488f  
> _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal6master6MasterERKNS5_9SlaveInfoERKNS0_4UPIDERKSt6vectorINS5_8ResourceESaISG_EERKSF_INS5_12ExecutorInfoESaISL_EERKSF_INS5_4TaskESaISQ_EERKSF_INS5_13FrameworkInfoESaISV_EERKSF_INS6_17Archive_FrameworkESaIS10_EERKSsRKSF_INS5_20SlaveInfo_CapabilityESaIS17_EERKNS0_6FutureIbEES9_SC_SI_SN_SS_SX_S12_SsS19_S1D_EEvRKNS0_3PIDIT_EEMS1H_FvT0_T1_T2_T3_T4_T5_T6_T7_T8_T9_ET10_T11_T12_T13_T14_T15_T16_T17_T18_T19_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_
>  @ 0x7f59f93c3eb1  process::ProcessManager::resume()
>  @ 0x7f59f93ccd57  
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv
>  @ 0x7f59f77cfa60  (unknown)
>  @ 0x7f59f6fec184  start_thread
>  @ 0x7f59f6d19bed  (unknown)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7376) Long registry updates when the number of agents is high

2017-04-17 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15971824#comment-15971824
 ] 

Benjamin Mahler commented on MESOS-7376:


Yes, I will shepherd, thanks for taking this on!

> Long registry updates when the number of agents is high
> ---
>
> Key: MESOS-7376
> URL: https://issues.apache.org/jira/browse/MESOS-7376
> Project: Mesos
>  Issue Type: Improvement
>  Components: master
>Affects Versions: 1.3.0
>Reporter: Ilya Pronin
>Assignee: Ilya Pronin
>Priority: Critical
>
> During scale testing we discovered that as the number of registered agents 
> grows the time it takes to update the registry grows to unacceptable values 
> very fast. At some point it starts exceeding {{registry_store_timeout}} which 
> doesn't fire.
> With 55k agents we saw this ({{registry_store_timeout=20secs}}):
> {noformat}
> I0331 17:11:21.227442 36472 registrar.cpp:473] Applied 69 operations in 
> 3.138843387secs; attempting to update the registry
> I0331 17:11:24.441409 36464 log.cpp:529] LogStorage.set: acquired the lock in 
> 74461ns
> I0331 17:11:24.441541 36464 log.cpp:543] LogStorage.set: started in 51770ns
> I0331 17:11:26.869323 36462 log.cpp:628] LogStorage.set: wrote append at 
> position=6420881 in 2.41043644secs
> I0331 17:11:26.869454 36462 state.hpp:179] State.store: storage.set has 
> finished in 2.428189561secs (b=1)
> I0331 17:11:56.199453 36469 registrar.cpp:518] Successfully updated the 
> registry in 34.971944192secs
> {noformat}
> This is caused by repeated {{Registry}} copying which involves copying a big 
> object graph that takes roughly 0.4 sec (with 55k agents).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-6975) Prevent pre-1.0 agents from registering with 1.3+ master.

2017-04-17 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-6975:
---
Summary: Prevent pre-1.0 agents from registering with 1.3+ master.  (was: 
Prevent old Mesos agents from registering)

> Prevent pre-1.0 agents from registering with 1.3+ master.
> -
>
> Key: MESOS-6975
> URL: https://issues.apache.org/jira/browse/MESOS-6975
> Project: Mesos
>  Issue Type: Epic
>  Components: master
>Reporter: Neil Conway
>Assignee: Neil Conway
>
> https://www.mail-archive.com/dev@mesos.apache.org/msg37194.html



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7387) ZK master contender and detector don't respect zk_session_timeout option

2017-04-17 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15971814#comment-15971814
 ] 

Benjamin Mahler commented on MESOS-7387:


Looks like Vinod is shepherding, thanks Vinod.

> ZK master contender and detector don't respect zk_session_timeout option
> 
>
> Key: MESOS-7387
> URL: https://issues.apache.org/jira/browse/MESOS-7387
> Project: Mesos
>  Issue Type: Improvement
>  Components: master
>Affects Versions: 1.3.0
>Reporter: Ilya Pronin
>Assignee: Ilya Pronin
>Priority: Minor
>
> {{ZooKeeperMasterContender}} and {{ZooKeeperMasterDetector}} are using 
> hardcoded ZK session timeouts ({{MASTER_CONTENDER_ZK_SESSION_TIMEOUT}} and 
> {{MASTER_DETECTOR_ZK_SESSION_TIMEOUT}}) and do not respect 
> {{--zk_session_timeout}} master option. This is unexpected and doesn't play 
> well with ZK updates that take longer than 10 secs.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7376) Long registry updates when the number of agents is high

2017-04-17 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-7376:
---
Shepherd: Benjamin Mahler

> Long registry updates when the number of agents is high
> ---
>
> Key: MESOS-7376
> URL: https://issues.apache.org/jira/browse/MESOS-7376
> Project: Mesos
>  Issue Type: Improvement
>  Components: master
>Affects Versions: 1.3.0
>Reporter: Ilya Pronin
>Assignee: Ilya Pronin
>Priority: Critical
>
> During scale testing we discovered that as the number of registered agents 
> grows the time it takes to update the registry grows to unacceptable values 
> very fast. At some point it starts exceeding {{registry_store_timeout}} which 
> doesn't fire.
> With 55k agents we saw this ({{registry_store_timeout=20secs}}):
> {noformat}
> I0331 17:11:21.227442 36472 registrar.cpp:473] Applied 69 operations in 
> 3.138843387secs; attempting to update the registry
> I0331 17:11:24.441409 36464 log.cpp:529] LogStorage.set: acquired the lock in 
> 74461ns
> I0331 17:11:24.441541 36464 log.cpp:543] LogStorage.set: started in 51770ns
> I0331 17:11:26.869323 36462 log.cpp:628] LogStorage.set: wrote append at 
> position=6420881 in 2.41043644secs
> I0331 17:11:26.869454 36462 state.hpp:179] State.store: storage.set has 
> finished in 2.428189561secs (b=1)
> I0331 17:11:56.199453 36469 registrar.cpp:518] Successfully updated the 
> registry in 34.971944192secs
> {noformat}
> This is caused by repeated {{Registry}} copying which involves copying a big 
> object graph that takes roughly 0.4 sec (with 55k agents).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7323) Framework role tracking in allocator results in framework treated as active incorrectly.

2017-03-31 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-7323:
---
Priority: Blocker  (was: Critical)

> Framework role tracking in allocator results in framework treated as active 
> incorrectly.
> 
>
> Key: MESOS-7323
> URL: https://issues.apache.org/jira/browse/MESOS-7323
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation, master
>Reporter: Benjamin Mahler
>Assignee: Michael Park
>Priority: Blocker
>
> When an agent is added to the allocator and there are resources allocated to 
> a known framework, where the allocation role is not one of the framework's 
> subscribed roles, then the allocator will "track" the role (i.e. allocation 
> information) for this framework. However, the current implementation results 
> in the framework being treated as an active client of the sorter, when it 
> should be an inactive client.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-6627) Allow frameworks to modify the role(s) they are subscribed to.

2017-03-31 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-6627:
---
Fix Version/s: 1.3.0

[~mcypark] we need to fix MESOS-7323 prior to the 1.3 release, will mark it as 
a blocker. We can move MESOS-7258 as an improvement outside of this epic.

> Allow frameworks to modify the role(s) they are subscribed to.
> --
>
> Key: MESOS-6627
> URL: https://issues.apache.org/jira/browse/MESOS-6627
> Project: Mesos
>  Issue Type: Epic
>  Components: framework api, master
>Reporter: Benjamin Mahler
>Assignee: Michael Park
> Fix For: 1.3.0
>
>
> Currently, we do not provide the ability for frameworks to change the roles 
> they are subscribed with. As we begin to support "multi-tenant" frameworks 
> (i.e. multi-role support in MESOS-1763), it will become necessary to allow 
> frameworks to add and remove roles as "tenants" come and go from the 
> framework.
> Because of this being necessary to support multi-role frameworks, this is 
> considered "phase 2" of the multi-role framework project. See the design 
> published in MESOS-4284.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-6058) Register slave in deactivate mode

2017-03-30 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15949891#comment-15949891
 ] 

Benjamin Mahler commented on MESOS-6058:


This sounds like a "reservation template" feature we've talked about, where the 
master would apply the reservation before letting the agent's resources be 
allocated.

This might also be covered by the addition of endpoints to deactivate/activate 
agents MESOS-7317, depending on how that's solved (if identified by machine vs 
specific agent id).

> Register slave in deactivate mode
> -
>
> Key: MESOS-6058
> URL: https://issues.apache.org/jira/browse/MESOS-6058
> Project: Mesos
>  Issue Type: Improvement
>  Components: agent, master
>Reporter: Klaus Ma
>
> In my cluster, I'd like to reserve some resource for one application, dynamic 
> reservation feature is used because the reservation maybe changed. But when  
> a slave register to the master, some tasks from other frameworks maybe 
> dispatched to the new slave before reservation. The proposal is to enable 
> slave register in deactivate mode, and activate it after configuration, e.g. 
> dynamic reservation.
> cc [~kaysoky]/[~jvanremoortere]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7317) Add master endpoint to deactivate / activate agent

2017-03-30 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15949876#comment-15949876
 ] 

Benjamin Mahler commented on MESOS-7317:


Linking in the original ticket.

> Add master endpoint to deactivate / activate agent
> --
>
> Key: MESOS-7317
> URL: https://issues.apache.org/jira/browse/MESOS-7317
> Project: Mesos
>  Issue Type: Improvement
>  Components: agent, master
>Reporter: Neil Conway
>  Labels: mesosphere
>
> This would allow the operator to deactivate and then subsequently activate an 
> agent. The allocator does not make offers for deactivated agents; this 
> functionality would be useful to help operators "manually (incrementally) 
> drain" the tasks running on an agent, e.g., before taking the agent down.
> At present, if the operator causes a framework to kill a task running on the 
> agent, the framework will often receive an offer for the unused resources on 
> the agent, which will often result in respawning the killed task on the same 
> agent.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7325) ProcessRemoteLinkTest.RemoteLinkLeak is flaky.

2017-03-30 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15949869#comment-15949869
 ] 

Benjamin Mahler commented on MESOS-7325:


Hm.. I wasn't able to reproduce after 3000 iterations.

> ProcessRemoteLinkTest.RemoteLinkLeak is flaky.
> --
>
> Key: MESOS-7325
> URL: https://issues.apache.org/jira/browse/MESOS-7325
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Affects Versions: 1.3.0
> Environment: macOS, clang, libev build
>Reporter: Till Toenshoff
>  Labels: flaky-test
>
> After this hit me initially on a regular make check, I did run the test 
> isolated in infinite repetition - those 3 times I tried it, the bug surfaced 
> at around 100-150 repetitions.
> {noformat}
> $ ./3rdparty/libprocess/libprocess-tests 
> --gtest_filter="ProcessRemoteLinkTest.RemoteLinkLeak" --gtest_repeat=-1 
> --gtest_break_on_failure
> {noformat}
> {noformat}
> Repeating all tests (iteration 119) . . .
> Note: Google Test filter = ProcessRemoteLinkTest.RemoteLinkLeak
> [==] Running 1 test from 1 test case.
> [--] Global test environment set-up.
> [--] 1 test from ProcessRemoteLinkTest
> [ RUN  ] ProcessRemoteLinkTest.RemoteLinkLeak
> (libev) select: Invalid argument
> *** Aborted at 1490866597 (unix time) try "date -d @1490866597" if you are 
> using GNU date ***
> PC: @ 0x7fffb7621d42 __pthread_kill
> *** SIGABRT (@0x7fffb7621d42) received by PID 60372 (TID 0x7d123000) 
> stack trace: ***
> @ 0x7fffb7702b3a _sigtramp
> @ 0x7ff8fecfc080 (unknown)
> @ 0x7fffb7587420 abort
> @0x10e17051d ev_syserr
> @0x10e170e16 select_poll
> @0x10e16c635 ev_run
> @0x10e126f2b ev_loop()
> @0x10e126e96 process::EventLoop::run()
> @0x10e0498bf _ZNSt3__114__thread_proxyINS_5tupleIJPFvvEEPvS5_
> @ 0x7fffb770c9af _pthread_body
> @ 0x7fffb770c8fb _pthread_start
> @ 0x7fffb770c101 thread_start
> Abort trap: 6
> {noformat}
> Note that this is obviously a libev build.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7325) ProcessRemoteLinkTest.RemoteLinkLeak is flaky.

2017-03-30 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15949852#comment-15949852
 ] 

Benjamin Mahler commented on MESOS-7325:


Hm.. looks the same as MESOS-6453 which also occurs on darwin but was on a 
different test run in repetition. I'm curious if this fails after a fixed 
number of runs (e.g. hitting some limitation) or non-deterministically.

> ProcessRemoteLinkTest.RemoteLinkLeak is flaky.
> --
>
> Key: MESOS-7325
> URL: https://issues.apache.org/jira/browse/MESOS-7325
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Affects Versions: 1.3.0
> Environment: macOS, clang, libev build
>Reporter: Till Toenshoff
>  Labels: flaky-test
>
> After this hit me initially on a regular make check, I did run the test 
> isolated in infinite repetition - those 3 times I tried it, the bug surfaced 
> at around 100-150 repetitions.
> {noformat}
> $ ./3rdparty/libprocess/libprocess-tests 
> --gtest_filter="ProcessRemoteLinkTest.RemoteLinkLeak" --gtest_repeat=-1 
> --gtest_break_on_failure
> {noformat}
> {noformat}
> Repeating all tests (iteration 119) . . .
> Note: Google Test filter = ProcessRemoteLinkTest.RemoteLinkLeak
> [==] Running 1 test from 1 test case.
> [--] Global test environment set-up.
> [--] 1 test from ProcessRemoteLinkTest
> [ RUN  ] ProcessRemoteLinkTest.RemoteLinkLeak
> (libev) select: Invalid argument
> *** Aborted at 1490866597 (unix time) try "date -d @1490866597" if you are 
> using GNU date ***
> PC: @ 0x7fffb7621d42 __pthread_kill
> *** SIGABRT (@0x7fffb7621d42) received by PID 60372 (TID 0x7d123000) 
> stack trace: ***
> @ 0x7fffb7702b3a _sigtramp
> @ 0x7ff8fecfc080 (unknown)
> @ 0x7fffb7587420 abort
> @0x10e17051d ev_syserr
> @0x10e170e16 select_poll
> @0x10e16c635 ev_run
> @0x10e126f2b ev_loop()
> @0x10e126e96 process::EventLoop::run()
> @0x10e0498bf _ZNSt3__114__thread_proxyINS_5tupleIJPFvvEEPvS5_
> @ 0x7fffb770c9af _pthread_body
> @ 0x7fffb770c8fb _pthread_start
> @ 0x7fffb770c101 thread_start
> Abort trap: 6
> {noformat}
> Note that this is obviously a libev build.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (MESOS-7324) Update documentation to reflect the addition of multi-role framework support.

2017-03-29 Thread Benjamin Mahler (JIRA)
Benjamin Mahler created MESOS-7324:
--

 Summary: Update documentation to reflect the addition of 
multi-role framework support.
 Key: MESOS-7324
 URL: https://issues.apache.org/jira/browse/MESOS-7324
 Project: Mesos
  Issue Type: Documentation
  Components: documentation
Reporter: Benjamin Mahler
Assignee: Benjamin Mahler


The current documentation assumes single role frameworks, we need to update the 
documentation to reflect the support for subscribing to multiple roles.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (MESOS-6762) Update release notes for multi-role changes

2017-03-29 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-6762:
--

Assignee: Benjamin Mahler

> Update release notes for multi-role changes
> ---
>
> Key: MESOS-6762
> URL: https://issues.apache.org/jira/browse/MESOS-6762
> Project: Mesos
>  Issue Type: Task
>Reporter: Benjamin Bannier
>Assignee: Benjamin Mahler
>
> When adding support for multi-role frameworks we should call out a number of 
> potential issues in the changelog/release notes.
> This ticket collects potential pitfalls.
> h6. Changes in master and agent endpoints
> When rendering the {{FrameworkInfo}} of multi-role enabled frameworks in 
> master or agent endpoints the {{role}} field will not be supported anymore; 
> instead the {{roles}} field should be used. Any tooling parsing endpoint 
> information and relying on the {{role}} field needs to be updated before 
> multi-role enabled frameworks can be run in the cluster.
> h6. Changes to the allocator interface / implementation requirements for 
> module implementors
> Implementors of allocator modules have to provide new implementation 
> functionality to satisfy the MULTI_ROLE framework capability. Also, the 
> interface has changed.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-6762) Update release notes for multi-role changes

2017-03-29 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15947997#comment-15947997
 ] 

Benjamin Mahler commented on MESOS-6762:


CHANGELOG update:

{noformat}
commit 10d7988ee5948bc45518e7c1c339a371c4bf151f
Author: Benjamin Mahler 
Date:   Thu Mar 16 15:33:56 2017 -0700

Added multi-role framework support to the CHANGELOG.

Review: https://reviews.apache.org/r/57707
{noformat}

Will close once the additional documentation described in this ticket is added.

> Update release notes for multi-role changes
> ---
>
> Key: MESOS-6762
> URL: https://issues.apache.org/jira/browse/MESOS-6762
> Project: Mesos
>  Issue Type: Task
>Reporter: Benjamin Bannier
>
> When adding support for multi-role frameworks we should call out a number of 
> potential issues in the changelog/release notes.
> This ticket collects potential pitfalls.
> h6. Changes in master and agent endpoints
> When rendering the {{FrameworkInfo}} of multi-role enabled frameworks in 
> master or agent endpoints the {{role}} field will not be supported anymore; 
> instead the {{roles}} field should be used. Any tooling parsing endpoint 
> information and relying on the {{role}} field needs to be updated before 
> multi-role enabled frameworks can be run in the cluster.
> h6. Changes to the allocator interface / implementation requirements for 
> module implementors
> Implementors of allocator modules have to provide new implementation 
> functionality to satisfy the MULTI_ROLE framework capability. Also, the 
> interface has changed.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-3875) Account dynamic reservations towards quota.

2017-03-29 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15947990#comment-15947990
 ] 

Benjamin Mahler commented on MESOS-3875:


I think the situation I'm describing is addressed by this ticket, since I'm 
referring to quota allocation not accounting for reserved resources and hence 
enabling gaming. Unless this is prevented already?

MESOS-3338 is fairly vague. It could use some clarification since it seems to 
be referring only to endpoints (and I'm not sure the suggestion of MESOS-3338 
is the right thing to do as far as the endpoints are concerned). Is it 
addressing the fair sharing side of the reservation gaming?

> Account dynamic reservations towards quota.
> ---
>
> Key: MESOS-3875
> URL: https://issues.apache.org/jira/browse/MESOS-3875
> Project: Mesos
>  Issue Type: Task
>  Components: allocation, master
>Reporter: Alexander Rukletsov
>Priority: Critical
>  Labels: mesosphere
>
> Dynamic reservations—whether allocated or not—should be accounted towards 
> role's quota. This requires update in at least two places:
> * The built-in allocator, which actually satisfies quota;
> * The sanity check in the master.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7323) Framework role tracking in allocator results in framework treated as active incorrectly.

2017-03-29 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-7323:
---
Target Version/s: 1.3.0
Priority: Critical  (was: Major)
 Description: When an agent is added to the allocator and there are 
resources allocated to a known framework, where the allocation role is not one 
of the framework's subscribed roles, then the allocator will "track" the role 
(i.e. allocation information) for this framework. However, the current 
implementation results in the framework being treated as an active client of 
the sorter, when it should be an inactive client.

> Framework role tracking in allocator results in framework treated as active 
> incorrectly.
> 
>
> Key: MESOS-7323
> URL: https://issues.apache.org/jira/browse/MESOS-7323
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation, master
>Reporter: Benjamin Mahler
>Assignee: Michael Park
>Priority: Critical
>
> When an agent is added to the allocator and there are resources allocated to 
> a known framework, where the allocation role is not one of the framework's 
> subscribed roles, then the allocator will "track" the role (i.e. allocation 
> information) for this framework. However, the current implementation results 
> in the framework being treated as an active client of the sorter, when it 
> should be an inactive client.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (MESOS-7323) Framework role tracking in allocator results in framework treated as active incorrectly.

2017-03-29 Thread Benjamin Mahler (JIRA)
Benjamin Mahler created MESOS-7323:
--

 Summary: Framework role tracking in allocator results in framework 
treated as active incorrectly.
 Key: MESOS-7323
 URL: https://issues.apache.org/jira/browse/MESOS-7323
 Project: Mesos
  Issue Type: Bug
  Components: allocation, master
Reporter: Benjamin Mahler
Assignee: Michael Park






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-3875) Account dynamic reservations towards quota.

2017-03-28 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-3875:
---
Priority: Critical  (was: Major)

Not accounting them when not allocated seems to allow gaming of the quota 
system.

The scenario is: I keep making reservations, I filter out any reservations that 
come back, in an effort to amass as many reserved resources as possible. Can I 
reserve the whole cluster this way given we don't count the reserved resources 
towards the qutoa? Or is there something in place already that is preventing 
this?

> Account dynamic reservations towards quota.
> ---
>
> Key: MESOS-3875
> URL: https://issues.apache.org/jira/browse/MESOS-3875
> Project: Mesos
>  Issue Type: Task
>  Components: allocation, master
>Reporter: Alexander Rukletsov
>Priority: Critical
>  Labels: mesosphere
>
> Dynamic reservations—whether allocated or not—should be accounted towards 
> role's quota. This requires update in at least two places:
> * The built-in allocator, which actually satisfies quota;
> * The sanity check in the master.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7319) Rename the DRAIN maintenance mode to SCHEDULED to avoid confusion.

2017-03-27 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-7319:
---
Description: 
The current naming of the DRAIN mode in maintenance has been confusing to users 
as there tends to be an expectation of mesos doing something (e.g. not sending 
offers, or killing tasks) to achieve the drain, whereas in reality mesos does 
nothing and expects the schedulers to act (this only applies for maintenance 
aware schedulers).

Rather, what's actually happening at in the DRAIN mode is that the maintenance 
is scheduled, that's it. So a name like SCHEDULED would be less confusing for 
users: http://mesos.apache.org/documentation/latest/maintenance/
Component/s: documentation

> Rename the DRAIN maintenance mode to SCHEDULED to avoid confusion.
> --
>
> Key: MESOS-7319
> URL: https://issues.apache.org/jira/browse/MESOS-7319
> Project: Mesos
>  Issue Type: Improvement
>  Components: documentation, HTTP API, master
>Reporter: Benjamin Mahler
>
> The current naming of the DRAIN mode in maintenance has been confusing to 
> users as there tends to be an expectation of mesos doing something (e.g. not 
> sending offers, or killing tasks) to achieve the drain, whereas in reality 
> mesos does nothing and expects the schedulers to act (this only applies for 
> maintenance aware schedulers).
> Rather, what's actually happening at in the DRAIN mode is that the 
> maintenance is scheduled, that's it. So a name like SCHEDULED would be less 
> confusing for users: http://mesos.apache.org/documentation/latest/maintenance/



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (MESOS-7319) Rename the DRAIN maintenance mode to SCHEDULED to avoid confusion.

2017-03-27 Thread Benjamin Mahler (JIRA)
Benjamin Mahler created MESOS-7319:
--

 Summary: Rename the DRAIN maintenance mode to SCHEDULED to avoid 
confusion.
 Key: MESOS-7319
 URL: https://issues.apache.org/jira/browse/MESOS-7319
 Project: Mesos
  Issue Type: Improvement
  Components: HTTP API, master
Reporter: Benjamin Mahler






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7201) Improvements to maintenance primitives

2017-03-27 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15944343#comment-15944343
 ] 

Benjamin Mahler commented on MESOS-7201:


[~kaysoky] I'm inclined to rename the {{DRAIN}} mode to {{SCHEDULED}} as there 
is not necessarily "draining" occurring in the {{DRAIN}} mode, so this tends to 
confuse users as they have an expectation of mesos doing something (e.g. not 
sending offers, or killing tasks) to achieve the drain. Thoughts?

> Improvements to maintenance primitives
> --
>
> Key: MESOS-7201
> URL: https://issues.apache.org/jira/browse/MESOS-7201
> Project: Mesos
>  Issue Type: Epic
>Reporter: Joseph Wu
>  Labels: mesosphere
>
> This is a follow up epic to MESOS-1474 to capture further improvements for 
> maintenance primitives.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7317) Add master endpoint to deactivate / activate agent

2017-03-27 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-7317:
---
Target Version/s: 1.3.0

> Add master endpoint to deactivate / activate agent
> --
>
> Key: MESOS-7317
> URL: https://issues.apache.org/jira/browse/MESOS-7317
> Project: Mesos
>  Issue Type: Improvement
>  Components: agent, master
>Reporter: Neil Conway
>  Labels: mesosphere
>
> This would allow the operator to deactivate and then subsequently activate an 
> agent. The allocator does not make offers for deactivated agents; this 
> functionality would be useful to help operators "manually (incrementally) 
> drain" the tasks running on an agent, e.g., before taking the agent down.
> At present, if the operator causes a framework to kill a task running on the 
> agent, the framework will often receive an offer for the unused resources on 
> the agent, which will often result in respawning the killed task on the same 
> agent.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7317) Add master endpoint to deactivate / activate agent

2017-03-27 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15944208#comment-15944208
 ] 

Benjamin Mahler commented on MESOS-7317:


Linking in the "maintenance improvements" epic.

> Add master endpoint to deactivate / activate agent
> --
>
> Key: MESOS-7317
> URL: https://issues.apache.org/jira/browse/MESOS-7317
> Project: Mesos
>  Issue Type: Improvement
>  Components: agent, master
>Reporter: Neil Conway
>  Labels: mesosphere
>
> This would allow the operator to deactivate and then subsequently activate an 
> agent. The allocator does not make offers for deactivated agents; this 
> functionality would be useful to help operators "manually (incrementally) 
> drain" the tasks running on an agent, e.g., before taking the agent down.
> At present, if the operator causes a framework to kill a task running on the 
> agent, the framework will often receive an offer for the unused resources on 
> the agent, which will often result in respawning the killed task on the same 
> agent.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (MESOS-7318) Libprocess delays and timers should be undisturbed by system clock jumps.

2017-03-27 Thread Benjamin Mahler (JIRA)
Benjamin Mahler created MESOS-7318:
--

 Summary: Libprocess delays and timers should be undisturbed by 
system clock jumps.
 Key: MESOS-7318
 URL: https://issues.apache.org/jira/browse/MESOS-7318
 Project: Mesos
  Issue Type: Bug
  Components: libprocess
Reporter: Benjamin Mahler


Currently, libprocess timers / delays / timeouts are affected by system clock 
jumps because they do not use a monotonic clock as a reference point.

Since these require relative timing, we can use a monotonic clock as the 
reference point. We also need the approach to be affected by clock manipulation 
at the libprocess level (i.e. {{Clock::advance(...)}} and 
{{Clock::update(...)}}) for testing purposes.

The current recommendation is for users to use NTP with skewing applied to 
adjust for leaps, e.g.: 
https://googleblog.blogspot.com/2011/09/time-technology-and-leaping-seconds.html

I thought we already had a ticket for this but can't seem to find it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-5995) Protobuf JSON deserialisation does not accept numbers formated as strings

2017-03-24 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-5995:
---
Priority: Critical  (was: Minor)

> Protobuf JSON deserialisation does not accept numbers formated as strings
> -
>
> Key: MESOS-5995
> URL: https://issues.apache.org/jira/browse/MESOS-5995
> Project: Mesos
>  Issue Type: Bug
>  Components: HTTP API
>Affects Versions: 1.0.0
>Reporter: Tomasz Janiszewski
>Assignee: Tomasz Janiszewski
>Priority: Critical
>
> Proto2 does not specify JSON mappings but 
> [Proto3|https://developers.google.com/protocol-buffers/docs/proto3#json] does 
> and it recommend to map 64bit numbers as a string. Unfortunately Mesos does 
> not accepts strings in places of uint64 and return 400 Bad 
> {quote}
> Request error Failed to convert JSON into Call protobuf: Not expecting a JSON 
> string for field 'value'.
> {quote}
> Is this by purpose or is this a bug?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7281) Backwards incompatible UpdateFrameworkMessage handling

2017-03-22 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-7281:
---
Target Version/s: 1.3.0

[~ipronin] thanks for reporting this, we'll get your fix in.

> Backwards incompatible UpdateFrameworkMessage handling
> --
>
> Key: MESOS-7281
> URL: https://issues.apache.org/jira/browse/MESOS-7281
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Reporter: Ilya Pronin
>Assignee: Ilya Pronin
>Priority: Blocker
>
> Patch in [r/57108|https://reviews.apache.org/r/57108/] introduced framework 
> info updates. Agents are using a new {{framework_info}} field without 
> checking that it's present. If a patched agent is used with not patched 
> master it will get a default-initialized {{framework_info}}. This will cause 
> agent failures later. E.g abort on framework ID validation when it tries to 
> launch a new task for the updated framework.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-6995) Update the webui to reflect hierarchical roles.

2017-03-20 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15933897#comment-15933897
 ] 

Benjamin Mahler commented on MESOS-6995:


[~haosd...@gmail.com] the flat table is just an initial approach to get the 
information exposed. We could figure out a tree structure and/or have it so 
that roles are shown per level (e.g. top level page shows eng and sales, can 
click into eng to see roles that are beneath eng, and so on recursively).

I'm inclined to at least do the per-level approach, where the user has to 
"click in" to view the level beneath. We could later include a tree structure 
on top of this, where users would still be able to click in to roles.

> Update the webui to reflect hierarchical roles.
> ---
>
> Key: MESOS-6995
> URL: https://issues.apache.org/jira/browse/MESOS-6995
> Project: Mesos
>  Issue Type: Task
>  Components: webui
>Reporter: Benjamin Mahler
>Assignee: Jay Guo
>
> It may not need any changes, but we should confirm that the new role format 
> for hierarchical roles is correctly displayed in the webui.
> In addition, we can add a roles tab that shows the summary information 
> (shares, weights, quotas). For now, we don't need to make any of this 
> clickable (e.g. to see the tasks / frameworks under the role).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7240) Slave logs do not show gpus resources

2017-03-20 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15933304#comment-15933304
 ] 

Benjamin Mahler commented on MESOS-7240:


Hi [~osallou], looking at the code it seems to me this should be printing. Are 
you sure the executor has GPU resources? It could be that its your task that 
has the GPU resources. Including the full logs and how you're constructing the 
task would help diagnose further.

+ [~klueska] 

> Slave logs do not show gpus resources
> -
>
> Key: MESOS-7240
> URL: https://issues.apache.org/jira/browse/MESOS-7240
> Project: Mesos
>  Issue Type: Bug
>  Components: executor
>Affects Versions: 1.1.0
>Reporter: Olivier Sallou
>Priority: Trivial
>
> In slave logs, when starting container, there are information on requested 
> cpu and mem. Would be nice to also show requested gpus:
>  Launching executor '12' of framework 
> 37ef8db2-8203-471c-be90-b79fdc88ed3a- with resources cpus(*):0.1; 
> mem(*):32
> => gpus(*):1



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


<    3   4   5   6   7   8   9   10   11   12   >