[jira] [Created] (MESOS-7689) Libprocess can crash on malformed request paths for libprocess messages.
Benjamin Mahler created MESOS-7689: -- Summary: Libprocess can crash on malformed request paths for libprocess messages. Key: MESOS-7689 URL: https://issues.apache.org/jira/browse/MESOS-7689 Project: Mesos Issue Type: Bug Components: libprocess Reporter: Benjamin Mahler The following code will crash when there is a libprocess message and the path cannot be decoded: https://github.com/apache/mesos/blob/1.3.0/3rdparty/libprocess/src/process.cpp#L798-L800 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7688) Improve master failover performance by reducing unnecessary agent retries.
[ https://issues.apache.org/jira/browse/MESOS-7688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-7688: --- Description: Currently, during a failover the agents will (re-)register with the master. While the master is recovering, the master may drop messages from the agents, and so the agents must retry registration using a backoff mechanism. For large clusters, there can be a lot of overhead in processing unnecessary retries from the agents, given that these messages must be deserialized and contain all of the task / executor information many times over. In order to reduce this overhead, the idea is to avoid the need for agents to blindly retry (re-)registration with the master. Two approaches for this are: (1) Update the MasterInfo in ZK when the master is recovered. This is a bit of an abuse of MasterInfo unfortunately, but the idea is for agents to only (re-)register when they see that the master reaches a recovered state. Once recovered, the master will not drop messages, and therefore agents only need to retry when the connection breaks. (2) Have the master reply with a retry message when it's in the recovering state, so that agents get a clear signal that their messages were dropped. The agents only retry when the connection breaks or they get a retry message. This one is less optimal, because the master may have to process a lot of messages and send retries, but once the master is recovered, the master will process only a single (re-)registration from each agent. The number of (re-)registrations that occur while the master is recovering can be reduced to 1 in this approach if the master sends the retry message only after the master completes recovery. was: Currently, during a failover the agents will (re-)register with the master. While the master is recovering, the master may drop messages from the agents, and so the agents must retry registration using a backoff mechanism. For large clusters, there can be a lot of overhead in processing unnecessary retries from the agents, given that these messages must be deserialized and contain all of the task / executor information many times over. In order to reduce this overhead, the idea is to avoid the need for agents to blindly retry (re-)registration with the master. Two approaches for this are: (1) Update the MasterInfo in ZK when the master is recovered. This is a bit of an abuse of MasterInfo unfortunately, but the idea is for agents to only (re-)register when they see that the master reaches a recovered state. Once recovered, the master will not drop messages, and therefore agents only need to retry when the connection breaks. (2) Have the master reply with a retry message when it's in the recovering state, so that agents get a clear signal that their messages were dropped. This one is less optimal, because the master may have to process a lot of messages and send retries, but once the master is recovered, the master will process only a single (re-)registration from each agent. Here, agents only retry when the connection breaks or they get a retry message. > Improve master failover performance by reducing unnecessary agent retries. > -- > > Key: MESOS-7688 > URL: https://issues.apache.org/jira/browse/MESOS-7688 > Project: Mesos > Issue Type: Improvement > Components: agent, master >Reporter: Benjamin Mahler > Labels: scalability > > Currently, during a failover the agents will (re-)register with the master. > While the master is recovering, the master may drop messages from the agents, > and so the agents must retry registration using a backoff mechanism. For > large clusters, there can be a lot of overhead in processing unnecessary > retries from the agents, given that these messages must be deserialized and > contain all of the task / executor information many times over. > In order to reduce this overhead, the idea is to avoid the need for agents to > blindly retry (re-)registration with the master. Two approaches for this are: > (1) Update the MasterInfo in ZK when the master is recovered. This is a bit > of an abuse of MasterInfo unfortunately, but the idea is for agents to only > (re-)register when they see that the master reaches a recovered state. Once > recovered, the master will not drop messages, and therefore agents only need > to retry when the connection breaks. > (2) Have the master reply with a retry message when it's in the recovering > state, so that agents get a clear signal that their messages were dropped. > The agents only retry when the connection breaks or they get a retry message. > This one is less optimal, because the master may have to process a lot of > messages and send retries, but once the master is recovered,
[jira] [Created] (MESOS-7688) Improve master failover performance by reducing unnecessary agent retries.
Benjamin Mahler created MESOS-7688: -- Summary: Improve master failover performance by reducing unnecessary agent retries. Key: MESOS-7688 URL: https://issues.apache.org/jira/browse/MESOS-7688 Project: Mesos Issue Type: Improvement Components: agent, master Reporter: Benjamin Mahler Currently, during a failover the agents will (re-)register with the master. While the master is recovering, the master may drop messages from the agents, and so the agents must retry registration using a backoff mechanism. For large clusters, there can be a lot of overhead in processing unnecessary retries from the agents, given that these messages must be deserialized and contain all of the task / executor information many times over. In order to reduce this overhead, the idea is to avoid the need for agents to blindly retry (re-)registration with the master. Two approaches for this are: (1) Update the MasterInfo in ZK when the master is recovered. This is a bit of an abuse of MasterInfo unfortunately, but the idea is for agents to only (re-)register when they see that the master reaches a recovered state. Once recovered, the master will not drop messages, and therefore agents only need to retry when the connection breaks. (2) Have the master reply with a retry message when it's in the recovering state, so that agents get a clear signal that their messages were dropped. This one is less optimal, because the master may have to process a lot of messages and send retries, but once the master is recovered, the master will process only a single (re-)registration from each agent. Here, agents only retry when the connection breaks or they get a retry message. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (MESOS-7683) Introduce a master capability for reservation refinement.
Benjamin Mahler created MESOS-7683: -- Summary: Introduce a master capability for reservation refinement. Key: MESOS-7683 URL: https://issues.apache.org/jira/browse/MESOS-7683 Project: Mesos Issue Type: Task Reporter: Benjamin Mahler So that frameworks can detect which features the master supports, we have proposed introducing master capabilities: MESOS-5675. For reservation refinement we can add a master capability. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7682) Agent agent downgrade capability checking.
[ https://issues.apache.org/jira/browse/MESOS-7682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-7682: --- Summary: Agent agent downgrade capability checking. (was: Agent downgrade capability checking.) > Agent agent downgrade capability checking. > -- > > Key: MESOS-7682 > URL: https://issues.apache.org/jira/browse/MESOS-7682 > Project: Mesos > Issue Type: Improvement >Reporter: Benjamin Mahler > > It would be great if the agent could prevent a downgrade if it reaches a > point where a capability becomes required but the downgraded agent does not > have the capability. > For example, consider the case that an agent starts writing refined > reservations to disk (per the RESERVATION_REFINEMENT capability). At this > point, the RESERVATION_REFINEMENT capability becomes required. Ideally, the > agent persists this information into its state, so that if the agent is > downgraded to a pre-RESERVATION_REFINEMENT state, the old agent could detect > that a capability it does not have is required. At this point the old agent > could refuse to start. This would prevent a "buggy" downgrade due to the old > agent mis-reading the checkpointed resources. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (MESOS-7682) Agent downgrade capability checking.
Benjamin Mahler created MESOS-7682: -- Summary: Agent downgrade capability checking. Key: MESOS-7682 URL: https://issues.apache.org/jira/browse/MESOS-7682 Project: Mesos Issue Type: Improvement Reporter: Benjamin Mahler It would be great if the agent could prevent a downgrade if it reaches a point where a capability becomes required but the downgraded agent does not have the capability. For example, consider the case that an agent starts writing refined reservations to disk (per the RESERVATION_REFINEMENT capability). At this point, the RESERVATION_REFINEMENT capability becomes required. Ideally, the agent persists this information into its state, so that if the agent is downgraded to a pre-RESERVATION_REFINEMENT state, the old agent could detect that a capability it does not have is required. At this point the old agent could refuse to start. This would prevent a "buggy" downgrade due to the old agent mis-reading the checkpointed resources. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Assigned] (MESOS-7655) Reservation Refinement: Update the resources logic.
[ https://issues.apache.org/jira/browse/MESOS-7655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler reassigned MESOS-7655: -- Assignee: Michael Park > Reservation Refinement: Update the resources logic. > --- > > Key: MESOS-7655 > URL: https://issues.apache.org/jira/browse/MESOS-7655 > Project: Mesos > Issue Type: Bug >Reporter: Michael Park >Assignee: Michael Park > > With reservation refinement, there is a new framework capability called > {{RESERVATION_REFINEMENT}}. The framework is required to use the > {{Resource.reservations}} field to express reservations if the capability is > set, otherwise it is required to use the {{Resource.role}} and > {{Resource.reservation}} fields. > After the validation, we transform the resources from the old format to the > new format and deal with the new format internally. > This allows us to only deal with the old format at the validation phase, and > update the code to only consider the new format for all other > Resources-related functions. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (MESOS-7668) Update authorization to handle reservation refinement.
Benjamin Mahler created MESOS-7668: -- Summary: Update authorization to handle reservation refinement. Key: MESOS-7668 URL: https://issues.apache.org/jira/browse/MESOS-7668 Project: Mesos Issue Type: Task Reporter: Benjamin Mahler With reservation refinement, the local authorizer needs to be updated to retrieve the role and principal via the {{Resource.reservations}} field. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (MESOS-7665) v0 Operator API update for reservation refinement.
Benjamin Mahler created MESOS-7665: -- Summary: v0 Operator API update for reservation refinement. Key: MESOS-7665 URL: https://issues.apache.org/jira/browse/MESOS-7665 Project: Mesos Issue Type: Task Reporter: Benjamin Mahler In order to preserve backwards compatibility, the v0 endpoints (e.g. /state) should expose the old format using `Resource.role` and `Resource.reservation` if the resources do not contain a refined reservation. If the resource contains a refined reservation, then we need to ensure the v0 endpoints reflect that in the JSON. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7664) Framework API update for reservation refinement.
[ https://issues.apache.org/jira/browse/MESOS-7664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-7664: --- Issue Type: Task (was: Documentation) > Framework API update for reservation refinement. > > > Key: MESOS-7664 > URL: https://issues.apache.org/jira/browse/MESOS-7664 > Project: Mesos > Issue Type: Task >Reporter: Benjamin Mahler >Assignee: Michael Park > > In order to add reservation refinement, the framework API needs: > * A way to express the "stack" of reservations. > * A new capability to gate the feature, since the resource format has to > change. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (MESOS-7663) Update the documentation to reflect the addition of reservation refinement.
Benjamin Mahler created MESOS-7663: -- Summary: Update the documentation to reflect the addition of reservation refinement. Key: MESOS-7663 URL: https://issues.apache.org/jira/browse/MESOS-7663 Project: Mesos Issue Type: Documentation Components: documentation Reporter: Benjamin Mahler There are a few things we need to be sure to document: * What reservation refinement is. * The new "format" for Resource, when using the RESERVATION_REFINEMENT capability. * The filtering of resources if a framework is not RESERVATION_REFINEMENT capable. * The current limitations that only a single reservation can be pushed / popped within a single RESERVE / UNRESERVE operation. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (MESOS-7575) Support reservation refinement for hierarchical roles.
[ https://issues.apache.org/jira/browse/MESOS-7575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16048573#comment-16048573 ] Benjamin Mahler edited comment on MESOS-7575 at 6/14/17 1:24 AM: - Moving this to an epic so that we can capture all of the work needed here. was (Author: bmahler): Moving this to an epic so that capture all of the work needed here. > Support reservation refinement for hierarchical roles. > -- > > Key: MESOS-7575 > URL: https://issues.apache.org/jira/browse/MESOS-7575 > Project: Mesos > Issue Type: Epic >Reporter: Michael Park >Assignee: Michael Park > > With the introduction of hierarchical roles, Mesos provides a mechanism to > delegate resources down a hierarchy. > To complement this, we'll introduce a mechanism to *refine* the reservations > down the hierarchy. > For example, given resources allocated to role {{foo}}, it can be further > reserved for {{foo/bar}}. > When the resources allocated to {{foo/bar}} is unreserved, it goes back to > where it came from. In this case, back to role {{foo}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7575) Support reservation refinement for hierarchical roles.
[ https://issues.apache.org/jira/browse/MESOS-7575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-7575: --- Epic Name: reservation refinement > Support reservation refinement for hierarchical roles. > -- > > Key: MESOS-7575 > URL: https://issues.apache.org/jira/browse/MESOS-7575 > Project: Mesos > Issue Type: Epic >Reporter: Michael Park >Assignee: Michael Park > > With the introduction of hierarchical roles, Mesos provides a mechanism to > delegate resources down a hierarchy. > To complement this, we'll introduce a mechanism to *refine* the reservations > down the hierarchy. > For example, given resources allocated to role {{foo}}, it can be further > reserved for {{foo/bar}}. > When the resources allocated to {{foo/bar}} is unreserved, it goes back to > where it came from. In this case, back to role {{foo}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7575) Support reservation refinement for hierarchical roles.
[ https://issues.apache.org/jira/browse/MESOS-7575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-7575: --- Summary: Support reservation refinement for hierarchical roles. (was: Support reservation refinement) Issue Type: Epic (was: Task) Moving this to an epic so that capture all of the work needed here. > Support reservation refinement for hierarchical roles. > -- > > Key: MESOS-7575 > URL: https://issues.apache.org/jira/browse/MESOS-7575 > Project: Mesos > Issue Type: Epic >Reporter: Michael Park >Assignee: Michael Park > > With the introduction of hierarchical roles, Mesos provides a mechanism to > delegate resources down a hierarchy. > To complement this, we'll introduce a mechanism to *refine* the reservations > down the hierarchy. > For example, given resources allocated to role {{foo}}, it can be further > reserved for {{foo/bar}}. > When the resources allocated to {{foo/bar}} is unreserved, it goes back to > where it came from. In this case, back to role {{foo}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-6972) Improve performance of protobuf message passing by removing RepeatedPtrField to vector conversion.
[ https://issues.apache.org/jira/browse/MESOS-6972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16046889#comment-16046889 ] Benjamin Mahler commented on MESOS-6972: [~dzhuk] I see, sounds like we have two options that depend on the install handler implementation. One case is const-access style, in which arenas and const access are best, and the other is wanting to move out the data from the protobuf, in which case non-const access (for moveability) with no arenas is best. > Improve performance of protobuf message passing by removing RepeatedPtrField > to vector conversion. > -- > > Key: MESOS-6972 > URL: https://issues.apache.org/jira/browse/MESOS-6972 > Project: Mesos > Issue Type: Improvement >Reporter: Benjamin Mahler > Labels: tech-debt > > Currently, all protobuf message handlers must take a {{vector}} for repeated > fields, rather than a {{RepeatedPtrField}}. > This requires that a copy be performed of the repeated field's entries (see > [here|https://github.com/apache/mesos/blob/9228ebc239dac42825390bebc72053dbf3ae7b09/3rdparty/libprocess/include/process/protobuf.hpp#L78-L87]), > which can be very expensive in some cases. We should avoid requiring this > expense on the callers. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-7651) Consider a more explicit way to bind reservations / volumes to a framework.
[ https://issues.apache.org/jira/browse/MESOS-7651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16045268#comment-16045268 ] Benjamin Mahler commented on MESOS-7651: [~xujyan] Updated the description to mention lifecycle. > Consider a more explicit way to bind reservations / volumes to a framework. > --- > > Key: MESOS-7651 > URL: https://issues.apache.org/jira/browse/MESOS-7651 > Project: Mesos > Issue Type: Improvement >Reporter: Benjamin Mahler > > Currently, when a framework creates a reservation or a persistent volume, and > it wants exclusive access to this volume or reservation, it must take a few > steps: > * Ensure that no other frameworks are running within the reservation role (or > the other frameworks are co-operative). > * With hierarchical roles, frameworks must also ensure that the role is a > leaf so that no descendant roles will have access to the reservation/volume. > This could be done by generating a role (e.g. eng/kafka/). > It's not easy for the framework to ensure these things, since role ACLs are > controlled by the operator. > We should consider a more direct way for a framework to ensure that their > reservation/volume cannot be shared. E.g. by binding it to their framework id > (perhaps re-using roles for this rather than introducing something new?) > We should also consider binding the reservation / volumes, much like other > objects (tasks, executors), to the framework's lifecycle. So that if the > framework is removed, the reservations / volumes it left behind are cleaned > up. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (MESOS-7651) Consider a more explicit way to bind reservations / volumes to a framework.
[ https://issues.apache.org/jira/browse/MESOS-7651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-7651: --- Description: Currently, when a framework creates a reservation or a persistent volume, and it wants exclusive access to this volume or reservation, it must take a few steps: * Ensure that no other frameworks are running within the reservation role (or the other frameworks are co-operative). * With hierarchical roles, frameworks must also ensure that the role is a leaf so that no descendant roles will have access to the reservation/volume. This could be done by generating a role (e.g. eng/kafka/). It's not easy for the framework to ensure these things, since role ACLs are controlled by the operator. We should consider a more direct way for a framework to ensure that their reservation/volume cannot be shared. E.g. by binding it to their framework id (perhaps re-using roles for this rather than introducing something new?) We should also consider binding the reservation / volumes, much like other objects (tasks, executors), to the framework's lifecycle. So that if the framework is removed, the reservations / volumes it left behind are cleaned up. was: Currently, when a framework creates a reservation or a persistent volume, and it wants exclusive access to this volume or reservation, it must take a few steps: * Ensure that no other frameworks are running within the reservation role (or the other frameworks are co-operative). * With hierarchical roles, frameworks must also ensure that the role is a leaf so that no descendant roles will have access to the reservation/volume. This could be done by generating a role (e.g. eng/kafka/). It's not easy for the framework to ensure these things, since role ACLs are controlled by the operator. We should consider a more direct way for a framework to ensure that their reservation/volume cannot be shared. E.g. by binding it to their framework id (perhaps re-using roles for this rather than introducing something new?) > Consider a more explicit way to bind reservations / volumes to a framework. > --- > > Key: MESOS-7651 > URL: https://issues.apache.org/jira/browse/MESOS-7651 > Project: Mesos > Issue Type: Improvement >Reporter: Benjamin Mahler > > Currently, when a framework creates a reservation or a persistent volume, and > it wants exclusive access to this volume or reservation, it must take a few > steps: > * Ensure that no other frameworks are running within the reservation role (or > the other frameworks are co-operative). > * With hierarchical roles, frameworks must also ensure that the role is a > leaf so that no descendant roles will have access to the reservation/volume. > This could be done by generating a role (e.g. eng/kafka/). > It's not easy for the framework to ensure these things, since role ACLs are > controlled by the operator. > We should consider a more direct way for a framework to ensure that their > reservation/volume cannot be shared. E.g. by binding it to their framework id > (perhaps re-using roles for this rather than introducing something new?) > We should also consider binding the reservation / volumes, much like other > objects (tasks, executors), to the framework's lifecycle. So that if the > framework is removed, the reservations / volumes it left behind are cleaned > up. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MESOS-3826) Add an optional unique identifier for resource reservations
[ https://issues.apache.org/jira/browse/MESOS-3826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16044833#comment-16044833 ] Benjamin Mahler commented on MESOS-3826: Filed a related issue: https://issues.apache.org/jira/browse/MESOS-7651 > Add an optional unique identifier for resource reservations > --- > > Key: MESOS-3826 > URL: https://issues.apache.org/jira/browse/MESOS-3826 > Project: Mesos > Issue Type: Improvement >Reporter: Sargun Dhillon > Labels: mesosphere, reservations > > Thanks to the resource reservation primitives, frameworks can reserve > resources. These reservations are per role, which means multiple frameworks > can share reservations. This can get very hairy, as multiple reservations can > occur on each agent. > It would be nice to be able to optionally, uniquely identify reservations by > ID, much like persistent volumes are today. This could be done by adding a > new protobuf field, such as Resource.ReservationInfo.id, that if set upon > reservation time, would come back when the reservation is advertised. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (MESOS-7651) Consider a more explicit way to bind reservations / volumes to a framework.
Benjamin Mahler created MESOS-7651: -- Summary: Consider a more explicit way to bind reservations / volumes to a framework. Key: MESOS-7651 URL: https://issues.apache.org/jira/browse/MESOS-7651 Project: Mesos Issue Type: Improvement Reporter: Benjamin Mahler Currently, when a framework creates a reservation or a persistent volume, and it wants exclusive access to this volume or reservation, it must take a few steps: * Ensure that no other frameworks are running within the reservation role (or the other frameworks are co-operative). * With hierarchical roles, frameworks must also ensure that the role is a leaf so that no descendant roles will have access to the reservation/volume. This could be done by generating a role (e.g. eng/kafka/). It's not easy for the framework to ensure these things, since role ACLs are controlled by the operator. We should consider a more direct way for a framework to ensure that their reservation/volume cannot be shared. E.g. by binding it to their framework id (perhaps re-using roles for this rather than introducing something new?) -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (MESOS-7033) Update documentation for hierarchical roles.
[ https://issues.apache.org/jira/browse/MESOS-7033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-7033: --- Description: A few things to be sure cover: * How to ensure that a volume is not shared with other frameworks. Previously, this meant running only 1 framework in the role and using ACLs to prevent other frameworks from running in the role. With hierarchical roles, this now also includes using ACLs to prevent any child roles from being created beneath the role (as these children would be able to obtain the reserved resources). We've been advising frameworks to generate a role (e.g. eng/kafka/) to ensure that they own their reservations (but the dynamic nature of this makes setting up ACLs difficult). Longer term, we may need a more explicit way to bind reservations or volumes to frameworks. > Update documentation for hierarchical roles. > > > Key: MESOS-7033 > URL: https://issues.apache.org/jira/browse/MESOS-7033 > Project: Mesos > Issue Type: Task > Components: documentation >Reporter: Neil Conway >Assignee: Neil Conway > Labels: mesosphere > > A few things to be sure cover: > * How to ensure that a volume is not shared with other frameworks. > Previously, this meant running only 1 framework in the role and using ACLs to > prevent other frameworks from running in the role. With hierarchical roles, > this now also includes using ACLs to prevent any child roles from being > created beneath the role (as these children would be able to obtain the > reserved resources). We've been advising frameworks to generate a role (e.g. > eng/kafka/) to ensure that they own their reservations (but the > dynamic nature of this makes setting up ACLs difficult). Longer term, we may > need a more explicit way to bind reservations or volumes to frameworks. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MESOS-6972) Improve performance of protobuf message passing by removing RepeatedPtrField to vector conversion.
[ https://issues.apache.org/jira/browse/MESOS-6972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16043178#comment-16043178 ] Benjamin Mahler commented on MESOS-6972: [~dzhuk] great, yeah I was thinking of this option as well but wasn't sure if this works when we use arena allocation, since Swap performs deep copies if one side is from an arena and the other is not: https://developers.google.com/protocol-buffers/docs/reference/arenas#message-class-methods > Improve performance of protobuf message passing by removing RepeatedPtrField > to vector conversion. > -- > > Key: MESOS-6972 > URL: https://issues.apache.org/jira/browse/MESOS-6972 > Project: Mesos > Issue Type: Improvement >Reporter: Benjamin Mahler > Labels: tech-debt > > Currently, all protobuf message handlers must take a {{vector}} for repeated > fields, rather than a {{RepeatedPtrField}}. > This requires that a copy be performed of the repeated field's entries (see > [here|https://github.com/apache/mesos/blob/9228ebc239dac42825390bebc72053dbf3ae7b09/3rdparty/libprocess/include/process/protobuf.hpp#L78-L87]), > which can be very expensive in some cases. We should avoid requiring this > expense on the callers. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Assigned] (MESOS-6244) Add support for streaming HTTP request bodies in libprocess.
[ https://issues.apache.org/jira/browse/MESOS-6244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler reassigned MESOS-6244: -- Shepherd: Benjamin Mahler Assignee: Anand Mazumdar > Add support for streaming HTTP request bodies in libprocess. > > > Key: MESOS-6244 > URL: https://issues.apache.org/jira/browse/MESOS-6244 > Project: Mesos > Issue Type: Improvement > Components: libprocess >Reporter: Benjamin Mahler >Assignee: Anand Mazumdar > > We currently have support for streaming responses. See MESOS-2438. Servers > can start sending the response body before the body is complete. Clients can > start reading a response before the body is complete. This is an optimization > for large responses and is a requirement for infinite "streaming" style > endpoints. > We currently do not have support for streaming requests. This would allow a > client to stream a large or infinite request body to the server without > having to have the complete body in hand, and it would allow a server to read > request bodies before they are have been completely received over the > connection. > This is a requirement if we want to allow clients to "stream" data into a > server, i.e. an infinite request body. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (MESOS-6531) Add support for incremental gzip compression.
[ https://issues.apache.org/jira/browse/MESOS-6531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-6531: --- Description: We currently only support compression assuming the entire input is available at once. We can add a {{gzip::Compressor}} to support incremental compression. (was: We currently only support compression / decompression assuming the entire input is available at once. We can add a {{gzip::Compressor}} to support incremental compression.) > Add support for incremental gzip compression. > - > > Key: MESOS-6531 > URL: https://issues.apache.org/jira/browse/MESOS-6531 > Project: Mesos > Issue Type: Improvement > Components: stout >Reporter: Benjamin Mahler > > We currently only support compression assuming the entire input is available > at once. We can add a {{gzip::Compressor}} to support incremental compression. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MESOS-7566) Master crash due to failed check in DRFSorter::remove
[ https://issues.apache.org/jira/browse/MESOS-7566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16037599#comment-16037599 ] Benjamin Mahler commented on MESOS-7566: [~xujyan] can you file a ticket for the race you described? It isn't the issue in this ticket AFAICT, but we should capture it and fix it as well. > Master crash due to failed check in DRFSorter::remove > - > > Key: MESOS-7566 > URL: https://issues.apache.org/jira/browse/MESOS-7566 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.1.1, 1.1.2 >Reporter: Zhitao Li >Priority: Critical > > A check in [sorter.cpp#L355 in 1.1.2 | > https://github.com/apache/mesos/blob/1.1.2/src/master/allocator/sorter/drf/sorter.cpp#L355] > is triggered occasionally in our cluster and crashes the master leader. > I manually modified that check to print out the related variables, and the > following is a master log. > https://gist.github.com/zhitaoli/0662d9fe1f6d57de344951c05b536bad#file-gistfile1-txt > From the log, it seems like the check was using an stale value revocable CPU > {{26}} while the new value was updated to 25, thus the check crashed. > So far two verified occurrence of this bug are both observed near an > {{UNRESERVE}} operation (see lines above in the log). -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MESOS-7566) Master crash due to failed check in DRFSorter::remove
[ https://issues.apache.org/jira/browse/MESOS-7566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16037576#comment-16037576 ] Benjamin Mahler commented on MESOS-7566: For posterity, line 773 in [~zhitao]'s version corresponds to: https://github.com/apache/mesos/blob/1.1.2/src/master/allocator/mesos/hierarchical.cpp#L749 > Master crash due to failed check in DRFSorter::remove > - > > Key: MESOS-7566 > URL: https://issues.apache.org/jira/browse/MESOS-7566 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.1.1, 1.1.2 >Reporter: Zhitao Li >Priority: Critical > > A check in [sorter.cpp#L355 in 1.1.2 | > https://github.com/apache/mesos/blob/1.1.2/src/master/allocator/sorter/drf/sorter.cpp#L355] > is triggered occasionally in our cluster and crashes the master leader. > I manually modified that check to print out the related variables, and the > following is a master log. > https://gist.github.com/zhitaoli/0662d9fe1f6d57de344951c05b536bad#file-gistfile1-txt > From the log, it seems like the check was using an stale value revocable CPU > {{26}} while the new value was updated to 25, thus the check crashed. > So far two verified occurrence of this bug are both observed near an > {{UNRESERVE}} operation (see lines above in the log). -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MESOS-7095) Basic make check from getting started link fails
[ https://issues.apache.org/jira/browse/MESOS-7095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16035652#comment-16035652 ] Benjamin Mahler commented on MESOS-7095: [~tillt] does the getting started guide need any updates related to this so that users don't hit it? http://mesos.apache.org/gettingstarted/ > Basic make check from getting started link fails > > > Key: MESOS-7095 > URL: https://issues.apache.org/jira/browse/MESOS-7095 > Project: Mesos > Issue Type: Bug > Components: build >Reporter: Alec Bruns > > {*** Aborted at 1486657215 (unix time) try "date -d @1486657215" if you are > using GNU date *** PC: @0x1080b7367 apr_pool_create_ex *** SIGSEGV > (@0x30) received by PID 25167 (TID 0x7fffbdd073c0) stack trace: ***} > \{@ 0x7fffb50c7bba _sigtramp > @\{ 0x72c0517 (unknown)\} > @0x107eaa13a svn_pool_create_ex > @0x107691d6e svn::diff() > @0x107691042 SVNTest_DiffPatch_Test::TestBody() > @0x1077026ba > testing::internal::HandleSehExceptionsInMethodIfSupported<>() > @0x1076b3ad7 > testing::internal::HandleExceptionsInMethodIfSupported<>() > @0x1076b3985 testing::Test::Run() > @0x1076b54f8 testing::TestInfo::Run() > @0x1076b6867 testing::TestCase::Run() > @0x1076c65dc testing::internal::UnitTestImpl::RunAllTests() > @0x1077033da > testing::internal::HandleSehExceptionsInMethodIfSupported<>() > @0x1076c6007 > testing::internal::HandleExceptionsInMethodIfSupported<>() > @0x1076c5ed8 testing::UnitTest::Run() > @0x1074d55c1 RUN_ALL_TESTS() > @0x1074d5580 main > @ 0x7fffb4eba255 start > make[6]: *** [check-local] Segmentation fault: 11 > make[5]: *** [check-am] Error 2 make[4]: *** [check-recursive] Error 1 > make[3]: *** [check] Error 2 make[2]: *** [check-recursive] Error 1 > make[1]: *** [check] Error 2 make: *** [check-recursive] Error 1 > make: *** [check-recursive] Error 1 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (MESOS-7590) Make the default decline timeout configurable on the master rather than burned into the protobuf.
Benjamin Mahler created MESOS-7590: -- Summary: Make the default decline timeout configurable on the master rather than burned into the protobuf. Key: MESOS-7590 URL: https://issues.apache.org/jira/browse/MESOS-7590 Project: Mesos Issue Type: Improvement Components: master Reporter: Benjamin Mahler Currently, many frameworks decline without setting the filter timeout, we have a default filter timeout of 5 seconds burned into the protobuf. Instead, it would be better if we could configure the default filter timeout on the master via a flag. When many frameworks are running and declining with short filter timeouts, the master may not have time to try to offer the resources to each framework before circling back to the original framework that declined them. So, allowing the operator to configure this as a workaround would be nice. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (MESOS-7401) Optionally reject messages when UPIDs does not match IP.
[ https://issues.apache.org/jira/browse/MESOS-7401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-7401: --- Issue Type: Improvement (was: Bug) > Optionally reject messages when UPIDs does not match IP. > > > Key: MESOS-7401 > URL: https://issues.apache.org/jira/browse/MESOS-7401 > Project: Mesos > Issue Type: Improvement > Components: libprocess >Reporter: James Peach >Assignee: James Peach >Priority: Minor > Fix For: 1.4.0 > > > {{libprocess}} does no validation of the peer UPID so in some deployments it > is trivial to inject bogus messages and impersonate legitimate actors. If we > add a check to verify that messages are received from the same IP address as > the peer UPID claims to be using, we can increase the difficulty of UPID > spoofing, and mitigate this somewhat. > For compatibility, this has to be an optional setting and disabled by default. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (MESOS-7540) Add an agent flag for executor re-registration timeout.
[ https://issues.apache.org/jira/browse/MESOS-7540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-7540: --- Fix Version/s: 1.1.3 > Add an agent flag for executor re-registration timeout. > --- > > Key: MESOS-7540 > URL: https://issues.apache.org/jira/browse/MESOS-7540 > Project: Mesos > Issue Type: Improvement > Components: agent >Reporter: Greg Mann >Assignee: Greg Mann > Labels: mesosphere > Fix For: 1.2.2, 1.3.1, 1.4.0, 1.1.3 > > > Currently, the executor re-register timeout is hard-coded at 2 seconds. It > would be beneficial to allow operators to specify this value. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (MESOS-7569) Allow "old" executors with half-open connections to be preserved during agent upgrade / restart.
[ https://issues.apache.org/jira/browse/MESOS-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-7569: --- Fix Version/s: 1.1.3 > Allow "old" executors with half-open connections to be preserved during agent > upgrade / restart. > > > Key: MESOS-7569 > URL: https://issues.apache.org/jira/browse/MESOS-7569 > Project: Mesos > Issue Type: Bug > Components: agent >Reporter: Benjamin Mahler >Assignee: Benjamin Mahler > Fix For: 1.2.2, 1.3.1, 1.4.0, 1.1.3 > > > Users who have executors in their cluster without the fix to MESOS-7057 will > experience these executors potentially being destroyed whenever the agent > restarts (or is upgraded). > This occurs when these old executors have connections idle for > 5 days > (default conntrack tcp timeout). At this point, the connection is timedout > and no longer tracked by conntrack. From what we've seen, if the agent stays > up, the packets still flow between the executor and agent. However, once the > agent restarts, in some cases (presence of a DROP rule, or some flavors of > NATing), the executor does not receive the RST/FIN from the kernel and will > hold a half-open TCP connection. At this point, when the executor responds to > the reconnect message from the restarted agent, it's half-open TCP connection > closes, and the executor will be destroyed by the agent. > In order to allow users to preserve the tasks running in these "old" > executors (i.e. without the MESOS-7057 fix), we can add *optional* retrying > of the reconnect message in the agent. This allows the old executor to > correctly establish a link to agent, when the second reconnect message is > handled. > Longer term, heartbeating or TCP keepalives will prevent the connections from > reaching the conntrack timeout (see MESOS-7568). -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (MESOS-7569) Allow "old" executors with half-open connections to be preserved during agent upgrade / restart.
[ https://issues.apache.org/jira/browse/MESOS-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-7569: --- Fix Version/s: 1.2.2 > Allow "old" executors with half-open connections to be preserved during agent > upgrade / restart. > > > Key: MESOS-7569 > URL: https://issues.apache.org/jira/browse/MESOS-7569 > Project: Mesos > Issue Type: Bug > Components: agent >Reporter: Benjamin Mahler >Assignee: Benjamin Mahler > Fix For: 1.2.2, 1.3.1, 1.4.0 > > > Users who have executors in their cluster without the fix to MESOS-7057 will > experience these executors potentially being destroyed whenever the agent > restarts (or is upgraded). > This occurs when these old executors have connections idle for > 5 days > (default conntrack tcp timeout). At this point, the connection is timedout > and no longer tracked by conntrack. From what we've seen, if the agent stays > up, the packets still flow between the executor and agent. However, once the > agent restarts, in some cases (presence of a DROP rule, or some flavors of > NATing), the executor does not receive the RST/FIN from the kernel and will > hold a half-open TCP connection. At this point, when the executor responds to > the reconnect message from the restarted agent, it's half-open TCP connection > closes, and the executor will be destroyed by the agent. > In order to allow users to preserve the tasks running in these "old" > executors (i.e. without the MESOS-7057 fix), we can add *optional* retrying > of the reconnect message in the agent. This allows the old executor to > correctly establish a link to agent, when the second reconnect message is > handled. > Longer term, heartbeating or TCP keepalives will prevent the connections from > reaching the conntrack timeout (see MESOS-7568). -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (MESOS-7540) Add an agent flag for executor re-registration timeout.
[ https://issues.apache.org/jira/browse/MESOS-7540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-7540: --- Fix Version/s: 1.2.2 > Add an agent flag for executor re-registration timeout. > --- > > Key: MESOS-7540 > URL: https://issues.apache.org/jira/browse/MESOS-7540 > Project: Mesos > Issue Type: Improvement > Components: agent >Reporter: Greg Mann >Assignee: Greg Mann > Labels: mesosphere > Fix For: 1.2.2, 1.3.1, 1.4.0 > > > Currently, the executor re-register timeout is hard-coded at 2 seconds. It > would be beneficial to allow operators to specify this value. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (MESOS-7540) Add an agent flag for executor re-registration timeout.
[ https://issues.apache.org/jira/browse/MESOS-7540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-7540: --- Fix Version/s: 1.3.1 > Add an agent flag for executor re-registration timeout. > --- > > Key: MESOS-7540 > URL: https://issues.apache.org/jira/browse/MESOS-7540 > Project: Mesos > Issue Type: Improvement > Components: agent >Reporter: Greg Mann >Assignee: Greg Mann > Labels: mesosphere > Fix For: 1.3.1, 1.4.0 > > > Currently, the executor re-register timeout is hard-coded at 2 seconds. It > would be beneficial to allow operators to specify this value. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (MESOS-7569) Allow "old" executors with half-open connections to be preserved during agent upgrade / restart.
[ https://issues.apache.org/jira/browse/MESOS-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-7569: --- Fix Version/s: 1.3.1 > Allow "old" executors with half-open connections to be preserved during agent > upgrade / restart. > > > Key: MESOS-7569 > URL: https://issues.apache.org/jira/browse/MESOS-7569 > Project: Mesos > Issue Type: Bug > Components: agent >Reporter: Benjamin Mahler >Assignee: Benjamin Mahler > Fix For: 1.3.1, 1.4.0 > > > Users who have executors in their cluster without the fix to MESOS-7057 will > experience these executors potentially being destroyed whenever the agent > restarts (or is upgraded). > This occurs when these old executors have connections idle for > 5 days > (default conntrack tcp timeout). At this point, the connection is timedout > and no longer tracked by conntrack. From what we've seen, if the agent stays > up, the packets still flow between the executor and agent. However, once the > agent restarts, in some cases (presence of a DROP rule, or some flavors of > NATing), the executor does not receive the RST/FIN from the kernel and will > hold a half-open TCP connection. At this point, when the executor responds to > the reconnect message from the restarted agent, it's half-open TCP connection > closes, and the executor will be destroyed by the agent. > In order to allow users to preserve the tasks running in these "old" > executors (i.e. without the MESOS-7057 fix), we can add *optional* retrying > of the reconnect message in the agent. This allows the old executor to > correctly establish a link to agent, when the second reconnect message is > handled. > Longer term, heartbeating or TCP keepalives will prevent the connections from > reaching the conntrack timeout (see MESOS-7568). -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (MESOS-7540) Add an agent flag for executor re-registration timeout.
[ https://issues.apache.org/jira/browse/MESOS-7540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-7540: --- Summary: Add an agent flag for executor re-registration timeout. (was: Add an agent flag for executor re-register timeout) > Add an agent flag for executor re-registration timeout. > --- > > Key: MESOS-7540 > URL: https://issues.apache.org/jira/browse/MESOS-7540 > Project: Mesos > Issue Type: Improvement > Components: agent >Reporter: Greg Mann >Assignee: Greg Mann > Labels: mesosphere > Fix For: 1.4.0 > > > Currently, the executor re-register timeout is hard-coded at 2 seconds. It > would be beneficial to allow operators to specify this value. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (MESOS-7574) Allow reservations to multiple roles.
Benjamin Mahler created MESOS-7574: -- Summary: Allow reservations to multiple roles. Key: MESOS-7574 URL: https://issues.apache.org/jira/browse/MESOS-7574 Project: Mesos Issue Type: Improvement Reporter: Benjamin Mahler There have been some discussions for allowing reservations to multiple roles (or more generally, role expressions). E.g. All resources on GPU agents are reserved for "eng/machine-learning" or "finance/forecasting" or "data-science/modeling" to use, because these are the roles in my organization that make use of GPUs, and I want to guarantee that none of the non-GPU workloads tie up the GPU machines cpus/mem/disk. This GPU related example would allow us to deprecate and remove the GPU_RESOURCES capability, which is a hack implementation of reservations to multiple roles. Mesos will only offer GPU machine resources to GPU capable schedulers. Having the ability to make reservations to multiple roles obviates this hack. With hierarchical roles, we have a restricted version of reservations to multiple roles, where the roles are restricted to the descendant roles. For example, a reservation for "gpu-workloads" can be allocated to "gpu-workloads/eng/image-processing", "gpu-workloads/data-science/modeling", "gpu-workloads/finance/forecasting etc. What isn't achievable is a reservation to multiple roles across the tree, e.g. "eng/image-processing" OR "finance/forecasting" OR "data-science/modeling". This can get clumsy because if "eng/ML" wants to get in on the reserved gpus, the user would have to place a related role underneath the "gpu-workloads" role, e.g. "gpu-workloads/eng/ML". A similar use case has been that some agents are "public" and there are disparate roles in the organization that need access to these hosts, so we want to ensure that only these roles get access and no other roles can tie up the resources on these hosts. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MESOS-5332) TASK_LOST on slave restart potentially due to executor race condition
[ https://issues.apache.org/jira/browse/MESOS-5332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16025625#comment-16025625 ] Benjamin Mahler commented on MESOS-5332: In order to enable users who hit this situation to safely upgrade (without all >5 day idle connection executors being destroyed), we will introduce an optional retry of the reconnect message via MESOS-7569: https://reviews.apache.org/r/59584/ This will allow the preservation of executors without the relink (MESOS-7057) fix when upgrading an agent. Longer term, TCP keepalives or heartbeating will be put in place to avoid the connections timing out in conntrack. > TASK_LOST on slave restart potentially due to executor race condition > - > > Key: MESOS-5332 > URL: https://issues.apache.org/jira/browse/MESOS-5332 > Project: Mesos > Issue Type: Bug > Components: agent, libprocess >Affects Versions: 0.26.0 > Environment: Mesos 0.26 > Aurora 0.13 >Reporter: Stephan Erb >Assignee: Anand Mazumdar > Attachments: executor-logs.tar.gz, executor-stderr.log, > executor-stderrV2.log, mesos-slave.log > > > When restarting the Mesos agent binary, tasks can end up as LOST. We lose > from 20% to 50% of all tasks. They are killed by the Mesos agent via: > {code} > I0505 08:42:06.781318 21738 slave.cpp:2702] Cleaning up un-reregistered > executors > I0505 08:42:06.781366 21738 slave.cpp:2720] Killing un-reregistered executor > 'thermos-nobody-devel-service-28854-0-6a88d62e-656 > 4-4e33-b0bb-1d8039d97afc' of framework > 20151001-085346-58917130-5050-37976- at executor(1)@10.X.X.X:40541 > I0505 08:42:06.781446 21738 slave.cpp:2720] Killing un-reregistered executor > 'thermos-nobody-devel-service-23839-0-1d2cd0e6-699 > 4-4cba-a9df-3dfc1552667f' of framework > 20151001-085346-58917130-5050-37976- at executor(1)@10.X.X.X:35757 > I0505 08:42:06.781466 21738 slave.cpp:2720] Killing un-reregistered executor > 'thermos-nobody-devel-service-29970-0-478a7291-d070-4aa8 > -af21-6fda889f750c' of framework 20151001-085346-58917130-5050-37976- at > executor(1)@10.X.X.X:51463 > ... > I0505 08:42:06.781558 21738 slave.cpp:4230] Finished recovery > {code} > We have verified that the tasks and their executors are killed by the agent > during startup. When stopping the agent using supervisorctl stop, the > executors are still running (verified via {{ps aux}}). They are only killed > once the agent tries to reregister. > The issue is hard to reproduce: > * When restarting the agent binary multiple times, tasks are only lost for > the first restart. > * It is much more likely to occur if the agent binary has been running for a > longer period of time (> 7 days) > Mesos is correctly sticking to the 2 seconds wait time before killing > un-reregistered executors. The failed executors receive the reregistration > request, but it seems like they fail to send a reply. > A successful reregistration (not leading to LOST): > {code} > I0505 08:41:59.581231 21664 exec.cpp:456] Slave exited, but framework has > checkpointing enabled. Waiting 15mins to reconnect with slave > 20160118-141153-92471562-5050-6270-S17 > I0505 08:42:04.780591 21665 exec.cpp:256] Received reconnect request from > slave 20160118-141153-92471562-5050-6270-S17 > I0505 08:42:04.785297 21676 exec.cpp:233] Executor re-registered on slave > 20160118-141153-92471562-5050-6270-S17 > I0505 08:42:04.788579 21676 exec.cpp:245] Executor::reregistered took > 1.492339ms > {code} > A failed one: > {code} > I0505 08:42:04.779677 2389 exec.cpp:256] Received reconnect request from > slave 20160118-141153-92471562-5050-6270-S17 > E0505 08:42:05.481374 2408 process.cpp:1911] Failed to shutdown socket with > fd 11: Transport endpoint is not connected > I0505 08:42:05.481374 2395 exec.cpp:456] Slave exited, but framework has > checkpointing enabled. Waiting 15mins to reconnect with slave > 20160118-141153-92471562-5050-6270-S17 > {code} > All task ending up in LOST have an output similar to the one posted above, > i.e. log messages are in a wrong order. > Anyone an idea what might be going on here? -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Assigned] (MESOS-7569) Allow "old" executors with half-open connections to be preserved during agent upgrade / restart.
[ https://issues.apache.org/jira/browse/MESOS-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler reassigned MESOS-7569: -- Assignee: Benjamin Mahler > Allow "old" executors with half-open connections to be preserved during agent > upgrade / restart. > > > Key: MESOS-7569 > URL: https://issues.apache.org/jira/browse/MESOS-7569 > Project: Mesos > Issue Type: Bug > Components: agent >Reporter: Benjamin Mahler >Assignee: Benjamin Mahler > > Users who have executors in their cluster without the fix to MESOS-7057 will > experience these executors potentially being destroyed whenever the agent > restarts (or is upgraded). > This occurs when these old executors have connections idle for > 5 days > (default conntrack tcp timeout). At this point, the connection is timedout > and no longer tracked by conntrack. From what we've seen, if the agent stays > up, the packets still flow between the executor and agent. However, once the > agent restarts, in some cases (presence of a DROP rule, or some flavors of > NATing), the executor does not receive the RST/FIN from the kernel and will > hold a half-open TCP connection. At this point, when the executor responds to > the reconnect message from the restarted agent, it's half-open TCP connection > closes, and the executor will be destroyed by the agent. > In order to allow users to preserve the tasks running in these "old" > executors (i.e. without the MESOS-7057 fix), we can add *optional* retrying > of the reconnect message in the agent. This allows the old executor to > correctly establish a link to agent, when the second reconnect message is > handled. > Longer term, heartbeating or TCP keepalives will prevent the connections from > reaching the conntrack timeout (see MESOS-7568). -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (MESOS-7569) Allow "old" executors with half-open connections to be preserved during agent upgrade / restart.
[ https://issues.apache.org/jira/browse/MESOS-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-7569: --- Target Version/s: 1.2.2, 1.3.1, 1.4.0, 1.1.3 (was: 1.2.2, 1.3.1, 1.4.0) > Allow "old" executors with half-open connections to be preserved during agent > upgrade / restart. > > > Key: MESOS-7569 > URL: https://issues.apache.org/jira/browse/MESOS-7569 > Project: Mesos > Issue Type: Bug > Components: agent >Reporter: Benjamin Mahler > > Users who have executors in their cluster without the fix to MESOS-7057 will > experience these executors potentially being destroyed whenever the agent > restarts (or is upgraded). > This occurs when these old executors have connections idle for > 5 days > (default conntrack tcp timeout). At this point, the connection is timedout > and no longer tracked by conntrack. From what we've seen, if the agent stays > up, the packets still flow between the executor and agent. However, once the > agent restarts, in some cases (presence of a DROP rule, or some flavors of > NATing), the executor does not receive the RST/FIN from the kernel and will > hold a half-open TCP connection. At this point, when the executor responds to > the reconnect message from the restarted agent, it's half-open TCP connection > closes, and the executor will be destroyed by the agent. > In order to allow users to preserve the tasks running in these "old" > executors (i.e. without the MESOS-7057 fix), we can add *optional* retrying > of the reconnect message in the agent. This allows the old executor to > correctly establish a link to agent, when the second reconnect message is > handled. > Longer term, heartbeating or TCP keepalives will prevent the connections from > reaching the conntrack timeout (see MESOS-7568). -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (MESOS-7569) Allow "old" executors with half-open connections to be preserved during agent upgrade / restart.
Benjamin Mahler created MESOS-7569: -- Summary: Allow "old" executors with half-open connections to be preserved during agent upgrade / restart. Key: MESOS-7569 URL: https://issues.apache.org/jira/browse/MESOS-7569 Project: Mesos Issue Type: Bug Components: agent Reporter: Benjamin Mahler Users who have executors in their cluster without the fix to MESOS-7057 will experience these executors potentially being destroyed whenever the agent restarts (or is upgraded). This occurs when these old executors have connections idle for > 5 days (default conntrack tcp timeout). At this point, the connection is timedout and no longer tracked by conntrack. From what we've seen, if the agent stays up, the packets still flow between the executor and agent. However, once the agent restarts, in some cases (presence of a DROP rule, or some flavors of NATing), the executor does not receive the RST/FIN from the kernel and will hold a half-open TCP connection. At this point, when the executor responds to the reconnect message from the restarted agent, it's half-open TCP connection closes, and the executor will be destroyed by the agent. In order to allow users to preserve the tasks running in these "old" executors (i.e. without the MESOS-7057 fix), we can add *optional* retrying of the reconnect message in the agent. This allows the old executor to correctly establish a link to agent, when the second reconnect message is handled. Longer term, heartbeating or TCP keepalives will prevent the connections from reaching the conntrack timeout (see MESOS-7568). -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MESOS-5361) Consider introducing TCP KeepAlive for Libprocess sockets.
[ https://issues.apache.org/jira/browse/MESOS-5361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16025600#comment-16025600 ] Benjamin Mahler commented on MESOS-5361: Linking in the executor related tickets that came up due to conntrack considering connections stale after 5 days. > Consider introducing TCP KeepAlive for Libprocess sockets. > -- > > Key: MESOS-5361 > URL: https://issues.apache.org/jira/browse/MESOS-5361 > Project: Mesos > Issue Type: Improvement > Components: libprocess >Reporter: Anand Mazumdar > Labels: mesosphere > > We currently don't use TCP KeepAlive's when creating sockets in libprocess. > This might benefit master - scheduler, master - agent connections i.e. we can > detect if any of them failed faster. > Currently, if the master process goes down. If for some reason the {{RST}} > sequence did not reach the scheduler, the scheduler can only come to know > about the disconnection when it tries to do a {{send}} itself. > The default TCP keep alive values on Linux are of little use in a real world > application: > {code} > . This means that the keepalive routines wait for two hours (7200 secs) > before sending the first keepalive probe, and then resend it every 75 > seconds. If no ACK response is received for nine consecutive times, the > connection is marked as broken. > {code} > However, for long running instances of scheduler/agent this still can be > beneficial. Also, operators might start tuning the values for their clusters > explicitly once we start supporting it. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (MESOS-7564) Introduce a heartbeat mechanism for v1 HTTP executor <-> agent communication.
[ https://issues.apache.org/jira/browse/MESOS-7564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-7564: --- Summary: Introduce a heartbeat mechanism for v1 HTTP executor <-> agent communication. (was: Introduce a heartbeat mechanism for executor <-> agent communication.) > Introduce a heartbeat mechanism for v1 HTTP executor <-> agent communication. > - > > Key: MESOS-7564 > URL: https://issues.apache.org/jira/browse/MESOS-7564 > Project: Mesos > Issue Type: Bug >Reporter: Anand Mazumdar > > Currently, we do not have heartbeats for executor <-> agent communication. > This is especially problematic in scenarios when IPFilters are enabled since > the default conntrack keep alive timeout is 5 days. When that timeout > elapses, the executor doesn't get notified via a socket disconnection when > the agent process restarts. The executor would then get killed if it doesn't > re-register when the agent recovery process is completed. > Enabling application level heartbeats or TCP KeepAlive's can be a possible > way for fixing this issue. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (MESOS-7564) Introduce a heartbeat mechanism for executor <-> agent communication.
[ https://issues.apache.org/jira/browse/MESOS-7564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-7564: --- Issue Type: Bug (was: Task) > Introduce a heartbeat mechanism for executor <-> agent communication. > - > > Key: MESOS-7564 > URL: https://issues.apache.org/jira/browse/MESOS-7564 > Project: Mesos > Issue Type: Bug >Reporter: Anand Mazumdar > > Currently, we do not have heartbeats for executor <-> agent communication. > This is especially problematic in scenarios when IPFilters are enabled since > the default conntrack keep alive timeout is 5 days. When that timeout > elapses, the executor doesn't get notified via a socket disconnection when > the agent process restarts. The executor would then get killed if it doesn't > re-register when the agent recovery process is completed. > Enabling application level heartbeats or TCP KeepAlive's can be a possible > way for fixing this issue. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (MESOS-7568) Introduce a heartbeat mechanism for v0 executor <-> agent links.
Benjamin Mahler created MESOS-7568: -- Summary: Introduce a heartbeat mechanism for v0 executor <-> agent links. Key: MESOS-7568 URL: https://issues.apache.org/jira/browse/MESOS-7568 Project: Mesos Issue Type: Bug Reporter: Anand Mazumdar Currently, we do not have heartbeats for executor <-> agent communication. This is especially problematic in scenarios when IPFilters are enabled since the default conntrack keep alive timeout is 5 days. When that timeout elapses, the executor doesn't get notified via a socket disconnection when the agent process restarts. The executor would then get killed if it doesn't re-register when the agent recovery process is completed. Enabling application level heartbeats or TCP KeepAlive's can be a possible way for fixing this issue. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MESOS-7468) Could not copy the sandbox path on WebUI
[ https://issues.apache.org/jira/browse/MESOS-7468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16005725#comment-16005725 ] Benjamin Mahler commented on MESOS-7468: I gave a ship it with some comments. Do you know if there's a way to make the breadcrumb slash copy-able? > Could not copy the sandbox path on WebUI > - > > Key: MESOS-7468 > URL: https://issues.apache.org/jira/browse/MESOS-7468 > Project: Mesos > Issue Type: Bug > Components: webui >Reporter: haosdent >Assignee: haosdent >Priority: Minor > > I would get > {code} > var lib mesos slaves 08879b43-58c9-4db7-a93e-4873e35c8144-S1 frameworks > 1c092dff-e6d2-4537-a872-52752929ea7e- executors > test-copy.cfd4d72a-3397-11e7-8e73-02426ed45ffc runs > 3d8e16cb-f5c7-4580-952d-1a230943e154 > {code} > when I select texts in webui. > It is because the definition of breadcrumb in bootstrap is > {code} > .breadcrumb > li + li:before { > content: "/"; > } > {code} > So "/" would not be included when select and copy text -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Assigned] (MESOS-5255) Add GPUs to container resource consumption metrics.
[ https://issues.apache.org/jira/browse/MESOS-5255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler reassigned MESOS-5255: -- Assignee: (was: Chun-Hung Hsiao) > Add GPUs to container resource consumption metrics. > --- > > Key: MESOS-5255 > URL: https://issues.apache.org/jira/browse/MESOS-5255 > Project: Mesos > Issue Type: Task >Reporter: Kevin Klues > Labels: gpu > > Currently the usage callback in the Nvidia GPU isolator is unimplemented: > {noformat} > src/slave/containerizer/mesos/isolators/cgroups/devices/gpus/nvidia.cpp > {noformat} > It should use functionality from NVML to gather the current GPU usage and add > it to a ResourceStatistics object. It is still an open question as to exactly > what information we want to expose here (power, memory consumption, current > load, etc.). Whatever we decide on should be standard across different GPU > types, different GPU vendors, etc. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MESOS-7478) Pre-1.2.x master does not work with 1.2.x agent.
[ https://issues.apache.org/jira/browse/MESOS-7478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16003380#comment-16003380 ] Benjamin Mahler commented on MESOS-7478: [~anandmazumdar] aside from my manual testing, I ran the upgrade script. It turns out it doesn't catch it because it upgrades the master first, then agents. Filed MESOS-7483. > Pre-1.2.x master does not work with 1.2.x agent. > > > Key: MESOS-7478 > URL: https://issues.apache.org/jira/browse/MESOS-7478 > Project: Mesos > Issue Type: Bug > Components: agent >Reporter: Benjamin Mahler >Assignee: Benjamin Mahler >Priority: Blocker > > [~evilezh] reported the following crash in the agent upon running a 1.1.0 > master against a 1.2.0 agent: > {noformat} > F0509 00:19:07.045413 3469 slave.cpp:4609] Check failed: > resource.has_allocation_info() > *** Check failure stack trace: *** > @ 0x7f4c4a4fa3cd google::LogMessage::Fail() > @ 0x7f4c4a4fc180 google::LogMessage::SendToLog() > @ 0x7f4c4a4f9fb3 google::LogMessage::Flush() > @ 0x7f4c4a4fcba9 google::LogMessageFatal::~LogMessageFatal() > @ 0x7f4c49b3bcf5 mesos::internal::slave::Slave::getExecutorInfo() > @ 0x7f4c49b3cf76 mesos::internal::slave::Slave::runTask() > @ 0x7f4c49b8832c ProtobufProcess<>::handler4<>() > @ 0x7f4c49b4dc06 std::_Function_handler<>::_M_invoke() > @ 0x7f4c49b6975a ProtobufProcess<>::visit() > @ 0x7f4c4a46c933 process::ProcessManager::resume() > @ 0x7f4c4a477537 > _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv > @ 0x7f4c486b8c80 (unknown) > @ 0x7f4c481d46ba start_thread > @ 0x7f4c47f0a82d (unknown) > Aborted (core dumped) > {noformat} > This appears to have been due to a lack of manual upgrade testing (we also > don't have any automated upgrade testing in place). > The check in {{getExecutorInfo(...)}} > [here|https://github.com/apache/mesos/blob/1.2.0/src/slave/slave.cpp#L4609] > crashes with an old master because it occurs before our injection in > {{run(...)}}. See the {{runTask(...)}} call into {{getExecutorInfo(...)}} > [here|https://github.com/apache/mesos/blob/1.2.0/src/slave/slave.cpp#L1556]. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MESOS-5935) Add upgrade testing to the ASF CI
[ https://issues.apache.org/jira/browse/MESOS-5935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16003378#comment-16003378 ] Benjamin Mahler commented on MESOS-5935: [~vinodkone] I also filed a ticket to test the case where agents are upgraded first. Seems like we need an epic here. > Add upgrade testing to the ASF CI > - > > Key: MESOS-5935 > URL: https://issues.apache.org/jira/browse/MESOS-5935 > Project: Mesos > Issue Type: Improvement >Affects Versions: 1.0.0 >Reporter: Greg Mann > Labels: mesosphere > > We should add execution of the {{support/test-upgrade.py}} script to the ASF > CI runs. This will require having a build of a previous Mesos version to run > against latest master; perhaps we could cache builds of the last stable > release somewhere, which could be fetched and executed against CI builds. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (MESOS-7483) Update upgrade test script to also test when agents are upgraded first.
Benjamin Mahler created MESOS-7483: -- Summary: Update upgrade test script to also test when agents are upgraded first. Key: MESOS-7483 URL: https://issues.apache.org/jira/browse/MESOS-7483 Project: Mesos Issue Type: Improvement Components: test Reporter: Benjamin Mahler Currently the upgrade test only tries to upgrade masters firs, e.g. {noformat} Running upgrade test from mesos 1.1.0 to mesos 1.2.1 +--+++---+ | Test case| Framework| Master | Agent | +--+++---+ |#1| mesos 1.1.0 | mesos 1.1.0| mesos 1.1.0 | |#2| mesos 1.1.0 | mesos 1.2.1| mesos 1.1.0 | |#3| mesos 1.1.0 | mesos 1.2.1| mesos 1.2.1 | |#4| mesos 1.2.1 | mesos 1.2.1| mesos 1.2.1 | +--+++---+ {noformat} We should also test the case where the agents are upgraded first. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MESOS-7478) Pre-1.2.x master does not work with 1.2.x agent.
[ https://issues.apache.org/jira/browse/MESOS-7478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16003347#comment-16003347 ] Benjamin Mahler commented on MESOS-7478: [~anandmazumdar] yeah we need to do that, I synced with vinod about resurrecting this on CI: MESOS-5935 > Pre-1.2.x master does not work with 1.2.x agent. > > > Key: MESOS-7478 > URL: https://issues.apache.org/jira/browse/MESOS-7478 > Project: Mesos > Issue Type: Bug > Components: agent >Reporter: Benjamin Mahler >Assignee: Benjamin Mahler >Priority: Blocker > > [~evilezh] reported the following crash in the agent upon running a 1.1.0 > master against a 1.2.0 agent: > {noformat} > F0509 00:19:07.045413 3469 slave.cpp:4609] Check failed: > resource.has_allocation_info() > *** Check failure stack trace: *** > @ 0x7f4c4a4fa3cd google::LogMessage::Fail() > @ 0x7f4c4a4fc180 google::LogMessage::SendToLog() > @ 0x7f4c4a4f9fb3 google::LogMessage::Flush() > @ 0x7f4c4a4fcba9 google::LogMessageFatal::~LogMessageFatal() > @ 0x7f4c49b3bcf5 mesos::internal::slave::Slave::getExecutorInfo() > @ 0x7f4c49b3cf76 mesos::internal::slave::Slave::runTask() > @ 0x7f4c49b8832c ProtobufProcess<>::handler4<>() > @ 0x7f4c49b4dc06 std::_Function_handler<>::_M_invoke() > @ 0x7f4c49b6975a ProtobufProcess<>::visit() > @ 0x7f4c4a46c933 process::ProcessManager::resume() > @ 0x7f4c4a477537 > _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv > @ 0x7f4c486b8c80 (unknown) > @ 0x7f4c481d46ba start_thread > @ 0x7f4c47f0a82d (unknown) > Aborted (core dumped) > {noformat} > This appears to have been due to a lack of manual upgrade testing (we also > don't have any automated upgrade testing in place). > The check in {{getExecutorInfo(...)}} > [here|https://github.com/apache/mesos/blob/1.2.0/src/slave/slave.cpp#L4609] > crashes with an old master because it occurs before our injection in > {{run(...)}}. See the {{runTask(...)}} call into {{getExecutorInfo(...)}} > [here|https://github.com/apache/mesos/blob/1.2.0/src/slave/slave.cpp#L1556]. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MESOS-7478) Pre-1.2.x master does not work with 1.2.x agent.
[ https://issues.apache.org/jira/browse/MESOS-7478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16003215#comment-16003215 ] Benjamin Mahler commented on MESOS-7478: Re ETA: Yes, I have a patch, just manually testing it before posting it. Re upgrade ordering: My understanding is that we've generally agreed to support 1.x master against 1.y agents, for all x and y. [~vinodkone] is that clearly documented? > Pre-1.2.x master does not work with 1.2.x agent. > > > Key: MESOS-7478 > URL: https://issues.apache.org/jira/browse/MESOS-7478 > Project: Mesos > Issue Type: Bug > Components: agent >Reporter: Benjamin Mahler >Assignee: Benjamin Mahler >Priority: Blocker > > [~evilezh] reported the following crash in the agent upon running a 1.1.0 > master against a 1.2.0 agent: > {noformat} > F0509 00:19:07.045413 3469 slave.cpp:4609] Check failed: > resource.has_allocation_info() > *** Check failure stack trace: *** > @ 0x7f4c4a4fa3cd google::LogMessage::Fail() > @ 0x7f4c4a4fc180 google::LogMessage::SendToLog() > @ 0x7f4c4a4f9fb3 google::LogMessage::Flush() > @ 0x7f4c4a4fcba9 google::LogMessageFatal::~LogMessageFatal() > @ 0x7f4c49b3bcf5 mesos::internal::slave::Slave::getExecutorInfo() > @ 0x7f4c49b3cf76 mesos::internal::slave::Slave::runTask() > @ 0x7f4c49b8832c ProtobufProcess<>::handler4<>() > @ 0x7f4c49b4dc06 std::_Function_handler<>::_M_invoke() > @ 0x7f4c49b6975a ProtobufProcess<>::visit() > @ 0x7f4c4a46c933 process::ProcessManager::resume() > @ 0x7f4c4a477537 > _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv > @ 0x7f4c486b8c80 (unknown) > @ 0x7f4c481d46ba start_thread > @ 0x7f4c47f0a82d (unknown) > Aborted (core dumped) > {noformat} > This appears to have been due to a lack of manual upgrade testing (we also > don't have any automated upgrade testing in place). > The check in {{getExecutorInfo(...)}} > [here|https://github.com/apache/mesos/blob/1.2.0/src/slave/slave.cpp#L4609] > crashes with an old master because it occurs before our injection in > {{run(...)}}. See the {{runTask(...)}} call into {{getExecutorInfo(...)}} > [here|https://github.com/apache/mesos/blob/1.2.0/src/slave/slave.cpp#L1556]. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Assigned] (MESOS-7478) Pre-1.2.x master does not work with 1.2.x agent.
[ https://issues.apache.org/jira/browse/MESOS-7478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler reassigned MESOS-7478: -- Assignee: Benjamin Mahler > Pre-1.2.x master does not work with 1.2.x agent. > > > Key: MESOS-7478 > URL: https://issues.apache.org/jira/browse/MESOS-7478 > Project: Mesos > Issue Type: Bug > Components: agent >Reporter: Benjamin Mahler >Assignee: Benjamin Mahler >Priority: Blocker > > [~evilezh] reported the following crash in the agent upon running a 1.1.0 > master against a 1.2.0 agent: > {noformat} > F0509 00:19:07.045413 3469 slave.cpp:4609] Check failed: > resource.has_allocation_info() > *** Check failure stack trace: *** > @ 0x7f4c4a4fa3cd google::LogMessage::Fail() > @ 0x7f4c4a4fc180 google::LogMessage::SendToLog() > @ 0x7f4c4a4f9fb3 google::LogMessage::Flush() > @ 0x7f4c4a4fcba9 google::LogMessageFatal::~LogMessageFatal() > @ 0x7f4c49b3bcf5 mesos::internal::slave::Slave::getExecutorInfo() > @ 0x7f4c49b3cf76 mesos::internal::slave::Slave::runTask() > @ 0x7f4c49b8832c ProtobufProcess<>::handler4<>() > @ 0x7f4c49b4dc06 std::_Function_handler<>::_M_invoke() > @ 0x7f4c49b6975a ProtobufProcess<>::visit() > @ 0x7f4c4a46c933 process::ProcessManager::resume() > @ 0x7f4c4a477537 > _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv > @ 0x7f4c486b8c80 (unknown) > @ 0x7f4c481d46ba start_thread > @ 0x7f4c47f0a82d (unknown) > Aborted (core dumped) > {noformat} > This appears to have been due to a lack of manual upgrade testing (we also > don't have any automated upgrade testing in place). > The check in {{getExecutorInfo(...)}} > [here|https://github.com/apache/mesos/blob/1.2.0/src/slave/slave.cpp#L4609] > crashes with an old master because it occurs before our injection in > {{run(...)}}. See the {{runTask(...)}} call into {{getExecutorInfo(...)}} > [here|https://github.com/apache/mesos/blob/1.2.0/src/slave/slave.cpp#L1556]. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (MESOS-7478) Pre-1.2.x master does not work with 1.2.x agent.
Benjamin Mahler created MESOS-7478: -- Summary: Pre-1.2.x master does not work with 1.2.x agent. Key: MESOS-7478 URL: https://issues.apache.org/jira/browse/MESOS-7478 Project: Mesos Issue Type: Bug Components: agent Reporter: Benjamin Mahler Priority: Blocker [~evilezh] reported the following crash in the agent upon running a 1.1.0 master against a 1.2.0 agent: {noformat} F0509 00:19:07.045413 3469 slave.cpp:4609] Check failed: resource.has_allocation_info() *** Check failure stack trace: *** @ 0x7f4c4a4fa3cd google::LogMessage::Fail() @ 0x7f4c4a4fc180 google::LogMessage::SendToLog() @ 0x7f4c4a4f9fb3 google::LogMessage::Flush() @ 0x7f4c4a4fcba9 google::LogMessageFatal::~LogMessageFatal() @ 0x7f4c49b3bcf5 mesos::internal::slave::Slave::getExecutorInfo() @ 0x7f4c49b3cf76 mesos::internal::slave::Slave::runTask() @ 0x7f4c49b8832c ProtobufProcess<>::handler4<>() @ 0x7f4c49b4dc06 std::_Function_handler<>::_M_invoke() @ 0x7f4c49b6975a ProtobufProcess<>::visit() @ 0x7f4c4a46c933 process::ProcessManager::resume() @ 0x7f4c4a477537 _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv @ 0x7f4c486b8c80 (unknown) @ 0x7f4c481d46ba start_thread @ 0x7f4c47f0a82d (unknown) Aborted (core dumped) {noformat} This appears to have been due to a lack of manual upgrade testing (we don't have any automated upgrade testing in place). The check in {{getExecutorInfo(...)}} [here|https://github.com/apache/mesos/blob/1.2.0/src/slave/slave.cpp#L4609] crashes with an old master because it occurs before our injection in {{run(...)}}. See the {{runTask(...)}} call into {{getExecutorInfo(...)}} [here|https://github.com/apache/mesos/blob/1.2.0/src/slave/slave.cpp#L1556]. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (MESOS-7478) Pre-1.2.x master does not work with 1.2.x agent.
[ https://issues.apache.org/jira/browse/MESOS-7478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-7478: --- Description: [~evilezh] reported the following crash in the agent upon running a 1.1.0 master against a 1.2.0 agent: {noformat} F0509 00:19:07.045413 3469 slave.cpp:4609] Check failed: resource.has_allocation_info() *** Check failure stack trace: *** @ 0x7f4c4a4fa3cd google::LogMessage::Fail() @ 0x7f4c4a4fc180 google::LogMessage::SendToLog() @ 0x7f4c4a4f9fb3 google::LogMessage::Flush() @ 0x7f4c4a4fcba9 google::LogMessageFatal::~LogMessageFatal() @ 0x7f4c49b3bcf5 mesos::internal::slave::Slave::getExecutorInfo() @ 0x7f4c49b3cf76 mesos::internal::slave::Slave::runTask() @ 0x7f4c49b8832c ProtobufProcess<>::handler4<>() @ 0x7f4c49b4dc06 std::_Function_handler<>::_M_invoke() @ 0x7f4c49b6975a ProtobufProcess<>::visit() @ 0x7f4c4a46c933 process::ProcessManager::resume() @ 0x7f4c4a477537 _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv @ 0x7f4c486b8c80 (unknown) @ 0x7f4c481d46ba start_thread @ 0x7f4c47f0a82d (unknown) Aborted (core dumped) {noformat} This appears to have been due to a lack of manual upgrade testing (we also don't have any automated upgrade testing in place). The check in {{getExecutorInfo(...)}} [here|https://github.com/apache/mesos/blob/1.2.0/src/slave/slave.cpp#L4609] crashes with an old master because it occurs before our injection in {{run(...)}}. See the {{runTask(...)}} call into {{getExecutorInfo(...)}} [here|https://github.com/apache/mesos/blob/1.2.0/src/slave/slave.cpp#L1556]. was: [~evilezh] reported the following crash in the agent upon running a 1.1.0 master against a 1.2.0 agent: {noformat} F0509 00:19:07.045413 3469 slave.cpp:4609] Check failed: resource.has_allocation_info() *** Check failure stack trace: *** @ 0x7f4c4a4fa3cd google::LogMessage::Fail() @ 0x7f4c4a4fc180 google::LogMessage::SendToLog() @ 0x7f4c4a4f9fb3 google::LogMessage::Flush() @ 0x7f4c4a4fcba9 google::LogMessageFatal::~LogMessageFatal() @ 0x7f4c49b3bcf5 mesos::internal::slave::Slave::getExecutorInfo() @ 0x7f4c49b3cf76 mesos::internal::slave::Slave::runTask() @ 0x7f4c49b8832c ProtobufProcess<>::handler4<>() @ 0x7f4c49b4dc06 std::_Function_handler<>::_M_invoke() @ 0x7f4c49b6975a ProtobufProcess<>::visit() @ 0x7f4c4a46c933 process::ProcessManager::resume() @ 0x7f4c4a477537 _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv @ 0x7f4c486b8c80 (unknown) @ 0x7f4c481d46ba start_thread @ 0x7f4c47f0a82d (unknown) Aborted (core dumped) {noformat} This appears to have been due to a lack of manual upgrade testing (we don't have any automated upgrade testing in place). The check in {{getExecutorInfo(...)}} [here|https://github.com/apache/mesos/blob/1.2.0/src/slave/slave.cpp#L4609] crashes with an old master because it occurs before our injection in {{run(...)}}. See the {{runTask(...)}} call into {{getExecutorInfo(...)}} [here|https://github.com/apache/mesos/blob/1.2.0/src/slave/slave.cpp#L1556]. > Pre-1.2.x master does not work with 1.2.x agent. > > > Key: MESOS-7478 > URL: https://issues.apache.org/jira/browse/MESOS-7478 > Project: Mesos > Issue Type: Bug > Components: agent >Reporter: Benjamin Mahler >Priority: Blocker > > [~evilezh] reported the following crash in the agent upon running a 1.1.0 > master against a 1.2.0 agent: > {noformat} > F0509 00:19:07.045413 3469 slave.cpp:4609] Check failed: > resource.has_allocation_info() > *** Check failure stack trace: *** > @ 0x7f4c4a4fa3cd google::LogMessage::Fail() > @ 0x7f4c4a4fc180 google::LogMessage::SendToLog() > @ 0x7f4c4a4f9fb3 google::LogMessage::Flush() > @ 0x7f4c4a4fcba9 google::LogMessageFatal::~LogMessageFatal() > @ 0x7f4c49b3bcf5 mesos::internal::slave::Slave::getExecutorInfo() > @ 0x7f4c49b3cf76 mesos::internal::slave::Slave::runTask() > @ 0x7f4c49b8832c ProtobufProcess<>::handler4<>() > @ 0x7f4c49b4dc06 std::_Function_handler<>::_M_invoke() > @ 0x7f4c49b6975a ProtobufProcess<>::visit() > @ 0x7f4c4a46c933 process::ProcessManager::resume() > @ 0x7f4c4a477537 > _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv > @ 0x7f4c486b8c80 (unknown) > @ 0x7f4c481d46ba start_thread > @ 0x7f4c47f0a82d (unknown) > Aborted (core dumped) > {noformat} > This appears to have been due to a lack of manual
[jira] [Created] (MESOS-7460) UpdateFrameworkMessage may send a Framework role(s) change to a non-MULTI_ROLE agent.
Benjamin Mahler created MESOS-7460: -- Summary: UpdateFrameworkMessage may send a Framework role(s) change to a non-MULTI_ROLE agent. Key: MESOS-7460 URL: https://issues.apache.org/jira/browse/MESOS-7460 Project: Mesos Issue Type: Bug Components: master Reporter: Benjamin Mahler Assignee: Michael Park Priority: Blocker When a framework is MULTI_ROLE capable, if the framework was previously running tasks on an old agent (non-MULTI_ROLE capable), the master *must* ensure the UpdateFramework message sent to this old agent preserves the framework's original role. Otherwise the agent will interpret the role to have changed, which can break things (e.g. not locate the reservations, volumes, etc). In addition, a framework without MULTI_ROLE has the ability to change their role. We'll need to change this to ensure that the {{role}} field is immutable and frameworks need to use the {{roles}} field with the MULTI_ROLE capability if they want to change their role. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (MESOS-7260) Authorization for `/role` endpoint should take both VIEW_ROLES and VIEW_FRAMEWORKS into account.
[ https://issues.apache.org/jira/browse/MESOS-7260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-7260: --- Shepherd: Adam B To confirm, [~arojas] and [~adam-mesos] can you guys review / shepherd this? > Authorization for `/role` endpoint should take both VIEW_ROLES and > VIEW_FRAMEWORKS into account. > > > Key: MESOS-7260 > URL: https://issues.apache.org/jira/browse/MESOS-7260 > Project: Mesos > Issue Type: Bug > Components: HTTP API, master >Reporter: Jay Guo >Assignee: Jay Guo > > Consider following case: both {{framework1}} and {{framework2}} subscribe to > {{roleX}}, {{principal}} is allowed to view {{roleX}} and {{framework1}}, but > *NOT* {{framework2}}, therefore, {{/role}} endpoint should only contain > {{framework1}}, but not both frameworks. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (MESOS-6441) Display reservations in the agent page in the webui.
[ https://issues.apache.org/jira/browse/MESOS-6441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-6441: --- Description: We currently do not display the reservations present on an agent in the webui. It would be nice to see this information. It would also be nice to update the resource statistics tables to make the distinction between unreserved and reserved resources. E.g. Reserved: Used, Allocated, Available and Total Unreserved: Used, Allocated, Available and Total was: We currently do not display the reservations present on an agent in the webui. It would be nice to see this information. It would also be nice to update the resource statistics tables to make the distinction between unreserved and reserved resources. > Display reservations in the agent page in the webui. > > > Key: MESOS-6441 > URL: https://issues.apache.org/jira/browse/MESOS-6441 > Project: Mesos > Issue Type: Task > Components: webui >Reporter: Benjamin Mahler > > We currently do not display the reservations present on an agent in the > webui. It would be nice to see this information. > It would also be nice to update the resource statistics tables to make the > distinction between unreserved and reserved resources. E.g. > Reserved: > Used, Allocated, Available and Total > Unreserved: > Used, Allocated, Available and Total -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (MESOS-7438) Double free or corruption when using parallel test runner
Benjamin Mahler created MESOS-7438: -- Summary: Double free or corruption when using parallel test runner Key: MESOS-7438 URL: https://issues.apache.org/jira/browse/MESOS-7438 Project: Mesos Issue Type: Bug Components: technical debt, test Reporter: Benjamin Mahler I observed the following when using the parallel test runner: {noformat} /home/bmahler/git/mesos/build/../support/mesos-gtest-runner.py --sequential=*ROOT_* ./mesos-tests .. *** Error in `/home/bmahler/git/mesos/build/src/.libs/mesos-tests': double free or corruption (out): 0x7fa818001310 *** === Backtrace: = /usr/lib64/libc.so.6(+0x7c503)[0x7fa87f27e503] /usr/lib64/libsasl2.so.3(+0x866d)[0x7fa880f0d66d] /usr/lib64/libsasl2.so.3(sasl_dispose+0x3b)[0x7fa880f1075b] /home/bmahler/git/mesos/build/src/.libs/libmesos-1.3.0.so(_ZN5mesos8internal8cram_md527CRAMMD5AuthenticateeProcessD1Ev+0x5d)[0x7fa88708f67d] /home/bmahler/git/mesos/build/src/.libs/libmesos-1.3.0.so(_ZN5mesos8internal8cram_md527CRAMMD5AuthenticateeProcessD0Ev+0x18)[0x7fa88708f734] /home/bmahler/git/mesos/build/src/.libs/libmesos-1.3.0.so(_ZN5mesos8internal8cram_md520CRAMMD5AuthenticateeD1Ev+0xfb)[0x7fa88708a065] /home/bmahler/git/mesos/build/src/.libs/libmesos-1.3.0.so(_ZN5mesos8internal8cram_md520CRAMMD5AuthenticateeD0Ev+0x18)[0x7fa88708a0b4] /home/bmahler/git/mesos/build/src/.libs/libmesos-1.3.0.so(_ZN5mesos8internal5slave5Slave13_authenticateEv+0x67)[0x7fa8879ff579] /home/bmahler/git/mesos/build/src/.libs/libmesos-1.3.0.so(_ZZN7process8dispatchIN5mesos8internal5slave5SlaveEEEvRKNS_3PIDIT_EEMS6_FvvEENKUlPNS_11ProcessBaseEE_clESD_+0xe2)[0x7fa887a60b7a] /home/bmahler/git/mesos/build/src/.libs/libmesos-1.3.0.so(_ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal5slave5SlaveEEEvRKNS0_3PIDIT_EEMSA_FvvEEUlS2_E_E9_M_invokeERKSt9_Any_dataS2_+0x37)[0x7fa887aa0efe] /home/bmahler/git/mesos/build/src/.libs/libmesos-1.3.0.so(_ZNKSt8functionIFvPN7process11ProcessBaseEEEclES2_+0x49)[0x7fad1177] /home/bmahler/git/mesos/build/src/.libs/libmesos-1.3.0.so(_ZN7process11ProcessBase5visitERKNS_13DispatchEventE+0x2f)[0x7fab5063] /home/bmahler/git/mesos/build/src/.libs/libmesos-1.3.0.so(_ZNK7process13DispatchEvent5visitEPNS_12EventVisitorE+0x2e)[0x7fac0422] /home/bmahler/git/mesos/build/src/.libs/mesos-tests(_ZN7process11ProcessBase5serveERKNS_5EventE+0x2e)[0xb088c8] /home/bmahler/git/mesos/build/src/.libs/libmesos-1.3.0.so(_ZN7process14ProcessManager6resumeEPNS_11ProcessBaseE+0x525)[0x7fab10d5] /home/bmahler/git/mesos/build/src/.libs/libmesos-1.3.0.so(+0x5f1a880)[0x7faad880] /home/bmahler/git/mesos/build/src/.libs/libmesos-1.3.0.so(+0x5f2ca8a)[0x7fabfa8a] /home/bmahler/git/mesos/build/src/.libs/libmesos-1.3.0.so(+0x5f2c9ce)[0x7fabf9ce] /home/bmahler/git/mesos/build/src/.libs/libmesos-1.3.0.so(+0x5f2c958)[0x7fabf958] /usr/lib64/libstdc++.so.6(+0xb5230)[0x7fa87fb90230] /usr/lib64/libpthread.so.0(+0x7dc5)[0x7fa88040ddc5] /usr/lib64/libc.so.6(clone+0x6d)[0x7fa87f2f973d] {noformat} Not sure how reproducible this is, appears to occur in the authentication path of the agent. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Comment Edited] (MESOS-6345) ExamplesTest.PersistentVolumeFramework failing due to double free corruption on Ubuntu 14.04
[ https://issues.apache.org/jira/browse/MESOS-6345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15559169#comment-15559169 ] Benjamin Mahler edited comment on MESOS-6345 at 4/28/17 10:38 PM: -- {noformat} [04:56:48] : [Step 10/10] [ RUN ] ExamplesTest.PersistentVolumeFramework [04:56:48]W: [Step 10/10] I1008 04:56:48.212661 25257 master.cpp:1097] Master terminating [04:56:48]W: [Step 10/10] I1008 04:56:48.212674 25254 status_update_manager.cpp:395] Received status update acknowledgement (UUID: 542b14f7-bfc9-4be3-81b4-c23a1da9ecb5) for task 2 of framework 84cfc7a4-ad66-4f0d-965c-33ff6093ef32- [04:56:48]W: [Step 10/10] I1008 04:56:48.212709 25254 status_update_manager.cpp:531] Cleaning up status update stream for task 2 of framework 84cfc7a4-ad66-4f0d-965c-33ff6093ef32- [04:56:48]W: [Step 10/10] I1008 04:56:48.212712 25257 master.cpp:7725] Removing executor 'default' with resources {} of framework 84cfc7a4-ad66-4f0d-965c-33ff6093ef32- on agent 84cfc7a4-ad66-4f0d-965c-33ff6093ef32-S2 at slave(1)@172.30.2.21:52703 (ip-172-30-2-21.mesosphere.io) [04:56:48]W: [Step 10/10] I1008 04:56:48.212767 25254 slave.cpp:2953] Status update manager successfully handled status update acknowledgement (UUID: 542b14f7-bfc9-4be3-81b4-c23a1da9ecb5) for task 2 of framework 84cfc7a4-ad66-4f0d-965c-33ff6093ef32- [04:56:48]W: [Step 10/10] I1008 04:56:48.212782 25254 slave.cpp:6543] Completing task 2 [04:56:48]W: [Step 10/10] I1008 04:56:48.212792 25258 hierarchical.cpp:517] Removed agent 84cfc7a4-ad66-4f0d-965c-33ff6093ef32-S2 [04:56:48]W: [Step 10/10] I1008 04:56:48.212829 25257 master.cpp:7696] Removing task 3 with resources cpus(*):1; mem(*):128 of framework 84cfc7a4-ad66-4f0d-965c-33ff6093ef32- on agent 84cfc7a4-ad66-4f0d-965c-33ff6093ef32-S1 at slave(3)@172.30.2.21:52703 (ip-172-30-2-21.mesosphere.io) [04:56:48]W: [Step 10/10] I1008 04:56:48.212888 25257 master.cpp:7725] Removing executor 'default' with resources {} of framework 84cfc7a4-ad66-4f0d-965c-33ff6093ef32- on agent 84cfc7a4-ad66-4f0d-965c-33ff6093ef32-S1 at slave(3)@172.30.2.21:52703 (ip-172-30-2-21.mesosphere.io) [04:56:48]W: [Step 10/10] I1008 04:56:48.212915 25258 hierarchical.cpp:517] Removed agent 84cfc7a4-ad66-4f0d-965c-33ff6093ef32-S1 [04:56:48]W: [Step 10/10] I1008 04:56:48.213017 25257 master.cpp:7725] Removing executor 'default' with resources {} of framework 84cfc7a4-ad66-4f0d-965c-33ff6093ef32- on agent 84cfc7a4-ad66-4f0d-965c-33ff6093ef32-S0 at slave(2)@172.30.2.21:52703 (ip-172-30-2-21.mesosphere.io) [04:56:48]W: [Step 10/10] I1008 04:56:48.213102 25254 hierarchical.cpp:517] Removed agent 84cfc7a4-ad66-4f0d-965c-33ff6093ef32-S0 [04:56:48]W: [Step 10/10] I1008 04:56:48.213281 25251 hierarchical.cpp:337] Removed framework 84cfc7a4-ad66-4f0d-965c-33ff6093ef32- [04:56:48]W: [Step 10/10] I1008 04:56:48.213404 25257 slave.cpp:4174] Got exited event for master@172.30.2.21:52703 [04:56:48]W: [Step 10/10] I1008 04:56:48.213418 25253 slave.cpp:4174] Got exited event for master@172.30.2.21:52703 [04:56:48]W: [Step 10/10] W1008 04:56:48.213426 25257 slave.cpp:4179] Master disconnected! Waiting for a new master to be elected [04:56:48]W: [Step 10/10] W1008 04:56:48.213433 25253 slave.cpp:4179] Master disconnected! Waiting for a new master to be elected [04:56:48]W: [Step 10/10] I1008 04:56:48.213407 25254 slave.cpp:4174] Got exited event for master@172.30.2.21:52703 [04:56:48]W: [Step 10/10] W1008 04:56:48.213448 25254 slave.cpp:4179] Master disconnected! Waiting for a new master to be elected [04:56:48]W: [Step 10/10] I1008 04:56:48.214047 25254 slave.cpp:787] Agent terminating [04:56:48]W: [Step 10/10] I1008 04:56:48.214068 25254 slave.cpp:2506] Asked to shut down framework 84cfc7a4-ad66-4f0d-965c-33ff6093ef32- by @0.0.0.0:0 [04:56:48]W: [Step 10/10] I1008 04:56:48.214076 25254 slave.cpp:2531] Shutting down framework 84cfc7a4-ad66-4f0d-965c-33ff6093ef32- [04:56:48]W: [Step 10/10] I1008 04:56:48.214083 25254 slave.cpp:4855] Shutting down executor 'default' of framework 84cfc7a4-ad66-4f0d-965c-33ff6093ef32- (via HTTP) [04:56:48]W: [Step 10/10] E1008 04:56:48.215160 25384 executor.cpp:681] End-Of-File received from agent. The agent closed the event stream [04:56:48]W: [Step 10/10] I1008 04:56:48.215250 25254 slave.cpp:787] Agent terminating [04:56:48]W: [Step 10/10] I1008 04:56:48.215266 25254 slave.cpp:2506] Asked to shut down framework 84cfc7a4-ad66-4f0d-965c-33ff6093ef32- by @0.0.0.0:0 [04:56:48]W: [Step 10/10] I1008 04:56:48.215279 25254 slave.cpp:2531] Shutting down framework 84cfc7a4-ad66-4f0d-965c-33ff6093ef32- [04:56:48]W: [Step 10/10] I1008 04:56:48.215291 25254 slave.cpp:4855] Shutting down executor 'default' of framework
[jira] [Assigned] (MESOS-7430) Per-role Suppress call implementation is broken.
[ https://issues.apache.org/jira/browse/MESOS-7430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler reassigned MESOS-7430: -- Assignee: Benjamin Mahler > Per-role Suppress call implementation is broken. > > > Key: MESOS-7430 > URL: https://issues.apache.org/jira/browse/MESOS-7430 > Project: Mesos > Issue Type: Bug > Components: allocation, master >Reporter: Benjamin Mahler >Assignee: Benjamin Mahler >Priority: Blocker > > The per-role Suppress call implementation is broken currently in the > allocator, since it still uses a global 'suppress' bit for the framework. > Before fixing, we should discuss whether we want keep role within Suppress > (it hasn't been released yet), or add calls to move towards consistent > naming, e.g. {{Call::ActivateRole}} / {{Call::DeactivateRole}}. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (MESOS-5967) Add support for 'docker image inspect' in our docker abstraction.
[ https://issues.apache.org/jira/browse/MESOS-5967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-5967: --- Target Version/s: (was: 1.4.0) > Add support for 'docker image inspect' in our docker abstraction. > - > > Key: MESOS-5967 > URL: https://issues.apache.org/jira/browse/MESOS-5967 > Project: Mesos > Issue Type: Improvement > Components: containerization, docker >Reporter: Kevin Klues >Assignee: Guangya Liu > Labels: gpu > > Docker's command line tool for {{docker inspect}} can take either a > {{container}}, an {{image}}, or a {{task}} as its argument, and return a JSON > array containing low-level information about that container, image or task. > However, the current {{docker inspect}} support in our docker abstraction > only supports inspecting containers (not images or tasks). We should expand > this to (at least) support images. > In particular, this additional functionality is motivated by the upcoming GPU > support, which needs to inspect the labels in a docker image to decide if it > should inject the required Nvidia volumes into a container. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (MESOS-7430) Per-role Suppress call implementation is broken.
Benjamin Mahler created MESOS-7430: -- Summary: Per-role Suppress call implementation is broken. Key: MESOS-7430 URL: https://issues.apache.org/jira/browse/MESOS-7430 Project: Mesos Issue Type: Bug Components: allocation, master Reporter: Benjamin Mahler Priority: Blocker The per-role Suppress call implementation is broken currently in the allocator, since it still uses a global 'suppress' bit for the framework. Before fixing, we should discuss whether we want keep role within Suppress (it hasn't been released yet), or add calls to move towards consistent naming, e.g. {{Call::ActivateRole}} / {{Call::DeactivateRole}}. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (MESOS-7401) Optionally reject messages when UPIDs does not match IP.
[ https://issues.apache.org/jira/browse/MESOS-7401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-7401: --- Shepherd: Benjamin Mahler > Optionally reject messages when UPIDs does not match IP. > > > Key: MESOS-7401 > URL: https://issues.apache.org/jira/browse/MESOS-7401 > Project: Mesos > Issue Type: Bug > Components: libprocess >Reporter: James Peach >Assignee: James Peach >Priority: Minor > > {{libprocess}} does no validation of the peer UPID so in some deployments it > is trivial to inject bogus messages and impersonate legitimate actors. If we > add a check to verify that messages are received from the same IP address as > the peer UPID claims to be using, we can increase the difficulty of UPID > spoofing, and mitigate this somewhat. > For compatibility, this has to be an optional setting and disabled by default. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (MESOS-7401) Optionally reject messages when UPIDs does not match IP.
[ https://issues.apache.org/jira/browse/MESOS-7401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-7401: --- Summary: Optionally reject messages when UPIDs does not match IP. (was: Optionally pin UPIDs to their IP address.) > Optionally reject messages when UPIDs does not match IP. > > > Key: MESOS-7401 > URL: https://issues.apache.org/jira/browse/MESOS-7401 > Project: Mesos > Issue Type: Bug > Components: libprocess >Reporter: James Peach >Assignee: James Peach >Priority: Minor > > {{libprocess}} does no validation of the peer UPID so in some deployments it > is trivial to inject bogus messages and impersonate legitimate actors. If we > add a check to verify that messages are received from the same IP address as > the peer UPID claims to be using, we can increase the difficulty of UPID > spoofing, and mitigate this somewhat. > For compatibility, this has to be an optional setting and disabled by default. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (MESOS-6441) Display reservations in the agent page in the webui.
[ https://issues.apache.org/jira/browse/MESOS-6441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-6441: --- Description: We currently do not display the reservations present on an agent in the webui. It would be nice to see this information. It would also be nice to update the resource statistics tables to make the distinction between unreserved and reserved resources. was:We currently do not display the reservations present on an agent in the webui. It would be nice to see this information. > Display reservations in the agent page in the webui. > > > Key: MESOS-6441 > URL: https://issues.apache.org/jira/browse/MESOS-6441 > Project: Mesos > Issue Type: Task > Components: webui >Reporter: Benjamin Mahler > > We currently do not display the reservations present on an agent in the > webui. It would be nice to see this information. > It would also be nice to update the resource statistics tables to make the > distinction between unreserved and reserved resources. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MESOS-6447) Display role weight / role quota information in the webui.
[ https://issues.apache.org/jira/browse/MESOS-6447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15981989#comment-15981989 ] Benjamin Mahler commented on MESOS-6447: This was resolved via MESOS-6995. There is now a tab for viewing roles. > Display role weight / role quota information in the webui. > -- > > Key: MESOS-6447 > URL: https://issues.apache.org/jira/browse/MESOS-6447 > Project: Mesos > Issue Type: Task > Components: webui >Reporter: Benjamin Mahler > Fix For: 1.3.0 > > > The webui does not display role weight and role quotas. It would be nice to > have this information visible to users of the webui. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MESOS-7136) Eliminate fair sharing between frameworks within a role.
[ https://issues.apache.org/jira/browse/MESOS-7136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15981959#comment-15981959 ] Benjamin Mahler commented on MESOS-7136: [~qianzhang] hm.. not sure I follow the difficulty in figuring out which one replies first? I would hope to avoid the implicit leaf roles, since it's difficult to configure and view (need to know framework ids, and these can keep changing if frameworks complete / if new frameworks arrive). > Eliminate fair sharing between frameworks within a role. > > > Key: MESOS-7136 > URL: https://issues.apache.org/jira/browse/MESOS-7136 > Project: Mesos > Issue Type: Epic > Components: allocation, technical debt >Reporter: Benjamin Mahler > Labels: multi-tenancy > > The current fair sharing algorithm performs fair sharing between frameworks > within a role. This is equivalent to having the framework id behave as a > pseudo-role beneath the role. Consider the case where there are two spark > frameworks running within the same "spark" role. This behaves similarly to > hierarchical roles with the framework ID acting as an implicit role: > {noformat} > ^ >/ \ > spark services > ^ > / \ > / \ > FrameworkId1 FrameworkId2 > (fixed weight of 1)(fixed weight of 1) > {noformat} > Unfortunately, the frameworks cannot change their weight to be a value other > than 1 (see MESOS-6247) and they cannot set quota. > With the addition of hierarchical roles (see MESOS-6375) we can eliminate the > notion of the framework ID acting as a pseudo-role in favor of explicitly > using hierarchical roles. E.g. > {noformat} > ^ >/ \ > engsales > ^ > / \ > analytics ui > ^ >/ \ >learning reports > {noformat} > Here if two frameworks run within the eng/analytics role, then they will > compete for resources without fair sharing. However, if resource guarantees > are required, sub-roles can be created explicitly, e.g. > eng/analytics/learning and eng/analytics/reports. These roles can be given > weights and quota. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MESOS-7136) Eliminate fair sharing between frameworks within a role.
[ https://issues.apache.org/jira/browse/MESOS-7136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15975879#comment-15975879 ] Benjamin Mahler commented on MESOS-7136: [~qianzhang] yes, the plan would be to remove the special casing of framework id as a pseudo-role, so there would no longer be the additional sorters. To clarify what I meant by compete, the resources would be offered to all participants within a role and the first one to take it wins. If users wish to have constraints among frameworks using a role, they can use sub-roles identifying their frameworks, which achieves the existing behavior. But, you also get the added benefit of using actual roles, so you can change weights, set quota, etc. > Eliminate fair sharing between frameworks within a role. > > > Key: MESOS-7136 > URL: https://issues.apache.org/jira/browse/MESOS-7136 > Project: Mesos > Issue Type: Epic > Components: allocation, technical debt >Reporter: Benjamin Mahler > Labels: multi-tenancy > > The current fair sharing algorithm performs fair sharing between frameworks > within a role. This is equivalent to having the framework id behave as a > pseudo-role beneath the role. Consider the case where there are two spark > frameworks running within the same "spark" role. This behaves similarly to > hierarchical roles with the framework ID acting as an implicit role: > {noformat} > ^ >/ \ > spark services > ^ > / \ > / \ > FrameworkId1 FrameworkId2 > (fixed weight of 1)(fixed weight of 1) > {noformat} > Unfortunately, the frameworks cannot change their weight to be a value other > than 1 (see MESOS-6247) and they cannot set quota. > With the addition of hierarchical roles (see MESOS-6375) we can eliminate the > notion of the framework ID acting as a pseudo-role in favor of explicitly > using hierarchical roles. E.g. > {noformat} > ^ >/ \ > engsales > ^ > / \ > analytics ui > ^ >/ \ >learning reports > {noformat} > Here if two frameworks run within the eng/analytics role, then they will > compete for resources without fair sharing. However, if resource guarantees > are required, sub-roles can be created explicitly, e.g. > eng/analytics/learning and eng/analytics/reports. These roles can be given > weights and quota. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (MESOS-7376) Reduce copying of the Registry to improve Registrar performance.
[ https://issues.apache.org/jira/browse/MESOS-7376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-7376: --- Summary: Reduce copying of the Registry to improve Registrar performance. (was: Long registry updates when the number of agents is high) > Reduce copying of the Registry to improve Registrar performance. > > > Key: MESOS-7376 > URL: https://issues.apache.org/jira/browse/MESOS-7376 > Project: Mesos > Issue Type: Improvement > Components: master >Affects Versions: 1.3.0 >Reporter: Ilya Pronin >Assignee: Ilya Pronin >Priority: Critical > > During scale testing we discovered that as the number of registered agents > grows the time it takes to update the registry grows to unacceptable values > very fast. At some point it starts exceeding {{registry_store_timeout}} which > doesn't fire. > With 55k agents we saw this ({{registry_store_timeout=20secs}}): > {noformat} > I0331 17:11:21.227442 36472 registrar.cpp:473] Applied 69 operations in > 3.138843387secs; attempting to update the registry > I0331 17:11:24.441409 36464 log.cpp:529] LogStorage.set: acquired the lock in > 74461ns > I0331 17:11:24.441541 36464 log.cpp:543] LogStorage.set: started in 51770ns > I0331 17:11:26.869323 36462 log.cpp:628] LogStorage.set: wrote append at > position=6420881 in 2.41043644secs > I0331 17:11:26.869454 36462 state.hpp:179] State.store: storage.set has > finished in 2.428189561secs (b=1) > I0331 17:11:56.199453 36469 registrar.cpp:518] Successfully updated the > registry in 34.971944192secs > {noformat} > This is caused by repeated {{Registry}} copying which involves copying a big > object graph that takes roughly 0.4 sec (with 55k agents). -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MESOS-7389) Mesos 1.2.0 crashes with pre-1.0 Mesos agents
[ https://issues.apache.org/jira/browse/MESOS-7389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15975401#comment-15975401 ] Benjamin Mahler commented on MESOS-7389: One thing that could be simple to cherry pick is to simply log and drop registrations from pre-1.0 agents in the master into the 1.2.x branch. > Mesos 1.2.0 crashes with pre-1.0 Mesos agents > - > > Key: MESOS-7389 > URL: https://issues.apache.org/jira/browse/MESOS-7389 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.2.0 > Environment: Ubuntu 14.04 >Reporter: Nicholas Studt >Assignee: Benjamin Mahler >Priority: Critical > Labels: mesosphere > > During upgrade from 1.0.1 to 1.2.0 a single mesos-slave reregistering with > the running leader caused the leader to terminate. All 3 of the masters > suffered the same failure as the same slave node reregistered against the new > leader, this continued across the entire cluster until the offending slave > node was removed and fixed. The fix to the slave node was to remove the mesos > directory and then start the slave node back up. > F0412 17:24:42.736600 6317 master.cpp:5701] Check failed: > frameworks_.contains(task.framework_id()) > *** Check failure stack trace: *** > @ 0x7f59f944f94d google::LogMessage::Fail() > @ 0x7f59f945177d google::LogMessage::SendToLog() > @ 0x7f59f944f53c google::LogMessage::Flush() > @ 0x7f59f9452079 google::LogMessageFatal::~LogMessageFatal() > I0412 17:24:42.750300 6316 replica.cpp:693] Replica received learned notice > for position 6896 from @0.0.0.0:0 > @ 0x7f59f88f2341 mesos::internal::master::Master::_reregisterSlave() > @ 0x7f59f88f488f > _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal6master6MasterERKNS5_9SlaveInfoERKNS0_4UPIDERKSt6vectorINS5_8ResourceESaISG_EERKSF_INS5_12ExecutorInfoESaISL_EERKSF_INS5_4TaskESaISQ_EERKSF_INS5_13FrameworkInfoESaISV_EERKSF_INS6_17Archive_FrameworkESaIS10_EERKSsRKSF_INS5_20SlaveInfo_CapabilityESaIS17_EERKNS0_6FutureIbEES9_SC_SI_SN_SS_SX_S12_SsS19_S1D_EEvRKNS0_3PIDIT_EEMS1H_FvT0_T1_T2_T3_T4_T5_T6_T7_T8_T9_ET10_T11_T12_T13_T14_T15_T16_T17_T18_T19_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_ > @ 0x7f59f93c3eb1 process::ProcessManager::resume() > @ 0x7f59f93ccd57 > _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv > @ 0x7f59f77cfa60 (unknown) > @ 0x7f59f6fec184 start_thread > @ 0x7f59f6d19bed (unknown) -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MESOS-7389) Mesos 1.2.0 crashes with pre-1.0 Mesos agents
[ https://issues.apache.org/jira/browse/MESOS-7389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15975396#comment-15975396 ] Benjamin Mahler commented on MESOS-7389: Sorry for the confusion, I just meant that I'm inclined not to support pre-1.0 agents against a 1.2 master given the complexity of the solution. However, I totally agree that incompatible versions of the master and agent should not not lead to crashes (especially vague ones). Explicit handling of incompatible agents (i.e. just MESOS-6976 within the MESOS-6975 epic) is long overdue. In the interim, we can start with explicitly stating the upgrade requirements for getting into 1.x from 0.y, since they aren't captured [here|http://mesos.apache.org/documentation/latest/versioning/] and you need to reach a certain 0.y release before you can upgrade to 1.x. cc [~vinodkone] > Mesos 1.2.0 crashes with pre-1.0 Mesos agents > - > > Key: MESOS-7389 > URL: https://issues.apache.org/jira/browse/MESOS-7389 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.2.0 > Environment: Ubuntu 14.04 >Reporter: Nicholas Studt >Assignee: Benjamin Mahler >Priority: Critical > Labels: mesosphere > > During upgrade from 1.0.1 to 1.2.0 a single mesos-slave reregistering with > the running leader caused the leader to terminate. All 3 of the masters > suffered the same failure as the same slave node reregistered against the new > leader, this continued across the entire cluster until the offending slave > node was removed and fixed. The fix to the slave node was to remove the mesos > directory and then start the slave node back up. > F0412 17:24:42.736600 6317 master.cpp:5701] Check failed: > frameworks_.contains(task.framework_id()) > *** Check failure stack trace: *** > @ 0x7f59f944f94d google::LogMessage::Fail() > @ 0x7f59f945177d google::LogMessage::SendToLog() > @ 0x7f59f944f53c google::LogMessage::Flush() > @ 0x7f59f9452079 google::LogMessageFatal::~LogMessageFatal() > I0412 17:24:42.750300 6316 replica.cpp:693] Replica received learned notice > for position 6896 from @0.0.0.0:0 > @ 0x7f59f88f2341 mesos::internal::master::Master::_reregisterSlave() > @ 0x7f59f88f488f > _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal6master6MasterERKNS5_9SlaveInfoERKNS0_4UPIDERKSt6vectorINS5_8ResourceESaISG_EERKSF_INS5_12ExecutorInfoESaISL_EERKSF_INS5_4TaskESaISQ_EERKSF_INS5_13FrameworkInfoESaISV_EERKSF_INS6_17Archive_FrameworkESaIS10_EERKSsRKSF_INS5_20SlaveInfo_CapabilityESaIS17_EERKNS0_6FutureIbEES9_SC_SI_SN_SS_SX_S12_SsS19_S1D_EEvRKNS0_3PIDIT_EEMS1H_FvT0_T1_T2_T3_T4_T5_T6_T7_T8_T9_ET10_T11_T12_T13_T14_T15_T16_T17_T18_T19_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_ > @ 0x7f59f93c3eb1 process::ProcessManager::resume() > @ 0x7f59f93ccd57 > _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv > @ 0x7f59f77cfa60 (unknown) > @ 0x7f59f6fec184 start_thread > @ 0x7f59f6d19bed (unknown) -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MESOS-7389) Mesos 1.2.0 crashes with pre-1.0 Mesos agents
[ https://issues.apache.org/jira/browse/MESOS-7389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15973853#comment-15973853 ] Benjamin Mahler commented on MESOS-7389: Ah, I'm sorry! I misread that code. With regard to this ticket, I'm inclined not to fix given the difficulty and that we don't support pre-1.0 agents against a 1.2 master. Any objections? > Mesos 1.2.0 crashes with pre-1.0 Mesos agents > - > > Key: MESOS-7389 > URL: https://issues.apache.org/jira/browse/MESOS-7389 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.2.0 > Environment: Ubuntu 14.04 >Reporter: Nicholas Studt >Assignee: Benjamin Mahler >Priority: Critical > Labels: mesosphere > > During upgrade from 1.0.1 to 1.2.0 a single mesos-slave reregistering with > the running leader caused the leader to terminate. All 3 of the masters > suffered the same failure as the same slave node reregistered against the new > leader, this continued across the entire cluster until the offending slave > node was removed and fixed. The fix to the slave node was to remove the mesos > directory and then start the slave node back up. > F0412 17:24:42.736600 6317 master.cpp:5701] Check failed: > frameworks_.contains(task.framework_id()) > *** Check failure stack trace: *** > @ 0x7f59f944f94d google::LogMessage::Fail() > @ 0x7f59f945177d google::LogMessage::SendToLog() > @ 0x7f59f944f53c google::LogMessage::Flush() > @ 0x7f59f9452079 google::LogMessageFatal::~LogMessageFatal() > I0412 17:24:42.750300 6316 replica.cpp:693] Replica received learned notice > for position 6896 from @0.0.0.0:0 > @ 0x7f59f88f2341 mesos::internal::master::Master::_reregisterSlave() > @ 0x7f59f88f488f > _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal6master6MasterERKNS5_9SlaveInfoERKNS0_4UPIDERKSt6vectorINS5_8ResourceESaISG_EERKSF_INS5_12ExecutorInfoESaISL_EERKSF_INS5_4TaskESaISQ_EERKSF_INS5_13FrameworkInfoESaISV_EERKSF_INS6_17Archive_FrameworkESaIS10_EERKSsRKSF_INS5_20SlaveInfo_CapabilityESaIS17_EERKNS0_6FutureIbEES9_SC_SI_SN_SS_SX_S12_SsS19_S1D_EEvRKNS0_3PIDIT_EEMS1H_FvT0_T1_T2_T3_T4_T5_T6_T7_T8_T9_ET10_T11_T12_T13_T14_T15_T16_T17_T18_T19_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_ > @ 0x7f59f93c3eb1 process::ProcessManager::resume() > @ 0x7f59f93ccd57 > _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv > @ 0x7f59f77cfa60 (unknown) > @ 0x7f59f6fec184 start_thread > @ 0x7f59f6d19bed (unknown) -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MESOS-7389) Mesos 1.2.0 crashes with pre-1.0 Mesos agents
[ https://issues.apache.org/jira/browse/MESOS-7389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15973514#comment-15973514 ] Benjamin Mahler commented on MESOS-7389: As far as I can tell, fixing this to support pre-1.0 agents is complicated and is likely to produce its own subtle bugs. 1.2+ masters maintain an invariant that each task / executor has a known allocation role (and it can determine this given 1.0+ agents report their frameworks). Whereas if we were to support pre-1.0 agents against a 1.2+ master, the master would have to be updated to handle tasks that have an unknown allocation role (i.e. what used to be called "orphaned" tasks). A partial fix here would be to handle the case where the framework is already re-registered, leaving the "orphaned" task case triggering this check. [~neilc] [~vinodkone] The handling of pre-1.0 agents in the context of "orphaned tasks" already appears have issues, e.g.: * Master upgraded to 1.2.x * Pre- 1.0 agent re-registers with task and task's framework id, doesn't send the FrameworkInfos. * This task's framework hasn't re-registered yet, so this is what we used to call an "orphan task". * The re-registration handling drops the task, see [here|https://github.com/apache/mesos/blob/1.2.0/src/master/master.cpp#L5784-L5807]. * Later, when this framework re-registers, the task is absent in the master but known to the agent. Is this broken or am I missing something? If broken, given that fixing this ticket requires a complicated solution, and we didn't originally intend to support pre-1.0 upgrades for > 1.0.x masters, I'd be inclined to not support it (and possibly cherry-pick safety checks like MESOS-6975). > Mesos 1.2.0 crashes with pre-1.0 Mesos agents > - > > Key: MESOS-7389 > URL: https://issues.apache.org/jira/browse/MESOS-7389 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.2.0 > Environment: Ubuntu 14.04 >Reporter: Nicholas Studt >Assignee: Benjamin Mahler >Priority: Critical > Labels: mesosphere > > During upgrade from 1.0.1 to 1.2.0 a single mesos-slave reregistering with > the running leader caused the leader to terminate. All 3 of the masters > suffered the same failure as the same slave node reregistered against the new > leader, this continued across the entire cluster until the offending slave > node was removed and fixed. The fix to the slave node was to remove the mesos > directory and then start the slave node back up. > F0412 17:24:42.736600 6317 master.cpp:5701] Check failed: > frameworks_.contains(task.framework_id()) > *** Check failure stack trace: *** > @ 0x7f59f944f94d google::LogMessage::Fail() > @ 0x7f59f945177d google::LogMessage::SendToLog() > @ 0x7f59f944f53c google::LogMessage::Flush() > @ 0x7f59f9452079 google::LogMessageFatal::~LogMessageFatal() > I0412 17:24:42.750300 6316 replica.cpp:693] Replica received learned notice > for position 6896 from @0.0.0.0:0 > @ 0x7f59f88f2341 mesos::internal::master::Master::_reregisterSlave() > @ 0x7f59f88f488f > _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal6master6MasterERKNS5_9SlaveInfoERKNS0_4UPIDERKSt6vectorINS5_8ResourceESaISG_EERKSF_INS5_12ExecutorInfoESaISL_EERKSF_INS5_4TaskESaISQ_EERKSF_INS5_13FrameworkInfoESaISV_EERKSF_INS6_17Archive_FrameworkESaIS10_EERKSsRKSF_INS5_20SlaveInfo_CapabilityESaIS17_EERKNS0_6FutureIbEES9_SC_SI_SN_SS_SX_S12_SsS19_S1D_EEvRKNS0_3PIDIT_EEMS1H_FvT0_T1_T2_T3_T4_T5_T6_T7_T8_T9_ET10_T11_T12_T13_T14_T15_T16_T17_T18_T19_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_ > @ 0x7f59f93c3eb1 process::ProcessManager::resume() > @ 0x7f59f93ccd57 > _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv > @ 0x7f59f77cfa60 (unknown) > @ 0x7f59f6fec184 start_thread > @ 0x7f59f6d19bed (unknown) -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MESOS-7376) Long registry updates when the number of agents is high
[ https://issues.apache.org/jira/browse/MESOS-7376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15971824#comment-15971824 ] Benjamin Mahler commented on MESOS-7376: Yes, I will shepherd, thanks for taking this on! > Long registry updates when the number of agents is high > --- > > Key: MESOS-7376 > URL: https://issues.apache.org/jira/browse/MESOS-7376 > Project: Mesos > Issue Type: Improvement > Components: master >Affects Versions: 1.3.0 >Reporter: Ilya Pronin >Assignee: Ilya Pronin >Priority: Critical > > During scale testing we discovered that as the number of registered agents > grows the time it takes to update the registry grows to unacceptable values > very fast. At some point it starts exceeding {{registry_store_timeout}} which > doesn't fire. > With 55k agents we saw this ({{registry_store_timeout=20secs}}): > {noformat} > I0331 17:11:21.227442 36472 registrar.cpp:473] Applied 69 operations in > 3.138843387secs; attempting to update the registry > I0331 17:11:24.441409 36464 log.cpp:529] LogStorage.set: acquired the lock in > 74461ns > I0331 17:11:24.441541 36464 log.cpp:543] LogStorage.set: started in 51770ns > I0331 17:11:26.869323 36462 log.cpp:628] LogStorage.set: wrote append at > position=6420881 in 2.41043644secs > I0331 17:11:26.869454 36462 state.hpp:179] State.store: storage.set has > finished in 2.428189561secs (b=1) > I0331 17:11:56.199453 36469 registrar.cpp:518] Successfully updated the > registry in 34.971944192secs > {noformat} > This is caused by repeated {{Registry}} copying which involves copying a big > object graph that takes roughly 0.4 sec (with 55k agents). -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (MESOS-6975) Prevent pre-1.0 agents from registering with 1.3+ master.
[ https://issues.apache.org/jira/browse/MESOS-6975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-6975: --- Summary: Prevent pre-1.0 agents from registering with 1.3+ master. (was: Prevent old Mesos agents from registering) > Prevent pre-1.0 agents from registering with 1.3+ master. > - > > Key: MESOS-6975 > URL: https://issues.apache.org/jira/browse/MESOS-6975 > Project: Mesos > Issue Type: Epic > Components: master >Reporter: Neil Conway >Assignee: Neil Conway > > https://www.mail-archive.com/dev@mesos.apache.org/msg37194.html -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MESOS-7387) ZK master contender and detector don't respect zk_session_timeout option
[ https://issues.apache.org/jira/browse/MESOS-7387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15971814#comment-15971814 ] Benjamin Mahler commented on MESOS-7387: Looks like Vinod is shepherding, thanks Vinod. > ZK master contender and detector don't respect zk_session_timeout option > > > Key: MESOS-7387 > URL: https://issues.apache.org/jira/browse/MESOS-7387 > Project: Mesos > Issue Type: Improvement > Components: master >Affects Versions: 1.3.0 >Reporter: Ilya Pronin >Assignee: Ilya Pronin >Priority: Minor > > {{ZooKeeperMasterContender}} and {{ZooKeeperMasterDetector}} are using > hardcoded ZK session timeouts ({{MASTER_CONTENDER_ZK_SESSION_TIMEOUT}} and > {{MASTER_DETECTOR_ZK_SESSION_TIMEOUT}}) and do not respect > {{--zk_session_timeout}} master option. This is unexpected and doesn't play > well with ZK updates that take longer than 10 secs. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (MESOS-7376) Long registry updates when the number of agents is high
[ https://issues.apache.org/jira/browse/MESOS-7376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-7376: --- Shepherd: Benjamin Mahler > Long registry updates when the number of agents is high > --- > > Key: MESOS-7376 > URL: https://issues.apache.org/jira/browse/MESOS-7376 > Project: Mesos > Issue Type: Improvement > Components: master >Affects Versions: 1.3.0 >Reporter: Ilya Pronin >Assignee: Ilya Pronin >Priority: Critical > > During scale testing we discovered that as the number of registered agents > grows the time it takes to update the registry grows to unacceptable values > very fast. At some point it starts exceeding {{registry_store_timeout}} which > doesn't fire. > With 55k agents we saw this ({{registry_store_timeout=20secs}}): > {noformat} > I0331 17:11:21.227442 36472 registrar.cpp:473] Applied 69 operations in > 3.138843387secs; attempting to update the registry > I0331 17:11:24.441409 36464 log.cpp:529] LogStorage.set: acquired the lock in > 74461ns > I0331 17:11:24.441541 36464 log.cpp:543] LogStorage.set: started in 51770ns > I0331 17:11:26.869323 36462 log.cpp:628] LogStorage.set: wrote append at > position=6420881 in 2.41043644secs > I0331 17:11:26.869454 36462 state.hpp:179] State.store: storage.set has > finished in 2.428189561secs (b=1) > I0331 17:11:56.199453 36469 registrar.cpp:518] Successfully updated the > registry in 34.971944192secs > {noformat} > This is caused by repeated {{Registry}} copying which involves copying a big > object graph that takes roughly 0.4 sec (with 55k agents). -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (MESOS-7323) Framework role tracking in allocator results in framework treated as active incorrectly.
[ https://issues.apache.org/jira/browse/MESOS-7323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-7323: --- Priority: Blocker (was: Critical) > Framework role tracking in allocator results in framework treated as active > incorrectly. > > > Key: MESOS-7323 > URL: https://issues.apache.org/jira/browse/MESOS-7323 > Project: Mesos > Issue Type: Bug > Components: allocation, master >Reporter: Benjamin Mahler >Assignee: Michael Park >Priority: Blocker > > When an agent is added to the allocator and there are resources allocated to > a known framework, where the allocation role is not one of the framework's > subscribed roles, then the allocator will "track" the role (i.e. allocation > information) for this framework. However, the current implementation results > in the framework being treated as an active client of the sorter, when it > should be an inactive client. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (MESOS-6627) Allow frameworks to modify the role(s) they are subscribed to.
[ https://issues.apache.org/jira/browse/MESOS-6627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-6627: --- Fix Version/s: 1.3.0 [~mcypark] we need to fix MESOS-7323 prior to the 1.3 release, will mark it as a blocker. We can move MESOS-7258 as an improvement outside of this epic. > Allow frameworks to modify the role(s) they are subscribed to. > -- > > Key: MESOS-6627 > URL: https://issues.apache.org/jira/browse/MESOS-6627 > Project: Mesos > Issue Type: Epic > Components: framework api, master >Reporter: Benjamin Mahler >Assignee: Michael Park > Fix For: 1.3.0 > > > Currently, we do not provide the ability for frameworks to change the roles > they are subscribed with. As we begin to support "multi-tenant" frameworks > (i.e. multi-role support in MESOS-1763), it will become necessary to allow > frameworks to add and remove roles as "tenants" come and go from the > framework. > Because of this being necessary to support multi-role frameworks, this is > considered "phase 2" of the multi-role framework project. See the design > published in MESOS-4284. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MESOS-6058) Register slave in deactivate mode
[ https://issues.apache.org/jira/browse/MESOS-6058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15949891#comment-15949891 ] Benjamin Mahler commented on MESOS-6058: This sounds like a "reservation template" feature we've talked about, where the master would apply the reservation before letting the agent's resources be allocated. This might also be covered by the addition of endpoints to deactivate/activate agents MESOS-7317, depending on how that's solved (if identified by machine vs specific agent id). > Register slave in deactivate mode > - > > Key: MESOS-6058 > URL: https://issues.apache.org/jira/browse/MESOS-6058 > Project: Mesos > Issue Type: Improvement > Components: agent, master >Reporter: Klaus Ma > > In my cluster, I'd like to reserve some resource for one application, dynamic > reservation feature is used because the reservation maybe changed. But when > a slave register to the master, some tasks from other frameworks maybe > dispatched to the new slave before reservation. The proposal is to enable > slave register in deactivate mode, and activate it after configuration, e.g. > dynamic reservation. > cc [~kaysoky]/[~jvanremoortere] -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MESOS-7317) Add master endpoint to deactivate / activate agent
[ https://issues.apache.org/jira/browse/MESOS-7317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15949876#comment-15949876 ] Benjamin Mahler commented on MESOS-7317: Linking in the original ticket. > Add master endpoint to deactivate / activate agent > -- > > Key: MESOS-7317 > URL: https://issues.apache.org/jira/browse/MESOS-7317 > Project: Mesos > Issue Type: Improvement > Components: agent, master >Reporter: Neil Conway > Labels: mesosphere > > This would allow the operator to deactivate and then subsequently activate an > agent. The allocator does not make offers for deactivated agents; this > functionality would be useful to help operators "manually (incrementally) > drain" the tasks running on an agent, e.g., before taking the agent down. > At present, if the operator causes a framework to kill a task running on the > agent, the framework will often receive an offer for the unused resources on > the agent, which will often result in respawning the killed task on the same > agent. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MESOS-7325) ProcessRemoteLinkTest.RemoteLinkLeak is flaky.
[ https://issues.apache.org/jira/browse/MESOS-7325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15949869#comment-15949869 ] Benjamin Mahler commented on MESOS-7325: Hm.. I wasn't able to reproduce after 3000 iterations. > ProcessRemoteLinkTest.RemoteLinkLeak is flaky. > -- > > Key: MESOS-7325 > URL: https://issues.apache.org/jira/browse/MESOS-7325 > Project: Mesos > Issue Type: Bug > Components: libprocess >Affects Versions: 1.3.0 > Environment: macOS, clang, libev build >Reporter: Till Toenshoff > Labels: flaky-test > > After this hit me initially on a regular make check, I did run the test > isolated in infinite repetition - those 3 times I tried it, the bug surfaced > at around 100-150 repetitions. > {noformat} > $ ./3rdparty/libprocess/libprocess-tests > --gtest_filter="ProcessRemoteLinkTest.RemoteLinkLeak" --gtest_repeat=-1 > --gtest_break_on_failure > {noformat} > {noformat} > Repeating all tests (iteration 119) . . . > Note: Google Test filter = ProcessRemoteLinkTest.RemoteLinkLeak > [==] Running 1 test from 1 test case. > [--] Global test environment set-up. > [--] 1 test from ProcessRemoteLinkTest > [ RUN ] ProcessRemoteLinkTest.RemoteLinkLeak > (libev) select: Invalid argument > *** Aborted at 1490866597 (unix time) try "date -d @1490866597" if you are > using GNU date *** > PC: @ 0x7fffb7621d42 __pthread_kill > *** SIGABRT (@0x7fffb7621d42) received by PID 60372 (TID 0x7d123000) > stack trace: *** > @ 0x7fffb7702b3a _sigtramp > @ 0x7ff8fecfc080 (unknown) > @ 0x7fffb7587420 abort > @0x10e17051d ev_syserr > @0x10e170e16 select_poll > @0x10e16c635 ev_run > @0x10e126f2b ev_loop() > @0x10e126e96 process::EventLoop::run() > @0x10e0498bf _ZNSt3__114__thread_proxyINS_5tupleIJPFvvEEPvS5_ > @ 0x7fffb770c9af _pthread_body > @ 0x7fffb770c8fb _pthread_start > @ 0x7fffb770c101 thread_start > Abort trap: 6 > {noformat} > Note that this is obviously a libev build. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MESOS-7325) ProcessRemoteLinkTest.RemoteLinkLeak is flaky.
[ https://issues.apache.org/jira/browse/MESOS-7325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15949852#comment-15949852 ] Benjamin Mahler commented on MESOS-7325: Hm.. looks the same as MESOS-6453 which also occurs on darwin but was on a different test run in repetition. I'm curious if this fails after a fixed number of runs (e.g. hitting some limitation) or non-deterministically. > ProcessRemoteLinkTest.RemoteLinkLeak is flaky. > -- > > Key: MESOS-7325 > URL: https://issues.apache.org/jira/browse/MESOS-7325 > Project: Mesos > Issue Type: Bug > Components: libprocess >Affects Versions: 1.3.0 > Environment: macOS, clang, libev build >Reporter: Till Toenshoff > Labels: flaky-test > > After this hit me initially on a regular make check, I did run the test > isolated in infinite repetition - those 3 times I tried it, the bug surfaced > at around 100-150 repetitions. > {noformat} > $ ./3rdparty/libprocess/libprocess-tests > --gtest_filter="ProcessRemoteLinkTest.RemoteLinkLeak" --gtest_repeat=-1 > --gtest_break_on_failure > {noformat} > {noformat} > Repeating all tests (iteration 119) . . . > Note: Google Test filter = ProcessRemoteLinkTest.RemoteLinkLeak > [==] Running 1 test from 1 test case. > [--] Global test environment set-up. > [--] 1 test from ProcessRemoteLinkTest > [ RUN ] ProcessRemoteLinkTest.RemoteLinkLeak > (libev) select: Invalid argument > *** Aborted at 1490866597 (unix time) try "date -d @1490866597" if you are > using GNU date *** > PC: @ 0x7fffb7621d42 __pthread_kill > *** SIGABRT (@0x7fffb7621d42) received by PID 60372 (TID 0x7d123000) > stack trace: *** > @ 0x7fffb7702b3a _sigtramp > @ 0x7ff8fecfc080 (unknown) > @ 0x7fffb7587420 abort > @0x10e17051d ev_syserr > @0x10e170e16 select_poll > @0x10e16c635 ev_run > @0x10e126f2b ev_loop() > @0x10e126e96 process::EventLoop::run() > @0x10e0498bf _ZNSt3__114__thread_proxyINS_5tupleIJPFvvEEPvS5_ > @ 0x7fffb770c9af _pthread_body > @ 0x7fffb770c8fb _pthread_start > @ 0x7fffb770c101 thread_start > Abort trap: 6 > {noformat} > Note that this is obviously a libev build. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (MESOS-7324) Update documentation to reflect the addition of multi-role framework support.
Benjamin Mahler created MESOS-7324: -- Summary: Update documentation to reflect the addition of multi-role framework support. Key: MESOS-7324 URL: https://issues.apache.org/jira/browse/MESOS-7324 Project: Mesos Issue Type: Documentation Components: documentation Reporter: Benjamin Mahler Assignee: Benjamin Mahler The current documentation assumes single role frameworks, we need to update the documentation to reflect the support for subscribing to multiple roles. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Assigned] (MESOS-6762) Update release notes for multi-role changes
[ https://issues.apache.org/jira/browse/MESOS-6762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler reassigned MESOS-6762: -- Assignee: Benjamin Mahler > Update release notes for multi-role changes > --- > > Key: MESOS-6762 > URL: https://issues.apache.org/jira/browse/MESOS-6762 > Project: Mesos > Issue Type: Task >Reporter: Benjamin Bannier >Assignee: Benjamin Mahler > > When adding support for multi-role frameworks we should call out a number of > potential issues in the changelog/release notes. > This ticket collects potential pitfalls. > h6. Changes in master and agent endpoints > When rendering the {{FrameworkInfo}} of multi-role enabled frameworks in > master or agent endpoints the {{role}} field will not be supported anymore; > instead the {{roles}} field should be used. Any tooling parsing endpoint > information and relying on the {{role}} field needs to be updated before > multi-role enabled frameworks can be run in the cluster. > h6. Changes to the allocator interface / implementation requirements for > module implementors > Implementors of allocator modules have to provide new implementation > functionality to satisfy the MULTI_ROLE framework capability. Also, the > interface has changed. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MESOS-6762) Update release notes for multi-role changes
[ https://issues.apache.org/jira/browse/MESOS-6762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15947997#comment-15947997 ] Benjamin Mahler commented on MESOS-6762: CHANGELOG update: {noformat} commit 10d7988ee5948bc45518e7c1c339a371c4bf151f Author: Benjamin MahlerDate: Thu Mar 16 15:33:56 2017 -0700 Added multi-role framework support to the CHANGELOG. Review: https://reviews.apache.org/r/57707 {noformat} Will close once the additional documentation described in this ticket is added. > Update release notes for multi-role changes > --- > > Key: MESOS-6762 > URL: https://issues.apache.org/jira/browse/MESOS-6762 > Project: Mesos > Issue Type: Task >Reporter: Benjamin Bannier > > When adding support for multi-role frameworks we should call out a number of > potential issues in the changelog/release notes. > This ticket collects potential pitfalls. > h6. Changes in master and agent endpoints > When rendering the {{FrameworkInfo}} of multi-role enabled frameworks in > master or agent endpoints the {{role}} field will not be supported anymore; > instead the {{roles}} field should be used. Any tooling parsing endpoint > information and relying on the {{role}} field needs to be updated before > multi-role enabled frameworks can be run in the cluster. > h6. Changes to the allocator interface / implementation requirements for > module implementors > Implementors of allocator modules have to provide new implementation > functionality to satisfy the MULTI_ROLE framework capability. Also, the > interface has changed. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MESOS-3875) Account dynamic reservations towards quota.
[ https://issues.apache.org/jira/browse/MESOS-3875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15947990#comment-15947990 ] Benjamin Mahler commented on MESOS-3875: I think the situation I'm describing is addressed by this ticket, since I'm referring to quota allocation not accounting for reserved resources and hence enabling gaming. Unless this is prevented already? MESOS-3338 is fairly vague. It could use some clarification since it seems to be referring only to endpoints (and I'm not sure the suggestion of MESOS-3338 is the right thing to do as far as the endpoints are concerned). Is it addressing the fair sharing side of the reservation gaming? > Account dynamic reservations towards quota. > --- > > Key: MESOS-3875 > URL: https://issues.apache.org/jira/browse/MESOS-3875 > Project: Mesos > Issue Type: Task > Components: allocation, master >Reporter: Alexander Rukletsov >Priority: Critical > Labels: mesosphere > > Dynamic reservations—whether allocated or not—should be accounted towards > role's quota. This requires update in at least two places: > * The built-in allocator, which actually satisfies quota; > * The sanity check in the master. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (MESOS-7323) Framework role tracking in allocator results in framework treated as active incorrectly.
[ https://issues.apache.org/jira/browse/MESOS-7323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-7323: --- Target Version/s: 1.3.0 Priority: Critical (was: Major) Description: When an agent is added to the allocator and there are resources allocated to a known framework, where the allocation role is not one of the framework's subscribed roles, then the allocator will "track" the role (i.e. allocation information) for this framework. However, the current implementation results in the framework being treated as an active client of the sorter, when it should be an inactive client. > Framework role tracking in allocator results in framework treated as active > incorrectly. > > > Key: MESOS-7323 > URL: https://issues.apache.org/jira/browse/MESOS-7323 > Project: Mesos > Issue Type: Bug > Components: allocation, master >Reporter: Benjamin Mahler >Assignee: Michael Park >Priority: Critical > > When an agent is added to the allocator and there are resources allocated to > a known framework, where the allocation role is not one of the framework's > subscribed roles, then the allocator will "track" the role (i.e. allocation > information) for this framework. However, the current implementation results > in the framework being treated as an active client of the sorter, when it > should be an inactive client. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (MESOS-7323) Framework role tracking in allocator results in framework treated as active incorrectly.
Benjamin Mahler created MESOS-7323: -- Summary: Framework role tracking in allocator results in framework treated as active incorrectly. Key: MESOS-7323 URL: https://issues.apache.org/jira/browse/MESOS-7323 Project: Mesos Issue Type: Bug Components: allocation, master Reporter: Benjamin Mahler Assignee: Michael Park -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (MESOS-3875) Account dynamic reservations towards quota.
[ https://issues.apache.org/jira/browse/MESOS-3875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-3875: --- Priority: Critical (was: Major) Not accounting them when not allocated seems to allow gaming of the quota system. The scenario is: I keep making reservations, I filter out any reservations that come back, in an effort to amass as many reserved resources as possible. Can I reserve the whole cluster this way given we don't count the reserved resources towards the qutoa? Or is there something in place already that is preventing this? > Account dynamic reservations towards quota. > --- > > Key: MESOS-3875 > URL: https://issues.apache.org/jira/browse/MESOS-3875 > Project: Mesos > Issue Type: Task > Components: allocation, master >Reporter: Alexander Rukletsov >Priority: Critical > Labels: mesosphere > > Dynamic reservations—whether allocated or not—should be accounted towards > role's quota. This requires update in at least two places: > * The built-in allocator, which actually satisfies quota; > * The sanity check in the master. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (MESOS-7319) Rename the DRAIN maintenance mode to SCHEDULED to avoid confusion.
[ https://issues.apache.org/jira/browse/MESOS-7319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-7319: --- Description: The current naming of the DRAIN mode in maintenance has been confusing to users as there tends to be an expectation of mesos doing something (e.g. not sending offers, or killing tasks) to achieve the drain, whereas in reality mesos does nothing and expects the schedulers to act (this only applies for maintenance aware schedulers). Rather, what's actually happening at in the DRAIN mode is that the maintenance is scheduled, that's it. So a name like SCHEDULED would be less confusing for users: http://mesos.apache.org/documentation/latest/maintenance/ Component/s: documentation > Rename the DRAIN maintenance mode to SCHEDULED to avoid confusion. > -- > > Key: MESOS-7319 > URL: https://issues.apache.org/jira/browse/MESOS-7319 > Project: Mesos > Issue Type: Improvement > Components: documentation, HTTP API, master >Reporter: Benjamin Mahler > > The current naming of the DRAIN mode in maintenance has been confusing to > users as there tends to be an expectation of mesos doing something (e.g. not > sending offers, or killing tasks) to achieve the drain, whereas in reality > mesos does nothing and expects the schedulers to act (this only applies for > maintenance aware schedulers). > Rather, what's actually happening at in the DRAIN mode is that the > maintenance is scheduled, that's it. So a name like SCHEDULED would be less > confusing for users: http://mesos.apache.org/documentation/latest/maintenance/ -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (MESOS-7319) Rename the DRAIN maintenance mode to SCHEDULED to avoid confusion.
Benjamin Mahler created MESOS-7319: -- Summary: Rename the DRAIN maintenance mode to SCHEDULED to avoid confusion. Key: MESOS-7319 URL: https://issues.apache.org/jira/browse/MESOS-7319 Project: Mesos Issue Type: Improvement Components: HTTP API, master Reporter: Benjamin Mahler -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MESOS-7201) Improvements to maintenance primitives
[ https://issues.apache.org/jira/browse/MESOS-7201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15944343#comment-15944343 ] Benjamin Mahler commented on MESOS-7201: [~kaysoky] I'm inclined to rename the {{DRAIN}} mode to {{SCHEDULED}} as there is not necessarily "draining" occurring in the {{DRAIN}} mode, so this tends to confuse users as they have an expectation of mesos doing something (e.g. not sending offers, or killing tasks) to achieve the drain. Thoughts? > Improvements to maintenance primitives > -- > > Key: MESOS-7201 > URL: https://issues.apache.org/jira/browse/MESOS-7201 > Project: Mesos > Issue Type: Epic >Reporter: Joseph Wu > Labels: mesosphere > > This is a follow up epic to MESOS-1474 to capture further improvements for > maintenance primitives. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (MESOS-7317) Add master endpoint to deactivate / activate agent
[ https://issues.apache.org/jira/browse/MESOS-7317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-7317: --- Target Version/s: 1.3.0 > Add master endpoint to deactivate / activate agent > -- > > Key: MESOS-7317 > URL: https://issues.apache.org/jira/browse/MESOS-7317 > Project: Mesos > Issue Type: Improvement > Components: agent, master >Reporter: Neil Conway > Labels: mesosphere > > This would allow the operator to deactivate and then subsequently activate an > agent. The allocator does not make offers for deactivated agents; this > functionality would be useful to help operators "manually (incrementally) > drain" the tasks running on an agent, e.g., before taking the agent down. > At present, if the operator causes a framework to kill a task running on the > agent, the framework will often receive an offer for the unused resources on > the agent, which will often result in respawning the killed task on the same > agent. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MESOS-7317) Add master endpoint to deactivate / activate agent
[ https://issues.apache.org/jira/browse/MESOS-7317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15944208#comment-15944208 ] Benjamin Mahler commented on MESOS-7317: Linking in the "maintenance improvements" epic. > Add master endpoint to deactivate / activate agent > -- > > Key: MESOS-7317 > URL: https://issues.apache.org/jira/browse/MESOS-7317 > Project: Mesos > Issue Type: Improvement > Components: agent, master >Reporter: Neil Conway > Labels: mesosphere > > This would allow the operator to deactivate and then subsequently activate an > agent. The allocator does not make offers for deactivated agents; this > functionality would be useful to help operators "manually (incrementally) > drain" the tasks running on an agent, e.g., before taking the agent down. > At present, if the operator causes a framework to kill a task running on the > agent, the framework will often receive an offer for the unused resources on > the agent, which will often result in respawning the killed task on the same > agent. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (MESOS-7318) Libprocess delays and timers should be undisturbed by system clock jumps.
Benjamin Mahler created MESOS-7318: -- Summary: Libprocess delays and timers should be undisturbed by system clock jumps. Key: MESOS-7318 URL: https://issues.apache.org/jira/browse/MESOS-7318 Project: Mesos Issue Type: Bug Components: libprocess Reporter: Benjamin Mahler Currently, libprocess timers / delays / timeouts are affected by system clock jumps because they do not use a monotonic clock as a reference point. Since these require relative timing, we can use a monotonic clock as the reference point. We also need the approach to be affected by clock manipulation at the libprocess level (i.e. {{Clock::advance(...)}} and {{Clock::update(...)}}) for testing purposes. The current recommendation is for users to use NTP with skewing applied to adjust for leaps, e.g.: https://googleblog.blogspot.com/2011/09/time-technology-and-leaping-seconds.html I thought we already had a ticket for this but can't seem to find it. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (MESOS-5995) Protobuf JSON deserialisation does not accept numbers formated as strings
[ https://issues.apache.org/jira/browse/MESOS-5995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-5995: --- Priority: Critical (was: Minor) > Protobuf JSON deserialisation does not accept numbers formated as strings > - > > Key: MESOS-5995 > URL: https://issues.apache.org/jira/browse/MESOS-5995 > Project: Mesos > Issue Type: Bug > Components: HTTP API >Affects Versions: 1.0.0 >Reporter: Tomasz Janiszewski >Assignee: Tomasz Janiszewski >Priority: Critical > > Proto2 does not specify JSON mappings but > [Proto3|https://developers.google.com/protocol-buffers/docs/proto3#json] does > and it recommend to map 64bit numbers as a string. Unfortunately Mesos does > not accepts strings in places of uint64 and return 400 Bad > {quote} > Request error Failed to convert JSON into Call protobuf: Not expecting a JSON > string for field 'value'. > {quote} > Is this by purpose or is this a bug? -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (MESOS-7281) Backwards incompatible UpdateFrameworkMessage handling
[ https://issues.apache.org/jira/browse/MESOS-7281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-7281: --- Target Version/s: 1.3.0 [~ipronin] thanks for reporting this, we'll get your fix in. > Backwards incompatible UpdateFrameworkMessage handling > -- > > Key: MESOS-7281 > URL: https://issues.apache.org/jira/browse/MESOS-7281 > Project: Mesos > Issue Type: Bug > Components: agent >Reporter: Ilya Pronin >Assignee: Ilya Pronin >Priority: Blocker > > Patch in [r/57108|https://reviews.apache.org/r/57108/] introduced framework > info updates. Agents are using a new {{framework_info}} field without > checking that it's present. If a patched agent is used with not patched > master it will get a default-initialized {{framework_info}}. This will cause > agent failures later. E.g abort on framework ID validation when it tries to > launch a new task for the updated framework. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MESOS-6995) Update the webui to reflect hierarchical roles.
[ https://issues.apache.org/jira/browse/MESOS-6995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15933897#comment-15933897 ] Benjamin Mahler commented on MESOS-6995: [~haosd...@gmail.com] the flat table is just an initial approach to get the information exposed. We could figure out a tree structure and/or have it so that roles are shown per level (e.g. top level page shows eng and sales, can click into eng to see roles that are beneath eng, and so on recursively). I'm inclined to at least do the per-level approach, where the user has to "click in" to view the level beneath. We could later include a tree structure on top of this, where users would still be able to click in to roles. > Update the webui to reflect hierarchical roles. > --- > > Key: MESOS-6995 > URL: https://issues.apache.org/jira/browse/MESOS-6995 > Project: Mesos > Issue Type: Task > Components: webui >Reporter: Benjamin Mahler >Assignee: Jay Guo > > It may not need any changes, but we should confirm that the new role format > for hierarchical roles is correctly displayed in the webui. > In addition, we can add a roles tab that shows the summary information > (shares, weights, quotas). For now, we don't need to make any of this > clickable (e.g. to see the tasks / frameworks under the role). -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MESOS-7240) Slave logs do not show gpus resources
[ https://issues.apache.org/jira/browse/MESOS-7240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15933304#comment-15933304 ] Benjamin Mahler commented on MESOS-7240: Hi [~osallou], looking at the code it seems to me this should be printing. Are you sure the executor has GPU resources? It could be that its your task that has the GPU resources. Including the full logs and how you're constructing the task would help diagnose further. + [~klueska] > Slave logs do not show gpus resources > - > > Key: MESOS-7240 > URL: https://issues.apache.org/jira/browse/MESOS-7240 > Project: Mesos > Issue Type: Bug > Components: executor >Affects Versions: 1.1.0 >Reporter: Olivier Sallou >Priority: Trivial > > In slave logs, when starting container, there are information on requested > cpu and mem. Would be nice to also show requested gpus: > Launching executor '12' of framework > 37ef8db2-8203-471c-be90-b79fdc88ed3a- with resources cpus(*):0.1; > mem(*):32 > => gpus(*):1 -- This message was sent by Atlassian JIRA (v6.3.15#6346)