[jira] [Commented] (MESOS-5366) Update documentation to include contender/detector module

2016-05-12 Thread Jay Guo (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15281267#comment-15281267
 ] 

Jay Guo commented on MESOS-5366:


Reviewable at: https://reviews.apache.org/r/47292/

> Update documentation to include contender/detector module
> -
>
> Key: MESOS-5366
> URL: https://issues.apache.org/jira/browse/MESOS-5366
> Project: Mesos
>  Issue Type: Documentation
>Reporter: Jay Guo
>Assignee: Jay Guo
>Priority: Minor
>
> Since contender and detector are modulerized, the documentation should be 
> updated to reflect this change as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5366) Update documentation to include contender/detector module

2016-05-12 Thread Jay Guo (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jay Guo updated MESOS-5366:
---
Shepherd: Kapil Arya

> Update documentation to include contender/detector module
> -
>
> Key: MESOS-5366
> URL: https://issues.apache.org/jira/browse/MESOS-5366
> Project: Mesos
>  Issue Type: Documentation
>Reporter: Jay Guo
>Assignee: Jay Guo
>Priority: Minor
>
> Since contender and detector are modulerized, the documentation should be 
> updated to reflect this change as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5286) Add authorization to libprocess HTTP endpoints

2016-05-12 Thread Adam B (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15281292#comment-15281292
 ] 

Adam B commented on MESOS-5286:
---

Looks like these just got committed. Please mark the RRs as Submitted, and 
resolve this JIRA with the commit log in a comment.

> Add authorization to libprocess HTTP endpoints
> --
>
> Key: MESOS-5286
> URL: https://issues.apache.org/jira/browse/MESOS-5286
> Project: Mesos
>  Issue Type: Improvement
>  Components: libprocess
>Reporter: Greg Mann
>Assignee: Greg Mann
>  Labels: mesosphere
> Fix For: 0.29.0
>
>
> Now that the libprocess-level HTTP endpoints have had authentication added to 
> them in MESOS-4902, we can add authorization to them as well. As a first 
> step, we can implement a "coarse-grained" approach, in which a principal is 
> granted or denied access to a given endpoint. We will likely need to register 
> an authorizer with libprocess.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5212) Allow any principal in ReservationInfo when HTTP authentication is off

2016-05-12 Thread Bernd Mathiske (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15281385#comment-15281385
 ] 

Bernd Mathiske commented on MESOS-5212:
---

This patch is implementation-only (with tests), which is proper. I am assuming 
the documentation changes that go along with the new behavior will then be 
posted against MESOS-5215? IMHO it would also be OK to dedicate limited doc 
updates to the ticket here.

> Allow any principal in ReservationInfo when HTTP authentication is off
> --
>
> Key: MESOS-5212
> URL: https://issues.apache.org/jira/browse/MESOS-5212
> Project: Mesos
>  Issue Type: Improvement
>Affects Versions: 0.28.1
>Reporter: Greg Mann
>Assignee: Greg Mann
>  Labels: mesosphere
> Fix For: 0.29.0
>
>
> Mesos currently provides no way for operators to pass their principal to HTTP 
> endpoints when HTTP authentication is off. Since we enforce that 
> {{ReservationInfo.principal}} be equal to the operator principal in requests 
> to {{/reserve}}, this means that when HTTP authentication is disabled, the 
> {{ReservationInfo.principal}} field cannot be set.
> To address this in the short-term, we should allow 
> {{ReservationInfo.principal}} to hold any value when HTTP authentication is 
> disabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5368) Consider introducing persistent agent ID

2016-05-12 Thread Neil Conway (JIRA)
Neil Conway created MESOS-5368:
--

 Summary: Consider introducing persistent agent ID
 Key: MESOS-5368
 URL: https://issues.apache.org/jira/browse/MESOS-5368
 Project: Mesos
  Issue Type: Improvement
Reporter: Neil Conway


Currently, agent IDs identify a single "session" by an agent: that is, an agent 
receives an agent ID when it registers with the master; it reuses that agent ID 
if it disconnects and successfully reregisters; if the agent shuts down and 
restarts, it registers anew and receives a new agent ID.

It would be convenient to have a "persistent agent ID" that remains the same 
for the duration of a given agent {{work_dir}}. This would mean that a given 
persistent volume would not migrate between different agent IDs over time, for 
example (see MESOS-4894). If we supported permanently removing an agent from 
the cluster (i.e., the {{work_dir}} and any volumes used by the agent will 
never be reused), we could use the persistent agent ID to report which agent 
has been removed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5335) Add authorization to GET /weights

2016-05-12 Thread Yongqiao Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongqiao Wang updated MESOS-5335:
-
Assignee: (was: Yongqiao Wang)

> Add authorization to GET /weights
> -
>
> Key: MESOS-5335
> URL: https://issues.apache.org/jira/browse/MESOS-5335
> Project: Mesos
>  Issue Type: Improvement
>  Components: master, security
>Reporter: Adam B
>  Labels: mesosphere, security
> Fix For: 0.29.0
>
>
> We already authorize which http users can update weights for particular 
> roles, but even knowing of the existence of these roles (let alone their 
> weights) may be sensitive information. We should add authz around GET 
> operations on /weights.
> Easy option: GET_ENDPOINT_WITH_PATH /weights
> - Pro: No new verb
> - Con: All or nothing
> Complex option: GET_WEIGHTS_WITH_ROLE
> - Pro: Filters contents based on roles the user is authorized to see
> - Con: More authorize calls (one per role in each /weights request)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5344) New TASK_LOST Semantics

2016-05-12 Thread Neil Conway (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neil Conway updated MESOS-5344:
---
Epic Name: New TaskStatuses  (was: TASK_GONE)

> New TASK_LOST Semantics
> ---
>
> Key: MESOS-5344
> URL: https://issues.apache.org/jira/browse/MESOS-5344
> Project: Mesos
>  Issue Type: Epic
>  Components: master
>Reporter: Neil Conway
>  Labels: mesosphere
>
> The TASK_LOST task status describes two different situations: (a) the task 
> was not launched because of an error (e.g., insufficient available 
> resources), or (b) the master lost contact with a running task (e.g., due to 
> a network partition); the master will kill the task when it can (e.g., when 
> the network partition heals), but in the meantime the task may still be 
> running.
> This has two problems:
> 1. Using the same task status for two fairly different situations is 
> confusing.
> 2. In the partitioned-but-still-running case, frameworks have no easy way to 
> determine when a task has truly terminated.
> To address these problems, we propose introducing a new task status, 
> TASK_GONE, which would be used whenever a task can be guaranteed to not be 
> running.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5344) New TASK_LOST Semantics

2016-05-12 Thread Neil Conway (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neil Conway updated MESOS-5344:
---
Summary: New TASK_LOST Semantics  (was: Introduce TASK_GONE task status)

> New TASK_LOST Semantics
> ---
>
> Key: MESOS-5344
> URL: https://issues.apache.org/jira/browse/MESOS-5344
> Project: Mesos
>  Issue Type: Epic
>  Components: master
>Reporter: Neil Conway
>  Labels: mesosphere
>
> The TASK_LOST task status describes two different situations: (a) the task 
> was not launched because of an error (e.g., insufficient available 
> resources), or (b) the master lost contact with a running task (e.g., due to 
> a network partition); the master will kill the task when it can (e.g., when 
> the network partition heals), but in the meantime the task may still be 
> running.
> This has two problems:
> 1. Using the same task status for two fairly different situations is 
> confusing.
> 2. In the partitioned-but-still-running case, frameworks have no easy way to 
> determine when a task has truly terminated.
> To address these problems, we propose introducing a new task status, 
> TASK_GONE, which would be used whenever a task can be guaranteed to not be 
> running.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5344) Revise TaskStatus semantics

2016-05-12 Thread Neil Conway (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neil Conway updated MESOS-5344:
---
Summary: Revise TaskStatus semantics  (was: New TASK_LOST Semantics)

> Revise TaskStatus semantics
> ---
>
> Key: MESOS-5344
> URL: https://issues.apache.org/jira/browse/MESOS-5344
> Project: Mesos
>  Issue Type: Epic
>  Components: master
>Reporter: Neil Conway
>  Labels: mesosphere
>
> The TASK_LOST task status describes two different situations: (a) the task 
> was not launched because of an error (e.g., insufficient available 
> resources), or (b) the master lost contact with a running task (e.g., due to 
> a network partition); the master will kill the task when it can (e.g., when 
> the network partition heals), but in the meantime the task may still be 
> running.
> This has two problems:
> 1. Using the same task status for two fairly different situations is 
> confusing.
> 2. In the partitioned-but-still-running case, frameworks have no easy way to 
> determine when a task has truly terminated.
> To address these problems, we propose introducing a new task status, 
> TASK_GONE, which would be used whenever a task can be guaranteed to not be 
> running.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5345) Design doc for TASK_LOST_IN_PROGRESS

2016-05-12 Thread Neil Conway (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neil Conway updated MESOS-5345:
---
Summary: Design doc for TASK_LOST_IN_PROGRESS  (was: Design doc for 
TASK_GONE)

> Design doc for TASK_LOST_IN_PROGRESS
> 
>
> Key: MESOS-5345
> URL: https://issues.apache.org/jira/browse/MESOS-5345
> Project: Mesos
>  Issue Type: Task
>  Components: master
>Reporter: Neil Conway
>  Labels: mesosphere
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5344) Revise TaskStatus semantics

2016-05-12 Thread Neil Conway (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neil Conway updated MESOS-5344:
---
Description: 
This epic covers two related tasks:
1. Clarifying the semantics of TASK_LOST, and allow frameworks to learn when a 
task is *truly* lost (i.e., not running), versus the current LOST semantics of 
"may or may not be running".
2. Allowing frameworks to control how partitioned tasks are handled.


  was:
The TASK_LOST task status describes two different situations: (a) the task was 
not launched because of an error (e.g., insufficient available resources), or 
(b) the master lost contact with a running task (e.g., due to a network 
partition); the master will kill the task when it can (e.g., when the network 
partition heals), but in the meantime the task may still be running.

This has two problems:
1. Using the same task status for two fairly different situations is confusing.
2. In the partitioned-but-still-running case, frameworks have no easy way to 
determine when a task has truly terminated.

To address these problems, we propose introducing a new task status, TASK_GONE, 
which would be used whenever a task can be guaranteed to not be running.


> Revise TaskStatus semantics
> ---
>
> Key: MESOS-5344
> URL: https://issues.apache.org/jira/browse/MESOS-5344
> Project: Mesos
>  Issue Type: Epic
>  Components: master
>Reporter: Neil Conway
>  Labels: mesosphere
>
> This epic covers two related tasks:
> 1. Clarifying the semantics of TASK_LOST, and allow frameworks to learn when 
> a task is *truly* lost (i.e., not running), versus the current LOST semantics 
> of "may or may not be running".
> 2. Allowing frameworks to control how partitioned tasks are handled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5345) Design doc for TASK_LOST_IN_PROGRESS

2016-05-12 Thread Neil Conway (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neil Conway updated MESOS-5345:
---
Description: 
The TASK_LOST task status describes two different situations: (a) the task was 
not launched because of an error (e.g., insufficient available resources), or 
(b) the master lost contact with a running task (e.g., due to a network 
partition); the master will kill the task when it can (e.g., when the network 
partition heals), but in the meantime the task may still be running.

This has two problems:
1. Using the same task status for two fairly different situations is confusing.
2. In the partitioned-but-still-running case, frameworks have no easy way to 
determine when a task has truly terminated.

To address these problems, we propose introducing a new task status, 
TASK_LOST_IN_PROGRESS. If a framework opts into this behavior using a new 
capability, TASK_LOST would mean "the task is definitely not running", whereas 
TASK_LOST_IN_PROGRESS would mean "the task may or may not be running (we've 
lost contact with the agent), but the master will try to shut it down when 
possible."

> Design doc for TASK_LOST_IN_PROGRESS
> 
>
> Key: MESOS-5345
> URL: https://issues.apache.org/jira/browse/MESOS-5345
> Project: Mesos
>  Issue Type: Task
>  Components: master
>Reporter: Neil Conway
>  Labels: mesosphere
>
> The TASK_LOST task status describes two different situations: (a) the task 
> was not launched because of an error (e.g., insufficient available 
> resources), or (b) the master lost contact with a running task (e.g., due to 
> a network partition); the master will kill the task when it can (e.g., when 
> the network partition heals), but in the meantime the task may still be 
> running.
> This has two problems:
> 1. Using the same task status for two fairly different situations is 
> confusing.
> 2. In the partitioned-but-still-running case, frameworks have no easy way to 
> determine when a task has truly terminated.
> To address these problems, we propose introducing a new task status, 
> TASK_LOST_IN_PROGRESS. If a framework opts into this behavior using a new 
> capability, TASK_LOST would mean "the task is definitely not running", 
> whereas TASK_LOST_IN_PROGRESS would mean "the task may or may not be running 
> (we've lost contact with the agent), but the master will try to shut it down 
> when possible."



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5345) Design doc for TASK_LOST_IN_PROGRESS

2016-05-12 Thread Neil Conway (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15281443#comment-15281443
 ] 

Neil Conway commented on MESOS-5345:


Revised design doc: 
https://docs.google.com/document/d/1D2mJnwuC1qlT_SJGspfj4MdAQXflESCqKANY0Pj4644

> Design doc for TASK_LOST_IN_PROGRESS
> 
>
> Key: MESOS-5345
> URL: https://issues.apache.org/jira/browse/MESOS-5345
> Project: Mesos
>  Issue Type: Task
>  Components: master
>Reporter: Neil Conway
>  Labels: mesosphere
>
> The TASK_LOST task status describes two different situations: (a) the task 
> was not launched because of an error (e.g., insufficient available 
> resources), or (b) the master lost contact with a running task (e.g., due to 
> a network partition); the master will kill the task when it can (e.g., when 
> the network partition heals), but in the meantime the task may still be 
> running.
> This has two problems:
> 1. Using the same task status for two fairly different situations is 
> confusing.
> 2. In the partitioned-but-still-running case, frameworks have no easy way to 
> determine when a task has truly terminated.
> To address these problems, we propose introducing a new task status, 
> TASK_LOST_IN_PROGRESS. If a framework opts into this behavior using a new 
> capability, TASK_LOST would mean "the task is definitely not running", 
> whereas TASK_LOST_IN_PROGRESS would mean "the task may or may not be running 
> (we've lost contact with the agent), but the master will try to shut it down 
> when possible."



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5369) Coarse-grained authorization of endpoints is supported only for short url paths.

2016-05-12 Thread Alexander Rukletsov (JIRA)
Alexander Rukletsov created MESOS-5369:
--

 Summary: Coarse-grained authorization of endpoints is supported 
only for short url paths.
 Key: MESOS-5369
 URL: https://issues.apache.org/jira/browse/MESOS-5369
 Project: Mesos
  Issue Type: Bug
Affects Versions: 0.29.0
Reporter: Alexander Rukletsov


For coarse-grained authorization actions, e.g., {{GET_ENDPOINT_WITH_PATH}}, we 
currently pass the short version of the url path, i.e., {{/state}} instead of 
{{/master/state}}, to the authorizer in some cases. This means that ACLs for 
local authorizer will not work as expected if absolute paths are used. 
Moreover, both local and modularized authorizers should be able to understand 
both short url paths for endpoints that belong to the "major" actor process 
(e.g., master, agent) and absolute url paths for all other actors (e.g., 
{{/files/browse}}, {{/metrics/snapshot}}.

One possible solution is to pass absolute paths to authorizers and let them do 
the necessary processing, e.g., removing agent id from {{/slave(id)/state}}. 
This will also require normalizing endpoints from ACLs to absolute path form, 
similarly as we have done in MESOS-3143. Additionally this solution removes 
ambiguity which may arise for same endpoints belonging to different actors, 
e.g., {{/master/flags}} vs. {{/slave/flags}}.

Here are some code snippets to illustrate the problem and the reasons:
* 
https://github.com/apache/mesos/blob/eaf0d3461b3f17c9037490e873f114c2ee1c14d9/src/slave/http.cpp#L824-L833
* 
https://github.com/apache/mesos/blob/0104e7349a0539f38d02a0e7e23b7712ebefc201/3rdparty/libprocess/src/process.cpp#L2398
* 
https://github.com/apache/mesos/blob/0104e7349a0539f38d02a0e7e23b7712ebefc201/src/master/main.cpp#L247
* 
https://github.com/apache/mesos/blob/0104e7349a0539f38d02a0e7e23b7712ebefc201/3rdparty/libprocess/src/process.cpp#L2875



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5369) Coarse-grained authorization of endpoints is supported only for short url paths.

2016-05-12 Thread Alexander Rojas (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15281535#comment-15281535
 ] 

Alexander Rojas commented on MESOS-5369:


We faced this issue while implementing the firewall. Some solutions were 
proposed:

# Add a regex library (was discarded because it will add another binary third 
party dependency and GCC didn’t yet had support for C++11 regex).
# Implement our own wildcard mechanism (discarded because it added extra 
abstractions into Mesos codebase).

Finally we just add an MVP requirement of knowing the suffix of the process you 
wanted to hit. The idea was to postpone the final solution to a next iteration, 
through in truth, we never got to that.

> Coarse-grained authorization of endpoints is supported only for short url 
> paths.
> 
>
> Key: MESOS-5369
> URL: https://issues.apache.org/jira/browse/MESOS-5369
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.29.0
>Reporter: Alexander Rukletsov
>  Labels: authorization, mesosphere, security
>
> For coarse-grained authorization actions, e.g., {{GET_ENDPOINT_WITH_PATH}}, 
> we currently pass the short version of the url path, i.e., {{/state}} instead 
> of {{/master/state}}, to the authorizer in some cases. This means that ACLs 
> for local authorizer will not work as expected if absolute paths are used. 
> Moreover, both local and modularized authorizers should be able to understand 
> both short url paths for endpoints that belong to the "major" actor process 
> (e.g., master, agent) and absolute url paths for all other actors (e.g., 
> {{/files/browse}}, {{/metrics/snapshot}}.
> One possible solution is to pass absolute paths to authorizers and let them 
> do the necessary processing, e.g., removing agent id from 
> {{/slave(id)/state}}. This will also require normalizing endpoints from ACLs 
> to absolute path form, similarly as we have done in MESOS-3143. Additionally 
> this solution removes ambiguity which may arise for same endpoints belonging 
> to different actors, e.g., {{/master/flags}} vs. {{/slave/flags}}.
> Here are some code snippets to illustrate the problem and the reasons:
> * 
> https://github.com/apache/mesos/blob/eaf0d3461b3f17c9037490e873f114c2ee1c14d9/src/slave/http.cpp#L824-L833
> * 
> https://github.com/apache/mesos/blob/0104e7349a0539f38d02a0e7e23b7712ebefc201/3rdparty/libprocess/src/process.cpp#L2398
> * 
> https://github.com/apache/mesos/blob/0104e7349a0539f38d02a0e7e23b7712ebefc201/src/master/main.cpp#L247
> * 
> https://github.com/apache/mesos/blob/0104e7349a0539f38d02a0e7e23b7712ebefc201/3rdparty/libprocess/src/process.cpp#L2875



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-5368) Consider introducing persistent agent ID

2016-05-12 Thread Abhishek Dasgupta (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Abhishek Dasgupta reassigned MESOS-5368:


Assignee: Abhishek Dasgupta

> Consider introducing persistent agent ID
> 
>
> Key: MESOS-5368
> URL: https://issues.apache.org/jira/browse/MESOS-5368
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Neil Conway
>Assignee: Abhishek Dasgupta
>  Labels: mesosphere
>
> Currently, agent IDs identify a single "session" by an agent: that is, an 
> agent receives an agent ID when it registers with the master; it reuses that 
> agent ID if it disconnects and successfully reregisters; if the agent shuts 
> down and restarts, it registers anew and receives a new agent ID.
> It would be convenient to have a "persistent agent ID" that remains the same 
> for the duration of a given agent {{work_dir}}. This would mean that a given 
> persistent volume would not migrate between different agent IDs over time, 
> for example (see MESOS-4894). If we supported permanently removing an agent 
> from the cluster (i.e., the {{work_dir}} and any volumes used by the agent 
> will never be reused), we could use the persistent agent ID to report which 
> agent has been removed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5368) Consider introducing persistent agent ID

2016-05-12 Thread Abhishek Dasgupta (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15281555#comment-15281555
 ] 

Abhishek Dasgupta commented on MESOS-5368:
--

+1

> Consider introducing persistent agent ID
> 
>
> Key: MESOS-5368
> URL: https://issues.apache.org/jira/browse/MESOS-5368
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Neil Conway
>Assignee: Abhishek Dasgupta
>  Labels: mesosphere
>
> Currently, agent IDs identify a single "session" by an agent: that is, an 
> agent receives an agent ID when it registers with the master; it reuses that 
> agent ID if it disconnects and successfully reregisters; if the agent shuts 
> down and restarts, it registers anew and receives a new agent ID.
> It would be convenient to have a "persistent agent ID" that remains the same 
> for the duration of a given agent {{work_dir}}. This would mean that a given 
> persistent volume would not migrate between different agent IDs over time, 
> for example (see MESOS-4894). If we supported permanently removing an agent 
> from the cluster (i.e., the {{work_dir}} and any volumes used by the agent 
> will never be reused), we could use the persistent agent ID to report which 
> agent has been removed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5298) Unset slave authenticator during tear-down of every test case that has authentication enabled.

2016-05-12 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-5298:
---
Labels: mesosphere tech-debt-test  (was: )
Issue Type: Improvement  (was: Wish)

> Unset slave authenticator during tear-down of every test case that has 
> authentication enabled.
> --
>
> Key: MESOS-5298
> URL: https://issues.apache.org/jira/browse/MESOS-5298
> Project: Mesos
>  Issue Type: Improvement
>  Components: slave, tests
>Reporter: Jan Schlicht
>  Labels: mesosphere, tech-debt-test
>
> When Mesos agents are started with enabled authentication, the call 
> {{process::http::authentication::setAuthenticator}} to enable the HTTP 
> authenticator on libprocess level. This authenticator is never unset, which 
> is fine for the general use case, because libprocess's lifetime is tied to 
> the agent's lifetime.
> In a test fixture the situation is different, though. The lifetime of 
> libprocess is tied to the lifetime of the fixture. As a consequence, a test 
> case that wants to disable HTTP authentication of an agent needs to manually 
> unset the authenticator as it already may have been set by a different test 
> case of the fixture.
> The naive solution would be to add the unset call to 
> {{cluster::Slave::~Slave}} but that could cause problems in test cases with 
> multiple slaves. A better solution would be to unset the HTTP authenticator 
> during the teardown of a test case that used agents with enabled HTTP 
> authentication.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-970) Upgrade leveldb to 1.18.

2016-05-12 Thread haosdent (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

haosdent updated MESOS-970:
---
Summary: Upgrade leveldb to 1.18.  (was: Upgrade leveldb.)

> Upgrade leveldb to 1.18.
> 
>
> Key: MESOS-970
> URL: https://issues.apache.org/jira/browse/MESOS-970
> Project: Mesos
>  Issue Type: Improvement
>  Components: replicated log
>Reporter: Benjamin Mahler
>
> We currently bundle leveldb 1.4, and the latest version is leveldb 1.15.
> A careful review of the fixes and changes in each release would be prudent. 
> Regression testing and performance testing would also be prudent, given the 
> replicated log is built on leveldb.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-970) Upgrade bundled leveldb to 1.18

2016-05-12 Thread haosdent (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

haosdent updated MESOS-970:
---
Summary: Upgrade bundled leveldb to 1.18  (was: Upgrade leveldb to 1.18)

> Upgrade bundled leveldb to 1.18
> ---
>
> Key: MESOS-970
> URL: https://issues.apache.org/jira/browse/MESOS-970
> Project: Mesos
>  Issue Type: Improvement
>  Components: replicated log
>Reporter: Benjamin Mahler
>
> We currently bundle leveldb 1.4, and the latest version is leveldb 1.15.
> A careful review of the fixes and changes in each release would be prudent. 
> Regression testing and performance testing would also be prudent, given the 
> replicated log is built on leveldb.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-970) Upgrade leveldb to 1.18

2016-05-12 Thread haosdent (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

haosdent updated MESOS-970:
---
Summary: Upgrade leveldb to 1.18  (was: Upgrade leveldb to 1.18.)

> Upgrade leveldb to 1.18
> ---
>
> Key: MESOS-970
> URL: https://issues.apache.org/jira/browse/MESOS-970
> Project: Mesos
>  Issue Type: Improvement
>  Components: replicated log
>Reporter: Benjamin Mahler
>
> We currently bundle leveldb 1.4, and the latest version is leveldb 1.15.
> A careful review of the fixes and changes in each release would be prudent. 
> Regression testing and performance testing would also be prudent, given the 
> replicated log is built on leveldb.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-970) Upgrade bundled leveldb to 1.18

2016-05-12 Thread haosdent (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

haosdent updated MESOS-970:
---
Description: 
We currently bundle leveldb 1.4, and the latest version is leveldb 1.18.

Upgrade to 1.18 could solve the problems when build Mesos in some non-x86 
architecture CPU.

  was:
We currently bundle leveldb 1.4, and the latest version is leveldb 1.15.

A careful review of the fixes and changes in each release would be prudent. 
Regression testing and performance testing would also be prudent, given the 
replicated log is built on leveldb.


> Upgrade bundled leveldb to 1.18
> ---
>
> Key: MESOS-970
> URL: https://issues.apache.org/jira/browse/MESOS-970
> Project: Mesos
>  Issue Type: Improvement
>  Components: replicated log
>Reporter: Benjamin Mahler
>
> We currently bundle leveldb 1.4, and the latest version is leveldb 1.18.
> Upgrade to 1.18 could solve the problems when build Mesos in some non-x86 
> architecture CPU.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5279) DRF sorter add/activate doesn't check if it's adding a duplicate entry

2016-05-12 Thread Yan Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15281855#comment-15281855
 ] 

Yan Xu commented on MESOS-5279:
---

So in the reviews activate and add are treated differently: 
- {{activate}} is made idempotent, if you activate the client which is already 
active, it's OK to say this is a no-op.
- {{add}} doesn't allow adding a client already in the sorter because it has 
another argument {{weight}} which could be different.

> DRF sorter add/activate doesn't check if it's adding a duplicate entry
> --
>
> Key: MESOS-5279
> URL: https://issues.apache.org/jira/browse/MESOS-5279
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation
>Reporter: Yan Xu
>Assignee: Yan Xu
>
> Currently the sorter relies on the caller to make sure the sorter is in a 
> good state when add/activate is called. It's not defensive against caller 
> mistakes as it should be. It's never an acceptable result if duplicates are 
> added to {{DRFSorter::clients}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-5279) DRF sorter add/activate doesn't check if it's adding a duplicate entry

2016-05-12 Thread Yan Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15281851#comment-15281851
 ] 

Yan Xu edited comment on MESOS-5279 at 5/12/16 6:12 PM:


https://reviews.apache.org/r/47257/
https://reviews.apache.org/r/47258/
https://reviews.apache.org/r/47259/


was (Author: xujyan):
https://reviews.apache.org/r/47257/
https://reviews.apache.org/r/47258/
https://reviews.apache.org/r/47259/
https://reviews.apache.org/r/47260/

> DRF sorter add/activate doesn't check if it's adding a duplicate entry
> --
>
> Key: MESOS-5279
> URL: https://issues.apache.org/jira/browse/MESOS-5279
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation
>Reporter: Yan Xu
>Assignee: Yan Xu
>
> Currently the sorter relies on the caller to make sure the sorter is in a 
> good state when add/activate is called. It's not defensive against caller 
> mistakes as it should be. It's never an acceptable result if duplicates are 
> added to {{DRFSorter::clients}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5370) Add deprecation support for Flags

2016-05-12 Thread Vinod Kone (JIRA)
Vinod Kone created MESOS-5370:
-

 Summary: Add deprecation support for Flags
 Key: MESOS-5370
 URL: https://issues.apache.org/jira/browse/MESOS-5370
 Project: Mesos
  Issue Type: Improvement
Reporter: Vinod Kone
Assignee: Vinod Kone


MESOS-5271 adds support for a flag name to have an alias. This ticket captures 
the work need to add deprecation support. The idea is for the caller to 
explicitly specify deprecation via `FlagsBase::add()`  and get a list of 
deprecation warnings when doing `FlagsBase::load()`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5271) Add alias support for Flags

2016-05-12 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-5271:
--
Shepherd: Michael Park  (was: Benjamin Mahler)

Split out the work needed for deprecation support into a separate ticket 
MESOS-5370.

> Add alias support for Flags
> ---
>
> Key: MESOS-5271
> URL: https://issues.apache.org/jira/browse/MESOS-5271
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Vinod Kone
>Assignee: Vinod Kone
>
> Currently there is no support for a flag to have an alias. Such support would 
> be useful to rename/deprecate a flag.
> For example, for MESOS-4386, we could let the flag have `--authenticate` name 
> and a `--authenticate_frameworks` alias. The alias can be marked as 
> deprecated (need to add support for this as well).
> This support will also be useful for slave/agent flag rename. See MESOS-3781 
> for details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2201) ReplicaTest.Restore fails with leveldb greater than v1.7.

2016-05-12 Thread Bing Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15281888#comment-15281888
 ] 

Bing Li commented on MESOS-2201:


I was working on MESOS-5288.
ReplicaTest.Restore failed on SLES12SP1 with leveldb 1.18 on s390x and I 
confirm that the proposed fix resolves the problem.

> ReplicaTest.Restore fails with leveldb greater than v1.7.
> -
>
> Key: MESOS-2201
> URL: https://issues.apache.org/jira/browse/MESOS-2201
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Affects Versions: 0.29.0
> Environment: E.g. Ubuntu 14.04.4 LTS + leveldb 1.10
>Reporter: Kapil Arya
>Assignee: Tomasz Janiszewski
>Priority: Minor
>  Labels: mesosphere
> Fix For: 0.29.0
>
>
> I wanted to configure Mesos with system provided leveldb libraries when I ran 
> into this issue. Apparently,  if one does {{../configure 
> --with-leveldb=/path/to/leveldb}}, compilation succeeds, however the 
> "ReplicaTest_Restore" test fails with the following back trace:
> {code}
> [ RUN  ] ReplicaTest.Restore
> Using temporary directory '/tmp/ReplicaTest_Restore_IZbbRR'
> I1222 14:16:49.517500  2927 leveldb.cpp:176] Opened db in 10.758917ms
> I1222 14:16:49.526495  2927 leveldb.cpp:183] Compacted db in 8.931146ms
> I1222 14:16:49.526523  2927 leveldb.cpp:198] Created db iterator in 5787ns
> I1222 14:16:49.526531  2927 leveldb.cpp:204] Seeked to beginning of db in 
> 511ns
> I1222 14:16:49.526535  2927 leveldb.cpp:273] Iterated through 0 keys in the 
> db in 197ns
> I1222 14:16:49.526623  2927 replica.cpp:741] Replica recovered with log 
> positions 0 -> 0 with 1 holes and 0 unlearned
> I1222 14:16:49.530972  2945 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 3.084458ms
> I1222 14:16:49.531008  2945 replica.cpp:320] Persisted replica status to 
> VOTING
> I1222 14:16:49.541263  2927 leveldb.cpp:176] Opened db in 9.980586ms
> I1222 14:16:49.551636  2927 leveldb.cpp:183] Compacted db in 10.348096ms
> I1222 14:16:49.551683  2927 leveldb.cpp:198] Created db iterator in 3405ns
> I1222 14:16:49.551693  2927 leveldb.cpp:204] Seeked to beginning of db in 
> 3559ns
> I1222 14:16:49.551728  2927 leveldb.cpp:273] Iterated through 1 keys in the 
> db in 29722ns
> I1222 14:16:49.551751  2927 replica.cpp:741] Replica recovered with log 
> positions 0 -> 0 with 1 holes and 0 unlearned
> I1222 14:16:49.551996  2947 replica.cpp:474] Replica received implicit 
> promise request with proposal 1
> I1222 14:16:49.560921  2947 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 8.899591ms
> I1222 14:16:49.560940  2947 replica.cpp:342] Persisted promised to 1
> I1222 14:16:49.561338  2943 replica.cpp:508] Replica received write request 
> for position 1
> I1222 14:16:49.568677  2943 leveldb.cpp:343] Persisting action (27 bytes) to 
> leveldb took 7.287155ms
> I1222 14:16:49.568692  2943 replica.cpp:676] Persisted action at 1
> I1222 14:16:49.569042  2942 leveldb.cpp:438] Reading position from leveldb 
> took 26339ns
> F1222 14:16:49.569411  2927 replica.cpp:721] CHECK_SOME(state): IO error: 
> lock /tmp/ReplicaTest_Restore_IZbbRR/.log/LOCK: already held by process 
> Failed to recover the log
> *** Check failure stack trace: ***
> @ 0x7f7f6c53e688  google::LogMessage::Fail()
> @ 0x7f7f6c53e5e7  google::LogMessage::SendToLog()
> @ 0x7f7f6c53dff8  google::LogMessage::Flush()
> @ 0x7f7f6c540d2c  google::LogMessageFatal::~LogMessageFatal()
> @   0x90a520  _CheckFatal::~_CheckFatal()
> @ 0x7f7f6c400f4d  mesos::internal::log::ReplicaProcess::restore()
> @ 0x7f7f6c3fd763  
> mesos::internal::log::ReplicaProcess::ReplicaProcess()
> @ 0x7f7f6c401271  mesos::internal::log::Replica::Replica()
> @   0xcd7ca3  ReplicaTest_Restore_Test::TestBody()
> @  0x10934b2  
> testing::internal::HandleSehExceptionsInMethodIfSupported<>()
> @  0x108e584  
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> @  0x10768fd  testing::Test::Run()
> @  0x1077020  testing::TestInfo::Run()
> @  0x10775a8  testing::TestCase::Run()
> @  0x107c324  testing::internal::UnitTestImpl::RunAllTests()
> @  0x1094348  
> testing::internal::HandleSehExceptionsInMethodIfSupported<>()
> @  0x108f2b7  
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> @  0x107b1d4  testing::UnitTest::Run()
> @   0xd344a9  main
> @ 0x7f7f66fdfb45  __libc_start_main
> @   0x8f3549  (unknown)
> @  (nil)  (unknown)
> [2]2927 abort (core dumped)  GLOG_logtostderr=1 GTEST_v=10 
> ./bin/mesos-tests.sh --verbose
> {code}
> The bundled version 

[jira] [Created] (MESOS-5371) Implement `fcntl.hpp`

2016-05-12 Thread Alex Clemmer (JIRA)
Alex Clemmer created MESOS-5371:
---

 Summary: Implement `fcntl.hpp`
 Key: MESOS-5371
 URL: https://issues.apache.org/jira/browse/MESOS-5371
 Project: Mesos
  Issue Type: Bug
  Components: stout
Reporter: Alex Clemmer
Assignee: Alex Clemmer


`fcntl.hpp` has a bunch of functions that will never work on Windows. We will 
need to work around them, either by working around specific call sites of 
functions like `os::cloexec`, or by implementing something that keeps track of 
which file descriptors are cloexec, and which aren't.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4259) mesos HA can't delete the the redundant container on failure slave node.

2016-05-12 Thread jhiuvg (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jhiuvg updated MESOS-4259:
--
Attachment: canon.pdf

> mesos HA can't delete the  the redundant container on failure slave node.
> -
>
> Key: MESOS-4259
> URL: https://issues.apache.org/jira/browse/MESOS-4259
> Project: Mesos
>  Issue Type: Bug
>  Components: docker, framework
>Affects Versions: 0.25.0
>Reporter: wangqun
>Priority: Critical
>  Labels: patch
> Fix For: 0.25.0
>
> Attachments: canon.pdf
>
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> We have setup one mesos cluster. One Master nodes, and two slave nodes.
> We want to test the HA, but we find the mesos HA can't delete the  the 
> redundant container on failure slave node.
> 1.We create one container on slave node. 
> 2.stop the slave node having the container,and the container can transfer to 
> the remaining slave node. 
> However, if we restore the slave node stoped by us. The status of the 
> original container is exited. Then I start the container manually and it can 
> start up. i.e. There are two contain is running on different slave node.
> I think that the container and new container is repeated after migration. 
> They are redundant. Can mesos delete automatically the redundant container?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-970) Upgrade bundled leveldb to 1.18

2016-05-12 Thread Tomasz Janiszewski (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15282005#comment-15282005
 ] 

Tomasz Janiszewski commented on MESOS-970:
--

Review: https://reviews.apache.org/r/47324/

> Upgrade bundled leveldb to 1.18
> ---
>
> Key: MESOS-970
> URL: https://issues.apache.org/jira/browse/MESOS-970
> Project: Mesos
>  Issue Type: Improvement
>  Components: replicated log
>Reporter: Benjamin Mahler
>Assignee: Tomasz Janiszewski
>
> We currently bundle leveldb 1.4, and the latest version is leveldb 1.18.
> Upgrade to 1.18 could solve the problems when build Mesos in some non-x86 
> architecture CPU.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-970) Upgrade bundled leveldb to 1.18

2016-05-12 Thread Tomasz Janiszewski (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomasz Janiszewski reassigned MESOS-970:


Assignee: Tomasz Janiszewski

> Upgrade bundled leveldb to 1.18
> ---
>
> Key: MESOS-970
> URL: https://issues.apache.org/jira/browse/MESOS-970
> Project: Mesos
>  Issue Type: Improvement
>  Components: replicated log
>Reporter: Benjamin Mahler
>Assignee: Tomasz Janiszewski
>
> We currently bundle leveldb 1.4, and the latest version is leveldb 1.18.
> Upgrade to 1.18 could solve the problems when build Mesos in some non-x86 
> architecture CPU.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5372) Add random() to os:: namespace

2016-05-12 Thread Daniel Pravat (JIRA)
Daniel Pravat created MESOS-5372:


 Summary: Add random() to os:: namespace 
 Key: MESOS-5372
 URL: https://issues.apache.org/jira/browse/MESOS-5372
 Project: Mesos
  Issue Type: Improvement
  Components: stout
Affects Versions: 0.29.0
Reporter: Daniel Pravat
Assignee: Daniel Pravat
 Fix For: 0.29.0


The function "random()" is not available in Windows. After this improvement the 
calls to "os::random()" will result in calls to "::random()" on POSIX and 
"::rand()" on Windows.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5373) Remove `Zookeeper's` NTDDI_VERSION define

2016-05-12 Thread Daniel Pravat (JIRA)
Daniel Pravat created MESOS-5373:


 Summary: Remove `Zookeeper's` NTDDI_VERSION define
 Key: MESOS-5373
 URL: https://issues.apache.org/jira/browse/MESOS-5373
 Project: Mesos
  Issue Type: Improvement
  Components: general
Affects Versions: 0.29.0
Reporter: Daniel Pravat
Assignee: Daniel Pravat
 Fix For: 0.29.0


Zookeeper client library defines NTDDI_VERSION to 0x0400 in "winconfig.h". 
While this API level is suficient to compile the client library,  Mesos have to 
use a newer API set. After this improvement the code will compile with the 
latest NTDDI_VERSION.   




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-3094) Mesos on Windows

2016-05-12 Thread Daniel Pravat (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Pravat reassigned MESOS-3094:


Assignee: Daniel Pravat  (was: Alex Clemmer)

> Mesos on Windows
> 
>
> Key: MESOS-3094
> URL: https://issues.apache.org/jira/browse/MESOS-3094
> Project: Mesos
>  Issue Type: Epic
>  Components: containerization, libprocess, stout
>Reporter: Joseph Wu
>Assignee: Daniel Pravat
>  Labels: mesosphere
>
> The ultimate goal of this is to have all containerizer tests running and 
> passing on Windows Server.
> # It must build (see MESOS-898).
> # All OS-specific code (that is touched by the containerizer) must be ported 
> to Windows.
> # The containizer itself must be ported to Windows, alongside the 
> MesosContainerizer.
> Note: Isolation (cgroups) will probably not exist on Windows.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3094) Mesos on Windows

2016-05-12 Thread Daniel Pravat (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Pravat updated MESOS-3094:
-
Assignee: Alex Clemmer  (was: Daniel Pravat)

> Mesos on Windows
> 
>
> Key: MESOS-3094
> URL: https://issues.apache.org/jira/browse/MESOS-3094
> Project: Mesos
>  Issue Type: Epic
>  Components: containerization, libprocess, stout
>Reporter: Joseph Wu
>Assignee: Alex Clemmer
>  Labels: mesosphere
>
> The ultimate goal of this is to have all containerizer tests running and 
> passing on Windows Server.
> # It must build (see MESOS-898).
> # All OS-specific code (that is touched by the containerizer) must be ported 
> to Windows.
> # The containizer itself must be ported to Windows, alongside the 
> MesosContainerizer.
> Note: Isolation (cgroups) will probably not exist on Windows.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5374) Add support for Console Ctrl handling in `slave.cpp`

2016-05-12 Thread Daniel Pravat (JIRA)
Daniel Pravat created MESOS-5374:


 Summary: Add support for Console Ctrl handling in `slave.cpp`
 Key: MESOS-5374
 URL: https://issues.apache.org/jira/browse/MESOS-5374
 Project: Mesos
  Issue Type: Improvement
Reporter: Daniel Pravat
Assignee: Daniel Pravat


Extract supporting code to handle POSIX signals in a separate header and add 
support for CTRL handler when running on Windows. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5375) Implement stout/os/windows/kill.hpp

2016-05-12 Thread Daniel Pravat (JIRA)
Daniel Pravat created MESOS-5375:


 Summary: Implement stout/os/windows/kill.hpp
 Key: MESOS-5375
 URL: https://issues.apache.org/jira/browse/MESOS-5375
 Project: Mesos
  Issue Type: Improvement
Reporter: Daniel Pravat
Assignee: Daniel Pravat


Implement equivalent functionality on Windows 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5286) Add authorization to libprocess HTTP endpoints

2016-05-12 Thread Greg Mann (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15282238#comment-15282238
 ] 

Greg Mann commented on MESOS-5286:
--

{code}
Commit: 25376d8ee9227653a93f546e33be49500b6d9d5c
Parents: acde41a
Author: Greg Mann 
Authored: Wed May 11 22:45:30 2016 -0400
Committer: Kapil Arya 
Committed: Thu May 12 01:50:20 2016 -0400

Enabled authorization of libprocess HTTP endpoints (libprocess).
{code}

{code}
Commit: d0b0ca638ec033b62da8f86cba9e42a955916eb4
Parents: 25376d8
Author: Greg Mann 
Authored: Wed May 11 22:45:37 2016 -0400
Committer: Kapil Arya 
Committed: Thu May 12 01:50:20 2016 -0400

Enabled authorization of libprocess HTTP endpoints (Mesos).
{code}

{code}
Commit: 12b05e837f96acfa7a88926fc331eeba695d2112
Parents: d0b0ca6
Author: Greg Mann 
Authored: Wed May 11 22:45:40 2016 -0400
Committer: Kapil Arya 
Committed: Thu May 12 01:50:20 2016 -0400

Added authorization callback for '/metrics/snapshot'.
{code}

{code}
Commit: d5e1a47dbb5db4af372a079d8cd287392ebe30fa
Parents: 12b05e8
Author: Greg Mann 
Authored: Wed May 11 22:45:43 2016 -0400
Committer: Kapil Arya 
Committed: Thu May 12 01:50:20 2016 -0400

Allowed tests to authorize libprocess HTTP endpoints.
{code}

{code}
Commit: a776785f3ea94ee7e827bd5aa7e37f323b6a2230
Parents: d5e1a47
Author: Greg Mann g...@mesosphere.io
Authored: Wed May 11 22:45:46 2016 -0400
Committer: Kapil Arya ka...@mesosphere.io
Committed: Thu May 12 01:50:20 2016 -0400

Added MetricsTests with authorization.
{code}

{code}
Commit: 1140f6e5c3757034896ec9256a9d118c4331a361
Parents: a776785
Author: Greg Mann g...@mesosphere.io
Authored: Wed May 11 22:45:49 2016 -0400
Committer: Kapil Arya ka...@mesosphere.io
Committed: Thu May 12 01:50:20 2016 -0400

Added authorization callback for '/logging/toggle'.
{code}

{code}
Commit: a5ce87b268bbb9eb0c7fc8e32873d62dcb05d9e4
Parents: 1140f6e
Author: Greg Mann 
Authored: Wed May 11 22:45:52 2016 -0400
Committer: Kapil Arya 
Committed: Thu May 12 01:50:20 2016 -0400

Added a LoggingTest with authorization.
{code}

> Add authorization to libprocess HTTP endpoints
> --
>
> Key: MESOS-5286
> URL: https://issues.apache.org/jira/browse/MESOS-5286
> Project: Mesos
>  Issue Type: Improvement
>  Components: libprocess
>Reporter: Greg Mann
>Assignee: Greg Mann
>  Labels: mesosphere
> Fix For: 0.29.0
>
>
> Now that the libprocess-level HTTP endpoints have had authentication added to 
> them in MESOS-4902, we can add authorization to them as well. As a first 
> step, we can implement a "coarse-grained" approach, in which a principal is 
> granted or denied access to a given endpoint. We will likely need to register 
> an authorizer with libprocess.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5376) Add systemd watchdog support

2016-05-12 Thread David Robinson (JIRA)
David Robinson created MESOS-5376:
-

 Summary: Add systemd watchdog support
 Key: MESOS-5376
 URL: https://issues.apache.org/jira/browse/MESOS-5376
 Project: Mesos
  Issue Type: Improvement
Reporter: David Robinson


It would be great if Mesos had support for systemd's 
[watchdog|http://0pointer.de/blog/projects/watchdog.html]. Users would 
typically use a supervisor like [monit|https://mmonit.com/monit/] to check the 
agent/master's /health endpoint and restart upon consecutive failures. Systemd 
doesn't support polling services, it uses a watchdog to communicate liveliness 
instead. Supervisor solutions like monit could be replaced with systemd if 
mesos had watchdog support. Note that simply restarting the service upon 
failure (ie, when the process exits) is not sufficient -- a deadlock within 
mesos would not cause the process to exit but a watchdog could detect this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5377) Improve DRF behavior with scarce resources.

2016-05-12 Thread Benjamin Mahler (JIRA)
Benjamin Mahler created MESOS-5377:
--

 Summary: Improve DRF behavior with scarce resources.
 Key: MESOS-5377
 URL: https://issues.apache.org/jira/browse/MESOS-5377
 Project: Mesos
  Issue Type: Epic
  Components: allocation
Reporter: Benjamin Mahler


The allocator currently uses the notion of Weighted [Dominant Resource 
Fairness|https://www.cs.berkeley.edu/~alig/papers/drf.pdf] (WDRF) to establish 
a linear notion of fairness across allocation roles.

DRF behaves well for resources that are present within each machine in a 
cluster (e.g. CPUs, memory, disk). However, some resources (e.g. GPUs) are only 
present on a subset of machines in the cluster.

Consider the behavior when there are the following agents in a cluster:

1000 agents with (cpus:4,mem:1024,disk:1024)
1 agent with (gpus:1,cpus:4,mem:1024,disk:1024)

If a role wishes to use both GPU and non-GPU resources for tasks, consuming 1 
GPU will lead DRF to consider the role to have a 100% share of the cluster, 
since it consumes 100% of the GPUs in the cluster. This framework will then not 
receive any other offers.

Among possible improvements, fairness can have understanding of resource 
packages. In a sense there is 1 GPU package that is competed on and 1000 
non-GPU packages competed on, and consuming the GPU package does not have a 
large effect on the role's access to the 1000 non-GPU packages.

In the interim, we should consider having a recommended way to deal with scarce 
resources in the current model.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5377) Improve DRF behavior with scarce resources.

2016-05-12 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-5377:
---
Description: 
The allocator currently uses the notion of Weighted [Dominant Resource 
Fairness|https://www.cs.berkeley.edu/~alig/papers/drf.pdf] (WDRF) to establish 
a linear notion of fairness across allocation roles.

DRF behaves well for resources that are present within each machine in a 
cluster (e.g. CPUs, memory, disk). However, some resources (e.g. GPUs) are only 
present on a subset of machines in the cluster.

Consider the behavior when there are the following agents in a cluster:

1000 agents with (cpus:4,mem:1024,disk:1024)
1 agent with (gpus:1,cpus:4,mem:1024,disk:1024)

If a role wishes to use both GPU and non-GPU resources for tasks, consuming 1 
GPU will lead DRF to consider the role to have a 100% share of the cluster, 
since it consumes 100% of the GPUs in the cluster. This framework will then not 
receive any other offers.

Among possible improvements, fairness can have understanding of resource 
packages. In a sense there is 1 GPU package that is competed on and 1000 
non-GPU packages competed on, and ideally a role's consumption of the single 
GPU package does not have a large effect on the role's access to the other 1000 
non-GPU packages.

In the interim, we should consider having a recommended way to deal with scarce 
resources in the current model.

  was:
The allocator currently uses the notion of Weighted [Dominant Resource 
Fairness|https://www.cs.berkeley.edu/~alig/papers/drf.pdf] (WDRF) to establish 
a linear notion of fairness across allocation roles.

DRF behaves well for resources that are present within each machine in a 
cluster (e.g. CPUs, memory, disk). However, some resources (e.g. GPUs) are only 
present on a subset of machines in the cluster.

Consider the behavior when there are the following agents in a cluster:

1000 agents with (cpus:4,mem:1024,disk:1024)
1 agent with (gpus:1,cpus:4,mem:1024,disk:1024)

If a role wishes to use both GPU and non-GPU resources for tasks, consuming 1 
GPU will lead DRF to consider the role to have a 100% share of the cluster, 
since it consumes 100% of the GPUs in the cluster. This framework will then not 
receive any other offers.

Among possible improvements, fairness can have understanding of resource 
packages. In a sense there is 1 GPU package that is competed on and 1000 
non-GPU packages competed on, and consuming the GPU package does not have a 
large effect on the role's access to the 1000 non-GPU packages.

In the interim, we should consider having a recommended way to deal with scarce 
resources in the current model.


> Improve DRF behavior with scarce resources.
> ---
>
> Key: MESOS-5377
> URL: https://issues.apache.org/jira/browse/MESOS-5377
> Project: Mesos
>  Issue Type: Epic
>  Components: allocation
>Reporter: Benjamin Mahler
>
> The allocator currently uses the notion of Weighted [Dominant Resource 
> Fairness|https://www.cs.berkeley.edu/~alig/papers/drf.pdf] (WDRF) to 
> establish a linear notion of fairness across allocation roles.
> DRF behaves well for resources that are present within each machine in a 
> cluster (e.g. CPUs, memory, disk). However, some resources (e.g. GPUs) are 
> only present on a subset of machines in the cluster.
> Consider the behavior when there are the following agents in a cluster:
> 1000 agents with (cpus:4,mem:1024,disk:1024)
> 1 agent with (gpus:1,cpus:4,mem:1024,disk:1024)
> If a role wishes to use both GPU and non-GPU resources for tasks, consuming 1 
> GPU will lead DRF to consider the role to have a 100% share of the cluster, 
> since it consumes 100% of the GPUs in the cluster. This framework will then 
> not receive any other offers.
> Among possible improvements, fairness can have understanding of resource 
> packages. In a sense there is 1 GPU package that is competed on and 1000 
> non-GPU packages competed on, and ideally a role's consumption of the single 
> GPU package does not have a large effect on the role's access to the other 
> 1000 non-GPU packages.
> In the interim, we should consider having a recommended way to deal with 
> scarce resources in the current model.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5378) Terminating a framework during master failover leads to orphaned tasks

2016-05-12 Thread Joseph Wu (JIRA)
Joseph Wu created MESOS-5378:


 Summary: Terminating a framework during master failover leads to 
orphaned tasks
 Key: MESOS-5378
 URL: https://issues.apache.org/jira/browse/MESOS-5378
 Project: Mesos
  Issue Type: Bug
  Components: framework, master
Affects Versions: 0.28.1, 0.27.2
Reporter: Joseph Wu


Repro steps:

1) Setup:
{code}
bin/mesos-master.sh --work_dir=/tmp/master
bin/mesos-slave.sh --work_dir=/tmp/slave --master=localhost:5050
src/mesos-execute --checkpoint --command="sleep 1000" --master=localhost:5050 
--name="test"
{code}

2) Kill all three from (1), in the order they were started.

3) Restart the master and agent.  Do not restart the framework.

Result)
* The agent will reconnect to an orphaned task.
* The Web UI will report no memory usage
* {{curl localhost:5050/metrics/snapshot}} will say:  {{"master/mem_used": 
128,}}

Cause) 
When a framework registers with the master, it provides a {{failover_timeout}}, 
in case the framework disconnects.  If the framework disconnects and does not 
reconnect within this {{failover_timeout}}, the master will kill all tasks 
belonging to the framework.

However, the master does not persist this {{failover_timeout}} across master 
failover.  The master will "forget" about a framework if:
1) The master dies before {{failover_timeout}} passes.
2) The framework dies while the master is dead.

When the master comes back up, the agent will re-register.  The agent will 
report the orphaned task(s).  Because the master failed over, it does not know 
these tasks are orphans (i.e. it thinks the frameworks might re-register).

Proposed solution)
The master should save the {{FrameworkID}} and {{failover_timeout}} in the 
registry.  Upon recovery, the master should resume the {{failover_timeout}} 
timers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-3639) Implement stout/os/windows/killtree.hpp

2016-05-12 Thread Daniel Pravat (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Pravat reassigned MESOS-3639:


Assignee: Daniel Pravat  (was: Alex Clemmer)

> Implement stout/os/windows/killtree.hpp
> ---
>
> Key: MESOS-3639
> URL: https://issues.apache.org/jira/browse/MESOS-3639
> Project: Mesos
>  Issue Type: Task
>  Components: stout
>Reporter: Alex Clemmer
>Assignee: Daniel Pravat
>  Labels: mesosphere, windows
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3639) Implement stout/os/windows/killtree.hpp

2016-05-12 Thread Daniel Pravat (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Pravat updated MESOS-3639:
-
Description: 
killtree() is implemented using Windows Job Objects. The processes created by 
the  executor are associated with a job object using `create_job'. killtree() 
is simply terminating the job object. 

Helper functions:
`create_job` function create a job object whose name is derived from the `pid` 
and associates the  process with the job object. Every process started by the 
process which is part of the job object becomes part of the job object. The job 
name should match the name used in `kill_job`.

`kill_job` function assumes the process identified by `pid` is associated with 
a job object whose name is derive from it. Every process started by the process 
which is part of the job object becomes part of the job object. Destroying the 
task will close all such processes.

> Implement stout/os/windows/killtree.hpp
> ---
>
> Key: MESOS-3639
> URL: https://issues.apache.org/jira/browse/MESOS-3639
> Project: Mesos
>  Issue Type: Task
>  Components: stout
>Reporter: Alex Clemmer
>Assignee: Daniel Pravat
>  Labels: mesosphere, windows
>
> killtree() is implemented using Windows Job Objects. The processes created by 
> the  executor are associated with a job object using `create_job'. killtree() 
> is simply terminating the job object. 
> Helper functions:
> `create_job` function create a job object whose name is derived from the 
> `pid` and associates the  process with the job object. Every process started 
> by the process which is part of the job object becomes part of the job 
> object. The job name should match the name used in `kill_job`.
> `kill_job` function assumes the process identified by `pid` is associated 
> with a job object whose name is derive from it. Every process started by the 
> process which is part of the job object becomes part of the job object. 
> Destroying the task will close all such processes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5330) Agent should backoff before connecting to the master

2016-05-12 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-5330:
---
Shepherd: Benjamin Mahler

> Agent should backoff before connecting to the master
> 
>
> Key: MESOS-5330
> URL: https://issues.apache.org/jira/browse/MESOS-5330
> Project: Mesos
>  Issue Type: Bug
>Reporter: David Robinson
>Assignee: David Robinson
>
> When an agent is started it starts a background task (libprocess process?) to 
> detect the leading master. When the leading master is detected (or changes) 
> the [SocketManager's link() method is called and a TCP connection to the 
> master is 
> established|https://github.com/apache/mesos/blob/a138e2246a30c4b5c9bc3f7069ad12204dcaffbc/src/slave/slave.cpp#L954].
>  The agent _then_ backs off before sending a ReRegisterSlave message via the 
> newly established connection. The agent needs to backoff _before_ attempting 
> to establish a TCP connection to the master, not before sending the first 
> message over the connection.
> During scale tests at Twitter we discovered that agents can SYN flood the 
> master upon leader changes, then the problem described in MESOS-5200 can 
> occur where ephemeral connections are used, which exacerbates the problem. 
> The end result is a lot of hosts setting up and tearing down TCP connections 
> every slave_ping_timeout seconds (15 by default), connections failing to be 
> established, hosts being marked as unhealthy and being shutdown. We observed 
> ~800 passive TCP connections per second on the leading master during scale 
> tests.
> The problem can be somewhat mitigated by tuning the kernel to handle a 
> thundering herd of TCP connections, but ideally there would not be a 
> thundering herd to begin with.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3639) Implement stout/os/windows/killtree.hpp

2016-05-12 Thread Daniel Pravat (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Pravat updated MESOS-3639:
-
Description: 
killtree() is implemented using Windows Job Objects. The processes created by 
the  executor are associated with a job object using `create_job'. killtree() 
is simply terminating the job object. 

Helper functions:
`create_job` function creates a job object whose name is derived from the `pid` 
and associates the `pid` process with the job object. Every process started by 
the process which is part of the job object becomes part of the job object. The 
job name should match the name used in `kill_job`.

`kill_job` function assumes the process identified by `pid` is associated with 
a job object whose name is derive from it. Every process started by the process 
which is part of the job object becomes part of the job object. Destroying the 
task will close all such processes.

  was:
killtree() is implemented using Windows Job Objects. The processes created by 
the  executor are associated with a job object using `create_job'. killtree() 
is simply terminating the job object. 

Helper functions:
`create_job` function create a job object whose name is derived from the `pid` 
and associates the  process with the job object. Every process started by the 
process which is part of the job object becomes part of the job object. The job 
name should match the name used in `kill_job`.

`kill_job` function assumes the process identified by `pid` is associated with 
a job object whose name is derive from it. Every process started by the process 
which is part of the job object becomes part of the job object. Destroying the 
task will close all such processes.


> Implement stout/os/windows/killtree.hpp
> ---
>
> Key: MESOS-3639
> URL: https://issues.apache.org/jira/browse/MESOS-3639
> Project: Mesos
>  Issue Type: Task
>  Components: stout
>Reporter: Alex Clemmer
>Assignee: Daniel Pravat
>  Labels: mesosphere, windows
>
> killtree() is implemented using Windows Job Objects. The processes created by 
> the  executor are associated with a job object using `create_job'. killtree() 
> is simply terminating the job object. 
> Helper functions:
> `create_job` function creates a job object whose name is derived from the 
> `pid` and associates the `pid` process with the job object. Every process 
> started by the process which is part of the job object becomes part of the 
> job object. The job name should match the name used in `kill_job`.
> `kill_job` function assumes the process identified by `pid` is associated 
> with a job object whose name is derive from it. Every process started by the 
> process which is part of the job object becomes part of the job object. 
> Destroying the task will close all such processes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3435) Add containerizer support for hyper

2016-05-12 Thread Timothy Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15282470#comment-15282470
 ] 

Timothy Chen commented on MESOS-3435:
-

Sounds like the hyper folks are improving their APIs for easier integration 
with Mesos, [~haosd...@gmail.com] is the module for containerization merged now?

> Add containerizer support for hyper
> ---
>
> Key: MESOS-3435
> URL: https://issues.apache.org/jira/browse/MESOS-3435
> Project: Mesos
>  Issue Type: Story
>Reporter: Deshi Xiao
>Assignee: haosdent
>
> Secure as hypervisor, fast and easily used as Docker. This is hyper. 
> https://docs.hyper.sh/Introduction/what_is_hyper_.html We could implement 
> this through module way once MESOS-3709 finished.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3435) Add containerizer support for hyper

2016-05-12 Thread haosdent (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15282478#comment-15282478
 ] 

haosdent commented on MESOS-3435:
-

Unfortunately till are not available at leat in following 4 weeks. So we have 
not yet start to review the containerizer modularization.

{quote}
Sounds like the hyper folks are improving their APIs for easier integration 
with Mesos
{quote}

Do you have more details about this or their contact information? I would like 
to contact them. :-)

> Add containerizer support for hyper
> ---
>
> Key: MESOS-3435
> URL: https://issues.apache.org/jira/browse/MESOS-3435
> Project: Mesos
>  Issue Type: Story
>Reporter: Deshi Xiao
>Assignee: haosdent
>
> Secure as hypervisor, fast and easily used as Docker. This is hyper. 
> https://docs.hyper.sh/Introduction/what_is_hyper_.html We could implement 
> this through module way once MESOS-3709 finished.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)