date:20151209

[jira] [Comment Edited] (MESOS-2980) Allow runtime configuration to be returned from provisioner

2015-12-09 Thread Gilbert Song (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-2980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15049109#comment-15049109
 ] 

Gilbert Song edited comment on MESOS-2980 at 12/10/15 7:02 AM:
---

Just finish this series:
1. https://reviews.apache.org/r/41192/ | add protobuf
2. https://reviews.apache.org/r/41011/ | provisioner/filesystem isolator
3. https://reviews.apache.org/r/41032/ | local/registry puller
4. https://reviews.apache.org/r/41194/ | simple cleanup JSON parse
5. https://reviews.apache.org/r/41195/ | metadata manager
6. https://reviews.apache.org/r/41125/ | docker/appc store


was (Author: gilbert):
Just finish this series:
1. https://reviews.apache.org/r/41122/ | add protobuf
2. https://reviews.apache.org/r/41011/ | provisioner/filesystem isolator
3. https://reviews.apache.org/r/41032/ | local/registry puller
4. https://reviews.apache.org/r/41123/ | simple cleanup JSON parse
5. https://reviews.apache.org/r/41124/ | metadata manager
6. https://reviews.apache.org/r/41125/ | docker/appc store

> Allow runtime configuration to be returned from provisioner
> ---
>
> Key: MESOS-2980
> URL: https://issues.apache.org/jira/browse/MESOS-2980
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Timothy Chen
>Assignee: Gilbert Song
>  Labels: mesosphere
>
> Image specs also includes execution configuration (e.g: Env, user, ports, 
> etc).
> We should support passing those information from the image provisioner back 
> to the containerizer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-3841) Master HTTP API support to get the leader

2015-12-09 Thread Jian Qiu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-3841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15050187#comment-15050187
 ] 

Jian Qiu commented on MESOS-3841:
-

How about an endpoint {code} http://master:port/leader {code} with a return of
{code}
{"leader": {"hostname":"xxx","ip":"x.x.x.x","port":5050}}
{code}

> Master HTTP API support to get the leader
> -
>
> Key: MESOS-3841
> URL: https://issues.apache.org/jira/browse/MESOS-3841
> Project: Mesos
>  Issue Type: Improvement
>  Components: HTTP API
>Reporter: Cosmin Lehene
>Assignee: Jian Qiu
>
> There's currently no good way to query the current master ensemble leader.
> Some workarounds to get the leader (and parse it from leader@ip) from 
> {{/state.json}} or to grep it from  {{master/redirect}}. 
> The scheduler API does an HTTP redirect, but that requires an HTTP  POST 
> coming from a framework as well
> {{POST /api/v1/scheduler  HTTP/1.1}}
> There should be a lightweight API call to get the current master. 
> This could be part of a more granular representation (REST) of the current 
> state.json.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (MESOS-4097) Change /roles endpoint to include quotas, weights, reserved resources?

2015-12-09 Thread Yong Qiao Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-4097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15049956#comment-15049956
 ] 

Yong Qiao Wang edited comment on MESOS-4097 at 12/10/15 4:27 AM:
-

OK, per my understanding of role management I think Mesos is prefer to set 
role-related configuration with some separated endpoint, such as use /quota to 
set quota,  use /reserve to dynamically set reservation, and maybe use /weights 
to set weight, etc., but use the unified endpoint /roles to show all 
role-related information. That way, do we need to defer the function to query 
quota with /quota after using /roles to show quota information of a role? 
[~neilc] and [~alexr], do you think so?


was (Author: jamesyongqiaowang):
OK, per my understanding of role management I think Mesos is prefer to set 
role-related configuration with some separated endpoint, such as use /quota to 
set quota,  use /reserve to dynamically set reservation, and maybe use /weights 
to set weight, etc., and use the unified endpoint /roles to show all 
role-related information. That way, do we need to defer the function to query 
quota with /quota after using /roles to show quota information of a role? 

> Change /roles endpoint to include quotas, weights, reserved resources?
> --
>
> Key: MESOS-4097
> URL: https://issues.apache.org/jira/browse/MESOS-4097
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Neil Conway
>  Labels: mesosphere, quota, reservations, roles
>
> MESOS-4085 changes the behavior of the {{/roles}} endpoint: rather than 
> listing all the explicitly defined roles, we will now only list those roles 
> that have one or more registered frameworks.
> As suggested by [~alexr] in code review, this could be improved -- an 
> operator might reasonably expect to see all the roles that have
> * non-default weight
> * non-default quota
> * non-default ACLs?
> * any static or dynamically reserved resources



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-3843) Audit `src/CMakelists.txt` to make sure we're compiling everything we need to build the agent binary.

2015-12-09 Thread Joris Van Remoortere (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-3843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15049980#comment-15049980
 ] 

Joris Van Remoortere commented on MESOS-3843:
-

{code}
commit 07725b5f0cf46439607919bc6f3d51437dbe2088
Author: Diana Arroyo 
Date:   Wed Dec 9 19:26:09 2015 -0800

CMake: Added FindCurl.cmake script to locate cURL library.

Review: https://reviews.apache.org/r/41090
{code}

> Audit `src/CMakelists.txt` to make sure we're compiling everything we need to 
> build the agent binary.
> -
>
> Key: MESOS-3843
> URL: https://issues.apache.org/jira/browse/MESOS-3843
> Project: Mesos
>  Issue Type: Task
>  Components: cmake
>Reporter: Alex Clemmer
>Assignee: Diana Arroyo
>
> `src/CMakeLists.txt` has fallen into some state of disrepair. There are some 
> source files that seem to be missing (e.g., the `src/launcher/` and 
> `src/linux`/ directories), so the first step is to audit the source file to 
> make sure everything we need is there. Likely this will mean looking at the 
> corresponding `src/Makefile.am` to see that's missing.
> Once we understand the limitations of the current build, we can fan out more 
> tickets or proceed to generating the agent binary, as well as the master.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-4097) Change /roles endpoint to include quotas, weights, reserved resources?

2015-12-09 Thread Yong Qiao Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-4097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15049956#comment-15049956
 ] 

Yong Qiao Wang commented on MESOS-4097:
---

OK, per my understanding of role management I think Mesos is prefer to set 
role-related configuration with some separated endpoint, such as use /quota to 
set quota,  use /reserve to dynamically set reservation, and maybe use /weights 
to set weight, etc., and use the unified endpoint /roles to show all 
role-related information. That way, do we need to defer the function to query 
quota with /quota after using /roles to show quota information of a role? 

> Change /roles endpoint to include quotas, weights, reserved resources?
> --
>
> Key: MESOS-4097
> URL: https://issues.apache.org/jira/browse/MESOS-4097
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Neil Conway
>  Labels: mesosphere, quota, reservations, roles
>
> MESOS-4085 changes the behavior of the {{/roles}} endpoint: rather than 
> listing all the explicitly defined roles, we will now only list those roles 
> that have one or more registered frameworks.
> As suggested by [~alexr] in code review, this could be improved -- an 
> operator might reasonably expect to see all the roles that have
> * non-default weight
> * non-default quota
> * non-default ACLs?
> * any static or dynamically reserved resources



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-1718) Command executor can overcommit the slave.

2015-12-09 Thread Klaus Ma (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-1718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15049905#comment-15049905
 ] 

Klaus Ma commented on MESOS-1718:
-

re reuse {{TaskInfo::CommandInfo}}: I'm thinking to only use 
{{TaskInfo::ExecutorInfo}} to launch task in the future; considering the 
backward compatibility, add {{ExecutorInfo::task_command}} without touching 
{{TaskInfo::CommandInfo}}. For this JIRA, I'm OK to reuse 
{{TaskInfo::CommandInfo}} and relax the constraint on {{TaskInfo::CommandInfo}} 
and {{TaskInfo::ExecutorInfo}}.

re filling {{ExecutorInfo::CommandInfo}} in slave: good point :). I used  to 
avoid persist those info (e.g. launch_dir); but after thinking more about it, 
mybe we did not need to persist them; slave will report it when re-register it.

For the backwards compatibility, for now, most cases are passed with the draft 
RR without interface changed. The major issue is how do we define command line 
executor's resources, as framework did not assign resources to it. Currently, I 
cut resources from task (e.g. 1 CPU cli will use 0.9 CPU for tasks and 0.1 for 
executor); there maybe two offers to return 1 CPU instead of one, the previous 
behavior is one offer; and task resources in metrics is also changed, for 
example, Marathon's UI will show 0.9 CPU when launch service with 1 CPU.

> Command executor can overcommit the slave.
> --
>
> Key: MESOS-1718
> URL: https://issues.apache.org/jira/browse/MESOS-1718
> Project: Mesos
>  Issue Type: Bug
>  Components: slave
>Reporter: Benjamin Mahler
>Assignee: Ian Downes
>
> Currently we give a small amount of resources to the command executor, in 
> addition to resources used by the command task:
> https://github.com/apache/mesos/blob/0.20.0-rc1/src/slave/slave.cpp#L2448
> {code: title=}
> ExecutorInfo Slave::getExecutorInfo(
> const FrameworkID& frameworkId,
> const TaskInfo& task)
> {
>   ...
> // Add an allowance for the command executor. This does lead to a
> // small overcommit of resources.
> executor.mutable_resources()->MergeFrom(
> Resources::parse(
>   "cpus:" + stringify(DEFAULT_EXECUTOR_CPUS) + ";" +
>   "mem:" + stringify(DEFAULT_EXECUTOR_MEM.megabytes())).get());
>   ...
> }
> {code}
> This leads to an overcommit of the slave. Ideally, for command tasks we can 
> "transfer" all of the task resources to the executor at the slave / isolation 
> level.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (MESOS-4111) Provide a means for libprocess users to exit while ensuring messages are flushed.

2015-12-09 Thread Benjamin Mahler (JIRA)

Benjamin Mahler created MESOS-4111:
--

 Summary: Provide a means for libprocess users to exit while 
ensuring messages are flushed.
 Key: MESOS-4111
 URL: https://issues.apache.org/jira/browse/MESOS-4111
 Project: Mesos
  Issue Type: Bug
  Components: libprocess
Reporter: Benjamin Mahler
Priority: Minor


Currently after a {{send}} there is no way to ensure that the message is 
flushed on the socket before terminating. We work around this by inserting 
{{os::sleep}} calls (see MESOS-243, MESOS-4106).

There are a number of approaches to this:

(1) Return a Future from send that notifies when the message is flushed from 
the system.

(2) Call process::finalize before exiting. This would require that 
process::finalize flushes all of the outstanding data on any active sockets, 
which may block.

Regardless of the approach, there needs to be a timer if we want to guarantee 
termination.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-4097) Change /roles endpoint to include quotas, weights, reserved resources?

2015-12-09 Thread Neil Conway (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-4097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15049900#comment-15049900
 ] 

Neil Conway commented on MESOS-4097:


Yeah, we probably want {{/roles}} to return information for all "visible" 
(active and/or explicitly configured roles), per discussion in 
https://reviews.apache.org/r/41075/ -- implementing that at the moment.

> Change /roles endpoint to include quotas, weights, reserved resources?
> --
>
> Key: MESOS-4097
> URL: https://issues.apache.org/jira/browse/MESOS-4097
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Neil Conway
>  Labels: mesosphere, quota, reservations, roles
>
> MESOS-4085 changes the behavior of the {{/roles}} endpoint: rather than 
> listing all the explicitly defined roles, we will now only list those roles 
> that have one or more registered frameworks.
> As suggested by [~alexr] in code review, this could be improved -- an 
> operator might reasonably expect to see all the roles that have
> * non-default weight
> * non-default quota
> * non-default ACLs?
> * any static or dynamically reserved resources



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-4106) The health checker may fail to inform the executor to kill an unhealthy task after max_consecutive_failures.

2015-12-09 Thread Benjamin Mahler (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-4106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15049899#comment-15049899
 ] 

Benjamin Mahler commented on MESOS-4106:


Yeah, I'll reference MESOS-4111 now that we have it, I'll also reference it in 
the existing command executor sleep.

> The health checker may fail to inform the executor to kill an unhealthy task 
> after max_consecutive_failures.
> 
>
> Key: MESOS-4106
> URL: https://issues.apache.org/jira/browse/MESOS-4106
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.20.0, 0.20.1, 0.21.1, 0.21.2, 0.22.1, 0.22.2, 0.23.0, 
> 0.23.1, 0.24.0, 0.24.1, 0.25.0
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
>Priority: Blocker
>
> This was reported by [~tan] experimenting with health checks. Many tasks were 
> launched with the following health check, taken from the container 
> stdout/stderr:
> {code}
> Launching health check process: /usr/local/libexec/mesos/mesos-health-check 
> --executor=(1)@127.0.0.1:39629 
> --health_check_json={"command":{"shell":true,"value":"false"},"consecutive_failures":1,"delay_seconds":0.0,"grace_period_seconds":1.0,"interval_seconds":1.0,"timeout_seconds":1.0}
>  --task_id=sleepy-2
> {code}
> This should have led to all tasks getting killed due to 
> {{\-\-consecutive_failures}} being set, however, only some tasks get killed, 
> while other remain running.
> It turns out that the health check binary does a {{send}} and promptly exits. 
> Unfortunately, this may lead to a message drop since libprocess may not have 
> sent this message over the socket by the time the process exits.
> We work around this in the command executor with a manual sleep, which has 
> been around since the svn days. See 
> [here|https://github.com/apache/mesos/blob/0.14.0/src/launcher/executor.cpp#L288-L290].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-4097) Change /roles endpoint to include quotas, weights, reserved resources?

2015-12-09 Thread Yong Qiao Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-4097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15049894#comment-15049894
 ] 

Yong Qiao Wang commented on MESOS-4097:
---

Maybe we should recover the RoleInfo and improve /roles for above requirements. 
 MESOS-3791 do the similar things.

> Change /roles endpoint to include quotas, weights, reserved resources?
> --
>
> Key: MESOS-4097
> URL: https://issues.apache.org/jira/browse/MESOS-4097
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Neil Conway
>  Labels: mesosphere, quota, reservations, roles
>
> MESOS-4085 changes the behavior of the {{/roles}} endpoint: rather than 
> listing all the explicitly defined roles, we will now only list those roles 
> that have one or more registered frameworks.
> As suggested by [~alexr] in code review, this could be improved -- an 
> operator might reasonably expect to see all the roles that have
> * non-default weight
> * non-default quota
> * non-default ACLs?
> * any static or dynamically reserved resources



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-4106) The health checker may fail to inform the executor to kill an unhealthy task after max_consecutive_failures.

2015-12-09 Thread Neil Conway (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-4106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15049893#comment-15049893
 ] 

Neil Conway commented on MESOS-4106:


Sounds good -- maybe add a {{TODO}} to the fix for this bug to put in a more 
robust fix later?

> The health checker may fail to inform the executor to kill an unhealthy task 
> after max_consecutive_failures.
> 
>
> Key: MESOS-4106
> URL: https://issues.apache.org/jira/browse/MESOS-4106
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.20.0, 0.20.1, 0.21.1, 0.21.2, 0.22.1, 0.22.2, 0.23.0, 
> 0.23.1, 0.24.0, 0.24.1, 0.25.0
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
>Priority: Blocker
>
> This was reported by [~tan] experimenting with health checks. Many tasks were 
> launched with the following health check, taken from the container 
> stdout/stderr:
> {code}
> Launching health check process: /usr/local/libexec/mesos/mesos-health-check 
> --executor=(1)@127.0.0.1:39629 
> --health_check_json={"command":{"shell":true,"value":"false"},"consecutive_failures":1,"delay_seconds":0.0,"grace_period_seconds":1.0,"interval_seconds":1.0,"timeout_seconds":1.0}
>  --task_id=sleepy-2
> {code}
> This should have led to all tasks getting killed due to 
> {{\-\-consecutive_failures}} being set, however, only some tasks get killed, 
> while other remain running.
> It turns out that the health check binary does a {{send}} and promptly exits. 
> Unfortunately, this may lead to a message drop since libprocess may not have 
> sent this message over the socket by the time the process exits.
> We work around this in the command executor with a manual sleep, which has 
> been around since the svn days. See 
> [here|https://github.com/apache/mesos/blob/0.14.0/src/launcher/executor.cpp#L288-L290].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-4106) The health checker may fail to inform the executor to kill an unhealthy task after max_consecutive_failures.

2015-12-09 Thread Benjamin Mahler (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-4106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15049889#comment-15049889
 ] 

Benjamin Mahler commented on MESOS-4106:


Yeah I had the same thought when I was looking at MESOS-243, but now we also 
have process::finalize that could be the mechanism for cleanly shutting down 
before {{exit}} calls. I'll file a ticket to express this issue more generally 
(MESOS-243 was the original but is specific to the executor driver).

> The health checker may fail to inform the executor to kill an unhealthy task 
> after max_consecutive_failures.
> 
>
> Key: MESOS-4106
> URL: https://issues.apache.org/jira/browse/MESOS-4106
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.20.0, 0.20.1, 0.21.1, 0.21.2, 0.22.1, 0.22.2, 0.23.0, 
> 0.23.1, 0.24.0, 0.24.1, 0.25.0
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
>Priority: Blocker
>
> This was reported by [~tan] experimenting with health checks. Many tasks were 
> launched with the following health check, taken from the container 
> stdout/stderr:
> {code}
> Launching health check process: /usr/local/libexec/mesos/mesos-health-check 
> --executor=(1)@127.0.0.1:39629 
> --health_check_json={"command":{"shell":true,"value":"false"},"consecutive_failures":1,"delay_seconds":0.0,"grace_period_seconds":1.0,"interval_seconds":1.0,"timeout_seconds":1.0}
>  --task_id=sleepy-2
> {code}
> This should have led to all tasks getting killed due to 
> {{\-\-consecutive_failures}} being set, however, only some tasks get killed, 
> while other remain running.
> It turns out that the health check binary does a {{send}} and promptly exits. 
> Unfortunately, this may lead to a message drop since libprocess may not have 
> sent this message over the socket by the time the process exits.
> We work around this in the command executor with a manual sleep, which has 
> been around since the svn days. See 
> [here|https://github.com/apache/mesos/blob/0.14.0/src/launcher/executor.cpp#L288-L290].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-4097) Change /roles endpoint to include quotas, weights, reserved resources?

2015-12-09 Thread Yong Qiao Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-4097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15049884#comment-15049884
 ] 

Yong Qiao Wang commented on MESOS-4097:
---

Some concerns as below:
1. we removed RoleInfo in MESOS-4085, but we are going to show RoleInfo (all 
role-related configuration) by /roles?
2. Currently design should be worse for operator experience, for example, after 
operator configured the quota for a role with /quota endpoint, and he has to 
check the configuration with another endpoint ( /roles ) which will return all 
active roles information and does not query a specified one.


> Change /roles endpoint to include quotas, weights, reserved resources?
> --
>
> Key: MESOS-4097
> URL: https://issues.apache.org/jira/browse/MESOS-4097
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Neil Conway
>  Labels: mesosphere, quota, reservations, roles
>
> MESOS-4085 changes the behavior of the {{/roles}} endpoint: rather than 
> listing all the explicitly defined roles, we will now only list those roles 
> that have one or more registered frameworks.
> As suggested by [~alexr] in code review, this could be improved -- an 
> operator might reasonably expect to see all the roles that have
> * non-default weight
> * non-default quota
> * non-default ACLs?
> * any static or dynamically reserved resources



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (MESOS-4110) Implement `WindowsError` to correspond with `ErrnoError`.

2015-12-09 Thread Alex Clemmer (JIRA)

Alex Clemmer created MESOS-4110:
---

 Summary: Implement `WindowsError` to correspond with `ErrnoError`.
 Key: MESOS-4110
 URL: https://issues.apache.org/jira/browse/MESOS-4110
 Project: Mesos
  Issue Type: Bug
  Components: stout
Reporter: Alex Clemmer
Assignee: Alex Clemmer


In the C standard library, `errno` records the last error on a thread. You can 
pretty-print it with `strerror`.

In Stout, we report these errors with `ErrnoError`.

The Windows API has something similar, called `GetLastError()`. The way to 
pretty-print this is hilariously unintuitive and terrible, so in this case it 
is actually very beneficial to wrap it with something similar to `ErrnoError`, 
maybe called `WindowsError`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-4106) The health checker may fail to inform the executor to kill an unhealthy task after max_consecutive_failures.

2015-12-09 Thread Neil Conway (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-4106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15049866#comment-15049866
 ] 

Neil Conway commented on MESOS-4106:


To fix the problem properly (without a {{sleep}} hack), it seems we need 
something akin to a {{flush}} primitive in libprocess. For example, we could 
provide a variant of {{send}} that returns a future, where the future is only 
satisfied once the associated message has been delivered to the kernel.

> The health checker may fail to inform the executor to kill an unhealthy task 
> after max_consecutive_failures.
> 
>
> Key: MESOS-4106
> URL: https://issues.apache.org/jira/browse/MESOS-4106
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.20.0, 0.20.1, 0.21.1, 0.21.2, 0.22.1, 0.22.2, 0.23.0, 
> 0.23.1, 0.24.0, 0.24.1, 0.25.0
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
>Priority: Blocker
>
> This was reported by [~tan] experimenting with health checks. Many tasks were 
> launched with the following health check, taken from the container 
> stdout/stderr:
> {code}
> Launching health check process: /usr/local/libexec/mesos/mesos-health-check 
> --executor=(1)@127.0.0.1:39629 
> --health_check_json={"command":{"shell":true,"value":"false"},"consecutive_failures":1,"delay_seconds":0.0,"grace_period_seconds":1.0,"interval_seconds":1.0,"timeout_seconds":1.0}
>  --task_id=sleepy-2
> {code}
> This should have led to all tasks getting killed due to 
> {{\-\-consecutive_failures}} being set, however, only some tasks get killed, 
> while other remain running.
> It turns out that the health check binary does a {{send}} and promptly exits. 
> Unfortunately, this may lead to a message drop since libprocess may not have 
> sent this message over the socket by the time the process exits.
> We work around this in the command executor with a manual sleep, which has 
> been around since the svn days. See 
> [here|https://github.com/apache/mesos/blob/0.14.0/src/launcher/executor.cpp#L288-L290].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (MESOS-4025) SlaveRecoveryTest/0.GCExecutor is flaky.

2015-12-09 Thread Greg Mann (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-4025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15049741#comment-15049741
 ] 

Greg Mann edited comment on MESOS-4025 at 12/10/15 2:09 AM:


I was just observing this error on Ubuntu 14.04 when running the tests as root. 
It does indeed seem to be due to artifacts left behind by some tests. After 
running the entire test suite the first time I saw that several of the 
{{SlaveRecoveryTest}} tests had failed:

{code}
[==] 988 tests from 144 test cases ran. (590989 ms total)
[  PASSED  ] 980 tests.
[  FAILED  ] 8 tests, listed below:
[  FAILED  ] SlaveRecoveryTest/0.GCExecutor, where TypeParam = 
mesos::internal::slave::MesosContainerizer
[  FAILED  ] SlaveRecoveryTest/0.ShutdownSlave, where TypeParam = 
mesos::internal::slave::MesosContainerizer
[  FAILED  ] SlaveRecoveryTest/0.ShutdownSlaveSIGUSR1, where TypeParam = 
mesos::internal::slave::MesosContainerizer
[  FAILED  ] SlaveTest.PingTimeoutNoPings
[  FAILED  ] PerfEventIsolatorTest.ROOT_CGROUPS_Sample
[  FAILED  ] CgroupsAnyHierarchyWithPerfEventTest.ROOT_CGROUPS_Perf
[  FAILED  ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics
[  FAILED  ] MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery
{code}

and the {{SlaveRecoveryTest}} errors all looked the same:

{code}
[ RUN  ] SlaveRecoveryTest/0.ShutdownSlave
../../src/tests/mesos.cpp:906: Failure
(cgroups::destroy(hierarchy, cgroup)).failure(): Failed to remove cgroup 
'/sys/fs/cgroup/memory/mesos_test_55fd4c14-a506-40ec-8965-2350ca5556c7/slave': 
Device or resource busy
---
We're very sorry but we can't seem to destroy existing
cgroups that we likely created as part of an earlier
invocation of the tests. Please manually destroy the cgroup
at 
'/sys/fs/cgroup/memory/mesos_test_55fd4c14-a506-40ec-8965-2350ca5556c7/slave' 
by first
manually killing all the processes found in the file at 
'/sys/fs/cgroup/memory/mesos_test_55fd4c14-a506-40ec-8965-2350ca5556c7/slave/tasks'
---
../../src/tests/mesos.cpp:940: Failure
(cgroups::destroy(hierarchy, cgroup)).failure(): Failed to remove cgroup 
'/sys/fs/cgroup/memory/mesos_test_55fd4c14-a506-40ec-8965-2350ca5556c7/slave': 
Device or resource busy
[  FAILED  ] SlaveRecoveryTest/0.ShutdownSlave, where TypeParam = 
mesos::internal::slave::MesosContainerizer (31 ms)
{code}

Next, re-running just these tests with {{sudo GTEST_FILTER="SlaveRecoverTest*" 
bin/mesos-tests.sh}}, they _all_ failed, with a slightly different error. At 
this point, these failures are reliably produced 100% of the time:

{code}
[ RUN  ] SlaveRecoveryTest/0.MultipleSlaves
../../src/tests/mesos.cpp:906: Failure
(cgroups::destroy(hierarchy, cgroup)).failure(): Failed to remove cgroup 
'/sys/fs/cgroup/perf_event/mesos_test': Device or resource busy
---
We're very sorry but we can't seem to destroy existing
cgroups that we likely created as part of an earlier
invocation of the tests. Please manually destroy the cgroup
at '/sys/fs/cgroup/perf_event/mesos_test' by first
manually killing all the processes found in the file at 
'/sys/fs/cgroup/perf_event/mesos_test/tasks'
---
../../src/tests/mesos.cpp:940: Failure
(cgroups::destroy(hierarchy, cgroup)).failure(): Failed to remove cgroup 
'/sys/fs/cgroup/perf_event/mesos_test': Device or resource busy
[  FAILED  ] SlaveRecoveryTest/0.MultipleSlaves, where TypeParam = 
mesos::internal::slave::MesosContainerizer (14 ms)
{code}

Finally, I recompiled from scratch and ran _only_ the {{SlaveRecoveryTest}} 
tests, doing:

{code}
GTEST_FILTER="" make check
sudo GTEST_FILTER="SlaveRecoveryTest*" bin/mesos-tests.sh
{code}

and they all passed. Regarding [~nfnt]'s comment above on the 
{{HealthCheckTest}} tests, after all this I was able to run {{sudo 
GTEST_FILTER="HealthCheckTest*" bin/mesos-tests.sh}}, followed by {{sudo 
GTEST_FILTER="SlaveRecoveryTest*" bin/mesos-tests.sh}}, and all of the 
{{SlaveRecoveryTest}} tests passed, so perhaps it isn't an artifact of the 
{{HealthCheckTest}} tests that is causing this problem? However, I only did 
that a couple times, and didn't try the exact command that Jan did:

{code}
sudo ./bin/mesos-tests.sh --gtest_repeat=1 --gtest_break_on_failure 
--gtest_filter="*ROOT_DOCKER_DockerHealthStatusChange:SlaveRecoveryTest*GCExecutor"
{code}

so if it's a flaky thing I may not have caught it.


was (Author: greggomann):
I was just observing this error on Ubuntu 14.04 when running the tests as root. 
It does indeed seem to be due to artifacts left behind by some tests. After 
running the entire test suite the first time I saw that several of the 
{{SlaveRecoveryTest}} tests had failed:

{code}
[==] 988 tests from 144 test cases ran. (5909

[jira] [Commented] (MESOS-1613) HealthCheckTest.ConsecutiveFailures is flaky

2015-12-09 Thread Benjamin Mahler (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-1613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15049850#comment-15049850
 ] 

Benjamin Mahler commented on MESOS-1613:


For posterity, I also wasn't able to reproduce this by just running in 
repetition. However, when I ran one {{openssl speed}} for each core on my 
laptop in order to induce load, I could reproduce easily. We probably want to 
direct folks to try this when they are having trouble reproducing something 
flaky from CI.

I will post a fix through MESOS-4106.

> HealthCheckTest.ConsecutiveFailures is flaky
> 
>
> Key: MESOS-1613
> URL: https://issues.apache.org/jira/browse/MESOS-1613
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Affects Versions: 0.20.0
> Environment: Ubuntu 10.04 GCC
>Reporter: Vinod Kone
>Assignee: Timothy Chen
>  Labels: flaky, mesosphere
>
> {code}
> [ RUN  ] HealthCheckTest.ConsecutiveFailures
> Using temporary directory '/tmp/HealthCheckTest_ConsecutiveFailures_AzK0OV'
> I0717 04:39:59.288471  5009 leveldb.cpp:176] Opened db in 21.575631ms
> I0717 04:39:59.295274  5009 leveldb.cpp:183] Compacted db in 6.471982ms
> I0717 04:39:59.295552  5009 leveldb.cpp:198] Created db iterator in 16783ns
> I0717 04:39:59.296026  5009 leveldb.cpp:204] Seeked to beginning of db in 
> 2125ns
> I0717 04:39:59.296257  5009 leveldb.cpp:273] Iterated through 0 keys in the 
> db in 10747ns
> I0717 04:39:59.296584  5009 replica.cpp:741] Replica recovered with log 
> positions 0 -> 0 with 1 holes and 0 unlearned
> I0717 04:39:59.297322  5033 recover.cpp:425] Starting replica recovery
> I0717 04:39:59.297413  5033 recover.cpp:451] Replica is in EMPTY status
> I0717 04:39:59.297824  5033 replica.cpp:638] Replica in EMPTY status received 
> a broadcasted recover request
> I0717 04:39:59.297899  5033 recover.cpp:188] Received a recover response from 
> a replica in EMPTY status
> I0717 04:39:59.297997  5033 recover.cpp:542] Updating replica status to 
> STARTING
> I0717 04:39:59.301985  5031 master.cpp:288] Master 
> 20140717-043959-16842879-40280-5009 (lucid) started on 127.0.1.1:40280
> I0717 04:39:59.302026  5031 master.cpp:325] Master only allowing 
> authenticated frameworks to register
> I0717 04:39:59.302032  5031 master.cpp:330] Master only allowing 
> authenticated slaves to register
> I0717 04:39:59.302039  5031 credentials.hpp:36] Loading credentials for 
> authentication from 
> '/tmp/HealthCheckTest_ConsecutiveFailures_AzK0OV/credentials'
> I0717 04:39:59.302283  5031 master.cpp:359] Authorization enabled
> I0717 04:39:59.302971  5031 hierarchical_allocator_process.hpp:301] 
> Initializing hierarchical allocator process with master : 
> master@127.0.1.1:40280
> I0717 04:39:59.303022  5031 master.cpp:122] No whitelist given. Advertising 
> offers for all slaves
> I0717 04:39:59.303390  5033 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 5.325097ms
> I0717 04:39:59.303419  5033 replica.cpp:320] Persisted replica status to 
> STARTING
> I0717 04:39:59.304076  5030 master.cpp:1128] The newly elected leader is 
> master@127.0.1.1:40280 with id 20140717-043959-16842879-40280-5009
> I0717 04:39:59.304095  5030 master.cpp:1141] Elected as the leading master!
> I0717 04:39:59.304102  5030 master.cpp:959] Recovering from registrar
> I0717 04:39:59.304182  5030 registrar.cpp:313] Recovering registrar
> I0717 04:39:59.304635  5033 recover.cpp:451] Replica is in STARTING status
> I0717 04:39:59.304962  5033 replica.cpp:638] Replica in STARTING status 
> received a broadcasted recover request
> I0717 04:39:59.305026  5033 recover.cpp:188] Received a recover response from 
> a replica in STARTING status
> I0717 04:39:59.305130  5033 recover.cpp:542] Updating replica status to VOTING
> I0717 04:39:59.310416  5033 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 5.204157ms
> I0717 04:39:59.310459  5033 replica.cpp:320] Persisted replica status to 
> VOTING
> I0717 04:39:59.310534  5033 recover.cpp:556] Successfully joined the Paxos 
> group
> I0717 04:39:59.310607  5033 recover.cpp:440] Recover process terminated
> I0717 04:39:59.310773  5033 log.cpp:656] Attempting to start the writer
> I0717 04:39:59.311157  5033 replica.cpp:474] Replica received implicit 
> promise request with proposal 1
> I0717 04:39:59.313451  5033 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 2.271822ms
> I0717 04:39:59.313627  5033 replica.cpp:342] Persisted promised to 1
> I0717 04:39:59.318038  5031 coordinator.cpp:230] Coordinator attemping to 
> fill missing position
> I0717 04:39:59.318430  5031 replica.cpp:375] Replica received explicit 
> promise request for position 0 with proposal 2
> I0717 04:39:59.323459  5031 leveldb.cpp:343] Persisting action (8 bytes) to 
> leve

[jira] [Created] (MESOS-4109) HTTPConnectionTest.ClosingResponse is flaky

2015-12-09 Thread Joseph Wu (JIRA)

Joseph Wu created MESOS-4109:


 Summary: HTTPConnectionTest.ClosingResponse is flaky
 Key: MESOS-4109
 URL: https://issues.apache.org/jira/browse/MESOS-4109
 Project: Mesos
  Issue Type: Bug
  Components: libprocess, test
Affects Versions: 0.26.0
 Environment: ASF Ubuntu 14 
{{--enable-ssl --enable-libevent}}
Reporter: Joseph Wu
Priority: Minor


Output of the test:
{code}
[ RUN  ] HTTPConnectionTest.ClosingResponse
I1210 01:20:27.048532 26671 process.cpp:3077] Handling HTTP event for process 
'(22)' with path: '/(22)/get'
../../../3rdparty/libprocess/src/tests/http_tests.cpp:919: Failure
Actual function call count doesn't match EXPECT_CALL(*http.process, get(_))...
 Expected: to be called twice
   Actual: called once - unsatisfied and active
[  FAILED  ] HTTPConnectionTest.ClosingResponse (43 ms)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (MESOS-4106) The health checker may fail to inform the executor to kill an unhealthy task after max_consecutive_failures.

2015-12-09 Thread Benjamin Mahler (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-4106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-4106:
--

Assignee: Benjamin Mahler

[~haosd...@gmail.com]: From my testing so far, yes. I will send a fix and 
re-enable the test from MESOS-1613.

> The health checker may fail to inform the executor to kill an unhealthy task 
> after max_consecutive_failures.
> 
>
> Key: MESOS-4106
> URL: https://issues.apache.org/jira/browse/MESOS-4106
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.20.0, 0.20.1, 0.21.1, 0.21.2, 0.22.1, 0.22.2, 0.23.0, 
> 0.23.1, 0.24.0, 0.24.1, 0.25.0
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
>Priority: Blocker
>
> This was reported by [~tan] experimenting with health checks. Many tasks were 
> launched with the following health check, taken from the container 
> stdout/stderr:
> {code}
> Launching health check process: /usr/local/libexec/mesos/mesos-health-check 
> --executor=(1)@127.0.0.1:39629 
> --health_check_json={"command":{"shell":true,"value":"false"},"consecutive_failures":1,"delay_seconds":0.0,"grace_period_seconds":1.0,"interval_seconds":1.0,"timeout_seconds":1.0}
>  --task_id=sleepy-2
> {code}
> This should have led to all tasks getting killed due to 
> {{\-\-consecutive_failures}} being set, however, only some tasks get killed, 
> while other remain running.
> It turns out that the health check binary does a {{send}} and promptly exits. 
> Unfortunately, this may lead to a message drop since libprocess may not have 
> sent this message over the socket by the time the process exits.
> We work around this in the command executor with a manual sleep, which has 
> been around since the svn days. See 
> [here|https://github.com/apache/mesos/blob/0.14.0/src/launcher/executor.cpp#L288-L290].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-3615) Port slave/state.cpp

2015-12-09 Thread Joris Van Remoortere (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-3615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van Remoortere updated MESOS-3615:

  Sprint: Mesosphere Sprint 24
Story Points: 3

> Port slave/state.cpp
> 
>
> Key: MESOS-3615
> URL: https://issues.apache.org/jira/browse/MESOS-3615
> Project: Mesos
>  Issue Type: Task
>  Components: slave
>Reporter: Alex Clemmer
>Assignee: Alex Clemmer
>  Labels: mesosphere, windows
> Fix For: 0.27.0
>
>
> Important subset of changes this depends on:
> slave/state.cpp: pid, os, path, protobuf, paths, state
> pid.hpp: address.hpp, ip.hpp
> address.hpp: ip.hpp, net.hpp
> net.hpp: ip, networking stuff
> state: type_utils, pid, os, path, protobuf, uuid
> type_utils.hpp: uuid.hpp



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-3615) Port slave/state.cpp

2015-12-09 Thread Joris Van Remoortere (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-3615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15049841#comment-15049841
 ] 

Joris Van Remoortere commented on MESOS-3615:
-

https://reviews.apache.org/r/39219

> Port slave/state.cpp
> 
>
> Key: MESOS-3615
> URL: https://issues.apache.org/jira/browse/MESOS-3615
> Project: Mesos
>  Issue Type: Task
>  Components: slave
>Reporter: Alex Clemmer
>Assignee: Alex Clemmer
>  Labels: mesosphere, windows
>
> Important subset of changes this depends on:
> slave/state.cpp: pid, os, path, protobuf, paths, state
> pid.hpp: address.hpp, ip.hpp
> address.hpp: ip.hpp, net.hpp
> net.hpp: ip, networking stuff
> state: type_utils, pid, os, path, protobuf, uuid
> type_utils.hpp: uuid.hpp



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-4106) The health checker may fail to inform the executor to kill an unhealthy task after max_consecutive_failures.

2015-12-09 Thread haosdent (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-4106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15049824#comment-15049824
 ] 

haosdent commented on MESOS-4106:
-

Does add ' os::sleep(Seconds(1));' enough?

> The health checker may fail to inform the executor to kill an unhealthy task 
> after max_consecutive_failures.
> 
>
> Key: MESOS-4106
> URL: https://issues.apache.org/jira/browse/MESOS-4106
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.20.0, 0.20.1, 0.21.1, 0.21.2, 0.22.1, 0.22.2, 0.23.0, 
> 0.23.1, 0.24.0, 0.24.1, 0.25.0
>Reporter: Benjamin Mahler
>Priority: Blocker
>
> This was reported by [~tan] experimenting with health checks. Many tasks were 
> launched with the following health check, taken from the container 
> stdout/stderr:
> {code}
> Launching health check process: /usr/local/libexec/mesos/mesos-health-check 
> --executor=(1)@127.0.0.1:39629 
> --health_check_json={"command":{"shell":true,"value":"false"},"consecutive_failures":1,"delay_seconds":0.0,"grace_period_seconds":1.0,"interval_seconds":1.0,"timeout_seconds":1.0}
>  --task_id=sleepy-2
> {code}
> This should have led to all tasks getting killed due to 
> {{\-\-consecutive_failures}} being set, however, only some tasks get killed, 
> while other remain running.
> It turns out that the health check binary does a {{send}} and promptly exits. 
> Unfortunately, this may lead to a message drop since libprocess may not have 
> sent this message over the socket by the time the process exits.
> We work around this in the command executor with a manual sleep, which has 
> been around since the svn days. See 
> [here|https://github.com/apache/mesos/blob/0.14.0/src/launcher/executor.cpp#L288-L290].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-4108) Implement `os::mkdtemp` for Windows

2015-12-09 Thread Joris Van Remoortere (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-4108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15049819#comment-15049819
 ] 

Joris Van Remoortere commented on MESOS-4108:
-

https://reviews.apache.org/r/39559

> Implement `os::mkdtemp` for Windows
> ---
>
> Key: MESOS-4108
> URL: https://issues.apache.org/jira/browse/MESOS-4108
> Project: Mesos
>  Issue Type: Bug
>  Components: stout
>Reporter: Alex Clemmer
>Assignee: Alex Clemmer
>  Labels: mesosphere, stout, windows
>
> Used basically exclusively for testing, this insecure and 
> otherwise-not-quite-suitable-for-prod function needs to work to run what will 
> eventually become the FS tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-4108) Implement `os::mkdtemp` for Windows

2015-12-09 Thread Joris Van Remoortere (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-4108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van Remoortere updated MESOS-4108:

  Sprint: Mesosphere Sprint 24
Story Points: 5
  Labels: mesosphere stout windows  (was: stout windows)

> Implement `os::mkdtemp` for Windows
> ---
>
> Key: MESOS-4108
> URL: https://issues.apache.org/jira/browse/MESOS-4108
> Project: Mesos
>  Issue Type: Bug
>  Components: stout
>Reporter: Alex Clemmer
>Assignee: Alex Clemmer
>  Labels: mesosphere, stout, windows
>
> Used basically exclusively for testing, this insecure and 
> otherwise-not-quite-suitable-for-prod function needs to work to run what will 
> eventually become the FS tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (MESOS-4108) Implement `os::mkdtemp` for Windows

2015-12-09 Thread Alex Clemmer (JIRA)

Alex Clemmer created MESOS-4108:
---

 Summary: Implement `os::mkdtemp` for Windows
 Key: MESOS-4108
 URL: https://issues.apache.org/jira/browse/MESOS-4108
 Project: Mesos
  Issue Type: Bug
  Components: stout
Reporter: Alex Clemmer
Assignee: Alex Clemmer


Used basically exclusively for testing, this insecure and 
otherwise-not-quite-suitable-for-prod function needs to work to run what will 
eventually become the FS tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-4107) `os::strerror_r` breaks the Windows build

2015-12-09 Thread Joris Van Remoortere (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-4107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van Remoortere updated MESOS-4107:

  Sprint: Mesosphere Sprint 24
Story Points: 1
  Labels: mesosphere stout  (was: stout)

https://reviews.apache.org/r/40382/

> `os::strerror_r` breaks the Windows build
> -
>
> Key: MESOS-4107
> URL: https://issues.apache.org/jira/browse/MESOS-4107
> Project: Mesos
>  Issue Type: Bug
>  Components: stout
>Reporter: Alex Clemmer
>Assignee: Alex Clemmer
>  Labels: mesosphere, stout
>
> `os::strerror_r` does not exist on Windows.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (MESOS-4107) `os::strerror_r` breaks the Windows build

2015-12-09 Thread Alex Clemmer (JIRA)

Alex Clemmer created MESOS-4107:
---

 Summary: `os::strerror_r` breaks the Windows build
 Key: MESOS-4107
 URL: https://issues.apache.org/jira/browse/MESOS-4107
 Project: Mesos
  Issue Type: Bug
  Components: stout
Reporter: Alex Clemmer
Assignee: Alex Clemmer


`os::strerror_r` does not exist on Windows.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-4088) Modularize existing plain-file logging for executor/task logs

2015-12-09 Thread Joseph Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-4088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15049749#comment-15049749
 ] 

Joseph Wu commented on MESOS-4088:
--

|| Reviews || Summary ||
| https://reviews.apache.org/r/41166/ | Add {{ExecutorLogger}} to 
{{Containerizer::Create}} |
| https://reviews.apache.org/r/41167/ | Initialize and call the 
{{ExecutorLogger}} in {{MesosContainerizer::_launch}} |
| https://reviews.apache.org/r/41168/ | Update {{MesosTest}} |
| https://reviews.apache.org/r/41169/ | Update {{MesosContainerizer}} tests |

> Modularize existing plain-file logging for executor/task logs
> -
>
> Key: MESOS-4088
> URL: https://issues.apache.org/jira/browse/MESOS-4088
> Project: Mesos
>  Issue Type: Task
>  Components: modules
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>  Labels: logging, mesosphere
>
> Once a module for executor/task output logging has been introduced, the 
> default module will mirror the existing behavior.  Executor/task 
> stdout/stderr is piped into files within the executor's sandbox directory.
> The files are exposed in the web UI, via the {{/files}} endpoint.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (MESOS-4025) SlaveRecoveryTest/0.GCExecutor is flaky.

2015-12-09 Thread Greg Mann (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-4025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15049741#comment-15049741
 ] 

Greg Mann edited comment on MESOS-4025 at 12/10/15 12:45 AM:
-

I was just observing this error on Ubuntu 14.04 when running the tests as root. 
It does indeed seem to be due to artifacts left behind by some tests. After 
running the entire test suite the first time I saw that several of the 
{{SlaveRecoveryTest}} tests had failed:

{code}
[==] 988 tests from 144 test cases ran. (590989 ms total)
[  PASSED  ] 980 tests.
[  FAILED  ] 8 tests, listed below:
[  FAILED  ] SlaveRecoveryTest/0.GCExecutor, where TypeParam = 
mesos::internal::slave::MesosContainerizer
[  FAILED  ] SlaveRecoveryTest/0.ShutdownSlave, where TypeParam = 
mesos::internal::slave::MesosContainerizer
[  FAILED  ] SlaveRecoveryTest/0.ShutdownSlaveSIGUSR1, where TypeParam = 
mesos::internal::slave::MesosContainerizer
[  FAILED  ] SlaveTest.PingTimeoutNoPings
[  FAILED  ] PerfEventIsolatorTest.ROOT_CGROUPS_Sample
[  FAILED  ] CgroupsAnyHierarchyWithPerfEventTest.ROOT_CGROUPS_Perf
[  FAILED  ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics
[  FAILED  ] MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery
{code}

and the {{SlaveRecoveryTest}} errors all looked the same:

{code}
[ RUN  ] SlaveRecoveryTest/0.ShutdownSlave
../../src/tests/mesos.cpp:906: Failure
(cgroups::destroy(hierarchy, cgroup)).failure(): Failed to remove cgroup 
'/sys/fs/cgroup/memory/mesos_test_55fd4c14-a506-40ec-8965-2350ca5556c7/slave': 
Device or resource busy
---
We're very sorry but we can't seem to destroy existing
cgroups that we likely created as part of an earlier
invocation of the tests. Please manually destroy the cgroup
at 
'/sys/fs/cgroup/memory/mesos_test_55fd4c14-a506-40ec-8965-2350ca5556c7/slave' 
by first
manually killing all the processes found in the file at 
'/sys/fs/cgroup/memory/mesos_test_55fd4c14-a506-40ec-8965-2350ca5556c7/slave/tasks'
---
../../src/tests/mesos.cpp:940: Failure
(cgroups::destroy(hierarchy, cgroup)).failure(): Failed to remove cgroup 
'/sys/fs/cgroup/memory/mesos_test_55fd4c14-a506-40ec-8965-2350ca5556c7/slave': 
Device or resource busy
[  FAILED  ] SlaveRecoveryTest/0.ShutdownSlave, where TypeParam = 
mesos::internal::slave::MesosContainerizer (31 ms)
{code}

Next, re-running just these tests with {{sudo GTEST_FILTER="SlaveRecoverTest*" 
bin/mesos-tests.sh}}, they _all_ failed, with a slightly different error. At 
this point, these failures are reliably produced 100% of the time:

{code}
[ RUN  ] SlaveRecoveryTest/0.MultipleSlaves
../../src/tests/mesos.cpp:906: Failure
(cgroups::destroy(hierarchy, cgroup)).failure(): Failed to remove cgroup 
'/sys/fs/cgroup/perf_event/mesos_test': Device or resource busy
---
We're very sorry but we can't seem to destroy existing
cgroups that we likely created as part of an earlier
invocation of the tests. Please manually destroy the cgroup
at '/sys/fs/cgroup/perf_event/mesos_test' by first
manually killing all the processes found in the file at 
'/sys/fs/cgroup/perf_event/mesos_test/tasks'
---
../../src/tests/mesos.cpp:940: Failure
(cgroups::destroy(hierarchy, cgroup)).failure(): Failed to remove cgroup 
'/sys/fs/cgroup/perf_event/mesos_test': Device or resource busy
[  FAILED  ] SlaveRecoveryTest/0.MultipleSlaves, where TypeParam = 
mesos::internal::slave::MesosContainerizer (14 ms)
{code}

Finally, I recompiled from scratch and ran _only_ the {{SlaveRecoveryTest}} 
tests, doing:

{code}
GTEST_FILTER="" make check
sudo GTEST_FILTER="SlaveRecoveryTest*" bin/mesos-tests.sh
{code}

and they all passed. Regarding [~nfnt]'s comment above on the 
{{HealthCheckTest}} tests, after all this I was able to run {{sudo 
GTEST_FILTER="HealthCheckTest*" bin/mesos-tests.sh}}, followed by {{sudo 
GTEST_FILTER="SlaveRecoveryTest*" bin/mesos-tests.sh}}, and all of the 
{{SlaveRecoveryTest}} tests passed, so perhaps it isn't an artifact of the 
{{HealthCheckTest}} tests that is causing this problem.


was (Author: greggomann):
I was just observing this error on Ubuntu 14.04 when running the tests as root. 
It does indeed seem to be due to artifacts left behind by some tests. After 
running the entire test suite the first time I saw that several of the 
{{SlaveRecoveryTest}} tests had failed:

{code}
[==] 988 tests from 144 test cases ran. (590989 ms total)
[  PASSED  ] 980 tests.
[  FAILED  ] 8 tests, listed below:
[  FAILED  ] SlaveRecoveryTest/0.GCExecutor, where TypeParam = 
mesos::internal::slave::MesosContainerizer
[  FAILED  ] SlaveRecoveryTest/0.ShutdownSlave, where TypeParam = 
mesos::internal::slave::MesosContainerizer
[  FAILED  ] Slav

[jira] [Comment Edited] (MESOS-4025) SlaveRecoveryTest/0.GCExecutor is flaky.

2015-12-09 Thread Greg Mann (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-4025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15049741#comment-15049741
 ] 

Greg Mann edited comment on MESOS-4025 at 12/10/15 12:44 AM:
-

I was just observing this error on Ubuntu 14.04 when running the tests as root. 
It does indeed seem to be due to artifacts left behind by some tests. After 
running the entire test suite the first time I saw that several of the 
{{SlaveRecoveryTest}} tests had failed:

{code}
[==] 988 tests from 144 test cases ran. (590989 ms total)
[  PASSED  ] 980 tests.
[  FAILED  ] 8 tests, listed below:
[  FAILED  ] SlaveRecoveryTest/0.GCExecutor, where TypeParam = 
mesos::internal::slave::MesosContainerizer
[  FAILED  ] SlaveRecoveryTest/0.ShutdownSlave, where TypeParam = 
mesos::internal::slave::MesosContainerizer
[  FAILED  ] SlaveRecoveryTest/0.ShutdownSlaveSIGUSR1, where TypeParam = 
mesos::internal::slave::MesosContainerizer
[  FAILED  ] SlaveTest.PingTimeoutNoPings
[  FAILED  ] PerfEventIsolatorTest.ROOT_CGROUPS_Sample
[  FAILED  ] CgroupsAnyHierarchyWithPerfEventTest.ROOT_CGROUPS_Perf
[  FAILED  ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics
[  FAILED  ] MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery
{code}

and the {{SlaveRecoveryTest}} errors all looked the same:

{code}
[ RUN  ] SlaveRecoveryTest/0.ShutdownSlave
../../src/tests/mesos.cpp:906: Failure
(cgroups::destroy(hierarchy, cgroup)).failure(): Failed to remove cgroup 
'/sys/fs/cgroup/memory/mesos_test_55fd4c14-a506-40ec-8965-2350ca5556c7/slave': 
Device or resource busy
---
We're very sorry but we can't seem to destroy existing
cgroups that we likely created as part of an earlier
invocation of the tests. Please manually destroy the cgroup
at 
'/sys/fs/cgroup/memory/mesos_test_55fd4c14-a506-40ec-8965-2350ca5556c7/slave' 
by first
manually killing all the processes found in the file at 
'/sys/fs/cgroup/memory/mesos_test_55fd4c14-a506-40ec-8965-2350ca5556c7/slave/tasks'
---
../../src/tests/mesos.cpp:940: Failure
(cgroups::destroy(hierarchy, cgroup)).failure(): Failed to remove cgroup 
'/sys/fs/cgroup/memory/mesos_test_55fd4c14-a506-40ec-8965-2350ca5556c7/slave': 
Device or resource busy
[  FAILED  ] SlaveRecoveryTest/0.ShutdownSlave, where TypeParam = 
mesos::internal::slave::MesosContainerizer (31 ms)
{code}

Next, re-running just these tests with {{sudo GTEST_FILTER="SlaveRecoverTest*" 
bin/mesos-tests.sh}}, they _all_ failed, with a slightly different error. At 
this point, these failures are reliably produced 100% of the time:

{code}
[ RUN  ] SlaveRecoveryTest/0.MultipleSlaves
../../src/tests/mesos.cpp:906: Failure
(cgroups::destroy(hierarchy, cgroup)).failure(): Failed to remove cgroup 
'/sys/fs/cgroup/perf_event/mesos_test': Device or resource busy
---
We're very sorry but we can't seem to destroy existing
cgroups that we likely created as part of an earlier
invocation of the tests. Please manually destroy the cgroup
at '/sys/fs/cgroup/perf_event/mesos_test' by first
manually killing all the processes found in the file at 
'/sys/fs/cgroup/perf_event/mesos_test/tasks'
---
../../src/tests/mesos.cpp:940: Failure
(cgroups::destroy(hierarchy, cgroup)).failure(): Failed to remove cgroup 
'/sys/fs/cgroup/perf_event/mesos_test': Device or resource busy
[  FAILED  ] SlaveRecoveryTest/0.MultipleSlaves, where TypeParam = 
mesos::internal::slave::MesosContainerizer (14 ms)
{code}

Finally, I recompiled from scratch and ran _only_ the {{SlaveRecoveryTest}} 
tests, doing:

{code}
GTEST_FILTER="" make check
sudo GTEST_FILTER="SlaveRecoveryTest*" bin/mesos-tests.sh
{code}

and they all passed. Regarding [~nfnt]'s comment above on the 
{{HealthCheckTest}} tests, after all this I was able to run {{sudo 
GTEST_FILTER="HealthCheckTest*" bin/mesos-tests.sh}}, followed by {{sudo 
GTEST_FILTER="SlaveRecoveryTest*" bin/mesos-tests.sh}}, and all of the 
{{SlaveRecoveryTest}} tests passed, so perhaps it isn't an artifact of the 
{{HealthCheckTest}}s that is causing this problem.


was (Author: greggomann):
I was just observing this error on Ubuntu 14.04 when running the tests as root. 
It does indeed seem to be due to artifacts left behind by some tests. After 
running the entire test suite the first time I saw that several of the 
{{SlaveRecoveryTest}}s had failed:

{code}
[==] 988 tests from 144 test cases ran. (590989 ms total)
[  PASSED  ] 980 tests.
[  FAILED  ] 8 tests, listed below:
[  FAILED  ] SlaveRecoveryTest/0.GCExecutor, where TypeParam = 
mesos::internal::slave::MesosContainerizer
[  FAILED  ] SlaveRecoveryTest/0.ShutdownSlave, where TypeParam = 
mesos::internal::slave::MesosContainerizer
[  FAILED  ] SlaveRecoveryT

[jira] [Commented] (MESOS-4025) SlaveRecoveryTest/0.GCExecutor is flaky.

2015-12-09 Thread Greg Mann (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-4025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15049741#comment-15049741
 ] 

Greg Mann commented on MESOS-4025:
--

I was just observing this error on Ubuntu 14.04 when running the tests as root. 
It does indeed seem to be due to artifacts left behind by some tests. After 
running the entire test suite the first time I saw that several of the 
{{SlaveRecoveryTest}}s had failed:

{code}
[==] 988 tests from 144 test cases ran. (590989 ms total)
[  PASSED  ] 980 tests.
[  FAILED  ] 8 tests, listed below:
[  FAILED  ] SlaveRecoveryTest/0.GCExecutor, where TypeParam = 
mesos::internal::slave::MesosContainerizer
[  FAILED  ] SlaveRecoveryTest/0.ShutdownSlave, where TypeParam = 
mesos::internal::slave::MesosContainerizer
[  FAILED  ] SlaveRecoveryTest/0.ShutdownSlaveSIGUSR1, where TypeParam = 
mesos::internal::slave::MesosContainerizer
[  FAILED  ] SlaveTest.PingTimeoutNoPings
[  FAILED  ] PerfEventIsolatorTest.ROOT_CGROUPS_Sample
[  FAILED  ] CgroupsAnyHierarchyWithPerfEventTest.ROOT_CGROUPS_Perf
[  FAILED  ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics
[  FAILED  ] MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery
{code}

and the {{SlaveRecoveryTest}} errors all looked the same:

{code}
[ RUN  ] SlaveRecoveryTest/0.ShutdownSlave
../../src/tests/mesos.cpp:906: Failure
(cgroups::destroy(hierarchy, cgroup)).failure(): Failed to remove cgroup 
'/sys/fs/cgroup/memory/mesos_test_55fd4c14-a506-40ec-8965-2350ca5556c7/slave': 
Device or resource busy
---
We're very sorry but we can't seem to destroy existing
cgroups that we likely created as part of an earlier
invocation of the tests. Please manually destroy the cgroup
at 
'/sys/fs/cgroup/memory/mesos_test_55fd4c14-a506-40ec-8965-2350ca5556c7/slave' 
by first
manually killing all the processes found in the file at 
'/sys/fs/cgroup/memory/mesos_test_55fd4c14-a506-40ec-8965-2350ca5556c7/slave/tasks'
---
../../src/tests/mesos.cpp:940: Failure
(cgroups::destroy(hierarchy, cgroup)).failure(): Failed to remove cgroup 
'/sys/fs/cgroup/memory/mesos_test_55fd4c14-a506-40ec-8965-2350ca5556c7/slave': 
Device or resource busy
[  FAILED  ] SlaveRecoveryTest/0.ShutdownSlave, where TypeParam = 
mesos::internal::slave::MesosContainerizer (31 ms)
{code}

Next, re-running just these tests with {{sudo GTEST_FILTER="SlaveRecoverTest*" 
bin/mesos-tests.sh}}, they _all_ failed, with a slightly different error. At 
this point, these failures are reliably produced 100% of the time:

{code}
[ RUN  ] SlaveRecoveryTest/0.MultipleSlaves
../../src/tests/mesos.cpp:906: Failure
(cgroups::destroy(hierarchy, cgroup)).failure(): Failed to remove cgroup 
'/sys/fs/cgroup/perf_event/mesos_test': Device or resource busy
---
We're very sorry but we can't seem to destroy existing
cgroups that we likely created as part of an earlier
invocation of the tests. Please manually destroy the cgroup
at '/sys/fs/cgroup/perf_event/mesos_test' by first
manually killing all the processes found in the file at 
'/sys/fs/cgroup/perf_event/mesos_test/tasks'
---
../../src/tests/mesos.cpp:940: Failure
(cgroups::destroy(hierarchy, cgroup)).failure(): Failed to remove cgroup 
'/sys/fs/cgroup/perf_event/mesos_test': Device or resource busy
[  FAILED  ] SlaveRecoveryTest/0.MultipleSlaves, where TypeParam = 
mesos::internal::slave::MesosContainerizer (14 ms)
{code}

Finally, I recompiled from scratch and ran _only_ the {{SlaveRecoveryTest}}s, 
doing:

{code}
GTEST_FILTER="" make check
sudo GTEST_FILTER="SlaveRecoveryTest*" bin/mesos-tests.sh
{code}

and they all passed. Regarding [~nfnt]'s comment above on the 
{{HealthCheckTest}}s, after all this I was able to run {{sudo 
GTEST_FILTER="HealthCheckTest*" bin/mesos-tests.sh}}, followed by {{sudo 
GTEST_FILTER="SlaveRecoveryTest*" bin/mesos-tests.sh}}, and all of the 
{{SlaveRecoveryTest}}s passed, so perhaps it isn't an artifact of the 
{{HealthCheckTest}}s that is causing this problem.

> SlaveRecoveryTest/0.GCExecutor is flaky.
> 
>
> Key: MESOS-4025
> URL: https://issues.apache.org/jira/browse/MESOS-4025
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Affects Versions: 0.26.0
>Reporter: Till Toenshoff
>Assignee: Jan Schlicht
>  Labels: flaky, flaky-test, test
>
> Build was SSL enabled (--enable-ssl, --enable-libevent). The build was based 
> on 0.26.0-rc1.
> Testsuite was run as root.
> {noformat}
> sudo ./bin/mesos-tests.sh --gtest_break_on_failure --gtest_repeat=-1
> {noformat}
> {noformat}
> [ RUN  ] SlaveRecoveryTest/0.GCExecutor
> I1130 16:49:16.336833  1032

[jira] [Commented] (MESOS-1718) Command executor can overcommit the slave.

2015-12-09 Thread Vinod Kone (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-1718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15049692#comment-15049692
 ] 

Vinod Kone commented on MESOS-1718:
---

I think adding yet another field "Executorinfo::task_command" is potentially 
confusing. Why not reuse "TaskInfo::CommandInfo" instead?

I suggest we relax the current constraint that "only one of 
TaskInfo::CommandInfo or TaskInfo::ExecutorInfo" should be set. It should be OK 
if both are set as far as slave is concerned. So, when the framework sends a 
launch task with TaskInfo::ExecutorInfo unset, master can set that field with 
command executor information and pass it onto the slave. 

Note that there are no required fields in CommandInfo, so master can leave them 
unset and let slave fill it in. But, I'm not convinced it having slave fill 
ExecutorInfo::CommandInfo is a good idea. This is because, when a master fails 
over it learns about existing ExecutorInfos from the re-registered slaves. 
Since these ExecutorInfos have updated CommandInfo, it will look weird in the 
master. The weirdness is because any new command tasks launched will not have 
ExecutorInfo::CommandInfo::* set, whereas command executors from re-registered 
slaves will have it set.

Also note that we didn't yet talk about backwards compatibility concerns here. 
I'm guessing we need to make changes to slave and command executor to make sure 
they work with old style and new style command tasks.

> Command executor can overcommit the slave.
> --
>
> Key: MESOS-1718
> URL: https://issues.apache.org/jira/browse/MESOS-1718
> Project: Mesos
>  Issue Type: Bug
>  Components: slave
>Reporter: Benjamin Mahler
>Assignee: Ian Downes
>
> Currently we give a small amount of resources to the command executor, in 
> addition to resources used by the command task:
> https://github.com/apache/mesos/blob/0.20.0-rc1/src/slave/slave.cpp#L2448
> {code: title=}
> ExecutorInfo Slave::getExecutorInfo(
> const FrameworkID& frameworkId,
> const TaskInfo& task)
> {
>   ...
> // Add an allowance for the command executor. This does lead to a
> // small overcommit of resources.
> executor.mutable_resources()->MergeFrom(
> Resources::parse(
>   "cpus:" + stringify(DEFAULT_EXECUTOR_CPUS) + ";" +
>   "mem:" + stringify(DEFAULT_EXECUTOR_MEM.megabytes())).get());
>   ...
> }
> {code}
> This leads to an overcommit of the slave. Ideally, for command tasks we can 
> "transfer" all of the task resources to the executor at the slave / isolation 
> level.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-4003) Pass agent work_dir to isolator modules

2015-12-09 Thread Greg Mann (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-4003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15049684#comment-15049684
 ] 

Greg Mann commented on MESOS-4003:
--

In order to prevent breaking the isolator interface in the future when more 
parameters may be added, a new protobuf message was added and made the sole 
parameter of {{Isolator::recover()}}.

Review is posted here: https://reviews.apache.org/r/41113/

> Pass agent work_dir to isolator modules
> ---
>
> Key: MESOS-4003
> URL: https://issues.apache.org/jira/browse/MESOS-4003
> Project: Mesos
>  Issue Type: Bug
>Reporter: Greg Mann
>Assignee: Greg Mann
>  Labels: external-volumes, mesosphere
>
> Some isolator modules can benefit from access to the agent's {{work_dir}}. 
> For example, the DVD isolator (https://github.com/emccode/mesos-module-dvdi) 
> is currently forced to mount external volumes in a hard-coded directory. 
> Making the {{work_dir}} accessible to the isolator via 
> {{Isolator::recover()}} would allow the isolator to mount volumes within the 
> agent's {{work_dir}}. This can be accomplished by simply adding an overloaded 
> signature for {{Isolator::recover()}} which includes the {{work_dir}} as a 
> parameter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-4106) The health checker may fail to inform the executor to kill an unhealthy task after max_consecutive_failures.

2015-12-09 Thread Benjamin Mahler (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-4106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15049651#comment-15049651
 ] 

Benjamin Mahler commented on MESOS-4106:


This is also possibly the reason for MESOS-1613.

> The health checker may fail to inform the executor to kill an unhealthy task 
> after max_consecutive_failures.
> 
>
> Key: MESOS-4106
> URL: https://issues.apache.org/jira/browse/MESOS-4106
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.20.0, 0.20.1, 0.21.1, 0.21.2, 0.22.1, 0.22.2, 0.23.0, 
> 0.23.1, 0.24.0, 0.24.1, 0.25.0
>Reporter: Benjamin Mahler
>Priority: Blocker
>
> This was reported by [~tan] experimenting with health checks. Many tasks were 
> launched with the following health check, taken from the container 
> stdout/stderr:
> {code}
> Launching health check process: /usr/local/libexec/mesos/mesos-health-check 
> --executor=(1)@127.0.0.1:39629 
> --health_check_json={"command":{"shell":true,"value":"false"},"consecutive_failures":1,"delay_seconds":0.0,"grace_period_seconds":1.0,"interval_seconds":1.0,"timeout_seconds":1.0}
>  --task_id=sleepy-2
> {code}
> This should have led to all tasks getting killed due to 
> {{\-\-consecutive_failures}} being set, however, only some tasks get killed, 
> while other remain running.
> It turns out that the health check binary does a {{send}} and promptly exits. 
> Unfortunately, this may lead to a message drop since libprocess may not have 
> sent this message over the socket by the time the process exits.
> We work around this in the command executor with a manual sleep, which has 
> been around since the svn days. See 
> [here|https://github.com/apache/mesos/blob/0.14.0/src/launcher/executor.cpp#L288-L290].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (MESOS-4106) The health checker may fail to inform the executor to kill an unhealthy task after max_consecutive_failures.

2015-12-09 Thread Benjamin Mahler (JIRA)

Benjamin Mahler created MESOS-4106:
--

 Summary: The health checker may fail to inform the executor to 
kill an unhealthy task after max_consecutive_failures.
 Key: MESOS-4106
 URL: https://issues.apache.org/jira/browse/MESOS-4106
 Project: Mesos
  Issue Type: Bug
Affects Versions: 0.25.0, 0.24.1, 0.24.0, 0.23.1, 0.23.0, 0.22.2, 0.22.1, 
0.21.2, 0.21.1, 0.20.1, 0.20.0
Reporter: Benjamin Mahler
Priority: Blocker


This was reported by [~tan] experimenting with health checks. Many tasks were 
launched with the following health check, taken from the container 
stdout/stderr:

{code}
Launching health check process: /usr/local/libexec/mesos/mesos-health-check 
--executor=(1)@127.0.0.1:39629 
--health_check_json={"command":{"shell":true,"value":"false"},"consecutive_failures":1,"delay_seconds":0.0,"grace_period_seconds":1.0,"interval_seconds":1.0,"timeout_seconds":1.0}
 --task_id=sleepy-2
{code}

This should have led to all tasks getting killed due to 
{{\-\-consecutive_failures}} being set, however, only some tasks get killed, 
while other remain running.

It turns out that the health check binary does a {{send}} and promptly exits. 
Unfortunately, this may lead to a message drop since libprocess may not have 
sent this message over the socket by the time the process exits.

We work around this in the command executor with a manual sleep, which has been 
around since the svn days. See 
[here|https://github.com/apache/mesos/blob/0.14.0/src/launcher/executor.cpp#L288-L290].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-4105) Network isolator causes corrupt packets to reach application

2015-12-09 Thread Ian Downes (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-4105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ian Downes updated MESOS-4105:
--
Description: 
The optional network isolator (network/port_mapping) will let corrupt TCP 
packets reach the application. This could lead to data corruption in 
applications. Normally these packets are dropped immediately by the network 
stack and do not reach the application. 

Networks may have a very low level of corrupt packets (a few per million) or, 
may have very high levels if there are hardware or software errors in 
networking equipment.

1) We receive a corrupt packet externally
2) The hardware driver is able to checksum it and notices it has a bad checksum
3) The driver delivers this packet anyway to wait for TCP layer to checksum it 
again and then drop it
4) This packet is moved to a veth interface because it is for a container
5) Both sides of the veth pair have RX checksum offloading enabled by default
6) The veth_xmit() marks the packet's checksum as UNNECESSARY since its peer 
device has rx checksum offloading
7) Packet is moved into the container TCP/IP stack
8) TCP layer is not going to checksum it since it is not necessary
9) The packet gets delivered to application layer


  was:
The optional network isolator (network/port_mapping) will let corrupt TCP 
packets reach the application. This could lead to data corruption in 
applications. Normally these packets are dropped immediately by the network 
stack and do not reach the application. 

Networks may have a very low level of corrupt packets (a few per million) or, 
may have very high levels if there are hardware or software errors in 
networking equipment.

Investigation is ongoing but an initial hypothesis is being tested:
1) The checksum error is correctly detected by the host interface.
2) The Mesos tc filters used by the network isolator redirect the packet to the 
virtual interface, even when a checksum error has occurred.
3) Either in copying to the veth device or passing across the veth pipe the 
checksum flag is cleared.
4) The veth inside the container does not verify the checksum, even though TCP 
RX checksum offloading is supposedly on. \[This is hypothesized to be 
acceptable normally because it's receiving packets over the virtual link where 
corruption should not occur\] 
5) The container network stack accepts the packet and delivers it to the 
application.

Disabling tcp rx cso on the container veth appears to fix this: it forces the 
container network stack to compute the packet checksums (in software) whereby 
it detects the checksum errors and does not deliver the packet to the 
application.


> Network isolator causes corrupt packets to reach application
> 
>
> Key: MESOS-4105
> URL: https://issues.apache.org/jira/browse/MESOS-4105
> Project: Mesos
>  Issue Type: Bug
>  Components: isolation
>Affects Versions: 0.20.0, 0.20.1, 0.21.0, 0.21.1, 0.21.2, 0.22.0, 0.22.1, 
> 0.22.2, 0.23.0, 0.23.1, 0.24.0, 0.24.1, 0.25.0
>Reporter: Ian Downes
>Assignee: Cong Wang
>Priority: Critical
>
> The optional network isolator (network/port_mapping) will let corrupt TCP 
> packets reach the application. This could lead to data corruption in 
> applications. Normally these packets are dropped immediately by the network 
> stack and do not reach the application. 
> Networks may have a very low level of corrupt packets (a few per million) or, 
> may have very high levels if there are hardware or software errors in 
> networking equipment.
> 1) We receive a corrupt packet externally
> 2) The hardware driver is able to checksum it and notices it has a bad 
> checksum
> 3) The driver delivers this packet anyway to wait for TCP layer to checksum 
> it again and then drop it
> 4) This packet is moved to a veth interface because it is for a container
> 5) Both sides of the veth pair have RX checksum offloading enabled by default
> 6) The veth_xmit() marks the packet's checksum as UNNECESSARY since its peer 
> device has rx checksum offloading
> 7) Packet is moved into the container TCP/IP stack
> 8) TCP layer is not going to checksum it since it is not necessary
> 9) The packet gets delivered to application layer



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-4105) Network isolator causes corrupt packets to reach application

2015-12-09 Thread Ian Downes (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-4105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ian Downes updated MESOS-4105:
--
Assignee: Cong Wang

> Network isolator causes corrupt packets to reach application
> 
>
> Key: MESOS-4105
> URL: https://issues.apache.org/jira/browse/MESOS-4105
> Project: Mesos
>  Issue Type: Bug
>  Components: isolation
>Affects Versions: 0.20.0, 0.20.1, 0.21.0, 0.21.1, 0.21.2, 0.22.0, 0.22.1, 
> 0.22.2, 0.23.0, 0.23.1, 0.24.0, 0.24.1, 0.25.0
>Reporter: Ian Downes
>Assignee: Cong Wang
>Priority: Critical
>
> The optional network isolator (network/port_mapping) will let corrupt TCP 
> packets reach the application. This could lead to data corruption in 
> applications. Normally these packets are dropped immediately by the network 
> stack and do not reach the application. 
> Networks may have a very low level of corrupt packets (a few per million) or, 
> may have very high levels if there are hardware or software errors in 
> networking equipment.
> Investigation is ongoing but an initial hypothesis is being tested:
> 1) The checksum error is correctly detected by the host interface.
> 2) The Mesos tc filters used by the network isolator redirect the packet to 
> the virtual interface, even when a checksum error has occurred.
> 3) Either in copying to the veth device or passing across the veth pipe the 
> checksum flag is cleared.
> 4) The veth inside the container does not verify the checksum, even though 
> TCP RX checksum offloading is supposedly on. \[This is hypothesized to be 
> acceptable normally because it's receiving packets over the virtual link 
> where corruption should not occur\] 
> 5) The container network stack accepts the packet and delivers it to the 
> application.
> Disabling tcp rx cso on the container veth appears to fix this: it forces the 
> container network stack to compute the packet checksums (in software) whereby 
> it detects the checksum errors and does not deliver the packet to the 
> application.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (MESOS-4105) Network isolator causes corrupt packets to reach application

2015-12-09 Thread Ian Downes (JIRA)

Ian Downes created MESOS-4105:
-

 Summary: Network isolator causes corrupt packets to reach 
application
 Key: MESOS-4105
 URL: https://issues.apache.org/jira/browse/MESOS-4105
 Project: Mesos
  Issue Type: Bug
  Components: isolation
Affects Versions: 0.25.0, 0.24.1, 0.24.0, 0.23.1, 0.23.0, 0.22.2, 0.22.1, 
0.22.0, 0.21.2, 0.21.1, 0.21.0, 0.20.1, 0.20.0
Reporter: Ian Downes
Priority: Critical


The optional network isolator (network/port_mapping) will let corrupt TCP 
packets reach the application. This could lead to data corruption in 
applications. Normally these packets are dropped immediately by the network 
stack and do not reach the application. 

Networks may have a very low level of corrupt packets (a few per million) or, 
may have very high levels if there are hardware or software errors in 
networking equipment.

Investigation is ongoing but an initial hypothesis is being tested:
1) The checksum error is correctly detected by the host interface.
2) The Mesos tc filters used by the network isolator redirect the packet to the 
virtual interface, even when a checksum error has occurred.
3) Either in copying to the veth device or passing across the veth pipe the 
checksum flag is cleared.
4) The veth inside the container does not verify the checksum, even though TCP 
RX checksum offloading is supposedly on. \[This is hypothesized to be 
acceptable normally because it's receiving packets over the virtual link where 
corruption should not occur\] 
5) The container network stack accepts the packet and delivers it to the 
application.

Disabling tcp rx cso on the container veth appears to fix this: it forces the 
container network stack to compute the packet checksums (in software) whereby 
it detects the checksum errors and does not deliver the packet to the 
application.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-4082) Add Tests for quota authentification and authorization

2015-12-09 Thread Alexander Rukletsov (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-4082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-4082:
---
Shepherd: Till Toenshoff

> Add Tests for quota authentification and authorization
> --
>
> Key: MESOS-4082
> URL: https://issues.apache.org/jira/browse/MESOS-4082
> Project: Mesos
>  Issue Type: Task
>  Components: master, test
>Reporter: Jan Schlicht
>Assignee: Jan Schlicht
>  Labels: mesosphere, quota
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-3065) Add framework authorization for persistent volume

2015-12-09 Thread Greg Mann (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-3065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann updated MESOS-3065:
-
Sprint: Mesosphere Sprint 16, Mesosphere Sprint 22, Mesosphere Sprint 24  
(was: Mesosphere Sprint 16, Mesosphere Sprint 22)

> Add framework authorization for persistent volume
> -
>
> Key: MESOS-3065
> URL: https://issues.apache.org/jira/browse/MESOS-3065
> Project: Mesos
>  Issue Type: Task
>Reporter: Michael Park
>Assignee: Greg Mann
>  Labels: mesosphere, persistent-volumes
>
> Persistent volume should be authorized with the {{principal}} of the 
> reserving entity (framework or master). The idea is to introduce {{Create}} 
> and {{Destroy}} into the ACL.
> {code}
>   message Create {
> // Subjects.
> required Entity principals = 1;
> // Objects? Perhaps the kind of volume? allowed permissions?
>   }
>   message Destroy {
> // Subjects.
> required Entity principals = 1;
> // Objects.
> required Entity creator_principals = 2;
>   }
> {code}
> When a framework creates a persistent volume, "create" ACLs are checked to 
> see if the framework (FrameworkInfo.principal) or the operator 
> (Credential.user) is authorized to create persistent volumes. If not 
> authorized, the create operation is rejected.
> When a framework destroys a persistent volume, "destroy" ACLs are checked to 
> see if the framework (FrameworkInfo.principal) or the operator 
> (Credential.user) is authorized to destroy the persistent volume created by a 
> framework or operator (Resource.DiskInfo.principal). If not authorized, the 
> destroy operation is rejected.
> A separate ticket will use the structures created here to enable 
> authorization of the "/create" and "/destroy" HTTP endpoints: 
> https://issues.apache.org/jira/browse/MESOS-3903



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2980) Allow runtime configuration to be returned from provisioner

2015-12-09 Thread Gilbert Song (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gilbert Song updated MESOS-2980:

Story Points: 5

> Allow runtime configuration to be returned from provisioner
> ---
>
> Key: MESOS-2980
> URL: https://issues.apache.org/jira/browse/MESOS-2980
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Timothy Chen
>Assignee: Gilbert Song
>  Labels: mesosphere
>
> Image specs also includes execution configuration (e.g: Env, user, ports, 
> etc).
> We should support passing those information from the image provisioner back 
> to the containerizer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-2980) Allow runtime configuration to be returned from provisioner

2015-12-09 Thread Gilbert Song (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-2980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15049109#comment-15049109
 ] 

Gilbert Song commented on MESOS-2980:
-

Just finish this series:
1. https://reviews.apache.org/r/41122/ | add protobuf
2. https://reviews.apache.org/r/41011/ | provisioner/filesystem isolator
3. https://reviews.apache.org/r/41032/ | local/registry puller
4. https://reviews.apache.org/r/41123/ | simple cleanup JSON parse
5. https://reviews.apache.org/r/41124/ | metadata manager
6. https://reviews.apache.org/r/41125/ | docker/appc store

> Allow runtime configuration to be returned from provisioner
> ---
>
> Key: MESOS-2980
> URL: https://issues.apache.org/jira/browse/MESOS-2980
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Timothy Chen
>Assignee: Gilbert Song
>  Labels: mesosphere
>
> Image specs also includes execution configuration (e.g: Env, user, ports, 
> etc).
> We should support passing those information from the image provisioner back 
> to the containerizer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-4102) Quota doesn't allocate resources on slave joining

2015-12-09 Thread Neil Conway (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-4102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15049080#comment-15049080
 ] 

Neil Conway commented on MESOS-4102:


Thanks for the explanation. I understand what is going on, so the question is 
whether this is the best behavior. Basically, the current implementation of 
event-triggered allocations will always make legal allocations (per quota), but 
might not make all the allocations that legally could be made. Is that 
considered a problem and/or something we want to change?

It would be helpful for me to understand why we have event-triggered 
allocations in the first place. If we need regular batch allocations to ensure 
that all resources are allocated appropriately, then I guess event-triggered 
allocations are just intended to be a "best-effort" mechanism, to do 
_something_ about a change in cluster state until the next batch allocation 
occurs?

> Quota doesn't allocate resources on slave joining
> -
>
> Key: MESOS-4102
> URL: https://issues.apache.org/jira/browse/MESOS-4102
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation
>Reporter: Neil Conway
>  Labels: mesosphere, quota
> Attachments: quota_absent_framework_test-1.patch
>
>
> See attached patch. {{framework1}} is not allocated any resources, despite 
> the fact that the resources on {{agent2}} can safely be allocated to it 
> without risk of violating {{quota1}}. If I understand the intended quota 
> behavior correctly, this doesn't seem intended.
> Note that if the framework is added _after_ the slaves are added, the 
> resources on {{agent2}} are allocated to {{framework1}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-3738) Mesos health check is invoked incorrectly when Mesos slave is within the docker container

2015-12-09 Thread haosdent (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-3738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15049014#comment-15049014
 ] 

haosdent commented on MESOS-3738:
-

Hi, you need patch 
https://issues.apache.org/jira/secure/attachment/12766990/MESOS-3738-0_23_1.patch
 which I upload in attachments.

> Mesos health check is invoked incorrectly when Mesos slave is within the 
> docker container
> -
>
> Key: MESOS-3738
> URL: https://issues.apache.org/jira/browse/MESOS-3738
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 0.25.0
> Environment: Docker 1.8.0:
> Client:
>  Version:  1.8.0
>  API version:  1.20
>  Go version:   go1.4.2
>  Git commit:   0d03096
>  Built:Tue Aug 11 16:48:39 UTC 2015
>  OS/Arch:  linux/amd64
> Server:
>  Version:  1.8.0
>  API version:  1.20
>  Go version:   go1.4.2
>  Git commit:   0d03096
>  Built:Tue Aug 11 16:48:39 UTC 2015
>  OS/Arch:  linux/amd64
> Host: Ubuntu 14.04
> Container: Debian 8.1 + Java-7
>Reporter: Yong Tang
>Assignee: haosdent
> Fix For: 0.26.0
>
> Attachments: MESOS-3738-0_23_1.patch, MESOS-3738-0_24_1.patch, 
> MESOS-3738-0_25_0.patch
>
>
> When Mesos slave is within the container, the COMMAND health check from 
> Marathon is invoked incorrectly.
> In such a scenario, the sandbox directory (instead of the 
> launcher/health-check directory) is used. This result in an error with the 
> container.
> Command to invoke the Mesos slave container:
> {noformat}
> sudo docker run -d -v /sys:/sys -v /usr/bin/docker:/usr/bin/docker:ro -v 
> /usr/lib/x86_64-linux-gnu/libapparmor.so.1:/usr/lib/x86_64-linux-gnu/libapparmor.so.1:ro
>  -v /var/run/docker.sock:/var/run/docker.sock -v /tmp/mesos:/tmp/mesos mesos 
> mesos slave --master=zk://10.2.1.2:2181/mesos --containerizers=docker,mesos 
> --executor_registration_timeout=5mins --docker_stop_timeout=10secs 
> --launcher=posix
> {noformat}
> Marathon JSON file:
> {code}
> {
>   "id": "ubuntu",
>   "container":
>   {
> "type": "DOCKER",
> "docker":
> {
>   "image": "ubuntu",
>   "network": "BRIDGE",
>   "parameters": []
> }
>   },
>   "args": [ "bash", "-c", "while true; do echo 1; sleep 5; done" ],
>   "uris": [],
>   "healthChecks":
>   [
> {
>   "protocol": "COMMAND",
>   "command": { "value": "echo Success" },
>   "gracePeriodSeconds": 3000,
>   "intervalSeconds": 5,
>   "timeoutSeconds": 5,
>   "maxConsecutiveFailures": 300
> }
>   ],
>   "instances": 1
> }
> {code}
> {noformat}
> STDOUT:
> root@cea2be47d64f:/mnt/mesos/sandbox# cat stdout 
> --container="mesos-e20f8959-cd9f-40ae-987d-809401309361-S0.815cc886-1cd1-4f13-8f9b-54af1f127c3f"
>  --docker="docker" --docker_socket="/var/run/docker.sock" --help="false" 
> --initialize_driver_logging="true" --logbufsecs="0" --logging_level="INFO" 
> --mapped_directory="/mnt/mesos/sandbox" --quiet="false" 
> --sandbox_directory="/tmp/mesos/slaves/e20f8959-cd9f-40ae-987d-809401309361-S0/frameworks/e20f8959-cd9f-40ae-987d-809401309361-/executors/ubuntu.86bca10f-72c9-11e5-b36d-02420a020106/runs/815cc886-1cd1-4f13-8f9b-54af1f127c3f"
>  --stop_timeout="10secs"
> --container="mesos-e20f8959-cd9f-40ae-987d-809401309361-S0.815cc886-1cd1-4f13-8f9b-54af1f127c3f"
>  --docker="docker" --docker_socket="/var/run/docker.sock" --help="false" 
> --initialize_driver_logging="true" --logbufsecs="0" --logging_level="INFO" 
> --mapped_directory="/mnt/mesos/sandbox" --quiet="false" 
> --sandbox_directory="/tmp/mesos/slaves/e20f8959-cd9f-40ae-987d-809401309361-S0/frameworks/e20f8959-cd9f-40ae-987d-809401309361-/executors/ubuntu.86bca10f-72c9-11e5-b36d-02420a020106/runs/815cc886-1cd1-4f13-8f9b-54af1f127c3f"
>  --stop_timeout="10secs"
> Registered docker executor on b01e2e75afcb
> Starting task ubuntu.86bca10f-72c9-11e5-b36d-02420a020106
> 1
> Launching health check process: 
> /tmp/mesos/slaves/e20f8959-cd9f-40ae-987d-809401309361-S0/frameworks/e20f8959-cd9f-40ae-987d-809401309361-/executors/ubuntu.86bca10f-72c9-11e5-b36d-02420a020106/runs/815cc886-1cd1-4f13-8f9b-54af1f127c3f/mesos-health-check
>  --executor=(1)@10.2.1.7:40695 
> --health_check_json={"command":{"shell":true,"value":"docker exec 
> mesos-e20f8959-cd9f-40ae-987d-809401309361-S0.815cc886-1cd1-4f13-8f9b-54af1f127c3f
>  sh -c \" echo Success 
> \""},"consecutive_failures":300,"delay_seconds":0.0,"grace_period_seconds":3000.0,"interval_seconds":5.0,"timeout_seconds":5.0}
>  --task_id=ubuntu.86bca10f-72c9-11e5-b36d-02420a020106
> Health check process launched at pid: 94
> 1
> 1
> 1
> 1
> 1
> STDERR:
> root@cea2be47d64f:/mnt/mesos/sandbox# cat stderr
> I1014 23:15:58.12795056 exec.cpp:134] Version: 0.25.0
> I1014 23:15:58

[jira] [Updated] (MESOS-4098) Allow interactive terminal for mesos containerizer

2015-12-09 Thread Jojy Varghese (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-4098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jojy Varghese updated MESOS-4098:
-
Story Points: 10  (was: 4)

> Allow interactive terminal for mesos containerizer
> --
>
> Key: MESOS-4098
> URL: https://issues.apache.org/jira/browse/MESOS-4098
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
> Environment: linux
>Reporter: Jojy Varghese
>Assignee: Jojy Varghese
>  Labels: mesosphere
>
> Today mesos containerizer does not have a way to run tasks that require 
> interactive sessions. An example use case is running a task that requires a 
> manual password entry from an operator. Another use case could be debugging 
> (gdb). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-4098) Allow interactive terminal for mesos containerizer

2015-12-09 Thread Jojy Varghese (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-4098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jojy Varghese updated MESOS-4098:
-
Issue Type: Story  (was: Improvement)

> Allow interactive terminal for mesos containerizer
> --
>
> Key: MESOS-4098
> URL: https://issues.apache.org/jira/browse/MESOS-4098
> Project: Mesos
>  Issue Type: Story
>  Components: containerization
> Environment: linux
>Reporter: Jojy Varghese
>Assignee: Jojy Varghese
>  Labels: mesosphere
>
> Today mesos containerizer does not have a way to run tasks that require 
> interactive sessions. An example use case is running a task that requires a 
> manual password entry from an operator. Another use case could be debugging 
> (gdb). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (MESOS-4104) Design document for interactive terminal for mesos containerizer

2015-12-09 Thread Jojy Varghese (JIRA)

Jojy Varghese created MESOS-4104:


 Summary: Design document for interactive terminal for mesos 
containerizer
 Key: MESOS-4104
 URL: https://issues.apache.org/jira/browse/MESOS-4104
 Project: Mesos
  Issue Type: Task
  Components: containerization
Reporter: Jojy Varghese
Assignee: Jojy Varghese


As a first step to address the use cases, propose a design document covering 
the requirement, design and implementation details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-4102) Quota doesn't allocate resources on slave joining

2015-12-09 Thread Alexander Rukletsov (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-4102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15048495#comment-15048495
 ] 

Alexander Rukletsov commented on MESOS-4102:


The reason you see this behaviour is
  a) because event-triggered allocations do not include all available agents 
*and*
  b) because we do not persist set aside resources across allocations.

However, during the next batch allocation cycle, we will observe all active 
agents and will be able to properly allocate resources not set aside for quota. 
Now the question is: *do you find this behaviour surprising*, i.e. *shall we 
fix it*?

In your case, you can make your test succeed if you add
{code}
  Clock::advance(flags.allocation_interval);
  Clock::settle();
{code}
before {{Future allocation = allocations.get();}}

The reason why we do b) is because we do not want to "attach" unallocated part 
of quota to particular agents. Technically, we do not even set aside resources, 
rather we stop allocating to non quota'ed frameworks if remaining resources are 
less than the unsatisfied quota part. 

For posterity, let me elaborate on the sequence of events happening in your 
test.
# Quota {{cpus:2;mem:1024}} is set for {{QUOTA_ROLE}}
#* {{allocate()}} for all agents is triggered
#* total resources are {{0}}
#* no resources to allocate, hence allocation callback is not called, hence 
nothing is pushed into the {{allocations}} queue
# {{framework1}} is added to {{NO_QUOTA_ROLE}}
#* {{allocate()}} for all agents is triggered
#* total resources are {{0}}
#* no resources to allocate, hence allocation callback is not called, hence 
nothing is pushed into the {{allocations}} queue
# {{slave1}} with {{cpus(* ):2; mem(* ):1024}} is added
#* {{allocate()}} for {{slave1}} *only* is triggered
#* total resources are {{cpus(* ):2; mem(* ):1024}} from {{slave1}}
#* total resources are less or equal than unallocated part of quota
#* no resources to allocate, hence allocation callback is not called, hence 
nothing is pushed into the {{allocations}} queue
# {{slave2}} with {{cpus(* ):1; mem(* ):512}} is added
#* {{allocate()}} for {{slave2}} *only* is triggered
#* total resources are {{cpus(* ):1; mem(* ):512}} from {{slave2}}
#* total resources are less or equal than unallocated part of quota
#* no resources to allocate, hence allocation callback is not called, hence 
nothing is pushed into the {{allocations}} queue
# {{AWAIT_READY(allocation);}} fails since not a single allocation happened in 
the test.

> Quota doesn't allocate resources on slave joining
> -
>
> Key: MESOS-4102
> URL: https://issues.apache.org/jira/browse/MESOS-4102
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation
>Reporter: Neil Conway
>  Labels: mesosphere, quota
> Attachments: quota_absent_framework_test-1.patch
>
>
> See attached patch. {{framework1}} is not allocated any resources, despite 
> the fact that the resources on {{agent2}} can safely be allocated to it 
> without risk of violating {{quota1}}. If I understand the intended quota 
> behavior correctly, this doesn't seem intended.
> Note that if the framework is added _after_ the slaves are added, the 
> resources on {{agent2}} are allocated to {{framework1}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

48 matches

Mail list logo