[jira] [Comment Edited] (MESOS-2980) Allow runtime configuration to be returned from provisioner
[ https://issues.apache.org/jira/browse/MESOS-2980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15049109#comment-15049109 ] Gilbert Song edited comment on MESOS-2980 at 12/10/15 7:02 AM: --- Just finish this series: 1. https://reviews.apache.org/r/41192/ | add protobuf 2. https://reviews.apache.org/r/41011/ | provisioner/filesystem isolator 3. https://reviews.apache.org/r/41032/ | local/registry puller 4. https://reviews.apache.org/r/41194/ | simple cleanup JSON parse 5. https://reviews.apache.org/r/41195/ | metadata manager 6. https://reviews.apache.org/r/41125/ | docker/appc store was (Author: gilbert): Just finish this series: 1. https://reviews.apache.org/r/41122/ | add protobuf 2. https://reviews.apache.org/r/41011/ | provisioner/filesystem isolator 3. https://reviews.apache.org/r/41032/ | local/registry puller 4. https://reviews.apache.org/r/41123/ | simple cleanup JSON parse 5. https://reviews.apache.org/r/41124/ | metadata manager 6. https://reviews.apache.org/r/41125/ | docker/appc store > Allow runtime configuration to be returned from provisioner > --- > > Key: MESOS-2980 > URL: https://issues.apache.org/jira/browse/MESOS-2980 > Project: Mesos > Issue Type: Improvement >Reporter: Timothy Chen >Assignee: Gilbert Song > Labels: mesosphere > > Image specs also includes execution configuration (e.g: Env, user, ports, > etc). > We should support passing those information from the image provisioner back > to the containerizer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3841) Master HTTP API support to get the leader
[ https://issues.apache.org/jira/browse/MESOS-3841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15050187#comment-15050187 ] Jian Qiu commented on MESOS-3841: - How about an endpoint {code} http://master:port/leader {code} with a return of {code} {"leader": {"hostname":"xxx","ip":"x.x.x.x","port":5050}} {code} > Master HTTP API support to get the leader > - > > Key: MESOS-3841 > URL: https://issues.apache.org/jira/browse/MESOS-3841 > Project: Mesos > Issue Type: Improvement > Components: HTTP API >Reporter: Cosmin Lehene >Assignee: Jian Qiu > > There's currently no good way to query the current master ensemble leader. > Some workarounds to get the leader (and parse it from leader@ip) from > {{/state.json}} or to grep it from {{master/redirect}}. > The scheduler API does an HTTP redirect, but that requires an HTTP POST > coming from a framework as well > {{POST /api/v1/scheduler HTTP/1.1}} > There should be a lightweight API call to get the current master. > This could be part of a more granular representation (REST) of the current > state.json. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-4097) Change /roles endpoint to include quotas, weights, reserved resources?
[ https://issues.apache.org/jira/browse/MESOS-4097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15049956#comment-15049956 ] Yong Qiao Wang edited comment on MESOS-4097 at 12/10/15 4:27 AM: - OK, per my understanding of role management I think Mesos is prefer to set role-related configuration with some separated endpoint, such as use /quota to set quota, use /reserve to dynamically set reservation, and maybe use /weights to set weight, etc., but use the unified endpoint /roles to show all role-related information. That way, do we need to defer the function to query quota with /quota after using /roles to show quota information of a role? [~neilc] and [~alexr], do you think so? was (Author: jamesyongqiaowang): OK, per my understanding of role management I think Mesos is prefer to set role-related configuration with some separated endpoint, such as use /quota to set quota, use /reserve to dynamically set reservation, and maybe use /weights to set weight, etc., and use the unified endpoint /roles to show all role-related information. That way, do we need to defer the function to query quota with /quota after using /roles to show quota information of a role? > Change /roles endpoint to include quotas, weights, reserved resources? > -- > > Key: MESOS-4097 > URL: https://issues.apache.org/jira/browse/MESOS-4097 > Project: Mesos > Issue Type: Improvement >Reporter: Neil Conway > Labels: mesosphere, quota, reservations, roles > > MESOS-4085 changes the behavior of the {{/roles}} endpoint: rather than > listing all the explicitly defined roles, we will now only list those roles > that have one or more registered frameworks. > As suggested by [~alexr] in code review, this could be improved -- an > operator might reasonably expect to see all the roles that have > * non-default weight > * non-default quota > * non-default ACLs? > * any static or dynamically reserved resources -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3843) Audit `src/CMakelists.txt` to make sure we're compiling everything we need to build the agent binary.
[ https://issues.apache.org/jira/browse/MESOS-3843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15049980#comment-15049980 ] Joris Van Remoortere commented on MESOS-3843: - {code} commit 07725b5f0cf46439607919bc6f3d51437dbe2088 Author: Diana Arroyo Date: Wed Dec 9 19:26:09 2015 -0800 CMake: Added FindCurl.cmake script to locate cURL library. Review: https://reviews.apache.org/r/41090 {code} > Audit `src/CMakelists.txt` to make sure we're compiling everything we need to > build the agent binary. > - > > Key: MESOS-3843 > URL: https://issues.apache.org/jira/browse/MESOS-3843 > Project: Mesos > Issue Type: Task > Components: cmake >Reporter: Alex Clemmer >Assignee: Diana Arroyo > > `src/CMakeLists.txt` has fallen into some state of disrepair. There are some > source files that seem to be missing (e.g., the `src/launcher/` and > `src/linux`/ directories), so the first step is to audit the source file to > make sure everything we need is there. Likely this will mean looking at the > corresponding `src/Makefile.am` to see that's missing. > Once we understand the limitations of the current build, we can fan out more > tickets or proceed to generating the agent binary, as well as the master. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4097) Change /roles endpoint to include quotas, weights, reserved resources?
[ https://issues.apache.org/jira/browse/MESOS-4097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15049956#comment-15049956 ] Yong Qiao Wang commented on MESOS-4097: --- OK, per my understanding of role management I think Mesos is prefer to set role-related configuration with some separated endpoint, such as use /quota to set quota, use /reserve to dynamically set reservation, and maybe use /weights to set weight, etc., and use the unified endpoint /roles to show all role-related information. That way, do we need to defer the function to query quota with /quota after using /roles to show quota information of a role? > Change /roles endpoint to include quotas, weights, reserved resources? > -- > > Key: MESOS-4097 > URL: https://issues.apache.org/jira/browse/MESOS-4097 > Project: Mesos > Issue Type: Improvement >Reporter: Neil Conway > Labels: mesosphere, quota, reservations, roles > > MESOS-4085 changes the behavior of the {{/roles}} endpoint: rather than > listing all the explicitly defined roles, we will now only list those roles > that have one or more registered frameworks. > As suggested by [~alexr] in code review, this could be improved -- an > operator might reasonably expect to see all the roles that have > * non-default weight > * non-default quota > * non-default ACLs? > * any static or dynamically reserved resources -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1718) Command executor can overcommit the slave.
[ https://issues.apache.org/jira/browse/MESOS-1718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15049905#comment-15049905 ] Klaus Ma commented on MESOS-1718: - re reuse {{TaskInfo::CommandInfo}}: I'm thinking to only use {{TaskInfo::ExecutorInfo}} to launch task in the future; considering the backward compatibility, add {{ExecutorInfo::task_command}} without touching {{TaskInfo::CommandInfo}}. For this JIRA, I'm OK to reuse {{TaskInfo::CommandInfo}} and relax the constraint on {{TaskInfo::CommandInfo}} and {{TaskInfo::ExecutorInfo}}. re filling {{ExecutorInfo::CommandInfo}} in slave: good point :). I used to avoid persist those info (e.g. launch_dir); but after thinking more about it, mybe we did not need to persist them; slave will report it when re-register it. For the backwards compatibility, for now, most cases are passed with the draft RR without interface changed. The major issue is how do we define command line executor's resources, as framework did not assign resources to it. Currently, I cut resources from task (e.g. 1 CPU cli will use 0.9 CPU for tasks and 0.1 for executor); there maybe two offers to return 1 CPU instead of one, the previous behavior is one offer; and task resources in metrics is also changed, for example, Marathon's UI will show 0.9 CPU when launch service with 1 CPU. > Command executor can overcommit the slave. > -- > > Key: MESOS-1718 > URL: https://issues.apache.org/jira/browse/MESOS-1718 > Project: Mesos > Issue Type: Bug > Components: slave >Reporter: Benjamin Mahler >Assignee: Ian Downes > > Currently we give a small amount of resources to the command executor, in > addition to resources used by the command task: > https://github.com/apache/mesos/blob/0.20.0-rc1/src/slave/slave.cpp#L2448 > {code: title=} > ExecutorInfo Slave::getExecutorInfo( > const FrameworkID& frameworkId, > const TaskInfo& task) > { > ... > // Add an allowance for the command executor. This does lead to a > // small overcommit of resources. > executor.mutable_resources()->MergeFrom( > Resources::parse( > "cpus:" + stringify(DEFAULT_EXECUTOR_CPUS) + ";" + > "mem:" + stringify(DEFAULT_EXECUTOR_MEM.megabytes())).get()); > ... > } > {code} > This leads to an overcommit of the slave. Ideally, for command tasks we can > "transfer" all of the task resources to the executor at the slave / isolation > level. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-4111) Provide a means for libprocess users to exit while ensuring messages are flushed.
Benjamin Mahler created MESOS-4111: -- Summary: Provide a means for libprocess users to exit while ensuring messages are flushed. Key: MESOS-4111 URL: https://issues.apache.org/jira/browse/MESOS-4111 Project: Mesos Issue Type: Bug Components: libprocess Reporter: Benjamin Mahler Priority: Minor Currently after a {{send}} there is no way to ensure that the message is flushed on the socket before terminating. We work around this by inserting {{os::sleep}} calls (see MESOS-243, MESOS-4106). There are a number of approaches to this: (1) Return a Future from send that notifies when the message is flushed from the system. (2) Call process::finalize before exiting. This would require that process::finalize flushes all of the outstanding data on any active sockets, which may block. Regardless of the approach, there needs to be a timer if we want to guarantee termination. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4097) Change /roles endpoint to include quotas, weights, reserved resources?
[ https://issues.apache.org/jira/browse/MESOS-4097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15049900#comment-15049900 ] Neil Conway commented on MESOS-4097: Yeah, we probably want {{/roles}} to return information for all "visible" (active and/or explicitly configured roles), per discussion in https://reviews.apache.org/r/41075/ -- implementing that at the moment. > Change /roles endpoint to include quotas, weights, reserved resources? > -- > > Key: MESOS-4097 > URL: https://issues.apache.org/jira/browse/MESOS-4097 > Project: Mesos > Issue Type: Improvement >Reporter: Neil Conway > Labels: mesosphere, quota, reservations, roles > > MESOS-4085 changes the behavior of the {{/roles}} endpoint: rather than > listing all the explicitly defined roles, we will now only list those roles > that have one or more registered frameworks. > As suggested by [~alexr] in code review, this could be improved -- an > operator might reasonably expect to see all the roles that have > * non-default weight > * non-default quota > * non-default ACLs? > * any static or dynamically reserved resources -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4106) The health checker may fail to inform the executor to kill an unhealthy task after max_consecutive_failures.
[ https://issues.apache.org/jira/browse/MESOS-4106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15049899#comment-15049899 ] Benjamin Mahler commented on MESOS-4106: Yeah, I'll reference MESOS-4111 now that we have it, I'll also reference it in the existing command executor sleep. > The health checker may fail to inform the executor to kill an unhealthy task > after max_consecutive_failures. > > > Key: MESOS-4106 > URL: https://issues.apache.org/jira/browse/MESOS-4106 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.20.0, 0.20.1, 0.21.1, 0.21.2, 0.22.1, 0.22.2, 0.23.0, > 0.23.1, 0.24.0, 0.24.1, 0.25.0 >Reporter: Benjamin Mahler >Assignee: Benjamin Mahler >Priority: Blocker > > This was reported by [~tan] experimenting with health checks. Many tasks were > launched with the following health check, taken from the container > stdout/stderr: > {code} > Launching health check process: /usr/local/libexec/mesos/mesos-health-check > --executor=(1)@127.0.0.1:39629 > --health_check_json={"command":{"shell":true,"value":"false"},"consecutive_failures":1,"delay_seconds":0.0,"grace_period_seconds":1.0,"interval_seconds":1.0,"timeout_seconds":1.0} > --task_id=sleepy-2 > {code} > This should have led to all tasks getting killed due to > {{\-\-consecutive_failures}} being set, however, only some tasks get killed, > while other remain running. > It turns out that the health check binary does a {{send}} and promptly exits. > Unfortunately, this may lead to a message drop since libprocess may not have > sent this message over the socket by the time the process exits. > We work around this in the command executor with a manual sleep, which has > been around since the svn days. See > [here|https://github.com/apache/mesos/blob/0.14.0/src/launcher/executor.cpp#L288-L290]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4097) Change /roles endpoint to include quotas, weights, reserved resources?
[ https://issues.apache.org/jira/browse/MESOS-4097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15049894#comment-15049894 ] Yong Qiao Wang commented on MESOS-4097: --- Maybe we should recover the RoleInfo and improve /roles for above requirements. MESOS-3791 do the similar things. > Change /roles endpoint to include quotas, weights, reserved resources? > -- > > Key: MESOS-4097 > URL: https://issues.apache.org/jira/browse/MESOS-4097 > Project: Mesos > Issue Type: Improvement >Reporter: Neil Conway > Labels: mesosphere, quota, reservations, roles > > MESOS-4085 changes the behavior of the {{/roles}} endpoint: rather than > listing all the explicitly defined roles, we will now only list those roles > that have one or more registered frameworks. > As suggested by [~alexr] in code review, this could be improved -- an > operator might reasonably expect to see all the roles that have > * non-default weight > * non-default quota > * non-default ACLs? > * any static or dynamically reserved resources -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4106) The health checker may fail to inform the executor to kill an unhealthy task after max_consecutive_failures.
[ https://issues.apache.org/jira/browse/MESOS-4106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15049893#comment-15049893 ] Neil Conway commented on MESOS-4106: Sounds good -- maybe add a {{TODO}} to the fix for this bug to put in a more robust fix later? > The health checker may fail to inform the executor to kill an unhealthy task > after max_consecutive_failures. > > > Key: MESOS-4106 > URL: https://issues.apache.org/jira/browse/MESOS-4106 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.20.0, 0.20.1, 0.21.1, 0.21.2, 0.22.1, 0.22.2, 0.23.0, > 0.23.1, 0.24.0, 0.24.1, 0.25.0 >Reporter: Benjamin Mahler >Assignee: Benjamin Mahler >Priority: Blocker > > This was reported by [~tan] experimenting with health checks. Many tasks were > launched with the following health check, taken from the container > stdout/stderr: > {code} > Launching health check process: /usr/local/libexec/mesos/mesos-health-check > --executor=(1)@127.0.0.1:39629 > --health_check_json={"command":{"shell":true,"value":"false"},"consecutive_failures":1,"delay_seconds":0.0,"grace_period_seconds":1.0,"interval_seconds":1.0,"timeout_seconds":1.0} > --task_id=sleepy-2 > {code} > This should have led to all tasks getting killed due to > {{\-\-consecutive_failures}} being set, however, only some tasks get killed, > while other remain running. > It turns out that the health check binary does a {{send}} and promptly exits. > Unfortunately, this may lead to a message drop since libprocess may not have > sent this message over the socket by the time the process exits. > We work around this in the command executor with a manual sleep, which has > been around since the svn days. See > [here|https://github.com/apache/mesos/blob/0.14.0/src/launcher/executor.cpp#L288-L290]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4106) The health checker may fail to inform the executor to kill an unhealthy task after max_consecutive_failures.
[ https://issues.apache.org/jira/browse/MESOS-4106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15049889#comment-15049889 ] Benjamin Mahler commented on MESOS-4106: Yeah I had the same thought when I was looking at MESOS-243, but now we also have process::finalize that could be the mechanism for cleanly shutting down before {{exit}} calls. I'll file a ticket to express this issue more generally (MESOS-243 was the original but is specific to the executor driver). > The health checker may fail to inform the executor to kill an unhealthy task > after max_consecutive_failures. > > > Key: MESOS-4106 > URL: https://issues.apache.org/jira/browse/MESOS-4106 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.20.0, 0.20.1, 0.21.1, 0.21.2, 0.22.1, 0.22.2, 0.23.0, > 0.23.1, 0.24.0, 0.24.1, 0.25.0 >Reporter: Benjamin Mahler >Assignee: Benjamin Mahler >Priority: Blocker > > This was reported by [~tan] experimenting with health checks. Many tasks were > launched with the following health check, taken from the container > stdout/stderr: > {code} > Launching health check process: /usr/local/libexec/mesos/mesos-health-check > --executor=(1)@127.0.0.1:39629 > --health_check_json={"command":{"shell":true,"value":"false"},"consecutive_failures":1,"delay_seconds":0.0,"grace_period_seconds":1.0,"interval_seconds":1.0,"timeout_seconds":1.0} > --task_id=sleepy-2 > {code} > This should have led to all tasks getting killed due to > {{\-\-consecutive_failures}} being set, however, only some tasks get killed, > while other remain running. > It turns out that the health check binary does a {{send}} and promptly exits. > Unfortunately, this may lead to a message drop since libprocess may not have > sent this message over the socket by the time the process exits. > We work around this in the command executor with a manual sleep, which has > been around since the svn days. See > [here|https://github.com/apache/mesos/blob/0.14.0/src/launcher/executor.cpp#L288-L290]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4097) Change /roles endpoint to include quotas, weights, reserved resources?
[ https://issues.apache.org/jira/browse/MESOS-4097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15049884#comment-15049884 ] Yong Qiao Wang commented on MESOS-4097: --- Some concerns as below: 1. we removed RoleInfo in MESOS-4085, but we are going to show RoleInfo (all role-related configuration) by /roles? 2. Currently design should be worse for operator experience, for example, after operator configured the quota for a role with /quota endpoint, and he has to check the configuration with another endpoint ( /roles ) which will return all active roles information and does not query a specified one. > Change /roles endpoint to include quotas, weights, reserved resources? > -- > > Key: MESOS-4097 > URL: https://issues.apache.org/jira/browse/MESOS-4097 > Project: Mesos > Issue Type: Improvement >Reporter: Neil Conway > Labels: mesosphere, quota, reservations, roles > > MESOS-4085 changes the behavior of the {{/roles}} endpoint: rather than > listing all the explicitly defined roles, we will now only list those roles > that have one or more registered frameworks. > As suggested by [~alexr] in code review, this could be improved -- an > operator might reasonably expect to see all the roles that have > * non-default weight > * non-default quota > * non-default ACLs? > * any static or dynamically reserved resources -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-4110) Implement `WindowsError` to correspond with `ErrnoError`.
Alex Clemmer created MESOS-4110: --- Summary: Implement `WindowsError` to correspond with `ErrnoError`. Key: MESOS-4110 URL: https://issues.apache.org/jira/browse/MESOS-4110 Project: Mesos Issue Type: Bug Components: stout Reporter: Alex Clemmer Assignee: Alex Clemmer In the C standard library, `errno` records the last error on a thread. You can pretty-print it with `strerror`. In Stout, we report these errors with `ErrnoError`. The Windows API has something similar, called `GetLastError()`. The way to pretty-print this is hilariously unintuitive and terrible, so in this case it is actually very beneficial to wrap it with something similar to `ErrnoError`, maybe called `WindowsError`. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4106) The health checker may fail to inform the executor to kill an unhealthy task after max_consecutive_failures.
[ https://issues.apache.org/jira/browse/MESOS-4106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15049866#comment-15049866 ] Neil Conway commented on MESOS-4106: To fix the problem properly (without a {{sleep}} hack), it seems we need something akin to a {{flush}} primitive in libprocess. For example, we could provide a variant of {{send}} that returns a future, where the future is only satisfied once the associated message has been delivered to the kernel. > The health checker may fail to inform the executor to kill an unhealthy task > after max_consecutive_failures. > > > Key: MESOS-4106 > URL: https://issues.apache.org/jira/browse/MESOS-4106 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.20.0, 0.20.1, 0.21.1, 0.21.2, 0.22.1, 0.22.2, 0.23.0, > 0.23.1, 0.24.0, 0.24.1, 0.25.0 >Reporter: Benjamin Mahler >Assignee: Benjamin Mahler >Priority: Blocker > > This was reported by [~tan] experimenting with health checks. Many tasks were > launched with the following health check, taken from the container > stdout/stderr: > {code} > Launching health check process: /usr/local/libexec/mesos/mesos-health-check > --executor=(1)@127.0.0.1:39629 > --health_check_json={"command":{"shell":true,"value":"false"},"consecutive_failures":1,"delay_seconds":0.0,"grace_period_seconds":1.0,"interval_seconds":1.0,"timeout_seconds":1.0} > --task_id=sleepy-2 > {code} > This should have led to all tasks getting killed due to > {{\-\-consecutive_failures}} being set, however, only some tasks get killed, > while other remain running. > It turns out that the health check binary does a {{send}} and promptly exits. > Unfortunately, this may lead to a message drop since libprocess may not have > sent this message over the socket by the time the process exits. > We work around this in the command executor with a manual sleep, which has > been around since the svn days. See > [here|https://github.com/apache/mesos/blob/0.14.0/src/launcher/executor.cpp#L288-L290]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-4025) SlaveRecoveryTest/0.GCExecutor is flaky.
[ https://issues.apache.org/jira/browse/MESOS-4025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15049741#comment-15049741 ] Greg Mann edited comment on MESOS-4025 at 12/10/15 2:09 AM: I was just observing this error on Ubuntu 14.04 when running the tests as root. It does indeed seem to be due to artifacts left behind by some tests. After running the entire test suite the first time I saw that several of the {{SlaveRecoveryTest}} tests had failed: {code} [==] 988 tests from 144 test cases ran. (590989 ms total) [ PASSED ] 980 tests. [ FAILED ] 8 tests, listed below: [ FAILED ] SlaveRecoveryTest/0.GCExecutor, where TypeParam = mesos::internal::slave::MesosContainerizer [ FAILED ] SlaveRecoveryTest/0.ShutdownSlave, where TypeParam = mesos::internal::slave::MesosContainerizer [ FAILED ] SlaveRecoveryTest/0.ShutdownSlaveSIGUSR1, where TypeParam = mesos::internal::slave::MesosContainerizer [ FAILED ] SlaveTest.PingTimeoutNoPings [ FAILED ] PerfEventIsolatorTest.ROOT_CGROUPS_Sample [ FAILED ] CgroupsAnyHierarchyWithPerfEventTest.ROOT_CGROUPS_Perf [ FAILED ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics [ FAILED ] MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery {code} and the {{SlaveRecoveryTest}} errors all looked the same: {code} [ RUN ] SlaveRecoveryTest/0.ShutdownSlave ../../src/tests/mesos.cpp:906: Failure (cgroups::destroy(hierarchy, cgroup)).failure(): Failed to remove cgroup '/sys/fs/cgroup/memory/mesos_test_55fd4c14-a506-40ec-8965-2350ca5556c7/slave': Device or resource busy --- We're very sorry but we can't seem to destroy existing cgroups that we likely created as part of an earlier invocation of the tests. Please manually destroy the cgroup at '/sys/fs/cgroup/memory/mesos_test_55fd4c14-a506-40ec-8965-2350ca5556c7/slave' by first manually killing all the processes found in the file at '/sys/fs/cgroup/memory/mesos_test_55fd4c14-a506-40ec-8965-2350ca5556c7/slave/tasks' --- ../../src/tests/mesos.cpp:940: Failure (cgroups::destroy(hierarchy, cgroup)).failure(): Failed to remove cgroup '/sys/fs/cgroup/memory/mesos_test_55fd4c14-a506-40ec-8965-2350ca5556c7/slave': Device or resource busy [ FAILED ] SlaveRecoveryTest/0.ShutdownSlave, where TypeParam = mesos::internal::slave::MesosContainerizer (31 ms) {code} Next, re-running just these tests with {{sudo GTEST_FILTER="SlaveRecoverTest*" bin/mesos-tests.sh}}, they _all_ failed, with a slightly different error. At this point, these failures are reliably produced 100% of the time: {code} [ RUN ] SlaveRecoveryTest/0.MultipleSlaves ../../src/tests/mesos.cpp:906: Failure (cgroups::destroy(hierarchy, cgroup)).failure(): Failed to remove cgroup '/sys/fs/cgroup/perf_event/mesos_test': Device or resource busy --- We're very sorry but we can't seem to destroy existing cgroups that we likely created as part of an earlier invocation of the tests. Please manually destroy the cgroup at '/sys/fs/cgroup/perf_event/mesos_test' by first manually killing all the processes found in the file at '/sys/fs/cgroup/perf_event/mesos_test/tasks' --- ../../src/tests/mesos.cpp:940: Failure (cgroups::destroy(hierarchy, cgroup)).failure(): Failed to remove cgroup '/sys/fs/cgroup/perf_event/mesos_test': Device or resource busy [ FAILED ] SlaveRecoveryTest/0.MultipleSlaves, where TypeParam = mesos::internal::slave::MesosContainerizer (14 ms) {code} Finally, I recompiled from scratch and ran _only_ the {{SlaveRecoveryTest}} tests, doing: {code} GTEST_FILTER="" make check sudo GTEST_FILTER="SlaveRecoveryTest*" bin/mesos-tests.sh {code} and they all passed. Regarding [~nfnt]'s comment above on the {{HealthCheckTest}} tests, after all this I was able to run {{sudo GTEST_FILTER="HealthCheckTest*" bin/mesos-tests.sh}}, followed by {{sudo GTEST_FILTER="SlaveRecoveryTest*" bin/mesos-tests.sh}}, and all of the {{SlaveRecoveryTest}} tests passed, so perhaps it isn't an artifact of the {{HealthCheckTest}} tests that is causing this problem? However, I only did that a couple times, and didn't try the exact command that Jan did: {code} sudo ./bin/mesos-tests.sh --gtest_repeat=1 --gtest_break_on_failure --gtest_filter="*ROOT_DOCKER_DockerHealthStatusChange:SlaveRecoveryTest*GCExecutor" {code} so if it's a flaky thing I may not have caught it. was (Author: greggomann): I was just observing this error on Ubuntu 14.04 when running the tests as root. It does indeed seem to be due to artifacts left behind by some tests. After running the entire test suite the first time I saw that several of the {{SlaveRecoveryTest}} tests had failed: {code} [==] 988 tests from 144 test cases ran. (5909
[jira] [Commented] (MESOS-1613) HealthCheckTest.ConsecutiveFailures is flaky
[ https://issues.apache.org/jira/browse/MESOS-1613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15049850#comment-15049850 ] Benjamin Mahler commented on MESOS-1613: For posterity, I also wasn't able to reproduce this by just running in repetition. However, when I ran one {{openssl speed}} for each core on my laptop in order to induce load, I could reproduce easily. We probably want to direct folks to try this when they are having trouble reproducing something flaky from CI. I will post a fix through MESOS-4106. > HealthCheckTest.ConsecutiveFailures is flaky > > > Key: MESOS-1613 > URL: https://issues.apache.org/jira/browse/MESOS-1613 > Project: Mesos > Issue Type: Bug > Components: test >Affects Versions: 0.20.0 > Environment: Ubuntu 10.04 GCC >Reporter: Vinod Kone >Assignee: Timothy Chen > Labels: flaky, mesosphere > > {code} > [ RUN ] HealthCheckTest.ConsecutiveFailures > Using temporary directory '/tmp/HealthCheckTest_ConsecutiveFailures_AzK0OV' > I0717 04:39:59.288471 5009 leveldb.cpp:176] Opened db in 21.575631ms > I0717 04:39:59.295274 5009 leveldb.cpp:183] Compacted db in 6.471982ms > I0717 04:39:59.295552 5009 leveldb.cpp:198] Created db iterator in 16783ns > I0717 04:39:59.296026 5009 leveldb.cpp:204] Seeked to beginning of db in > 2125ns > I0717 04:39:59.296257 5009 leveldb.cpp:273] Iterated through 0 keys in the > db in 10747ns > I0717 04:39:59.296584 5009 replica.cpp:741] Replica recovered with log > positions 0 -> 0 with 1 holes and 0 unlearned > I0717 04:39:59.297322 5033 recover.cpp:425] Starting replica recovery > I0717 04:39:59.297413 5033 recover.cpp:451] Replica is in EMPTY status > I0717 04:39:59.297824 5033 replica.cpp:638] Replica in EMPTY status received > a broadcasted recover request > I0717 04:39:59.297899 5033 recover.cpp:188] Received a recover response from > a replica in EMPTY status > I0717 04:39:59.297997 5033 recover.cpp:542] Updating replica status to > STARTING > I0717 04:39:59.301985 5031 master.cpp:288] Master > 20140717-043959-16842879-40280-5009 (lucid) started on 127.0.1.1:40280 > I0717 04:39:59.302026 5031 master.cpp:325] Master only allowing > authenticated frameworks to register > I0717 04:39:59.302032 5031 master.cpp:330] Master only allowing > authenticated slaves to register > I0717 04:39:59.302039 5031 credentials.hpp:36] Loading credentials for > authentication from > '/tmp/HealthCheckTest_ConsecutiveFailures_AzK0OV/credentials' > I0717 04:39:59.302283 5031 master.cpp:359] Authorization enabled > I0717 04:39:59.302971 5031 hierarchical_allocator_process.hpp:301] > Initializing hierarchical allocator process with master : > master@127.0.1.1:40280 > I0717 04:39:59.303022 5031 master.cpp:122] No whitelist given. Advertising > offers for all slaves > I0717 04:39:59.303390 5033 leveldb.cpp:306] Persisting metadata (8 bytes) to > leveldb took 5.325097ms > I0717 04:39:59.303419 5033 replica.cpp:320] Persisted replica status to > STARTING > I0717 04:39:59.304076 5030 master.cpp:1128] The newly elected leader is > master@127.0.1.1:40280 with id 20140717-043959-16842879-40280-5009 > I0717 04:39:59.304095 5030 master.cpp:1141] Elected as the leading master! > I0717 04:39:59.304102 5030 master.cpp:959] Recovering from registrar > I0717 04:39:59.304182 5030 registrar.cpp:313] Recovering registrar > I0717 04:39:59.304635 5033 recover.cpp:451] Replica is in STARTING status > I0717 04:39:59.304962 5033 replica.cpp:638] Replica in STARTING status > received a broadcasted recover request > I0717 04:39:59.305026 5033 recover.cpp:188] Received a recover response from > a replica in STARTING status > I0717 04:39:59.305130 5033 recover.cpp:542] Updating replica status to VOTING > I0717 04:39:59.310416 5033 leveldb.cpp:306] Persisting metadata (8 bytes) to > leveldb took 5.204157ms > I0717 04:39:59.310459 5033 replica.cpp:320] Persisted replica status to > VOTING > I0717 04:39:59.310534 5033 recover.cpp:556] Successfully joined the Paxos > group > I0717 04:39:59.310607 5033 recover.cpp:440] Recover process terminated > I0717 04:39:59.310773 5033 log.cpp:656] Attempting to start the writer > I0717 04:39:59.311157 5033 replica.cpp:474] Replica received implicit > promise request with proposal 1 > I0717 04:39:59.313451 5033 leveldb.cpp:306] Persisting metadata (8 bytes) to > leveldb took 2.271822ms > I0717 04:39:59.313627 5033 replica.cpp:342] Persisted promised to 1 > I0717 04:39:59.318038 5031 coordinator.cpp:230] Coordinator attemping to > fill missing position > I0717 04:39:59.318430 5031 replica.cpp:375] Replica received explicit > promise request for position 0 with proposal 2 > I0717 04:39:59.323459 5031 leveldb.cpp:343] Persisting action (8 bytes) to > leve
[jira] [Created] (MESOS-4109) HTTPConnectionTest.ClosingResponse is flaky
Joseph Wu created MESOS-4109: Summary: HTTPConnectionTest.ClosingResponse is flaky Key: MESOS-4109 URL: https://issues.apache.org/jira/browse/MESOS-4109 Project: Mesos Issue Type: Bug Components: libprocess, test Affects Versions: 0.26.0 Environment: ASF Ubuntu 14 {{--enable-ssl --enable-libevent}} Reporter: Joseph Wu Priority: Minor Output of the test: {code} [ RUN ] HTTPConnectionTest.ClosingResponse I1210 01:20:27.048532 26671 process.cpp:3077] Handling HTTP event for process '(22)' with path: '/(22)/get' ../../../3rdparty/libprocess/src/tests/http_tests.cpp:919: Failure Actual function call count doesn't match EXPECT_CALL(*http.process, get(_))... Expected: to be called twice Actual: called once - unsatisfied and active [ FAILED ] HTTPConnectionTest.ClosingResponse (43 ms) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-4106) The health checker may fail to inform the executor to kill an unhealthy task after max_consecutive_failures.
[ https://issues.apache.org/jira/browse/MESOS-4106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler reassigned MESOS-4106: -- Assignee: Benjamin Mahler [~haosd...@gmail.com]: From my testing so far, yes. I will send a fix and re-enable the test from MESOS-1613. > The health checker may fail to inform the executor to kill an unhealthy task > after max_consecutive_failures. > > > Key: MESOS-4106 > URL: https://issues.apache.org/jira/browse/MESOS-4106 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.20.0, 0.20.1, 0.21.1, 0.21.2, 0.22.1, 0.22.2, 0.23.0, > 0.23.1, 0.24.0, 0.24.1, 0.25.0 >Reporter: Benjamin Mahler >Assignee: Benjamin Mahler >Priority: Blocker > > This was reported by [~tan] experimenting with health checks. Many tasks were > launched with the following health check, taken from the container > stdout/stderr: > {code} > Launching health check process: /usr/local/libexec/mesos/mesos-health-check > --executor=(1)@127.0.0.1:39629 > --health_check_json={"command":{"shell":true,"value":"false"},"consecutive_failures":1,"delay_seconds":0.0,"grace_period_seconds":1.0,"interval_seconds":1.0,"timeout_seconds":1.0} > --task_id=sleepy-2 > {code} > This should have led to all tasks getting killed due to > {{\-\-consecutive_failures}} being set, however, only some tasks get killed, > while other remain running. > It turns out that the health check binary does a {{send}} and promptly exits. > Unfortunately, this may lead to a message drop since libprocess may not have > sent this message over the socket by the time the process exits. > We work around this in the command executor with a manual sleep, which has > been around since the svn days. See > [here|https://github.com/apache/mesos/blob/0.14.0/src/launcher/executor.cpp#L288-L290]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-3615) Port slave/state.cpp
[ https://issues.apache.org/jira/browse/MESOS-3615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van Remoortere updated MESOS-3615: Sprint: Mesosphere Sprint 24 Story Points: 3 > Port slave/state.cpp > > > Key: MESOS-3615 > URL: https://issues.apache.org/jira/browse/MESOS-3615 > Project: Mesos > Issue Type: Task > Components: slave >Reporter: Alex Clemmer >Assignee: Alex Clemmer > Labels: mesosphere, windows > Fix For: 0.27.0 > > > Important subset of changes this depends on: > slave/state.cpp: pid, os, path, protobuf, paths, state > pid.hpp: address.hpp, ip.hpp > address.hpp: ip.hpp, net.hpp > net.hpp: ip, networking stuff > state: type_utils, pid, os, path, protobuf, uuid > type_utils.hpp: uuid.hpp -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3615) Port slave/state.cpp
[ https://issues.apache.org/jira/browse/MESOS-3615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15049841#comment-15049841 ] Joris Van Remoortere commented on MESOS-3615: - https://reviews.apache.org/r/39219 > Port slave/state.cpp > > > Key: MESOS-3615 > URL: https://issues.apache.org/jira/browse/MESOS-3615 > Project: Mesos > Issue Type: Task > Components: slave >Reporter: Alex Clemmer >Assignee: Alex Clemmer > Labels: mesosphere, windows > > Important subset of changes this depends on: > slave/state.cpp: pid, os, path, protobuf, paths, state > pid.hpp: address.hpp, ip.hpp > address.hpp: ip.hpp, net.hpp > net.hpp: ip, networking stuff > state: type_utils, pid, os, path, protobuf, uuid > type_utils.hpp: uuid.hpp -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4106) The health checker may fail to inform the executor to kill an unhealthy task after max_consecutive_failures.
[ https://issues.apache.org/jira/browse/MESOS-4106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15049824#comment-15049824 ] haosdent commented on MESOS-4106: - Does add ' os::sleep(Seconds(1));' enough? > The health checker may fail to inform the executor to kill an unhealthy task > after max_consecutive_failures. > > > Key: MESOS-4106 > URL: https://issues.apache.org/jira/browse/MESOS-4106 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.20.0, 0.20.1, 0.21.1, 0.21.2, 0.22.1, 0.22.2, 0.23.0, > 0.23.1, 0.24.0, 0.24.1, 0.25.0 >Reporter: Benjamin Mahler >Priority: Blocker > > This was reported by [~tan] experimenting with health checks. Many tasks were > launched with the following health check, taken from the container > stdout/stderr: > {code} > Launching health check process: /usr/local/libexec/mesos/mesos-health-check > --executor=(1)@127.0.0.1:39629 > --health_check_json={"command":{"shell":true,"value":"false"},"consecutive_failures":1,"delay_seconds":0.0,"grace_period_seconds":1.0,"interval_seconds":1.0,"timeout_seconds":1.0} > --task_id=sleepy-2 > {code} > This should have led to all tasks getting killed due to > {{\-\-consecutive_failures}} being set, however, only some tasks get killed, > while other remain running. > It turns out that the health check binary does a {{send}} and promptly exits. > Unfortunately, this may lead to a message drop since libprocess may not have > sent this message over the socket by the time the process exits. > We work around this in the command executor with a manual sleep, which has > been around since the svn days. See > [here|https://github.com/apache/mesos/blob/0.14.0/src/launcher/executor.cpp#L288-L290]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4108) Implement `os::mkdtemp` for Windows
[ https://issues.apache.org/jira/browse/MESOS-4108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15049819#comment-15049819 ] Joris Van Remoortere commented on MESOS-4108: - https://reviews.apache.org/r/39559 > Implement `os::mkdtemp` for Windows > --- > > Key: MESOS-4108 > URL: https://issues.apache.org/jira/browse/MESOS-4108 > Project: Mesos > Issue Type: Bug > Components: stout >Reporter: Alex Clemmer >Assignee: Alex Clemmer > Labels: mesosphere, stout, windows > > Used basically exclusively for testing, this insecure and > otherwise-not-quite-suitable-for-prod function needs to work to run what will > eventually become the FS tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4108) Implement `os::mkdtemp` for Windows
[ https://issues.apache.org/jira/browse/MESOS-4108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van Remoortere updated MESOS-4108: Sprint: Mesosphere Sprint 24 Story Points: 5 Labels: mesosphere stout windows (was: stout windows) > Implement `os::mkdtemp` for Windows > --- > > Key: MESOS-4108 > URL: https://issues.apache.org/jira/browse/MESOS-4108 > Project: Mesos > Issue Type: Bug > Components: stout >Reporter: Alex Clemmer >Assignee: Alex Clemmer > Labels: mesosphere, stout, windows > > Used basically exclusively for testing, this insecure and > otherwise-not-quite-suitable-for-prod function needs to work to run what will > eventually become the FS tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-4108) Implement `os::mkdtemp` for Windows
Alex Clemmer created MESOS-4108: --- Summary: Implement `os::mkdtemp` for Windows Key: MESOS-4108 URL: https://issues.apache.org/jira/browse/MESOS-4108 Project: Mesos Issue Type: Bug Components: stout Reporter: Alex Clemmer Assignee: Alex Clemmer Used basically exclusively for testing, this insecure and otherwise-not-quite-suitable-for-prod function needs to work to run what will eventually become the FS tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4107) `os::strerror_r` breaks the Windows build
[ https://issues.apache.org/jira/browse/MESOS-4107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van Remoortere updated MESOS-4107: Sprint: Mesosphere Sprint 24 Story Points: 1 Labels: mesosphere stout (was: stout) https://reviews.apache.org/r/40382/ > `os::strerror_r` breaks the Windows build > - > > Key: MESOS-4107 > URL: https://issues.apache.org/jira/browse/MESOS-4107 > Project: Mesos > Issue Type: Bug > Components: stout >Reporter: Alex Clemmer >Assignee: Alex Clemmer > Labels: mesosphere, stout > > `os::strerror_r` does not exist on Windows. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-4107) `os::strerror_r` breaks the Windows build
Alex Clemmer created MESOS-4107: --- Summary: `os::strerror_r` breaks the Windows build Key: MESOS-4107 URL: https://issues.apache.org/jira/browse/MESOS-4107 Project: Mesos Issue Type: Bug Components: stout Reporter: Alex Clemmer Assignee: Alex Clemmer `os::strerror_r` does not exist on Windows. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4088) Modularize existing plain-file logging for executor/task logs
[ https://issues.apache.org/jira/browse/MESOS-4088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15049749#comment-15049749 ] Joseph Wu commented on MESOS-4088: -- || Reviews || Summary || | https://reviews.apache.org/r/41166/ | Add {{ExecutorLogger}} to {{Containerizer::Create}} | | https://reviews.apache.org/r/41167/ | Initialize and call the {{ExecutorLogger}} in {{MesosContainerizer::_launch}} | | https://reviews.apache.org/r/41168/ | Update {{MesosTest}} | | https://reviews.apache.org/r/41169/ | Update {{MesosContainerizer}} tests | > Modularize existing plain-file logging for executor/task logs > - > > Key: MESOS-4088 > URL: https://issues.apache.org/jira/browse/MESOS-4088 > Project: Mesos > Issue Type: Task > Components: modules >Reporter: Joseph Wu >Assignee: Joseph Wu > Labels: logging, mesosphere > > Once a module for executor/task output logging has been introduced, the > default module will mirror the existing behavior. Executor/task > stdout/stderr is piped into files within the executor's sandbox directory. > The files are exposed in the web UI, via the {{/files}} endpoint. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-4025) SlaveRecoveryTest/0.GCExecutor is flaky.
[ https://issues.apache.org/jira/browse/MESOS-4025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15049741#comment-15049741 ] Greg Mann edited comment on MESOS-4025 at 12/10/15 12:45 AM: - I was just observing this error on Ubuntu 14.04 when running the tests as root. It does indeed seem to be due to artifacts left behind by some tests. After running the entire test suite the first time I saw that several of the {{SlaveRecoveryTest}} tests had failed: {code} [==] 988 tests from 144 test cases ran. (590989 ms total) [ PASSED ] 980 tests. [ FAILED ] 8 tests, listed below: [ FAILED ] SlaveRecoveryTest/0.GCExecutor, where TypeParam = mesos::internal::slave::MesosContainerizer [ FAILED ] SlaveRecoveryTest/0.ShutdownSlave, where TypeParam = mesos::internal::slave::MesosContainerizer [ FAILED ] SlaveRecoveryTest/0.ShutdownSlaveSIGUSR1, where TypeParam = mesos::internal::slave::MesosContainerizer [ FAILED ] SlaveTest.PingTimeoutNoPings [ FAILED ] PerfEventIsolatorTest.ROOT_CGROUPS_Sample [ FAILED ] CgroupsAnyHierarchyWithPerfEventTest.ROOT_CGROUPS_Perf [ FAILED ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics [ FAILED ] MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery {code} and the {{SlaveRecoveryTest}} errors all looked the same: {code} [ RUN ] SlaveRecoveryTest/0.ShutdownSlave ../../src/tests/mesos.cpp:906: Failure (cgroups::destroy(hierarchy, cgroup)).failure(): Failed to remove cgroup '/sys/fs/cgroup/memory/mesos_test_55fd4c14-a506-40ec-8965-2350ca5556c7/slave': Device or resource busy --- We're very sorry but we can't seem to destroy existing cgroups that we likely created as part of an earlier invocation of the tests. Please manually destroy the cgroup at '/sys/fs/cgroup/memory/mesos_test_55fd4c14-a506-40ec-8965-2350ca5556c7/slave' by first manually killing all the processes found in the file at '/sys/fs/cgroup/memory/mesos_test_55fd4c14-a506-40ec-8965-2350ca5556c7/slave/tasks' --- ../../src/tests/mesos.cpp:940: Failure (cgroups::destroy(hierarchy, cgroup)).failure(): Failed to remove cgroup '/sys/fs/cgroup/memory/mesos_test_55fd4c14-a506-40ec-8965-2350ca5556c7/slave': Device or resource busy [ FAILED ] SlaveRecoveryTest/0.ShutdownSlave, where TypeParam = mesos::internal::slave::MesosContainerizer (31 ms) {code} Next, re-running just these tests with {{sudo GTEST_FILTER="SlaveRecoverTest*" bin/mesos-tests.sh}}, they _all_ failed, with a slightly different error. At this point, these failures are reliably produced 100% of the time: {code} [ RUN ] SlaveRecoveryTest/0.MultipleSlaves ../../src/tests/mesos.cpp:906: Failure (cgroups::destroy(hierarchy, cgroup)).failure(): Failed to remove cgroup '/sys/fs/cgroup/perf_event/mesos_test': Device or resource busy --- We're very sorry but we can't seem to destroy existing cgroups that we likely created as part of an earlier invocation of the tests. Please manually destroy the cgroup at '/sys/fs/cgroup/perf_event/mesos_test' by first manually killing all the processes found in the file at '/sys/fs/cgroup/perf_event/mesos_test/tasks' --- ../../src/tests/mesos.cpp:940: Failure (cgroups::destroy(hierarchy, cgroup)).failure(): Failed to remove cgroup '/sys/fs/cgroup/perf_event/mesos_test': Device or resource busy [ FAILED ] SlaveRecoveryTest/0.MultipleSlaves, where TypeParam = mesos::internal::slave::MesosContainerizer (14 ms) {code} Finally, I recompiled from scratch and ran _only_ the {{SlaveRecoveryTest}} tests, doing: {code} GTEST_FILTER="" make check sudo GTEST_FILTER="SlaveRecoveryTest*" bin/mesos-tests.sh {code} and they all passed. Regarding [~nfnt]'s comment above on the {{HealthCheckTest}} tests, after all this I was able to run {{sudo GTEST_FILTER="HealthCheckTest*" bin/mesos-tests.sh}}, followed by {{sudo GTEST_FILTER="SlaveRecoveryTest*" bin/mesos-tests.sh}}, and all of the {{SlaveRecoveryTest}} tests passed, so perhaps it isn't an artifact of the {{HealthCheckTest}} tests that is causing this problem. was (Author: greggomann): I was just observing this error on Ubuntu 14.04 when running the tests as root. It does indeed seem to be due to artifacts left behind by some tests. After running the entire test suite the first time I saw that several of the {{SlaveRecoveryTest}} tests had failed: {code} [==] 988 tests from 144 test cases ran. (590989 ms total) [ PASSED ] 980 tests. [ FAILED ] 8 tests, listed below: [ FAILED ] SlaveRecoveryTest/0.GCExecutor, where TypeParam = mesos::internal::slave::MesosContainerizer [ FAILED ] SlaveRecoveryTest/0.ShutdownSlave, where TypeParam = mesos::internal::slave::MesosContainerizer [ FAILED ] Slav
[jira] [Comment Edited] (MESOS-4025) SlaveRecoveryTest/0.GCExecutor is flaky.
[ https://issues.apache.org/jira/browse/MESOS-4025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15049741#comment-15049741 ] Greg Mann edited comment on MESOS-4025 at 12/10/15 12:44 AM: - I was just observing this error on Ubuntu 14.04 when running the tests as root. It does indeed seem to be due to artifacts left behind by some tests. After running the entire test suite the first time I saw that several of the {{SlaveRecoveryTest}} tests had failed: {code} [==] 988 tests from 144 test cases ran. (590989 ms total) [ PASSED ] 980 tests. [ FAILED ] 8 tests, listed below: [ FAILED ] SlaveRecoveryTest/0.GCExecutor, where TypeParam = mesos::internal::slave::MesosContainerizer [ FAILED ] SlaveRecoveryTest/0.ShutdownSlave, where TypeParam = mesos::internal::slave::MesosContainerizer [ FAILED ] SlaveRecoveryTest/0.ShutdownSlaveSIGUSR1, where TypeParam = mesos::internal::slave::MesosContainerizer [ FAILED ] SlaveTest.PingTimeoutNoPings [ FAILED ] PerfEventIsolatorTest.ROOT_CGROUPS_Sample [ FAILED ] CgroupsAnyHierarchyWithPerfEventTest.ROOT_CGROUPS_Perf [ FAILED ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics [ FAILED ] MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery {code} and the {{SlaveRecoveryTest}} errors all looked the same: {code} [ RUN ] SlaveRecoveryTest/0.ShutdownSlave ../../src/tests/mesos.cpp:906: Failure (cgroups::destroy(hierarchy, cgroup)).failure(): Failed to remove cgroup '/sys/fs/cgroup/memory/mesos_test_55fd4c14-a506-40ec-8965-2350ca5556c7/slave': Device or resource busy --- We're very sorry but we can't seem to destroy existing cgroups that we likely created as part of an earlier invocation of the tests. Please manually destroy the cgroup at '/sys/fs/cgroup/memory/mesos_test_55fd4c14-a506-40ec-8965-2350ca5556c7/slave' by first manually killing all the processes found in the file at '/sys/fs/cgroup/memory/mesos_test_55fd4c14-a506-40ec-8965-2350ca5556c7/slave/tasks' --- ../../src/tests/mesos.cpp:940: Failure (cgroups::destroy(hierarchy, cgroup)).failure(): Failed to remove cgroup '/sys/fs/cgroup/memory/mesos_test_55fd4c14-a506-40ec-8965-2350ca5556c7/slave': Device or resource busy [ FAILED ] SlaveRecoveryTest/0.ShutdownSlave, where TypeParam = mesos::internal::slave::MesosContainerizer (31 ms) {code} Next, re-running just these tests with {{sudo GTEST_FILTER="SlaveRecoverTest*" bin/mesos-tests.sh}}, they _all_ failed, with a slightly different error. At this point, these failures are reliably produced 100% of the time: {code} [ RUN ] SlaveRecoveryTest/0.MultipleSlaves ../../src/tests/mesos.cpp:906: Failure (cgroups::destroy(hierarchy, cgroup)).failure(): Failed to remove cgroup '/sys/fs/cgroup/perf_event/mesos_test': Device or resource busy --- We're very sorry but we can't seem to destroy existing cgroups that we likely created as part of an earlier invocation of the tests. Please manually destroy the cgroup at '/sys/fs/cgroup/perf_event/mesos_test' by first manually killing all the processes found in the file at '/sys/fs/cgroup/perf_event/mesos_test/tasks' --- ../../src/tests/mesos.cpp:940: Failure (cgroups::destroy(hierarchy, cgroup)).failure(): Failed to remove cgroup '/sys/fs/cgroup/perf_event/mesos_test': Device or resource busy [ FAILED ] SlaveRecoveryTest/0.MultipleSlaves, where TypeParam = mesos::internal::slave::MesosContainerizer (14 ms) {code} Finally, I recompiled from scratch and ran _only_ the {{SlaveRecoveryTest}} tests, doing: {code} GTEST_FILTER="" make check sudo GTEST_FILTER="SlaveRecoveryTest*" bin/mesos-tests.sh {code} and they all passed. Regarding [~nfnt]'s comment above on the {{HealthCheckTest}} tests, after all this I was able to run {{sudo GTEST_FILTER="HealthCheckTest*" bin/mesos-tests.sh}}, followed by {{sudo GTEST_FILTER="SlaveRecoveryTest*" bin/mesos-tests.sh}}, and all of the {{SlaveRecoveryTest}} tests passed, so perhaps it isn't an artifact of the {{HealthCheckTest}}s that is causing this problem. was (Author: greggomann): I was just observing this error on Ubuntu 14.04 when running the tests as root. It does indeed seem to be due to artifacts left behind by some tests. After running the entire test suite the first time I saw that several of the {{SlaveRecoveryTest}}s had failed: {code} [==] 988 tests from 144 test cases ran. (590989 ms total) [ PASSED ] 980 tests. [ FAILED ] 8 tests, listed below: [ FAILED ] SlaveRecoveryTest/0.GCExecutor, where TypeParam = mesos::internal::slave::MesosContainerizer [ FAILED ] SlaveRecoveryTest/0.ShutdownSlave, where TypeParam = mesos::internal::slave::MesosContainerizer [ FAILED ] SlaveRecoveryT
[jira] [Commented] (MESOS-4025) SlaveRecoveryTest/0.GCExecutor is flaky.
[ https://issues.apache.org/jira/browse/MESOS-4025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15049741#comment-15049741 ] Greg Mann commented on MESOS-4025: -- I was just observing this error on Ubuntu 14.04 when running the tests as root. It does indeed seem to be due to artifacts left behind by some tests. After running the entire test suite the first time I saw that several of the {{SlaveRecoveryTest}}s had failed: {code} [==] 988 tests from 144 test cases ran. (590989 ms total) [ PASSED ] 980 tests. [ FAILED ] 8 tests, listed below: [ FAILED ] SlaveRecoveryTest/0.GCExecutor, where TypeParam = mesos::internal::slave::MesosContainerizer [ FAILED ] SlaveRecoveryTest/0.ShutdownSlave, where TypeParam = mesos::internal::slave::MesosContainerizer [ FAILED ] SlaveRecoveryTest/0.ShutdownSlaveSIGUSR1, where TypeParam = mesos::internal::slave::MesosContainerizer [ FAILED ] SlaveTest.PingTimeoutNoPings [ FAILED ] PerfEventIsolatorTest.ROOT_CGROUPS_Sample [ FAILED ] CgroupsAnyHierarchyWithPerfEventTest.ROOT_CGROUPS_Perf [ FAILED ] MemoryPressureMesosTest.CGROUPS_ROOT_Statistics [ FAILED ] MemoryPressureMesosTest.CGROUPS_ROOT_SlaveRecovery {code} and the {{SlaveRecoveryTest}} errors all looked the same: {code} [ RUN ] SlaveRecoveryTest/0.ShutdownSlave ../../src/tests/mesos.cpp:906: Failure (cgroups::destroy(hierarchy, cgroup)).failure(): Failed to remove cgroup '/sys/fs/cgroup/memory/mesos_test_55fd4c14-a506-40ec-8965-2350ca5556c7/slave': Device or resource busy --- We're very sorry but we can't seem to destroy existing cgroups that we likely created as part of an earlier invocation of the tests. Please manually destroy the cgroup at '/sys/fs/cgroup/memory/mesos_test_55fd4c14-a506-40ec-8965-2350ca5556c7/slave' by first manually killing all the processes found in the file at '/sys/fs/cgroup/memory/mesos_test_55fd4c14-a506-40ec-8965-2350ca5556c7/slave/tasks' --- ../../src/tests/mesos.cpp:940: Failure (cgroups::destroy(hierarchy, cgroup)).failure(): Failed to remove cgroup '/sys/fs/cgroup/memory/mesos_test_55fd4c14-a506-40ec-8965-2350ca5556c7/slave': Device or resource busy [ FAILED ] SlaveRecoveryTest/0.ShutdownSlave, where TypeParam = mesos::internal::slave::MesosContainerizer (31 ms) {code} Next, re-running just these tests with {{sudo GTEST_FILTER="SlaveRecoverTest*" bin/mesos-tests.sh}}, they _all_ failed, with a slightly different error. At this point, these failures are reliably produced 100% of the time: {code} [ RUN ] SlaveRecoveryTest/0.MultipleSlaves ../../src/tests/mesos.cpp:906: Failure (cgroups::destroy(hierarchy, cgroup)).failure(): Failed to remove cgroup '/sys/fs/cgroup/perf_event/mesos_test': Device or resource busy --- We're very sorry but we can't seem to destroy existing cgroups that we likely created as part of an earlier invocation of the tests. Please manually destroy the cgroup at '/sys/fs/cgroup/perf_event/mesos_test' by first manually killing all the processes found in the file at '/sys/fs/cgroup/perf_event/mesos_test/tasks' --- ../../src/tests/mesos.cpp:940: Failure (cgroups::destroy(hierarchy, cgroup)).failure(): Failed to remove cgroup '/sys/fs/cgroup/perf_event/mesos_test': Device or resource busy [ FAILED ] SlaveRecoveryTest/0.MultipleSlaves, where TypeParam = mesos::internal::slave::MesosContainerizer (14 ms) {code} Finally, I recompiled from scratch and ran _only_ the {{SlaveRecoveryTest}}s, doing: {code} GTEST_FILTER="" make check sudo GTEST_FILTER="SlaveRecoveryTest*" bin/mesos-tests.sh {code} and they all passed. Regarding [~nfnt]'s comment above on the {{HealthCheckTest}}s, after all this I was able to run {{sudo GTEST_FILTER="HealthCheckTest*" bin/mesos-tests.sh}}, followed by {{sudo GTEST_FILTER="SlaveRecoveryTest*" bin/mesos-tests.sh}}, and all of the {{SlaveRecoveryTest}}s passed, so perhaps it isn't an artifact of the {{HealthCheckTest}}s that is causing this problem. > SlaveRecoveryTest/0.GCExecutor is flaky. > > > Key: MESOS-4025 > URL: https://issues.apache.org/jira/browse/MESOS-4025 > Project: Mesos > Issue Type: Bug > Components: test >Affects Versions: 0.26.0 >Reporter: Till Toenshoff >Assignee: Jan Schlicht > Labels: flaky, flaky-test, test > > Build was SSL enabled (--enable-ssl, --enable-libevent). The build was based > on 0.26.0-rc1. > Testsuite was run as root. > {noformat} > sudo ./bin/mesos-tests.sh --gtest_break_on_failure --gtest_repeat=-1 > {noformat} > {noformat} > [ RUN ] SlaveRecoveryTest/0.GCExecutor > I1130 16:49:16.336833 1032
[jira] [Commented] (MESOS-1718) Command executor can overcommit the slave.
[ https://issues.apache.org/jira/browse/MESOS-1718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15049692#comment-15049692 ] Vinod Kone commented on MESOS-1718: --- I think adding yet another field "Executorinfo::task_command" is potentially confusing. Why not reuse "TaskInfo::CommandInfo" instead? I suggest we relax the current constraint that "only one of TaskInfo::CommandInfo or TaskInfo::ExecutorInfo" should be set. It should be OK if both are set as far as slave is concerned. So, when the framework sends a launch task with TaskInfo::ExecutorInfo unset, master can set that field with command executor information and pass it onto the slave. Note that there are no required fields in CommandInfo, so master can leave them unset and let slave fill it in. But, I'm not convinced it having slave fill ExecutorInfo::CommandInfo is a good idea. This is because, when a master fails over it learns about existing ExecutorInfos from the re-registered slaves. Since these ExecutorInfos have updated CommandInfo, it will look weird in the master. The weirdness is because any new command tasks launched will not have ExecutorInfo::CommandInfo::* set, whereas command executors from re-registered slaves will have it set. Also note that we didn't yet talk about backwards compatibility concerns here. I'm guessing we need to make changes to slave and command executor to make sure they work with old style and new style command tasks. > Command executor can overcommit the slave. > -- > > Key: MESOS-1718 > URL: https://issues.apache.org/jira/browse/MESOS-1718 > Project: Mesos > Issue Type: Bug > Components: slave >Reporter: Benjamin Mahler >Assignee: Ian Downes > > Currently we give a small amount of resources to the command executor, in > addition to resources used by the command task: > https://github.com/apache/mesos/blob/0.20.0-rc1/src/slave/slave.cpp#L2448 > {code: title=} > ExecutorInfo Slave::getExecutorInfo( > const FrameworkID& frameworkId, > const TaskInfo& task) > { > ... > // Add an allowance for the command executor. This does lead to a > // small overcommit of resources. > executor.mutable_resources()->MergeFrom( > Resources::parse( > "cpus:" + stringify(DEFAULT_EXECUTOR_CPUS) + ";" + > "mem:" + stringify(DEFAULT_EXECUTOR_MEM.megabytes())).get()); > ... > } > {code} > This leads to an overcommit of the slave. Ideally, for command tasks we can > "transfer" all of the task resources to the executor at the slave / isolation > level. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4003) Pass agent work_dir to isolator modules
[ https://issues.apache.org/jira/browse/MESOS-4003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15049684#comment-15049684 ] Greg Mann commented on MESOS-4003: -- In order to prevent breaking the isolator interface in the future when more parameters may be added, a new protobuf message was added and made the sole parameter of {{Isolator::recover()}}. Review is posted here: https://reviews.apache.org/r/41113/ > Pass agent work_dir to isolator modules > --- > > Key: MESOS-4003 > URL: https://issues.apache.org/jira/browse/MESOS-4003 > Project: Mesos > Issue Type: Bug >Reporter: Greg Mann >Assignee: Greg Mann > Labels: external-volumes, mesosphere > > Some isolator modules can benefit from access to the agent's {{work_dir}}. > For example, the DVD isolator (https://github.com/emccode/mesos-module-dvdi) > is currently forced to mount external volumes in a hard-coded directory. > Making the {{work_dir}} accessible to the isolator via > {{Isolator::recover()}} would allow the isolator to mount volumes within the > agent's {{work_dir}}. This can be accomplished by simply adding an overloaded > signature for {{Isolator::recover()}} which includes the {{work_dir}} as a > parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4106) The health checker may fail to inform the executor to kill an unhealthy task after max_consecutive_failures.
[ https://issues.apache.org/jira/browse/MESOS-4106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15049651#comment-15049651 ] Benjamin Mahler commented on MESOS-4106: This is also possibly the reason for MESOS-1613. > The health checker may fail to inform the executor to kill an unhealthy task > after max_consecutive_failures. > > > Key: MESOS-4106 > URL: https://issues.apache.org/jira/browse/MESOS-4106 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.20.0, 0.20.1, 0.21.1, 0.21.2, 0.22.1, 0.22.2, 0.23.0, > 0.23.1, 0.24.0, 0.24.1, 0.25.0 >Reporter: Benjamin Mahler >Priority: Blocker > > This was reported by [~tan] experimenting with health checks. Many tasks were > launched with the following health check, taken from the container > stdout/stderr: > {code} > Launching health check process: /usr/local/libexec/mesos/mesos-health-check > --executor=(1)@127.0.0.1:39629 > --health_check_json={"command":{"shell":true,"value":"false"},"consecutive_failures":1,"delay_seconds":0.0,"grace_period_seconds":1.0,"interval_seconds":1.0,"timeout_seconds":1.0} > --task_id=sleepy-2 > {code} > This should have led to all tasks getting killed due to > {{\-\-consecutive_failures}} being set, however, only some tasks get killed, > while other remain running. > It turns out that the health check binary does a {{send}} and promptly exits. > Unfortunately, this may lead to a message drop since libprocess may not have > sent this message over the socket by the time the process exits. > We work around this in the command executor with a manual sleep, which has > been around since the svn days. See > [here|https://github.com/apache/mesos/blob/0.14.0/src/launcher/executor.cpp#L288-L290]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-4106) The health checker may fail to inform the executor to kill an unhealthy task after max_consecutive_failures.
Benjamin Mahler created MESOS-4106: -- Summary: The health checker may fail to inform the executor to kill an unhealthy task after max_consecutive_failures. Key: MESOS-4106 URL: https://issues.apache.org/jira/browse/MESOS-4106 Project: Mesos Issue Type: Bug Affects Versions: 0.25.0, 0.24.1, 0.24.0, 0.23.1, 0.23.0, 0.22.2, 0.22.1, 0.21.2, 0.21.1, 0.20.1, 0.20.0 Reporter: Benjamin Mahler Priority: Blocker This was reported by [~tan] experimenting with health checks. Many tasks were launched with the following health check, taken from the container stdout/stderr: {code} Launching health check process: /usr/local/libexec/mesos/mesos-health-check --executor=(1)@127.0.0.1:39629 --health_check_json={"command":{"shell":true,"value":"false"},"consecutive_failures":1,"delay_seconds":0.0,"grace_period_seconds":1.0,"interval_seconds":1.0,"timeout_seconds":1.0} --task_id=sleepy-2 {code} This should have led to all tasks getting killed due to {{\-\-consecutive_failures}} being set, however, only some tasks get killed, while other remain running. It turns out that the health check binary does a {{send}} and promptly exits. Unfortunately, this may lead to a message drop since libprocess may not have sent this message over the socket by the time the process exits. We work around this in the command executor with a manual sleep, which has been around since the svn days. See [here|https://github.com/apache/mesos/blob/0.14.0/src/launcher/executor.cpp#L288-L290]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4105) Network isolator causes corrupt packets to reach application
[ https://issues.apache.org/jira/browse/MESOS-4105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ian Downes updated MESOS-4105: -- Description: The optional network isolator (network/port_mapping) will let corrupt TCP packets reach the application. This could lead to data corruption in applications. Normally these packets are dropped immediately by the network stack and do not reach the application. Networks may have a very low level of corrupt packets (a few per million) or, may have very high levels if there are hardware or software errors in networking equipment. 1) We receive a corrupt packet externally 2) The hardware driver is able to checksum it and notices it has a bad checksum 3) The driver delivers this packet anyway to wait for TCP layer to checksum it again and then drop it 4) This packet is moved to a veth interface because it is for a container 5) Both sides of the veth pair have RX checksum offloading enabled by default 6) The veth_xmit() marks the packet's checksum as UNNECESSARY since its peer device has rx checksum offloading 7) Packet is moved into the container TCP/IP stack 8) TCP layer is not going to checksum it since it is not necessary 9) The packet gets delivered to application layer was: The optional network isolator (network/port_mapping) will let corrupt TCP packets reach the application. This could lead to data corruption in applications. Normally these packets are dropped immediately by the network stack and do not reach the application. Networks may have a very low level of corrupt packets (a few per million) or, may have very high levels if there are hardware or software errors in networking equipment. Investigation is ongoing but an initial hypothesis is being tested: 1) The checksum error is correctly detected by the host interface. 2) The Mesos tc filters used by the network isolator redirect the packet to the virtual interface, even when a checksum error has occurred. 3) Either in copying to the veth device or passing across the veth pipe the checksum flag is cleared. 4) The veth inside the container does not verify the checksum, even though TCP RX checksum offloading is supposedly on. \[This is hypothesized to be acceptable normally because it's receiving packets over the virtual link where corruption should not occur\] 5) The container network stack accepts the packet and delivers it to the application. Disabling tcp rx cso on the container veth appears to fix this: it forces the container network stack to compute the packet checksums (in software) whereby it detects the checksum errors and does not deliver the packet to the application. > Network isolator causes corrupt packets to reach application > > > Key: MESOS-4105 > URL: https://issues.apache.org/jira/browse/MESOS-4105 > Project: Mesos > Issue Type: Bug > Components: isolation >Affects Versions: 0.20.0, 0.20.1, 0.21.0, 0.21.1, 0.21.2, 0.22.0, 0.22.1, > 0.22.2, 0.23.0, 0.23.1, 0.24.0, 0.24.1, 0.25.0 >Reporter: Ian Downes >Assignee: Cong Wang >Priority: Critical > > The optional network isolator (network/port_mapping) will let corrupt TCP > packets reach the application. This could lead to data corruption in > applications. Normally these packets are dropped immediately by the network > stack and do not reach the application. > Networks may have a very low level of corrupt packets (a few per million) or, > may have very high levels if there are hardware or software errors in > networking equipment. > 1) We receive a corrupt packet externally > 2) The hardware driver is able to checksum it and notices it has a bad > checksum > 3) The driver delivers this packet anyway to wait for TCP layer to checksum > it again and then drop it > 4) This packet is moved to a veth interface because it is for a container > 5) Both sides of the veth pair have RX checksum offloading enabled by default > 6) The veth_xmit() marks the packet's checksum as UNNECESSARY since its peer > device has rx checksum offloading > 7) Packet is moved into the container TCP/IP stack > 8) TCP layer is not going to checksum it since it is not necessary > 9) The packet gets delivered to application layer -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4105) Network isolator causes corrupt packets to reach application
[ https://issues.apache.org/jira/browse/MESOS-4105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ian Downes updated MESOS-4105: -- Assignee: Cong Wang > Network isolator causes corrupt packets to reach application > > > Key: MESOS-4105 > URL: https://issues.apache.org/jira/browse/MESOS-4105 > Project: Mesos > Issue Type: Bug > Components: isolation >Affects Versions: 0.20.0, 0.20.1, 0.21.0, 0.21.1, 0.21.2, 0.22.0, 0.22.1, > 0.22.2, 0.23.0, 0.23.1, 0.24.0, 0.24.1, 0.25.0 >Reporter: Ian Downes >Assignee: Cong Wang >Priority: Critical > > The optional network isolator (network/port_mapping) will let corrupt TCP > packets reach the application. This could lead to data corruption in > applications. Normally these packets are dropped immediately by the network > stack and do not reach the application. > Networks may have a very low level of corrupt packets (a few per million) or, > may have very high levels if there are hardware or software errors in > networking equipment. > Investigation is ongoing but an initial hypothesis is being tested: > 1) The checksum error is correctly detected by the host interface. > 2) The Mesos tc filters used by the network isolator redirect the packet to > the virtual interface, even when a checksum error has occurred. > 3) Either in copying to the veth device or passing across the veth pipe the > checksum flag is cleared. > 4) The veth inside the container does not verify the checksum, even though > TCP RX checksum offloading is supposedly on. \[This is hypothesized to be > acceptable normally because it's receiving packets over the virtual link > where corruption should not occur\] > 5) The container network stack accepts the packet and delivers it to the > application. > Disabling tcp rx cso on the container veth appears to fix this: it forces the > container network stack to compute the packet checksums (in software) whereby > it detects the checksum errors and does not deliver the packet to the > application. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-4105) Network isolator causes corrupt packets to reach application
Ian Downes created MESOS-4105: - Summary: Network isolator causes corrupt packets to reach application Key: MESOS-4105 URL: https://issues.apache.org/jira/browse/MESOS-4105 Project: Mesos Issue Type: Bug Components: isolation Affects Versions: 0.25.0, 0.24.1, 0.24.0, 0.23.1, 0.23.0, 0.22.2, 0.22.1, 0.22.0, 0.21.2, 0.21.1, 0.21.0, 0.20.1, 0.20.0 Reporter: Ian Downes Priority: Critical The optional network isolator (network/port_mapping) will let corrupt TCP packets reach the application. This could lead to data corruption in applications. Normally these packets are dropped immediately by the network stack and do not reach the application. Networks may have a very low level of corrupt packets (a few per million) or, may have very high levels if there are hardware or software errors in networking equipment. Investigation is ongoing but an initial hypothesis is being tested: 1) The checksum error is correctly detected by the host interface. 2) The Mesos tc filters used by the network isolator redirect the packet to the virtual interface, even when a checksum error has occurred. 3) Either in copying to the veth device or passing across the veth pipe the checksum flag is cleared. 4) The veth inside the container does not verify the checksum, even though TCP RX checksum offloading is supposedly on. \[This is hypothesized to be acceptable normally because it's receiving packets over the virtual link where corruption should not occur\] 5) The container network stack accepts the packet and delivers it to the application. Disabling tcp rx cso on the container veth appears to fix this: it forces the container network stack to compute the packet checksums (in software) whereby it detects the checksum errors and does not deliver the packet to the application. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4082) Add Tests for quota authentification and authorization
[ https://issues.apache.org/jira/browse/MESOS-4082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rukletsov updated MESOS-4082: --- Shepherd: Till Toenshoff > Add Tests for quota authentification and authorization > -- > > Key: MESOS-4082 > URL: https://issues.apache.org/jira/browse/MESOS-4082 > Project: Mesos > Issue Type: Task > Components: master, test >Reporter: Jan Schlicht >Assignee: Jan Schlicht > Labels: mesosphere, quota > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-3065) Add framework authorization for persistent volume
[ https://issues.apache.org/jira/browse/MESOS-3065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Mann updated MESOS-3065: - Sprint: Mesosphere Sprint 16, Mesosphere Sprint 22, Mesosphere Sprint 24 (was: Mesosphere Sprint 16, Mesosphere Sprint 22) > Add framework authorization for persistent volume > - > > Key: MESOS-3065 > URL: https://issues.apache.org/jira/browse/MESOS-3065 > Project: Mesos > Issue Type: Task >Reporter: Michael Park >Assignee: Greg Mann > Labels: mesosphere, persistent-volumes > > Persistent volume should be authorized with the {{principal}} of the > reserving entity (framework or master). The idea is to introduce {{Create}} > and {{Destroy}} into the ACL. > {code} > message Create { > // Subjects. > required Entity principals = 1; > // Objects? Perhaps the kind of volume? allowed permissions? > } > message Destroy { > // Subjects. > required Entity principals = 1; > // Objects. > required Entity creator_principals = 2; > } > {code} > When a framework creates a persistent volume, "create" ACLs are checked to > see if the framework (FrameworkInfo.principal) or the operator > (Credential.user) is authorized to create persistent volumes. If not > authorized, the create operation is rejected. > When a framework destroys a persistent volume, "destroy" ACLs are checked to > see if the framework (FrameworkInfo.principal) or the operator > (Credential.user) is authorized to destroy the persistent volume created by a > framework or operator (Resource.DiskInfo.principal). If not authorized, the > destroy operation is rejected. > A separate ticket will use the structures created here to enable > authorization of the "/create" and "/destroy" HTTP endpoints: > https://issues.apache.org/jira/browse/MESOS-3903 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2980) Allow runtime configuration to be returned from provisioner
[ https://issues.apache.org/jira/browse/MESOS-2980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gilbert Song updated MESOS-2980: Story Points: 5 > Allow runtime configuration to be returned from provisioner > --- > > Key: MESOS-2980 > URL: https://issues.apache.org/jira/browse/MESOS-2980 > Project: Mesos > Issue Type: Improvement >Reporter: Timothy Chen >Assignee: Gilbert Song > Labels: mesosphere > > Image specs also includes execution configuration (e.g: Env, user, ports, > etc). > We should support passing those information from the image provisioner back > to the containerizer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2980) Allow runtime configuration to be returned from provisioner
[ https://issues.apache.org/jira/browse/MESOS-2980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15049109#comment-15049109 ] Gilbert Song commented on MESOS-2980: - Just finish this series: 1. https://reviews.apache.org/r/41122/ | add protobuf 2. https://reviews.apache.org/r/41011/ | provisioner/filesystem isolator 3. https://reviews.apache.org/r/41032/ | local/registry puller 4. https://reviews.apache.org/r/41123/ | simple cleanup JSON parse 5. https://reviews.apache.org/r/41124/ | metadata manager 6. https://reviews.apache.org/r/41125/ | docker/appc store > Allow runtime configuration to be returned from provisioner > --- > > Key: MESOS-2980 > URL: https://issues.apache.org/jira/browse/MESOS-2980 > Project: Mesos > Issue Type: Improvement >Reporter: Timothy Chen >Assignee: Gilbert Song > Labels: mesosphere > > Image specs also includes execution configuration (e.g: Env, user, ports, > etc). > We should support passing those information from the image provisioner back > to the containerizer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4102) Quota doesn't allocate resources on slave joining
[ https://issues.apache.org/jira/browse/MESOS-4102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15049080#comment-15049080 ] Neil Conway commented on MESOS-4102: Thanks for the explanation. I understand what is going on, so the question is whether this is the best behavior. Basically, the current implementation of event-triggered allocations will always make legal allocations (per quota), but might not make all the allocations that legally could be made. Is that considered a problem and/or something we want to change? It would be helpful for me to understand why we have event-triggered allocations in the first place. If we need regular batch allocations to ensure that all resources are allocated appropriately, then I guess event-triggered allocations are just intended to be a "best-effort" mechanism, to do _something_ about a change in cluster state until the next batch allocation occurs? > Quota doesn't allocate resources on slave joining > - > > Key: MESOS-4102 > URL: https://issues.apache.org/jira/browse/MESOS-4102 > Project: Mesos > Issue Type: Bug > Components: allocation >Reporter: Neil Conway > Labels: mesosphere, quota > Attachments: quota_absent_framework_test-1.patch > > > See attached patch. {{framework1}} is not allocated any resources, despite > the fact that the resources on {{agent2}} can safely be allocated to it > without risk of violating {{quota1}}. If I understand the intended quota > behavior correctly, this doesn't seem intended. > Note that if the framework is added _after_ the slaves are added, the > resources on {{agent2}} are allocated to {{framework1}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3738) Mesos health check is invoked incorrectly when Mesos slave is within the docker container
[ https://issues.apache.org/jira/browse/MESOS-3738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15049014#comment-15049014 ] haosdent commented on MESOS-3738: - Hi, you need patch https://issues.apache.org/jira/secure/attachment/12766990/MESOS-3738-0_23_1.patch which I upload in attachments. > Mesos health check is invoked incorrectly when Mesos slave is within the > docker container > - > > Key: MESOS-3738 > URL: https://issues.apache.org/jira/browse/MESOS-3738 > Project: Mesos > Issue Type: Bug > Components: containerization, docker >Affects Versions: 0.25.0 > Environment: Docker 1.8.0: > Client: > Version: 1.8.0 > API version: 1.20 > Go version: go1.4.2 > Git commit: 0d03096 > Built:Tue Aug 11 16:48:39 UTC 2015 > OS/Arch: linux/amd64 > Server: > Version: 1.8.0 > API version: 1.20 > Go version: go1.4.2 > Git commit: 0d03096 > Built:Tue Aug 11 16:48:39 UTC 2015 > OS/Arch: linux/amd64 > Host: Ubuntu 14.04 > Container: Debian 8.1 + Java-7 >Reporter: Yong Tang >Assignee: haosdent > Fix For: 0.26.0 > > Attachments: MESOS-3738-0_23_1.patch, MESOS-3738-0_24_1.patch, > MESOS-3738-0_25_0.patch > > > When Mesos slave is within the container, the COMMAND health check from > Marathon is invoked incorrectly. > In such a scenario, the sandbox directory (instead of the > launcher/health-check directory) is used. This result in an error with the > container. > Command to invoke the Mesos slave container: > {noformat} > sudo docker run -d -v /sys:/sys -v /usr/bin/docker:/usr/bin/docker:ro -v > /usr/lib/x86_64-linux-gnu/libapparmor.so.1:/usr/lib/x86_64-linux-gnu/libapparmor.so.1:ro > -v /var/run/docker.sock:/var/run/docker.sock -v /tmp/mesos:/tmp/mesos mesos > mesos slave --master=zk://10.2.1.2:2181/mesos --containerizers=docker,mesos > --executor_registration_timeout=5mins --docker_stop_timeout=10secs > --launcher=posix > {noformat} > Marathon JSON file: > {code} > { > "id": "ubuntu", > "container": > { > "type": "DOCKER", > "docker": > { > "image": "ubuntu", > "network": "BRIDGE", > "parameters": [] > } > }, > "args": [ "bash", "-c", "while true; do echo 1; sleep 5; done" ], > "uris": [], > "healthChecks": > [ > { > "protocol": "COMMAND", > "command": { "value": "echo Success" }, > "gracePeriodSeconds": 3000, > "intervalSeconds": 5, > "timeoutSeconds": 5, > "maxConsecutiveFailures": 300 > } > ], > "instances": 1 > } > {code} > {noformat} > STDOUT: > root@cea2be47d64f:/mnt/mesos/sandbox# cat stdout > --container="mesos-e20f8959-cd9f-40ae-987d-809401309361-S0.815cc886-1cd1-4f13-8f9b-54af1f127c3f" > --docker="docker" --docker_socket="/var/run/docker.sock" --help="false" > --initialize_driver_logging="true" --logbufsecs="0" --logging_level="INFO" > --mapped_directory="/mnt/mesos/sandbox" --quiet="false" > --sandbox_directory="/tmp/mesos/slaves/e20f8959-cd9f-40ae-987d-809401309361-S0/frameworks/e20f8959-cd9f-40ae-987d-809401309361-/executors/ubuntu.86bca10f-72c9-11e5-b36d-02420a020106/runs/815cc886-1cd1-4f13-8f9b-54af1f127c3f" > --stop_timeout="10secs" > --container="mesos-e20f8959-cd9f-40ae-987d-809401309361-S0.815cc886-1cd1-4f13-8f9b-54af1f127c3f" > --docker="docker" --docker_socket="/var/run/docker.sock" --help="false" > --initialize_driver_logging="true" --logbufsecs="0" --logging_level="INFO" > --mapped_directory="/mnt/mesos/sandbox" --quiet="false" > --sandbox_directory="/tmp/mesos/slaves/e20f8959-cd9f-40ae-987d-809401309361-S0/frameworks/e20f8959-cd9f-40ae-987d-809401309361-/executors/ubuntu.86bca10f-72c9-11e5-b36d-02420a020106/runs/815cc886-1cd1-4f13-8f9b-54af1f127c3f" > --stop_timeout="10secs" > Registered docker executor on b01e2e75afcb > Starting task ubuntu.86bca10f-72c9-11e5-b36d-02420a020106 > 1 > Launching health check process: > /tmp/mesos/slaves/e20f8959-cd9f-40ae-987d-809401309361-S0/frameworks/e20f8959-cd9f-40ae-987d-809401309361-/executors/ubuntu.86bca10f-72c9-11e5-b36d-02420a020106/runs/815cc886-1cd1-4f13-8f9b-54af1f127c3f/mesos-health-check > --executor=(1)@10.2.1.7:40695 > --health_check_json={"command":{"shell":true,"value":"docker exec > mesos-e20f8959-cd9f-40ae-987d-809401309361-S0.815cc886-1cd1-4f13-8f9b-54af1f127c3f > sh -c \" echo Success > \""},"consecutive_failures":300,"delay_seconds":0.0,"grace_period_seconds":3000.0,"interval_seconds":5.0,"timeout_seconds":5.0} > --task_id=ubuntu.86bca10f-72c9-11e5-b36d-02420a020106 > Health check process launched at pid: 94 > 1 > 1 > 1 > 1 > 1 > STDERR: > root@cea2be47d64f:/mnt/mesos/sandbox# cat stderr > I1014 23:15:58.12795056 exec.cpp:134] Version: 0.25.0 > I1014 23:15:58
[jira] [Updated] (MESOS-4098) Allow interactive terminal for mesos containerizer
[ https://issues.apache.org/jira/browse/MESOS-4098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jojy Varghese updated MESOS-4098: - Story Points: 10 (was: 4) > Allow interactive terminal for mesos containerizer > -- > > Key: MESOS-4098 > URL: https://issues.apache.org/jira/browse/MESOS-4098 > Project: Mesos > Issue Type: Improvement > Components: containerization > Environment: linux >Reporter: Jojy Varghese >Assignee: Jojy Varghese > Labels: mesosphere > > Today mesos containerizer does not have a way to run tasks that require > interactive sessions. An example use case is running a task that requires a > manual password entry from an operator. Another use case could be debugging > (gdb). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4098) Allow interactive terminal for mesos containerizer
[ https://issues.apache.org/jira/browse/MESOS-4098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jojy Varghese updated MESOS-4098: - Issue Type: Story (was: Improvement) > Allow interactive terminal for mesos containerizer > -- > > Key: MESOS-4098 > URL: https://issues.apache.org/jira/browse/MESOS-4098 > Project: Mesos > Issue Type: Story > Components: containerization > Environment: linux >Reporter: Jojy Varghese >Assignee: Jojy Varghese > Labels: mesosphere > > Today mesos containerizer does not have a way to run tasks that require > interactive sessions. An example use case is running a task that requires a > manual password entry from an operator. Another use case could be debugging > (gdb). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-4104) Design document for interactive terminal for mesos containerizer
Jojy Varghese created MESOS-4104: Summary: Design document for interactive terminal for mesos containerizer Key: MESOS-4104 URL: https://issues.apache.org/jira/browse/MESOS-4104 Project: Mesos Issue Type: Task Components: containerization Reporter: Jojy Varghese Assignee: Jojy Varghese As a first step to address the use cases, propose a design document covering the requirement, design and implementation details. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4102) Quota doesn't allocate resources on slave joining
[ https://issues.apache.org/jira/browse/MESOS-4102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15048495#comment-15048495 ] Alexander Rukletsov commented on MESOS-4102: The reason you see this behaviour is a) because event-triggered allocations do not include all available agents *and* b) because we do not persist set aside resources across allocations. However, during the next batch allocation cycle, we will observe all active agents and will be able to properly allocate resources not set aside for quota. Now the question is: *do you find this behaviour surprising*, i.e. *shall we fix it*? In your case, you can make your test succeed if you add {code} Clock::advance(flags.allocation_interval); Clock::settle(); {code} before {{Future allocation = allocations.get();}} The reason why we do b) is because we do not want to "attach" unallocated part of quota to particular agents. Technically, we do not even set aside resources, rather we stop allocating to non quota'ed frameworks if remaining resources are less than the unsatisfied quota part. For posterity, let me elaborate on the sequence of events happening in your test. # Quota {{cpus:2;mem:1024}} is set for {{QUOTA_ROLE}} #* {{allocate()}} for all agents is triggered #* total resources are {{0}} #* no resources to allocate, hence allocation callback is not called, hence nothing is pushed into the {{allocations}} queue # {{framework1}} is added to {{NO_QUOTA_ROLE}} #* {{allocate()}} for all agents is triggered #* total resources are {{0}} #* no resources to allocate, hence allocation callback is not called, hence nothing is pushed into the {{allocations}} queue # {{slave1}} with {{cpus(* ):2; mem(* ):1024}} is added #* {{allocate()}} for {{slave1}} *only* is triggered #* total resources are {{cpus(* ):2; mem(* ):1024}} from {{slave1}} #* total resources are less or equal than unallocated part of quota #* no resources to allocate, hence allocation callback is not called, hence nothing is pushed into the {{allocations}} queue # {{slave2}} with {{cpus(* ):1; mem(* ):512}} is added #* {{allocate()}} for {{slave2}} *only* is triggered #* total resources are {{cpus(* ):1; mem(* ):512}} from {{slave2}} #* total resources are less or equal than unallocated part of quota #* no resources to allocate, hence allocation callback is not called, hence nothing is pushed into the {{allocations}} queue # {{AWAIT_READY(allocation);}} fails since not a single allocation happened in the test. > Quota doesn't allocate resources on slave joining > - > > Key: MESOS-4102 > URL: https://issues.apache.org/jira/browse/MESOS-4102 > Project: Mesos > Issue Type: Bug > Components: allocation >Reporter: Neil Conway > Labels: mesosphere, quota > Attachments: quota_absent_framework_test-1.patch > > > See attached patch. {{framework1}} is not allocated any resources, despite > the fact that the resources on {{agent2}} can safely be allocated to it > without risk of violating {{quota1}}. If I understand the intended quota > behavior correctly, this doesn't seem intended. > Note that if the framework is added _after_ the slaves are added, the > resources on {{agent2}} are allocated to {{framework1}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)