[jira] [Commented] (MESOS-9769) Add direct containerized support for filesystem operations.
[ https://issues.apache.org/jira/browse/MESOS-9769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860494#comment-16860494 ] Qian Zhang commented on MESOS-9769: --- In the above patch, the command executor was missed to update for `ContainerFileOperation` support, I posted a patch to fix it: https://reviews.apache.org/r/70826/ > Add direct containerized support for filesystem operations. > --- > > Key: MESOS-9769 > URL: https://issues.apache.org/jira/browse/MESOS-9769 > Project: Mesos > Issue Type: Improvement > Components: containerization >Reporter: James Peach >Assignee: James Peach >Priority: Major > Fix For: 1.9.0 > > > When setting up the container filesystems, we use `pre_exec_commands` to make > ABI symlinks and other things. The problem with this is that, depending of > the order of operations, we may not have the full security policy in place > yet, but since we are running in the context of the container's mount > namespaces, the programs we execute are under the control of whoever built > the container image. > [~jieyu] and I previously discussed adding filesystem operations to the > `ContainerLaunchInfo`. Just `ln` would be sufficient for the `cgroups` and > `linux/filesystem` isolators. Secrets and port mapping isolators need more, > so we should discuss and file new tickets if necessary. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9836) Docker containerizer overwrites `/mesos/slave` cgroups.
Chun-Hung Hsiao created MESOS-9836: -- Summary: Docker containerizer overwrites `/mesos/slave` cgroups. Key: MESOS-9836 URL: https://issues.apache.org/jira/browse/MESOS-9836 Project: Mesos Issue Type: Bug Components: containerization Reporter: Chun-Hung Hsiao The following bug was observed on our internal testing cluster. The docker containerizer launched a container on an agent: {noformat} I0523 06:00:53.888579 21815 docker.cpp:1195] Starting container 'f69c8a8c-eba4-4494-a305-0956a44a6ad2' for task 'apps_docker-sleep-app.1fda5b8e-7d20-11e9-9717-7aa030269ee1' (and executor 'apps_docker-sleep-app.1fda5b8e-7d20-11e9-9717-7aa030269ee1') of framework 415284b7-2967-407d-b66f-f445e93f064e-0011 I0523 06:00:54.524171 21815 docker.cpp:783] Checkpointing pid 13716 to '/var/lib/mesos/slave/meta/slaves/60c42ab7-eb1a-4cec-b03d-ea06bff00c3f-S2/frameworks/415284b7-2967-407d-b66f-f445e93f064e-0011/executors/apps_docker-sleep-app.1fda5b8e-7d20-11e9-9717-7aa030269ee1/runs/f69c8a8c-eba4-4494-a305-0956a44a6ad2/pids/forked.pid' {noformat} After the container was launched, the docker containerizer did a {{docker inspect}} on the container and cached the pid: [https://github.com/apache/mesos/blob/0c431dd60ae39138cc7e8b099d41ad794c02c9a9/src/slave/containerizer/docker.cpp#L1764] The pid should be slightly greater than 13716. The docker executor sent a {{TASK_FINISHED}} status update around 16 minutes later: {noformat} I0523 06:16:17.287595 21809 slave.cpp:5566] Handling status update TASK_FINISHED (Status UUID: 4e00b786-b773-46cd-8327-c7deb08f1de9) for task apps_docker-sleep-app.1fda5b8e-7d20-11e9-9717-7aa030269ee1 of framework 415284b7-2967-407d-b66f-f445e93f064e-0011 from executor(1)@172.31.1.7:36244 {noformat} After receiving the terminal status update, the agent asked the docker containerizer to update {{cpu.cfs_period_us}}, {{cpu.cfs_quota_us}} and {{memory.soft_limit_in_bytes}} of the container through the cached pid: [https://github.com/apache/mesos/blob/0c431dd60ae39138cc7e8b099d41ad794c02c9a9/src/slave/containerizer/docker.cpp#L1696] {noformat} I0523 06:16:17.290447 21815 docker.cpp:1868] Updated 'cpu.shares' to 102 at /sys/fs/cgroup/cpu,cpuacct/mesos/slave for container f69c8a8c-eba4-4494-a305-0956a44a6ad2 I0523 06:16:17.290660 21815 docker.cpp:1895] Updated 'cpu.cfs_period_us' to 100ms and 'cpu.cfs_quota_us' to 10ms (cpus 0.1) for container f69c8a8c-eba4-4494-a305-0956a44a6ad2 I0523 06:16:17.889816 21815 docker.cpp:1937] Updated 'memory.soft_limit_in_bytes' to 32MB for container f69c8a8c-eba4-4494-a305-0956a44a6ad2 {noformat} Note that the cgroup of {{cpu.shares}} was {{/mesos/slave}}. This was possibly because that over the 16 minutes the pid got reused: {noformat} # zgrep 'systemd.cpp:98\]' /var/log/mesos/archive/mesos-agent.log.12.gz ... I0523 06:00:54.525178 21815 systemd.cpp:98] Assigned child process '13716' to 'mesos_executors.slice' I0523 06:00:55.078546 21808 systemd.cpp:98] Assigned child process '13798' to 'mesos_executors.slice' I0523 06:00:55.134096 21808 systemd.cpp:98] Assigned child process '13799' to 'mesos_executors.slice' ... I0523 06:06:30.997439 21808 systemd.cpp:98] Assigned child process '32689' to 'mesos_executors.slice' I0523 06:06:31.050976 21808 systemd.cpp:98] Assigned child process '32690' to 'mesos_executors.slice' I0523 06:06:31.110514 21815 systemd.cpp:98] Assigned child process '32692' to 'mesos_executors.slice' I0523 06:06:33.143726 21818 systemd.cpp:98] Assigned child process '446' to 'mesos_executors.slice' I0523 06:06:33.196251 21818 systemd.cpp:98] Assigned child process '447' to 'mesos_executors.slice' I0523 06:06:33.266332 21816 systemd.cpp:98] Assigned child process '449' to 'mesos_executors.slice' ... I0523 06:09:34.870056 21808 systemd.cpp:98] Assigned child process '13717' to 'mesos_executors.slice' I0523 06:09:34.937762 21813 systemd.cpp:98] Assigned child process '13744' to 'mesos_executors.slice' I0523 06:09:35.073971 21817 systemd.cpp:98] Assigned child process '13754' to 'mesos_executors.slice' ... {noformat} It was highly likely that the container itself exited around 06:09:35, way before the docker executor detected and reported the terminal status update, and then its pid was reused by another forked child of the agent, and thus {{cpu.cfs_period_us}}, {{cpu.quota_us}} and {{memory.soft_limit_in_bytes}} of the {{/mesos/slave}} cgroup was mistakenly overwritten. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9835) `QuotaRoleAllocateNonQuotaResource` is failing.
Meng Zhu created MESOS-9835: --- Summary: `QuotaRoleAllocateNonQuotaResource` is failing. Key: MESOS-9835 URL: https://issues.apache.org/jira/browse/MESOS-9835 Project: Mesos Issue Type: Bug Components: test Reporter: Meng Zhu Assignee: Meng Zhu {noformat} [ RUN ] HierarchicalAllocatorTest.QuotaRoleAllocateNonQuotaResource ../../src/tests/hierarchical_allocator_tests.cpp:4094: Failure Value of: allocations.get().isPending() Actual: false Expected: true [ FAILED ] HierarchicalAllocatorTest.QuotaRoleAllocateNonQuotaResource (12 ms) {noformat} The test is failing because: After agent3 is added, it misses a settle call where the allocation of agent3 is racy. In addition, after https://github.com/apache/mesos/commit/7df8cc6b79e294c075de09f1de4b31a2b88423c8 we now offer nonquota resources on an agent (even that means "chopping") on top of role's satisfied guarantees, the test needs to be updated in accordance with the behavior change. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-9818) Implement minimal agent-side draining handler
[ https://issues.apache.org/jira/browse/MESOS-9818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Mann reassigned MESOS-9818: Assignee: Greg Mann > Implement minimal agent-side draining handler > - > > Key: MESOS-9818 > URL: https://issues.apache.org/jira/browse/MESOS-9818 > Project: Mesos > Issue Type: Task > Components: agent >Reporter: Greg Mann >Assignee: Greg Mann >Priority: Major > Labels: foundations, mesosphere > > To unblock other work that can be done in parallel, this ticket captures the > implementation of a handler for the {{DrainSlaveMessage}} in the agent which > will: > * Checkpoint the {{DrainInfo}} > * Populate a new data member in the agent with the {{DrainInfo}} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-9814) Implement DrainAgent master/operator call with associated registry actions
[ https://issues.apache.org/jira/browse/MESOS-9814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Mann reassigned MESOS-9814: Assignee: Joseph Wu > Implement DrainAgent master/operator call with associated registry actions > -- > > Key: MESOS-9814 > URL: https://issues.apache.org/jira/browse/MESOS-9814 > Project: Mesos > Issue Type: Task > Components: master >Reporter: Joseph Wu >Assignee: Joseph Wu >Priority: Major > Labels: foundations, mesosphere > > We want to add several calls associated with agent draining: > {code} > message Call { > enum Type { > . . . > DRAIN_AGENT = 37; > DEACTIVATE_AGENT = 38; > REACTIVATE_AGENT = 39; > } > . . . > message DrainAgents { > message DrainConfig { > required AgentID agent = 1; > // The duration after which the agent should complete draining. > // If tasks are still running after this time, they will > // be forcefully terminated. > optional Duration max_grace_period = 2; > // Whether or not this agent will be removed permanently > // from the cluster when draining is complete. > optional bool destructive = 3 [default = false]; > } > repeated DrainConfig drain_config = 1; > } > message DeactivateAgents { > repeated AgentID agents = 1; > } > message ReactivateAgents { > repeated AgentID agents = 1; > } > } > {code} > Each field will be persisted in the registry: > {code} > message Registry { > . . . > message Slave { > . . . > optional DrainInfo drain_info = 2; > } > . . . > message UnreachableSlave { > . . . > optional DrainInfo drain_info = 3; > } > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)