[jira] [Commented] (MESOS-9769) Add direct containerized support for filesystem operations.

2019-06-10 Thread Qian Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860494#comment-16860494
 ] 

Qian Zhang commented on MESOS-9769:
---

In the above patch, the command executor was missed to update for 
`ContainerFileOperation` support, I posted a patch to fix it:

https://reviews.apache.org/r/70826/

> Add direct containerized support for filesystem operations.
> ---
>
> Key: MESOS-9769
> URL: https://issues.apache.org/jira/browse/MESOS-9769
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Reporter: James Peach
>Assignee: James Peach
>Priority: Major
> Fix For: 1.9.0
>
>
> When setting up the container filesystems, we use `pre_exec_commands` to make 
> ABI symlinks and other things. The problem with this is that, depending of 
> the order of operations, we may not have the full security policy in place 
> yet, but since we are running in the context of the container's mount 
> namespaces, the programs we execute are under the control of whoever built 
> the container image.
> [~jieyu] and I previously discussed adding filesystem operations to the 
> `ContainerLaunchInfo`. Just `ln` would be sufficient for the `cgroups` and 
> `linux/filesystem` isolators. Secrets and port mapping isolators need more, 
> so we should discuss and file new tickets if necessary.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9836) Docker containerizer overwrites `/mesos/slave` cgroups.

2019-06-10 Thread Chun-Hung Hsiao (JIRA)
Chun-Hung Hsiao created MESOS-9836:
--

 Summary: Docker containerizer overwrites `/mesos/slave` cgroups.
 Key: MESOS-9836
 URL: https://issues.apache.org/jira/browse/MESOS-9836
 Project: Mesos
  Issue Type: Bug
  Components: containerization
Reporter: Chun-Hung Hsiao


The following bug was observed on our internal testing cluster.

The docker containerizer launched a container on an agent:
{noformat}
I0523 06:00:53.888579 21815 docker.cpp:1195] Starting container 
'f69c8a8c-eba4-4494-a305-0956a44a6ad2' for task 
'apps_docker-sleep-app.1fda5b8e-7d20-11e9-9717-7aa030269ee1' (and executor 
'apps_docker-sleep-app.1fda5b8e-7d20-11e9-9717-7aa030269ee1') of framework 
415284b7-2967-407d-b66f-f445e93f064e-0011
I0523 06:00:54.524171 21815 docker.cpp:783] Checkpointing pid 13716 to 
'/var/lib/mesos/slave/meta/slaves/60c42ab7-eb1a-4cec-b03d-ea06bff00c3f-S2/frameworks/415284b7-2967-407d-b66f-f445e93f064e-0011/executors/apps_docker-sleep-app.1fda5b8e-7d20-11e9-9717-7aa030269ee1/runs/f69c8a8c-eba4-4494-a305-0956a44a6ad2/pids/forked.pid'
{noformat}
After the container was launched, the docker containerizer did a {{docker 
inspect}} on the container and cached the pid:
 
[https://github.com/apache/mesos/blob/0c431dd60ae39138cc7e8b099d41ad794c02c9a9/src/slave/containerizer/docker.cpp#L1764]
 The pid should be slightly greater than 13716.

The docker executor sent a {{TASK_FINISHED}} status update around 16 minutes 
later:
{noformat}
I0523 06:16:17.287595 21809 slave.cpp:5566] Handling status update 
TASK_FINISHED (Status UUID: 4e00b786-b773-46cd-8327-c7deb08f1de9) for task 
apps_docker-sleep-app.1fda5b8e-7d20-11e9-9717-7aa030269ee1 of framework 
415284b7-2967-407d-b66f-f445e93f064e-0011 from executor(1)@172.31.1.7:36244
{noformat}
After receiving the terminal status update, the agent asked the docker 
containerizer to update {{cpu.cfs_period_us}}, {{cpu.cfs_quota_us}} and 
{{memory.soft_limit_in_bytes}} of the container through the cached pid:
 
[https://github.com/apache/mesos/blob/0c431dd60ae39138cc7e8b099d41ad794c02c9a9/src/slave/containerizer/docker.cpp#L1696]
{noformat}
I0523 06:16:17.290447 21815 docker.cpp:1868] Updated 'cpu.shares' to 102 at 
/sys/fs/cgroup/cpu,cpuacct/mesos/slave for container 
f69c8a8c-eba4-4494-a305-0956a44a6ad2
I0523 06:16:17.290660 21815 docker.cpp:1895] Updated 'cpu.cfs_period_us' to 
100ms and 'cpu.cfs_quota_us' to 10ms (cpus 0.1) for container 
f69c8a8c-eba4-4494-a305-0956a44a6ad2
I0523 06:16:17.889816 21815 docker.cpp:1937] Updated 
'memory.soft_limit_in_bytes' to 32MB for container 
f69c8a8c-eba4-4494-a305-0956a44a6ad2
{noformat}
Note that the cgroup of {{cpu.shares}} was {{/mesos/slave}}. This was possibly 
because that over the 16 minutes the pid got reused:
{noformat}
# zgrep 'systemd.cpp:98\]' /var/log/mesos/archive/mesos-agent.log.12.gz
...
I0523 06:00:54.525178 21815 systemd.cpp:98] Assigned child process '13716' to 
'mesos_executors.slice'
I0523 06:00:55.078546 21808 systemd.cpp:98] Assigned child process '13798' to 
'mesos_executors.slice'
I0523 06:00:55.134096 21808 systemd.cpp:98] Assigned child process '13799' to 
'mesos_executors.slice'
...
I0523 06:06:30.997439 21808 systemd.cpp:98] Assigned child process '32689' to 
'mesos_executors.slice'
I0523 06:06:31.050976 21808 systemd.cpp:98] Assigned child process '32690' to 
'mesos_executors.slice'
I0523 06:06:31.110514 21815 systemd.cpp:98] Assigned child process '32692' to 
'mesos_executors.slice'
I0523 06:06:33.143726 21818 systemd.cpp:98] Assigned child process '446' to 
'mesos_executors.slice'
I0523 06:06:33.196251 21818 systemd.cpp:98] Assigned child process '447' to 
'mesos_executors.slice'
I0523 06:06:33.266332 21816 systemd.cpp:98] Assigned child process '449' to 
'mesos_executors.slice'
...
I0523 06:09:34.870056 21808 systemd.cpp:98] Assigned child process '13717' to 
'mesos_executors.slice'
I0523 06:09:34.937762 21813 systemd.cpp:98] Assigned child process '13744' to 
'mesos_executors.slice'
I0523 06:09:35.073971 21817 systemd.cpp:98] Assigned child process '13754' to 
'mesos_executors.slice'
...
{noformat}
It was highly likely that the container itself exited around 06:09:35, way 
before the docker executor detected and reported the terminal status update, 
and then its pid was reused by another forked child of the agent, and thus 
{{cpu.cfs_period_us}}, {{cpu.quota_us}} and {{memory.soft_limit_in_bytes}} of 
the {{/mesos/slave}} cgroup was mistakenly overwritten.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9835) `QuotaRoleAllocateNonQuotaResource` is failing.

2019-06-10 Thread Meng Zhu (JIRA)
Meng Zhu created MESOS-9835:
---

 Summary: `QuotaRoleAllocateNonQuotaResource` is failing.
 Key: MESOS-9835
 URL: https://issues.apache.org/jira/browse/MESOS-9835
 Project: Mesos
  Issue Type: Bug
  Components: test
Reporter: Meng Zhu
Assignee: Meng Zhu


{noformat}
[ RUN  ] HierarchicalAllocatorTest.QuotaRoleAllocateNonQuotaResource
../../src/tests/hierarchical_allocator_tests.cpp:4094: Failure
Value of: allocations.get().isPending()
  Actual: false
Expected: true
[  FAILED  ] HierarchicalAllocatorTest.QuotaRoleAllocateNonQuotaResource (12 ms)
{noformat}

The test is failing because:

After agent3 is added, it misses a settle call where the allocation of agent3 
is racy.
In addition, after 
https://github.com/apache/mesos/commit/7df8cc6b79e294c075de09f1de4b31a2b88423c8
we now offer nonquota resources on an agent (even that means "chopping") on top 
of role's satisfied guarantees, the test needs to be updated in accordance with 
the behavior change.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9818) Implement minimal agent-side draining handler

2019-06-10 Thread Greg Mann (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann reassigned MESOS-9818:


Assignee: Greg Mann

> Implement minimal agent-side draining handler
> -
>
> Key: MESOS-9818
> URL: https://issues.apache.org/jira/browse/MESOS-9818
> Project: Mesos
>  Issue Type: Task
>  Components: agent
>Reporter: Greg Mann
>Assignee: Greg Mann
>Priority: Major
>  Labels: foundations, mesosphere
>
> To unblock other work that can be done in parallel, this ticket captures the 
> implementation of a handler for the {{DrainSlaveMessage}} in the agent which 
> will:
> * Checkpoint the {{DrainInfo}}
> * Populate a new data member in the agent with the {{DrainInfo}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9814) Implement DrainAgent master/operator call with associated registry actions

2019-06-10 Thread Greg Mann (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann reassigned MESOS-9814:


Assignee: Joseph Wu

> Implement DrainAgent master/operator call with associated registry actions
> --
>
> Key: MESOS-9814
> URL: https://issues.apache.org/jira/browse/MESOS-9814
> Project: Mesos
>  Issue Type: Task
>  Components: master
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>Priority: Major
>  Labels: foundations, mesosphere
>
> We want to add several calls associated with agent draining:
> {code}
> message Call {
>   enum Type {
> . . .
> DRAIN_AGENT = 37;
> DEACTIVATE_AGENT = 38;
> REACTIVATE_AGENT = 39;
>   }
>   . . .
>   message DrainAgents {
> message DrainConfig {
>   required AgentID agent = 1;
>   // The duration after which the agent should complete draining.
>   // If tasks are still running after this time, they will
>   // be forcefully terminated.
>   optional Duration max_grace_period = 2;
>   // Whether or not this agent will be removed permanently
>   // from the cluster when draining is complete.
>   optional bool destructive = 3 [default = false];
> }
> repeated DrainConfig drain_config = 1;
>   }
>   message DeactivateAgents {
> repeated AgentID agents = 1;
>   }
>   message ReactivateAgents {
> repeated AgentID agents = 1;
>   }
> }
> {code}
> Each field will be persisted in the registry:
> {code}
> message Registry {
>   . . .
>   message Slave {
> . . .
> optional DrainInfo drain_info = 2;
>   }
>   . . .
>   message UnreachableSlave {
> . . .
> optional DrainInfo drain_info = 3;
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)