[jira] [Commented] (MESOS-9937) 53598228fe should be backported to 1.7.x

2019-08-20 Thread longfei (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-9937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16911906#comment-16911906
 ] 

longfei commented on MESOS-9937:


(y)(y)

> 53598228fe should be backported to 1.7.x
> 
>
> Key: MESOS-9937
> URL: https://issues.apache.org/jira/browse/MESOS-9937
> Project: Mesos
>  Issue Type: Bug
>Reporter: longfei
>Assignee: Greg Mann
>Priority: Blocker
>  Labels: foundations
> Fix For: 1.7.3
>
>
> Commit 53598228fe on the master branch should be backported to 1.7.x. 
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Comment Edited] (MESOS-9545) Marking an unreachable agent as gone should transition the tasks to terminal state

2019-08-20 Thread Greg Mann (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-9545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16911836#comment-16911836
 ] 

Greg Mann edited comment on MESOS-9545 at 8/21/19 1:40 AM:
---

1.8.x:
{noformat}
commit 13e4cd1c42ae88094f14d6b05cfb9832d4494193
Author: Greg Mann 
Date:   Tue Apr 23 22:25:29 2019 -0700

Transitioned tasks when an unreachable agent is marked as gone.

This patch updates the master code responsible for marking
agents as gone to properly transition tasks on agents which
were previously marked as unreachable.

Review: https://reviews.apache.org/r/70519/
{noformat}
{noformat}
commit 6f90cc334701fad10e721312cd4cbd0690e1c6ec
Author: Greg Mann 
Date:   Tue Apr 23 22:25:21 2019 -0700

Fixed a memory leak in the master's 'removeTask()' helper.

Previously, all removed tasks were added to the
`slaves.unreachableTasks` map. This patch adds a conditional
so that removed tasks are only added to that structure when
they are being marked unreachable.

Review: https://reviews.apache.org/r/70518/
{noformat}


1.7.x:
{noformat}
commit 61f1155675bd3bc5312e0501ea6182d2ee7434af
Author: Greg Mann 
Date:   Tue Apr 23 22:25:29 2019 -0700

Transitioned tasks when an unreachable agent is marked as gone.

This patch updates the master code responsible for marking
agents as gone to properly transition tasks on agents which
were previously marked as unreachable.

Review: https://reviews.apache.org/r/70519/
{noformat}
{noformat}
commit 0c5e78bc26653d26a03b08b82923ea517de46fc0
Author: Greg Mann 
Date:   Tue Apr 23 22:25:21 2019 -0700

Fixed a memory leak in the master's 'removeTask()' helper.

Previously, all removed tasks were added to the
`slaves.unreachableTasks` map. This patch adds a conditional
so that removed tasks are only added to that structure when
they are being marked unreachable.

Review: https://reviews.apache.org/r/70518/
{noformat}


1.6.x:
{noformat}
commit c6da50d10511a1046b8d4bc563dc3ccee875
Author: Greg Mann 
Date:   Tue Apr 23 22:25:29 2019 -0700

Transitioned tasks when an unreachable agent is marked as gone.

This patch updates the master code responsible for marking
agents as gone to properly transition tasks on agents which
were previously marked as unreachable.

Review: https://reviews.apache.org/r/70519/
{noformat}
{noformat}
commit 6a9cee7999be0a3a4f89d21ec58947fe90c01eeb
Author: Greg Mann 
Date:   Tue Apr 23 22:25:21 2019 -0700

Fixed a memory leak in the master's 'removeTask()' helper.

Previously, all removed tasks were added to the
`slaves.unreachableTasks` map. This patch adds a conditional
so that removed tasks are only added to that structure when
they are being marked unreachable.

Review: https://reviews.apache.org/r/70518/
{noformat}


was (Author: greggomann):
1.8.x:
{noformat}
commit 13e4cd1c42ae88094f14d6b05cfb9832d4494193
Author: Greg Mann 
Date:   Tue Apr 23 22:25:29 2019 -0700

Transitioned tasks when an unreachable agent is marked as gone.

This patch updates the master code responsible for marking
agents as gone to properly transition tasks on agents which
were previously marked as unreachable.

Review: https://reviews.apache.org/r/70519/
{noformat}
{noformat}
commit 6f90cc334701fad10e721312cd4cbd0690e1c6ec
Author: Greg Mann 
Date:   Tue Apr 23 22:25:21 2019 -0700

Fixed a memory leak in the master's 'removeTask()' helper.

Previously, all removed tasks were added to the
`slaves.unreachableTasks` map. This patch adds a conditional
so that removed tasks are only added to that structure when
they are being marked unreachable.

Review: https://reviews.apache.org/r/70518/
{noformat}

1.7.x:
{noformat}
commit 61f1155675bd3bc5312e0501ea6182d2ee7434af
Author: Greg Mann 
Date:   Tue Apr 23 22:25:29 2019 -0700

Transitioned tasks when an unreachable agent is marked as gone.

This patch updates the master code responsible for marking
agents as gone to properly transition tasks on agents which
were previously marked as unreachable.

Review: https://reviews.apache.org/r/70519/
{noformat}
{noformat}
commit 0c5e78bc26653d26a03b08b82923ea517de46fc0
Author: Greg Mann 
Date:   Tue Apr 23 22:25:21 2019 -0700

Fixed a memory leak in the master's 'removeTask()' helper.

Previously, all removed tasks were added to the
`slaves.unreachableTasks` map. This patch adds a conditional
so that removed tasks are only added to that structure when
they are being marked unreachable.

Review: https://reviews.apache.org/r/70518/
{noformat}

1.6.x:
{noformat}
commit c6da50d10511a1046b8d4bc563dc3ccee875 (HEAD -> 1.6.x, origin/1.6.x, 
mesos-private/ci/greg/mesos-9545-1.6.x, ci/greg/mesos-9545-1.6.x)
Author: Greg Mann 
Date:   Tue Apr 23 22:25:29 2019 -0700

Transitioned tasks when an 

[jira] [Comment Edited] (MESOS-9545) Marking an unreachable agent as gone should transition the tasks to terminal state

2019-08-20 Thread Greg Mann (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-9545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16911836#comment-16911836
 ] 

Greg Mann edited comment on MESOS-9545 at 8/21/19 1:40 AM:
---

1.8.x:
{noformat}
commit 13e4cd1c42ae88094f14d6b05cfb9832d4494193
Author: Greg Mann 
Date:   Tue Apr 23 22:25:29 2019 -0700

Transitioned tasks when an unreachable agent is marked as gone.

This patch updates the master code responsible for marking
agents as gone to properly transition tasks on agents which
were previously marked as unreachable.

Review: https://reviews.apache.org/r/70519/
{noformat}
{noformat}
commit 6f90cc334701fad10e721312cd4cbd0690e1c6ec
Author: Greg Mann 
Date:   Tue Apr 23 22:25:21 2019 -0700

Fixed a memory leak in the master's 'removeTask()' helper.

Previously, all removed tasks were added to the
`slaves.unreachableTasks` map. This patch adds a conditional
so that removed tasks are only added to that structure when
they are being marked unreachable.

Review: https://reviews.apache.org/r/70518/
{noformat}

1.7.x:
{noformat}
commit 61f1155675bd3bc5312e0501ea6182d2ee7434af
Author: Greg Mann 
Date:   Tue Apr 23 22:25:29 2019 -0700

Transitioned tasks when an unreachable agent is marked as gone.

This patch updates the master code responsible for marking
agents as gone to properly transition tasks on agents which
were previously marked as unreachable.

Review: https://reviews.apache.org/r/70519/
{noformat}
{noformat}
commit 0c5e78bc26653d26a03b08b82923ea517de46fc0
Author: Greg Mann 
Date:   Tue Apr 23 22:25:21 2019 -0700

Fixed a memory leak in the master's 'removeTask()' helper.

Previously, all removed tasks were added to the
`slaves.unreachableTasks` map. This patch adds a conditional
so that removed tasks are only added to that structure when
they are being marked unreachable.

Review: https://reviews.apache.org/r/70518/
{noformat}

1.6.x:
{noformat}
commit c6da50d10511a1046b8d4bc563dc3ccee875 (HEAD -> 1.6.x, origin/1.6.x, 
mesos-private/ci/greg/mesos-9545-1.6.x, ci/greg/mesos-9545-1.6.x)
Author: Greg Mann 
Date:   Tue Apr 23 22:25:29 2019 -0700

Transitioned tasks when an unreachable agent is marked as gone.

This patch updates the master code responsible for marking
agents as gone to properly transition tasks on agents which
were previously marked as unreachable.

Review: https://reviews.apache.org/r/70519/
{noformat}
{noformat}
commit 6a9cee7999be0a3a4f89d21ec58947fe90c01eeb
Author: Greg Mann 
Date:   Tue Apr 23 22:25:21 2019 -0700

Fixed a memory leak in the master's 'removeTask()' helper.

Previously, all removed tasks were added to the
`slaves.unreachableTasks` map. This patch adds a conditional
so that removed tasks are only added to that structure when
they are being marked unreachable.

Review: https://reviews.apache.org/r/70518/
{noformat}


was (Author: greggomann):
1.8.x:
{noformat}
commit 13e4cd1c42ae88094f14d6b05cfb9832d4494193
Author: Greg Mann 
Date:   Tue Apr 23 22:25:29 2019 -0700

Transitioned tasks when an unreachable agent is marked as gone.

This patch updates the master code responsible for marking
agents as gone to properly transition tasks on agents which
were previously marked as unreachable.

Review: https://reviews.apache.org/r/70519/
{noformat}
{noformat}
commit 6f90cc334701fad10e721312cd4cbd0690e1c6ec
Author: Greg Mann 
Date:   Tue Apr 23 22:25:21 2019 -0700

Fixed a memory leak in the master's 'removeTask()' helper.

Previously, all removed tasks were added to the
`slaves.unreachableTasks` map. This patch adds a conditional
so that removed tasks are only added to that structure when
they are being marked unreachable.

Review: https://reviews.apache.org/r/70518/
{noformat}

1.7.x:
{noformat}
commit 61f1155675bd3bc5312e0501ea6182d2ee7434af
Author: Greg Mann 
Date:   Tue Apr 23 22:25:29 2019 -0700

Transitioned tasks when an unreachable agent is marked as gone.

This patch updates the master code responsible for marking
agents as gone to properly transition tasks on agents which
were previously marked as unreachable.

Review: https://reviews.apache.org/r/70519/
{noformat}
{noformat}
commit 0c5e78bc26653d26a03b08b82923ea517de46fc0
Author: Greg Mann 
Date:   Tue Apr 23 22:25:21 2019 -0700

Fixed a memory leak in the master's 'removeTask()' helper.

Previously, all removed tasks were added to the
`slaves.unreachableTasks` map. This patch adds a conditional
so that removed tasks are only added to that structure when
they are being marked unreachable.

Review: https://reviews.apache.org/r/70518/
{noformat}

> Marking an unreachable agent as gone should transition the tasks to terminal 
> state
> 

[jira] [Commented] (MESOS-9937) 53598228fe should be backported to 1.7.x

2019-08-20 Thread Greg Mann (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-9937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16911854#comment-16911854
 ] 

Greg Mann commented on MESOS-9937:
--

[~carlone], this is done, see the commit below:
{noformat}
commit 0c5e78bc26653d26a03b08b82923ea517de46fc0
Author: Greg Mann 
Date:   Tue Apr 23 22:25:21 2019 -0700

Fixed a memory leak in the master's 'removeTask()' helper.

Previously, all removed tasks were added to the
`slaves.unreachableTasks` map. This patch adds a conditional
so that removed tasks are only added to that structure when
they are being marked unreachable.

Review: https://reviews.apache.org/r/70518/
{noformat}

> 53598228fe should be backported to 1.7.x
> 
>
> Key: MESOS-9937
> URL: https://issues.apache.org/jira/browse/MESOS-9937
> Project: Mesos
>  Issue Type: Bug
>Reporter: longfei
>Assignee: Greg Mann
>Priority: Blocker
>  Labels: foundations
>
> Commit 53598228fe on the master branch should be backported to 1.7.x. 
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Comment Edited] (MESOS-9545) Marking an unreachable agent as gone should transition the tasks to terminal state

2019-08-20 Thread Greg Mann (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-9545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16911836#comment-16911836
 ] 

Greg Mann edited comment on MESOS-9545 at 8/21/19 1:25 AM:
---

1.8.x:
{noformat}
commit 13e4cd1c42ae88094f14d6b05cfb9832d4494193
Author: Greg Mann 
Date:   Tue Apr 23 22:25:29 2019 -0700

Transitioned tasks when an unreachable agent is marked as gone.

This patch updates the master code responsible for marking
agents as gone to properly transition tasks on agents which
were previously marked as unreachable.

Review: https://reviews.apache.org/r/70519/
{noformat}
{noformat}
commit 6f90cc334701fad10e721312cd4cbd0690e1c6ec
Author: Greg Mann 
Date:   Tue Apr 23 22:25:21 2019 -0700

Fixed a memory leak in the master's 'removeTask()' helper.

Previously, all removed tasks were added to the
`slaves.unreachableTasks` map. This patch adds a conditional
so that removed tasks are only added to that structure when
they are being marked unreachable.

Review: https://reviews.apache.org/r/70518/
{noformat}

1.7.x:
{noformat}
commit 61f1155675bd3bc5312e0501ea6182d2ee7434af
Author: Greg Mann 
Date:   Tue Apr 23 22:25:29 2019 -0700

Transitioned tasks when an unreachable agent is marked as gone.

This patch updates the master code responsible for marking
agents as gone to properly transition tasks on agents which
were previously marked as unreachable.

Review: https://reviews.apache.org/r/70519/
{noformat}
{noformat}
commit 0c5e78bc26653d26a03b08b82923ea517de46fc0
Author: Greg Mann 
Date:   Tue Apr 23 22:25:21 2019 -0700

Fixed a memory leak in the master's 'removeTask()' helper.

Previously, all removed tasks were added to the
`slaves.unreachableTasks` map. This patch adds a conditional
so that removed tasks are only added to that structure when
they are being marked unreachable.

Review: https://reviews.apache.org/r/70518/
{noformat}


was (Author: greggomann):
1.8.x:
{noformat}
commit 13e4cd1c42ae88094f14d6b05cfb9832d4494193
Author: Greg Mann 
Date:   Tue Apr 23 22:25:29 2019 -0700

Transitioned tasks when an unreachable agent is marked as gone.

This patch updates the master code responsible for marking
agents as gone to properly transition tasks on agents which
were previously marked as unreachable.

Review: https://reviews.apache.org/r/70519/
{noformat}
{noformat}
commit 6f90cc334701fad10e721312cd4cbd0690e1c6ec
Author: Greg Mann 
Date:   Tue Apr 23 22:25:21 2019 -0700

Fixed a memory leak in the master's 'removeTask()' helper.

Previously, all removed tasks were added to the
`slaves.unreachableTasks` map. This patch adds a conditional
so that removed tasks are only added to that structure when
they are being marked unreachable.

Review: https://reviews.apache.org/r/70518/
{noformat}

> Marking an unreachable agent as gone should transition the tasks to terminal 
> state
> --
>
> Key: MESOS-9545
> URL: https://issues.apache.org/jira/browse/MESOS-9545
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Vinod Kone
>Assignee: Greg Mann
>Priority: Major
>  Labels: foundations
> Fix For: 1.9.0
>
>
> If an unreachable agent is marked as gone, currently master just marks that 
> agent in the registry but doesn't do anything about its tasks. So the tasks 
> are in UNREACHABLE state in the master forever, until the master fails over. 
> This is not great UX. We should transition these to terminal state instead.
> This fix should also include a test to verify.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (MESOS-9545) Marking an unreachable agent as gone should transition the tasks to terminal state

2019-08-20 Thread Greg Mann (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-9545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16911836#comment-16911836
 ] 

Greg Mann commented on MESOS-9545:
--

1.8.x:
{noformat}
commit 13e4cd1c42ae88094f14d6b05cfb9832d4494193
Author: Greg Mann 
Date:   Tue Apr 23 22:25:29 2019 -0700

Transitioned tasks when an unreachable agent is marked as gone.

This patch updates the master code responsible for marking
agents as gone to properly transition tasks on agents which
were previously marked as unreachable.

Review: https://reviews.apache.org/r/70519/
{noformat}
{noformat}
commit 6f90cc334701fad10e721312cd4cbd0690e1c6ec
Author: Greg Mann 
Date:   Tue Apr 23 22:25:21 2019 -0700

Fixed a memory leak in the master's 'removeTask()' helper.

Previously, all removed tasks were added to the
`slaves.unreachableTasks` map. This patch adds a conditional
so that removed tasks are only added to that structure when
they are being marked unreachable.

Review: https://reviews.apache.org/r/70518/
{noformat}

> Marking an unreachable agent as gone should transition the tasks to terminal 
> state
> --
>
> Key: MESOS-9545
> URL: https://issues.apache.org/jira/browse/MESOS-9545
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Vinod Kone
>Assignee: Greg Mann
>Priority: Major
>  Labels: foundations
> Fix For: 1.9.0
>
>
> If an unreachable agent is marked as gone, currently master just marks that 
> agent in the registry but doesn't do anything about its tasks. So the tasks 
> are in UNREACHABLE state in the master forever, until the master fails over. 
> This is not great UX. We should transition these to terminal state instead.
> This fix should also include a test to verify.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (MESOS-9946) DefaultExecutorTest.ROOT_INTERNET_CURL_DockerTaskWithFileURI is flaky

2019-08-20 Thread Greg Mann (Jira)
Greg Mann created MESOS-9946:


 Summary: 
DefaultExecutorTest.ROOT_INTERNET_CURL_DockerTaskWithFileURI is flaky
 Key: MESOS-9946
 URL: https://issues.apache.org/jira/browse/MESOS-9946
 Project: Mesos
  Issue Type: Bug
  Components: test
Reporter: Greg Mann


Observed this on a 1.8.x build. I suspect it's due to a slow image pull based 
on the logs.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (MESOS-9945) Use streaming response in the checker process

2019-08-20 Thread Greg Mann (Jira)
Greg Mann created MESOS-9945:


 Summary: Use streaming response in the checker process
 Key: MESOS-9945
 URL: https://issues.apache.org/jira/browse/MESOS-9945
 Project: Mesos
  Issue Type: Improvement
Reporter: Greg Mann


Because we do not currently use a streaming response for nested container 
command health checks in the checker process, we are not able to display the 
output of failed checks (MESOS-7903), and we are not able to begin the health 
check timeout at the appropriate moment (MESOS-9944).

We should update the checker process to use a streaming response for the 
LAUNCH_NESTED_CONTAINER_SESSION call that it uses to initiate command health 
checks.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (MESOS-9944) Command health check timeout begins to early

2019-08-20 Thread Greg Mann (Jira)
Greg Mann created MESOS-9944:


 Summary: Command health check timeout begins to early
 Key: MESOS-9944
 URL: https://issues.apache.org/jira/browse/MESOS-9944
 Project: Mesos
  Issue Type: Bug
  Components: agent
Affects Versions: 1.9.0
Reporter: Greg Mann


The checker process begins the timer for the command health check timeout when 
the LAUNCH_NESTED_CONTAINER_SESSION request is first sent, which means any 
delay in the execution of the health check command is included in the health 
check timeout. This can be an issue when the agent is under heavy load, and it 
may take a few seconds for the health check command to be run.

Once we have a streaming response for the ATTACH_CONTAINER_OUTPUT call which 
follows the nested container launch, we can initiate the health check timeout 
once the first byte of the response is received; this is a more accurate signal 
that the health check command has begun running.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (MESOS-9836) Docker containerizer overwrites `/mesos/slave` cgroups.

2019-08-20 Thread Andrei Budnik (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-9836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16911305#comment-16911305
 ] 

Andrei Budnik commented on MESOS-9836:
--

Shall we deprecate the option to run a custom executor in a Docker container? 
If no one responds to our proposal in dev@ & user@ mailing lists, then we can 
safely deprecate this feature.

> Docker containerizer overwrites `/mesos/slave` cgroups.
> ---
>
> Key: MESOS-9836
> URL: https://issues.apache.org/jira/browse/MESOS-9836
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Chun-Hung Hsiao
>Priority: Critical
>  Labels: docker, mesosphere
>
> The following bug was observed on our internal testing cluster.
> The docker containerizer launched a container on an agent:
> {noformat}
> I0523 06:00:53.888579 21815 docker.cpp:1195] Starting container 
> 'f69c8a8c-eba4-4494-a305-0956a44a6ad2' for task 
> 'apps_docker-sleep-app.1fda5b8e-7d20-11e9-9717-7aa030269ee1' (and executor 
> 'apps_docker-sleep-app.1fda5b8e-7d20-11e9-9717-7aa030269ee1') of framework 
> 415284b7-2967-407d-b66f-f445e93f064e-0011
> I0523 06:00:54.524171 21815 docker.cpp:783] Checkpointing pid 13716 to 
> '/var/lib/mesos/slave/meta/slaves/60c42ab7-eb1a-4cec-b03d-ea06bff00c3f-S2/frameworks/415284b7-2967-407d-b66f-f445e93f064e-0011/executors/apps_docker-sleep-app.1fda5b8e-7d20-11e9-9717-7aa030269ee1/runs/f69c8a8c-eba4-4494-a305-0956a44a6ad2/pids/forked.pid'
> {noformat}
> After the container was launched, the docker containerizer did a {{docker 
> inspect}} on the container and cached the pid:
>  
> [https://github.com/apache/mesos/blob/0c431dd60ae39138cc7e8b099d41ad794c02c9a9/src/slave/containerizer/docker.cpp#L1764]
>  The pid should be slightly greater than 13716.
> The docker executor sent a {{TASK_FINISHED}} status update around 16 minutes 
> later:
> {noformat}
> I0523 06:16:17.287595 21809 slave.cpp:5566] Handling status update 
> TASK_FINISHED (Status UUID: 4e00b786-b773-46cd-8327-c7deb08f1de9) for task 
> apps_docker-sleep-app.1fda5b8e-7d20-11e9-9717-7aa030269ee1 of framework 
> 415284b7-2967-407d-b66f-f445e93f064e-0011 from executor(1)@172.31.1.7:36244
> {noformat}
> After receiving the terminal status update, the agent asked the docker 
> containerizer to update {{cpu.cfs_period_us}}, {{cpu.cfs_quota_us}} and 
> {{memory.soft_limit_in_bytes}} of the container through the cached pid:
>  
> [https://github.com/apache/mesos/blob/0c431dd60ae39138cc7e8b099d41ad794c02c9a9/src/slave/containerizer/docker.cpp#L1696]
> {noformat}
> I0523 06:16:17.290447 21815 docker.cpp:1868] Updated 'cpu.shares' to 102 at 
> /sys/fs/cgroup/cpu,cpuacct/mesos/slave for container 
> f69c8a8c-eba4-4494-a305-0956a44a6ad2
> I0523 06:16:17.290660 21815 docker.cpp:1895] Updated 'cpu.cfs_period_us' to 
> 100ms and 'cpu.cfs_quota_us' to 10ms (cpus 0.1) for container 
> f69c8a8c-eba4-4494-a305-0956a44a6ad2
> I0523 06:16:17.889816 21815 docker.cpp:1937] Updated 
> 'memory.soft_limit_in_bytes' to 32MB for container 
> f69c8a8c-eba4-4494-a305-0956a44a6ad2
> {noformat}
> Note that the cgroup of {{cpu.shares}} was {{/mesos/slave}}. This was 
> possibly because that over the 16 minutes the pid got reused:
> {noformat}
> # zgrep 'systemd.cpp:98\]' /var/log/mesos/archive/mesos-agent.log.12.gz
> ...
> I0523 06:00:54.525178 21815 systemd.cpp:98] Assigned child process '13716' to 
> 'mesos_executors.slice'
> I0523 06:00:55.078546 21808 systemd.cpp:98] Assigned child process '13798' to 
> 'mesos_executors.slice'
> I0523 06:00:55.134096 21808 systemd.cpp:98] Assigned child process '13799' to 
> 'mesos_executors.slice'
> ...
> I0523 06:06:30.997439 21808 systemd.cpp:98] Assigned child process '32689' to 
> 'mesos_executors.slice'
> I0523 06:06:31.050976 21808 systemd.cpp:98] Assigned child process '32690' to 
> 'mesos_executors.slice'
> I0523 06:06:31.110514 21815 systemd.cpp:98] Assigned child process '32692' to 
> 'mesos_executors.slice'
> I0523 06:06:33.143726 21818 systemd.cpp:98] Assigned child process '446' to 
> 'mesos_executors.slice'
> I0523 06:06:33.196251 21818 systemd.cpp:98] Assigned child process '447' to 
> 'mesos_executors.slice'
> I0523 06:06:33.266332 21816 systemd.cpp:98] Assigned child process '449' to 
> 'mesos_executors.slice'
> ...
> I0523 06:09:34.870056 21808 systemd.cpp:98] Assigned child process '13717' to 
> 'mesos_executors.slice'
> I0523 06:09:34.937762 21813 systemd.cpp:98] Assigned child process '13744' to 
> 'mesos_executors.slice'
> I0523 06:09:35.073971 21817 systemd.cpp:98] Assigned child process '13754' to 
> 'mesos_executors.slice'
> ...
> {noformat}
> It was highly likely that the container itself exited around 06:09:35, way 
> before the docker executor detected and reported the terminal status update, 
> and then its pid was reused by 

[jira] [Commented] (MESOS-9836) Docker containerizer overwrites `/mesos/slave` cgroups.

2019-08-20 Thread Qian Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-9836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16911293#comment-16911293
 ] 

Qian Zhang commented on MESOS-9836:
---

I think there is a case that we need to update Docker container's CPU & memory 
directly in Docker containerizer: a customer executor is launched as a Docker 
container and it launches tasks as processes in the Docker container. In this 
case every time when a new task is sent to the customer executor or a running 
task terminates, Docker containerizer needs to update the Docker container's 
CPU and memory in cgroups.

> Docker containerizer overwrites `/mesos/slave` cgroups.
> ---
>
> Key: MESOS-9836
> URL: https://issues.apache.org/jira/browse/MESOS-9836
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Chun-Hung Hsiao
>Priority: Critical
>  Labels: docker, mesosphere
>
> The following bug was observed on our internal testing cluster.
> The docker containerizer launched a container on an agent:
> {noformat}
> I0523 06:00:53.888579 21815 docker.cpp:1195] Starting container 
> 'f69c8a8c-eba4-4494-a305-0956a44a6ad2' for task 
> 'apps_docker-sleep-app.1fda5b8e-7d20-11e9-9717-7aa030269ee1' (and executor 
> 'apps_docker-sleep-app.1fda5b8e-7d20-11e9-9717-7aa030269ee1') of framework 
> 415284b7-2967-407d-b66f-f445e93f064e-0011
> I0523 06:00:54.524171 21815 docker.cpp:783] Checkpointing pid 13716 to 
> '/var/lib/mesos/slave/meta/slaves/60c42ab7-eb1a-4cec-b03d-ea06bff00c3f-S2/frameworks/415284b7-2967-407d-b66f-f445e93f064e-0011/executors/apps_docker-sleep-app.1fda5b8e-7d20-11e9-9717-7aa030269ee1/runs/f69c8a8c-eba4-4494-a305-0956a44a6ad2/pids/forked.pid'
> {noformat}
> After the container was launched, the docker containerizer did a {{docker 
> inspect}} on the container and cached the pid:
>  
> [https://github.com/apache/mesos/blob/0c431dd60ae39138cc7e8b099d41ad794c02c9a9/src/slave/containerizer/docker.cpp#L1764]
>  The pid should be slightly greater than 13716.
> The docker executor sent a {{TASK_FINISHED}} status update around 16 minutes 
> later:
> {noformat}
> I0523 06:16:17.287595 21809 slave.cpp:5566] Handling status update 
> TASK_FINISHED (Status UUID: 4e00b786-b773-46cd-8327-c7deb08f1de9) for task 
> apps_docker-sleep-app.1fda5b8e-7d20-11e9-9717-7aa030269ee1 of framework 
> 415284b7-2967-407d-b66f-f445e93f064e-0011 from executor(1)@172.31.1.7:36244
> {noformat}
> After receiving the terminal status update, the agent asked the docker 
> containerizer to update {{cpu.cfs_period_us}}, {{cpu.cfs_quota_us}} and 
> {{memory.soft_limit_in_bytes}} of the container through the cached pid:
>  
> [https://github.com/apache/mesos/blob/0c431dd60ae39138cc7e8b099d41ad794c02c9a9/src/slave/containerizer/docker.cpp#L1696]
> {noformat}
> I0523 06:16:17.290447 21815 docker.cpp:1868] Updated 'cpu.shares' to 102 at 
> /sys/fs/cgroup/cpu,cpuacct/mesos/slave for container 
> f69c8a8c-eba4-4494-a305-0956a44a6ad2
> I0523 06:16:17.290660 21815 docker.cpp:1895] Updated 'cpu.cfs_period_us' to 
> 100ms and 'cpu.cfs_quota_us' to 10ms (cpus 0.1) for container 
> f69c8a8c-eba4-4494-a305-0956a44a6ad2
> I0523 06:16:17.889816 21815 docker.cpp:1937] Updated 
> 'memory.soft_limit_in_bytes' to 32MB for container 
> f69c8a8c-eba4-4494-a305-0956a44a6ad2
> {noformat}
> Note that the cgroup of {{cpu.shares}} was {{/mesos/slave}}. This was 
> possibly because that over the 16 minutes the pid got reused:
> {noformat}
> # zgrep 'systemd.cpp:98\]' /var/log/mesos/archive/mesos-agent.log.12.gz
> ...
> I0523 06:00:54.525178 21815 systemd.cpp:98] Assigned child process '13716' to 
> 'mesos_executors.slice'
> I0523 06:00:55.078546 21808 systemd.cpp:98] Assigned child process '13798' to 
> 'mesos_executors.slice'
> I0523 06:00:55.134096 21808 systemd.cpp:98] Assigned child process '13799' to 
> 'mesos_executors.slice'
> ...
> I0523 06:06:30.997439 21808 systemd.cpp:98] Assigned child process '32689' to 
> 'mesos_executors.slice'
> I0523 06:06:31.050976 21808 systemd.cpp:98] Assigned child process '32690' to 
> 'mesos_executors.slice'
> I0523 06:06:31.110514 21815 systemd.cpp:98] Assigned child process '32692' to 
> 'mesos_executors.slice'
> I0523 06:06:33.143726 21818 systemd.cpp:98] Assigned child process '446' to 
> 'mesos_executors.slice'
> I0523 06:06:33.196251 21818 systemd.cpp:98] Assigned child process '447' to 
> 'mesos_executors.slice'
> I0523 06:06:33.266332 21816 systemd.cpp:98] Assigned child process '449' to 
> 'mesos_executors.slice'
> ...
> I0523 06:09:34.870056 21808 systemd.cpp:98] Assigned child process '13717' to 
> 'mesos_executors.slice'
> I0523 06:09:34.937762 21813 systemd.cpp:98] Assigned child process '13744' to 
> 'mesos_executors.slice'
> I0523 06:09:35.073971 21817 systemd.cpp:98] Assigned child process '13754' to 
> 

[jira] [Assigned] (MESOS-9482) Resource provider manager can crash on invalid data from resource providers

2019-08-20 Thread Benjamin Bannier (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-9482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Bannier reassigned MESOS-9482:
---

Assignee: Benjamin Bannier

> Resource provider manager can crash on invalid data from resource providers
> ---
>
> Key: MESOS-9482
> URL: https://issues.apache.org/jira/browse/MESOS-9482
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benjamin Bannier
>Assignee: Benjamin Bannier
>Priority: Major
>  Labels: mesosphere, mesosphere-dss-post-ga, storage
>
> The resource provider manager code currently contains a number of assertions 
> which will crash the manager (and its agent) if some forms of invalid data 
> are received from a resource provider. This is dangerous since resource 
> providers are not necessarily part of Mesos-controlled code (they talk to the 
> manager over an HTTP API and could even be in external processes).
> Instead of crashing, the resource provider manager should disconnect the 
> resource providers in such scenarios.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)