[jira] [Commented] (MESOS-10139) Mesos agent host may become unresponsive when it is under low memory pressure

Qian Zhang (Jira) Tue, 09 Jun 2020 18:35:49 -0700


    [ 
https://issues.apache.org/jira/browse/MESOS-10139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17129929#comment-17129929
 ]


Qian Zhang commented on MESOS-10139:
------------------------------------

I asked a 
[question|https://unix.stackexchange.com/questions/591566/why-does-linux-become-unresponsive-when-a-large-number-of-memory-is-used-oom-ca]
 in StackExchange for this issue and found actually this is an issue which has 
been discussed in Linux community for a long time. The solution is running a 
daemon to monitor the memory pressure and kill or trigger OOM killer to kill a 
memory hog process when the system is in the low memory condition.

[~greggomann] also suggests that we could fix this issue by setting 
`/sys/fs/cgroups/memory/mesos/memory.limit_in_bytes` to the allocatable memory 
of the agent (rather than leaving it as the default value) and also ensure that 
`memory.use_hierarchy` is enabled. And the [current 
logic|https://github.com/apache/mesos/blob/1.10.0/src/slave/containerizer/containerizer.cpp#L145:L158]
 to determine the allocatable memory for an agent node may need to be changed, 
currently in most of the cases we just leave 1GB for system services and all 
other memory can be offered to frameworks, but for the agent node which have 
relatively large memory, it may not be enough. For example, for an agent node 
with 32GB memory, when 29GB memory has been used by tasks, the node may become 
unresponsive. So I think instead of an absolute value (1GB), we may adopt a 
relative ratio, like leave 10% of memory for system services and offer the 
other 90% to frameworks. But we need to figure out a reasonable and safe ratio.

> Mesos agent host may become unresponsive when it is under low memory pressure
> -----------------------------------------------------------------------------
>
>                 Key: MESOS-10139
>                 URL: https://issues.apache.org/jira/browse/MESOS-10139
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Qian Zhang
>            Priority: Major
>
> When user launches a task to use a large number of memory on an agent host 
> (e.g., launch a task to run `stress --vm 1 --vm-bytes 29800M --vm-hang 0` on 
> an agent host which have 32GB memory), the whole agent host will become 
> unresponsive (no commands can be executed anymore, but still pingable). A few 
> minutes later Mesos master will mark this agent as unreachable and update all 
> its task’s state to `TASK_UNREACHABLE`.
> {code:java}
> May 26 02:13:31 ip-172-16-15-17.us-west-2.compute.internal 
> mesos-master[15468]: I0526 02:13:31.103382 15491 master.cpp:260] Scheduling 
> transition of agent 89d2d679-fa08-49be-94c3-880ebb595212-S0 to UNREACHABLE 
> because of health check timeout
> May 26 02:13:31 ip-172-16-15-17.us-west-2.compute.internal 
> mesos-master[15468]: I0526 02:13:31.103612 15491 master.cpp:8592] Marking 
> agent 89d2d679-fa08-49be-94c3-880ebb595212-S0 (172.16.3.236) unreachable: 
> health check timed out
> May 26 02:13:31 ip-172-16-15-17.us-west-2.compute.internal 
> mesos-master[15468]: I0526 02:13:31.108093 15495 master.cpp:8635] Marked 
> agent 89d2d679-fa08-49be-94c3-880ebb595212-S0 (172.16.3.236) unreachable: 
> health check timed out
> …
> May 26 02:13:31 ip-172-16-15-17.us-west-2.compute.internal 
> mesos-master[15468]: I0526 02:13:31.108419 15495 master.cpp:11149] Updating 
> the state of task app10.instance-1f70be9f-9ef5-11ea-8981-9a93e42a6514._app.2 
> of framework 89d2d679-fa08-49be-94c3-880ebb595212-0000 (latest state: 
> TASK_UNREACHABLE, status update state: TASK_UNREACHABLE)
> May 26 02:13:31 ip-172-16-15-17.us-west-2.compute.internal 
> mesos-master[15468]: I0526 02:13:31.108865 15495 master.cpp:11149] Updating 
> the state of task app9.instance-954f91ad-9ef4-11ea-8981-9a93e42a6514._app.1 
> of framework 89d2d679-fa08-49be-94c3-880ebb595212-0000 (latest state: 
> TASK_UNREACHABLE, status update state: TASK_UNREACHABLE)
> ...{code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (MESOS-10139) Mesos agent host may become unresponsive when it is under low memory pressure

Reply via email to