[
https://issues.apache.org/jira/browse/MESOS-10139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17129923#comment-17129923
]
Qian Zhang commented on MESOS-10139:
------------------------------------
When this issue happens, via the `top` command I see `wa` is high which should
be caused by `kswapd0`
{code:java}
top - 01:18:41 up 1:23, 4 users, load average: 73.47, 38.72, 41.05
Tasks: 227 total, 3 running, 223 sleeping, 0 stopped, 1 zombie
%Cpu(s): 1.4 us, 3.0 sy, 0.0 ni, 48.7 id, 46.9 wa, 0.0 hi, 0.0 si, 0.0 st
MiB Mem : 31211.2 total, 208.8 free, 30836.6 used, 165.8 buff/cache
MiB Swap: 0.0 total, 0.0 free, 0.0 used. 1.4 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
103 root 20 0 0 0 0 R 100.0 0.0
2:40.74 kswapd0
...
{code}
Please note the swap is NOT enabled in the agent host, so it seems `kswapd0`
tries to page out the executable code of some processes and OOM killer is not
triggered at all.
> Mesos agent host may become unresponsive when it is under low memory pressure
> -----------------------------------------------------------------------------
>
> Key: MESOS-10139
> URL: https://issues.apache.org/jira/browse/MESOS-10139
> Project: Mesos
> Issue Type: Bug
> Reporter: Qian Zhang
> Priority: Major
>
> When user launches a task to use a large number of memory on an agent host
> (e.g., launch a task to run `stress --vm 1 --vm-bytes 29800M --vm-hang 0` on
> an agent host which have 32GB memory), the whole agent host will become
> unresponsive (no commands can be executed anymore, but still pingable). A few
> minutes later Mesos master will mark this agent as unreachable and update all
> its task’s state to `TASK_UNREACHABLE`.
> {code:java}
> May 26 02:13:31 ip-172-16-15-17.us-west-2.compute.internal
> mesos-master[15468]: I0526 02:13:31.103382 15491 master.cpp:260] Scheduling
> transition of agent 89d2d679-fa08-49be-94c3-880ebb595212-S0 to UNREACHABLE
> because of health check timeout
> May 26 02:13:31 ip-172-16-15-17.us-west-2.compute.internal
> mesos-master[15468]: I0526 02:13:31.103612 15491 master.cpp:8592] Marking
> agent 89d2d679-fa08-49be-94c3-880ebb595212-S0 (172.16.3.236) unreachable:
> health check timed out
> May 26 02:13:31 ip-172-16-15-17.us-west-2.compute.internal
> mesos-master[15468]: I0526 02:13:31.108093 15495 master.cpp:8635] Marked
> agent 89d2d679-fa08-49be-94c3-880ebb595212-S0 (172.16.3.236) unreachable:
> health check timed out
> …
> May 26 02:13:31 ip-172-16-15-17.us-west-2.compute.internal
> mesos-master[15468]: I0526 02:13:31.108419 15495 master.cpp:11149] Updating
> the state of task app10.instance-1f70be9f-9ef5-11ea-8981-9a93e42a6514._app.2
> of framework 89d2d679-fa08-49be-94c3-880ebb595212-0000 (latest state:
> TASK_UNREACHABLE, status update state: TASK_UNREACHABLE)
> May 26 02:13:31 ip-172-16-15-17.us-west-2.compute.internal
> mesos-master[15468]: I0526 02:13:31.108865 15495 master.cpp:11149] Updating
> the state of task app9.instance-954f91ad-9ef4-11ea-8981-9a93e42a6514._app.1
> of framework 89d2d679-fa08-49be-94c3-880ebb595212-0000 (latest state:
> TASK_UNREACHABLE, status update state: TASK_UNREACHABLE)
> ...{code}
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)