[jira] [Comment Edited] (MESOS-10139) Mesos agent host may become unresponsive when it is under low memory pressure

Qian Zhang (Jira) Tue, 09 Jun 2020 18:38:49 -0700


    [ 
https://issues.apache.org/jira/browse/MESOS-10139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17129923#comment-17129923
 ]


Qian Zhang edited comment on MESOS-10139 at 6/10/20, 1:37 AM:
--------------------------------------------------------------

When this issue happens, via the `top` command I see `wa` is high which should 
be caused by `kswapd0`
{code:java}
top - 01:18:41 up  1:23,  4 users,  load average: 73.47, 38.72, 41.05
Tasks: 227 total,   3 running, 223 sleeping,   0 stopped,   1 zombie
%Cpu(s):  1.4 us,  3.0 sy,  0.0 ni, 48.7 id, 46.9 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :  31211.2 total,    208.8 free,  30836.6 used,    165.8 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.      1.4 avail Mem   
PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND       
                                                                                
           
  103 root          20   0       0         0         0      R   100.0  0.0   
2:40.74 kswapd0
...

{code}
Please note swap is NOT enabled in the agent host, so it seems `kswapd0` tries 
to page out the executable code of some processes and OOM killer is not 
triggered at all, that means we may hit [this 
issue|https://askubuntu.com/questions/432809/why-is-kswapd0-running-on-a-computer-with-no-swap/1134491#1134491]:
{quote}It is a well known problem that when Linux runs out of memory it can 
enter swap loops instead of doing what it should be doing, killing processes to 
free up ram. There are an OOM (Out of Memory) killer that does this but only if 
Swap and RAM are full.

However this should not really be a problem. If there are a bunch of offending 
processes, for example Firefox and Chrome, each with tabs that are both using 
and grabbing memory, then these processes will cause swap read back. Linux then 
enters a loop where the same memory are being moved back and forth between 
memory and hard drive. This in turn causes priority inversion where swapping a 
few processes back and forth makes the system unresponsive.
{quote}


was (Author: qianzhang):
When this issue happens, via the `top` command I see `wa` is high which should 
be caused by `kswapd0`
{code:java}
top - 01:18:41 up  1:23,  4 users,  load average: 73.47, 38.72, 41.05
Tasks: 227 total,   3 running, 223 sleeping,   0 stopped,   1 zombie
%Cpu(s):  1.4 us,  3.0 sy,  0.0 ni, 48.7 id, 46.9 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :  31211.2 total,    208.8 free,  30836.6 used,    165.8 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.      1.4 avail Mem   
PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND       
                                                                                
           
  103 root          20   0       0         0         0      R   100.0  0.0   
2:40.74 kswapd0
...

{code}
Please note swap is NOT enabled in the agent host, so it seems `kswapd0` tries 
to page out the executable code of some processes and OOM killer is not 
triggered at all. 

> Mesos agent host may become unresponsive when it is under low memory pressure
> -----------------------------------------------------------------------------
>
>                 Key: MESOS-10139
>                 URL: https://issues.apache.org/jira/browse/MESOS-10139
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Qian Zhang
>            Priority: Major
>
> When user launches a task to use a large number of memory on an agent host 
> (e.g., launch a task to run `stress --vm 1 --vm-bytes 29800M --vm-hang 0` on 
> an agent host which have 32GB memory), the whole agent host will become 
> unresponsive (no commands can be executed anymore, but still pingable). A few 
> minutes later Mesos master will mark this agent as unreachable and update all 
> its task’s state to `TASK_UNREACHABLE`.
> {code:java}
> May 26 02:13:31 ip-172-16-15-17.us-west-2.compute.internal 
> mesos-master[15468]: I0526 02:13:31.103382 15491 master.cpp:260] Scheduling 
> transition of agent 89d2d679-fa08-49be-94c3-880ebb595212-S0 to UNREACHABLE 
> because of health check timeout
> May 26 02:13:31 ip-172-16-15-17.us-west-2.compute.internal 
> mesos-master[15468]: I0526 02:13:31.103612 15491 master.cpp:8592] Marking 
> agent 89d2d679-fa08-49be-94c3-880ebb595212-S0 (172.16.3.236) unreachable: 
> health check timed out
> May 26 02:13:31 ip-172-16-15-17.us-west-2.compute.internal 
> mesos-master[15468]: I0526 02:13:31.108093 15495 master.cpp:8635] Marked 
> agent 89d2d679-fa08-49be-94c3-880ebb595212-S0 (172.16.3.236) unreachable: 
> health check timed out
> …
> May 26 02:13:31 ip-172-16-15-17.us-west-2.compute.internal 
> mesos-master[15468]: I0526 02:13:31.108419 15495 master.cpp:11149] Updating 
> the state of task app10.instance-1f70be9f-9ef5-11ea-8981-9a93e42a6514._app.2 
> of framework 89d2d679-fa08-49be-94c3-880ebb595212-0000 (latest state: 
> TASK_UNREACHABLE, status update state: TASK_UNREACHABLE)
> May 26 02:13:31 ip-172-16-15-17.us-west-2.compute.internal 
> mesos-master[15468]: I0526 02:13:31.108865 15495 master.cpp:11149] Updating 
> the state of task app9.instance-954f91ad-9ef4-11ea-8981-9a93e42a6514._app.1 
> of framework 89d2d679-fa08-49be-94c3-880ebb595212-0000 (latest state: 
> TASK_UNREACHABLE, status update state: TASK_UNREACHABLE)
> ...{code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (MESOS-10139) Mesos agent host may become unresponsive when it is under low memory pressure

Reply via email to