[ 
https://issues.apache.org/jira/browse/MESOS-10139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17129923#comment-17129923
 ] 

Qian Zhang commented on MESOS-10139:
------------------------------------

When this issue happens, via the `top` command I see `wa` is high which should 
be caused by `kswapd0`
{code:java}
top - 01:18:41 up  1:23,  4 users,  load average: 73.47, 38.72, 41.05
Tasks: 227 total,   3 running, 223 sleeping,   0 stopped,   1 zombie
%Cpu(s):  1.4 us,  3.0 sy,  0.0 ni, 48.7 id, 46.9 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :  31211.2 total,    208.8 free,  30836.6 used,    165.8 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.      1.4 avail Mem   
PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND       
                                                                                
           
  103 root          20   0       0         0         0      R   100.0  0.0   
2:40.74 kswapd0
...

{code}
Please note the swap is NOT enabled in the agent host, so it seems `kswapd0` 
tries to page out the executable code of some processes and OOM killer is not 
triggered at all.

 

 

 

 

> Mesos agent host may become unresponsive when it is under low memory pressure
> -----------------------------------------------------------------------------
>
>                 Key: MESOS-10139
>                 URL: https://issues.apache.org/jira/browse/MESOS-10139
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Qian Zhang
>            Priority: Major
>
> When user launches a task to use a large number of memory on an agent host 
> (e.g., launch a task to run `stress --vm 1 --vm-bytes 29800M --vm-hang 0` on 
> an agent host which have 32GB memory), the whole agent host will become 
> unresponsive (no commands can be executed anymore, but still pingable). A few 
> minutes later Mesos master will mark this agent as unreachable and update all 
> its task’s state to `TASK_UNREACHABLE`.
> {code:java}
> May 26 02:13:31 ip-172-16-15-17.us-west-2.compute.internal 
> mesos-master[15468]: I0526 02:13:31.103382 15491 master.cpp:260] Scheduling 
> transition of agent 89d2d679-fa08-49be-94c3-880ebb595212-S0 to UNREACHABLE 
> because of health check timeout
> May 26 02:13:31 ip-172-16-15-17.us-west-2.compute.internal 
> mesos-master[15468]: I0526 02:13:31.103612 15491 master.cpp:8592] Marking 
> agent 89d2d679-fa08-49be-94c3-880ebb595212-S0 (172.16.3.236) unreachable: 
> health check timed out
> May 26 02:13:31 ip-172-16-15-17.us-west-2.compute.internal 
> mesos-master[15468]: I0526 02:13:31.108093 15495 master.cpp:8635] Marked 
> agent 89d2d679-fa08-49be-94c3-880ebb595212-S0 (172.16.3.236) unreachable: 
> health check timed out
> …
> May 26 02:13:31 ip-172-16-15-17.us-west-2.compute.internal 
> mesos-master[15468]: I0526 02:13:31.108419 15495 master.cpp:11149] Updating 
> the state of task app10.instance-1f70be9f-9ef5-11ea-8981-9a93e42a6514._app.2 
> of framework 89d2d679-fa08-49be-94c3-880ebb595212-0000 (latest state: 
> TASK_UNREACHABLE, status update state: TASK_UNREACHABLE)
> May 26 02:13:31 ip-172-16-15-17.us-west-2.compute.internal 
> mesos-master[15468]: I0526 02:13:31.108865 15495 master.cpp:11149] Updating 
> the state of task app9.instance-954f91ad-9ef4-11ea-8981-9a93e42a6514._app.1 
> of framework 89d2d679-fa08-49be-94c3-880ebb595212-0000 (latest state: 
> TASK_UNREACHABLE, status update state: TASK_UNREACHABLE)
> ...{code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to