[
https://issues.apache.org/jira/browse/MESOS-662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13754173#comment-13754173
]
Eric W. Biederman commented on MESOS-662:
-----------------------------------------
If the concern is reporting the OOM that is simply a solvable synchronization
issue on slave tear down. The information is available and the kernel sends
the oom notification and updates the appropriate status files in all cases.
When the kernel handles an OOM it kills processes one at a time with SIGKILL in
the cgroup until enough memory is available for the cgroup to continue. For
each process that is killed a message is logged so there is no doubt of what
happened to that process.
Which means that mesos-slave still needs to kill all of the processes in a
cgroup when a cgroup runs out of memory, and that most likely the processes
killed will not be the executor so races between the reaping the executor and
oom handling are unlikely to occur. This also means that if the reaper wins
the race because it was the executor that was killed it will be known that
there was a deliberate SIGKILL sent to the process. Not that the reaper
process that runs once a second is likely to win a race with anything.
So the practical choices are really between the user debugging TASK_LOST and a
hung box and between TASK_KILLED with an indication that the executor was
explicitly killed (and probably with a message from mesos about OOMs and
definitely a system message about OOMs). So I really don't see any valid error
reporting reason for not letting the kernel kill processes. Even with no
other changes to the code there should not be any mysteries.
Furthermore mesos-slave is not setup to safely be an OOM killer. Steps are not
taken to guarantee that mesos-slave will not allocate memory from the operating
system when handling an OOM condition, nor to guarantee that there is free
memory somewhere that mesos-slave can allocate from. In fact almost the exact
opposite is true. We spawn multiple libprocess processes to kill a cgroup and
free up memory, while busily logging random things from random libprocess
processes to disk. During slave restart things are even worse as there is no
mesos-slave to even try and kill
anything, which might result in too little memory to start a new copy of
mesos-slave.
Which is my long winded way of saying. No letting the kernel kill processes
will not significantly impact the debugability of the system, and the problems
inherent in letting the kernel kill processes are much simpler to solve than
the problems inherent in getting mesos-slave to kill processes to relieve a
memory shortage.
> Executor OOM could lead to a kernel hang
> ----------------------------------------
>
> Key: MESOS-662
> URL: https://issues.apache.org/jira/browse/MESOS-662
> Project: Mesos
> Issue Type: Bug
> Reporter: Vinod Kone
> Assignee: Benjamin Mahler
> Priority: Critical
> Fix For: 0.15.0
>
>
> We observed this in production at Twitter.
> An executor OOMed and kernel put it in sleep instead of killing it because
> Mesos slave disable OOM kills. Mesos disables the kernel OOM so that it can
> take some action. The currently the only action it does is cleaning up the
> cgroup. But in the future, the action could be to increase the memory limit.
> [6290807.554028] SysRq : Show Blocked State
> [6290807.554175] task PC stack pid father
> [6290807.554251] python2.6 D ffff88097b1c3158 0 31039 1
> 0x00000000
> [6290807.554255] ffff88120ae19b48 0000000000000082 0000000000000000
> ffff88093ffffa08
> [6290807.554259] ffff88093fffed00 ffff88120ae18010 0000000000013300
> 0000000000013300
> [6290807.554263] 0000000000013300 ffff88120ae19fd8 0000000000013300
> 0000000000013300
> [6290807.554267] Call Trace:
> [6290807.554279] [<ffffffff814dfabd>] schedule+0x64/0x66
> [6290807.554285] [<ffffffff8113ad09>] mem_cgroup_handle_oom+0x132/0x21f
> [6290807.554289] [<ffffffff81138e62>] ? mem_cgroup_update_tree+0x165/0x165
> [6290807.554292] [<ffffffff8113aef5>] mem_cgroup_do_charge+0xff/0x124
> [6290807.554295] [<ffffffff8113b0ce>] __mem_cgroup_try_charge+0x1b4/0x298
> [6290807.554298] [<ffffffff8113b643>] mem_cgroup_charge_common+0x6a/0x91
> [6290807.554301] [<ffffffff8113b72f>] mem_cgroup_newpage_charge+0x23/0x25
> [6290807.554307] [<ffffffff8110c26e>] do_anonymous_page+0x169/0x29a
> [6290807.554311] [<ffffffff81110137>] handle_pte_fault+0x8d/0x1b1
> [6290807.554315] [<ffffffff8110a793>] ?
> anon_vma_interval_tree_insert+0x8a/0x8c
> [6290807.554319] [<ffffffff81113afe>] ? vma_adjust+0x50f/0x5b9
> [6290807.554324] [<ffffffff811a196d>] ? ext3_dx_readdir+0x181/0x1d7
> [6290807.554327] [<ffffffff81110489>] handle_mm_fault+0x22e/0x248
> [6290807.554332] [<ffffffff814e3c6a>] do_page_fault+0x367/0x3ae
> [6290807.554335] [<ffffffff811149f4>] ? do_brk+0x291/0x2f2
> [6290807.554339] [<ffffffff81141289>] ? __fput+0x1e7/0x1f6
> [6290807.554342] [<ffffffff814e0ba5>] page_fault+0x25/0x30
> A short term solution is to enable kernel OOM kill in cgroups (until we get
> around to adding support for soft memory limits in the cgroups isolator). The
> slave should still get a OOM notification and properly inform the frameworks
> of the OOM. One concern is that we don't know if kernel handling OOM would
> cause problems with cgroups cleanup done by the slave.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira