[jira] [Commented] (MESOS-662) Executor OOM could lead to a kernel hang

Eric W. Biederman (JIRA) Thu, 29 Aug 2013 16:04:42 -0700

    [ 
https://issues.apache.org/jira/browse/MESOS-662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13754173#comment-13754173
 ]


Eric W. Biederman commented on MESOS-662:
-----------------------------------------

If the concern is reporting the OOM that is simply a solvable synchronization 
issue on slave tear down.  The information is available and the kernel sends 
the oom notification and updates the appropriate status files in all cases.

When the kernel handles an OOM it kills processes one at a time with SIGKILL in 
the cgroup until enough memory is available for the cgroup to continue.  For 
each process that is killed a message is logged so there is no doubt of what 
happened to that process.

Which means that mesos-slave still needs to kill all of the processes in a 
cgroup when a cgroup runs out of memory, and that most likely the processes 
killed will not be the executor so races between the reaping the executor and 
oom handling are unlikely to occur.  This also means that if the reaper wins 
the race because it was the executor that was killed it will be known that 
there was a deliberate SIGKILL sent to the process.  Not that the reaper 
process that runs once a second is likely to win a race with anything. 

So the practical choices are really between the user debugging TASK_LOST and a 
hung box and between TASK_KILLED with an indication that the executor was 
explicitly killed (and probably with a message from mesos about OOMs and 
definitely a system message about OOMs).  So I really don't see any valid error 
reporting reason for not letting the kernel kill processes.   Even with no 
other changes to the code there should not be any mysteries.

Furthermore mesos-slave is not setup to safely be an OOM killer.  Steps are not 
taken to guarantee that mesos-slave will not allocate memory from the operating 
system when handling an OOM condition, nor to guarantee that there is free 
memory somewhere that mesos-slave can allocate from.  In fact almost the exact 
opposite is true.  We spawn multiple libprocess processes to kill a cgroup and 
free up memory, while busily logging random things from random libprocess 
processes to disk.  During slave restart things are even worse as there is no 
mesos-slave to even try and kill
anything, which might result in too little memory to start a new copy of 
mesos-slave.

Which is my long winded way of saying.  No letting the kernel kill processes 
will not significantly impact the debugability of the system, and the problems 
inherent in letting the kernel kill processes are much simpler to solve than 
the problems inherent in getting mesos-slave to kill processes to relieve a 
memory shortage.
                
> Executor OOM could lead to a kernel hang
> ----------------------------------------
>
>                 Key: MESOS-662
>                 URL: https://issues.apache.org/jira/browse/MESOS-662
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Vinod Kone
>            Assignee: Benjamin Mahler
>            Priority: Critical
>             Fix For: 0.15.0
>
>
> We observed this in production at Twitter.
> An executor OOMed and kernel put it in sleep instead of killing it because 
> Mesos slave disable OOM kills. Mesos disables the kernel OOM so that it can 
> take some action. The currently the only action it does is cleaning up the 
> cgroup. But in the future, the action could be to increase the memory limit.
> [6290807.554028] SysRq : Show Blocked State
> [6290807.554175]   task                        PC stack   pid father
> [6290807.554251] python2.6       D ffff88097b1c3158     0 31039      1 
> 0x00000000
> [6290807.554255]  ffff88120ae19b48 0000000000000082 0000000000000000 
> ffff88093ffffa08
> [6290807.554259]  ffff88093fffed00 ffff88120ae18010 0000000000013300 
> 0000000000013300
> [6290807.554263]  0000000000013300 ffff88120ae19fd8 0000000000013300 
> 0000000000013300
> [6290807.554267] Call Trace:
> [6290807.554279]  [<ffffffff814dfabd>] schedule+0x64/0x66
> [6290807.554285]  [<ffffffff8113ad09>] mem_cgroup_handle_oom+0x132/0x21f
> [6290807.554289]  [<ffffffff81138e62>] ? mem_cgroup_update_tree+0x165/0x165
> [6290807.554292]  [<ffffffff8113aef5>] mem_cgroup_do_charge+0xff/0x124
> [6290807.554295]  [<ffffffff8113b0ce>] __mem_cgroup_try_charge+0x1b4/0x298
> [6290807.554298]  [<ffffffff8113b643>] mem_cgroup_charge_common+0x6a/0x91
> [6290807.554301]  [<ffffffff8113b72f>] mem_cgroup_newpage_charge+0x23/0x25
> [6290807.554307]  [<ffffffff8110c26e>] do_anonymous_page+0x169/0x29a
> [6290807.554311]  [<ffffffff81110137>] handle_pte_fault+0x8d/0x1b1
> [6290807.554315]  [<ffffffff8110a793>] ? 
> anon_vma_interval_tree_insert+0x8a/0x8c
> [6290807.554319]  [<ffffffff81113afe>] ? vma_adjust+0x50f/0x5b9
> [6290807.554324]  [<ffffffff811a196d>] ? ext3_dx_readdir+0x181/0x1d7
> [6290807.554327]  [<ffffffff81110489>] handle_mm_fault+0x22e/0x248
> [6290807.554332]  [<ffffffff814e3c6a>] do_page_fault+0x367/0x3ae
> [6290807.554335]  [<ffffffff811149f4>] ? do_brk+0x291/0x2f2
> [6290807.554339]  [<ffffffff81141289>] ? __fput+0x1e7/0x1f6
> [6290807.554342]  [<ffffffff814e0ba5>] page_fault+0x25/0x30
> A short term solution is to enable kernel OOM kill in cgroups (until we get 
> around to adding support for soft memory limits in the cgroups isolator). The 
> slave should still get a OOM notification and properly inform the frameworks 
> of the OOM. One concern is that we don't know if kernel handling OOM would 
> cause problems with cgroups cleanup done by the slave. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MESOS-662) Executor OOM could lead to a kernel hang

Reply via email to