On Tue, 19 Nov 2013, Michal Hocko wrote:

> Hi,
> it's been quite some time since LSFMM 2013 when this has been
> discussed[1]. In short, it seems that there are usecases with a
> strong demand on a better user/admin policy control for the global
> OOM situations. Per process oom_{adj,score} which is used for the
> prioritizing is no longer sufficient because there are other categories
> which might be important. For example, often it doesn't make sense to
> kill just a part of the workload and killing the whole group would be a
> better fit. I am pretty sure there are many others some of them workload
> specific and thus not appropriate for the generic implementation.
> 

Thanks for starting this thread.  We'd like to have two things:

 - allow userspace to call into our implementation of malloc() to free
   excess memory that will avoid requiring anything from being killed,
   which may include freeing userspace caches back to the kernel or
   using MADV_DONTNEED over a range of unused memory within the arena,
   and

 - enforce a hierarchical memcg prioritization policy so that memcgs can 
   be iterated at each level beneath the oom memcg (which may include the
   root memcg for system oom conditions) and eligible processes are killed 
   in the lowest priority memcg.

This obviously allows for much more powerful implementations as well that 
can be defined by users of memcgs to drop caches, increase memcg limits, 
signaling applications to free unused memory, start throttling memory 
usage, heap analysis, logging, etc. and userspace oom handlers are the 
perfect place to do so.

> We have basically ended up with 3 options AFAIR:
>       1) allow memcg approach (memcg.oom_control) on the root level
>            for both OOM notification and blocking OOM killer and handle
>            the situation from the userspace same as we can for other
>          memcgs.

This is what I've been proposing both with my latest patches, the 
memory.oom_delay_millisecs patch in the past, and future patch to allow 
for per-memcg memory reserves that allow charging to be bypassed to a 
pre-defined threshold much like per-zone memory reserves for TIF_MEMDIE 
processes today so that userspace has access to memory to handle the 
situation even in system oom conditions.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Reply via email to