On Tue, 19 Nov 2013, Michal Hocko wrote: > Hi, > it's been quite some time since LSFMM 2013 when this has been > discussed[1]. In short, it seems that there are usecases with a > strong demand on a better user/admin policy control for the global > OOM situations. Per process oom_{adj,score} which is used for the > prioritizing is no longer sufficient because there are other categories > which might be important. For example, often it doesn't make sense to > kill just a part of the workload and killing the whole group would be a > better fit. I am pretty sure there are many others some of them workload > specific and thus not appropriate for the generic implementation. >
Thanks for starting this thread. We'd like to have two things: - allow userspace to call into our implementation of malloc() to free excess memory that will avoid requiring anything from being killed, which may include freeing userspace caches back to the kernel or using MADV_DONTNEED over a range of unused memory within the arena, and - enforce a hierarchical memcg prioritization policy so that memcgs can be iterated at each level beneath the oom memcg (which may include the root memcg for system oom conditions) and eligible processes are killed in the lowest priority memcg. This obviously allows for much more powerful implementations as well that can be defined by users of memcgs to drop caches, increase memcg limits, signaling applications to free unused memory, start throttling memory usage, heap analysis, logging, etc. and userspace oom handlers are the perfect place to do so. > We have basically ended up with 3 options AFAIR: > 1) allow memcg approach (memcg.oom_control) on the root level > for both OOM notification and blocking OOM killer and handle > the situation from the userspace same as we can for other > memcgs. This is what I've been proposing both with my latest patches, the memory.oom_delay_millisecs patch in the past, and future patch to allow for per-memcg memory reserves that allow charging to be bypassed to a pre-defined threshold much like per-zone memory reserves for TIF_MEMDIE processes today so that userspace has access to memory to handle the situation even in system oom conditions. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/