On Thu, 13 Jun 2013, Michal Hocko wrote: > > That's not at all the objective, the changelog quite explicitly states > > this is a deadlock as the result of userspace having disabled the oom > > killer so that its userspace oom handler can resolve the condition and it > > being unresponsive or unable to perform its job. > > Ohh, so another round. Sigh. You insist on having user space handlers > running in the context of the limited group. OK, I can understand your > use case, although I think it is pushing the limits of the interface and > it is dangerous.
Ok, this is where our misunderstanding is, and I can see why you have reacted the way you have. It's my fault for not describing where we're going with this. Userspace oom handling right now is very inadequate, not only for the reasons that you and Johannes have described. This requirement that userspace oom handlers must run with more memory than the memcgs it is monitoring is where the pain is. We have to accept the fact that users can be given control of their own memcg trees by chowning that "user root" directory to the user. The only real change you must enforce is that the limit cannot be increased unless you have a special capability. Otherwise, users can be given complete control over their own tree and enforce their own memory limits (or just accounting). I haven't seen any objection to that usecase from anybody, so I'm assuming that it's something that is understood and acknowledged. Nothing in the cgroups code prevents it and I believe it gives a much more powerful interface to the user to control their own destiny much like setting up root level memcgs for the admin. The problem is that we want to control global oom conditions as well. oom_score_adj simply doesn't allow for user-specific implementations either within user subcontainers or at the root. Users may very well want to kill the allocating task in their memcg to avoid expensive scans (SGI required oom_kill_allocating_task at a global level), kill the largest memory consumer, do a priority-based memcg scoring, kill the newest process, etc. That's the power of user oom notifications beyond just statistics collection or creating a tombstone to examine later. We want to do that at the root level as well for global oom conditions. We have a priority-based scoring method amongst top-level memcgs. They are jobs that are scheduled on machines and given a certain priority to run at. We also overcommit the machine so the sum of all top-level memory.limit_in_bytes exceeds the amount of physical RAM since jobs seldom use their entire reservation. When RAM is exhausted, we want to kill the lowest priority process from the lowest priority memcg. That's not possible to do currently because we lack global oom notification and userspace oom handling. We want to change that by notifying on the root memcg's memory.oom_control for global oom and wakeup a process that will handle global oom conditions. Memcg is the perfect vehicle for this since it already does userspace oom notification and because the oom condition in a "user root" can be handled the same as a "global oom". Waking up userspace and asking it to do anything useful in a global oom condition is problematic if it is unresponsive for whatever reason. Especially with true kmem accounting of all slab, we cannot be guaranteed that this global oom handler can actually do things like query the memory usage of a process or read a memcg's tasks file. For this reason, we want a kernel failsafe to ensure that if it isn't able to respond in a reasonable amount of time, the kernel frees memory itself. There's simply no other alternative to doing something like memory.oom_delay_millisecs then at the root level. For notification to actually work, we need to preempt the kernel at least for a certain amount of time from killing: the solution right now is to disable the oom killer entirely with memory.oom_control == 1. If the global oom handler fails the respond, the machine is essentially dead unless something starts freeing memory back for some strange reason. For our own priority-based memcg oom scoring implementation, this global oom handler would then signal the memcg oom handler in the lowest priority memcg, if it exists, to handle the situation or it would kill something from that memcg itself. The key is that without some sort of kernel-imposed delay, there's no way to sanely implement a global oom handler that doesn't risk killing the entire machine because the kernel oom killer is deferred. This patch makes that delay configurable. I could certainly push the patch that does the global oom notification on the root level memcg's memory.oom_control first if you'd like, but I didn't think there would be so much opposition to running a memcg oom handler in the memcg itself, which is what this essentially is. If that's done, then we can't simply rely on the global oom killer, which your suggestion does when things go wrong or to act as a watchdog, to step in in such a situation and kill something. We always want to kill from the lowest-priority memcg in global oom conditions and we don't expect to implement a multitude of different oom selection algorithms in the kernel. We want to punt to userspace and give it the functionality in the kernel that it needs when things go wrong. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/