On Wed, Sep 27, 2017 at 9:23 AM, Roman Gushchin <g...@fb.com> wrote: > On Wed, Sep 27, 2017 at 08:35:50AM -0700, Tim Hockin wrote: >> On Wed, Sep 27, 2017 at 12:43 AM, Michal Hocko <mho...@kernel.org> wrote: >> > On Tue 26-09-17 20:37:37, Tim Hockin wrote: >> > [...] >> >> I feel like David has offered examples here, and many of us at Google >> >> have offered examples as long ago as 2013 (if I recall) of cases where >> >> the proposed heuristic is EXACTLY WRONG. >> > >> > I do not think we have discussed anything resembling the current >> > approach. And I would really appreciate some more examples where >> > decisions based on leaf nodes would be EXACTLY WRONG. >> > >> >> We need OOM behavior to kill in a deterministic order configured by >> >> policy. >> > >> > And nobody is objecting to this usecase. I think we can build a priority >> > policy on top of leaf-based decision as well. The main point we are >> > trying to sort out here is a reasonable semantic that would work for >> > most workloads. Sibling based selection will simply not work on those >> > that have to use deeper hierarchies for organizational purposes. I >> > haven't heard a counter argument for that example yet. >> > > Hi, Tim! > >> We have a priority-based, multi-user cluster. That cluster runs a >> variety of work, including critical things like search and gmail, as >> well as non-critical things like batch work. We try to offer our >> users an SLA around how often they will be killed by factors outside >> themselves, but we also want to get higher utilization. We know for a >> fact (data, lots of data) that most jobs have spare memory capacity, >> set aside for spikes or simply because accurate sizing is hard. We >> can sell "guaranteed" resources to critical jobs, with a high SLA. We >> can sell "best effort" resources to non-critical jobs with a low SLA. >> We achieve much better overall utilization this way. > > This is well understood. > >> >> I need to represent the priority of these tasks in a way that gives me >> a very strong promise that, in case of system OOM, the non-critical >> jobs will be chosen before the critical jobs. Regardless of size. >> Regardless of how many non-critical jobs have to die. I'd rather kill >> *all* of the non-critical jobs than a single critical job. Size of >> the process or cgroup is simply not a factor, and honestly given 2 >> options of equal priority I'd say age matters more than size. >> >> So concretely I have 2 first-level cgroups, one for "guaranteed" and >> one for "best effort" classes. I always want to kill from "best >> effort", even if that means killing 100 small cgroups, before touching >> "guaranteed". >> >> I apologize if this is not as thorough as the rest of the thread - I >> am somewhat out of touch with the guts of it all these days. I just >> feel compelled to indicate that, as a historical user (via Google >> systems) and current user (via Kubernetes), some of the assertions >> being made here do not ring true for our very real use cases. I >> desperately want cgroup-aware OOM handing, but it has to be >> policy-based or it is just not useful to us. > > A policy-based approach was suggested by Michal at a very beginning of > this discussion. Although nobody had any strong objections against it, > we've agreed that this is out of scope of this patchset. > > The idea of this patchset is to introduce an ability to select a memcg > as an OOM victim with the following optional killing of all belonging tasks. > I believe, it's absolutely mandatory for _any_ further development > of the OOM killer, which wants to deal with memory cgroups as OOM entities. > > If you think that it makes impossible to support some use cases in the future, > let's discuss it. Otherwise, I'd prefer to finish this part of the work, > and proceed to the following improvements on top of it. > > Thank you!
I am 100% in favor of killing whole groups. We want that too. I just needed to express disagreement with statements that size-based decisions could not produce bad results. They can and do.