On 2026-01-12 14:48, Michal Hocko wrote:
On Mon 12-01-26 14:37:49, Mathieu Desnoyers wrote:
On 2026-01-12 03:42, Michal Hocko wrote:
Hi,
sorry to jump in this late but the timing of previous versions didn't
really work well for me.

On Sun 11-01-26 14:49:57, Mathieu Desnoyers wrote:
[...]
Here is a (possibly incomplete) list of the prior approaches that were
used or proposed, along with their downside:

1) Per-thread rss tracking: large error on many-thread processes.

2) Per-CPU counters: up to 12% slower for short-lived processes and 9%
     increased system time in make test workloads [1]. Moreover, the
     inaccuracy increases with O(n^2) with the number of CPUs.

3) Per-NUMA-node counters: requires atomics on fast-path (overhead),
     error is high with systems that have lots of NUMA nodes (32 times
     the number of NUMA nodes).

The approach proposed here is to replace this by the hierarchical
per-cpu counters, which bounds the inaccuracy based on the system
topology with O(N*logN).

The concept of hierarchical pcp counter is interesting and I am
definitely not opposed if there are more users that would benefit.

  From the OOM POV, IIUC the primary problem is that get_mm_counter
(percpu_counter_read_positive) is too imprecise on systems when the task
is moving around a large number of cpus. In the list of alternative
solutions I do not see percpu_counter_sum_positive to be mentioned.
oom_badness() is a really slow path and taking the slow path to
calculate a much more precise value seems acceptable. Have you
considered that option?
I must admit I assumed that since there was already a mechanism in place
to ensure it's not necessary to sum per-cpu counters when the oom killer
is trying to select tasks, it must be because this

   O(nr_possible_cpus * nr_processes)

operation must be too slow for the oom killer requirements.

AFAIU, the oom killer is executed when the memory allocator fails to
allocate memory, which can be within code paths which need to progress
eventually. So even though it's a slow path compared to the allocator
fast path, there must be at least _some_ expectations about it
completing within a decent amount of time. What would that ballpark be ?

I do not think we have ever promissed more than the oom killer will try
to unlock the system blocked on memory shortage.

To give an order of magnitude, I've tried modifying the upstream
oom killer to use percpu_counter_sum_positive and compared it to
the hierarchical approach:

AMD EPYC 9654 96-Core (2 sockets)
Within a KVM, configured with 256 logical cpus.

                    nr_processes=40    nr_processes=10000
Counter sum:            0.4 ms             81.0 ms
HPCC with 2-pass:       0.3 ms              9.3 ms

These are peanuts for the global oom situations. We have had situations
when soft lockup detector triggered because of the process tree
traversal so adding 100ms is not really critical.

So as we scale up the number of processes on large SMP systems,
the latency caused by the oom killer task selection greatly
increases with the counter sums compared with the hierarchical
approach.

Yes, I am not really questioning the hierarchical approach will perform
much better but I am thinking of a good enough solution and calculating
the number might be just that stop gap solution (that would be also
suitable for stable tree backports). I am not ruling out improving on
top of that by a more clever solution like your hierarchical counters
approach. Especially if there are more benefits from that elsewhere.


Would you be OK with introducing changes in the following order ?

1) Fix the OOM killer inaccuracy by using counter sum (iteration on all
   cpu counters) in task selection. This may slow down the oom killer,
   but would at least fix its current inaccuracy issues. This could be
   backported to stable kernels.

2) Introduce the hierarchical percpu counters on top, as a oom killer
   task selection performance optimization (reduce latency of oom kill).

This way, (2) becomes purely a performance optimization, so it's easy
to bissect and revert if it causes issues.

I agree that bringing a fix along with a performance optimization within
a single commit makes it hard to backport to stable, and tricky to
revert if it causes problems.

As for finding other users of the hpcc, I have ideas, but not so much
time available to try them out, as I'm pretty much doing this in my
spare time.

One possibility I see would be to track memory access patterns by
sampling with page fault handlers for NUMA locality tracking.
Using hierarchical counters could help make quick migration decisions
(either task migration or moving memory pages across NUMA nodes)
based on hpcc approximations, including using intermediate counters
within the tree levels.

Another category of use-cases would be tracking resource limits
(e.g. cgroups, network traffic control) through HPCC, and then
using approximated counter values to validate whether the current
usage goes beyond the limit threshold.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Reply via email to