On 2025-04-08 12:37, Christoph Lameter (Ampere) wrote:
On Tue, 8 Apr 2025, Mathieu Desnoyers wrote:

- Minimize contention when incrementing and decrementing counters,
- Provide fast access to a sum approximation,

In general I like this as a abstraction of the Zoned VM counters in
vmstat.c that will make the scalable counters there useful elsewhere.

I'm glad you like it! :)


It aims at fixing the per-mm RSS tracking which has become too
inaccurate for OOM killer purposes on large many-core systems [1].

There are numerous cases where these issues occur. I know of a few I could
use something like this.

I'm looking forward to see it used.


The hierarchical per-CPU counters propagate a sum approximation through
a binary tree. When reaching the batch size, the carry is propagated
through a binary tree which consists of log2(nr_cpu_ids) levels. The
batch size for each level is twice the batch size of the prior level.

A binary tree? Could we do this N-way? Otherwise the tree will be 8 levels
on a 512 cpu machine. Given the inflation of the number of cpus this
scheme better work up to 8K cpus.

That's a good point. We should probably tune the tree n-arity to limit
the tree depth to about 5. Here is a table I've done showing a possible
n-arity choice for each power of 2 nr_cpu_ids:

nr_cpu_ids that can be represented at depth N for tree n-arity
2, 4, 8, and 16:

   N      2^N        4^N             8^N                 16^N
--------------------------------------------------------------
   0        1          1               1                    1
   1        2          4               8                   16
   2        4         16              64                  256
   3        8         64             512                 4096
   4       16        256            4096               *65536*
   5       32       1024          *32768*             1048576
   6       64       4096          262144             16777216
   7      128     *16384*        2097152            268435456
   8      256      65536        16777216           4294967296
   9      512     262144       134217728          68719476736
  10     1024    1048576      1073741824        1099511627776
  11     2048    4194304      8589934592       17592186044416
  12     4096   16777216     68719476736      281474976710656
  13    *8192*  67108864    549755813888     4503599627370496
  14    16384  268435456   4398046511104    72057594037927936
  15    32768 1073741824  35184372088832  1152921504606846976
  16    65536 4294967296 281474976710656 18446744073709551616

nr_levels is the number of tree levels for carry propagation
(excludes the final accumulation counter).

nr_cpu_ids   n-arity          nr_levels
      2         2                 1
      4         2                 2
      8         2                 3
     16         4                 3
     32         4                 3
     64         4                 3
    128         8                 3
    256         8                 3
    512         8                 3
   1024         8                 4
   2048         8                 4
   4096         8                 4
   8192        16                 4
  16384        16                 4
  32768        16                 4
  65536        16                 4
 131072        16                 5
 262144        16                 5
 524288        16                 5
1048576        16                 5


+int percpu_counter_tree_precise_sum(struct percpu_counter_tree *counter);
+int percpu_counter_tree_precise_compare(struct percpu_counter_tree *a, struct 
percpu_counter_tree *b);
+int percpu_counter_tree_precise_compare_value(struct percpu_counter_tree 
*counter, int v);

Precise? Concurrent counter updates can occur while determining the global
value. People may get confused.

If counters are incremented or decremented concurrently, of course a "sum"
reader will either observe the value before or after the increment.
This is inherent to reading a counter concurrently with its updates.
However, the same can be said of reading a global counter concurrently
with its updates.

If you introduce an external synchronization mechanism that makes the
updaters quiescent (e.g. RCU) before reading the counter value, then
you have a sum which is guaranteed to include all prior increments/decrements.

So using the term "precise" does not appear misleading to me from that
perspective.

Also maybe there would be a need for a function to collape the values into
the global if f.e. a cpu goes off line or in order to switch off OS
activities on a cpu.

Currently percpu_counter_tree_precise_sum_unbiased() iterates on each
possible cpu, which does not require cpu hotplug integration.

Indeed, as an improvement, we could integrate with cpu hotplug notifiers
and move the offlining CPU counters to a global accumulator, but this would
require the precise sum to synchronize against concurrent CPU hotplug. That
would allow the precise sum iteration to only consider online cpus.

Note that the hierarchical counter tree algorithms for both the
increment/decrement and for approximate/precise sums are conveniently
lock-free at the moment.

We'd have to consider whether it's preferable to keep lock-freedom or
introduce locking for the precise sum to iterate on fewer CPUs.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Reply via email to