Re: [RFC PATCH v2] mm: use per-numa-node atomics instead of percpu_counters

Mathieu Desnoyers Thu, 03 Apr 2025 11:00:09 -0700

On 2025-04-02 20:00, Shakeel Butt wrote:

On Mon, Mar 31, 2025 at 06:35:14PM -0400, Sweet Tea Dorminy wrote:

[...]

I am still not buying the 'good performance' point. To me we might need
to go with reduced batch size of existing approach or multi level
approach from Mathieu (I still have to see Mateusz and Kairui's
proposals).


Here is an initial userspace prototype of my hierarchical split counters:

https://github.com/compudj/librseq/blob/percpu-counter/include/rseq/percpu-counter.h
https://github.com/compudj/librseq/blob/percpu-counter/src/percpu-counter.c

How to try it out:

* Install liburcu
* Clone & build  https://github.com/compudj/librseq branch: percpu-counter

(note: it's currently a POC, very lightly tested.)

Run tests/percpu_counter_test.tap :

ok 1 - Registered current thread with rseq
Counter init: approx: 0 precise: 0 inaccuracy: ±2048
Counter after sum: approx: 2998016 precise: 2996800 inaccuracy: ±2048
Counter after set=0: approx: 1216 precise: 0 inaccuracy: ±2048
ok 2 - Unregistered current thread with rseq
1..2

It implements the following operations:

Fast paths:

- counter_add
- counter_approx_sum

Function call APIs for:

- counter_add_slowpath (approximation propagation to levels > 0),
- counter_precise_sum (iterate on all per-cpu counters)
- counter_set: Set a bias to bring precise sum to a given target value.
- counter_inaccuracy: returns the maximum inaccuracy of approximation for this
                      counter configuration.
- counter_compare: Compare a counter against a value. Use approximation if value
                   is further than inaccuracy limits, else use precise sum.

Porting it to the Linux kernel and replacing lib/percpu_counter.c should be
straightforward. AFAIU, the only thing I have not implemented is a replacement
for percpu_counter_limited_add, and I'm not so sure how useful it is.

The most relevant piece of the algorithm is within counter_add as follows:

static inline
void counter_add(struct percpu_counter *counter, long inc)
{
        unsigned long bit_mask = counter->level0_bit_mask;
        intptr_t orig, res;
        int ret, cpu;

        if (!inc)
                return;
        // This is basically a percpu_add_return() in userspace with rseq.
        do {
                cpu = rseq_cpu_start();
                orig = *rseq_percpu_ptr(counter->level0, cpu);
                ret = rseq_load_cbne_store__ptr(RSEQ_MO_RELAXED, 
RSEQ_PERCPU_CPU_ID,
                                rseq_percpu_ptr(counter->level0, cpu),
                                orig, orig + inc, cpu);
        } while (ret);
        res = orig + inc;
        counter_dbg_printf("counter_add: inc: %ld, bit_mask: %ld, orig %ld, res 
%ld\n",
                           inc, bit_mask, orig, res);
        if (inc < 0) {
                inc = -(-inc & ~((bit_mask << 1) - 1));
                /* xor bit_mask, same sign: underflow */
                if (((orig & bit_mask) ^ (res & bit_mask)) && 
__counter_same_sign(orig, res))
                        inc -= bit_mask;
        } else {
                inc &= ~((bit_mask << 1) - 1);
                /* xor bit_mask, same sign: overflow */
                if (((orig & bit_mask) ^ (res & bit_mask)) && 
__counter_same_sign(orig, res))
                        inc += bit_mask;
        }
        if (inc)
                counter_add_slowpath(counter, inc);
}

void counter_add_slowpath(struct percpu_counter *counter, long inc)
{
        struct percpu_counter_level_item *item = counter->items;
        unsigned int level_items = counter->nr_cpus >> 1;
        unsigned int level, nr_levels = counter->nr_levels;
        long bit_mask = counter->level0_bit_mask;
        int cpu = rseq_current_cpu_raw();

        for (level = 1; level < nr_levels; level++) {
                long orig, res;
                long *count = &item[cpu & (level_items - 1)].count;

                bit_mask <<= 1;
                res = uatomic_add_return(count, inc, CMM_RELAXED);
                orig = res - inc;
                counter_dbg_printf("counter_add_slowpath: level %d, inc: %ld, 
bit_mask: %ld, orig %ld, res %ld\n",
                                   level, inc, bit_mask, orig, res);
                if (inc < 0) {
                        inc = -(-inc & ~((bit_mask << 1) - 1));
                        /* xor bit_mask, same sign: underflow */
                        if (((orig & bit_mask) ^ (res & bit_mask)) && 
__counter_same_sign(orig, res))
                                inc -= bit_mask;
                } else {
                        inc &= ~((bit_mask << 1) - 1);
                        /* xor bit_mask, same sign: overflow */
                        if (((orig & bit_mask) ^ (res & bit_mask)) && 
__counter_same_sign(orig, res))
                                inc += bit_mask;
                }
                item += level_items;
                level_items >>= 1;
                if (!inc)
                        return;
        }
        counter_dbg_printf("counter_add_slowpath: last level add %ld\n", inc);
        uatomic_add(&item->count, inc, CMM_RELAXED);
}

Feedback is welcome!

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [RFC PATCH v2] mm: use per-numa-node atomics instead of percpu_counters

Reply via email to