On 2025-04-02 20:00, Shakeel Butt wrote:
On Mon, Mar 31, 2025 at 06:35:14PM -0400, Sweet Tea Dorminy wrote:
[...]
I am still not buying the 'good performance' point. To me we might need to go with reduced batch size of existing approach or multi level approach from Mathieu (I still have to see Mateusz and Kairui's proposals).
Here is an initial userspace prototype of my hierarchical split counters: https://github.com/compudj/librseq/blob/percpu-counter/include/rseq/percpu-counter.h https://github.com/compudj/librseq/blob/percpu-counter/src/percpu-counter.c How to try it out: * Install liburcu * Clone & build https://github.com/compudj/librseq branch: percpu-counter (note: it's currently a POC, very lightly tested.) Run tests/percpu_counter_test.tap : ok 1 - Registered current thread with rseq Counter init: approx: 0 precise: 0 inaccuracy: ±2048 Counter after sum: approx: 2998016 precise: 2996800 inaccuracy: ±2048 Counter after set=0: approx: 1216 precise: 0 inaccuracy: ±2048 ok 2 - Unregistered current thread with rseq 1..2 It implements the following operations: Fast paths: - counter_add - counter_approx_sum Function call APIs for: - counter_add_slowpath (approximation propagation to levels > 0), - counter_precise_sum (iterate on all per-cpu counters) - counter_set: Set a bias to bring precise sum to a given target value. - counter_inaccuracy: returns the maximum inaccuracy of approximation for this counter configuration. - counter_compare: Compare a counter against a value. Use approximation if value is further than inaccuracy limits, else use precise sum. Porting it to the Linux kernel and replacing lib/percpu_counter.c should be straightforward. AFAIU, the only thing I have not implemented is a replacement for percpu_counter_limited_add, and I'm not so sure how useful it is. The most relevant piece of the algorithm is within counter_add as follows: static inline void counter_add(struct percpu_counter *counter, long inc) { unsigned long bit_mask = counter->level0_bit_mask; intptr_t orig, res; int ret, cpu; if (!inc) return; // This is basically a percpu_add_return() in userspace with rseq. do { cpu = rseq_cpu_start(); orig = *rseq_percpu_ptr(counter->level0, cpu); ret = rseq_load_cbne_store__ptr(RSEQ_MO_RELAXED, RSEQ_PERCPU_CPU_ID, rseq_percpu_ptr(counter->level0, cpu), orig, orig + inc, cpu); } while (ret); res = orig + inc; counter_dbg_printf("counter_add: inc: %ld, bit_mask: %ld, orig %ld, res %ld\n", inc, bit_mask, orig, res); if (inc < 0) { inc = -(-inc & ~((bit_mask << 1) - 1)); /* xor bit_mask, same sign: underflow */ if (((orig & bit_mask) ^ (res & bit_mask)) && __counter_same_sign(orig, res)) inc -= bit_mask; } else { inc &= ~((bit_mask << 1) - 1); /* xor bit_mask, same sign: overflow */ if (((orig & bit_mask) ^ (res & bit_mask)) && __counter_same_sign(orig, res)) inc += bit_mask; } if (inc) counter_add_slowpath(counter, inc); } void counter_add_slowpath(struct percpu_counter *counter, long inc) { struct percpu_counter_level_item *item = counter->items; unsigned int level_items = counter->nr_cpus >> 1; unsigned int level, nr_levels = counter->nr_levels; long bit_mask = counter->level0_bit_mask; int cpu = rseq_current_cpu_raw(); for (level = 1; level < nr_levels; level++) { long orig, res; long *count = &item[cpu & (level_items - 1)].count; bit_mask <<= 1; res = uatomic_add_return(count, inc, CMM_RELAXED); orig = res - inc; counter_dbg_printf("counter_add_slowpath: level %d, inc: %ld, bit_mask: %ld, orig %ld, res %ld\n", level, inc, bit_mask, orig, res); if (inc < 0) { inc = -(-inc & ~((bit_mask << 1) - 1)); /* xor bit_mask, same sign: underflow */ if (((orig & bit_mask) ^ (res & bit_mask)) && __counter_same_sign(orig, res)) inc -= bit_mask; } else { inc &= ~((bit_mask << 1) - 1); /* xor bit_mask, same sign: overflow */ if (((orig & bit_mask) ^ (res & bit_mask)) && __counter_same_sign(orig, res)) inc += bit_mask; } item += level_items; level_items >>= 1; if (!inc) return; } counter_dbg_printf("counter_add_slowpath: last level add %ld\n", inc); uatomic_add(&item->count, inc, CMM_RELAXED); } Feedback is welcome! Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
