* Mathieu Desnoyers <[email protected]> [250408 15:40]: > On 2025-04-08 13:41, Liam R. Howlett wrote: > > * Matthew Wilcox <[email protected]> [250408 13:03]: > > > On Tue, Apr 08, 2025 at 09:37:18AM -0700, Christoph Lameter (Ampere) > > > wrote: > > > > > The hierarchical per-CPU counters propagate a sum approximation > > > > > through > > > > > a binary tree. When reaching the batch size, the carry is propagated > > > > > through a binary tree which consists of log2(nr_cpu_ids) levels. The > > > > > batch size for each level is twice the batch size of the prior level. > > > > > > > > A binary tree? Could we do this N-way? Otherwise the tree will be 8 > > > > levels > > > > on a 512 cpu machine. Given the inflation of the number of cpus this > > > > scheme better work up to 8K cpus. > > > > > > I find that a fan-out somewhere between 8 and 16 works well in practice. > > > log16(512) gives a 3 level tree as does a log8 tree. log16(8192) is a 4 > > > level tree whereas log8(8192) is a 5 level tree. Not a big difference > > > either way. > > > > > > Somebody was trying to persuade me that a new tree type that maintained > > > additional information at each level of the tree to make some operations > > > log(log(N)) would be a better idea than a B-tree that is log(N). I > > > countered that a wider tree made the argument unsound at any size tree > > > up to 100k. And we don't tend to have _that_ many objects in a > > > data structure inside the kernel. > > > > I still maintain vEB trees are super cool, but I am glad we didn't try > > to implement an RCU safe version. > > > > > > > > ceil(log14(100,000)) = 5 > > > ceil(log2(log2(100,000))) = 5 > > > > > > at a million, there's actually a gap, 6 vs 5. But constant factors > > > become a much larger factor than scalability arguments at that point. > > > > In retrospect, it seems more of a math win than a practical win - and > > only really the O(n) bounds. Beyond what willy points out, writes > > rippling up the tree should be a concern for most users since it will > > impact the restart of readers and negatively affect the writer speed - > > but probably not here (hot plug?). > > This implementation of hierarchical per-cpu counters is lock-free > for increment/decrement *and* for precise/approximate sums. > > The increment/decrement use: > > - this_cpu_add_return on the fast-path, > - atomic_add_return_relaxed for intermediate levels carry propagation, > - atomic_add for approximate sum updates. > > The precise sum iterates on all possible cpus, loading their current > value. The approximate sum simply loads the current value of the > approximate sum. > > So I'm unsure about your concern of writers restarting readers, because > this tree does not rely on mutual exclusion between updaters and > readers, nor does it rely on cmpxchg-based retry mechanisms in readers.
I don't think it matters, but I'm not sure how hot-plug affects the tree. > > I agree with you that updates going all the way up the tree may > negatively affect the updater and approximate sum reader performance due > to bouncing of the counter cache line across CPUs. > > > > > Working in (multiples of) cacheline sized b-tree nodes makes the most > > sense, in my experience. > > I'm confused. Can you explain how this recommendation can practically > apply to the hierarchical counters ? It would apply if you switch to a b-tree with a larger branching factor. Thanks, Liam
