On Wed 14-01-26 14:19:38, Mathieu Desnoyers wrote: > On 2026-01-14 11:41, Michal Hocko wrote: > > > > One thing you should probably mention here is the memory consumption of > > the structure. > Good point. > > The most important parts are the per-cpu counters and the tree items > which propagate the carry. > > In the proposed implementation, the per-cpu counters are allocated > within per-cpu data structures, so they end up using: > > nr_possible_cpus * sizeof(unsigned long) > > In addition, the tree items are appended at the end of the mm_struct. > The size of those items is defined by the per_nr_cpu_order_config > table "nr_items" field. > > Each item is aligned on cacheline size (typically 64 bytes) to minimize > false sharing. > > Here is the footprint for a few nr_cpus on a 64-bit arch: > > nr_cpus percpu counters (bytes) nr_items items size (bytes) > total (bytes) > 2 16 1 64 > 80 > 4 32 3 192 > 224 > 8 64 7 448 > 512 > 64 512 21 1344 > 1856 > 128 1024 21 1344 > 2368 > 256 2048 37 2368 > 4416 > 512 4096 73 4672 > 8768
I assume this is nr_possible_cpus not NR_CPUS, right? > There are of course various trade offs we can make here. We can: > > * Increase the n-arity of the intermediate items to shrink the nr_items > required for a given nr_cpus. This will increase contention of carry > propagation across more cores. > > * Remove cacheline alignment of intermediate tree items. This will > shrink the memory needed for tree items, but will increase false > sharing. > > * Represent intermediate tree items on a byte rather than long. > This further reduces the memory required for intermediate tree > items, but further increases false sharing. > > * Represent per-cpu counters on bytes rather than long. This makes > the "sum" operation trickier, because it needs to iterate on the > intermediate carry propagation nodes as well and synchronize with > ongoing "tree add" operations. It further reduces memory use. > > * Implement a custom strided allocator for intermediate items carry > propagation bytes. This shares cachelines across different tree > instances, keeping good locality. This ensures that all accesses > from a given location in the machine topology touch the same > cacheline for the various tree instances. This adds complexity, > but provides compactness as well as minimal false-sharing. > > Compared to this, the upstream percpu counters use a 32-bit integer per-cpu > (4 bytes), and accumulate within a 64-bit global value. > > So yes, there is an extra memory footprint added by the current hpcc > implementation, but if it's an issue we have various options to consider > to reduce its footprint. > > Is it OK if I add this discussion to the commit message, or should it > be also added into the high level design doc within > Documentation/core-api/percpu-counter-tree.rst ? I would mention them in both changelog and the documentation. -- Michal Hocko SUSE Labs
