On Fri, Mar 20, 2026 at 04:42:35PM -0400, Waiman Long wrote: > The vmstats flush threshold currently increases linearly with the > number of online CPUs. As the number of CPUs increases over time, it > will become increasingly difficult to meet the threshold and update the > vmstats data in a timely manner. These days, systems with hundreds of > CPUs or even thousands of them are becoming more common. > > For example, the test_memcg_sock test of test_memcontrol always fails > when running on an arm64 system with 128 CPUs. It is because the > threshold is now 64*128 = 8192. With 4k page size, it needs changes in > 32 MB of memory. It will be even worse with larger page size like 64k. > > To make the output of memory.stat more correct, it is better to scale > up the threshold slower than linearly with the number of CPUs. The > int_sqrt() function is a good compromise as suggested by Li Wang [1]. > An extra 2 is added to make sure that we will double the threshold for > a 2-core system. The increase will be slower after that. > > With the int_sqrt() scale, we can use the possibly larger > num_possible_cpus() instead of num_online_cpus() which may change at > run time. > > Although there is supposed to be a periodic and asynchronous flush of > vmstats every 2 seconds, the actual time lag between succesive runs > can actually vary quite a bit. In fact, I have seen time lags of up > to 10s of seconds in some cases. So we couldn't too rely on the hope > that there will be an asynchronous vmstats flush every 2 seconds. This > may be something we need to look into. > > [1] https://lore.kernel.org/lkml/[email protected]/ > > Suggested-by: Li Wang <[email protected]> > Signed-off-by: Waiman Long <[email protected]> > --- > mm/memcontrol.c | 18 +++++++++++++----- > 1 file changed, 13 insertions(+), 5 deletions(-) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 772bac21d155..cc1fc0f5aeea 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -548,20 +548,20 @@ struct memcg_vmstats { > * rstat update tree grow unbounded. > * > * 2) Flush the stats synchronously on reader side only when there are more > than > - * (MEMCG_CHARGE_BATCH * nr_cpus) update events. Though this optimization > - * will let stats be out of sync by atmost (MEMCG_CHARGE_BATCH * nr_cpus) > but > - * only for 2 seconds due to (1). > + * (MEMCG_CHARGE_BATCH * int_sqrt(nr_cpus+2)) update events. Though this > + * optimization will let stats be out of sync by up to that amount. This > is > + * supposed to last for up to 2 seconds due to (1). > */ > static void flush_memcg_stats_dwork(struct work_struct *w); > static DECLARE_DEFERRABLE_WORK(stats_flush_dwork, flush_memcg_stats_dwork); > static u64 flush_last_time; > +static int vmstats_flush_threshold __ro_after_init; > > #define FLUSH_TIME (2UL*HZ) > > static bool memcg_vmstats_needs_flush(struct memcg_vmstats *vmstats) > { > - return atomic_read(&vmstats->stats_updates) > > - MEMCG_CHARGE_BATCH * num_online_cpus(); > + return atomic_read(&vmstats->stats_updates) > vmstats_flush_threshold; > } > > static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val, > @@ -5191,6 +5191,14 @@ int __init mem_cgroup_init(void) > > memcg_pn_cachep = KMEM_CACHE(mem_cgroup_per_node, > SLAB_PANIC | SLAB_HWCACHE_ALIGN); > + /* > + * Scale up vmstats flush threshold with int_sqrt(nr_cpus+2). The extra > + * 2 constant is to make sure that the threshold is double for a 2-core > + * system. After that, it will increase by MEMCG_CHARGE_BATCH when the > + * number of the CPUs reaches the next (2^n - 2) value. > + */ > + vmstats_flush_threshold = MEMCG_CHARGE_BATCH * > + (int_sqrt(num_possible_cpus() + 2)); > > return 0; > }
Reviewed-by: Li Wang <[email protected]> -- Regards, Li Wang

