On Fri, Apr 4, 2025 at 6:51 PM Kairui Song <[email protected]> wrote: > > On Thu, Apr 3, 2025 at 10:31 PM Mateusz Guzik <[email protected]> wrote: > > Note there are 2 unrelated components in that patchset: > > - one per-cpu instance of rss counters which is rolled up on context > > switches, avoiding the costly counter alloc/free on mm > > creation/teardown > > - cpu iteration in get_mm_counter > > > > The allocation problem is fixable without abandoning the counters, see > > my other e -mail (tl;dr let mm's hanging out in slab caches *keep* the > > counters). This aspect has to be solved anyway due to mm_alloc_cid(). > > Providing a way to sort it out covers *both* the rss counters and the > > cid thing. > > It's not just about the fork performance, on some servers there could > be ~100K processes and ~200 CPUs, that will be hundreds of MBs of > memory just for the counters. > > And nowadays it's not something uncommon for a desktop to have ~64 > CPUs and ~10K processes. > > If we use a single shared "per-cpu" counter (as in the patch), the > total consumption will always be only about just dozens of bytes. >
I agree there is a tradeoff here and your approach saves memory in exchange for more work during a context switch. I have no opinion which way to go here. > > > > In your patchset the accuracy increase comes at the expense of walking > > all CPUs every time, while a big part of the point of using percpu > > counters is to have a good enough approximation somewhere that this is > > not necessary. > > It usually doesn't walk all CPUs, only the CPUs that actually used > that mm_struct, by checking mm_struct's cpu_bitmap. I didn't check if > all arch uses that bitmap though. > > It's true that one CPU having its bit set on one mm_struct's > cpu_bitmap doesn't mean it updated the RSS counter so there will be > false positives, the false positive rate is low as schedulers don't > shuffle processes between processors randomly, and not every process > will be ran at a period. > > Also per my observation the reader side is much colder compared to > updater for /proc. > Per my comment, the read thing happens a lot for mmap and munmap so it cannot be taken lightly. You can check yourself with bpftrace. While I can agree vast majority of processes are not very thread-heavy and vast majority of machines out there don't have hundreds of cores, this does have to behave sanely for the cases which *do* exhibit these conditions. For example a box with > 200 cores and 200+ threads to boot, all running on the entirety of the box. In your patch as posted fetching the value will force the walk *a lot* and is consequently a no-go. This aspect needs to be dealt with for the patchset to be ok. Otherwise few months down the road someone else will show up and complain about a new slowdown stemming from it. -- Mateusz Guzik <mjguzik gmail.com>
