Hi Thomas,

On 10/22/25 6:25 PM, Thomas Gleixner wrote:
The MM CID management has two fundamental requirements:

   1) It has to guarantee that at no given point in time the same CID is
      used by concurrent tasks in userspace.

   2) The CID space must not exceed the number of possible CPUs in a
      system. While most allocators (glibc, tcmalloc, jemalloc) do not
      care about that, there seems to be at least some LTTng library
      depending on it.

The CID space compaction itself is not a functional correctness
requirement, it is only a useful optimization mechanism to reduce the
memory foot print in unused user space pools.


Just wondering, if there is no user space request for CID, this whole mechanism
should be under a static check to avoid any overhead?

(Trying to understand this change. Very interesting. ignore if this is not 
applicable)

The optimal CID space is:

     min(nr_tasks, nr_cpus_allowed);

Where @nr_tasks is the number of actual user space threads associated to
the mm and @nr_cpus_allowed is the superset of all task affinities. It is
growth only as it would be insane to take a racy snapshot of all task
affinities when the affinity of one task changes just do redo it 2
milliseconds later when the next task changes it's affinity.

That means that as long as the number of tasks is lower or equal than the
number of CPUs allowed, each task owns a CID. If the number of tasks
exceeds the number of CPUs allowed it switches to per CPU mode, where the
CPUs own the CIDs and the tasks borrow them as long as they are scheduled
in.

For transition periods CIDs can go beyond the optimal space as long as they
don't go beyond the number of possible CPUs.

The current upstream implementation tries to keep the CID with the task
even in overcommit situations, which complicates task migration. It also
has to do the CID space consolidation work from a task work in the exit to
user space path. As that work is assigned to a random task related to a MM
this can inflict unwanted exit latencies.

Implement the context switch parts of a strict ownership mechanism to
address this.

This removes most of the work from the task which schedules out. Only
during transitioning from per CPU to per task ownership it is required to
drop the CID when leaving the CPU to prevent CID space exhaustion. Other
than that scheduling out is just a single check and branch.

The task which schedules in has to check whether:

     1) The ownership mode changed
     2) The CID is within the optimal CID space

In stable situations this results in zero work. The only short disruption
is when ownership mode changes or when the associated CID is not in the
optimal CID space. The latter only happens when tasks exit and therefore
the optimal CID space shrinks.

That mechanism is strictly optimized for the common case where no change
happens. The only case where it actually causes a temporary one time spike
is on mode changes when and only when a lot of tasks related to a MM
schedule exactly at the same time and have eventually to compete on
allocating a CID from the bitmap.

In the sysbench test case which triggered the spinlock contention in the
initial CID code, __schedule() drops significantly in perf top on a 128
Core (256 threads) machine when running sysbench with 255 threads, which
fits into the task mode limit of 256 together with the parent thread:

   Upstream  rseq/perf branch  +CID rework
   0.42%     0.37%             0.32%          [k] __schedule

Increasing the number of threads to 256, which puts the test process into
per CPU mode looks about the same.

Signed-off-by: Thomas Gleixner <[email protected]>
---

[...]
+static inline unsigned int __mm_get_cid(struct mm_struct *mm, unsigned int 
max_cids)
+{
+       unsigned int cid = find_first_zero_bit(mm_cidmask(mm), max_cids);
+
+       if (cid >= max_cids)
+               return MM_CID_UNSET;
+       if (test_and_set_bit(cid, mm_cidmask(mm)))
+               return MM_CID_UNSET;
+       return cid;
+}
+
+static inline unsigned int mm_get_cid(struct mm_struct *mm)
+{
+       unsigned int cid = __mm_get_cid(mm, READ_ONCE(mm->mm_cid.max_cids));
+
+       for (; cid == MM_CID_UNSET; cpu_relax())

This triggers an compile error on ppc64le.

In file included from ./include/vdso/processor.h:10,
                 from ./arch/powerpc/include/asm/processor.h:9,
                 from ./include/linux/sched.h:13,
                 from ./include/linux/sched/affinity.h:1,
                 from kernel/sched/sched.h:8,
                 from kernel/sched/rq-offsets.c:5:
kernel/sched/sched.h: In function ‘mm_get_cid’:
./arch/powerpc/include/asm/vdso/processor.h:26:9: error: expected expression 
before ‘asm’
   26 |         asm volatile(ASM_FTR_IFCLR(                                     
\
      |         ^~~
kernel/sched/sched.h:3615:37: note: in expansion of macro ‘cpu_relax’
 3615 |         for (; cid == MM_CID_UNSET; cpu_relax())


+linux-ppc dev.


Moving it into a while seems to make compiling happy.
Need to see why, but thought of informing fist.

 kernel/sched/sched.h | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 50bc0eb08a66..8a6ad4fc9534 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3612,8 +3612,11 @@ static inline unsigned int mm_get_cid(struct mm_struct 
*mm)
 {
        unsigned int cid = __mm_get_cid(mm, READ_ONCE(mm->mm_cid.max_cids));
- for (; cid == MM_CID_UNSET; cpu_relax())
-               cid = __mm_get_cid(mm, num_possible_cpus());
+       while(1) {
+               if (cid == MM_CID_UNSET)
+                       break;
+               cpu_relax();
+       }
return cid;
 }




+               cid = __mm_get_cid(mm, num_possible_cpus());
+
+       return cid;
+}
+

Reply via email to