On 05. 03. 26, 8:00, Jiri Slaby wrote:
On 02. 03. 26, 12:46, Peter Zijlstra wrote:
On Mon, Mar 02, 2026 at 06:28:38AM +0100, Jiri Slaby wrote:

The state of the lock:

crash> struct rq.__lock -x ffff8d1a6fd35dc0
   __lock = {
     raw_lock = {
       {
         val = {
           counter = 0x40003
         },
         {
           locked = 0x3,
           pending = 0x0
         },
         {
           locked_pending = 0x3,
           tail = 0x4
         }
       }
     }
   },



That had me remember the below patch that never quite made it. I've
rebased it to something more recent so it applies.

If you stick that in, we might get a clue as to who is owning that lock.
Provided it all wants to reproduce well enough.

Thanks, I applied it, but to date it is still not accepted yet:
https://build.opensuse.org/requests/1335893

OK, I have a first dump with the patch applied:
  __lock = {
    raw_lock = {
      {
        val = {
          counter = 0x2c0003
        },
        {
          locked = 0x3,
          pending = 0x0
        },
        {
          locked_pending = 0x3,
          tail = 0x2c
        }
      }
    }
  },

I am not sure if it is of any help?




BUT: I have another dump with LOCKDEP (but NOT the patch above). The kernel is again spinning in mm_get_cid(), presumably waiting for a free bit in the map as before [1]:


[  162.660584] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
...
[  162.661378] Sending NMI from CPU 3 to CPUs 1:
[  162.661398] NMI backtrace for cpu 1
...
[  162.661411] RIP: 0010:mm_get_cid+0x54/0xc0


7680 is active on CPU 1:
PID: 7680     TASK: ffff8cc4038525c0  CPU: 1    COMMAND: "asm"


CPU3 is waiting for the CPU1's rq_lock:
RDX: 0000000000000000  RSI: 0000000000000003  RDI: ffff8cc72fcb8500
...
 #3 [ffffd2e9c0083da0] raw_spin_rq_lock_nested+0x20 at ffffffff9339e700

crash> struct rq.__lock -x ffff8cc72fcb8500
  __lock = {
    raw_lock = {
      {
        val = {
          counter = 0x100003
        },
        {
          locked = 0x3,
          pending = 0x0
        },
        {
          locked_pending = 0x3,
          tail = 0x10
        }
      }
    },
    magic = 0xdead4ead,
    owner_cpu = 0x1,
    owner = 0xffff8cc4038b8000,
    dep_map = {
      key = 0xffffffff96245970 <__key.7>,
      class_cache = {0xffffffff9644b488 <lock_classes+10600>, 0x0},
      name = 0xffffffff94ba3ab3 "&rq->__lock",
      wait_type_outer = 0x0,
      wait_type_inner = 0x2,
      lock_type = 0x0
    }
  },

owner_cpu is 1, owner is:
PID: 7508     TASK: ffff8cc4038b8000  CPU: 1    COMMAND: "compile"

But as you can see above, CPU1 is occupied with a different task:
crash> bt -sxc 1
PID: 7680     TASK: ffff8cc4038525c0  CPU: 1    COMMAND: "asm"

spinning in mm_get_cid() as I wrote. See the objdump of mm_get_cid below.

[1] https://bugzilla.suse.com/show_bug.cgi?id=1258936#c17


ffffffff8139cd40 <mm_get_cid>:
mm_get_cid():
include/linux/cpumask.h:1020
ffffffff8139cd40:       8b 05 9a d7 40 02       mov    0x240d79a(%rip),%eax        # 
ffffffff837aa4e0 <nr_cpu_ids>
kernel/sched/sched.h:3779
ffffffff8139cd46:       55                      push   %rbp
ffffffff8139cd47:       53                      push   %rbx
include/linux/mm_types.h:1477
ffffffff8139cd48:       48 8d 9f 80 0b 00 00    lea    0xb80(%rdi),%rbx
kernel/sched/sched.h:3780 (discriminator 2)
ffffffff8139cd4f:       8b b7 0c 01 00 00       mov    0x10c(%rdi),%esi
include/linux/cpumask.h:1020
ffffffff8139cd55:       83 c0 3f                add    $0x3f,%eax
ffffffff8139cd58:       c1 e8 03                shr    $0x3,%eax
kernel/sched/sched.h:3780 (discriminator 2)
ffffffff8139cd5b:       48 89 f5                mov    %rsi,%rbp
include/linux/mm_types.h:1479 (discriminator 1)
ffffffff8139cd5e:       25 f8 ff ff 1f          and    $0x1ffffff8,%eax
include/linux/mm_types.h:1489 (discriminator 1)
ffffffff8139cd63:       48 8d 3c 43             lea    (%rbx,%rax,2),%rdi
include/linux/find.h:393
ffffffff8139cd67:       e8 44 d8 6e 00          call   ffffffff81a8a5b0 
<_find_first_zero_bit>
kernel/sched/sched.h:3771
ffffffff8139cd6c:       39 e8                   cmp    %ebp,%eax
ffffffff8139cd6e:       73 7c                   jae    ffffffff8139cdec 
<mm_get_cid+0xac>
ffffffff8139cd70:       89 c1                   mov    %eax,%ecx
kernel/sched/sched.h:3773 (discriminator 1)
ffffffff8139cd72:       89 c2                   mov    %eax,%edx
include/linux/cpumask.h:1020
ffffffff8139cd74:       8b 05 66 d7 40 02       mov    0x240d766(%rip),%eax        # 
ffffffff837aa4e0 <nr_cpu_ids>
ffffffff8139cd7a:       83 c0 3f                add    $0x3f,%eax
ffffffff8139cd7d:       c1 e8 03                shr    $0x3,%eax
include/linux/mm_types.h:1479 (discriminator 1)
ffffffff8139cd80:       25 f8 ff ff 1f          and    $0x1ffffff8,%eax
include/linux/mm_types.h:1489 (discriminator 1)
ffffffff8139cd85:       48 8d 04 43             lea    (%rbx,%rax,2),%rax
arch/x86/include/asm/bitops.h:136
ffffffff8139cd89:       f0 48 0f ab 10          lock bts %rdx,(%rax)
kernel/sched/sched.h:3773 (discriminator 2)
ffffffff8139cd8e:       73 4b                   jae    ffffffff8139cddb 
<mm_get_cid+0x9b>
ffffffff8139cd90:       eb 5a                   jmp    ffffffff8139cdec 
<mm_get_cid+0xac>
arch/x86/include/asm/vdso/processor.h:13
ffffffff8139cd92:       f3 90                   pause
include/linux/cpumask.h:1020
ffffffff8139cd94:       8b 05 46 d7 40 02       mov    0x240d746(%rip),%eax        # 
ffffffff837aa4e0 <nr_cpu_ids>

The CPU1 was caught by the NMI here ^^^^^^^^^^^^^^^^^^^^.




In the meantime, me and Michal K. did some digging into qemu dumps. Details at (and a couple previous comments):
https://bugzilla.suse.com/show_bug.cgi?id=1258936#c17

tl;dr:

In one of the dumps, one process sits in
   context_switch
     -> mm_get_cid (before switch_to())

> 65 kworker/1:1 SP= 0xffffcf82c022fd98 -> __schedule+0x16ee (ffffffff820f162e) -> call mm_get_cid

Michal extracted the vCPU's RIP and it turned out:
> Hm, I'd say the CPU could be spinning in mm_get_cid() waiting for a free CID.
 > ...
 > ffff8a88458137c0:  000000000000000f 000000000000000f
 >                                                    ^
 > Hm, so indeed CIDs for all four CPUs are occupied.

To me (I don't know what CID is either), this might point as a possible culprit to Thomas' "sched/mmcid: Cure mode transition woes" [1].

Funnily enough, 47ee94efccf6 ("sched/mmcid: Protect transition on weakly ordered systems") spells:
 >     As a consequence the task will
>     not drop the CID when scheduling out before the fixup is completed, which >     means the CID space can be exhausted and the next task scheduling in will >     loop in mm_get_cid() and the fixup thread can livelock on the held runqueue
 >     lock as above.

Which sounds like what exactly happens here. Except the patch is from the series above, so is already in 6.19 obviously.


I noticed there is also a 7.0-rc1 fix:
   1e83ccd5921a sched/mmcid: Don't assume CID is CPU owned on mode switch
But that got into 6.19.1 already (we are at 6.19.3). So does not improve the situation.

Any ideas?



[1] https://lore.kernel.org/all/[email protected]/

thanks,

--
js
suse labs


Reply via email to