Re: Stalls when starting a VSOCK listening socket: soft lockups, RCU stalls, timeout

Jiri Slaby Wed, 04 Mar 2026 23:00:44 -0800

On 02. 03. 26, 12:46, Peter Zijlstra wrote:

On Mon, Mar 02, 2026 at 06:28:38AM +0100, Jiri Slaby wrote:

The state of the lock:

crash> struct rq.__lock -x ffff8d1a6fd35dc0
   __lock = {
     raw_lock = {
       {
         val = {
           counter = 0x40003
         },
         {
           locked = 0x3,
           pending = 0x0
         },
         {
           locked_pending = 0x3,
           tail = 0x4
         }
       }
     }
   },



That had me remember the below patch that never quite made it. I've
rebased it to something more recent so it applies.

If you stick that in, we might get a clue as to who is owning that lock.
Provided it all wants to reproduce well enough.


Thanks, I applied it, but to date it is still not accepted yet:
https://build.opensuse.org/requests/1335893

In the meantime, me and Michal K. did some digging into qemu dumps.Details at (and a couple previous comments):

https://bugzilla.suse.com/show_bug.cgi?id=1258936#c17

tl;dr:

In one of the dumps, one process sits in
  context_switch
    -> mm_get_cid (before switch_to())

> 65 kworker/1:1 SP= 0xffffcf82c022fd98 -> __schedule+0x16ee(ffffffff820f162e) -> call mm_get_cid


Michal extracted the vCPU's RIP and it turned out:

> Hm, I'd say the CPU could be spinning in mm_get_cid() waiting for afree CID.

> ...
> ffff8a88458137c0:  000000000000000f 000000000000000f
>                                                    ^
> Hm, so indeed CIDs for all four CPUs are occupied.

To me (I don't know what CID is either), this might point as a possibleculprit to Thomas' "sched/mmcid: Cure mode transition woes" [1].

Funnily enough, 47ee94efccf6 ("sched/mmcid: Protect transition on weaklyordered systems") spells:

>     As a consequence the task will

> not drop the CID when scheduling out before the fixup iscompleted, which> means the CID space can be exhausted and the next task schedulingin will> loop in mm_get_cid() and the fixup thread can livelock on theheld runqueue

>     lock as above.

Which sounds like what exactly happens here. Except the patch is fromthe series above, so is already in 6.19 obviously.



I noticed there is also a 7.0-rc1 fix:
  1e83ccd5921a sched/mmcid: Don't assume CID is CPU owned on mode switch

But that got into 6.19.1 already (we are at 6.19.3). So does not improvethe situation.


Any ideas?



[1] https://lore.kernel.org/all/[email protected]/

thanks,
--
js
suse labs

Re: Stalls when starting a VSOCK listening socket: soft lockups, RCU stalls, timeout

Reply via email to