Hi,

On 03. 03. 26, 14:23, Matthieu Baerts wrote:
On 26/02/2026 11:37, Jiri Slaby wrote:
On 06. 02. 26, 12:54, Matthieu Baerts wrote:
Our CI for the MPTCP subsystem is now regularly hitting various stalls
before even starting the MPTCP test suite. These issues are visible on
top of the latest net and net-next trees, which have been sync with
Linus' tree yesterday. All these issues have been seen on a "public CI"
using GitHub-hosted runners with KVM support, where the tested kernel is
launched in a nested (I suppose) VM. I can see the issue with or without
debug.config. According to the logs, it might have started around
v6.19-rc0, but I was unavailable for a few weeks, and I couldn't react
quicker, sorry for that. Unfortunately, I cannot reproduce this locally,
and the CI doesn't currently have the ability to execute bisections.

Hmm, after the switch of the qemu guest kernels to 6.19, our (opensuse)
build service is stalling in smp_call_function_many_cond() randomly too:
https://bugzilla.suse.com/show_bug.cgi?id=1258936

The attachment from there contains sysrq-t logs too:
https://bugzilla.suse.com/attachment.cgi?id=888612

I'm glad I'm not the only one with this issue :)

In your case, do you also have nested VMs with KVM support?

No, it's KVM directly on bare metal.

Are you able to easily reproduce the issue and change the guest kernel
in your build service?

Unfortunately no and no.

On my side, any debugging steps need to be automated. Lately, it looks
like the issue is more easily triggered on a stable 6.19 kernel, than on
the last RC.

The stalls happen before starting the MPTCP test suite. The init program
creates a VSOCK listening socket via socat [1], and different hangs are
then visible: RCU stalls followed by a soft lockup [2], only a soft
lockup [3], sometimes the soft lockup comes with a delay [4] [5], or
there is no RCU stalls or soft lockups detected after one minute, but VM
is stalled [6]. In the last case, the VM is stopped after having
launched GDB to get more details about what was being executed.

It feels like the issue is not directly caused by the VSOCK listening
socket, but the stalls always happen after having started the socat
command [1] in the background.

It fails randomly while building random packages (go, libreoffice,
bayle, ...). I don't think it is VSOCK related in those cases, but who
knows what the builds do...

Indeed, unlikely to be VSOCK then.

I cannot reproduce locally either.

I came across:
   614da1d3d4cd x86: make page fault handling disable interrupts properly
but I have no idea if it could have impact on this at all.

Did it help to revert it?

We haven't tried, it is unlikely the cause.

--
js
suse labs

Reply via email to