On 22.01.26 18:36, Roger Pau Monné wrote:
On Thu, Jan 22, 2026 at 05:21:12PM +0000, Andrew Cooper wrote:
On 22/01/2026 3:56 pm, Jürgen Groß wrote:
Just as a heads up: a hardware partner of SUSE has seen hard lockups
of the Linux kernel during boot on a new machine. This machine has
8 NUMA nodes and 960 CPUs. The hang occurs in roughly 1.5% of the boot
attempts in MTRR initialization of the APs.

I have sent a small patch series to LKML which seems to fix the problem:
https://lore.kernel.org/lkml/[email protected]/

As Xen MTRR handling is taken from the Linux kernel, I guess the same
problem could happen in Xen, too.

As the hang always occurred while waiting for the lock, which is
serializing the single CPUs doing MTRR initialization, my solution was
to eliminate the lock, allowing all APs to init MTRRs in parallel.

Maybe we want to do the same in Xen.

I suspect Xen might be insulated by the fact that we don't have parallel
AP start (yet), so we don't have the whole system competing on the
spinlock at once.

Oh, I think I've misunderstood the issue.  Linux is doing MTRR init in
the AP startup path, and so if it takes too long Linux will report
that the AP has failed to start.

No, Linux is deferring the MTRRs until all APs are up, just like Xen
(or Xen does it like Linux).


This is not an issue on Xen because MTRR initialization is deferred
until all APs are up, and hence is not part of the timed AP start
path.  This optimization was done in:

0d22c8d92c6c x86: CPU synchronization while doing MTRR register update

So even if we did parallel AP startup we won't likely be affected,
because we would still defer the MTRR setup until all APs are up.

We will be affected, as its the deferred MTRR setup which is the
problem.


Juergen

Attachment: OpenPGP_0xB0DE9DD628BF132F.asc
Description: OpenPGP public key

Attachment: OpenPGP_signature.asc
Description: OpenPGP digital signature

Reply via email to