Hi Stephen and Cedric, This issue haven't been found in real platform but sometime happens in emulator, e.g. Simic.
> Adding Aspeed Engineers. This reminds me of a discussion a while ago. > > On 1/11/24 18:38, Stephen Longfield wrote: > > We’ve noticed inconsistent behavior when running a large number of aspeed > ast2600 executions, that seems to be tied to a race condition in the smp boot > when executing on TCG-QEMU, and were wondering what a good mediation > strategy might be. > > > > The problem first shows up as part of SMP boot. On a run that’s likely to > later run into issues, we’ll see something like: > > > > ``` > > [ 0.008350] smp: Bringing up secondary CPUs ... > > [ 1.168584] CPU1: failed to come online [ 1.187277] smp: Brought > > up 1 node, 1 CPU ``` > > > > Compared to the more likely to succeed: > > > > ``` > > [ 0.080313] smp: Bringing up secondary CPUs ... > > [ 0.093166] smp: Brought up 1 node, 2 CPUs [ 0.093345] SMP: > > Total of 2 processors activated (4800.00 BogoMIPS). > > ``` > > > > It’s somewhat reliably reproducible by running the ast2600-evb with an > OpenBMC image, using ‘-icount auto’ to slow execution and make the race > condition more frequent (it happens without this, just easier to debug if we > can reproduce): > > > > > > ``` > > ./aarch64-softmmu/qemu-system-aarch64 -machine ast2600-evb - > nographic > > -drive > > file=~/bmc-bin/image-obmc-ast2600,if=mtd,bus=0,unit=0,snapshot=on -nic > > user -icount auto ``` Have you try to run qemu with "-smp 2"? > > > > Our current hypothesis is that the problem comes up in the platform > uboot. As part of the boot, the secondary core waits for the smp mailbox to > get a magic number written by the primary core: > > > > https://github.com/AspeedTech-BMC/u-boot/blob/aspeed-master- > v2019.04/a > > rch/arm/mach-aspeed/ast2600/platform.S#L168 > > <https://github.com/AspeedTech-BMC/u-boot/blob/aspeed-master- > v2019.04/ > > arch/arm/mach-aspeed/ast2600/platform.S#L168> > > > > However, this memory address is cleared on boot: > > > > https://github.com/AspeedTech-BMC/u-boot/blob/aspeed-master- > v2019.04/a > > rch/arm/mach-aspeed/ast2600/platform.S#L146 > > <https://github.com/AspeedTech-BMC/u-boot/blob/aspeed-master- > v2019.04/ > > arch/arm/mach-aspeed/ast2600/platform.S#L146> > > > > The race condition occurs if the primary core runs far ahead of the > > secondary > core: if the primary core gets to the point where it signals the secondary > core’s > mailbox before the secondary core gets past the point where it does the > initial > reset and starts waiting, the reset will clear the signal, and then the > secondary > core will never get past the point where it’s looping in > `poll_smp_mbox_ready`. > > > > We’ve observed this race happening by dumping all SCU reads and writes, > and validated that this is the problem by using a modified `platform.S` that > doesn’t clear the =SCU_SMP_READY mailbox on reset, but would rather not > have to use a modified version of SMP boot just for QEMU-TCG execution. To prevent the race condition described, SCU188 zeroization is conducted as early as possible by both CPU#0 and CPU#1. After that, there are at least 100 instructions for CPU#0 to execute before it get the chance to set SCU188 to 0xbabecafe. For real, parallel HW, it is unusual that CPU#1 will be slower than CPU#0 by 100 instruction cycles. > > you could use '-trace aspeed_scu*' to collect the MMIO accesses on the SCU > unit. A TCG plugin also. > > > Is there a way to have QEMU insert a barrier synchronization at some point > in the bootloader? I think getting both cores past the =SCU_SMP_READY reset > would get rid of this race, but I’m not aware of a way to do that kind of > thing > in QEMU-TCG. > > > > Thanks for any insights! > > Could we change the default value to registers 0x180 ... 0x18C in > hw/misc/aspeed_scu.c to make sure the SMP regs are immune to the race ? > > Thanks, > > C. Thanks, Troy Lee