Sort of resolved: We were able to find a good-enough workaround. In case
anyone else is running into this, here's what we did:

By dropping to the uboot console and running the command

```
mw.l 0x1e6e2188 0xbabecafe
```

The magic number is set in the SCU regardless of how the race goes, and the
2nd core gets released from its mailbox polling loop.

On Thu, Jan 11, 2024 at 9:38 AM Stephen Longfield <slongfi...@google.com>
wrote:

> We’ve noticed inconsistent behavior when running a large number of aspeed
> ast2600 executions, that seems to be tied to a race condition in the smp
> boot when executing on TCG-QEMU, and were wondering what a good mediation
> strategy might be.
>
> The problem first shows up as part of SMP boot. On a run that’s likely to
> later run into issues, we’ll see something like:
>
> ```
> [    0.008350] smp: Bringing up secondary CPUs ...
> [    1.168584] CPU1: failed to come online
> [    1.187277] smp: Brought up 1 node, 1 CPU
> ```
>
> Compared to the more likely to succeed:
>
> ```
> [    0.080313] smp: Bringing up secondary CPUs ...
> [    0.093166] smp: Brought up 1 node, 2 CPUs
> [    0.093345] SMP: Total of 2 processors activated (4800.00 BogoMIPS).
> ```
>
> It’s somewhat reliably reproducible by running the ast2600-evb with an
> OpenBMC image, using ‘-icount auto’ to slow execution and make the race
> condition more frequent (it happens without this, just easier to debug if
> we can reproduce):
>
>
> ```
> ./aarch64-softmmu/qemu-system-aarch64 -machine ast2600-evb -nographic
> -drive file=~/bmc-bin/image-obmc-ast2600,if=mtd,bus=0,unit=0,snapshot=on
> -nic user -icount auto
> ```
>
> Our current hypothesis is that the problem comes up in the platform
> uboot.  As part of the boot, the secondary core waits for the smp mailbox
> to get a magic number written by the primary core:
>
>
> https://github.com/AspeedTech-BMC/u-boot/blob/aspeed-master-v2019.04/arch/arm/mach-aspeed/ast2600/platform.S#L168
>
> However, this memory address is cleared on boot:
>
>
> https://github.com/AspeedTech-BMC/u-boot/blob/aspeed-master-v2019.04/arch/arm/mach-aspeed/ast2600/platform.S#L146
>
> The race condition occurs if the primary core runs far ahead of the
> secondary core: if the primary core gets to the point where it signals the
> secondary core’s mailbox before the secondary core gets past the point
> where it does the initial reset and starts waiting, the reset will clear
> the signal, and then the secondary core will never get past the point where
> it’s looping in `poll_smp_mbox_ready`.
>
> We’ve observed this race happening by dumping all SCU reads and writes,
> and validated that this is the problem by using a modified `platform.S`
> that doesn’t clear the =SCU_SMP_READY mailbox on reset, but would rather
> not have to use a modified version of SMP boot just for QEMU-TCG execution.
>
> Is there a way to have QEMU insert a barrier synchronization at some point
> in the bootloader?  I think getting both cores past the =SCU_SMP_READY
> reset would get rid of this race, but I’m not aware of a way to do that
> kind of thing in QEMU-TCG.
>
> Thanks for any insights!
>
> --Stephen
>
> ---
>
> P.S. Additional note about the aspeed platform.S:
>
> Clearing the mailbox was added in this patch:
>
>
> https://github.com/AspeedTech-BMC/u-boot/commit/55825c55d1dabc00e37999a38495ed05c901bec2
>
> At the time, the write to what was then known as
> `=AST_SMP_MBOX_FIELD_READY` (now `=SCU_SMP_READY`) happened after
> `scu_unlock`.  But, when the boot flow was revised in
>
>
> https://github.com/AspeedTech-BMC/u-boot/commit/46a48bbe56c1e790c9bd1794364db86ec609c48e
> the scu_unlock was moved to primary core boot, so, unless the primary core
> wins the race, it doesn’t seem like the mailbox ready clear actually will
> have any effect, since it’ll be writing while the SCU is locked.
>

Reply via email to