On 9/5/22 11:41 AM, Heinrich Schuchardt wrote:
On 9/5/22 17:30, Sean Anderson wrote:
On 9/5/22 3:47 AM, Nikita Shubin wrote:
Hi Rick!

On Mon, 5 Sep 2022 14:22:41 +0800
Rick Chen <rickche...@gmail.com> wrote:

Hi,

When I free-run a SMP system, I once hit a failure case where some
harts didn't boot to the kernel shell successfully.
However it can't be duplicated anymore even if I try many times.

But when I set a break during debugging with GDB, it can trigger the
failure case each time.

If hart fails to register itself to available_harts before
send_ipi_many is hit by the main hart:
https://elixir.bootlin.com/u-boot/v2022.10-rc3/source/arch/riscv/lib/smp.c#L50

it won't exit the secondary_hart_loop:
https://elixir.bootlin.com/u-boot/v2022.10-rc3/source/arch/riscv/cpu/start.S#L433
As no ipi will be sent to it.

Can we call send_ipi_many() again when booting?

AFAIK we do; see arch/riscv/lib/bootm.c and arch/riscv/lib/spl.c

Do we need to call it before booting?

Yes. We also call it when relocating (in SPL and U-Boot proper).


This might be exactly your case.

When working on the IPI mechanism, I considered this possibility. However,
there's really no way to know how long to wait. On normal systems, the boot
hart is going to do a lot of work before calling send_ipi_many, and the
other harts just have to make it through ~100 instructions. So I figured we
would never run into this issue.

We might not even need the mask... the only direct reason we might is for
OpenSBI, as spl_invoke_opensbi is the only function which uses the wait
parameter.

I think the mechanism of available_harts does not provide a method
that guarantees the success of the SMP system.
Maybe we shall think of a better way for the SMP booting or just
remove it ?

I haven't experienced any unexplained problem with hart_lottery or
available_harts_lock unless:

1) harts are started non-simultaneously
2) SPL/U-Boot is in some kind of TCM, OCRAM, etc... which is not cleared
on reset which leaves available_harts dirty

XIP, of course, has this problem every time and just doesn't use the mask.
I remember thinking a lot about how to deal with this, but I never ended
up sending a patch because I didn't have a XIP system.

--Sean

3) something is wrong with atomics

Also there might be something wrong with IPI send/recieve.


Thread 8 hit Breakpoint 1, harts_early_init ()

(gdb) c
Continuing.
[Switching to Thread 7]

Thread 7 hit Breakpoint 1, harts_early_init ()

(gdb)
Continuing.
[Switching to Thread 6]

Thread 6 hit Breakpoint 1, harts_early_init ()

(gdb)
Continuing.
[Switching to Thread 5]

Thread 5 hit Breakpoint 1, harts_early_init ()

(gdb)
Continuing.
[Switching to Thread 4]

Thread 4 hit Breakpoint 1, harts_early_init ()

(gdb)
Continuing.
[Switching to Thread 3]

Thread 3 hit Breakpoint 1, harts_early_init ()
(gdb)
Continuing.
[Switching to Thread 2]

Thread 2 hit Breakpoint 1, harts_early_init ()
(gdb)
Continuing.
[Switching to Thread 1]

Thread 1 hit Breakpoint 1, harts_early_init ()
(gdb)
Continuing.
[Switching to Thread 5]


Thread 5 hit Breakpoint 3, 0x0000000001200000 in ?? ()
(gdb) info threads
   Id   Target Id         Frame
   1    Thread 1 (hart 1) secondary_hart_loop () at
arch/riscv/cpu/start.S:436 2    Thread 2 (hart 2) secondary_hart_loop
() at arch/riscv/cpu/start.S:436 3    Thread 3 (hart 3)
secondary_hart_loop () at arch/riscv/cpu/start.S:436 4    Thread 4
(hart 4) secondary_hart_loop () at arch/riscv/cpu/start.S:436
* 5    Thread 5 (hart 5) 0x0000000001200000 in ?? ()
   6    Thread 6 (hart 6) 0x000000000000b650 in ?? ()
   7    Thread 7 (hart 7) 0x000000000000b650 in ?? ()
   8    Thread 8 (hart 8) 0x0000000000005fa0 in ?? ()
(gdb) c
Continuing.

Do they all "offline" harts remain in SPL/U-Boot secondary_hart_loop ?




[    0.175619] smp: Bringing up secondary CPUs ...
[    1.230474] CPU1: failed to come online
[    2.282349] CPU2: failed to come online
[    3.334394] CPU3: failed to come online
[    4.386783] CPU4: failed to come online
[    4.427829] smp: Brought up 1 node, 4 CPUs


/root # cat /proc/cpuinfo
processor       : 0
hart            : 4
isa     : rv64i2p0m2p0a2p0c2p0xv5-1p1
mmu             : sv39

processor       : 5
hart            : 5
isa     : rv64i2p0m2p0a2p0c2p0xv5-1p1
mmu             : sv39

processor       : 6
hart            : 6
isa     : rv64i2p0m2p0a2p0c2p0xv5-1p1
mmu             : sv39

processor       : 7
hart            : 7
isa     : rv64i2p0m2p0a2p0c2p0xv5-1p1
mmu             : sv39

/root #

Thanks,
Rick





Reply via email to