On 28/06/2022 13:25, Matheus K. Ferst wrote:
On 27/06/2022 15:25, Frederic Barrat wrote:
[ Resending as it was meant for the qemu-ppc list ]
Hello,
I've been looking at why our qemu powernv model is so slow when booting
a compressed linux kernel, using multiple vcpus and multi-thread tcg.
With only one vcpu, the decompression time of the kernel is what it is,
but when using multiple vcpus, the decompression is actually slower. And
worse: it degrades very fast with the number of vcpus!
Rough measurement of the decompression time on a x86 laptop with
multi-thread tcg and using the qemu powernv10 machine:
1 vcpu => 15 seconds
2 vcpus => 45 seconds
4 vcpus => 1 min 30 seconds
Looking in details, when the firmware (skiboot) hands over execution to
the linux kernel, there's one main thread entering some bootstrap code
and running the kernel decompression algorithm. All the other secondary
threads are left spinning in skiboot (1 thread per vpcu). So on paper,
with multi-thread tcg and assuming the system has enough available
physical cpus, I would expect the decompression to hog one physical cpu
and the time needed to be constant, no matter the number of vpcus.
All the secondary threads are left spinning in code like this:
for (;;) {
if (cpu_check_jobs(cpu)) // reading cpu-local data
break;
if (reconfigure_idle) // global variable
break;
barrier();
}
The barrier is to force reading the memory with each iteration. It's
defined as:
asm volatile("" : : : "memory");
Some time later, the main thread in the linux kernel will get the
secondary threads out of that loop by posting a job.
My first thought was that the translation of that code through tcg was
somehow causing some abnormally slow behavior, maybe due to some
non-obvious contention between the threads. However, if I send the
threads spinning forever with simply:
for (;;) ;
supposedly removing any contention, then the decompression time is the
same.
Ironically, the behavior seen with single thread tcg is what I would
expect: 1 thread decompressing in 15 seconds, all the other threads
spinning for that same amount of time, all sharing the same physical
cpu, so it all adds up nicely: I see 60 seconds decompression time with
4 vcpus (4x15). Which means multi-thread tcg is slower by quite a bit.
And single thread tcg hogs one physical cpu of the laptop vs. 4 physical
cpus for the slower multi-thread tcg.
Does anybody have an idea of what might happen or have suggestion to
keep investigating?
Thanks for your help!
Fred
Hi Frederic,
I did some boot time tests recently and didn't notice this behavior.
Could you share your QEMU command line with us? Did you build QEMU with
any debug option or sanitizer enabled?
You should be able to see it with:
qemu-system-ppc64 -machine powernv10 -smp 4 -m 4G -nographic -bios <path
to skiboot.lid> -kernel <path to compresses kernel> -initrd <path to
initd> -serial mon:stdio
-smp is what matters.
When simplifying the command line above, I noticed something
interesting: the problem doesn't show using the skiboot.lid shipped with
qemu! I'm using something closer to the current upstream head and the
idle code (the for loop in my initial mail) had been reworked in
between. So, clearly, the way the guest code is written matters. But
that doesn't explain it.
I'm using a kernel in debug mode, so it's pretty big and that's why I
was using a compressed image. The compressed image is about 8 MB.
The initrd shouldn't matter, the issue is seen during kernel
decompression, before the init ram is used.
I can share my binaries if you'd like. Especially a recent version of
skiboot showing the problem.
Fred