Frederic Barrat <fbar...@linux.ibm.com> writes:
> On 28/06/2022 17:12, Alex Bennée wrote: >> Frederic Barrat <fbar...@linux.ibm.com> writes: >> >>> On 28/06/2022 13:25, Matheus K. Ferst wrote: >>>> On 27/06/2022 15:25, Frederic Barrat wrote: >>>>> [ Resending as it was meant for the qemu-ppc list ] >>>>> >>>>> Hello, >>>>> >>>>> I've been looking at why our qemu powernv model is so slow when booting >>>>> a compressed linux kernel, using multiple vcpus and multi-thread tcg. >>>>> With only one vcpu, the decompression time of the kernel is what it is, >>>>> but when using multiple vcpus, the decompression is actually slower. And >>>>> worse: it degrades very fast with the number of vcpus! >>>>> >>>>> Rough measurement of the decompression time on a x86 laptop with >>>>> multi-thread tcg and using the qemu powernv10 machine: >>>>> 1 vcpu => 15 seconds >>>>> 2 vcpus => 45 seconds >>>>> 4 vcpus => 1 min 30 seconds >>>>> >>>>> Looking in details, when the firmware (skiboot) hands over execution to >>>>> the linux kernel, there's one main thread entering some bootstrap code >>>>> and running the kernel decompression algorithm. All the other secondary >>>>> threads are left spinning in skiboot (1 thread per vpcu). So on paper, >>>>> with multi-thread tcg and assuming the system has enough available >>>>> physical cpus, I would expect the decompression to hog one physical cpu >>>>> and the time needed to be constant, no matter the number of vpcus. >> <snip> >>>>> >>>>> Ironically, the behavior seen with single thread tcg is what I would >>>>> expect: 1 thread decompressing in 15 seconds, all the other threads >>>>> spinning for that same amount of time, all sharing the same physical >>>>> cpu, so it all adds up nicely: I see 60 seconds decompression time with >>>>> 4 vcpus (4x15). Which means multi-thread tcg is slower by quite a bit. >>>>> And single thread tcg hogs one physical cpu of the laptop vs. 4 physical >>>>> cpus for the slower multi-thread tcg. >>>>> >>>>> Does anybody have an idea of what might happen or have suggestion to >>>>> keep investigating? >>>>> Thanks for your help! >>>>> >>>>> Fred >>>>> >>>>> >>>> Hi Frederic, >>>> I did some boot time tests recently and didn't notice this behavior. >>>> Could you share your QEMU command line with us? Did you build QEMU >>>> with any debug option or sanitizer enabled? >>> >>> >>> You should be able to see it with: >>> >>> qemu-system-ppc64 -machine powernv10 -smp 4 -m 4G -nographic -bios >>> <path to skiboot.lid> -kernel <path to compresses kernel> -initrd >>> <path to initd> -serial mon:stdio >>> >>> >>> -smp is what matters. >>> >>> When simplifying the command line above, I noticed something >>> interesting: the problem doesn't show using the skiboot.lid shipped >>> with qemu! I'm using something closer to the current upstream head and >>> the idle code (the for loop in my initial mail) had been reworked in >>> between. So, clearly, the way the guest code is written matters. But >>> that doesn't explain it. >>> >>> I'm using a kernel in debug mode, so it's pretty big and that's why I >>> was using a compressed image. The compressed image is about 8 MB. You can use split debug to avoid keeping the symbol in the final vmimage. Or are there other debugging options turned on? >> If the debug mode on PPC enables live patching of kernel functions >> for >> instrumentation that can certainly slow things down. You would see that >> in tcg_optimize appearing in the perf log and "info jit" showing >> constantly growing translation buffers. > > > The part where I'm seeing the huge slowdown is not quite in kernel > yet. Only one thread is in bootstrap code decompressing the real > kernel. All the other threads are still spinning in firmware. > > Anyway, I've run perf. I couldn't figure out how to trigger the > recording only around the decompression part with the slowdown. So I > booted with 4 cpus to make it really slow, expecting the initial steps > of the boot, which happen quickly enough, would be dwarfed by the time > spent while one thread is decompressing the kernel (the part where I > see the huge slowdown). I'd say the recording was taken with ~80% of > the time in the interesting part. Here is what I got: > > > 12,62% qemu-system-ppc [kernel.kallsyms] [k] > syscall_exit_to_user_mode > 6,93% qemu-system-ppc [kernel.kallsyms] [k] > syscall_return_via_sysret > 5,64% qemu-system-ppc [kernel.kallsyms] [k] > __entry_text_start > 3,93% qemu-system-ppc libc.so.6 [.] > pthread_mutex_lock@@GLIBC_2.2.5 > 3,21% qemu-system-ppc libc.so.6 [.] > __GI___pthread_mutex_unlock_usercnt > 3,12% qemu-system-ppc libc.so.6 [.] > __GI___lll_lock_wait > 2,60% qemu-system-ppc qemu-system-ppc64 [.] > cpu_handle_interrupt > 2,55% qemu-system-ppc [kernel.kallsyms] [k] futex_wake > 2,43% qemu-system-ppc [kernel.kallsyms] [k] > native_queued_spin_lock_slowpath > 1,97% qemu-system-ppc [kernel.kallsyms] [k] _raw_spin_lock > 1,89% qemu-system-ppc qemu-system-ppc64 [.] > qemu_mutex_lock_impl > 1,83% qemu-system-ppc qemu-system-ppc64 [.] tb_lookup > 1,71% qemu-system-ppc [kernel.kallsyms] [k] > __get_user_nocheck_4 > 1,55% qemu-system-ppc qemu-system-ppc64 [.] > hreg_compute_hflags_value > 1,46% qemu-system-ppc [kernel.kallsyms] [k] futex_q_lock > 1,39% qemu-system-ppc [kernel.kallsyms] [k] futex_q_unlock > 1,23% qemu-system-ppc [kernel.kallsyms] [k] > audit_reset_context.part.0.constprop.0 > 1,14% qemu-system-ppc qemu-system-ppc64 [.] > object_class_dynamic_cast_assert > 1,09% qemu-system-ppc qemu-system-ppc64 [.] > qemu_mutex_unlock_impl > 1,02% qemu-system-ppc qemu-system-ppc64 [.] > object_dynamic_cast_assert > 1,00% qemu-system-ppc [kernel.kallsyms] [k] __x64_sys_futex > > > Any known pattern here? There seems to be some contention with the > mutex/futex call, but it's not obvious to me what it is. If you run the sync-profiler (via the HMP "sync-profile on") you can then get a breakdown of which mutex's are being held and for how long ("info sync-profile"). > I was also pointed to enabling gprof in qemu. I'll look into it. gprof will likely change the behaviour due to overhead. > > Thanks! > > Fred > > >>> >>> The initrd shouldn't matter, the issue is seen during kernel >>> decompression, before the init ram is used. >>> >>> I can share my binaries if you'd like. Especially a recent version of >>> skiboot showing the problem. >>> >>> Fred >> -- Alex Bennée