On Wed, Mar 1, 2023 at 9:04 AM Laszlo Ersek <ler...@redhat.com> wrote: > > Hello Christian, > > On 3/1/23 08:17, Christian Ehrhardt wrote: > > On Thu, Jan 5, 2023 at 8:14 AM Laszlo Ersek <ler...@redhat.com> wrote: > >> > >> On 1/4/23 13:35, Michael S. Tsirkin wrote: > >>> On Wed, Jan 04, 2023 at 10:01:38AM +0100, Laszlo Ersek wrote: > >>>> The modern ACPI CPU hotplug interface was introduced in the following > >>>> series (aa1dd39ca307..679dd1a957df), released in v2.7.0: > >>>> > >>>> 1 abd49bc2ed2f docs: update ACPI CPU hotplug spec with new protocol > >>>> 2 16bcab97eb9f pc: piix4/ich9: add 'cpu-hotplug-legacy' property > >>>> 3 5e1b5d93887b acpi: cpuhp: add CPU devices AML with _STA method > >>>> 4 ac35f13ba8f8 pc: acpi: introduce AcpiDeviceIfClass.madt_cpu hook > >>>> 5 d2238cb6781d acpi: cpuhp: implement hot-add parts of CPU hotplug > >>>> interface > >>>> 6 8872c25a26cc acpi: cpuhp: implement hot-remove parts of CPU hotplug > >>>> interface > >>>> 7 76623d00ae57 acpi: cpuhp: add cpu._OST handling > >>>> 8 679dd1a957df pc: use new CPU hotplug interface since 2.7 machine > >>>> type > >>>> > > ... > >> > >> The solution to the riddle > > > > Hi, > > just to add to this nicely convoluted case an FYI to everyone involved > > back then, > > the fix seems to have caused a regression [1] in - as far as I've > > found - an edge case. > > > > [1]: https://gitlab.com/qemu-project/qemu/-/issues/1520 > > After reading the gitlab case, here's my theory on it: > > - Without the patch applied, the CPU hotplug register block in QEMU is > broken. Effectively, it has *always* been broken; to put it differently, > you have most likely *never* seen a QEMU in which the CPU hotplug > register block was not broken. The reason is that the only QEMU release > without the breakage (as far as a guest could see it!) was v5.0.0, but > it got exposed to the guest as early as v5.1.0 (IOW, in the 5.* series, > the first stable release already exposed the issue), and the symptom has > existed since (up to and including 7.2). > > - With the register block broken, OVMF's multiprocessing is broken, and > the random chaos just happens to play out in a way that makes OVMF think > it's running on a uniprocessor system. > > - With the register block *fixed* (commit dab30fbe applied), OVMF > actually boots up your VCPUs. With MT-TCG, this translates to as many > host-side VCPU threads running in your QEMU process as you have VCPUs. > > - Furthermore, if your OVMF build includes the SMM driver stack, then > each UEFI variable update will require all VCPUs to enter SMM. All VCPUs > entering SMM is a "thundering herd" event, so it seriously spins up all > your host-side threads. (I assume the SMM-enabled binaries are what you > refer to as "signed OVMF cases" in the gitlab ticket.) > > - If you overcommit the VCPUs (#vcpus > #pcpus), then your host-side > threads will be competing for PCPUs. On s390x, there is apparently some > bottleneck in QEMU's locking or in the host kernel or wherever else that > penalizes (#threads > #pcpus) heavily, while on other host arches, the > penalty is (apparently) not as severe. > > So, the QEMU fix actually "only exposes" the high penalty of the MT-TCG > VCPU thread overcommit that appears characteristic of s390x hosts. > You've not seen this symptom before because, regardless of how many > VCPUs you've specified in the past, OVMF has never actually attempted to > bring those up, due to the hotplug regblock breakage "masking" the > actual VCPU counts (the present-at-boot VCPU count and the possible max > VCPU count).
Thank you for the detailed thoughts - if we can confirm this we can close the case as "it is odd that there is so much penalty, but => Won't Fix / Works as Intended" > Here's a test you could try: go back to QEMU v5.0.0 *precisely*, and try > to reproduce the symptom. I expect that it should reproduce. v5.0.0 - 1 host cpu vs 2 vcpu - 58.47s v5.0.0 - 1 host cpu vs 1 vcpu - 5.33s v5.0.0 - 2 host cpu vs 2 vcpu - 5.27s v5.1.0 - 1 host cpu vs 2 vcpu - 7.18s v5.1.0 - 1 host cpu vs 1 vcpu - 5.22s v5.1.0 - 2 host cpu vs 2 vcpu - 5.40s Yes, v5.0.0 behaves exactly like the recent master branch does since your fix. And v5.1.0 does no more, just as you predicted > Here's another test you can try: with latest QEMU, boot an x86 Linux > guest, but using SeaBIOS, not OVMF, on your s390x host. Then, in the > Linux guest, run as many busy loops (e.g. in the shell) as there are > VCPUs. Compare the behavior between #vcpus = #pcpus vs. #vcpus > #pcpus. > The idea here is of course to show that the impact of overcommitting x86 > VCPUs on s390x is not specific to OVMF. Note that I don't *fully* expect > this test to confirm the expectation, because the guest workload will be > very different: in the Linux guest case, your VCPUs will not be > attempting to enter SMM *or* to access pflash, so the paths exercised in > QEMU will be very different. But, the test may still be worth a try. That felt too much of a different workload to me, so I have skipped this one as - without further evidence that it will help - it could be quite a time sink. > Yet another test (or more like, information gathering): re-run the > problematic case, while printing the OVMF debug log (the x86 debug > console) to stdout, and visually determine at what part(s) the slowdown > hits. (I guess you can also feed the debug console log through some > timestamping utility like "logger".) I suspect it's going to be those > log sections that relate to SMM entry -- initial SMBASE relocation, and > then whenever UEFI variables are modified. Building without -b RELEASE adding debugcon and timestamping that ouput showed that each individual initialization takes the expected ~x10 longer. So up to these they are more or less at the same speed initially. But then the bad case slows down. Here one example on BootGraphicsResourceTableDxe.efi good ~0.09 [08:14:36.657866559] Loading driver at 0x0000DA42000 EntryPoint=0x0000DA43545 BootGraphicsResourceTableDxe.efi [08:14:36.658913369] InstallProtocolInterface: BC62157E-3E33-4FEC-9920-2D3B36D750DF DA72D18 [08:14:36.659946746] ProtectUefiImageCommon - 0xDA72040 [08:14:36.660982120] - 0x000000000DA42000 - 0x0000000000002840 [08:14:36.662043745] InstallProtocolInterface: CDEA2BD3-FC25-4C1C-B97C-B31186064990 DA445F0 [08:14:36.663092745] InstallProtocolInterface: 4B5DC1DF-1EAA-48B2-A7E9-EAC489A00B5C DA44670 [08:14:36.664139682] Loading driver 961578FE-B6B7-44C3-AF35-6BC705CD2B1F [08:14:36.665191815] InstallProtocolInterface: 5B1B31A1-9562-11D2-8E3F-00A0C969723B DA72540 [08:14:36.666244307] Loading driver at 0x0000DA02000 EntryPoint=0x0000DA099BC Fat.efi bad ~0.17s [08:15:30.386201946] Loading driver at 0x0000DA49000 EntryPoint=0x0000DA4A545 BootGraphicsResourceTableDxe.efi [08:15:30.410568994] InstallProtocolInterface: BC62157E-3E33-4FEC-9920-2D3B36D750DF DA7EB18 [08:15:30.430838932] ProtectUefiImageCommon - 0xDA7E140 [08:15:30.440526879] - 0x000000000DA49000 - 0x0000000000002840 [08:15:30.450730504] InstallProtocolInterface: CDEA2BD3-FC25-4C1C-B97C-B31186064990 DA4B5F0 [08:15:30.480538889] InstallProtocolInterface: 4B5DC1DF-1EAA-48B2-A7E9-EAC489A00B5C DA4B670 [08:15:30.490532370] Loading driver 961578FE-B6B7-44C3-AF35-6BC705CD2B1F [08:15:30.510566744] InstallProtocolInterface: 5B1B31A1-9562-11D2-8E3F-00A0C969723B DA7D040 [08:15:30.550572432] Loading driver at 0x0000D7F6000 EntryPoint=0x0000D7FD9BC Fat.efi This seems to be the case for each driver load in here which then adds up. There is another rather big jump here a bit later good ~instant [08:14:37.267336194] Select Item: 0xE [08:14:37.268346995] [Bds]RegisterKeyNotify: 000C/0000 80000000/00 Success bad ~8s [08:15:43.561054490] Select Item: 0xE [08:15:51.291039364] [Bds]RegisterKeyNotify: 000C/0000 80000000/00 Success The whole late section of OVMF init makes up for almost all of the loss. Full files: - good: https://paste.ubuntu.com/p/DcMpxtd9Cy/ - bad: https://paste.ubuntu.com/p/4wDfzmC9Sm/ > Preliminary advice: don't overcommit VCPUs in the setup at hand, or else > please increase the timeout. :) I was always in the "if possible you should not overcommit" camp anyway. And we have - by now resolved this in the tests [1] due to my bug about it - thanks @Dann Frazier [1]: https://salsa.debian.org/qemu-team/edk2/-/commit/243f0c2533fc18671dc373645e44b5071d8474a5 > In edk2, a way to mitigate said "thundering herd" problem *supposedly* > exists (using unicast SMIs rather than broadcast ones), but that > configuration of the core SMM components in edk2 had always been > extremely unstable when built into OVMF *and* running on QEMU/KVM. So we > opted for broadcast SMIs (supporting which actually required some QEMU > patches). Broadcast SMIs generate larger spikes in host load, but > regarding guest functionally, they are much more stable/robust. > > Laszlo > -- Christian Ehrhardt Senior Staff Engineer, Ubuntu Server Canonical Ltd