Re: [PATCH] acpi: cpuhp: fix guest-visible maximum access size to the legacy reg block

Christian Ehrhardt Thu, 02 Mar 2023 00:37:02 -0800

On Wed, Mar 1, 2023 at 9:04 AM Laszlo Ersek <ler...@redhat.com> wrote:
>
> Hello Christian,
>
> On 3/1/23 08:17, Christian Ehrhardt wrote:
> > On Thu, Jan 5, 2023 at 8:14 AM Laszlo Ersek <ler...@redhat.com> wrote:
> >>
> >> On 1/4/23 13:35, Michael S. Tsirkin wrote:
> >>> On Wed, Jan 04, 2023 at 10:01:38AM +0100, Laszlo Ersek wrote:
> >>>> The modern ACPI CPU hotplug interface was introduced in the following
> >>>> series (aa1dd39ca307..679dd1a957df), released in v2.7.0:
> >>>>
> >>>>   1  abd49bc2ed2f docs: update ACPI CPU hotplug spec with new protocol
> >>>>   2  16bcab97eb9f pc: piix4/ich9: add 'cpu-hotplug-legacy' property
> >>>>   3  5e1b5d93887b acpi: cpuhp: add CPU devices AML with _STA method
> >>>>   4  ac35f13ba8f8 pc: acpi: introduce AcpiDeviceIfClass.madt_cpu hook
> >>>>   5  d2238cb6781d acpi: cpuhp: implement hot-add parts of CPU hotplug
> >>>>                   interface
> >>>>   6  8872c25a26cc acpi: cpuhp: implement hot-remove parts of CPU hotplug
> >>>>                   interface
> >>>>   7  76623d00ae57 acpi: cpuhp: add cpu._OST handling
> >>>>   8  679dd1a957df pc: use new CPU hotplug interface since 2.7 machine 
> >>>> type
> >>>>
> > ...
> >>
> >> The solution to the riddle
> >
> > Hi,
> > just to add to this nicely convoluted case an FYI to everyone involved
> > back then,
> > the fix seems to have caused a regression [1] in - as far as I've
> > found - an edge case.
> >
> > [1]: https://gitlab.com/qemu-project/qemu/-/issues/1520
>
> After reading the gitlab case, here's my theory on it:
>
> - Without the patch applied, the CPU hotplug register block in QEMU is
> broken. Effectively, it has *always* been broken; to put it differently,
> you have most likely *never* seen a QEMU in which the CPU hotplug
> register block was not broken. The reason is that the only QEMU release
> without the breakage (as far as a guest could see it!) was v5.0.0, but
> it got exposed to the guest as early as v5.1.0 (IOW, in the 5.* series,
> the first stable release already exposed the issue), and the symptom has
> existed since (up to and including 7.2).
>
> - With the register block broken, OVMF's multiprocessing is broken, and
> the random chaos just happens to play out in a way that makes OVMF think
> it's running on a uniprocessor system.
>
> - With the register block *fixed* (commit dab30fbe applied), OVMF
> actually boots up your VCPUs. With MT-TCG, this translates to as many
> host-side VCPU threads running in your QEMU process as you have VCPUs.
>
> - Furthermore, if your OVMF build includes the SMM driver stack, then
> each UEFI variable update will require all VCPUs to enter SMM. All VCPUs
> entering SMM is a "thundering herd" event, so it seriously spins up all
> your host-side threads. (I assume the SMM-enabled binaries are what you
> refer to as "signed OVMF cases" in the gitlab ticket.)
>
> - If you overcommit the VCPUs (#vcpus > #pcpus), then your host-side
> threads will be competing for PCPUs. On s390x, there is apparently some
> bottleneck in QEMU's locking or in the host kernel or wherever else that
> penalizes (#threads > #pcpus) heavily, while on other host arches, the
> penalty is (apparently) not as severe.
>
> So, the QEMU fix actually "only exposes" the high penalty of the MT-TCG
> VCPU thread overcommit that appears characteristic of s390x hosts.
> You've not seen this symptom before because, regardless of how many
> VCPUs you've specified in the past, OVMF has never actually attempted to
> bring those up, due to the hotplug regblock breakage "masking" the
> actual VCPU counts (the present-at-boot VCPU count and the possible max
> VCPU count).


Thank you for the detailed thoughts - if we can confirm this we can
close the case as "it is odd that there is so much penalty, but =>
Won't Fix / Works as Intended"

> Here's a test you could try: go back to QEMU v5.0.0 *precisely*, and try
> to reproduce the symptom. I expect that it should reproduce.

v5.0.0 - 1 host cpu vs 2 vcpu - 58.47s
v5.0.0 - 1 host cpu vs 1 vcpu -  5.33s
v5.0.0 - 2 host cpu vs 2 vcpu -  5.27s
v5.1.0 - 1 host cpu vs 2 vcpu -  7.18s
v5.1.0 - 1 host cpu vs 1 vcpu -  5.22s
v5.1.0 - 2 host cpu vs 2 vcpu -  5.40s

Yes, v5.0.0 behaves exactly like the recent master branch does since your fix.
And v5.1.0 does no more, just as you predicted

> Here's another test you can try: with latest QEMU, boot an x86 Linux
> guest, but using SeaBIOS, not OVMF, on your s390x host. Then, in the
> Linux guest, run as many busy loops (e.g. in the shell) as there are
> VCPUs. Compare the behavior between #vcpus = #pcpus vs. #vcpus > #pcpus.
> The idea here is of course to show that the impact of overcommitting x86
> VCPUs on s390x is not specific to OVMF. Note that I don't *fully* expect
> this test to confirm the expectation, because the guest workload will be
> very different: in the Linux guest case, your VCPUs will not be
> attempting to enter SMM *or* to access pflash, so the paths exercised in
> QEMU will be very different. But, the test may still be worth a try.

That felt too much of a different workload to me, so I have skipped this
one as - without further evidence that it will help - it could be quite a
time sink.

> Yet another test (or more like, information gathering): re-run the
> problematic case, while printing the OVMF debug log (the x86 debug
> console) to stdout, and visually determine at what part(s) the slowdown
> hits. (I guess you can also feed the debug console log through some
> timestamping utility like "logger".) I suspect it's going to be those
> log sections that relate to SMM entry -- initial SMBASE relocation, and
> then whenever UEFI variables are modified.

Building without -b RELEASE adding debugcon and timestamping that
ouput showed that each individual initialization takes the expected
~x10 longer.
So up to these they are more or less at the same speed initially.
But then the bad case slows down.

Here one example on BootGraphicsResourceTableDxe.efi

good ~0.09
[08:14:36.657866559] Loading driver at 0x0000DA42000
EntryPoint=0x0000DA43545 BootGraphicsResourceTableDxe.efi
[08:14:36.658913369] InstallProtocolInterface:
BC62157E-3E33-4FEC-9920-2D3B36D750DF DA72D18
[08:14:36.659946746] ProtectUefiImageCommon - 0xDA72040
[08:14:36.660982120] - 0x000000000DA42000 - 0x0000000000002840
[08:14:36.662043745] InstallProtocolInterface:
CDEA2BD3-FC25-4C1C-B97C-B31186064990 DA445F0
[08:14:36.663092745] InstallProtocolInterface:
4B5DC1DF-1EAA-48B2-A7E9-EAC489A00B5C DA44670
[08:14:36.664139682] Loading driver
961578FE-B6B7-44C3-AF35-6BC705CD2B1F
[08:14:36.665191815] InstallProtocolInterface:
5B1B31A1-9562-11D2-8E3F-00A0C969723B DA72540
[08:14:36.666244307] Loading driver at 0x0000DA02000
EntryPoint=0x0000DA099BC Fat.efi

bad ~0.17s
[08:15:30.386201946] Loading driver at 0x0000DA49000
EntryPoint=0x0000DA4A545 BootGraphicsResourceTableDxe.efi
[08:15:30.410568994] InstallProtocolInterface:
BC62157E-3E33-4FEC-9920-2D3B36D750DF DA7EB18
[08:15:30.430838932] ProtectUefiImageCommon - 0xDA7E140
[08:15:30.440526879] - 0x000000000DA49000 - 0x0000000000002840
[08:15:30.450730504] InstallProtocolInterface:
CDEA2BD3-FC25-4C1C-B97C-B31186064990 DA4B5F0
[08:15:30.480538889] InstallProtocolInterface:
4B5DC1DF-1EAA-48B2-A7E9-EAC489A00B5C DA4B670
[08:15:30.490532370] Loading driver
961578FE-B6B7-44C3-AF35-6BC705CD2B1F
[08:15:30.510566744] InstallProtocolInterface:
5B1B31A1-9562-11D2-8E3F-00A0C969723B DA7D040
[08:15:30.550572432] Loading driver at 0x0000D7F6000
EntryPoint=0x0000D7FD9BC Fat.efi

This seems to be the case for each driver load in here which then adds up.
There is another rather big jump here a bit later

good ~instant
[08:14:37.267336194] Select Item: 0xE
[08:14:37.268346995] [Bds]RegisterKeyNotify: 000C/0000 80000000/00 Success

bad ~8s
[08:15:43.561054490] Select Item: 0xE
[08:15:51.291039364] [Bds]RegisterKeyNotify: 000C/0000 80000000/00 Success

The whole late section of OVMF init makes up for almost all of the loss.

Full files:
- good: https://paste.ubuntu.com/p/DcMpxtd9Cy/
- bad: https://paste.ubuntu.com/p/4wDfzmC9Sm/

> Preliminary advice: don't overcommit VCPUs in the setup at hand, or else
> please increase the timeout. :)

I was always in the "if possible you should not overcommit" camp anyway.
And we have - by now resolved this in the tests [1] due to my bug
about it - thanks @Dann Frazier

[1]: 
https://salsa.debian.org/qemu-team/edk2/-/commit/243f0c2533fc18671dc373645e44b5071d8474a5

> In edk2, a way to mitigate said "thundering herd" problem *supposedly*
> exists (using unicast SMIs rather than broadcast ones), but that
> configuration of the core SMM components in edk2 had always been
> extremely unstable when built into OVMF *and* running on QEMU/KVM. So we
> opted for broadcast SMIs (supporting which actually required some QEMU
> patches). Broadcast SMIs generate larger spikes in host load, but
> regarding guest functionally, they are much more stable/robust.
>
> Laszlo
>


-- 
Christian Ehrhardt
Senior Staff Engineer, Ubuntu Server
Canonical Ltd

Re: [PATCH] acpi: cpuhp: fix guest-visible maximum access size to the legacy reg block

Reply via email to