Hello folks, I'd like to start a discussion (or bump it, in case it was
already discussed) about an "issue", or better saying, a limitation
we've been observing (and receiving reports) on qemu/ovmf with regards
to the PCI passthrough of large BAR devices.

After OVMF commit 7e5b1b670c38 ("OvmfPkg: PlatformPei: determine the
64-bit PCI host aperture for X64 DXE"), the PCI 64-bit aperture is a
hardcoded value passed to the guest via ACPI CRS that, in practical
terms does not allow 32G+ BAR PCI devices to be correctly passthrough'ed
to guests.

There was a very informative discussion on edk2 groups [0] started by my
colleague Dann, to which some edk2 and qemu developers responded with a
good amount of information and rationale about this limitation, and the
problems that increasing such limit would bring. All the colleagues that
responded in that group discussion are hereby CC'ed.

The summary (in my understanding) is:

- The main reasoning for the current limitation is to make it simple; we
need to take into account the 64-bit aperture in order to accomplish
memory mapping on OVMF, and for common scenarios the current limit of
32G accommodates the majority of use cases.

- On top of it, increasing the 64-bit aperture will incur in the
increase of the memory required for OVMF-calculated PEI (Pre-EFI
Initialization) page tables.

- The current aperture also accounts for the 36-bit CPU physical bits
(PCPU) common in old processors and in some qemu generic vcpus, and this
"helps" with live migration, since 36-bit seems to be the LCD (lowest
common denominator) between all processors (for 64-bit architectures),
hence the limiting PCI64 aperture wouldn't be yet another factor that
makes live migration difficult or impossible.

- Finally, there's an _experimental_ parameter to allow some users'
flexibility on PCI64 aperture calculation: "X-PciMmio64Mb".

The point is we have more and more devices out there with bigger BARs
(mostly GPUs), that either exceed 32G by themselves or are almost there
(16G) and if users want to pass-through such devices, OVMF doesn't allow
that. Relying on "X-PciMmio64Mb" is problematic due to the
experimental/unstable nature of such parameter.

Linux kernel allows bypassing ACPI CRS with "pci=nocrs", some discussion
about that on [1]. But other OSes may not have such option, effectively
preventing the PCI-PT of such large devices to succeed or forcing user
to rely on the experimental parameter.

I'd like to discuss here a definitive solution; I've started this
discussion on Tianocore bugzilla [2], but Laszlo wisely suggested us to
move here to gather input from qemu community.
Currently I see 2 options, being (a) my preferred one:

(a) We could rely in the guest physbits to calculate the PCI64 aperture.
If the users are doing host bits' passthrough (or setting the physbits
manually through -phys-bits), they are already risking a live migration
failure. Also, if the users are not setting the physbits in the guest,
there must be a default (seems to be 40bit according to my experiments),
seems to be a good idea to rely on that.
If guest physbits is 40, why to have OVMF limiting it to 36, right?

(b) Making the experimental "X-PciMmio64Mb" not experimental anymore is
also an option, allowing users to rely on it without the risk of support
drop.

Please let me know your thoughts on such limitation and how we could
improve it. Other ideas are also welcome, of course. Thanks for the
attention,


Guilherme


[0] edk2.groups.io/g/discuss/topic/ovmf_resource_assignment/59340711
[1] bugs.launchpad.net/bugs/1849563
[2] bugzilla.tianocore.org/show_bug.cgi?id=2796

Reply via email to