Hello folks, I'd like to start a discussion (or bump it, in case it was already discussed) about an "issue", or better saying, a limitation we've been observing (and receiving reports) on qemu/ovmf with regards to the PCI passthrough of large BAR devices.
After OVMF commit 7e5b1b670c38 ("OvmfPkg: PlatformPei: determine the 64-bit PCI host aperture for X64 DXE"), the PCI 64-bit aperture is a hardcoded value passed to the guest via ACPI CRS that, in practical terms does not allow 32G+ BAR PCI devices to be correctly passthrough'ed to guests. There was a very informative discussion on edk2 groups [0] started by my colleague Dann, to which some edk2 and qemu developers responded with a good amount of information and rationale about this limitation, and the problems that increasing such limit would bring. All the colleagues that responded in that group discussion are hereby CC'ed. The summary (in my understanding) is: - The main reasoning for the current limitation is to make it simple; we need to take into account the 64-bit aperture in order to accomplish memory mapping on OVMF, and for common scenarios the current limit of 32G accommodates the majority of use cases. - On top of it, increasing the 64-bit aperture will incur in the increase of the memory required for OVMF-calculated PEI (Pre-EFI Initialization) page tables. - The current aperture also accounts for the 36-bit CPU physical bits (PCPU) common in old processors and in some qemu generic vcpus, and this "helps" with live migration, since 36-bit seems to be the LCD (lowest common denominator) between all processors (for 64-bit architectures), hence the limiting PCI64 aperture wouldn't be yet another factor that makes live migration difficult or impossible. - Finally, there's an _experimental_ parameter to allow some users' flexibility on PCI64 aperture calculation: "X-PciMmio64Mb". The point is we have more and more devices out there with bigger BARs (mostly GPUs), that either exceed 32G by themselves or are almost there (16G) and if users want to pass-through such devices, OVMF doesn't allow that. Relying on "X-PciMmio64Mb" is problematic due to the experimental/unstable nature of such parameter. Linux kernel allows bypassing ACPI CRS with "pci=nocrs", some discussion about that on [1]. But other OSes may not have such option, effectively preventing the PCI-PT of such large devices to succeed or forcing user to rely on the experimental parameter. I'd like to discuss here a definitive solution; I've started this discussion on Tianocore bugzilla [2], but Laszlo wisely suggested us to move here to gather input from qemu community. Currently I see 2 options, being (a) my preferred one: (a) We could rely in the guest physbits to calculate the PCI64 aperture. If the users are doing host bits' passthrough (or setting the physbits manually through -phys-bits), they are already risking a live migration failure. Also, if the users are not setting the physbits in the guest, there must be a default (seems to be 40bit according to my experiments), seems to be a good idea to rely on that. If guest physbits is 40, why to have OVMF limiting it to 36, right? (b) Making the experimental "X-PciMmio64Mb" not experimental anymore is also an option, allowing users to rely on it without the risk of support drop. Please let me know your thoughts on such limitation and how we could improve it. Other ideas are also welcome, of course. Thanks for the attention, Guilherme [0] edk2.groups.io/g/discuss/topic/ovmf_resource_assignment/59340711 [1] bugs.launchpad.net/bugs/1849563 [2] bugzilla.tianocore.org/show_bug.cgi?id=2796