On Wed, Aug 25, 2021 at 05:50:23PM -0400, Peter Xu wrote: > On Wed, Aug 25, 2021 at 02:28:55PM +0200, Markus Armbruster wrote: > > Markus Armbruster <arm...@redhat.com> writes: > > > > > Peter Xu <pet...@redhat.com> writes: > > > > > >> On Mon, Aug 23, 2021 at 05:56:23PM -0400, Eduardo Habkost wrote: > > >>> I don't have any other example, but I assume address assignment > > >>> based on ordering is a common pattern in device code. > > >>> > > >>> I would take a very close and careful look at the devices with > > >>> non-default vmsd priority. If you can prove that the 13 device > > >>> types with non-default priority are all order-insensitive, a > > >>> custom sort function as you describe might be safe. > > >> > > >> Besides virtio-mem-pci, there'll also similar devfn issue with all > > >> MIG_PRI_PCI_BUS, as they'll be allocated just like other pci devices. > > >> Say, > > >> below two cmdlines will generate different pci topology too: > > >> > > >> $ qemu-system-x86_64 -device pcie-root-port,chassis=0 \ > > >> -device pcie-root-port,chassis=1 \ > > >> -device virtio-net-pci > > >> > > >> And: > > >> > > >> $ qemu-system-x86_64 -device pcie-root-port,chassis=0 \ > > >> -device virtio-net-pci > > >> -device pcie-root-port,chassis=1 \ > > >> > > >> This cannot be solved by keeping priority==0 ordering. > > >> > > >> After a second thought, I think I was initially wrong on seeing migration > > >> priority and device realization the same problem. > > >> > > >> For example, for live migration we have a requirement on PCI_BUS being > > >> migrated > > >> earlier than MIG_PRI_IOMMU because there's bus number information > > >> required > > >> because IOMMU relies on the bus number to find address spaces. However > > >> that's > > >> definitely not a requirement for device realizations, say, realizing > > >> vIOMMU > > >> after pci buses are fine (bus assigned during bios). > > >> > > >> I've probably messed up with the ideas (though they really look alike!). > > >> Sorry > > >> about that. > > >> > > >> Since the only ordering constraint so far is IOMMU vs all the rest of > > >> devices, > > >> I'll introduce a new priority mechanism and only make sure vIOMMUs are > > >> realized > > >> earlier. That'll also avoid other implications on pci devfn allocations. > > >> > > >> Will rework a new version tomorrow. Thanks a lot for all the comments, > > > > > > Is it really a good idea to magically reorder device realization just to > > > make a non-working command line work? Why can't we just fail the > > > non-working command line in a way that tells users how to get a working > > > one? We have way too much ordering magic already... > > > > > > If we decide we want more magic, then I'd argue for *dependencies* > > > instead of priorities. Dependencies are specific and local: $this needs > > > to go after $that because $reasons. Priorities are unspecific and > > > global. > > > > Having thought about this a bit more... > > > > Constraints on realize order are nothing new. For instance, when a > > device plugs into a bus, it needs to be realized after the device > > providing the bus. > > > > We ensure this by having the device refer to the bus, e.g. bus=pci.0. > > The reference may be implicit, but it's there. It must resolve for > > device creation to succeed, and if it resolves, the device providing the > > bus will be realized in time. > > > > I believe what's getting us into trouble with IOMMU is not having such a > > reference. Or in other words, keeping the dependence between the IOMMU > > and the devices relying on it *implicit*, and thus hidden from the > > existing realize-ordering machinery.
[1] > > > > Instead of inventing another such machinery, let's try to use the one we > > already have. > > Hmm... I just found that we don't have such machinery, do we? > > This does not really work: > > $ ./qemu-system-x86_64 -M q35 -device virtio-net-pci,bus=pcie.1 \ > -device pcie-root-port,id=pcie.1,bus=pcie.0 > qemu-system-x86_64: -device virtio-net-pci,bus=pcie.1: Bus 'pcie.1' not found > > While this will: > > $ ./qemu-system-x86_64 -M q35 -device pcie-root-port,id=pcie.1,bus=pcie.0 \ > -device virtio-net-pci,bus=pcie.1 I think I fully agree at [1], the iommu is special in that it's not only implicit, but also does not have a default value. Pci buses have default values (the root pci bus; e.g. pcie.0 on q35), so it's actually easier. When parsing the "-device" entry for a pci device, we'll 100% sure what's the pci bus for the device, either specified or it's just the default. We don't look at the rest of "-device"s to be sure of it. We just try to look up the pci bus, if it's there we continue, otherwise we abort. But IOMMU is different, the device can run without a vIOMMU (in which case there's no dependency), but when vIOMMU is specified there's the dependency. That's why I am not sure whether the old machinery and "early failure" solution would work trivially for IOMMUs. We can add an "-iommu" parameter, then we simply parse it before "-device". However I forgot why Marcel (IIRC) made it a "-device" parameter, also "-iommu" is not ideal in that the IOMMU unit is indeed implemented as a device for the archs I'm aware of, so it's kind of some extra glue that seems to just work around the ordering problem we have right now. Meanwhile that solution won't help the ordering issue of pci bus/device. That's why I still think the idea of having a global priority for device realization (or describe the dependency of devices) makes sense. We can actually fix both IOMMU and pci bus so we can allow pci bus to be put latter than the pci device that belongs to the bus alongside of fixing the IOMMU. IMHO a list of global priority (maybe a realize_priority field in the class of devices) is just a simpler version of device dependencies. Say, at least we don't need to worry about cycle-dependency issues. So far the ordering requirement is still simple, so globally define them should fix most of the issues already and in a straightforward way, also less LOCs. If it goes complicated one day, we can always switch the global priority list into device dependency easily, because that's not guest ABI. Any further thoughts will be greatly welcomed. Thanks, -- Peter Xu