On Wed, Aug 25, 2021 at 05:50:23PM -0400, Peter Xu wrote:
> On Wed, Aug 25, 2021 at 02:28:55PM +0200, Markus Armbruster wrote:
> > Markus Armbruster <arm...@redhat.com> writes:
> > 
> > > Peter Xu <pet...@redhat.com> writes:
> > >
> > >> On Mon, Aug 23, 2021 at 05:56:23PM -0400, Eduardo Habkost wrote:
> > >>> I don't have any other example, but I assume address assignment
> > >>> based on ordering is a common pattern in device code.
> > >>> 
> > >>> I would take a very close and careful look at the devices with
> > >>> non-default vmsd priority.  If you can prove that the 13 device
> > >>> types with non-default priority are all order-insensitive, a
> > >>> custom sort function as you describe might be safe.
> > >>
> > >> Besides virtio-mem-pci, there'll also similar devfn issue with all
> > >> MIG_PRI_PCI_BUS, as they'll be allocated just like other pci devices.  
> > >> Say,
> > >> below two cmdlines will generate different pci topology too:
> > >>
> > >>   $ qemu-system-x86_64 -device pcie-root-port,chassis=0 \
> > >>                        -device pcie-root-port,chassis=1 \
> > >>                        -device virtio-net-pci
> > >>
> > >> And:
> > >>
> > >>   $ qemu-system-x86_64 -device pcie-root-port,chassis=0 \
> > >>                        -device virtio-net-pci
> > >>                        -device pcie-root-port,chassis=1 \
> > >>
> > >> This cannot be solved by keeping priority==0 ordering.
> > >>
> > >> After a second thought, I think I was initially wrong on seeing migration
> > >> priority and device realization the same problem.
> > >>
> > >> For example, for live migration we have a requirement on PCI_BUS being 
> > >> migrated
> > >> earlier than MIG_PRI_IOMMU because there's bus number information 
> > >> required
> > >> because IOMMU relies on the bus number to find address spaces.  However 
> > >> that's
> > >> definitely not a requirement for device realizations, say, realizing 
> > >> vIOMMU
> > >> after pci buses are fine (bus assigned during bios).
> > >>
> > >> I've probably messed up with the ideas (though they really look alike!). 
> > >>  Sorry
> > >> about that.
> > >>
> > >> Since the only ordering constraint so far is IOMMU vs all the rest of 
> > >> devices,
> > >> I'll introduce a new priority mechanism and only make sure vIOMMUs are 
> > >> realized
> > >> earlier.  That'll also avoid other implications on pci devfn allocations.
> > >>
> > >> Will rework a new version tomorrow.  Thanks a lot for all the comments,
> > >
> > > Is it really a good idea to magically reorder device realization just to
> > > make a non-working command line work?  Why can't we just fail the
> > > non-working command line in a way that tells users how to get a working
> > > one?  We have way too much ordering magic already...
> > >
> > > If we decide we want more magic, then I'd argue for *dependencies*
> > > instead of priorities.  Dependencies are specific and local: $this needs
> > > to go after $that because $reasons.  Priorities are unspecific and
> > > global.
> > 
> > Having thought about this a bit more...
> > 
> > Constraints on realize order are nothing new.  For instance, when a
> > device plugs into a bus, it needs to be realized after the device
> > providing the bus.
> > 
> > We ensure this by having the device refer to the bus, e.g. bus=pci.0.
> > The reference may be implicit, but it's there.  It must resolve for
> > device creation to succeed, and if it resolves, the device providing the
> > bus will be realized in time.
> > 
> > I believe what's getting us into trouble with IOMMU is not having such a
> > reference.  Or in other words, keeping the dependence between the IOMMU
> > and the devices relying on it *implicit*, and thus hidden from the
> > existing realize-ordering machinery.

[1]

> > 
> > Instead of inventing another such machinery, let's try to use the one we
> > already have.
> 
> Hmm... I just found that we don't have such machinery, do we?
> 
> This does not really work:
> 
> $ ./qemu-system-x86_64 -M q35 -device virtio-net-pci,bus=pcie.1 \
>                        -device pcie-root-port,id=pcie.1,bus=pcie.0
> qemu-system-x86_64: -device virtio-net-pci,bus=pcie.1: Bus 'pcie.1' not found
> 
> While this will:
> 
> $ ./qemu-system-x86_64 -M q35 -device pcie-root-port,id=pcie.1,bus=pcie.0 \
>                        -device virtio-net-pci,bus=pcie.1

I think I fully agree at [1], the iommu is special in that it's not only
implicit, but also does not have a default value.  Pci buses have default
values (the root pci bus; e.g. pcie.0 on q35), so it's actually easier.

When parsing the "-device" entry for a pci device, we'll 100% sure what's the
pci bus for the device, either specified or it's just the default.  We don't
look at the rest of "-device"s to be sure of it.  We just try to look up the
pci bus, if it's there we continue, otherwise we abort.

But IOMMU is different, the device can run without a vIOMMU (in which case
there's no dependency), but when vIOMMU is specified there's the dependency.

That's why I am not sure whether the old machinery and "early failure" solution
would work trivially for IOMMUs.

We can add an "-iommu" parameter, then we simply parse it before "-device".
However I forgot why Marcel (IIRC) made it a "-device" parameter, also "-iommu"
is not ideal in that the IOMMU unit is indeed implemented as a device for the
archs I'm aware of, so it's kind of some extra glue that seems to just work
around the ordering problem we have right now.  Meanwhile that solution won't
help the ordering issue of pci bus/device.

That's why I still think the idea of having a global priority for device
realization (or describe the dependency of devices) makes sense.  We can
actually fix both IOMMU and pci bus so we can allow pci bus to be put latter
than the pci device that belongs to the bus alongside of fixing the IOMMU.

IMHO a list of global priority (maybe a realize_priority field in the class of
devices) is just a simpler version of device dependencies.  Say, at least we
don't need to worry about cycle-dependency issues.  So far the ordering
requirement is still simple, so globally define them should fix most of the
issues already and in a straightforward way, also less LOCs.  If it goes
complicated one day, we can always switch the global priority list into device
dependency easily, because that's not guest ABI.

Any further thoughts will be greatly welcomed.

Thanks,

-- 
Peter Xu


Reply via email to