On Thu, Jul 16, 2020 at 12:32:44PM +0200, Greg Kurz wrote:
> On Thu, 16 Jul 2020 14:45:40 +1000
> David Gibson <da...@gibson.dropbear.id.au> wrote:
> 
> > On Thu, Jul 09, 2020 at 07:12:47PM +0200, Greg Kurz wrote:
> > > Some recent error handling cleanups unveiled issues with our support of
> > > PCI bridges:
> > > 
> > > 1) QEMU aborts when using non-standard PCI bridge types,
> > >    unveiled by commit 7ef1553dac "spapr_pci: Drop some dead error 
> > > handling"
> > > 
> > > $ qemu-system-ppc64 -M pseries -device pcie-pci-bridge
> > > Unexpected error in object_property_find() at qom/object.c:1240:
> > > qemu-system-ppc64: -device pcie-pci-bridge: Property '.chassis_nr' not 
> > > found
> > > Aborted (core dumped)
> > 
> > Oops, I thought we had a check that we actually had a "pci-bridge"
> > device before continuing with the hotplug, but I guess not.
> 
> Ah... are you suggesting we should explicitly check the actual type
> of the bridge rather than looking for the "chassis_nr" property ?

Uh.. I thought about it, but I don't think it matters much which way
we do it.

> > > This happens because we assume all PCI bridge types to have a "chassis_nr"
> > > property. This property only exists with the standard PCI bridge type
> > > "pci-bridge" actually. We could possibly revert 7ef1553dac but it seems
> > > much simpler to check the presence of "chassis_nr" earlier.
> > 
> > Hrm, right, 7ef1553dac was not really correct since add_drcs() really
> > can fail.
> 
> Yes.
> 
> > > 2) QEMU abort if same "chassis_nr" value is used several times,
> > >    unveiled by commit d2623129a7de "qom: Drop parameter @errp of
> > >    object_property_add() & friends"
> > > 
> > > $ qemu-system-ppc64 -M pseries -device pci-bridge,chassis_nr=1 \
> > >                         -device pci-bridge,chassis_nr=1
> > > Unexpected error in object_property_try_add() at qom/object.c:1167:
> > > qemu-system-ppc64: -device pci-bridge,chassis_nr=1: attempt to add 
> > > duplicate property '40000100' to object (type 'container')
> > > Aborted (core dumped)
> > > 
> > > This happens because we assume that "chassis_nr" values are unique, but
> > > nobody enforces that and we end up generating duplicate DRC ids. The PCI
> > > code doesn't really care for duplicate "chassis_nr" properties since it
> > > is only used to initialize the "Chassis Number Register" of the bridge,
> > > with no functional impact on QEMU. So, even if passing the same value
> > > several times might look weird, it never broke anything before, so
> > > I guess we don't necessarily want to enforce strict checking in the PCI
> > > code now.
> > 
> > Yeah, I guess.  I'm pretty sure that the chassis number of bridges is
> > supposed to be system-unique (well, unique within the PCI domain at
> > least, I guess) as part of the hardware spec.  So specifying multiple
> > chassis ids the same is a user error, but we need a better failure
> > mode.
> 
> According to section 3.2.6.4 of "PCI-to-PCI Bridge Architecture
> Specification", the chassis number is exposed to the OS as a 
> non-volatile r/w register.

Argh.  Dammit.  I could have sworn I checked that chassis numbers were
supposed to be unique (and not guest alterable).  That's the whole
reason I chose it.

> It seems to be expected that chassis
> numbers might collide, in which case the system software can
> overwrite the register with a new number. So I'm not sure that
> specifying the same number multiple times is an actual user error.
> 
> > > Workaround both issues in the PAPR code: check that the bridge has a
> > > unique and non null "chassis_nr" when plugging it into its parent bus.
> > >
> > > Fixes: 05929a6c5dfe ("spapr: Don't use bus number for building DRC ids")
> > 
> > Arguably, it's really fixing 7ef1553dac.
> 
> True for issue 1 but not for issue 2, which is the result of
> 05929a6c5dfe (switch to "chassis_nr" introduces a condition
> to end up with duplicate DRC ids)

Well, technically.  But the method we had before was *way* more broken
than chassis numbers.  It was using bus number which is not stable
across the VM's lifetime, which is a non-starter.  Plus, the bus
number too won't be unique, until the guest has enumerated the bus,
which is too late for DRC creation.

The only reason we got away with it, was that it was basically dead
code - at that stage we didn't support hotplug under bridges, so we
never actually created DRCs except for the root bus.

> and d2623129a7de (assert
> when trying to add a duplicated DRC).

Well, again, only on a technicality.  It might not have immediatley
assert()ed, but adding a duplicated DRC was still completely broken
before that.

> I'm starting to think I should maybe split this in
> two patches. One for each issue.

At this stage, I'd kind of prefer to merge this fix now, with the
intention of doing a pull request tomorrow.  AFAICT none of the
suggested improvements can't be done as followups.

> > > Reported-by: Thomas Huth <th...@redhat.com>
> > > Signed-off-by: Greg Kurz <gr...@kaod.org>
> > 
> > I had a few misgivings about the details of this, but I think I've
> > convinced myself they're fine.  There's a couple of things I'd like to
> > polish, but I'll do that as a follow up.
> 
> Some fixes for d2623129a7de just got merged. Let me have a look
> again.

In fact the main part of the polish I was thinking of didn't really
work out.  The only change I made was a tiny move to the check (it's
not really relevant until *after* we've checked if hotplug is enabled
at all).  So I just folded that in and put it into the ppc-for-5.1
tree.

-- 
David Gibson                    | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
                                | _way_ _around_!
http://www.ozlabs.org/~dgibson

Attachment: signature.asc
Description: PGP signature

Reply via email to