Re: [PATCH updated v2] spapr: Fix EEH capability issue on KVM guest for PCI passthru

2021-05-13 Thread Oliver O'Halloran
On Thu, May 13, 2021 at 2:22 PM David Gibson
 wrote:
>
> On Wed, May 05, 2021 at 08:18:27PM +0530, Mahesh Salgaonkar wrote:
> > With upstream kernel, especially after commit 98ba956f6a389
> > ("powerpc/pseries/eeh: Rework device EEH PE determination") we see that KVM
> > guest isn't able to enable EEH option for PCI pass-through devices anymore.
> >
> > [root@atest-guest ~]# dmesg | grep EEH
> > [0.032337] EEH: pSeries platform initialized
> > [0.298207] EEH: No capable adapters found: recovery disabled.
> > [root@atest-guest ~]#
> >
> > So far the linux kernel was assuming pe_config_addr equal to device's
> > config_addr and using it to enable EEH on the PE through ibm,set-eeh-option
> > RTAS call. Which wasn't the correct way as per PAPR. The linux kernel
> > commit 98ba956f6a389 fixed this flow. With that fixed, linux now uses PE
> > config address returned by ibm,get-config-addr-info2 RTAS call to enable
> > EEH option per-PE basis instead of per-device basis. However this has
> > uncovered a bug in qemu where ibm,set-eeh-option is treating PE config
> > address as per-device config address.
>
> Huh.  To be fair, the stuff about this in PAPR is nearly
> incomprehensible, so we probably used what the kernel was doing as a
> guide instead.

I found the PAPR documentation made some sense after I learned how EEH
was handled on PCI(-X) systems. What's in Linux never made sense,
unfortunately.

> Hmm.. shouldn't we at least check that the supplied config_addr
> matches the one it should be for this PHB, rather than just ignoring
> it?

I think that'd cause issues with older kernels. Prior to the rework
mentioned by Mahesh (linux commit 98ba956f6a389 ("powerpc/pseries/eeh:
Rework device EEH PE determination")) the kernel would call
eeh-set-option for each device in the PE using the device's
config_address as the argument rather than the PE address. If we
return an error from eeh-set-option when the argument isn't a valid PE
address then older kernels will interpret that as EEH not being
supported. That really needs to be called out in a comment though.
Preferably with kernel version numbers, etc.

> ..and, looking back at rtas_ibm_get_config_addr_info2(), I think
> that looks wrong in the case of PCI bridges.  AFAICT it gives an
> address that depends on the bus, but in other places we assume that
> the entire PHB is a single PE on the guest side, so it really
> shouldn't.

Yep, get_config_addr_info2 should map every device inside that PE to
the same PE address, even when they're on child busses. That said, I'm
not sure how well EEH works when there's a mix of real (vfio) and
emulated (qemu bridges) devices in the same PHB. Can VFIO pass through
a bridge?



Re: [PATCH] spapr: Modify ibm, get-config-addr-info2 to set DEVNUM in PE config address.

2021-04-29 Thread Oliver O'Halloran
On Thu, Apr 29, 2021 at 7:02 PM Mahesh J Salgaonkar
 wrote:
>
> On 2021-04-28 22:33:45 Wed, Oliver O'Halloran wrote:
> > On Tue, Apr 27, 2021 at 9:56 PM Mahesh Salgaonkar  
> > wrote:
> > >
> > > With upstream kernel, especially after commit 98ba956f6a389
> > > ("powerpc/pseries/eeh: Rework device EEH PE determination") we see that 
> > > KVM
> > > guest isn't able to enable EEH option for PCI pass-through devices 
> > > anymore.
> >
> > How are you passing the devices through to the guest?
>
> I am using libvirt with below xml section to add pass-through:
>
> 
>   
>   
> 
>   
>function='0x0' multifunction='on'/>
> 
> 
>   
>   
> 
>   
>function='0x1' multifunction='on'/>
> 
>
> Looks like libvirt does not allow pass through device in slot zero, and
> throws following error.
>
> error: XML error: Invalid PCI address :01:00.0. slot must be >= 1
> Failed. Try again? [y,n,i,f,?]:

That's pretty odd and I have no idea why that's happening. I seem to
remember being able to use slot 0 for vfio devices when doing the
passthru manually with the qemu command line so this might be a
libvirt quirk.

> *snip*
>
> Agree. I realize my fix is not correctly handling this. The current code
> under ibm,set-eeh-option is checking for individual PCI device presence.
> Better fix should be to check if there is any PCI device (vfio-pci)
> present under specified bus and enable the EEH if found. And no change
> in return value of get-config-addr-info2. What do you say ?

That sounds reasonable. You would however need to verify that all the
devices on that bus are within the same PE on the hypervisor side.



Re: [PATCH] spapr: Modify ibm, get-config-addr-info2 to set DEVNUM in PE config address.

2021-04-28 Thread Oliver O'Halloran
On Tue, Apr 27, 2021 at 9:56 PM Mahesh Salgaonkar  wrote:
>
> With upstream kernel, especially after commit 98ba956f6a389
> ("powerpc/pseries/eeh: Rework device EEH PE determination") we see that KVM
> guest isn't able to enable EEH option for PCI pass-through devices anymore.

How are you passing the devices through to the guest?

> [root@atest-guest ~]# dmesg | grep EEH
> [0.032337] EEH: pSeries platform initialized
> [0.298207] EEH: No capable adapters found: recovery disabled.
> [root@atest-guest ~]#
>
> So far the linux kernel was assuming pe_config_addr equal to device's
> config_addr and using it to enable EEH on the PE through ibm,set-eeh-option
> RTAS call. Which wasn't the correct way as per PAPR. The linux kernel
> commit 98ba956f6a389 fixed this flow. With that fixed, linux now gets the
> PE config address first using the ibm,get-config-addr-info2 RTAS call, and
> then uses the found address as argument to ibm,set-eeh-option RTAS call to
> enable EEH on the PCI device.

That's not really correct. EEH being enabled or disabled is a per-PE
setting rather than a per-device setting. The fact there's not a 1-1
correspondence between devices and PEs is a large part of why the
get-config-addr-info2 RTAS call exists in the first place.
Unfortunately, the initial implementation of EEH support in linux
conflated the two because in the past there was typically a single
device within a PE. However, that assumption was never really correct
and it has long outlived its usefulness.

> This works on PowerVM lpar but fails in qemu
> KVM guests. This is because ibm,set-eeh-option RTAS call in qemu expects PE
> config address bits 16:20 be populated with device number (DEVNUM).
>
> The rtas call ibm,get-config-addr-info2 in qemu always returns the PE
> config address in form of "00BB0001" (i.e.  <00>) where
> "BB" represents the bus number of PE's primary bus and with device number
> information always set to zero. However until commit 98ba956f6a389 this
> return value wasn't used to enable EEH on the PCI device.
>
> Now in qemu guest with recent kernel the ibm,set-eeh-option RTAS call fails
> with -3 return value indicating that there is no PCI device exist for the
> specified pe config address. The rtas_ibm_set_eeh_option call uses
> pci_find_device() to get the PC device that matches specific bus and devfn
> extracted from PE config address passed as argument. Since the DEVFN part
> of PE config always contains zero, pci_find_device() fails to find the
> specific PCI device and hence fails to enable the EEH capability.
>
> hw/ppc/spapr_pci_vfio.c: spapr_phb_vfio_eeh_set_option()
>case RTAS_EEH_ENABLE: {
> PCIHostState *phb;
> PCIDevice *pdev;
>
> /*
>  * The EEH functionality is enabled on basis of PCI device,
>  * instead of PE. We need check the validity of the PCI
>  * device address.
>  */
> phb = PCI_HOST_BRIDGE(sphb);
> pdev = pci_find_device(phb->bus,
>(addr >> 16) & 0xFF, (addr >> 8) & 0xFF);
> if (!pdev || !object_dynamic_cast(OBJECT(pdev), "vfio-pci")) {
> return RTAS_OUT_PARAM_ERROR;
> }
>
> This patch fixes this issue by populating DEVNUM field (bits 16:20) of PE
> config address with device number.

I don't think this is a good idea and I'm fairly sure you're
introducing some subtle breakage here. As an example, say that on the
host side you have two devices on the same bus:

:00:00.0 - card 1
:00:01.0 - card 2

On PowerNV (i.e. the hypervisor) these would be placed into the same
hardware PE since they're on the same bus. Now, if I take both devices
and pass them through to the guest on the same PHB and bus (use
0005:ff) we'll have:

0005:ff:00.0 - card 1
0005:ff:01.0 - card 2

With this patch applied get-config-addr-info2 returns 00BBD001, so the
PE we get for each device will be:

card 1 - 00ff0001
card 2 - 00ff1001

Which implies the two are in different PEs. As a result, if the guest
requests a reset of card 1's PE then the guest will see an unexpected
reset of card 2 as well. From the hypervisor's point of view the two
are in the same PE so this is a legitimate thing to do, but due to
this patch the guest doesn't know that.

As far as I can remember this is why you're supposed to pass each EEH
capable devices to the guest on a seperate spapr-phb (which matches
what PHYP does). Alexy can probably tell you more.

Oliver



Re: [PATCH 1/2] ppc/pnv: Add models for POWER9 PHB4 PCIe Host bridge

2020-01-28 Thread Oliver O'Halloran
On Wed, Jan 29, 2020 at 2:09 PM David Gibson
 wrote:
>
> On Mon, Jan 27, 2020 at 03:45:05PM +0100, Cédric Le Goater wrote:
> > From: Benjamin Herrenschmidt 
> >

*snip*

> > +
> > +/*
> > + * The CONFIG_DATA register expects little endian accesses, but as the
> > + * region is big endian, we have to swap the value.
> > + */
> > +static void pnv_phb4_config_write(PnvPHB4 *phb, unsigned off,
> > +  unsigned size, uint64_t val)
> > +{
> > +uint32_t cfg_addr, limit;
> > +PCIDevice *pdev;
> > +
> > +pdev = pnv_phb4_find_cfg_dev(phb);
> > +if (!pdev) {
> > +return;
> > +}
> > +cfg_addr = (phb->regs[PHB_CONFIG_ADDRESS >> 3] >> 32) & 0xffc;
> > +cfg_addr |= off;
> > +limit = pci_config_size(pdev);
> > +if (limit <= cfg_addr) {
> > +/*
> > + * conventional pci device can be behind pcie-to-pci bridge.
> > + * 256 <= addr < 4K has no effects.
> > + */
> > +return;
> > +}
> > +switch (size) {
> > +case 1:
> > +break;
> > +case 2:
> > +val = bswap16(val);
>
> I'm a little confused by these byteswaps.  As I see below the device
> is set to big endian, so the values passed in here should already be
> in host-native endian.  Why do you need the swap?  Are some of the
> registers in the bank BE and some LE?

All the registers are BE except for CONFIG_DATA, which isn't actually
a register. It's really a window into the config space of the device
specified in CONFIG_ADDR so it doesn't do any byte-swapping.

> > +break;
> > +case 4:
> > +val = bswap32(val);
> > +break;
> > +default:
> > +g_assert_not_reached();
> > +}
> > +pci_host_config_write_common(pdev, cfg_addr, limit, val, size);
> > +}