On Wed, Feb 15, 2017 at 9:47 PM, Alex Williamson <alex.william...@redhat.com > wrote:
> On Thu, 16 Feb 2017 10:28:39 +0800 > Peter Xu <pet...@redhat.com> wrote: > > > On Wed, Feb 15, 2017 at 11:15:52AM -0700, Alex Williamson wrote: > > > > [...] > > > > > > Alex, do you like something like below to fix above issue that > Jintack > > > > has encountered? > > > > > > > > (note: this code is not for compile, only trying show what I mean...) > > > > > > > > ------8<------- > > > > diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c > > > > index 332f41d..4dca631 100644 > > > > --- a/hw/vfio/pci.c > > > > +++ b/hw/vfio/pci.c > > > > @@ -1877,25 +1877,6 @@ static void vfio_add_ext_cap(VFIOPCIDevice > *vdev) > > > > */ > > > > config = g_memdup(pdev->config, vdev->config_size); > > > > > > > > - /* > > > > - * Extended capabilities are chained with each pointing to the > next, so we > > > > - * can drop anything other than the head of the chain simply by > modifying > > > > - * the previous next pointer. For the head of the chain, we > can modify the > > > > - * capability ID to something that cannot match a valid > capability. ID > > > > - * 0 is reserved for this since absence of capabilities is > indicated by > > > > - * 0 for the ID, version, AND next pointer. However, > pcie_add_capability() > > > > - * uses ID 0 as reserved for list management and will > incorrectly match and > > > > - * assert if we attempt to pre-load the head of the chain with > this ID. > > > > - * Use ID 0xFFFF temporarily since it is also seems to be > reserved in > > > > - * part for identifying absence of capabilities in a root > complex register > > > > - * block. If the ID still exists after adding capabilities, > switch back to > > > > - * zero. We'll mark this entire first dword as emulated for > this purpose. > > > > - */ > > > > - pci_set_long(pdev->config + PCI_CONFIG_SPACE_SIZE, > > > > - PCI_EXT_CAP(0xFFFF, 0, 0)); > > > > - pci_set_long(pdev->wmask + PCI_CONFIG_SPACE_SIZE, 0); > > > > - pci_set_long(vdev->emulated_config_bits + > PCI_CONFIG_SPACE_SIZE, ~0); > > > > - > > > > for (next = PCI_CONFIG_SPACE_SIZE; next; > > > > next = PCI_EXT_CAP_NEXT(pci_get_long(config + next))) { > > > > header = pci_get_long(config + next); > > > > @@ -1917,6 +1898,8 @@ static void vfio_add_ext_cap(VFIOPCIDevice > *vdev) > > > > switch (cap_id) { > > > > case PCI_EXT_CAP_ID_SRIOV: /* Read-only VF BARs confuse > OVMF */ > > > > case PCI_EXT_CAP_ID_ARI: /* XXX Needs next function > virtualization */ > > > > + /* keep this ecap header (4 bytes), but mask cap_id to > 0xffff */ > > > > + ... > > > > trace_vfio_add_ext_cap_dropped(vdev->vbasedev.name, > cap_id, next); > > > > break; > > > > default: > > > > @@ -1925,11 +1908,6 @@ static void vfio_add_ext_cap(VFIOPCIDevice > *vdev) > > > > > > > > } > > > > > > > > - /* Cleanup chain head ID if necessary */ > > > > - if (pci_get_word(pdev->config + PCI_CONFIG_SPACE_SIZE) == > 0xFFFF) { > > > > - pci_set_word(pdev->config + PCI_CONFIG_SPACE_SIZE, 0); > > > > - } > > > > - > > > > g_free(config); > > > > return; > > > > } > > > > ----->8----- > > > > > > > > Since after all we need the assumption that 0xffff is reserved for > > > > cap_id. Then, we can just remove the "first 0xffff then 0x0" hack, > > > > which is imho error-prone and hacky. > > > > > > This doesn't fix the bug, which is that pcie_add_capability() uses a > > > valid capability ID for it's own internal tracking. It's only doing > > > this to find the end of the capability chain, which we could do in a > > > spec complaint way by looking for a zero next pointer. Fix that and > > > then vfio doesn't need to do this set to 0xffff then back to zero > > > nonsense at all. Capability ID zero is valid. Thanks, > > > > Yeah I see Michael's fix on the capability list stuff. However, imho > > these are two different issues? Or say, even if with that patch, we > > should still need this hack (first 0x0, then 0xffff) right? Since > > looks like that patch didn't solve the problem if the first pcie ecap > > is masked at 0x100. > > I thought the problem was that QEMU in the host exposes a device with a > capability ID of 0 to the L1 guest. QEMU in the L1 guest balks at a > capability ID of 0 because that's how it finds the end of the chain. > Therefore if we make QEMU not use capability ID 0 for internal > purposes, things work. vfio using 0xffff and swapping back to 0x0 > becomes unnecessary, but doesn't hurt anything. Thanks, > I've applied Peter's hack and Michael's patch below, but still can't use the assigned device in L2. commit 4bb571d857d973d9308d9fdb1f48d983d6639bd4 Author: Michael S. Tsirkin <m...@redhat.com> Date: Wed Feb 15 22:37:45 2017 +0200 pci/pcie: don't assume cap id 0 is reserved I was able to boot L2 with following qemu warnings, qemu-system-x86_64: vfio: Cannot reset device 0000:00:03.0, no available reset mechanism. qemu-system-x86_64: vfio: Cannot reset device 0000:00:03.0, no available reset mechanism. but then I don't see the network device, which I was trying to assign to L2, in L2. This is from L2 dmesg, and it looks like the device is not initialized. [ 5.884115] mlx4_core: Mellanox ConnectX core driver v2.2-1 (Feb, 2014) [ 5.891563] mlx4_core: Initializing 0000:00:03.0 [ 5.896947] ACPI: PCI Interrupt Link [GSIH] enabled at IRQ 23 [ 6.913559] mlx4_core 0000:00:03.0: Installed FW has unsupported command interface revision 0 [ 6.920925] mlx4_core 0000:00:03.0: (Installed FW version is 0.0.000) [ 6.926490] mlx4_core 0000:00:03.0: This driver version supports only revisions 2 to 3 [ 6.933300] mlx4_core 0000:00:03.0: QUERY_FW command failed, aborting [ 6.940279] mlx4_core 0000:00:03.0: Failed to init fw, aborting. This is the full kernel log from L2. https://paste.ubuntu.com/24039462/ L0, L1 and L2 are using the same kernel, so I think they are using the same device driver. This is the L0/L1 kernel log about the network device. --- From L0 --- [ 8.175533] mlx4_core: Mellanox ConnectX core driver v2.2-1 (Feb, 2014) [ 8.175543] mlx4_core: Initializing 0000:08:00.0 [ 14.524093] mlx4_core 0000:08:00.0: PCIe link speed is 8.0GT/s, device supports 8.0GT/s [ 14.533030] mlx4_core 0000:08:00.0: PCIe link width is x8, device supports x8 [ 14.714296] mlx4_en: Mellanox ConnectX HCA Ethernet driver v2.2-1 (Feb 2014) [ 14.722295] mlx4_en 0000:08:00.0: Activating port:2 [ 14.735186] mlx4_en: 0000:08:00.0: Port 2: Using 128 TX rings [ 14.741608] mlx4_en: 0000:08:00.0: Port 2: Using 8 RX rings [ 14.747826] mlx4_en: 0000:08:00.0: Port 2: frag:0 - size:1522 prefix:0 stride:1536 [ 14.756698] mlx4_en: 0000:08:00.0: Port 2: Initializing port [ 14.764036] mlx4_en 0000:08:00.0: registered PHC clock --- From L1 --- [ 3.790302] mlx4_core: Mellanox ConnectX core driver v2.2-1 (Feb, 2014) [ 3.791089] mlx4_core: Initializing 0000:00:03.0 [ 9.053077] mlx4_core 0000:00:03.0: Unable to determine PCIe device BW capabilities [ 9.203290] mlx4_en: Mellanox ConnectX HCA Ethernet driver v2.2-1 (Feb 2014) [ 9.204503] mlx4_en 0000:00:03.0: Activating port:2 [ 9.212853] mlx4_en: 0000:00:03.0: Port 2: Using 32 TX rings [ 9.213514] mlx4_en: 0000:00:03.0: Port 2: Using 4 RX rings [ 9.214131] mlx4_en: 0000:00:03.0: Port 2: frag:0 - size:1522 prefix:0 stride:1536 [ 9.215260] mlx4_en: 0000:00:03.0: Port 2: Initializing port [ 9.216377] mlx4_en 0000:00:03.0: registered PHC clock [ 9.261518] mlx4_en: eth1: Link Up [ 9.690730] mlx4_core 0000:00:03.0 eth2: renamed from eth1 Any thoughts? > Alex > >