Re: [Qemu-devel] iommu emulation

2017-03-02 Thread Bandan Das
Peter Xu  writes:

> On Thu, Mar 02, 2017 at 05:20:19PM -0500, Bandan Das wrote:
>> Jintack Lim  writes:
>> 
>> > [cc Bandan]
>> >
...
>> 
>> Jintack, any progress with this ?
>> 
>> I am testing on a X540-AT2 and I see a different behavior. It appears
>> config succeeds but the driver keeps resetting the device due to a Tx
>> hang:
>> 
>> [ 568.612391 ] ixgbe :00:03.0 enp0s3: tx hang 38 detected on queue 0,
>> resetting adapter
>> [ 568.612393 ]  ixgbe :00:03.0 enp0s3: initiating reset due to tx
>> timeout
>> [ 568.612397 ]  ixgbe :00:03.0 enp0s3: Reset adapter
>> 
>> This may be device specific but I think the actual behavior you see is
>> also dependent on the ixgbe driver in the guest. Are you on a recent
>> kernel ? Also, can you point me to the hack (by Peter) that you have
>> mentioned above ?
>
> Hi, Bandan,
>
> Are you using the vtd vfio v7 series or another branch?

Thanks for the tip. Jintack pointed me to your repo and I am using v7
from there.

> If it's the upstream one... just a note that we need to make sure
> "-device intel-iommu" be the first device specified in QEMU command
> line parameters (need to before "-device vfio-pci,..."). Thanks,
>
> -- peterx



Re: [Qemu-devel] iommu emulation

2017-03-02 Thread Peter Xu
On Thu, Mar 02, 2017 at 05:20:19PM -0500, Bandan Das wrote:
> Jintack Lim  writes:
> 
> > [cc Bandan]
> >
> > On Tue, Feb 21, 2017 at 5:33 AM, Jintack Lim 
> > wrote:
> >
> >>
> >>
> >> On Wed, Feb 15, 2017 at 9:47 PM, Alex Williamson <
> >> alex.william...@redhat.com> wrote:
> ...
> >>
> >
> > I've tried another network device on a different machine. It has "Intel
> > Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection" ethernet
> > controller. I got the same problem of getting the network device
> > initialization failure in L2. I think I'm missing something since I heard
> > from Bandan that he had no problem to assign a device to L2 with ixgbe.
> >
> > This is the error message from dmesg in L2.
> >
> > [3.692871] ixgbe: Intel(R) 10 Gigabit PCI Express Network Driver -
> > version 4.2.1-k
> > [3.697716] ixgbe: Copyright (c) 1999-2015 Intel Corporation.
> > [3.964875] ixgbe :00:02.0: HW Init failed: -12
> > [3.972362] ixgbe: probe of :00:02.0 failed with error -12
> >
> > I checked that L2 indeed had that device.
> > root@guest0:~# lspci
> > 00:00.0 Host bridge: Intel Corporation 82G33/G31/P35/P31 Express DRAM
> > Controller
> > 00:01.0 VGA compatible controller: Device 1234: (rev 02)
> > 00:02.0 Ethernet controller: Intel Corporation 82599ES 10-Gigabit SFI/SFP+
> > Network Connection (rev 01)
> 
> Jintack, any progress with this ?
> 
> I am testing on a X540-AT2 and I see a different behavior. It appears
> config succeeds but the driver keeps resetting the device due to a Tx
> hang:
> 
> [ 568.612391 ] ixgbe :00:03.0 enp0s3: tx hang 38 detected on queue 0,
> resetting adapter
> [ 568.612393 ]  ixgbe :00:03.0 enp0s3: initiating reset due to tx
> timeout
> [ 568.612397 ]  ixgbe :00:03.0 enp0s3: Reset adapter
> 
> This may be device specific but I think the actual behavior you see is
> also dependent on the ixgbe driver in the guest. Are you on a recent
> kernel ? Also, can you point me to the hack (by Peter) that you have
> mentioned above ?

Hi, Bandan,

Are you using the vtd vfio v7 series or another branch?

If it's the upstream one... just a note that we need to make sure
"-device intel-iommu" be the first device specified in QEMU command
line parameters (need to before "-device vfio-pci,..."). Thanks,

-- peterx



Re: [Qemu-devel] iommu emulation

2017-03-02 Thread Jintack Lim
On Thu, Mar 2, 2017 at 5:20 PM, Bandan Das  wrote:
> Jintack Lim  writes:
>
>> [cc Bandan]
>>
>> On Tue, Feb 21, 2017 at 5:33 AM, Jintack Lim 
>> wrote:
>>
>>>
>>>
>>> On Wed, Feb 15, 2017 at 9:47 PM, Alex Williamson <
>>> alex.william...@redhat.com> wrote:
> ...
>>>
>>
>> I've tried another network device on a different machine. It has "Intel
>> Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection" ethernet
>> controller. I got the same problem of getting the network device
>> initialization failure in L2. I think I'm missing something since I heard
>> from Bandan that he had no problem to assign a device to L2 with ixgbe.
>>
>> This is the error message from dmesg in L2.
>>
>> [3.692871] ixgbe: Intel(R) 10 Gigabit PCI Express Network Driver -
>> version 4.2.1-k
>> [3.697716] ixgbe: Copyright (c) 1999-2015 Intel Corporation.
>> [3.964875] ixgbe :00:02.0: HW Init failed: -12
>> [3.972362] ixgbe: probe of :00:02.0 failed with error -12
>>
>> I checked that L2 indeed had that device.
>> root@guest0:~# lspci
>> 00:00.0 Host bridge: Intel Corporation 82G33/G31/P35/P31 Express DRAM
>> Controller
>> 00:01.0 VGA compatible controller: Device 1234: (rev 02)
>> 00:02.0 Ethernet controller: Intel Corporation 82599ES 10-Gigabit SFI/SFP+
>> Network Connection (rev 01)
>
> Jintack, any progress with this ?

Not much, unfortunately.

>
> I am testing on a X540-AT2 and I see a different behavior. It appears
> config succeeds but the driver keeps resetting the device due to a Tx
> hang:

Thanks for your effort!

>
> [ 568.612391 ] ixgbe :00:03.0 enp0s3: tx hang 38 detected on queue 0,
> resetting adapter
> [ 568.612393 ]  ixgbe :00:03.0 enp0s3: initiating reset due to tx
> timeout
> [ 568.612397 ]  ixgbe :00:03.0 enp0s3: Reset adapter
>
> This may be device specific but I think the actual behavior you see is
> also dependent on the ixgbe driver in the guest. Are you on a recent
> kernel ? Also, can you point me to the hack (by Peter) that you have
> mentioned above ?

I was using 4.6.0-rc on the machine with Mellanox device, and
4.10.0-rc on the machine with Intel device. L0, L1 and L2 had the same
version.

This is the initial hack from Peter,
--8<---
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 332f41d..bacd302 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -1925,11 +1925,6 @@ static void vfio_add_ext_cap(VFIOPCIDevice *vdev)

 }

-/* Cleanup chain head ID if necessary */
-if (pci_get_word(pdev->config + PCI_CONFIG_SPACE_SIZE) == 0x) {
-pci_set_word(pdev->config + PCI_CONFIG_SPACE_SIZE, 0);
-}
-
 g_free(config);
 return;
 }
-->8---

and I believe this is the commit merged into QEMU repo.

commit d0d1cd70d10639273e2a23870e7e7d80b2bc4e21
Author: Alex Williamson 
Date:   Wed Feb 22 13:19:58 2017 -0700

vfio/pci: Improve extended capability comments, skip masked caps


Thanks,
Jintack

>
> Thanks,
> Bandan
>
>> I'm describing steps I took, so if you notice something wrong, PLEASE let
>> me know.
>>
>> 1. [L0] Check the device with lspci. Result is [1]
>> 2. [L0] Unbind from the original driver and bind to vfio-pci driver
>> following [2][3]
>> 3. [L0] Start L1 with this script. [4]
>> 4. [L1] L1 is able to use the network device.
>> 5. [L1] Unbind from the original driver and bind to vfio-pci driver same as
>> the step 2.
>> 6. [L1] Start L2 with this script. [5]
>> 7. [L2] Got the init failure error message above.
>>
>> [1] https://paste.ubuntu.com/24055745/
>> [2] http://www.linux-kvm.org/page/10G_NIC_performance:_VFIO_vs_virtio
>> [3] http://www.linux-kvm.org/images/b/b4/2012-forum-VFIO.pdf
>> [4] https://paste.ubuntu.com/24055715/
>> [5] https://paste.ubuntu.com/24055720/
>>
>> Thanks,
>> Jintack
>>
>>
>>>
>>>
 Alex


>>>
>




Re: [Qemu-devel] iommu emulation

2017-03-02 Thread Bandan Das
Jintack Lim  writes:

> [cc Bandan]
>
> On Tue, Feb 21, 2017 at 5:33 AM, Jintack Lim 
> wrote:
>
>>
>>
>> On Wed, Feb 15, 2017 at 9:47 PM, Alex Williamson <
>> alex.william...@redhat.com> wrote:
...
>>
>
> I've tried another network device on a different machine. It has "Intel
> Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection" ethernet
> controller. I got the same problem of getting the network device
> initialization failure in L2. I think I'm missing something since I heard
> from Bandan that he had no problem to assign a device to L2 with ixgbe.
>
> This is the error message from dmesg in L2.
>
> [3.692871] ixgbe: Intel(R) 10 Gigabit PCI Express Network Driver -
> version 4.2.1-k
> [3.697716] ixgbe: Copyright (c) 1999-2015 Intel Corporation.
> [3.964875] ixgbe :00:02.0: HW Init failed: -12
> [3.972362] ixgbe: probe of :00:02.0 failed with error -12
>
> I checked that L2 indeed had that device.
> root@guest0:~# lspci
> 00:00.0 Host bridge: Intel Corporation 82G33/G31/P35/P31 Express DRAM
> Controller
> 00:01.0 VGA compatible controller: Device 1234: (rev 02)
> 00:02.0 Ethernet controller: Intel Corporation 82599ES 10-Gigabit SFI/SFP+
> Network Connection (rev 01)

Jintack, any progress with this ?

I am testing on a X540-AT2 and I see a different behavior. It appears
config succeeds but the driver keeps resetting the device due to a Tx
hang:

[ 568.612391 ] ixgbe :00:03.0 enp0s3: tx hang 38 detected on queue 0,
resetting adapter
[ 568.612393 ]  ixgbe :00:03.0 enp0s3: initiating reset due to tx
timeout
[ 568.612397 ]  ixgbe :00:03.0 enp0s3: Reset adapter

This may be device specific but I think the actual behavior you see is
also dependent on the ixgbe driver in the guest. Are you on a recent
kernel ? Also, can you point me to the hack (by Peter) that you have
mentioned above ?

Thanks,
Bandan

> I'm describing steps I took, so if you notice something wrong, PLEASE let
> me know.
>
> 1. [L0] Check the device with lspci. Result is [1]
> 2. [L0] Unbind from the original driver and bind to vfio-pci driver
> following [2][3]
> 3. [L0] Start L1 with this script. [4]
> 4. [L1] L1 is able to use the network device.
> 5. [L1] Unbind from the original driver and bind to vfio-pci driver same as
> the step 2.
> 6. [L1] Start L2 with this script. [5]
> 7. [L2] Got the init failure error message above.
>
> [1] https://paste.ubuntu.com/24055745/
> [2] http://www.linux-kvm.org/page/10G_NIC_performance:_VFIO_vs_virtio
> [3] http://www.linux-kvm.org/images/b/b4/2012-forum-VFIO.pdf
> [4] https://paste.ubuntu.com/24055715/
> [5] https://paste.ubuntu.com/24055720/
>
> Thanks,
> Jintack
>
>
>>
>>
>>> Alex
>>>
>>>
>>



Re: [Qemu-devel] iommu emulation

2017-02-23 Thread Jintack Lim
[cc Bandan]

On Tue, Feb 21, 2017 at 5:33 AM, Jintack Lim 
wrote:

>
>
> On Wed, Feb 15, 2017 at 9:47 PM, Alex Williamson <
> alex.william...@redhat.com> wrote:
>
>> On Thu, 16 Feb 2017 10:28:39 +0800
>> Peter Xu  wrote:
>>
>> > On Wed, Feb 15, 2017 at 11:15:52AM -0700, Alex Williamson wrote:
>> >
>> > [...]
>> >
>> > > > Alex, do you like something like below to fix above issue that
>> Jintack
>> > > > has encountered?
>> > > >
>> > > > (note: this code is not for compile, only trying show what I
>> mean...)
>> > > >
>> > > > --8<---
>> > > > diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>> > > > index 332f41d..4dca631 100644
>> > > > --- a/hw/vfio/pci.c
>> > > > +++ b/hw/vfio/pci.c
>> > > > @@ -1877,25 +1877,6 @@ static void vfio_add_ext_cap(VFIOPCIDevice
>> *vdev)
>> > > >   */
>> > > >  config = g_memdup(pdev->config, vdev->config_size);
>> > > >
>> > > > -/*
>> > > > - * Extended capabilities are chained with each pointing to the
>> next, so we
>> > > > - * can drop anything other than the head of the chain simply
>> by modifying
>> > > > - * the previous next pointer.  For the head of the chain, we
>> can modify the
>> > > > - * capability ID to something that cannot match a valid
>> capability.  ID
>> > > > - * 0 is reserved for this since absence of capabilities is
>> indicated by
>> > > > - * 0 for the ID, version, AND next pointer.  However,
>> pcie_add_capability()
>> > > > - * uses ID 0 as reserved for list management and will
>> incorrectly match and
>> > > > - * assert if we attempt to pre-load the head of the chain with
>> this ID.
>> > > > - * Use ID 0x temporarily since it is also seems to be
>> reserved in
>> > > > - * part for identifying absence of capabilities in a root
>> complex register
>> > > > - * block.  If the ID still exists after adding capabilities,
>> switch back to
>> > > > - * zero.  We'll mark this entire first dword as emulated for
>> this purpose.
>> > > > - */
>> > > > -pci_set_long(pdev->config + PCI_CONFIG_SPACE_SIZE,
>> > > > - PCI_EXT_CAP(0x, 0, 0));
>> > > > -pci_set_long(pdev->wmask + PCI_CONFIG_SPACE_SIZE, 0);
>> > > > -pci_set_long(vdev->emulated_config_bits +
>> PCI_CONFIG_SPACE_SIZE, ~0);
>> > > > -
>> > > >  for (next = PCI_CONFIG_SPACE_SIZE; next;
>> > > >   next = PCI_EXT_CAP_NEXT(pci_get_long(config + next))) {
>> > > >  header = pci_get_long(config + next);
>> > > > @@ -1917,6 +1898,8 @@ static void vfio_add_ext_cap(VFIOPCIDevice
>> *vdev)
>> > > >  switch (cap_id) {
>> > > >  case PCI_EXT_CAP_ID_SRIOV: /* Read-only VF BARs confuse
>> OVMF */
>> > > >  case PCI_EXT_CAP_ID_ARI: /* XXX Needs next function
>> virtualization */
>> > > > +/* keep this ecap header (4 bytes), but mask cap_id to
>> 0x */
>> > > > +...
>> > > >  trace_vfio_add_ext_cap_dropped(vdev->vbasedev.name,
>> cap_id, next);
>> > > >  break;
>> > > >  default:
>> > > > @@ -1925,11 +1908,6 @@ static void vfio_add_ext_cap(VFIOPCIDevice
>> *vdev)
>> > > >
>> > > >  }
>> > > >
>> > > > -/* Cleanup chain head ID if necessary */
>> > > > -if (pci_get_word(pdev->config + PCI_CONFIG_SPACE_SIZE) ==
>> 0x) {
>> > > > -pci_set_word(pdev->config + PCI_CONFIG_SPACE_SIZE, 0);
>> > > > -}
>> > > > -
>> > > >  g_free(config);
>> > > >  return;
>> > > >  }
>> > > > ->8-
>> > > >
>> > > > Since after all we need the assumption that 0x is reserved for
>> > > > cap_id. Then, we can just remove the "first 0x then 0x0" hack,
>> > > > which is imho error-prone and hacky.
>> > >
>> > > This doesn't fix the bug, which is that pcie_add_capability() uses a
>> > > valid capability ID for it's own internal tracking.  It's only doing
>> > > this to find the end of the capability chain, which we could do in a
>> > > spec complaint way by looking for a zero next pointer.  Fix that and
>> > > then vfio doesn't need to do this set to 0x then back to zero
>> > > nonsense at all.  Capability ID zero is valid.  Thanks,
>> >
>> > Yeah I see Michael's fix on the capability list stuff. However, imho
>> > these are two different issues? Or say, even if with that patch, we
>> > should still need this hack (first 0x0, then 0x) right? Since
>> > looks like that patch didn't solve the problem if the first pcie ecap
>> > is masked at 0x100.
>>
>> I thought the problem was that QEMU in the host exposes a device with a
>> capability ID of 0 to the L1 guest.  QEMU in the L1 guest balks at a
>> capability ID of 0 because that's how it finds the end of the chain.
>> Therefore if we make QEMU not use capability ID 0 for internal
>> purposes, things work.  vfio using 0x and swapping back to 0x0
>> becomes unnecessary, but doesn't hurt anything.  Thanks,
>>
>
> I've applied Peter's hack and Michael's patch below, but still can't 

Re: [Qemu-devel] iommu emulation

2017-02-21 Thread Jintack Lim
On Wed, Feb 15, 2017 at 9:47 PM, Alex Williamson  wrote:

> On Thu, 16 Feb 2017 10:28:39 +0800
> Peter Xu  wrote:
>
> > On Wed, Feb 15, 2017 at 11:15:52AM -0700, Alex Williamson wrote:
> >
> > [...]
> >
> > > > Alex, do you like something like below to fix above issue that
> Jintack
> > > > has encountered?
> > > >
> > > > (note: this code is not for compile, only trying show what I mean...)
> > > >
> > > > --8<---
> > > > diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> > > > index 332f41d..4dca631 100644
> > > > --- a/hw/vfio/pci.c
> > > > +++ b/hw/vfio/pci.c
> > > > @@ -1877,25 +1877,6 @@ static void vfio_add_ext_cap(VFIOPCIDevice
> *vdev)
> > > >   */
> > > >  config = g_memdup(pdev->config, vdev->config_size);
> > > >
> > > > -/*
> > > > - * Extended capabilities are chained with each pointing to the
> next, so we
> > > > - * can drop anything other than the head of the chain simply by
> modifying
> > > > - * the previous next pointer.  For the head of the chain, we
> can modify the
> > > > - * capability ID to something that cannot match a valid
> capability.  ID
> > > > - * 0 is reserved for this since absence of capabilities is
> indicated by
> > > > - * 0 for the ID, version, AND next pointer.  However,
> pcie_add_capability()
> > > > - * uses ID 0 as reserved for list management and will
> incorrectly match and
> > > > - * assert if we attempt to pre-load the head of the chain with
> this ID.
> > > > - * Use ID 0x temporarily since it is also seems to be
> reserved in
> > > > - * part for identifying absence of capabilities in a root
> complex register
> > > > - * block.  If the ID still exists after adding capabilities,
> switch back to
> > > > - * zero.  We'll mark this entire first dword as emulated for
> this purpose.
> > > > - */
> > > > -pci_set_long(pdev->config + PCI_CONFIG_SPACE_SIZE,
> > > > - PCI_EXT_CAP(0x, 0, 0));
> > > > -pci_set_long(pdev->wmask + PCI_CONFIG_SPACE_SIZE, 0);
> > > > -pci_set_long(vdev->emulated_config_bits +
> PCI_CONFIG_SPACE_SIZE, ~0);
> > > > -
> > > >  for (next = PCI_CONFIG_SPACE_SIZE; next;
> > > >   next = PCI_EXT_CAP_NEXT(pci_get_long(config + next))) {
> > > >  header = pci_get_long(config + next);
> > > > @@ -1917,6 +1898,8 @@ static void vfio_add_ext_cap(VFIOPCIDevice
> *vdev)
> > > >  switch (cap_id) {
> > > >  case PCI_EXT_CAP_ID_SRIOV: /* Read-only VF BARs confuse
> OVMF */
> > > >  case PCI_EXT_CAP_ID_ARI: /* XXX Needs next function
> virtualization */
> > > > +/* keep this ecap header (4 bytes), but mask cap_id to
> 0x */
> > > > +...
> > > >  trace_vfio_add_ext_cap_dropped(vdev->vbasedev.name,
> cap_id, next);
> > > >  break;
> > > >  default:
> > > > @@ -1925,11 +1908,6 @@ static void vfio_add_ext_cap(VFIOPCIDevice
> *vdev)
> > > >
> > > >  }
> > > >
> > > > -/* Cleanup chain head ID if necessary */
> > > > -if (pci_get_word(pdev->config + PCI_CONFIG_SPACE_SIZE) ==
> 0x) {
> > > > -pci_set_word(pdev->config + PCI_CONFIG_SPACE_SIZE, 0);
> > > > -}
> > > > -
> > > >  g_free(config);
> > > >  return;
> > > >  }
> > > > ->8-
> > > >
> > > > Since after all we need the assumption that 0x is reserved for
> > > > cap_id. Then, we can just remove the "first 0x then 0x0" hack,
> > > > which is imho error-prone and hacky.
> > >
> > > This doesn't fix the bug, which is that pcie_add_capability() uses a
> > > valid capability ID for it's own internal tracking.  It's only doing
> > > this to find the end of the capability chain, which we could do in a
> > > spec complaint way by looking for a zero next pointer.  Fix that and
> > > then vfio doesn't need to do this set to 0x then back to zero
> > > nonsense at all.  Capability ID zero is valid.  Thanks,
> >
> > Yeah I see Michael's fix on the capability list stuff. However, imho
> > these are two different issues? Or say, even if with that patch, we
> > should still need this hack (first 0x0, then 0x) right? Since
> > looks like that patch didn't solve the problem if the first pcie ecap
> > is masked at 0x100.
>
> I thought the problem was that QEMU in the host exposes a device with a
> capability ID of 0 to the L1 guest.  QEMU in the L1 guest balks at a
> capability ID of 0 because that's how it finds the end of the chain.
> Therefore if we make QEMU not use capability ID 0 for internal
> purposes, things work.  vfio using 0x and swapping back to 0x0
> becomes unnecessary, but doesn't hurt anything.  Thanks,
>

I've applied Peter's hack and Michael's patch below, but still can't use
the assigned device in L2.
 commit 4bb571d857d973d9308d9fdb1f48d983d6639bd4
Author: Michael S. Tsirkin 
Date:   Wed Feb 15 22:37:45 2017 +0200

pci/pcie: don't assume cap id 0 is reserved


Re: [Qemu-devel] iommu emulation

2017-02-15 Thread Alex Williamson
On Thu, 16 Feb 2017 10:28:39 +0800
Peter Xu  wrote:

> On Wed, Feb 15, 2017 at 11:15:52AM -0700, Alex Williamson wrote:
> 
> [...]
> 
> > > Alex, do you like something like below to fix above issue that Jintack
> > > has encountered?
> > > 
> > > (note: this code is not for compile, only trying show what I mean...)
> > > 
> > > --8<---
> > > diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> > > index 332f41d..4dca631 100644
> > > --- a/hw/vfio/pci.c
> > > +++ b/hw/vfio/pci.c
> > > @@ -1877,25 +1877,6 @@ static void vfio_add_ext_cap(VFIOPCIDevice *vdev)
> > >   */
> > >  config = g_memdup(pdev->config, vdev->config_size);
> > > 
> > > -/*
> > > - * Extended capabilities are chained with each pointing to the next, 
> > > so we
> > > - * can drop anything other than the head of the chain simply by 
> > > modifying
> > > - * the previous next pointer.  For the head of the chain, we can 
> > > modify the
> > > - * capability ID to something that cannot match a valid capability.  
> > > ID
> > > - * 0 is reserved for this since absence of capabilities is indicated 
> > > by
> > > - * 0 for the ID, version, AND next pointer.  However, 
> > > pcie_add_capability()
> > > - * uses ID 0 as reserved for list management and will incorrectly 
> > > match and
> > > - * assert if we attempt to pre-load the head of the chain with this 
> > > ID.
> > > - * Use ID 0x temporarily since it is also seems to be reserved in
> > > - * part for identifying absence of capabilities in a root complex 
> > > register
> > > - * block.  If the ID still exists after adding capabilities, switch 
> > > back to
> > > - * zero.  We'll mark this entire first dword as emulated for this 
> > > purpose.
> > > - */
> > > -pci_set_long(pdev->config + PCI_CONFIG_SPACE_SIZE,
> > > - PCI_EXT_CAP(0x, 0, 0));
> > > -pci_set_long(pdev->wmask + PCI_CONFIG_SPACE_SIZE, 0);
> > > -pci_set_long(vdev->emulated_config_bits + PCI_CONFIG_SPACE_SIZE, ~0);
> > > -
> > >  for (next = PCI_CONFIG_SPACE_SIZE; next;
> > >   next = PCI_EXT_CAP_NEXT(pci_get_long(config + next))) {
> > >  header = pci_get_long(config + next);
> > > @@ -1917,6 +1898,8 @@ static void vfio_add_ext_cap(VFIOPCIDevice *vdev)
> > >  switch (cap_id) {
> > >  case PCI_EXT_CAP_ID_SRIOV: /* Read-only VF BARs confuse OVMF */
> > >  case PCI_EXT_CAP_ID_ARI: /* XXX Needs next function 
> > > virtualization */
> > > +/* keep this ecap header (4 bytes), but mask cap_id to 
> > > 0x */
> > > +...
> > >  trace_vfio_add_ext_cap_dropped(vdev->vbasedev.name, cap_id, 
> > > next);
> > >  break;
> > >  default:
> > > @@ -1925,11 +1908,6 @@ static void vfio_add_ext_cap(VFIOPCIDevice *vdev)
> > > 
> > >  }
> > > 
> > > -/* Cleanup chain head ID if necessary */
> > > -if (pci_get_word(pdev->config + PCI_CONFIG_SPACE_SIZE) == 0x) {
> > > -pci_set_word(pdev->config + PCI_CONFIG_SPACE_SIZE, 0);
> > > -}
> > > -
> > >  g_free(config);
> > >  return;
> > >  }  
> > > ->8-
> > > 
> > > Since after all we need the assumption that 0x is reserved for
> > > cap_id. Then, we can just remove the "first 0x then 0x0" hack,
> > > which is imho error-prone and hacky.  
> > 
> > This doesn't fix the bug, which is that pcie_add_capability() uses a
> > valid capability ID for it's own internal tracking.  It's only doing
> > this to find the end of the capability chain, which we could do in a
> > spec complaint way by looking for a zero next pointer.  Fix that and
> > then vfio doesn't need to do this set to 0x then back to zero
> > nonsense at all.  Capability ID zero is valid.  Thanks,  
> 
> Yeah I see Michael's fix on the capability list stuff. However, imho
> these are two different issues? Or say, even if with that patch, we
> should still need this hack (first 0x0, then 0x) right? Since
> looks like that patch didn't solve the problem if the first pcie ecap
> is masked at 0x100.

I thought the problem was that QEMU in the host exposes a device with a
capability ID of 0 to the L1 guest.  QEMU in the L1 guest balks at a
capability ID of 0 because that's how it finds the end of the chain.
Therefore if we make QEMU not use capability ID 0 for internal
purposes, things work.  vfio using 0x and swapping back to 0x0
becomes unnecessary, but doesn't hurt anything.  Thanks,

Alex



Re: [Qemu-devel] iommu emulation

2017-02-15 Thread Peter Xu
On Wed, Feb 15, 2017 at 11:15:52AM -0700, Alex Williamson wrote:

[...]

> > Alex, do you like something like below to fix above issue that Jintack
> > has encountered?
> > 
> > (note: this code is not for compile, only trying show what I mean...)
> > 
> > --8<---
> > diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> > index 332f41d..4dca631 100644
> > --- a/hw/vfio/pci.c
> > +++ b/hw/vfio/pci.c
> > @@ -1877,25 +1877,6 @@ static void vfio_add_ext_cap(VFIOPCIDevice *vdev)
> >   */
> >  config = g_memdup(pdev->config, vdev->config_size);
> > 
> > -/*
> > - * Extended capabilities are chained with each pointing to the next, 
> > so we
> > - * can drop anything other than the head of the chain simply by 
> > modifying
> > - * the previous next pointer.  For the head of the chain, we can 
> > modify the
> > - * capability ID to something that cannot match a valid capability.  ID
> > - * 0 is reserved for this since absence of capabilities is indicated by
> > - * 0 for the ID, version, AND next pointer.  However, 
> > pcie_add_capability()
> > - * uses ID 0 as reserved for list management and will incorrectly 
> > match and
> > - * assert if we attempt to pre-load the head of the chain with this ID.
> > - * Use ID 0x temporarily since it is also seems to be reserved in
> > - * part for identifying absence of capabilities in a root complex 
> > register
> > - * block.  If the ID still exists after adding capabilities, switch 
> > back to
> > - * zero.  We'll mark this entire first dword as emulated for this 
> > purpose.
> > - */
> > -pci_set_long(pdev->config + PCI_CONFIG_SPACE_SIZE,
> > - PCI_EXT_CAP(0x, 0, 0));
> > -pci_set_long(pdev->wmask + PCI_CONFIG_SPACE_SIZE, 0);
> > -pci_set_long(vdev->emulated_config_bits + PCI_CONFIG_SPACE_SIZE, ~0);
> > -
> >  for (next = PCI_CONFIG_SPACE_SIZE; next;
> >   next = PCI_EXT_CAP_NEXT(pci_get_long(config + next))) {
> >  header = pci_get_long(config + next);
> > @@ -1917,6 +1898,8 @@ static void vfio_add_ext_cap(VFIOPCIDevice *vdev)
> >  switch (cap_id) {
> >  case PCI_EXT_CAP_ID_SRIOV: /* Read-only VF BARs confuse OVMF */
> >  case PCI_EXT_CAP_ID_ARI: /* XXX Needs next function virtualization 
> > */
> > +/* keep this ecap header (4 bytes), but mask cap_id to 0x 
> > */
> > +...
> >  trace_vfio_add_ext_cap_dropped(vdev->vbasedev.name, cap_id, 
> > next);
> >  break;
> >  default:
> > @@ -1925,11 +1908,6 @@ static void vfio_add_ext_cap(VFIOPCIDevice *vdev)
> > 
> >  }
> > 
> > -/* Cleanup chain head ID if necessary */
> > -if (pci_get_word(pdev->config + PCI_CONFIG_SPACE_SIZE) == 0x) {
> > -pci_set_word(pdev->config + PCI_CONFIG_SPACE_SIZE, 0);
> > -}
> > -
> >  g_free(config);
> >  return;
> >  }
> > ->8-  
> > 
> > Since after all we need the assumption that 0x is reserved for
> > cap_id. Then, we can just remove the "first 0x then 0x0" hack,
> > which is imho error-prone and hacky.
> 
> This doesn't fix the bug, which is that pcie_add_capability() uses a
> valid capability ID for it's own internal tracking.  It's only doing
> this to find the end of the capability chain, which we could do in a
> spec complaint way by looking for a zero next pointer.  Fix that and
> then vfio doesn't need to do this set to 0x then back to zero
> nonsense at all.  Capability ID zero is valid.  Thanks,

Yeah I see Michael's fix on the capability list stuff. However, imho
these are two different issues? Or say, even if with that patch, we
should still need this hack (first 0x0, then 0x) right? Since
looks like that patch didn't solve the problem if the first pcie ecap
is masked at 0x100.

Please correct me if I missed anything. Thanks,

-- peterx



Re: [Qemu-devel] iommu emulation

2017-02-15 Thread Jintack Lim
On Wed, Feb 15, 2017 at 5:50 PM, Alex Williamson  wrote:

> On Wed, 15 Feb 2017 17:05:35 -0500
> Jintack Lim  wrote:
>
> > On Tue, Feb 14, 2017 at 9:52 PM, Peter Xu  wrote:
> >
> > > On Tue, Feb 14, 2017 at 07:50:39AM -0500, Jintack Lim wrote:
> > >
> > > [...]
> > >
> > > > > > >> > I misunderstood what you said?
> > > > > > >
> > > > > > > I failed to understand why an vIOMMU could help boost
> performance.
> > > :(
> > > > > > > Could you provide your command line here so that I can try to
> > > > > > > reproduce?
> > > > > >
> > > > > > Sure. This is the command line to launch L1 VM
> > > > > >
> > > > > > qemu-system-x86_64 -M q35,accel=kvm,kernel-irqchip=split \
> > > > > > -m 12G -device intel-iommu,intremap=on,eim=off,caching-mode=on \
> > > > > > -drive file=/mydata/guest0.img,format=raw --nographic -cpu host
> \
> > > > > > -smp 4,sockets=4,cores=1,threads=1 \
> > > > > > -device vfio-pci,host=08:00.0,id=net0
> > > > > >
> > > > > > And this is for L2 VM.
> > > > > >
> > > > > > ./qemu-system-x86_64 -M q35,accel=kvm \
> > > > > > -m 8G \
> > > > > > -drive file=/vm/l2guest.img,format=raw --nographic -cpu host \
> > > > > > -device vfio-pci,host=00:03.0,id=net0
> > > > >
> > > > > ... here looks like these are command lines for L1/L2 guest, rather
> > > > > than L1 guest with/without vIOMMU?
> > > > >
> > > >
> > > > That's right. I thought you were asking about command lines for L1/L2
> > > guest
> > > > :(.
> > > > I think I made the confusion, and as I said above, I didn't mean to
> talk
> > > > about the performance of L1 guest with/without vIOMMO.
> > > > We can move on!
> > >
> > > I see. Sure! :-)
> > >
> > > [...]
> > >
> > > > >
> > > > > Then, I *think* above assertion you encountered would fail only if
> > > > > prev == 0 here, but I still don't quite sure why was that
> happening.
> > > > > Btw, could you paste me your "lspci -vvv -s 00:03.0" result in
> your L1
> > > > > guest?
> > > > >
> > > >
> > > > Sure. This is from my L1 guest.
> > >
> > > Hmm... I think I found the problem...
> > >
> > > >
> > > > root@guest0:~# lspci -vvv -s 00:03.0
> > > > 00:03.0 Network controller: Mellanox Technologies MT27500 Family
> > > > [ConnectX-3]
> > > > Subsystem: Mellanox Technologies Device 0050
> > > > Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
> > > > Stepping- SERR+ FastB2B- DisINTx+
> > > > Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
>  > > > SERR-  > > > Latency: 0, Cache Line Size: 64 bytes
> > > > Interrupt: pin A routed to IRQ 23
> > > > Region 0: Memory at fe90 (64-bit, non-prefetchable) [size=1M]
> > > > Region 2: Memory at fe00 (64-bit, prefetchable) [size=8M]
> > > > Expansion ROM at fea0 [disabled] [size=1M]
> > > > Capabilities: [40] Power Management version 3
> > > > Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA
> PME(D0-,D1-,D2-,D3hot-,D3cold-
> > > )
> > > > Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
> > > > Capabilities: [48] Vital Product Data
> > > > Product Name: CX354A - ConnectX-3 QSFP
> > > > Read-only fields:
> > > > [PN] Part number: MCX354A-FCBT
> > > > [EC] Engineering changes: A4
> > > > [SN] Serial number: MT1346X00791
> > > > [V0] Vendor specific: PCIe Gen3 x8
> > > > [RV] Reserved: checksum good, 0 byte(s) reserved
> > > > Read/write fields:
> > > > [V1] Vendor specific: N/A
> > > > [YA] Asset tag: N/A
> > > > [RW] Read-write area: 105 byte(s) free
> > > > [RW] Read-write area: 253 byte(s) free
> > > > [RW] Read-write area: 253 byte(s) free
> > > > [RW] Read-write area: 253 byte(s) free
> > > > [RW] Read-write area: 253 byte(s) free
> > > > [RW] Read-write area: 253 byte(s) free
> > > > [RW] Read-write area: 253 byte(s) free
> > > > [RW] Read-write area: 253 byte(s) free
> > > > [RW] Read-write area: 253 byte(s) free
> > > > [RW] Read-write area: 253 byte(s) free
> > > > [RW] Read-write area: 253 byte(s) free
> > > > [RW] Read-write area: 253 byte(s) free
> > > > [RW] Read-write area: 253 byte(s) free
> > > > [RW] Read-write area: 253 byte(s) free
> > > > [RW] Read-write area: 253 byte(s) free
> > > > [RW] Read-write area: 252 byte(s) free
> > > > End
> > > > Capabilities: [9c] MSI-X: Enable+ Count=128 Masked-
> > > > Vector table: BAR=0 offset=0007c000
> > > > PBA: BAR=0 offset=0007d000
> > > > Capabilities: [60] Express (v2) Root Complex Integrated Endpoint,
> MSI 00
> > > > DevCap: MaxPayload 256 bytes, PhantFunc 0
> > > > ExtTag- RBE+
> > > > DevCtl: Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported+
> > > > RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
> > > > MaxPayload 256 bytes, MaxReadReq 4096 bytes
> > > > DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
> > > > DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR-, OBFF Not
> > > > Supported
> > > > DevCtl2: Completion Timeout: 65ms to 210ms, TimeoutDis-, LTR-, OBFF
> > > Disabled
> > > > Capabilities: [100 v0] #00
> > >
> > > Here we 

Re: [Qemu-devel] iommu emulation

2017-02-09 Thread Jintack Lim
On Wed, Feb 8, 2017 at 10:52 PM, Peter Xu  wrote:
> (cc qemu-devel and Alex)
>
> On Wed, Feb 08, 2017 at 09:14:03PM -0500, Jintack Lim wrote:
>> On Wed, Feb 8, 2017 at 10:49 AM, Jintack Lim  wrote:
>> > Hi Peter,
>> >
>> > On Tue, Feb 7, 2017 at 10:12 PM, Peter Xu  wrote:
>> >> On Tue, Feb 07, 2017 at 02:16:29PM -0500, Jintack Lim wrote:
>> >>> Hi Peter and Michael,
>> >>
>> >> Hi, Jintack,
>> >>
>> >>>
>> >>> I would like to get some help to run a VM with the emulated iommu. I
>> >>> have tried for a few days to make it work, but I couldn't.
>> >>>
>> >>> What I want to do eventually is to assign a network device to the
>> >>> nested VM so that I can measure the performance of applications
>> >>> running in the nested VM.
>> >>
>> >> Good to know that you are going to use [4] to do something useful. :-)
>> >>
>> >> However, could I ask why you want to measure the performance of
>> >> application inside nested VM rather than host? That's something I am
>> >> just curious about, considering that virtualization stack will
>> >> definitely introduce overhead along the way, and I don't know whether
>> >> that'll affect your measurement to the application.
>> >
>> > I have added nested virtualization support to KVM/ARM, which is under
>> > review now. I found that application performance running inside the
>> > nested VM is really bad both on ARM and x86, and I'm trying to figure
>> > out what's the real overhead. I think one way to figure that out is to
>> > see if the direct device assignment to L2 helps to reduce the overhead
>> > or not.
>
> I see. IIUC you are trying to use an assigned device to replace your
> old emulated device in L2 guest to see whether performance will drop
> as well, right? Then at least I can know that you won't need a nested
> VT-d here (so we should not need a vIOMMU in L2 guest).

That's right.

>
> In that case, I think we can give it a shot, considering that L1 guest
> will use vfio-pci for that assigned device as well, and when L2 guest
> QEMU uses this assigned device, it'll use a static mapping (just to
> map the whole GPA for L2 guest) there, so even if you are using a
> kernel driver in L2 guest with your to-be-tested application, we
> should still be having a static mapping in vIOMMU in L1 guest, which
> is IMHO fine from performance POV.
>
> I cced Alex in case I missed anything here.
>
>> >
>> >>
>> >> Another thing to mention is that (in case you don't know that), device
>> >> assignment with VT-d protection would be even slower than generic VMs
>> >> (without Intel IOMMU protection) if you are using generic kernel
>> >> drivers in the guest, since we may need real-time DMA translation on
>> >> data path.
>> >>
>> >
>> > So, this is the comparison between using virtio and using the device
>> > assignment for L1? I have tested application performance running
>> > inside L1 with and without iommu, and I found that the performance is
>> > better with iommu. I thought whether the device is assigned to L1 or
>> > L2, the DMA translation is done by iommu, which is pretty fast? Maybe
>> > I misunderstood what you said?
>
> I failed to understand why an vIOMMU could help boost performance. :(
> Could you provide your command line here so that I can try to
> reproduce?

Sure. This is the command line to launch L1 VM

qemu-system-x86_64 -M q35,accel=kvm,kernel-irqchip=split \
-m 12G -device intel-iommu,intremap=on,eim=off,caching-mode=on \
-drive file=/mydata/guest0.img,format=raw --nographic -cpu host \
-smp 4,sockets=4,cores=1,threads=1 \
-device vfio-pci,host=08:00.0,id=net0

And this is for L2 VM.

./qemu-system-x86_64 -M q35,accel=kvm \
-m 8G \
-drive file=/vm/l2guest.img,format=raw --nographic -cpu host \
-device vfio-pci,host=00:03.0,id=net0

>
> Besides, what I mentioned above is just in case you don't know that
> vIOMMU will drag down the performance in most cases.
>
> I think here to be more explicit, the overhead of vIOMMU is different
> for assigned devices and emulated ones.
>
>   (1) For emulated devices, the overhead is when we do the
>   translation, or say when we do the DMA operation. We need
>   real-time translation which should drag down the performance.
>
>   (2) For assigned devices (our case), the overhead is when we setup
>   the pages (since we are trapping the setup procedures via CM
>   bit). However, after it's setup, we should have no much
>   performance drag when we really do the data transfer (during
>   DMA) since that'll all be done in the hardware IOMMU (no matter
>   whether the device is assigned to L1/L2 guest).
>
> Now, after I know your use case now (use vIOMMU in L1 guest, don't use
> vIOMMU in L2 guest, only use assigned devices), I suspect we would
> have no big problem according to (2).
>
>> >
>> >>>
>> >>> First, I am having trouble to boot a VM with the emulated iommu. I
>> >>> have posted my problem to the qemu user mailing list[1],
>> >>
>> 

Re: [Qemu-devel] iommu emulation

2017-02-08 Thread Peter Xu
(cc qemu-devel and Alex)

On Wed, Feb 08, 2017 at 09:14:03PM -0500, Jintack Lim wrote:
> On Wed, Feb 8, 2017 at 10:49 AM, Jintack Lim  wrote:
> > Hi Peter,
> >
> > On Tue, Feb 7, 2017 at 10:12 PM, Peter Xu  wrote:
> >> On Tue, Feb 07, 2017 at 02:16:29PM -0500, Jintack Lim wrote:
> >>> Hi Peter and Michael,
> >>
> >> Hi, Jintack,
> >>
> >>>
> >>> I would like to get some help to run a VM with the emulated iommu. I
> >>> have tried for a few days to make it work, but I couldn't.
> >>>
> >>> What I want to do eventually is to assign a network device to the
> >>> nested VM so that I can measure the performance of applications
> >>> running in the nested VM.
> >>
> >> Good to know that you are going to use [4] to do something useful. :-)
> >>
> >> However, could I ask why you want to measure the performance of
> >> application inside nested VM rather than host? That's something I am
> >> just curious about, considering that virtualization stack will
> >> definitely introduce overhead along the way, and I don't know whether
> >> that'll affect your measurement to the application.
> >
> > I have added nested virtualization support to KVM/ARM, which is under
> > review now. I found that application performance running inside the
> > nested VM is really bad both on ARM and x86, and I'm trying to figure
> > out what's the real overhead. I think one way to figure that out is to
> > see if the direct device assignment to L2 helps to reduce the overhead
> > or not.

I see. IIUC you are trying to use an assigned device to replace your
old emulated device in L2 guest to see whether performance will drop
as well, right? Then at least I can know that you won't need a nested
VT-d here (so we should not need a vIOMMU in L2 guest).

In that case, I think we can give it a shot, considering that L1 guest
will use vfio-pci for that assigned device as well, and when L2 guest
QEMU uses this assigned device, it'll use a static mapping (just to
map the whole GPA for L2 guest) there, so even if you are using a
kernel driver in L2 guest with your to-be-tested application, we
should still be having a static mapping in vIOMMU in L1 guest, which
is IMHO fine from performance POV.

I cced Alex in case I missed anything here.

> >
> >>
> >> Another thing to mention is that (in case you don't know that), device
> >> assignment with VT-d protection would be even slower than generic VMs
> >> (without Intel IOMMU protection) if you are using generic kernel
> >> drivers in the guest, since we may need real-time DMA translation on
> >> data path.
> >>
> >
> > So, this is the comparison between using virtio and using the device
> > assignment for L1? I have tested application performance running
> > inside L1 with and without iommu, and I found that the performance is
> > better with iommu. I thought whether the device is assigned to L1 or
> > L2, the DMA translation is done by iommu, which is pretty fast? Maybe
> > I misunderstood what you said?

I failed to understand why an vIOMMU could help boost performance. :(
Could you provide your command line here so that I can try to
reproduce?

Besides, what I mentioned above is just in case you don't know that
vIOMMU will drag down the performance in most cases.

I think here to be more explicit, the overhead of vIOMMU is different
for assigned devices and emulated ones.

  (1) For emulated devices, the overhead is when we do the
  translation, or say when we do the DMA operation. We need
  real-time translation which should drag down the performance.

  (2) For assigned devices (our case), the overhead is when we setup
  the pages (since we are trapping the setup procedures via CM
  bit). However, after it's setup, we should have no much
  performance drag when we really do the data transfer (during
  DMA) since that'll all be done in the hardware IOMMU (no matter
  whether the device is assigned to L1/L2 guest).

Now, after I know your use case now (use vIOMMU in L1 guest, don't use
vIOMMU in L2 guest, only use assigned devices), I suspect we would
have no big problem according to (2).

> >
> >>>
> >>> First, I am having trouble to boot a VM with the emulated iommu. I
> >>> have posted my problem to the qemu user mailing list[1],
> >>
> >> Here I would suggest that you cc qemu-devel as well next time:
> >>
> >>   qemu-devel@nongnu.org
> >>
> >> Since I guess not all people are registered to qemu-discuss, at least
> >> I am not in that loop. Imho cc qemu-devel could let the question
> >> spread to more people, and it'll get a higher chance to be answered.
> >
> > Thanks. I'll cc qemu-devel next time.
> >
> >>
> >>> but to put it
> >>> in a nutshell, I'd like to know the setting I can reuse to boot a VM
> >>> with the emulated iommu. (e.g. how to create a VM with q35 chipset
> >>> and/or libvirt xml if you use virsh).
> >>
> >> IIUC you are looking for device assignment for the nested VM case. So,
> >> firstly, you may need my