Re: [PATCH 4/4] dmaengine: idxd: Use DMA API for in-kernel DMA with PASID

2021-12-08 Thread Dave Jiang



On 12/8/2021 4:39 PM, Jason Gunthorpe wrote:

On Wed, Dec 08, 2021 at 01:59:45PM -0800, Jacob Pan wrote:

Hi Jason,

On Wed, 8 Dec 2021 16:30:22 -0400, Jason Gunthorpe  wrote:


On Wed, Dec 08, 2021 at 11:55:16AM -0800, Jacob Pan wrote:

Hi Jason,

On Wed, 8 Dec 2021 09:13:58 -0400, Jason Gunthorpe 
wrote:

This patch utilizes iommu_enable_pasid_dma() to enable DSA to
perform DMA requests with PASID under the same mapping managed by
DMA mapping API. In addition, SVA-related bits for kernel DMA are
removed. As a result, DSA users shall use DMA mapping API to obtain
DMA handles instead of using kernel virtual addresses.

Er, shouldn't this be adding dma_map/etc type calls?

You can't really say a driver is using the DMA API without actually
calling the DMA API..

The IDXD driver is not aware of addressing mode, it is up to the user of
dmaengine API to prepare the buffer mappings. Here we only set up the
PASID such that it can be picked up during DMA work submission. I
tested with /drivers/dma/dmatest.c which does dma_map_page(),
map_single etc. also tested with other pieces under development.

Ignoring the work, doesn't IDXD prepare the DMA queues itself, don't
those need the DMA API?


Do you mean wq completion record address? It is already using DMA API.
wq->compls = dma_alloc_coherent(dev, wq->compls_size,
>compls_addr, GFP_KERNEL);
desc->compl_dma = wq->compls_addr + idxd->data->compl_size * i;

I would have expected something on the queue submission side too?


DSA is different than typical DMA devices in the past. Instead of a 
software descriptor ring where the device DMA to fetch the descriptors 
after the software ringing a doorbell or writing a head index, the 
descriptors are submitted directly to the device via a CPU instruction 
(i.e. MOVDIR64B or ENQCMD(S)). The CPU takes the KVA of the 64B 
descriptor and writes to the device atomically. No DMA mapping is 
necessary in this case.






the same thing, they do not use the same IOVA's. Did you test this
with bypass mode off?

Yes with dmatest. IOVA is the default, I separated out the SATC patch which
will put internal accelerators in bypass mode. It can also be verified by
iommu debugfs dump of DMA PASID (2) and PASID 0 (RIDPASID) are pointing to
he same default domain. e.g

Well, OK then..

Jason

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 4/4] dmaengine: idxd: Use DMA API for in-kernel DMA with PASID

2021-12-08 Thread Dave Jiang



On 12/8/2021 6:13 AM, Jason Gunthorpe wrote:

On Tue, Dec 07, 2021 at 05:47:14AM -0800, Jacob Pan wrote:

In-kernel DMA should be managed by DMA mapping API. The existing kernel
PASID support is based on the SVA machinery in SVA lib that is intended
for user process SVA. The binding between a kernel PASID and kernel
mapping has many flaws. See discussions in the link below.

This patch utilizes iommu_enable_pasid_dma() to enable DSA to perform DMA
requests with PASID under the same mapping managed by DMA mapping API.
In addition, SVA-related bits for kernel DMA are removed. As a result,
DSA users shall use DMA mapping API to obtain DMA handles instead of
using kernel virtual addresses.

Er, shouldn't this be adding dma_map/etc type calls?

You can't really say a driver is using the DMA API without actually
calling the DMA API..


+   /*
+* Try to enable both in-kernel and user DMA request with PASID.
+* PASID is supported unless both user and kernel PASID are
+* supported. Do not fail probe here in that idxd can still be
+* used w/o PASID or IOMMU.
+*/
+   if (iommu_dev_enable_feature(dev, IOMMU_DEV_FEAT_SVA) ||
+   idxd_enable_system_pasid(idxd)) {
+   dev_warn(dev, "Failed to enable PASID\n");
+   } else {
+   set_bit(IDXD_FLAG_PASID_ENABLED, >flags);
}

Huh? How can the driver keep going if PASID isn't supported? I thought
the whole point of this was because the device cannot do DMA without
PASID at all?


There are 2 types of WQ supported with the DSA devices. A dedicated WQ 
type and a shared WQ type. The dedicated WQ type can support DMA with 
and without PASID. The shared wq type must have a PASID to operate. The 
driver can support dedicated WQ only without PASID usage when there is 
no PASID support.





Jason

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 4/4] dmaengine: idxd: Use DMA API for in-kernel DMA with PASID

2021-12-07 Thread Dave Jiang



On 12/7/2021 6:47 AM, Jacob Pan wrote:

In-kernel DMA should be managed by DMA mapping API. The existing kernel
PASID support is based on the SVA machinery in SVA lib that is intended
for user process SVA. The binding between a kernel PASID and kernel
mapping has many flaws. See discussions in the link below.

This patch utilizes iommu_enable_pasid_dma() to enable DSA to perform DMA
requests with PASID under the same mapping managed by DMA mapping API.
In addition, SVA-related bits for kernel DMA are removed. As a result,
DSA users shall use DMA mapping API to obtain DMA handles instead of
using kernel virtual addresses.

Link: https://lore.kernel.org/linux-iommu/20210511194726.gp1002...@nvidia.com/
Signed-off-by: Jacob Pan 


Acked-by: Dave Jiang 

Also cc Vinod and dmaengine@vger



---
  .../admin-guide/kernel-parameters.txt |  6 --
  drivers/dma/Kconfig   | 10 
  drivers/dma/idxd/idxd.h   |  1 -
  drivers/dma/idxd/init.c   | 59 ++-
  drivers/dma/idxd/sysfs.c  |  7 ---
  5 files changed, 19 insertions(+), 64 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 9725c546a0d4..fe73d02c62f3 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -1751,12 +1751,6 @@
In such case C2/C3 won't be used again.
idle=nomwait: Disable mwait for CPU C-states
  
-	idxd.sva=	[HW]

-   Format: 
-   Allow force disabling of Shared Virtual Memory (SVA)
-   support for the idxd driver. By default it is set to
-   true (1).
-
idxd.tc_override= [HW]
Format: 
Allow override of default traffic class configuration
diff --git a/drivers/dma/Kconfig b/drivers/dma/Kconfig
index 6bcdb4e6a0d1..3b28bd720e7d 100644
--- a/drivers/dma/Kconfig
+++ b/drivers/dma/Kconfig
@@ -313,16 +313,6 @@ config INTEL_IDXD_COMPAT
  
  	  If unsure, say N.
  
-# Config symbol that collects all the dependencies that's necessary to

-# support shared virtual memory for the devices supported by idxd.
-config INTEL_IDXD_SVM
-   bool "Accelerator Shared Virtual Memory Support"
-   depends on INTEL_IDXD
-   depends on INTEL_IOMMU_SVM
-   depends on PCI_PRI
-   depends on PCI_PASID
-   depends on PCI_IOV
-
  config INTEL_IDXD_PERFMON
bool "Intel Data Accelerators performance monitor support"
depends on INTEL_IDXD
diff --git a/drivers/dma/idxd/idxd.h b/drivers/dma/idxd/idxd.h
index 0cf8d3145870..3155e3a2d3ae 100644
--- a/drivers/dma/idxd/idxd.h
+++ b/drivers/dma/idxd/idxd.h
@@ -262,7 +262,6 @@ struct idxd_device {
struct idxd_wq **wqs;
struct idxd_engine **engines;
  
-	struct iommu_sva *sva;

unsigned int pasid;
  
  	int num_groups;

diff --git a/drivers/dma/idxd/init.c b/drivers/dma/idxd/init.c
index 7bf03f371ce1..44633f8113e2 100644
--- a/drivers/dma/idxd/init.c
+++ b/drivers/dma/idxd/init.c
@@ -16,6 +16,7 @@
  #include 
  #include 
  #include 
+#include 
  #include 
  #include 
  #include "../dmaengine.h"
@@ -28,10 +29,6 @@ MODULE_LICENSE("GPL v2");
  MODULE_AUTHOR("Intel Corporation");
  MODULE_IMPORT_NS(IDXD);
  
-static bool sva = true;

-module_param(sva, bool, 0644);
-MODULE_PARM_DESC(sva, "Toggle SVA support on/off");
-
  bool tc_override;
  module_param(tc_override, bool, 0644);
  MODULE_PARM_DESC(tc_override, "Override traffic class defaults");
@@ -530,36 +527,22 @@ static struct idxd_device *idxd_alloc(struct pci_dev 
*pdev, struct idxd_driver_d
  
  static int idxd_enable_system_pasid(struct idxd_device *idxd)

  {
-   int flags;
-   unsigned int pasid;
-   struct iommu_sva *sva;
-
-   flags = SVM_FLAG_SUPERVISOR_MODE;
-
-   sva = iommu_sva_bind_device(>pdev->dev, NULL, );
-   if (IS_ERR(sva)) {
-   dev_warn(>pdev->dev,
-"iommu sva bind failed: %ld\n", PTR_ERR(sva));
-   return PTR_ERR(sva);
-   }
+   u32 pasid;
  
-	pasid = iommu_sva_get_pasid(sva);

-   if (pasid == IOMMU_PASID_INVALID) {
-   iommu_sva_unbind_device(sva);
+   pasid = iommu_enable_pasid_dma(>pdev->dev);
+   if (pasid == INVALID_IOASID) {
+   dev_err(>pdev->dev, "No kernel DMA PASID\n");
return -ENODEV;
}
-
-   idxd->sva = sva;
idxd->pasid = pasid;
-   dev_dbg(>pdev->dev, "system pasid: %u\n", pasid);
+
return 0;
  }
  
  static void idxd_disable_system_pasid(struct idxd_device *idxd)

  {
-
-   iommu_sva_unbind_device(idxd->sva);
-   idxd->sva = NULL;
+   iommu_disa

Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-01 Thread Dave Jiang



On 12/1/2021 3:03 PM, Thomas Gleixner wrote:

On Wed, Dec 01 2021 at 14:49, Dave Jiang wrote:

On 12/1/2021 2:44 PM, Thomas Gleixner wrote:

How that is backed on the host does not really matter. You can expose
MSI-X to the guest with a INTx backing as well.

I'm still failing to see the connection between the 9 MSIX vectors and
the 2048 IMS vectors which I assume that this is the limitation of the
physical device, right?

I think I was confused with what you were asking and was thinking you
are saying why can't we just have MSIX on guest backed by the MSIX on
the physical device and thought there would not be enough vectors to
service the many guests. I think I understand what your position is now
with the clarification above.

This still depends on how this overall discussion about representation
of all of this stuff is resolved.


What needs a subdevice to expose?

Can you answer that too please?


Sorry. So initial version of the IDXD sub-device is represented with a 
single queue. It needs a command irq (emulated) and a completion irq 
that is backed by a device vector (currently IMS).





Thanks,

 tglx

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-01 Thread Dave Jiang



On 12/1/2021 2:44 PM, Thomas Gleixner wrote:

On Wed, Dec 01 2021 at 14:21, Dave Jiang wrote:

On 12/1/2021 1:25 PM, Thomas Gleixner wrote:

The hardware implementation does not have enough MSIX vectors for
guests. There are only 9 MSIX vectors total (8 for queues) and 2048 IMS
vectors. So if we are to do MSI-X for all of them, then we need to do
the IMS backed MSIX scheme rather than passthrough IMS to guests.

Confused. Are you talking about passing a full IDXD device to the guest
or about passing a carved out subdevice, aka. queue?

I'm talking about carving out a subdevice. I had the impression of you
wanting IMS passed through for all variations. But it sounds like for a
sub-device, you are ok with the implementation of MSIX backed by IMS?

I don't see anything wrong with that. A subdevice is it's own entity and
VFIO can chose the most conveniant representation of it to the guest
obviously.

How that is backed on the host does not really matter. You can expose
MSI-X to the guest with a INTx backing as well.

I'm still failing to see the connection between the 9 MSIX vectors and
the 2048 IMS vectors which I assume that this is the limitation of the
physical device, right?


I think I was confused with what you were asking and was thinking you 
are saying why can't we just have MSIX on guest backed by the MSIX on 
the physical device and thought there would not be enough vectors to 
service the many guests. I think I understand what your position is now 
with the clarification above.





What needs a subdevice to expose?

Thanks,

 tglx




___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-01 Thread Dave Jiang



On 12/1/2021 1:25 PM, Thomas Gleixner wrote:

On Wed, Dec 01 2021 at 11:47, Dave Jiang wrote:

On 12/1/2021 11:41 AM, Thomas Gleixner wrote:

Hi Thomas. This is actually the IDXD usage for a mediated device passed
to a guest kernel when we plumb the pass through of IMS to the guest
rather than doing previous implementation of having a MSIX vector on
guest backed by IMS.

Which makes a lot of sense.


The control block for the mediated device is emulated and therefore an
emulated MSIX vector will be surfaced as vector 0. However the queues
will backed by IMS vectors. So we end up needing MSIX and IMS coexist
running on the guest kernel for the same device.

Why? What's wrong with using straight MSI-X for all of them?

The hardware implementation does not have enough MSIX vectors for
guests. There are only 9 MSIX vectors total (8 for queues) and 2048 IMS
vectors. So if we are to do MSI-X for all of them, then we need to do
the IMS backed MSIX scheme rather than passthrough IMS to guests.

Confused. Are you talking about passing a full IDXD device to the guest
or about passing a carved out subdevice, aka. queue?


I'm talking about carving out a subdevice. I had the impression of you 
wanting IMS passed through for all variations. But it sounds like for a 
sub-device, you are ok with the implementation of MSIX backed by IMS?





Thanks,

 tglx


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-01 Thread Dave Jiang



On 12/1/2021 11:41 AM, Thomas Gleixner wrote:

Dave,

please trim your replies.

On Wed, Dec 01 2021 at 09:28, Dave Jiang wrote:


On 12/1/2021 3:16 AM, Thomas Gleixner wrote:

Jason,

CC+ IOMMU folks

On Tue, Nov 30 2021 at 20:17, Jason Gunthorpe wrote:

On Tue, Nov 30, 2021 at 10:23:16PM +0100, Thomas Gleixner wrote:

Though I fear there is also a use case for MSI-X and IMS tied to the
same device. That network card you are talking about might end up using
MSI-X for a control block and then IMS for the actual network queues
when it is used as physical function device as a whole, but that's
conceptually a different case.

Hi Thomas. This is actually the IDXD usage for a mediated device passed
to a guest kernel when we plumb the pass through of IMS to the guest
rather than doing previous implementation of having a MSIX vector on
guest backed by IMS.

Which makes a lot of sense.


The control block for the mediated device is emulated and therefore an
emulated MSIX vector will be surfaced as vector 0. However the queues
will backed by IMS vectors. So we end up needing MSIX and IMS coexist
running on the guest kernel for the same device.

Why? What's wrong with using straight MSI-X for all of them?


The hardware implementation does not have enough MSIX vectors for 
guests. There are only 9 MSIX vectors total (8 for queues) and 2048 IMS 
vectors. So if we are to do MSI-X for all of them, then we need to do 
the IMS backed MSIX scheme rather than passthrough IMS to guests.





Thanks,

 tglx


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()

2021-12-01 Thread Dave Jiang



On 12/1/2021 3:16 AM, Thomas Gleixner wrote:

Jason,

CC+ IOMMU folks

On Tue, Nov 30 2021 at 20:17, Jason Gunthorpe wrote:

On Tue, Nov 30, 2021 at 10:23:16PM +0100, Thomas Gleixner wrote:

The real problem is where to store the MSI descriptors because the PCI
device has its own real PCI/MSI-X interrupts which means it still shares
the storage space.

Er.. I never realized that just looking at the patches :|

That is relevant to all real "IMS" users. IDXD escaped this because
it, IMHO, wrongly used the mdev with the IRQ layer. The mdev is purely
a messy artifact of VFIO, it should not be required to make the IRQ
layers work.
I don't think it makes sense that the msi_desc would point to a mdev,
the iommu layer consumes the msi_desc_to_dev(), it really should point
to the physical device that originates the message with a proper
iommu ops/data/etc.

Looking at the device slices as subdevices with their own struct device
makes a lot of sense from the conceptual level. That makes is pretty
much obvious to manage the MSIs of those devices at this level like we
do for any other device.

Whether mdev is the right encapsulation for these subdevices is an
orthogonal problem.

I surely agree that msi_desc::dev is an interesting question, but we
already have this disconnect of msi_desc::dev and DMA today due to DMA
aliasing. I haven't looked at that in detail yet, but of course the
alias handling is substantially different accross the various IOMMU
implementations.

Though I fear there is also a use case for MSI-X and IMS tied to the
same device. That network card you are talking about might end up using
MSI-X for a control block and then IMS for the actual network queues
when it is used as physical function device as a whole, but that's
conceptually a different case.


Hi Thomas. This is actually the IDXD usage for a mediated device passed 
to a guest kernel when we plumb the pass through of IMS to the guest 
rather than doing previous implementation of having a MSIX vector on 
guest backed by IMS. The control block for the mediated device is 
emulated and therefore an emulated MSIX vector will be surfaced as 
vector 0. However the queues will backed by IMS vectors. So we end up 
needing MSIX and IMS coexist running on the guest kernel for the same 
device.


DJ




I'm currently tending to partition the index space in the xarray:

  0x - 0x  PCI/MSI-X
  0x0001 - 0x0001  NTB

It is OK, with some xarray work it can be range allocating & reserving
so that the msi_domain_alloc_irqs() flows can carve out chunks of the
number space..

Another view is the msi_domain_alloc_irqs() flows should have their
own xarrays..

Yes, I was thinking about that as well. The trivial way would be:

 struct xarray store[MSI_MAX_STORES];

and then have a store index for each allocation domain. With the
proposed encapsulation of the xarray handling that's definitely
feasible. Whether that buys much is a different question. Let me think
about it some more.


which is feasible now with the range modifications and way simpler to do
with xarray than with the linked list.

Indeed!

I'm glad you like the approach.

Thanks,

 tglx



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v9 05/10] uacce: Enable IOMMU_DEV_FEAT_IOPF

2021-01-22 Thread Dave Jiang


On 1/22/2021 4:53 AM, Zhou Wang wrote:

On 2021/1/21 4:47, Dave Jiang wrote:

On 1/8/2021 7:52 AM, Jean-Philippe Brucker wrote:

The IOPF (I/O Page Fault) feature is now enabled independently from the
SVA feature, because some IOPF implementations are device-specific and
do not require IOMMU support for PCIe PRI or Arm SMMU stall.

Enable IOPF unconditionally when enabling SVA for now. In the future, if
a device driver implementing a uacce interface doesn't need IOPF
support, it will need to tell the uacce module, for example with a new
flag.

Signed-off-by: Jean-Philippe Brucker 
---
Cc: Arnd Bergmann 
Cc: Greg Kroah-Hartman 
Cc: Zhangfei Gao 
Cc: Zhou Wang 
---
   drivers/misc/uacce/uacce.c | 32 +---
   1 file changed, 25 insertions(+), 7 deletions(-)

diff --git a/drivers/misc/uacce/uacce.c b/drivers/misc/uacce/uacce.c
index d07af4edfcac..41ef1eb62a14 100644
--- a/drivers/misc/uacce/uacce.c
+++ b/drivers/misc/uacce/uacce.c
@@ -385,6 +385,24 @@ static void uacce_release(struct device *dev)
   kfree(uacce);
   }
   +static unsigned int uacce_enable_sva(struct device *parent, unsigned int 
flags)
+{
+if (!(flags & UACCE_DEV_SVA))
+return flags;
+
+flags &= ~UACCE_DEV_SVA;
+
+if (iommu_dev_enable_feature(parent, IOMMU_DEV_FEAT_IOPF))
+return flags;
+
+if (iommu_dev_enable_feature(parent, IOMMU_DEV_FEAT_SVA)) {
+iommu_dev_disable_feature(parent, IOMMU_DEV_FEAT_IOPF);
+return flags;
+}

Sorry to jump in a bit late on this and not specifically towards the
intent of this patch. But I'd like to start a discussion on if we want
to push the iommu dev feature enabling to the device driver itself rather
than having UACCE control this? Maybe allow the device driver to manage
the feature bits and UACCE only verify that they are enabled?

1. The device driver knows what platform it's on and what specific
feature bits its devices supports. Maybe in the future if there are
feature bits that's needed on one platform and not on another?

Hi Dave,

 From the discussion in this series, the meaning of IOMMU_DEV_FEAT_IOPF here
is the IOPF capability of iommu device itself. So I think check it in UACCE
will be fine.


2. This allows the possibility of multiple uacce device registered to 1
pci dev, which for a device with asymmetric queues (Intel DSA/idxd
driver) that is desirable feature. The current setup forces a single
uacce device per pdev. If additional uacce devs are registered, the
first removal of uacce device will disable the feature bit for the
rest of the registered devices. With uacce managing the feature bit,
it would need to add device context to the parent pdev and ref
counting. It may be cleaner to just allow device driver to manage
the feature bits and the driver should have all the information on
when the feature needs to be turned on and off.

Yes, we have this problem, however, this problem exists for IOMMU_DEV_FEAT_SVA
too. How about to fix it in another patch?


Hi Zhou,

Right that's what I'm implying. I'm not pushing back on the IOPF feature 
set. Just trying to survey  the opinions from people on moving the 
feature settings to the actual drivers rather than having it in UACCE. I 
will create some patches to show what I mean for comments.





Best,
Zhou


- DaveJ



+
+return flags | UACCE_DEV_SVA;
+}
+
   /**
* uacce_alloc() - alloc an accelerator
* @parent: pointer of uacce parent device
@@ -404,11 +422,7 @@ struct uacce_device *uacce_alloc(struct device *parent,
   if (!uacce)
   return ERR_PTR(-ENOMEM);
   -if (flags & UACCE_DEV_SVA) {
-ret = iommu_dev_enable_feature(parent, IOMMU_DEV_FEAT_SVA);
-if (ret)
-flags &= ~UACCE_DEV_SVA;
-}
+flags = uacce_enable_sva(parent, flags);
 uacce->parent = parent;
   uacce->flags = flags;
@@ -432,8 +446,10 @@ struct uacce_device *uacce_alloc(struct device *parent,
   return uacce;
 err_with_uacce:
-if (flags & UACCE_DEV_SVA)
+if (flags & UACCE_DEV_SVA) {
   iommu_dev_disable_feature(uacce->parent, IOMMU_DEV_FEAT_SVA);
+iommu_dev_disable_feature(uacce->parent, IOMMU_DEV_FEAT_IOPF);
+}
   kfree(uacce);
   return ERR_PTR(ret);
   }
@@ -487,8 +503,10 @@ void uacce_remove(struct uacce_device *uacce)
   mutex_unlock(>queues_lock);
 /* disable sva now since no opened queues */
-if (uacce->flags & UACCE_DEV_SVA)
+if (uacce->flags & UACCE_DEV_SVA) {
   iommu_dev_disable_feature(uacce->parent, IOMMU_DEV_FEAT_SVA);
+iommu_dev_disable_feature(uacce->parent, IOMMU_DEV_FEAT_IOPF);
+}
 if (uacce->cdev)
   cdev_device_del(uacce->cdev, >dev);

.


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v9 05/10] uacce: Enable IOMMU_DEV_FEAT_IOPF

2021-01-20 Thread Dave Jiang



On 1/8/2021 7:52 AM, Jean-Philippe Brucker wrote:

The IOPF (I/O Page Fault) feature is now enabled independently from the
SVA feature, because some IOPF implementations are device-specific and
do not require IOMMU support for PCIe PRI or Arm SMMU stall.

Enable IOPF unconditionally when enabling SVA for now. In the future, if
a device driver implementing a uacce interface doesn't need IOPF
support, it will need to tell the uacce module, for example with a new
flag.

Signed-off-by: Jean-Philippe Brucker 
---
Cc: Arnd Bergmann 
Cc: Greg Kroah-Hartman 
Cc: Zhangfei Gao 
Cc: Zhou Wang 
---
  drivers/misc/uacce/uacce.c | 32 +---
  1 file changed, 25 insertions(+), 7 deletions(-)

diff --git a/drivers/misc/uacce/uacce.c b/drivers/misc/uacce/uacce.c
index d07af4edfcac..41ef1eb62a14 100644
--- a/drivers/misc/uacce/uacce.c
+++ b/drivers/misc/uacce/uacce.c
@@ -385,6 +385,24 @@ static void uacce_release(struct device *dev)
kfree(uacce);
  }
  
+static unsigned int uacce_enable_sva(struct device *parent, unsigned int flags)

+{
+   if (!(flags & UACCE_DEV_SVA))
+   return flags;
+
+   flags &= ~UACCE_DEV_SVA;
+
+   if (iommu_dev_enable_feature(parent, IOMMU_DEV_FEAT_IOPF))
+   return flags;
+
+   if (iommu_dev_enable_feature(parent, IOMMU_DEV_FEAT_SVA)) {
+   iommu_dev_disable_feature(parent, IOMMU_DEV_FEAT_IOPF);
+   return flags;
+   }


Sorry to jump in a bit late on this and not specifically towards the 
intent of this patch. But I'd like to start a discussion on if we want 
to push the iommu dev feature enabling to the device driver itself 
rather than having UACCE control this? Maybe allow the device driver to 
manage the feature bits and UACCE only verify that they are enabled?


1. The device driver knows what platform it's on and what specific
   feature bits its devices supports. Maybe in the future if there are
   feature bits that's needed on one platform and not on another?
2. This allows the possibility of multiple uacce device registered to 1
   pci dev, which for a device with asymmetric queues (Intel DSA/idxd
   driver) that is desirable feature. The current setup forces a single
   uacce device per pdev. If additional uacce devs are registered, the
   first removal of uacce device will disable the feature bit for the
   rest of the registered devices. With uacce managing the feature bit,
   it would need to add device context to the parent pdev and ref
   counting. It may be cleaner to just allow device driver to manage
   the feature bits and the driver should have all the information on
   when the feature needs to be turned on and off.

- DaveJ



+
+   return flags | UACCE_DEV_SVA;
+}
+
  /**
   * uacce_alloc() - alloc an accelerator
   * @parent: pointer of uacce parent device
@@ -404,11 +422,7 @@ struct uacce_device *uacce_alloc(struct device *parent,
if (!uacce)
return ERR_PTR(-ENOMEM);
  
-	if (flags & UACCE_DEV_SVA) {

-   ret = iommu_dev_enable_feature(parent, IOMMU_DEV_FEAT_SVA);
-   if (ret)
-   flags &= ~UACCE_DEV_SVA;
-   }
+   flags = uacce_enable_sva(parent, flags);
  
  	uacce->parent = parent;

uacce->flags = flags;
@@ -432,8 +446,10 @@ struct uacce_device *uacce_alloc(struct device *parent,
return uacce;
  
  err_with_uacce:

-   if (flags & UACCE_DEV_SVA)
+   if (flags & UACCE_DEV_SVA) {
iommu_dev_disable_feature(uacce->parent, IOMMU_DEV_FEAT_SVA);
+   iommu_dev_disable_feature(uacce->parent, IOMMU_DEV_FEAT_IOPF);
+   }
kfree(uacce);
return ERR_PTR(ret);
  }
@@ -487,8 +503,10 @@ void uacce_remove(struct uacce_device *uacce)
mutex_unlock(>queues_lock);
  
  	/* disable sva now since no opened queues */

-   if (uacce->flags & UACCE_DEV_SVA)
+   if (uacce->flags & UACCE_DEV_SVA) {
iommu_dev_disable_feature(uacce->parent, IOMMU_DEV_FEAT_SVA);
+   iommu_dev_disable_feature(uacce->parent, IOMMU_DEV_FEAT_IOPF);
+   }
  
  	if (uacce->cdev)

cdev_device_del(uacce->cdev, >dev);

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] MAINTAINERS: add maintainers for uacce

2020-02-25 Thread Dave Jiang



On 2/25/20 6:11 PM, zhangfei wrote:



On 2020/2/26 上午12:02, Dave Jiang wrote:



On 2/24/20 11:17 PM, Zhangfei Gao wrote:

Add Zhangfei Gao and Zhou Wang as maintainers for uacce

Signed-off-by: Zhangfei Gao 
Signed-off-by: Zhou Wang 
---
  MAINTAINERS | 10 ++
  1 file changed, 10 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 38fe2f3..22e647f 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -17039,6 +17039,16 @@ W:    http://linuxtv.org
  S:    Maintained
  F:    drivers/media/pci/tw686x/
  +UACCE ACCELERATOR FRAMEWORK
+M:    Zhangfei Gao 
+M:    Zhou Wang 
+S:    Maintained
+F:    Documentation/ABI/testing/sysfs-driver-uacce
+F:    Documentation/misc-devices/uacce.rst
+F:    drivers/misc/uacce/
+F:    include/linux/uacce.h
+F:    include/uapi/misc/uacce/


Mailing list for patch submission?
+L: linux-accelerat...@lists.ozlabs.org ?


Thanks Dave

How about adding both
linux-accelerat...@lists.ozlabs.org
linux-ker...@vger.kernel.org
Since the patches will go to misc tree.


That is entirely up to you. But having guidance on somewhere to submit 
patches will be good.




Thanks

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH] MAINTAINERS: add maintainers for uacce

2020-02-25 Thread Dave Jiang




On 2/24/20 11:17 PM, Zhangfei Gao wrote:

Add Zhangfei Gao and Zhou Wang as maintainers for uacce

Signed-off-by: Zhangfei Gao 
Signed-off-by: Zhou Wang 
---
  MAINTAINERS | 10 ++
  1 file changed, 10 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 38fe2f3..22e647f 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -17039,6 +17039,16 @@ W: http://linuxtv.org
  S:Maintained
  F:drivers/media/pci/tw686x/
  
+UACCE ACCELERATOR FRAMEWORK

+M: Zhangfei Gao 
+M: Zhou Wang 
+S: Maintained
+F: Documentation/ABI/testing/sysfs-driver-uacce
+F: Documentation/misc-devices/uacce.rst
+F: drivers/misc/uacce/
+F: include/linux/uacce.h
+F: include/uapi/misc/uacce/


Mailing list for patch submission?
+L: linux-accelerat...@lists.ozlabs.org ?


+
  UBI FILE SYSTEM (UBIFS)
  M:Richard Weinberger 
  L:linux-...@lists.infradead.org


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v11 2/4] uacce: add uacce driver

2020-01-15 Thread Dave Jiang



On 1/15/20 4:18 AM, zhangfei wrote:

Hi, Greg

On 2020/1/14 下午10:59, Greg Kroah-Hartman wrote:

On Mon, Jan 13, 2020 at 11:34:55AM +0800, zhangfei wrote:

Hi, Greg

Thanks for the review.

On 2020/1/12 上午3:40, Greg Kroah-Hartman wrote:

On Sat, Jan 11, 2020 at 10:48:37AM +0800, Zhangfei Gao wrote:

+static int uacce_fops_open(struct inode *inode, struct file *filep)
+{
+    struct uacce_mm *uacce_mm = NULL;
+    struct uacce_device *uacce;
+    struct uacce_queue *q;
+    int ret = 0;
+
+    uacce = xa_load(_xa, iminor(inode));
+    if (!uacce)
+    return -ENODEV;
+
+    if (!try_module_get(uacce->parent->driver->owner))
+    return -ENODEV;

Why are you trying to grab the module reference of the parent device?
Why is that needed and what is that going to help with here?

This shouldn't be needed as the module reference of the owner of the
fileops for this module is incremented, and the "parent" module depends
on this module, so how could it be unloaded without this code being
unloaded?

Yes, if you build this code into the kernel and the "parent" driver 
is a

module, then you will not have a reference, but when you remove that
parent driver the device will be removed as it has to be unregistered
before that parent driver can be removed from the system, right?

Or what am I missing here?
The refcount here is preventing rmmod "parent" module after fd is 
opened,
since user driver has mmap kernel memory to user space, like mmio, 
which may

still in-use.

With the refcount protection, rmmod "parent" module will fail until
application free the fd.
log like: rmmod: ERROR: Module hisi_zip is in use

But if the "parent" module is to be unloaded, it has to unregister the
"child" device and that will call the destructor in here and then you
will tear everything down and all should be good.

There's no need to "forbid" a module from being unloaded, even if it is
being used.  Look at all networking drivers, they work that way, right?

Thanks Greg for the kind suggestion.

I still have one uncertainty.
Does uacce has to block process continue accessing the mmapped area when 
remove "parent" module?
Uacce can block device access the physical memory when parent module 
call uacce_remove.
But application is still running, and suppose it is not the kernel 
driver's responsibility to call unmap.


I am looking for some examples in kernel,
looks vfio does not block process continue accessing when 
vfio_unregister_iommu_driver either.


In my test, application will keep waiting after rmmod parent, until 
ctrl+c, when unmap is called.

During the process, kernel does not report any error.

Do you have any advice?


Would it work to call unmap_mapping_range() on the char dev 
inode->i_mappings? I think you need to set the vma->fault function ptr 
for the vm_operations_struct in the original mmap(). After the mappings 
are unmapped, you can set a state variable to trigger the return of 
VM_FAULT_SIGBUS in the ->fault function when the user app accesses the 
mmap region again and triggers a page fault. The user app needs to be 
programmed to catch exceptions to deal with that.





+static void uacce_release(struct device *dev)
+{
+    struct uacce_device *uacce = to_uacce_device(dev);
+
+    kfree(uacce);
+    uacce = NULL;

That line didn't do anything :)

Yes, this is a mistake.
It is up to caller to set to NULL to prevent release multi times.

Release function is called by the driver core which will not touch the
value again.

Yes, I understand, it's my mistake. Will remove it.

Thanks

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v10 0/4] Add uacce module for Accelerator

2020-01-08 Thread Dave Jiang

On 12/15/19 8:08 PM, Zhangfei Gao wrote:

Uacce (Unified/User-space-access-intended Accelerator Framework) targets to
provide Shared Virtual Addressing (SVA) between accelerators and processes.
So accelerator can access any data structure of the main cpu.
This differs from the data sharing between cpu and io device, which share
data content rather than address.
Because of unified address, hardware and user space of process can share
the same virtual address in the communication.

Uacce is intended to be used with Jean Philippe Brucker's SVA
patchset[1], which enables IO side page fault and PASID support.
We have keep verifying with Jean's sva patchset [2]
We also keep verifying with Eric's SMMUv3 Nested Stage patches [3]


Looking forward to this common framework going upstream. I'm currently 
in the process of upstreaming the device driver (idxd) [1] for the 
recently announced Intel Data Streaming Accelerator [2] [3] that also 
supports SVA.


[1] https://lkml.org/lkml/2020/1/7/905
[2] https://01.org/blogs/2019/introducing-intel-data-streaming-accelerator
[3] 
https://software.intel.com/en-us/download/intel-data-streaming-accelerator-preliminary-architecture-specification


And I think with this framework upstream I can potentially drop the in 
driver exported char device support code and use this framework directly.





This series and related zip & qm driver
https://github.com/Linaro/linux-kernel-warpdrive/tree/v5.5-rc1-uacce-v10

The library and user application:
https://github.com/Linaro/warpdrive/tree/wdprd-upstream-v10

References:
[1] http://jpbrucker.net/sva/
[2] http://jpbrucker.net/git/linux/log/?h=sva/zip-devel
[3] https://github.com/eauger/linux/tree/v5.3.0-rc0-2stage-v9

Change History:
v10:
Modify the include header to fix kbuild test erorr in other arch.

v9:
Suggested by Jonathan
1. Remove sysfs: numa_distance, node_id, id, also add is_visible callback
2. Split the api to solve the potential race
struct uacce_device *uacce_alloc(struct device *parent,
 struct uacce_interface *interface)
int uacce_register(struct uacce_device *uacce)
void uacce_remove(struct uacce_device *uacce)
3. Split clean up patch 03

v8:
Address some comments from Jonathan
Merge Jean's patch, using uacce_mm instead of pid for sva_exit

v7:
As suggested by Jean and Jerome
Only consider sva case and remove unused dma apis for the first patch.
Also add mm_exit for sva and vm_ops.close etc


v6: https://lkml.org/lkml/2019/10/16/231
Change sys qfrs_size to different file, suggested by Jonathan
Fix crypto daily build issue and based on crypto code base, also 5.4-rc1.

v5: https://lkml.org/lkml/2019/10/14/74
Add an example patch using the uacce interface, suggested by Greg
0003-crypto-hisilicon-register-zip-engine-to-uacce.patch

v4: https://lkml.org/lkml/2019/9/17/116
Based on 5.4-rc1
Considering other driver integrating uacce,
if uacce not compiled, uacce_register return error and uacce_unregister is 
empty.
Simplify uacce flag: UACCE_DEV_SVA.
Address Greg's comments:
Fix state machine, remove potential syslog triggered from user space etc.

v3: https://lkml.org/lkml/2019/9/2/990
Recommended by Greg, use sturct uacce_device instead of struct uacce,
and use struct *cdev in struct uacce_device, as a result,
cdev can be released by itself when refcount decreased to 0.
So the two structures are decoupled and self-maintained by themsleves.
Also add dev.release for put_device.

v2: https://lkml.org/lkml/2019/8/28/565
Address comments from Greg and Jonathan
Modify interface uacce_register
Drop noiommu mode first

v1: https://lkml.org/lkml/2019/8/14/277
1. Rebase to 5.3-rc1
2. Build on iommu interface
3. Verifying with Jean's sva and Eric's nested mode iommu.
4. User library has developed a lot: support zlib, openssl etc.
5. Move to misc first

RFC3:
https://lkml.org/lkml/2018/11/12/1951

RFC2:
https://lwn.net/Articles/763990/


Background of why Uacce:
Von Neumann processor is not good at general data manipulation.
It is designed for control-bound rather than data-bound application.
The latter need less control path facility and more/specific ALUs.
So there are more and more heterogeneous processors, such as
encryption/decryption accelerators, TPUs, or
EDGE (Explicated Data Graph Execution) processors, introduced to gain
better performance or power efficiency for particular applications
these days.

There are generally two ways to make use of these heterogeneous processors:

The first is to make them co-processors, just like FPU.
This is good for some application but it has its own cons:
It changes the ISA set permanently.
You must save all state elements when the process is switched out.
But most data-bound processors have a huge set of state elements.
It makes the kernel scheduler more complex.

The second is Accelerator.
It is taken as a IO device from the CPU's point of view
(but it need not to be physically). The process, running on CPU,
hold a context of the accelerator and send 

Re: [PATCH 0/9] Support using MSI interrupts in ntb_transport

2019-01-31 Thread Dave Jiang



On 1/31/2019 4:41 PM, Logan Gunthorpe wrote:


On 2019-01-31 3:46 p.m., Dave Jiang wrote:

I believe irqbalance writes to the file /proc/irq/N/smp_affinity. So
maybe take a look at the code that starts from there and see if it would
have any impact on your stuff.

Ok, well on my system I can write to the smp_affinity all day and the
MSI interrupts still work fine.


Maybe your code is ok then. If the stats show up in /proc/interrupts 
then you can see it moving to different cores.



The MSI code is a bit difficult to trace and audit with all the
different chips and the parent chips which I don't have a good
understanding of. But I can definitely see that it could be possible for
some chips to change the address as smp_affinitiy will eventually
sometimes call msi_domain_set_affinity() which does seem to recompose
the message and write it back to the chip.

So, I could relatively easily add a callback to msi_desc to catch this
and resend the MSI address/data. However, I'm not sure how this is ever
done atomically. It seems like there would be a race while the device
updates its address where old interrupts could be triggered. This race
would be much longer for us when sending this information over the NTB
link. Though, I guess if the only change is that it encodes CPU
information in the address then that would not be an issue. However, I'm
not sure I can say that for certain without a comprehensive
understanding of all the IRQ chips.

Any thoughts on this?


Yeah I'm not sure what to do about it either as I'm not super familiar 
with that area either. Just making note of what I encountered. And you 
are right, the updated info has to go over NTB for the other side to 
write to the updated place. So there's a lot of latency involved.






Logan

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 0/9] Support using MSI interrupts in ntb_transport

2019-01-31 Thread Dave Jiang



On 1/31/2019 3:39 PM, Logan Gunthorpe wrote:


On 2019-01-31 1:58 p.m., Dave Jiang wrote:

On 1/31/2019 1:48 PM, Logan Gunthorpe wrote:

On 2019-01-31 1:20 p.m., Dave Jiang wrote:

Does this work when the system moves the MSI vector either via software
(irqbalance) or BIOS APIC programming (some modes cause round robin
behavior)?

I don't know how irqbalance works, and I'm not sure what you are
referring to by BIOS APIC programming, however I would expect these
things would not be a problem.

The MSI code I'm presenting here doesn't do anything crazy with the
interrupts, it allocates and uses them just as any PCI driver would. The
only real difference here is that instead of a piece of hardware sending
the IRQ TLP, it will be sent through the memory window (which, from the
OS's perspective, is just coming from an NTB hardware proxy alias).

Logan

Right. I did that as a hack a while back for some silicon errata
workaround. When the vector moves, the address for the LAPIC changes. So
unless it gets updated, you end up writing to the old location and lose
all the new interrupts. irqbalance is a user daemon that rotates the
system interrupts around to ensure that not all interrupts are pinned on
a single core.

Yes, that would be a problem if something changes the MSI vectors out
from under us. Seems like that would be a bit difficult to do even with
regular hardware. So far I haven't seen anything that would do that. If
you know of where in the kernel this happens I'd be interested in
getting a pointer to the flow in the code. If that is the case this MSI
stuff will need to get much more complicated...


I believe irqbalance writes to the file /proc/irq/N/smp_affinity. So 
maybe take a look at the code that starts from there and see if it would 
have any impact on your stuff.






I think it's enabled by default on several distros.
Although MSIX has nothing to do with the IOAPIC, the mode that the APIC
is programmed can have an influence on how the interrupts are delivered.
There are certain Intel platforms (I don't know if AMD does anything
like that) puts the IOAPIC in a certain configuration that causes the
interrupts to be moved in a round robin fashion. I think it's physical
flat mode? I don't quite recall. Normally on the low end Xeons. It's
probably worth doing a test run with the irqbalance daemon running and
make sure you traffic stream doesn't all of sudden stop.

I've tested with irqbalance running and haven't found any noticeable
difference.

Logan

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 0/9] Support using MSI interrupts in ntb_transport

2019-01-31 Thread Dave Jiang



On 1/31/2019 1:48 PM, Logan Gunthorpe wrote:


On 2019-01-31 1:20 p.m., Dave Jiang wrote:

Does this work when the system moves the MSI vector either via software
(irqbalance) or BIOS APIC programming (some modes cause round robin
behavior)?


I don't know how irqbalance works, and I'm not sure what you are
referring to by BIOS APIC programming, however I would expect these
things would not be a problem.

The MSI code I'm presenting here doesn't do anything crazy with the
interrupts, it allocates and uses them just as any PCI driver would. The
only real difference here is that instead of a piece of hardware sending
the IRQ TLP, it will be sent through the memory window (which, from the
OS's perspective, is just coming from an NTB hardware proxy alias).

Logan
Right. I did that as a hack a while back for some silicon errata 
workaround. When the vector moves, the address for the LAPIC changes. So 
unless it gets updated, you end up writing to the old location and lose 
all the new interrupts. irqbalance is a user daemon that rotates the 
system interrupts around to ensure that not all interrupts are pinned on 
a single core. I think it's enabled by default on several distros. 
Although MSIX has nothing to do with the IOAPIC, the mode that the APIC 
is programmed can have an influence on how the interrupts are delivered. 
There are certain Intel platforms (I don't know if AMD does anything 
like that) puts the IOAPIC in a certain configuration that causes the 
interrupts to be moved in a round robin fashion. I think it's physical 
flat mode? I don't quite recall. Normally on the low end Xeons. It's 
probably worth doing a test run with the irqbalance daemon running and 
make sure you traffic stream doesn't all of sudden stop.



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 0/9] Support using MSI interrupts in ntb_transport

2019-01-31 Thread Dave Jiang



On 1/31/2019 11:56 AM, Logan Gunthorpe wrote:

Hi,

This patch series adds optional support for using MSI interrupts instead
of NTB doorbells in ntb_transport. This is desirable seeing doorbells on
current hardware are quite slow and therefore switching to MSI interrupts
provides a significant performance gain. On switchtec hardware, a simple
apples-to-apples comparison shows ntb_netdev/iperf numbers going from
3.88Gb/s to 14.1Gb/s when switching to MSI interrupts.

To do this, a couple changes are required outside of the NTB tree:

1) The IOMMU must know to accept MSI requests from aliased bused numbers
seeing NTB hardware typically sends proxied request IDs through
additional requester IDs. The first patch in this series adds support
for the Intel IOMMU. A quirk to add these aliases for switchtec hardware
was already accepted. See commit ad281ecf1c7d ("PCI: Add DMA alias quirk
for Microsemi Switchtec NTB") for a description of NTB proxy IDs and why
this is necessary.

2) NTB transport (and other clients) may often need more MSI interrupts
than the NTB hardware actually advertises support for. However, seeing
these interrupts will not be triggered by the hardware but through an
NTB memory window, the hardware does not actually need support or need
to know about them. Therefore we add the concept of Virtual MSI
interrupts which are allocated just like any other MSI interrupt but
are not programmed into the hardware's MSI table. This is done in
Patch 2 and then made use of in Patch 3.


Logan,

Does this work when the system moves the MSI vector either via software 
(irqbalance) or BIOS APIC programming (some modes cause round robin 
behavior)?






The remaining patches in this series add a library for dealing with MSI
interrupts, a test client and finally support in ntb_transport.

The series is based off of v5.0-rc4 and I've tested it on top of a
of the patches I've already sent to the NTB tree (though they are
independent changes). A git repo is available here:

https://github.com/sbates130272/linux-p2pmem/ ntb_transport_msi_v1

Thanks,

Logan

--

Logan Gunthorpe (9):
   iommu/vt-d: Allow interrupts from the entire bus for aliased devices
   PCI/MSI: Support allocating virtual MSI interrupts
   PCI/switchtec: Add module parameter to request more interrupts
   NTB: Introduce functions to calculate multi-port resource index
   NTB: Rename ntb.c to support multiple source files in the module
   NTB: Introduce MSI library
   NTB: Introduce NTB MSI Test Client
   NTB: Add ntb_msi_test support to ntb_test
   NTB: Add MSI interrupt support to ntb_transport

  drivers/iommu/intel_irq_remapping.c |  12 +
  drivers/ntb/Kconfig |  10 +
  drivers/ntb/Makefile|   3 +
  drivers/ntb/{ntb.c => core.c}   |   0
  drivers/ntb/msi.c   | 313 ++
  drivers/ntb/ntb_transport.c | 134 +++-
  drivers/ntb/test/Kconfig|   9 +
  drivers/ntb/test/Makefile   |   1 +
  drivers/ntb/test/ntb_msi_test.c | 416 
  drivers/pci/msi.c   |  51 ++-
  drivers/pci/switch/switchtec.c  |  12 +-
  include/linux/msi.h |   1 +
  include/linux/ntb.h | 139 
  include/linux/pci.h |   9 +
  tools/testing/selftests/ntb/ntb_test.sh |  54 ++-
  15 files changed, 1150 insertions(+), 14 deletions(-)
  rename drivers/ntb/{ntb.c => core.c} (100%)
  create mode 100644 drivers/ntb/msi.c
  create mode 100644 drivers/ntb/test/ntb_msi_test.c

--
2.19.0


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: IOAT DMA w/IOMMU

2018-08-10 Thread Dave Jiang



On 08/10/2018 10:15 AM, Logan Gunthorpe wrote:
> 
> 
> On 10/08/18 11:01 AM, Dave Jiang wrote:
>> Or if the BIOS has provided mapping for the Intel NTB device
>> specifically? Is that a possibility? NTB does go through the IOMMU.
> 
> I don't know but if the BIOS is doing it, but that would only at best
> work for Intel NTB I see no hope in getting the BIOS to map the MW
> for a Switchtec NTB device...

Right. I'm just speculating why it may possibly work. But yes, I think
kernel will need appropriate mapping if it's not happening right now.
Hopefully Kit is onto something.

> 
> Logan
> 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: IOAT DMA w/IOMMU

2018-08-10 Thread Dave Jiang


On 08/10/2018 09:33 AM, Logan Gunthorpe wrote:
> 
> 
> On 10/08/18 10:31 AM, Dave Jiang wrote:
>>
>>
>> On 08/10/2018 09:24 AM, Logan Gunthorpe wrote:
>>>
>>>
>>> On 10/08/18 10:02 AM, Kit Chow wrote:
>>>> Turns out there is no dma_map_resource routine on x86. get_dma_ops 
>>>> returns intel_dma_ops which has map_resource pointing to NULL.
>>>
>>> Oh, yup. I wasn't aware of that. From a cursory view, it looks like it
>>> shouldn't be too hard to implement though.
>>>
>>>> Will poke around some in the intel_map_page code but can you actually 
>>>> get a valid struct page for a pci bar address (dma_map_single calls 
>>>> virt_to_page)?  If not, does a map_resource routine that can properly 
>>>> map a pci bar address need to be implemented?
>>>
>>> Yes, you can not get a struct page for a PCI bar address unless it's
>>> mapped with ZONE_DEVICE like in my p2p work. So that would explain why
>>> dma_map_single() didn't work.
>>>
>>> This all implies that ntb_transport doesn't work with DMA and the IOMMU
>>> turned on. I'm not sure I've ever tried that configuration myself but it
>>> is a bit surprising.
>>
>> Hmmthat's surprising because it seems to work on Skylake platform
>> when I tested it yesterday with Intel NTB. Kit is using a Haswell
>> platform at the moment I think. Although I'm curious if it works with
>> the PLX NTB he's using on Skylake.
> 
> Does that mean on Skylake the IOAT can bypass the IOMMU? Because it
> looks like the ntb_transport code doesn't map the physical address of
> the NTB MW into the IOMMU when doing DMA...

Or if the BIOS has provided mapping for the Intel NTB device
specifically? Is that a possibility? NTB does go through the IOMMU.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: IOAT DMA w/IOMMU

2018-08-10 Thread Dave Jiang


On 08/10/2018 09:24 AM, Logan Gunthorpe wrote:
> 
> 
> On 10/08/18 10:02 AM, Kit Chow wrote:
>> Turns out there is no dma_map_resource routine on x86. get_dma_ops 
>> returns intel_dma_ops which has map_resource pointing to NULL.
> 
> Oh, yup. I wasn't aware of that. From a cursory view, it looks like it
> shouldn't be too hard to implement though.
> 
>> Will poke around some in the intel_map_page code but can you actually 
>> get a valid struct page for a pci bar address (dma_map_single calls 
>> virt_to_page)?  If not, does a map_resource routine that can properly 
>> map a pci bar address need to be implemented?
> 
> Yes, you can not get a struct page for a PCI bar address unless it's
> mapped with ZONE_DEVICE like in my p2p work. So that would explain why
> dma_map_single() didn't work.
> 
> This all implies that ntb_transport doesn't work with DMA and the IOMMU
> turned on. I'm not sure I've ever tried that configuration myself but it
> is a bit surprising.

Hmmthat's surprising because it seems to work on Skylake platform
when I tested it yesterday with Intel NTB. Kit is using a Haswell
platform at the moment I think. Although I'm curious if it works with
the PLX NTB he's using on Skylake.

> 
> Logan
> 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH 03/44] dmaengine: ioat: don't use DMA_ERROR_CODE

2017-06-08 Thread Dave Jiang
On 06/08/2017 06:25 AM, Christoph Hellwig wrote:
> DMA_ERROR_CODE is not a public API and will go away.  Instead properly
> unwind based on the loop counter.
> 
> Signed-off-by: Christoph Hellwig <h...@lst.de>

Acked-by: Dave Jiang <dave.ji...@intel.com>

> ---
>  drivers/dma/ioat/init.c | 24 +++-
>  1 file changed, 7 insertions(+), 17 deletions(-)
> 
> diff --git a/drivers/dma/ioat/init.c b/drivers/dma/ioat/init.c
> index 6ad4384b3fa8..ed8ed1192775 100644
> --- a/drivers/dma/ioat/init.c
> +++ b/drivers/dma/ioat/init.c
> @@ -839,8 +839,6 @@ static int ioat_xor_val_self_test(struct ioatdma_device 
> *ioat_dma)
>   goto free_resources;
>   }
>  
> - for (i = 0; i < IOAT_NUM_SRC_TEST; i++)
> - dma_srcs[i] = DMA_ERROR_CODE;
>   for (i = 0; i < IOAT_NUM_SRC_TEST; i++) {
>   dma_srcs[i] = dma_map_page(dev, xor_srcs[i], 0, PAGE_SIZE,
>  DMA_TO_DEVICE);
> @@ -910,8 +908,6 @@ static int ioat_xor_val_self_test(struct ioatdma_device 
> *ioat_dma)
>  
>   xor_val_result = 1;
>  
> - for (i = 0; i < IOAT_NUM_SRC_TEST + 1; i++)
> - dma_srcs[i] = DMA_ERROR_CODE;
>   for (i = 0; i < IOAT_NUM_SRC_TEST + 1; i++) {
>   dma_srcs[i] = dma_map_page(dev, xor_val_srcs[i], 0, PAGE_SIZE,
>  DMA_TO_DEVICE);
> @@ -965,8 +961,6 @@ static int ioat_xor_val_self_test(struct ioatdma_device 
> *ioat_dma)
>   op = IOAT_OP_XOR_VAL;
>  
>   xor_val_result = 0;
> - for (i = 0; i < IOAT_NUM_SRC_TEST + 1; i++)
> - dma_srcs[i] = DMA_ERROR_CODE;
>   for (i = 0; i < IOAT_NUM_SRC_TEST + 1; i++) {
>   dma_srcs[i] = dma_map_page(dev, xor_val_srcs[i], 0, PAGE_SIZE,
>  DMA_TO_DEVICE);
> @@ -1017,18 +1011,14 @@ static int ioat_xor_val_self_test(struct 
> ioatdma_device *ioat_dma)
>   goto free_resources;
>  dma_unmap:
>   if (op == IOAT_OP_XOR) {
> - if (dest_dma != DMA_ERROR_CODE)
> - dma_unmap_page(dev, dest_dma, PAGE_SIZE,
> -DMA_FROM_DEVICE);
> - for (i = 0; i < IOAT_NUM_SRC_TEST; i++)
> - if (dma_srcs[i] != DMA_ERROR_CODE)
> - dma_unmap_page(dev, dma_srcs[i], PAGE_SIZE,
> -DMA_TO_DEVICE);
> + while (--i >= 0)
> + dma_unmap_page(dev, dma_srcs[i], PAGE_SIZE,
> +DMA_TO_DEVICE);
> + dma_unmap_page(dev, dest_dma, PAGE_SIZE, DMA_FROM_DEVICE);
>   } else if (op == IOAT_OP_XOR_VAL) {
> - for (i = 0; i < IOAT_NUM_SRC_TEST + 1; i++)
> - if (dma_srcs[i] != DMA_ERROR_CODE)
> - dma_unmap_page(dev, dma_srcs[i], PAGE_SIZE,
> -DMA_TO_DEVICE);
> + while (--i >= 0)
> + dma_unmap_page(dev, dma_srcs[i], PAGE_SIZE,
> +DMA_TO_DEVICE);
>   }
>  free_resources:
>   dma->device_free_chan_resources(dma_chan);
> 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu