Re: [PATCH 4/4] dmaengine: idxd: Use DMA API for in-kernel DMA with PASID
On 12/8/2021 4:39 PM, Jason Gunthorpe wrote: On Wed, Dec 08, 2021 at 01:59:45PM -0800, Jacob Pan wrote: Hi Jason, On Wed, 8 Dec 2021 16:30:22 -0400, Jason Gunthorpe wrote: On Wed, Dec 08, 2021 at 11:55:16AM -0800, Jacob Pan wrote: Hi Jason, On Wed, 8 Dec 2021 09:13:58 -0400, Jason Gunthorpe wrote: This patch utilizes iommu_enable_pasid_dma() to enable DSA to perform DMA requests with PASID under the same mapping managed by DMA mapping API. In addition, SVA-related bits for kernel DMA are removed. As a result, DSA users shall use DMA mapping API to obtain DMA handles instead of using kernel virtual addresses. Er, shouldn't this be adding dma_map/etc type calls? You can't really say a driver is using the DMA API without actually calling the DMA API.. The IDXD driver is not aware of addressing mode, it is up to the user of dmaengine API to prepare the buffer mappings. Here we only set up the PASID such that it can be picked up during DMA work submission. I tested with /drivers/dma/dmatest.c which does dma_map_page(), map_single etc. also tested with other pieces under development. Ignoring the work, doesn't IDXD prepare the DMA queues itself, don't those need the DMA API? Do you mean wq completion record address? It is already using DMA API. wq->compls = dma_alloc_coherent(dev, wq->compls_size, >compls_addr, GFP_KERNEL); desc->compl_dma = wq->compls_addr + idxd->data->compl_size * i; I would have expected something on the queue submission side too? DSA is different than typical DMA devices in the past. Instead of a software descriptor ring where the device DMA to fetch the descriptors after the software ringing a doorbell or writing a head index, the descriptors are submitted directly to the device via a CPU instruction (i.e. MOVDIR64B or ENQCMD(S)). The CPU takes the KVA of the 64B descriptor and writes to the device atomically. No DMA mapping is necessary in this case. the same thing, they do not use the same IOVA's. Did you test this with bypass mode off? Yes with dmatest. IOVA is the default, I separated out the SATC patch which will put internal accelerators in bypass mode. It can also be verified by iommu debugfs dump of DMA PASID (2) and PASID 0 (RIDPASID) are pointing to he same default domain. e.g Well, OK then.. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 4/4] dmaengine: idxd: Use DMA API for in-kernel DMA with PASID
On 12/8/2021 6:13 AM, Jason Gunthorpe wrote: On Tue, Dec 07, 2021 at 05:47:14AM -0800, Jacob Pan wrote: In-kernel DMA should be managed by DMA mapping API. The existing kernel PASID support is based on the SVA machinery in SVA lib that is intended for user process SVA. The binding between a kernel PASID and kernel mapping has many flaws. See discussions in the link below. This patch utilizes iommu_enable_pasid_dma() to enable DSA to perform DMA requests with PASID under the same mapping managed by DMA mapping API. In addition, SVA-related bits for kernel DMA are removed. As a result, DSA users shall use DMA mapping API to obtain DMA handles instead of using kernel virtual addresses. Er, shouldn't this be adding dma_map/etc type calls? You can't really say a driver is using the DMA API without actually calling the DMA API.. + /* +* Try to enable both in-kernel and user DMA request with PASID. +* PASID is supported unless both user and kernel PASID are +* supported. Do not fail probe here in that idxd can still be +* used w/o PASID or IOMMU. +*/ + if (iommu_dev_enable_feature(dev, IOMMU_DEV_FEAT_SVA) || + idxd_enable_system_pasid(idxd)) { + dev_warn(dev, "Failed to enable PASID\n"); + } else { + set_bit(IDXD_FLAG_PASID_ENABLED, >flags); } Huh? How can the driver keep going if PASID isn't supported? I thought the whole point of this was because the device cannot do DMA without PASID at all? There are 2 types of WQ supported with the DSA devices. A dedicated WQ type and a shared WQ type. The dedicated WQ type can support DMA with and without PASID. The shared wq type must have a PASID to operate. The driver can support dedicated WQ only without PASID usage when there is no PASID support. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 4/4] dmaengine: idxd: Use DMA API for in-kernel DMA with PASID
On 12/7/2021 6:47 AM, Jacob Pan wrote: In-kernel DMA should be managed by DMA mapping API. The existing kernel PASID support is based on the SVA machinery in SVA lib that is intended for user process SVA. The binding between a kernel PASID and kernel mapping has many flaws. See discussions in the link below. This patch utilizes iommu_enable_pasid_dma() to enable DSA to perform DMA requests with PASID under the same mapping managed by DMA mapping API. In addition, SVA-related bits for kernel DMA are removed. As a result, DSA users shall use DMA mapping API to obtain DMA handles instead of using kernel virtual addresses. Link: https://lore.kernel.org/linux-iommu/20210511194726.gp1002...@nvidia.com/ Signed-off-by: Jacob Pan Acked-by: Dave Jiang Also cc Vinod and dmaengine@vger --- .../admin-guide/kernel-parameters.txt | 6 -- drivers/dma/Kconfig | 10 drivers/dma/idxd/idxd.h | 1 - drivers/dma/idxd/init.c | 59 ++- drivers/dma/idxd/sysfs.c | 7 --- 5 files changed, 19 insertions(+), 64 deletions(-) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 9725c546a0d4..fe73d02c62f3 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -1751,12 +1751,6 @@ In such case C2/C3 won't be used again. idle=nomwait: Disable mwait for CPU C-states - idxd.sva= [HW] - Format: - Allow force disabling of Shared Virtual Memory (SVA) - support for the idxd driver. By default it is set to - true (1). - idxd.tc_override= [HW] Format: Allow override of default traffic class configuration diff --git a/drivers/dma/Kconfig b/drivers/dma/Kconfig index 6bcdb4e6a0d1..3b28bd720e7d 100644 --- a/drivers/dma/Kconfig +++ b/drivers/dma/Kconfig @@ -313,16 +313,6 @@ config INTEL_IDXD_COMPAT If unsure, say N. -# Config symbol that collects all the dependencies that's necessary to -# support shared virtual memory for the devices supported by idxd. -config INTEL_IDXD_SVM - bool "Accelerator Shared Virtual Memory Support" - depends on INTEL_IDXD - depends on INTEL_IOMMU_SVM - depends on PCI_PRI - depends on PCI_PASID - depends on PCI_IOV - config INTEL_IDXD_PERFMON bool "Intel Data Accelerators performance monitor support" depends on INTEL_IDXD diff --git a/drivers/dma/idxd/idxd.h b/drivers/dma/idxd/idxd.h index 0cf8d3145870..3155e3a2d3ae 100644 --- a/drivers/dma/idxd/idxd.h +++ b/drivers/dma/idxd/idxd.h @@ -262,7 +262,6 @@ struct idxd_device { struct idxd_wq **wqs; struct idxd_engine **engines; - struct iommu_sva *sva; unsigned int pasid; int num_groups; diff --git a/drivers/dma/idxd/init.c b/drivers/dma/idxd/init.c index 7bf03f371ce1..44633f8113e2 100644 --- a/drivers/dma/idxd/init.c +++ b/drivers/dma/idxd/init.c @@ -16,6 +16,7 @@ #include #include #include +#include #include #include #include "../dmaengine.h" @@ -28,10 +29,6 @@ MODULE_LICENSE("GPL v2"); MODULE_AUTHOR("Intel Corporation"); MODULE_IMPORT_NS(IDXD); -static bool sva = true; -module_param(sva, bool, 0644); -MODULE_PARM_DESC(sva, "Toggle SVA support on/off"); - bool tc_override; module_param(tc_override, bool, 0644); MODULE_PARM_DESC(tc_override, "Override traffic class defaults"); @@ -530,36 +527,22 @@ static struct idxd_device *idxd_alloc(struct pci_dev *pdev, struct idxd_driver_d static int idxd_enable_system_pasid(struct idxd_device *idxd) { - int flags; - unsigned int pasid; - struct iommu_sva *sva; - - flags = SVM_FLAG_SUPERVISOR_MODE; - - sva = iommu_sva_bind_device(>pdev->dev, NULL, ); - if (IS_ERR(sva)) { - dev_warn(>pdev->dev, -"iommu sva bind failed: %ld\n", PTR_ERR(sva)); - return PTR_ERR(sva); - } + u32 pasid; - pasid = iommu_sva_get_pasid(sva); - if (pasid == IOMMU_PASID_INVALID) { - iommu_sva_unbind_device(sva); + pasid = iommu_enable_pasid_dma(>pdev->dev); + if (pasid == INVALID_IOASID) { + dev_err(>pdev->dev, "No kernel DMA PASID\n"); return -ENODEV; } - - idxd->sva = sva; idxd->pasid = pasid; - dev_dbg(>pdev->dev, "system pasid: %u\n", pasid); + return 0; } static void idxd_disable_system_pasid(struct idxd_device *idxd) { - - iommu_sva_unbind_device(idxd->sva); - idxd->sva = NULL; + iommu_disa
Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
On 12/1/2021 3:03 PM, Thomas Gleixner wrote: On Wed, Dec 01 2021 at 14:49, Dave Jiang wrote: On 12/1/2021 2:44 PM, Thomas Gleixner wrote: How that is backed on the host does not really matter. You can expose MSI-X to the guest with a INTx backing as well. I'm still failing to see the connection between the 9 MSIX vectors and the 2048 IMS vectors which I assume that this is the limitation of the physical device, right? I think I was confused with what you were asking and was thinking you are saying why can't we just have MSIX on guest backed by the MSIX on the physical device and thought there would not be enough vectors to service the many guests. I think I understand what your position is now with the clarification above. This still depends on how this overall discussion about representation of all of this stuff is resolved. What needs a subdevice to expose? Can you answer that too please? Sorry. So initial version of the IDXD sub-device is represented with a single queue. It needs a command irq (emulated) and a completion irq that is backed by a device vector (currently IMS). Thanks, tglx ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
On 12/1/2021 2:44 PM, Thomas Gleixner wrote: On Wed, Dec 01 2021 at 14:21, Dave Jiang wrote: On 12/1/2021 1:25 PM, Thomas Gleixner wrote: The hardware implementation does not have enough MSIX vectors for guests. There are only 9 MSIX vectors total (8 for queues) and 2048 IMS vectors. So if we are to do MSI-X for all of them, then we need to do the IMS backed MSIX scheme rather than passthrough IMS to guests. Confused. Are you talking about passing a full IDXD device to the guest or about passing a carved out subdevice, aka. queue? I'm talking about carving out a subdevice. I had the impression of you wanting IMS passed through for all variations. But it sounds like for a sub-device, you are ok with the implementation of MSIX backed by IMS? I don't see anything wrong with that. A subdevice is it's own entity and VFIO can chose the most conveniant representation of it to the guest obviously. How that is backed on the host does not really matter. You can expose MSI-X to the guest with a INTx backing as well. I'm still failing to see the connection between the 9 MSIX vectors and the 2048 IMS vectors which I assume that this is the limitation of the physical device, right? I think I was confused with what you were asking and was thinking you are saying why can't we just have MSIX on guest backed by the MSIX on the physical device and thought there would not be enough vectors to service the many guests. I think I understand what your position is now with the clarification above. What needs a subdevice to expose? Thanks, tglx ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
On 12/1/2021 1:25 PM, Thomas Gleixner wrote: On Wed, Dec 01 2021 at 11:47, Dave Jiang wrote: On 12/1/2021 11:41 AM, Thomas Gleixner wrote: Hi Thomas. This is actually the IDXD usage for a mediated device passed to a guest kernel when we plumb the pass through of IMS to the guest rather than doing previous implementation of having a MSIX vector on guest backed by IMS. Which makes a lot of sense. The control block for the mediated device is emulated and therefore an emulated MSIX vector will be surfaced as vector 0. However the queues will backed by IMS vectors. So we end up needing MSIX and IMS coexist running on the guest kernel for the same device. Why? What's wrong with using straight MSI-X for all of them? The hardware implementation does not have enough MSIX vectors for guests. There are only 9 MSIX vectors total (8 for queues) and 2048 IMS vectors. So if we are to do MSI-X for all of them, then we need to do the IMS backed MSIX scheme rather than passthrough IMS to guests. Confused. Are you talking about passing a full IDXD device to the guest or about passing a carved out subdevice, aka. queue? I'm talking about carving out a subdevice. I had the impression of you wanting IMS passed through for all variations. But it sounds like for a sub-device, you are ok with the implementation of MSIX backed by IMS? Thanks, tglx ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
On 12/1/2021 11:41 AM, Thomas Gleixner wrote: Dave, please trim your replies. On Wed, Dec 01 2021 at 09:28, Dave Jiang wrote: On 12/1/2021 3:16 AM, Thomas Gleixner wrote: Jason, CC+ IOMMU folks On Tue, Nov 30 2021 at 20:17, Jason Gunthorpe wrote: On Tue, Nov 30, 2021 at 10:23:16PM +0100, Thomas Gleixner wrote: Though I fear there is also a use case for MSI-X and IMS tied to the same device. That network card you are talking about might end up using MSI-X for a control block and then IMS for the actual network queues when it is used as physical function device as a whole, but that's conceptually a different case. Hi Thomas. This is actually the IDXD usage for a mediated device passed to a guest kernel when we plumb the pass through of IMS to the guest rather than doing previous implementation of having a MSIX vector on guest backed by IMS. Which makes a lot of sense. The control block for the mediated device is emulated and therefore an emulated MSIX vector will be surfaced as vector 0. However the queues will backed by IMS vectors. So we end up needing MSIX and IMS coexist running on the guest kernel for the same device. Why? What's wrong with using straight MSI-X for all of them? The hardware implementation does not have enough MSIX vectors for guests. There are only 9 MSIX vectors total (8 for queues) and 2048 IMS vectors. So if we are to do MSI-X for all of them, then we need to do the IMS backed MSIX scheme rather than passthrough IMS to guests. Thanks, tglx ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [patch 21/32] NTB/msi: Convert to msi_on_each_desc()
On 12/1/2021 3:16 AM, Thomas Gleixner wrote: Jason, CC+ IOMMU folks On Tue, Nov 30 2021 at 20:17, Jason Gunthorpe wrote: On Tue, Nov 30, 2021 at 10:23:16PM +0100, Thomas Gleixner wrote: The real problem is where to store the MSI descriptors because the PCI device has its own real PCI/MSI-X interrupts which means it still shares the storage space. Er.. I never realized that just looking at the patches :| That is relevant to all real "IMS" users. IDXD escaped this because it, IMHO, wrongly used the mdev with the IRQ layer. The mdev is purely a messy artifact of VFIO, it should not be required to make the IRQ layers work. I don't think it makes sense that the msi_desc would point to a mdev, the iommu layer consumes the msi_desc_to_dev(), it really should point to the physical device that originates the message with a proper iommu ops/data/etc. Looking at the device slices as subdevices with their own struct device makes a lot of sense from the conceptual level. That makes is pretty much obvious to manage the MSIs of those devices at this level like we do for any other device. Whether mdev is the right encapsulation for these subdevices is an orthogonal problem. I surely agree that msi_desc::dev is an interesting question, but we already have this disconnect of msi_desc::dev and DMA today due to DMA aliasing. I haven't looked at that in detail yet, but of course the alias handling is substantially different accross the various IOMMU implementations. Though I fear there is also a use case for MSI-X and IMS tied to the same device. That network card you are talking about might end up using MSI-X for a control block and then IMS for the actual network queues when it is used as physical function device as a whole, but that's conceptually a different case. Hi Thomas. This is actually the IDXD usage for a mediated device passed to a guest kernel when we plumb the pass through of IMS to the guest rather than doing previous implementation of having a MSIX vector on guest backed by IMS. The control block for the mediated device is emulated and therefore an emulated MSIX vector will be surfaced as vector 0. However the queues will backed by IMS vectors. So we end up needing MSIX and IMS coexist running on the guest kernel for the same device. DJ I'm currently tending to partition the index space in the xarray: 0x - 0x PCI/MSI-X 0x0001 - 0x0001 NTB It is OK, with some xarray work it can be range allocating & reserving so that the msi_domain_alloc_irqs() flows can carve out chunks of the number space.. Another view is the msi_domain_alloc_irqs() flows should have their own xarrays.. Yes, I was thinking about that as well. The trivial way would be: struct xarray store[MSI_MAX_STORES]; and then have a store index for each allocation domain. With the proposed encapsulation of the xarray handling that's definitely feasible. Whether that buys much is a different question. Let me think about it some more. which is feasible now with the range modifications and way simpler to do with xarray than with the linked list. Indeed! I'm glad you like the approach. Thanks, tglx ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v9 05/10] uacce: Enable IOMMU_DEV_FEAT_IOPF
On 1/22/2021 4:53 AM, Zhou Wang wrote: On 2021/1/21 4:47, Dave Jiang wrote: On 1/8/2021 7:52 AM, Jean-Philippe Brucker wrote: The IOPF (I/O Page Fault) feature is now enabled independently from the SVA feature, because some IOPF implementations are device-specific and do not require IOMMU support for PCIe PRI or Arm SMMU stall. Enable IOPF unconditionally when enabling SVA for now. In the future, if a device driver implementing a uacce interface doesn't need IOPF support, it will need to tell the uacce module, for example with a new flag. Signed-off-by: Jean-Philippe Brucker --- Cc: Arnd Bergmann Cc: Greg Kroah-Hartman Cc: Zhangfei Gao Cc: Zhou Wang --- drivers/misc/uacce/uacce.c | 32 +--- 1 file changed, 25 insertions(+), 7 deletions(-) diff --git a/drivers/misc/uacce/uacce.c b/drivers/misc/uacce/uacce.c index d07af4edfcac..41ef1eb62a14 100644 --- a/drivers/misc/uacce/uacce.c +++ b/drivers/misc/uacce/uacce.c @@ -385,6 +385,24 @@ static void uacce_release(struct device *dev) kfree(uacce); } +static unsigned int uacce_enable_sva(struct device *parent, unsigned int flags) +{ +if (!(flags & UACCE_DEV_SVA)) +return flags; + +flags &= ~UACCE_DEV_SVA; + +if (iommu_dev_enable_feature(parent, IOMMU_DEV_FEAT_IOPF)) +return flags; + +if (iommu_dev_enable_feature(parent, IOMMU_DEV_FEAT_SVA)) { +iommu_dev_disable_feature(parent, IOMMU_DEV_FEAT_IOPF); +return flags; +} Sorry to jump in a bit late on this and not specifically towards the intent of this patch. But I'd like to start a discussion on if we want to push the iommu dev feature enabling to the device driver itself rather than having UACCE control this? Maybe allow the device driver to manage the feature bits and UACCE only verify that they are enabled? 1. The device driver knows what platform it's on and what specific feature bits its devices supports. Maybe in the future if there are feature bits that's needed on one platform and not on another? Hi Dave, From the discussion in this series, the meaning of IOMMU_DEV_FEAT_IOPF here is the IOPF capability of iommu device itself. So I think check it in UACCE will be fine. 2. This allows the possibility of multiple uacce device registered to 1 pci dev, which for a device with asymmetric queues (Intel DSA/idxd driver) that is desirable feature. The current setup forces a single uacce device per pdev. If additional uacce devs are registered, the first removal of uacce device will disable the feature bit for the rest of the registered devices. With uacce managing the feature bit, it would need to add device context to the parent pdev and ref counting. It may be cleaner to just allow device driver to manage the feature bits and the driver should have all the information on when the feature needs to be turned on and off. Yes, we have this problem, however, this problem exists for IOMMU_DEV_FEAT_SVA too. How about to fix it in another patch? Hi Zhou, Right that's what I'm implying. I'm not pushing back on the IOPF feature set. Just trying to survey the opinions from people on moving the feature settings to the actual drivers rather than having it in UACCE. I will create some patches to show what I mean for comments. Best, Zhou - DaveJ + +return flags | UACCE_DEV_SVA; +} + /** * uacce_alloc() - alloc an accelerator * @parent: pointer of uacce parent device @@ -404,11 +422,7 @@ struct uacce_device *uacce_alloc(struct device *parent, if (!uacce) return ERR_PTR(-ENOMEM); -if (flags & UACCE_DEV_SVA) { -ret = iommu_dev_enable_feature(parent, IOMMU_DEV_FEAT_SVA); -if (ret) -flags &= ~UACCE_DEV_SVA; -} +flags = uacce_enable_sva(parent, flags); uacce->parent = parent; uacce->flags = flags; @@ -432,8 +446,10 @@ struct uacce_device *uacce_alloc(struct device *parent, return uacce; err_with_uacce: -if (flags & UACCE_DEV_SVA) +if (flags & UACCE_DEV_SVA) { iommu_dev_disable_feature(uacce->parent, IOMMU_DEV_FEAT_SVA); +iommu_dev_disable_feature(uacce->parent, IOMMU_DEV_FEAT_IOPF); +} kfree(uacce); return ERR_PTR(ret); } @@ -487,8 +503,10 @@ void uacce_remove(struct uacce_device *uacce) mutex_unlock(>queues_lock); /* disable sva now since no opened queues */ -if (uacce->flags & UACCE_DEV_SVA) +if (uacce->flags & UACCE_DEV_SVA) { iommu_dev_disable_feature(uacce->parent, IOMMU_DEV_FEAT_SVA); +iommu_dev_disable_feature(uacce->parent, IOMMU_DEV_FEAT_IOPF); +} if (uacce->cdev) cdev_device_del(uacce->cdev, >dev); . ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v9 05/10] uacce: Enable IOMMU_DEV_FEAT_IOPF
On 1/8/2021 7:52 AM, Jean-Philippe Brucker wrote: The IOPF (I/O Page Fault) feature is now enabled independently from the SVA feature, because some IOPF implementations are device-specific and do not require IOMMU support for PCIe PRI or Arm SMMU stall. Enable IOPF unconditionally when enabling SVA for now. In the future, if a device driver implementing a uacce interface doesn't need IOPF support, it will need to tell the uacce module, for example with a new flag. Signed-off-by: Jean-Philippe Brucker --- Cc: Arnd Bergmann Cc: Greg Kroah-Hartman Cc: Zhangfei Gao Cc: Zhou Wang --- drivers/misc/uacce/uacce.c | 32 +--- 1 file changed, 25 insertions(+), 7 deletions(-) diff --git a/drivers/misc/uacce/uacce.c b/drivers/misc/uacce/uacce.c index d07af4edfcac..41ef1eb62a14 100644 --- a/drivers/misc/uacce/uacce.c +++ b/drivers/misc/uacce/uacce.c @@ -385,6 +385,24 @@ static void uacce_release(struct device *dev) kfree(uacce); } +static unsigned int uacce_enable_sva(struct device *parent, unsigned int flags) +{ + if (!(flags & UACCE_DEV_SVA)) + return flags; + + flags &= ~UACCE_DEV_SVA; + + if (iommu_dev_enable_feature(parent, IOMMU_DEV_FEAT_IOPF)) + return flags; + + if (iommu_dev_enable_feature(parent, IOMMU_DEV_FEAT_SVA)) { + iommu_dev_disable_feature(parent, IOMMU_DEV_FEAT_IOPF); + return flags; + } Sorry to jump in a bit late on this and not specifically towards the intent of this patch. But I'd like to start a discussion on if we want to push the iommu dev feature enabling to the device driver itself rather than having UACCE control this? Maybe allow the device driver to manage the feature bits and UACCE only verify that they are enabled? 1. The device driver knows what platform it's on and what specific feature bits its devices supports. Maybe in the future if there are feature bits that's needed on one platform and not on another? 2. This allows the possibility of multiple uacce device registered to 1 pci dev, which for a device with asymmetric queues (Intel DSA/idxd driver) that is desirable feature. The current setup forces a single uacce device per pdev. If additional uacce devs are registered, the first removal of uacce device will disable the feature bit for the rest of the registered devices. With uacce managing the feature bit, it would need to add device context to the parent pdev and ref counting. It may be cleaner to just allow device driver to manage the feature bits and the driver should have all the information on when the feature needs to be turned on and off. - DaveJ + + return flags | UACCE_DEV_SVA; +} + /** * uacce_alloc() - alloc an accelerator * @parent: pointer of uacce parent device @@ -404,11 +422,7 @@ struct uacce_device *uacce_alloc(struct device *parent, if (!uacce) return ERR_PTR(-ENOMEM); - if (flags & UACCE_DEV_SVA) { - ret = iommu_dev_enable_feature(parent, IOMMU_DEV_FEAT_SVA); - if (ret) - flags &= ~UACCE_DEV_SVA; - } + flags = uacce_enable_sva(parent, flags); uacce->parent = parent; uacce->flags = flags; @@ -432,8 +446,10 @@ struct uacce_device *uacce_alloc(struct device *parent, return uacce; err_with_uacce: - if (flags & UACCE_DEV_SVA) + if (flags & UACCE_DEV_SVA) { iommu_dev_disable_feature(uacce->parent, IOMMU_DEV_FEAT_SVA); + iommu_dev_disable_feature(uacce->parent, IOMMU_DEV_FEAT_IOPF); + } kfree(uacce); return ERR_PTR(ret); } @@ -487,8 +503,10 @@ void uacce_remove(struct uacce_device *uacce) mutex_unlock(>queues_lock); /* disable sva now since no opened queues */ - if (uacce->flags & UACCE_DEV_SVA) + if (uacce->flags & UACCE_DEV_SVA) { iommu_dev_disable_feature(uacce->parent, IOMMU_DEV_FEAT_SVA); + iommu_dev_disable_feature(uacce->parent, IOMMU_DEV_FEAT_IOPF); + } if (uacce->cdev) cdev_device_del(uacce->cdev, >dev); ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH] MAINTAINERS: add maintainers for uacce
On 2/25/20 6:11 PM, zhangfei wrote: On 2020/2/26 上午12:02, Dave Jiang wrote: On 2/24/20 11:17 PM, Zhangfei Gao wrote: Add Zhangfei Gao and Zhou Wang as maintainers for uacce Signed-off-by: Zhangfei Gao Signed-off-by: Zhou Wang --- MAINTAINERS | 10 ++ 1 file changed, 10 insertions(+) diff --git a/MAINTAINERS b/MAINTAINERS index 38fe2f3..22e647f 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -17039,6 +17039,16 @@ W: http://linuxtv.org S: Maintained F: drivers/media/pci/tw686x/ +UACCE ACCELERATOR FRAMEWORK +M: Zhangfei Gao +M: Zhou Wang +S: Maintained +F: Documentation/ABI/testing/sysfs-driver-uacce +F: Documentation/misc-devices/uacce.rst +F: drivers/misc/uacce/ +F: include/linux/uacce.h +F: include/uapi/misc/uacce/ Mailing list for patch submission? +L: linux-accelerat...@lists.ozlabs.org ? Thanks Dave How about adding both linux-accelerat...@lists.ozlabs.org linux-ker...@vger.kernel.org Since the patches will go to misc tree. That is entirely up to you. But having guidance on somewhere to submit patches will be good. Thanks ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH] MAINTAINERS: add maintainers for uacce
On 2/24/20 11:17 PM, Zhangfei Gao wrote: Add Zhangfei Gao and Zhou Wang as maintainers for uacce Signed-off-by: Zhangfei Gao Signed-off-by: Zhou Wang --- MAINTAINERS | 10 ++ 1 file changed, 10 insertions(+) diff --git a/MAINTAINERS b/MAINTAINERS index 38fe2f3..22e647f 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -17039,6 +17039,16 @@ W: http://linuxtv.org S:Maintained F:drivers/media/pci/tw686x/ +UACCE ACCELERATOR FRAMEWORK +M: Zhangfei Gao +M: Zhou Wang +S: Maintained +F: Documentation/ABI/testing/sysfs-driver-uacce +F: Documentation/misc-devices/uacce.rst +F: drivers/misc/uacce/ +F: include/linux/uacce.h +F: include/uapi/misc/uacce/ Mailing list for patch submission? +L: linux-accelerat...@lists.ozlabs.org ? + UBI FILE SYSTEM (UBIFS) M:Richard Weinberger L:linux-...@lists.infradead.org ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v11 2/4] uacce: add uacce driver
On 1/15/20 4:18 AM, zhangfei wrote: Hi, Greg On 2020/1/14 下午10:59, Greg Kroah-Hartman wrote: On Mon, Jan 13, 2020 at 11:34:55AM +0800, zhangfei wrote: Hi, Greg Thanks for the review. On 2020/1/12 上午3:40, Greg Kroah-Hartman wrote: On Sat, Jan 11, 2020 at 10:48:37AM +0800, Zhangfei Gao wrote: +static int uacce_fops_open(struct inode *inode, struct file *filep) +{ + struct uacce_mm *uacce_mm = NULL; + struct uacce_device *uacce; + struct uacce_queue *q; + int ret = 0; + + uacce = xa_load(_xa, iminor(inode)); + if (!uacce) + return -ENODEV; + + if (!try_module_get(uacce->parent->driver->owner)) + return -ENODEV; Why are you trying to grab the module reference of the parent device? Why is that needed and what is that going to help with here? This shouldn't be needed as the module reference of the owner of the fileops for this module is incremented, and the "parent" module depends on this module, so how could it be unloaded without this code being unloaded? Yes, if you build this code into the kernel and the "parent" driver is a module, then you will not have a reference, but when you remove that parent driver the device will be removed as it has to be unregistered before that parent driver can be removed from the system, right? Or what am I missing here? The refcount here is preventing rmmod "parent" module after fd is opened, since user driver has mmap kernel memory to user space, like mmio, which may still in-use. With the refcount protection, rmmod "parent" module will fail until application free the fd. log like: rmmod: ERROR: Module hisi_zip is in use But if the "parent" module is to be unloaded, it has to unregister the "child" device and that will call the destructor in here and then you will tear everything down and all should be good. There's no need to "forbid" a module from being unloaded, even if it is being used. Look at all networking drivers, they work that way, right? Thanks Greg for the kind suggestion. I still have one uncertainty. Does uacce has to block process continue accessing the mmapped area when remove "parent" module? Uacce can block device access the physical memory when parent module call uacce_remove. But application is still running, and suppose it is not the kernel driver's responsibility to call unmap. I am looking for some examples in kernel, looks vfio does not block process continue accessing when vfio_unregister_iommu_driver either. In my test, application will keep waiting after rmmod parent, until ctrl+c, when unmap is called. During the process, kernel does not report any error. Do you have any advice? Would it work to call unmap_mapping_range() on the char dev inode->i_mappings? I think you need to set the vma->fault function ptr for the vm_operations_struct in the original mmap(). After the mappings are unmapped, you can set a state variable to trigger the return of VM_FAULT_SIGBUS in the ->fault function when the user app accesses the mmap region again and triggers a page fault. The user app needs to be programmed to catch exceptions to deal with that. +static void uacce_release(struct device *dev) +{ + struct uacce_device *uacce = to_uacce_device(dev); + + kfree(uacce); + uacce = NULL; That line didn't do anything :) Yes, this is a mistake. It is up to caller to set to NULL to prevent release multi times. Release function is called by the driver core which will not touch the value again. Yes, I understand, it's my mistake. Will remove it. Thanks ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v10 0/4] Add uacce module for Accelerator
On 12/15/19 8:08 PM, Zhangfei Gao wrote: Uacce (Unified/User-space-access-intended Accelerator Framework) targets to provide Shared Virtual Addressing (SVA) between accelerators and processes. So accelerator can access any data structure of the main cpu. This differs from the data sharing between cpu and io device, which share data content rather than address. Because of unified address, hardware and user space of process can share the same virtual address in the communication. Uacce is intended to be used with Jean Philippe Brucker's SVA patchset[1], which enables IO side page fault and PASID support. We have keep verifying with Jean's sva patchset [2] We also keep verifying with Eric's SMMUv3 Nested Stage patches [3] Looking forward to this common framework going upstream. I'm currently in the process of upstreaming the device driver (idxd) [1] for the recently announced Intel Data Streaming Accelerator [2] [3] that also supports SVA. [1] https://lkml.org/lkml/2020/1/7/905 [2] https://01.org/blogs/2019/introducing-intel-data-streaming-accelerator [3] https://software.intel.com/en-us/download/intel-data-streaming-accelerator-preliminary-architecture-specification And I think with this framework upstream I can potentially drop the in driver exported char device support code and use this framework directly. This series and related zip & qm driver https://github.com/Linaro/linux-kernel-warpdrive/tree/v5.5-rc1-uacce-v10 The library and user application: https://github.com/Linaro/warpdrive/tree/wdprd-upstream-v10 References: [1] http://jpbrucker.net/sva/ [2] http://jpbrucker.net/git/linux/log/?h=sva/zip-devel [3] https://github.com/eauger/linux/tree/v5.3.0-rc0-2stage-v9 Change History: v10: Modify the include header to fix kbuild test erorr in other arch. v9: Suggested by Jonathan 1. Remove sysfs: numa_distance, node_id, id, also add is_visible callback 2. Split the api to solve the potential race struct uacce_device *uacce_alloc(struct device *parent, struct uacce_interface *interface) int uacce_register(struct uacce_device *uacce) void uacce_remove(struct uacce_device *uacce) 3. Split clean up patch 03 v8: Address some comments from Jonathan Merge Jean's patch, using uacce_mm instead of pid for sva_exit v7: As suggested by Jean and Jerome Only consider sva case and remove unused dma apis for the first patch. Also add mm_exit for sva and vm_ops.close etc v6: https://lkml.org/lkml/2019/10/16/231 Change sys qfrs_size to different file, suggested by Jonathan Fix crypto daily build issue and based on crypto code base, also 5.4-rc1. v5: https://lkml.org/lkml/2019/10/14/74 Add an example patch using the uacce interface, suggested by Greg 0003-crypto-hisilicon-register-zip-engine-to-uacce.patch v4: https://lkml.org/lkml/2019/9/17/116 Based on 5.4-rc1 Considering other driver integrating uacce, if uacce not compiled, uacce_register return error and uacce_unregister is empty. Simplify uacce flag: UACCE_DEV_SVA. Address Greg's comments: Fix state machine, remove potential syslog triggered from user space etc. v3: https://lkml.org/lkml/2019/9/2/990 Recommended by Greg, use sturct uacce_device instead of struct uacce, and use struct *cdev in struct uacce_device, as a result, cdev can be released by itself when refcount decreased to 0. So the two structures are decoupled and self-maintained by themsleves. Also add dev.release for put_device. v2: https://lkml.org/lkml/2019/8/28/565 Address comments from Greg and Jonathan Modify interface uacce_register Drop noiommu mode first v1: https://lkml.org/lkml/2019/8/14/277 1. Rebase to 5.3-rc1 2. Build on iommu interface 3. Verifying with Jean's sva and Eric's nested mode iommu. 4. User library has developed a lot: support zlib, openssl etc. 5. Move to misc first RFC3: https://lkml.org/lkml/2018/11/12/1951 RFC2: https://lwn.net/Articles/763990/ Background of why Uacce: Von Neumann processor is not good at general data manipulation. It is designed for control-bound rather than data-bound application. The latter need less control path facility and more/specific ALUs. So there are more and more heterogeneous processors, such as encryption/decryption accelerators, TPUs, or EDGE (Explicated Data Graph Execution) processors, introduced to gain better performance or power efficiency for particular applications these days. There are generally two ways to make use of these heterogeneous processors: The first is to make them co-processors, just like FPU. This is good for some application but it has its own cons: It changes the ISA set permanently. You must save all state elements when the process is switched out. But most data-bound processors have a huge set of state elements. It makes the kernel scheduler more complex. The second is Accelerator. It is taken as a IO device from the CPU's point of view (but it need not to be physically). The process, running on CPU, hold a context of the accelerator and send
Re: [PATCH 0/9] Support using MSI interrupts in ntb_transport
On 1/31/2019 4:41 PM, Logan Gunthorpe wrote: On 2019-01-31 3:46 p.m., Dave Jiang wrote: I believe irqbalance writes to the file /proc/irq/N/smp_affinity. So maybe take a look at the code that starts from there and see if it would have any impact on your stuff. Ok, well on my system I can write to the smp_affinity all day and the MSI interrupts still work fine. Maybe your code is ok then. If the stats show up in /proc/interrupts then you can see it moving to different cores. The MSI code is a bit difficult to trace and audit with all the different chips and the parent chips which I don't have a good understanding of. But I can definitely see that it could be possible for some chips to change the address as smp_affinitiy will eventually sometimes call msi_domain_set_affinity() which does seem to recompose the message and write it back to the chip. So, I could relatively easily add a callback to msi_desc to catch this and resend the MSI address/data. However, I'm not sure how this is ever done atomically. It seems like there would be a race while the device updates its address where old interrupts could be triggered. This race would be much longer for us when sending this information over the NTB link. Though, I guess if the only change is that it encodes CPU information in the address then that would not be an issue. However, I'm not sure I can say that for certain without a comprehensive understanding of all the IRQ chips. Any thoughts on this? Yeah I'm not sure what to do about it either as I'm not super familiar with that area either. Just making note of what I encountered. And you are right, the updated info has to go over NTB for the other side to write to the updated place. So there's a lot of latency involved. Logan ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 0/9] Support using MSI interrupts in ntb_transport
On 1/31/2019 3:39 PM, Logan Gunthorpe wrote: On 2019-01-31 1:58 p.m., Dave Jiang wrote: On 1/31/2019 1:48 PM, Logan Gunthorpe wrote: On 2019-01-31 1:20 p.m., Dave Jiang wrote: Does this work when the system moves the MSI vector either via software (irqbalance) or BIOS APIC programming (some modes cause round robin behavior)? I don't know how irqbalance works, and I'm not sure what you are referring to by BIOS APIC programming, however I would expect these things would not be a problem. The MSI code I'm presenting here doesn't do anything crazy with the interrupts, it allocates and uses them just as any PCI driver would. The only real difference here is that instead of a piece of hardware sending the IRQ TLP, it will be sent through the memory window (which, from the OS's perspective, is just coming from an NTB hardware proxy alias). Logan Right. I did that as a hack a while back for some silicon errata workaround. When the vector moves, the address for the LAPIC changes. So unless it gets updated, you end up writing to the old location and lose all the new interrupts. irqbalance is a user daemon that rotates the system interrupts around to ensure that not all interrupts are pinned on a single core. Yes, that would be a problem if something changes the MSI vectors out from under us. Seems like that would be a bit difficult to do even with regular hardware. So far I haven't seen anything that would do that. If you know of where in the kernel this happens I'd be interested in getting a pointer to the flow in the code. If that is the case this MSI stuff will need to get much more complicated... I believe irqbalance writes to the file /proc/irq/N/smp_affinity. So maybe take a look at the code that starts from there and see if it would have any impact on your stuff. I think it's enabled by default on several distros. Although MSIX has nothing to do with the IOAPIC, the mode that the APIC is programmed can have an influence on how the interrupts are delivered. There are certain Intel platforms (I don't know if AMD does anything like that) puts the IOAPIC in a certain configuration that causes the interrupts to be moved in a round robin fashion. I think it's physical flat mode? I don't quite recall. Normally on the low end Xeons. It's probably worth doing a test run with the irqbalance daemon running and make sure you traffic stream doesn't all of sudden stop. I've tested with irqbalance running and haven't found any noticeable difference. Logan ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 0/9] Support using MSI interrupts in ntb_transport
On 1/31/2019 1:48 PM, Logan Gunthorpe wrote: On 2019-01-31 1:20 p.m., Dave Jiang wrote: Does this work when the system moves the MSI vector either via software (irqbalance) or BIOS APIC programming (some modes cause round robin behavior)? I don't know how irqbalance works, and I'm not sure what you are referring to by BIOS APIC programming, however I would expect these things would not be a problem. The MSI code I'm presenting here doesn't do anything crazy with the interrupts, it allocates and uses them just as any PCI driver would. The only real difference here is that instead of a piece of hardware sending the IRQ TLP, it will be sent through the memory window (which, from the OS's perspective, is just coming from an NTB hardware proxy alias). Logan Right. I did that as a hack a while back for some silicon errata workaround. When the vector moves, the address for the LAPIC changes. So unless it gets updated, you end up writing to the old location and lose all the new interrupts. irqbalance is a user daemon that rotates the system interrupts around to ensure that not all interrupts are pinned on a single core. I think it's enabled by default on several distros. Although MSIX has nothing to do with the IOAPIC, the mode that the APIC is programmed can have an influence on how the interrupts are delivered. There are certain Intel platforms (I don't know if AMD does anything like that) puts the IOAPIC in a certain configuration that causes the interrupts to be moved in a round robin fashion. I think it's physical flat mode? I don't quite recall. Normally on the low end Xeons. It's probably worth doing a test run with the irqbalance daemon running and make sure you traffic stream doesn't all of sudden stop. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 0/9] Support using MSI interrupts in ntb_transport
On 1/31/2019 11:56 AM, Logan Gunthorpe wrote: Hi, This patch series adds optional support for using MSI interrupts instead of NTB doorbells in ntb_transport. This is desirable seeing doorbells on current hardware are quite slow and therefore switching to MSI interrupts provides a significant performance gain. On switchtec hardware, a simple apples-to-apples comparison shows ntb_netdev/iperf numbers going from 3.88Gb/s to 14.1Gb/s when switching to MSI interrupts. To do this, a couple changes are required outside of the NTB tree: 1) The IOMMU must know to accept MSI requests from aliased bused numbers seeing NTB hardware typically sends proxied request IDs through additional requester IDs. The first patch in this series adds support for the Intel IOMMU. A quirk to add these aliases for switchtec hardware was already accepted. See commit ad281ecf1c7d ("PCI: Add DMA alias quirk for Microsemi Switchtec NTB") for a description of NTB proxy IDs and why this is necessary. 2) NTB transport (and other clients) may often need more MSI interrupts than the NTB hardware actually advertises support for. However, seeing these interrupts will not be triggered by the hardware but through an NTB memory window, the hardware does not actually need support or need to know about them. Therefore we add the concept of Virtual MSI interrupts which are allocated just like any other MSI interrupt but are not programmed into the hardware's MSI table. This is done in Patch 2 and then made use of in Patch 3. Logan, Does this work when the system moves the MSI vector either via software (irqbalance) or BIOS APIC programming (some modes cause round robin behavior)? The remaining patches in this series add a library for dealing with MSI interrupts, a test client and finally support in ntb_transport. The series is based off of v5.0-rc4 and I've tested it on top of a of the patches I've already sent to the NTB tree (though they are independent changes). A git repo is available here: https://github.com/sbates130272/linux-p2pmem/ ntb_transport_msi_v1 Thanks, Logan -- Logan Gunthorpe (9): iommu/vt-d: Allow interrupts from the entire bus for aliased devices PCI/MSI: Support allocating virtual MSI interrupts PCI/switchtec: Add module parameter to request more interrupts NTB: Introduce functions to calculate multi-port resource index NTB: Rename ntb.c to support multiple source files in the module NTB: Introduce MSI library NTB: Introduce NTB MSI Test Client NTB: Add ntb_msi_test support to ntb_test NTB: Add MSI interrupt support to ntb_transport drivers/iommu/intel_irq_remapping.c | 12 + drivers/ntb/Kconfig | 10 + drivers/ntb/Makefile| 3 + drivers/ntb/{ntb.c => core.c} | 0 drivers/ntb/msi.c | 313 ++ drivers/ntb/ntb_transport.c | 134 +++- drivers/ntb/test/Kconfig| 9 + drivers/ntb/test/Makefile | 1 + drivers/ntb/test/ntb_msi_test.c | 416 drivers/pci/msi.c | 51 ++- drivers/pci/switch/switchtec.c | 12 +- include/linux/msi.h | 1 + include/linux/ntb.h | 139 include/linux/pci.h | 9 + tools/testing/selftests/ntb/ntb_test.sh | 54 ++- 15 files changed, 1150 insertions(+), 14 deletions(-) rename drivers/ntb/{ntb.c => core.c} (100%) create mode 100644 drivers/ntb/msi.c create mode 100644 drivers/ntb/test/ntb_msi_test.c -- 2.19.0 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: IOAT DMA w/IOMMU
On 08/10/2018 10:15 AM, Logan Gunthorpe wrote: > > > On 10/08/18 11:01 AM, Dave Jiang wrote: >> Or if the BIOS has provided mapping for the Intel NTB device >> specifically? Is that a possibility? NTB does go through the IOMMU. > > I don't know but if the BIOS is doing it, but that would only at best > work for Intel NTB I see no hope in getting the BIOS to map the MW > for a Switchtec NTB device... Right. I'm just speculating why it may possibly work. But yes, I think kernel will need appropriate mapping if it's not happening right now. Hopefully Kit is onto something. > > Logan > ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: IOAT DMA w/IOMMU
On 08/10/2018 09:33 AM, Logan Gunthorpe wrote: > > > On 10/08/18 10:31 AM, Dave Jiang wrote: >> >> >> On 08/10/2018 09:24 AM, Logan Gunthorpe wrote: >>> >>> >>> On 10/08/18 10:02 AM, Kit Chow wrote: >>>> Turns out there is no dma_map_resource routine on x86. get_dma_ops >>>> returns intel_dma_ops which has map_resource pointing to NULL. >>> >>> Oh, yup. I wasn't aware of that. From a cursory view, it looks like it >>> shouldn't be too hard to implement though. >>> >>>> Will poke around some in the intel_map_page code but can you actually >>>> get a valid struct page for a pci bar address (dma_map_single calls >>>> virt_to_page)? If not, does a map_resource routine that can properly >>>> map a pci bar address need to be implemented? >>> >>> Yes, you can not get a struct page for a PCI bar address unless it's >>> mapped with ZONE_DEVICE like in my p2p work. So that would explain why >>> dma_map_single() didn't work. >>> >>> This all implies that ntb_transport doesn't work with DMA and the IOMMU >>> turned on. I'm not sure I've ever tried that configuration myself but it >>> is a bit surprising. >> >> Hmmthat's surprising because it seems to work on Skylake platform >> when I tested it yesterday with Intel NTB. Kit is using a Haswell >> platform at the moment I think. Although I'm curious if it works with >> the PLX NTB he's using on Skylake. > > Does that mean on Skylake the IOAT can bypass the IOMMU? Because it > looks like the ntb_transport code doesn't map the physical address of > the NTB MW into the IOMMU when doing DMA... Or if the BIOS has provided mapping for the Intel NTB device specifically? Is that a possibility? NTB does go through the IOMMU. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: IOAT DMA w/IOMMU
On 08/10/2018 09:24 AM, Logan Gunthorpe wrote: > > > On 10/08/18 10:02 AM, Kit Chow wrote: >> Turns out there is no dma_map_resource routine on x86. get_dma_ops >> returns intel_dma_ops which has map_resource pointing to NULL. > > Oh, yup. I wasn't aware of that. From a cursory view, it looks like it > shouldn't be too hard to implement though. > >> Will poke around some in the intel_map_page code but can you actually >> get a valid struct page for a pci bar address (dma_map_single calls >> virt_to_page)? If not, does a map_resource routine that can properly >> map a pci bar address need to be implemented? > > Yes, you can not get a struct page for a PCI bar address unless it's > mapped with ZONE_DEVICE like in my p2p work. So that would explain why > dma_map_single() didn't work. > > This all implies that ntb_transport doesn't work with DMA and the IOMMU > turned on. I'm not sure I've ever tried that configuration myself but it > is a bit surprising. Hmmthat's surprising because it seems to work on Skylake platform when I tested it yesterday with Intel NTB. Kit is using a Haswell platform at the moment I think. Although I'm curious if it works with the PLX NTB he's using on Skylake. > > Logan > ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 03/44] dmaengine: ioat: don't use DMA_ERROR_CODE
On 06/08/2017 06:25 AM, Christoph Hellwig wrote: > DMA_ERROR_CODE is not a public API and will go away. Instead properly > unwind based on the loop counter. > > Signed-off-by: Christoph Hellwig <h...@lst.de> Acked-by: Dave Jiang <dave.ji...@intel.com> > --- > drivers/dma/ioat/init.c | 24 +++- > 1 file changed, 7 insertions(+), 17 deletions(-) > > diff --git a/drivers/dma/ioat/init.c b/drivers/dma/ioat/init.c > index 6ad4384b3fa8..ed8ed1192775 100644 > --- a/drivers/dma/ioat/init.c > +++ b/drivers/dma/ioat/init.c > @@ -839,8 +839,6 @@ static int ioat_xor_val_self_test(struct ioatdma_device > *ioat_dma) > goto free_resources; > } > > - for (i = 0; i < IOAT_NUM_SRC_TEST; i++) > - dma_srcs[i] = DMA_ERROR_CODE; > for (i = 0; i < IOAT_NUM_SRC_TEST; i++) { > dma_srcs[i] = dma_map_page(dev, xor_srcs[i], 0, PAGE_SIZE, > DMA_TO_DEVICE); > @@ -910,8 +908,6 @@ static int ioat_xor_val_self_test(struct ioatdma_device > *ioat_dma) > > xor_val_result = 1; > > - for (i = 0; i < IOAT_NUM_SRC_TEST + 1; i++) > - dma_srcs[i] = DMA_ERROR_CODE; > for (i = 0; i < IOAT_NUM_SRC_TEST + 1; i++) { > dma_srcs[i] = dma_map_page(dev, xor_val_srcs[i], 0, PAGE_SIZE, > DMA_TO_DEVICE); > @@ -965,8 +961,6 @@ static int ioat_xor_val_self_test(struct ioatdma_device > *ioat_dma) > op = IOAT_OP_XOR_VAL; > > xor_val_result = 0; > - for (i = 0; i < IOAT_NUM_SRC_TEST + 1; i++) > - dma_srcs[i] = DMA_ERROR_CODE; > for (i = 0; i < IOAT_NUM_SRC_TEST + 1; i++) { > dma_srcs[i] = dma_map_page(dev, xor_val_srcs[i], 0, PAGE_SIZE, > DMA_TO_DEVICE); > @@ -1017,18 +1011,14 @@ static int ioat_xor_val_self_test(struct > ioatdma_device *ioat_dma) > goto free_resources; > dma_unmap: > if (op == IOAT_OP_XOR) { > - if (dest_dma != DMA_ERROR_CODE) > - dma_unmap_page(dev, dest_dma, PAGE_SIZE, > -DMA_FROM_DEVICE); > - for (i = 0; i < IOAT_NUM_SRC_TEST; i++) > - if (dma_srcs[i] != DMA_ERROR_CODE) > - dma_unmap_page(dev, dma_srcs[i], PAGE_SIZE, > -DMA_TO_DEVICE); > + while (--i >= 0) > + dma_unmap_page(dev, dma_srcs[i], PAGE_SIZE, > +DMA_TO_DEVICE); > + dma_unmap_page(dev, dest_dma, PAGE_SIZE, DMA_FROM_DEVICE); > } else if (op == IOAT_OP_XOR_VAL) { > - for (i = 0; i < IOAT_NUM_SRC_TEST + 1; i++) > - if (dma_srcs[i] != DMA_ERROR_CODE) > - dma_unmap_page(dev, dma_srcs[i], PAGE_SIZE, > -DMA_TO_DEVICE); > + while (--i >= 0) > + dma_unmap_page(dev, dma_srcs[i], PAGE_SIZE, > +DMA_TO_DEVICE); > } > free_resources: > dma->device_free_chan_resources(dma_chan); > ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu