Re: [PATCH v10 01/17] iova: Export alloc_iova_fast() and free_iova_fast()

2021-08-03 Thread Yongji Xie
On Tue, Aug 3, 2021 at 6:54 PM Robin Murphy  wrote:
>
> On 2021-08-03 09:54, Yongji Xie wrote:
> > On Tue, Aug 3, 2021 at 3:41 PM Jason Wang  wrote:
> >>
> >>
> >> 在 2021/7/29 下午3:34, Xie Yongji 写道:
> >>> Export alloc_iova_fast() and free_iova_fast() so that
> >>> some modules can use it to improve iova allocation efficiency.
> >>
> >>
> >> It's better to explain why alloc_iova() is not sufficient here.
> >>
> >
> > Fine.
>
> What I fail to understand from the later patches is what the IOVA domain
> actually represents. If the "device" is a userspace process then
> logically the "IOVA" would be the userspace address, so presumably
> somewhere you're having to translate between this arbitrary address
> space and actual usable addresses - if you're worried about efficiency
> surely it would be even better to not do that?
>

Yes, userspace daemon needs to translate the "IOVA" in a DMA
descriptor to the VA (from mmap(2)). But this actually doesn't affect
performance since it's an identical mapping in most cases.

> Presumably userspace doesn't have any concern about alignment and the
> things we have to worry about for the DMA API in general, so it's pretty
> much just allocating slots in a buffer, and there are far more effective
> ways to do that than a full-blown address space manager.

Considering iova allocation efficiency, I think the iova allocator is
better here. In most cases, we don't even need to hold a spin lock
during iova allocation.

> If you're going
> to reuse any infrastructure I'd have expected it to be SWIOTLB rather
> than the IOVA allocator. Because, y'know, you're *literally implementing
> a software I/O TLB* ;)
>

But actually what we can reuse in SWIOTLB is the IOVA allocator. And
the IOVA management in SWIOTLB is not what we want. For example,
SWIOTLB allocates and uses contiguous memory for bouncing, which is
not necessary in VDUSE case. And VDUSE needs coherent mapping which is
not supported by the SWIOTLB. Besides, the SWIOTLB works in singleton
mode (designed for platform IOMMU) , but VDUSE is based on on-chip
IOMMU (supports multiple instances). So I still prefer to reuse the
IOVA allocator to implement a MMU-based software IOTLB.

Thanks,
Yongji
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v2 12/12] iommu: Do not allow IOMMU passthrough with Secure Launch

2021-08-03 Thread Daniel P. Smith

On 6/21/21 5:15 PM, Andy Lutomirski wrote:

On Mon, Jun 21, 2021 at 10:51 AM Ross Philipson
 wrote:


On 6/18/21 2:32 PM, Robin Murphy wrote:

On 2021-06-18 17:12, Ross Philipson wrote:

@@ -2761,7 +2762,10 @@ void iommu_set_default_passthrough(bool cmd_line)
   {
   if (cmd_line)
   iommu_cmd_line |= IOMMU_CMD_LINE_DMA_API;
-iommu_def_domain_type = IOMMU_DOMAIN_IDENTITY;
+
+/* Do not allow identity domain when Secure Launch is configured */
+if (!(slaunch_get_flags() & SL_FLAG_ACTIVE))
+iommu_def_domain_type = IOMMU_DOMAIN_IDENTITY;


Quietly ignoring the setting and possibly leaving iommu_def_domain_type
uninitialised (note that 0 is not actually a usable type) doesn't seem
great. AFAICS this probably warrants similar treatment to the


Ok so I guess it would be better to set it to IOMMU_DOMAIN_DMA event
though passthrough was requested. Or perhaps something more is needed here?


mem_encrypt_active() case - there doesn't seem a great deal of value in
trying to save users from themselves if they care about measured boot
yet explicitly pass options which may compromise measured boot. If you
really want to go down that route there's at least the sysfs interface
you'd need to nobble as well, not to mention the various ways of
completely disabling IOMMUs...


Doing a secure launch with the kernel is not a general purpose user use
case. A lot of work is done to secure the environment. Allowing
passthrough mode would leave the secure launch kernel exposed to DMA. I
think what we are trying to do here is what we intend though there may
be a better way or perhaps it is incomplete as you suggest.



I don't really like all these special cases.  Generically, what you're
trying to do is (AFAICT) to get the kernel to run in a mode in which
it does its best not to trust attached devices.  Nothing about this is
specific to Secure Launch.  There are plenty of scenarios in which
this the case:

  - Virtual devices in a VM host outside the TCB, e.g. VDUSE, Xen
device domains (did I get the name right), whatever tricks QEMU has,
etc.
  - SRTM / DRTM technologies (including but not limited to Secure
Launch -- plain old Secure Boot can work like this too).
  - Secure guest technologies, including but not limited to TDX and SEV.
  - Any computer with a USB-C port or other external DMA-capable port.
  - Regular computers in which the admin wants to enable this mode for
whatever reason.

Can you folks all please agree on a coordinated way for a Linux kernel
to configure itself appropriately?  Or to be configured via initramfs,
boot option, or some other trusted source of configuration supplied at
boot time?  We don't need a whole bunch of if (TDX), if (SEV), if
(secure launch), if (I have a USB-C port with PCIe exposed), if
(running on Xen), and similar checks all over the place.


Hey Andy,

On behalf of Ross and myself I wanted to follow up on the points raised 
here. While there is an interest to ensure a system is properly 
configured we should not be blocking the user from configuring the 
system as they desire. Instead we are taking the approach to document 
the SecureLaunch capability, in particular the recommend way to 
configure the kernel to appropriately use the capability using the 
already existing methods such as using kernel parameters. Hopefully that 
will address the concerns in the short term. Looking forward, we do have 
a vested interest in ensuring there is an ability to configure access 
control for security and safety critical solutions and would be grateful 
if we would be included in any discussions or working groups that might 
be looking into unifying how all these security technologies should be 
configuring the Linux kernel.


V/r,
Daniel P. Smith
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH] iommu/iova: kmemleak when disable SRIOV.

2021-08-03 Thread Zhou, Peng Ju via iommu
[Public]

Hi Alex
Is there any doc to descript how the branch sync up?
I only know the drm-next, mainline, project branch, and the relationship 
between them, 
having no idea about drm-tip or drm-misc.


-- 
BW
Pengju Zhou



> -Original Message-
> From: Deucher, Alexander 
> Sent: Tuesday, August 3, 2021 9:29 PM
> To: Zhou, Peng Ju ; Chris Wilson  wilson.co.uk>; Robin Murphy ; iommu@lists.linux-
> foundation.org
> Cc: Wang, Yin ; w...@kernel.org; Chang, HaiJun
> ; Deng, Emily 
> Subject: RE: [PATCH] iommu/iova: kmemleak when disable SRIOV.
> 
> [Public]
> 
> > -Original Message-
> > From: Zhou, Peng Ju 
> > Sent: Tuesday, August 3, 2021 1:51 AM
> > To: Chris Wilson ; Robin Murphy
> > ; iommu@lists.linux-foundation.org
> > Cc: Deucher, Alexander ; Wang, Yin
> > ; w...@kernel.org; Chang, HaiJun
> > ; Deng, Emily 
> > Subject: RE: [PATCH] iommu/iova: kmemleak when disable SRIOV.
> >
> > [AMD Official Use Only]
> >
> > Hi Chris
> > I hit kmemleak with your following patch, Can you help to fix it?
> >
> > According to the info in this thread, it seems the patch doesn't merge
> > into iommu mainline branch, but I can get your patch from my kernel:
> > 5.11.0
> 
> If this patch is not upstream, it probably ended up in our tree via drm-tip or
> drm-misc last time we synced up.  If that is the case and the patch is not
> upstream, you can just revert the patch from our tree.
> 
> Alex
> 
> >
> >
> > commit 48a64dd561a53fb809e3f2210faf5dd442cfc56d
> > Author: Chris Wilson 
> > Date:   Sat Jan 16 11:10:35 2021 +
> >
> > iommu/iova: Use bottom-up allocation for DMA32
> >
> > Since DMA32 allocations are currently allocated top-down from the 4G
> > boundary, this causes fragmentation and reduces the maximise allocation
> > size. On some systems, the dma address space may be extremely
> > constrained by PCIe windows, requiring a stronger anti-fragmentation
> > strategy.
> >
> > Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/2929
> > Signed-off-by: Chris Wilson 
> >
> >
> > --
> > BW
> > Pengju Zhou
> >
> >
> >
> >
> >
> > > -Original Message-
> > > From: Robin Murphy 
> > > Sent: Tuesday, July 27, 2021 10:23 PM
> > > To: Zhou, Peng Ju ; iommu@lists.linux-
> > > foundation.org
> > > Cc: Deucher, Alexander ; Wang, Yin
> > > ; w...@kernel.org
> > > Subject: Re: [PATCH] iommu/iova: kmemleak when disable SRIOV.
> > >
> > > On 2021-07-27 15:05, Zhou, Peng Ju wrote:
> > > > [AMD Official Use Only]
> > > >
> > > > Hi Robin
> > > > The patch which add "head" :
> > > >
> > > > commit 48a64dd561a53fb809e3f2210faf5dd442cfc56d
> > > > Author: Chris Wilson 
> > > > Date:   Sat Jan 16 11:10:35 2021 +
> > > >
> > > >  iommu/iova: Use bottom-up allocation for DMA32
> > > >
> > > >  Since DMA32 allocations are currently allocated top-down from the 
> > > > 4G
> > > >  boundary, this causes fragmentation and reduces the maximise
> > allocation
> > > >  size. On some systems, the dma address space may be extremely
> > > >  constrained by PCIe windows, requiring a stronger 
> > > > anti-fragmentation
> > > >  strategy.
> > > >
> > > >  Closes:
> > >
> > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitl
> > a
> > b.f
> > > reedesktop.org%2Fdrm%2Fintel%2F-
> > >
> > %2Fissues%2F2929data=04%7C01%7CPengJu.Zhou%40amd.com%7C4
> > 7f
> > >
> >
> c4308f6044a379ed908d9510a19b1%7C3dd8961fe4884e608e11a82d994e183d
> > >
> > %7C0%7C0%7C637629927137121754%7CUnknown%7CTWFpbGZsb3d8eyJWI
> j
> > o
> > >
> > iMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C20
> > 00
> > >
> > sdata=iO5%2FKSW8KV1UZtwGU3oiZpYqiR4eBNcSpF3%2Ft6uSDpY%3D
> > &
> > > amp;reserved=0
> > > >  Signed-off-by: Chris Wilson 
> > >
> > > ...which is not in mainline. I've never even seen it posted for review.
> > > In fact two search engines can't seem to find any trace of that SHA
> > > or patch subject on the internet at all.
> > >
> > > By all means tell Chris that his patch, wherever you got it from, is
> > > broken,
> > but
> > > once again there's nothing the upstream maintainers/reviewers can do
> > about
> > > code which isn't upstream.
> > >
> > > Thanks,
> > > Robin.
> > >
> > > > --
> > > > 
> > > > BW
> > > > Pengju Zhou
> > > >
> > > >
> > > >
> > > >> -Original Message-
> > > >> From: Robin Murphy 
> > > >> Sent: Tuesday, July 27, 2021 4:52 PM
> > > >> To: Zhou, Peng Ju ; iommu@lists.linux-
> > > >> foundation.org
> > > >> Cc: Deucher, Alexander ; Wang, Yin
> > > >> ; w...@kernel.org
> > > >> Subject: Re: [PATCH] iommu/iova: kmemleak when disable SRIOV.
> > > >>
> > > >> On 2021-07-27 05:46, Zhou, Peng Ju wrote:
> > > >>> [AMD Official Use Only]
> > > >>>
> > > >>> Hi Robin
> > > >>> 1. it is not a good manner to free a 

Re: [RFC v1 6/8] mshv: command line option to skip devices in PV-IOMMU

2021-08-03 Thread Wei Liu
On Wed, Aug 04, 2021 at 12:20:42AM +0530, Praveen Kumar wrote:
> On 09-07-2021 17:13, Wei Liu wrote:
> > Some devices may have been claimed by the hypervisor already. One such
> > example is a user can assign a NIC for debugging purpose.
> > 
> > Ideally Linux should be able to tell retrieve that information, but
> > there is no way to do that yet. And designing that new mechanism is
> > going to take time.
> > 
> > Provide a command line option for skipping devices. This is a stopgap
> > solution, so it is intentionally undocumented. Hopefully we can retire
> > it in the future.
> > 
> > Signed-off-by: Wei Liu 
> > ---
> >  drivers/iommu/hyperv-iommu.c | 45 
> >  1 file changed, 45 insertions(+)
> > 
> > diff --git a/drivers/iommu/hyperv-iommu.c b/drivers/iommu/hyperv-iommu.c
> > index 043dcff06511..353da5036387 100644
> > --- a/drivers/iommu/hyperv-iommu.c
> > +++ b/drivers/iommu/hyperv-iommu.c
> > @@ -349,6 +349,16 @@ static const struct irq_domain_ops 
> > hyperv_root_ir_domain_ops = {
> >  
> >  #ifdef CONFIG_HYPERV_ROOT_PVIOMMU
> >  
> > +/* The IOMMU will not claim these PCI devices. */
> > +static char *pci_devs_to_skip;
> > +static int __init mshv_iommu_setup_skip(char *str) {
> > +   pci_devs_to_skip = str;
> > +
> > +   return 0;
> > +}
> > +/* mshv_iommu_skip=(:BB:DD.F)(:BB:DD.F) */
> > +__setup("mshv_iommu_skip=", mshv_iommu_setup_skip);
> > +
> >  /* DMA remapping support */
> >  struct hv_iommu_domain {
> > struct iommu_domain domain;
> > @@ -774,6 +784,41 @@ static struct iommu_device 
> > *hv_iommu_probe_device(struct device *dev)
> > if (!dev_is_pci(dev))
> > return ERR_PTR(-ENODEV);
> >  
> > +   /*
> > +* Skip the PCI device specified in `pci_devs_to_skip`. This is a
> > +* temporary solution until we figure out a way to extract information
> > +* from the hypervisor what devices it is already using.
> > +*/
> > +   if (pci_devs_to_skip && *pci_devs_to_skip) {
> > +   int pos = 0;
> > +   int parsed;
> > +   int segment, bus, slot, func;
> > +   struct pci_dev *pdev = to_pci_dev(dev);
> > +
> > +   do {
> > +   parsed = 0;
> > +
> > +   sscanf(pci_devs_to_skip + pos,
> > +   " (%x:%x:%x.%x) %n",
> > +   , , , , );
> > +
> > +   if (parsed <= 0)
> > +   break;
> > +
> > +   if (pci_domain_nr(pdev->bus) == segment &&
> > +   pdev->bus->number == bus &&
> > +   PCI_SLOT(pdev->devfn) == slot &&
> > +   PCI_FUNC(pdev->devfn) == func)
> > +   {
> > +   dev_info(dev, "skipped by MSHV IOMMU\n");
> > +   return ERR_PTR(-ENODEV);
> > +   }
> > +
> > +   pos += parsed;
> > +
> > +   } while (pci_devs_to_skip[pos]);
> 
> Is there a possibility of pci_devs_to_skip + pos > sizeof(pci_devs_to_skip)
> and also a valid memory ?

pos should point to the last parsed position. If parsing fails pos does
not get updated and the code breaks out of the loop. If parsing is
success pos should point to either the start of next element of '\0'
(end of string). To me this is good enough.

> I would recommend to have a check of size as well before accessing the
> array content, just to be safer accessing any memory.
> 

What check do you have in mind?

Wei.

> > +   }
> > +
> > vdev = kzalloc(sizeof(*vdev), GFP_KERNEL);
> > if (!vdev)
> > return ERR_PTR(-ENOMEM);
> > 
> 
> Regards,
> 
> ~Praveen.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC v1 5/8] mshv: add paravirtualized IOMMU support

2021-08-03 Thread Wei Liu
On Wed, Aug 04, 2021 at 12:10:45AM +0530, Praveen Kumar wrote:
> On 09-07-2021 17:13, Wei Liu wrote:
> > +static void hv_iommu_domain_free(struct iommu_domain *d)
> > +{
> > +   struct hv_iommu_domain *domain = to_hv_iommu_domain(d);
> > +   unsigned long flags;
> > +   u64 status;
> > +   struct hv_input_delete_device_domain *input;
> > +
> > +   if (is_identity_domain(domain) || is_null_domain(domain))
> > +   return;
> > +
> > +   local_irq_save(flags);
> > +   input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> > +   memset(input, 0, sizeof(*input));
> > +
> > +   input->device_domain= domain->device_domain;
> > +
> > +   status = hv_do_hypercall(HVCALL_DELETE_DEVICE_DOMAIN, input, NULL);
> > +
> > +   local_irq_restore(flags);
> > +
> > +   if (!hv_result_success(status))
> > +   pr_err("%s: hypercall failed, status %lld\n", __func__, status);
> 
> Is it OK to deallocate the resources, if hypercall has failed ?

It should be fine. We leak some resources in the hypervisor, but Linux
is in a rather wedged state anyway. Refusing to free up resources in
Linux does not much good.

> Do we have any specific error code EBUSY (kind of) which we need to wait upon 
> ?
> 

I have not found a circumstance that can happen.

> > +
> > +   ida_free(>hv_iommu->domain_ids, 
> > domain->device_domain.domain_id.id);
> > +
> > +   iommu_put_dma_cookie(d);
> > +
> > +   kfree(domain);
> > +}
> > +
> > +static int hv_iommu_attach_dev(struct iommu_domain *d, struct device *dev)
> > +{
> > +   struct hv_iommu_domain *domain = to_hv_iommu_domain(d);
> > +   u64 status;
> > +   unsigned long flags;
> > +   struct hv_input_attach_device_domain *input;
> > +   struct pci_dev *pdev;
> > +   struct hv_iommu_endpoint *vdev = dev_iommu_priv_get(dev);
> > +
> > +   /* Only allow PCI devices for now */
> > +   if (!dev_is_pci(dev))
> > +   return -EINVAL;
> > +
> > +   pdev = to_pci_dev(dev);
> > +
> > +   dev_dbg(dev, "Attaching (%strusted) to %d\n", pdev->untrusted ? "un" : 
> > "",
> > +   domain->device_domain.domain_id.id);
> > +
> > +   local_irq_save(flags);
> > +   input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> > +   memset(input, 0, sizeof(*input));
> > +
> > +   input->device_domain = domain->device_domain;
> > +   input->device_id = hv_build_pci_dev_id(pdev);
> > +
> > +   status = hv_do_hypercall(HVCALL_ATTACH_DEVICE_DOMAIN, input, NULL);
> > +   local_irq_restore(flags);
> > +
> > +   if (!hv_result_success(status))
> > +   pr_err("%s: hypercall failed, status %lld\n", __func__, status);
> 
> Does it make sense to vdev->domain = NULL ?
>

It is already NULL -- there is no other code path that sets it and the
detach path always sets the field to NULL.

> > +   else
> > +   vdev->domain = domain;
> > +
> > +   return hv_status_to_errno(status);
> > +}
> > +
[...]
> > +static size_t hv_iommu_unmap(struct iommu_domain *d, unsigned long iova,
> > +  size_t size, struct iommu_iotlb_gather *gather)
> > +{
> > +   size_t unmapped;
> > +   struct hv_iommu_domain *domain = to_hv_iommu_domain(d);
> > +   unsigned long flags, npages;
> > +   struct hv_input_unmap_device_gpa_pages *input;
> > +   u64 status;
> > +
> > +   unmapped = hv_iommu_del_mappings(domain, iova, size);
> > +   if (unmapped < size)
> > +   return 0;
> 
> Is there a case where unmapped > 0 && unmapped < size ?
> 

There could be such a case -- hv_iommu_del_mappings' return value is >= 0.
Is there a problem with this predicate?

Wei.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC v1 6/8] mshv: command line option to skip devices in PV-IOMMU

2021-08-03 Thread Praveen Kumar
On 09-07-2021 17:13, Wei Liu wrote:
> Some devices may have been claimed by the hypervisor already. One such
> example is a user can assign a NIC for debugging purpose.
> 
> Ideally Linux should be able to tell retrieve that information, but
> there is no way to do that yet. And designing that new mechanism is
> going to take time.
> 
> Provide a command line option for skipping devices. This is a stopgap
> solution, so it is intentionally undocumented. Hopefully we can retire
> it in the future.
> 
> Signed-off-by: Wei Liu 
> ---
>  drivers/iommu/hyperv-iommu.c | 45 
>  1 file changed, 45 insertions(+)
> 
> diff --git a/drivers/iommu/hyperv-iommu.c b/drivers/iommu/hyperv-iommu.c
> index 043dcff06511..353da5036387 100644
> --- a/drivers/iommu/hyperv-iommu.c
> +++ b/drivers/iommu/hyperv-iommu.c
> @@ -349,6 +349,16 @@ static const struct irq_domain_ops 
> hyperv_root_ir_domain_ops = {
>  
>  #ifdef CONFIG_HYPERV_ROOT_PVIOMMU
>  
> +/* The IOMMU will not claim these PCI devices. */
> +static char *pci_devs_to_skip;
> +static int __init mshv_iommu_setup_skip(char *str) {
> + pci_devs_to_skip = str;
> +
> + return 0;
> +}
> +/* mshv_iommu_skip=(:BB:DD.F)(:BB:DD.F) */
> +__setup("mshv_iommu_skip=", mshv_iommu_setup_skip);
> +
>  /* DMA remapping support */
>  struct hv_iommu_domain {
>   struct iommu_domain domain;
> @@ -774,6 +784,41 @@ static struct iommu_device *hv_iommu_probe_device(struct 
> device *dev)
>   if (!dev_is_pci(dev))
>   return ERR_PTR(-ENODEV);
>  
> + /*
> +  * Skip the PCI device specified in `pci_devs_to_skip`. This is a
> +  * temporary solution until we figure out a way to extract information
> +  * from the hypervisor what devices it is already using.
> +  */
> + if (pci_devs_to_skip && *pci_devs_to_skip) {
> + int pos = 0;
> + int parsed;
> + int segment, bus, slot, func;
> + struct pci_dev *pdev = to_pci_dev(dev);
> +
> + do {
> + parsed = 0;
> +
> + sscanf(pci_devs_to_skip + pos,
> + " (%x:%x:%x.%x) %n",
> + , , , , );
> +
> + if (parsed <= 0)
> + break;
> +
> + if (pci_domain_nr(pdev->bus) == segment &&
> + pdev->bus->number == bus &&
> + PCI_SLOT(pdev->devfn) == slot &&
> + PCI_FUNC(pdev->devfn) == func)
> + {
> + dev_info(dev, "skipped by MSHV IOMMU\n");
> + return ERR_PTR(-ENODEV);
> + }
> +
> + pos += parsed;
> +
> + } while (pci_devs_to_skip[pos]);

Is there a possibility of pci_devs_to_skip + pos > sizeof(pci_devs_to_skip) and 
also a valid memory ?
I would recommend to have a check of size as well before accessing the array 
content, just to be safer accessing any memory.

> + }
> +
>   vdev = kzalloc(sizeof(*vdev), GFP_KERNEL);
>   if (!vdev)
>   return ERR_PTR(-ENOMEM);
> 

Regards,

~Praveen.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC v1 5/8] mshv: add paravirtualized IOMMU support

2021-08-03 Thread Praveen Kumar
On 09-07-2021 17:13, Wei Liu wrote:
> +static void hv_iommu_domain_free(struct iommu_domain *d)
> +{
> + struct hv_iommu_domain *domain = to_hv_iommu_domain(d);
> + unsigned long flags;
> + u64 status;
> + struct hv_input_delete_device_domain *input;
> +
> + if (is_identity_domain(domain) || is_null_domain(domain))
> + return;
> +
> + local_irq_save(flags);
> + input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> + memset(input, 0, sizeof(*input));
> +
> + input->device_domain= domain->device_domain;
> +
> + status = hv_do_hypercall(HVCALL_DELETE_DEVICE_DOMAIN, input, NULL);
> +
> + local_irq_restore(flags);
> +
> + if (!hv_result_success(status))
> + pr_err("%s: hypercall failed, status %lld\n", __func__, status);

Is it OK to deallocate the resources, if hypercall has failed ?
Do we have any specific error code EBUSY (kind of) which we need to wait upon ?

> +
> + ida_free(>hv_iommu->domain_ids, 
> domain->device_domain.domain_id.id);
> +
> + iommu_put_dma_cookie(d);
> +
> + kfree(domain);
> +}
> +
> +static int hv_iommu_attach_dev(struct iommu_domain *d, struct device *dev)
> +{
> + struct hv_iommu_domain *domain = to_hv_iommu_domain(d);
> + u64 status;
> + unsigned long flags;
> + struct hv_input_attach_device_domain *input;
> + struct pci_dev *pdev;
> + struct hv_iommu_endpoint *vdev = dev_iommu_priv_get(dev);
> +
> + /* Only allow PCI devices for now */
> + if (!dev_is_pci(dev))
> + return -EINVAL;
> +
> + pdev = to_pci_dev(dev);
> +
> + dev_dbg(dev, "Attaching (%strusted) to %d\n", pdev->untrusted ? "un" : 
> "",
> + domain->device_domain.domain_id.id);
> +
> + local_irq_save(flags);
> + input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> + memset(input, 0, sizeof(*input));
> +
> + input->device_domain = domain->device_domain;
> + input->device_id = hv_build_pci_dev_id(pdev);
> +
> + status = hv_do_hypercall(HVCALL_ATTACH_DEVICE_DOMAIN, input, NULL);
> + local_irq_restore(flags);
> +
> + if (!hv_result_success(status))
> + pr_err("%s: hypercall failed, status %lld\n", __func__, status);

Does it make sense to vdev->domain = NULL ?

> + else
> + vdev->domain = domain;
> +
> + return hv_status_to_errno(status);
> +}
> +
> +static void hv_iommu_detach_dev(struct iommu_domain *d, struct device *dev)
> +{
> + u64 status;
> + unsigned long flags;
> + struct hv_input_detach_device_domain *input;
> + struct pci_dev *pdev;
> + struct hv_iommu_domain *domain = to_hv_iommu_domain(d);
> + struct hv_iommu_endpoint *vdev = dev_iommu_priv_get(dev);
> +
> + /* See the attach function, only PCI devices for now */
> + if (!dev_is_pci(dev))
> + return;
> +
> + pdev = to_pci_dev(dev);
> +
> + dev_dbg(dev, "Detaching from %d\n", domain->device_domain.domain_id.id);
> +
> + local_irq_save(flags);
> + input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> + memset(input, 0, sizeof(*input));
> +
> + input->partition_id = HV_PARTITION_ID_SELF;
> + input->device_id = hv_build_pci_dev_id(pdev);
> +
> + status = hv_do_hypercall(HVCALL_DETACH_DEVICE_DOMAIN, input, NULL);
> + local_irq_restore(flags);
> +
> + if (!hv_result_success(status))
> + pr_err("%s: hypercall failed, status %lld\n", __func__, status);
> +
> + vdev->domain = NULL;
> +}
> +
> +static int hv_iommu_add_mapping(struct hv_iommu_domain *domain, unsigned 
> long iova,
> + phys_addr_t paddr, size_t size, u32 flags)
> +{
> + unsigned long irqflags;
> + struct hv_iommu_mapping *mapping;
> +
> + mapping = kzalloc(sizeof(*mapping), GFP_ATOMIC);
> + if (!mapping)
> + return -ENOMEM;
> +
> + mapping->paddr = paddr;
> + mapping->iova.start = iova;
> + mapping->iova.last = iova + size - 1;
> + mapping->flags = flags;
> +
> + spin_lock_irqsave(>mappings_lock, irqflags);
> + interval_tree_insert(>iova, >mappings);
> + spin_unlock_irqrestore(>mappings_lock, irqflags);
> +
> + return 0;
> +}
> +
> +static size_t hv_iommu_del_mappings(struct hv_iommu_domain *domain,
> + unsigned long iova, size_t size)
> +{
> + unsigned long flags;
> + size_t unmapped = 0;
> + unsigned long last = iova + size - 1;
> + struct hv_iommu_mapping *mapping = NULL;
> + struct interval_tree_node *node, *next;
> +
> + spin_lock_irqsave(>mappings_lock, flags);
> + next = interval_tree_iter_first(>mappings, iova, last);
> + while (next) {
> + node = next;
> + mapping = container_of(node, struct hv_iommu_mapping, iova);
> + next = interval_tree_iter_next(node, iova, last);
> +
> + /* Trying to split a mapping? Not supported for now. */
> + if (mapping->iova.start < iova)
> + break;
> +
> + unmapped += 

Re: [PATCH v2] iommu/amd: Use report_iommu_fault()

2021-08-03 Thread Lennert Buytenhek
On Fri, Jul 30, 2021 at 05:32:16AM +0300, Lennert Buytenhek wrote:

> > > This patch makes iommu/amd call report_iommu_fault() when an I/O page
> > > fault occurs, which has two effects:
> > > 
> > > 1) It allows device drivers to register a callback to be notified of
> > > I/O page faults, via the iommu_set_fault_handler() API.
> > > 
> > > 2) It triggers the io_page_fault tracepoint in report_iommu_fault()
> > > when an I/O page fault occurs.
> > > 
> > > I'm mainly interested in (2).  We have a daemon with some rasdaemon-like
> > > functionality for handling platform errors, and being able to be notified
> > > of I/O page faults for initiating corrective action is very useful -- and
> > > receiving such events via event tracing is a lot nicer than having to
> > > scrape them from kmsg.
> > > 
> > > A number of other IOMMU drivers already use report_iommu_fault(), and
> > > I/O page faults on those IOMMUs therefore already seem to trigger this
> > > tracepoint -- but this isn't (yet) the case for AMD-Vi and Intel DMAR.
> > > 
> > > I copied the logic from the other callers of report_iommu_fault(), where
> > > if that function returns zero, the driver will have handled the fault,
> > > in which case we avoid logging information about the fault to the printk
> > > buffer from the IOMMU driver.
> > > 
> > > With this patch I see io_page_fault event tracing entries as expected:
> > > 
> > > irq/24-AMD-Vi-48[002]    978.554289: io_page_fault: 
> > > IOMMU:[drvname] :05:00.0 iova=0x91482640 flags=0x
> > > irq/24-AMD-Vi-48[002]    978.554294: io_page_fault: 
> > > IOMMU:[drvname] :05:00.0 iova=0x91482650 flags=0x
> > > irq/24-AMD-Vi-48[002]    978.554299: io_page_fault: 
> > > IOMMU:[drvname] :05:00.0 iova=0x91482660 flags=0x
> > > irq/24-AMD-Vi-48[002]    978.554305: io_page_fault: 
> > > IOMMU:[drvname] :05:00.0 iova=0x91482670 flags=0x
> > > irq/24-AMD-Vi-48[002]    978.554310: io_page_fault: 
> > > IOMMU:[drvname] :05:00.0 iova=0x91482680 flags=0x
> > > irq/24-AMD-Vi-48[002]    978.554315: io_page_fault: 
> > > IOMMU:[drvname] :05:00.0 iova=0x914826a0 flags=0x
> > > 
> > > For determining IOMMU_FAULT_{READ,WRITE}, I followed the AMD IOMMU
> > > spec, but I haven't tested that bit of the code, as the page faults I
> > > encounter are all to non-present (!EVENT_FLAG_PR) mappings, in which
> > > case EVENT_FLAG_RW doesn't make sense.
> > > 
> > > Signed-off-by: Lennert Buytenhek 
> > > ---
> > > Changes since v1 RFC:
> > > 
> > > - Don't call report_iommu_fault() for IRQ remapping faults.
> > >(Suggested by Joerg Roedel.)
> > > 
> > >   drivers/iommu/amd/amd_iommu_types.h |  4 
> > >   drivers/iommu/amd/iommu.c   | 29 +
> > >   2 files changed, 33 insertions(+)
> > > 
> > > diff --git a/drivers/iommu/amd/amd_iommu_types.h 
> > > b/drivers/iommu/amd/amd_iommu_types.h
> > > index 94c1a7a9876d..2f2c6630c24c 100644
> > > --- a/drivers/iommu/amd/amd_iommu_types.h
> > > +++ b/drivers/iommu/amd/amd_iommu_types.h
> > > @@ -138,6 +138,10 @@
> > >   #define EVENT_DOMID_MASK_HI 0xf
> > >   #define EVENT_FLAGS_MASK0xfff
> > >   #define EVENT_FLAGS_SHIFT   0x10
> > > +#define EVENT_FLAG_TR0x100
> > > +#define EVENT_FLAG_RW0x020
> > > +#define EVENT_FLAG_PR0x010
> > > +#define EVENT_FLAG_I 0x008
> > >   /* feature control bits */
> > >   #define CONTROL_IOMMU_EN0x00ULL
> > > diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
> > > index a7d6d78147b7..d9fb2c22d44a 100644
> > > --- a/drivers/iommu/amd/iommu.c
> > > +++ b/drivers/iommu/amd/iommu.c
> > 
> > What if we introduce:
> > 
> > +/*
> > + * AMD I/O Virtualization Technology (IOMMU) Specification,
> > + * revision 3.00, section 2.5.3 ("IO_PAGE_FAULT Event") says
> > + * that the RW ("read-write") bit is only valid if the I/O
> > + * page fault was caused by a memory transaction request
> > + * referencing a page that was marked present.
> > + */
> > +#define IO_PAGE_FAULT_MEM_MASK \
> > +   (EVENT_FLAG_TR | EVENT_FLAG_PR | EVENT_FLAG_I)
> > +#define IS_IOMMU_MEM_TRANSACTION(x)\
> > +   ((x & IO_PAGE_FAULT_MEM_MASK) == EVENT_FLAG_PR)
> > 
> > Note that this should have already checked w/ EVENT_FLAG_I == 0.
> > 
> > 
> > > @@ -484,6 +484,34 @@ static void amd_iommu_report_page_fault(u16 devid, 
> > > u16 domain_id,
> > >   if (pdev)
> > >   dev_data = dev_iommu_priv_get(>dev);
> > > + /*
> > > +  * If this is a DMA fault (for which the I(nterrupt) bit will
> > > +  * be unset), allow report_iommu_fault() to prevent logging it.
> > > +  */
> > > + if (dev_data && ((flags & EVENT_FLAG_I) == 0)) {
> > > + int report_flags;
> > > +
> > > + /*
> > > +  * AMD I/O Virtualization Technology (IOMMU) Specification,

[PATCH v3] iommu/amd: Use report_iommu_fault()

2021-08-03 Thread Lennert Buytenhek
This patch makes iommu/amd call report_iommu_fault() when an I/O page
fault occurs, which has two effects:

1) It allows device drivers to register a callback to be notified of
   I/O page faults, via the iommu_set_fault_handler() API.

2) It triggers the io_page_fault tracepoint in report_iommu_fault()
   when an I/O page fault occurs.

I'm mainly interested in (2).  We have a daemon with some rasdaemon-like
functionality for handling platform errors, and being able to be notified
of I/O page faults for initiating corrective action is very useful -- and
receiving such events via event tracing is a lot nicer than having to
scrape them from kmsg.

A number of other IOMMU drivers already use report_iommu_fault(), and
I/O page faults on those IOMMUs therefore already seem to trigger this
tracepoint -- but this isn't (yet) the case for AMD-Vi and Intel DMAR.

I copied the logic from the other callers of report_iommu_fault(), where
if that function returns zero, the driver will have handled the fault,
in which case we avoid logging information about the fault to the printk
buffer from the IOMMU driver.

With this patch I see io_page_fault event tracing entries as expected:

   irq/24-AMD-Vi-48[002]    978.554289: io_page_fault: IOMMU:[drvname] 
:05:00.0 iova=0x91482640 flags=0x
   irq/24-AMD-Vi-48[002]    978.554294: io_page_fault: IOMMU:[drvname] 
:05:00.0 iova=0x91482650 flags=0x
   irq/24-AMD-Vi-48[002]    978.554299: io_page_fault: IOMMU:[drvname] 
:05:00.0 iova=0x91482660 flags=0x
   irq/24-AMD-Vi-48[002]    978.554305: io_page_fault: IOMMU:[drvname] 
:05:00.0 iova=0x91482670 flags=0x
   irq/24-AMD-Vi-48[002]    978.554310: io_page_fault: IOMMU:[drvname] 
:05:00.0 iova=0x91482680 flags=0x
   irq/24-AMD-Vi-48[002]    978.554315: io_page_fault: IOMMU:[drvname] 
:05:00.0 iova=0x914826a0 flags=0x

For determining IOMMU_FAULT_{READ,WRITE}, I followed the AMD IOMMU
spec, but I haven't tested that bit of the code, as the page faults I
encounter are all to non-present (!EVENT_FLAG_PR) mappings, in which
case EVENT_FLAG_RW doesn't make sense.

Signed-off-by: Lennert Buytenhek 
---
Changes for v3:
- Test fault flags via macros.  (Suggested by Suravee Suthikulpanit.)

Changes for v2:
- Don't call report_iommu_fault() for IRQ remapping faults.
  (Suggested by Joerg Roedel.)

 drivers/iommu/amd/amd_iommu_types.h |  4 
 drivers/iommu/amd/iommu.c   | 29 +
 2 files changed, 33 insertions(+)

diff --git a/drivers/iommu/amd/amd_iommu_types.h 
b/drivers/iommu/amd/amd_iommu_types.h
index 94c1a7a9876d..2f2c6630c24c 100644
--- a/drivers/iommu/amd/amd_iommu_types.h
+++ b/drivers/iommu/amd/amd_iommu_types.h
@@ -138,6 +138,10 @@
 #define EVENT_DOMID_MASK_HI0xf
 #define EVENT_FLAGS_MASK   0xfff
 #define EVENT_FLAGS_SHIFT  0x10
+#define EVENT_FLAG_TR  0x100
+#define EVENT_FLAG_RW  0x020
+#define EVENT_FLAG_PR  0x010
+#define EVENT_FLAG_I   0x008
 
 /* feature control bits */
 #define CONTROL_IOMMU_EN0x00ULL
diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index a7d6d78147b7..00975b08bd3f 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -473,6 +473,22 @@ static void amd_iommu_report_rmp_fault(volatile u32 *event)
pci_dev_put(pdev);
 }
 
+/*
+ * AMD I/O Virtualization Technology (IOMMU) Specification, revision
+ * 3.00, section 2.5.3 ("IO_PAGE_FAULT Event") says that the RW
+ * ("read-write") bit is only valid if the I/O page fault was caused
+ * by a memory transaction request referencing a page that was marked
+ * present.
+ */
+#define IS_IOMMU_MEM_TRANSACTION(flags)\
+   (((flags) & EVENT_FLAG_I) == 0)
+
+#define IS_RW_FLAG_VALID(flags)\
+   (((flags) & (EVENT_FLAG_TR | EVENT_FLAG_PR)) == EVENT_FLAG_PR)
+
+#define IS_WRITE_REQUEST(flags)\
+   (IS_RW_FLAG_VALID(flags) && ((flags) & EVENT_FLAG_RW))
+
 static void amd_iommu_report_page_fault(u16 devid, u16 domain_id,
u64 address, int flags)
 {
@@ -484,6 +500,18 @@ static void amd_iommu_report_page_fault(u16 devid, u16 
domain_id,
if (pdev)
dev_data = dev_iommu_priv_get(>dev);
 
+   /*
+* If this is a DMA fault (for which the I(nterrupt) bit will
+* be unset), allow report_iommu_fault() to prevent logging it.
+*/
+   if (dev_data && IS_IOMMU_MEM_TRANSACTION(flags)) {
+   if (!report_iommu_fault(_data->domain->domain,
+   >dev, address,
+   IS_WRITE_REQUEST(flags) ?
+   IOMMU_FAULT_WRITE : IOMMU_FAULT_READ))
+   goto out;
+   }
+
if (dev_data) {
if 

RE: [PATCH] iommu/iova: kmemleak when disable SRIOV.

2021-08-03 Thread Deucher, Alexander via iommu
[Public]

> -Original Message-
> From: Zhou, Peng Ju 
> Sent: Tuesday, August 3, 2021 1:51 AM
> To: Chris Wilson ; Robin Murphy
> ; iommu@lists.linux-foundation.org
> Cc: Deucher, Alexander ; Wang, Yin
> ; w...@kernel.org; Chang, HaiJun
> ; Deng, Emily 
> Subject: RE: [PATCH] iommu/iova: kmemleak when disable SRIOV.
> 
> [AMD Official Use Only]
> 
> Hi Chris
> I hit kmemleak with your following patch, Can you help to fix it?
> 
> According to the info in this thread, it seems the patch doesn't merge into
> iommu mainline branch, but I can get your patch from my kernel: 5.11.0

If this patch is not upstream, it probably ended up in our tree via drm-tip or 
drm-misc last time we synced up.  If that is the case and the patch is not 
upstream, you can just revert the patch from our tree.

Alex

> 
> 
> commit 48a64dd561a53fb809e3f2210faf5dd442cfc56d
> Author: Chris Wilson 
> Date:   Sat Jan 16 11:10:35 2021 +
> 
> iommu/iova: Use bottom-up allocation for DMA32
> 
> Since DMA32 allocations are currently allocated top-down from the 4G
> boundary, this causes fragmentation and reduces the maximise allocation
> size. On some systems, the dma address space may be extremely
> constrained by PCIe windows, requiring a stronger anti-fragmentation
> strategy.
> 
> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/2929
> Signed-off-by: Chris Wilson 
> 
> 
> --
> BW
> Pengju Zhou
> 
> 
> 
> 
> 
> > -Original Message-
> > From: Robin Murphy 
> > Sent: Tuesday, July 27, 2021 10:23 PM
> > To: Zhou, Peng Ju ; iommu@lists.linux-
> > foundation.org
> > Cc: Deucher, Alexander ; Wang, Yin
> > ; w...@kernel.org
> > Subject: Re: [PATCH] iommu/iova: kmemleak when disable SRIOV.
> >
> > On 2021-07-27 15:05, Zhou, Peng Ju wrote:
> > > [AMD Official Use Only]
> > >
> > > Hi Robin
> > > The patch which add "head" :
> > >
> > > commit 48a64dd561a53fb809e3f2210faf5dd442cfc56d
> > > Author: Chris Wilson 
> > > Date:   Sat Jan 16 11:10:35 2021 +
> > >
> > >  iommu/iova: Use bottom-up allocation for DMA32
> > >
> > >  Since DMA32 allocations are currently allocated top-down from the 4G
> > >  boundary, this causes fragmentation and reduces the maximise
> allocation
> > >  size. On some systems, the dma address space may be extremely
> > >  constrained by PCIe windows, requiring a stronger anti-fragmentation
> > >  strategy.
> > >
> > >  Closes:
> >
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitla
> b.f
> > reedesktop.org%2Fdrm%2Fintel%2F-
> >
> %2Fissues%2F2929data=04%7C01%7CPengJu.Zhou%40amd.com%7C4
> 7f
> >
> c4308f6044a379ed908d9510a19b1%7C3dd8961fe4884e608e11a82d994e183d
> >
> %7C0%7C0%7C637629927137121754%7CUnknown%7CTWFpbGZsb3d8eyJWIj
> o
> >
> iMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C20
> 00
> >
> sdata=iO5%2FKSW8KV1UZtwGU3oiZpYqiR4eBNcSpF3%2Ft6uSDpY%3D
> &
> > amp;reserved=0
> > >  Signed-off-by: Chris Wilson 
> >
> > ...which is not in mainline. I've never even seen it posted for review.
> > In fact two search engines can't seem to find any trace of that SHA or patch
> > subject on the internet at all.
> >
> > By all means tell Chris that his patch, wherever you got it from, is broken,
> but
> > once again there's nothing the upstream maintainers/reviewers can do
> about
> > code which isn't upstream.
> >
> > Thanks,
> > Robin.
> >
> > > --
> > > BW
> > > Pengju Zhou
> > >
> > >
> > >
> > >> -Original Message-
> > >> From: Robin Murphy 
> > >> Sent: Tuesday, July 27, 2021 4:52 PM
> > >> To: Zhou, Peng Ju ; iommu@lists.linux-
> > >> foundation.org
> > >> Cc: Deucher, Alexander ; Wang, Yin
> > >> ; w...@kernel.org
> > >> Subject: Re: [PATCH] iommu/iova: kmemleak when disable SRIOV.
> > >>
> > >> On 2021-07-27 05:46, Zhou, Peng Ju wrote:
> > >>> [AMD Official Use Only]
> > >>>
> > >>> Hi Robin
> > >>> 1. it is not a good manner to free a statically allocated object(in
> > >>> this case, it
> > >> is iovad->head) dynamically even though the free only occurred when
> > >> shut down the OS in most cases.
> > >>> 2. the kmemleak occurred when disable SRIOV(remove a PCIE device),
> I
> > >>> post the log in the following, in the log, the line:" kmemleak:
> > >>> Found object by alias at 0x9221ae647050" means the OS frees a
> > >>> not existing object(iovad->head) which added to RB_TREE in the
> > >>> function init_iova_domain
> > >>
> > >> Sure, it was apparent enough what the bug was; my point is that the
> > >> bug does not exist in mainline. This is the current mainline
> > >> definition of struct
> > >> iova_domain:
> > >>
> > >>
> > >> /* holds all the iova translations for a domain */ struct iova_domain {
> > >>  spinlock_t  iova_rbtree_lock; /* Lock to protect update of rbtree
> > >> */
> > >>  struct rb_root  rbroot; /* iova domain 

Re: [PATCH v2 23/24] iommu/arm-smmu: Allow non-strict in pgtable_quirks interface

2021-08-03 Thread Will Deacon
On Tue, Aug 03, 2021 at 01:13:12PM +0100, Robin Murphy wrote:
> On 2021-08-03 11:36, Will Deacon wrote:
> > Overall, I'm really nervous about the concurrency here and think we'd be
> > better off requiring the unbind as we have for the other domain changes.
> 
> Sure, the dynamic switch is what makes it ultimately work for Doug's
> use-case (where the unbind isn't viable), but I had every expectation that
> we might need to hold back those two patches for much deeper consideration.
> It's no accident that the first 22 patches can still be usefully applied
> without them!

Oh, the rest of the series looks great which is why I jumped on this bit!

> In all honesty I don't really like this particular patch much, mostly
> because I increasingly dislike IO_PGTABLE_QUIRK_NON_STRICT at all, but since
> the interface was there it made it super easy to prove the concept. I have a
> more significant refactoring of the io-pgtable code sketched out in my mind
> already, it's just going to be more work.

Intriguing... Move the smp_wmb() into the IOVA code instead?

> > With your changes, I think quite a few things can go wrong.
> > 
> >* cookie->fq_domain may be observed but iovad->fq could be NULL
> 
> Good point, I guess that already technically applies (if iovad->fq sat in a
> write buffer long enough), but it certainly becomes far easier to provoke.
> However a barrier after assigning fq_domain (as mentioned above) paired with
> the control dependency around the queue_iova() call would also fix that,
> right?
> 
> >* We can miss the smp_wmb() in the pgtable code but end up deferring the
> >  IOVA reclaim
> >* iommu_change_dev_def_domain() only holds the group mutex afaict, so can
> >  possibly run concurrently with itself on the same domain?
> >* iommu_dma_init_fq() can flip the domain type back from
> >  IOMMU_DOMAIN_DMA_FQ to IOMMU_DOMAIN_DMA on the error path
> >* set_pgtable_quirks() isn't atomic, which probably is ok for now, but 
> > the
> >  moment we use it anywhere else it's dangerous
> 
> In other words, IO_PGTABLE_QUIRK_NON_STRICT is definitely the problem. I'll
> have a hack on that this afternoon, and if it starts to look rabbit-holey
> I'll split this bit off and post v3 of the rest of the series.
> 
> If all the io-pgtable and fq behaviour for a given call could be consistent
> based on a single READ_ONCE(cookie->fq_domain) in iommu_dma_unmap(), do you
> see any remaining issues other than the point above?

I'd have to see the patches, and I didn't look exhaustively at the current
stuff. But yes, I think the basic flow needs to be that there is an atomic
flag (i.e. cookie->fq_domain) which indicates which mode is being used
and if you flip that concurrently then you need to guarantee that everybody
is either using the old more or the new mode and not some halfway state in
between.

Will
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v5 3/3] iommu/dart: Add DART iommu driver

2021-08-03 Thread Sven Peter via iommu
Apple's new SoCs use iommus for almost all peripherals. These Device
Address Resolution Tables must be setup before these peripherals can
act as DMA masters.

Tested-by: Alyssa Rosenzweig 
Signed-off-by: Sven Peter 
---
 MAINTAINERS|   1 +
 drivers/iommu/Kconfig  |  14 +
 drivers/iommu/Makefile |   1 +
 drivers/iommu/apple-dart.c | 923 +
 4 files changed, 939 insertions(+)
 create mode 100644 drivers/iommu/apple-dart.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 0f450f7d5336..5f3ef4298594 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1267,6 +1267,7 @@ M:Sven Peter 
 L: iommu@lists.linux-foundation.org
 S: Maintained
 F: Documentation/devicetree/bindings/iommu/apple,dart.yaml
+F: drivers/iommu/apple-dart.c
 
 APPLE SMC DRIVER
 M: Henrik Rydberg 
diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index c84da8205be7..dfe81da483e9 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -290,6 +290,20 @@ config SPAPR_TCE_IOMMU
  Enables bits of IOMMU API required by VFIO. The iommu_ops
  is not implemented as it is not necessary for VFIO.
 
+config APPLE_DART
+   tristate "Apple DART IOMMU Support"
+   depends on ARM64 || (COMPILE_TEST && !GENERIC_ATOMIC64)
+   select IOMMU_API
+   select IOMMU_IO_PGTABLE_LPAE
+   default ARCH_APPLE
+   help
+ Support for Apple DART (Device Address Resolution Table) IOMMUs
+ found in Apple ARM SoCs like the M1.
+ This IOMMU is required for most peripherals using DMA to access
+ the main memory.
+
+ Say Y here if you are using an Apple SoC.
+
 # ARM IOMMU support
 config ARM_SMMU
tristate "ARM Ltd. System MMU (SMMU) Support"
diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
index c0fb0ba88143..bc7f730edbb0 100644
--- a/drivers/iommu/Makefile
+++ b/drivers/iommu/Makefile
@@ -29,3 +29,4 @@ obj-$(CONFIG_HYPERV_IOMMU) += hyperv-iommu.o
 obj-$(CONFIG_VIRTIO_IOMMU) += virtio-iommu.o
 obj-$(CONFIG_IOMMU_SVA_LIB) += iommu-sva-lib.o io-pgfault.o
 obj-$(CONFIG_SPRD_IOMMU) += sprd-iommu.o
+obj-$(CONFIG_APPLE_DART) += apple-dart.o
diff --git a/drivers/iommu/apple-dart.c b/drivers/iommu/apple-dart.c
new file mode 100644
index ..559db9259e65
--- /dev/null
+++ b/drivers/iommu/apple-dart.c
@@ -0,0 +1,923 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Apple DART (Device Address Resolution Table) IOMMU driver
+ *
+ * Copyright (C) 2021 The Asahi Linux Contributors
+ *
+ * Based on arm/arm-smmu/arm-ssmu.c and arm/arm-smmu-v3/arm-smmu-v3.c
+ *  Copyright (C) 2013 ARM Limited
+ *  Copyright (C) 2015 ARM Limited
+ * and on exynos-iommu.c
+ *  Copyright (c) 2011,2016 Samsung Electronics Co., Ltd.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define DART_MAX_STREAMS 16
+#define DART_MAX_TTBR 4
+#define MAX_DARTS_PER_DEVICE 2
+
+#define DART_STREAM_ALL 0x
+
+#define DART_PARAMS1 0x00
+#define DART_PARAMS_PAGE_SHIFT GENMASK(27, 24)
+
+#define DART_PARAMS2 0x04
+#define DART_PARAMS_BYPASS_SUPPORT BIT(0)
+
+#define DART_STREAM_COMMAND 0x20
+#define DART_STREAM_COMMAND_BUSY BIT(2)
+#define DART_STREAM_COMMAND_INVALIDATE BIT(20)
+
+#define DART_STREAM_SELECT 0x34
+
+#define DART_ERROR 0x40
+#define DART_ERROR_STREAM GENMASK(27, 24)
+#define DART_ERROR_CODE GENMASK(11, 0)
+#define DART_ERROR_FLAG BIT(31)
+
+#define DART_ERROR_READ_FAULT BIT(4)
+#define DART_ERROR_WRITE_FAULT BIT(3)
+#define DART_ERROR_NO_PTE BIT(2)
+#define DART_ERROR_NO_PMD BIT(1)
+#define DART_ERROR_NO_TTBR BIT(0)
+
+#define DART_CONFIG 0x60
+#define DART_CONFIG_LOCK BIT(15)
+
+#define DART_STREAM_COMMAND_BUSY_TIMEOUT 100
+
+#define DART_ERROR_ADDR_HI 0x54
+#define DART_ERROR_ADDR_LO 0x50
+
+#define DART_TCR(sid) (0x100 + 4 * (sid))
+#define DART_TCR_TRANSLATE_ENABLE BIT(7)
+#define DART_TCR_BYPASS0_ENABLE BIT(8)
+#define DART_TCR_BYPASS1_ENABLE BIT(12)
+
+#define DART_TTBR(sid, idx) (0x200 + 16 * (sid) + 4 * (idx))
+#define DART_TTBR_VALID BIT(31)
+#define DART_TTBR_SHIFT 12
+
+/*
+ * Private structure associated with each DART device.
+ *
+ * @dev: device struct
+ * @regs: mapped MMIO region
+ * @irq: interrupt number, can be shared with other DARTs
+ * @clks: clocks associated with this DART
+ * @num_clks: number of @clks
+ * @lock: lock for hardware operations involving this dart
+ * @pgsize: pagesize supported by this DART
+ * @supports_bypass: indicates if this DART supports bypass mode
+ * @force_bypass: force bypass mode due to pagesize mismatch?
+ * @sid2group: maps stream ids to iommu_groups
+ * @iommu: iommu core device
+ */
+struct apple_dart {
+   struct device *dev;
+
+   void __iomem *regs;
+
+   int irq;
+   struct clk_bulk_data *clks;
+   int num_clks;
+
+   spinlock_t lock;
+
+

[PATCH v5 1/3] iommu/io-pgtable: Add DART pagetable format

2021-08-03 Thread Sven Peter via iommu
Apple's DART iommu uses a pagetable format that shares some
similarities with the ones already implemented by io-pgtable.c.
Add a new format variant to support the required differences
so that we don't have to duplicate the pagetable handling code.

Reviewed-by: Alexander Graf 
Reviewed-by: Alyssa Rosenzweig 
Reviewed-by: Robin Murphy 
Signed-off-by: Sven Peter 
---
 drivers/iommu/io-pgtable-arm.c | 63 ++
 drivers/iommu/io-pgtable.c |  1 +
 include/linux/io-pgtable.h |  7 
 3 files changed, 71 insertions(+)

diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
index 053df4048a29..0779eb96bd29 100644
--- a/drivers/iommu/io-pgtable-arm.c
+++ b/drivers/iommu/io-pgtable-arm.c
@@ -130,6 +130,9 @@
 #define ARM_MALI_LPAE_MEMATTR_IMP_DEF  0x88ULL
 #define ARM_MALI_LPAE_MEMATTR_WRITE_ALLOC 0x8DULL
 
+#define APPLE_DART_PTE_PROT_NO_WRITE (1<<7)
+#define APPLE_DART_PTE_PROT_NO_READ (1<<8)
+
 /* IOPTE accessors */
 #define iopte_deref(pte,d) __va(iopte_to_paddr(pte, d))
 
@@ -402,6 +405,15 @@ static arm_lpae_iopte arm_lpae_prot_to_pte(struct 
arm_lpae_io_pgtable *data,
 {
arm_lpae_iopte pte;
 
+   if (data->iop.fmt == APPLE_DART) {
+   pte = 0;
+   if (!(prot & IOMMU_WRITE))
+   pte |= APPLE_DART_PTE_PROT_NO_WRITE;
+   if (!(prot & IOMMU_READ))
+   pte |= APPLE_DART_PTE_PROT_NO_READ;
+   return pte;
+   }
+
if (data->iop.fmt == ARM_64_LPAE_S1 ||
data->iop.fmt == ARM_32_LPAE_S1) {
pte = ARM_LPAE_PTE_nG;
@@ -1102,6 +1114,52 @@ arm_mali_lpae_alloc_pgtable(struct io_pgtable_cfg *cfg, 
void *cookie)
return NULL;
 }
 
+static struct io_pgtable *
+apple_dart_alloc_pgtable(struct io_pgtable_cfg *cfg, void *cookie)
+{
+   struct arm_lpae_io_pgtable *data;
+   int i;
+
+   if (cfg->oas > 36)
+   return NULL;
+
+   data = arm_lpae_alloc_pgtable(cfg);
+   if (!data)
+   return NULL;
+
+   /*
+* The table format itself always uses two levels, but the total VA
+* space is mapped by four separate tables, making the MMIO registers
+* an effective "level 1". For simplicity, though, we treat this
+* equivalently to LPAE stage 2 concatenation at level 2, with the
+* additional TTBRs each just pointing at consecutive pages.
+*/
+   if (data->start_level < 1)
+   goto out_free_data;
+   if (data->start_level == 1 && data->pgd_bits > 2)
+   goto out_free_data;
+   if (data->start_level > 1)
+   data->pgd_bits = 0;
+   data->start_level = 2;
+   cfg->apple_dart_cfg.n_ttbrs = 1 << data->pgd_bits;
+   data->pgd_bits += data->bits_per_level;
+
+   data->pgd = __arm_lpae_alloc_pages(ARM_LPAE_PGD_SIZE(data), GFP_KERNEL,
+  cfg);
+   if (!data->pgd)
+   goto out_free_data;
+
+   for (i = 0; i < cfg->apple_dart_cfg.n_ttbrs; ++i)
+   cfg->apple_dart_cfg.ttbr[i] =
+   virt_to_phys(data->pgd + i * ARM_LPAE_GRANULE(data));
+
+   return >iop;
+
+out_free_data:
+   kfree(data);
+   return NULL;
+}
+
 struct io_pgtable_init_fns io_pgtable_arm_64_lpae_s1_init_fns = {
.alloc  = arm_64_lpae_alloc_pgtable_s1,
.free   = arm_lpae_free_pgtable,
@@ -1127,6 +1185,11 @@ struct io_pgtable_init_fns 
io_pgtable_arm_mali_lpae_init_fns = {
.free   = arm_lpae_free_pgtable,
 };
 
+struct io_pgtable_init_fns io_pgtable_apple_dart_init_fns = {
+   .alloc  = apple_dart_alloc_pgtable,
+   .free   = arm_lpae_free_pgtable,
+};
+
 #ifdef CONFIG_IOMMU_IO_PGTABLE_LPAE_SELFTEST
 
 static struct io_pgtable_cfg *cfg_cookie __initdata;
diff --git a/drivers/iommu/io-pgtable.c b/drivers/iommu/io-pgtable.c
index 6e9917ce980f..f4bfcef98297 100644
--- a/drivers/iommu/io-pgtable.c
+++ b/drivers/iommu/io-pgtable.c
@@ -20,6 +20,7 @@ io_pgtable_init_table[IO_PGTABLE_NUM_FMTS] = {
[ARM_64_LPAE_S1] = _pgtable_arm_64_lpae_s1_init_fns,
[ARM_64_LPAE_S2] = _pgtable_arm_64_lpae_s2_init_fns,
[ARM_MALI_LPAE] = _pgtable_arm_mali_lpae_init_fns,
+   [APPLE_DART] = _pgtable_apple_dart_init_fns,
 #endif
 #ifdef CONFIG_IOMMU_IO_PGTABLE_ARMV7S
[ARM_V7S] = _pgtable_arm_v7s_init_fns,
diff --git a/include/linux/io-pgtable.h b/include/linux/io-pgtable.h
index c43f3b899d2a..a738483fb4da 100644
--- a/include/linux/io-pgtable.h
+++ b/include/linux/io-pgtable.h
@@ -16,6 +16,7 @@ enum io_pgtable_fmt {
ARM_V7S,
ARM_MALI_LPAE,
AMD_IOMMU_V1,
+   APPLE_DART,
IO_PGTABLE_NUM_FMTS,
 };
 
@@ -136,6 +137,11 @@ struct io_pgtable_cfg {
u64 transtab;
u64 memattr;
} arm_mali_lpae_cfg;
+
+   struct {
+   u64 ttbr[4];
+   u32 n_ttbrs;
+ 

[PATCH v5 2/3] dt-bindings: iommu: add DART iommu bindings

2021-08-03 Thread Sven Peter via iommu
DART (Device Address Resolution Table) is the iommu found on Apple
ARM SoCs such as the M1.

Reviewed-by: Rob Herring 
Reviewed-by: Alyssa Rosenzweig 
Signed-off-by: Sven Peter 
---
 .../devicetree/bindings/iommu/apple,dart.yaml | 81 +++
 MAINTAINERS   |  6 ++
 2 files changed, 87 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/iommu/apple,dart.yaml

diff --git a/Documentation/devicetree/bindings/iommu/apple,dart.yaml 
b/Documentation/devicetree/bindings/iommu/apple,dart.yaml
new file mode 100644
index ..94aa9e9afa59
--- /dev/null
+++ b/Documentation/devicetree/bindings/iommu/apple,dart.yaml
@@ -0,0 +1,81 @@
+# SPDX-License-Identifier: (GPL-2.0 OR BSD-2-Clause)
+%YAML 1.2
+---
+$id: http://devicetree.org/schemas/iommu/apple,dart.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
+
+title: Apple DART IOMMU
+
+maintainers:
+  - Sven Peter 
+
+description: |+
+  Apple SoCs may contain an implementation of their Device Address
+  Resolution Table which provides a mandatory layer of address
+  translations for various masters.
+
+  Each DART instance is capable of handling up to 16 different streams
+  with individual pagetables and page-level read/write protection flags.
+
+  This DART IOMMU also raises interrupts in response to various
+  fault conditions.
+
+properties:
+  compatible:
+const: apple,t8103-dart
+
+  reg:
+maxItems: 1
+
+  interrupts:
+maxItems: 1
+
+  clocks:
+description:
+  Reference to the gate clock phandle if required for this IOMMU.
+  Optional since not all IOMMUs are attached to a clock gate.
+
+  '#iommu-cells':
+const: 1
+description:
+  Has to be one. The single cell describes the stream id emitted by
+  a master to the IOMMU.
+
+required:
+  - compatible
+  - reg
+  - '#iommu-cells'
+  - interrupts
+
+additionalProperties: false
+
+examples:
+  - |+
+dart1: iommu@82f8 {
+  compatible = "apple,t8103-dart";
+  reg = <0x82f8 0x4000>;
+  interrupts = <1 781 4>;
+  #iommu-cells = <1>;
+};
+
+master1 {
+  iommus = < 0>;
+};
+
+  - |+
+dart2a: iommu@82f0 {
+  compatible = "apple,t8103-dart";
+  reg = <0x82f0 0x4000>;
+  interrupts = <1 781 4>;
+  #iommu-cells = <1>;
+};
+dart2b: iommu@82f8 {
+  compatible = "apple,t8103-dart";
+  reg = <0x82f8 0x4000>;
+  interrupts = <1 781 4>;
+  #iommu-cells = <1>;
+};
+
+master2 {
+  iommus = < 0>, < 1>;
+};
diff --git a/MAINTAINERS b/MAINTAINERS
index c9467d2839f5..0f450f7d5336 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1262,6 +1262,12 @@ L:   linux-in...@vger.kernel.org
 S: Odd fixes
 F: drivers/input/mouse/bcm5974.c
 
+APPLE DART IOMMU DRIVER
+M: Sven Peter 
+L: iommu@lists.linux-foundation.org
+S: Maintained
+F: Documentation/devicetree/bindings/iommu/apple,dart.yaml
+
 APPLE SMC DRIVER
 M: Henrik Rydberg 
 L: linux-hw...@vger.kernel.org
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v5 0/3] Apple M1 DART IOMMU driver

2021-08-03 Thread Sven Peter via iommu
Hi,

This is v5 of my Apple M1 DART IOMMU driver series as a follow up to the 
previous
versions [1][2][3][7].

Short summary: this series adds support for the iommu found in Apple's new M1
SoC which is required to use DMA on most peripherals like the display 
controller,
the USB ports or the internal PCIe bus (which is used for WiFi, Ethernet and
more USB ports).
So far this code has been tested by multiple people with dwc3 in host and
device mode (which both only require changes to the device tree after this
patchset) and PCIe (using a yet to be finalized patchset).

Note that this version has to be applied on top of iommu/core or iommu/next 
since
it now uses the new map_pages/unmap_pages API.


== Testing this patchset with the USB-C controller == 

The two USB-C ports on the M1 machines are exposed as two separate dwc3
controllers which are behind a DART. Now that my USB phy bringup code has been
merged into our bootloader m1n1 you can easily test this patchset yourself:

1) Follow the instructions at [4] to setup our bootloader m1n1 on your M1
   machine which will allow you to boot kernels using a normal USB cable.
   Note that you'll still need a special setup to expose the UART for very
   low-level debugging.

2) Apply this patchset and add the DART and dwc3 nodes as done in e.g. [5].

3) Boot the kernel through our bootloader m1n1. You'll need a version after
   commit [6] which enables the USB PHY and the USB PD chip.

Note that the dwc3 controller has a quirk where each root port can only be used
once right now. The most stable way to test is to already connected the USB
device(s) before booting the kernel.

It's also possible to test the PCIe bus but this requires a more complex setup
for now. I can write a quick howto if anyone is interested though.
(tl;dr: Mark Kettenis has a u-boot fork that includes PCIe bringup code and
Marc Zyngier has a WIP patchset to add a PCIe driver)

== Project Blurb ==

Asahi Linux is an open community project dedicated to developing and
maintaining mainline support for Apple Silicon on Linux. Feel free to
drop by #asahi and #asahi-dev on OFTC to chat with us, or check
our website for more information on the project:

https://asahilinux.org/

== Changes ==

Changes for v5:
 - Added reviewed-by and tested-by tags (thanks!)
 - Rebased on top of iommu/core and replaced map/unmap with 
map_pages/unmap_pages
 - Removed software bypass hacks: I've tried a few different variants now
   and they all have drawbacks or incompatibilities that I am not comfortable
   with. This means that PCIe devices (for which there is no kernel support yet
   anyway) will not work correctly for now on 4K kernels. I plan to address this
   in a follow-up series where I want to modify the dma-iommu layer to support
   pagesize mismatches.
 - Removed reference to ARM from the constants for the io-pgtable code
 - Addressed the following comments by Robin Murphy, which resulted in some 
major
   changes to apple-dart.c
   - Correctly assign iommu_groups in apple_dart_device_group
   - Get rid of the fwspec-inspired of_xlate linked lists and replaced them with
 a simple static array with a streamid bitmap covering all known cases
   - Relax locking: Now only a single spinlock around TLB flushes and a mutex
 around domain initialization are required. attach_dev/detach_dev uses
 atomic64_t.
   - Set .suppress_bind_attrs to prevent manual unbinding
   - Get rid of .shutdown since there's no real need to clean anything up
   - Manage interrupts manually instead of using devm_* to prevent situations
 where a shared interrupt could trigger while clocks are disabled
   - apple_dart_irq now prints "unknown" when more than a single error bit
 has been set.
   - Use DL_FLAG_AUTOREMOVE_SUPPLIER so that there's no need to keep track
 of the pointer
   - Ignore any unknown protection flags instead of failing
   - Use dev_err_ratelimited and clk_bulk_disable_unprepare instead of
 open-coding them
   - Correctly set and clear iommu_bus_ops
   - Removed unhelpful WARN_ONs and duplicate sanity checks
   - Removed unnecessary identity stream map reset code 
   - Renamed apple-dart-iommu.c to apple-dart.c
   - Fixed commit style to use the correct subsystem style
 - Possibly some smaller fixes I forgot about

Changes for v4:
 - Addressed Rob Herring's remark about the incorrect phandles in the device
   tree binding example and added his reviewed-by tag
 - Take the software linear mapping range from the bus instead of hardcoding
   it in the driver
 - Use def_domain_type to force bypass mode if there's a pagesize mismatch
   between the DART (hardwired to 16KB) and the kernel (may use 4K)
 - Added lockdep_assert_held instead of comments as suggested by Rouven 
Czerwinski
 - rebased on 5.13-rc7

Changes for v3:
 - fixed name of the iommu node in the device tree binding example
   pointed out by Arnd Bergmann
 - remove hardware specific checks from io-pgtable.c  as pointed out by
   

Re: [PATCH v2 23/24] iommu/arm-smmu: Allow non-strict in pgtable_quirks interface

2021-08-03 Thread Robin Murphy

On 2021-08-03 11:36, Will Deacon wrote:

On Mon, Aug 02, 2021 at 03:15:50PM +0100, Robin Murphy wrote:

On 2021-08-02 14:04, Will Deacon wrote:

On Wed, Jul 28, 2021 at 04:58:44PM +0100, Robin Murphy wrote:

To make io-pgtable aware of a flush queue being dynamically enabled,
allow IO_PGTABLE_QUIRK_NON_STRICT to be set even after a domain has been
attached to, and hook up the final piece of the puzzle in iommu-dma.

Signed-off-by: Robin Murphy 
---
   drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 15 +++
   drivers/iommu/arm/arm-smmu/arm-smmu.c   | 11 +++
   drivers/iommu/dma-iommu.c   |  3 +++
   3 files changed, 29 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 19400826eba7..40fa9cb382c3 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2711,6 +2711,20 @@ static int arm_smmu_enable_nesting(struct iommu_domain 
*domain)
return ret;
   }
+static int arm_smmu_set_pgtable_quirks(struct iommu_domain *domain,
+   unsigned long quirks)
+{
+   struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
+
+   if (quirks == IO_PGTABLE_QUIRK_NON_STRICT && smmu_domain->pgtbl_ops) {
+   struct io_pgtable *iop = 
io_pgtable_ops_to_pgtable(smmu_domain->pgtbl_ops);
+
+   iop->cfg.quirks |= IO_PGTABLE_QUIRK_NON_STRICT;
+   return 0;
+   }
+   return -EINVAL;
+}


I don't see anything serialising this against a concurrent iommu_unmap(), so
the ordering and atomicity looks quite suspicious to me here. I don't think
it's just the page-table quirks either, but also setting cookie->fq_domain.


Heh, I confess to very much taking the cheeky "let's say nothing and see
what Will thinks about concurrency" approach here :)


Damnit, I fell for that didn't I?

Overall, I'm really nervous about the concurrency here and think we'd be
better off requiring the unbind as we have for the other domain changes.


Sure, the dynamic switch is what makes it ultimately work for Doug's 
use-case (where the unbind isn't viable), but I had every expectation 
that we might need to hold back those two patches for much deeper 
consideration. It's no accident that the first 22 patches can still be 
usefully applied without them!


In all honesty I don't really like this particular patch much, mostly 
because I increasingly dislike IO_PGTABLE_QUIRK_NON_STRICT at all, but 
since the interface was there it made it super easy to prove the 
concept. I have a more significant refactoring of the io-pgtable code 
sketched out in my mind already, it's just going to be more work.



The beauty of only allowing relaxation in the strict->non-strict direction
is that it shouldn't need serialising as such - it doesn't matter if the
update to cookie->fq_domain is observed between iommu_unmap() and
iommu_dma_free_iova(), since there's no correctness impact to queueing IOVAs
which may already have been invalidated and may or may not have been synced.
AFAICS the only condition which matters is that the setting of the
io-pgtable quirk must observe fq_domain being set. It feels like there must
be enough dependencies on the read side, but we might need an smp_wmb()
between the two in iommu_dma_init_fq()?

I've also flip-flopped a bit on whether fq_domain needs to be accessed with
READ_ONCE/WRITE_ONCE - by the time of posting I'd convinced myself that it
was probably OK, but looking again now I suppose this wacky reordering is
theoretically possible:


iommu_dma_unmap() {
bool free_fq = cookie->fq_domain; // == false

iommu_unmap();

if (!cookie->fq_domain) // observes new non-NULL value
iommu_tlb_sync(); // skipped

iommu_dma_free_iova { // inlined
if (free_fq) // false
queue_iova();
else
free_iova_fast(); // Uh-oh!
}
}

so although I still can't see atomicity being a problem I guess we do need
it for the sake of reordering at least.


With your changes, I think quite a few things can go wrong.

   * cookie->fq_domain may be observed but iovad->fq could be NULL


Good point, I guess that already technically applies (if iovad->fq sat 
in a write buffer long enough), but it certainly becomes far easier to 
provoke. However a barrier after assigning fq_domain (as mentioned 
above) paired with the control dependency around the queue_iova() call 
would also fix that, right?



   * We can miss the smp_wmb() in the pgtable code but end up deferring the
 IOVA reclaim
   * iommu_change_dev_def_domain() only holds the group mutex afaict, so can
 possibly run concurrently with itself on the same domain?
   * iommu_dma_init_fq() can flip the domain type back from
 IOMMU_DOMAIN_DMA_FQ to 

Re: [PATCH v10 01/17] iova: Export alloc_iova_fast() and free_iova_fast()

2021-08-03 Thread Robin Murphy

On 2021-08-03 09:54, Yongji Xie wrote:

On Tue, Aug 3, 2021 at 3:41 PM Jason Wang  wrote:



在 2021/7/29 下午3:34, Xie Yongji 写道:

Export alloc_iova_fast() and free_iova_fast() so that
some modules can use it to improve iova allocation efficiency.



It's better to explain why alloc_iova() is not sufficient here.



Fine.


What I fail to understand from the later patches is what the IOVA domain 
actually represents. If the "device" is a userspace process then 
logically the "IOVA" would be the userspace address, so presumably 
somewhere you're having to translate between this arbitrary address 
space and actual usable addresses - if you're worried about efficiency 
surely it would be even better to not do that?


Presumably userspace doesn't have any concern about alignment and the 
things we have to worry about for the DMA API in general, so it's pretty 
much just allocating slots in a buffer, and there are far more effective 
ways to do that than a full-blown address space manager. If you're going 
to reuse any infrastructure I'd have expected it to be SWIOTLB rather 
than the IOVA allocator. Because, y'know, you're *literally implementing 
a software I/O TLB* ;)


Robin.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v2 23/24] iommu/arm-smmu: Allow non-strict in pgtable_quirks interface

2021-08-03 Thread Will Deacon
On Mon, Aug 02, 2021 at 03:15:50PM +0100, Robin Murphy wrote:
> On 2021-08-02 14:04, Will Deacon wrote:
> > On Wed, Jul 28, 2021 at 04:58:44PM +0100, Robin Murphy wrote:
> > > To make io-pgtable aware of a flush queue being dynamically enabled,
> > > allow IO_PGTABLE_QUIRK_NON_STRICT to be set even after a domain has been
> > > attached to, and hook up the final piece of the puzzle in iommu-dma.
> > > 
> > > Signed-off-by: Robin Murphy 
> > > ---
> > >   drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 15 +++
> > >   drivers/iommu/arm/arm-smmu/arm-smmu.c   | 11 +++
> > >   drivers/iommu/dma-iommu.c   |  3 +++
> > >   3 files changed, 29 insertions(+)
> > > 
> > > diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
> > > b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > > index 19400826eba7..40fa9cb382c3 100644
> > > --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > > +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > > @@ -2711,6 +2711,20 @@ static int arm_smmu_enable_nesting(struct 
> > > iommu_domain *domain)
> > >   return ret;
> > >   }
> > > +static int arm_smmu_set_pgtable_quirks(struct iommu_domain *domain,
> > > + unsigned long quirks)
> > > +{
> > > + struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
> > > +
> > > + if (quirks == IO_PGTABLE_QUIRK_NON_STRICT && smmu_domain->pgtbl_ops) {
> > > + struct io_pgtable *iop = 
> > > io_pgtable_ops_to_pgtable(smmu_domain->pgtbl_ops);
> > > +
> > > + iop->cfg.quirks |= IO_PGTABLE_QUIRK_NON_STRICT;
> > > + return 0;
> > > + }
> > > + return -EINVAL;
> > > +}
> > 
> > I don't see anything serialising this against a concurrent iommu_unmap(), so
> > the ordering and atomicity looks quite suspicious to me here. I don't think
> > it's just the page-table quirks either, but also setting cookie->fq_domain.
> 
> Heh, I confess to very much taking the cheeky "let's say nothing and see
> what Will thinks about concurrency" approach here :)

Damnit, I fell for that didn't I?

Overall, I'm really nervous about the concurrency here and think we'd be
better off requiring the unbind as we have for the other domain changes.

> The beauty of only allowing relaxation in the strict->non-strict direction
> is that it shouldn't need serialising as such - it doesn't matter if the
> update to cookie->fq_domain is observed between iommu_unmap() and
> iommu_dma_free_iova(), since there's no correctness impact to queueing IOVAs
> which may already have been invalidated and may or may not have been synced.
> AFAICS the only condition which matters is that the setting of the
> io-pgtable quirk must observe fq_domain being set. It feels like there must
> be enough dependencies on the read side, but we might need an smp_wmb()
> between the two in iommu_dma_init_fq()?
> 
> I've also flip-flopped a bit on whether fq_domain needs to be accessed with
> READ_ONCE/WRITE_ONCE - by the time of posting I'd convinced myself that it
> was probably OK, but looking again now I suppose this wacky reordering is
> theoretically possible:
> 
> 
>   iommu_dma_unmap() {
>   bool free_fq = cookie->fq_domain; // == false
> 
>   iommu_unmap();
> 
>   if (!cookie->fq_domain) // observes new non-NULL value
>   iommu_tlb_sync(); // skipped
> 
>   iommu_dma_free_iova { // inlined
>   if (free_fq) // false
>   queue_iova();
>   else
>   free_iova_fast(); // Uh-oh!
>   }
>   }
> 
> so although I still can't see atomicity being a problem I guess we do need
> it for the sake of reordering at least.

With your changes, I think quite a few things can go wrong.

  * cookie->fq_domain may be observed but iovad->fq could be NULL
  * We can miss the smp_wmb() in the pgtable code but end up deferring the
IOVA reclaim
  * iommu_change_dev_def_domain() only holds the group mutex afaict, so can
possibly run concurrently with itself on the same domain?
  * iommu_dma_init_fq() can flip the domain type back from
IOMMU_DOMAIN_DMA_FQ to IOMMU_DOMAIN_DMA on the error path
  * set_pgtable_quirks() isn't atomic, which probably is ok for now, but the
moment we use it anywhere else it's dangerous

and I suspect there are more :/

Will
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v10 05/17] vhost-vdpa: Fail the vhost_vdpa_set_status() on reset failure

2021-08-03 Thread Yongji Xie
On Tue, Aug 3, 2021 at 4:10 PM Jason Wang  wrote:
>
>
> 在 2021/7/29 下午3:34, Xie Yongji 写道:
> > Re-read the device status to ensure it's set to zero during
> > resetting. Otherwise, fail the vhost_vdpa_set_status() after timeout.
> >
> > Signed-off-by: Xie Yongji 
> > ---
> >   drivers/vhost/vdpa.c | 11 ++-
> >   1 file changed, 10 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> > index b07aa161f7ad..dd05c1e1133c 100644
> > --- a/drivers/vhost/vdpa.c
> > +++ b/drivers/vhost/vdpa.c
> > @@ -157,7 +157,7 @@ static long vhost_vdpa_set_status(struct vhost_vdpa *v, 
> > u8 __user *statusp)
> >   struct vdpa_device *vdpa = v->vdpa;
> >   const struct vdpa_config_ops *ops = vdpa->config;
> >   u8 status, status_old;
> > - int nvqs = v->nvqs;
> > + int timeout = 0, nvqs = v->nvqs;
> >   u16 i;
> >
> >   if (copy_from_user(, statusp, sizeof(status)))
> > @@ -173,6 +173,15 @@ static long vhost_vdpa_set_status(struct vhost_vdpa 
> > *v, u8 __user *statusp)
> >   return -EINVAL;
> >
> >   ops->set_status(vdpa, status);
> > + if (status == 0) {
> > + while (ops->get_status(vdpa)) {
> > + timeout += 20;
> > + if (timeout > VDPA_RESET_TIMEOUT_MS)
> > + return -EIO;
> > +
> > + msleep(20);
> > + }
>
>
> Spec has introduced the reset a one of the basic facility. And consider
> we differ reset here.
>
> This makes me think if it's better to introduce a dedicated vdpa ops for
> reset?
>

Do you mean replace the ops.set_status(vdev, 0) with the ops.reset()?
Then we can remove the timeout processing which is device specific
stuff.

Thanks,
Yongji
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v10 10/17] virtio: Handle device reset failure in register_virtio_device()

2021-08-03 Thread Yongji Xie
On Tue, Aug 3, 2021 at 4:09 PM Jason Wang  wrote:
>
>
> 在 2021/7/29 下午3:34, Xie Yongji 写道:
> > The device reset may fail in virtio-vdpa case now, so add checks to
> > its return value and fail the register_virtio_device().
>
>
> So the reset() would be called by the driver during remove as well, or
> is it sufficient to deal only with the reset during probe?
>

Actually there is no way to handle failure during removal. And it
should be safe with the protection of software IOTLB even if the
reset() fails.

Thanks,
Yongji
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v10 04/17] vdpa: Fail the vdpa_reset() if fail to set device status to zero

2021-08-03 Thread Yongji Xie
On Tue, Aug 3, 2021 at 3:58 PM Jason Wang  wrote:
>
>
> 在 2021/7/29 下午3:34, Xie Yongji 写道:
> > Re-read the device status to ensure it's set to zero during
> > resetting. Otherwise, fail the vdpa_reset() after timeout.
> >
> > Signed-off-by: Xie Yongji 
> > ---
> >   include/linux/vdpa.h | 15 ++-
> >   1 file changed, 14 insertions(+), 1 deletion(-)
> >
> > diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h
> > index 406d53a606ac..d1a80ef05089 100644
> > --- a/include/linux/vdpa.h
> > +++ b/include/linux/vdpa.h
> > @@ -6,6 +6,7 @@
> >   #include 
> >   #include 
> >   #include 
> > +#include 
> >
> >   /**
> >* struct vdpa_calllback - vDPA callback definition.
> > @@ -340,12 +341,24 @@ static inline struct device *vdpa_get_dma_dev(struct 
> > vdpa_device *vdev)
> >   return vdev->dma_dev;
> >   }
> >
> > -static inline void vdpa_reset(struct vdpa_device *vdev)
> > +#define VDPA_RESET_TIMEOUT_MS 1000
> > +
> > +static inline int vdpa_reset(struct vdpa_device *vdev)
> >   {
> >   const struct vdpa_config_ops *ops = vdev->config;
> > + int timeout = 0;
> >
> >   vdev->features_valid = false;
> >   ops->set_status(vdev, 0);
> > + while (ops->get_status(vdev)) {
> > + timeout += 20;
> > + if (timeout > VDPA_RESET_TIMEOUT_MS)
> > + return -EIO;
> > +
> > + msleep(20);
> > + }
>
>
> I wonder if it's better to do this in the vDPA parent?
>
> Thanks
>

Sorry, I didn't get you here. Do you mean vDPA parent driver (e.g.
VDUSE)? Actually I didn't find any other place where I can do
set_status() and get_status().

Thanks,
Yongji
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v10 07/17] virtio: Don't set FAILED status bit on device index allocation failure

2021-08-03 Thread Yongji Xie
On Tue, Aug 3, 2021 at 4:03 PM Jason Wang  wrote:
>
>
> 在 2021/7/29 下午3:34, Xie Yongji 写道:
> > We don't need to set FAILED status bit on device index allocation
> > failure since the device initialization hasn't been started yet.
> > This doesn't affect runtime, found in code review.
> >
> > Signed-off-by: Xie Yongji 
>
>
> Does it really harm?
>

Actually not. I think I can remove this patch if we don't need it.

Thanks,
Yongji
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v10 03/17] vdpa: Fix code indentation

2021-08-03 Thread Yongji Xie
On Tue, Aug 3, 2021 at 3:51 PM Jason Wang  wrote:
>
>
> 在 2021/7/29 下午3:34, Xie Yongji 写道:
> > Use tabs to indent the code instead of spaces.
> >
> > Signed-off-by: Xie Yongji 
> > ---
> >   include/linux/vdpa.h | 29 ++---
> >   1 file changed, 14 insertions(+), 15 deletions(-)
>
>
> It looks to me not all the warnings are addressed.
>
> Or did you silent checkpatch.pl -f?
>

This patch only fixes the code indent issue. I will address all
warnings in the next version.

Thanks,
Yongji
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v10 02/17] file: Export receive_fd() to modules

2021-08-03 Thread Yongji Xie
On Tue, Aug 3, 2021 at 3:46 PM Jason Wang  wrote:
>
>
> 在 2021/7/29 下午3:34, Xie Yongji 写道:
> > Export receive_fd() so that some modules can use
> > it to pass file descriptor between processes without
> > missing any security stuffs.
> >
> > Signed-off-by: Xie Yongji 
> > ---
> >   fs/file.c| 6 ++
> >   include/linux/file.h | 7 +++
> >   2 files changed, 9 insertions(+), 4 deletions(-)
> >
> > diff --git a/fs/file.c b/fs/file.c
> > index 86dc9956af32..210e540672aa 100644
> > --- a/fs/file.c
> > +++ b/fs/file.c
> > @@ -1134,6 +1134,12 @@ int receive_fd_replace(int new_fd, struct file 
> > *file, unsigned int o_flags)
> >   return new_fd;
> >   }
> >
> > +int receive_fd(struct file *file, unsigned int o_flags)
> > +{
> > + return __receive_fd(file, NULL, o_flags);
>
>
> Any reason that receive_fd_user() can live in the file.h?
>

Since no modules use it.

Thanks,
Yongji
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v10 01/17] iova: Export alloc_iova_fast() and free_iova_fast()

2021-08-03 Thread Yongji Xie
On Tue, Aug 3, 2021 at 3:41 PM Jason Wang  wrote:
>
>
> 在 2021/7/29 下午3:34, Xie Yongji 写道:
> > Export alloc_iova_fast() and free_iova_fast() so that
> > some modules can use it to improve iova allocation efficiency.
>
>
> It's better to explain why alloc_iova() is not sufficient here.
>

Fine.

Thanks,
Yongji
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v10 17/17] Documentation: Add documentation for VDUSE

2021-08-03 Thread Yongji Xie
On Tue, Aug 3, 2021 at 3:35 PM Jason Wang  wrote:
>
>
> 在 2021/7/29 下午3:35, Xie Yongji 写道:
> > VDUSE (vDPA Device in Userspace) is a framework to support
> > implementing software-emulated vDPA devices in userspace. This
> > document is intended to clarify the VDUSE design and usage.
> >
> > Signed-off-by: Xie Yongji 
> > ---
> >   Documentation/userspace-api/index.rst |   1 +
> >   Documentation/userspace-api/vduse.rst | 232 
> > ++
> >   2 files changed, 233 insertions(+)
> >   create mode 100644 Documentation/userspace-api/vduse.rst
> >
> > diff --git a/Documentation/userspace-api/index.rst 
> > b/Documentation/userspace-api/index.rst
> > index 0b5eefed027e..c432be070f67 100644
> > --- a/Documentation/userspace-api/index.rst
> > +++ b/Documentation/userspace-api/index.rst
> > @@ -27,6 +27,7 @@ place where this information is gathered.
> >  iommu
> >  media/index
> >  sysfs-platform_profile
> > +   vduse
> >
> >   .. only::  subproject and html
> >
> > diff --git a/Documentation/userspace-api/vduse.rst 
> > b/Documentation/userspace-api/vduse.rst
> > new file mode 100644
> > index ..30c9d1482126
> > --- /dev/null
> > +++ b/Documentation/userspace-api/vduse.rst
> > @@ -0,0 +1,232 @@
> > +==
> > +VDUSE - "vDPA Device in Userspace"
> > +==
> > +
> > +vDPA (virtio data path acceleration) device is a device that uses a
> > +datapath which complies with the virtio specifications with vendor
> > +specific control path. vDPA devices can be both physically located on
> > +the hardware or emulated by software. VDUSE is a framework that makes it
> > +possible to implement software-emulated vDPA devices in userspace. And
> > +to make the device emulation more secure, the emulated vDPA device's
> > +control path is handled in the kernel and only the data path is
> > +implemented in the userspace.
> > +
> > +Note that only virtio block device is supported by VDUSE framework now,
> > +which can reduce security risks when the userspace process that implements
> > +the data path is run by an unprivileged user. The support for other device
> > +types can be added after the security issue of corresponding device driver
> > +is clarified or fixed in the future.
> > +
> > +Create/Destroy VDUSE devices
> > +
> > +
> > +VDUSE devices are created as follows:
> > +
> > +1. Create a new VDUSE instance with ioctl(VDUSE_CREATE_DEV) on
> > +   /dev/vduse/control.
> > +
> > +2. Setup each virtqueue with ioctl(VDUSE_VQ_SETUP) on /dev/vduse/$NAME.
> > +
> > +3. Begin processing VDUSE messages from /dev/vduse/$NAME. The first
> > +   messages will arrive while attaching the VDUSE instance to vDPA bus.
> > +
> > +4. Send the VDPA_CMD_DEV_NEW netlink message to attach the VDUSE
> > +   instance to vDPA bus.
> > +
> > +VDUSE devices are destroyed as follows:
> > +
> > +1. Send the VDPA_CMD_DEV_DEL netlink message to detach the VDUSE
> > +   instance from vDPA bus.
> > +
> > +2. Close the file descriptor referring to /dev/vduse/$NAME.
> > +
> > +3. Destroy the VDUSE instance with ioctl(VDUSE_DESTROY_DEV) on
> > +   /dev/vduse/control.
> > +
> > +The netlink messages can be sent via vdpa tool in iproute2 or use the
> > +below sample codes:
> > +
> > +.. code-block:: c
> > +
> > + static int netlink_add_vduse(const char *name, enum vdpa_command cmd)
> > + {
> > + struct nl_sock *nlsock;
> > + struct nl_msg *msg;
> > + int famid;
> > +
> > + nlsock = nl_socket_alloc();
> > + if (!nlsock)
> > + return -ENOMEM;
> > +
> > + if (genl_connect(nlsock))
> > + goto free_sock;
> > +
> > + famid = genl_ctrl_resolve(nlsock, VDPA_GENL_NAME);
> > + if (famid < 0)
> > + goto close_sock;
> > +
> > + msg = nlmsg_alloc();
> > + if (!msg)
> > + goto close_sock;
> > +
> > + if (!genlmsg_put(msg, NL_AUTO_PORT, NL_AUTO_SEQ, famid, 0, 0, 
> > cmd, 0))
> > + goto nla_put_failure;
> > +
> > + NLA_PUT_STRING(msg, VDPA_ATTR_DEV_NAME, name);
> > + if (cmd == VDPA_CMD_DEV_NEW)
> > + NLA_PUT_STRING(msg, VDPA_ATTR_MGMTDEV_DEV_NAME, 
> > "vduse");
> > +
> > + if (nl_send_sync(nlsock, msg))
> > + goto close_sock;
> > +
> > + nl_close(nlsock);
> > + nl_socket_free(nlsock);
> > +
> > + return 0;
> > + nla_put_failure:
> > + nlmsg_free(msg);
> > + close_sock:
> > + nl_close(nlsock);
> > + free_sock:
> > + nl_socket_free(nlsock);
> > + return -1;
> > + }
> > +
> > +How VDUSE works
> > +---
> > +
> > +As mentioned above, a VDUSE device is created by ioctl(VDUSE_CREATE_DEV) on
> > +/dev/vduse/control. With this ioctl, 

Re: [PATCH v10 16/17] vduse: Introduce VDUSE - vDPA Device in Userspace

2021-08-03 Thread Yongji Xie
On Tue, Aug 3, 2021 at 3:30 PM Jason Wang  wrote:
>
>
> 在 2021/7/29 下午3:35, Xie Yongji 写道:
> > This VDUSE driver enables implementing software-emulated vDPA
> > devices in userspace. The vDPA device is created by
> > ioctl(VDUSE_CREATE_DEV) on /dev/vduse/control. Then a char device
> > interface (/dev/vduse/$NAME) is exported to userspace for device
> > emulation.
> >
> > In order to make the device emulation more secure, the device's
> > control path is handled in kernel. A message mechnism is introduced
> > to forward some dataplane related control messages to userspace.
> >
> > And in the data path, the DMA buffer will be mapped into userspace
> > address space through different ways depending on the vDPA bus to
> > which the vDPA device is attached. In virtio-vdpa case, the MMU-based
> > software IOTLB is used to achieve that. And in vhost-vdpa case, the
> > DMA buffer is reside in a userspace memory region which can be shared
> > to the VDUSE userspace processs via transferring the shmfd.
> >
> > For more details on VDUSE design and usage, please see the follow-on
> > Documentation commit.
> >
> > Signed-off-by: Xie Yongji 
> > ---
> >   Documentation/userspace-api/ioctl/ioctl-number.rst |1 +
> >   drivers/vdpa/Kconfig   |   10 +
> >   drivers/vdpa/Makefile  |1 +
> >   drivers/vdpa/vdpa_user/Makefile|5 +
> >   drivers/vdpa/vdpa_user/vduse_dev.c | 1541 
> > 
> >   include/uapi/linux/vduse.h |  220 +++
> >   6 files changed, 1778 insertions(+)
> >   create mode 100644 drivers/vdpa/vdpa_user/Makefile
> >   create mode 100644 drivers/vdpa/vdpa_user/vduse_dev.c
> >   create mode 100644 include/uapi/linux/vduse.h
> >
> > diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst 
> > b/Documentation/userspace-api/ioctl/ioctl-number.rst
> > index 1409e40e6345..293ca3aef358 100644
> > --- a/Documentation/userspace-api/ioctl/ioctl-number.rst
> > +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
> > @@ -300,6 +300,7 @@ Code  Seq#Include File  
> >  Comments
> >   'z'   10-4F  drivers/s390/crypto/zcrypt_api.h
> > conflict!
> >   '|'   00-7F  linux/media.h
> >   0x80  00-1F  linux/fb.h
> > +0x81  00-1F  linux/vduse.h
> >   0x89  00-06  arch/x86/include/asm/sockios.h
> >   0x89  0B-DF  linux/sockios.h
> >   0x89  E0-EF  linux/sockios.h 
> > SIOCPROTOPRIVATE range
> > diff --git a/drivers/vdpa/Kconfig b/drivers/vdpa/Kconfig
> > index a503c1b2bfd9..6e23bce6433a 100644
> > --- a/drivers/vdpa/Kconfig
> > +++ b/drivers/vdpa/Kconfig
> > @@ -33,6 +33,16 @@ config VDPA_SIM_BLOCK
> > vDPA block device simulator which terminates IO request in a
> > memory buffer.
> >
> > +config VDPA_USER
> > + tristate "VDUSE (vDPA Device in Userspace) support"
> > + depends on EVENTFD && MMU && HAS_DMA
> > + select DMA_OPS
> > + select VHOST_IOTLB
> > + select IOMMU_IOVA
> > + help
> > +   With VDUSE it is possible to emulate a vDPA Device
> > +   in a userspace program.
> > +
> >   config IFCVF
> >   tristate "Intel IFC VF vDPA driver"
> >   depends on PCI_MSI
> > diff --git a/drivers/vdpa/Makefile b/drivers/vdpa/Makefile
> > index 67fe7f3d6943..f02ebed33f19 100644
> > --- a/drivers/vdpa/Makefile
> > +++ b/drivers/vdpa/Makefile
> > @@ -1,6 +1,7 @@
> >   # SPDX-License-Identifier: GPL-2.0
> >   obj-$(CONFIG_VDPA) += vdpa.o
> >   obj-$(CONFIG_VDPA_SIM) += vdpa_sim/
> > +obj-$(CONFIG_VDPA_USER) += vdpa_user/
> >   obj-$(CONFIG_IFCVF)+= ifcvf/
> >   obj-$(CONFIG_MLX5_VDPA) += mlx5/
> >   obj-$(CONFIG_VP_VDPA)+= virtio_pci/
> > diff --git a/drivers/vdpa/vdpa_user/Makefile 
> > b/drivers/vdpa/vdpa_user/Makefile
> > new file mode 100644
> > index ..260e0b26af99
> > --- /dev/null
> > +++ b/drivers/vdpa/vdpa_user/Makefile
> > @@ -0,0 +1,5 @@
> > +# SPDX-License-Identifier: GPL-2.0
> > +
> > +vduse-y := vduse_dev.o iova_domain.o
> > +
> > +obj-$(CONFIG_VDPA_USER) += vduse.o
> > diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c 
> > b/drivers/vdpa/vdpa_user/vduse_dev.c
> > new file mode 100644
> > index ..6addc62e7de6
> > --- /dev/null
> > +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> > @@ -0,0 +1,1541 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +/*
> > + * VDUSE: vDPA Device in Userspace
> > + *
> > + * Copyright (C) 2020-2021 Bytedance Inc. and/or its affiliates. All 
> > rights reserved.
> > + *
> > + * Author: Xie Yongji 
> > + *
> > + */
> > +
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +
> > +#include "iova_domain.h"
> > +
> > +#define DRV_AUTHOR   

Re: [PATCH v2 0/3] iommu: Enable non-strict DMA on QCom SD/MMC

2021-08-03 Thread Robin Murphy

On 2021-08-03 01:09, Rajat Jain wrote:

Hi Robin, Doug,

On Wed, Jul 14, 2021 at 8:14 AM Doug Anderson  wrote:


Hi,

On Tue, Jul 13, 2021 at 11:07 AM Robin Murphy  wrote:


On 2021-07-08 15:36, Doug Anderson wrote:
[...]

Or document for the users that want performance how to
change the setting, so that they can decide.


Pushing this to the users can make sense for a Linux distribution but
probably less sense for an embedded platform. So I'm happy to make
some way for a user to override this (like via kernel command line),
but I also strongly believe there should be a default that users don't
have to futz with that we think is correct.


FYI I did make progress on the "punt it to userspace" approach. I'm not
posting it even as an RFC yet because I still need to set up a machine
to try actually testing any of it (it's almost certainly broken
somewhere), but in the end it comes out looking surprisingly not too bad
overall. If you're curious to take a look in the meantime I put it here:

https://gitlab.arm.com/linux-arm/linux-rm/-/commits/iommu/fq


I was wondering if you got any closer to testing / sending it out? I
looked at the patches and am trying to understand, would they also
make it possible to convert at runtime, an existing "non-strict"
domain (for a particular device) into a "strict" domain leaving the
other devices/domains as-is? Please let me know when you think your
patches are good to be tested, and I'd also be interested in trying
them out.


Yup, most recently here:

https://lore.kernel.org/linux-iommu/cover.1627468308.git.robin.mur...@arm.com/

I'm currently getting v3 ready, so I'll try to remember to add you to 
the CC list.



Being able to change this at runtime through sysfs sounds great and it
fills all the needs I'm aware of, thanks! In Chrome OS we can just use
this with some udev rules and get everything we need.


I still have another (inverse) use case where this does not work:
We have an Intel chromebook with the default domain type being
non-strict. There is an LTE modem (an internal PCI device which cannot
be marked external), which we'd like to be treated as a "Strict" DMA
domain.

Do I understand it right that using Rob's patches, I could potentially
switch the domain to "strict" *after* booting (since we don't use
initramfs), but by that time, the driver might have already attached
to the modem device (using "non-strict" domain), and thus the damage
may have already been done? So perhaps we still need a device property
that the firmware could use to indicate "strictness" for certain
devices at boot?


Well, in my view the "external facing" firmware property *should* 
effectively be the "I don't trust this device not to misbehave" 
property, but I guess it's a bit too conflated with other aspects of 
Thunderbolt root ports (at least in the ACPI definition) to abuse in 
that manner.


Ideas off the top of my head would be to flip the default domain type 
and manually relax all the other performance-sensitive devices instead, 
or module_blacklist the modem driver to load manually later after 
tweaking its group. However, if you think it's a sufficiently general 
concern, maybe a quirk to set pci_dev->untrusted might be worth 
exploring? It may make sense to drive such a thing from a command-line 
option rather than a hard-coded list, though, since trust is really down 
to the individual use-case.


[ re gitlab.arm.com, I understand it tends not to like large transfers - 
some colleagues have reported similar issues pushing large repos as 
well. I'd suggest cloning the base mainline repo from kernel.org or 
another reliable source, then fetching my branch into that. I've just 
tried that on a different machine (outside the work network) and it 
worked fine) ]


Thanks,
Robin.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v10 05/17] vhost-vdpa: Fail the vhost_vdpa_set_status() on reset failure

2021-08-03 Thread Jason Wang


在 2021/7/29 下午3:34, Xie Yongji 写道:

Re-read the device status to ensure it's set to zero during
resetting. Otherwise, fail the vhost_vdpa_set_status() after timeout.

Signed-off-by: Xie Yongji 
---
  drivers/vhost/vdpa.c | 11 ++-
  1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index b07aa161f7ad..dd05c1e1133c 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -157,7 +157,7 @@ static long vhost_vdpa_set_status(struct vhost_vdpa *v, u8 
__user *statusp)
struct vdpa_device *vdpa = v->vdpa;
const struct vdpa_config_ops *ops = vdpa->config;
u8 status, status_old;
-   int nvqs = v->nvqs;
+   int timeout = 0, nvqs = v->nvqs;
u16 i;
  
  	if (copy_from_user(, statusp, sizeof(status)))

@@ -173,6 +173,15 @@ static long vhost_vdpa_set_status(struct vhost_vdpa *v, u8 
__user *statusp)
return -EINVAL;
  
  	ops->set_status(vdpa, status);

+   if (status == 0) {
+   while (ops->get_status(vdpa)) {
+   timeout += 20;
+   if (timeout > VDPA_RESET_TIMEOUT_MS)
+   return -EIO;
+
+   msleep(20);
+   }



Spec has introduced the reset a one of the basic facility. And consider 
we differ reset here.


This makes me think if it's better to introduce a dedicated vdpa ops for 
reset?


Thanks



+   }
  
  	if ((status & VIRTIO_CONFIG_S_DRIVER_OK) && !(status_old & VIRTIO_CONFIG_S_DRIVER_OK))

for (i = 0; i < nvqs; i++)


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v10 10/17] virtio: Handle device reset failure in register_virtio_device()

2021-08-03 Thread Jason Wang


在 2021/7/29 下午3:34, Xie Yongji 写道:

The device reset may fail in virtio-vdpa case now, so add checks to
its return value and fail the register_virtio_device().



So the reset() would be called by the driver during remove as well, or 
is it sufficient to deal only with the reset during probe?


Thanks




Signed-off-by: Xie Yongji 
---
  drivers/virtio/virtio.c | 15 ++-
  1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c
index a15beb6b593b..8df75425fb43 100644
--- a/drivers/virtio/virtio.c
+++ b/drivers/virtio/virtio.c
@@ -349,7 +349,9 @@ int register_virtio_device(struct virtio_device *dev)
  
  	/* We always start by resetting the device, in case a previous

 * driver messed it up.  This also tests that code path a little. */
-   dev->config->reset(dev);
+   err = dev->config->reset(dev);
+   if (err)
+   goto err_reset;
  
  	/* Acknowledge that we've seen the device. */

virtio_add_status(dev, VIRTIO_CONFIG_S_ACKNOWLEDGE);
@@ -362,10 +364,13 @@ int register_virtio_device(struct virtio_device *dev)
 */
err = device_add(>dev);
if (err)
-   ida_simple_remove(_index_ida, dev->index);
-out:
-   if (err)
-   virtio_add_status(dev, VIRTIO_CONFIG_S_FAILED);
+   goto err_add;
+
+   return 0;
+err_add:
+   virtio_add_status(dev, VIRTIO_CONFIG_S_FAILED);
+err_reset:
+   ida_simple_remove(_index_ida, dev->index);
return err;
  }
  EXPORT_SYMBOL_GPL(register_virtio_device);


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v10 07/17] virtio: Don't set FAILED status bit on device index allocation failure

2021-08-03 Thread Jason Wang


在 2021/7/29 下午3:34, Xie Yongji 写道:

We don't need to set FAILED status bit on device index allocation
failure since the device initialization hasn't been started yet.
This doesn't affect runtime, found in code review.

Signed-off-by: Xie Yongji 



Does it really harm?

Thanks



---
  drivers/virtio/virtio.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c
index 4b15c00c0a0a..a15beb6b593b 100644
--- a/drivers/virtio/virtio.c
+++ b/drivers/virtio/virtio.c
@@ -338,7 +338,7 @@ int register_virtio_device(struct virtio_device *dev)
/* Assign a unique device index and hence name. */
err = ida_simple_get(_index_ida, 0, 0, GFP_KERNEL);
if (err < 0)
-   goto out;
+   return err;
  
  	dev->index = err;

dev_set_name(>dev, "virtio%u", dev->index);


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v10 04/17] vdpa: Fail the vdpa_reset() if fail to set device status to zero

2021-08-03 Thread Jason Wang


在 2021/7/29 下午3:34, Xie Yongji 写道:

Re-read the device status to ensure it's set to zero during
resetting. Otherwise, fail the vdpa_reset() after timeout.

Signed-off-by: Xie Yongji 
---
  include/linux/vdpa.h | 15 ++-
  1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h
index 406d53a606ac..d1a80ef05089 100644
--- a/include/linux/vdpa.h
+++ b/include/linux/vdpa.h
@@ -6,6 +6,7 @@
  #include 
  #include 
  #include 
+#include 
  
  /**

   * struct vdpa_calllback - vDPA callback definition.
@@ -340,12 +341,24 @@ static inline struct device *vdpa_get_dma_dev(struct 
vdpa_device *vdev)
return vdev->dma_dev;
  }
  
-static inline void vdpa_reset(struct vdpa_device *vdev)

+#define VDPA_RESET_TIMEOUT_MS 1000
+
+static inline int vdpa_reset(struct vdpa_device *vdev)
  {
const struct vdpa_config_ops *ops = vdev->config;
+   int timeout = 0;
  
  	vdev->features_valid = false;

ops->set_status(vdev, 0);
+   while (ops->get_status(vdev)) {
+   timeout += 20;
+   if (timeout > VDPA_RESET_TIMEOUT_MS)
+   return -EIO;
+
+   msleep(20);
+   }



I wonder if it's better to do this in the vDPA parent?

Thanks



+
+   return 0;
  }
  
  static inline int vdpa_set_features(struct vdpa_device *vdev, u64 features)


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v10 03/17] vdpa: Fix code indentation

2021-08-03 Thread Jason Wang


在 2021/7/29 下午3:34, Xie Yongji 写道:

Use tabs to indent the code instead of spaces.

Signed-off-by: Xie Yongji 
---
  include/linux/vdpa.h | 29 ++---
  1 file changed, 14 insertions(+), 15 deletions(-)



It looks to me not all the warnings are addressed.

Or did you silent checkpatch.pl -f?

Thanks




diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h
index 7c49bc5a2b71..406d53a606ac 100644
--- a/include/linux/vdpa.h
+++ b/include/linux/vdpa.h
@@ -43,17 +43,17 @@ struct vdpa_vq_state_split {
   * @last_used_idx: used index
   */
  struct vdpa_vq_state_packed {
-u16last_avail_counter:1;
-u16last_avail_idx:15;
-u16last_used_counter:1;
-u16last_used_idx:15;
+   u16 last_avail_counter:1;
+   u16 last_avail_idx:15;
+   u16 last_used_counter:1;
+   u16 last_used_idx:15;
  };
  
  struct vdpa_vq_state {

- union {
-  struct vdpa_vq_state_split split;
-  struct vdpa_vq_state_packed packed;
- };
+   union {
+   struct vdpa_vq_state_split split;
+   struct vdpa_vq_state_packed packed;
+   };
  };
  
  struct vdpa_mgmt_dev;

@@ -131,7 +131,7 @@ struct vdpa_iova_range {
   *@vdev: vdpa device
   *@idx: virtqueue index
   *@state: pointer to returned state 
(last_avail_idx)
- * @get_vq_notification:   Get the notification area for a virtqueue
+ * @get_vq_notification:   Get the notification area for a virtqueue
   *@vdev: vdpa device
   *@idx: virtqueue index
   *Returns the notifcation area
@@ -342,25 +342,24 @@ static inline struct device *vdpa_get_dma_dev(struct 
vdpa_device *vdev)
  
  static inline void vdpa_reset(struct vdpa_device *vdev)

  {
-const struct vdpa_config_ops *ops = vdev->config;
+   const struct vdpa_config_ops *ops = vdev->config;
  
  	vdev->features_valid = false;

-ops->set_status(vdev, 0);
+   ops->set_status(vdev, 0);
  }
  
  static inline int vdpa_set_features(struct vdpa_device *vdev, u64 features)

  {
-const struct vdpa_config_ops *ops = vdev->config;
+   const struct vdpa_config_ops *ops = vdev->config;
  
  	vdev->features_valid = true;

-return ops->set_features(vdev, features);
+   return ops->set_features(vdev, features);
  }
  
-

  static inline void vdpa_get_config(struct vdpa_device *vdev, unsigned offset,
   void *buf, unsigned int len)
  {
-const struct vdpa_config_ops *ops = vdev->config;
+   const struct vdpa_config_ops *ops = vdev->config;
  
  	/*

 * Config accesses aren't supposed to trigger before features are set.


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v10 02/17] file: Export receive_fd() to modules

2021-08-03 Thread Jason Wang


在 2021/7/29 下午3:34, Xie Yongji 写道:

Export receive_fd() so that some modules can use
it to pass file descriptor between processes without
missing any security stuffs.

Signed-off-by: Xie Yongji 
---
  fs/file.c| 6 ++
  include/linux/file.h | 7 +++
  2 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/fs/file.c b/fs/file.c
index 86dc9956af32..210e540672aa 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -1134,6 +1134,12 @@ int receive_fd_replace(int new_fd, struct file *file, 
unsigned int o_flags)
return new_fd;
  }
  
+int receive_fd(struct file *file, unsigned int o_flags)

+{
+   return __receive_fd(file, NULL, o_flags);



Any reason that receive_fd_user() can live in the file.h?

Thanks



+}
+EXPORT_SYMBOL_GPL(receive_fd);
+
  static int ksys_dup3(unsigned int oldfd, unsigned int newfd, int flags)
  {
int err = -EBADF;
diff --git a/include/linux/file.h b/include/linux/file.h
index 2de2e4613d7b..51e830b4fe3a 100644
--- a/include/linux/file.h
+++ b/include/linux/file.h
@@ -94,6 +94,9 @@ extern void fd_install(unsigned int fd, struct file *file);
  
  extern int __receive_fd(struct file *file, int __user *ufd,

unsigned int o_flags);
+
+extern int receive_fd(struct file *file, unsigned int o_flags);
+
  static inline int receive_fd_user(struct file *file, int __user *ufd,
  unsigned int o_flags)
  {
@@ -101,10 +104,6 @@ static inline int receive_fd_user(struct file *file, int 
__user *ufd,
return -EFAULT;
return __receive_fd(file, ufd, o_flags);
  }
-static inline int receive_fd(struct file *file, unsigned int o_flags)
-{
-   return __receive_fd(file, NULL, o_flags);
-}
  int receive_fd_replace(int new_fd, struct file *file, unsigned int o_flags);
  
  extern void flush_delayed_fput(void);


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v10 01/17] iova: Export alloc_iova_fast() and free_iova_fast()

2021-08-03 Thread Jason Wang


在 2021/7/29 下午3:34, Xie Yongji 写道:

Export alloc_iova_fast() and free_iova_fast() so that
some modules can use it to improve iova allocation efficiency.



It's better to explain why alloc_iova() is not sufficient here.

Thanks




Signed-off-by: Xie Yongji 
---
  drivers/iommu/iova.c | 2 ++
  1 file changed, 2 insertions(+)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index b6cf5f16123b..3941ed6bb99b 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -521,6 +521,7 @@ alloc_iova_fast(struct iova_domain *iovad, unsigned long 
size,
  
  	return new_iova->pfn_lo;

  }
+EXPORT_SYMBOL_GPL(alloc_iova_fast);
  
  /**

   * free_iova_fast - free iova pfn range into rcache
@@ -538,6 +539,7 @@ free_iova_fast(struct iova_domain *iovad, unsigned long 
pfn, unsigned long size)
  
  	free_iova(iovad, pfn);

  }
+EXPORT_SYMBOL_GPL(free_iova_fast);
  
  #define fq_ring_for_each(i, fq) \

for ((i) = (fq)->head; (i) != (fq)->tail; (i) = ((i) + 1) % 
IOVA_FQ_SIZE)


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v10 01/17] iova: Export alloc_iova_fast() and free_iova_fast()

2021-08-03 Thread Jason Wang


在 2021/7/29 下午3:34, Xie Yongji 写道:

Export alloc_iova_fast() and free_iova_fast() so that
some modules can use it to improve iova allocation efficiency.



It's better to explain which alloc_iova() is not sufficient here.

Thanks




Signed-off-by: Xie Yongji 
---
  drivers/iommu/iova.c | 2 ++
  1 file changed, 2 insertions(+)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index b6cf5f16123b..3941ed6bb99b 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -521,6 +521,7 @@ alloc_iova_fast(struct iova_domain *iovad, unsigned long 
size,
  
  	return new_iova->pfn_lo;

  }
+EXPORT_SYMBOL_GPL(alloc_iova_fast);
  
  /**

   * free_iova_fast - free iova pfn range into rcache
@@ -538,6 +539,7 @@ free_iova_fast(struct iova_domain *iovad, unsigned long 
pfn, unsigned long size)
  
  	free_iova(iovad, pfn);

  }
+EXPORT_SYMBOL_GPL(free_iova_fast);
  
  #define fq_ring_for_each(i, fq) \

for ((i) = (fq)->head; (i) != (fq)->tail; (i) = ((i) + 1) % 
IOVA_FQ_SIZE)


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v10 17/17] Documentation: Add documentation for VDUSE

2021-08-03 Thread Jason Wang


在 2021/7/29 下午3:35, Xie Yongji 写道:

VDUSE (vDPA Device in Userspace) is a framework to support
implementing software-emulated vDPA devices in userspace. This
document is intended to clarify the VDUSE design and usage.

Signed-off-by: Xie Yongji 
---
  Documentation/userspace-api/index.rst |   1 +
  Documentation/userspace-api/vduse.rst | 232 ++
  2 files changed, 233 insertions(+)
  create mode 100644 Documentation/userspace-api/vduse.rst

diff --git a/Documentation/userspace-api/index.rst 
b/Documentation/userspace-api/index.rst
index 0b5eefed027e..c432be070f67 100644
--- a/Documentation/userspace-api/index.rst
+++ b/Documentation/userspace-api/index.rst
@@ -27,6 +27,7 @@ place where this information is gathered.
 iommu
 media/index
 sysfs-platform_profile
+   vduse
  
  .. only::  subproject and html
  
diff --git a/Documentation/userspace-api/vduse.rst b/Documentation/userspace-api/vduse.rst

new file mode 100644
index ..30c9d1482126
--- /dev/null
+++ b/Documentation/userspace-api/vduse.rst
@@ -0,0 +1,232 @@
+==
+VDUSE - "vDPA Device in Userspace"
+==
+
+vDPA (virtio data path acceleration) device is a device that uses a
+datapath which complies with the virtio specifications with vendor
+specific control path. vDPA devices can be both physically located on
+the hardware or emulated by software. VDUSE is a framework that makes it
+possible to implement software-emulated vDPA devices in userspace. And
+to make the device emulation more secure, the emulated vDPA device's
+control path is handled in the kernel and only the data path is
+implemented in the userspace.
+
+Note that only virtio block device is supported by VDUSE framework now,
+which can reduce security risks when the userspace process that implements
+the data path is run by an unprivileged user. The support for other device
+types can be added after the security issue of corresponding device driver
+is clarified or fixed in the future.
+
+Create/Destroy VDUSE devices
+
+
+VDUSE devices are created as follows:
+
+1. Create a new VDUSE instance with ioctl(VDUSE_CREATE_DEV) on
+   /dev/vduse/control.
+
+2. Setup each virtqueue with ioctl(VDUSE_VQ_SETUP) on /dev/vduse/$NAME.
+
+3. Begin processing VDUSE messages from /dev/vduse/$NAME. The first
+   messages will arrive while attaching the VDUSE instance to vDPA bus.
+
+4. Send the VDPA_CMD_DEV_NEW netlink message to attach the VDUSE
+   instance to vDPA bus.
+
+VDUSE devices are destroyed as follows:
+
+1. Send the VDPA_CMD_DEV_DEL netlink message to detach the VDUSE
+   instance from vDPA bus.
+
+2. Close the file descriptor referring to /dev/vduse/$NAME.
+
+3. Destroy the VDUSE instance with ioctl(VDUSE_DESTROY_DEV) on
+   /dev/vduse/control.
+
+The netlink messages can be sent via vdpa tool in iproute2 or use the
+below sample codes:
+
+.. code-block:: c
+
+   static int netlink_add_vduse(const char *name, enum vdpa_command cmd)
+   {
+   struct nl_sock *nlsock;
+   struct nl_msg *msg;
+   int famid;
+
+   nlsock = nl_socket_alloc();
+   if (!nlsock)
+   return -ENOMEM;
+
+   if (genl_connect(nlsock))
+   goto free_sock;
+
+   famid = genl_ctrl_resolve(nlsock, VDPA_GENL_NAME);
+   if (famid < 0)
+   goto close_sock;
+
+   msg = nlmsg_alloc();
+   if (!msg)
+   goto close_sock;
+
+   if (!genlmsg_put(msg, NL_AUTO_PORT, NL_AUTO_SEQ, famid, 0, 0, 
cmd, 0))
+   goto nla_put_failure;
+
+   NLA_PUT_STRING(msg, VDPA_ATTR_DEV_NAME, name);
+   if (cmd == VDPA_CMD_DEV_NEW)
+   NLA_PUT_STRING(msg, VDPA_ATTR_MGMTDEV_DEV_NAME, 
"vduse");
+
+   if (nl_send_sync(nlsock, msg))
+   goto close_sock;
+
+   nl_close(nlsock);
+   nl_socket_free(nlsock);
+
+   return 0;
+   nla_put_failure:
+   nlmsg_free(msg);
+   close_sock:
+   nl_close(nlsock);
+   free_sock:
+   nl_socket_free(nlsock);
+   return -1;
+   }
+
+How VDUSE works
+---
+
+As mentioned above, a VDUSE device is created by ioctl(VDUSE_CREATE_DEV) on
+/dev/vduse/control. With this ioctl, userspace can specify some basic 
configuration
+such as device name (uniquely identify a VDUSE device), virtio features, virtio
+configuration space, the number of virtqueues and so on for this emulated 
device.
+Then a char device interface (/dev/vduse/$NAME) is exported to userspace for 
device
+emulation. Userspace can use the VDUSE_VQ_SETUP ioctl on /dev/vduse/$NAME to
+add per-virtqueue configuration such as the max size of virtqueue to the 
device.
+
+After the initialization, the VDUSE device can 

Re: [PATCH v10 16/17] vduse: Introduce VDUSE - vDPA Device in Userspace

2021-08-03 Thread Jason Wang


在 2021/7/29 下午3:35, Xie Yongji 写道:

This VDUSE driver enables implementing software-emulated vDPA
devices in userspace. The vDPA device is created by
ioctl(VDUSE_CREATE_DEV) on /dev/vduse/control. Then a char device
interface (/dev/vduse/$NAME) is exported to userspace for device
emulation.

In order to make the device emulation more secure, the device's
control path is handled in kernel. A message mechnism is introduced
to forward some dataplane related control messages to userspace.

And in the data path, the DMA buffer will be mapped into userspace
address space through different ways depending on the vDPA bus to
which the vDPA device is attached. In virtio-vdpa case, the MMU-based
software IOTLB is used to achieve that. And in vhost-vdpa case, the
DMA buffer is reside in a userspace memory region which can be shared
to the VDUSE userspace processs via transferring the shmfd.

For more details on VDUSE design and usage, please see the follow-on
Documentation commit.

Signed-off-by: Xie Yongji 
---
  Documentation/userspace-api/ioctl/ioctl-number.rst |1 +
  drivers/vdpa/Kconfig   |   10 +
  drivers/vdpa/Makefile  |1 +
  drivers/vdpa/vdpa_user/Makefile|5 +
  drivers/vdpa/vdpa_user/vduse_dev.c | 1541 
  include/uapi/linux/vduse.h |  220 +++
  6 files changed, 1778 insertions(+)
  create mode 100644 drivers/vdpa/vdpa_user/Makefile
  create mode 100644 drivers/vdpa/vdpa_user/vduse_dev.c
  create mode 100644 include/uapi/linux/vduse.h

diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst 
b/Documentation/userspace-api/ioctl/ioctl-number.rst
index 1409e40e6345..293ca3aef358 100644
--- a/Documentation/userspace-api/ioctl/ioctl-number.rst
+++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
@@ -300,6 +300,7 @@ Code  Seq#Include File  
 Comments
  'z'   10-4F  drivers/s390/crypto/zcrypt_api.hconflict!
  '|'   00-7F  linux/media.h
  0x80  00-1F  linux/fb.h
+0x81  00-1F  linux/vduse.h
  0x89  00-06  arch/x86/include/asm/sockios.h
  0x89  0B-DF  linux/sockios.h
  0x89  E0-EF  linux/sockios.h 
SIOCPROTOPRIVATE range
diff --git a/drivers/vdpa/Kconfig b/drivers/vdpa/Kconfig
index a503c1b2bfd9..6e23bce6433a 100644
--- a/drivers/vdpa/Kconfig
+++ b/drivers/vdpa/Kconfig
@@ -33,6 +33,16 @@ config VDPA_SIM_BLOCK
  vDPA block device simulator which terminates IO request in a
  memory buffer.
  
+config VDPA_USER

+   tristate "VDUSE (vDPA Device in Userspace) support"
+   depends on EVENTFD && MMU && HAS_DMA
+   select DMA_OPS
+   select VHOST_IOTLB
+   select IOMMU_IOVA
+   help
+ With VDUSE it is possible to emulate a vDPA Device
+ in a userspace program.
+
  config IFCVF
tristate "Intel IFC VF vDPA driver"
depends on PCI_MSI
diff --git a/drivers/vdpa/Makefile b/drivers/vdpa/Makefile
index 67fe7f3d6943..f02ebed33f19 100644
--- a/drivers/vdpa/Makefile
+++ b/drivers/vdpa/Makefile
@@ -1,6 +1,7 @@
  # SPDX-License-Identifier: GPL-2.0
  obj-$(CONFIG_VDPA) += vdpa.o
  obj-$(CONFIG_VDPA_SIM) += vdpa_sim/
+obj-$(CONFIG_VDPA_USER) += vdpa_user/
  obj-$(CONFIG_IFCVF)+= ifcvf/
  obj-$(CONFIG_MLX5_VDPA) += mlx5/
  obj-$(CONFIG_VP_VDPA)+= virtio_pci/
diff --git a/drivers/vdpa/vdpa_user/Makefile b/drivers/vdpa/vdpa_user/Makefile
new file mode 100644
index ..260e0b26af99
--- /dev/null
+++ b/drivers/vdpa/vdpa_user/Makefile
@@ -0,0 +1,5 @@
+# SPDX-License-Identifier: GPL-2.0
+
+vduse-y := vduse_dev.o iova_domain.o
+
+obj-$(CONFIG_VDPA_USER) += vduse.o
diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c 
b/drivers/vdpa/vdpa_user/vduse_dev.c
new file mode 100644
index ..6addc62e7de6
--- /dev/null
+++ b/drivers/vdpa/vdpa_user/vduse_dev.c
@@ -0,0 +1,1541 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * VDUSE: vDPA Device in Userspace
+ *
+ * Copyright (C) 2020-2021 Bytedance Inc. and/or its affiliates. All rights 
reserved.
+ *
+ * Author: Xie Yongji 
+ *
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "iova_domain.h"
+
+#define DRV_AUTHOR   "Yongji Xie "
+#define DRV_DESC "vDPA Device in Userspace"
+#define DRV_LICENSE  "GPL v2"
+
+#define VDUSE_DEV_MAX (1U << MINORBITS)
+#define VDUSE_BOUNCE_SIZE (64 * 1024 * 1024)
+#define VDUSE_IOVA_SIZE (128 * 1024 * 1024)
+#define VDUSE_REQUEST_TIMEOUT 30



I think we need make this as a module parameter. 0 probably means we 
need to wait for ever.


This can help in the case when the userspace is attached by GDB. If 
Michael is still not happy, we can find other solution (e.g only offload 
the datapath).


Other looks good.

Thanks



Re: [PATCH] iommu/arm-smmu: Add clk_bulk_{prepare/unprepare} to system pm callbacks

2021-08-03 Thread Sai Prakash Ranjan

On 2021-08-02 21:42, Will Deacon wrote:

On Tue, Jul 27, 2021 at 03:03:22PM +0530, Sai Prakash Ranjan wrote:
Some clocks for SMMU can have parent as XO such as 
gpu_cc_hub_cx_int_clk
of GPU SMMU in QTI SC7280 SoC and in order to enter deep sleep states 
in
such cases, we would need to drop the XO clock vote in unprepare call 
and
this unprepare callback for XO is in RPMh (Resource Power 
Manager-Hardened)
clock driver which controls RPMh managed clock resources for new QTI 
SoCs

and is a blocking call.

Given we cannot have a sleeping calls such as clk_bulk_prepare() and
clk_bulk_unprepare() in arm-smmu runtime pm callbacks since the iommu
operations like map and unmap can be in atomic context and are in fast
path, add this prepare and unprepare call to drop the XO vote only for
system pm callbacks since it is not a fast path and we expect the 
system

to enter deep sleep states with system pm as opposed to runtime pm.

This is a similar sequence of clock requests (prepare,enable and
disable,unprepare) in arm-smmu probe and remove.

Signed-off-by: Sai Prakash Ranjan 
Co-developed-by: Rajendra Nayak 
Signed-off-by: Rajendra Nayak 
---
 drivers/iommu/arm/arm-smmu/arm-smmu.c | 20 ++--
 1 file changed, 18 insertions(+), 2 deletions(-)


[+Rob]

How does this work with that funny GPU which writes to the SMMU 
registers
directly? Does the SMMU need to remain independently clocked for that 
to

work or is it all in the same clock domain?



As Rob mentioned, device link should take care of all the dependencies 
between
SMMU and its consumers. But not sure how the question relates to this 
patch as this
change is for system pm and not runtime pm, so it is exactly the 
sequence of
SMMU probe/remove which if works currently for that GPU SMMU, then it 
should work

just fine for system suspend and resume as well.

diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.c 
b/drivers/iommu/arm/arm-smmu/arm-smmu.c

index d3c6f54110a5..9561ba4c5d39 100644
--- a/drivers/iommu/arm/arm-smmu/arm-smmu.c
+++ b/drivers/iommu/arm/arm-smmu/arm-smmu.c
@@ -2277,6 +2277,13 @@ static int __maybe_unused 
arm_smmu_runtime_suspend(struct device *dev)


 static int __maybe_unused arm_smmu_pm_resume(struct device *dev)
 {
+   int ret;
+   struct arm_smmu_device *smmu = dev_get_drvdata(dev);
+
+   ret = clk_bulk_prepare(smmu->num_clks, smmu->clks);
+   if (ret)
+   return ret;
+
if (pm_runtime_suspended(dev))
return 0;


If we subsequently fail to enable the clks in arm_smmu_runtime_resume()
should we unprepare them again?



If we are unable to turn on the clks then its fatal and we will not live 
for long.


Thanks,
Sai


Will

@@ -2285,10 +2292,19 @@ static int __maybe_unused 
arm_smmu_pm_resume(struct device *dev)


 static int __maybe_unused arm_smmu_pm_suspend(struct device *dev)
 {
+   int ret = 0;
+   struct arm_smmu_device *smmu = dev_get_drvdata(dev);
+
if (pm_runtime_suspended(dev))
-   return 0;
+   goto clk_unprepare;

-   return arm_smmu_runtime_suspend(dev);
+   ret = arm_smmu_runtime_suspend(dev);
+   if (ret)
+   return ret;
+
+clk_unprepare:
+   clk_bulk_unprepare(smmu->num_clks, smmu->clks);
+   return ret;
 }

 static const struct dev_pm_ops arm_smmu_pm_ops = {
--
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a 
member

of Code Aurora Forum, hosted by The Linux Foundation



--
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a 
member

of Code Aurora Forum, hosted by The Linux Foundation
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu