Re: [RFC PATCH 0/7] A General Accelerator Framework, WarpDrive

2018-08-09 Thread Kenneth Lee
On Thu, Aug 09, 2018 at 10:46:13AM -0400, Jerome Glisse wrote:
> Date: Thu, 9 Aug 2018 10:46:13 -0400
> From: Jerome Glisse 
> To: Kenneth Lee 
> CC: Kenneth Lee , "Tian, Kevin"
>  , Alex Williamson ,
>  Herbert Xu , "k...@vger.kernel.org"
>  , Jonathan Corbet , Greg
>  Kroah-Hartman , Zaibo Xu ,
>  "linux-...@vger.kernel.org" , "Kumar, Sanjay K"
>  , Hao Fang ,
>  "linux-ker...@vger.kernel.org" ,
>  "linux...@huawei.com" ,
>  "iommu@lists.linux-foundation.org" ,
>  "linux-cry...@vger.kernel.org" , Philippe
>  Ombredanne , Thomas Gleixner ,
>  "David S . Miller" ,
>  "linux-accelerat...@lists.ozlabs.org"
>  
> Subject: Re: [RFC PATCH 0/7] A General Accelerator Framework, WarpDrive
> User-Agent: Mutt/1.10.0 (2018-05-17)
> Message-ID: <20180809144613.gb3...@redhat.com>
> 
> On Thu, Aug 09, 2018 at 04:03:52PM +0800, Kenneth Lee wrote:
> > On Wed, Aug 08, 2018 at 11:18:35AM -0400, Jerome Glisse wrote:
> > > On Wed, Aug 08, 2018 at 09:08:42AM +0800, Kenneth Lee wrote:
> > > > 在 2018年08月06日 星期一 11:32 下午, Jerome Glisse 写道:
> > > > > On Mon, Aug 06, 2018 at 11:12:52AM +0800, Kenneth Lee wrote:
> > > > > > On Fri, Aug 03, 2018 at 10:39:44AM -0400, Jerome Glisse wrote:
> > > > > > > On Fri, Aug 03, 2018 at 11:47:21AM +0800, Kenneth Lee wrote:
> > > > > > > > On Thu, Aug 02, 2018 at 10:22:43AM -0400, Jerome Glisse wrote:
> > > > > > > > > On Thu, Aug 02, 2018 at 12:05:57PM +0800, Kenneth Lee wrote:
> 
> [...]
> 
> > > > > > > > > your mechanisms the userspace must have a specific userspace
> > > > > > > > > drivers for each hardware and thus there are virtually no
> > > > > > > > > differences between having this userspace driver open a device
> > > > > > > > > file in vfio or somewhere else in the device filesystem. This 
> > > > > > > > > is
> > > > > > > > > just a different path.
> > > > > > > > > 
> > > > > > > > The basic problem WarpDrive want to solve it to avoid syscall. 
> > > > > > > > This is important
> > > > > > > > to accelerators. We have some data here:
> > > > > > > > https://www.slideshare.net/linaroorg/progress-and-demonstration-of-wrapdrive-a-accelerator-framework-sfo17317
> > > > > > > > 
> > > > > > > > (see page 3)
> > > > > > > > 
> > > > > > > > The performance is different on using kernel and user drivers.
> > > > > > > Yes and example i point to is exactly that. You have a one time 
> > > > > > > setup
> > > > > > > cost (creating command buffer binding PASID with command buffer 
> > > > > > > and
> > > > > > > couple other setup steps). Then userspace no longer have to do any
> > > > > > > ioctl to schedule work on the GPU. It is all down from userspace 
> > > > > > > and
> > > > > > > it use a doorbell to notify hardware when it should go look at 
> > > > > > > command
> > > > > > > buffer for new thing to execute.
> > > > > > > 
> > > > > > > My point stands on that. You have existing driver already doing so
> > > > > > > with no new framework and in your scheme you need a userspace 
> > > > > > > driver.
> > > > > > > So i do not see the value add, using one path or the other in the
> > > > > > > userspace driver is litteraly one line to change.
> > > > > > > 
> > > > > > Sorry, I'd got confuse here. I partially agree that the user driver 
> > > > > > is
> > > > > > redundance of kernel driver. (But for WarpDrive, the kernel driver 
> > > > > > is a full
> > > > > > driver include all preparation and setup stuff for the hardware, 
> > > > > > the user driver
> > > > > > is simply to send request and receive answer). Yes, it is just a 
> > > > > > choice of path.
> > > > > > But the user path is faster if the request come from use space. And 
> > > > > > to do that,
> > > > > > we need user land DMA support. Then why is it invaluable to let 
> > > > > > VFIO involved?
> > > > > Some drivers in the kernel already do exactly what you said. The user
> > > > > space emit commands without ever going into kernel by directly 
> > > > > scheduling
> > > > > commands and ringing a doorbell. They do not need VFIO either and they
> > > > > can map userspace address into the DMA address space of the device and
> > > > > again they do not need VFIO for that.
> > > > Could you please directly point out which driver you refer to here? 
> > > > Thank
> > > > you.
> > > 
> > > drivers/gpu/drm/amd/
> > > 
> > > Sub-directory of interest is amdkfd
> > > 
> > > Because it is a big driver here is a highlevel overview of how it works
> > > (this is a simplification):
> > >   - Process can allocate GPUs buffer (through ioclt) and map them into
> > > its address space (through mmap of device file at buffer object
> > > specific offset).
> > >   - Process can map any valid range of virtual address space into device
> > > address space (IOMMU mapping). This must be regular memory ie not an
> > > mmap of a device file or any special file (this is the non PASID
> > > path)
> > >   - Process can create a command queue and bind its process to it aka
> > > PASID, this is done through an 

Re: [RFC PATCH 0/7] A General Accelerator Framework, WarpDrive

2018-08-09 Thread Kenneth Lee
On Thu, Aug 09, 2018 at 08:31:31AM +, Tian, Kevin wrote:
> Date: Thu, 9 Aug 2018 08:31:31 +
> From: "Tian, Kevin" 
> To: Kenneth Lee , Jerome Glisse 
> CC: Kenneth Lee , Alex Williamson
>  , Herbert Xu ,
>  "k...@vger.kernel.org" , Jonathan Corbet
>  , Greg Kroah-Hartman , Zaibo
>  Xu , "linux-...@vger.kernel.org"
>  , "Kumar, Sanjay K" ,
>  Hao Fang , "linux-ker...@vger.kernel.org"
>  , "linux...@huawei.com"
>  , "iommu@lists.linux-foundation.org"
>  , "linux-cry...@vger.kernel.org"
>  , Philippe Ombredanne
>  , Thomas Gleixner , "David S .
>  Miller" , "linux-accelerat...@lists.ozlabs.org"
>  
> Subject: RE: [RFC PATCH 0/7] A General Accelerator Framework, WarpDrive
> Message-ID: 
> 
> 
> > From: Kenneth Lee [mailto:liguo...@hisilicon.com]
> > Sent: Thursday, August 9, 2018 4:04 PM
> > 
> > But we have another requirement which is to combine some device
> > together to
> > share the same address space. This is a little like these kinds of solution:
> > 
> > http://tce.technion.ac.il/wp-content/uploads/sites/8/2015/06/SC-7.2-M.-
> > Silberstein.pdf
> > 
> > With that, the application can directly pass the NiC packet pointer to the
> > decryption accelerator, and get the bare data in place. This is the feature
> > that
> > the VFIO container can provide.
> 
> above is not a good argument, at least in the context of your discussion.
> If each device has their own interface (similar to GPU) for process to bind 
> with, then having the process binding to multiple devices one-by-one then
> you still get same address space shared cross them...

If we consider this from the VFIO container perspective, with a container, you
can do DMA to the container applying it to all devices, even the device is added
after the DMA operation.  

So your argument remains true only when SVM is enabled and the whole process 
space
is devoted to the devices. 

Yes, the process can do the same all by itself. But if we agree with that, it
makes no sense to keep the container concept in VFIO;)

> 
> Thanks
> Kevin

-- 
-Kenneth(Hisilicon)


本邮件及其附件含有华为公司的保密信息,仅限于发送给上面地址中列出的个人或群组。禁
止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、或散发)本邮件中
的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本邮件!
This e-mail and its attachments contain confidential information from HUAWEI,
which is intended only for the person or entity whose address is listed above.
Any use of the 
information contained herein in any way (including, but not limited to, total or
partial disclosure, reproduction, or dissemination) by persons other than the
intended 
recipient(s) is prohibited. If you receive this e-mail in error, please notify
the sender by phone or email immediately and delete it!

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH] vfio/pci: Some buggy virtual functions incorrectly report 1 for intx.

2018-08-09 Thread Raj, Ashok
On Thu, Aug 09, 2018 at 01:44:17PM -0600, Alex Williamson wrote:
> On Thu,  9 Aug 2018 12:37:06 -0700
> Ashok Raj  wrote:
> 
> > PCI_INTERRUPT_PIN should always read  0 for SRIOV Virtual Functions.
> > 
> > Some SRIOV devices have some bugs in RTL and VF's end up reading 1
> > instead of 0 for the PIN.
> 
> Hi Ashok,
> 
> One question, can we identify which VFs are known to have this issue so
> that users (and downstreams) can know how to prioritize this patch?
> Thanks,

Hi Alex,

The hardware isn't public yet, so can't talk about it :-(. Once this patch gets 
merged, will let the OSV engagement folks drive it for inclusions. We 
could mark this for stable, but i would rather wait until we know the 
timeline when they are expecting it to be in. It shouldn't break anything
since we are just enforcing the spec.

Cheers,
Ashok
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: IOAT DMA w/IOMMU

2018-08-09 Thread Kit Chow



On 08/09/2018 03:50 PM, Logan Gunthorpe wrote:


On 09/08/18 04:48 PM, Kit Chow wrote:

Based on Logan's comments, I am very hopeful that the dma_map_resource
will make things work on the older platforms...

Well, I *think* dma_map_single() would still work. So I'm not that
confident that's the root of your problem. I'd still like to see the
actual code snippet you are using.

Logan
Here's the code snippet - (ntbdebug & 4) path does dma_map_resource of 
the pci bar address.


It was:
    unmap->addr[1] = dma_map_single(device->dev, (void 
*)dest, len,

    DMA_TO_DEVICE);

Kit
---


static int ntb_async_tx_submit(struct ntb_transport_qp *qp,
   struct ntb_queue_entry *entry)
{
    struct dma_async_tx_descriptor *txd;
    struct dma_chan *chan = qp->tx_dma_chan;
    struct dma_device *device;
    size_t len = entry->len;
    void *buf = entry->buf;
    size_t dest_off, buff_off;
    struct dmaengine_unmap_data *unmap;
    dma_addr_t dest;
    dma_cookie_t cookie;
    int unmapcnt;

    device = chan->device;

    dest = qp->tx_mw_phys + qp->tx_max_frame * entry->tx_index;

    buff_off = (size_t)buf & ~PAGE_MASK;
    dest_off = (size_t)dest & ~PAGE_MASK;

    if (!is_dma_copy_aligned(device, buff_off, dest_off, len))
    goto err;


    if (ntbdebug & 0x4) {
    unmapcnt = 2;
    } else {
    unmapcnt = 1;
    }

    unmap = dmaengine_get_unmap_data(device->dev, unmapcnt, 
GFP_NOWAIT);

    if (!unmap)
    goto err;

    unmap->len = len;
    unmap->addr[0] = dma_map_page(device->dev, virt_to_page(buf),
  buff_off, len, DMA_TO_DEVICE);
    if (dma_mapping_error(device->dev, unmap->addr[0]))
    goto err_get_unmap;

    if (ntbdebug & 0x4) {
    unmap->addr[1] = dma_map_resource(device->dev,
    (phys_addr_t)dest, len, DMA_TO_DEVICE, 0);
    if (dma_mapping_error(device->dev, unmap->addr[1]))
    goto err_get_unmap;
    unmap->to_cnt = 2;
    } else {
    unmap->addr[1] = dest;
    unmap->to_cnt = 1;
    }

    txd = device->device_prep_dma_memcpy(chan, unmap->addr[1],
    unmap->addr[0], len, DMA_PREP_INTERRUPT);

    if (!txd)
    goto err_get_unmap;

    txd->callback_result = ntb_tx_copy_callback;
    txd->callback_param = entry;
    dma_set_unmap(txd, unmap);

    cookie = dmaengine_submit(txd);
    if (dma_submit_error(cookie))
    goto err_set_unmap;

    dmaengine_unmap_put(unmap);

    dma_async_issue_pending(chan);

    return 0;

err_set_unmap:
    dma_descriptor_unmap(txd);
    txd->desc_free(txd);
err_get_unmap:
    dmaengine_unmap_put(unmap);
err:
    return -ENXIO;
}

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: IOAT DMA w/IOMMU

2018-08-09 Thread Logan Gunthorpe



On 09/08/18 04:48 PM, Kit Chow wrote:
> Based on Logan's comments, I am very hopeful that the dma_map_resource 
> will make things work on the older platforms...

Well, I *think* dma_map_single() would still work. So I'm not that
confident that's the root of your problem. I'd still like to see the
actual code snippet you are using.

Logan
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: IOAT DMA w/IOMMU

2018-08-09 Thread Kit Chow



On 08/09/2018 03:40 PM, Jiang, Dave wrote:

-Original Message-
From: linux-pci-ow...@vger.kernel.org [mailto:linux-pci-ow...@vger.kernel.org] 
On Behalf Of Kit Chow
Sent: Thursday, August 9, 2018 2:48 PM
To: Logan Gunthorpe ; Eric Pilmore ; Bjorn 
Helgaas 
Cc: linux-...@vger.kernel.org; David Woodhouse ; Alex Williamson 
;
iommu@lists.linux-foundation.org
Subject: Re: IOAT DMA w/IOMMU



On 08/09/2018 02:11 PM, Logan Gunthorpe wrote:

On 09/08/18 02:57 PM, Kit Chow wrote:

On 08/09/2018 01:11 PM, Logan Gunthorpe wrote:

On 09/08/18 01:47 PM, Kit Chow wrote:

I haven't tested this scenario but my guess would be that IOAT would
indeed go through the IOMMU and the PCI BAR address would need to be
properly mapped into the IOAT's IOVA. The fact that you see DMAR errors
is probably a good indication that this is the case. I really don't know
why you'd want to DMA something without mapping it.

I have experimented with changing ntb_async_tx_submit to dma_map the PCI
BAR
address. With this, I get a different DMAR error:

What code did you use to do this?

If you mean version of linux, it is 4.15.7.  Or specific dma_map call, I
believe it was dma_map_single.

I mean the complete code you use to do the mapping so we can see if it's
correct. dma_map_single() seems like an odd choice, I expected to see
dma_map_resource().

Thanks for the suggestion!  Will try out dma_map_resource and report back.

Kit

Kit, I was able to try this on the Skylake Xeon platform with Intel NTB and 
ioatdma and I did not encounter any issues. Unfortunately I do not have my 
pre-Skylake platform to test anymore to verify.
We are expecting to get skylakes within a couple of weeks and I'll have 
a chance to give it a go on there.


Based on Logan's comments, I am very hopeful that the dma_map_resource 
will make things work on the older platforms...


Thanks
Kit



DMAR: [DMA Write] Request device [07:00.4] fault addr ffd0
[fault reason 12] non-zero reserved fields in PTE

Also, what device corresponds to 07:00.4 on your system?

I believe 07.00.4 was the PLX dma device. I get the same error with ioat.

Using the mapping with the PLX dma device likely converts it from a pure
P2P request to one where the TLPs pass through the IOMMU. So the fact
that you get the same error with both means IOAT almost certainly goes
through the IOMMU and there's something wrong with the mapping setup.

Logan


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

RE: IOAT DMA w/IOMMU

2018-08-09 Thread Jiang, Dave
> -Original Message-
> From: linux-pci-ow...@vger.kernel.org 
> [mailto:linux-pci-ow...@vger.kernel.org] On Behalf Of Kit Chow
> Sent: Thursday, August 9, 2018 2:48 PM
> To: Logan Gunthorpe ; Eric Pilmore 
> ; Bjorn Helgaas 
> Cc: linux-...@vger.kernel.org; David Woodhouse ; Alex 
> Williamson ;
> iommu@lists.linux-foundation.org
> Subject: Re: IOAT DMA w/IOMMU
> 
> 
> 
> On 08/09/2018 02:11 PM, Logan Gunthorpe wrote:
> >
> > On 09/08/18 02:57 PM, Kit Chow wrote:
> >>
> >> On 08/09/2018 01:11 PM, Logan Gunthorpe wrote:
> >>> On 09/08/18 01:47 PM, Kit Chow wrote:
> > I haven't tested this scenario but my guess would be that IOAT would
> > indeed go through the IOMMU and the PCI BAR address would need to be
> > properly mapped into the IOAT's IOVA. The fact that you see DMAR errors
> > is probably a good indication that this is the case. I really don't know
> > why you'd want to DMA something without mapping it.
>  I have experimented with changing ntb_async_tx_submit to dma_map the PCI
>  BAR
>  address. With this, I get a different DMAR error:
> >>> What code did you use to do this?
> >> If you mean version of linux, it is 4.15.7.  Or specific dma_map call, I
> >> believe it was dma_map_single.
> > I mean the complete code you use to do the mapping so we can see if it's
> > correct. dma_map_single() seems like an odd choice, I expected to see
> > dma_map_resource().
> Thanks for the suggestion!  Will try out dma_map_resource and report back.
> 
> Kit

Kit, I was able to try this on the Skylake Xeon platform with Intel NTB and 
ioatdma and I did not encounter any issues. Unfortunately I do not have my 
pre-Skylake platform to test anymore to verify.

> 
>  DMAR: [DMA Write] Request device [07:00.4] fault addr ffd0
>  [fault reason 12] non-zero reserved fields in PTE
> >>> Also, what device corresponds to 07:00.4 on your system?
> >> I believe 07.00.4 was the PLX dma device. I get the same error with ioat.
> > Using the mapping with the PLX dma device likely converts it from a pure
> > P2P request to one where the TLPs pass through the IOMMU. So the fact
> > that you get the same error with both means IOAT almost certainly goes
> > through the IOMMU and there's something wrong with the mapping setup.
> >
> > Logan

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: IOAT DMA w/IOMMU

2018-08-09 Thread Kit Chow



On 08/09/2018 02:11 PM, Logan Gunthorpe wrote:


On 09/08/18 02:57 PM, Kit Chow wrote:


On 08/09/2018 01:11 PM, Logan Gunthorpe wrote:

On 09/08/18 01:47 PM, Kit Chow wrote:

I haven't tested this scenario but my guess would be that IOAT would
indeed go through the IOMMU and the PCI BAR address would need to be
properly mapped into the IOAT's IOVA. The fact that you see DMAR errors
is probably a good indication that this is the case. I really don't know
why you'd want to DMA something without mapping it.

I have experimented with changing ntb_async_tx_submit to dma_map the PCI
BAR
address. With this, I get a different DMAR error:

What code did you use to do this?

If you mean version of linux, it is 4.15.7.  Or specific dma_map call, I
believe it was dma_map_single.

I mean the complete code you use to do the mapping so we can see if it's
correct. dma_map_single() seems like an odd choice, I expected to see
dma_map_resource().

Thanks for the suggestion!  Will try out dma_map_resource and report back.

Kit


DMAR: [DMA Write] Request device [07:00.4] fault addr ffd0
[fault reason 12] non-zero reserved fields in PTE

Also, what device corresponds to 07:00.4 on your system?

I believe 07.00.4 was the PLX dma device. I get the same error with ioat.

Using the mapping with the PLX dma device likely converts it from a pure
P2P request to one where the TLPs pass through the IOMMU. So the fact
that you get the same error with both means IOAT almost certainly goes
through the IOMMU and there's something wrong with the mapping setup.

Logan


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: IOAT DMA w/IOMMU

2018-08-09 Thread Logan Gunthorpe



On 09/08/18 03:31 PM, Eric Pilmore wrote:
> On Thu, Aug 9, 2018 at 12:35 PM, Logan Gunthorpe  wrote:
>> Hey,
>>
>> On 09/08/18 12:51 PM, Eric Pilmore wrote:
> Was wondering if anybody here has used IOAT DMA engines with an
> IOMMU turned on (Xeon based system)? My specific question is really
> whether it is possible to DMA (w/IOAT) to a PCI BAR address as the
> destination without having to map that address to the IOVA space of
> the DMA engine first (assuming the IOMMU is on)?
>>
>> I haven't tested this scenario but my guess would be that IOAT would
>> indeed go through the IOMMU and the PCI BAR address would need to be
>> properly mapped into the IOAT's IOVA. The fact that you see DMAR errors
>> is probably a good indication that this is the case. I really don't know
>> why you'd want to DMA something without mapping it.
> 
> The thought was to avoid paying the price of having to go through yet another
> translation and also because it was not believed to be necessary anyway since
> the DMA device could go straight to a PCI BAR address without the need for a
> mapping.  We have been playing with two DMA engines, IOAT and PLX. The
> PLX does not have any issues going straight to the PCI BAR address, but unlike
> IOAT, PLX is sitting "in the PCI tree".

Yes, you can do true P2P transactions with the PLX DMA engine. (Though
you do have to be careful as technically you need to translate to a PCI
bus address not the CPU physical address which are the same on x86; and
you need to watch that ACS doesn't mess it up).

It doesn't sound like it's possible to avoid the IOMMU when using the
IOAT so you need the mapping. I would not expect the extra translation
in the IOMMU to be noticeable, performance wise.

Logan
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: IOAT DMA w/IOMMU

2018-08-09 Thread Eric Pilmore
On Thu, Aug 9, 2018 at 12:35 PM, Logan Gunthorpe  wrote:
> Hey,
>
> On 09/08/18 12:51 PM, Eric Pilmore wrote:
 Was wondering if anybody here has used IOAT DMA engines with an
 IOMMU turned on (Xeon based system)? My specific question is really
 whether it is possible to DMA (w/IOAT) to a PCI BAR address as the
 destination without having to map that address to the IOVA space of
 the DMA engine first (assuming the IOMMU is on)?
>
> I haven't tested this scenario but my guess would be that IOAT would
> indeed go through the IOMMU and the PCI BAR address would need to be
> properly mapped into the IOAT's IOVA. The fact that you see DMAR errors
> is probably a good indication that this is the case. I really don't know
> why you'd want to DMA something without mapping it.

The thought was to avoid paying the price of having to go through yet another
translation and also because it was not believed to be necessary anyway since
the DMA device could go straight to a PCI BAR address without the need for a
mapping.  We have been playing with two DMA engines, IOAT and PLX. The
PLX does not have any issues going straight to the PCI BAR address, but unlike
IOAT, PLX is sitting "in the PCI tree".

>
>>> So is this a peer-to-peer DMA scenario?  You mention DMA, which would
>>> be a transaction initiated by a PCI device, to a PCI BAR address, so
>>> it doesn't sound like system memory is involved.
>>
>> No, not peer-to-peer.  This is from system memory (e.g. SKB buffer which
>> has had an IOMMU mapping created) to a PCI BAR address.
>
> It's definitely peer-to-peer in the case where you are using a DMA
> engine in the PCI tree. You have the DMA PCI device sending TLPs
> directly to the PCI BAR device. So, if everything is done right, the
> TLPs will avoid the root complex completely. (Though, ACS settings could
> also prevent this from working and you'd either get similar DMAR errors
> or they'd disappear into a black hole).
>
> When using the IOAT, it is part of the CPU so I wouldn't say it's really
> peer-to-peer. But an argument could be made that it is. Though, this is
> exactly what the existing ntb_transport is doing: DMAing from system
> memory to a PCI BAR and vice versa using IOAT.
>
> Logan
>



-- 
Eric Pilmore
epilm...@gigaio.com
http://gigaio.com
Phone: (858) 775 2514

This e-mail message is intended only for the individual(s) to whom it
is addressed and
may contain information that is privileged, confidential, proprietary,
or otherwise exempt
from disclosure under applicable law. If you believe you have received
this message in
error, please advise the sender by return e-mail and delete it from
your mailbox.
Thank you.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: IOAT DMA w/IOMMU

2018-08-09 Thread Logan Gunthorpe


On 09/08/18 02:57 PM, Kit Chow wrote:
> 
> 
> On 08/09/2018 01:11 PM, Logan Gunthorpe wrote:
>>
>> On 09/08/18 01:47 PM, Kit Chow wrote:
 I haven't tested this scenario but my guess would be that IOAT would
 indeed go through the IOMMU and the PCI BAR address would need to be
 properly mapped into the IOAT's IOVA. The fact that you see DMAR errors
 is probably a good indication that this is the case. I really don't know
 why you'd want to DMA something without mapping it.
>>> I have experimented with changing ntb_async_tx_submit to dma_map the PCI
>>> BAR
>>> address. With this, I get a different DMAR error:
>> What code did you use to do this?
> If you mean version of linux, it is 4.15.7.  Or specific dma_map call, I 
> believe it was dma_map_single.

I mean the complete code you use to do the mapping so we can see if it's
correct. dma_map_single() seems like an odd choice, I expected to see
dma_map_resource().

>>> DMAR: [DMA Write] Request device [07:00.4] fault addr ffd0
>>> [fault reason 12] non-zero reserved fields in PTE
>> Also, what device corresponds to 07:00.4 on your system?
> I believe 07.00.4 was the PLX dma device. I get the same error with ioat.

Using the mapping with the PLX dma device likely converts it from a pure
P2P request to one where the TLPs pass through the IOMMU. So the fact
that you get the same error with both means IOAT almost certainly goes
through the IOMMU and there's something wrong with the mapping setup.

Logan
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: IOAT DMA w/IOMMU

2018-08-09 Thread Kit Chow



On 08/09/2018 01:11 PM, Logan Gunthorpe wrote:


On 09/08/18 01:47 PM, Kit Chow wrote:

I haven't tested this scenario but my guess would be that IOAT would
indeed go through the IOMMU and the PCI BAR address would need to be
properly mapped into the IOAT's IOVA. The fact that you see DMAR errors
is probably a good indication that this is the case. I really don't know
why you'd want to DMA something without mapping it.

I have experimented with changing ntb_async_tx_submit to dma_map the PCI
BAR
address. With this, I get a different DMAR error:

What code did you use to do this?
If you mean version of linux, it is 4.15.7.  Or specific dma_map call, I 
believe it was dma_map_single.





DMAR: [DMA Write] Request device [07:00.4] fault addr ffd0
[fault reason 12] non-zero reserved fields in PTE

Also, what device corresponds to 07:00.4 on your system?

I believe 07.00.4 was the PLX dma device. I get the same error with ioat.

Kit



Logan


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH] iommu/iova: Optimise attempts to allocate iova from 32bit address range

2018-08-09 Thread Robin Murphy

On 2018-08-09 6:49 PM, Ganapatrao Kulkarni wrote:

Hi Robin,

On Thu, Aug 9, 2018 at 9:54 PM, Robin Murphy  wrote:

On 07/08/18 09:54, Ganapatrao Kulkarni wrote:


As an optimisation for PCI devices, there is always first attempt
been made to allocate iova from SAC address range. This will lead
to unnecessary attempts/function calls, when there are no free ranges
available.

This patch optimises by adding flag to track previous failed attempts
and avoids further attempts until replenish happens.



Agh, what I overlooked is that this still suffers from the original problem,
wherein a large allocation which fails due to fragmentation then blocks all
subsequent smaller allocations, even if they may have succeeded.

For a minimal change, though, what I think we could do is instead of just
having a flag, track the size of the last 32-bit allocation which failed. If
we're happy to assume that nobody's likely to mix aligned and unaligned
allocations within the same domain, then that should be sufficiently robust
whilst being no more complicated than this version, i.e. (modulo thinking up
a better name for it):


I agree, it would be better to track size and attempt to allocate for
smaller chunks, if not for bigger one.





Signed-off-by: Ganapatrao Kulkarni 
---
This patch is based on comments from Robin Murphy 
for patch [1]

[1] https://lkml.org/lkml/2018/4/19/780

   drivers/iommu/iova.c | 11 ++-
   include/linux/iova.h |  1 +
   2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index 83fe262..d97bb5a 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -56,6 +56,7 @@ init_iova_domain(struct iova_domain *iovad, unsigned
long granule,
 iovad->granule = granule;
 iovad->start_pfn = start_pfn;
 iovad->dma_32bit_pfn = 1UL << (32 - iova_shift(iovad));
+   iovad->free_32bit_pfns = true;



 iovad->max_32bit_free = iovad->dma_32bit_pfn;


 iovad->flush_cb = NULL;
 iovad->fq = NULL;
 iovad->anchor.pfn_lo = iovad->anchor.pfn_hi = IOVA_ANCHOR;
@@ -139,8 +140,10 @@ __cached_rbnode_delete_update(struct iova_domain
*iovad, struct iova *free)
 cached_iova = rb_entry(iovad->cached32_node, struct iova, node);
 if (free->pfn_hi < iovad->dma_32bit_pfn &&
-   free->pfn_lo >= cached_iova->pfn_lo)
+   free->pfn_lo >= cached_iova->pfn_lo) {
 iovad->cached32_node = rb_next(>node);
+   iovad->free_32bit_pfns = true;



 iovad->max_32bit_free = iovad->dma_32bit_pfn;


i think, you intended to say,
   iovad->max_32bit_free += (free->pfn_hi - free->pfn_lo);


Nope, that's why I said it needed a better name ;)

(I nearly called it last_failed_32bit_alloc_size, but that's a bit much)

The point of this value (whetever it's called) is that at any given time 
it holds an upper bound on the size of the largest contiguous free area. 
It doesn't have to be the *smallest* upper bound, which is why we can 
keep things simple and avoid arithmetic - in realistic use-cases like 
yours when the allocations are a pretty constant size, this should work 
out directly equivalent to the boolean, only with values of "size" and 
"dma_32bit_pfn" instead of 0 and 1, so we don't do any more work than 
necessary. In the edge cases where allocations are all different sizes, 
it does mean that we will probably end up performing more failing 
allocations than if we actually tried to track all of the free space 
exactly, but I think that's reasonable - just because I want to make 
sure we handle such cases fairly gracefully, doesn't mean that we need 
to do extra work on the typical fast path to try and actually optimise 
for them (which is why I didn't really like the accounting 
implementation I came up with).





+   }
 cached_iova = rb_entry(iovad->cached_node, struct iova, node);
 if (free->pfn_lo >= cached_iova->pfn_lo)
@@ -290,6 +293,10 @@ alloc_iova(struct iova_domain *iovad, unsigned long
size,
 struct iova *new_iova;
 int ret;
   + if (limit_pfn <= iovad->dma_32bit_pfn &&
+   !iovad->free_32bit_pfns)



 size >= iovad->max_32bit_free)


+   return NULL;
+
 new_iova = alloc_iova_mem();
 if (!new_iova)
 return NULL;
@@ -299,6 +306,8 @@ alloc_iova(struct iova_domain *iovad, unsigned long
size,
 if (ret) {
 free_iova_mem(new_iova);
+   if (limit_pfn <= iovad->dma_32bit_pfn)
+   iovad->free_32bit_pfns = false;



 iovad->max_32bit_free = size;


same here, we should decrease available free range after successful allocation.
iovad->max_32bit_free -= size;


Equivalently, the simple assignment is strictly decreasing the upper 
bound already, since we can only get here if size < max_32bit_free in 
the first place. One more thing I've realised is 

Re: IOAT DMA w/IOMMU

2018-08-09 Thread Logan Gunthorpe



On 09/08/18 01:47 PM, Kit Chow wrote:
>> I haven't tested this scenario but my guess would be that IOAT would
>> indeed go through the IOMMU and the PCI BAR address would need to be
>> properly mapped into the IOAT's IOVA. The fact that you see DMAR errors
>> is probably a good indication that this is the case. I really don't know
>> why you'd want to DMA something without mapping it.
> I have experimented with changing ntb_async_tx_submit to dma_map the PCI 
> BAR
> address. With this, I get a different DMAR error:

What code did you use to do this?


> DMAR: [DMA Write] Request device [07:00.4] fault addr ffd0
> [fault reason 12] non-zero reserved fields in PTE

Also, what device corresponds to 07:00.4 on your system?

Logan
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: IOAT DMA w/IOMMU

2018-08-09 Thread Logan Gunthorpe
Hey,

On 09/08/18 12:51 PM, Eric Pilmore wrote:
>>> Was wondering if anybody here has used IOAT DMA engines with an
>>> IOMMU turned on (Xeon based system)? My specific question is really
>>> whether it is possible to DMA (w/IOAT) to a PCI BAR address as the
>>> destination without having to map that address to the IOVA space of
>>> the DMA engine first (assuming the IOMMU is on)?

I haven't tested this scenario but my guess would be that IOAT would
indeed go through the IOMMU and the PCI BAR address would need to be
properly mapped into the IOAT's IOVA. The fact that you see DMAR errors
is probably a good indication that this is the case. I really don't know
why you'd want to DMA something without mapping it.

>> So is this a peer-to-peer DMA scenario?  You mention DMA, which would
>> be a transaction initiated by a PCI device, to a PCI BAR address, so
>> it doesn't sound like system memory is involved.
> 
> No, not peer-to-peer.  This is from system memory (e.g. SKB buffer which
> has had an IOMMU mapping created) to a PCI BAR address.

It's definitely peer-to-peer in the case where you are using a DMA
engine in the PCI tree. You have the DMA PCI device sending TLPs
directly to the PCI BAR device. So, if everything is done right, the
TLPs will avoid the root complex completely. (Though, ACS settings could
also prevent this from working and you'd either get similar DMAR errors
or they'd disappear into a black hole).

When using the IOAT, it is part of the CPU so I wouldn't say it's really
peer-to-peer. But an argument could be made that it is. Though, this is
exactly what the existing ntb_transport is doing: DMAing from system
memory to a PCI BAR and vice versa using IOAT.

Logan

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: IOAT DMA w/IOMMU

2018-08-09 Thread Kit Chow



On 08/09/2018 12:35 PM, Logan Gunthorpe wrote:

Hey,

On 09/08/18 12:51 PM, Eric Pilmore wrote:

Was wondering if anybody here has used IOAT DMA engines with an
IOMMU turned on (Xeon based system)? My specific question is really
whether it is possible to DMA (w/IOAT) to a PCI BAR address as the
destination without having to map that address to the IOVA space of
the DMA engine first (assuming the IOMMU is on)?

I haven't tested this scenario but my guess would be that IOAT would
indeed go through the IOMMU and the PCI BAR address would need to be
properly mapped into the IOAT's IOVA. The fact that you see DMAR errors
is probably a good indication that this is the case. I really don't know
why you'd want to DMA something without mapping it.
I have experimented with changing ntb_async_tx_submit to dma_map the PCI 
BAR

address. With this, I get a different DMAR error:

DMAR: [DMA Write] Request device [07:00.4] fault addr ffd0
[fault reason 12] non-zero reserved fields in PTE

Kit

So is this a peer-to-peer DMA scenario?  You mention DMA, which would
be a transaction initiated by a PCI device, to a PCI BAR address, so
it doesn't sound like system memory is involved.

No, not peer-to-peer.  This is from system memory (e.g. SKB buffer which
has had an IOMMU mapping created) to a PCI BAR address.

It's definitely peer-to-peer in the case where you are using a DMA
engine in the PCI tree. You have the DMA PCI device sending TLPs
directly to the PCI BAR device. So, if everything is done right, the
TLPs will avoid the root complex completely. (Though, ACS settings could
also prevent this from working and you'd either get similar DMAR errors
or they'd disappear into a black hole).

When using the IOAT, it is part of the CPU so I wouldn't say it's really
peer-to-peer. But an argument could be made that it is. Though, this is
exactly what the existing ntb_transport is doing: DMAing from system
memory to a PCI BAR and vice versa using IOAT.

Logan



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] vfio/pci: Some buggy virtual functions incorrectly report 1 for intx.

2018-08-09 Thread Alex Williamson
On Thu,  9 Aug 2018 12:37:06 -0700
Ashok Raj  wrote:

> PCI_INTERRUPT_PIN should always read  0 for SRIOV Virtual Functions.
> 
> Some SRIOV devices have some bugs in RTL and VF's end up reading 1
> instead of 0 for the PIN.

Hi Ashok,

One question, can we identify which VFs are known to have this issue so
that users (and downstreams) can know how to prioritize this patch?
Thanks,

Alex

> Since this is a spec required value, rather than having a device specific
> quirk, we could fix it permanently in vfio.
> 
> Reworked suggestions from Alex https://lkml.org/lkml/2018/7/16/1052
> 
> Reported-by: Gage Eads 
> Tested-by: Gage Eads 
> Signed-off-by: Ashok Raj 
> Cc: k...@vger.kernel.org
> Cc: linux-ker...@vger.kernel.org
> Cc: iommu@lists.linux-foundation.org
> Cc: Joerg Roedel 
> Cc: Bjorn Helgaas 
> Cc: Gage Eads 
> ---
>  drivers/vfio/pci/vfio_pci.c| 12 +---
>  drivers/vfio/pci/vfio_pci_config.c |  3 ++-
>  2 files changed, 11 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index b423a30..32943dd 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -433,10 +433,16 @@ static int vfio_pci_get_irq_count(struct 
> vfio_pci_device *vdev, int irq_type)
>  {
>   if (irq_type == VFIO_PCI_INTX_IRQ_INDEX) {
>   u8 pin;
> - pci_read_config_byte(vdev->pdev, PCI_INTERRUPT_PIN, );
> - if (IS_ENABLED(CONFIG_VFIO_PCI_INTX) && !vdev->nointx && pin)
> - return 1;
> + /*
> +  * INTx must be 0 for all VF's. Enforce that for all
> +  * VF's since this is a spec requirement.
> +  */
> + if (!IS_ENABLED(CONFIG_VFIO_PCI_INTX) || vdev->nointx ||
> + vdev->pdev->is_virtfn)
> + return 0;
>  
> + pci_read_config_byte(vdev->pdev, PCI_INTERRUPT_PIN, );
> + return (pin ? 1 : 0);
>   } else if (irq_type == VFIO_PCI_MSI_IRQ_INDEX) {
>   u8 pos;
>   u16 flags;
> diff --git a/drivers/vfio/pci/vfio_pci_config.c 
> b/drivers/vfio/pci/vfio_pci_config.c
> index 115a36f..e36b7c3 100644
> --- a/drivers/vfio/pci/vfio_pci_config.c
> +++ b/drivers/vfio/pci/vfio_pci_config.c
> @@ -1676,7 +1676,8 @@ int vfio_config_init(struct vfio_pci_device *vdev)
>   *(__le16 *)[PCI_DEVICE_ID] = cpu_to_le16(pdev->device);
>   }
>  
> - if (!IS_ENABLED(CONFIG_VFIO_PCI_INTX) || vdev->nointx)
> + if (!IS_ENABLED(CONFIG_VFIO_PCI_INTX) || vdev->nointx ||
> + pdev->is_virtfn)
>   vconfig[PCI_INTERRUPT_PIN] = 0;
>  
>   ret = vfio_cap_init(vdev);

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH] vfio/pci: Some buggy virtual functions incorrectly report 1 for intx.

2018-08-09 Thread Ashok Raj
PCI_INTERRUPT_PIN should always read  0 for SRIOV Virtual Functions.

Some SRIOV devices have some bugs in RTL and VF's end up reading 1
instead of 0 for the PIN.

Since this is a spec required value, rather than having a device specific
quirk, we could fix it permanently in vfio.

Reworked suggestions from Alex https://lkml.org/lkml/2018/7/16/1052

Reported-by: Gage Eads 
Tested-by: Gage Eads 
Signed-off-by: Ashok Raj 
Cc: k...@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Cc: iommu@lists.linux-foundation.org
Cc: Joerg Roedel 
Cc: Bjorn Helgaas 
Cc: Gage Eads 
---
 drivers/vfio/pci/vfio_pci.c| 12 +---
 drivers/vfio/pci/vfio_pci_config.c |  3 ++-
 2 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index b423a30..32943dd 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -433,10 +433,16 @@ static int vfio_pci_get_irq_count(struct vfio_pci_device 
*vdev, int irq_type)
 {
if (irq_type == VFIO_PCI_INTX_IRQ_INDEX) {
u8 pin;
-   pci_read_config_byte(vdev->pdev, PCI_INTERRUPT_PIN, );
-   if (IS_ENABLED(CONFIG_VFIO_PCI_INTX) && !vdev->nointx && pin)
-   return 1;
+   /*
+* INTx must be 0 for all VF's. Enforce that for all
+* VF's since this is a spec requirement.
+*/
+   if (!IS_ENABLED(CONFIG_VFIO_PCI_INTX) || vdev->nointx ||
+   vdev->pdev->is_virtfn)
+   return 0;
 
+   pci_read_config_byte(vdev->pdev, PCI_INTERRUPT_PIN, );
+   return (pin ? 1 : 0);
} else if (irq_type == VFIO_PCI_MSI_IRQ_INDEX) {
u8 pos;
u16 flags;
diff --git a/drivers/vfio/pci/vfio_pci_config.c 
b/drivers/vfio/pci/vfio_pci_config.c
index 115a36f..e36b7c3 100644
--- a/drivers/vfio/pci/vfio_pci_config.c
+++ b/drivers/vfio/pci/vfio_pci_config.c
@@ -1676,7 +1676,8 @@ int vfio_config_init(struct vfio_pci_device *vdev)
*(__le16 *)[PCI_DEVICE_ID] = cpu_to_le16(pdev->device);
}
 
-   if (!IS_ENABLED(CONFIG_VFIO_PCI_INTX) || vdev->nointx)
+   if (!IS_ENABLED(CONFIG_VFIO_PCI_INTX) || vdev->nointx ||
+   pdev->is_virtfn)
vconfig[PCI_INTERRUPT_PIN] = 0;
 
ret = vfio_cap_init(vdev);
-- 
2.7.4

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: IOAT DMA w/IOMMU

2018-08-09 Thread Eric Pilmore
On Thu, Aug 9, 2018 at 11:43 AM, Bjorn Helgaas  wrote:
> [+cc David, Logan, Alex, iommu list]
>
> On Thu, Aug 09, 2018 at 11:14:13AM -0700, Eric Pilmore wrote:
>> Didn't get any response on the IRC channel so trying here.
>>
>> Was wondering if anybody here has used IOAT DMA engines with an
>> IOMMU turned on (Xeon based system)? My specific question is really
>> whether it is possible to DMA (w/IOAT) to a PCI BAR address as the
>> destination without having to map that address to the IOVA space of
>> the DMA engine first (assuming the IOMMU is on)?
>
> So is this a peer-to-peer DMA scenario?  You mention DMA, which would
> be a transaction initiated by a PCI device, to a PCI BAR address, so
> it doesn't sound like system memory is involved.

No, not peer-to-peer.  This is from system memory (e.g. SKB buffer which
has had an IOMMU mapping created) to a PCI BAR address.

>
> I copied some folks who know a lot more about this than I do.

Thank you!


>
>> I am encountering issues where I see PTE Errors reported from DMAR
>> in this scenario, but I do not if I use a different DMA engine
>> that's sitting downstream off the PCI tree. I'm wondering if the
>> IOAT DMA failure is some artifact of these engines sitting behind
>> the Host Bridge.
>>
>> Thanks in advance!
>> Eric
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: IOAT DMA w/IOMMU

2018-08-09 Thread Bjorn Helgaas
[+cc David, Logan, Alex, iommu list]

On Thu, Aug 09, 2018 at 11:14:13AM -0700, Eric Pilmore wrote:
> Didn't get any response on the IRC channel so trying here.
> 
> Was wondering if anybody here has used IOAT DMA engines with an
> IOMMU turned on (Xeon based system)? My specific question is really
> whether it is possible to DMA (w/IOAT) to a PCI BAR address as the
> destination without having to map that address to the IOVA space of
> the DMA engine first (assuming the IOMMU is on)?

So is this a peer-to-peer DMA scenario?  You mention DMA, which would
be a transaction initiated by a PCI device, to a PCI BAR address, so
it doesn't sound like system memory is involved.

I copied some folks who know a lot more about this than I do.

> I am encountering issues where I see PTE Errors reported from DMAR
> in this scenario, but I do not if I use a different DMA engine
> that's sitting downstream off the PCI tree. I'm wondering if the
> IOAT DMA failure is some artifact of these engines sitting behind
> the Host Bridge.
> 
> Thanks in advance!
> Eric
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] iommu/iova: Optimise attempts to allocate iova from 32bit address range

2018-08-09 Thread Ganapatrao Kulkarni
Hi Robin,

On Thu, Aug 9, 2018 at 9:54 PM, Robin Murphy  wrote:
> On 07/08/18 09:54, Ganapatrao Kulkarni wrote:
>>
>> As an optimisation for PCI devices, there is always first attempt
>> been made to allocate iova from SAC address range. This will lead
>> to unnecessary attempts/function calls, when there are no free ranges
>> available.
>>
>> This patch optimises by adding flag to track previous failed attempts
>> and avoids further attempts until replenish happens.
>
>
> Agh, what I overlooked is that this still suffers from the original problem,
> wherein a large allocation which fails due to fragmentation then blocks all
> subsequent smaller allocations, even if they may have succeeded.
>
> For a minimal change, though, what I think we could do is instead of just
> having a flag, track the size of the last 32-bit allocation which failed. If
> we're happy to assume that nobody's likely to mix aligned and unaligned
> allocations within the same domain, then that should be sufficiently robust
> whilst being no more complicated than this version, i.e. (modulo thinking up
> a better name for it):

I agree, it would be better to track size and attempt to allocate for
smaller chunks, if not for bigger one.

>
>>
>> Signed-off-by: Ganapatrao Kulkarni 
>> ---
>> This patch is based on comments from Robin Murphy 
>> for patch [1]
>>
>> [1] https://lkml.org/lkml/2018/4/19/780
>>
>>   drivers/iommu/iova.c | 11 ++-
>>   include/linux/iova.h |  1 +
>>   2 files changed, 11 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
>> index 83fe262..d97bb5a 100644
>> --- a/drivers/iommu/iova.c
>> +++ b/drivers/iommu/iova.c
>> @@ -56,6 +56,7 @@ init_iova_domain(struct iova_domain *iovad, unsigned
>> long granule,
>> iovad->granule = granule;
>> iovad->start_pfn = start_pfn;
>> iovad->dma_32bit_pfn = 1UL << (32 - iova_shift(iovad));
>> +   iovad->free_32bit_pfns = true;
>
>
> iovad->max_32bit_free = iovad->dma_32bit_pfn;
>
>> iovad->flush_cb = NULL;
>> iovad->fq = NULL;
>> iovad->anchor.pfn_lo = iovad->anchor.pfn_hi = IOVA_ANCHOR;
>> @@ -139,8 +140,10 @@ __cached_rbnode_delete_update(struct iova_domain
>> *iovad, struct iova *free)
>> cached_iova = rb_entry(iovad->cached32_node, struct iova, node);
>> if (free->pfn_hi < iovad->dma_32bit_pfn &&
>> -   free->pfn_lo >= cached_iova->pfn_lo)
>> +   free->pfn_lo >= cached_iova->pfn_lo) {
>> iovad->cached32_node = rb_next(>node);
>> +   iovad->free_32bit_pfns = true;
>
>
> iovad->max_32bit_free = iovad->dma_32bit_pfn;

i think, you intended to say,
  iovad->max_32bit_free += (free->pfn_hi - free->pfn_lo);

>
>> +   }
>> cached_iova = rb_entry(iovad->cached_node, struct iova, node);
>> if (free->pfn_lo >= cached_iova->pfn_lo)
>> @@ -290,6 +293,10 @@ alloc_iova(struct iova_domain *iovad, unsigned long
>> size,
>> struct iova *new_iova;
>> int ret;
>>   + if (limit_pfn <= iovad->dma_32bit_pfn &&
>> +   !iovad->free_32bit_pfns)
>
>
> size >= iovad->max_32bit_free)
>
>> +   return NULL;
>> +
>> new_iova = alloc_iova_mem();
>> if (!new_iova)
>> return NULL;
>> @@ -299,6 +306,8 @@ alloc_iova(struct iova_domain *iovad, unsigned long
>> size,
>> if (ret) {
>> free_iova_mem(new_iova);
>> +   if (limit_pfn <= iovad->dma_32bit_pfn)
>> +   iovad->free_32bit_pfns = false;
>
>
> iovad->max_32bit_free = size;

same here, we should decrease available free range after successful allocation.
iovad->max_32bit_free -= size;

>
> What do you think?

most likely this should work, i will try this and confirm at the earliest,

>
> Robin.
>
>
>> return NULL;
>> }
>>   diff --git a/include/linux/iova.h b/include/linux/iova.h
>> index 928442d..3810ba9 100644
>> --- a/include/linux/iova.h
>> +++ b/include/linux/iova.h
>> @@ -96,6 +96,7 @@ struct iova_domain {
>>flush-queues */
>> atomic_t fq_timer_on;   /* 1 when timer is active,
>> 0
>>when not */
>> +   boolfree_32bit_pfns;
>>   };
>> static inline unsigned long iova_size(struct iova *iova)
>>
>

thanks,
Ganapat
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] iommu/iova: Optimise attempts to allocate iova from 32bit address range

2018-08-09 Thread Robin Murphy

On 07/08/18 09:54, Ganapatrao Kulkarni wrote:

As an optimisation for PCI devices, there is always first attempt
been made to allocate iova from SAC address range. This will lead
to unnecessary attempts/function calls, when there are no free ranges
available.

This patch optimises by adding flag to track previous failed attempts
and avoids further attempts until replenish happens.


Agh, what I overlooked is that this still suffers from the original 
problem, wherein a large allocation which fails due to fragmentation 
then blocks all subsequent smaller allocations, even if they may have 
succeeded.


For a minimal change, though, what I think we could do is instead of 
just having a flag, track the size of the last 32-bit allocation which 
failed. If we're happy to assume that nobody's likely to mix aligned and 
unaligned allocations within the same domain, then that should be 
sufficiently robust whilst being no more complicated than this version, 
i.e. (modulo thinking up a better name for it):




Signed-off-by: Ganapatrao Kulkarni 
---
This patch is based on comments from Robin Murphy 
for patch [1]

[1] https://lkml.org/lkml/2018/4/19/780

  drivers/iommu/iova.c | 11 ++-
  include/linux/iova.h |  1 +
  2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index 83fe262..d97bb5a 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -56,6 +56,7 @@ init_iova_domain(struct iova_domain *iovad, unsigned long 
granule,
iovad->granule = granule;
iovad->start_pfn = start_pfn;
iovad->dma_32bit_pfn = 1UL << (32 - iova_shift(iovad));
+   iovad->free_32bit_pfns = true;


iovad->max_32bit_free = iovad->dma_32bit_pfn;


iovad->flush_cb = NULL;
iovad->fq = NULL;
iovad->anchor.pfn_lo = iovad->anchor.pfn_hi = IOVA_ANCHOR;
@@ -139,8 +140,10 @@ __cached_rbnode_delete_update(struct iova_domain *iovad, 
struct iova *free)
  
  	cached_iova = rb_entry(iovad->cached32_node, struct iova, node);

if (free->pfn_hi < iovad->dma_32bit_pfn &&
-   free->pfn_lo >= cached_iova->pfn_lo)
+   free->pfn_lo >= cached_iova->pfn_lo) {
iovad->cached32_node = rb_next(>node);
+   iovad->free_32bit_pfns = true;


iovad->max_32bit_free = iovad->dma_32bit_pfn;


+   }
  
  	cached_iova = rb_entry(iovad->cached_node, struct iova, node);

if (free->pfn_lo >= cached_iova->pfn_lo)
@@ -290,6 +293,10 @@ alloc_iova(struct iova_domain *iovad, unsigned long size,
struct iova *new_iova;
int ret;
  
+	if (limit_pfn <= iovad->dma_32bit_pfn &&

+   !iovad->free_32bit_pfns)


size >= iovad->max_32bit_free)


+   return NULL;
+
new_iova = alloc_iova_mem();
if (!new_iova)
return NULL;
@@ -299,6 +306,8 @@ alloc_iova(struct iova_domain *iovad, unsigned long size,
  
  	if (ret) {

free_iova_mem(new_iova);
+   if (limit_pfn <= iovad->dma_32bit_pfn)
+   iovad->free_32bit_pfns = false;


iovad->max_32bit_free = size;

What do you think?

Robin.


return NULL;
}
  
diff --git a/include/linux/iova.h b/include/linux/iova.h

index 928442d..3810ba9 100644
--- a/include/linux/iova.h
+++ b/include/linux/iova.h
@@ -96,6 +96,7 @@ struct iova_domain {
   flush-queues */
atomic_t fq_timer_on;   /* 1 when timer is active, 0
   when not */
+   boolfree_32bit_pfns;
  };
  
  static inline unsigned long iova_size(struct iova *iova)



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 2/8] iommu/tegra: gart: Provide access to Memory Controller driver

2018-08-09 Thread Dmitry Osipenko
On Thursday, 9 August 2018 17:52:06 MSK Thierry Reding wrote:
> On Thu, Aug 09, 2018 at 05:22:59PM +0300, Dmitry Osipenko wrote:
> > On Thursday, 9 August 2018 16:59:24 MSK Thierry Reding wrote:
> > > On Thu, Aug 09, 2018 at 02:39:03PM +0300, Dmitry Osipenko wrote:
> > > > On Thursday, 9 August 2018 14:17:46 MSK Thierry Reding wrote:
> > > > > On Sat, Aug 04, 2018 at 05:29:57PM +0300, Dmitry Osipenko wrote:
> > > > > > GART contain registers needed by the Memory Controller driver,
> > > > > > provide
> > > > > > access to the MC driver by utilizing its GART-integration
> > > > > > facility.
> > > > > > 
> > > > > > Signed-off-by: Dmitry Osipenko 
> > > > > > ---
> > > > > > 
> > > > > >  drivers/iommu/tegra-gart.c | 23 +++
> > > > > >  1 file changed, 23 insertions(+)
> > > > > > 
> > > > > > diff --git a/drivers/iommu/tegra-gart.c
> > > > > > b/drivers/iommu/tegra-gart.c
> > > > > > index a004f6da35f2..f8b653e25914 100644
> > > > > > --- a/drivers/iommu/tegra-gart.c
> > > > > > +++ b/drivers/iommu/tegra-gart.c
> > > > > > @@ -31,6 +31,8 @@
> > > > > > 
> > > > > >  #include 
> > > > > >  #include 
> > > > > > 
> > > > > > +#include 
> > > > > > +
> > > > > > 
> > > > > >  #include 
> > > > > >  
> > > > > >  /* bitmap of the page sizes currently supported */
> > > > > > 
> > > > > > @@ -41,6 +43,8 @@
> > > > > > 
> > > > > >  #define GART_ENTRY_ADDR(0x28 - GART_REG_BASE)
> > > > > >  #define GART_ENTRY_DATA(0x2c - GART_REG_BASE)
> > > > > >  #define GART_ENTRY_PHYS_ADDR_VALID (1 << 31)
> > > > > > 
> > > > > > +#define GART_ERROR_REQ (0x30 - GART_REG_BASE)
> > > > > > +#define GART_ERROR_ADDR(0x34 - GART_REG_BASE)
> > > > > > 
> > > > > >  #define GART_PAGE_SHIFT12
> > > > > >  #define GART_PAGE_SIZE (1 << GART_PAGE_SHIFT)
> > > > > > 
> > > > > > @@ -63,6 +67,8 @@ struct gart_device {
> > > > > > 
> > > > > > struct device   *dev;
> > > > > > 
> > > > > > struct iommu_device iommu;  /* IOMMU Core handle */
> > > > > > 
> > > > > > +
> > > > > > +   struct tegra_mc_gart_handle mc_gart_handle;
> > > > > > 
> > > > > >  };
> > > > > >  
> > > > > >  struct gart_domain {
> > > > > > 
> > > > > > @@ -408,6 +414,20 @@ static int tegra_gart_resume(struct device
> > > > > > *dev)
> > > > > > 
> > > > > > return 0;
> > > > > >  
> > > > > >  }
> > > > > > 
> > > > > > +static u32 tegra_gart_error_addr(struct tegra_mc_gart_handle
> > > > > > *handle)
> > > > > > +{
> > > > > > +   struct gart_device *gart = container_of(handle, struct
> > 
> > gart_device,
> > 
> > > > > > +   mc_gart_handle);
> > > > > > +   return readl_relaxed(gart->regs + GART_ERROR_ADDR);
> > > > > > +}
> > > > > > +
> > > > > > +static u32 tegra_gart_error_req(struct tegra_mc_gart_handle
> > > > > > *handle)
> > > > > > +{
> > > > > > +   struct gart_device *gart = container_of(handle, struct
> > 
> > gart_device,
> > 
> > > > > > +   mc_gart_handle);
> > > > > > +   return readl_relaxed(gart->regs + GART_ERROR_REQ);
> > > > > > +}
> > > > > > +
> > > > > > 
> > > > > >  static int tegra_gart_probe(struct platform_device *pdev)
> > > > > >  {
> > > > > >  
> > > > > > struct gart_device *gart;
> > > > > > 
> > > > > > @@ -464,6 +484,8 @@ static int tegra_gart_probe(struct
> > > > > > platform_device
> > > > > > *pdev)>
> > > > > > 
> > > > > > gart->regs = gart_regs;
> > > > > > gart->iovmm_base = (dma_addr_t)res_remap->start;
> > > > > > gart->page_count = (resource_size(res_remap) >>
> > > > > > GART_PAGE_SHIFT);
> > > > > > 
> > > > > > +   gart->mc_gart_handle.error_addr = tegra_gart_error_addr;
> > > > > > +   gart->mc_gart_handle.error_req = tegra_gart_error_req;
> > > > > > 
> > > > > > gart->savedata = vmalloc(array_size(sizeof(u32),
> > > > > > gart->page_count));
> > > > > > if (!gart->savedata) {
> > > > > > 
> > > > > > @@ -475,6 +497,7 @@ static int tegra_gart_probe(struct
> > > > > > platform_device
> > > > > > *pdev)>
> > > > > > 
> > > > > > do_gart_setup(gart, NULL);
> > > > > > 
> > > > > > gart_handle = gart;
> > > > > > 
> > > > > > +   tegra_mc_register_gart(>mc_gart_handle);
> > > > > > 
> > > > > > return 0;
> > > > > >  
> > > > > >  }
> > > > > 
> > > > > I see now why you've did it this way. We have separate devices
> > > > > unlike
> > > > > with SMMU where it is properly modeled as part of the memory
> > > > > controller.
> > > > > I think we should consider breaking ABI at this point and properly
> > > > > merge
> > > > > both the memory controller and GART nodes. There's really no reason
> > > > > why
> > > > > they should be separate and we're jumping through multiple hoops to
> > > > > do
> > > > > what we need to do just because a few years back we made a mistake.
> > > > > 
> > > > > I know we're technically not supposed to break the DT ABI, but I
> > > > 

Re: [PATCH v2 2/8] iommu/tegra: gart: Provide access to Memory Controller driver

2018-08-09 Thread Thierry Reding
On Thu, Aug 09, 2018 at 05:22:59PM +0300, Dmitry Osipenko wrote:
> On Thursday, 9 August 2018 16:59:24 MSK Thierry Reding wrote:
> > On Thu, Aug 09, 2018 at 02:39:03PM +0300, Dmitry Osipenko wrote:
> > > On Thursday, 9 August 2018 14:17:46 MSK Thierry Reding wrote:
> > > > On Sat, Aug 04, 2018 at 05:29:57PM +0300, Dmitry Osipenko wrote:
> > > > > GART contain registers needed by the Memory Controller driver, provide
> > > > > access to the MC driver by utilizing its GART-integration facility.
> > > > > 
> > > > > Signed-off-by: Dmitry Osipenko 
> > > > > ---
> > > > > 
> > > > >  drivers/iommu/tegra-gart.c | 23 +++
> > > > >  1 file changed, 23 insertions(+)
> > > > > 
> > > > > diff --git a/drivers/iommu/tegra-gart.c b/drivers/iommu/tegra-gart.c
> > > > > index a004f6da35f2..f8b653e25914 100644
> > > > > --- a/drivers/iommu/tegra-gart.c
> > > > > +++ b/drivers/iommu/tegra-gart.c
> > > > > @@ -31,6 +31,8 @@
> > > > > 
> > > > >  #include 
> > > > >  #include 
> > > > > 
> > > > > +#include 
> > > > > +
> > > > > 
> > > > >  #include 
> > > > >  
> > > > >  /* bitmap of the page sizes currently supported */
> > > > > 
> > > > > @@ -41,6 +43,8 @@
> > > > > 
> > > > >  #define GART_ENTRY_ADDR  (0x28 - GART_REG_BASE)
> > > > >  #define GART_ENTRY_DATA  (0x2c - GART_REG_BASE)
> > > > >  #define GART_ENTRY_PHYS_ADDR_VALID   (1 << 31)
> > > > > 
> > > > > +#define GART_ERROR_REQ   (0x30 - GART_REG_BASE)
> > > > > +#define GART_ERROR_ADDR  (0x34 - GART_REG_BASE)
> > > > > 
> > > > >  #define GART_PAGE_SHIFT  12
> > > > >  #define GART_PAGE_SIZE   (1 << GART_PAGE_SHIFT)
> > > > > 
> > > > > @@ -63,6 +67,8 @@ struct gart_device {
> > > > > 
> > > > >   struct device   *dev;
> > > > >   
> > > > >   struct iommu_device iommu;  /* IOMMU Core handle */
> > > > > 
> > > > > +
> > > > > + struct tegra_mc_gart_handle mc_gart_handle;
> > > > > 
> > > > >  };
> > > > >  
> > > > >  struct gart_domain {
> > > > > 
> > > > > @@ -408,6 +414,20 @@ static int tegra_gart_resume(struct device *dev)
> > > > > 
> > > > >   return 0;
> > > > >  
> > > > >  }
> > > > > 
> > > > > +static u32 tegra_gart_error_addr(struct tegra_mc_gart_handle *handle)
> > > > > +{
> > > > > + struct gart_device *gart = container_of(handle, struct 
> gart_device,
> > > > > + mc_gart_handle);
> > > > > + return readl_relaxed(gart->regs + GART_ERROR_ADDR);
> > > > > +}
> > > > > +
> > > > > +static u32 tegra_gart_error_req(struct tegra_mc_gart_handle *handle)
> > > > > +{
> > > > > + struct gart_device *gart = container_of(handle, struct 
> gart_device,
> > > > > + mc_gart_handle);
> > > > > + return readl_relaxed(gart->regs + GART_ERROR_REQ);
> > > > > +}
> > > > > +
> > > > > 
> > > > >  static int tegra_gart_probe(struct platform_device *pdev)
> > > > >  {
> > > > >  
> > > > >   struct gart_device *gart;
> > > > > 
> > > > > @@ -464,6 +484,8 @@ static int tegra_gart_probe(struct platform_device
> > > > > *pdev)>
> > > > > 
> > > > >   gart->regs = gart_regs;
> > > > >   gart->iovmm_base = (dma_addr_t)res_remap->start;
> > > > >   gart->page_count = (resource_size(res_remap) >> 
> > > > > GART_PAGE_SHIFT);
> > > > > 
> > > > > + gart->mc_gart_handle.error_addr = tegra_gart_error_addr;
> > > > > + gart->mc_gart_handle.error_req = tegra_gart_error_req;
> > > > > 
> > > > >   gart->savedata = vmalloc(array_size(sizeof(u32), 
> > > > > gart->page_count));
> > > > >   if (!gart->savedata) {
> > > > > 
> > > > > @@ -475,6 +497,7 @@ static int tegra_gart_probe(struct platform_device
> > > > > *pdev)>
> > > > > 
> > > > >   do_gart_setup(gart, NULL);
> > > > >   
> > > > >   gart_handle = gart;
> > > > > 
> > > > > + tegra_mc_register_gart(>mc_gart_handle);
> > > > > 
> > > > >   return 0;
> > > > >  
> > > > >  }
> > > > 
> > > > I see now why you've did it this way. We have separate devices unlike
> > > > with SMMU where it is properly modeled as part of the memory controller.
> > > > I think we should consider breaking ABI at this point and properly merge
> > > > both the memory controller and GART nodes. There's really no reason why
> > > > they should be separate and we're jumping through multiple hoops to do
> > > > what we need to do just because a few years back we made a mistake.
> > > > 
> > > > I know we're technically not supposed to break the DT ABI, but I think
> > > > in this particular case we can make a good case for it. The current DT
> > > > bindings are plainly broken, and obviously so. Also, we don't currently
> > > > use the GART upstream for anything, so we can't break any existing
> > > > systems either.
> > > 
> > > IIUC, that will require to break the stable DT ABI of the tegra20-mc,
> > > which is working fine and does its job. I'm personally not 

Re: [RFC PATCH 0/7] A General Accelerator Framework, WarpDrive

2018-08-09 Thread Jerome Glisse
On Thu, Aug 09, 2018 at 04:03:52PM +0800, Kenneth Lee wrote:
> On Wed, Aug 08, 2018 at 11:18:35AM -0400, Jerome Glisse wrote:
> > On Wed, Aug 08, 2018 at 09:08:42AM +0800, Kenneth Lee wrote:
> > > 在 2018年08月06日 星期一 11:32 下午, Jerome Glisse 写道:
> > > > On Mon, Aug 06, 2018 at 11:12:52AM +0800, Kenneth Lee wrote:
> > > > > On Fri, Aug 03, 2018 at 10:39:44AM -0400, Jerome Glisse wrote:
> > > > > > On Fri, Aug 03, 2018 at 11:47:21AM +0800, Kenneth Lee wrote:
> > > > > > > On Thu, Aug 02, 2018 at 10:22:43AM -0400, Jerome Glisse wrote:
> > > > > > > > On Thu, Aug 02, 2018 at 12:05:57PM +0800, Kenneth Lee wrote:

[...]

> > > > > > > > your mechanisms the userspace must have a specific userspace
> > > > > > > > drivers for each hardware and thus there are virtually no
> > > > > > > > differences between having this userspace driver open a device
> > > > > > > > file in vfio or somewhere else in the device filesystem. This is
> > > > > > > > just a different path.
> > > > > > > > 
> > > > > > > The basic problem WarpDrive want to solve it to avoid syscall. 
> > > > > > > This is important
> > > > > > > to accelerators. We have some data here:
> > > > > > > https://www.slideshare.net/linaroorg/progress-and-demonstration-of-wrapdrive-a-accelerator-framework-sfo17317
> > > > > > > 
> > > > > > > (see page 3)
> > > > > > > 
> > > > > > > The performance is different on using kernel and user drivers.
> > > > > > Yes and example i point to is exactly that. You have a one time 
> > > > > > setup
> > > > > > cost (creating command buffer binding PASID with command buffer and
> > > > > > couple other setup steps). Then userspace no longer have to do any
> > > > > > ioctl to schedule work on the GPU. It is all down from userspace and
> > > > > > it use a doorbell to notify hardware when it should go look at 
> > > > > > command
> > > > > > buffer for new thing to execute.
> > > > > > 
> > > > > > My point stands on that. You have existing driver already doing so
> > > > > > with no new framework and in your scheme you need a userspace 
> > > > > > driver.
> > > > > > So i do not see the value add, using one path or the other in the
> > > > > > userspace driver is litteraly one line to change.
> > > > > > 
> > > > > Sorry, I'd got confuse here. I partially agree that the user driver is
> > > > > redundance of kernel driver. (But for WarpDrive, the kernel driver is 
> > > > > a full
> > > > > driver include all preparation and setup stuff for the hardware, the 
> > > > > user driver
> > > > > is simply to send request and receive answer). Yes, it is just a 
> > > > > choice of path.
> > > > > But the user path is faster if the request come from use space. And 
> > > > > to do that,
> > > > > we need user land DMA support. Then why is it invaluable to let VFIO 
> > > > > involved?
> > > > Some drivers in the kernel already do exactly what you said. The user
> > > > space emit commands without ever going into kernel by directly 
> > > > scheduling
> > > > commands and ringing a doorbell. They do not need VFIO either and they
> > > > can map userspace address into the DMA address space of the device and
> > > > again they do not need VFIO for that.
> > > Could you please directly point out which driver you refer to here? Thank
> > > you.
> > 
> > drivers/gpu/drm/amd/
> > 
> > Sub-directory of interest is amdkfd
> > 
> > Because it is a big driver here is a highlevel overview of how it works
> > (this is a simplification):
> >   - Process can allocate GPUs buffer (through ioclt) and map them into
> > its address space (through mmap of device file at buffer object
> > specific offset).
> >   - Process can map any valid range of virtual address space into device
> > address space (IOMMU mapping). This must be regular memory ie not an
> > mmap of a device file or any special file (this is the non PASID
> > path)
> >   - Process can create a command queue and bind its process to it aka
> > PASID, this is done through an ioctl.
> >   - Process can schedule commands onto queues it created from userspace
> > without ioctl. For that it just write command into a ring buffer
> > that it mapped during the command queue creation process and it
> > rings a doorbell when commands are ready to be consume by the
> > hardware.
> >   - Commands can reference (access) all 3 types of object above ie
> > either full GPUs buffer, process regular memory maped as object
> > (non PASID) and PASID memory all at the same time ie you can
> > mix all of the above in same commands queue.
> >   - Kernel can evict, unbind any process command queues, unbind commands
> > queue are still valid from process point of view but commands
> > process schedules on them will not be executed until kernel re-bind
> > the queue.
> >   - Kernel can schedule commands itself onto its dedicated command
> > queues (kernel driver create its own command queues).
> >   - Kernel can control priorities between all the 

Re: [PATCH v2 2/8] iommu/tegra: gart: Provide access to Memory Controller driver

2018-08-09 Thread Dmitry Osipenko
On Thursday, 9 August 2018 16:59:24 MSK Thierry Reding wrote:
> On Thu, Aug 09, 2018 at 02:39:03PM +0300, Dmitry Osipenko wrote:
> > On Thursday, 9 August 2018 14:17:46 MSK Thierry Reding wrote:
> > > On Sat, Aug 04, 2018 at 05:29:57PM +0300, Dmitry Osipenko wrote:
> > > > GART contain registers needed by the Memory Controller driver, provide
> > > > access to the MC driver by utilizing its GART-integration facility.
> > > > 
> > > > Signed-off-by: Dmitry Osipenko 
> > > > ---
> > > > 
> > > >  drivers/iommu/tegra-gart.c | 23 +++
> > > >  1 file changed, 23 insertions(+)
> > > > 
> > > > diff --git a/drivers/iommu/tegra-gart.c b/drivers/iommu/tegra-gart.c
> > > > index a004f6da35f2..f8b653e25914 100644
> > > > --- a/drivers/iommu/tegra-gart.c
> > > > +++ b/drivers/iommu/tegra-gart.c
> > > > @@ -31,6 +31,8 @@
> > > > 
> > > >  #include 
> > > >  #include 
> > > > 
> > > > +#include 
> > > > +
> > > > 
> > > >  #include 
> > > >  
> > > >  /* bitmap of the page sizes currently supported */
> > > > 
> > > > @@ -41,6 +43,8 @@
> > > > 
> > > >  #define GART_ENTRY_ADDR(0x28 - GART_REG_BASE)
> > > >  #define GART_ENTRY_DATA(0x2c - GART_REG_BASE)
> > > >  #define GART_ENTRY_PHYS_ADDR_VALID (1 << 31)
> > > > 
> > > > +#define GART_ERROR_REQ (0x30 - GART_REG_BASE)
> > > > +#define GART_ERROR_ADDR(0x34 - GART_REG_BASE)
> > > > 
> > > >  #define GART_PAGE_SHIFT12
> > > >  #define GART_PAGE_SIZE (1 << GART_PAGE_SHIFT)
> > > > 
> > > > @@ -63,6 +67,8 @@ struct gart_device {
> > > > 
> > > > struct device   *dev;
> > > > 
> > > > struct iommu_device iommu;  /* IOMMU Core handle */
> > > > 
> > > > +
> > > > +   struct tegra_mc_gart_handle mc_gart_handle;
> > > > 
> > > >  };
> > > >  
> > > >  struct gart_domain {
> > > > 
> > > > @@ -408,6 +414,20 @@ static int tegra_gart_resume(struct device *dev)
> > > > 
> > > > return 0;
> > > >  
> > > >  }
> > > > 
> > > > +static u32 tegra_gart_error_addr(struct tegra_mc_gart_handle *handle)
> > > > +{
> > > > +   struct gart_device *gart = container_of(handle, struct 
gart_device,
> > > > +   mc_gart_handle);
> > > > +   return readl_relaxed(gart->regs + GART_ERROR_ADDR);
> > > > +}
> > > > +
> > > > +static u32 tegra_gart_error_req(struct tegra_mc_gart_handle *handle)
> > > > +{
> > > > +   struct gart_device *gart = container_of(handle, struct 
gart_device,
> > > > +   mc_gart_handle);
> > > > +   return readl_relaxed(gart->regs + GART_ERROR_REQ);
> > > > +}
> > > > +
> > > > 
> > > >  static int tegra_gart_probe(struct platform_device *pdev)
> > > >  {
> > > >  
> > > > struct gart_device *gart;
> > > > 
> > > > @@ -464,6 +484,8 @@ static int tegra_gart_probe(struct platform_device
> > > > *pdev)>
> > > > 
> > > > gart->regs = gart_regs;
> > > > gart->iovmm_base = (dma_addr_t)res_remap->start;
> > > > gart->page_count = (resource_size(res_remap) >> 
> > > > GART_PAGE_SHIFT);
> > > > 
> > > > +   gart->mc_gart_handle.error_addr = tegra_gart_error_addr;
> > > > +   gart->mc_gart_handle.error_req = tegra_gart_error_req;
> > > > 
> > > > gart->savedata = vmalloc(array_size(sizeof(u32), 
> > > > gart->page_count));
> > > > if (!gart->savedata) {
> > > > 
> > > > @@ -475,6 +497,7 @@ static int tegra_gart_probe(struct platform_device
> > > > *pdev)>
> > > > 
> > > > do_gart_setup(gart, NULL);
> > > > 
> > > > gart_handle = gart;
> > > > 
> > > > +   tegra_mc_register_gart(>mc_gart_handle);
> > > > 
> > > > return 0;
> > > >  
> > > >  }
> > > 
> > > I see now why you've did it this way. We have separate devices unlike
> > > with SMMU where it is properly modeled as part of the memory controller.
> > > I think we should consider breaking ABI at this point and properly merge
> > > both the memory controller and GART nodes. There's really no reason why
> > > they should be separate and we're jumping through multiple hoops to do
> > > what we need to do just because a few years back we made a mistake.
> > > 
> > > I know we're technically not supposed to break the DT ABI, but I think
> > > in this particular case we can make a good case for it. The current DT
> > > bindings are plainly broken, and obviously so. Also, we don't currently
> > > use the GART upstream for anything, so we can't break any existing
> > > systems either.
> > 
> > IIUC, that will require to break the stable DT ABI of the tegra20-mc,
> > which is working fine and does its job. I'm personally not seeing the
> > slight lameness of the current DT as a good excuse to break the ABI.
> > Let's then break DT ABI on all Tegra's and convert them all to genpd and
> > other goodies like assigned clock parents and clock rate.
> 
> genpd and assigned clocks are complementary, 

Re: [PATCH v2 2/8] iommu/tegra: gart: Provide access to Memory Controller driver

2018-08-09 Thread Thierry Reding
On Thu, Aug 09, 2018 at 02:39:03PM +0300, Dmitry Osipenko wrote:
> On Thursday, 9 August 2018 14:17:46 MSK Thierry Reding wrote:
> > On Sat, Aug 04, 2018 at 05:29:57PM +0300, Dmitry Osipenko wrote:
> > > GART contain registers needed by the Memory Controller driver, provide
> > > access to the MC driver by utilizing its GART-integration facility.
> > > 
> > > Signed-off-by: Dmitry Osipenko 
> > > ---
> > > 
> > >  drivers/iommu/tegra-gart.c | 23 +++
> > >  1 file changed, 23 insertions(+)
> > > 
> > > diff --git a/drivers/iommu/tegra-gart.c b/drivers/iommu/tegra-gart.c
> > > index a004f6da35f2..f8b653e25914 100644
> > > --- a/drivers/iommu/tegra-gart.c
> > > +++ b/drivers/iommu/tegra-gart.c
> > > @@ -31,6 +31,8 @@
> > > 
> > >  #include 
> > >  #include 
> > > 
> > > +#include 
> > > +
> > > 
> > >  #include 
> > >  
> > >  /* bitmap of the page sizes currently supported */
> > > 
> > > @@ -41,6 +43,8 @@
> > > 
> > >  #define GART_ENTRY_ADDR  (0x28 - GART_REG_BASE)
> > >  #define GART_ENTRY_DATA  (0x2c - GART_REG_BASE)
> > >  #define GART_ENTRY_PHYS_ADDR_VALID   (1 << 31)
> > > 
> > > +#define GART_ERROR_REQ   (0x30 - GART_REG_BASE)
> > > +#define GART_ERROR_ADDR  (0x34 - GART_REG_BASE)
> > > 
> > >  #define GART_PAGE_SHIFT  12
> > >  #define GART_PAGE_SIZE   (1 << GART_PAGE_SHIFT)
> > > 
> > > @@ -63,6 +67,8 @@ struct gart_device {
> > > 
> > >   struct device   *dev;
> > >   
> > >   struct iommu_device iommu;  /* IOMMU Core handle */
> > > 
> > > +
> > > + struct tegra_mc_gart_handle mc_gart_handle;
> > > 
> > >  };
> > >  
> > >  struct gart_domain {
> > > 
> > > @@ -408,6 +414,20 @@ static int tegra_gart_resume(struct device *dev)
> > > 
> > >   return 0;
> > >  
> > >  }
> > > 
> > > +static u32 tegra_gart_error_addr(struct tegra_mc_gart_handle *handle)
> > > +{
> > > + struct gart_device *gart = container_of(handle, struct gart_device,
> > > + mc_gart_handle);
> > > + return readl_relaxed(gart->regs + GART_ERROR_ADDR);
> > > +}
> > > +
> > > +static u32 tegra_gart_error_req(struct tegra_mc_gart_handle *handle)
> > > +{
> > > + struct gart_device *gart = container_of(handle, struct gart_device,
> > > + mc_gart_handle);
> > > + return readl_relaxed(gart->regs + GART_ERROR_REQ);
> > > +}
> > > +
> > > 
> > >  static int tegra_gart_probe(struct platform_device *pdev)
> > >  {
> > >  
> > >   struct gart_device *gart;
> > > 
> > > @@ -464,6 +484,8 @@ static int tegra_gart_probe(struct platform_device
> > > *pdev)> 
> > >   gart->regs = gart_regs;
> > >   gart->iovmm_base = (dma_addr_t)res_remap->start;
> > >   gart->page_count = (resource_size(res_remap) >> GART_PAGE_SHIFT);
> > > 
> > > + gart->mc_gart_handle.error_addr = tegra_gart_error_addr;
> > > + gart->mc_gart_handle.error_req = tegra_gart_error_req;
> > > 
> > >   gart->savedata = vmalloc(array_size(sizeof(u32), gart->page_count));
> > >   if (!gart->savedata) {
> > > 
> > > @@ -475,6 +497,7 @@ static int tegra_gart_probe(struct platform_device
> > > *pdev)> 
> > >   do_gart_setup(gart, NULL);
> > >   
> > >   gart_handle = gart;
> > > 
> > > + tegra_mc_register_gart(>mc_gart_handle);
> > > 
> > >   return 0;
> > >  
> > >  }
> > 
> > I see now why you've did it this way. We have separate devices unlike
> > with SMMU where it is properly modeled as part of the memory controller.
> > I think we should consider breaking ABI at this point and properly merge
> > both the memory controller and GART nodes. There's really no reason why
> > they should be separate and we're jumping through multiple hoops to do
> > what we need to do just because a few years back we made a mistake.
> > 
> > I know we're technically not supposed to break the DT ABI, but I think
> > in this particular case we can make a good case for it. The current DT
> > bindings are plainly broken, and obviously so. Also, we don't currently
> > use the GART upstream for anything, so we can't break any existing
> > systems either.
> 
> IIUC, that will require to break the stable DT ABI of the tegra20-mc, which 
> is 
> working fine and does its job. I'm personally not seeing the slight lameness 
> of the current DT as a good excuse to break the ABI. Let's then break DT ABI 
> on all Tegra's and convert them all to genpd and other goodies like assigned 
> clock parents and clock rate.

genpd and assigned clocks are complementary, they can be switched to
without breaking ABI.

And that's also different from the memory controller on Tegra20 where we
just made the mistake of describing what is effectively one device as
two separate devices. From what I can tell, the only reason this was
done was because it mapped better to the Linux driver model where there
is a framework to represent an IOMMU and a misunderstanding of how to
work with the driver model and device tree.

As such, I would describe it as more of a bug in the 

Re: [PATCH v2 2/2] iommu/arm-smmu-v3: avoid redundant CMD_SYNCs if possible

2018-08-09 Thread Robin Murphy

On 09/08/18 12:48, Zhen Lei wrote:

More than two CMD_SYNCs maybe adjacent in the command queue, and the first
one has done what others want to do. Drop the redundant CMD_SYNCs can
improve IO performance especially under the pressure scene.

I did the statistics in my test environment, the number of CMD_SYNCs can
be reduced about 1/3. See below:
CMD_SYNCs reduced:  19542181
CMD_SYNCs total:58098548(include reduced)
CMDs total: 116197099   (TLBI:SYNC about 1:1)

Signed-off-by: Zhen Lei 
---
  drivers/iommu/arm-smmu-v3.c | 22 +++---
  1 file changed, 19 insertions(+), 3 deletions(-)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index d17a9a7..b96d2d2 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -567,6 +567,7 @@ struct arm_smmu_device {
int gerr_irq;
int combined_irq;
u32 sync_nr;
+   u8  prev_cmd_opcode;

unsigned long   ias; /* IPA */
unsigned long   oas; /* PA */
@@ -775,6 +776,11 @@ static int queue_remove_raw(struct arm_smmu_queue *q, u64 
*ent)
return 0;
  }

+static inline u8 arm_smmu_cmd_opcode_get(u64 *cmd)
+{
+   return cmd[0] & CMDQ_0_OP;
+}
+
  /* High-level queue accessors */
  static int arm_smmu_cmdq_build_cmd(u64 *cmd, struct arm_smmu_cmdq_ent *ent)
  {
@@ -900,6 +906,8 @@ static void arm_smmu_cmdq_insert_cmd(struct arm_smmu_device 
*smmu, u64 *cmd)
struct arm_smmu_queue *q = >cmdq.q;
bool wfe = !!(smmu->features & ARM_SMMU_FEAT_SEV);

+   smmu->prev_cmd_opcode = arm_smmu_cmd_opcode_get(cmd);
+
while (queue_insert_raw(q, cmd) == -ENOSPC) {
if (queue_poll_cons(q, false, wfe))
dev_err_ratelimited(smmu->dev, "CMDQ timeout\n");
@@ -952,9 +960,17 @@ static int __arm_smmu_cmdq_issue_sync_msi(struct 
arm_smmu_device *smmu)
};

spin_lock_irqsave(>cmdq.lock, flags);
-   ent.sync.msidata = ++smmu->sync_nr;
-   arm_smmu_cmdq_build_cmd(cmd, );
-   arm_smmu_cmdq_insert_cmd(smmu, cmd);
+   if (smmu->prev_cmd_opcode == CMDQ_OP_CMD_SYNC) {
+   /*
+* Previous command is CMD_SYNC also, there is no need to add
+* one more. Just poll it.
+*/
+   ent.sync.msidata = smmu->sync_nr;


Aha! at the time I had pondered how to make multiple callers wait on a 
previous sync instead of issuing another back-to-back, but it seemed 
complicated precisely *because* of the counter being updated outside the 
lock. If only I'd realised... :)


Now I just need to figure out if we can do the same for the polling case.

Robin.


+   } else {
+   ent.sync.msidata = ++smmu->sync_nr;
+   arm_smmu_cmdq_build_cmd(cmd, );
+   arm_smmu_cmdq_insert_cmd(smmu, cmd);
+   }
spin_unlock_irqrestore(>cmdq.lock, flags);

return __arm_smmu_sync_poll_msi(smmu, ent.sync.msidata);
--
1.8.3



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v2 1/2] iommu/arm-smmu-v3: fix unexpected CMD_SYNC timeout

2018-08-09 Thread Zhen Lei
The condition "(int)(VAL - sync_idx) >= 0" to break loop in function
__arm_smmu_sync_poll_msi requires that sync_idx must be increased
monotonously according to the sequence of the CMDs in the cmdq.

But ".msidata = atomic_inc_return_relaxed(>sync_nr)" is not protected
by spinlock, so the following scenarios may appear:
cpu0cpu1
msidata=0
msidata=1
insert cmd1
insert cmd0
smmu execute cmd1
smmu execute cmd0
poll timeout, because msidata=1 is overridden by
cmd0, that means VAL=0, sync_idx=1.

This is not a functional problem, just make the caller wait for a long
time until TIMEOUT. It's rare to happen, because any other CMD_SYNCs
during the waiting period will break it.

Signed-off-by: Zhen Lei 
---
 drivers/iommu/arm-smmu-v3.c | 8 +++-
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 1d64710..d17a9a7 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -566,7 +566,7 @@ struct arm_smmu_device {

int gerr_irq;
int combined_irq;
-   atomic_tsync_nr;
+   u32 sync_nr;

unsigned long   ias; /* IPA */
unsigned long   oas; /* PA */
@@ -947,14 +947,13 @@ static int __arm_smmu_cmdq_issue_sync_msi(struct 
arm_smmu_device *smmu)
struct arm_smmu_cmdq_ent ent = {
.opcode = CMDQ_OP_CMD_SYNC,
.sync   = {
-   .msidata = atomic_inc_return_relaxed(>sync_nr),
.msiaddr = virt_to_phys(>sync_count),
},
};

-   arm_smmu_cmdq_build_cmd(cmd, );
-
spin_lock_irqsave(>cmdq.lock, flags);
+   ent.sync.msidata = ++smmu->sync_nr;
+   arm_smmu_cmdq_build_cmd(cmd, );
arm_smmu_cmdq_insert_cmd(smmu, cmd);
spin_unlock_irqrestore(>cmdq.lock, flags);

@@ -2179,7 +2178,6 @@ static int arm_smmu_init_structures(struct 
arm_smmu_device *smmu)
 {
int ret;

-   atomic_set(>sync_nr, 0);
ret = arm_smmu_init_queues(smmu);
if (ret)
return ret;
--
1.8.3


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v2 0/2] bugfix and optimization about CMD_SYNC

2018-08-09 Thread Zhen Lei
v1->v2:
1. move the call to arm_smmu_cmdq_build_cmd into the critical section,
   and keep itself unchange.
2. Although patch2 can make sure no two CMD_SYNCs will be adjacent,
but patch1 is still needed, see below:

cpu0cpu1cpu2
msidata=0
msidata=1
insert cmd1
insert a TLBI command
insert cmd0
smmu execute cmd1
smmu execute TLBI
smmu execute cmd0
poll timeout, because msidata=1 is overridden by
cmd0, that means VAL=0, sync_idx=1.

Zhen Lei (2):
  iommu/arm-smmu-v3: fix unexpected CMD_SYNC timeout
  iommu/arm-smmu-v3: avoid redundant CMD_SYNCs if possible

 drivers/iommu/arm-smmu-v3.c | 26 --
 1 file changed, 20 insertions(+), 6 deletions(-)

-- 
1.8.3


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v2 2/2] iommu/arm-smmu-v3: avoid redundant CMD_SYNCs if possible

2018-08-09 Thread Zhen Lei
More than two CMD_SYNCs maybe adjacent in the command queue, and the first
one has done what others want to do. Drop the redundant CMD_SYNCs can
improve IO performance especially under the pressure scene.

I did the statistics in my test environment, the number of CMD_SYNCs can
be reduced about 1/3. See below:
CMD_SYNCs reduced:  19542181
CMD_SYNCs total:58098548(include reduced)
CMDs total: 116197099   (TLBI:SYNC about 1:1)

Signed-off-by: Zhen Lei 
---
 drivers/iommu/arm-smmu-v3.c | 22 +++---
 1 file changed, 19 insertions(+), 3 deletions(-)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index d17a9a7..b96d2d2 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -567,6 +567,7 @@ struct arm_smmu_device {
int gerr_irq;
int combined_irq;
u32 sync_nr;
+   u8  prev_cmd_opcode;

unsigned long   ias; /* IPA */
unsigned long   oas; /* PA */
@@ -775,6 +776,11 @@ static int queue_remove_raw(struct arm_smmu_queue *q, u64 
*ent)
return 0;
 }

+static inline u8 arm_smmu_cmd_opcode_get(u64 *cmd)
+{
+   return cmd[0] & CMDQ_0_OP;
+}
+
 /* High-level queue accessors */
 static int arm_smmu_cmdq_build_cmd(u64 *cmd, struct arm_smmu_cmdq_ent *ent)
 {
@@ -900,6 +906,8 @@ static void arm_smmu_cmdq_insert_cmd(struct arm_smmu_device 
*smmu, u64 *cmd)
struct arm_smmu_queue *q = >cmdq.q;
bool wfe = !!(smmu->features & ARM_SMMU_FEAT_SEV);

+   smmu->prev_cmd_opcode = arm_smmu_cmd_opcode_get(cmd);
+
while (queue_insert_raw(q, cmd) == -ENOSPC) {
if (queue_poll_cons(q, false, wfe))
dev_err_ratelimited(smmu->dev, "CMDQ timeout\n");
@@ -952,9 +960,17 @@ static int __arm_smmu_cmdq_issue_sync_msi(struct 
arm_smmu_device *smmu)
};

spin_lock_irqsave(>cmdq.lock, flags);
-   ent.sync.msidata = ++smmu->sync_nr;
-   arm_smmu_cmdq_build_cmd(cmd, );
-   arm_smmu_cmdq_insert_cmd(smmu, cmd);
+   if (smmu->prev_cmd_opcode == CMDQ_OP_CMD_SYNC) {
+   /*
+* Previous command is CMD_SYNC also, there is no need to add
+* one more. Just poll it.
+*/
+   ent.sync.msidata = smmu->sync_nr;
+   } else {
+   ent.sync.msidata = ++smmu->sync_nr;
+   arm_smmu_cmdq_build_cmd(cmd, );
+   arm_smmu_cmdq_insert_cmd(smmu, cmd);
+   }
spin_unlock_irqrestore(>cmdq.lock, flags);

return __arm_smmu_sync_poll_msi(smmu, ent.sync.msidata);
--
1.8.3


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 2/8] iommu/tegra: gart: Provide access to Memory Controller driver

2018-08-09 Thread Dmitry Osipenko
On Thursday, 9 August 2018 14:17:46 MSK Thierry Reding wrote:
> On Sat, Aug 04, 2018 at 05:29:57PM +0300, Dmitry Osipenko wrote:
> > GART contain registers needed by the Memory Controller driver, provide
> > access to the MC driver by utilizing its GART-integration facility.
> > 
> > Signed-off-by: Dmitry Osipenko 
> > ---
> > 
> >  drivers/iommu/tegra-gart.c | 23 +++
> >  1 file changed, 23 insertions(+)
> > 
> > diff --git a/drivers/iommu/tegra-gart.c b/drivers/iommu/tegra-gart.c
> > index a004f6da35f2..f8b653e25914 100644
> > --- a/drivers/iommu/tegra-gart.c
> > +++ b/drivers/iommu/tegra-gart.c
> > @@ -31,6 +31,8 @@
> > 
> >  #include 
> >  #include 
> > 
> > +#include 
> > +
> > 
> >  #include 
> >  
> >  /* bitmap of the page sizes currently supported */
> > 
> > @@ -41,6 +43,8 @@
> > 
> >  #define GART_ENTRY_ADDR(0x28 - GART_REG_BASE)
> >  #define GART_ENTRY_DATA(0x2c - GART_REG_BASE)
> >  #define GART_ENTRY_PHYS_ADDR_VALID (1 << 31)
> > 
> > +#define GART_ERROR_REQ (0x30 - GART_REG_BASE)
> > +#define GART_ERROR_ADDR(0x34 - GART_REG_BASE)
> > 
> >  #define GART_PAGE_SHIFT12
> >  #define GART_PAGE_SIZE (1 << GART_PAGE_SHIFT)
> > 
> > @@ -63,6 +67,8 @@ struct gart_device {
> > 
> > struct device   *dev;
> > 
> > struct iommu_device iommu;  /* IOMMU Core handle */
> > 
> > +
> > +   struct tegra_mc_gart_handle mc_gart_handle;
> > 
> >  };
> >  
> >  struct gart_domain {
> > 
> > @@ -408,6 +414,20 @@ static int tegra_gart_resume(struct device *dev)
> > 
> > return 0;
> >  
> >  }
> > 
> > +static u32 tegra_gart_error_addr(struct tegra_mc_gart_handle *handle)
> > +{
> > +   struct gart_device *gart = container_of(handle, struct gart_device,
> > +   mc_gart_handle);
> > +   return readl_relaxed(gart->regs + GART_ERROR_ADDR);
> > +}
> > +
> > +static u32 tegra_gart_error_req(struct tegra_mc_gart_handle *handle)
> > +{
> > +   struct gart_device *gart = container_of(handle, struct gart_device,
> > +   mc_gart_handle);
> > +   return readl_relaxed(gart->regs + GART_ERROR_REQ);
> > +}
> > +
> > 
> >  static int tegra_gart_probe(struct platform_device *pdev)
> >  {
> >  
> > struct gart_device *gart;
> > 
> > @@ -464,6 +484,8 @@ static int tegra_gart_probe(struct platform_device
> > *pdev)> 
> > gart->regs = gart_regs;
> > gart->iovmm_base = (dma_addr_t)res_remap->start;
> > gart->page_count = (resource_size(res_remap) >> GART_PAGE_SHIFT);
> > 
> > +   gart->mc_gart_handle.error_addr = tegra_gart_error_addr;
> > +   gart->mc_gart_handle.error_req = tegra_gart_error_req;
> > 
> > gart->savedata = vmalloc(array_size(sizeof(u32), gart->page_count));
> > if (!gart->savedata) {
> > 
> > @@ -475,6 +497,7 @@ static int tegra_gart_probe(struct platform_device
> > *pdev)> 
> > do_gart_setup(gart, NULL);
> > 
> > gart_handle = gart;
> > 
> > +   tegra_mc_register_gart(>mc_gart_handle);
> > 
> > return 0;
> >  
> >  }
> 
> I see now why you've did it this way. We have separate devices unlike
> with SMMU where it is properly modeled as part of the memory controller.
> I think we should consider breaking ABI at this point and properly merge
> both the memory controller and GART nodes. There's really no reason why
> they should be separate and we're jumping through multiple hoops to do
> what we need to do just because a few years back we made a mistake.
> 
> I know we're technically not supposed to break the DT ABI, but I think
> in this particular case we can make a good case for it. The current DT
> bindings are plainly broken, and obviously so. Also, we don't currently
> use the GART upstream for anything, so we can't break any existing
> systems either.

IIUC, that will require to break the stable DT ABI of the tegra20-mc, which is 
working fine and does its job. I'm personally not seeing the slight lameness 
of the current DT as a good excuse to break the ABI. Let's then break DT ABI 
on all Tegra's and convert them all to genpd and other goodies like assigned 
clock parents and clock rate.



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v4 3/5] iommu/io-pgtable-arm: add support for non-strict mode

2018-08-09 Thread Leizhen (ThunderTown)



On 2018/8/9 18:54, Robin Murphy wrote:
> On 06/08/18 13:27, Zhen Lei wrote:
>> To support the non-strict mode, now we only tlbi and sync for the strict
>> mode. But for the non-leaf case, always follow strict mode.
>>
>> Signed-off-by: Zhen Lei 
>> ---
>>   drivers/iommu/io-pgtable-arm.c | 27 ++-
>>   drivers/iommu/io-pgtable.h |  3 +++
>>   2 files changed, 21 insertions(+), 9 deletions(-)
>>
>> diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
>> index 010a254..bb61bef 100644
>> --- a/drivers/iommu/io-pgtable-arm.c
>> +++ b/drivers/iommu/io-pgtable-arm.c
>> @@ -292,7 +292,7 @@ static void __arm_lpae_set_pte(arm_lpae_iopte *ptep, 
>> arm_lpae_iopte pte,
>>
>>   static size_t __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
>>  unsigned long iova, size_t size, int lvl,
>> -   arm_lpae_iopte *ptep);
>> +   arm_lpae_iopte *ptep, bool strict);
>>
>>   static void __arm_lpae_init_pte(struct arm_lpae_io_pgtable *data,
>>   phys_addr_t paddr, arm_lpae_iopte prot,
>> @@ -319,6 +319,7 @@ static int arm_lpae_init_pte(struct arm_lpae_io_pgtable 
>> *data,
>>arm_lpae_iopte prot, int lvl,
>>arm_lpae_iopte *ptep)
>>   {
>> +size_t unmapped;
>>   arm_lpae_iopte pte = *ptep;
>>
>>   if (iopte_leaf(pte, lvl)) {
>> @@ -334,7 +335,8 @@ static int arm_lpae_init_pte(struct arm_lpae_io_pgtable 
>> *data,
>>   size_t sz = ARM_LPAE_BLOCK_SIZE(lvl, data);
>>
>>   tblp = ptep - ARM_LPAE_LVL_IDX(iova, lvl, data);
>> -if (WARN_ON(__arm_lpae_unmap(data, iova, sz, lvl, tblp) != sz))
>> +unmapped = __arm_lpae_unmap(data, iova, sz, lvl, tblp, true);
>> +if (WARN_ON(unmapped != sz))
> 
> What's the extra local variable for?

in order to remove the warning: more than 80 characters a line

> 
>>   return -EINVAL;
>>   }
>>
>> @@ -576,15 +578,17 @@ static size_t arm_lpae_split_blk_unmap(struct 
>> arm_lpae_io_pgtable *data,
>>   }
>>
>>   if (unmap_idx < 0)
>> -return __arm_lpae_unmap(data, iova, size, lvl, tablep);
>> +return __arm_lpae_unmap(data, iova, size, lvl, tablep, true);
>>
>>   io_pgtable_tlb_add_flush(>iop, iova, size, size, true);
>> +io_pgtable_tlb_sync(>iop);
>> +
>>   return size;
>>   }
>>
>>   static size_t __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
>>  unsigned long iova, size_t size, int lvl,
>> -   arm_lpae_iopte *ptep)
>> +   arm_lpae_iopte *ptep, bool strict)
>>   {
>>   arm_lpae_iopte pte;
>>   struct io_pgtable *iop = >iop;
>> @@ -609,7 +613,7 @@ static size_t __arm_lpae_unmap(struct 
>> arm_lpae_io_pgtable *data,
>>   io_pgtable_tlb_sync(iop);
>>   ptep = iopte_deref(pte, data);
>>   __arm_lpae_free_pgtable(data, lvl + 1, ptep);
>> -} else {
>> +} else if (strict) {
> 
> Since this is the only place we ever actually evaluate "strict", can't we 
> just test iop->cfg.quirks directly at this point instead of playing 
> pass-the-parcel with the extra argument?

Wonderful, you're right!

> 
> Robin.
> 
>>   io_pgtable_tlb_add_flush(iop, iova, size, size, true);
>>   }
>>
>> @@ -625,12 +629,13 @@ static size_t __arm_lpae_unmap(struct 
>> arm_lpae_io_pgtable *data,
>>
>>   /* Keep on walkin' */
>>   ptep = iopte_deref(pte, data);
>> -return __arm_lpae_unmap(data, iova, size, lvl + 1, ptep);
>> +return __arm_lpae_unmap(data, iova, size, lvl + 1, ptep, strict);
>>   }
>>
>>   static size_t arm_lpae_unmap(struct io_pgtable_ops *ops, unsigned long 
>> iova,
>>size_t size)
>>   {
>> +bool strict;
>>   struct arm_lpae_io_pgtable *data = io_pgtable_ops_to_data(ops);
>>   arm_lpae_iopte *ptep = data->pgd;
>>   int lvl = ARM_LPAE_START_LVL(data);
>> @@ -638,7 +643,9 @@ static size_t arm_lpae_unmap(struct io_pgtable_ops *ops, 
>> unsigned long iova,
>>   if (WARN_ON(iova >= (1ULL << data->iop.cfg.ias)))
>>   return 0;
>>
>> -return __arm_lpae_unmap(data, iova, size, lvl, ptep);
>> +strict = !(data->iop.cfg.quirks & IO_PGTABLE_QUIRK_NON_STRICT);
>> +
>> +return __arm_lpae_unmap(data, iova, size, lvl, ptep, strict);
>>   }
>>
>>   static phys_addr_t arm_lpae_iova_to_phys(struct io_pgtable_ops *ops,
>> @@ -771,7 +778,8 @@ static void arm_lpae_restrict_pgsizes(struct 
>> io_pgtable_cfg *cfg)
>>   u64 reg;
>>   struct arm_lpae_io_pgtable *data;
>>
>> -if (cfg->quirks & ~(IO_PGTABLE_QUIRK_ARM_NS | IO_PGTABLE_QUIRK_NO_DMA))
>> +if (cfg->quirks & ~(IO_PGTABLE_QUIRK_ARM_NS | IO_PGTABLE_QUIRK_NO_DMA |
>> +IO_PGTABLE_QUIRK_NON_STRICT))
>>   return NULL;
>>
>>   data = arm_lpae_alloc_pgtable(cfg);
>> @@ -863,7 +871,8 @@ static void arm_lpae_restrict_pgsizes(struct 
>> io_pgtable_cfg *cfg)
>>   struct arm_lpae_io_pgtable 

Re: [PATCH v2 2/8] iommu/tegra: gart: Provide access to Memory Controller driver

2018-08-09 Thread Thierry Reding
On Sat, Aug 04, 2018 at 05:29:57PM +0300, Dmitry Osipenko wrote:
> GART contain registers needed by the Memory Controller driver, provide
> access to the MC driver by utilizing its GART-integration facility.
> 
> Signed-off-by: Dmitry Osipenko 
> ---
>  drivers/iommu/tegra-gart.c | 23 +++
>  1 file changed, 23 insertions(+)
> 
> diff --git a/drivers/iommu/tegra-gart.c b/drivers/iommu/tegra-gart.c
> index a004f6da35f2..f8b653e25914 100644
> --- a/drivers/iommu/tegra-gart.c
> +++ b/drivers/iommu/tegra-gart.c
> @@ -31,6 +31,8 @@
>  #include 
>  #include 
>  
> +#include 
> +
>  #include 
>  
>  /* bitmap of the page sizes currently supported */
> @@ -41,6 +43,8 @@
>  #define GART_ENTRY_ADDR  (0x28 - GART_REG_BASE)
>  #define GART_ENTRY_DATA  (0x2c - GART_REG_BASE)
>  #define GART_ENTRY_PHYS_ADDR_VALID   (1 << 31)
> +#define GART_ERROR_REQ   (0x30 - GART_REG_BASE)
> +#define GART_ERROR_ADDR  (0x34 - GART_REG_BASE)
>  
>  #define GART_PAGE_SHIFT  12
>  #define GART_PAGE_SIZE   (1 << GART_PAGE_SHIFT)
> @@ -63,6 +67,8 @@ struct gart_device {
>   struct device   *dev;
>  
>   struct iommu_device iommu;  /* IOMMU Core handle */
> +
> + struct tegra_mc_gart_handle mc_gart_handle;
>  };
>  
>  struct gart_domain {
> @@ -408,6 +414,20 @@ static int tegra_gart_resume(struct device *dev)
>   return 0;
>  }
>  
> +static u32 tegra_gart_error_addr(struct tegra_mc_gart_handle *handle)
> +{
> + struct gart_device *gart = container_of(handle, struct gart_device,
> + mc_gart_handle);
> + return readl_relaxed(gart->regs + GART_ERROR_ADDR);
> +}
> +
> +static u32 tegra_gart_error_req(struct tegra_mc_gart_handle *handle)
> +{
> + struct gart_device *gart = container_of(handle, struct gart_device,
> + mc_gart_handle);
> + return readl_relaxed(gart->regs + GART_ERROR_REQ);
> +}
> +
>  static int tegra_gart_probe(struct platform_device *pdev)
>  {
>   struct gart_device *gart;
> @@ -464,6 +484,8 @@ static int tegra_gart_probe(struct platform_device *pdev)
>   gart->regs = gart_regs;
>   gart->iovmm_base = (dma_addr_t)res_remap->start;
>   gart->page_count = (resource_size(res_remap) >> GART_PAGE_SHIFT);
> + gart->mc_gart_handle.error_addr = tegra_gart_error_addr;
> + gart->mc_gart_handle.error_req = tegra_gart_error_req;
>  
>   gart->savedata = vmalloc(array_size(sizeof(u32), gart->page_count));
>   if (!gart->savedata) {
> @@ -475,6 +497,7 @@ static int tegra_gart_probe(struct platform_device *pdev)
>   do_gart_setup(gart, NULL);
>  
>   gart_handle = gart;
> + tegra_mc_register_gart(>mc_gart_handle);
>  
>   return 0;
>  }

I see now why you've did it this way. We have separate devices unlike
with SMMU where it is properly modeled as part of the memory controller.
I think we should consider breaking ABI at this point and properly merge
both the memory controller and GART nodes. There's really no reason why
they should be separate and we're jumping through multiple hoops to do
what we need to do just because a few years back we made a mistake.

I know we're technically not supposed to break the DT ABI, but I think
in this particular case we can make a good case for it. The current DT
bindings are plainly broken, and obviously so. Also, we don't currently
use the GART upstream for anything, so we can't break any existing
systems either.

Thierry


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v2 1/8] memory: tegra: Provide facility for integration with the GART driver

2018-08-09 Thread Thierry Reding
On Sat, Aug 04, 2018 at 05:29:56PM +0300, Dmitry Osipenko wrote:
> In order to report clients name and access direction on GART's page fault,
> MC driver need to access GART registers. Add facility that provides access
> to the GART.
> 
> Signed-off-by: Dmitry Osipenko 
> ---
>  drivers/memory/tegra/mc.c | 26 +++---
>  include/soc/tegra/mc.h| 13 +
>  2 files changed, 36 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/memory/tegra/mc.c b/drivers/memory/tegra/mc.c
> index e56862495f36..4940d72b5263 100644
> --- a/drivers/memory/tegra/mc.c
> +++ b/drivers/memory/tegra/mc.c
> @@ -72,6 +72,8 @@ static const struct of_device_id tegra_mc_of_match[] = {
>  };
>  MODULE_DEVICE_TABLE(of, tegra_mc_of_match);
>  
> +static struct tegra_mc_gart_handle *gart_handle;
> +

Why the global variable? Can't this be part of struct tegra_mc? We
already do a very similar thing with the Tegra SMMU integration, why
invent something completely different? Can't we stick to a similar
mechanism?

Given that struct tegra_smmu is opaque at the memory controller level,
we could even go and store the GART related data in the same pointer.

How about the registration code goes into a struct tegra_gart_probe()
function that is called from tegra_mc_probe() right after the SMMU
block:

if (IS_ENABLED(CONFIG_TEGRA_IOMMU_SMMU)) {
mc->smmu = tegra_smmu_probe(...);
...
}

if (IS_ENABLED(CONFIG_TEGRA_IOMMU_GART)) {
mc->smmu = tegra_gart_probe(...);
...
}

?

Thierry


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v4 5/5] iommu/arm-smmu-v3: add bootup option "arm_iommu"

2018-08-09 Thread Robin Murphy

On 06/08/18 13:27, Zhen Lei wrote:

Add a bootup option to make the system manager can choose which mode to
be used. The default mode is strict.

Signed-off-by: Zhen Lei 
---
  Documentation/admin-guide/kernel-parameters.txt |  9 +
  drivers/iommu/arm-smmu-v3.c | 17 -
  2 files changed, 25 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 533ff5c..426e989 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -1720,6 +1720,15 @@
nobypass[PPC/POWERNV]
Disable IOMMU bypass, using IOMMU for PCI devices.

+   arm_iommu=  [ARM64]
+   non-strict  [Default Off]


Again, I'd much rather have "iommu.non_strict= { "0" | "1" }" in line 
with the passthrough option.


Robin.


+   Put off TLBs invalidation and release memory first.
+   It's good for scatter-gather performance but lacks full
+   isolation, an untrusted device can access the reused
+   memory because the TLBs may still valid. Please take
+   full consideration before choosing this mode.
+   Note that, VFIO will always use strict mode.
+
iommu.passthrough=
[ARM64] Configure DMA to bypass the IOMMU by default.
Format: { "0" | "1" }
diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 904bc1e..9a30892 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -631,6 +631,21 @@ struct arm_smmu_option_prop {
{ 0, NULL},
  };

+static u32 smmu_non_strict __read_mostly;
+
+static int __init arm_smmu_setup(char *str)
+{
+   if (!strncmp(str, "non-strict", 10)) {
+   smmu_non_strict = 1;
+   pr_warn("WARNING: iommu non-strict mode is chosen.\n"
+   "It's good for scatter-gather performance but lacks full 
isolation\n");
+   add_taint(TAINT_WARN, LOCKDEP_STILL_OK);
+   }
+
+   return 0;
+}
+early_param("arm_iommu", arm_smmu_setup);
+
  static inline void __iomem *arm_smmu_page1_fixup(unsigned long offset,
 struct arm_smmu_device *smmu)
  {
@@ -1622,7 +1637,7 @@ static int arm_smmu_domain_finalise(struct iommu_domain 
*domain)
if (smmu->features & ARM_SMMU_FEAT_COHERENCY)
pgtbl_cfg.quirks = IO_PGTABLE_QUIRK_NO_DMA;

-   if (domain->type == IOMMU_DOMAIN_DMA) {
+   if ((domain->type == IOMMU_DOMAIN_DMA) && smmu_non_strict) {
domain->non_strict = 1;
pgtbl_cfg.quirks |= IO_PGTABLE_QUIRK_NON_STRICT;
}
--
1.8.3



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v4 4/5] iommu/arm-smmu-v3: add support for non-strict mode

2018-08-09 Thread Robin Murphy

On 06/08/18 13:27, Zhen Lei wrote:

Dynamically choose strict or non-strict mode for page table config based
on the iommu domain type.

Signed-off-by: Zhen Lei 
---
  drivers/iommu/arm-smmu-v3.c | 7 ++-
  1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 2f1304b..904bc1e 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -1622,6 +1622,11 @@ static int arm_smmu_domain_finalise(struct iommu_domain 
*domain)
if (smmu->features & ARM_SMMU_FEAT_COHERENCY)
pgtbl_cfg.quirks = IO_PGTABLE_QUIRK_NO_DMA;

+   if (domain->type == IOMMU_DOMAIN_DMA) {
+   domain->non_strict = 1;
+   pgtbl_cfg.quirks |= IO_PGTABLE_QUIRK_NON_STRICT;
+   }
+
pgtbl_ops = alloc_io_pgtable_ops(fmt, _cfg, smmu_domain);
if (!pgtbl_ops)
return -ENOMEM;
@@ -1782,7 +1787,7 @@ static void arm_smmu_iotlb_sync(struct iommu_domain 
*domain)
  {
struct arm_smmu_device *smmu = to_smmu_domain(domain)->smmu;

-   if (smmu)
+   if (smmu && !domain->non_strict)


That doesn't smell right - even in non-strict domains we still need 
stuff like walk cache invalidations for non-leaf unmaps to be 
synchronous, so we can't just ignore all sync operations at the driver 
level. I think the right thing to do to elide the "normal" sync on unmap 
is to first convert __iommu_dma_unmap to use 
iommu_unmap_fast()/iommu_tlb_sync(), then make it not issue that sync at 
all for non-strict domains.


Robin.


__arm_smmu_tlb_sync(smmu);
  }

--
1.8.3



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v4 2/5] iommu/dma: add support for non-strict mode

2018-08-09 Thread Leizhen (ThunderTown)



On 2018/8/9 18:46, Robin Murphy wrote:
> On 06/08/18 13:27, Zhen Lei wrote:
>> 1. Save the related domain pointer in struct iommu_dma_cookie, make iovad
>> capable call domain->ops->flush_iotlb_all to flush TLB.
>> 2. During the iommu domain initialization phase, base on domain->non_strict
>> field to check whether non-strict mode is supported or not. If so, call
>> init_iova_flush_queue to register iovad->flush_cb callback.
>> 3. All unmap(contains iova-free) APIs will finally invoke __iommu_dma_unmap
>> -->iommu_dma_free_iova. If the domain is non-strict, call queue_iova to
>> put off iova freeing.
>>
>> Signed-off-by: Zhen Lei 
>> ---
>>   drivers/iommu/dma-iommu.c | 23 +++
>>   drivers/iommu/iommu.c |  1 +
>>   include/linux/iommu.h |  1 +
>>   3 files changed, 25 insertions(+)
>>
>> diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
>> index ddcbbdb..213e62a 100644
>> --- a/drivers/iommu/dma-iommu.c
>> +++ b/drivers/iommu/dma-iommu.c
>> @@ -55,6 +55,7 @@ struct iommu_dma_cookie {
>>   };
>>   struct list_headmsi_page_list;
>>   spinlock_tmsi_lock;
>> +struct iommu_domain*domain;
>>   };
>>
>>   static inline size_t cookie_msi_granule(struct iommu_dma_cookie *cookie)
>> @@ -257,6 +258,17 @@ static int iova_reserve_iommu_regions(struct device 
>> *dev,
>>   return ret;
>>   }
>>
>> +static void iommu_dma_flush_iotlb_all(struct iova_domain *iovad)
>> +{
>> +struct iommu_dma_cookie *cookie;
>> +struct iommu_domain *domain;
>> +
>> +cookie = container_of(iovad, struct iommu_dma_cookie, iovad);
>> +domain = cookie->domain;
>> +
>> +domain->ops->flush_iotlb_all(domain);
>> +}
>> +
>>   /**
>>* iommu_dma_init_domain - Initialise a DMA mapping domain
>>* @domain: IOMMU domain previously prepared by iommu_get_dma_cookie()
>> @@ -308,6 +320,14 @@ int iommu_dma_init_domain(struct iommu_domain *domain, 
>> dma_addr_t base,
>>   }
>>
>>   init_iova_domain(iovad, 1UL << order, base_pfn);
>> +
>> +if (domain->non_strict) {
>> +BUG_ON(!domain->ops->flush_iotlb_all);
>> +
>> +cookie->domain = domain;
> 
> cookie->domain will only be non-NULL if domain->non_strict is true...
> 
>> +init_iova_flush_queue(iovad, iommu_dma_flush_iotlb_all, NULL);
>> +}
>> +
>>   if (!dev)
>>   return 0;
>>
>> @@ -390,6 +410,9 @@ static void iommu_dma_free_iova(struct iommu_dma_cookie 
>> *cookie,
>>   /* The MSI case is only ever cleaning up its most recent allocation */
>>   if (cookie->type == IOMMU_DMA_MSI_COOKIE)
>>   cookie->msi_iova -= size;
>> +else if (cookie->domain && cookie->domain->non_strict)
> 
> ...so we don't need to re-check non_strict every time here.

OK, I will change it to a comment.

> 
>> +queue_iova(iovad, iova_pfn(iovad, iova),
>> +size >> iova_shift(iovad), 0);
>>   else
>>   free_iova_fast(iovad, iova_pfn(iovad, iova),
>>   size >> iova_shift(iovad));
>> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
>> index 63b3756..7811fde 100644
>> --- a/drivers/iommu/iommu.c
>> +++ b/drivers/iommu/iommu.c
>> @@ -1263,6 +1263,7 @@ static struct iommu_domain 
>> *__iommu_domain_alloc(struct bus_type *bus,
>>
>>   domain->ops  = bus->iommu_ops;
>>   domain->type = type;
>> +domain->non_strict = 0;
>>   /* Assume all sizes by default; the driver may override this later */
>>   domain->pgsize_bitmap  = bus->iommu_ops->pgsize_bitmap;
>>
>> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
>> index 19938ee..0a0fb48 100644
>> --- a/include/linux/iommu.h
>> +++ b/include/linux/iommu.h
>> @@ -88,6 +88,7 @@ struct iommu_domain_geometry {
>>
>>   struct iommu_domain {
>>   unsigned type;
>> +int non_strict;
> 
> bool?

OK

> 
> Robin.
> 
>>   const struct iommu_ops *ops;
>>   unsigned long pgsize_bitmap;/* Bitmap of page sizes in use */
>>   iommu_fault_handler_t handler;
>> -- 
>> 1.8.3
>>
>>
> 
> .
> 

-- 
Thanks!
BestRegards

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v4 3/5] iommu/io-pgtable-arm: add support for non-strict mode

2018-08-09 Thread Robin Murphy

On 06/08/18 13:27, Zhen Lei wrote:

To support the non-strict mode, now we only tlbi and sync for the strict
mode. But for the non-leaf case, always follow strict mode.

Signed-off-by: Zhen Lei 
---
  drivers/iommu/io-pgtable-arm.c | 27 ++-
  drivers/iommu/io-pgtable.h |  3 +++
  2 files changed, 21 insertions(+), 9 deletions(-)

diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
index 010a254..bb61bef 100644
--- a/drivers/iommu/io-pgtable-arm.c
+++ b/drivers/iommu/io-pgtable-arm.c
@@ -292,7 +292,7 @@ static void __arm_lpae_set_pte(arm_lpae_iopte *ptep, 
arm_lpae_iopte pte,

  static size_t __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
   unsigned long iova, size_t size, int lvl,
-  arm_lpae_iopte *ptep);
+  arm_lpae_iopte *ptep, bool strict);

  static void __arm_lpae_init_pte(struct arm_lpae_io_pgtable *data,
phys_addr_t paddr, arm_lpae_iopte prot,
@@ -319,6 +319,7 @@ static int arm_lpae_init_pte(struct arm_lpae_io_pgtable 
*data,
 arm_lpae_iopte prot, int lvl,
 arm_lpae_iopte *ptep)
  {
+   size_t unmapped;
arm_lpae_iopte pte = *ptep;

if (iopte_leaf(pte, lvl)) {
@@ -334,7 +335,8 @@ static int arm_lpae_init_pte(struct arm_lpae_io_pgtable 
*data,
size_t sz = ARM_LPAE_BLOCK_SIZE(lvl, data);

tblp = ptep - ARM_LPAE_LVL_IDX(iova, lvl, data);
-   if (WARN_ON(__arm_lpae_unmap(data, iova, sz, lvl, tblp) != sz))
+   unmapped = __arm_lpae_unmap(data, iova, sz, lvl, tblp, true);
+   if (WARN_ON(unmapped != sz))


What's the extra local variable for?


return -EINVAL;
}

@@ -576,15 +578,17 @@ static size_t arm_lpae_split_blk_unmap(struct 
arm_lpae_io_pgtable *data,
}

if (unmap_idx < 0)
-   return __arm_lpae_unmap(data, iova, size, lvl, tablep);
+   return __arm_lpae_unmap(data, iova, size, lvl, tablep, true);

io_pgtable_tlb_add_flush(>iop, iova, size, size, true);
+   io_pgtable_tlb_sync(>iop);
+
return size;
  }

  static size_t __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
   unsigned long iova, size_t size, int lvl,
-  arm_lpae_iopte *ptep)
+  arm_lpae_iopte *ptep, bool strict)
  {
arm_lpae_iopte pte;
struct io_pgtable *iop = >iop;
@@ -609,7 +613,7 @@ static size_t __arm_lpae_unmap(struct arm_lpae_io_pgtable 
*data,
io_pgtable_tlb_sync(iop);
ptep = iopte_deref(pte, data);
__arm_lpae_free_pgtable(data, lvl + 1, ptep);
-   } else {
+   } else if (strict) {


Since this is the only place we ever actually evaluate "strict", can't 
we just test iop->cfg.quirks directly at this point instead of playing 
pass-the-parcel with the extra argument?


Robin.


io_pgtable_tlb_add_flush(iop, iova, size, size, true);
}

@@ -625,12 +629,13 @@ static size_t __arm_lpae_unmap(struct arm_lpae_io_pgtable 
*data,

/* Keep on walkin' */
ptep = iopte_deref(pte, data);
-   return __arm_lpae_unmap(data, iova, size, lvl + 1, ptep);
+   return __arm_lpae_unmap(data, iova, size, lvl + 1, ptep, strict);
  }

  static size_t arm_lpae_unmap(struct io_pgtable_ops *ops, unsigned long iova,
 size_t size)
  {
+   bool strict;
struct arm_lpae_io_pgtable *data = io_pgtable_ops_to_data(ops);
arm_lpae_iopte *ptep = data->pgd;
int lvl = ARM_LPAE_START_LVL(data);
@@ -638,7 +643,9 @@ static size_t arm_lpae_unmap(struct io_pgtable_ops *ops, 
unsigned long iova,
if (WARN_ON(iova >= (1ULL << data->iop.cfg.ias)))
return 0;

-   return __arm_lpae_unmap(data, iova, size, lvl, ptep);
+   strict = !(data->iop.cfg.quirks & IO_PGTABLE_QUIRK_NON_STRICT);
+
+   return __arm_lpae_unmap(data, iova, size, lvl, ptep, strict);
  }

  static phys_addr_t arm_lpae_iova_to_phys(struct io_pgtable_ops *ops,
@@ -771,7 +778,8 @@ static void arm_lpae_restrict_pgsizes(struct io_pgtable_cfg 
*cfg)
u64 reg;
struct arm_lpae_io_pgtable *data;

-   if (cfg->quirks & ~(IO_PGTABLE_QUIRK_ARM_NS | IO_PGTABLE_QUIRK_NO_DMA))
+   if (cfg->quirks & ~(IO_PGTABLE_QUIRK_ARM_NS | IO_PGTABLE_QUIRK_NO_DMA |
+   IO_PGTABLE_QUIRK_NON_STRICT))
return NULL;

data = arm_lpae_alloc_pgtable(cfg);
@@ -863,7 +871,8 @@ static void arm_lpae_restrict_pgsizes(struct io_pgtable_cfg 
*cfg)
struct arm_lpae_io_pgtable *data;

/* The NS quirk doesn't apply at stage 2 */
-   if (cfg->quirks & ~IO_PGTABLE_QUIRK_NO_DMA)
+   if (cfg->quirks & ~(IO_PGTABLE_QUIRK_NO_DMA |
+  

Re: [PATCH v4 2/5] iommu/dma: add support for non-strict mode

2018-08-09 Thread Robin Murphy

On 06/08/18 13:27, Zhen Lei wrote:

1. Save the related domain pointer in struct iommu_dma_cookie, make iovad
capable call domain->ops->flush_iotlb_all to flush TLB.
2. During the iommu domain initialization phase, base on domain->non_strict
field to check whether non-strict mode is supported or not. If so, call
init_iova_flush_queue to register iovad->flush_cb callback.
3. All unmap(contains iova-free) APIs will finally invoke __iommu_dma_unmap
-->iommu_dma_free_iova. If the domain is non-strict, call queue_iova to
put off iova freeing.

Signed-off-by: Zhen Lei 
---
  drivers/iommu/dma-iommu.c | 23 +++
  drivers/iommu/iommu.c |  1 +
  include/linux/iommu.h |  1 +
  3 files changed, 25 insertions(+)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index ddcbbdb..213e62a 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -55,6 +55,7 @@ struct iommu_dma_cookie {
};
struct list_headmsi_page_list;
spinlock_t  msi_lock;
+   struct iommu_domain *domain;
  };

  static inline size_t cookie_msi_granule(struct iommu_dma_cookie *cookie)
@@ -257,6 +258,17 @@ static int iova_reserve_iommu_regions(struct device *dev,
return ret;
  }

+static void iommu_dma_flush_iotlb_all(struct iova_domain *iovad)
+{
+   struct iommu_dma_cookie *cookie;
+   struct iommu_domain *domain;
+
+   cookie = container_of(iovad, struct iommu_dma_cookie, iovad);
+   domain = cookie->domain;
+
+   domain->ops->flush_iotlb_all(domain);
+}
+
  /**
   * iommu_dma_init_domain - Initialise a DMA mapping domain
   * @domain: IOMMU domain previously prepared by iommu_get_dma_cookie()
@@ -308,6 +320,14 @@ int iommu_dma_init_domain(struct iommu_domain *domain, 
dma_addr_t base,
}

init_iova_domain(iovad, 1UL << order, base_pfn);
+
+   if (domain->non_strict) {
+   BUG_ON(!domain->ops->flush_iotlb_all);
+
+   cookie->domain = domain;


cookie->domain will only be non-NULL if domain->non_strict is true...


+   init_iova_flush_queue(iovad, iommu_dma_flush_iotlb_all, NULL);
+   }
+
if (!dev)
return 0;

@@ -390,6 +410,9 @@ static void iommu_dma_free_iova(struct iommu_dma_cookie 
*cookie,
/* The MSI case is only ever cleaning up its most recent allocation */
if (cookie->type == IOMMU_DMA_MSI_COOKIE)
cookie->msi_iova -= size;
+   else if (cookie->domain && cookie->domain->non_strict)


...so we don't need to re-check non_strict every time here.


+   queue_iova(iovad, iova_pfn(iovad, iova),
+   size >> iova_shift(iovad), 0);
else
free_iova_fast(iovad, iova_pfn(iovad, iova),
size >> iova_shift(iovad));
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 63b3756..7811fde 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -1263,6 +1263,7 @@ static struct iommu_domain *__iommu_domain_alloc(struct 
bus_type *bus,

domain->ops  = bus->iommu_ops;
domain->type = type;
+   domain->non_strict = 0;
/* Assume all sizes by default; the driver may override this later */
domain->pgsize_bitmap  = bus->iommu_ops->pgsize_bitmap;

diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 19938ee..0a0fb48 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -88,6 +88,7 @@ struct iommu_domain_geometry {

  struct iommu_domain {
unsigned type;
+   int non_strict;


bool?

Robin.


const struct iommu_ops *ops;
unsigned long pgsize_bitmap;/* Bitmap of page sizes in use */
iommu_fault_handler_t handler;
--
1.8.3



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v4 1/5] iommu/arm-smmu-v3: fix the implementation of flush_iotlb_all hook

2018-08-09 Thread Robin Murphy

On 06/08/18 13:27, Zhen Lei wrote:

.flush_iotlb_all can not just wait for previous tlbi operations to be
completed, but should also invalid all TLBs of the related domain.


I think it was like that because the only caller in practice was 
iommu_group_create_direct_mappings(), and at that point any relevant 
invalidations would have already been issued anyway. Once 
flush_iotlb_all actually does what it's supposed to we'll get a bit of 
unavoidable over-invalidation on that path, but it's no big deal.


Reviewed-by: Robin Murphy 


Signed-off-by: Zhen Lei 
---
  drivers/iommu/arm-smmu-v3.c | 10 +-
  1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 4810f61..2f1304b 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -1770,6 +1770,14 @@ static int arm_smmu_map(struct iommu_domain *domain, 
unsigned long iova,
return ops->unmap(ops, iova, size);
  }

+static void arm_smmu_flush_iotlb_all(struct iommu_domain *domain)
+{
+   struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
+
+   if (smmu_domain->smmu)
+   arm_smmu_tlb_inv_context(smmu_domain);
+}
+
  static void arm_smmu_iotlb_sync(struct iommu_domain *domain)
  {
struct arm_smmu_device *smmu = to_smmu_domain(domain)->smmu;
@@ -1998,7 +2006,7 @@ static void arm_smmu_put_resv_regions(struct device *dev,
.map= arm_smmu_map,
.unmap  = arm_smmu_unmap,
.map_sg = default_iommu_map_sg,
-   .flush_iotlb_all= arm_smmu_iotlb_sync,
+   .flush_iotlb_all= arm_smmu_flush_iotlb_all,
.iotlb_sync = arm_smmu_iotlb_sync,
.iova_to_phys   = arm_smmu_iova_to_phys,
.add_device = arm_smmu_add_device,
--
1.8.3



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 1/1] iommu/arm-smmu-v3: fix unexpected CMD_SYNC timeout

2018-08-09 Thread Leizhen (ThunderTown)



On 2018/8/9 16:49, Will Deacon wrote:
> On Thu, Aug 09, 2018 at 09:30:51AM +0800, Leizhen (ThunderTown) wrote:
>> On 2018/8/8 18:12, Will Deacon wrote:
>>> On Mon, Aug 06, 2018 at 08:31:29PM +0800, Zhen Lei wrote:
 The condition "(int)(VAL - sync_idx) >= 0" to break loop in function
 __arm_smmu_sync_poll_msi requires that sync_idx must be increased
 monotonously according to the sequence of the CMDs in the cmdq.

 But ".msidata = atomic_inc_return_relaxed(>sync_nr)" is not protected
 by spinlock, so the following scenarios may appear:
 cpu0   cpu1
 msidata=0
msidata=1
insert cmd1
 insert cmd0
smmu execute cmd1
 smmu execute cmd0
poll timeout, because msidata=1 is overridden by
cmd0, that means VAL=0, sync_idx=1.
>>>
>>> Oh yuck, you're right! We probably want a CC stable on this. Did you see
>>> this go wrong in practice?
>> Just misreported and make the caller wait for a long time until TIMEOUT. It's
>> rare to happen, because any other CMD_SYNC during the waiting period will 
>> break
>> it.
> 
> Thanks. Please mention that in the commit message, because I think it's
> useful to know.

OK.

> 
 Signed-off-by: Zhen Lei 
 ---
  drivers/iommu/arm-smmu-v3.c | 7 +++
  1 file changed, 3 insertions(+), 4 deletions(-)

 diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
 index 1d64710..4810f61 100644
 --- a/drivers/iommu/arm-smmu-v3.c
 +++ b/drivers/iommu/arm-smmu-v3.c
 @@ -566,7 +566,7 @@ struct arm_smmu_device {

int gerr_irq;
int combined_irq;
 -  atomic_tsync_nr;
 +  u32 sync_nr;

unsigned long   ias; /* IPA */
unsigned long   oas; /* PA */
 @@ -836,7 +836,6 @@ static int arm_smmu_cmdq_build_cmd(u64 *cmd, struct 
 arm_smmu_cmdq_ent *ent)
cmd[0] |= FIELD_PREP(CMDQ_SYNC_0_CS, 
 CMDQ_SYNC_0_CS_SEV);
cmd[0] |= FIELD_PREP(CMDQ_SYNC_0_MSH, ARM_SMMU_SH_ISH);
cmd[0] |= FIELD_PREP(CMDQ_SYNC_0_MSIATTR, 
 ARM_SMMU_MEMATTR_OIWB);
 -  cmd[0] |= FIELD_PREP(CMDQ_SYNC_0_MSIDATA, ent->sync.msidata);
cmd[1] |= ent->sync.msiaddr & CMDQ_SYNC_1_MSIADDR_MASK;
break;
default:
 @@ -947,7 +946,6 @@ static int __arm_smmu_cmdq_issue_sync_msi(struct 
 arm_smmu_device *smmu)
struct arm_smmu_cmdq_ent ent = {
.opcode = CMDQ_OP_CMD_SYNC,
.sync   = {
 -  .msidata = atomic_inc_return_relaxed(>sync_nr),
.msiaddr = virt_to_phys(>sync_count),
},
};
 @@ -955,6 +953,8 @@ static int __arm_smmu_cmdq_issue_sync_msi(struct 
 arm_smmu_device *smmu)
arm_smmu_cmdq_build_cmd(cmd, );

spin_lock_irqsave(>cmdq.lock, flags);
 +  ent.sync.msidata = ++smmu->sync_nr;
 +  cmd[0] |= FIELD_PREP(CMDQ_SYNC_0_MSIDATA, ent.sync.msidata);
>>>
>>> I really don't like splitting this out from building the rest of the
>>> command. Can you just move the call to arm_smmu_cmdq_build_cmd into the
>>> critical section, please?
>> OK. I have considered that before, just worry it will increase the
>> compition of spinlock.
> 
> If you can provide numbers showing that it's a problem, then we could add
> a helper function e.g. arm_smmu_cmdq_sync_set_msidata(arm_smmu_cmdq_ent *cmd)

The performance data from my current test envirnoment is not stable now, I'm
trying to find anothor one. So I want to leave this problem for the time being.
If it'a problem, I will send a new patch.

> 
>> In addition, I will append a optimization patch: the adjacent CMD_SYNCs,
>> we only need one.
> 
> Ok, but please keep them separate, since I don't want to fix up fixes and
> optimisations.

OK

> 
> Thanks,
> 
> Will
> 
> .
> 

-- 
Thanks!
BestRegards

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 1/1] iommu/arm-smmu-v3: fix unexpected CMD_SYNC timeout

2018-08-09 Thread Will Deacon
On Thu, Aug 09, 2018 at 09:30:51AM +0800, Leizhen (ThunderTown) wrote:
> On 2018/8/8 18:12, Will Deacon wrote:
> > On Mon, Aug 06, 2018 at 08:31:29PM +0800, Zhen Lei wrote:
> >> The condition "(int)(VAL - sync_idx) >= 0" to break loop in function
> >> __arm_smmu_sync_poll_msi requires that sync_idx must be increased
> >> monotonously according to the sequence of the CMDs in the cmdq.
> >>
> >> But ".msidata = atomic_inc_return_relaxed(>sync_nr)" is not protected
> >> by spinlock, so the following scenarios may appear:
> >> cpu0   cpu1
> >> msidata=0
> >>msidata=1
> >>insert cmd1
> >> insert cmd0
> >>smmu execute cmd1
> >> smmu execute cmd0
> >>poll timeout, because msidata=1 is overridden by
> >>cmd0, that means VAL=0, sync_idx=1.
> > 
> > Oh yuck, you're right! We probably want a CC stable on this. Did you see
> > this go wrong in practice?
> Just misreported and make the caller wait for a long time until TIMEOUT. It's
> rare to happen, because any other CMD_SYNC during the waiting period will 
> break
> it.

Thanks. Please mention that in the commit message, because I think it's
useful to know.

> >> Signed-off-by: Zhen Lei 
> >> ---
> >>  drivers/iommu/arm-smmu-v3.c | 7 +++
> >>  1 file changed, 3 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
> >> index 1d64710..4810f61 100644
> >> --- a/drivers/iommu/arm-smmu-v3.c
> >> +++ b/drivers/iommu/arm-smmu-v3.c
> >> @@ -566,7 +566,7 @@ struct arm_smmu_device {
> >>
> >>int gerr_irq;
> >>int combined_irq;
> >> -  atomic_tsync_nr;
> >> +  u32 sync_nr;
> >>
> >>unsigned long   ias; /* IPA */
> >>unsigned long   oas; /* PA */
> >> @@ -836,7 +836,6 @@ static int arm_smmu_cmdq_build_cmd(u64 *cmd, struct 
> >> arm_smmu_cmdq_ent *ent)
> >>cmd[0] |= FIELD_PREP(CMDQ_SYNC_0_CS, 
> >> CMDQ_SYNC_0_CS_SEV);
> >>cmd[0] |= FIELD_PREP(CMDQ_SYNC_0_MSH, ARM_SMMU_SH_ISH);
> >>cmd[0] |= FIELD_PREP(CMDQ_SYNC_0_MSIATTR, 
> >> ARM_SMMU_MEMATTR_OIWB);
> >> -  cmd[0] |= FIELD_PREP(CMDQ_SYNC_0_MSIDATA, ent->sync.msidata);
> >>cmd[1] |= ent->sync.msiaddr & CMDQ_SYNC_1_MSIADDR_MASK;
> >>break;
> >>default:
> >> @@ -947,7 +946,6 @@ static int __arm_smmu_cmdq_issue_sync_msi(struct 
> >> arm_smmu_device *smmu)
> >>struct arm_smmu_cmdq_ent ent = {
> >>.opcode = CMDQ_OP_CMD_SYNC,
> >>.sync   = {
> >> -  .msidata = atomic_inc_return_relaxed(>sync_nr),
> >>.msiaddr = virt_to_phys(>sync_count),
> >>},
> >>};
> >> @@ -955,6 +953,8 @@ static int __arm_smmu_cmdq_issue_sync_msi(struct 
> >> arm_smmu_device *smmu)
> >>arm_smmu_cmdq_build_cmd(cmd, );
> >>
> >>spin_lock_irqsave(>cmdq.lock, flags);
> >> +  ent.sync.msidata = ++smmu->sync_nr;
> >> +  cmd[0] |= FIELD_PREP(CMDQ_SYNC_0_MSIDATA, ent.sync.msidata);
> > 
> > I really don't like splitting this out from building the rest of the
> > command. Can you just move the call to arm_smmu_cmdq_build_cmd into the
> > critical section, please?
> OK. I have considered that before, just worry it will increase the
> compition of spinlock.

If you can provide numbers showing that it's a problem, then we could add
a helper function e.g. arm_smmu_cmdq_sync_set_msidata(arm_smmu_cmdq_ent *cmd)

> In addition, I will append a optimization patch: the adjacent CMD_SYNCs,
> we only need one.

Ok, but please keep them separate, since I don't want to fix up fixes and
optimisations.

Thanks,

Will
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC PATCH 0/7] A General Accelerator Framework, WarpDrive

2018-08-09 Thread Tian, Kevin
> From: Kenneth Lee [mailto:liguo...@hisilicon.com]
> Sent: Thursday, August 9, 2018 4:04 PM
> 
> But we have another requirement which is to combine some device
> together to
> share the same address space. This is a little like these kinds of solution:
> 
> http://tce.technion.ac.il/wp-content/uploads/sites/8/2015/06/SC-7.2-M.-
> Silberstein.pdf
> 
> With that, the application can directly pass the NiC packet pointer to the
> decryption accelerator, and get the bare data in place. This is the feature
> that
> the VFIO container can provide.

above is not a good argument, at least in the context of your discussion.
If each device has their own interface (similar to GPU) for process to bind 
with, then having the process binding to multiple devices one-by-one then
you still get same address space shared cross them...

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH 0/7] A General Accelerator Framework, WarpDrive

2018-08-09 Thread Kenneth Lee
On Wed, Aug 08, 2018 at 11:18:35AM -0400, Jerome Glisse wrote:
> Date: Wed, 8 Aug 2018 11:18:35 -0400
> From: Jerome Glisse 
> To: Kenneth Lee 
> CC: Kenneth Lee , "Tian, Kevin"
>  , Alex Williamson ,
>  Herbert Xu , "k...@vger.kernel.org"
>  , Jonathan Corbet , Greg
>  Kroah-Hartman , Zaibo Xu ,
>  "linux-...@vger.kernel.org" , "Kumar, Sanjay K"
>  , Hao Fang ,
>  "linux-ker...@vger.kernel.org" ,
>  "linux...@huawei.com" ,
>  "iommu@lists.linux-foundation.org" ,
>  "linux-cry...@vger.kernel.org" , Philippe
>  Ombredanne , Thomas Gleixner ,
>  "David S . Miller" ,
>  "linux-accelerat...@lists.ozlabs.org"
>  
> Subject: Re: [RFC PATCH 0/7] A General Accelerator Framework, WarpDrive
> User-Agent: Mutt/1.10.0 (2018-05-17)
> Message-ID: <20180808151835.ga3...@redhat.com>
> 
> On Wed, Aug 08, 2018 at 09:08:42AM +0800, Kenneth Lee wrote:
> > 
> > 
> > 在 2018年08月06日 星期一 11:32 下午, Jerome Glisse 写道:
> > > On Mon, Aug 06, 2018 at 11:12:52AM +0800, Kenneth Lee wrote:
> > > > On Fri, Aug 03, 2018 at 10:39:44AM -0400, Jerome Glisse wrote:
> > > > > On Fri, Aug 03, 2018 at 11:47:21AM +0800, Kenneth Lee wrote:
> > > > > > On Thu, Aug 02, 2018 at 10:22:43AM -0400, Jerome Glisse wrote:
> > > > > > > On Thu, Aug 02, 2018 at 12:05:57PM +0800, Kenneth Lee wrote:
> > > > > > > > On Thu, Aug 02, 2018 at 02:33:12AM +, Tian, Kevin wrote:
> > > > > > > > > > On Wed, Aug 01, 2018 at 06:22:14PM +0800, Kenneth Lee wrote:
> 
> [...]
> 
> > > > > > > > > > My more general question is do we want to grow VFIO to 
> > > > > > > > > > become
> > > > > > > > > > a more generic device driver API. This patchset adds a 
> > > > > > > > > > command
> > > > > > > > > > queue concept to it (i don't think it exist today but i have
> > > > > > > > > > not follow VFIO closely).
> > > > > > > > > > 
> > > > > > > > The thing is, VFIO is the only place to support DMA from user 
> > > > > > > > land. If we don't
> > > > > > > > put it here, we have to create another similar facility to 
> > > > > > > > support the same.
> > > > > > > No it is not, network device, GPU, block device, ... they all do
> > > > > > > support DMA. The point i am trying to make here is that even in
> > > > > > Sorry, wait a minute, are we talking the same thing? I meant "DMA 
> > > > > > from user
> > > > > > land", not "DMA from kernel driver". To do that we have to 
> > > > > > manipulate the
> > > > > > IOMMU(Unit). I think it can only be done by default_domain or vfio 
> > > > > > domain. Or
> > > > > > the user space have to directly access the IOMMU.
> > > > > GPU do DMA in the sense that you pass to the kernel a valid
> > > > > virtual address (kernel driver do all the proper check) and
> > > > > then you can use the GPU to copy from or to that range of
> > > > > virtual address. Exactly how you want to use this compression
> > > > > engine. It does not rely on SVM but SVM going forward would
> > > > > still be the prefered option.
> > > > > 
> > > > No, SVM is not the reason why we rely on Jean's SVM(SVA) series. We 
> > > > rely on
> > > > Jean's series because of multi-process (PASID or substream ID) support.
> > > > 
> > > > But of couse, WarpDrive can still benefit from the SVM feature.
> > > We are getting side tracked here. PASID/ID do not require VFIO.
> > > 
> > Yes, PASID itself do not require VFIO. But what if:
> > 
> > 1. Support DMA from user space.
> > 2. The hardware makes use of standard IOMMU/SMMU for IO address translation.
> > 3. The IOMMU facility is shared by both kernel and user drivers.
> > 4. Support PASID with the current IOMMU facility
> 
> I do not see how any of this means it has to be in VFIO.
> Other devices do just that. GPUs driver for instance share
> DMA engine (that copy data around) between kernel and user
> space. Sometime kernel use it to move things around. Evict
> some memory to make room for a new process is the common
> example. Same DMA engines is often use by userspace itself
> during rendering or compute (program moving things on there
> own). So they are already kernel driver that do all 4 of
> the above and are not in VFIO.
> 

I think our divergence is on "it is common that some device drivers use IOMMU
for user land DMA operation". Let us dive into this in the AMD case.

> 
> > > > > > > your mechanisms the userspace must have a specific userspace
> > > > > > > drivers for each hardware and thus there are virtually no
> > > > > > > differences between having this userspace driver open a device
> > > > > > > file in vfio or somewhere else in the device filesystem. This is
> > > > > > > just a different path.
> > > > > > > 
> > > > > > The basic problem WarpDrive want to solve it to avoid syscall. This 
> > > > > > is important
> > > > > > to accelerators. We have some data here:
> > > > > > https://www.slideshare.net/linaroorg/progress-and-demonstration-of-wrapdrive-a-accelerator-framework-sfo17317
> > > > > > 
> > > > > > (see page 3)
> > > > > > 
> > > > > > The performance is different on using kernel and user