Re: [RFC PATCH 0/7] A General Accelerator Framework, WarpDrive

2018-08-13 Thread Kenneth Lee
On Mon, Aug 13, 2018 at 03:23:01PM -0400, Jerome Glisse wrote:
> Received: from popscn.huawei.com [10.3.17.45] by Turing-Arch-b with POP3
>  (fetchmail-6.3.26) for  (single-drop); Tue, 14 Aug 2018
>  03:30:02 +0800 (CST)
> Received: from DGGEMM401-HUB.china.huawei.com (10.3.20.209) by
>  DGGEML402-HUB.china.huawei.com (10.3.17.38) with Microsoft SMTP Server
>  (TLS) id 14.3.399.0; Tue, 14 Aug 2018 03:23:25 +0800
> Received: from dggwg01-in.huawei.com (172.30.65.34) by
>  DGGEMM401-HUB.china.huawei.com (10.3.20.209) with Microsoft SMTP Server id
>  14.3.399.0; Tue, 14 Aug 2018 03:23:21 +0800
> Received: from mx1.redhat.com (unknown [66.187.233.73])   by Forcepoint
>  Email with ESMTPS id 301B5D4F60895   for ; Tue, 14
>  Aug 2018 03:23:16 +0800 (CST)
> Received: from smtp.corp.redhat.com
>  (int-mx06.intmail.prod.int.rdu2.redhat.com [10.11.54.6]) (using
>  TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client
>  certificate requested)   by mx1.redhat.com (Postfix) with ESMTPS id
>  4D0A14023461;Mon, 13 Aug 2018 19:23:05 + (UTC)
> Received: from redhat.com (unknown [10.20.6.215]) by
>  smtp.corp.redhat.com (Postfix) with ESMTPS id 20D072156712;  Mon, 13 Aug
>  2018 19:23:03 + (UTC)
> Date: Mon, 13 Aug 2018 15:23:01 -0400
> From: Jerome Glisse 
> To: Kenneth Lee 
> CC: Kenneth Lee , Jean-Philippe Brucker
>  , Herbert Xu ,
>  "k...@vger.kernel.org" , Jonathan Corbet
>  , Greg Kroah-Hartman , Zaibo
>  Xu , "linux-...@vger.kernel.org"
>  , "Kumar, Sanjay K" ,
>  "Tian, Kevin" , "iommu@lists.linux-foundation.org"
>  , "linux-ker...@vger.kernel.org"
>  , "linux...@huawei.com"
>  , Alex Williamson ,
>  "linux-cry...@vger.kernel.org" , Philippe
>  Ombredanne , Thomas Gleixner ,
>  Hao Fang , "David S . Miller" ,
>  "linux-accelerat...@lists.ozlabs.org"
>  
> Subject: Re: [RFC PATCH 0/7] A General Accelerator Framework, WarpDrive
> Message-ID: <20180813192301.gc3...@redhat.com>
> References: <20180806031252.GG91035@Turing-Arch-b>
>  <20180806153257.gb6...@redhat.com>
>  <11bace0e-dc14-5d2c-f65c-25b852f4e...@gmail.com>
>  <20180808151835.ga3...@redhat.com> <20180809080352.GI91035@Turing-Arch-b>
>  <20180809144613.gb3...@redhat.com> <20180810033913.GK91035@Turing-Arch-b>
>  <0f6bac9b-8381-1874-9367-46b5f4cef...@arm.com>
>  <6ea4dcfd-d539-93e4-acf1-d09ea35f0...@gmail.com>
>  <20180813092931.GL91035@Turing-Arch-b>
> Content-Type: text/plain; charset="iso-8859-1"
> Content-Disposition: inline
> Content-Transfer-Encoding: 8bit
> In-Reply-To: <20180813092931.GL91035@Turing-Arch-b>
> User-Agent: Mutt/1.10.0 (2018-05-17)
> X-Scanned-By: MIMEDefang 2.78 on 10.11.54.6
> X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16
>  (mx1.redhat.com [10.11.55.6]); Mon, 13 Aug 2018 19:23:05 + (UTC)
> X-Greylist: inspected by milter-greylist-4.5.16 (mx1.redhat.com
>  [10.11.55.6]); Mon, 13 Aug 2018 19:23:05 + (UTC) for IP:'10.11.54.6'
>  DOMAIN:'int-mx06.intmail.prod.int.rdu2.redhat.com'
>  HELO:'smtp.corp.redhat.com' FROM:'jgli...@redhat.com' RCPT:''
> Return-Path: jgli...@redhat.com
> X-MS-Exchange-Organization-AuthSource: DGGEMM401-HUB.china.huawei.com
> X-MS-Exchange-Organization-AuthAs: Anonymous
> MIME-Version: 1.0
> 
> On Mon, Aug 13, 2018 at 05:29:31PM +0800, Kenneth Lee wrote:
> > 
> > I made a quick change basing on the RFCv1 here: 
> > 
> > https://github.com/Kenneth-Lee/linux-kernel-warpdrive/commits/warpdrive-v0.6
> > 
> > I just made it compilable and not test it yet. But it shows how the idea is
> > going to be.
> > 
> > The Pros is: most of the virtual device stuff can be removed. Resource
> > management is on the openned files only.
> > 
> > The Cons is: as Jean said, we have to redo something that has been done by 
> > VFIO.
> > These mainly are:
> > 
> > 1. Track the dma operation and remove them on resource releasing
> > 2. Pin the memory with gup and do accounting
> > 
> > It not going to be easy to make a decision...
> > 
> 
> Maybe it would be good to list things you want do. Looking at your tree
> it seems you are re-inventing what dma-buf is already doing.

My English is quite limited;). I think I did not explain it well in the 
WrapDrive
document. Please let me try again here:

The requirement of WrapDrive is simple. Many acceleration requirements are from
user space, such as OpenSSL, AI, and so on. We want to provide a framework for
the user land application to summon the accelerator easily.

So the scenario is simple: The application i

Re: [RFC PATCH 0/7] A General Accelerator Framework, WarpDrive

2018-08-13 Thread Jerome Glisse
On Mon, Aug 13, 2018 at 05:29:31PM +0800, Kenneth Lee wrote:
> 
> I made a quick change basing on the RFCv1 here: 
> 
> https://github.com/Kenneth-Lee/linux-kernel-warpdrive/commits/warpdrive-v0.6
> 
> I just made it compilable and not test it yet. But it shows how the idea is
> going to be.
> 
> The Pros is: most of the virtual device stuff can be removed. Resource
> management is on the openned files only.
> 
> The Cons is: as Jean said, we have to redo something that has been done by 
> VFIO.
> These mainly are:
> 
> 1. Track the dma operation and remove them on resource releasing
> 2. Pin the memory with gup and do accounting
> 
> It not going to be easy to make a decision...
> 

Maybe it would be good to list things you want do. Looking at your tree
it seems you are re-inventing what dma-buf is already doing.

So here is what i understand for SVM/SVA:
(1) allow userspace to create a command buffer for a device and bind
it to its address space (PASID)
(2) allow userspace to directly schedule commands on its command buffer

No need to do tracking here as SVM/SVA which rely on PASID and something
like PCIE ATS (address translation service). Userspace can shoot itself
in the foot but nothing harmful can happen.


Non SVM/SVA:
(3) allow userspace to wrap a region of its memory into an object so
that it can be DMA map (ie GUP + dma_map_page())
(4) Have userspace schedule command referencing object created in (3)
using an ioctl.

We need to keep track of object usage by the hardware so that we know
when it is safe to release resources (dma_unmap_page()). The dma-buf
provides everything you want for (3) and (4). With dma-buf you create
object and each time it is use by a device you associate a fence with
it. When fence is signaled it means that the hardware is done using
that object. Fence also allow proper synchronization between multiple
devices. For instance making sure that the second device wait for the
first device before starting doing its thing. dma-buf documentations is
much more thorough explaining all this.


Now from implementation point of view, maybe it would be a good idea
to create something like the virtual gem driver. It is a virtual device
that allow to create GEM object. So maybe we want a virtual device that
allow to create dma-buf object from process memory and allow sharing of
those dma-buf between multiple devices.

Userspace would only have to talk to this virtual device to create
object and wrap its memory around, then it could use this object against
many actual devices.


This decouples the memory management, that can be share between all
devices, from the actual device driver, which is obviously specific to
every single device.


Note that dma-buf use file so that once all file reference are gone the
resource can be free and cleanup can happen (dma_unmap_page() ...). This
properly handle the resource lifetime issue you seem to worried about.

Cheers,
Jérôme
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH 0/7] A General Accelerator Framework, WarpDrive

2018-08-13 Thread Kenneth Lee
On Sat, Aug 11, 2018 at 11:26:48PM +0800, Kenneth Lee wrote:
> Date: Sat, 11 Aug 2018 23:26:48 +0800
> From: Kenneth Lee 
> To: Jean-Philippe Brucker , Kenneth Lee
>  , Jerome Glisse 
> CC: Herbert Xu , "k...@vger.kernel.org"
>  , Jonathan Corbet , Greg
>  Kroah-Hartman , Zaibo Xu ,
>  "linux-...@vger.kernel.org" , "Kumar, Sanjay K"
>  , "Tian, Kevin" ,
>  "iommu@lists.linux-foundation.org" ,
>  "linux-ker...@vger.kernel.org" ,
>  "linux...@huawei.com" , Alex Williamson
>  , "linux-cry...@vger.kernel.org"
>  , Philippe Ombredanne
>  , Thomas Gleixner , Hao Fang
>  , "David S . Miller" ,
>  "linux-accelerat...@lists.ozlabs.org"
>  
> Subject: Re: [RFC PATCH 0/7] A General Accelerator Framework, WarpDrive
> User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101
>  Thunderbird/52.9.1
> Message-ID: <6ea4dcfd-d539-93e4-acf1-d09ea35f0...@gmail.com>
> 
> 
> 
> 在 2018年08月10日 星期五 09:12 下午, Jean-Philippe Brucker 写道:
> >Hi Kenneth,
> >
> >On 10/08/18 04:39, Kenneth Lee wrote:
> >>>You can achieve everything you want to achieve with existing upstream
> >>>solution. Re-inventing a whole new driver infrastructure should really
> >>>be motivated with strong and obvious reasons.
> >>I want to understand better of your idea. If I create some unified helper
> >>APIs in drivers/iommu/, say:
> >>
> >>wd_create_dev(parent_dev, wd_dev)
> >>wd_release_dev(wd_dev)
> >>
> >>The API create chrdev to take request from user space for open(resource
> >>allocation), iomap, epoll (irq), and dma_map(with pasid automatically).
> >>
> >>Do you think it is acceptable?
> >Maybe not drivers/iommu/ :) That subsystem only contains tools for
> >dealing with DMA, I don't think epoll, resource enumeration or iomap fit
> >in there.
> Yes. I should consider where to put it carefully.
> >
> >Creating new helpers seems to be precisely what we're trying to avoid in
> >this thread, and vfio-mdev does provide the components that you
> >describe, so I wouldn't discard it right away. When the GPU, net, block
> >or another subsystem doesn't fit your needs, either because your
> >accelerator provides some specialized function, or because for
> >performance reasons your client wants direct MMIO access, you can at
> >least build your driver and library on top of those existing VFIO
> >components:
> >
> >* open allocates a partition of an accelerator.
> >* vfio_device_info, vfio_region_info and vfio_irq_info enumerates
> >available resources.
> >* vfio_irq_set deals with epoll.
> >* mmap gives you a private MMIO doorbell.
> >* vfio_iommu_type1 provides the DMA operations.
> >
> >Currently missing:
> >
> >* Sharing the parent IOMMU between mdev, which is also what the "IOMMU
> >aware mediated device" series tackles, and seems like a logical addition
> >to VFIO. I'd argue that the existing IOMMU ops (or ones implemented by
> >the SVA series) can be used to deal with this
> >
> >* The interface to discover an accelerator near your memory node, or one
> >that you can chain with other devices. If I understood correctly the
> >conclusion was that the API (a topology description in sysfs?) should be
> >common to various subsystems, in which case vfio-mdev (or the mediating
> >driver) could also use it.
> >
> >* The queue abstraction discussed on patch 3/7. Perhaps the current vfio
> >resource description of MMIO and IRQ is sufficient here as well, since
> >vendors tend to each implement their own queue schemes. If you need
> >additional features, read/write fops give the mediating driver a lot of
> >freedom. To support features that are too specific for drivers/vfio/ you
> >can implement a config space with capabilities and registers of your
> >choice. If you're versioning the capabilities, the code to handle them
> >could even be shared between different accelerator drivers and libraries.
> Thank you, Jean,
> 
> The major reason that I want to remove dependency to VFIO is: I
> accepted that the whole logic of VFIO was built on the idea of
> creating virtual device.
> 
> Let's consider it in this way: We have hardware with IOMMU support.
> So we create a default_domain to the particular IOMMU (unit) in the
> group for the kernel driver to use it. Now the device is going to be
> used by a VM or a Container. So we unbind it from the original
> driver, and put the default_domain away,  create a new domain for
> this particular use case.  

Re: [RFC PATCH 0/7] A General Accelerator Framework, WarpDrive

2018-08-11 Thread Kenneth Lee



在 2018年08月10日 星期五 09:12 下午, Jean-Philippe Brucker 写道:

Hi Kenneth,

On 10/08/18 04:39, Kenneth Lee wrote:

You can achieve everything you want to achieve with existing upstream
solution. Re-inventing a whole new driver infrastructure should really
be motivated with strong and obvious reasons.

I want to understand better of your idea. If I create some unified helper
APIs in drivers/iommu/, say:

wd_create_dev(parent_dev, wd_dev)
wd_release_dev(wd_dev)

The API create chrdev to take request from user space for open(resource
allocation), iomap, epoll (irq), and dma_map(with pasid automatically).

Do you think it is acceptable?

Maybe not drivers/iommu/ :) That subsystem only contains tools for
dealing with DMA, I don't think epoll, resource enumeration or iomap fit
in there.

Yes. I should consider where to put it carefully.


Creating new helpers seems to be precisely what we're trying to avoid in
this thread, and vfio-mdev does provide the components that you
describe, so I wouldn't discard it right away. When the GPU, net, block
or another subsystem doesn't fit your needs, either because your
accelerator provides some specialized function, or because for
performance reasons your client wants direct MMIO access, you can at
least build your driver and library on top of those existing VFIO
components:

* open allocates a partition of an accelerator.
* vfio_device_info, vfio_region_info and vfio_irq_info enumerates
available resources.
* vfio_irq_set deals with epoll.
* mmap gives you a private MMIO doorbell.
* vfio_iommu_type1 provides the DMA operations.

Currently missing:

* Sharing the parent IOMMU between mdev, which is also what the "IOMMU
aware mediated device" series tackles, and seems like a logical addition
to VFIO. I'd argue that the existing IOMMU ops (or ones implemented by
the SVA series) can be used to deal with this

* The interface to discover an accelerator near your memory node, or one
that you can chain with other devices. If I understood correctly the
conclusion was that the API (a topology description in sysfs?) should be
common to various subsystems, in which case vfio-mdev (or the mediating
driver) could also use it.

* The queue abstraction discussed on patch 3/7. Perhaps the current vfio
resource description of MMIO and IRQ is sufficient here as well, since
vendors tend to each implement their own queue schemes. If you need
additional features, read/write fops give the mediating driver a lot of
freedom. To support features that are too specific for drivers/vfio/ you
can implement a config space with capabilities and registers of your
choice. If you're versioning the capabilities, the code to handle them
could even be shared between different accelerator drivers and libraries.

Thank you, Jean,

The major reason that I want to remove dependency to VFIO is: I accepted 
that the whole logic of VFIO was built on the idea of creating virtual 
device.


Let's consider it in this way: We have hardware with IOMMU support. So 
we create a default_domain to the particular IOMMU (unit) in the group 
for the kernel driver to use it. Now the device is going to be used by a 
VM or a Container. So we unbind it from the original driver, and put the 
default_domain away,  create a new domain for this particular use case.  
So now the device shows up as a platform or pci device to the user 
space. This is what VFIO try to provide. Mdev extends the scenario but 
dose not change the intention. And I think that is why Alex emphasis 
pre-allocating resource to the mdev.


But what WarpDrive need is to get service from the hardware itself and 
set mapping to its current domain, aka defaut_domain. If we do it in 
VFIO-mdev, it looks like the VFIO framework takes all the effort to put 
the default_domain away and create a new one and be ready for user space 
to use. But I tell him stop using the new domain and try the original one...


It is not reasonable, isn't it:)

So why don't I just take the request and set it into the default_domain 
directly? The true requirement of WarpDrive is to let process set the 
page table for particular pasid or substream id, so it can accept 
command with address in the process space. It needs no device.


From this perspective, it seems there is no reason to keep it in VFIO.

Thanks
Kenneth


Thanks,
Jean



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC PATCH 0/7] A General Accelerator Framework, WarpDrive

2018-08-11 Thread Kenneth Lee



在 2018年08月10日 星期五 10:32 下午, Jerome Glisse 写道:

On Fri, Aug 10, 2018 at 11:39:13AM +0800, Kenneth Lee wrote:

On Thu, Aug 09, 2018 at 10:46:13AM -0400, Jerome Glisse wrote:

Date: Thu, 9 Aug 2018 10:46:13 -0400
From: Jerome Glisse 
To: Kenneth Lee 
CC: Kenneth Lee , "Tian, Kevin"
  , Alex Williamson ,
  Herbert Xu , "k...@vger.kernel.org"
  , Jonathan Corbet , Greg
  Kroah-Hartman , Zaibo Xu ,
  "linux-...@vger.kernel.org" , "Kumar, Sanjay K"
  , Hao Fang ,
  "linux-ker...@vger.kernel.org" ,
  "linux...@huawei.com" ,
  "iommu@lists.linux-foundation.org" ,
  "linux-cry...@vger.kernel.org" , Philippe
  Ombredanne , Thomas Gleixner ,
  "David S . Miller" ,
  "linux-accelerat...@lists.ozlabs.org"
  
Subject: Re: [RFC PATCH 0/7] A General Accelerator Framework, WarpDrive
User-Agent: Mutt/1.10.0 (2018-05-17)
Message-ID: <20180809144613.gb3...@redhat.com>

On Thu, Aug 09, 2018 at 04:03:52PM +0800, Kenneth Lee wrote:

On Wed, Aug 08, 2018 at 11:18:35AM -0400, Jerome Glisse wrote:

On Wed, Aug 08, 2018 at 09:08:42AM +0800, Kenneth Lee wrote:

在 2018年08月06日 星期一 11:32 下午, Jerome Glisse 写道:

On Mon, Aug 06, 2018 at 11:12:52AM +0800, Kenneth Lee wrote:

On Fri, Aug 03, 2018 at 10:39:44AM -0400, Jerome Glisse wrote:

On Fri, Aug 03, 2018 at 11:47:21AM +0800, Kenneth Lee wrote:

On Thu, Aug 02, 2018 at 10:22:43AM -0400, Jerome Glisse wrote:

On Thu, Aug 02, 2018 at 12:05:57PM +0800, Kenneth Lee wrote:

[...]


your mechanisms the userspace must have a specific userspace
drivers for each hardware and thus there are virtually no
differences between having this userspace driver open a device
file in vfio or somewhere else in the device filesystem. This is
just a different path.


The basic problem WarpDrive want to solve it to avoid syscall. This is important
to accelerators. We have some data here:
https://www.slideshare.net/linaroorg/progress-and-demonstration-of-wrapdrive-a-accelerator-framework-sfo17317

(see page 3)

The performance is different on using kernel and user drivers.

Yes and example i point to is exactly that. You have a one time setup
cost (creating command buffer binding PASID with command buffer and
couple other setup steps). Then userspace no longer have to do any
ioctl to schedule work on the GPU. It is all down from userspace and
it use a doorbell to notify hardware when it should go look at command
buffer for new thing to execute.

My point stands on that. You have existing driver already doing so
with no new framework and in your scheme you need a userspace driver.
So i do not see the value add, using one path or the other in the
userspace driver is litteraly one line to change.


Sorry, I'd got confuse here. I partially agree that the user driver is
redundance of kernel driver. (But for WarpDrive, the kernel driver is a full
driver include all preparation and setup stuff for the hardware, the user driver
is simply to send request and receive answer). Yes, it is just a choice of path.
But the user path is faster if the request come from use space. And to do that,
we need user land DMA support. Then why is it invaluable to let VFIO involved?

Some drivers in the kernel already do exactly what you said. The user
space emit commands without ever going into kernel by directly scheduling
commands and ringing a doorbell. They do not need VFIO either and they
can map userspace address into the DMA address space of the device and
again they do not need VFIO for that.

Could you please directly point out which driver you refer to here? Thank
you.

drivers/gpu/drm/amd/

Sub-directory of interest is amdkfd

Because it is a big driver here is a highlevel overview of how it works
(this is a simplification):
   - Process can allocate GPUs buffer (through ioclt) and map them into
 its address space (through mmap of device file at buffer object
 specific offset).
   - Process can map any valid range of virtual address space into device
 address space (IOMMU mapping). This must be regular memory ie not an
 mmap of a device file or any special file (this is the non PASID
 path)
   - Process can create a command queue and bind its process to it aka
 PASID, this is done through an ioctl.
   - Process can schedule commands onto queues it created from userspace
 without ioctl. For that it just write command into a ring buffer
 that it mapped during the command queue creation process and it
 rings a doorbell when commands are ready to be consume by the
 hardware.
   - Commands can reference (access) all 3 types of object above ie
 either full GPUs buffer, process regular memory maped as object
 (non PASID) and PASID memory all at the same time ie you can
 mix all of the above in same commands queue.
   - Kernel can evict, unbind any process command queues, unbind commands
 queue are still valid from process point of view but commands
 process schedules 

Re: [RFC PATCH 0/7] A General Accelerator Framework, WarpDrive

2018-08-10 Thread Jerome Glisse
On Fri, Aug 10, 2018 at 11:39:13AM +0800, Kenneth Lee wrote:
> On Thu, Aug 09, 2018 at 10:46:13AM -0400, Jerome Glisse wrote:
> > Date: Thu, 9 Aug 2018 10:46:13 -0400
> > From: Jerome Glisse 
> > To: Kenneth Lee 
> > CC: Kenneth Lee , "Tian, Kevin"
> >  , Alex Williamson ,
> >  Herbert Xu , "k...@vger.kernel.org"
> >  , Jonathan Corbet , Greg
> >  Kroah-Hartman , Zaibo Xu ,
> >  "linux-...@vger.kernel.org" , "Kumar, Sanjay K"
> >  , Hao Fang ,
> >  "linux-ker...@vger.kernel.org" ,
> >  "linux...@huawei.com" ,
> >  "iommu@lists.linux-foundation.org" ,
> >  "linux-cry...@vger.kernel.org" , Philippe
> >  Ombredanne , Thomas Gleixner ,
> >  "David S . Miller" ,
> >  "linux-accelerat...@lists.ozlabs.org"
> >  
> > Subject: Re: [RFC PATCH 0/7] A General Accelerator Framework, WarpDrive
> > User-Agent: Mutt/1.10.0 (2018-05-17)
> > Message-ID: <20180809144613.gb3...@redhat.com>
> > 
> > On Thu, Aug 09, 2018 at 04:03:52PM +0800, Kenneth Lee wrote:
> > > On Wed, Aug 08, 2018 at 11:18:35AM -0400, Jerome Glisse wrote:
> > > > On Wed, Aug 08, 2018 at 09:08:42AM +0800, Kenneth Lee wrote:
> > > > > 在 2018年08月06日 星期一 11:32 下午, Jerome Glisse 写道:
> > > > > > On Mon, Aug 06, 2018 at 11:12:52AM +0800, Kenneth Lee wrote:
> > > > > > > On Fri, Aug 03, 2018 at 10:39:44AM -0400, Jerome Glisse wrote:
> > > > > > > > On Fri, Aug 03, 2018 at 11:47:21AM +0800, Kenneth Lee wrote:
> > > > > > > > > On Thu, Aug 02, 2018 at 10:22:43AM -0400, Jerome Glisse wrote:
> > > > > > > > > > On Thu, Aug 02, 2018 at 12:05:57PM +0800, Kenneth Lee wrote:
> > 
> > [...]
> > 
> > > > > > > > > > your mechanisms the userspace must have a specific userspace
> > > > > > > > > > drivers for each hardware and thus there are virtually no
> > > > > > > > > > differences between having this userspace driver open a 
> > > > > > > > > > device
> > > > > > > > > > file in vfio or somewhere else in the device filesystem. 
> > > > > > > > > > This is
> > > > > > > > > > just a different path.
> > > > > > > > > > 
> > > > > > > > > The basic problem WarpDrive want to solve it to avoid 
> > > > > > > > > syscall. This is important
> > > > > > > > > to accelerators. We have some data here:
> > > > > > > > > https://www.slideshare.net/linaroorg/progress-and-demonstration-of-wrapdrive-a-accelerator-framework-sfo17317
> > > > > > > > > 
> > > > > > > > > (see page 3)
> > > > > > > > > 
> > > > > > > > > The performance is different on using kernel and user drivers.
> > > > > > > > Yes and example i point to is exactly that. You have a one time 
> > > > > > > > setup
> > > > > > > > cost (creating command buffer binding PASID with command buffer 
> > > > > > > > and
> > > > > > > > couple other setup steps). Then userspace no longer have to do 
> > > > > > > > any
> > > > > > > > ioctl to schedule work on the GPU. It is all down from 
> > > > > > > > userspace and
> > > > > > > > it use a doorbell to notify hardware when it should go look at 
> > > > > > > > command
> > > > > > > > buffer for new thing to execute.
> > > > > > > > 
> > > > > > > > My point stands on that. You have existing driver already doing 
> > > > > > > > so
> > > > > > > > with no new framework and in your scheme you need a userspace 
> > > > > > > > driver.
> > > > > > > > So i do not see the value add, using one path or the other in 
> > > > > > > > the
> > > > > > > > userspace driver is litteraly one line to change.
> > > > > > > > 
> > > > > > > Sorry, I'd got confuse here. I partially agree that the user 
> > > > > > > driver is
> > > > > > > redundance of kernel driver. (But for WarpDrive, the 

Re: [RFC PATCH 0/7] A General Accelerator Framework, WarpDrive

2018-08-10 Thread Jean-Philippe Brucker
Hi Kenneth,

On 10/08/18 04:39, Kenneth Lee wrote:
>> You can achieve everything you want to achieve with existing upstream
>> solution. Re-inventing a whole new driver infrastructure should really
>> be motivated with strong and obvious reasons.
> 
> I want to understand better of your idea. If I create some unified helper
> APIs in drivers/iommu/, say:
> 
>   wd_create_dev(parent_dev, wd_dev)
>   wd_release_dev(wd_dev)
> 
> The API create chrdev to take request from user space for open(resource
> allocation), iomap, epoll (irq), and dma_map(with pasid automatically).
> 
> Do you think it is acceptable?

Maybe not drivers/iommu/ :) That subsystem only contains tools for
dealing with DMA, I don't think epoll, resource enumeration or iomap fit
in there.

Creating new helpers seems to be precisely what we're trying to avoid in
this thread, and vfio-mdev does provide the components that you
describe, so I wouldn't discard it right away. When the GPU, net, block
or another subsystem doesn't fit your needs, either because your
accelerator provides some specialized function, or because for
performance reasons your client wants direct MMIO access, you can at
least build your driver and library on top of those existing VFIO
components:

* open allocates a partition of an accelerator.
* vfio_device_info, vfio_region_info and vfio_irq_info enumerates
available resources.
* vfio_irq_set deals with epoll.
* mmap gives you a private MMIO doorbell.
* vfio_iommu_type1 provides the DMA operations.

Currently missing:

* Sharing the parent IOMMU between mdev, which is also what the "IOMMU
aware mediated device" series tackles, and seems like a logical addition
to VFIO. I'd argue that the existing IOMMU ops (or ones implemented by
the SVA series) can be used to deal with this

* The interface to discover an accelerator near your memory node, or one
that you can chain with other devices. If I understood correctly the
conclusion was that the API (a topology description in sysfs?) should be
common to various subsystems, in which case vfio-mdev (or the mediating
driver) could also use it.

* The queue abstraction discussed on patch 3/7. Perhaps the current vfio
resource description of MMIO and IRQ is sufficient here as well, since
vendors tend to each implement their own queue schemes. If you need
additional features, read/write fops give the mediating driver a lot of
freedom. To support features that are too specific for drivers/vfio/ you
can implement a config space with capabilities and registers of your
choice. If you're versioning the capabilities, the code to handle them
could even be shared between different accelerator drivers and libraries.

Thanks,
Jean
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH 0/7] A General Accelerator Framework, WarpDrive

2018-08-09 Thread Kenneth Lee
On Thu, Aug 09, 2018 at 10:46:13AM -0400, Jerome Glisse wrote:
> Date: Thu, 9 Aug 2018 10:46:13 -0400
> From: Jerome Glisse 
> To: Kenneth Lee 
> CC: Kenneth Lee , "Tian, Kevin"
>  , Alex Williamson ,
>  Herbert Xu , "k...@vger.kernel.org"
>  , Jonathan Corbet , Greg
>  Kroah-Hartman , Zaibo Xu ,
>  "linux-...@vger.kernel.org" , "Kumar, Sanjay K"
>  , Hao Fang ,
>  "linux-ker...@vger.kernel.org" ,
>  "linux...@huawei.com" ,
>  "iommu@lists.linux-foundation.org" ,
>  "linux-cry...@vger.kernel.org" , Philippe
>  Ombredanne , Thomas Gleixner ,
>  "David S . Miller" ,
>  "linux-accelerat...@lists.ozlabs.org"
>  
> Subject: Re: [RFC PATCH 0/7] A General Accelerator Framework, WarpDrive
> User-Agent: Mutt/1.10.0 (2018-05-17)
> Message-ID: <20180809144613.gb3...@redhat.com>
> 
> On Thu, Aug 09, 2018 at 04:03:52PM +0800, Kenneth Lee wrote:
> > On Wed, Aug 08, 2018 at 11:18:35AM -0400, Jerome Glisse wrote:
> > > On Wed, Aug 08, 2018 at 09:08:42AM +0800, Kenneth Lee wrote:
> > > > 在 2018年08月06日 星期一 11:32 下午, Jerome Glisse 写道:
> > > > > On Mon, Aug 06, 2018 at 11:12:52AM +0800, Kenneth Lee wrote:
> > > > > > On Fri, Aug 03, 2018 at 10:39:44AM -0400, Jerome Glisse wrote:
> > > > > > > On Fri, Aug 03, 2018 at 11:47:21AM +0800, Kenneth Lee wrote:
> > > > > > > > On Thu, Aug 02, 2018 at 10:22:43AM -0400, Jerome Glisse wrote:
> > > > > > > > > On Thu, Aug 02, 2018 at 12:05:57PM +0800, Kenneth Lee wrote:
> 
> [...]
> 
> > > > > > > > > your mechanisms the userspace must have a specific userspace
> > > > > > > > > drivers for each hardware and thus there are virtually no
> > > > > > > > > differences between having this userspace driver open a device
> > > > > > > > > file in vfio or somewhere else in the device filesystem. This 
> > > > > > > > > is
> > > > > > > > > just a different path.
> > > > > > > > > 
> > > > > > > > The basic problem WarpDrive want to solve it to avoid syscall. 
> > > > > > > > This is important
> > > > > > > > to accelerators. We have some data here:
> > > > > > > > https://www.slideshare.net/linaroorg/progress-and-demonstration-of-wrapdrive-a-accelerator-framework-sfo17317
> > > > > > > > 
> > > > > > > > (see page 3)
> > > > > > > > 
> > > > > > > > The performance is different on using kernel and user drivers.
> > > > > > > Yes and example i point to is exactly that. You have a one time 
> > > > > > > setup
> > > > > > > cost (creating command buffer binding PASID with command buffer 
> > > > > > > and
> > > > > > > couple other setup steps). Then userspace no longer have to do any
> > > > > > > ioctl to schedule work on the GPU. It is all down from userspace 
> > > > > > > and
> > > > > > > it use a doorbell to notify hardware when it should go look at 
> > > > > > > command
> > > > > > > buffer for new thing to execute.
> > > > > > > 
> > > > > > > My point stands on that. You have existing driver already doing so
> > > > > > > with no new framework and in your scheme you need a userspace 
> > > > > > > driver.
> > > > > > > So i do not see the value add, using one path or the other in the
> > > > > > > userspace driver is litteraly one line to change.
> > > > > > > 
> > > > > > Sorry, I'd got confuse here. I partially agree that the user driver 
> > > > > > is
> > > > > > redundance of kernel driver. (But for WarpDrive, the kernel driver 
> > > > > > is a full
> > > > > > driver include all preparation and setup stuff for the hardware, 
> > > > > > the user driver
> > > > > > is simply to send request and receive answer). Yes, it is just a 
> > > > > > choice of path.
> > > > > > But the user path is faster if the request come from use space. And 
> > > > > > to do that,
> > > > > > we need user land DMA support. Then why is it invaluable to let 
> > > &

Re: [RFC PATCH 0/7] A General Accelerator Framework, WarpDrive

2018-08-09 Thread Kenneth Lee
On Thu, Aug 09, 2018 at 08:31:31AM +, Tian, Kevin wrote:
> Date: Thu, 9 Aug 2018 08:31:31 +
> From: "Tian, Kevin" 
> To: Kenneth Lee , Jerome Glisse 
> CC: Kenneth Lee , Alex Williamson
>  , Herbert Xu ,
>  "k...@vger.kernel.org" , Jonathan Corbet
>  , Greg Kroah-Hartman , Zaibo
>  Xu , "linux-...@vger.kernel.org"
>  , "Kumar, Sanjay K" ,
>  Hao Fang , "linux-ker...@vger.kernel.org"
>  , "linux...@huawei.com"
>  , "iommu@lists.linux-foundation.org"
>  , "linux-cry...@vger.kernel.org"
>  , Philippe Ombredanne
>  , Thomas Gleixner , "David S .
>  Miller" , "linux-accelerat...@lists.ozlabs.org"
>  
> Subject: RE: [RFC PATCH 0/7] A General Accelerator Framework, WarpDrive
> Message-ID: 
> 
> 
> > From: Kenneth Lee [mailto:liguo...@hisilicon.com]
> > Sent: Thursday, August 9, 2018 4:04 PM
> > 
> > But we have another requirement which is to combine some device
> > together to
> > share the same address space. This is a little like these kinds of solution:
> > 
> > http://tce.technion.ac.il/wp-content/uploads/sites/8/2015/06/SC-7.2-M.-
> > Silberstein.pdf
> > 
> > With that, the application can directly pass the NiC packet pointer to the
> > decryption accelerator, and get the bare data in place. This is the feature
> > that
> > the VFIO container can provide.
> 
> above is not a good argument, at least in the context of your discussion.
> If each device has their own interface (similar to GPU) for process to bind 
> with, then having the process binding to multiple devices one-by-one then
> you still get same address space shared cross them...

If we consider this from the VFIO container perspective, with a container, you
can do DMA to the container applying it to all devices, even the device is added
after the DMA operation.  

So your argument remains true only when SVM is enabled and the whole process 
space
is devoted to the devices. 

Yes, the process can do the same all by itself. But if we agree with that, it
makes no sense to keep the container concept in VFIO;)

> 
> Thanks
> Kevin

-- 
-Kenneth(Hisilicon)


本邮件及其附件含有华为公司的保密信息,仅限于发送给上面地址中列出的个人或群组。禁
止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、或散发)本邮件中
的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本邮件!
This e-mail and its attachments contain confidential information from HUAWEI,
which is intended only for the person or entity whose address is listed above.
Any use of the 
information contained herein in any way (including, but not limited to, total or
partial disclosure, reproduction, or dissemination) by persons other than the
intended 
recipient(s) is prohibited. If you receive this e-mail in error, please notify
the sender by phone or email immediately and delete it!

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC PATCH 0/7] A General Accelerator Framework, WarpDrive

2018-08-09 Thread Jerome Glisse
On Thu, Aug 09, 2018 at 04:03:52PM +0800, Kenneth Lee wrote:
> On Wed, Aug 08, 2018 at 11:18:35AM -0400, Jerome Glisse wrote:
> > On Wed, Aug 08, 2018 at 09:08:42AM +0800, Kenneth Lee wrote:
> > > 在 2018年08月06日 星期一 11:32 下午, Jerome Glisse 写道:
> > > > On Mon, Aug 06, 2018 at 11:12:52AM +0800, Kenneth Lee wrote:
> > > > > On Fri, Aug 03, 2018 at 10:39:44AM -0400, Jerome Glisse wrote:
> > > > > > On Fri, Aug 03, 2018 at 11:47:21AM +0800, Kenneth Lee wrote:
> > > > > > > On Thu, Aug 02, 2018 at 10:22:43AM -0400, Jerome Glisse wrote:
> > > > > > > > On Thu, Aug 02, 2018 at 12:05:57PM +0800, Kenneth Lee wrote:

[...]

> > > > > > > > your mechanisms the userspace must have a specific userspace
> > > > > > > > drivers for each hardware and thus there are virtually no
> > > > > > > > differences between having this userspace driver open a device
> > > > > > > > file in vfio or somewhere else in the device filesystem. This is
> > > > > > > > just a different path.
> > > > > > > > 
> > > > > > > The basic problem WarpDrive want to solve it to avoid syscall. 
> > > > > > > This is important
> > > > > > > to accelerators. We have some data here:
> > > > > > > https://www.slideshare.net/linaroorg/progress-and-demonstration-of-wrapdrive-a-accelerator-framework-sfo17317
> > > > > > > 
> > > > > > > (see page 3)
> > > > > > > 
> > > > > > > The performance is different on using kernel and user drivers.
> > > > > > Yes and example i point to is exactly that. You have a one time 
> > > > > > setup
> > > > > > cost (creating command buffer binding PASID with command buffer and
> > > > > > couple other setup steps). Then userspace no longer have to do any
> > > > > > ioctl to schedule work on the GPU. It is all down from userspace and
> > > > > > it use a doorbell to notify hardware when it should go look at 
> > > > > > command
> > > > > > buffer for new thing to execute.
> > > > > > 
> > > > > > My point stands on that. You have existing driver already doing so
> > > > > > with no new framework and in your scheme you need a userspace 
> > > > > > driver.
> > > > > > So i do not see the value add, using one path or the other in the
> > > > > > userspace driver is litteraly one line to change.
> > > > > > 
> > > > > Sorry, I'd got confuse here. I partially agree that the user driver is
> > > > > redundance of kernel driver. (But for WarpDrive, the kernel driver is 
> > > > > a full
> > > > > driver include all preparation and setup stuff for the hardware, the 
> > > > > user driver
> > > > > is simply to send request and receive answer). Yes, it is just a 
> > > > > choice of path.
> > > > > But the user path is faster if the request come from use space. And 
> > > > > to do that,
> > > > > we need user land DMA support. Then why is it invaluable to let VFIO 
> > > > > involved?
> > > > Some drivers in the kernel already do exactly what you said. The user
> > > > space emit commands without ever going into kernel by directly 
> > > > scheduling
> > > > commands and ringing a doorbell. They do not need VFIO either and they
> > > > can map userspace address into the DMA address space of the device and
> > > > again they do not need VFIO for that.
> > > Could you please directly point out which driver you refer to here? Thank
> > > you.
> > 
> > drivers/gpu/drm/amd/
> > 
> > Sub-directory of interest is amdkfd
> > 
> > Because it is a big driver here is a highlevel overview of how it works
> > (this is a simplification):
> >   - Process can allocate GPUs buffer (through ioclt) and map them into
> > its address space (through mmap of device file at buffer object
> > specific offset).
> >   - Process can map any valid range of virtual address space into device
> > address space (IOMMU mapping). This must be regular memory ie not an
> > mmap of a device file or any special file (this is the non PASID
> > path)
> >   - Process can create a command queue and bind its process to it aka
> > PASID, this is done through an ioctl.
> >   - Process can schedule commands onto queues it created from userspace
> > without ioctl. For that it just write command into a ring buffer
> > that it mapped during the command queue creation process and it
> > rings a doorbell when commands are ready to be consume by the
> > hardware.
> >   - Commands can reference (access) all 3 types of object above ie
> > either full GPUs buffer, process regular memory maped as object
> > (non PASID) and PASID memory all at the same time ie you can
> > mix all of the above in same commands queue.
> >   - Kernel can evict, unbind any process command queues, unbind commands
> > queue are still valid from process point of view but commands
> > process schedules on them will not be executed until kernel re-bind
> > the queue.
> >   - Kernel can schedule commands itself onto its dedicated command
> > queues (kernel driver create its own command queues).
> >   - Kernel can control priorities between all the 

RE: [RFC PATCH 0/7] A General Accelerator Framework, WarpDrive

2018-08-09 Thread Tian, Kevin
> From: Kenneth Lee [mailto:liguo...@hisilicon.com]
> Sent: Thursday, August 9, 2018 4:04 PM
> 
> But we have another requirement which is to combine some device
> together to
> share the same address space. This is a little like these kinds of solution:
> 
> http://tce.technion.ac.il/wp-content/uploads/sites/8/2015/06/SC-7.2-M.-
> Silberstein.pdf
> 
> With that, the application can directly pass the NiC packet pointer to the
> decryption accelerator, and get the bare data in place. This is the feature
> that
> the VFIO container can provide.

above is not a good argument, at least in the context of your discussion.
If each device has their own interface (similar to GPU) for process to bind 
with, then having the process binding to multiple devices one-by-one then
you still get same address space shared cross them...

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH 0/7] A General Accelerator Framework, WarpDrive

2018-08-09 Thread Kenneth Lee
On Wed, Aug 08, 2018 at 11:18:35AM -0400, Jerome Glisse wrote:
> Date: Wed, 8 Aug 2018 11:18:35 -0400
> From: Jerome Glisse 
> To: Kenneth Lee 
> CC: Kenneth Lee , "Tian, Kevin"
>  , Alex Williamson ,
>  Herbert Xu , "k...@vger.kernel.org"
>  , Jonathan Corbet , Greg
>  Kroah-Hartman , Zaibo Xu ,
>  "linux-...@vger.kernel.org" , "Kumar, Sanjay K"
>  , Hao Fang ,
>  "linux-ker...@vger.kernel.org" ,
>  "linux...@huawei.com" ,
>  "iommu@lists.linux-foundation.org" ,
>  "linux-cry...@vger.kernel.org" , Philippe
>  Ombredanne , Thomas Gleixner ,
>  "David S . Miller" ,
>  "linux-accelerat...@lists.ozlabs.org"
>  
> Subject: Re: [RFC PATCH 0/7] A General Accelerator Framework, WarpDrive
> User-Agent: Mutt/1.10.0 (2018-05-17)
> Message-ID: <20180808151835.ga3...@redhat.com>
> 
> On Wed, Aug 08, 2018 at 09:08:42AM +0800, Kenneth Lee wrote:
> > 
> > 
> > 在 2018年08月06日 星期一 11:32 下午, Jerome Glisse 写道:
> > > On Mon, Aug 06, 2018 at 11:12:52AM +0800, Kenneth Lee wrote:
> > > > On Fri, Aug 03, 2018 at 10:39:44AM -0400, Jerome Glisse wrote:
> > > > > On Fri, Aug 03, 2018 at 11:47:21AM +0800, Kenneth Lee wrote:
> > > > > > On Thu, Aug 02, 2018 at 10:22:43AM -0400, Jerome Glisse wrote:
> > > > > > > On Thu, Aug 02, 2018 at 12:05:57PM +0800, Kenneth Lee wrote:
> > > > > > > > On Thu, Aug 02, 2018 at 02:33:12AM +, Tian, Kevin wrote:
> > > > > > > > > > On Wed, Aug 01, 2018 at 06:22:14PM +0800, Kenneth Lee wrote:
> 
> [...]
> 
> > > > > > > > > > My more general question is do we want to grow VFIO to 
> > > > > > > > > > become
> > > > > > > > > > a more generic device driver API. This patchset adds a 
> > > > > > > > > > command
> > > > > > > > > > queue concept to it (i don't think it exist today but i have
> > > > > > > > > > not follow VFIO closely).
> > > > > > > > > > 
> > > > > > > > The thing is, VFIO is the only place to support DMA from user 
> > > > > > > > land. If we don't
> > > > > > > > put it here, we have to create another similar facility to 
> > > > > > > > support the same.
> > > > > > > No it is not, network device, GPU, block device, ... they all do
> > > > > > > support DMA. The point i am trying to make here is that even in
> > > > > > Sorry, wait a minute, are we talking the same thing? I meant "DMA 
> > > > > > from user
> > > > > > land", not "DMA from kernel driver". To do that we have to 
> > > > > > manipulate the
> > > > > > IOMMU(Unit). I think it can only be done by default_domain or vfio 
> > > > > > domain. Or
> > > > > > the user space have to directly access the IOMMU.
> > > > > GPU do DMA in the sense that you pass to the kernel a valid
> > > > > virtual address (kernel driver do all the proper check) and
> > > > > then you can use the GPU to copy from or to that range of
> > > > > virtual address. Exactly how you want to use this compression
> > > > > engine. It does not rely on SVM but SVM going forward would
> > > > > still be the prefered option.
> > > > > 
> > > > No, SVM is not the reason why we rely on Jean's SVM(SVA) series. We 
> > > > rely on
> > > > Jean's series because of multi-process (PASID or substream ID) support.
> > > > 
> > > > But of couse, WarpDrive can still benefit from the SVM feature.
> > > We are getting side tracked here. PASID/ID do not require VFIO.
> > > 
> > Yes, PASID itself do not require VFIO. But what if:
> > 
> > 1. Support DMA from user space.
> > 2. The hardware makes use of standard IOMMU/SMMU for IO address translation.
> > 3. The IOMMU facility is shared by both kernel and user drivers.
> > 4. Support PASID with the current IOMMU facility
> 
> I do not see how any of this means it has to be in VFIO.
> Other devices do just that. GPUs driver for instance share
> DMA engine (that copy data around) between kernel and user
> space. Sometime kernel use it to move things around. Evict
> some memory to make room for a new process is the common
> example. Same DMA engines is often use by userspace its

Re: [RFC PATCH 0/7] A General Accelerator Framework, WarpDrive

2018-08-08 Thread Jerome Glisse
On Wed, Aug 08, 2018 at 09:08:42AM +0800, Kenneth Lee wrote:
> 
> 
> 在 2018年08月06日 星期一 11:32 下午, Jerome Glisse 写道:
> > On Mon, Aug 06, 2018 at 11:12:52AM +0800, Kenneth Lee wrote:
> > > On Fri, Aug 03, 2018 at 10:39:44AM -0400, Jerome Glisse wrote:
> > > > On Fri, Aug 03, 2018 at 11:47:21AM +0800, Kenneth Lee wrote:
> > > > > On Thu, Aug 02, 2018 at 10:22:43AM -0400, Jerome Glisse wrote:
> > > > > > On Thu, Aug 02, 2018 at 12:05:57PM +0800, Kenneth Lee wrote:
> > > > > > > On Thu, Aug 02, 2018 at 02:33:12AM +, Tian, Kevin wrote:
> > > > > > > > > On Wed, Aug 01, 2018 at 06:22:14PM +0800, Kenneth Lee wrote:

[...]

> > > > > > > > > My more general question is do we want to grow VFIO to become
> > > > > > > > > a more generic device driver API. This patchset adds a command
> > > > > > > > > queue concept to it (i don't think it exist today but i have
> > > > > > > > > not follow VFIO closely).
> > > > > > > > > 
> > > > > > > The thing is, VFIO is the only place to support DMA from user 
> > > > > > > land. If we don't
> > > > > > > put it here, we have to create another similar facility to 
> > > > > > > support the same.
> > > > > > No it is not, network device, GPU, block device, ... they all do
> > > > > > support DMA. The point i am trying to make here is that even in
> > > > > Sorry, wait a minute, are we talking the same thing? I meant "DMA 
> > > > > from user
> > > > > land", not "DMA from kernel driver". To do that we have to manipulate 
> > > > > the
> > > > > IOMMU(Unit). I think it can only be done by default_domain or vfio 
> > > > > domain. Or
> > > > > the user space have to directly access the IOMMU.
> > > > GPU do DMA in the sense that you pass to the kernel a valid
> > > > virtual address (kernel driver do all the proper check) and
> > > > then you can use the GPU to copy from or to that range of
> > > > virtual address. Exactly how you want to use this compression
> > > > engine. It does not rely on SVM but SVM going forward would
> > > > still be the prefered option.
> > > > 
> > > No, SVM is not the reason why we rely on Jean's SVM(SVA) series. We rely 
> > > on
> > > Jean's series because of multi-process (PASID or substream ID) support.
> > > 
> > > But of couse, WarpDrive can still benefit from the SVM feature.
> > We are getting side tracked here. PASID/ID do not require VFIO.
> > 
> Yes, PASID itself do not require VFIO. But what if:
> 
> 1. Support DMA from user space.
> 2. The hardware makes use of standard IOMMU/SMMU for IO address translation.
> 3. The IOMMU facility is shared by both kernel and user drivers.
> 4. Support PASID with the current IOMMU facility

I do not see how any of this means it has to be in VFIO.
Other devices do just that. GPUs driver for instance share
DMA engine (that copy data around) between kernel and user
space. Sometime kernel use it to move things around. Evict
some memory to make room for a new process is the common
example. Same DMA engines is often use by userspace itself
during rendering or compute (program moving things on there
own). So they are already kernel driver that do all 4 of
the above and are not in VFIO.


> > > > > > your mechanisms the userspace must have a specific userspace
> > > > > > drivers for each hardware and thus there are virtually no
> > > > > > differences between having this userspace driver open a device
> > > > > > file in vfio or somewhere else in the device filesystem. This is
> > > > > > just a different path.
> > > > > > 
> > > > > The basic problem WarpDrive want to solve it to avoid syscall. This 
> > > > > is important
> > > > > to accelerators. We have some data here:
> > > > > https://www.slideshare.net/linaroorg/progress-and-demonstration-of-wrapdrive-a-accelerator-framework-sfo17317
> > > > > 
> > > > > (see page 3)
> > > > > 
> > > > > The performance is different on using kernel and user drivers.
> > > > Yes and example i point to is exactly that. You have a one time setup
> > > > cost (creating command buffer binding PASID with command buffer and
> > > > couple other setup steps). Then userspace no longer have to do any
> > > > ioctl to schedule work on the GPU. It is all down from userspace and
> > > > it use a doorbell to notify hardware when it should go look at command
> > > > buffer for new thing to execute.
> > > > 
> > > > My point stands on that. You have existing driver already doing so
> > > > with no new framework and in your scheme you need a userspace driver.
> > > > So i do not see the value add, using one path or the other in the
> > > > userspace driver is litteraly one line to change.
> > > > 
> > > Sorry, I'd got confuse here. I partially agree that the user driver is
> > > redundance of kernel driver. (But for WarpDrive, the kernel driver is a 
> > > full
> > > driver include all preparation and setup stuff for the hardware, the user 
> > > driver
> > > is simply to send request and receive answer). Yes, it is just a choice 
> > > of path.
> > > But the user path is faster if 

Re: [RFC PATCH 0/7] A General Accelerator Framework, WarpDrive

2018-08-07 Thread Kenneth Lee



在 2018年08月06日 星期一 11:32 下午, Jerome Glisse 写道:

On Mon, Aug 06, 2018 at 11:12:52AM +0800, Kenneth Lee wrote:

On Fri, Aug 03, 2018 at 10:39:44AM -0400, Jerome Glisse wrote:

On Fri, Aug 03, 2018 at 11:47:21AM +0800, Kenneth Lee wrote:

On Thu, Aug 02, 2018 at 10:22:43AM -0400, Jerome Glisse wrote:

On Thu, Aug 02, 2018 at 12:05:57PM +0800, Kenneth Lee wrote:

On Thu, Aug 02, 2018 at 02:33:12AM +, Tian, Kevin wrote:

On Wed, Aug 01, 2018 at 06:22:14PM +0800, Kenneth Lee wrote:

[...]


But doorbell is just a notification. Except for DOS (to make hardware busy) it
cannot actually take or change anything from the kernel space. And the DOS
problem can be always taken as the problem that a group of processes share the
same kernel entity.

In the coming HIP09 hardware, the doorbell will come with a random number so
only the process who allocated the queue can knock it correctly.

When doorbell is ring the hardware start fetching commands from
the queue and execute them ? If so than a rogue process B might
ring the doorbell of process A which would starts execution of
random commands (ie whatever random memory value there is left
inside the command buffer memory, could be old commands i guess).

If this is not how this doorbell works then, yes it can only do
a denial of service i guess. Issue i have with doorbell is that
i have seen 10 differents implementations in 10 differents hw
and each are different as to what ringing or value written to the
doorbell does. It is painfull to track what is what for each hw.


In our implementation, doorbell is simply a notification, just like an interrupt
to the accelerator. The command is all about what's in the queue.

I agree that there is no simple and standard way to track the shared IO space.
But I think we have to trust the driver in some way. If the driver is malicious,
even a simple ioctl can become an attack.

Trusting kernel space driver is fine, trusting user space driver is
not in my view. AFAICT every driver developer so far always made
sure that someone could not abuse its device to do harmfull thing to
other process.


Fully agree. That is why this driver shares only the doorbell space. There is
only the doorbell is shared in the whole page, nothing else.

Maybe you are concerning the user driver will give malicious command to the
hardware? But these commands cannot influence the other process. If we can trust
the hardware design, the process cannot do any harm.

My questions was what happens if a process B ring the doorbell of
process A.

On some hardware the value written in the doorbell is use as an
index in command buffer. On other it just wakeup the hardware to go
look at a structure private to the process. They are other variations
of those themes.

If it is the former ie the value is use to advance in the command
buffer then a rogue process can force another process to advance its
command buffer and what is in the command buffer can be some random
old memory values which can be more harmfull than just Denial Of
Service.


Yes. We have considered that. There is no other information in the 
doorbell. The indexes, such as head and tail pointers, are all in the 
shared memory between the hardware and the user process. The other 
process cannot touch it.



My more general question is do we want to grow VFIO to become
a more generic device driver API. This patchset adds a command
queue concept to it (i don't think it exist today but i have
not follow VFIO closely).


The thing is, VFIO is the only place to support DMA from user land. If we don't
put it here, we have to create another similar facility to support the same.

No it is not, network device, GPU, block device, ... they all do
support DMA. The point i am trying to make here is that even in

Sorry, wait a minute, are we talking the same thing? I meant "DMA from user
land", not "DMA from kernel driver". To do that we have to manipulate the
IOMMU(Unit). I think it can only be done by default_domain or vfio domain. Or
the user space have to directly access the IOMMU.

GPU do DMA in the sense that you pass to the kernel a valid
virtual address (kernel driver do all the proper check) and
then you can use the GPU to copy from or to that range of
virtual address. Exactly how you want to use this compression
engine. It does not rely on SVM but SVM going forward would
still be the prefered option.


No, SVM is not the reason why we rely on Jean's SVM(SVA) series. We rely on
Jean's series because of multi-process (PASID or substream ID) support.

But of couse, WarpDrive can still benefit from the SVM feature.

We are getting side tracked here. PASID/ID do not require VFIO.


Yes, PASID itself do not require VFIO. But what if:

1. Support DMA from user space.
2. The hardware makes use of standard IOMMU/SMMU for IO address translation.
3. The IOMMU facility is shared by both kernel and user drivers.
4. Support PASID with the current IOMMU facility

your mechanisms the userspace must have a specific 

Re: [RFC PATCH 0/7] A General Accelerator Framework, WarpDrive

2018-08-06 Thread Jerome Glisse
On Mon, Aug 06, 2018 at 11:12:52AM +0800, Kenneth Lee wrote:
> On Fri, Aug 03, 2018 at 10:39:44AM -0400, Jerome Glisse wrote:
> > On Fri, Aug 03, 2018 at 11:47:21AM +0800, Kenneth Lee wrote:
> > > On Thu, Aug 02, 2018 at 10:22:43AM -0400, Jerome Glisse wrote:
> > > > On Thu, Aug 02, 2018 at 12:05:57PM +0800, Kenneth Lee wrote:
> > > > > On Thu, Aug 02, 2018 at 02:33:12AM +, Tian, Kevin wrote:
> > > > > > > On Wed, Aug 01, 2018 at 06:22:14PM +0800, Kenneth Lee wrote:
> > > > > > > >

[...]

> > > > > But doorbell is just a notification. Except for DOS (to make hardware 
> > > > > busy) it
> > > > > cannot actually take or change anything from the kernel space. And 
> > > > > the DOS
> > > > > problem can be always taken as the problem that a group of processes 
> > > > > share the
> > > > > same kernel entity.
> > > > > 
> > > > > In the coming HIP09 hardware, the doorbell will come with a random 
> > > > > number so
> > > > > only the process who allocated the queue can knock it correctly.
> > > > 
> > > > When doorbell is ring the hardware start fetching commands from
> > > > the queue and execute them ? If so than a rogue process B might
> > > > ring the doorbell of process A which would starts execution of
> > > > random commands (ie whatever random memory value there is left
> > > > inside the command buffer memory, could be old commands i guess).
> > > > 
> > > > If this is not how this doorbell works then, yes it can only do
> > > > a denial of service i guess. Issue i have with doorbell is that
> > > > i have seen 10 differents implementations in 10 differents hw
> > > > and each are different as to what ringing or value written to the
> > > > doorbell does. It is painfull to track what is what for each hw.
> > > > 
> > > 
> > > In our implementation, doorbell is simply a notification, just like an 
> > > interrupt
> > > to the accelerator. The command is all about what's in the queue.
> > > 
> > > I agree that there is no simple and standard way to track the shared IO 
> > > space.
> > > But I think we have to trust the driver in some way. If the driver is 
> > > malicious,
> > > even a simple ioctl can become an attack.
> > 
> > Trusting kernel space driver is fine, trusting user space driver is
> > not in my view. AFAICT every driver developer so far always made
> > sure that someone could not abuse its device to do harmfull thing to
> > other process.
> > 
> 
> Fully agree. That is why this driver shares only the doorbell space. There is
> only the doorbell is shared in the whole page, nothing else.
> 
> Maybe you are concerning the user driver will give malicious command to the
> hardware? But these commands cannot influence the other process. If we can 
> trust
> the hardware design, the process cannot do any harm.

My questions was what happens if a process B ring the doorbell of
process A.

On some hardware the value written in the doorbell is use as an
index in command buffer. On other it just wakeup the hardware to go
look at a structure private to the process. They are other variations
of those themes.

If it is the former ie the value is use to advance in the command
buffer then a rogue process can force another process to advance its
command buffer and what is in the command buffer can be some random
old memory values which can be more harmfull than just Denial Of
Service.


> > > > > > > My more general question is do we want to grow VFIO to become
> > > > > > > a more generic device driver API. This patchset adds a command
> > > > > > > queue concept to it (i don't think it exist today but i have
> > > > > > > not follow VFIO closely).
> > > > > > > 
> > > > > 
> > > > > The thing is, VFIO is the only place to support DMA from user land. 
> > > > > If we don't
> > > > > put it here, we have to create another similar facility to support 
> > > > > the same.
> > > > 
> > > > No it is not, network device, GPU, block device, ... they all do
> > > > support DMA. The point i am trying to make here is that even in
> > > 
> > > Sorry, wait a minute, are we talking the same thing? I meant "DMA from 
> > > user
> > > land", not "DMA from kernel driver". To do that we have to manipulate the
> > > IOMMU(Unit). I think it can only be done by default_domain or vfio 
> > > domain. Or
> > > the user space have to directly access the IOMMU.
> > 
> > GPU do DMA in the sense that you pass to the kernel a valid
> > virtual address (kernel driver do all the proper check) and
> > then you can use the GPU to copy from or to that range of
> > virtual address. Exactly how you want to use this compression
> > engine. It does not rely on SVM but SVM going forward would
> > still be the prefered option.
> > 
> 
> No, SVM is not the reason why we rely on Jean's SVM(SVA) series. We rely on
> Jean's series because of multi-process (PASID or substream ID) support.
> 
> But of couse, WarpDrive can still benefit from the SVM feature.

We are getting side tracked here. PASID/ID do not require VFIO.


> > > > your 

Re: [RFC PATCH 0/7] A General Accelerator Framework, WarpDrive

2018-08-05 Thread Kenneth Lee
On Fri, Aug 03, 2018 at 10:39:44AM -0400, Jerome Glisse wrote:
> Date: Fri, 3 Aug 2018 10:39:44 -0400
> From: Jerome Glisse 
> To: Kenneth Lee 
> CC: "Tian, Kevin" , Herbert Xu
>  , "k...@vger.kernel.org"
>  , Jonathan Corbet , Greg
>  Kroah-Hartman , Zaibo Xu ,
>  "linux-...@vger.kernel.org" , "Kumar, Sanjay K"
>  , Hao Fang ,
>  "iommu@lists.linux-foundation.org" ,
>  "linux-ker...@vger.kernel.org" ,
>  "linux...@huawei.com" , Alex Williamson
>  , "linux-cry...@vger.kernel.org"
>  , Philippe Ombredanne
>  , Thomas Gleixner , Kenneth Lee
>  , "David S . Miller" ,
>  "linux-accelerat...@lists.ozlabs.org"
>  
> Subject: Re: [RFC PATCH 0/7] A General Accelerator Framework, WarpDrive
> User-Agent: Mutt/1.10.0 (2018-05-17)
> Message-ID: <20180803143944.ga4...@redhat.com>
> 
> On Fri, Aug 03, 2018 at 11:47:21AM +0800, Kenneth Lee wrote:
> > On Thu, Aug 02, 2018 at 10:22:43AM -0400, Jerome Glisse wrote:
> > > Date: Thu, 2 Aug 2018 10:22:43 -0400
> > > From: Jerome Glisse 
> > > To: Kenneth Lee 
> > > CC: "Tian, Kevin" , Hao Fang ,
> > >  Alex Williamson , Herbert Xu
> > >  , "k...@vger.kernel.org"
> > >  , Jonathan Corbet , Greg
> > >  Kroah-Hartman , Zaibo Xu 
> > > ,
> > >  "linux-...@vger.kernel.org" , "Kumar, Sanjay 
> > > K"
> > >  , Kenneth Lee ,
> > >  "iommu@lists.linux-foundation.org" ,
> > >  "linux-ker...@vger.kernel.org" ,
> > >  "linux...@huawei.com" ,
> > >  "linux-cry...@vger.kernel.org" , Philippe
> > >  Ombredanne , Thomas Gleixner ,
> > >  "David S . Miller" ,
> > >  "linux-accelerat...@lists.ozlabs.org"
> > >  
> > > Subject: Re: [RFC PATCH 0/7] A General Accelerator Framework, WarpDrive
> > > User-Agent: Mutt/1.10.0 (2018-05-17)
> > > Message-ID: <20180802142243.ga3...@redhat.com>
> > > 
> > > On Thu, Aug 02, 2018 at 12:05:57PM +0800, Kenneth Lee wrote:
> > > > On Thu, Aug 02, 2018 at 02:33:12AM +, Tian, Kevin wrote:
> > > > > Date: Thu, 2 Aug 2018 02:33:12 +
> > > > > > From: Jerome Glisse
> > > > > > On Wed, Aug 01, 2018 at 06:22:14PM +0800, Kenneth Lee wrote:
> > > > > > > From: Kenneth Lee 
> > > > > > >
> > > > > > > WarpDrive is an accelerator framework to expose the hardware
> > > > > > capabilities
> > > > > > > directly to the user space. It makes use of the exist vfio and 
> > > > > > > vfio-mdev
> > > > > > > facilities. So the user application can send request and DMA to 
> > > > > > > the
> > > > > > > hardware without interaction with the kernel. This remove the 
> > > > > > > latency
> > > > > > > of syscall and context switch.
> > > > > > >
> > > > > > > The patchset contains documents for the detail. Please refer to 
> > > > > > > it for
> > > > > > more
> > > > > > > information.
> > > > > > >
> > > > > > > This patchset is intended to be used with Jean Philippe Brucker's 
> > > > > > > SVA
> > > > > > > patch [1] (Which is also in RFC stage). But it is not mandatory. 
> > > > > > > This
> > > > > > > patchset is tested in the latest mainline kernel without the SVA 
> > > > > > > patches.
> > > > > > > So it support only one process for each accelerator.
> > > > > > >
> > > > > > > With SVA support, WarpDrive can support multi-process in the same
> > > > > > > accelerator device.  We tested it in our SoC integrated 
> > > > > > > Accelerator (board
> > > > > > > ID: D06, Chip ID: HIP08). A reference work tree can be found 
> > > > > > > here: [2].
> > > > > > 
> > > > > > I have not fully inspected things nor do i know enough about
> > > > > > this Hisilicon ZIP accelerator to ascertain, but from glimpsing
> > > > > > at the code it seems that it is unsafe to use even with SVA due
> > > > > > to the doorbell. There is a comment talking about safetyness
> > > > > > in patch 7.
>

Re: [RFC PATCH 0/7] A General Accelerator Framework, WarpDrive

2018-08-05 Thread Kenneth Lee
On Fri, Aug 03, 2018 at 03:20:43PM +0100, Alan Cox wrote:
> Date: Fri, 3 Aug 2018 15:20:43 +0100
> From: Alan Cox 
> To: Jerome Glisse 
> CC: "Tian, Kevin" , Kenneth Lee
>  , Hao Fang , Herbert Xu
>  , "k...@vger.kernel.org"
>  , Jonathan Corbet , Greg
>  Kroah-Hartman , "linux-...@vger.kernel.org"
>  , "Kumar, Sanjay K" ,
>  "iommu@lists.linux-foundation.org" ,
>  "linux-ker...@vger.kernel.org" ,
>  "linux...@huawei.com" , Alex Williamson
>  , Thomas Gleixner ,
>  "linux-cry...@vger.kernel.org" , Philippe
>  Ombredanne , Zaibo Xu , Kenneth
>  Lee , "David S . Miller" ,
>  Ross Zwisler 
> Subject: Re: [RFC PATCH 0/7] A General Accelerator Framework, WarpDrive
> Organization: Intel Corporation
> X-Mailer: Claws Mail 3.16.0 (GTK+ 2.24.32; x86_64-redhat-linux-gnu)
> Message-ID: <20180803152043.40f88947@alans-desktop>
> 
> > If we are going to have any kind of general purpose accelerator API then
> > > it has to be able to implement things like  
> > 
> > Why is the existing driver model not good enough ? So you want
> > a device with function X you look into /dev/X (for instance
> > for GPU you look in /dev/dri)
> 
> Except when my GPU is in an FPGA in which case it might be somewhere else
> or it's a general purpose accelerator that happens to be usable as a GPU.
> Unusual today in big computer space but you'll find it in
> microcontrollers.
> 
> > Each of those device need a userspace driver and thus this
> > user space driver can easily knows where to look. I do not
> > expect that every application will reimplement those drivers
> > but instead use some kind of library that provide a high
> > level API for each of those devices.
> 
> Think about it from the user level. You have a pipeline of things you
> wish to execute, you need to get the right accelerator combinations and
> they need to fit together to meet system constraints like number of
> IOMMU ids the accelerator supports, where they are connected.
> 
> > Now you have a hierarchy of memory for the CPU (HBM, local
> > node main memory aka you DDR dimm, persistent memory) each
> 
> It's not a heirarchy, it's a graph. There's no fundamental reason two
> accelerators can't be close to two different CPU cores but have shared
> HBM that is far from each processor. There are physical reasons it tends
> to look more like a heirarchy today.
> 
> > Anyway i think finding devices and finding relation between
> > devices and memory is 2 separate problems and as such should
> > be handled separatly.
> 
> At a certain level they are deeply intertwined because you need a common
> API. It's not good if I want a particular accelerator and need to then
> see which API its under on this machine and which interface I have to
> use, and maybe have a mix of FPGA, WarpDrive and Google ASIC interfaces
> all different.
> 
> The job of the kernel is to impose some kind of sanity and unity on this
> lot.
> 
> All of it in the end comes down to
> 
> 'Somehow glue some chunk of memory into my address space and find any
> supporting driver I need'
> 

Agree. This is also our intension on WarpDrive. And it looks VFIO is the best
place to fulfill this requirement.

> plus virtualization of the above.
> 
> That bit's easy - but making it usable is a different story.
> 
> Alan

-- 
-Kenneth(Hisilicon)


本邮件及其附件含有华为公司的保密信息,仅限于发送给上面地址中列出的个人或群组。禁
止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、或散发)本邮件中
的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本邮件!
This e-mail and its attachments contain confidential information from HUAWEI,
which is intended only for the person or entity whose address is listed above.
Any use of the 
information contained herein in any way (including, but not limited to, total or
partial disclosure, reproduction, or dissemination) by persons other than the
intended 
recipient(s) is prohibited. If you receive this e-mail in error, please notify
the sender by phone or email immediately and delete it!

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC PATCH 0/7] A General Accelerator Framework, WarpDrive

2018-08-03 Thread Jerome Glisse
On Fri, Aug 03, 2018 at 03:20:43PM +0100, Alan Cox wrote:
> > If we are going to have any kind of general purpose accelerator API then
> > > it has to be able to implement things like  
> > 
> > Why is the existing driver model not good enough ? So you want
> > a device with function X you look into /dev/X (for instance
> > for GPU you look in /dev/dri)
> 
> Except when my GPU is in an FPGA in which case it might be somewhere else
> or it's a general purpose accelerator that happens to be usable as a GPU.
> Unusual today in big computer space but you'll find it in
> microcontrollers.

You do need a specific userspace driver for each of those device
correct ?

You definitly do for GPU and i do not see that going away any time
soon. I doubt Xilinx and Altera will ever compile down to same bit-
stream format either.

So userspace application is bound to use some kind of library that
implement the userspace side of the driver and that library can
easily provide helpers to enumerate all the devices it supports.

For instance that is what OpenCL allows for both GPU and FPGA.
One single API and multiple different hardware you can target from
it.


> > Each of those device need a userspace driver and thus this
> > user space driver can easily knows where to look. I do not
> > expect that every application will reimplement those drivers
> > but instead use some kind of library that provide a high
> > level API for each of those devices.
> 
> Think about it from the user level. You have a pipeline of things you
> wish to execute, you need to get the right accelerator combinations and
> they need to fit together to meet system constraints like number of
> IOMMU ids the accelerator supports, where they are connected.

Creating a pipe of device ie one device consuming the work of
the previous one, is a problem on its own and it should be solved
separatly and not inside VFIO.

GPU (on ARM) already have this pipe thing because the IP block
that do overlay, or the IP block that push pixel to the screen or
the IP block that do 3D rendering are all coming from different
company.

I do not see the value of having all the device enumerated through
VFIO to address this problem. I can definitly understand having a
specific kernel mechanism to expose to userspace what is do-able
but i believe this should be its own thing that allow any device
(a VFIO one, a network one, a GPU, a FPGA, ...) to be use in "pipe"
mode.


> > Now you have a hierarchy of memory for the CPU (HBM, local
> > node main memory aka you DDR dimm, persistent memory) each
> 
> It's not a heirarchy, it's a graph. There's no fundamental reason two
> accelerators can't be close to two different CPU cores but have shared
> HBM that is far from each processor. There are physical reasons it tends
> to look more like a heirarchy today.

Yes you are right i used the wrong word.

> 
> > Anyway i think finding devices and finding relation between
> > devices and memory is 2 separate problems and as such should
> > be handled separatly.
> 
> At a certain level they are deeply intertwined because you need a common
> API. It's not good if I want a particular accelerator and need to then
> see which API its under on this machine and which interface I have to
> use, and maybe have a mix of FPGA, WarpDrive and Google ASIC interfaces
> all different.
> 
> The job of the kernel is to impose some kind of sanity and unity on this
> lot.
> 
> All of it in the end comes down to
> 
> 'Somehow glue some chunk of memory into my address space and find any
> supporting driver I need'
> 
> plus virtualization of the above.
> 
> That bit's easy - but making it usable is a different story.

My point is that for all those devices you will have a userspace drivers
and thus all the complexity of finding best memory for a given combination
of hardware can be push to userspace. Kernel only have to expose the
topology of the different memory and their relation with each of the
devices. Very much like you have NUMA node for CPU.


I rather see kernel having one API to expose topology, one API to
expose device mixing capabilities (ie how can you can pipe device
together if at all). Then having to update every single existing
upstream driver that want to participate in the above to now become
a VFIO driver.


I have nothing against VFIO ;) Just to me it seems that this is all
reinventing a new device driver infrastructure under it while the
existing one is good enough and can evolve to support all the cases
discussed here.

Cheers,
Jérôme
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH 0/7] A General Accelerator Framework, WarpDrive

2018-08-03 Thread Jerome Glisse
On Fri, Aug 03, 2018 at 11:47:21AM +0800, Kenneth Lee wrote:
> On Thu, Aug 02, 2018 at 10:22:43AM -0400, Jerome Glisse wrote:
> > Date: Thu, 2 Aug 2018 10:22:43 -0400
> > From: Jerome Glisse 
> > To: Kenneth Lee 
> > CC: "Tian, Kevin" , Hao Fang ,
> >  Alex Williamson , Herbert Xu
> >  , "k...@vger.kernel.org"
> >  , Jonathan Corbet , Greg
> >  Kroah-Hartman , Zaibo Xu ,
> >  "linux-...@vger.kernel.org" , "Kumar, Sanjay K"
> >  , Kenneth Lee ,
> >  "iommu@lists.linux-foundation.org" ,
> >  "linux-ker...@vger.kernel.org" ,
> >  "linux...@huawei.com" ,
> >  "linux-cry...@vger.kernel.org" , Philippe
> >  Ombredanne , Thomas Gleixner ,
> >  "David S . Miller" ,
> >  "linux-accelerat...@lists.ozlabs.org"
> >  
> > Subject: Re: [RFC PATCH 0/7] A General Accelerator Framework, WarpDrive
> > User-Agent: Mutt/1.10.0 (2018-05-17)
> > Message-ID: <20180802142243.ga3...@redhat.com>
> > 
> > On Thu, Aug 02, 2018 at 12:05:57PM +0800, Kenneth Lee wrote:
> > > On Thu, Aug 02, 2018 at 02:33:12AM +, Tian, Kevin wrote:
> > > > Date: Thu, 2 Aug 2018 02:33:12 +
> > > > > From: Jerome Glisse
> > > > > On Wed, Aug 01, 2018 at 06:22:14PM +0800, Kenneth Lee wrote:
> > > > > > From: Kenneth Lee 
> > > > > >
> > > > > > WarpDrive is an accelerator framework to expose the hardware
> > > > > capabilities
> > > > > > directly to the user space. It makes use of the exist vfio and 
> > > > > > vfio-mdev
> > > > > > facilities. So the user application can send request and DMA to the
> > > > > > hardware without interaction with the kernel. This remove the 
> > > > > > latency
> > > > > > of syscall and context switch.
> > > > > >
> > > > > > The patchset contains documents for the detail. Please refer to it 
> > > > > > for
> > > > > more
> > > > > > information.
> > > > > >
> > > > > > This patchset is intended to be used with Jean Philippe Brucker's 
> > > > > > SVA
> > > > > > patch [1] (Which is also in RFC stage). But it is not mandatory. 
> > > > > > This
> > > > > > patchset is tested in the latest mainline kernel without the SVA 
> > > > > > patches.
> > > > > > So it support only one process for each accelerator.
> > > > > >
> > > > > > With SVA support, WarpDrive can support multi-process in the same
> > > > > > accelerator device.  We tested it in our SoC integrated Accelerator 
> > > > > > (board
> > > > > > ID: D06, Chip ID: HIP08). A reference work tree can be found here: 
> > > > > > [2].
> > > > > 
> > > > > I have not fully inspected things nor do i know enough about
> > > > > this Hisilicon ZIP accelerator to ascertain, but from glimpsing
> > > > > at the code it seems that it is unsafe to use even with SVA due
> > > > > to the doorbell. There is a comment talking about safetyness
> > > > > in patch 7.
> > > > > 
> > > > > Exposing thing to userspace is always enticing, but if it is
> > > > > a security risk then it should clearly say so and maybe a
> > > > > kernel boot flag should be necessary to allow such device to
> > > > > be use.
> > > > > 
> > > 
> > > But doorbell is just a notification. Except for DOS (to make hardware 
> > > busy) it
> > > cannot actually take or change anything from the kernel space. And the DOS
> > > problem can be always taken as the problem that a group of processes 
> > > share the
> > > same kernel entity.
> > > 
> > > In the coming HIP09 hardware, the doorbell will come with a random number 
> > > so
> > > only the process who allocated the queue can knock it correctly.
> > 
> > When doorbell is ring the hardware start fetching commands from
> > the queue and execute them ? If so than a rogue process B might
> > ring the doorbell of process A which would starts execution of
> > random commands (ie whatever random memory value there is left
> > inside the command buffer memory, could be old commands i guess).
> > 
> > If this is not how this doorbell works th

Re: [RFC PATCH 0/7] A General Accelerator Framework, WarpDrive

2018-08-03 Thread Alan Cox
> If we are going to have any kind of general purpose accelerator API then
> > it has to be able to implement things like  
> 
> Why is the existing driver model not good enough ? So you want
> a device with function X you look into /dev/X (for instance
> for GPU you look in /dev/dri)

Except when my GPU is in an FPGA in which case it might be somewhere else
or it's a general purpose accelerator that happens to be usable as a GPU.
Unusual today in big computer space but you'll find it in
microcontrollers.

> Each of those device need a userspace driver and thus this
> user space driver can easily knows where to look. I do not
> expect that every application will reimplement those drivers
> but instead use some kind of library that provide a high
> level API for each of those devices.

Think about it from the user level. You have a pipeline of things you
wish to execute, you need to get the right accelerator combinations and
they need to fit together to meet system constraints like number of
IOMMU ids the accelerator supports, where they are connected.

> Now you have a hierarchy of memory for the CPU (HBM, local
> node main memory aka you DDR dimm, persistent memory) each

It's not a heirarchy, it's a graph. There's no fundamental reason two
accelerators can't be close to two different CPU cores but have shared
HBM that is far from each processor. There are physical reasons it tends
to look more like a heirarchy today.

> Anyway i think finding devices and finding relation between
> devices and memory is 2 separate problems and as such should
> be handled separatly.

At a certain level they are deeply intertwined because you need a common
API. It's not good if I want a particular accelerator and need to then
see which API its under on this machine and which interface I have to
use, and maybe have a mix of FPGA, WarpDrive and Google ASIC interfaces
all different.

The job of the kernel is to impose some kind of sanity and unity on this
lot.

All of it in the end comes down to

'Somehow glue some chunk of memory into my address space and find any
supporting driver I need'

plus virtualization of the above.

That bit's easy - but making it usable is a different story.

Alan
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH 0/7] A General Accelerator Framework, WarpDrive

2018-08-02 Thread Jerome Glisse
On Thu, Aug 02, 2018 at 11:10:00AM +0100, Alan Cox wrote:
> > One motivation I guess, is that most accelerators lack of a 
> > well-abstracted high level APIs similar to GPU side (e.g. OpenCL 
> > clearly defines Shared Virtual Memory models). VFIO mdev
> > might be an alternative common interface to enable SVA usages 
> > on various accelerators...
> 
> SVA is not IMHO the hard bit from a user level API perspective. The hard
> bit is describing what you have and enumerating the contents of the device
> especially when those can be quite dynamic and in the FPGA case can
> change on the fly.
> 
> Right now we've got
> - FPGA manager
> - Google's recently posted ASIC patches
> - WarpDrive
> 
> all trying to be bits of the same thing, and really there needs to be a
> single solution that handles all of this stuff properly.
> 
> If we are going to have any kind of general purpose accelerator API then
> it has to be able to implement things like

Why is the existing driver model not good enough ? So you want
a device with function X you look into /dev/X (for instance
for GPU you look in /dev/dri)

Each of those device need a userspace driver and thus this
user space driver can easily knows where to look. I do not
expect that every application will reimplement those drivers
but instead use some kind of library that provide a high
level API for each of those devices.

>   'find me an accelerator with function X that is nearest my memory'
>   'find me accelerator functions X and Y that share HBM'
>   'find me accelerator functions X and Y than can be chained'
> 
> If instead we have three API's depending upon whose accelerator you are
> using and whether it's FPGA or ASIC this is going to be a mess on a grand
> scale.

I see the enumeration as an orthogonal problem. There have
been talks within various circles about it because new system
bus (CAPI, CCIX, ...) imply that things like NUMA topology
you have in sysfs is not up to the task.

Now you have a hierarchy of memory for the CPU (HBM, local
node main memory aka you DDR dimm, persistent memory) each
with different properties (bandwidth, latency, ...). But
also for devices memory. Some device can have many types
of memory too. For instance a GPU might have HBM, GDDR,
persistant and multiple types of memory. Those device are
on a given CPU node but can also have inter-connect of their
own (AMD infinity, NVlink, ...).


You have HMAT which i was hopping would provide a new sysfs
that supersede old numa but it seems it is going back to
NUMA node for compatibilities reasons i guess (ccing Ross).


Anyway i think finding devices and finding relation between
devices and memory is 2 separate problems and as such should
be handled separatly.


Cheers,
Jérôme
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH 0/7] A General Accelerator Framework, WarpDrive

2018-08-02 Thread Jerome Glisse
On Thu, Aug 02, 2018 at 12:05:57PM +0800, Kenneth Lee wrote:
> On Thu, Aug 02, 2018 at 02:33:12AM +, Tian, Kevin wrote:
> > Date: Thu, 2 Aug 2018 02:33:12 +
> > > From: Jerome Glisse
> > > On Wed, Aug 01, 2018 at 06:22:14PM +0800, Kenneth Lee wrote:
> > > > From: Kenneth Lee 
> > > >
> > > > WarpDrive is an accelerator framework to expose the hardware
> > > capabilities
> > > > directly to the user space. It makes use of the exist vfio and vfio-mdev
> > > > facilities. So the user application can send request and DMA to the
> > > > hardware without interaction with the kernel. This remove the latency
> > > > of syscall and context switch.
> > > >
> > > > The patchset contains documents for the detail. Please refer to it for
> > > more
> > > > information.
> > > >
> > > > This patchset is intended to be used with Jean Philippe Brucker's SVA
> > > > patch [1] (Which is also in RFC stage). But it is not mandatory. This
> > > > patchset is tested in the latest mainline kernel without the SVA 
> > > > patches.
> > > > So it support only one process for each accelerator.
> > > >
> > > > With SVA support, WarpDrive can support multi-process in the same
> > > > accelerator device.  We tested it in our SoC integrated Accelerator 
> > > > (board
> > > > ID: D06, Chip ID: HIP08). A reference work tree can be found here: [2].
> > > 
> > > I have not fully inspected things nor do i know enough about
> > > this Hisilicon ZIP accelerator to ascertain, but from glimpsing
> > > at the code it seems that it is unsafe to use even with SVA due
> > > to the doorbell. There is a comment talking about safetyness
> > > in patch 7.
> > > 
> > > Exposing thing to userspace is always enticing, but if it is
> > > a security risk then it should clearly say so and maybe a
> > > kernel boot flag should be necessary to allow such device to
> > > be use.
> > > 
> 
> But doorbell is just a notification. Except for DOS (to make hardware busy) it
> cannot actually take or change anything from the kernel space. And the DOS
> problem can be always taken as the problem that a group of processes share the
> same kernel entity.
> 
> In the coming HIP09 hardware, the doorbell will come with a random number so
> only the process who allocated the queue can knock it correctly.

When doorbell is ring the hardware start fetching commands from
the queue and execute them ? If so than a rogue process B might
ring the doorbell of process A which would starts execution of
random commands (ie whatever random memory value there is left
inside the command buffer memory, could be old commands i guess).

If this is not how this doorbell works then, yes it can only do
a denial of service i guess. Issue i have with doorbell is that
i have seen 10 differents implementations in 10 differents hw
and each are different as to what ringing or value written to the
doorbell does. It is painfull to track what is what for each hw.


> > > My more general question is do we want to grow VFIO to become
> > > a more generic device driver API. This patchset adds a command
> > > queue concept to it (i don't think it exist today but i have
> > > not follow VFIO closely).
> > > 
> 
> The thing is, VFIO is the only place to support DMA from user land. If we 
> don't
> put it here, we have to create another similar facility to support the same.

No it is not, network device, GPU, block device, ... they all do
support DMA. The point i am trying to make here is that even in
your mechanisms the userspace must have a specific userspace
drivers for each hardware and thus there are virtually no
differences between having this userspace driver open a device
file in vfio or somewhere else in the device filesystem. This is
just a different path.

So this is why i do not see any benefit to having all drivers with
SVM (can we please use SVM and not SVA as SVM is what have been use
in more places so far).


Cheers,
Jérôme
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH 0/7] A General Accelerator Framework, WarpDrive

2018-08-02 Thread Xu Zaibo

Hi,

On 2018/8/2 18:10, Alan Cox wrote:

One motivation I guess, is that most accelerators lack of a
well-abstracted high level APIs similar to GPU side (e.g. OpenCL
clearly defines Shared Virtual Memory models). VFIO mdev
might be an alternative common interface to enable SVA usages
on various accelerators...

SVA is not IMHO the hard bit from a user level API perspective. The hard
bit is describing what you have and enumerating the contents of the device
especially when those can be quite dynamic and in the FPGA case can
change on the fly.

Right now we've got
- FPGA manager
- Google's recently posted ASIC patches
- WarpDrive

all trying to be bits of the same thing, and really there needs to be a
single solution that handles all of this stuff properly.

If we are going to have any kind of general purpose accelerator API then
it has to be able to implement things like

'find me an accelerator with function X that is nearest my memory'
'find me accelerator functions X and Y that share HBM'
'find me accelerator functions X and Y than can be chained'

If instead we have three API's depending upon whose accelerator you are
using and whether it's FPGA or ASIC this is going to be a mess on a grand
scale.

Agree, at the beginning, we try to bring a notion of 'capability' which 
describes 'algorithms, mem access methods .etc ',
but then, we come to realize it is the first thing that we should come 
to a single solution on these things such as

memory/device access, IOMMU .etc.

Thanks,
Zaibo



.




___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH 0/7] A General Accelerator Framework, WarpDrive

2018-08-02 Thread Alan Cox
> One motivation I guess, is that most accelerators lack of a 
> well-abstracted high level APIs similar to GPU side (e.g. OpenCL 
> clearly defines Shared Virtual Memory models). VFIO mdev
> might be an alternative common interface to enable SVA usages 
> on various accelerators...

SVA is not IMHO the hard bit from a user level API perspective. The hard
bit is describing what you have and enumerating the contents of the device
especially when those can be quite dynamic and in the FPGA case can
change on the fly.

Right now we've got
- FPGA manager
- Google's recently posted ASIC patches
- WarpDrive

all trying to be bits of the same thing, and really there needs to be a
single solution that handles all of this stuff properly.

If we are going to have any kind of general purpose accelerator API then
it has to be able to implement things like

'find me an accelerator with function X that is nearest my memory'
'find me accelerator functions X and Y that share HBM'
'find me accelerator functions X and Y than can be chained'

If instead we have three API's depending upon whose accelerator you are
using and whether it's FPGA or ASIC this is going to be a mess on a grand
scale.

Alan
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH 0/7] A General Accelerator Framework, WarpDrive

2018-08-01 Thread Kenneth Lee
On Thu, Aug 02, 2018 at 04:36:52AM +, Tian, Kevin wrote:
> Date: Thu, 2 Aug 2018 04:36:52 +
> From: "Tian, Kevin" 
> To: Kenneth Lee 
> CC: Kenneth Lee , Herbert Xu
>  , "k...@vger.kernel.org"
>  , Jonathan Corbet , Greg
>  Kroah-Hartman , "linux-...@vger.kernel.org"
>  , "Kumar, Sanjay K" ,
>  Hao Fang , "iommu@lists.linux-foundation.org"
>  , "linux-ker...@vger.kernel.org"
>  , "linux...@huawei.com"
>  , Alex Williamson ,
>  Thomas Gleixner , "linux-cry...@vger.kernel.org"
>  , Philippe Ombredanne
>  , Zaibo Xu , "David S . Miller"
>  , "linux-accelerat...@lists.ozlabs.org"
>  
> Subject: RE: [RFC PATCH 0/7] A General Accelerator Framework, WarpDrive
> Message-ID: 
> 
> 
> > From: Kenneth Lee
> > Sent: Thursday, August 2, 2018 11:40 AM
> > 
> > On Thu, Aug 02, 2018 at 02:59:33AM +, Tian, Kevin wrote:
> > > > From: Kenneth Lee
> > > > Sent: Wednesday, August 1, 2018 6:22 PM
> > > >
> > > > From: Kenneth Lee 
> > > >
> > > > WarpDrive is an accelerator framework to expose the hardware
> > capabilities
> > > > directly to the user space. It makes use of the exist vfio and vfio-mdev
> > > > facilities. So the user application can send request and DMA to the
> > > > hardware without interaction with the kernel. This remove the latency
> > > > of syscall and context switch.
> > > >
> > > > The patchset contains documents for the detail. Please refer to it for
> > more
> > > > information.
> > > >
> > > > This patchset is intended to be used with Jean Philippe Brucker's SVA
> > > > patch [1] (Which is also in RFC stage). But it is not mandatory. This
> > > > patchset is tested in the latest mainline kernel without the SVA 
> > > > patches.
> > > > So it support only one process for each accelerator.
> > >
> > > If no sharing, then why not just assigning the whole parent device to
> > > the process? IMO if SVA usage is the clear goal of your series, it
> > > might be made clearly so then Jean's series is mandatory dependency...
> > >
> > 
> > We don't know how SVA will be finally. But the feature, "make use of
> > per-PASID/substream ID IOMMU page table", should be able to be enabled
> > in the
> > kernel. So we don't want to enforce it here. After we have this serial 
> > ready,
> > it
> > can be hooked to any implementation.
> 
> "any" or "only queue-based" implementation? some devices may not
> have queue concept, e.g. GPU.

Here, "any implementation" refer to the "per PASID page table" implementatin:)

> 
> > 
> > Further more, even without "per-PASID IOMMU page table", this series has
> > its
> > value. It is not simply dedicate the whole device to the process. It 
> > "shares"
> > the device with the kernel driver. So you can support crypto and a user
> > application at the same time.
> 
> OK.
> 
> Thanks
> Kevin

-- 
-Kenneth(Hisilicon)


本邮件及其附件含有华为公司的保密信息,仅限于发送给上面地址中列出的个人或群组。禁
止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、或散发)本邮件中
的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本邮件!
This e-mail and its attachments contain confidential information from HUAWEI,
which is intended only for the person or entity whose address is listed above.
Any use of the 
information contained herein in any way (including, but not limited to, total or
partial disclosure, reproduction, or dissemination) by persons other than the
intended 
recipient(s) is prohibited. If you receive this e-mail in error, please notify
the sender by phone or email immediately and delete it!

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

RE: [RFC PATCH 0/7] A General Accelerator Framework, WarpDrive

2018-08-01 Thread Tian, Kevin
> From: Kenneth Lee
> Sent: Thursday, August 2, 2018 11:40 AM
> 
> On Thu, Aug 02, 2018 at 02:59:33AM +, Tian, Kevin wrote:
> > > From: Kenneth Lee
> > > Sent: Wednesday, August 1, 2018 6:22 PM
> > >
> > > From: Kenneth Lee 
> > >
> > > WarpDrive is an accelerator framework to expose the hardware
> capabilities
> > > directly to the user space. It makes use of the exist vfio and vfio-mdev
> > > facilities. So the user application can send request and DMA to the
> > > hardware without interaction with the kernel. This remove the latency
> > > of syscall and context switch.
> > >
> > > The patchset contains documents for the detail. Please refer to it for
> more
> > > information.
> > >
> > > This patchset is intended to be used with Jean Philippe Brucker's SVA
> > > patch [1] (Which is also in RFC stage). But it is not mandatory. This
> > > patchset is tested in the latest mainline kernel without the SVA patches.
> > > So it support only one process for each accelerator.
> >
> > If no sharing, then why not just assigning the whole parent device to
> > the process? IMO if SVA usage is the clear goal of your series, it
> > might be made clearly so then Jean's series is mandatory dependency...
> >
> 
> We don't know how SVA will be finally. But the feature, "make use of
> per-PASID/substream ID IOMMU page table", should be able to be enabled
> in the
> kernel. So we don't want to enforce it here. After we have this serial ready,
> it
> can be hooked to any implementation.

"any" or "only queue-based" implementation? some devices may not
have queue concept, e.g. GPU.

> 
> Further more, even without "per-PASID IOMMU page table", this series has
> its
> value. It is not simply dedicate the whole device to the process. It "shares"
> the device with the kernel driver. So you can support crypto and a user
> application at the same time.

OK.

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH 0/7] A General Accelerator Framework, WarpDrive

2018-08-01 Thread Kenneth Lee
On Thu, Aug 02, 2018 at 02:33:12AM +, Tian, Kevin wrote:
> Date: Thu, 2 Aug 2018 02:33:12 +
> From: "Tian, Kevin" 
> To: Jerome Glisse , Kenneth Lee 
> CC: Hao Fang , Herbert Xu
>  , "k...@vger.kernel.org"
>  , Jonathan Corbet , Greg
>  Kroah-Hartman , "linux-...@vger.kernel.org"
>  , "Kumar, Sanjay K" ,
>  "iommu@lists.linux-foundation.org" ,
>  "linux-ker...@vger.kernel.org" ,
>  "linux...@huawei.com" , Alex Williamson
>  , Thomas Gleixner ,
>  "linux-cry...@vger.kernel.org" , Philippe
>  Ombredanne , Zaibo Xu , Kenneth
>  Lee , "David S . Miller" ,
>  "linux-accelerat...@lists.ozlabs.org"
>  
> Subject: RE: [RFC PATCH 0/7] A General Accelerator Framework, WarpDrive
> Message-ID: 
> 
> 
> > From: Jerome Glisse
> > Sent: Thursday, August 2, 2018 12:57 AM
> > 
> > On Wed, Aug 01, 2018 at 06:22:14PM +0800, Kenneth Lee wrote:
> > > From: Kenneth Lee 
> > >
> > > WarpDrive is an accelerator framework to expose the hardware
> > capabilities
> > > directly to the user space. It makes use of the exist vfio and vfio-mdev
> > > facilities. So the user application can send request and DMA to the
> > > hardware without interaction with the kernel. This remove the latency
> > > of syscall and context switch.
> > >
> > > The patchset contains documents for the detail. Please refer to it for
> > more
> > > information.
> > >
> > > This patchset is intended to be used with Jean Philippe Brucker's SVA
> > > patch [1] (Which is also in RFC stage). But it is not mandatory. This
> > > patchset is tested in the latest mainline kernel without the SVA patches.
> > > So it support only one process for each accelerator.
> > >
> > > With SVA support, WarpDrive can support multi-process in the same
> > > accelerator device.  We tested it in our SoC integrated Accelerator (board
> > > ID: D06, Chip ID: HIP08). A reference work tree can be found here: [2].
> > 
> > I have not fully inspected things nor do i know enough about
> > this Hisilicon ZIP accelerator to ascertain, but from glimpsing
> > at the code it seems that it is unsafe to use even with SVA due
> > to the doorbell. There is a comment talking about safetyness
> > in patch 7.
> > 
> > Exposing thing to userspace is always enticing, but if it is
> > a security risk then it should clearly say so and maybe a
> > kernel boot flag should be necessary to allow such device to
> > be use.
> > 

But doorbell is just a notification. Except for DOS (to make hardware busy) it
cannot actually take or change anything from the kernel space. And the DOS
problem can be always taken as the problem that a group of processes share the
same kernel entity.

In the coming HIP09 hardware, the doorbell will come with a random number so
only the process who allocated the queue can knock it correctly.
> > 
> > My more general question is do we want to grow VFIO to become
> > a more generic device driver API. This patchset adds a command
> > queue concept to it (i don't think it exist today but i have
> > not follow VFIO closely).
> > 

The thing is, VFIO is the only place to support DMA from user land. If we don't
put it here, we have to create another similar facility to support the same.

> > Why is that any better that existing driver model ? Where a
> > device create a device file (can be character device, block
> > device, ...). Such models also allow for direct hardware
> > access from userspace. For instance see the AMD KFD driver
> > inside drivers/gpu/drm/amd
> 
> One motivation I guess, is that most accelerators lack of a 
> well-abstracted high level APIs similar to GPU side (e.g. OpenCL 
> clearly defines Shared Virtual Memory models). VFIO mdev
> might be an alternative common interface to enable SVA usages 
> on various accelerators...
> 
Yes.
> > 
> > So you can already do what you are doing with the Hisilicon
> > driver today without this new infrastructure. This only need
> > hardware that have command queue and doorbell like mechanisms.
> > 
> > 
> > Unlike mdev which unify a very high level concept, it seems
> > to me spimdev just introduce low level concept (namely command
> > queue) and i don't see the intrinsic value here.
> > 
As I have said, VFIO is the only place support DMA from user space now.
> > 
> > Cheers,
> > Jérôme
> > ___
> > iommu mailing list
> > iommu@lists.linux-foundation.org
> > ht

Re: [RFC PATCH 0/7] A General Accelerator Framework, WarpDrive

2018-08-01 Thread Kenneth Lee
On Thu, Aug 02, 2018 at 02:59:33AM +, Tian, Kevin wrote:
> Date: Thu, 2 Aug 2018 02:59:33 +
> From: "Tian, Kevin" 
> To: Kenneth Lee , Jonathan Corbet ,
>  Herbert Xu , "David S . Miller"
>  , Joerg Roedel , Alex Williamson
>  , Kenneth Lee , Hao
>  Fang , Zhou Wang , Zaibo Xu
>  , Philippe Ombredanne , Greg
>  Kroah-Hartman , Thomas Gleixner
>  , "linux-...@vger.kernel.org"
>  , "linux-ker...@vger.kernel.org"
>  , "linux-cry...@vger.kernel.org"
>  , "iommu@lists.linux-foundation.org"
>  , "k...@vger.kernel.org"
>  , "linux-accelerat...@lists.ozlabs.org"
>  , Lu Baolu
>  , "Kumar, Sanjay K" 
> CC: "linux...@huawei.com" 
> Subject: RE: [RFC PATCH 0/7] A General Accelerator Framework, WarpDrive
> Message-ID: 
> 
> 
> > From: Kenneth Lee
> > Sent: Wednesday, August 1, 2018 6:22 PM
> > 
> > From: Kenneth Lee 
> > 
> > WarpDrive is an accelerator framework to expose the hardware capabilities
> > directly to the user space. It makes use of the exist vfio and vfio-mdev
> > facilities. So the user application can send request and DMA to the
> > hardware without interaction with the kernel. This remove the latency
> > of syscall and context switch.
> > 
> > The patchset contains documents for the detail. Please refer to it for more
> > information.
> > 
> > This patchset is intended to be used with Jean Philippe Brucker's SVA
> > patch [1] (Which is also in RFC stage). But it is not mandatory. This
> > patchset is tested in the latest mainline kernel without the SVA patches.
> > So it support only one process for each accelerator.
> 
> If no sharing, then why not just assigning the whole parent device to
> the process? IMO if SVA usage is the clear goal of your series, it
> might be made clearly so then Jean's series is mandatory dependency...
> 

We don't know how SVA will be finally. But the feature, "make use of
per-PASID/substream ID IOMMU page table", should be able to be enabled in the
kernel. So we don't want to enforce it here. After we have this serial ready, it
can be hooked to any implementation.

Further more, even without "per-PASID IOMMU page table", this series has its
value. It is not simply dedicate the whole device to the process. It "shares"
the device with the kernel driver. So you can support crypto and a user
application at the same time.

> > 
> > With SVA support, WarpDrive can support multi-process in the same
> > accelerator device.  We tested it in our SoC integrated Accelerator (board
> > ID: D06, Chip ID: HIP08). A reference work tree can be found here: [2].
> > 
> > We have noticed the IOMMU aware mdev RFC announced recently [3].
> > 
> > The IOMMU aware mdev has similar idea but different intention comparing
> > to
> > WarpDrive. It intends to dedicate part of the hardware resource to a VM.
> 
> Not just to VM, though I/O Virtualization is in the name. You can assign
> such mdev to either VMs, containers, or bare metal processes. It's just
> a fully-isolated device from user space p.o.v.

Oh, yes. Thank you for clarification.

> 
> > And the design is supposed to be used with Scalable I/O Virtualization.
> > While spimdev is intended to share the hardware resource with a big
> > amount
> > of processes.  It just requires the hardware supporting address
> > translation per process (PCIE's PASID or ARM SMMU's substream ID).
> > 
> > But we don't see serious confliction on both design. We believe they can be
> > normalized as one.
> 
> yes there are something which can be shared, e.g. regarding to
> the interface to IOMMU.
> 
> Conceptually I see them different mindset on device resource sharing:
> 
> WarpDrive more aims to provide a generic framework to enable SVA
> usages on various accelerators, which lack of a well-abstracted user
> API like OpenCL. SVA is a hardware capability - sort of exposing resources
> composing ONE capability to user space through mdev framework. It is
> not like a VF which naturally carries most capabilities as PF.
> 

Yes. But we believe the user abstraction layer will be enabled soon when the
channel is opened. WarpDrive gives the hardware the chance to serve the
application directly. For example, an AI engine can be called by many processes
for inference. The resource need not to be dedicated to one particular process.

> Intel Scalable I/O virtualization is a thorough design to partition the
> device into minimal sharable copies (queue, queue pair, context), 
> while each copy carries most PF capabilities (including SVA) similar to
> VF. Also with IOMMU scalable mode support, the copy can be 
> independently assigned to any client (process, container, VM, etc.)
> 
Yes, we can see this intension.
> Thanks
> Kevin

Thank you.

-- 
-Kenneth(Hisilicon)
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC PATCH 0/7] A General Accelerator Framework, WarpDrive

2018-08-01 Thread Tian, Kevin
> From: Kenneth Lee
> Sent: Wednesday, August 1, 2018 6:22 PM
> 
> From: Kenneth Lee 
> 
> WarpDrive is an accelerator framework to expose the hardware capabilities
> directly to the user space. It makes use of the exist vfio and vfio-mdev
> facilities. So the user application can send request and DMA to the
> hardware without interaction with the kernel. This remove the latency
> of syscall and context switch.
> 
> The patchset contains documents for the detail. Please refer to it for more
> information.
> 
> This patchset is intended to be used with Jean Philippe Brucker's SVA
> patch [1] (Which is also in RFC stage). But it is not mandatory. This
> patchset is tested in the latest mainline kernel without the SVA patches.
> So it support only one process for each accelerator.

If no sharing, then why not just assigning the whole parent device to
the process? IMO if SVA usage is the clear goal of your series, it
might be made clearly so then Jean's series is mandatory dependency...

> 
> With SVA support, WarpDrive can support multi-process in the same
> accelerator device.  We tested it in our SoC integrated Accelerator (board
> ID: D06, Chip ID: HIP08). A reference work tree can be found here: [2].
> 
> We have noticed the IOMMU aware mdev RFC announced recently [3].
> 
> The IOMMU aware mdev has similar idea but different intention comparing
> to
> WarpDrive. It intends to dedicate part of the hardware resource to a VM.

Not just to VM, though I/O Virtualization is in the name. You can assign
such mdev to either VMs, containers, or bare metal processes. It's just
a fully-isolated device from user space p.o.v.

> And the design is supposed to be used with Scalable I/O Virtualization.
> While spimdev is intended to share the hardware resource with a big
> amount
> of processes.  It just requires the hardware supporting address
> translation per process (PCIE's PASID or ARM SMMU's substream ID).
> 
> But we don't see serious confliction on both design. We believe they can be
> normalized as one.

yes there are something which can be shared, e.g. regarding to
the interface to IOMMU.

Conceptually I see them different mindset on device resource sharing:

WarpDrive more aims to provide a generic framework to enable SVA
usages on various accelerators, which lack of a well-abstracted user
API like OpenCL. SVA is a hardware capability - sort of exposing resources
composing ONE capability to user space through mdev framework. It is
not like a VF which naturally carries most capabilities as PF.

Intel Scalable I/O virtualization is a thorough design to partition the
device into minimal sharable copies (queue, queue pair, context), 
while each copy carries most PF capabilities (including SVA) similar to
VF. Also with IOMMU scalable mode support, the copy can be 
independently assigned to any client (process, container, VM, etc.)

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC PATCH 0/7] A General Accelerator Framework, WarpDrive

2018-08-01 Thread Tian, Kevin
> From: Jerome Glisse
> Sent: Thursday, August 2, 2018 12:57 AM
> 
> On Wed, Aug 01, 2018 at 06:22:14PM +0800, Kenneth Lee wrote:
> > From: Kenneth Lee 
> >
> > WarpDrive is an accelerator framework to expose the hardware
> capabilities
> > directly to the user space. It makes use of the exist vfio and vfio-mdev
> > facilities. So the user application can send request and DMA to the
> > hardware without interaction with the kernel. This remove the latency
> > of syscall and context switch.
> >
> > The patchset contains documents for the detail. Please refer to it for
> more
> > information.
> >
> > This patchset is intended to be used with Jean Philippe Brucker's SVA
> > patch [1] (Which is also in RFC stage). But it is not mandatory. This
> > patchset is tested in the latest mainline kernel without the SVA patches.
> > So it support only one process for each accelerator.
> >
> > With SVA support, WarpDrive can support multi-process in the same
> > accelerator device.  We tested it in our SoC integrated Accelerator (board
> > ID: D06, Chip ID: HIP08). A reference work tree can be found here: [2].
> 
> I have not fully inspected things nor do i know enough about
> this Hisilicon ZIP accelerator to ascertain, but from glimpsing
> at the code it seems that it is unsafe to use even with SVA due
> to the doorbell. There is a comment talking about safetyness
> in patch 7.
> 
> Exposing thing to userspace is always enticing, but if it is
> a security risk then it should clearly say so and maybe a
> kernel boot flag should be necessary to allow such device to
> be use.
> 
> 
> My more general question is do we want to grow VFIO to become
> a more generic device driver API. This patchset adds a command
> queue concept to it (i don't think it exist today but i have
> not follow VFIO closely).
> 
> Why is that any better that existing driver model ? Where a
> device create a device file (can be character device, block
> device, ...). Such models also allow for direct hardware
> access from userspace. For instance see the AMD KFD driver
> inside drivers/gpu/drm/amd

One motivation I guess, is that most accelerators lack of a 
well-abstracted high level APIs similar to GPU side (e.g. OpenCL 
clearly defines Shared Virtual Memory models). VFIO mdev
might be an alternative common interface to enable SVA usages 
on various accelerators...

> 
> So you can already do what you are doing with the Hisilicon
> driver today without this new infrastructure. This only need
> hardware that have command queue and doorbell like mechanisms.
> 
> 
> Unlike mdev which unify a very high level concept, it seems
> to me spimdev just introduce low level concept (namely command
> queue) and i don't see the intrinsic value here.
> 
> 
> Cheers,
> Jérôme
> ___
> iommu mailing list
> iommu@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/iommu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH 0/7] A General Accelerator Framework, WarpDrive

2018-08-01 Thread Jerome Glisse
On Wed, Aug 01, 2018 at 06:22:14PM +0800, Kenneth Lee wrote:
> From: Kenneth Lee 
> 
> WarpDrive is an accelerator framework to expose the hardware capabilities
> directly to the user space. It makes use of the exist vfio and vfio-mdev
> facilities. So the user application can send request and DMA to the
> hardware without interaction with the kernel. This remove the latency
> of syscall and context switch.
> 
> The patchset contains documents for the detail. Please refer to it for more
> information.
> 
> This patchset is intended to be used with Jean Philippe Brucker's SVA
> patch [1] (Which is also in RFC stage). But it is not mandatory. This
> patchset is tested in the latest mainline kernel without the SVA patches.
> So it support only one process for each accelerator.
> 
> With SVA support, WarpDrive can support multi-process in the same
> accelerator device.  We tested it in our SoC integrated Accelerator (board
> ID: D06, Chip ID: HIP08). A reference work tree can be found here: [2].

I have not fully inspected things nor do i know enough about
this Hisilicon ZIP accelerator to ascertain, but from glimpsing
at the code it seems that it is unsafe to use even with SVA due
to the doorbell. There is a comment talking about safetyness
in patch 7.

Exposing thing to userspace is always enticing, but if it is
a security risk then it should clearly say so and maybe a
kernel boot flag should be necessary to allow such device to
be use.


My more general question is do we want to grow VFIO to become
a more generic device driver API. This patchset adds a command
queue concept to it (i don't think it exist today but i have
not follow VFIO closely).

Why is that any better that existing driver model ? Where a
device create a device file (can be character device, block
device, ...). Such models also allow for direct hardware
access from userspace. For instance see the AMD KFD driver
inside drivers/gpu/drm/amd

So you can already do what you are doing with the Hisilicon
driver today without this new infrastructure. This only need
hardware that have command queue and doorbell like mechanisms.


Unlike mdev which unify a very high level concept, it seems
to me spimdev just introduce low level concept (namely command
queue) and i don't see the intrinsic value here.


Cheers,
Jérôme
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu