Re: [PATCH kernel v9 29/32] vfio: powerpc/spapr: Register memory and define IOMMU v2

2015-05-05 Thread David Gibson
On Fri, May 01, 2015 at 04:27:47PM +1000, Alexey Kardashevskiy wrote:
> On 05/01/2015 03:23 PM, David Gibson wrote:
> >On Fri, May 01, 2015 at 02:35:23PM +1000, Alexey Kardashevskiy wrote:
> >>On 04/30/2015 04:55 PM, David Gibson wrote:
> >>>On Sat, Apr 25, 2015 at 10:14:53PM +1000, Alexey Kardashevskiy wrote:
> The existing implementation accounts the whole DMA window in
> the locked_vm counter. This is going to be worse with multiple
> containers and huge DMA windows. Also, real-time accounting would requite
> additional tracking of accounted pages due to the page size difference -
> IOMMU uses 4K pages and system uses 4K or 64K pages.
> 
> Another issue is that actual pages pinning/unpinning happens on every
> DMA map/unmap request. This does not affect the performance much now as
> we spend way too much time now on switching context between
> guest/userspace/host but this will start to matter when we add in-kernel
> DMA map/unmap acceleration.
> 
> This introduces a new IOMMU type for SPAPR - VFIO_SPAPR_TCE_v2_IOMMU.
> New IOMMU deprecates VFIO_IOMMU_ENABLE/VFIO_IOMMU_DISABLE and introduces
> 2 new ioctls to register/unregister DMA memory -
> VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY -
> which receive user space address and size of a memory region which
> needs to be pinned/unpinned and counted in locked_vm.
> New IOMMU splits physical pages pinning and TCE table update into 2 
> different
> operations. It requires 1) guest pages to be registered first 2) 
> consequent
> map/unmap requests to work only with pre-registered memory.
> For the default single window case this means that the entire guest
> (instead of 2GB) needs to be pinned before using VFIO.
> When a huge DMA window is added, no additional pinning will be
> required, otherwise it would be guest RAM + 2GB.
> 
> The new memory registration ioctls are not supported by
> VFIO_SPAPR_TCE_IOMMU. Dynamic DMA window and in-kernel acceleration
> will require memory to be preregistered in order to work.
> 
> The accounting is done per the user process.
> 
> This advertises v2 SPAPR TCE IOMMU and restricts what the userspace
> can do with v1 or v2 IOMMUs.
> 
> Signed-off-by: Alexey Kardashevskiy 
> [aw: for the vfio related changes]
> Acked-by: Alex Williamson 
> ---
> Changes:
> v9:
> * s/tce_get_hva_cached/tce_iommu_use_page_v2/
> 
> v7:
> * now memory is registered per mm (i.e. process)
> * moved memory registration code to powerpc/mmu
> * merged "vfio: powerpc/spapr: Define v2 IOMMU" into this
> * limited new ioctls to v2 IOMMU
> * updated doc
> * unsupported ioclts return -ENOTTY instead of -EPERM
> 
> v6:
> * tce_get_hva_cached() returns hva via a pointer
> 
> v4:
> * updated docs
> * s/kzmalloc/vzalloc/
> * in tce_pin_pages()/tce_unpin_pages() removed @vaddr, @size and
> replaced offset with index
> * renamed vfio_iommu_type_register_memory to 
> vfio_iommu_spapr_register_memory
> and removed duplicating vfio_iommu_spapr_register_memory
> ---
>   Documentation/vfio.txt  |  23 
>   drivers/vfio/vfio_iommu_spapr_tce.c | 230 
>  +++-
>   include/uapi/linux/vfio.h   |  27 +
>   3 files changed, 274 insertions(+), 6 deletions(-)
> 
> diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
> index 96978ec..94328c8 100644
> --- a/Documentation/vfio.txt
> +++ b/Documentation/vfio.txt
> @@ -427,6 +427,29 @@ The code flow from the example above should be 
> slightly changed:
> 
>   
> 
> +5) There is v2 of SPAPR TCE IOMMU. It deprecates VFIO_IOMMU_ENABLE/
> +VFIO_IOMMU_DISABLE and implements 2 new ioctls:
> +VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY
> +(which are unsupported in v1 IOMMU).
> >>>
> >>>A summary of the semantic differeces between v1 and v2 would be nice.
> >>>At this point it's not really clear to me if there's a case for
> >>>creating v2, or if this could just be done by adding (optional)
> >>>functionality to v1.
> >>
> >>v1: memory preregistration is not supported; explicit enable/disable ioctls
> >>are required
> >>
> >>v2: memory preregistration is required; explicit enable/disable are
> >>prohibited (as they are not needed).
> >>
> >>Mixing these in one IOMMU type caused a lot of problems like should I
> >>increment locked_vm by the 32bit window size on enable() or not; what do I
> >>do about pages pinning when map/map (check if it is from registered memory
> >>and do not pin?).
> >>
> >>Having 2 IOMMU models makes everything a lot simpler.
> >
> >Ok.  Would it simplify it further if you made v2 only usable on IODA2
> >hardware?
> 
> 
> Very little. V2 addresses memory 

Re: [PATCH kernel v9 29/32] vfio: powerpc/spapr: Register memory and define IOMMU v2

2015-05-05 Thread David Gibson
On Fri, May 01, 2015 at 04:27:47PM +1000, Alexey Kardashevskiy wrote:
 On 05/01/2015 03:23 PM, David Gibson wrote:
 On Fri, May 01, 2015 at 02:35:23PM +1000, Alexey Kardashevskiy wrote:
 On 04/30/2015 04:55 PM, David Gibson wrote:
 On Sat, Apr 25, 2015 at 10:14:53PM +1000, Alexey Kardashevskiy wrote:
 The existing implementation accounts the whole DMA window in
 the locked_vm counter. This is going to be worse with multiple
 containers and huge DMA windows. Also, real-time accounting would requite
 additional tracking of accounted pages due to the page size difference -
 IOMMU uses 4K pages and system uses 4K or 64K pages.
 
 Another issue is that actual pages pinning/unpinning happens on every
 DMA map/unmap request. This does not affect the performance much now as
 we spend way too much time now on switching context between
 guest/userspace/host but this will start to matter when we add in-kernel
 DMA map/unmap acceleration.
 
 This introduces a new IOMMU type for SPAPR - VFIO_SPAPR_TCE_v2_IOMMU.
 New IOMMU deprecates VFIO_IOMMU_ENABLE/VFIO_IOMMU_DISABLE and introduces
 2 new ioctls to register/unregister DMA memory -
 VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY -
 which receive user space address and size of a memory region which
 needs to be pinned/unpinned and counted in locked_vm.
 New IOMMU splits physical pages pinning and TCE table update into 2 
 different
 operations. It requires 1) guest pages to be registered first 2) 
 consequent
 map/unmap requests to work only with pre-registered memory.
 For the default single window case this means that the entire guest
 (instead of 2GB) needs to be pinned before using VFIO.
 When a huge DMA window is added, no additional pinning will be
 required, otherwise it would be guest RAM + 2GB.
 
 The new memory registration ioctls are not supported by
 VFIO_SPAPR_TCE_IOMMU. Dynamic DMA window and in-kernel acceleration
 will require memory to be preregistered in order to work.
 
 The accounting is done per the user process.
 
 This advertises v2 SPAPR TCE IOMMU and restricts what the userspace
 can do with v1 or v2 IOMMUs.
 
 Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
 [aw: for the vfio related changes]
 Acked-by: Alex Williamson alex.william...@redhat.com
 ---
 Changes:
 v9:
 * s/tce_get_hva_cached/tce_iommu_use_page_v2/
 
 v7:
 * now memory is registered per mm (i.e. process)
 * moved memory registration code to powerpc/mmu
 * merged vfio: powerpc/spapr: Define v2 IOMMU into this
 * limited new ioctls to v2 IOMMU
 * updated doc
 * unsupported ioclts return -ENOTTY instead of -EPERM
 
 v6:
 * tce_get_hva_cached() returns hva via a pointer
 
 v4:
 * updated docs
 * s/kzmalloc/vzalloc/
 * in tce_pin_pages()/tce_unpin_pages() removed @vaddr, @size and
 replaced offset with index
 * renamed vfio_iommu_type_register_memory to 
 vfio_iommu_spapr_register_memory
 and removed duplicating vfio_iommu_spapr_register_memory
 ---
   Documentation/vfio.txt  |  23 
   drivers/vfio/vfio_iommu_spapr_tce.c | 230 
  +++-
   include/uapi/linux/vfio.h   |  27 +
   3 files changed, 274 insertions(+), 6 deletions(-)
 
 diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
 index 96978ec..94328c8 100644
 --- a/Documentation/vfio.txt
 +++ b/Documentation/vfio.txt
 @@ -427,6 +427,29 @@ The code flow from the example above should be 
 slightly changed:
 
   
 
 +5) There is v2 of SPAPR TCE IOMMU. It deprecates VFIO_IOMMU_ENABLE/
 +VFIO_IOMMU_DISABLE and implements 2 new ioctls:
 +VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY
 +(which are unsupported in v1 IOMMU).
 
 A summary of the semantic differeces between v1 and v2 would be nice.
 At this point it's not really clear to me if there's a case for
 creating v2, or if this could just be done by adding (optional)
 functionality to v1.
 
 v1: memory preregistration is not supported; explicit enable/disable ioctls
 are required
 
 v2: memory preregistration is required; explicit enable/disable are
 prohibited (as they are not needed).
 
 Mixing these in one IOMMU type caused a lot of problems like should I
 increment locked_vm by the 32bit window size on enable() or not; what do I
 do about pages pinning when map/map (check if it is from registered memory
 and do not pin?).
 
 Having 2 IOMMU models makes everything a lot simpler.
 
 Ok.  Would it simplify it further if you made v2 only usable on IODA2
 hardware?
 
 
 Very little. V2 addresses memory pinning issue which is handled the same way
 on ioda2 and older hardware, including KVM acceleration. Whether enable DDW
 or not - this is handled just fine via extra properties in the GET_INFO
 ioctl().
 
 IODA2 and others are different in handling multiple groups per container but
 this does not require changes to userspace API.
 
 And remember, the only machine I can use 100% of time is POWER7/P5IOC2 so it
 is really useful if at least some bits of the 

Re: [PATCH kernel v9 29/32] vfio: powerpc/spapr: Register memory and define IOMMU v2

2015-05-01 Thread Alexey Kardashevskiy

On 05/01/2015 03:23 PM, David Gibson wrote:

On Fri, May 01, 2015 at 02:35:23PM +1000, Alexey Kardashevskiy wrote:

On 04/30/2015 04:55 PM, David Gibson wrote:

On Sat, Apr 25, 2015 at 10:14:53PM +1000, Alexey Kardashevskiy wrote:

The existing implementation accounts the whole DMA window in
the locked_vm counter. This is going to be worse with multiple
containers and huge DMA windows. Also, real-time accounting would requite
additional tracking of accounted pages due to the page size difference -
IOMMU uses 4K pages and system uses 4K or 64K pages.

Another issue is that actual pages pinning/unpinning happens on every
DMA map/unmap request. This does not affect the performance much now as
we spend way too much time now on switching context between
guest/userspace/host but this will start to matter when we add in-kernel
DMA map/unmap acceleration.

This introduces a new IOMMU type for SPAPR - VFIO_SPAPR_TCE_v2_IOMMU.
New IOMMU deprecates VFIO_IOMMU_ENABLE/VFIO_IOMMU_DISABLE and introduces
2 new ioctls to register/unregister DMA memory -
VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY -
which receive user space address and size of a memory region which
needs to be pinned/unpinned and counted in locked_vm.
New IOMMU splits physical pages pinning and TCE table update into 2 different
operations. It requires 1) guest pages to be registered first 2) consequent
map/unmap requests to work only with pre-registered memory.
For the default single window case this means that the entire guest
(instead of 2GB) needs to be pinned before using VFIO.
When a huge DMA window is added, no additional pinning will be
required, otherwise it would be guest RAM + 2GB.

The new memory registration ioctls are not supported by
VFIO_SPAPR_TCE_IOMMU. Dynamic DMA window and in-kernel acceleration
will require memory to be preregistered in order to work.

The accounting is done per the user process.

This advertises v2 SPAPR TCE IOMMU and restricts what the userspace
can do with v1 or v2 IOMMUs.

Signed-off-by: Alexey Kardashevskiy 
[aw: for the vfio related changes]
Acked-by: Alex Williamson 
---
Changes:
v9:
* s/tce_get_hva_cached/tce_iommu_use_page_v2/

v7:
* now memory is registered per mm (i.e. process)
* moved memory registration code to powerpc/mmu
* merged "vfio: powerpc/spapr: Define v2 IOMMU" into this
* limited new ioctls to v2 IOMMU
* updated doc
* unsupported ioclts return -ENOTTY instead of -EPERM

v6:
* tce_get_hva_cached() returns hva via a pointer

v4:
* updated docs
* s/kzmalloc/vzalloc/
* in tce_pin_pages()/tce_unpin_pages() removed @vaddr, @size and
replaced offset with index
* renamed vfio_iommu_type_register_memory to vfio_iommu_spapr_register_memory
and removed duplicating vfio_iommu_spapr_register_memory
---
  Documentation/vfio.txt  |  23 
  drivers/vfio/vfio_iommu_spapr_tce.c | 230 +++-
  include/uapi/linux/vfio.h   |  27 +
  3 files changed, 274 insertions(+), 6 deletions(-)

diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
index 96978ec..94328c8 100644
--- a/Documentation/vfio.txt
+++ b/Documentation/vfio.txt
@@ -427,6 +427,29 @@ The code flow from the example above should be slightly 
changed:



+5) There is v2 of SPAPR TCE IOMMU. It deprecates VFIO_IOMMU_ENABLE/
+VFIO_IOMMU_DISABLE and implements 2 new ioctls:
+VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY
+(which are unsupported in v1 IOMMU).


A summary of the semantic differeces between v1 and v2 would be nice.
At this point it's not really clear to me if there's a case for
creating v2, or if this could just be done by adding (optional)
functionality to v1.


v1: memory preregistration is not supported; explicit enable/disable ioctls
are required

v2: memory preregistration is required; explicit enable/disable are
prohibited (as they are not needed).

Mixing these in one IOMMU type caused a lot of problems like should I
increment locked_vm by the 32bit window size on enable() or not; what do I
do about pages pinning when map/map (check if it is from registered memory
and do not pin?).

Having 2 IOMMU models makes everything a lot simpler.


Ok.  Would it simplify it further if you made v2 only usable on IODA2
hardware?



Very little. V2 addresses memory pinning issue which is handled the same 
way on ioda2 and older hardware, including KVM acceleration. Whether enable 
DDW or not - this is handled just fine via extra properties in the GET_INFO 
ioctl().


IODA2 and others are different in handling multiple groups per container 
but this does not require changes to userspace API.


And remember, the only machine I can use 100% of time is POWER7/P5IOC2 so 
it is really useful if at least some bits of the patchset can be tested 
there; if it was a bit less different from IODA2, I would have even 
implemented DDW there too :)




+PPC64 paravirtualized guests generate a lot of map/unmap requests,
+and the handling of those 

Re: [PATCH kernel v9 29/32] vfio: powerpc/spapr: Register memory and define IOMMU v2

2015-05-01 Thread Alexey Kardashevskiy

On 05/01/2015 03:23 PM, David Gibson wrote:

On Fri, May 01, 2015 at 02:35:23PM +1000, Alexey Kardashevskiy wrote:

On 04/30/2015 04:55 PM, David Gibson wrote:

On Sat, Apr 25, 2015 at 10:14:53PM +1000, Alexey Kardashevskiy wrote:

The existing implementation accounts the whole DMA window in
the locked_vm counter. This is going to be worse with multiple
containers and huge DMA windows. Also, real-time accounting would requite
additional tracking of accounted pages due to the page size difference -
IOMMU uses 4K pages and system uses 4K or 64K pages.

Another issue is that actual pages pinning/unpinning happens on every
DMA map/unmap request. This does not affect the performance much now as
we spend way too much time now on switching context between
guest/userspace/host but this will start to matter when we add in-kernel
DMA map/unmap acceleration.

This introduces a new IOMMU type for SPAPR - VFIO_SPAPR_TCE_v2_IOMMU.
New IOMMU deprecates VFIO_IOMMU_ENABLE/VFIO_IOMMU_DISABLE and introduces
2 new ioctls to register/unregister DMA memory -
VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY -
which receive user space address and size of a memory region which
needs to be pinned/unpinned and counted in locked_vm.
New IOMMU splits physical pages pinning and TCE table update into 2 different
operations. It requires 1) guest pages to be registered first 2) consequent
map/unmap requests to work only with pre-registered memory.
For the default single window case this means that the entire guest
(instead of 2GB) needs to be pinned before using VFIO.
When a huge DMA window is added, no additional pinning will be
required, otherwise it would be guest RAM + 2GB.

The new memory registration ioctls are not supported by
VFIO_SPAPR_TCE_IOMMU. Dynamic DMA window and in-kernel acceleration
will require memory to be preregistered in order to work.

The accounting is done per the user process.

This advertises v2 SPAPR TCE IOMMU and restricts what the userspace
can do with v1 or v2 IOMMUs.

Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
[aw: for the vfio related changes]
Acked-by: Alex Williamson alex.william...@redhat.com
---
Changes:
v9:
* s/tce_get_hva_cached/tce_iommu_use_page_v2/

v7:
* now memory is registered per mm (i.e. process)
* moved memory registration code to powerpc/mmu
* merged vfio: powerpc/spapr: Define v2 IOMMU into this
* limited new ioctls to v2 IOMMU
* updated doc
* unsupported ioclts return -ENOTTY instead of -EPERM

v6:
* tce_get_hva_cached() returns hva via a pointer

v4:
* updated docs
* s/kzmalloc/vzalloc/
* in tce_pin_pages()/tce_unpin_pages() removed @vaddr, @size and
replaced offset with index
* renamed vfio_iommu_type_register_memory to vfio_iommu_spapr_register_memory
and removed duplicating vfio_iommu_spapr_register_memory
---
  Documentation/vfio.txt  |  23 
  drivers/vfio/vfio_iommu_spapr_tce.c | 230 +++-
  include/uapi/linux/vfio.h   |  27 +
  3 files changed, 274 insertions(+), 6 deletions(-)

diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
index 96978ec..94328c8 100644
--- a/Documentation/vfio.txt
+++ b/Documentation/vfio.txt
@@ -427,6 +427,29 @@ The code flow from the example above should be slightly 
changed:



+5) There is v2 of SPAPR TCE IOMMU. It deprecates VFIO_IOMMU_ENABLE/
+VFIO_IOMMU_DISABLE and implements 2 new ioctls:
+VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY
+(which are unsupported in v1 IOMMU).


A summary of the semantic differeces between v1 and v2 would be nice.
At this point it's not really clear to me if there's a case for
creating v2, or if this could just be done by adding (optional)
functionality to v1.


v1: memory preregistration is not supported; explicit enable/disable ioctls
are required

v2: memory preregistration is required; explicit enable/disable are
prohibited (as they are not needed).

Mixing these in one IOMMU type caused a lot of problems like should I
increment locked_vm by the 32bit window size on enable() or not; what do I
do about pages pinning when map/map (check if it is from registered memory
and do not pin?).

Having 2 IOMMU models makes everything a lot simpler.


Ok.  Would it simplify it further if you made v2 only usable on IODA2
hardware?



Very little. V2 addresses memory pinning issue which is handled the same 
way on ioda2 and older hardware, including KVM acceleration. Whether enable 
DDW or not - this is handled just fine via extra properties in the GET_INFO 
ioctl().


IODA2 and others are different in handling multiple groups per container 
but this does not require changes to userspace API.


And remember, the only machine I can use 100% of time is POWER7/P5IOC2 so 
it is really useful if at least some bits of the patchset can be tested 
there; if it was a bit less different from IODA2, I would have even 
implemented DDW there too :)




+PPC64 paravirtualized guests generate a lot of 

Re: [PATCH kernel v9 29/32] vfio: powerpc/spapr: Register memory and define IOMMU v2

2015-04-30 Thread David Gibson
On Fri, May 01, 2015 at 02:35:23PM +1000, Alexey Kardashevskiy wrote:
> On 04/30/2015 04:55 PM, David Gibson wrote:
> >On Sat, Apr 25, 2015 at 10:14:53PM +1000, Alexey Kardashevskiy wrote:
> >>The existing implementation accounts the whole DMA window in
> >>the locked_vm counter. This is going to be worse with multiple
> >>containers and huge DMA windows. Also, real-time accounting would requite
> >>additional tracking of accounted pages due to the page size difference -
> >>IOMMU uses 4K pages and system uses 4K or 64K pages.
> >>
> >>Another issue is that actual pages pinning/unpinning happens on every
> >>DMA map/unmap request. This does not affect the performance much now as
> >>we spend way too much time now on switching context between
> >>guest/userspace/host but this will start to matter when we add in-kernel
> >>DMA map/unmap acceleration.
> >>
> >>This introduces a new IOMMU type for SPAPR - VFIO_SPAPR_TCE_v2_IOMMU.
> >>New IOMMU deprecates VFIO_IOMMU_ENABLE/VFIO_IOMMU_DISABLE and introduces
> >>2 new ioctls to register/unregister DMA memory -
> >>VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY -
> >>which receive user space address and size of a memory region which
> >>needs to be pinned/unpinned and counted in locked_vm.
> >>New IOMMU splits physical pages pinning and TCE table update into 2 
> >>different
> >>operations. It requires 1) guest pages to be registered first 2) consequent
> >>map/unmap requests to work only with pre-registered memory.
> >>For the default single window case this means that the entire guest
> >>(instead of 2GB) needs to be pinned before using VFIO.
> >>When a huge DMA window is added, no additional pinning will be
> >>required, otherwise it would be guest RAM + 2GB.
> >>
> >>The new memory registration ioctls are not supported by
> >>VFIO_SPAPR_TCE_IOMMU. Dynamic DMA window and in-kernel acceleration
> >>will require memory to be preregistered in order to work.
> >>
> >>The accounting is done per the user process.
> >>
> >>This advertises v2 SPAPR TCE IOMMU and restricts what the userspace
> >>can do with v1 or v2 IOMMUs.
> >>
> >>Signed-off-by: Alexey Kardashevskiy 
> >>[aw: for the vfio related changes]
> >>Acked-by: Alex Williamson 
> >>---
> >>Changes:
> >>v9:
> >>* s/tce_get_hva_cached/tce_iommu_use_page_v2/
> >>
> >>v7:
> >>* now memory is registered per mm (i.e. process)
> >>* moved memory registration code to powerpc/mmu
> >>* merged "vfio: powerpc/spapr: Define v2 IOMMU" into this
> >>* limited new ioctls to v2 IOMMU
> >>* updated doc
> >>* unsupported ioclts return -ENOTTY instead of -EPERM
> >>
> >>v6:
> >>* tce_get_hva_cached() returns hva via a pointer
> >>
> >>v4:
> >>* updated docs
> >>* s/kzmalloc/vzalloc/
> >>* in tce_pin_pages()/tce_unpin_pages() removed @vaddr, @size and
> >>replaced offset with index
> >>* renamed vfio_iommu_type_register_memory to 
> >>vfio_iommu_spapr_register_memory
> >>and removed duplicating vfio_iommu_spapr_register_memory
> >>---
> >>  Documentation/vfio.txt  |  23 
> >>  drivers/vfio/vfio_iommu_spapr_tce.c | 230 
> >> +++-
> >>  include/uapi/linux/vfio.h   |  27 +
> >>  3 files changed, 274 insertions(+), 6 deletions(-)
> >>
> >>diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
> >>index 96978ec..94328c8 100644
> >>--- a/Documentation/vfio.txt
> >>+++ b/Documentation/vfio.txt
> >>@@ -427,6 +427,29 @@ The code flow from the example above should be 
> >>slightly changed:
> >>
> >>
> >>
> >>+5) There is v2 of SPAPR TCE IOMMU. It deprecates VFIO_IOMMU_ENABLE/
> >>+VFIO_IOMMU_DISABLE and implements 2 new ioctls:
> >>+VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY
> >>+(which are unsupported in v1 IOMMU).
> >
> >A summary of the semantic differeces between v1 and v2 would be nice.
> >At this point it's not really clear to me if there's a case for
> >creating v2, or if this could just be done by adding (optional)
> >functionality to v1.
> 
> v1: memory preregistration is not supported; explicit enable/disable ioctls
> are required
> 
> v2: memory preregistration is required; explicit enable/disable are
> prohibited (as they are not needed).
> 
> Mixing these in one IOMMU type caused a lot of problems like should I
> increment locked_vm by the 32bit window size on enable() or not; what do I
> do about pages pinning when map/map (check if it is from registered memory
> and do not pin?).
> 
> Having 2 IOMMU models makes everything a lot simpler.

Ok.  Would it simplify it further if you made v2 only usable on IODA2
hardware?

> >>+PPC64 paravirtualized guests generate a lot of map/unmap requests,
> >>+and the handling of those includes pinning/unpinning pages and updating
> >>+mm::locked_vm counter to make sure we do not exceed the rlimit.
> >>+The v2 IOMMU splits accounting and pinning into separate operations:
> >>+
> >>+- VFIO_IOMMU_SPAPR_REGISTER_MEMORY/VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY 
> >>ioctls
> 

Re: [PATCH kernel v9 29/32] vfio: powerpc/spapr: Register memory and define IOMMU v2

2015-04-30 Thread Alexey Kardashevskiy

On 04/30/2015 04:55 PM, David Gibson wrote:

On Sat, Apr 25, 2015 at 10:14:53PM +1000, Alexey Kardashevskiy wrote:

The existing implementation accounts the whole DMA window in
the locked_vm counter. This is going to be worse with multiple
containers and huge DMA windows. Also, real-time accounting would requite
additional tracking of accounted pages due to the page size difference -
IOMMU uses 4K pages and system uses 4K or 64K pages.

Another issue is that actual pages pinning/unpinning happens on every
DMA map/unmap request. This does not affect the performance much now as
we spend way too much time now on switching context between
guest/userspace/host but this will start to matter when we add in-kernel
DMA map/unmap acceleration.

This introduces a new IOMMU type for SPAPR - VFIO_SPAPR_TCE_v2_IOMMU.
New IOMMU deprecates VFIO_IOMMU_ENABLE/VFIO_IOMMU_DISABLE and introduces
2 new ioctls to register/unregister DMA memory -
VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY -
which receive user space address and size of a memory region which
needs to be pinned/unpinned and counted in locked_vm.
New IOMMU splits physical pages pinning and TCE table update into 2 different
operations. It requires 1) guest pages to be registered first 2) consequent
map/unmap requests to work only with pre-registered memory.
For the default single window case this means that the entire guest
(instead of 2GB) needs to be pinned before using VFIO.
When a huge DMA window is added, no additional pinning will be
required, otherwise it would be guest RAM + 2GB.

The new memory registration ioctls are not supported by
VFIO_SPAPR_TCE_IOMMU. Dynamic DMA window and in-kernel acceleration
will require memory to be preregistered in order to work.

The accounting is done per the user process.

This advertises v2 SPAPR TCE IOMMU and restricts what the userspace
can do with v1 or v2 IOMMUs.

Signed-off-by: Alexey Kardashevskiy 
[aw: for the vfio related changes]
Acked-by: Alex Williamson 
---
Changes:
v9:
* s/tce_get_hva_cached/tce_iommu_use_page_v2/

v7:
* now memory is registered per mm (i.e. process)
* moved memory registration code to powerpc/mmu
* merged "vfio: powerpc/spapr: Define v2 IOMMU" into this
* limited new ioctls to v2 IOMMU
* updated doc
* unsupported ioclts return -ENOTTY instead of -EPERM

v6:
* tce_get_hva_cached() returns hva via a pointer

v4:
* updated docs
* s/kzmalloc/vzalloc/
* in tce_pin_pages()/tce_unpin_pages() removed @vaddr, @size and
replaced offset with index
* renamed vfio_iommu_type_register_memory to vfio_iommu_spapr_register_memory
and removed duplicating vfio_iommu_spapr_register_memory
---
  Documentation/vfio.txt  |  23 
  drivers/vfio/vfio_iommu_spapr_tce.c | 230 +++-
  include/uapi/linux/vfio.h   |  27 +
  3 files changed, 274 insertions(+), 6 deletions(-)

diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
index 96978ec..94328c8 100644
--- a/Documentation/vfio.txt
+++ b/Documentation/vfio.txt
@@ -427,6 +427,29 @@ The code flow from the example above should be slightly 
changed:



+5) There is v2 of SPAPR TCE IOMMU. It deprecates VFIO_IOMMU_ENABLE/
+VFIO_IOMMU_DISABLE and implements 2 new ioctls:
+VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY
+(which are unsupported in v1 IOMMU).


A summary of the semantic differeces between v1 and v2 would be nice.
At this point it's not really clear to me if there's a case for
creating v2, or if this could just be done by adding (optional)
functionality to v1.


v1: memory preregistration is not supported; explicit enable/disable ioctls 
are required


v2: memory preregistration is required; explicit enable/disable are 
prohibited (as they are not needed).


Mixing these in one IOMMU type caused a lot of problems like should I 
increment locked_vm by the 32bit window size on enable() or not; what do I 
do about pages pinning when map/map (check if it is from registered memory 
and do not pin?).


Having 2 IOMMU models makes everything a lot simpler.



+PPC64 paravirtualized guests generate a lot of map/unmap requests,
+and the handling of those includes pinning/unpinning pages and updating
+mm::locked_vm counter to make sure we do not exceed the rlimit.
+The v2 IOMMU splits accounting and pinning into separate operations:
+
+- VFIO_IOMMU_SPAPR_REGISTER_MEMORY/VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY ioctls
+receive a user space address and size of the block to be pinned.
+Bisecting is not supported and VFIO_IOMMU_UNREGISTER_MEMORY is expected to
+be called with the exact address and size used for registering
+the memory block. The userspace is not expected to call these often.
+The ranges are stored in a linked list in a VFIO container.
+
+- VFIO_IOMMU_MAP_DMA/VFIO_IOMMU_UNMAP_DMA ioctls only update the actual
+IOMMU table and do not do pinning; instead these check that the userspace
+address is from pre-registered range.
+
+This separation 

Re: [PATCH kernel v9 29/32] vfio: powerpc/spapr: Register memory and define IOMMU v2

2015-04-30 Thread David Gibson
On Sat, Apr 25, 2015 at 10:14:53PM +1000, Alexey Kardashevskiy wrote:
> The existing implementation accounts the whole DMA window in
> the locked_vm counter. This is going to be worse with multiple
> containers and huge DMA windows. Also, real-time accounting would requite
> additional tracking of accounted pages due to the page size difference -
> IOMMU uses 4K pages and system uses 4K or 64K pages.
> 
> Another issue is that actual pages pinning/unpinning happens on every
> DMA map/unmap request. This does not affect the performance much now as
> we spend way too much time now on switching context between
> guest/userspace/host but this will start to matter when we add in-kernel
> DMA map/unmap acceleration.
> 
> This introduces a new IOMMU type for SPAPR - VFIO_SPAPR_TCE_v2_IOMMU.
> New IOMMU deprecates VFIO_IOMMU_ENABLE/VFIO_IOMMU_DISABLE and introduces
> 2 new ioctls to register/unregister DMA memory -
> VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY -
> which receive user space address and size of a memory region which
> needs to be pinned/unpinned and counted in locked_vm.
> New IOMMU splits physical pages pinning and TCE table update into 2 different
> operations. It requires 1) guest pages to be registered first 2) consequent
> map/unmap requests to work only with pre-registered memory.
> For the default single window case this means that the entire guest
> (instead of 2GB) needs to be pinned before using VFIO.
> When a huge DMA window is added, no additional pinning will be
> required, otherwise it would be guest RAM + 2GB.
> 
> The new memory registration ioctls are not supported by
> VFIO_SPAPR_TCE_IOMMU. Dynamic DMA window and in-kernel acceleration
> will require memory to be preregistered in order to work.
> 
> The accounting is done per the user process.
> 
> This advertises v2 SPAPR TCE IOMMU and restricts what the userspace
> can do with v1 or v2 IOMMUs.
> 
> Signed-off-by: Alexey Kardashevskiy 
> [aw: for the vfio related changes]
> Acked-by: Alex Williamson 
> ---
> Changes:
> v9:
> * s/tce_get_hva_cached/tce_iommu_use_page_v2/
> 
> v7:
> * now memory is registered per mm (i.e. process)
> * moved memory registration code to powerpc/mmu
> * merged "vfio: powerpc/spapr: Define v2 IOMMU" into this
> * limited new ioctls to v2 IOMMU
> * updated doc
> * unsupported ioclts return -ENOTTY instead of -EPERM
> 
> v6:
> * tce_get_hva_cached() returns hva via a pointer
> 
> v4:
> * updated docs
> * s/kzmalloc/vzalloc/
> * in tce_pin_pages()/tce_unpin_pages() removed @vaddr, @size and
> replaced offset with index
> * renamed vfio_iommu_type_register_memory to vfio_iommu_spapr_register_memory
> and removed duplicating vfio_iommu_spapr_register_memory
> ---
>  Documentation/vfio.txt  |  23 
>  drivers/vfio/vfio_iommu_spapr_tce.c | 230 
> +++-
>  include/uapi/linux/vfio.h   |  27 +
>  3 files changed, 274 insertions(+), 6 deletions(-)
> 
> diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
> index 96978ec..94328c8 100644
> --- a/Documentation/vfio.txt
> +++ b/Documentation/vfio.txt
> @@ -427,6 +427,29 @@ The code flow from the example above should be slightly 
> changed:
>  
>   
>  
> +5) There is v2 of SPAPR TCE IOMMU. It deprecates VFIO_IOMMU_ENABLE/
> +VFIO_IOMMU_DISABLE and implements 2 new ioctls:
> +VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY
> +(which are unsupported in v1 IOMMU).

A summary of the semantic differeces between v1 and v2 would be nice.
At this point it's not really clear to me if there's a case for
creating v2, or if this could just be done by adding (optional)
functionality to v1.

> +PPC64 paravirtualized guests generate a lot of map/unmap requests,
> +and the handling of those includes pinning/unpinning pages and updating
> +mm::locked_vm counter to make sure we do not exceed the rlimit.
> +The v2 IOMMU splits accounting and pinning into separate operations:
> +
> +- VFIO_IOMMU_SPAPR_REGISTER_MEMORY/VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY ioctls
> +receive a user space address and size of the block to be pinned.
> +Bisecting is not supported and VFIO_IOMMU_UNREGISTER_MEMORY is expected to
> +be called with the exact address and size used for registering
> +the memory block. The userspace is not expected to call these often.
> +The ranges are stored in a linked list in a VFIO container.
> +
> +- VFIO_IOMMU_MAP_DMA/VFIO_IOMMU_UNMAP_DMA ioctls only update the actual
> +IOMMU table and do not do pinning; instead these check that the userspace
> +address is from pre-registered range.
> +
> +This separation helps in optimizing DMA for guests.
> +
>  
> ---
>  
>  [1] VFIO was originally an acronym for "Virtual Function I/O" in its
> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
> b/drivers/vfio/vfio_iommu_spapr_tce.c
> index 892a584..4cfc2c1 100644
> --- 

Re: [PATCH kernel v9 29/32] vfio: powerpc/spapr: Register memory and define IOMMU v2

2015-04-30 Thread David Gibson
On Sat, Apr 25, 2015 at 10:14:53PM +1000, Alexey Kardashevskiy wrote:
 The existing implementation accounts the whole DMA window in
 the locked_vm counter. This is going to be worse with multiple
 containers and huge DMA windows. Also, real-time accounting would requite
 additional tracking of accounted pages due to the page size difference -
 IOMMU uses 4K pages and system uses 4K or 64K pages.
 
 Another issue is that actual pages pinning/unpinning happens on every
 DMA map/unmap request. This does not affect the performance much now as
 we spend way too much time now on switching context between
 guest/userspace/host but this will start to matter when we add in-kernel
 DMA map/unmap acceleration.
 
 This introduces a new IOMMU type for SPAPR - VFIO_SPAPR_TCE_v2_IOMMU.
 New IOMMU deprecates VFIO_IOMMU_ENABLE/VFIO_IOMMU_DISABLE and introduces
 2 new ioctls to register/unregister DMA memory -
 VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY -
 which receive user space address and size of a memory region which
 needs to be pinned/unpinned and counted in locked_vm.
 New IOMMU splits physical pages pinning and TCE table update into 2 different
 operations. It requires 1) guest pages to be registered first 2) consequent
 map/unmap requests to work only with pre-registered memory.
 For the default single window case this means that the entire guest
 (instead of 2GB) needs to be pinned before using VFIO.
 When a huge DMA window is added, no additional pinning will be
 required, otherwise it would be guest RAM + 2GB.
 
 The new memory registration ioctls are not supported by
 VFIO_SPAPR_TCE_IOMMU. Dynamic DMA window and in-kernel acceleration
 will require memory to be preregistered in order to work.
 
 The accounting is done per the user process.
 
 This advertises v2 SPAPR TCE IOMMU and restricts what the userspace
 can do with v1 or v2 IOMMUs.
 
 Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
 [aw: for the vfio related changes]
 Acked-by: Alex Williamson alex.william...@redhat.com
 ---
 Changes:
 v9:
 * s/tce_get_hva_cached/tce_iommu_use_page_v2/
 
 v7:
 * now memory is registered per mm (i.e. process)
 * moved memory registration code to powerpc/mmu
 * merged vfio: powerpc/spapr: Define v2 IOMMU into this
 * limited new ioctls to v2 IOMMU
 * updated doc
 * unsupported ioclts return -ENOTTY instead of -EPERM
 
 v6:
 * tce_get_hva_cached() returns hva via a pointer
 
 v4:
 * updated docs
 * s/kzmalloc/vzalloc/
 * in tce_pin_pages()/tce_unpin_pages() removed @vaddr, @size and
 replaced offset with index
 * renamed vfio_iommu_type_register_memory to vfio_iommu_spapr_register_memory
 and removed duplicating vfio_iommu_spapr_register_memory
 ---
  Documentation/vfio.txt  |  23 
  drivers/vfio/vfio_iommu_spapr_tce.c | 230 
 +++-
  include/uapi/linux/vfio.h   |  27 +
  3 files changed, 274 insertions(+), 6 deletions(-)
 
 diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
 index 96978ec..94328c8 100644
 --- a/Documentation/vfio.txt
 +++ b/Documentation/vfio.txt
 @@ -427,6 +427,29 @@ The code flow from the example above should be slightly 
 changed:
  
   
  
 +5) There is v2 of SPAPR TCE IOMMU. It deprecates VFIO_IOMMU_ENABLE/
 +VFIO_IOMMU_DISABLE and implements 2 new ioctls:
 +VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY
 +(which are unsupported in v1 IOMMU).

A summary of the semantic differeces between v1 and v2 would be nice.
At this point it's not really clear to me if there's a case for
creating v2, or if this could just be done by adding (optional)
functionality to v1.

 +PPC64 paravirtualized guests generate a lot of map/unmap requests,
 +and the handling of those includes pinning/unpinning pages and updating
 +mm::locked_vm counter to make sure we do not exceed the rlimit.
 +The v2 IOMMU splits accounting and pinning into separate operations:
 +
 +- VFIO_IOMMU_SPAPR_REGISTER_MEMORY/VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY ioctls
 +receive a user space address and size of the block to be pinned.
 +Bisecting is not supported and VFIO_IOMMU_UNREGISTER_MEMORY is expected to
 +be called with the exact address and size used for registering
 +the memory block. The userspace is not expected to call these often.
 +The ranges are stored in a linked list in a VFIO container.
 +
 +- VFIO_IOMMU_MAP_DMA/VFIO_IOMMU_UNMAP_DMA ioctls only update the actual
 +IOMMU table and do not do pinning; instead these check that the userspace
 +address is from pre-registered range.
 +
 +This separation helps in optimizing DMA for guests.
 +
  
 ---
  
  [1] VFIO was originally an acronym for Virtual Function I/O in its
 diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
 b/drivers/vfio/vfio_iommu_spapr_tce.c
 index 892a584..4cfc2c1 100644
 --- a/drivers/vfio/vfio_iommu_spapr_tce.c
 +++ b/drivers/vfio/vfio_iommu_spapr_tce.c

So, from 

Re: [PATCH kernel v9 29/32] vfio: powerpc/spapr: Register memory and define IOMMU v2

2015-04-30 Thread Alexey Kardashevskiy

On 04/30/2015 04:55 PM, David Gibson wrote:

On Sat, Apr 25, 2015 at 10:14:53PM +1000, Alexey Kardashevskiy wrote:

The existing implementation accounts the whole DMA window in
the locked_vm counter. This is going to be worse with multiple
containers and huge DMA windows. Also, real-time accounting would requite
additional tracking of accounted pages due to the page size difference -
IOMMU uses 4K pages and system uses 4K or 64K pages.

Another issue is that actual pages pinning/unpinning happens on every
DMA map/unmap request. This does not affect the performance much now as
we spend way too much time now on switching context between
guest/userspace/host but this will start to matter when we add in-kernel
DMA map/unmap acceleration.

This introduces a new IOMMU type for SPAPR - VFIO_SPAPR_TCE_v2_IOMMU.
New IOMMU deprecates VFIO_IOMMU_ENABLE/VFIO_IOMMU_DISABLE and introduces
2 new ioctls to register/unregister DMA memory -
VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY -
which receive user space address and size of a memory region which
needs to be pinned/unpinned and counted in locked_vm.
New IOMMU splits physical pages pinning and TCE table update into 2 different
operations. It requires 1) guest pages to be registered first 2) consequent
map/unmap requests to work only with pre-registered memory.
For the default single window case this means that the entire guest
(instead of 2GB) needs to be pinned before using VFIO.
When a huge DMA window is added, no additional pinning will be
required, otherwise it would be guest RAM + 2GB.

The new memory registration ioctls are not supported by
VFIO_SPAPR_TCE_IOMMU. Dynamic DMA window and in-kernel acceleration
will require memory to be preregistered in order to work.

The accounting is done per the user process.

This advertises v2 SPAPR TCE IOMMU and restricts what the userspace
can do with v1 or v2 IOMMUs.

Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
[aw: for the vfio related changes]
Acked-by: Alex Williamson alex.william...@redhat.com
---
Changes:
v9:
* s/tce_get_hva_cached/tce_iommu_use_page_v2/

v7:
* now memory is registered per mm (i.e. process)
* moved memory registration code to powerpc/mmu
* merged vfio: powerpc/spapr: Define v2 IOMMU into this
* limited new ioctls to v2 IOMMU
* updated doc
* unsupported ioclts return -ENOTTY instead of -EPERM

v6:
* tce_get_hva_cached() returns hva via a pointer

v4:
* updated docs
* s/kzmalloc/vzalloc/
* in tce_pin_pages()/tce_unpin_pages() removed @vaddr, @size and
replaced offset with index
* renamed vfio_iommu_type_register_memory to vfio_iommu_spapr_register_memory
and removed duplicating vfio_iommu_spapr_register_memory
---
  Documentation/vfio.txt  |  23 
  drivers/vfio/vfio_iommu_spapr_tce.c | 230 +++-
  include/uapi/linux/vfio.h   |  27 +
  3 files changed, 274 insertions(+), 6 deletions(-)

diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
index 96978ec..94328c8 100644
--- a/Documentation/vfio.txt
+++ b/Documentation/vfio.txt
@@ -427,6 +427,29 @@ The code flow from the example above should be slightly 
changed:



+5) There is v2 of SPAPR TCE IOMMU. It deprecates VFIO_IOMMU_ENABLE/
+VFIO_IOMMU_DISABLE and implements 2 new ioctls:
+VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY
+(which are unsupported in v1 IOMMU).


A summary of the semantic differeces between v1 and v2 would be nice.
At this point it's not really clear to me if there's a case for
creating v2, or if this could just be done by adding (optional)
functionality to v1.


v1: memory preregistration is not supported; explicit enable/disable ioctls 
are required


v2: memory preregistration is required; explicit enable/disable are 
prohibited (as they are not needed).


Mixing these in one IOMMU type caused a lot of problems like should I 
increment locked_vm by the 32bit window size on enable() or not; what do I 
do about pages pinning when map/map (check if it is from registered memory 
and do not pin?).


Having 2 IOMMU models makes everything a lot simpler.



+PPC64 paravirtualized guests generate a lot of map/unmap requests,
+and the handling of those includes pinning/unpinning pages and updating
+mm::locked_vm counter to make sure we do not exceed the rlimit.
+The v2 IOMMU splits accounting and pinning into separate operations:
+
+- VFIO_IOMMU_SPAPR_REGISTER_MEMORY/VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY ioctls
+receive a user space address and size of the block to be pinned.
+Bisecting is not supported and VFIO_IOMMU_UNREGISTER_MEMORY is expected to
+be called with the exact address and size used for registering
+the memory block. The userspace is not expected to call these often.
+The ranges are stored in a linked list in a VFIO container.
+
+- VFIO_IOMMU_MAP_DMA/VFIO_IOMMU_UNMAP_DMA ioctls only update the actual
+IOMMU table and do not do pinning; instead these check that the userspace
+address is from 

Re: [PATCH kernel v9 29/32] vfio: powerpc/spapr: Register memory and define IOMMU v2

2015-04-30 Thread David Gibson
On Fri, May 01, 2015 at 02:35:23PM +1000, Alexey Kardashevskiy wrote:
 On 04/30/2015 04:55 PM, David Gibson wrote:
 On Sat, Apr 25, 2015 at 10:14:53PM +1000, Alexey Kardashevskiy wrote:
 The existing implementation accounts the whole DMA window in
 the locked_vm counter. This is going to be worse with multiple
 containers and huge DMA windows. Also, real-time accounting would requite
 additional tracking of accounted pages due to the page size difference -
 IOMMU uses 4K pages and system uses 4K or 64K pages.
 
 Another issue is that actual pages pinning/unpinning happens on every
 DMA map/unmap request. This does not affect the performance much now as
 we spend way too much time now on switching context between
 guest/userspace/host but this will start to matter when we add in-kernel
 DMA map/unmap acceleration.
 
 This introduces a new IOMMU type for SPAPR - VFIO_SPAPR_TCE_v2_IOMMU.
 New IOMMU deprecates VFIO_IOMMU_ENABLE/VFIO_IOMMU_DISABLE and introduces
 2 new ioctls to register/unregister DMA memory -
 VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY -
 which receive user space address and size of a memory region which
 needs to be pinned/unpinned and counted in locked_vm.
 New IOMMU splits physical pages pinning and TCE table update into 2 
 different
 operations. It requires 1) guest pages to be registered first 2) consequent
 map/unmap requests to work only with pre-registered memory.
 For the default single window case this means that the entire guest
 (instead of 2GB) needs to be pinned before using VFIO.
 When a huge DMA window is added, no additional pinning will be
 required, otherwise it would be guest RAM + 2GB.
 
 The new memory registration ioctls are not supported by
 VFIO_SPAPR_TCE_IOMMU. Dynamic DMA window and in-kernel acceleration
 will require memory to be preregistered in order to work.
 
 The accounting is done per the user process.
 
 This advertises v2 SPAPR TCE IOMMU and restricts what the userspace
 can do with v1 or v2 IOMMUs.
 
 Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
 [aw: for the vfio related changes]
 Acked-by: Alex Williamson alex.william...@redhat.com
 ---
 Changes:
 v9:
 * s/tce_get_hva_cached/tce_iommu_use_page_v2/
 
 v7:
 * now memory is registered per mm (i.e. process)
 * moved memory registration code to powerpc/mmu
 * merged vfio: powerpc/spapr: Define v2 IOMMU into this
 * limited new ioctls to v2 IOMMU
 * updated doc
 * unsupported ioclts return -ENOTTY instead of -EPERM
 
 v6:
 * tce_get_hva_cached() returns hva via a pointer
 
 v4:
 * updated docs
 * s/kzmalloc/vzalloc/
 * in tce_pin_pages()/tce_unpin_pages() removed @vaddr, @size and
 replaced offset with index
 * renamed vfio_iommu_type_register_memory to 
 vfio_iommu_spapr_register_memory
 and removed duplicating vfio_iommu_spapr_register_memory
 ---
   Documentation/vfio.txt  |  23 
   drivers/vfio/vfio_iommu_spapr_tce.c | 230 
  +++-
   include/uapi/linux/vfio.h   |  27 +
   3 files changed, 274 insertions(+), 6 deletions(-)
 
 diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
 index 96978ec..94328c8 100644
 --- a/Documentation/vfio.txt
 +++ b/Documentation/vfio.txt
 @@ -427,6 +427,29 @@ The code flow from the example above should be 
 slightly changed:
 
 
 
 +5) There is v2 of SPAPR TCE IOMMU. It deprecates VFIO_IOMMU_ENABLE/
 +VFIO_IOMMU_DISABLE and implements 2 new ioctls:
 +VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY
 +(which are unsupported in v1 IOMMU).
 
 A summary of the semantic differeces between v1 and v2 would be nice.
 At this point it's not really clear to me if there's a case for
 creating v2, or if this could just be done by adding (optional)
 functionality to v1.
 
 v1: memory preregistration is not supported; explicit enable/disable ioctls
 are required
 
 v2: memory preregistration is required; explicit enable/disable are
 prohibited (as they are not needed).
 
 Mixing these in one IOMMU type caused a lot of problems like should I
 increment locked_vm by the 32bit window size on enable() or not; what do I
 do about pages pinning when map/map (check if it is from registered memory
 and do not pin?).
 
 Having 2 IOMMU models makes everything a lot simpler.

Ok.  Would it simplify it further if you made v2 only usable on IODA2
hardware?

 +PPC64 paravirtualized guests generate a lot of map/unmap requests,
 +and the handling of those includes pinning/unpinning pages and updating
 +mm::locked_vm counter to make sure we do not exceed the rlimit.
 +The v2 IOMMU splits accounting and pinning into separate operations:
 +
 +- VFIO_IOMMU_SPAPR_REGISTER_MEMORY/VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY 
 ioctls
 +receive a user space address and size of the block to be pinned.
 +Bisecting is not supported and VFIO_IOMMU_UNREGISTER_MEMORY is expected to
 +be called with the exact address and size used for registering
 +the memory block. The userspace is not expected