Re: [PATCH v2 0/9] Add dynamic iommu backed bounce buffers

2022-07-01 Thread Niklas Schnelle
On Fri, 2022-05-27 at 10:25 +0900, David Stevens wrote:
> On Tue, May 24, 2022 at 9:27 PM Niklas Schnelle  
> wrote:
> > On Fri, 2021-08-06 at 19:34 +0900, David Stevens wrote:
> > > From: David Stevens 
> > > 
> > > This patch series adds support for per-domain dynamic pools of iommu
> > > bounce buffers to the dma-iommu API. This allows iommu mappings to be
> > > reused while still maintaining strict iommu protection.
> > > 
> > > This bounce buffer support is used to add a new config option that, when
> > > enabled, causes all non-direct streaming mappings below a configurable
> > > size to go through the bounce buffers. This serves as an optimization on
> > > systems where manipulating iommu mappings is very expensive. For
> > > example, virtio-iommu operations in a guest on a linux host require a
> > > vmexit, involvement the VMM, and a VFIO syscall. For relatively small
> > > DMA operations, memcpy can be significantly faster.
> > > 
> > > As a performance comparison, on a device with an i5-10210U, I ran fio
> > > with a VFIO passthrough NVMe drive and virtio-iommu with '--direct=1
> > > --rw=read --ioengine=libaio --iodepth=64' and block sizes 4k, 16k, 64k,
> > > and 128k. Test throughput increased by 2.8x, 4.7x, 3.6x, and 3.6x. Time
> > > spent in iommu_dma_unmap_(page|sg) per GB processed decreased by 97%,
> > > 94%, 90%, and 87%. Time spent in iommu_dma_map_(page|sg) decreased
> > > by >99%, as bounce buffers don't require syncing here in the read case.
> > > Running with multiple jobs doesn't serve as a useful performance
> > > comparison because virtio-iommu and vfio_iommu_type1 both have big
> > > locks that significantly limit mulithreaded DMA performance.
> > > 
> > > These pooled bounce buffers are also used for subgranule mappings with
> > > untrusted devices, replacing the single use bounce buffers used
> > > currently. The biggest difference here is that the new implementation
> > > maps a whole sglist using a single bounce buffer. The new implementation
> > > does not support using bounce buffers for only some segments of the
> > > sglist, so it may require more copying. However, the current
> > > implementation requires per-segment iommu map/unmap operations for all
> > > untrusted sglist mappings (fully aligned sglists included). On a
> > > i5-10210U laptop with the internal NVMe drive made to appear untrusted,
> > > fio --direct=1 --rw=read --ioengine=libaio --iodepth=64 --bs=64k showed
> > > a statistically significant decrease in CPU load from 2.28% -> 2.17%
> > > with the new iommu bounce buffer optimization enabled.
> > > 
> > > Each domain's buffer pool is split into multiple power-of-2 size
> > > classes. Each class allocates a fixed number of buffer slot metadata. A
> > > large iova range is allocated, and each slot is assigned an iova from
> > > the range. This allows the iova to be easily mapped back to the slot,
> > > and allows the critical section of most pool operations to be constant
> > > time. The one exception is finding a cached buffer to reuse. These are
> > > only separated according to R/W permissions - the use of other
> > > permissions such as IOMMU_PRIV may require a linear search through the
> > > cache. However, these other permissions are rare and likely exhibit high
> > > locality, so the should not be a bottleneck in practice.
> > > 
> > > Since untrusted devices may require bounce buffers, each domain has a
> > > fallback rbtree to manage single use buffers. This may be necessary if a
> > > very large number of DMA operations are simultaneously in-flight, or for
> > > very large individual DMA operations.
> > > 
> > > This patch set does not use swiotlb. There are two primary ways in which
> > > swiotlb isn't compatible with per-domain buffer pools. First, swiotlb
> > > allocates buffers to be compatible with a single device, whereas
> > > per-domain buffer pools don't handle that during buffer allocation as a
> > > single buffer may end up being used by multiple devices. Second, swiotlb
> > > allocation establishes the original to bounce buffer mapping, which
> > > again doesn't work if buffers can be reused. Effectively the only code
> > > that can be shared between the two use cases is allocating slots from
> > > the swiotlb's memory. However, given that we're going to be allocating
> > > memory for use with an iommu, allocating memory from a block of memory
> > > explicitly set aside to deal with a lack of iommu seems kind of
> > > contradictory. At best there might be a small performance improvement if
> > > wiotlb allocation is faster than regular page allocation, but buffer
> > > allocation isn't on the hot path anyway.
> > > 
> > > Not using the swiotlb has the benefit that memory doesn't have to be
> > > preallocated. Instead, bounce buffers consume memory only for in-flight
> > > dma transactions (ignoring temporarily cached buffers), which is the
> > > smallest amount possible. This makes it easier to use bounce buffers as
> > > an 

Re: [PATCH v2 0/9] Add dynamic iommu backed bounce buffers

2022-06-05 Thread David Stevens
On Fri, Jun 3, 2022 at 11:53 PM Niklas Schnelle  wrote:
>
> On Fri, 2022-05-27 at 10:25 +0900, David Stevens wrote:
> > On Tue, May 24, 2022 at 9:27 PM Niklas Schnelle  
> > wrote:
> > > On Fri, 2021-08-06 at 19:34 +0900, David Stevens wrote:
> > > > From: David Stevens 
> > > >
> > > > This patch series adds support for per-domain dynamic pools of iommu
> > > > bounce buffers to the dma-iommu API. This allows iommu mappings to be
> > > > reused while still maintaining strict iommu protection.
> > > >
> > > > This bounce buffer support is used to add a new config option that, when
> > > > enabled, causes all non-direct streaming mappings below a configurable
> > > > size to go through the bounce buffers. This serves as an optimization on
> > > > systems where manipulating iommu mappings is very expensive. For
> > > > example, virtio-iommu operations in a guest on a linux host require a
> > > > vmexit, involvement the VMM, and a VFIO syscall. For relatively small
> > > > DMA operations, memcpy can be significantly faster.
> > > >
> > > > As a performance comparison, on a device with an i5-10210U, I ran fio
> > > > with a VFIO passthrough NVMe drive and virtio-iommu with '--direct=1
> > > > --rw=read --ioengine=libaio --iodepth=64' and block sizes 4k, 16k, 64k,
> > > > and 128k. Test throughput increased by 2.8x, 4.7x, 3.6x, and 3.6x. Time
> > > > spent in iommu_dma_unmap_(page|sg) per GB processed decreased by 97%,
> > > > 94%, 90%, and 87%. Time spent in iommu_dma_map_(page|sg) decreased
> > > > by >99%, as bounce buffers don't require syncing here in the read case.
> > > > Running with multiple jobs doesn't serve as a useful performance
> > > > comparison because virtio-iommu and vfio_iommu_type1 both have big
> > > > locks that significantly limit mulithreaded DMA performance.
> > > >
> > > > These pooled bounce buffers are also used for subgranule mappings with
> > > > untrusted devices, replacing the single use bounce buffers used
> > > > currently. The biggest difference here is that the new implementation
> > > > maps a whole sglist using a single bounce buffer. The new implementation
> > > > does not support using bounce buffers for only some segments of the
> > > > sglist, so it may require more copying. However, the current
> > > > implementation requires per-segment iommu map/unmap operations for all
> > > > untrusted sglist mappings (fully aligned sglists included). On a
> > > > i5-10210U laptop with the internal NVMe drive made to appear untrusted,
> > > > fio --direct=1 --rw=read --ioengine=libaio --iodepth=64 --bs=64k showed
> > > > a statistically significant decrease in CPU load from 2.28% -> 2.17%
> > > > with the new iommu bounce buffer optimization enabled.
> > > >
> > > > Each domain's buffer pool is split into multiple power-of-2 size
> > > > classes. Each class allocates a fixed number of buffer slot metadata. A
> > > > large iova range is allocated, and each slot is assigned an iova from
> > > > the range. This allows the iova to be easily mapped back to the slot,
> > > > and allows the critical section of most pool operations to be constant
> > > > time. The one exception is finding a cached buffer to reuse. These are
> > > > only separated according to R/W permissions - the use of other
> > > > permissions such as IOMMU_PRIV may require a linear search through the
> > > > cache. However, these other permissions are rare and likely exhibit high
> > > > locality, so the should not be a bottleneck in practice.
> > > >
> > > > Since untrusted devices may require bounce buffers, each domain has a
> > > > fallback rbtree to manage single use buffers. This may be necessary if a
> > > > very large number of DMA operations are simultaneously in-flight, or for
> > > > very large individual DMA operations.
> > > >
> > > > This patch set does not use swiotlb. There are two primary ways in which
> > > > swiotlb isn't compatible with per-domain buffer pools. First, swiotlb
> > > > allocates buffers to be compatible with a single device, whereas
> > > > per-domain buffer pools don't handle that during buffer allocation as a
> > > > single buffer may end up being used by multiple devices. Second, swiotlb
> > > > allocation establishes the original to bounce buffer mapping, which
> > > > again doesn't work if buffers can be reused. Effectively the only code
> > > > that can be shared between the two use cases is allocating slots from
> > > > the swiotlb's memory. However, given that we're going to be allocating
> > > > memory for use with an iommu, allocating memory from a block of memory
> > > > explicitly set aside to deal with a lack of iommu seems kind of
> > > > contradictory. At best there might be a small performance improvement if
> > > > wiotlb allocation is faster than regular page allocation, but buffer
> > > > allocation isn't on the hot path anyway.
> > > >
> > > > Not using the swiotlb has the benefit that memory doesn't have to be
> > > > preallocated. Instead, bounce buffers 

Re: [PATCH v2 0/9] Add dynamic iommu backed bounce buffers

2022-06-03 Thread Niklas Schnelle
On Fri, 2022-05-27 at 10:25 +0900, David Stevens wrote:
> On Tue, May 24, 2022 at 9:27 PM Niklas Schnelle  
> wrote:
> > On Fri, 2021-08-06 at 19:34 +0900, David Stevens wrote:
> > > From: David Stevens 
> > > 
> > > This patch series adds support for per-domain dynamic pools of iommu
> > > bounce buffers to the dma-iommu API. This allows iommu mappings to be
> > > reused while still maintaining strict iommu protection.
> > > 
> > > This bounce buffer support is used to add a new config option that, when
> > > enabled, causes all non-direct streaming mappings below a configurable
> > > size to go through the bounce buffers. This serves as an optimization on
> > > systems where manipulating iommu mappings is very expensive. For
> > > example, virtio-iommu operations in a guest on a linux host require a
> > > vmexit, involvement the VMM, and a VFIO syscall. For relatively small
> > > DMA operations, memcpy can be significantly faster.
> > > 
> > > As a performance comparison, on a device with an i5-10210U, I ran fio
> > > with a VFIO passthrough NVMe drive and virtio-iommu with '--direct=1
> > > --rw=read --ioengine=libaio --iodepth=64' and block sizes 4k, 16k, 64k,
> > > and 128k. Test throughput increased by 2.8x, 4.7x, 3.6x, and 3.6x. Time
> > > spent in iommu_dma_unmap_(page|sg) per GB processed decreased by 97%,
> > > 94%, 90%, and 87%. Time spent in iommu_dma_map_(page|sg) decreased
> > > by >99%, as bounce buffers don't require syncing here in the read case.
> > > Running with multiple jobs doesn't serve as a useful performance
> > > comparison because virtio-iommu and vfio_iommu_type1 both have big
> > > locks that significantly limit mulithreaded DMA performance.
> > > 
> > > These pooled bounce buffers are also used for subgranule mappings with
> > > untrusted devices, replacing the single use bounce buffers used
> > > currently. The biggest difference here is that the new implementation
> > > maps a whole sglist using a single bounce buffer. The new implementation
> > > does not support using bounce buffers for only some segments of the
> > > sglist, so it may require more copying. However, the current
> > > implementation requires per-segment iommu map/unmap operations for all
> > > untrusted sglist mappings (fully aligned sglists included). On a
> > > i5-10210U laptop with the internal NVMe drive made to appear untrusted,
> > > fio --direct=1 --rw=read --ioengine=libaio --iodepth=64 --bs=64k showed
> > > a statistically significant decrease in CPU load from 2.28% -> 2.17%
> > > with the new iommu bounce buffer optimization enabled.
> > > 
> > > Each domain's buffer pool is split into multiple power-of-2 size
> > > classes. Each class allocates a fixed number of buffer slot metadata. A
> > > large iova range is allocated, and each slot is assigned an iova from
> > > the range. This allows the iova to be easily mapped back to the slot,
> > > and allows the critical section of most pool operations to be constant
> > > time. The one exception is finding a cached buffer to reuse. These are
> > > only separated according to R/W permissions - the use of other
> > > permissions such as IOMMU_PRIV may require a linear search through the
> > > cache. However, these other permissions are rare and likely exhibit high
> > > locality, so the should not be a bottleneck in practice.
> > > 
> > > Since untrusted devices may require bounce buffers, each domain has a
> > > fallback rbtree to manage single use buffers. This may be necessary if a
> > > very large number of DMA operations are simultaneously in-flight, or for
> > > very large individual DMA operations.
> > > 
> > > This patch set does not use swiotlb. There are two primary ways in which
> > > swiotlb isn't compatible with per-domain buffer pools. First, swiotlb
> > > allocates buffers to be compatible with a single device, whereas
> > > per-domain buffer pools don't handle that during buffer allocation as a
> > > single buffer may end up being used by multiple devices. Second, swiotlb
> > > allocation establishes the original to bounce buffer mapping, which
> > > again doesn't work if buffers can be reused. Effectively the only code
> > > that can be shared between the two use cases is allocating slots from
> > > the swiotlb's memory. However, given that we're going to be allocating
> > > memory for use with an iommu, allocating memory from a block of memory
> > > explicitly set aside to deal with a lack of iommu seems kind of
> > > contradictory. At best there might be a small performance improvement if
> > > wiotlb allocation is faster than regular page allocation, but buffer
> > > allocation isn't on the hot path anyway.
> > > 
> > > Not using the swiotlb has the benefit that memory doesn't have to be
> > > preallocated. Instead, bounce buffers consume memory only for in-flight
> > > dma transactions (ignoring temporarily cached buffers), which is the
> > > smallest amount possible. This makes it easier to use bounce buffers as
> > > an 

Re: [PATCH v2 0/9] Add dynamic iommu backed bounce buffers

2022-05-26 Thread David Stevens
On Tue, May 24, 2022 at 9:27 PM Niklas Schnelle  wrote:
>
> On Fri, 2021-08-06 at 19:34 +0900, David Stevens wrote:
> > From: David Stevens 
> >
> > This patch series adds support for per-domain dynamic pools of iommu
> > bounce buffers to the dma-iommu API. This allows iommu mappings to be
> > reused while still maintaining strict iommu protection.
> >
> > This bounce buffer support is used to add a new config option that, when
> > enabled, causes all non-direct streaming mappings below a configurable
> > size to go through the bounce buffers. This serves as an optimization on
> > systems where manipulating iommu mappings is very expensive. For
> > example, virtio-iommu operations in a guest on a linux host require a
> > vmexit, involvement the VMM, and a VFIO syscall. For relatively small
> > DMA operations, memcpy can be significantly faster.
> >
> > As a performance comparison, on a device with an i5-10210U, I ran fio
> > with a VFIO passthrough NVMe drive and virtio-iommu with '--direct=1
> > --rw=read --ioengine=libaio --iodepth=64' and block sizes 4k, 16k, 64k,
> > and 128k. Test throughput increased by 2.8x, 4.7x, 3.6x, and 3.6x. Time
> > spent in iommu_dma_unmap_(page|sg) per GB processed decreased by 97%,
> > 94%, 90%, and 87%. Time spent in iommu_dma_map_(page|sg) decreased
> > by >99%, as bounce buffers don't require syncing here in the read case.
> > Running with multiple jobs doesn't serve as a useful performance
> > comparison because virtio-iommu and vfio_iommu_type1 both have big
> > locks that significantly limit mulithreaded DMA performance.
> >
> > These pooled bounce buffers are also used for subgranule mappings with
> > untrusted devices, replacing the single use bounce buffers used
> > currently. The biggest difference here is that the new implementation
> > maps a whole sglist using a single bounce buffer. The new implementation
> > does not support using bounce buffers for only some segments of the
> > sglist, so it may require more copying. However, the current
> > implementation requires per-segment iommu map/unmap operations for all
> > untrusted sglist mappings (fully aligned sglists included). On a
> > i5-10210U laptop with the internal NVMe drive made to appear untrusted,
> > fio --direct=1 --rw=read --ioengine=libaio --iodepth=64 --bs=64k showed
> > a statistically significant decrease in CPU load from 2.28% -> 2.17%
> > with the new iommu bounce buffer optimization enabled.
> >
> > Each domain's buffer pool is split into multiple power-of-2 size
> > classes. Each class allocates a fixed number of buffer slot metadata. A
> > large iova range is allocated, and each slot is assigned an iova from
> > the range. This allows the iova to be easily mapped back to the slot,
> > and allows the critical section of most pool operations to be constant
> > time. The one exception is finding a cached buffer to reuse. These are
> > only separated according to R/W permissions - the use of other
> > permissions such as IOMMU_PRIV may require a linear search through the
> > cache. However, these other permissions are rare and likely exhibit high
> > locality, so the should not be a bottleneck in practice.
> >
> > Since untrusted devices may require bounce buffers, each domain has a
> > fallback rbtree to manage single use buffers. This may be necessary if a
> > very large number of DMA operations are simultaneously in-flight, or for
> > very large individual DMA operations.
> >
> > This patch set does not use swiotlb. There are two primary ways in which
> > swiotlb isn't compatible with per-domain buffer pools. First, swiotlb
> > allocates buffers to be compatible with a single device, whereas
> > per-domain buffer pools don't handle that during buffer allocation as a
> > single buffer may end up being used by multiple devices. Second, swiotlb
> > allocation establishes the original to bounce buffer mapping, which
> > again doesn't work if buffers can be reused. Effectively the only code
> > that can be shared between the two use cases is allocating slots from
> > the swiotlb's memory. However, given that we're going to be allocating
> > memory for use with an iommu, allocating memory from a block of memory
> > explicitly set aside to deal with a lack of iommu seems kind of
> > contradictory. At best there might be a small performance improvement if
> > wiotlb allocation is faster than regular page allocation, but buffer
> > allocation isn't on the hot path anyway.
> >
> > Not using the swiotlb has the benefit that memory doesn't have to be
> > preallocated. Instead, bounce buffers consume memory only for in-flight
> > dma transactions (ignoring temporarily cached buffers), which is the
> > smallest amount possible. This makes it easier to use bounce buffers as
> > an optimization on systems with large numbers of devices or in
> > situations where devices are unknown, since it is not necessary to try
> > to tune how much memory needs to be set aside to achieve good
> > performance without 

Re: [PATCH v2 0/9] Add dynamic iommu backed bounce buffers

2022-05-24 Thread Niklas Schnelle
On Fri, 2021-08-06 at 19:34 +0900, David Stevens wrote:
> From: David Stevens 
> 
> This patch series adds support for per-domain dynamic pools of iommu
> bounce buffers to the dma-iommu API. This allows iommu mappings to be
> reused while still maintaining strict iommu protection.
> 
> This bounce buffer support is used to add a new config option that, when
> enabled, causes all non-direct streaming mappings below a configurable
> size to go through the bounce buffers. This serves as an optimization on
> systems where manipulating iommu mappings is very expensive. For
> example, virtio-iommu operations in a guest on a linux host require a
> vmexit, involvement the VMM, and a VFIO syscall. For relatively small
> DMA operations, memcpy can be significantly faster.
> 
> As a performance comparison, on a device with an i5-10210U, I ran fio
> with a VFIO passthrough NVMe drive and virtio-iommu with '--direct=1
> --rw=read --ioengine=libaio --iodepth=64' and block sizes 4k, 16k, 64k,
> and 128k. Test throughput increased by 2.8x, 4.7x, 3.6x, and 3.6x. Time
> spent in iommu_dma_unmap_(page|sg) per GB processed decreased by 97%,
> 94%, 90%, and 87%. Time spent in iommu_dma_map_(page|sg) decreased
> by >99%, as bounce buffers don't require syncing here in the read case.
> Running with multiple jobs doesn't serve as a useful performance
> comparison because virtio-iommu and vfio_iommu_type1 both have big
> locks that significantly limit mulithreaded DMA performance.
> 
> These pooled bounce buffers are also used for subgranule mappings with
> untrusted devices, replacing the single use bounce buffers used
> currently. The biggest difference here is that the new implementation
> maps a whole sglist using a single bounce buffer. The new implementation
> does not support using bounce buffers for only some segments of the
> sglist, so it may require more copying. However, the current
> implementation requires per-segment iommu map/unmap operations for all
> untrusted sglist mappings (fully aligned sglists included). On a 
> i5-10210U laptop with the internal NVMe drive made to appear untrusted,
> fio --direct=1 --rw=read --ioengine=libaio --iodepth=64 --bs=64k showed
> a statistically significant decrease in CPU load from 2.28% -> 2.17%
> with the new iommu bounce buffer optimization enabled.
> 
> Each domain's buffer pool is split into multiple power-of-2 size
> classes. Each class allocates a fixed number of buffer slot metadata. A
> large iova range is allocated, and each slot is assigned an iova from
> the range. This allows the iova to be easily mapped back to the slot,
> and allows the critical section of most pool operations to be constant
> time. The one exception is finding a cached buffer to reuse. These are
> only separated according to R/W permissions - the use of other
> permissions such as IOMMU_PRIV may require a linear search through the
> cache. However, these other permissions are rare and likely exhibit high
> locality, so the should not be a bottleneck in practice.
> 
> Since untrusted devices may require bounce buffers, each domain has a
> fallback rbtree to manage single use buffers. This may be necessary if a
> very large number of DMA operations are simultaneously in-flight, or for
> very large individual DMA operations.
> 
> This patch set does not use swiotlb. There are two primary ways in which
> swiotlb isn't compatible with per-domain buffer pools. First, swiotlb
> allocates buffers to be compatible with a single device, whereas
> per-domain buffer pools don't handle that during buffer allocation as a
> single buffer may end up being used by multiple devices. Second, swiotlb
> allocation establishes the original to bounce buffer mapping, which
> again doesn't work if buffers can be reused. Effectively the only code
> that can be shared between the two use cases is allocating slots from
> the swiotlb's memory. However, given that we're going to be allocating
> memory for use with an iommu, allocating memory from a block of memory
> explicitly set aside to deal with a lack of iommu seems kind of
> contradictory. At best there might be a small performance improvement if 
> wiotlb allocation is faster than regular page allocation, but buffer
> allocation isn't on the hot path anyway.
> 
> Not using the swiotlb has the benefit that memory doesn't have to be
> preallocated. Instead, bounce buffers consume memory only for in-flight
> dma transactions (ignoring temporarily cached buffers), which is the
> smallest amount possible. This makes it easier to use bounce buffers as
> an optimization on systems with large numbers of devices or in
> situations where devices are unknown, since it is not necessary to try
> to tune how much memory needs to be set aside to achieve good
> performance without costing too much memory.
> 
> Finally, this series adds a new DMA_ATTR_PERSISTENT_STREAMING flag. This
> is meant to address devices which create long lived streaming mappings
> but manage CPU cache 

[PATCH v2 0/9] Add dynamic iommu backed bounce buffers

2021-08-06 Thread David Stevens
From: David Stevens 

This patch series adds support for per-domain dynamic pools of iommu
bounce buffers to the dma-iommu API. This allows iommu mappings to be
reused while still maintaining strict iommu protection.

This bounce buffer support is used to add a new config option that, when
enabled, causes all non-direct streaming mappings below a configurable
size to go through the bounce buffers. This serves as an optimization on
systems where manipulating iommu mappings is very expensive. For
example, virtio-iommu operations in a guest on a linux host require a
vmexit, involvement the VMM, and a VFIO syscall. For relatively small
DMA operations, memcpy can be significantly faster.

As a performance comparison, on a device with an i5-10210U, I ran fio
with a VFIO passthrough NVMe drive and virtio-iommu with '--direct=1
--rw=read --ioengine=libaio --iodepth=64' and block sizes 4k, 16k, 64k,
and 128k. Test throughput increased by 2.8x, 4.7x, 3.6x, and 3.6x. Time
spent in iommu_dma_unmap_(page|sg) per GB processed decreased by 97%,
94%, 90%, and 87%. Time spent in iommu_dma_map_(page|sg) decreased
by >99%, as bounce buffers don't require syncing here in the read case.
Running with multiple jobs doesn't serve as a useful performance
comparison because virtio-iommu and vfio_iommu_type1 both have big
locks that significantly limit mulithreaded DMA performance.

These pooled bounce buffers are also used for subgranule mappings with
untrusted devices, replacing the single use bounce buffers used
currently. The biggest difference here is that the new implementation
maps a whole sglist using a single bounce buffer. The new implementation
does not support using bounce buffers for only some segments of the
sglist, so it may require more copying. However, the current
implementation requires per-segment iommu map/unmap operations for all
untrusted sglist mappings (fully aligned sglists included). On a 
i5-10210U laptop with the internal NVMe drive made to appear untrusted,
fio --direct=1 --rw=read --ioengine=libaio --iodepth=64 --bs=64k showed
a statistically significant decrease in CPU load from 2.28% -> 2.17%
with the new iommu bounce buffer optimization enabled.

Each domain's buffer pool is split into multiple power-of-2 size
classes. Each class allocates a fixed number of buffer slot metadata. A
large iova range is allocated, and each slot is assigned an iova from
the range. This allows the iova to be easily mapped back to the slot,
and allows the critical section of most pool operations to be constant
time. The one exception is finding a cached buffer to reuse. These are
only separated according to R/W permissions - the use of other
permissions such as IOMMU_PRIV may require a linear search through the
cache. However, these other permissions are rare and likely exhibit high
locality, so the should not be a bottleneck in practice.

Since untrusted devices may require bounce buffers, each domain has a
fallback rbtree to manage single use buffers. This may be necessary if a
very large number of DMA operations are simultaneously in-flight, or for
very large individual DMA operations.

This patch set does not use swiotlb. There are two primary ways in which
swiotlb isn't compatible with per-domain buffer pools. First, swiotlb
allocates buffers to be compatible with a single device, whereas
per-domain buffer pools don't handle that during buffer allocation as a
single buffer may end up being used by multiple devices. Second, swiotlb
allocation establishes the original to bounce buffer mapping, which
again doesn't work if buffers can be reused. Effectively the only code
that can be shared between the two use cases is allocating slots from
the swiotlb's memory. However, given that we're going to be allocating
memory for use with an iommu, allocating memory from a block of memory
explicitly set aside to deal with a lack of iommu seems kind of
contradictory. At best there might be a small performance improvement if 
wiotlb allocation is faster than regular page allocation, but buffer
allocation isn't on the hot path anyway.

Not using the swiotlb has the benefit that memory doesn't have to be
preallocated. Instead, bounce buffers consume memory only for in-flight
dma transactions (ignoring temporarily cached buffers), which is the
smallest amount possible. This makes it easier to use bounce buffers as
an optimization on systems with large numbers of devices or in
situations where devices are unknown, since it is not necessary to try
to tune how much memory needs to be set aside to achieve good
performance without costing too much memory.

Finally, this series adds a new DMA_ATTR_PERSISTENT_STREAMING flag. This
is meant to address devices which create long lived streaming mappings
but manage CPU cache coherency without using the dma_sync_* APIs.
Currently, these devices don't function properly with swiotlb=force. The
new flag is used to bypass bounce buffers so such devices will function
when the new bounce buffer