RE: [RFC PATCH V2 1/2] swiotlb: Add Child IO TLB mem support

2022-05-31 Thread Michael Kelley (LINUX) via iommu
From: h...@lst.de  Sent: Tuesday, May 31, 2022 12:17 AM
> 
> On Mon, May 30, 2022 at 01:52:37AM +, Michael Kelley (LINUX) wrote:
> > B) The contents of the memory buffer must transition between
> > encrypted and not encrypted.  The hardware doesn't provide
> > any mechanism to do such a transition "in place".  The only
> > way to transition is for the CPU to copy the contents between
> > an encrypted area and an unencrypted area of memory.
> >
> > Because of (B), we're stuck needing bounce buffers.  There's no
> > way to avoid them with the current hardware.  Tianyu also pointed
> > out not wanting to expose uninitialized guest memory to the host,
> > so, for example, sharing a read buffer with the host requires that
> > it first be initialized to zero.
> 
> Ok, B is a deal breaker.  I just brought this in because I've received
> review comments that state bouncing is just the easiest option for
> now and we could map it into the hypervisor in the future.  But at
> least for SEV that does not seem like an option without hardware
> changes.
> 
> > We should reset and make sure we agree on the top-level approach.
> > 1) We want general scalability improvements to swiotlb.  These
> > improvements should scale to high CPUs counts (> 100) and for
> > multiple NUMA nodes.
> > 2) Drivers should not require any special knowledge of swiotlb to
> > benefit from the improvements.  No new swiotlb APIs should be
> > need to be used by drivers -- the swiotlb scalability improvements
> > should be transparent.
> > 3) The scalability improvements should not be based on device
> > boundaries since a single device may have multiple channels
> > doing bounce buffering on multiple CPUs in parallel.
> 
> Agreed to all counts.
> 
> > The patch from Andi Kleen [3] (not submitted upstream, but referenced
> > by Tianyu as the basis for his patches) seems like a good starting point
> > for meeting the top-level approach.
> 
> Yes, I think doing per-cpu and/or per-node scaling sounds like the
> right general approach. Why was this never sent out?

I'll defer to Andi on what his thinking/plan is for this patch.

> 
> > Andi and Robin had some
> > back-and-forth about Andi's patch that I haven't delved into yet, but
> > getting that worked out seems like a better overall approach.  I had
> > an offline chat with Tianyu, and he would agree as well.
> 
> Where was this discussion?

I was just referring to this thread that you are already on:
https://lore.kernel.org/lkml/e7b644f0-6c90-fe99-792d-75c38505d...@arm.com/

Michael

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH V2 1/2] swiotlb: Add Child IO TLB mem support

2022-05-31 Thread h...@lst.de
On Mon, May 30, 2022 at 01:52:37AM +, Michael Kelley (LINUX) wrote:
> B) The contents of the memory buffer must transition between
> encrypted and not encrypted.  The hardware doesn't provide
> any mechanism to do such a transition "in place".  The only
> way to transition is for the CPU to copy the contents between
> an encrypted area and an unencrypted area of memory.
> 
> Because of (B), we're stuck needing bounce buffers.  There's no
> way to avoid them with the current hardware.  Tianyu also pointed
> out not wanting to expose uninitialized guest memory to the host,
> so, for example, sharing a read buffer with the host requires that
> it first be initialized to zero.

Ok, B is a deal breaker.  I just brought this in because I've received
review comments that state bouncing is just the easiest option for
now and we could map it into the hypervisor in the future.  But at
least for SEV that does not seem like an option without hardware
changes.

> We should reset and make sure we agree on the top-level approach.
> 1) We want general scalability improvements to swiotlb.  These
> improvements should scale to high CPUs counts (> 100) and for
> multiple NUMA nodes.
> 2) Drivers should not require any special knowledge of swiotlb to
> benefit from the improvements.  No new swiotlb APIs should be
> need to be used by drivers -- the swiotlb scalability improvements
> should be transparent.
> 3) The scalability improvements should not be based on device
> boundaries since a single device may have multiple channels
> doing bounce buffering on multiple CPUs in parallel.

Agreed to all counts.

> The patch from Andi Kleen [3] (not submitted upstream, but referenced
> by Tianyu as the basis for his patches) seems like a good starting point
> for meeting the top-level approach.

Yes, I think doing per-cpu and/or per-node scaling sounds like the
right general approach. Why was this never sent out?

> Andi and Robin had some
> back-and-forth about Andi's patch that I haven't delved into yet, but
> getting that worked out seems like a better overall approach.  I had
> an offline chat with Tianyu, and he would agree as well.

Where was this discussion?
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC PATCH V2 1/2] swiotlb: Add Child IO TLB mem support

2022-05-29 Thread Michael Kelley (LINUX) via iommu
From: Christoph Hellwig  Sent: Monday, May 16, 2022 12:35 AM
> 
> I don't really understand how 'childs' fit in here.  The code also
> doesn't seem to be usable without patch 2 and a caller of the
> new functions added in patch 2, so it is rather impossible to review.
> 
> Also:
> 
>  1) why is SEV/TDX so different from other cases that need bounce
> buffering to treat it different and we can't work on a general
> scalability improvement
>  2) per previous discussions at how swiotlb itself works, it is
> clear that another option is to just make pages we DMA to
> shared with the hypervisor.  Why don't we try that at least
> for larger I/O?

Tianyu already responded, but let me offer an expanded view.
I have better knowledge of AMD's SEV-SNP than of Intel's TDX,
so my details might be off for TDX.

Taking your question (2) first, two things must be done when guest
memory pages transition between the "shared with the hypervisor"
and the "private to the guest" states:

A) Some bookkeeping between the guest and host, which requires
a hypercall.  Doing a hypercall isn't super-fast, but for large I/Os,
it could be a reasonable tradeoff if we could avoid bounce buffer
copying.

B) The contents of the memory buffer must transition between
encrypted and not encrypted.  The hardware doesn't provide
any mechanism to do such a transition "in place".  The only
way to transition is for the CPU to copy the contents between
an encrypted area and an unencrypted area of memory.

Because of (B), we're stuck needing bounce buffers.  There's no
way to avoid them with the current hardware.  Tianyu also pointed
out not wanting to expose uninitialized guest memory to the host,
so, for example, sharing a read buffer with the host requires that
it first be initialized to zero.

For your question (1), I think we all would agree that SEV-SNP and
TDX usage of bounce buffers isn't fundamentally different from other
uses -- they just put a lot more load on the bounce buffering
mechanism. If done well, general swiotlb scalability improvements
should be sufficient and are much preferred.

You made a recent comment about almost being done removing
all knowledge of swiotlb from drivers [1].  I agree with that goal.
However, Tianyu's recent patches for improving swiotlb scalability
don't align with that goal.  A while back, you suggested using
per-device swiotlb regions [2], and I think Tianyu's patch sets have
taken that approach.  But that approach requires going beyond the
existing per-device swiotlb regions to get scalability with multi-channel
devices, and that's leading us in the wrong direction.

We should reset and make sure we agree on the top-level approach.
1) We want general scalability improvements to swiotlb.  These
improvements should scale to high CPUs counts (> 100) and for
multiple NUMA nodes.
2) Drivers should not require any special knowledge of swiotlb to
benefit from the improvements.  No new swiotlb APIs should be
need to be used by drivers -- the swiotlb scalability improvements
should be transparent.
3) The scalability improvements should not be based on device
boundaries since a single device may have multiple channels
doing bounce buffering on multiple CPUs in parallel.

Anything else?

The patch from Andi Kleen [3] (not submitted upstream, but referenced
by Tianyu as the basis for his patches) seems like a good starting point
for meeting the top-level approach.  Andi and Robin had some
back-and-forth about Andi's patch that I haven't delved into yet, but
getting that worked out seems like a better overall approach.  I had
an offline chat with Tianyu, and he would agree as well.

Agree?  Disagree?

Michael

[1]  https://lore.kernel.org/lkml/ymqonhkbt8fty...@infradead.org/
[2] https://lore.kernel.org/lkml/20220222080543.ga5...@lst.de/
[3] https://github.com/intel/tdx/commit/4529b578
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH V2 1/2] swiotlb: Add Child IO TLB mem support

2022-05-16 Thread Tianyu Lan

On 5/16/2022 3:34 PM, Christoph Hellwig wrote:

I don't really understand how 'childs' fit in here.  The code also
doesn't seem to be usable without patch 2 and a caller of the
new functions added in patch 2, so it is rather impossible to review.


Hi Christoph:
 OK. I will merge two patches and add a caller patch. The motivation
is to avoid global spin lock when devices use swiotlb bounce buffer and
this introduces overhead during high throughput cases. In my test
environment, current code can achieve about 24Gb/s network throughput
with SWIOTLB force enabled and it can achieve about 40Gb/s without
SWIOTLB force. Storage also has the same issue.
 Per-device IO TLB mem may resolve global spin lock issue among
devices but device still may have multi queues. Multi queues still need
to share one spin lock. This is why introduce child or IO tlb areas in
the previous patches. Each device queues will have separate child IO TLB
mem and single spin lock to manage their IO TLB buffers.
 Otherwise, global spin lock still cost cpu usage during high 
throughput even when there is performance regression. Each device queues 
needs to spin on the different cpus to acquire the global lock. Child IO

TLB mem also may resolve the cpu issue.



Also:

  1) why is SEV/TDX so different from other cases that need bounce
 buffering to treat it different and we can't work on a general
 scalability improvement


Other cases also have global spin lock issue but it depends on
whether hits the bottleneck. The cpu usage issue may be ignored.


  2) per previous discussions at how swiotlb itself works, it is
 clear that another option is to just make pages we DMA to
 shared with the hypervisor.  Why don't we try that at least
 for larger I/O?


For confidential VM(Both TDX and SEV), we need to use bounce
buffer to copy between private memory that hypervisor can't
access directly and shared memory. For security consideration,
confidential VM should not share IO stack DMA pages with
hypervisor directly to avoid attack from hypervisor when IO
stack handles the DMA data.

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH V2 1/2] swiotlb: Add Child IO TLB mem support

2022-05-16 Thread Christoph Hellwig
I don't really understand how 'childs' fit in here.  The code also
doesn't seem to be usable without patch 2 and a caller of the
new functions added in patch 2, so it is rather impossible to review.

Also:

 1) why is SEV/TDX so different from other cases that need bounce
buffering to treat it different and we can't work on a general
scalability improvement
 2) per previous discussions at how swiotlb itself works, it is
clear that another option is to just make pages we DMA to
shared with the hypervisor.  Why don't we try that at least
for larger I/O?
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[RFC PATCH V2 1/2] swiotlb: Add Child IO TLB mem support

2022-05-02 Thread Tianyu Lan
From: Tianyu Lan 

Traditionally swiotlb was not performance critical because it was only
used for slow devices. But in some setups, like TDX/SEV confidential
guests, all IO has to go through swiotlb. Currently swiotlb only has a
single lock. Under high IO load with multiple CPUs this can lead to
significant lock contention on the swiotlb lock.

This patch adds child IO TLB mem support to resolve spinlock overhead
among device's queues. Each device may allocate IO tlb mem and setup
child IO TLB mem according to queue number. Swiotlb code allocates
bounce buffer among child IO tlb mem iterately.

Signed-off-by: Tianyu Lan 
---
 include/linux/swiotlb.h |  7 +++
 kernel/dma/swiotlb.c| 97 -
 2 files changed, 94 insertions(+), 10 deletions(-)

diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 7ed35dd3de6e..4a3f6a7b4b7e 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -89,6 +89,9 @@ extern enum swiotlb_force swiotlb_force;
  * @late_alloc:%true if allocated using the page allocator
  * @force_bounce: %true if swiotlb bouncing is forced
  * @for_alloc:  %true if the pool is used for memory allocation
+ * @child_nslot:The number of IO TLB slot in the child IO TLB mem.
+ * @num_child:  The child io tlb mem number in the pool.
+ * @child_start:The child index to start searching in the next round.
  */
 struct io_tlb_mem {
phys_addr_t start;
@@ -102,6 +105,10 @@ struct io_tlb_mem {
bool late_alloc;
bool force_bounce;
bool for_alloc;
+   unsigned int num_child;
+   unsigned int child_nslot;
+   unsigned int child_start;
+   struct io_tlb_mem *child;
struct io_tlb_slot {
phys_addr_t orig_addr;
size_t alloc_size;
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index e2ef0864eb1e..32e8f42530b6 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -207,6 +207,26 @@ static void swiotlb_init_io_tlb_mem(struct io_tlb_mem 
*mem, phys_addr_t start,
mem->force_bounce = true;
 
spin_lock_init(&mem->lock);
+
+   if (mem->num_child) {
+   mem->child_nslot = nslabs / mem->num_child;
+   mem->child_start = 0;
+
+   /*
+* Initialize child IO TLB mem, divide IO TLB pool
+* into child number. Reuse parent mem->slot in the
+* child mem->slot.
+*/
+   for (i = 0; i < mem->num_child; i++) {
+   mem->child[i].slots = mem->slots + i * mem->child_nslot;
+   mem->child[i].num_child = 0;
+
+   swiotlb_init_io_tlb_mem(&mem->child[i],
+   start + ((i * mem->child_nslot) << 
IO_TLB_SHIFT),
+   mem->child_nslot, late_alloc);
+   }
+   }
+
for (i = 0; i < mem->nslabs; i++) {
mem->slots[i].list = IO_TLB_SEGSIZE - io_tlb_offset(i);
mem->slots[i].orig_addr = INVALID_PHYS_ADDR;
@@ -336,16 +356,18 @@ int swiotlb_init_late(size_t size, gfp_t gfp_mask,
 
mem->slots = (void *)__get_free_pages(GFP_KERNEL | __GFP_ZERO,
get_order(array_size(sizeof(*mem->slots), nslabs)));
-   if (!mem->slots) {
-   free_pages((unsigned long)vstart, order);
-   return -ENOMEM;
-   }
+   if (!mem->slots)
+   goto error_slots;
 
set_memory_decrypted((unsigned long)vstart, bytes >> PAGE_SHIFT);
swiotlb_init_io_tlb_mem(mem, virt_to_phys(vstart), nslabs, true);
 
swiotlb_print_info();
return 0;
+
+error_slots:
+   free_pages((unsigned long)vstart, order);
+   return -ENOMEM;
 }
 
 void __init swiotlb_exit(void)
@@ -483,10 +505,11 @@ static unsigned int wrap_index(struct io_tlb_mem *mem, 
unsigned int index)
  * Find a suitable number of IO TLB entries size that will fit this request and
  * allocate a buffer from that IO TLB pool.
  */
-static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
- size_t alloc_size, unsigned int alloc_align_mask)
+static int swiotlb_do_find_slots(struct io_tlb_mem *mem,
+struct device *dev, phys_addr_t orig_addr,
+size_t alloc_size,
+unsigned int alloc_align_mask)
 {
-   struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
unsigned long boundary_mask = dma_get_seg_boundary(dev);
dma_addr_t tbl_dma_addr =
phys_to_dma_unencrypted(dev, mem->start) & boundary_mask;
@@ -565,6 +588,46 @@ static int swiotlb_find_slots(struct device *dev, 
phys_addr_t orig_addr,
return index;
 }
 
+static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
+ size_t alloc_size, unsigned int alloc_align_mask)
+{
+   struct io_tlb_mem *mem = dev->dm