Re: [PATCH v2 12/18] AMD/IOMMU: allow use of superpage mappings

2021-12-13 Thread Jan Beulich
On 13.12.2021 11:33, Roger Pau Monné wrote:
> On Mon, Dec 13, 2021 at 11:00:23AM +0100, Jan Beulich wrote:
>> On 13.12.2021 10:45, Roger Pau Monné wrote:
>>> It would be better if we could somehow account this in a per-vCPU way,
>>> kind of similar to what we do with vPCI BAR mappings.
>>
>> But recording them per-vCPU wouldn't make any difference to the
>> number of pages that could accumulate in a single run. Maybe I'm
>> missing something in what you're thinking about here ...
> 
> If Xen somehow did the free in guest vCPU context before resuming
> guest execution then you could yield when events are pending and thus
> allow other guests to run without hogging the pCPU, and the freeing
> would be accounted to vCPU sched slice time.

That's an interesting thought. Shouldn't be difficult to arrange for
HVM (from {svm,vmx}_vmenter_helper()), but I can't immediately see a
clean way of having the same for PV (short of an ad hoc call out of
assembly code somewhere after test_all_events).

Jan




Re: [PATCH v2 12/18] AMD/IOMMU: allow use of superpage mappings

2021-12-13 Thread Roger Pau Monné
On Mon, Dec 13, 2021 at 11:00:23AM +0100, Jan Beulich wrote:
> On 13.12.2021 10:45, Roger Pau Monné wrote:
> > On Mon, Dec 13, 2021 at 09:49:50AM +0100, Jan Beulich wrote:
> >> On 10.12.2021 16:06, Roger Pau Monné wrote:
> >>> On Fri, Sep 24, 2021 at 11:52:14AM +0200, Jan Beulich wrote:
>  ---
>  I'm not fully sure about allowing 512G mappings: The scheduling-for-
>  freeing of intermediate page tables can take quite a while when
>  replacing a tree of 4k mappings by a single 512G one. Plus (or otoh)
>  there's no present code path via which 512G chunks of memory could be
>  allocated (and hence mapped) anyway.
> >>>
> >>> I would limit to 1G, which is what we support for CPU page tables
> >>> also.
> >>
> >> I'm not sure I buy comparing with CPU side support when not sharing
> >> page tables. Not the least with PV in mind.
> > 
> > Hm, my thinking was that similar reasons that don't allow us to do
> > 512G mappings for the CPU side would also apply to IOMMU. Regardless
> > of that, given the current way in which replaced page table entries
> > are freed, I'm not sure it's fine to allow 512G mappings as the
> > freeing of the possible huge amount of 4K entries could allow guests
> > to hog a CPU for a long time.
> 
> This huge amount can occur only when replacing a hierarchy with
> sufficiently many 4k leaves by a single 512G page. Yet there's no
> way - afaics - that such an operation can be initiated right now.
> That's, as said in the remark, because there's no way to allocate
> a 512G chunk of memory in one go. When re-coalescing, the worst
> that can happen is one L1 table worth of 4k mappings, one L2
> table worth of 2M mappings, and one L3 table worth of 1G mappings.
> All other mappings already need to have been superpage ones at the
> time of the checking. Hence the total upper bound (for the
> enclosing map / unmap) is again primarily determined by there not
> being any way to establish 512G mappings.
> 
> Actually, thinking about it, there's one path where 512G mappings
> could be established, but that's Dom0-reachable only
> (XEN_DOMCTL_memory_mapping) and would assume gigantic BARs in a
> PCI device. Even if such a device existed, I think we're fine to
> assume that Dom0 won't establish such mappings to replace
> existing ones, but only ever put them in place when nothing was
> mapped in that range yet.
> 
> > It would be better if we could somehow account this in a per-vCPU way,
> > kind of similar to what we do with vPCI BAR mappings.
> 
> But recording them per-vCPU wouldn't make any difference to the
> number of pages that could accumulate in a single run. Maybe I'm
> missing something in what you're thinking about here ...

If Xen somehow did the free in guest vCPU context before resuming
guest execution then you could yield when events are pending and thus
allow other guests to run without hogging the pCPU, and the freeing
would be accounted to vCPU sched slice time.

Thanks, Roger.



Re: [PATCH v2 12/18] AMD/IOMMU: allow use of superpage mappings

2021-12-13 Thread Jan Beulich
On 13.12.2021 10:45, Roger Pau Monné wrote:
> On Mon, Dec 13, 2021 at 09:49:50AM +0100, Jan Beulich wrote:
>> On 10.12.2021 16:06, Roger Pau Monné wrote:
>>> On Fri, Sep 24, 2021 at 11:52:14AM +0200, Jan Beulich wrote:
 ---
 I'm not fully sure about allowing 512G mappings: The scheduling-for-
 freeing of intermediate page tables can take quite a while when
 replacing a tree of 4k mappings by a single 512G one. Plus (or otoh)
 there's no present code path via which 512G chunks of memory could be
 allocated (and hence mapped) anyway.
>>>
>>> I would limit to 1G, which is what we support for CPU page tables
>>> also.
>>
>> I'm not sure I buy comparing with CPU side support when not sharing
>> page tables. Not the least with PV in mind.
> 
> Hm, my thinking was that similar reasons that don't allow us to do
> 512G mappings for the CPU side would also apply to IOMMU. Regardless
> of that, given the current way in which replaced page table entries
> are freed, I'm not sure it's fine to allow 512G mappings as the
> freeing of the possible huge amount of 4K entries could allow guests
> to hog a CPU for a long time.

This huge amount can occur only when replacing a hierarchy with
sufficiently many 4k leaves by a single 512G page. Yet there's no
way - afaics - that such an operation can be initiated right now.
That's, as said in the remark, because there's no way to allocate
a 512G chunk of memory in one go. When re-coalescing, the worst
that can happen is one L1 table worth of 4k mappings, one L2
table worth of 2M mappings, and one L3 table worth of 1G mappings.
All other mappings already need to have been superpage ones at the
time of the checking. Hence the total upper bound (for the
enclosing map / unmap) is again primarily determined by there not
being any way to establish 512G mappings.

Actually, thinking about it, there's one path where 512G mappings
could be established, but that's Dom0-reachable only
(XEN_DOMCTL_memory_mapping) and would assume gigantic BARs in a
PCI device. Even if such a device existed, I think we're fine to
assume that Dom0 won't establish such mappings to replace
existing ones, but only ever put them in place when nothing was
mapped in that range yet.

> It would be better if we could somehow account this in a per-vCPU way,
> kind of similar to what we do with vPCI BAR mappings.

But recording them per-vCPU wouldn't make any difference to the
number of pages that could accumulate in a single run. Maybe I'm
missing something in what you're thinking about here ...

Jan




Re: [PATCH v2 12/18] AMD/IOMMU: allow use of superpage mappings

2021-12-13 Thread Roger Pau Monné
On Mon, Dec 13, 2021 at 09:49:50AM +0100, Jan Beulich wrote:
> On 10.12.2021 16:06, Roger Pau Monné wrote:
> > On Fri, Sep 24, 2021 at 11:52:14AM +0200, Jan Beulich wrote:
> >> ---
> >> I'm not fully sure about allowing 512G mappings: The scheduling-for-
> >> freeing of intermediate page tables can take quite a while when
> >> replacing a tree of 4k mappings by a single 512G one. Plus (or otoh)
> >> there's no present code path via which 512G chunks of memory could be
> >> allocated (and hence mapped) anyway.
> > 
> > I would limit to 1G, which is what we support for CPU page tables
> > also.
> 
> I'm not sure I buy comparing with CPU side support when not sharing
> page tables. Not the least with PV in mind.

Hm, my thinking was that similar reasons that don't allow us to do
512G mappings for the CPU side would also apply to IOMMU. Regardless
of that, given the current way in which replaced page table entries
are freed, I'm not sure it's fine to allow 512G mappings as the
freeing of the possible huge amount of 4K entries could allow guests
to hog a CPU for a long time.

It would be better if we could somehow account this in a per-vCPU way,
kind of similar to what we do with vPCI BAR mappings.

> > Should we also assert that level (or next_level) is always != 0 for
> > extra safety?
> 
> As said elsewhere - if this wasn't a static helper, I'd agree. But all
> call sites have respective conditionals around the call. If anything
> I'd move those checks into the function (but only if you think that
> would improve things, as to me having them at the call sites is more
> logical).

I'm fine to leave the checks in the callers, was just a suggestion in
case we gain new callers that forgot to do the checks themselves.

Thanks, Roger.



Re: [PATCH v2 12/18] AMD/IOMMU: allow use of superpage mappings

2021-12-13 Thread Jan Beulich
On 10.12.2021 16:06, Roger Pau Monné wrote:
> On Fri, Sep 24, 2021 at 11:52:14AM +0200, Jan Beulich wrote:
>> ---
>> I'm not fully sure about allowing 512G mappings: The scheduling-for-
>> freeing of intermediate page tables can take quite a while when
>> replacing a tree of 4k mappings by a single 512G one. Plus (or otoh)
>> there's no present code path via which 512G chunks of memory could be
>> allocated (and hence mapped) anyway.
> 
> I would limit to 1G, which is what we support for CPU page tables
> also.

I'm not sure I buy comparing with CPU side support when not sharing
page tables. Not the least with PV in mind.

>> @@ -288,10 +289,31 @@ static int iommu_pde_from_dfn(struct dom
>>  return 0;
>>  }
>>  
>> +static void queue_free_pt(struct domain *d, mfn_t mfn, unsigned int 
>> next_level)
> 
> Nit: should the last parameter be named level rather than next_level?
> AFAICT it's the level of the mfn parameter.

Yeah, might make sense.

> Should we also assert that level (or next_level) is always != 0 for
> extra safety?

As said elsewhere - if this wasn't a static helper, I'd agree. But all
call sites have respective conditionals around the call. If anything
I'd move those checks into the function (but only if you think that
would improve things, as to me having them at the call sites is more
logical).

Jan




Re: [PATCH v2 12/18] AMD/IOMMU: allow use of superpage mappings

2021-12-10 Thread Roger Pau Monné
On Fri, Sep 24, 2021 at 11:52:14AM +0200, Jan Beulich wrote:
> No separate feature flags exist which would control availability of
> these; the only restriction is HATS (establishing the maximum number of
> page table levels in general), and even that has a lower bound of 4.
> Thus we can unconditionally announce 2M, 1G, and 512G mappings. (Via
> non-default page sizes the implementation in principle permits arbitrary
> size mappings, but these require multiple identical leaf PTEs to be
> written, which isn't all that different from having to write multiple
> consecutive PTEs with increasing frame numbers. IMO that's therefore
> beneficial only on hardware where suitable TLBs exist; I'm unaware of
> such hardware.)

Also replacing/shattering such non-standard page sizes will require
more logic, so unless there's a performance benefit I would just skip
using them.

> 
> Signed-off-by: Jan Beulich 
> ---
> I'm not fully sure about allowing 512G mappings: The scheduling-for-
> freeing of intermediate page tables can take quite a while when
> replacing a tree of 4k mappings by a single 512G one. Plus (or otoh)
> there's no present code path via which 512G chunks of memory could be
> allocated (and hence mapped) anyway.

I would limit to 1G, which is what we support for CPU page tables
also.

> --- a/xen/drivers/passthrough/amd/iommu_map.c
> +++ b/xen/drivers/passthrough/amd/iommu_map.c
> @@ -32,12 +32,13 @@ static unsigned int pfn_to_pde_idx(unsig
>  }
>  
>  static union amd_iommu_pte clear_iommu_pte_present(unsigned long l1_mfn,
> -   unsigned long dfn)
> +   unsigned long dfn,
> +   unsigned int level)
>  {
>  union amd_iommu_pte *table, *pte, old;
>  
>  table = map_domain_page(_mfn(l1_mfn));
> -pte = &table[pfn_to_pde_idx(dfn, 1)];
> +pte = &table[pfn_to_pde_idx(dfn, level)];
>  old = *pte;
>  
>  write_atomic(&pte->raw, 0);
> @@ -288,10 +289,31 @@ static int iommu_pde_from_dfn(struct dom
>  return 0;
>  }
>  
> +static void queue_free_pt(struct domain *d, mfn_t mfn, unsigned int 
> next_level)

Nit: should the last parameter be named level rather than next_level?
AFAICT it's the level of the mfn parameter.

Should we also assert that level (or next_level) is always != 0 for
extra safety?

Thanks, Roger.



[PATCH v2 12/18] AMD/IOMMU: allow use of superpage mappings

2021-09-24 Thread Jan Beulich
No separate feature flags exist which would control availability of
these; the only restriction is HATS (establishing the maximum number of
page table levels in general), and even that has a lower bound of 4.
Thus we can unconditionally announce 2M, 1G, and 512G mappings. (Via
non-default page sizes the implementation in principle permits arbitrary
size mappings, but these require multiple identical leaf PTEs to be
written, which isn't all that different from having to write multiple
consecutive PTEs with increasing frame numbers. IMO that's therefore
beneficial only on hardware where suitable TLBs exist; I'm unaware of
such hardware.)

Signed-off-by: Jan Beulich 
---
I'm not fully sure about allowing 512G mappings: The scheduling-for-
freeing of intermediate page tables can take quite a while when
replacing a tree of 4k mappings by a single 512G one. Plus (or otoh)
there's no present code path via which 512G chunks of memory could be
allocated (and hence mapped) anyway.

--- a/xen/drivers/passthrough/amd/iommu_map.c
+++ b/xen/drivers/passthrough/amd/iommu_map.c
@@ -32,12 +32,13 @@ static unsigned int pfn_to_pde_idx(unsig
 }
 
 static union amd_iommu_pte clear_iommu_pte_present(unsigned long l1_mfn,
-   unsigned long dfn)
+   unsigned long dfn,
+   unsigned int level)
 {
 union amd_iommu_pte *table, *pte, old;
 
 table = map_domain_page(_mfn(l1_mfn));
-pte = &table[pfn_to_pde_idx(dfn, 1)];
+pte = &table[pfn_to_pde_idx(dfn, level)];
 old = *pte;
 
 write_atomic(&pte->raw, 0);
@@ -288,10 +289,31 @@ static int iommu_pde_from_dfn(struct dom
 return 0;
 }
 
+static void queue_free_pt(struct domain *d, mfn_t mfn, unsigned int next_level)
+{
+if ( next_level > 1 )
+{
+union amd_iommu_pte *pt = map_domain_page(mfn);
+unsigned int i;
+
+for ( i = 0; i < PTE_PER_TABLE_SIZE; ++i )
+if ( pt[i].pr && pt[i].next_level )
+{
+ASSERT(pt[i].next_level < next_level);
+queue_free_pt(d, _mfn(pt[i].mfn), pt[i].next_level);
+}
+
+unmap_domain_page(pt);
+}
+
+iommu_queue_free_pgtable(d, mfn_to_page(mfn));
+}
+
 int amd_iommu_map_page(struct domain *d, dfn_t dfn, mfn_t mfn,
unsigned int flags, unsigned int *flush_flags)
 {
 struct domain_iommu *hd = dom_iommu(d);
+unsigned int level = (IOMMUF_order(flags) / PTE_PER_TABLE_SHIFT) + 1;
 int rc;
 unsigned long pt_mfn = 0;
 union amd_iommu_pte old;
@@ -320,7 +342,7 @@ int amd_iommu_map_page(struct domain *d,
 return rc;
 }
 
-if ( iommu_pde_from_dfn(d, dfn_x(dfn), 1, &pt_mfn, flush_flags, true) ||
+if ( iommu_pde_from_dfn(d, dfn_x(dfn), level, &pt_mfn, flush_flags, true) 
||
  !pt_mfn )
 {
 spin_unlock(&hd->arch.mapping_lock);
@@ -330,8 +352,8 @@ int amd_iommu_map_page(struct domain *d,
 return -EFAULT;
 }
 
-/* Install 4k mapping */
-old = set_iommu_pte_present(pt_mfn, dfn_x(dfn), mfn_x(mfn), 1,
+/* Install mapping */
+old = set_iommu_pte_present(pt_mfn, dfn_x(dfn), mfn_x(mfn), level,
 (flags & IOMMUF_writable),
 (flags & IOMMUF_readable));
 
@@ -339,8 +361,13 @@ int amd_iommu_map_page(struct domain *d,
 
 *flush_flags |= IOMMU_FLUSHF_added;
 if ( old.pr )
+{
 *flush_flags |= IOMMU_FLUSHF_modified;
 
+if ( level > 1 && old.next_level )
+queue_free_pt(d, _mfn(old.mfn), old.next_level);
+}
+
 return 0;
 }
 
@@ -349,6 +376,7 @@ int amd_iommu_unmap_page(struct domain *
 {
 unsigned long pt_mfn = 0;
 struct domain_iommu *hd = dom_iommu(d);
+unsigned int level = (order / PTE_PER_TABLE_SHIFT) + 1;
 union amd_iommu_pte old = {};
 
 spin_lock(&hd->arch.mapping_lock);
@@ -359,7 +387,7 @@ int amd_iommu_unmap_page(struct domain *
 return 0;
 }
 
-if ( iommu_pde_from_dfn(d, dfn_x(dfn), 1, &pt_mfn, flush_flags, false) )
+if ( iommu_pde_from_dfn(d, dfn_x(dfn), level, &pt_mfn, flush_flags, false) 
)
 {
 spin_unlock(&hd->arch.mapping_lock);
 AMD_IOMMU_DEBUG("Invalid IO pagetable entry dfn = %"PRI_dfn"\n",
@@ -371,14 +399,19 @@ int amd_iommu_unmap_page(struct domain *
 if ( pt_mfn )
 {
 /* Mark PTE as 'page not present'. */
-old = clear_iommu_pte_present(pt_mfn, dfn_x(dfn));
+old = clear_iommu_pte_present(pt_mfn, dfn_x(dfn), level);
 }
 
 spin_unlock(&hd->arch.mapping_lock);
 
 if ( old.pr )
+{
 *flush_flags |= IOMMU_FLUSHF_modified;
 
+if ( level > 1 && old.next_level )
+queue_free_pt(d, _mfn(old.mfn), old.next_level);
+}
+
 return 0;
 }
 
--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
+++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
@@ -630,7 +630,7 @@