On Mon, Apr 25, 2022 at 10:38:06AM +0200, Jan Beulich wrote: > No separate feature flags exist which would control availability of > these; the only restriction is HATS (establishing the maximum number of > page table levels in general), and even that has a lower bound of 4. > Thus we can unconditionally announce 2M, 1G, and 512G mappings. (Via > non-default page sizes the implementation in principle permits arbitrary > size mappings, but these require multiple identical leaf PTEs to be > written, which isn't all that different from having to write multiple > consecutive PTEs with increasing frame numbers. IMO that's therefore > beneficial only on hardware where suitable TLBs exist; I'm unaware of > such hardware.) > > Signed-off-by: Jan Beulich <jbeul...@suse.com>
Reviewed-by: Roger Pau Monné <roger....@citrix.com> > --- > I'm not fully sure about allowing 512G mappings: The scheduling-for- > freeing of intermediate page tables would take quite a while when > replacing a tree of 4k mappings by a single 512G one. Yet then again > there's no present code path via which 512G chunks of memory could be > allocated (and hence mapped) anyway, so this would only benefit huge > systems where 512 1G mappings could be re-coalesced (once suitable code > is in place) into a single L4 entry. And re-coalescing wouldn't result > in scheduling-for-freeing of full trees of lower level pagetables. I would think part of this should go into the commit message, as to why enabling 512G superpages is fine. > --- > v4: Change type of queue_free_pt()'s 1st parameter. Re-base. > v3: Rename queue_free_pt()'s last parameter. Replace "level > 1" checks > where possible. > > --- a/xen/drivers/passthrough/amd/iommu_map.c > +++ b/xen/drivers/passthrough/amd/iommu_map.c > @@ -32,12 +32,13 @@ static unsigned int pfn_to_pde_idx(unsig > } > > static union amd_iommu_pte clear_iommu_pte_present(unsigned long l1_mfn, > - unsigned long dfn) > + unsigned long dfn, > + unsigned int level) > { > union amd_iommu_pte *table, *pte, old; > > table = map_domain_page(_mfn(l1_mfn)); > - pte = &table[pfn_to_pde_idx(dfn, 1)]; > + pte = &table[pfn_to_pde_idx(dfn, level)]; > old = *pte; > > write_atomic(&pte->raw, 0); > @@ -351,11 +352,32 @@ static int iommu_pde_from_dfn(struct dom > return 0; > } > > +static void queue_free_pt(struct domain_iommu *hd, mfn_t mfn, unsigned int > level) > +{ > + if ( level > 1 ) > + { > + union amd_iommu_pte *pt = map_domain_page(mfn); > + unsigned int i; > + > + for ( i = 0; i < PTE_PER_TABLE_SIZE; ++i ) > + if ( pt[i].pr && pt[i].next_level ) > + { > + ASSERT(pt[i].next_level < level); > + queue_free_pt(hd, _mfn(pt[i].mfn), pt[i].next_level); > + } > + > + unmap_domain_page(pt); > + } > + > + iommu_queue_free_pgtable(hd, mfn_to_page(mfn)); > +} > + > int cf_check amd_iommu_map_page( > struct domain *d, dfn_t dfn, mfn_t mfn, unsigned int flags, > unsigned int *flush_flags) > { > struct domain_iommu *hd = dom_iommu(d); > + unsigned int level = (IOMMUF_order(flags) / PTE_PER_TABLE_SHIFT) + 1; > int rc; > unsigned long pt_mfn = 0; > union amd_iommu_pte old; > @@ -384,7 +406,7 @@ int cf_check amd_iommu_map_page( > return rc; > } > I think it might be helpful to assert or otherwise check that the input order is supported by the IOMMU, just to be on the safe side. > - if ( iommu_pde_from_dfn(d, dfn_x(dfn), 1, &pt_mfn, flush_flags, true) || > + if ( iommu_pde_from_dfn(d, dfn_x(dfn), level, &pt_mfn, flush_flags, > true) || > !pt_mfn ) > { > spin_unlock(&hd->arch.mapping_lock); > @@ -394,8 +416,8 @@ int cf_check amd_iommu_map_page( > return -EFAULT; > } > > - /* Install 4k mapping */ > - old = set_iommu_pte_present(pt_mfn, dfn_x(dfn), mfn_x(mfn), 1, > + /* Install mapping */ > + old = set_iommu_pte_present(pt_mfn, dfn_x(dfn), mfn_x(mfn), level, > (flags & IOMMUF_writable), > (flags & IOMMUF_readable)); > > @@ -403,8 +425,13 @@ int cf_check amd_iommu_map_page( > > *flush_flags |= IOMMU_FLUSHF_added; > if ( old.pr ) > + { > *flush_flags |= IOMMU_FLUSHF_modified; > > + if ( IOMMUF_order(flags) && old.next_level ) > + queue_free_pt(hd, _mfn(old.mfn), old.next_level); > + } > + > return 0; > } > > @@ -413,6 +440,7 @@ int cf_check amd_iommu_unmap_page( > { > unsigned long pt_mfn = 0; > struct domain_iommu *hd = dom_iommu(d); > + unsigned int level = (order / PTE_PER_TABLE_SHIFT) + 1; > union amd_iommu_pte old = {}; > > spin_lock(&hd->arch.mapping_lock); > @@ -423,7 +451,7 @@ int cf_check amd_iommu_unmap_page( > return 0; > } > > - if ( iommu_pde_from_dfn(d, dfn_x(dfn), 1, &pt_mfn, flush_flags, false) ) > + if ( iommu_pde_from_dfn(d, dfn_x(dfn), level, &pt_mfn, flush_flags, > false) ) > { > spin_unlock(&hd->arch.mapping_lock); > AMD_IOMMU_ERROR("invalid IO pagetable entry dfn = %"PRI_dfn"\n", > @@ -435,14 +463,19 @@ int cf_check amd_iommu_unmap_page( > if ( pt_mfn ) > { > /* Mark PTE as 'page not present'. */ > - old = clear_iommu_pte_present(pt_mfn, dfn_x(dfn)); > + old = clear_iommu_pte_present(pt_mfn, dfn_x(dfn), level); > } > > spin_unlock(&hd->arch.mapping_lock); > > if ( old.pr ) > + { > *flush_flags |= IOMMU_FLUSHF_modified; > > + if ( order && old.next_level ) > + queue_free_pt(hd, _mfn(old.mfn), old.next_level); > + } > + > return 0; > } > > --- a/xen/drivers/passthrough/amd/pci_amd_iommu.c > +++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c > @@ -747,7 +747,7 @@ static void cf_check amd_dump_page_table > } > > static const struct iommu_ops __initconst_cf_clobber _iommu_ops = { > - .page_sizes = PAGE_SIZE_4K, > + .page_sizes = PAGE_SIZE_4K | PAGE_SIZE_2M | PAGE_SIZE_1G | > PAGE_SIZE_512G, As mentioned on a previous email, I'm worried if we ever get to replace an entry populated with 4K pages with a 512G superpage, as the freeing cost of the associated pagetables would be quite high. I guess we will have to implement a more preemptive freeing behavior if issues arise. Thanks, Roger.