x86: perform PV Dom0 mappings in batches

Jan Beulich Wed, 04 May 2022 02:47:11 -0700

On 03.05.2022 16:49, Roger Pau Monné wrote:
> On Mon, Apr 25, 2022 at 10:34:59AM +0200, Jan Beulich wrote:
>> For large page mappings to be easily usable (i.e. in particular without
>> un-shattering of smaller page mappings) and for mapping operations to
>> then also be more efficient, pass batches of Dom0 memory to iommu_map().
>> In dom0_construct_pv() and its helpers (covering strict mode) this
>> additionally requires establishing the type of those pages (albeit with
>> zero type references).
> 
> I think it's possible I've already asked this.  Would it make sense to
> add the IOMMU mappings in alloc_domheap_pages(), maybe by passing a
> specific flag?


I don't think you did ask, but now that you do: This would look like a
layering violation to me. I don't think allocation should ever have
mapping (into the IOMMU or elsewhere) as a "side effect", no matter
that ...

> It would seem to me that doing it that way would also allow the
> mappings to get established in blocks for domUs.

... then this would perhaps be possible.

>> The installing of zero-ref writable types has in fact shown (observed
>> while putting together the change) that despite the intention by the
>> XSA-288 changes (affecting DomU-s only) for Dom0 a number of
>> sufficiently ordinary pages (at the very least initrd and P2M ones as
>> well as pages that are part of the initial allocation but not part of
>> the initial mapping) still have been starting out as PGT_none, meaning
>> that they would have gained IOMMU mappings only the first time these
>> pages would get mapped writably. Consequently an open question is
>> whether iommu_memory_setup() should set the pages to PGT_writable_page
>> independent of need_iommu_pt_sync().
> 
> I think I'm confused, doesn't the setting of PGT_writable_page happen
> as a result of need_iommu_pt_sync() and having those pages added to
> the IOMMU page tables? (so they can be properly tracked and IOMMU
> mappings are removed if thte page is also removed)

In principle yes - in guest_physmap_add_page(). But this function isn't
called for the pages I did enumerate in the remark. XSA-288 really only
cared about getting this right for DomU-s.

> If the pages are not added here (because dom0 is not running in strict
> mode) then setting PGT_writable_page is not required?

Correct - in that case we skip fiddling with IOMMU mappings on
transitions to/from PGT_writable_page, and hence putting this type in
place would be benign (but improve consistency).

>> Note also that strictly speaking the iommu_iotlb_flush_all() here (as
>> well as the pre-existing one in arch_iommu_hwdom_init()) shouldn't be
>> needed: Actual hooking up (AMD) or enabling of translation (VT-d)
>> occurs only afterwards anyway, so nothing can have made it into TLBs
>> just yet.
> 
> Hm, indeed. I think the one in arch_iommu_hwdom_init can surely go
> away, as we must strictly do the hwdom init before enabling the iommu
> itself.

Why would that be? That's imo as much of an implementation detail as
...

> The one in dom0 build I'm less convinced, just to be on the safe side
> if we ever change the order of IOMMU init and memory setup.

... this. Just like we populate tables with the IOMMU already enabled
for DomU-s, I think the same would be valid to do for Dom0.

> I would expect flushing an empty TLB to not be very expensive?

I wouldn't "expect" this - it might be this way, but it surely depends
on whether an implementation can easily tell whether the TLB is empty,
and whether its emptiness actually makes a difference for the latency
of a flush operation.

>> --- a/xen/drivers/passthrough/x86/iommu.c
>> +++ b/xen/drivers/passthrough/x86/iommu.c
>> @@ -347,8 +347,8 @@ static unsigned int __hwdom_init hwdom_i
>>  
>>  void __hwdom_init arch_iommu_hwdom_init(struct domain *d)
>>  {
>> -    unsigned long i, top, max_pfn;
>> -    unsigned int flush_flags = 0;
>> +    unsigned long i, top, max_pfn, start, count;
>> +    unsigned int flush_flags = 0, start_perms = 0;
>>  
>>      BUG_ON(!is_hardware_domain(d));
>>  
>> @@ -379,9 +379,9 @@ void __hwdom_init arch_iommu_hwdom_init(
>>       * First Mb will get mapped in one go by pvh_populate_p2m(). Avoid
>>       * setting up potentially conflicting mappings here.
>>       */
>> -    i = paging_mode_translate(d) ? PFN_DOWN(MB(1)) : 0;
>> +    start = paging_mode_translate(d) ? PFN_DOWN(MB(1)) : 0;
>>  
>> -    for ( ; i < top; i++ )
>> +    for ( i = start, count = 0; i < top; )
>>      {
>>          unsigned long pfn = pdx_to_pfn(i);
>>          unsigned int perms = hwdom_iommu_map(d, pfn, max_pfn);
>> @@ -390,20 +390,41 @@ void __hwdom_init arch_iommu_hwdom_init(
>>          if ( !perms )
>>              rc = 0;
>>          else if ( paging_mode_translate(d) )
>> +        {
>>              rc = p2m_add_identity_entry(d, pfn,
>>                                          perms & IOMMUF_writable ? 
>> p2m_access_rw
>>                                                                  : 
>> p2m_access_r,
>>                                          0);
>> +            if ( rc )
>> +                printk(XENLOG_WARNING
>> +                       "%pd: identity mapping of %lx failed: %d\n",
>> +                       d, pfn, rc);
>> +        }
>> +        else if ( pfn != start + count || perms != start_perms )
>> +        {
>> +        commit:
>> +            rc = iommu_map(d, _dfn(start), _mfn(start), count, start_perms,
>> +                           &flush_flags);
>> +            if ( rc )
>> +                printk(XENLOG_WARNING
>> +                       "%pd: IOMMU identity mapping of [%lx,%lx) failed: 
>> %d\n",
>> +                       d, pfn, pfn + count, rc);
>> +            SWAP(start, pfn);
>> +            start_perms = perms;
>> +            count = 1;
>> +        }
>>          else
>> -            rc = iommu_map(d, _dfn(pfn), _mfn(pfn), 1ul << PAGE_ORDER_4K,
>> -                           perms, &flush_flags);
>> +        {
>> +            ++count;
>> +            rc = 0;
> 
> Seeing as we want to process this in blocks now, I wonder whether it
> would make sense to take a different approach, and use a rangeset to
> track which regions need to be mapped.  What gets added would be based
> on the host e820 plus the options
> iommu_hwdom_{strict,inclusive,reserved}.  We would then punch holes
> based on the logic in hwdom_iommu_map() and finally we could iterate
> over the regions afterwards using rangeset_consume_ranges().
> 
> Not that you strictly need to do it here, just think the end result
> would be clearer.

The end result might indeed be, but it would be more of a change right
here. Hence I'd prefer to leave that out of the series for now.

Jan

Re: [PATCH v4 06/21] IOMMU/x86: perform PV Dom0 mappings in batches

Reply via email to