On Wed, May 20, 2026 at 07:26:24PM +0000, Michael Kelley wrote:
> From: Yu Zhang <[email protected]> Sent: Wednesday, May 20, 2026 
> 10:15 AM
> > 
> > On Fri, May 15, 2026 at 07:35:45PM -0300, Jason Gunthorpe wrote:
> > > On Tue, May 12, 2026 at 12:24:08AM +0800, Yu Zhang wrote:
> > > > +static inline u16 hv_iommu_fill_iova_list(union hv_iommu_flush_va 
> > > > *iova_list,
> > > > +                                         unsigned long start,
> > > > +                                         unsigned long end)
> > > > +{
> > > > +       unsigned long start_pfn = start >> PAGE_SHIFT;
> > > > +       unsigned long end_pfn = PAGE_ALIGN(end) >> PAGE_SHIFT;
> > > > +       unsigned long nr_pages = end_pfn - start_pfn;
> > > > +       u16 count = 0;
> > > > +
> > > > +       while (nr_pages > 0) {
> > > > +               unsigned long flush_pages;
> > > > +               int order;
> > > > +               unsigned long pfn_align;
> > > > +               unsigned long size_align;
> > > > +
> > > > +               if (count >= HV_IOMMU_MAX_FLUSH_VA_COUNT) {
> > > > +                       count = HV_IOMMU_FLUSH_VA_OVERFLOW;
> > > > +                       break;
> > > > +               }
> > > > +
> > > > +               if (start_pfn)
> > > > +                       pfn_align = __ffs(start_pfn);
> > > > +               else
> > > > +                       pfn_align = BITS_PER_LONG - 1;
> > > > +
> > > > +               size_align = __fls(nr_pages);
> > > > +               order = min(pfn_align, size_align);
> > > > +               iova_list[count].page_mask_shift = order;
> > > > +               iova_list[count].page_number = start_pfn;
> > > > +
> > > > +               flush_pages = 1UL << order;
> > > > +               start_pfn += flush_pages;
> > > > +               nr_pages -= flush_pages;
> > > > +               count++;
> > > > +       }
> > >
> > > This seems like a really silly hypervisor interface. Why doesn't it
> > > just accept a normal range? Splitting it into power of two aligned
> > > ranges is very inefficient.
> > 
> > Fair point. I'm not sure how much flexibility we have to change
> > this hypercall interface at the moment - it predates the pvIOMMU
> > work and may have other consumers beyond Linux guest. On the other
> > hand, having the guest specify 2^N-aligned blocks does save the
> > hypervisor from having to decompose ranges itself before issuing
> > hardware invalidation commands - the guest-provided entries can be
> > fed to the HW more or less directly.
> > 
> > That said, the way I'm currently using this interface may be
> > more precise than necessary. Maybe we have 2 options:
> > 
> > 1) Current approach: decompose the range into multiple exact
> >    2^N-aligned blocks with no over-flush, but at the cost of
> >    more complex calculations and more entries.
> > 
> > 2) Follow what Intel/AMD drivers do: find a single minimal
> >    2^N-aligned block that covers the entire range, but may
> >    over-flush.
> > 
> > Any preference?
> > 
> > @Michael, since you've also been reviewing this patch, I'd
> > appreciate your thoughts on the above as well. :)
> > 
> 
> I'm just guessing, but perhaps flushing an aligned power-of-2
> range can be processed by the hypervisor at a relatively fixed
> cost, regardless of the size. Having the guest do the decomposing
> of an arbitrary range allows the hypervisor to make use of the
> existing "rep" hypercall mechanism if the hypercall is taking
> "too long". The hypervisor can pause its processing, return to
> the guest temporarily, and then continue the hypercall. If the
> arbitrary range were passed into the hypercall for the hypervisor
> to do the decomposing, that pause-and-restart mechanism
> wouldn't be available.
> 
> Of course, Linux doesn't really take advantage of the pause to
> reduce guest interrupt latency because the Hyper-V code in
> Linux typically disable interrupts around a hypercall due to the
> way the hypercall input page is allocated. But other guest
> operating systems might benefit from such a pause. And we could
> probably fix the Hyper-V code in Linux to allow interrupts during a
> hypercall pause/restart if long-running hypercalls turn out to be
> a problem.
> 
> Regarding proposal (1) vs. (2), perhaps you could confirm with
> the Hyper-V team that flushing an aligned power-of-2 range
> has relatively fixed cost, regardless of the size. And what do the
> flush requests coming from the generic IOMMU subsystem look
> like? Do they match dma_unmap() ranges, which are probably
> dominated by relatively small ranges of a few pages at most,
> with a few outliers for disk I/O requests of 1 MiB or some such?
> If the dominant flush request is for a few pages at most, then
> doing (2) seems reasonable.

Thanks for the thoughtful suggestions, Michael!

I believe the time might be dominated by the number of descriptors,
instead of the size of each range, especially when device TLB
invalidations are involved.

Here's my understanding of what hypervisor does in its handler:

Hyper-V constructs one IOTLB invalidation descriptor (and possibly
a Device TLB invalidation descriptor as well) per iova_list entry
and submits them to the HW invalidation queue, then synchronously
waits for completion. So multiple 2^N-aligned entries should be less
efficient than a single larger 2^N aligned one.

Since both options submit 2^N-aligned entries to the hypervisor,
either one single coarser-grained entry or a precise decomposition,
I'm now also leaning towards option 2, which is also what Intel/AMD
drivers do for page-selective IOTLB flush. Simpler guest code, faster
flush, and the hypervisor can feed the single entry almost directly
to HW.

Yu

Reply via email to