From: Yu Zhang <[email protected]> Sent: Wednesday, May 20, 2026 
10:15 AM
> 
> On Fri, May 15, 2026 at 07:35:45PM -0300, Jason Gunthorpe wrote:
> > On Tue, May 12, 2026 at 12:24:08AM +0800, Yu Zhang wrote:
> > > +static inline u16 hv_iommu_fill_iova_list(union hv_iommu_flush_va 
> > > *iova_list,
> > > +                                   unsigned long start,
> > > +                                   unsigned long end)
> > > +{
> > > + unsigned long start_pfn = start >> PAGE_SHIFT;
> > > + unsigned long end_pfn = PAGE_ALIGN(end) >> PAGE_SHIFT;
> > > + unsigned long nr_pages = end_pfn - start_pfn;
> > > + u16 count = 0;
> > > +
> > > + while (nr_pages > 0) {
> > > +         unsigned long flush_pages;
> > > +         int order;
> > > +         unsigned long pfn_align;
> > > +         unsigned long size_align;
> > > +
> > > +         if (count >= HV_IOMMU_MAX_FLUSH_VA_COUNT) {
> > > +                 count = HV_IOMMU_FLUSH_VA_OVERFLOW;
> > > +                 break;
> > > +         }
> > > +
> > > +         if (start_pfn)
> > > +                 pfn_align = __ffs(start_pfn);
> > > +         else
> > > +                 pfn_align = BITS_PER_LONG - 1;
> > > +
> > > +         size_align = __fls(nr_pages);
> > > +         order = min(pfn_align, size_align);
> > > +         iova_list[count].page_mask_shift = order;
> > > +         iova_list[count].page_number = start_pfn;
> > > +
> > > +         flush_pages = 1UL << order;
> > > +         start_pfn += flush_pages;
> > > +         nr_pages -= flush_pages;
> > > +         count++;
> > > + }
> >
> > This seems like a really silly hypervisor interface. Why doesn't it
> > just accept a normal range? Splitting it into power of two aligned
> > ranges is very inefficient.
> 
> Fair point. I'm not sure how much flexibility we have to change
> this hypercall interface at the moment - it predates the pvIOMMU
> work and may have other consumers beyond Linux guest. On the other
> hand, having the guest specify 2^N-aligned blocks does save the
> hypervisor from having to decompose ranges itself before issuing
> hardware invalidation commands - the guest-provided entries can be
> fed to the HW more or less directly.
> 
> That said, the way I'm currently using this interface may be
> more precise than necessary. Maybe we have 2 options:
> 
> 1) Current approach: decompose the range into multiple exact
>    2^N-aligned blocks with no over-flush, but at the cost of
>    more complex calculations and more entries.
> 
> 2) Follow what Intel/AMD drivers do: find a single minimal
>    2^N-aligned block that covers the entire range, but may
>    over-flush.
> 
> Any preference?
> 
> @Michael, since you've also been reviewing this patch, I'd
> appreciate your thoughts on the above as well. :)
> 

I'm just guessing, but perhaps flushing an aligned power-of-2
range can be processed by the hypervisor at a relatively fixed
cost, regardless of the size. Having the guest do the decomposing
of an arbitrary range allows the hypervisor to make use of the
existing "rep" hypercall mechanism if the hypercall is taking
"too long". The hypervisor can pause its processing, return to
the guest temporarily, and then continue the hypercall. If the
arbitrary range were passed into the hypercall for the hypervisor
to do the decomposing, that pause-and-restart mechanism
wouldn't be available.

Of course, Linux doesn't really take advantage of the pause to
reduce guest interrupt latency because the Hyper-V code in
Linux typically disable interrupts around a hypercall due to the
way the hypercall input page is allocated. But other guest
operating systems might benefit from such a pause. And we could
probably fix the Hyper-V code in Linux to allow interrupts during a
hypercall pause/restart if long-running hypercalls turn out to be
a problem.

Regarding proposal (1) vs. (2), perhaps you could confirm with
the Hyper-V team that flushing an aligned power-of-2 range
has relatively fixed cost, regardless of the size. And what do the
flush requests coming from the generic IOMMU subsystem look
like? Do they match dma_unmap() ranges, which are probably
dominated by relatively small ranges of a few pages at most,
with a few outliers for disk I/O requests of 1 MiB or some such?
If the dominant flush request is for a few pages at most, then
doing (2) seems reasonable.

Michael

Reply via email to