From: Jacob Pan <[email protected]> Sent: Wednesday, May 20, 2026 
1:40 PM
> 
> Hi Michael,
> 
> On Wed, 20 May 2026 19:26:24 +0000
> Michael Kelley <[email protected]> wrote:
> 
> > From: Michael Kelley <[email protected]> To: Yu Zhang 
> > <[email protected]>, Jason Gunthorpe
> >
> > From: Yu Zhang <[email protected]> Sent: Wednesday, May 20, 2026 
> > 10:15 AM
> > >
> > > On Fri, May 15, 2026 at 07:35:45PM -0300, Jason Gunthorpe wrote:
> > > > On Tue, May 12, 2026 at 12:24:08AM +0800, Yu Zhang wrote:
> > > > > +static inline u16 hv_iommu_fill_iova_list(union
> > > > > hv_iommu_flush_va *iova_list,
> > > > > +                                       unsigned long start,
> > > > > +                                       unsigned long end)
> > > > > +{
> > > > > +     unsigned long start_pfn = start >> PAGE_SHIFT;
> > > > > +     unsigned long end_pfn = PAGE_ALIGN(end) >> PAGE_SHIFT;
> > > > > +     unsigned long nr_pages = end_pfn - start_pfn;
> > > > > +     u16 count = 0;
> > > > > +
> > > > > +     while (nr_pages > 0) {
> > > > > +             unsigned long flush_pages;
> > > > > +             int order;
> > > > > +             unsigned long pfn_align;
> > > > > +             unsigned long size_align;
> > > > > +
> > > > > +             if (count >= HV_IOMMU_MAX_FLUSH_VA_COUNT) {
> > > > > +                     count = HV_IOMMU_FLUSH_VA_OVERFLOW;
> > > > > +                     break;
> > > > > +             }
> > > > > +
> > > > > +             if (start_pfn)
> > > > > +                     pfn_align = __ffs(start_pfn);
> > > > > +             else
> > > > > +                     pfn_align = BITS_PER_LONG - 1;
> > > > > +
> > > > > +             size_align = __fls(nr_pages);
> > > > > +             order = min(pfn_align, size_align);
> > > > > +             iova_list[count].page_mask_shift = order;
> > > > > +             iova_list[count].page_number = start_pfn;
> > > > > +
> > > > > +             flush_pages = 1UL << order;
> > > > > +             start_pfn += flush_pages;
> > > > > +             nr_pages -= flush_pages;
> > > > > +             count++;
> > > > > +     }
> > > >
> > > > This seems like a really silly hypervisor interface. Why doesn't
> > > > it just accept a normal range? Splitting it into power of two
> > > > aligned ranges is very inefficient.
> > >
> > > Fair point. I'm not sure how much flexibility we have to change
> > > this hypercall interface at the moment - it predates the pvIOMMU
> > > work and may have other consumers beyond Linux guest. On the other
> > > hand, having the guest specify 2^N-aligned blocks does save the
> > > hypervisor from having to decompose ranges itself before issuing
> > > hardware invalidation commands - the guest-provided entries can be
> > > fed to the HW more or less directly.
> > >
> > > That said, the way I'm currently using this interface may be
> > > more precise than necessary. Maybe we have 2 options:
> > >
> > > 1) Current approach: decompose the range into multiple exact
> > >    2^N-aligned blocks with no over-flush, but at the cost of
> > >    more complex calculations and more entries.
> > >
> > > 2) Follow what Intel/AMD drivers do: find a single minimal
> > >    2^N-aligned block that covers the entire range, but may
> > >    over-flush.
> > >
> > > Any preference?
> > >
> > > @Michael, since you've also been reviewing this patch, I'd
> > > appreciate your thoughts on the above as well. :)
> > >
> >
> > I'm just guessing, but perhaps flushing an aligned power-of-2
> > range can be processed by the hypervisor at a relatively fixed
> > cost, regardless of the size. Having the guest do the decomposing
> > of an arbitrary range allows the hypervisor to make use of the
> > existing "rep" hypercall mechanism if the hypercall is taking
> > "too long". The hypervisor can pause its processing, return to
> > the guest temporarily, and then continue the hypercall. If the
> > arbitrary range were passed into the hypercall for the hypervisor
> > to do the decomposing, that pause-and-restart mechanism
> > wouldn't be available.
> >
> > Of course, Linux doesn't really take advantage of the pause to
> > reduce guest interrupt latency because the Hyper-V code in
> > Linux typically disable interrupts around a hypercall due to the
> > way the hypercall input page is allocated. But other guest
> > operating systems might benefit from such a pause. And we could
> > probably fix the Hyper-V code in Linux to allow interrupts during a
> > hypercall pause/restart if long-running hypercalls turn out to be
> > a problem.

> I am not sure if this pause feature is suitable for IOTLB flush at all
> since it is inherently synchronous — the caller must block until all
> invalidations complete. Pausing mid-flush to return to the guest
> doesn't help if the guest can't make forward progress anyway.

I agree that hypercall pause/resume doesn't help with
forward progress. But it could help with interrupt latency in the
guest if the hypercall executes with interrupts enabled in the
guest. During the pause when control returns to the guest,
the guest could take an interrupt, versus the interrupt having
to wait until the entire hypercall completes. And if preemption
is enabled in the guest thread executing the hypercall, the thread
could be descheduled, potentially improving scheduling latency.

At least that's my understanding of why Hyper-V has this pause/
resume mechanism for "rep" hypercalls. :-)

Michael



Reply via email to