Hi Michael, On Wed, 20 May 2026 19:26:24 +0000 Michael Kelley <[email protected]> wrote:
> From: Michael Kelley <[email protected]> > To: Yu Zhang <[email protected]>, Jason Gunthorpe > <[email protected]> CC: "[email protected]" > <[email protected]>, "[email protected]" > <[email protected]>, "[email protected]" > <[email protected]>, "[email protected]" > <[email protected]>, "[email protected]" > <[email protected]>, "[email protected]" > <[email protected]>, "[email protected]" <[email protected]>, > "[email protected]" <[email protected]>, > "[email protected]" <[email protected]>, "[email protected]" > <[email protected]>, "[email protected]" <[email protected]>, > "[email protected]" <[email protected]>, "[email protected]" > <[email protected]>, "[email protected]" <[email protected]>, > "[email protected]" <[email protected]>, > "[email protected]" <[email protected]>, "[email protected]" > <[email protected]>, "[email protected]" <[email protected]>, > "[email protected]" <[email protected]>, "[email protected]" > <[email protected]>, "[email protected]" > <[email protected]>, > "[email protected]" > <[email protected]> Subject: RE: [PATCH v1 4/4] > iommu/hyperv: Add page-selective IOTLB flush support Date: Wed, 20 > May 2026 19:26:24 +0000 > > From: Yu Zhang <[email protected]> Sent: Wednesday, May > 20, 2026 10:15 AM > > > > On Fri, May 15, 2026 at 07:35:45PM -0300, Jason Gunthorpe wrote: > > > On Tue, May 12, 2026 at 12:24:08AM +0800, Yu Zhang wrote: > > > > +static inline u16 hv_iommu_fill_iova_list(union > > > > hv_iommu_flush_va *iova_list, > > > > + unsigned long start, > > > > + unsigned long end) > > > > +{ > > > > + unsigned long start_pfn = start >> PAGE_SHIFT; > > > > + unsigned long end_pfn = PAGE_ALIGN(end) >> PAGE_SHIFT; > > > > + unsigned long nr_pages = end_pfn - start_pfn; > > > > + u16 count = 0; > > > > + > > > > + while (nr_pages > 0) { > > > > + unsigned long flush_pages; > > > > + int order; > > > > + unsigned long pfn_align; > > > > + unsigned long size_align; > > > > + > > > > + if (count >= HV_IOMMU_MAX_FLUSH_VA_COUNT) { > > > > + count = HV_IOMMU_FLUSH_VA_OVERFLOW; > > > > + break; > > > > + } > > > > + > > > > + if (start_pfn) > > > > + pfn_align = __ffs(start_pfn); > > > > + else > > > > + pfn_align = BITS_PER_LONG - 1; > > > > + > > > > + size_align = __fls(nr_pages); > > > > + order = min(pfn_align, size_align); > > > > + iova_list[count].page_mask_shift = order; > > > > + iova_list[count].page_number = start_pfn; > > > > + > > > > + flush_pages = 1UL << order; > > > > + start_pfn += flush_pages; > > > > + nr_pages -= flush_pages; > > > > + count++; > > > > + } > > > > > > This seems like a really silly hypervisor interface. Why doesn't > > > it just accept a normal range? Splitting it into power of two > > > aligned ranges is very inefficient. > > > > Fair point. I'm not sure how much flexibility we have to change > > this hypercall interface at the moment - it predates the pvIOMMU > > work and may have other consumers beyond Linux guest. On the other > > hand, having the guest specify 2^N-aligned blocks does save the > > hypervisor from having to decompose ranges itself before issuing > > hardware invalidation commands - the guest-provided entries can be > > fed to the HW more or less directly. > > > > That said, the way I'm currently using this interface may be > > more precise than necessary. Maybe we have 2 options: > > > > 1) Current approach: decompose the range into multiple exact > > 2^N-aligned blocks with no over-flush, but at the cost of > > more complex calculations and more entries. > > > > 2) Follow what Intel/AMD drivers do: find a single minimal > > 2^N-aligned block that covers the entire range, but may > > over-flush. > > > > Any preference? > > > > @Michael, since you've also been reviewing this patch, I'd > > appreciate your thoughts on the above as well. :) > > > > I'm just guessing, but perhaps flushing an aligned power-of-2 > range can be processed by the hypervisor at a relatively fixed > cost, regardless of the size. Having the guest do the decomposing > of an arbitrary range allows the hypervisor to make use of the > existing "rep" hypercall mechanism if the hypercall is taking > "too long". The hypervisor can pause its processing, return to > the guest temporarily, and then continue the hypercall. If the > arbitrary range were passed into the hypercall for the hypervisor > to do the decomposing, that pause-and-restart mechanism > wouldn't be available. > > Of course, Linux doesn't really take advantage of the pause to > reduce guest interrupt latency because the Hyper-V code in > Linux typically disable interrupts around a hypercall due to the > way the hypercall input page is allocated. But other guest > operating systems might benefit from such a pause. And we could > probably fix the Hyper-V code in Linux to allow interrupts during a > hypercall pause/restart if long-running hypercalls turn out to be > a problem. I am not sure if this pause feature is suitable for IOTLB flush at all since it is inherently synchronous — the caller must block until all invalidations complete. Pausing mid-flush to return to the guest doesn't help if the guest can't make forward progress anyway.

