On Fri, May 15, 2026 at 07:35:45PM -0300, Jason Gunthorpe wrote:
> On Tue, May 12, 2026 at 12:24:08AM +0800, Yu Zhang wrote:
> > +static inline u16 hv_iommu_fill_iova_list(union hv_iommu_flush_va 
> > *iova_list,
> > +                                     unsigned long start,
> > +                                     unsigned long end)
> > +{
> > +   unsigned long start_pfn = start >> PAGE_SHIFT;
> > +   unsigned long end_pfn = PAGE_ALIGN(end) >> PAGE_SHIFT;
> > +   unsigned long nr_pages = end_pfn - start_pfn;
> > +   u16 count = 0;
> > +
> > +   while (nr_pages > 0) {
> > +           unsigned long flush_pages;
> > +           int order;
> > +           unsigned long pfn_align;
> > +           unsigned long size_align;
> > +
> > +           if (count >= HV_IOMMU_MAX_FLUSH_VA_COUNT) {
> > +                   count = HV_IOMMU_FLUSH_VA_OVERFLOW;
> > +                   break;
> > +           }
> > +
> > +           if (start_pfn)
> > +                   pfn_align = __ffs(start_pfn);
> > +           else
> > +                   pfn_align = BITS_PER_LONG - 1;
> > +
> > +           size_align = __fls(nr_pages);
> > +           order = min(pfn_align, size_align);
> > +           iova_list[count].page_mask_shift = order;
> > +           iova_list[count].page_number = start_pfn;
> > +
> > +           flush_pages = 1UL << order;
> > +           start_pfn += flush_pages;
> > +           nr_pages -= flush_pages;
> > +           count++;
> > +   }
> 
> This seems like a really silly hypervisor interface. Why doesn't it
> just accept a normal range? Splitting it into power of two aligned
> ranges is very inefficient.

Fair point. I'm not sure how much flexibility we have to change
this hypercall interface at the moment - it predates the pvIOMMU
work and may have other consumers beyond Linux guest. On the other
hand, having the guest specify 2^N-aligned blocks does save the
hypervisor from having to decompose ranges itself before issuing
hardware invalidation commands - the guest-provided entries can be
fed to the HW more or less directly.

That said, the way I'm currently using this interface may be
more precise than necessary. Maybe we have 2 options:

1) Current approach: decompose the range into multiple exact
   2^N-aligned blocks with no over-flush, but at the cost of
   more complex calculations and more entries.

2) Follow what Intel/AMD drivers do: find a single minimal
   2^N-aligned block that covers the entire range, but may
   over-flush.

Any preference?

@Michael, since you've also been reviewing this patch, I'd
appreciate your thoughts on the above as well. :)

Yu

Reply via email to