hyperv: Add page-selective IOTLB flush support

Michael Kelley Fri, 15 May 2026 16:33:35 -0700

From: Michael Kelley <[email protected]> Sent: Friday, May 15, 2026 11:00 AM
> 
> From: Yu Zhang <[email protected]> Sent: Friday, May 15, 2026 9:24 
> AM
> >
> > On Thu, May 14, 2026 at 06:14:22PM +0000, Michael Kelley wrote:
> > > From: Yu Zhang <[email protected]> Sent: Monday, May 11, 2026 
> > > 9:24 AM
> > > >
> 
> [....]
> 
> > > > +       unsigned long nr_pages = end_pfn - start_pfn;
> > > > +       u16 count = 0;
> > > > +
> > > > +       while (nr_pages > 0) {
> > > > +               unsigned long flush_pages;
> > > > +               int order;
> > > > +               unsigned long pfn_align;
> > > > +               unsigned long size_align;
> > > > +
> > > > +               if (count >= HV_IOMMU_MAX_FLUSH_VA_COUNT) {
> > > > +                       count = HV_IOMMU_FLUSH_VA_OVERFLOW;
> > > > +                       break;
> > > > +               }
> > > > +
> > > > +               if (start_pfn)
> > > > +                       pfn_align = __ffs(start_pfn);
> > >
> > > I don't understand why __ffs() is correct here. I would expect
> > > __fls() so it is consistent with the calculation of size_align. But I
> > > can only surmise how the hypercall works since there's no
> > > documentation, so maybe my understanding of the hypercall is
> > > wrong.   If __ffs really is correct, a comment explaining why
> > > would help. :-)
> > >
> >
> > The use of __ffs() is intentional. Each flush entry invalidates a
> > naturally aligned 2^N page block, and the hypervisor requires the
> > page_number to be aligned to 2^page_mask_shift.
> >
> > Here __ffs() and __fls() serve different purposes:
> > - __ffs(start_pfn) is about the alignment constraint, e.g.,  how
> > large a block can this address support?
> > - __fls(nr_pages) is about  the size constraint, e.g., how large
> > a block can the remaining range hold?
> >
> > Taking min() of both ensures each entry is both properly aligned
> > and within bounds.
> >
> > Thanks for raising this - it definitely deserves a comment. I had to
> > stare at it for a while myself to remember why. :)
> 
> Hmmm. Something about this still nags at me. I'll run some
> experiments to either convince myself that you are right, or to
> come up with a counterexample.


I have resolved what was nagging at me. My understanding of how
_ffs() and __fls() work was wrong. :-(  Your code is correct.

> 
> A related thought occurred to me. If each flush entry that is passed
> to Hyper-V describes a naturally aligned 2^N page block, I don't
> think the HV_IOMMU_MAX_FLUSH_VA_COUNT can ever
> be reached. The number of entries is limited by the number of
> bits in a PFN and the pages count, both of which are 64. And with
> 52 bit physical addressing and 4KiB pages, the actual size of
> a PFN and pages count is even smaller than 64.
> HV_IOMMU_MAX_FLUSH_VA_COUNT is the number of 8 byte
> union hv_iommu_flush_va entries that fit in a 4KiB page, so
> it's ~500.
> 
> My statement applies to a single flush range. If multiple flush
> ranges were strung together in a single hypercall, a larger count
> could be reached, but hv_flush_device_domain_list() only does
> a single range. So I don't think the overflow case in
> hv_flush_device_domain_list() can ever happen. But let me
> do my experiments, and I will also look at this aspect to confirm
> if it's right.

My experiments also convince me that the overflow case can't
happen as long as the hypercall is being populated with entries
derived from a single range.

Michael

> 
> >
> > > > +               else
> > > > +                       pfn_align = BITS_PER_LONG - 1;
> > > > +
> > > > +               size_align = __fls(nr_pages);
> > > > +               order = min(pfn_align, size_align);
> > > > +               iova_list[count].page_mask_shift = order;
> > > > +               iova_list[count].page_number = start_pfn;
> > > > +
> > > > +               flush_pages = 1UL << order;
> > > > +               start_pfn += flush_pages;
> > > > +               nr_pages -= flush_pages;
> > > > +               count++;
> > > > +       }
> > > > +
> > > > +       return count;
> > > > +}
> > > > +
> > > > +static void hv_flush_device_domain_list(struct hv_iommu_domain 
> > > > *hv_domain,
> > > > +                                       struct iommu_iotlb_gather 
> > > > *iotlb_gather)
> > > > +{
> > > > +       u64 status;
> > > > +       u16 count;
> > > > +       unsigned long flags;
> > > > +       struct hv_input_flush_device_domain_list *input;
> > > > +
> > > > +       local_irq_save(flags);
> > > > +
> > > > +       input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> > > > +       memset(input, 0, sizeof(*input));
> > > > +
> > > > +       input->device_domain = hv_domain->device_domain;
> > > > +       input->flags |= HV_FLUSH_DEVICE_DOMAIN_LIST_IOMMU_FORMAT;
> > >
> > > I would suggest moving the memset() and setting the input fields down
> > > under the "else" below so that they are parallel with the flush all case.
> > >
> >
> > I agree the structure should be more symmetric. Yet I guess the memset and
> > hv_iommu_fill_iova_list() need to stay before the branch since the fill
> > writes directly into input->iova_list[]. :)
> 
> Agreed.
> 
> >
> > > > +       count = hv_iommu_fill_iova_list(input->iova_list,
> > > > +                                       iotlb_gather->start,
> > > > +                                       iotlb_gather->end);
> > > > +       if (count == HV_IOMMU_FLUSH_VA_OVERFLOW) {
> > > > +               /*
> > > > +                * Range exceeds hypercall page capacity. Fall back to 
> > > > a full
> > > > +                * domain flush.
> > > > +                */
> > > > +               struct hv_input_flush_device_domain *flush_all = (void 
> > > > *)input;
> > > > +
> > > > +               memset(flush_all, 0, sizeof(*flush_all));
> > > > +               flush_all->device_domain = hv_domain->device_domain;
> > > > +               status = hv_do_hypercall(HVCALL_FLUSH_DEVICE_DOMAIN,
> > > > +                                       flush_all, NULL);
> > > > +       } else {
> > > > +               status = hv_do_rep_hypercall(
> > > > +                               HVCALL_FLUSH_DEVICE_DOMAIN_LIST,
> > > > +                               count, 0, input, NULL);
> > > > +       }
> > > > +
> > > > +       local_irq_restore(flags);
> > > > +
> > > > +       if (!hv_result_success(status))
> > > > +               pr_err("HVCALL_FLUSH_DEVICE_DOMAIN_LIST failed, status 
> > > > %lld\n", status);
> > >
> > > As Sashiko pointed out, a failure here can lead to all kinds of trouble 
> > > because
> > > of leaving unflushed entries. Maybe a WARN() is more appropriate? Also, 
> > > maybe
> > > a failure in the list flush should try a flush all as a fallback, with 
> > > the WARN()
> > > only if the flush all fails.
> > >
> >
> > Good idea. How about we restructure this routine to sth. like this:
> >
> >
> >     memset(input, 0, sizeof(*input));
> >     count = hv_iommu_fill_iova_list(...);
> >
> >     if (count != HV_IOMMU_FLUSH_VA_OVERFLOW) {
> >             input->device_domain = ...;
> >             ...
> >             status = hv_do_rep_hypercall(FLUSH_DEVICE_DOMAIN_LIST, ...);
> >             if (hv_result_success(status))
> >                     goto out;
> >     }
> >
> >     /* overflow or list flush failed: fallback to full domain flush */
> >     flush_all = (void *)input;
> >     memset(flush_all, 0, sizeof(*flush_all));
> >     flush_all->device_domain = ...;
> >     status = hv_do_hypercall(FLUSH_DEVICE_DOMAIN, ...);
> >     WARN(!hv_result_success(status), "IOTLB flush failed, status %lld\n", 
> > status);
> >
> >     out:
> >             local_irq_restore(flags);
> >
> 
> Yes, I think this works. But per my earlier comment, if I'm right that
> the overflow case never occurs, it could be simplified further to just
> do the list flush with the full flush as the error fallback. Then WARN
> if the full flush fails.
> 
> Michael

RE: [PATCH v1 4/4] iommu/hyperv: Add page-selective IOTLB flush support

Reply via email to