From: Michael Kelley <[email protected]> Sent: Friday, May 15, 2026 11:00 AM
>
> From: Yu Zhang <[email protected]> Sent: Friday, May 15, 2026 9:24
> AM
> >
> > On Thu, May 14, 2026 at 06:14:22PM +0000, Michael Kelley wrote:
> > > From: Yu Zhang <[email protected]> Sent: Monday, May 11, 2026
> > > 9:24 AM
> > > >
>
> [....]
>
> > > > + unsigned long nr_pages = end_pfn - start_pfn;
> > > > + u16 count = 0;
> > > > +
> > > > + while (nr_pages > 0) {
> > > > + unsigned long flush_pages;
> > > > + int order;
> > > > + unsigned long pfn_align;
> > > > + unsigned long size_align;
> > > > +
> > > > + if (count >= HV_IOMMU_MAX_FLUSH_VA_COUNT) {
> > > > + count = HV_IOMMU_FLUSH_VA_OVERFLOW;
> > > > + break;
> > > > + }
> > > > +
> > > > + if (start_pfn)
> > > > + pfn_align = __ffs(start_pfn);
> > >
> > > I don't understand why __ffs() is correct here. I would expect
> > > __fls() so it is consistent with the calculation of size_align. But I
> > > can only surmise how the hypercall works since there's no
> > > documentation, so maybe my understanding of the hypercall is
> > > wrong. If __ffs really is correct, a comment explaining why
> > > would help. :-)
> > >
> >
> > The use of __ffs() is intentional. Each flush entry invalidates a
> > naturally aligned 2^N page block, and the hypervisor requires the
> > page_number to be aligned to 2^page_mask_shift.
> >
> > Here __ffs() and __fls() serve different purposes:
> > - __ffs(start_pfn) is about the alignment constraint, e.g., how
> > large a block can this address support?
> > - __fls(nr_pages) is about the size constraint, e.g., how large
> > a block can the remaining range hold?
> >
> > Taking min() of both ensures each entry is both properly aligned
> > and within bounds.
> >
> > Thanks for raising this - it definitely deserves a comment. I had to
> > stare at it for a while myself to remember why. :)
>
> Hmmm. Something about this still nags at me. I'll run some
> experiments to either convince myself that you are right, or to
> come up with a counterexample.
I have resolved what was nagging at me. My understanding of how
_ffs() and __fls() work was wrong. :-( Your code is correct.
>
> A related thought occurred to me. If each flush entry that is passed
> to Hyper-V describes a naturally aligned 2^N page block, I don't
> think the HV_IOMMU_MAX_FLUSH_VA_COUNT can ever
> be reached. The number of entries is limited by the number of
> bits in a PFN and the pages count, both of which are 64. And with
> 52 bit physical addressing and 4KiB pages, the actual size of
> a PFN and pages count is even smaller than 64.
> HV_IOMMU_MAX_FLUSH_VA_COUNT is the number of 8 byte
> union hv_iommu_flush_va entries that fit in a 4KiB page, so
> it's ~500.
>
> My statement applies to a single flush range. If multiple flush
> ranges were strung together in a single hypercall, a larger count
> could be reached, but hv_flush_device_domain_list() only does
> a single range. So I don't think the overflow case in
> hv_flush_device_domain_list() can ever happen. But let me
> do my experiments, and I will also look at this aspect to confirm
> if it's right.
My experiments also convince me that the overflow case can't
happen as long as the hypercall is being populated with entries
derived from a single range.
Michael
>
> >
> > > > + else
> > > > + pfn_align = BITS_PER_LONG - 1;
> > > > +
> > > > + size_align = __fls(nr_pages);
> > > > + order = min(pfn_align, size_align);
> > > > + iova_list[count].page_mask_shift = order;
> > > > + iova_list[count].page_number = start_pfn;
> > > > +
> > > > + flush_pages = 1UL << order;
> > > > + start_pfn += flush_pages;
> > > > + nr_pages -= flush_pages;
> > > > + count++;
> > > > + }
> > > > +
> > > > + return count;
> > > > +}
> > > > +
> > > > +static void hv_flush_device_domain_list(struct hv_iommu_domain
> > > > *hv_domain,
> > > > + struct iommu_iotlb_gather
> > > > *iotlb_gather)
> > > > +{
> > > > + u64 status;
> > > > + u16 count;
> > > > + unsigned long flags;
> > > > + struct hv_input_flush_device_domain_list *input;
> > > > +
> > > > + local_irq_save(flags);
> > > > +
> > > > + input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> > > > + memset(input, 0, sizeof(*input));
> > > > +
> > > > + input->device_domain = hv_domain->device_domain;
> > > > + input->flags |= HV_FLUSH_DEVICE_DOMAIN_LIST_IOMMU_FORMAT;
> > >
> > > I would suggest moving the memset() and setting the input fields down
> > > under the "else" below so that they are parallel with the flush all case.
> > >
> >
> > I agree the structure should be more symmetric. Yet I guess the memset and
> > hv_iommu_fill_iova_list() need to stay before the branch since the fill
> > writes directly into input->iova_list[]. :)
>
> Agreed.
>
> >
> > > > + count = hv_iommu_fill_iova_list(input->iova_list,
> > > > + iotlb_gather->start,
> > > > + iotlb_gather->end);
> > > > + if (count == HV_IOMMU_FLUSH_VA_OVERFLOW) {
> > > > + /*
> > > > + * Range exceeds hypercall page capacity. Fall back to
> > > > a full
> > > > + * domain flush.
> > > > + */
> > > > + struct hv_input_flush_device_domain *flush_all = (void
> > > > *)input;
> > > > +
> > > > + memset(flush_all, 0, sizeof(*flush_all));
> > > > + flush_all->device_domain = hv_domain->device_domain;
> > > > + status = hv_do_hypercall(HVCALL_FLUSH_DEVICE_DOMAIN,
> > > > + flush_all, NULL);
> > > > + } else {
> > > > + status = hv_do_rep_hypercall(
> > > > + HVCALL_FLUSH_DEVICE_DOMAIN_LIST,
> > > > + count, 0, input, NULL);
> > > > + }
> > > > +
> > > > + local_irq_restore(flags);
> > > > +
> > > > + if (!hv_result_success(status))
> > > > + pr_err("HVCALL_FLUSH_DEVICE_DOMAIN_LIST failed, status
> > > > %lld\n", status);
> > >
> > > As Sashiko pointed out, a failure here can lead to all kinds of trouble
> > > because
> > > of leaving unflushed entries. Maybe a WARN() is more appropriate? Also,
> > > maybe
> > > a failure in the list flush should try a flush all as a fallback, with
> > > the WARN()
> > > only if the flush all fails.
> > >
> >
> > Good idea. How about we restructure this routine to sth. like this:
> >
> >
> > memset(input, 0, sizeof(*input));
> > count = hv_iommu_fill_iova_list(...);
> >
> > if (count != HV_IOMMU_FLUSH_VA_OVERFLOW) {
> > input->device_domain = ...;
> > ...
> > status = hv_do_rep_hypercall(FLUSH_DEVICE_DOMAIN_LIST, ...);
> > if (hv_result_success(status))
> > goto out;
> > }
> >
> > /* overflow or list flush failed: fallback to full domain flush */
> > flush_all = (void *)input;
> > memset(flush_all, 0, sizeof(*flush_all));
> > flush_all->device_domain = ...;
> > status = hv_do_hypercall(FLUSH_DEVICE_DOMAIN, ...);
> > WARN(!hv_result_success(status), "IOTLB flush failed, status %lld\n",
> > status);
> >
> > out:
> > local_irq_restore(flags);
> >
>
> Yes, I think this works. But per my earlier comment, if I'm right that
> the overflow case never occurs, it could be simplified further to just
> do the list flush with the full flush as the error fallback. Then WARN
> if the full flush fails.
>
> Michael