From: Yu Zhang <[email protected]> Sent: Monday, January 12, 2026 
8:56 AM
> 
> On Thu, Jan 08, 2026 at 06:48:59PM +0000, Michael Kelley wrote:
> > From: Yu Zhang <[email protected]> Sent: Monday, December 8, 
> > 2025 9:11 PM
> 
> <snip>
> Thank you so much, Michael, for the thorough review!
> 
> I've snipped some comments I fully agree with and will address in
> next version. Actually, I have to admit I agree with your remaining
> comments below as well. :)
> 
> > > +struct hv_iommu_dev *hv_iommu_device;
> > > +static struct hv_iommu_domain hv_identity_domain;
> > > +static struct hv_iommu_domain hv_blocking_domain;
> >
> > Why is hv_iommu_device allocated dynamically while the two
> > domains are allocated statically? Seems like the approach could
> > be consistent, though maybe there's some reason I'm missing.
> >
> 
> On second thought, `hv_identity_domain` and `hv_blocking_domain` should
> likely be allocated dynamically as well, consistent with `hv_iommu_device`.

I don't know if there's a strong rationale either way (static allocation vs.
dynamic). If the long-term expectation is that there is never more than one
PV IOMMU in a guest, then static would be OK. If future direction allows that
there could be multiple PV IOMMUs in a guest, then doing dynamic from
the start is justifiable (though the current PV IOMMU hypercalls seem to
assume only one PV IOMMU). But either way, being consistent is desirable.

> 
> <snip>
> > > +static int hv_iommu_get_logical_device_property(struct device *dev,
> > > +                                 enum hv_logical_device_property_code 
> > > code,
> > > +                                 struct 
> > > hv_output_get_logical_device_property *property)
> > > +{
> > > + u64 status;
> > > + unsigned long flags;
> > > + struct hv_input_get_logical_device_property *input;
> > > + struct hv_output_get_logical_device_property *output;
> > > +
> > > + local_irq_save(flags);
> > > +
> > > + input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> > > + output = *this_cpu_ptr(hyperv_pcpu_output_arg);
> > > + memset(input, 0, sizeof(*input));
> > > + memset(output, 0, sizeof(*output));
> >
> > General practice is to *not* zero the output area prior to a hypercall. The 
> > hypervisor
> > should be correctly setting all the output bits. There are a couple of 
> > cases in the new
> > MSHV code where the output is zero'ed, but I'm planning to submit a patch to
> > remove those so that hypercall call sites that have output are consistent 
> > across the
> > code base. Of course, it's possible to have a Hyper-V bug where it doesn't 
> > do the
> > right thing, and zero'ing the output could be done as a workaround. But 
> > such cases
> > should be explicitly known with code comments indicating the reason for the
> > zero'ing.
> >
> > Same applies in hv_iommu_detect().
> >
> 
> Thanks for the information! Just to clarify: this is only because Hyper-V is
> supposed to zero the output page, and for input page, memset is still needed.
> Am I correct?

Yes, you are correct.

The general TLFS requirement for hypercall input is that unused fields and bits
are set to zero. This requirement ensures forward compatibility if a later 
version of
the hypervisor assigns some meaning to previously unused fields/bits. So best 
practice
for hypercall call sites is to use memset() to zero the entire input area, and 
then specific
field values are set on top of that. Any fields/bits that aren't explicitly set 
then meet
the TLFS requirement.

It would be OK if a hypercall call site explicitly set every field/bit instead 
of using
memset(), but it's easy to unintentionally miss a field/bit and create a forward
compatibility problem. However, when the hypercall input contains a large array,
the code usually does *not* do memset() on the large array because of the perf
impact, but instead the code populating the large array must be careful to not 
leave
any bits uninitialized.

For hypercall output, the hypervisor essentially has the same requirement. It 
should
make sure that any unused fields/bits in the output area are zero, so that the 
Linux
guest can properly deal with a future hypervisor version that assigns meaning to
previously unused fields/bits.

> 
> <snip>
> 
> > > +static void hv_iommu_shutdown(void)
> > > +{
> > > + iommu_device_sysfs_remove(&hv_iommu_device->iommu);
> > > +
> > > + kfree(hv_iommu_device);
> > > +}
> > > +
> > > +static struct syscore_ops hv_iommu_syscore_ops = {
> > > + .shutdown = hv_iommu_shutdown,
> > > +};
> >
> > Why is a shutdown needed at all?  hv_iommu_shutdown() doesn't do anything
> > that really needed, since sysfs entries are transient, and freeing memory 
> > isn't
> > relevant for a shutdown.
> >
> 
> For iommu_device_sysfs_remove(), I guess they are not necessary, and
> I will need to do some homework to better understand the sysfs. :)
> Originally, we wanted a shutdown routine to trigger some hypercall,
> so that Hyper-V will disable the DMA translation, e.g., during the VM
> reboot process.

I would presume that if Hyper-V reboots the VM, Hyper-V automatically
resets the PV IOMMU and prevents any further DMA operations. But
consider kexec(), where a new kernel gets loaded without going through
the hypervisor "reboot-this-VM" path. There have been problems in the
past with kexec() where parts of Hyper-V state for the guest didn't get
reset, and the PV IOMMU is likely something in that category. So there
may indeed be a need to tell the hypervisor to reset everything related
to the PV IOMMU. There are already functions to do Hyper-V cleanup: see
vmbus_initiate_unload() and hyperv_cleanup(). These existing functions
may be a better place to do PV IOMMU cleanup/reset if needed.

> 
> <snip>
> 
> > > +device_initcall(hv_iommu_init);
> >
> > I'm concerned about the timing of this initialization. VMBus is initialized 
> > with
> > subsys_initcall(), which is initcall level 4 while device_initcall() is 
> > initcall level 6.
> > So VMBus initialization happens quite a bit earlier, and the hypervisor 
> > starts
> > offering devices to the guest, including PCI pass-thru devices, before the
> > IOMMU initialization starts. I cobbled together a way to make this IOMMU 
> > code
> > run in an Azure VM using the identity domain. The VM has an NVMe OS disk,
> > two NVMe data disks, and a MANA NIC. The NVMe devices were offered, and
> > completed hv_pci_probe() before this IOMMU initialization was started. When
> > IOMMU initialization did run, it went back and found the NVMe devices. But
> > I'm unsure if that's OK because my hacked together environment obviously
> > couldn't do real IOMMU mapping. It appears that the NVMe device driver
> > didn't start its initialization until after the IOMMU driver was setup, 
> > which
> > would probably make everything OK. But that might be just timing luck, or
> > maybe there's something that affirmatively prevents the native PCI driver
> > (like NVMe) from getting started until after all the initcalls have 
> > finished.
> >
> 
> This is yet another immature attempt by me to do the hv_iommu_init() in
> an arch-independent path. And I do not think using device_initcall() is
> harmless. This patch set was tested using an assigned Intel DSA device,
> and the DMA tests succeeded w/o any error. But that is not enough to
> justify using device_initcall(): I reset the idxd driver as kernel
> builtin and realized, just like you said, both hv_pci_probe() and
> idxd_pci_probe() were triggered before hv_iommu_init(), and when pvIOMMU
> tries to probe the endpoint device, a warning is printed:
> 
> [    3.609697] idxd 13d7:00:00.0: late IOMMU probe at driver bind, something 
> fishy here!
> 

You succeeded in doing what I was going to try! I won't spend time on it now.

> > I'm planning to look at this further to see if there's a way for a PCI 
> > driver
> > to try initializing a pass-thru device *before* this IOMMU driver has 
> > initialized.
> > If so, a different way to do the IOMMU initialization will be needed that is
> > linked to VMBus initialization so things can't happen out-of-order. 
> > Establishing
> > such a linkage is probably a good idea regardless.
> >
> > FWIW, the Azure VM with the 3 NVMe devices and MANA, and operating with
> > the identity IOMMU domain, all seemed to work fine! Got 4 IOMMU groups,
> > and devices coming and going dynamically all worked correctly. When a device
> > was removed, it was moved to the blocking domain, and then flushed before
> > being finally removed. All good! I wish I had a way to test with an IOMMU
> > paging domain that was doing real translation.
> >
> 
> Thank you, Michael! I really appreciate you running these extra experiments!
> 
> My tests on this DSA device passed (using paging domain) too, with no DMA
> errors observed (regardless its driver is builtin or as a kernel module).
> But that doesn't make me confident about using `device_initcall`. I believe
> your concern is valid. E.g., an endpoint device might allocate a DMA address(
> using a raw GPA, instead of gIOVA) before pvIOMMU is initialized, and then
> use that address for DMA later, after a paging domain is attached?

Yes, that's exactly my concern.

> 
> > > diff --git a/drivers/iommu/hyperv/iommu.h b/drivers/iommu/hyperv/iommu.h
> > > new file mode 100644
> > > index 000000000000..c8657e791a6e
> > > --- /dev/null
> > > +++ b/drivers/iommu/hyperv/iommu.h
> > > @@ -0,0 +1,53 @@
> > > +/* SPDX-License-Identifier: GPL-2.0 */
> > > +
> > > +/*
> > > + * Hyper-V IOMMU driver.
> > > + *
> > > + * Copyright (C) 2024-2025, Microsoft, Inc.
> > > + *
> > > + */
> > > +
> > > +#ifndef _HYPERV_IOMMU_H
> > > +#define _HYPERV_IOMMU_H
> > > +
> > > +struct hv_iommu_dev {
> > > + struct iommu_device iommu;
> > > + struct ida domain_ids;
> > > +
> > > + /* Device configuration */
> > > + u8  max_iova_width;
> > > + u8  max_pasid_width;
> > > + u64 cap;
> > > + u64 pgsize_bitmap;
> > > +
> > > + struct iommu_domain_geometry geometry;
> > > + u64 first_domain;
> > > + u64 last_domain;
> > > +};
> > > +
> > > +struct hv_iommu_domain {
> > > + union {
> > > +         struct iommu_domain    domain;
> > > +         struct pt_iommu        pt_iommu;
> > > +         struct pt_iommu_x86_64 pt_iommu_x86_64;
> > > + };
> > > + struct hv_iommu_dev *hv_iommu;
> > > + struct hv_input_device_domain device_domain;
> > > + u64             pgsize_bitmap;
> > > +
> > > + spinlock_t lock; /* protects dev_list and TLB flushes */
> > > + /* List of devices in this DMA domain */
> >
> > It appears that this list is really a list of endpoints (i.e., struct
> > hv_iommu_endpoint), not devices (which I read to be struct
> > hv_iommu_dev).
> >
> > But that said, what is the list used for?  I see code to add
> > endpoints to the list, and to remove then, but the list is never
> > walked by any code in this patch set. If there is an anticipated
> > future use, it would be better to add the list as part of the code
> > for that future use.
> >
> 
> Yes, we do not really need this list for this patch set. Thanks!
> 
> B.R.
> Yu

Reply via email to