On Thu, Jan 08, 2026 at 06:48:59PM +0000, Michael Kelley wrote:
> From: Yu Zhang <[email protected]> Sent: Monday, December 8, 2025
> 9:11 PM
<snip>
Thank you so much, Michael, for the thorough review!
I've snipped some comments I fully agree with and will address in
next version. Actually, I have to admit I agree with your remaining
comments below as well. :)
> > +struct hv_iommu_dev *hv_iommu_device;
> > +static struct hv_iommu_domain hv_identity_domain;
> > +static struct hv_iommu_domain hv_blocking_domain;
>
> Why is hv_iommu_device allocated dynamically while the two
> domains are allocated statically? Seems like the approach could
> be consistent, though maybe there's some reason I'm missing.
>
On second thought, `hv_identity_domain` and `hv_blocking_domain` should
likely be allocated dynamically as well, consistent with `hv_iommu_device`.
<snip>
> > +static int hv_iommu_get_logical_device_property(struct device *dev,
> > + enum hv_logical_device_property_code
> > code,
> > + struct
> > hv_output_get_logical_device_property *property)
> > +{
> > + u64 status;
> > + unsigned long flags;
> > + struct hv_input_get_logical_device_property *input;
> > + struct hv_output_get_logical_device_property *output;
> > +
> > + local_irq_save(flags);
> > +
> > + input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> > + output = *this_cpu_ptr(hyperv_pcpu_output_arg);
> > + memset(input, 0, sizeof(*input));
> > + memset(output, 0, sizeof(*output));
>
> General practice is to *not* zero the output area prior to a hypercall. The
> hypervisor
> should be correctly setting all the output bits. There are a couple of cases
> in the new
> MSHV code where the output is zero'ed, but I'm planning to submit a patch to
> remove those so that hypercall call sites that have output are consistent
> across the
> code base. Of course, it's possible to have a Hyper-V bug where it doesn't do
> the
> right thing, and zero'ing the output could be done as a workaround. But such
> cases
> should be explicitly known with code comments indicating the reason for the
> zero'ing.
>
> Same applies in hv_iommu_detect().
>
Thanks for the information! Just to clarify: this is only because Hyper-V is
supposed to zero the output page, and for input page, memset is still needed.
Am I correct?
<snip>
> > +static void hv_iommu_shutdown(void)
> > +{
> > + iommu_device_sysfs_remove(&hv_iommu_device->iommu);
> > +
> > + kfree(hv_iommu_device);
> > +}
> > +
> > +static struct syscore_ops hv_iommu_syscore_ops = {
> > + .shutdown = hv_iommu_shutdown,
> > +};
>
> Why is a shutdown needed at all? hv_iommu_shutdown() doesn't do anything
> that really needed, since sysfs entries are transient, and freeing memory
> isn't
> relevant for a shutdown.
>
For iommu_device_sysfs_remove(), I guess they are not necessary, and
I will need to do some homework to better understand the sysfs. :)
Originally, we wanted a shutdown routine to trigger some hypercall,
so that Hyper-V will disable the DMA translation, e.g., during the VM
reboot process.
<snip>
> > +device_initcall(hv_iommu_init);
>
> I'm concerned about the timing of this initialization. VMBus is initialized
> with
> subsys_initcall(), which is initcall level 4 while device_initcall() is
> initcall level 6.
> So VMBus initialization happens quite a bit earlier, and the hypervisor starts
> offering devices to the guest, including PCI pass-thru devices, before the
> IOMMU initialization starts. I cobbled together a way to make this IOMMU code
> run in an Azure VM using the identity domain. The VM has an NVMe OS disk,
> two NVMe data disks, and a MANA NIC. The NVMe devices were offered, and
> completed hv_pci_probe() before this IOMMU initialization was started. When
> IOMMU initialization did run, it went back and found the NVMe devices. But
> I'm unsure if that's OK because my hacked together environment obviously
> couldn't do real IOMMU mapping. It appears that the NVMe device driver
> didn't start its initialization until after the IOMMU driver was setup, which
> would probably make everything OK. But that might be just timing luck, or
> maybe there's something that affirmatively prevents the native PCI driver
> (like NVMe) from getting started until after all the initcalls have finished.
>
This is yet another immature attempt by me to do the hv_iommu_init() in
an arch-independent path. And I do not think using device_initcall() is
harmless. This patch set was tested using an assigned Intel DSA device,
and the DMA tests succeeded w/o any error. But that is not enough to
justify using device_initcall(): I reset the idxd driver as kernel
builtin and realized, just like you said, both hv_pci_probe() and
idxd_pci_probe() were triggered before hv_iommu_init(), and when pvIOMMU
tries to probe the endpoint device, a warning is printed:
[ 3.609697] idxd 13d7:00:00.0: late IOMMU probe at driver bind, something
fishy here!
> I'm planning to look at this further to see if there's a way for a PCI driver
> to try initializing a pass-thru device *before* this IOMMU driver has
> initialized.
> If so, a different way to do the IOMMU initialization will be needed that is
> linked to VMBus initialization so things can't happen out-of-order.
> Establishing
> such a linkage is probably a good idea regardless.
>
> FWIW, the Azure VM with the 3 NVMe devices and MANA, and operating with
> the identity IOMMU domain, all seemed to work fine! Got 4 IOMMU groups,
> and devices coming and going dynamically all worked correctly. When a device
> was removed, it was moved to the blocking domain, and then flushed before
> being finally removed. All good! I wish I had a way to test with an IOMMU
> paging domain that was doing real translation.
>
Thank you, Michael! I really appreciate you running these extra experiments!
My tests on this DSA device passed (using paging domain) too, with no DMA
errors observed (regardless its driver is builtin or as a kernel module).
But that doesn't make me confident about using `device_initcall`. I believe
your concern is valid. E.g., an endpoint device might allocate a DMA address(
using a raw GPA, instead of gIOVA) before pvIOMMU is initialized, and then
use that address for DMA later, after a paging domain is attached?
> > diff --git a/drivers/iommu/hyperv/iommu.h b/drivers/iommu/hyperv/iommu.h
> > new file mode 100644
> > index 000000000000..c8657e791a6e
> > --- /dev/null
> > +++ b/drivers/iommu/hyperv/iommu.h
> > @@ -0,0 +1,53 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +
> > +/*
> > + * Hyper-V IOMMU driver.
> > + *
> > + * Copyright (C) 2024-2025, Microsoft, Inc.
> > + *
> > + */
> > +
> > +#ifndef _HYPERV_IOMMU_H
> > +#define _HYPERV_IOMMU_H
> > +
> > +struct hv_iommu_dev {
> > + struct iommu_device iommu;
> > + struct ida domain_ids;
> > +
> > + /* Device configuration */
> > + u8 max_iova_width;
> > + u8 max_pasid_width;
> > + u64 cap;
> > + u64 pgsize_bitmap;
> > +
> > + struct iommu_domain_geometry geometry;
> > + u64 first_domain;
> > + u64 last_domain;
> > +};
> > +
> > +struct hv_iommu_domain {
> > + union {
> > + struct iommu_domain domain;
> > + struct pt_iommu pt_iommu;
> > + struct pt_iommu_x86_64 pt_iommu_x86_64;
> > + };
> > + struct hv_iommu_dev *hv_iommu;
> > + struct hv_input_device_domain device_domain;
> > + u64 pgsize_bitmap;
> > +
> > + spinlock_t lock; /* protects dev_list and TLB flushes */
> > + /* List of devices in this DMA domain */
>
> It appears that this list is really a list of endpoints (i.e., struct
> hv_iommu_endpoint), not devices (which I read to be struct
> hv_iommu_dev).
>
> But that said, what is the list used for? I see code to add
> endpoints to the list, and to remove then, but the list is never
> walked by any code in this patch set. If there is an anticipated
> future use, it would be better to add the list as part of the code
> for that future use.
>
Yes, we do not really need this list for this patch set. Thanks!
B.R.
Yu