hyperv: Add para-virtualized IOMMU support for Hyper-V guest

Yu Zhang Mon, 12 Jan 2026 08:57:05 -0800

On Thu, Jan 08, 2026 at 06:48:59PM +0000, Michael Kelley wrote:
> From: Yu Zhang <[email protected]> Sent: Monday, December 8, 2025 
> 9:11 PM


<snip>
Thank you so much, Michael, for the thorough review!

I've snipped some comments I fully agree with and will address in 
next version. Actually, I have to admit I agree with your remaining
comments below as well. :)

> > +struct hv_iommu_dev *hv_iommu_device;
> > +static struct hv_iommu_domain hv_identity_domain;
> > +static struct hv_iommu_domain hv_blocking_domain;
> 
> Why is hv_iommu_device allocated dynamically while the two
> domains are allocated statically? Seems like the approach could
> be consistent, though maybe there's some reason I'm missing.
> 

On second thought, `hv_identity_domain` and `hv_blocking_domain` should
likely be allocated dynamically as well, consistent with `hv_iommu_device`.

<snip>
> > +static int hv_iommu_get_logical_device_property(struct device *dev,
> > +                                   enum hv_logical_device_property_code 
> > code,
> > +                                   struct 
> > hv_output_get_logical_device_property *property)
> > +{
> > +   u64 status;
> > +   unsigned long flags;
> > +   struct hv_input_get_logical_device_property *input;
> > +   struct hv_output_get_logical_device_property *output;
> > +
> > +   local_irq_save(flags);
> > +
> > +   input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> > +   output = *this_cpu_ptr(hyperv_pcpu_output_arg);
> > +   memset(input, 0, sizeof(*input));
> > +   memset(output, 0, sizeof(*output));
> 
> General practice is to *not* zero the output area prior to a hypercall. The 
> hypervisor
> should be correctly setting all the output bits. There are a couple of cases 
> in the new
> MSHV code where the output is zero'ed, but I'm planning to submit a patch to
> remove those so that hypercall call sites that have output are consistent 
> across the
> code base. Of course, it's possible to have a Hyper-V bug where it doesn't do 
> the
> right thing, and zero'ing the output could be done as a workaround. But such 
> cases
> should be explicitly known with code comments indicating the reason for the
> zero'ing.
> 
> Same applies in hv_iommu_detect().
> 

Thanks for the information! Just to clarify: this is only because Hyper-V is
supposed to zero the output page, and for input page, memset is still needed.
Am I correct?

<snip>

> > +static void hv_iommu_shutdown(void)
> > +{
> > +   iommu_device_sysfs_remove(&hv_iommu_device->iommu);
> > +
> > +   kfree(hv_iommu_device);
> > +}
> > +
> > +static struct syscore_ops hv_iommu_syscore_ops = {
> > +   .shutdown = hv_iommu_shutdown,
> > +};
> 
> Why is a shutdown needed at all?  hv_iommu_shutdown() doesn't do anything
> that really needed, since sysfs entries are transient, and freeing memory 
> isn't
> relevant for a shutdown.
> 

For iommu_device_sysfs_remove(), I guess they are not necessary, and
I will need to do some homework to better understand the sysfs. :)
Originally, we wanted a shutdown routine to trigger some hypercall,
so that Hyper-V will disable the DMA translation, e.g., during the VM
reboot process. 

<snip>

> > +device_initcall(hv_iommu_init);
> 
> I'm concerned about the timing of this initialization. VMBus is initialized 
> with
> subsys_initcall(), which is initcall level 4 while device_initcall() is 
> initcall level 6.
> So VMBus initialization happens quite a bit earlier, and the hypervisor starts
> offering devices to the guest, including PCI pass-thru devices, before the
> IOMMU initialization starts. I cobbled together a way to make this IOMMU code
> run in an Azure VM using the identity domain. The VM has an NVMe OS disk,
> two NVMe data disks, and a MANA NIC. The NVMe devices were offered, and
> completed hv_pci_probe() before this IOMMU initialization was started. When
> IOMMU initialization did run, it went back and found the NVMe devices. But
> I'm unsure if that's OK because my hacked together environment obviously
> couldn't do real IOMMU mapping. It appears that the NVMe device driver
> didn't start its initialization until after the IOMMU driver was setup, which
> would probably make everything OK. But that might be just timing luck, or
> maybe there's something that affirmatively prevents the native PCI driver
> (like NVMe) from getting started until after all the initcalls have finished.
> 

This is yet another immature attempt by me to do the hv_iommu_init() in
an arch-independent path. And I do not think using device_initcall() is
harmless. This patch set was tested using an assigned Intel DSA device,
and the DMA tests succeeded w/o any error. But that is not enough to
justify using device_initcall(): I reset the idxd driver as kernel
builtin and realized, just like you said, both hv_pci_probe() and
idxd_pci_probe() were triggered before hv_iommu_init(), and when pvIOMMU
tries to probe the endpoint device, a warning is printed:

[    3.609697] idxd 13d7:00:00.0: late IOMMU probe at driver bind, something 
fishy here!

> I'm planning to look at this further to see if there's a way for a PCI driver
> to try initializing a pass-thru device *before* this IOMMU driver has 
> initialized.
> If so, a different way to do the IOMMU initialization will be needed that is
> linked to VMBus initialization so things can't happen out-of-order. 
> Establishing
> such a linkage is probably a good idea regardless.
> 
> FWIW, the Azure VM with the 3 NVMe devices and MANA, and operating with
> the identity IOMMU domain, all seemed to work fine! Got 4 IOMMU groups,
> and devices coming and going dynamically all worked correctly. When a device
> was removed, it was moved to the blocking domain, and then flushed before
> being finally removed. All good! I wish I had a way to test with an IOMMU
> paging domain that was doing real translation.
> 

Thank you, Michael! I really appreciate you running these extra experiments!

My tests on this DSA device passed (using paging domain) too, with no DMA
errors observed (regardless its driver is builtin or as a kernel module).
But that doesn't make me confident about using `device_initcall`. I believe
your concern is valid. E.g., an endpoint device might allocate a DMA address(
using a raw GPA, instead of gIOVA) before pvIOMMU is initialized, and then
use that address for DMA later, after a paging domain is attached?

> > diff --git a/drivers/iommu/hyperv/iommu.h b/drivers/iommu/hyperv/iommu.h
> > new file mode 100644
> > index 000000000000..c8657e791a6e
> > --- /dev/null
> > +++ b/drivers/iommu/hyperv/iommu.h
> > @@ -0,0 +1,53 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +
> > +/*
> > + * Hyper-V IOMMU driver.
> > + *
> > + * Copyright (C) 2024-2025, Microsoft, Inc.
> > + *
> > + */
> > +
> > +#ifndef _HYPERV_IOMMU_H
> > +#define _HYPERV_IOMMU_H
> > +
> > +struct hv_iommu_dev {
> > +   struct iommu_device iommu;
> > +   struct ida domain_ids;
> > +
> > +   /* Device configuration */
> > +   u8  max_iova_width;
> > +   u8  max_pasid_width;
> > +   u64 cap;
> > +   u64 pgsize_bitmap;
> > +
> > +   struct iommu_domain_geometry geometry;
> > +   u64 first_domain;
> > +   u64 last_domain;
> > +};
> > +
> > +struct hv_iommu_domain {
> > +   union {
> > +           struct iommu_domain    domain;
> > +           struct pt_iommu        pt_iommu;
> > +           struct pt_iommu_x86_64 pt_iommu_x86_64;
> > +   };
> > +   struct hv_iommu_dev *hv_iommu;
> > +   struct hv_input_device_domain device_domain;
> > +   u64             pgsize_bitmap;
> > +
> > +   spinlock_t lock; /* protects dev_list and TLB flushes */
> > +   /* List of devices in this DMA domain */
> 
> It appears that this list is really a list of endpoints (i.e., struct
> hv_iommu_endpoint), not devices (which I read to be struct
> hv_iommu_dev). 
> 
> But that said, what is the list used for?  I see code to add
> endpoints to the list, and to remove then, but the list is never
> walked by any code in this patch set. If there is an anticipated
> future use, it would be better to add the list as part of the code
> for that future use.
> 

Yes, we do not really need this list for this patch set. Thanks!

B.R.
Yu

Re: [RFC v1 5/5] iommu/hyperv: Add para-virtualized IOMMU support for Hyper-V guest

Reply via email to