From: Yu Zhang <[email protected]> Sent: Monday, December 8, 2025 
9:11 PM
> 
> Add a para-virtualized IOMMU driver for Linux guests running on Hyper-V.
> This driver implements stage-1 IO translation within the guest OS.
> It integrates with the Linux IOMMU core, utilizing Hyper-V hypercalls
> for:
>  - Capability discovery
>  - Domain allocation, configuration, and deallocation
>  - Device attachment and detachment
>  - IOTLB invalidation
> 
> The driver constructs x86-compatible stage-1 IO page tables in the
> guest memory using consolidated IO page table helpers. This allows
> the guest to manage stage-1 translations independently of vendor-
> specific drivers (like Intel VT-d or AMD IOMMU).
> 
> Hyper-v consumes this stage-1 IO page table, when a device domain is

s/Hyper-v/Hyper-V/

> created and configured, and nests it with the host's stage-2 IO page
> tables, therefore elemenating the VM exits for guest IOMMU mapping

s/elemenating/eliminating/

> operations.
> 
> For guest IOMMU unmapping operations, VM exits to perform the IOTLB
> flush(and possibly the device TLB flush) is still unavoidable. For

Typo: Add a space after "flush" and before the open parenthesis.

> now, HVCALL_FLUSH_DEVICE_DOMAIN       is used to implement a domain-selective

Typo:  Extra white space after HVCALL_FLUSH_DEVICE_DOMAIN

> IOTLB flush. New hypercalls for finer-grained hypercall will be provided
> in future patches.
> 
> Co-developed-by: Wei Liu <[email protected]>
> Signed-off-by: Wei Liu <[email protected]>
> Co-developed-by: Jacob Pan <[email protected]>
> Signed-off-by: Jacob Pan <[email protected]>
> Co-developed-by: Easwar Hariharan <[email protected]>
> Signed-off-by: Easwar Hariharan <[email protected]>
> Signed-off-by: Yu Zhang <[email protected]>
> ---
>  drivers/iommu/hyperv/Kconfig  |  14 +
>  drivers/iommu/hyperv/Makefile |   1 +
>  drivers/iommu/hyperv/iommu.c  | 608 ++++++++++++++++++++++++++++++++++
>  drivers/iommu/hyperv/iommu.h  |  53 +++
>  4 files changed, 676 insertions(+)
>  create mode 100644 drivers/iommu/hyperv/iommu.c
>  create mode 100644 drivers/iommu/hyperv/iommu.h
> 
> diff --git a/drivers/iommu/hyperv/Kconfig b/drivers/iommu/hyperv/Kconfig
> index 30f40d867036..fa3c77752d7b 100644
> --- a/drivers/iommu/hyperv/Kconfig
> +++ b/drivers/iommu/hyperv/Kconfig
> @@ -8,3 +8,17 @@ config HYPERV_IOMMU
>       help
>         Stub IOMMU driver to handle IRQs to support Hyper-V Linux
>         guest and root partitions.
> +
> +if HYPERV_IOMMU
> +config HYPERV_PVIOMMU
> +     bool "Microsoft Hypervisor para-virtualized IOMMU support"
> +     depends on X86 && HYPERV && PCI_HYPERV

Depending on PCI_HYPERV is problematic as pointed out in my comments
on Patch 1 of this series.

> +     depends on IOMMU_PT

Use "select IOMMU_PT" instead of "depends"? Other IOMMU drivers use
"select".

> +     select IOMMU_API
> +     select IOMMU_DMA

IOMMU_DMA is enabled by default on x86 and arm64 architectures.
Other IOMMU drivers don't select it, so maybe this could be dropped.

> +     select DMA_OPS

DMA_OPS doesn't exist.  I'm not sure what this is supposed to be.

> +     select IOMMU_IOVA
> +     default HYPERV
> +     help
> +       A para-virtualized IOMMU for Microsoft Hypervisor guest.
> +endif
> diff --git a/drivers/iommu/hyperv/Makefile b/drivers/iommu/hyperv/Makefile
> index 9f557bad94ff..8669741c0a51 100644
> --- a/drivers/iommu/hyperv/Makefile
> +++ b/drivers/iommu/hyperv/Makefile
> @@ -1,2 +1,3 @@
>  # SPDX-License-Identifier: GPL-2.0
>  obj-$(CONFIG_HYPERV_IOMMU) += irq_remapping.o
> +obj-$(CONFIG_HYPERV_PVIOMMU) += iommu.o
> diff --git a/drivers/iommu/hyperv/iommu.c b/drivers/iommu/hyperv/iommu.c
> new file mode 100644
> index 000000000000..3d0aff868e16
> --- /dev/null
> +++ b/drivers/iommu/hyperv/iommu.c
> @@ -0,0 +1,608 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +/*
> + * Hyper-V IOMMU driver.
> + *
> + * Copyright (C) 2019, 2024-2025 Microsoft, Inc.
> + */
> +
> +#include <linux/iommu.h>
> +#include <linux/pci.h>
> +#include <linux/dma-map-ops.h>
> +#include <linux/generic_pt/iommu.h>
> +#include <linux/syscore_ops.h>
> +#include <linux/pci-ats.h>
> +
> +#include <asm/iommu.h>
> +#include <asm/hypervisor.h>
> +#include <asm/mshyperv.h>
> +
> +#include "iommu.h"
> +#include "../dma-iommu.h"
> +#include "../iommu-pages.h"
> +
> +static void hv_iommu_detach_dev(struct iommu_domain *domain, struct device 
> *dev);
> +static void hv_flush_device_domain(struct hv_iommu_domain *hv_domain);

With some fairly simple reordering of code in this source file, these
two declarations could go away. Generally, the best practice is to order
so such declarations aren't needed, though that's not always possible.

> +struct hv_iommu_dev *hv_iommu_device;
> +static struct hv_iommu_domain hv_identity_domain;
> +static struct hv_iommu_domain hv_blocking_domain;

Why is hv_iommu_device allocated dynamically while the two
domains are allocated statically? Seems like the approach could
be consistent, though maybe there's some reason I'm missing.

> +static const struct iommu_domain_ops hv_iommu_identity_domain_ops;
> +static const struct iommu_domain_ops hv_iommu_blocking_domain_ops;
> +static struct iommu_ops hv_iommu_ops;

I'm wondering if this declaration could also be eliminated by some
reordering, though I didn't take time to figure out the details. Maybe
this is one of those cases that can't be avoided.

> +
> +#define hv_iommu_present(iommu_cap) (iommu_cap & HV_IOMMU_CAP_PRESENT)
> +#define hv_iommu_s1_domain_supported(iommu_cap) (iommu_cap & HV_IOMMU_CAP_S1)
> +#define hv_iommu_5lvl_supported(iommu_cap) (iommu_cap & HV_IOMMU_CAP_S1_5LVL)
> +#define hv_iommu_ats_supported(iommu_cap) (iommu_cap & HV_IOMMU_CAP_ATS)
> +
> +static int hv_create_device_domain(struct hv_iommu_domain *hv_domain, u32 
> domain_stage)
> +{
> +     int ret;
> +     u64 status;
> +     unsigned long flags;
> +     struct hv_input_create_device_domain *input;
> +
> +     ret = ida_alloc_range(&hv_iommu_device->domain_ids,
> +                     hv_iommu_device->first_domain, 
> hv_iommu_device->last_domain,
> +                     GFP_KERNEL);
> +     if (ret < 0)
> +             return ret;
> +
> +     hv_domain->device_domain.partition_id = HV_PARTITION_ID_SELF;
> +     hv_domain->device_domain.domain_id.type = domain_stage;
> +     hv_domain->device_domain.domain_id.id = ret;
> +     hv_domain->hv_iommu = hv_iommu_device;
> +
> +     local_irq_save(flags);
> +
> +     input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +     memset(input, 0, sizeof(*input));
> +     input->device_domain = hv_domain->device_domain;
> +     input->create_device_domain_flags.forward_progress_required = 1;
> +     input->create_device_domain_flags.inherit_owning_vtl = 0;
> +     status = hv_do_hypercall(HVCALL_CREATE_DEVICE_DOMAIN, input, NULL);
> +
> +     local_irq_restore(flags);
> +
> +     if (!hv_result_success(status)) {
> +             pr_err("%s: hypercall failed, status %lld\n", __func__, status);
> +             ida_free(&hv_iommu_device->domain_ids, 
> hv_domain->device_domain.domain_id.id);
> +     }
> +
> +     return hv_result_to_errno(status);
> +}
> +
> +static void hv_delete_device_domain(struct hv_iommu_domain *hv_domain)
> +{
> +     u64 status;
> +     unsigned long flags;
> +     struct hv_input_delete_device_domain *input;
> +
> +     local_irq_save(flags);
> +
> +     input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +     memset(input, 0, sizeof(*input));
> +     input->device_domain = hv_domain->device_domain;
> +     status = hv_do_hypercall(HVCALL_DELETE_DEVICE_DOMAIN, input, NULL);
> +
> +     local_irq_restore(flags);
> +
> +     if (!hv_result_success(status))
> +             pr_err("%s: hypercall failed, status %lld\n", __func__, status);
> +
> +     ida_free(&hv_domain->hv_iommu->domain_ids, 
> hv_domain->device_domain.domain_id.id);
> +}
> +
> +static bool hv_iommu_capable(struct device *dev, enum iommu_cap cap)
> +{
> +     switch (cap) {
> +     case IOMMU_CAP_CACHE_COHERENCY:
> +             return true;
> +     case IOMMU_CAP_DEFERRED_FLUSH:
> +             return true;
> +     default:
> +             return false;
> +     }
> +}
> +
> +static int hv_iommu_attach_dev(struct iommu_domain *domain, struct device 
> *dev)
> +{
> +     u64 status;
> +     unsigned long flags;
> +     struct pci_dev *pdev;
> +     struct hv_input_attach_device_domain *input;
> +     struct hv_iommu_endpoint *vdev = dev_iommu_priv_get(dev);
> +     struct hv_iommu_domain *hv_domain = to_hv_iommu_domain(domain);
> +
> +     /* Only allow PCI devices for now */
> +     if (!dev_is_pci(dev))
> +             return -EINVAL;
> +
> +     if (vdev->hv_domain == hv_domain)
> +             return 0;
> +
> +     if (vdev->hv_domain)
> +             hv_iommu_detach_dev(&vdev->hv_domain->domain, dev);
> +
> +     pdev = to_pci_dev(dev);
> +     dev_dbg(dev, "Attaching (%strusted) to %d\n", pdev->untrusted ? "un" : 
> "",
> +             hv_domain->device_domain.domain_id.id);
> +
> +     local_irq_save(flags);
> +
> +     input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +     memset(input, 0, sizeof(*input));
> +     input->device_domain = hv_domain->device_domain;
> +     input->device_id.as_uint64 = hv_build_logical_dev_id(pdev);
> +     status = hv_do_hypercall(HVCALL_ATTACH_DEVICE_DOMAIN, input, NULL);
> +
> +     local_irq_restore(flags);
> +
> +     if (!hv_result_success(status)) {
> +             pr_err("%s: hypercall failed, status %lld\n", __func__, status);
> +     } else {
> +             vdev->hv_domain = hv_domain;
> +             spin_lock_irqsave(&hv_domain->lock, flags);
> +             list_add(&vdev->list, &hv_domain->dev_list);
> +             spin_unlock_irqrestore(&hv_domain->lock, flags);
> +     }
> +
> +     return hv_result_to_errno(status);
> +}
> +
> +static void hv_iommu_detach_dev(struct iommu_domain *domain, struct device 
> *dev)
> +{
> +     u64 status;
> +     unsigned long flags;
> +     struct hv_input_detach_device_domain *input;
> +     struct pci_dev *pdev;
> +     struct hv_iommu_domain *hv_domain = to_hv_iommu_domain(domain);
> +     struct hv_iommu_endpoint *vdev = dev_iommu_priv_get(dev);
> +
> +     /* See the attach function, only PCI devices for now */
> +     if (!dev_is_pci(dev) || vdev->hv_domain != hv_domain)
> +             return;
> +
> +     pdev = to_pci_dev(dev);
> +
> +     dev_dbg(dev, "Detaching from %d\n", 
> hv_domain->device_domain.domain_id.id);
> +
> +     local_irq_save(flags);
> +
> +     input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +     memset(input, 0, sizeof(*input));
> +     input->partition_id = HV_PARTITION_ID_SELF;
> +     input->device_id.as_uint64 = hv_build_logical_dev_id(pdev);
> +     status = hv_do_hypercall(HVCALL_DETACH_DEVICE_DOMAIN, input, NULL);
> +
> +     local_irq_restore(flags);
> +
> +     if (!hv_result_success(status))
> +             pr_err("%s: hypercall failed, status %lld\n", __func__, status);
> +
> +     spin_lock_irqsave(&hv_domain->lock, flags);
> +     hv_flush_device_domain(hv_domain);
> +     list_del(&vdev->list);
> +     spin_unlock_irqrestore(&hv_domain->lock, flags);
> +
> +     vdev->hv_domain = NULL;
> +}
> +
> +static int hv_iommu_get_logical_device_property(struct device *dev,
> +                                     enum hv_logical_device_property_code 
> code,
> +                                     struct 
> hv_output_get_logical_device_property *property)
> +{
> +     u64 status;
> +     unsigned long flags;
> +     struct hv_input_get_logical_device_property *input;
> +     struct hv_output_get_logical_device_property *output;
> +
> +     local_irq_save(flags);
> +
> +     input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +     output = *this_cpu_ptr(hyperv_pcpu_output_arg);
> +     memset(input, 0, sizeof(*input));
> +     memset(output, 0, sizeof(*output));

General practice is to *not* zero the output area prior to a hypercall. The 
hypervisor
should be correctly setting all the output bits. There are a couple of cases in 
the new
MSHV code where the output is zero'ed, but I'm planning to submit a patch to
remove those so that hypercall call sites that have output are consistent 
across the
code base. Of course, it's possible to have a Hyper-V bug where it doesn't do 
the
right thing, and zero'ing the output could be done as a workaround. But such 
cases
should be explicitly known with code comments indicating the reason for the
zero'ing.

Same applies in hv_iommu_detect().

> +     input->partition_id = HV_PARTITION_ID_SELF;
> +     input->logical_device_id = hv_build_logical_dev_id(to_pci_dev(dev));
> +     input->code = code;
> +     status = hv_do_hypercall(HVCALL_GET_LOGICAL_DEVICE_PROPERTY, input, 
> output);
> +     *property = *output;
> +
> +     local_irq_restore(flags);
> +
> +     if (!hv_result_success(status))
> +             pr_err("%s: hypercall failed, status %lld\n", __func__, status);
> +
> +     return hv_result_to_errno(status);
> +}
> +
> +static struct iommu_device *hv_iommu_probe_device(struct device *dev)
> +{
> +     struct pci_dev *pdev;
> +     struct hv_iommu_endpoint *vdev;
> +     struct hv_output_get_logical_device_property device_iommu_property = 
> {0};
> +
> +     if (!dev_is_pci(dev))
> +             return ERR_PTR(-ENODEV);
> +
> +     if (hv_iommu_get_logical_device_property(dev,
> +                                              
> HV_LOGICAL_DEVICE_PROPERTY_PVIOMMU,
> +                                              &device_iommu_property) ||
> +         !(device_iommu_property.device_iommu & HV_DEVICE_IOMMU_ENABLED))
> +             return ERR_PTR(-ENODEV);
> +
> +     pdev = to_pci_dev(dev);
> +     vdev = kzalloc(sizeof(*vdev), GFP_KERNEL);
> +     if (!vdev)
> +             return ERR_PTR(-ENOMEM);
> +
> +     vdev->dev = dev;
> +     vdev->hv_iommu = hv_iommu_device;
> +     dev_iommu_priv_set(dev, vdev);
> +
> +     if (hv_iommu_ats_supported(hv_iommu_device->cap) &&
> +         pci_ats_supported(pdev))
> +             pci_enable_ats(pdev, __ffs(hv_iommu_device->pgsize_bitmap));
> +
> +     return &vdev->hv_iommu->iommu;
> +}
> +
> +static void hv_iommu_release_device(struct device *dev)
> +{
> +     struct hv_iommu_endpoint *vdev = dev_iommu_priv_get(dev);
> +
> +     if (vdev->hv_domain)
> +             hv_iommu_detach_dev(&vdev->hv_domain->domain, dev);
> +
> +     dev_iommu_priv_set(dev, NULL);
> +     set_dma_ops(dev, NULL);
> +
> +     kfree(vdev);
> +}
> +
> +static struct iommu_group *hv_iommu_device_group(struct device *dev)
> +{
> +     if (dev_is_pci(dev))
> +             return pci_device_group(dev);
> +     else
> +             return generic_device_group(dev);
> +}
> +
> +static int hv_configure_device_domain(struct hv_iommu_domain *hv_domain, u32 
> domain_type)
> +{
> +     u64 status;
> +     unsigned long flags;
> +     struct pt_iommu_x86_64_hw_info pt_info;
> +     struct hv_input_configure_device_domain *input;
> +
> +     local_irq_save(flags);
> +
> +     input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +     memset(input, 0, sizeof(*input));
> +     input->device_domain = hv_domain->device_domain;
> +     input->settings.flags.blocked = (domain_type == IOMMU_DOMAIN_BLOCKED);
> +     input->settings.flags.translation_enabled = (domain_type != 
> IOMMU_DOMAIN_IDENTITY);
> +
> +     if (domain_type & __IOMMU_DOMAIN_PAGING) {
> +             pt_iommu_x86_64_hw_info(&hv_domain->pt_iommu_x86_64, &pt_info);
> +             input->settings.page_table_root = pt_info.gcr3_pt;
> +             input->settings.flags.first_stage_paging_mode =
> +                     pt_info.levels == 5;
> +     }
> +     status = hv_do_hypercall(HVCALL_CONFIGURE_DEVICE_DOMAIN, input, NULL);
> +
> +     local_irq_restore(flags);
> +
> +     if (!hv_result_success(status))
> +             pr_err("%s: hypercall failed, status %lld\n", __func__, status);
> +
> +     return hv_result_to_errno(status);
> +}
> +
> +static int __init hv_initialize_static_domains(void)
> +{
> +     int ret;
> +     struct hv_iommu_domain *hv_domain;
> +
> +     /* Default stage-1 identity domain */
> +     hv_domain = &hv_identity_domain;
> +     memset(hv_domain, 0, sizeof(*hv_domain));

The memset() isn't necessary. hv_identity_domain is a static variable, so it is 
already
initialized to zero.

> +
> +     ret = hv_create_device_domain(hv_domain, HV_DEVICE_DOMAIN_TYPE_S1);
> +     if (ret)
> +             return ret;
> +
> +     ret = hv_configure_device_domain(hv_domain, IOMMU_DOMAIN_IDENTITY);
> +     if (ret)
> +             goto delete_identity_domain;
> +
> +     hv_domain->domain.type = IOMMU_DOMAIN_IDENTITY;
> +     hv_domain->domain.ops = &hv_iommu_identity_domain_ops;
> +     hv_domain->domain.owner = &hv_iommu_ops;
> +     hv_domain->domain.geometry = hv_iommu_device->geometry;
> +     hv_domain->domain.pgsize_bitmap = hv_iommu_device->pgsize_bitmap;
> +     INIT_LIST_HEAD(&hv_domain->dev_list);
> +
> +     /* Default stage-1 blocked domain */
> +     hv_domain = &hv_blocking_domain;
> +     memset(hv_domain, 0, sizeof(*hv_domain));

Same here.

> +
> +     ret = hv_create_device_domain(hv_domain, HV_DEVICE_DOMAIN_TYPE_S1);
> +     if (ret)
> +             goto delete_identity_domain;
> +
> +     ret = hv_configure_device_domain(hv_domain, IOMMU_DOMAIN_BLOCKED);
> +     if (ret)
> +             goto delete_blocked_domain;
> +
> +     hv_domain->domain.type = IOMMU_DOMAIN_BLOCKED;
> +     hv_domain->domain.ops = &hv_iommu_blocking_domain_ops;
> +     hv_domain->domain.owner = &hv_iommu_ops;
> +     hv_domain->domain.geometry = hv_iommu_device->geometry;
> +     hv_domain->domain.pgsize_bitmap = hv_iommu_device->pgsize_bitmap;
> +     INIT_LIST_HEAD(&hv_domain->dev_list);
> +
> +     return 0;
> +
> +delete_blocked_domain:
> +     hv_delete_device_domain(&hv_blocking_domain);
> +delete_identity_domain:
> +     hv_delete_device_domain(&hv_identity_domain);
> +     return ret;
> +}
> +
> +#define INTERRUPT_RANGE_START        (0xfee00000)
> +#define INTERRUPT_RANGE_END  (0xfeefffff)
> +static void hv_iommu_get_resv_regions(struct device *dev,
> +             struct list_head *head)
> +{
> +     struct iommu_resv_region *region;
> +
> +     region = iommu_alloc_resv_region(INTERRUPT_RANGE_START,
> +                                   INTERRUPT_RANGE_END - 
> INTERRUPT_RANGE_START + 1,
> +                                   0, IOMMU_RESV_MSI, GFP_KERNEL);
> +     if (!region)
> +             return;
> +
> +     list_add_tail(&region->list, head);
> +}
> +
> +static void hv_flush_device_domain(struct hv_iommu_domain *hv_domain)
> +{
> +     u64 status;
> +     unsigned long flags;
> +     struct hv_input_flush_device_domain *input;
> +
> +     local_irq_save(flags);
> +
> +     input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +     memset(input, 0, sizeof(*input));
> +     input->device_domain.partition_id = 
> hv_domain->device_domain.partition_id;
> +     input->device_domain.owner_vtl = hv_domain->device_domain.owner_vtl;
> +     input->device_domain.domain_id.type = 
> hv_domain->device_domain.domain_id.type;
> +     input->device_domain.domain_id.id = 
> hv_domain->device_domain.domain_id.id;
> +     status = hv_do_hypercall(HVCALL_FLUSH_DEVICE_DOMAIN, input, NULL);
> +
> +     local_irq_restore(flags);
> +
> +     if (!hv_result_success(status))
> +             pr_err("%s: hypercall failed, status %lld\n", __func__, status);
> +}
> +
> +static void hv_iommu_flush_iotlb_all(struct iommu_domain *domain)
> +{
> +     hv_flush_device_domain(to_hv_iommu_domain(domain));
> +}
> +
> +static void hv_iommu_iotlb_sync(struct iommu_domain *domain,
> +                             struct iommu_iotlb_gather *iotlb_gather)
> +{
> +     hv_flush_device_domain(to_hv_iommu_domain(domain));
> +
> +     iommu_put_pages_list(&iotlb_gather->freelist);
> +}
> +
> +static void hv_iommu_paging_domain_free(struct iommu_domain *domain)
> +{
> +     struct hv_iommu_domain *hv_domain = to_hv_iommu_domain(domain);
> +
> +     /* Free all remaining mappings */
> +     pt_iommu_deinit(&hv_domain->pt_iommu);
> +
> +     hv_delete_device_domain(hv_domain);
> +
> +     kfree(hv_domain);
> +}
> +
> +static const struct iommu_domain_ops hv_iommu_identity_domain_ops = {
> +     .attach_dev     = hv_iommu_attach_dev,
> +};
> +
> +static const struct iommu_domain_ops hv_iommu_blocking_domain_ops = {
> +     .attach_dev     = hv_iommu_attach_dev,
> +};
> +
> +static const struct iommu_domain_ops hv_iommu_paging_domain_ops = {
> +     .attach_dev     = hv_iommu_attach_dev,
> +     IOMMU_PT_DOMAIN_OPS(x86_64),
> +     .flush_iotlb_all = hv_iommu_flush_iotlb_all,
> +     .iotlb_sync = hv_iommu_iotlb_sync,
> +     .free = hv_iommu_paging_domain_free,
> +};
> +
> +static struct iommu_domain *hv_iommu_domain_alloc_paging(struct device *dev)
> +{
> +     int ret;
> +     struct hv_iommu_domain *hv_domain;
> +     struct pt_iommu_x86_64_cfg cfg = {};
> +
> +     hv_domain = kzalloc(sizeof(*hv_domain), GFP_KERNEL);
> +     if (!hv_domain)
> +             return ERR_PTR(-ENOMEM);
> +
> +     ret = hv_create_device_domain(hv_domain, HV_DEVICE_DOMAIN_TYPE_S1);
> +     if (ret) {
> +             kfree(hv_domain);
> +             return ERR_PTR(ret);
> +     }
> +
> +     hv_domain->domain.pgsize_bitmap = hv_iommu_device->pgsize_bitmap;
> +     hv_domain->domain.geometry = hv_iommu_device->geometry;
> +     hv_domain->pt_iommu.nid = dev_to_node(dev);
> +     INIT_LIST_HEAD(&hv_domain->dev_list);
> +     spin_lock_init(&hv_domain->lock);
> +
> +     cfg.common.hw_max_vasz_lg2 = hv_iommu_device->max_iova_width;
> +     cfg.common.hw_max_oasz_lg2 = 52;

FYI, when this code is rebased to the latest linux-next, need to set 
cfg.top_level as well.

> +
> +     ret = pt_iommu_x86_64_init(&hv_domain->pt_iommu_x86_64, &cfg, 
> GFP_KERNEL);
> +     if (ret) {
> +             hv_delete_device_domain(hv_domain);
> +             return ERR_PTR(ret);
> +     }
> +
> +     hv_domain->domain.ops = &hv_iommu_paging_domain_ops;
> +
> +     ret = hv_configure_device_domain(hv_domain, __IOMMU_DOMAIN_PAGING);
> +     if (ret) {
> +             pt_iommu_deinit(&hv_domain->pt_iommu);
> +             hv_delete_device_domain(hv_domain);
> +             return ERR_PTR(ret);
> +     }
> +
> +     return &hv_domain->domain;
> +}
> +
> +static struct iommu_ops hv_iommu_ops = {
> +     .capable                  = hv_iommu_capable,
> +     .domain_alloc_paging      = hv_iommu_domain_alloc_paging,
> +     .probe_device             = hv_iommu_probe_device,
> +     .release_device           = hv_iommu_release_device,
> +     .device_group             = hv_iommu_device_group,
> +     .get_resv_regions         = hv_iommu_get_resv_regions,
> +     .owner                    = THIS_MODULE,
> +     .identity_domain          = &hv_identity_domain.domain,
> +     .blocked_domain           = &hv_blocking_domain.domain,
> +     .release_domain           = &hv_blocking_domain.domain,
> +};
> +
> +static void hv_iommu_shutdown(void)
> +{
> +     iommu_device_sysfs_remove(&hv_iommu_device->iommu);
> +
> +     kfree(hv_iommu_device);
> +}
> +
> +static struct syscore_ops hv_iommu_syscore_ops = {
> +     .shutdown = hv_iommu_shutdown,
> +};

Why is a shutdown needed at all?  hv_iommu_shutdown() doesn't do anything
that really needed, since sysfs entries are transient, and freeing memory isn't
relevant for a shutdown.

> +
> +static int hv_iommu_detect(struct hv_output_get_iommu_capabilities 
> *hv_iommu_cap)
> +{
> +     u64 status;
> +     unsigned long flags;
> +     struct hv_input_get_iommu_capabilities *input;
> +     struct hv_output_get_iommu_capabilities *output;
> +
> +     local_irq_save(flags);
> +
> +     input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +     output = *this_cpu_ptr(hyperv_pcpu_output_arg);
> +     memset(input, 0, sizeof(*input));
> +     memset(output, 0, sizeof(*output));
> +     input->partition_id = HV_PARTITION_ID_SELF;
> +     status = hv_do_hypercall(HVCALL_GET_IOMMU_CAPABILITIES, input, output);
> +     *hv_iommu_cap = *output;
> +
> +     local_irq_restore(flags);
> +
> +     if (!hv_result_success(status))
> +             pr_err("%s: hypercall failed, status %lld\n", __func__, status);
> +
> +     return hv_result_to_errno(status);
> +}
> +
> +static void __init hv_init_iommu_device(struct hv_iommu_dev *hv_iommu,
> +                     struct hv_output_get_iommu_capabilities *hv_iommu_cap)
> +{
> +     ida_init(&hv_iommu->domain_ids);
> +
> +     hv_iommu->cap = hv_iommu_cap->iommu_cap;
> +     hv_iommu->max_iova_width = hv_iommu_cap->max_iova_width;
> +     if (!hv_iommu_5lvl_supported(hv_iommu->cap) &&
> +         hv_iommu->max_iova_width > 48) {
> +             pr_err("5-level paging not supported, limiting iova width to 
> 48.\n");
> +             hv_iommu->max_iova_width = 48;
> +     }
> +
> +     hv_iommu->geometry = (struct iommu_domain_geometry) {
> +             .aperture_start = 0,
> +             .aperture_end = (((u64)1) << hv_iommu_cap->max_iova_width) - 1,
> +             .force_aperture = true,
> +     };
> +
> +     hv_iommu->first_domain = HV_DEVICE_DOMAIN_ID_DEFAULT + 1;
> +     hv_iommu->last_domain = HV_DEVICE_DOMAIN_ID_NULL - 1;
> +     hv_iommu->pgsize_bitmap = hv_iommu_cap->pgsize_bitmap;
> +     hv_iommu_device = hv_iommu;
> +}
> +
> +static int __init hv_iommu_init(void)
> +{
> +     int ret = 0;
> +     struct hv_iommu_dev *hv_iommu = NULL;
> +     struct hv_output_get_iommu_capabilities hv_iommu_cap = {0};
> +
> +     if (no_iommu || iommu_detected)
> +             return -ENODEV;
> +
> +     if (!hv_is_hyperv_initialized())
> +             return -ENODEV;
> +
> +     if (hv_iommu_detect(&hv_iommu_cap) ||
> +         !hv_iommu_present(hv_iommu_cap.iommu_cap) ||
> +         !hv_iommu_s1_domain_supported(hv_iommu_cap.iommu_cap))
> +             return -ENODEV;
> +
> +     iommu_detected = 1;
> +     pci_request_acs();
> +
> +     hv_iommu = kzalloc(sizeof(*hv_iommu), GFP_KERNEL);
> +     if (!hv_iommu)
> +             return -ENOMEM;
> +
> +     hv_init_iommu_device(hv_iommu, &hv_iommu_cap);
> +
> +     ret = hv_initialize_static_domains();
> +     if (ret) {
> +             pr_err("hv_initialize_static_domains failed: %d\n", ret);
> +             goto err_sysfs_remove;
> +     }
> +
> +     ret = iommu_device_sysfs_add(&hv_iommu->iommu, NULL, NULL, "%s", 
> "hv-iommu");
> +     if (ret) {
> +             pr_err("iommu_device_sysfs_add failed: %d\n", ret);
> +             goto err_free;
> +     }
> +

Extra blank line.

> +
> +     ret = iommu_device_register(&hv_iommu->iommu, &hv_iommu_ops, NULL);
> +     if (ret) {
> +             pr_err("iommu_device_register failed: %d\n", ret);
> +             goto err_sysfs_remove;
> +     }
> +
> +     register_syscore_ops(&hv_iommu_syscore_ops);

Per above, not sure why this is needed.

> +
> +     pr_info("Microsoft Hypervisor IOMMU initialized\n");

Could this be changed to fit the "standardized" messages that are output
about Hyper-V specific code? They all start with "Hyper-V: ", such as these:

[    0.000000] Hyper-V: privilege flags low 0xae7f, high 0x3b8030, ext 0x62, 
hints 0xa0e24, misc 0xe0bed7b2
[    0.000000] Hyper-V: Nested features: 0x0
[    0.000000] Hyper-V: LAPIC Timer Frequency: 0xc3500
[    0.000000] Hyper-V: Using hypercall for remote TLB flush
[    0.019223] Hyper-V: PV spinlocks enabled
[    0.052575] Hyper-V: Hypervisor Build 10.0.26100.7462-7-0
[    0.052577] Hyper-V: enabling crash_kexec_post_notifiers
[    0.052633] Hyper-V: Using IPI hypercalls

Maybe "Hyper-V: PV IOMMU initialized"?

> +     return 0;
> +
> +err_sysfs_remove:
> +     iommu_device_sysfs_remove(&hv_iommu->iommu);
> +err_free:
> +     kfree(hv_iommu);
> +     return ret;
> +}
> +
> +device_initcall(hv_iommu_init);

I'm concerned about the timing of this initialization. VMBus is initialized with
subsys_initcall(), which is initcall level 4 while device_initcall() is 
initcall level 6.
So VMBus initialization happens quite a bit earlier, and the hypervisor starts
offering devices to the guest, including PCI pass-thru devices, before the
IOMMU initialization starts. I cobbled together a way to make this IOMMU code
run in an Azure VM using the identity domain. The VM has an NVMe OS disk,
two NVMe data disks, and a MANA NIC. The NVMe devices were offered, and
completed hv_pci_probe() before this IOMMU initialization was started. When
IOMMU initialization did run, it went back and found the NVMe devices. But
I'm unsure if that's OK because my hacked together environment obviously
couldn't do real IOMMU mapping. It appears that the NVMe device driver
didn't start its initialization until after the IOMMU driver was setup, which
would probably make everything OK. But that might be just timing luck, or
maybe there's something that affirmatively prevents the native PCI driver
(like NVMe) from getting started until after all the initcalls have finished.

I'm planning to look at this further to see if there's a way for a PCI driver
to try initializing a pass-thru device *before* this IOMMU driver has 
initialized.
If so, a different way to do the IOMMU initialization will be needed that is
linked to VMBus initialization so things can't happen out-of-order. Establishing
such a linkage is probably a good idea regardless.

FWIW, the Azure VM with the 3 NVMe devices and MANA, and operating with
the identity IOMMU domain, all seemed to work fine! Got 4 IOMMU groups,
and devices coming and going dynamically all worked correctly. When a device
was removed, it was moved to the blocking domain, and then flushed before
being finally removed. All good! I wish I had a way to test with an IOMMU
paging domain that was doing real translation.

> diff --git a/drivers/iommu/hyperv/iommu.h b/drivers/iommu/hyperv/iommu.h
> new file mode 100644
> index 000000000000..c8657e791a6e
> --- /dev/null
> +++ b/drivers/iommu/hyperv/iommu.h
> @@ -0,0 +1,53 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +/*
> + * Hyper-V IOMMU driver.
> + *
> + * Copyright (C) 2024-2025, Microsoft, Inc.
> + *
> + */
> +
> +#ifndef _HYPERV_IOMMU_H
> +#define _HYPERV_IOMMU_H
> +
> +struct hv_iommu_dev {
> +     struct iommu_device iommu;
> +     struct ida domain_ids;
> +
> +     /* Device configuration */
> +     u8  max_iova_width;
> +     u8  max_pasid_width;
> +     u64 cap;
> +     u64 pgsize_bitmap;
> +
> +     struct iommu_domain_geometry geometry;
> +     u64 first_domain;
> +     u64 last_domain;
> +};
> +
> +struct hv_iommu_domain {
> +     union {
> +             struct iommu_domain    domain;
> +             struct pt_iommu        pt_iommu;
> +             struct pt_iommu_x86_64 pt_iommu_x86_64;
> +     };
> +     struct hv_iommu_dev *hv_iommu;
> +     struct hv_input_device_domain device_domain;
> +     u64             pgsize_bitmap;
> +
> +     spinlock_t lock; /* protects dev_list and TLB flushes */
> +     /* List of devices in this DMA domain */

It appears that this list is really a list of endpoints (i.e., struct
hv_iommu_endpoint), not devices (which I read to be struct
hv_iommu_dev). 

But that said, what is the list used for?  I see code to add
endpoints to the list, and to remove then, but the list is never
walked by any code in this patch set. If there is an anticipated
future use, it would be better to add the list as part of the code
for that future use.

> +     struct list_head dev_list;
> +};
> +
> +struct hv_iommu_endpoint {
> +     struct device *dev;
> +     struct hv_iommu_dev *hv_iommu;
> +     struct hv_iommu_domain *hv_domain;
> +     struct list_head list; /* For domain->dev_list */
> +};
> +
> +#define to_hv_iommu_domain(d) \
> +     container_of(d, struct hv_iommu_domain, domain)
> +
> +#endif /* _HYPERV_IOMMU_H */
> --
> 2.49.0



Reply via email to