Script 'mail_helper' called by obssrc Hello community, here is the log from the commit of package xen for openSUSE:Factory checked in at 2022-04-08 00:26:39 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Comparing /work/SRC/openSUSE:Factory/xen (Old) and /work/SRC/openSUSE:Factory/.xen.new.1900 (New) ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Package is "xen" Fri Apr 8 00:26:39 2022 rev:315 rq:967124 version:4.16.0_08 Changes: -------- --- /work/SRC/openSUSE:Factory/xen/xen.changes 2022-03-16 21:30:17.503390239 +0100 +++ /work/SRC/openSUSE:Factory/.xen.new.1900/xen.changes 2022-04-08 00:27:01.482789653 +0200 @@ -1,0 +2,30 @@ +Mon Apr 4 09:58:24 MDT 2022 - carn...@suse.com + +- bsc#1197423 - VUL-0: CVE-2022-26356: xen: Racy interactions + between dirty vram tracking and paging log dirty hypercalls + (XSA-397) + xsa397.patch +- bsc#1197425 - VUL-0: CVE-2022-26357: xen: race in VT-d domain ID + cleanup (XSA-399) + xsa399.patch +- bsc#1197426 - VUL-0: CVE-2022-26358,CVE-2022-26359, + CVE-2022-26360,CVE-2022-26361: xen: IOMMU: RMRR (VT-d) and unity + map (AMD-Vi) handling issues (XSA-400) + xsa400-01.patch + xsa400-02.patch + xsa400-03.patch + xsa400-04.patch + xsa400-05.patch + xsa400-06.patch + xsa400-07.patch + xsa400-08.patch + xsa400-09.patch + xsa400-10.patch + xsa400-11.patch + xsa400-12.patch +- Additional upstream bug fixes for XSA-400 (bsc#1027519) + 61d6ea2d-VT-d-split-domid-map-cleanup-check-into-a-function.patch + 61d6ea7b-VT-d-dont-leak-domid-mapping-on-error-path.patch + 6229ba46-VT-d-drop-undue-address-of-from-check_cleanup_domid_map.patch + +------------------------------------------------------------------- @@ -78 +108 @@ - list not giving any output + list not giving any output (see also bsc#1194267) New: ---- 61d6ea2d-VT-d-split-domid-map-cleanup-check-into-a-function.patch 61d6ea7b-VT-d-dont-leak-domid-mapping-on-error-path.patch 6229ba46-VT-d-drop-undue-address-of-from-check_cleanup_domid_map.patch xsa397.patch xsa399.patch xsa400-01.patch xsa400-02.patch xsa400-03.patch xsa400-04.patch xsa400-05.patch xsa400-06.patch xsa400-07.patch xsa400-08.patch xsa400-09.patch xsa400-10.patch xsa400-11.patch xsa400-12.patch ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Other differences: ------------------ ++++++ xen.spec ++++++ --- /var/tmp/diff_new_pack.0b60M3/_old 2022-04-08 00:27:04.294758093 +0200 +++ /var/tmp/diff_new_pack.0b60M3/_new 2022-04-08 00:27:04.294758093 +0200 @@ -119,7 +119,7 @@ %endif Provides: installhint(reboot-needed) -Version: 4.16.0_06 +Version: 4.16.0_08 Release: 0 Summary: Xen Virtualization: Hypervisor (aka VMM aka Microkernel) License: GPL-2.0-only @@ -195,7 +195,24 @@ Patch38: 6227866a-Arm-Spectre-BHB-handling.patch Patch39: 6227866b-Arm-allow-SMCCC_ARCH_WORKAROUND_3-use.patch Patch40: 6227866c-x86-AMD-cease-using-thunk-lfence.patch +Patch41: 61d6ea2d-VT-d-split-domid-map-cleanup-check-into-a-function.patch +Patch42: 61d6ea7b-VT-d-dont-leak-domid-mapping-on-error-path.patch +Patch43: 6229ba46-VT-d-drop-undue-address-of-from-check_cleanup_domid_map.patch # EMBARGOED security fixes +Patch97: xsa397.patch +Patch99: xsa399.patch +Patch101: xsa400-01.patch +Patch102: xsa400-02.patch +Patch103: xsa400-03.patch +Patch104: xsa400-04.patch +Patch105: xsa400-05.patch +Patch106: xsa400-06.patch +Patch107: xsa400-07.patch +Patch108: xsa400-08.patch +Patch109: xsa400-09.patch +Patch110: xsa400-10.patch +Patch111: xsa400-11.patch +Patch112: xsa400-12.patch # libxc Patch301: libxc-bitmap-long.patch Patch302: libxc-sr-xl-migration-debug.patch ++++++ 61d6ea2d-VT-d-split-domid-map-cleanup-check-into-a-function.patch ++++++ Subject: VT-d: split domid map cleanup check into a function From: Jan Beulich jbeul...@suse.com Thu Jan 6 14:10:05 2022 +0100 Date: Thu Jan 6 14:10:05 2022 +0100: Git: fa45f6b5560e738955993fe061a04d64c6f71c14 This logic will want invoking from elsewhere. No functional change intended. Signed-off-by: Jan Beulich <jbeul...@suse.com> Reviewed-by: Roger Pau Monn?? <roger....@citrix.com> Reviewed-by: Kevin Tian <kevin.t...@intel.com> master commit: 9fdc10abe9457e4c9879a266f82372cb08e88ffb master date: 2021-11-24 11:06:20 +0100 diff --git a/xen/drivers/passthrough/vtd/iommu.c b/xen/drivers/passthrough/vtd/iommu.c index f9ce402f22..de11c258ca 100644 --- a/xen/drivers/passthrough/vtd/iommu.c +++ b/xen/drivers/passthrough/vtd/iommu.c @@ -157,6 +157,51 @@ static void cleanup_domid_map(struct domain *domain, struct vtd_iommu *iommu) } } +static bool any_pdev_behind_iommu(const struct domain *d, + const struct pci_dev *exclude, + const struct vtd_iommu *iommu) +{ + const struct pci_dev *pdev; + + for_each_pdev ( d, pdev ) + { + const struct acpi_drhd_unit *drhd; + + if ( pdev == exclude ) + continue; + + drhd = acpi_find_matched_drhd_unit(pdev); + if ( drhd && drhd->iommu == iommu ) + return true; + } + + return false; +} + +/* + * If no other devices under the same iommu owned by this domain, + * clear iommu in iommu_bitmap and clear domain_id in domid_bitmap. + */ +static void check_cleanup_domid_map(struct domain *d, + const struct pci_dev *exclude, + struct vtd_iommu *iommu) +{ + bool found = any_pdev_behind_iommu(d, exclude, iommu); + + /* + * Hidden devices are associated with DomXEN but usable by the hardware + * domain. Hence they need considering here as well. + */ + if ( !found && is_hardware_domain(d) ) + found = any_pdev_behind_iommu(dom_xen, exclude, iommu); + + if ( !found ) + { + clear_bit(iommu->index, &dom_iommu(d)->arch.vtd.iommu_bitmap); + cleanup_domid_map(d, iommu); + } +} + static void sync_cache(const void *addr, unsigned int size) { static unsigned long clflush_size = 0; @@ -1674,27 +1719,6 @@ int domain_context_unmap_one( return rc; } -static bool any_pdev_behind_iommu(const struct domain *d, - const struct pci_dev *exclude, - const struct vtd_iommu *iommu) -{ - const struct pci_dev *pdev; - - for_each_pdev ( d, pdev ) - { - const struct acpi_drhd_unit *drhd; - - if ( pdev == exclude ) - continue; - - drhd = acpi_find_matched_drhd_unit(pdev); - if ( drhd && drhd->iommu == iommu ) - return true; - } - - return false; -} - static int domain_context_unmap(struct domain *domain, u8 devfn, struct pci_dev *pdev) { @@ -1703,7 +1727,6 @@ static int domain_context_unmap(struct domain *domain, u8 devfn, int ret; uint16_t seg = pdev->seg; uint8_t bus = pdev->bus, tmp_bus, tmp_devfn, secbus; - bool found; switch ( pdev->type ) { @@ -1779,28 +1802,10 @@ static int domain_context_unmap(struct domain *domain, u8 devfn, return -EINVAL; } - if ( ret || QUARANTINE_SKIP(domain) || pdev->devfn != devfn ) - return ret; + if ( !ret && !QUARANTINE_SKIP(domain) && pdev->devfn == devfn ) + check_cleanup_domid_map(domain, pdev, iommu); - /* - * If no other devices under the same iommu owned by this domain, - * clear iommu in iommu_bitmap and clear domain_id in domid_bitmap. - */ - found = any_pdev_behind_iommu(domain, pdev, iommu); - /* - * Hidden devices are associated with DomXEN but usable by the hardware - * domain. Hence they need considering here as well. - */ - if ( !found && is_hardware_domain(domain) ) - found = any_pdev_behind_iommu(dom_xen, pdev, iommu); - - if ( !found ) - { - clear_bit(iommu->index, dom_iommu(domain)->arch.vtd.iommu_bitmap); - cleanup_domid_map(domain, iommu); - } - - return 0; + return ret; } static void iommu_clear_root_pgtable(struct domain *d) ++++++ 61d6ea7b-VT-d-dont-leak-domid-mapping-on-error-path.patch ++++++ Subject: VT-d: don't leak domid mapping on error path From: Jan Beulich jbeul...@suse.com Thu Jan 6 14:11:23 2022 +0100 Date: Thu Jan 6 14:11:23 2022 +0100: Git: 84977e8b53935de9a1123f677213f1b146843a0e While domain_context_mapping() invokes domain_context_unmap() in a sub- case of handling DEV_TYPE_PCI when encountering an error, thus avoiding a leak, individual calls to domain_context_mapping_one() aren't similarly covered. Such a leak might persist until domain destruction. Leverage that these cases can be recognized by pdev being non-NULL. Fixes: dec403cc668f ("VT-d: fix iommu_domid for PCI/PCIx devices assignment") Signed-off-by: Jan Beulich <jbeul...@suse.com> Reviewed-by: Kevin Tian <kevin.t...@intel.com> master commit: e6252a51faf42c892eb5fc71f8a2617580832196 master date: 2021-11-24 11:07:11 +0100 diff --git a/xen/drivers/passthrough/vtd/iommu.c b/xen/drivers/passthrough/vtd/iommu.c index de11c258ca..3b37bad25e 100644 --- a/xen/drivers/passthrough/vtd/iommu.c +++ b/xen/drivers/passthrough/vtd/iommu.c @@ -1517,7 +1517,12 @@ int domain_context_mapping_one( rc = me_wifi_quirk(domain, bus, devfn, MAP_ME_PHANTOM_FUNC); if ( rc ) - domain_context_unmap_one(domain, iommu, bus, devfn); + { + ret = domain_context_unmap_one(domain, iommu, bus, devfn); + + if ( !ret && pdev && pdev->devfn == devfn ) + check_cleanup_domid_map(domain, pdev, iommu); + } return rc; } ++++++ 6229ba46-VT-d-drop-undue-address-of-from-check_cleanup_domid_map.patch ++++++ Subject: VT-d: drop undue address-of from check_cleanup_domid_map() From: Jan Beulich jbeul...@suse.com Thu Mar 10 09:43:50 2022 +0100 Date: Thu Mar 10 09:43:50 2022 +0100: Git: b2db518e952c3a8fe5b9ec6a2d007cda73fd05a4 For an unknown reason I added back the operator while backporting, despite 4.16 having c06e3d810314 ("VT-d: per-domain IOMMU bitmap needs to have dynamic size"). I can only assume that I mistakenly took the 4.15 backport as basis and/or reference. Fixes: fa45f6b5560e ("VT-d: split domid map cleanup check into a function") Signed-off-by: Jan Beulich <jbeul...@suse.com> Reviewed-by: Roger Pau Monn?? <roger....@citrix.com> diff --git a/xen/drivers/passthrough/vtd/iommu.c b/xen/drivers/passthrough/vtd/iommu.c index 3b37bad25e..ead12db6a4 100644 --- a/xen/drivers/passthrough/vtd/iommu.c +++ b/xen/drivers/passthrough/vtd/iommu.c @@ -197,7 +197,7 @@ static void check_cleanup_domid_map(struct domain *d, if ( !found ) { - clear_bit(iommu->index, &dom_iommu(d)->arch.vtd.iommu_bitmap); + clear_bit(iommu->index, dom_iommu(d)->arch.vtd.iommu_bitmap); cleanup_domid_map(d, iommu); } } ++++++ xsa397.patch ++++++ From: Roger Pau Monne <roger....@citrix.com> Subject: x86/hap: do not switch on log dirty for VRAM tracking XEN_DMOP_track_dirty_vram possibly calls into paging_log_dirty_enable when using HAP mode, and it can interact badly with other ongoing paging domctls, as XEN_DMOP_track_dirty_vram is not holding the domctl lock. This was detected as a result of the following assert triggering when doing repeated migrations of a HAP HVM domain with a stubdom: Assertion 'd->arch.paging.log_dirty.allocs == 0' failed at paging.c:198 ----[ Xen-4.17-unstable x86_64 debug=y Not tainted ]---- CPU: 34 RIP: e008:[<ffff82d040314b3b>] arch/x86/mm/paging.c#paging_free_log_dirty_bitmap+0x606/0x6 RFLAGS: 0000000000010206 CONTEXT: hypervisor (d0v23) [...] Xen call trace: [<ffff82d040314b3b>] R arch/x86/mm/paging.c#paging_free_log_dirty_bitmap+0x606/0x63a [<ffff82d040279f96>] S xsm/flask/hooks.c#domain_has_perm+0x5a/0x67 [<ffff82d04031577f>] F paging_domctl+0x251/0xd41 [<ffff82d04031640c>] F paging_domctl_continuation+0x19d/0x202 [<ffff82d0403202fa>] F pv_hypercall+0x150/0x2a7 [<ffff82d0403a729d>] F lstar_enter+0x12d/0x140 Such assert triggered because the stubdom used XEN_DMOP_track_dirty_vram while dom0 was in the middle of executing XEN_DOMCTL_SHADOW_OP_OFF, and so log dirty become enabled while retiring the old structures, thus leading to new entries being populated in already clear slots. Fix this by not enabling log dirty for VRAM tracking, similar to what is done when using shadow instead of HAP. Call p2m_enable_hardware_log_dirty when enabling VRAM tracking in order to get some hardware assistance if available. As a side effect the memory pressure on the p2m pool should go down if only VRAM tracking is enabled, as the dirty bitmap is no longer allocated. Note that paging_log_dirty_range (used to get the dirty bitmap for VRAM tracking) doesn't use the log dirty bitmap, and instead relies on checking whether each gfn on the range has been switched from p2m_ram_logdirty to p2m_ram_rw in order to account for dirty pages. This is CVE-2022-26356 / XSA-397. Signed-off-by: Roger Pau Monn?? <roger....@citrix.com> Reviewed-by: Jan Beulich <jbeul...@suse.com> --- a/xen/include/asm-x86/paging.h +++ b/xen/include/asm-x86/paging.h @@ -162,9 +162,6 @@ void paging_log_dirty_range(struct domai unsigned long nr, uint8_t *dirty_bitmap); -/* enable log dirty */ -int paging_log_dirty_enable(struct domain *d, bool log_global); - /* log dirty initialization */ void paging_log_dirty_init(struct domain *d, const struct log_dirty_ops *ops); --- a/xen/arch/x86/mm/hap/hap.c +++ b/xen/arch/x86/mm/hap/hap.c @@ -69,13 +69,6 @@ int hap_track_dirty_vram(struct domain * { unsigned int size = DIV_ROUND_UP(nr_frames, BITS_PER_BYTE); - if ( !paging_mode_log_dirty(d) ) - { - rc = paging_log_dirty_enable(d, false); - if ( rc ) - goto out; - } - rc = -ENOMEM; dirty_bitmap = vzalloc(size); if ( !dirty_bitmap ) @@ -107,6 +100,10 @@ int hap_track_dirty_vram(struct domain * paging_unlock(d); + domain_pause(d); + p2m_enable_hardware_log_dirty(d); + domain_unpause(d); + if ( oend > ostart ) p2m_change_type_range(d, ostart, oend, p2m_ram_logdirty, p2m_ram_rw); --- a/xen/arch/x86/mm/paging.c +++ b/xen/arch/x86/mm/paging.c @@ -211,7 +211,7 @@ static int paging_free_log_dirty_bitmap( return rc; } -int paging_log_dirty_enable(struct domain *d, bool log_global) +static int paging_log_dirty_enable(struct domain *d, bool log_global) { int ret; ++++++ xsa399.patch ++++++ From: Jan Beulich <jbeul...@suse.com> Subject: VT-d: correct ordering of operations in cleanup_domid_map() The function may be called without any locks held (leaving aside the domctl one, which we surely don't want to depend on here), so needs to play safe wrt other accesses to domid_map[] and domid_bitmap[]. This is to avoid context_set_domain_id()'s writing of domid_map[] to be reset to zero right away in the case of it racing the freeing of a DID. For the interaction with context_set_domain_id() and ->domid_map[] reads see the code comment. {check_,}cleanup_domid_map() are called with pcidevs_lock held or during domain cleanup only (and pcidevs_lock is also held around context_set_domain_id()), i.e. racing calls with the same (dom, iommu) tuple cannot occur. domain_iommu_domid(), besides its use by cleanup_domid_map(), has its result used only to control flushing, and hence a stale result would only lead to a stray extra flush. This is CVE-2022-26357 / XSA-399. Fixes: b9c20c78789f ("VT-d: per-iommu domain-id") Signed-off-by: Jan Beulich <jbeul...@suse.com> Reviewed-by: Roger Pau Monn?? <roger....@citrix.com> --- a/xen/drivers/passthrough/vtd/iommu.c +++ b/xen/drivers/passthrough/vtd/iommu.c @@ -152,8 +152,14 @@ static void cleanup_domid_map(struct dom if ( iommu_domid >= 0 ) { + /* + * Update domid_map[] /before/ domid_bitmap[] to avoid a race with + * context_set_domain_id(), setting the slot to DOMID_INVALID for + * ->domid_map[] reads to produce a suitable value while the bit is + * still set. + */ + iommu->domid_map[iommu_domid] = DOMID_INVALID; clear_bit(iommu_domid, iommu->domid_bitmap); - iommu->domid_map[iommu_domid] = 0; } } ++++++ xsa400-01.patch ++++++ From: Jan Beulich <jbeul...@suse.com> Subject: VT-d: fix (de)assign ordering when RMRRs are in use In the event that the RMRR mappings are essential for device operation, they should be established before updating the device's context entry, while they should be torn down only after the device's context entry was successfully updated. Also adjust a related log message. This is CVE-2022-26358 / part of XSA-400. Fixes: 8b99f4400b69 ("VT-d: fix RMRR related error handling") Signed-off-by: Jan Beulich <jbeul...@suse.com> Reviewed-by: Roger Pau Monn?? <roger....@citrix.com> Reviewed-by: Paul Durrant <p...@xen.org> Reviewed-by: Kevin Tian <kevin.t...@intel.com> --- a/xen/drivers/passthrough/vtd/iommu.c +++ b/xen/drivers/passthrough/vtd/iommu.c @@ -2419,6 +2419,10 @@ static int reassign_device_ownership( { int ret; + ret = domain_context_unmap(source, devfn, pdev); + if ( ret ) + return ret; + /* * Devices assigned to untrusted domains (here assumed to be any domU) * can attempt to send arbitrary LAPIC/MSI messages. We are unprotected @@ -2455,10 +2459,6 @@ static int reassign_device_ownership( } } - ret = domain_context_unmap(source, devfn, pdev); - if ( ret ) - return ret; - if ( devfn == pdev->devfn && pdev->domain != dom_io ) { list_move(&pdev->domain_list, &dom_io->pdev_list); @@ -2534,9 +2534,8 @@ static int intel_iommu_assign_device( } } - ret = reassign_device_ownership(s, d, devfn, pdev); - if ( ret || d == dom_io ) - return ret; + if ( d == dom_io ) + return reassign_device_ownership(s, d, devfn, pdev); /* Setup rmrr identity mapping */ for_each_rmrr_device( rmrr, bdf, i ) @@ -2549,20 +2548,37 @@ static int intel_iommu_assign_device( rmrr->end_address, flag); if ( ret ) { - int rc; - - rc = reassign_device_ownership(d, s, devfn, pdev); printk(XENLOG_G_ERR VTDPREFIX - " cannot map reserved region (%"PRIx64",%"PRIx64"] for Dom%d (%d)\n", - rmrr->base_address, rmrr->end_address, - d->domain_id, ret); - if ( rc ) - { - printk(XENLOG_ERR VTDPREFIX - " failed to reclaim %pp from %pd (%d)\n", - &PCI_SBDF3(seg, bus, devfn), d, rc); - domain_crash(d); - } + "%pd: cannot map reserved region [%"PRIx64",%"PRIx64"]: %d\n", + d, rmrr->base_address, rmrr->end_address, ret); + break; + } + } + } + + if ( !ret ) + ret = reassign_device_ownership(s, d, devfn, pdev); + + /* See reassign_device_ownership() for the hwdom aspect. */ + if ( !ret || is_hardware_domain(d) ) + return ret; + + for_each_rmrr_device( rmrr, bdf, i ) + { + if ( rmrr->segment == seg && + PCI_BUS(bdf) == bus && + PCI_DEVFN2(bdf) == devfn ) + { + int rc = iommu_identity_mapping(d, p2m_access_x, + rmrr->base_address, + rmrr->end_address, 0); + + if ( rc && rc != -ENOENT ) + { + printk(XENLOG_ERR VTDPREFIX + "%pd: cannot unmap reserved region [%"PRIx64",%"PRIx64"]: %d\n", + d, rmrr->base_address, rmrr->end_address, rc); + domain_crash(d); break; } } ++++++ xsa400-02.patch ++++++ From: Jan Beulich <jbeul...@suse.com> Subject: VT-d: fix add/remove ordering when RMRRs are in use In the event that the RMRR mappings are essential for device operation, they should be established before updating the device's context entry, while they should be torn down only after the device's context entry was successfully cleared. Also switch to %pd in related log messages. Fixes: fa88cfadf918 ("vt-d: Map RMRR in intel_iommu_add_device() if the device has RMRR") Fixes: 8b99f4400b69 ("VT-d: fix RMRR related error handling") Signed-off-by: Jan Beulich <jbeul...@suse.com> Reviewed-by: Roger Pau Monn?? <roger....@citrix.com> Reviewed-by: Kevin Tian <kevin.t...@intel.com> --- a/xen/drivers/passthrough/vtd/iommu.c +++ b/xen/drivers/passthrough/vtd/iommu.c @@ -1997,14 +1997,6 @@ static int intel_iommu_add_device(u8 dev if ( !pdev->domain ) return -EINVAL; - ret = domain_context_mapping(pdev->domain, devfn, pdev); - if ( ret ) - { - dprintk(XENLOG_ERR VTDPREFIX, "d%d: context mapping failed\n", - pdev->domain->domain_id); - return ret; - } - for_each_rmrr_device ( rmrr, bdf, i ) { if ( rmrr->segment == pdev->seg && @@ -2021,12 +2013,17 @@ static int intel_iommu_add_device(u8 dev rmrr->base_address, rmrr->end_address, 0); if ( ret ) - dprintk(XENLOG_ERR VTDPREFIX, "d%d: RMRR mapping failed\n", - pdev->domain->domain_id); + dprintk(XENLOG_ERR VTDPREFIX, "%pd: RMRR mapping failed\n", + pdev->domain); } } - return 0; + ret = domain_context_mapping(pdev->domain, devfn, pdev); + if ( ret ) + dprintk(XENLOG_ERR VTDPREFIX, "%pd: context mapping failed\n", + pdev->domain); + + return ret; } static int intel_iommu_enable_device(struct pci_dev *pdev) @@ -2048,11 +2045,15 @@ static int intel_iommu_remove_device(u8 { struct acpi_rmrr_unit *rmrr; u16 bdf; - int i; + int ret, i; if ( !pdev->domain ) return -EINVAL; + ret = domain_context_unmap(pdev->domain, devfn, pdev); + if ( ret ) + return ret; + for_each_rmrr_device ( rmrr, bdf, i ) { if ( rmrr->segment != pdev->seg || @@ -2068,7 +2069,7 @@ static int intel_iommu_remove_device(u8 rmrr->end_address, 0); } - return domain_context_unmap(pdev->domain, devfn, pdev); + return 0; } static int __hwdom_init setup_hwdom_device(u8 devfn, struct pci_dev *pdev) ++++++ xsa400-03.patch ++++++ From: Jan Beulich <jbeul...@suse.com> Subject: IOMMU/x86: tighten iommu_alloc_pgtable()'s parameter This is to make more obvious that nothing outside of domain_iommu(d) actually changes or is otherwise needed by the function. No functional change intended. Signed-off-by: Jan Beulich <jbeul...@suse.com> Reviewed-by: Roger Pau Monn?? <roger....@citrix.com> Reviewed-by: Paul Durrant <p...@xen.org> Reviewed-by: Kevin Tian <kevin.t...@intel.com> --- a/xen/include/asm-x86/iommu.h +++ b/xen/include/asm-x86/iommu.h @@ -142,7 +142,8 @@ int pi_update_irte(const struct pi_desc }) int __must_check iommu_free_pgtables(struct domain *d); -struct page_info *__must_check iommu_alloc_pgtable(struct domain *d); +struct domain_iommu; +struct page_info *__must_check iommu_alloc_pgtable(struct domain_iommu *hd); #endif /* !__ARCH_X86_IOMMU_H__ */ /* --- a/xen/drivers/passthrough/amd/iommu_map.c +++ b/xen/drivers/passthrough/amd/iommu_map.c @@ -184,7 +184,7 @@ static int iommu_pde_from_dfn(struct dom unsigned long next_table_mfn; unsigned int level; struct page_info *table; - const struct domain_iommu *hd = dom_iommu(d); + struct domain_iommu *hd = dom_iommu(d); table = hd->arch.amd.root_table; level = hd->arch.amd.paging_mode; @@ -219,7 +219,7 @@ static int iommu_pde_from_dfn(struct dom mfn = next_table_mfn; /* allocate lower level page table */ - table = iommu_alloc_pgtable(d); + table = iommu_alloc_pgtable(hd); if ( table == NULL ) { AMD_IOMMU_ERROR("cannot allocate I/O page table\n"); @@ -249,7 +249,7 @@ static int iommu_pde_from_dfn(struct dom if ( next_table_mfn == 0 ) { - table = iommu_alloc_pgtable(d); + table = iommu_alloc_pgtable(hd); if ( table == NULL ) { AMD_IOMMU_ERROR("cannot allocate I/O page table\n"); @@ -553,7 +553,7 @@ int __init amd_iommu_quarantine_init(str spin_lock(&hd->arch.mapping_lock); - hd->arch.amd.root_table = iommu_alloc_pgtable(d); + hd->arch.amd.root_table = iommu_alloc_pgtable(hd); if ( !hd->arch.amd.root_table ) goto out; @@ -568,7 +568,7 @@ int __init amd_iommu_quarantine_init(str * page table pages, and the resulting allocations are always * zeroed. */ - pg = iommu_alloc_pgtable(d); + pg = iommu_alloc_pgtable(hd); if ( !pg ) break; --- a/xen/drivers/passthrough/amd/pci_amd_iommu.c +++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c @@ -242,7 +242,7 @@ int amd_iommu_alloc_root(struct domain * if ( unlikely(!hd->arch.amd.root_table) ) { - hd->arch.amd.root_table = iommu_alloc_pgtable(d); + hd->arch.amd.root_table = iommu_alloc_pgtable(hd); if ( !hd->arch.amd.root_table ) return -ENOMEM; } --- a/xen/drivers/passthrough/vtd/iommu.c +++ b/xen/drivers/passthrough/vtd/iommu.c @@ -330,7 +330,7 @@ static u64 addr_to_dma_page_maddr(struct { struct page_info *pg; - if ( !alloc || !(pg = iommu_alloc_pgtable(domain)) ) + if ( !alloc || !(pg = iommu_alloc_pgtable(hd)) ) goto out; hd->arch.vtd.pgd_maddr = page_to_maddr(pg); @@ -350,7 +350,7 @@ static u64 addr_to_dma_page_maddr(struct if ( !alloc ) break; - pg = iommu_alloc_pgtable(domain); + pg = iommu_alloc_pgtable(hd); if ( !pg ) break; @@ -2766,7 +2766,7 @@ static int __init intel_iommu_quarantine goto out; } - pg = iommu_alloc_pgtable(d); + pg = iommu_alloc_pgtable(hd); rc = -ENOMEM; if ( !pg ) @@ -2785,7 +2785,7 @@ static int __init intel_iommu_quarantine * page table pages, and the resulting allocations are always * zeroed. */ - pg = iommu_alloc_pgtable(d); + pg = iommu_alloc_pgtable(hd); if ( !pg ) goto out; --- a/xen/drivers/passthrough/x86/iommu.c +++ b/xen/drivers/passthrough/x86/iommu.c @@ -416,9 +416,8 @@ int iommu_free_pgtables(struct domain *d return 0; } -struct page_info *iommu_alloc_pgtable(struct domain *d) +struct page_info *iommu_alloc_pgtable(struct domain_iommu *hd) { - struct domain_iommu *hd = dom_iommu(d); unsigned int memflags = 0; struct page_info *pg; void *p; ++++++ xsa400-04.patch ++++++ From: Jan Beulich <jbeul...@suse.com> Subject: VT-d: drop ownership checking from domain_context_mapping_one() Despite putting in quite a bit of effort it was not possible to establish why exactly this code exists (beyond possibly sanity checking). Instead of a subsequent change further complicating this logic, simply get rid of it. Take the opportunity and move the respective unmap_vtd_domain_page() out of the locked region. Signed-off-by: Jan Beulich <jbeul...@suse.com> Reviewed-by: Roger Pau Monn?? <roger....@citrix.com> Reviewed-by: Paul Durrant <p...@xen.org> Reviewed-by: Kevin Tian <kevin.t...@intel.com> --- a/xen/drivers/passthrough/vtd/iommu.c +++ b/xen/drivers/passthrough/vtd/iommu.c @@ -124,28 +124,6 @@ static int context_set_domain_id(struct return 0; } -static int context_get_domain_id(struct context_entry *context, - struct vtd_iommu *iommu) -{ - unsigned long dom_index, nr_dom; - int domid = -1; - - if (iommu && context) - { - nr_dom = cap_ndoms(iommu->cap); - - dom_index = context_domain_id(*context); - - if ( dom_index < nr_dom && iommu->domid_map ) - domid = iommu->domid_map[dom_index]; - else - dprintk(XENLOG_DEBUG VTDPREFIX, - "dom_index %lu exceeds nr_dom %lu or iommu has no domid_map\n", - dom_index, nr_dom); - } - return domid; -} - static void cleanup_domid_map(struct domain *domain, struct vtd_iommu *iommu) { int iommu_domid = domain_iommu_domid(domain, iommu); @@ -1416,44 +1394,9 @@ int domain_context_mapping_one( if ( context_present(*context) ) { - int res = 0; - - /* Try to get domain ownership from device structure. If that's - * not available, try to read it from the context itself. */ - if ( pdev ) - { - if ( pdev->domain != domain ) - { - printk(XENLOG_G_INFO VTDPREFIX "%pd: %pp owned by %pd", - domain, &PCI_SBDF3(seg, bus, devfn), - pdev->domain); - res = -EINVAL; - } - } - else - { - int cdomain; - cdomain = context_get_domain_id(context, iommu); - - if ( cdomain < 0 ) - { - printk(XENLOG_G_WARNING VTDPREFIX - "%pd: %pp mapped, but can't find owner\n", - domain, &PCI_SBDF3(seg, bus, devfn)); - res = -EINVAL; - } - else if ( cdomain != domain->domain_id ) - { - printk(XENLOG_G_INFO VTDPREFIX - "%pd: %pp already mapped to d%d", - domain, &PCI_SBDF3(seg, bus, devfn), cdomain); - res = -EINVAL; - } - } - - unmap_vtd_domain_page(context_entries); spin_unlock(&iommu->lock); - return res; + unmap_vtd_domain_page(context_entries); + return 0; } if ( iommu_hwdom_passthrough && is_hardware_domain(domain) ) ++++++ xsa400-05.patch ++++++ From: Jan Beulich <jbeul...@suse.com> Subject: VT-d: re-assign devices directly Devices with RMRRs, due to it being unspecified how/when the specified memory regions may get accessed, may not be left disconnected from their respective mappings (as long as it's not certain that the device has been fully quiesced). Hence rather than unmapping the old context and then mapping the new one, re-assignment needs to be done in a single step. This is CVE-2022-26359 / part of XSA-400. Similarly quarantining scratch-page mode relies on page tables to be continuously wired up. To avoid complicating things more than necessary, treat all devices mostly equally, i.e. regardless of their association with any RMRRs. The main difference is when it comes to updating context entries, which need to be atomic when there are RMRRs. Yet atomicity can only be achieved with CMPXCHG16B, availability of which we can't take for given. The seemingly complicated choice of non-negative return values for domain_context_mapping_one() is to limit code churn: This way callers passing NULL for pdev don't need fiddling with. Signed-off-by: Jan Beulich <jbeul...@suse.com> Reviewed-by: Kevin Tian <kevin.t...@intel.com> Reviewed-by: Roger Pau Monn?? <roger....@citrix.com> --- a/xen/drivers/passthrough/vtd/extern.h +++ b/xen/drivers/passthrough/vtd/extern.h @@ -84,7 +84,8 @@ void free_pgtable_maddr(u64 maddr); void *map_vtd_domain_page(u64 maddr); void unmap_vtd_domain_page(const void *va); int domain_context_mapping_one(struct domain *domain, struct vtd_iommu *iommu, - u8 bus, u8 devfn, const struct pci_dev *); + uint8_t bus, uint8_t devfn, + const struct pci_dev *pdev, unsigned int mode); int domain_context_unmap_one(struct domain *domain, struct vtd_iommu *iommu, u8 bus, u8 devfn); int intel_iommu_get_reserved_device_memory(iommu_grdm_t *func, void *ctxt); @@ -104,8 +105,8 @@ bool is_azalia_tlb_enabled(const struct void platform_quirks_init(void); void vtd_ops_preamble_quirk(struct vtd_iommu *iommu); void vtd_ops_postamble_quirk(struct vtd_iommu *iommu); -int __must_check me_wifi_quirk(struct domain *domain, - u8 bus, u8 devfn, int map); +int __must_check me_wifi_quirk(struct domain *domain, uint8_t bus, + uint8_t devfn, unsigned int mode); void pci_vtd_quirk(const struct pci_dev *); void quirk_iommu_caps(struct vtd_iommu *iommu); --- a/xen/drivers/passthrough/vtd/iommu.c +++ b/xen/drivers/passthrough/vtd/iommu.c @@ -120,6 +120,7 @@ static int context_set_domain_id(struct } set_bit(i, iommu->domid_bitmap); + context->hi &= ~(((1 << DID_FIELD_WIDTH) - 1) << DID_HIGH_OFFSET); context->hi |= (i & ((1 << DID_FIELD_WIDTH) - 1)) << DID_HIGH_OFFSET; return 0; } @@ -1371,15 +1372,27 @@ static void __hwdom_init intel_iommu_hwd } } +/* + * This function returns + * - a negative errno value upon error, + * - zero upon success when previously the entry was non-present, or this isn't + * the "main" request for a device (pdev == NULL), or for no-op quarantining + * assignments, + * - positive (one) upon success when previously the entry was present and this + * is the "main" request for a device (pdev != NULL). + */ int domain_context_mapping_one( struct domain *domain, struct vtd_iommu *iommu, - u8 bus, u8 devfn, const struct pci_dev *pdev) + uint8_t bus, uint8_t devfn, const struct pci_dev *pdev, + unsigned int mode) { struct domain_iommu *hd = dom_iommu(domain); - struct context_entry *context, *context_entries; + struct context_entry *context, *context_entries, lctxt; + __uint128_t old; u64 maddr, pgd_maddr; - u16 seg = iommu->drhd->segment; + uint16_t seg = iommu->drhd->segment, prev_did = 0; + struct domain *prev_dom = NULL; int rc, ret; bool_t flush_dev_iotlb; @@ -1391,17 +1404,32 @@ int domain_context_mapping_one( maddr = bus_to_context_maddr(iommu, bus); context_entries = (struct context_entry *)map_vtd_domain_page(maddr); context = &context_entries[devfn]; + old = (lctxt = *context).full; - if ( context_present(*context) ) + if ( context_present(lctxt) ) { - spin_unlock(&iommu->lock); - unmap_vtd_domain_page(context_entries); - return 0; + domid_t domid; + + prev_did = context_domain_id(lctxt); + domid = iommu->domid_map[prev_did]; + if ( domid < DOMID_FIRST_RESERVED ) + prev_dom = rcu_lock_domain_by_id(domid); + else if ( domid == DOMID_IO ) + prev_dom = rcu_lock_domain(dom_io); + if ( !prev_dom ) + { + spin_unlock(&iommu->lock); + unmap_vtd_domain_page(context_entries); + dprintk(XENLOG_DEBUG VTDPREFIX, + "no domain for did %u (nr_dom %u)\n", + prev_did, cap_ndoms(iommu->cap)); + return -ESRCH; + } } if ( iommu_hwdom_passthrough && is_hardware_domain(domain) ) { - context_set_translation_type(*context, CONTEXT_TT_PASS_THRU); + context_set_translation_type(lctxt, CONTEXT_TT_PASS_THRU); } else { @@ -1413,36 +1441,107 @@ int domain_context_mapping_one( spin_unlock(&hd->arch.mapping_lock); spin_unlock(&iommu->lock); unmap_vtd_domain_page(context_entries); + if ( prev_dom ) + rcu_unlock_domain(prev_dom); return -ENOMEM; } - context_set_address_root(*context, pgd_maddr); + context_set_address_root(lctxt, pgd_maddr); if ( ats_enabled && ecap_dev_iotlb(iommu->ecap) ) - context_set_translation_type(*context, CONTEXT_TT_DEV_IOTLB); + context_set_translation_type(lctxt, CONTEXT_TT_DEV_IOTLB); else - context_set_translation_type(*context, CONTEXT_TT_MULTI_LEVEL); + context_set_translation_type(lctxt, CONTEXT_TT_MULTI_LEVEL); spin_unlock(&hd->arch.mapping_lock); } - if ( context_set_domain_id(context, domain, iommu) ) + if ( context_set_domain_id(&lctxt, domain, iommu) ) { + unlock: spin_unlock(&iommu->lock); unmap_vtd_domain_page(context_entries); + if ( prev_dom ) + rcu_unlock_domain(prev_dom); return -EFAULT; } - context_set_address_width(*context, level_to_agaw(iommu->nr_pt_levels)); - context_set_fault_enable(*context); - context_set_present(*context); + if ( !prev_dom ) + { + context_set_address_width(lctxt, level_to_agaw(iommu->nr_pt_levels)); + context_set_fault_enable(lctxt); + context_set_present(lctxt); + } + else if ( prev_dom == domain ) + { + ASSERT(lctxt.full == context->full); + rc = !!pdev; + goto unlock; + } + else + { + ASSERT(context_address_width(lctxt) == + level_to_agaw(iommu->nr_pt_levels)); + ASSERT(!context_fault_disable(lctxt)); + } + + if ( cpu_has_cx16 ) + { + __uint128_t res = cmpxchg16b(context, &old, &lctxt.full); + + /* + * Hardware does not update the context entry behind our backs, + * so the return value should match "old". + */ + if ( res != old ) + { + if ( pdev ) + check_cleanup_domid_map(domain, pdev, iommu); + printk(XENLOG_ERR + "%pp: unexpected context entry %016lx_%016lx (expected %016lx_%016lx)\n", + &PCI_SBDF3(pdev->seg, pdev->bus, devfn), + (uint64_t)(res >> 64), (uint64_t)res, + (uint64_t)(old >> 64), (uint64_t)old); + rc = -EILSEQ; + goto unlock; + } + } + else if ( !prev_dom || !(mode & MAP_WITH_RMRR) ) + { + context_clear_present(*context); + iommu_sync_cache(context, sizeof(*context)); + + write_atomic(&context->hi, lctxt.hi); + /* No barrier should be needed between these two. */ + write_atomic(&context->lo, lctxt.lo); + } + else /* Best effort, updating DID last. */ + { + /* + * By non-atomically updating the context entry's DID field last, + * during a short window in time TLB entries with the old domain ID + * but the new page tables may be inserted. This could affect I/O + * of other devices using this same (old) domain ID. Such updating + * therefore is not a problem if this was the only device associated + * with the old domain ID. Diverting I/O of any of a dying domain's + * devices to the quarantine page tables is intended anyway. + */ + if ( !(mode & (MAP_OWNER_DYING | MAP_SINGLE_DEVICE)) ) + printk(XENLOG_WARNING VTDPREFIX + " %pp: reassignment may cause %pd data corruption\n", + &PCI_SBDF3(seg, bus, devfn), prev_dom); + + write_atomic(&context->lo, lctxt.lo); + /* No barrier should be needed between these two. */ + write_atomic(&context->hi, lctxt.hi); + } + iommu_sync_cache(context, sizeof(struct context_entry)); spin_unlock(&iommu->lock); - /* Context entry was previously non-present (with domid 0). */ - rc = iommu_flush_context_device(iommu, 0, PCI_BDF2(bus, devfn), - DMA_CCMD_MASK_NOBIT, 1); + rc = iommu_flush_context_device(iommu, prev_did, PCI_BDF2(bus, devfn), + DMA_CCMD_MASK_NOBIT, !prev_dom); flush_dev_iotlb = !!find_ats_dev_drhd(iommu); - ret = iommu_flush_iotlb_dsi(iommu, 0, 1, flush_dev_iotlb); + ret = iommu_flush_iotlb_dsi(iommu, prev_did, !prev_dom, flush_dev_iotlb); /* * The current logic for returns: @@ -1463,17 +1562,26 @@ int domain_context_mapping_one( unmap_vtd_domain_page(context_entries); if ( !seg && !rc ) - rc = me_wifi_quirk(domain, bus, devfn, MAP_ME_PHANTOM_FUNC); + rc = me_wifi_quirk(domain, bus, devfn, mode); if ( rc ) { - ret = domain_context_unmap_one(domain, iommu, bus, devfn); + if ( !prev_dom ) + ret = domain_context_unmap_one(domain, iommu, bus, devfn); + else if ( prev_dom != domain ) /* Avoid infinite recursion. */ + ret = domain_context_mapping_one(prev_dom, iommu, bus, devfn, pdev, + mode & MAP_WITH_RMRR) < 0; + else + ret = 1; if ( !ret && pdev && pdev->devfn == devfn ) check_cleanup_domid_map(domain, pdev, iommu); } - return rc; + if ( prev_dom ) + rcu_unlock_domain(prev_dom); + + return rc ?: pdev && prev_dom; } static int domain_context_unmap(struct domain *d, uint8_t devfn, @@ -1483,8 +1591,10 @@ static int domain_context_mapping(struct struct pci_dev *pdev) { const struct acpi_drhd_unit *drhd = acpi_find_matched_drhd_unit(pdev); + const struct acpi_rmrr_unit *rmrr; int ret = 0; - uint16_t seg = pdev->seg; + unsigned int i, mode = 0; + uint16_t seg = pdev->seg, bdf; uint8_t bus = pdev->bus, secbus; /* @@ -1500,8 +1610,29 @@ static int domain_context_mapping(struct ASSERT(pcidevs_locked()); + for_each_rmrr_device( rmrr, bdf, i ) + { + if ( rmrr->segment != pdev->seg || bdf != pdev->sbdf.bdf ) + continue; + + mode |= MAP_WITH_RMRR; + break; + } + + if ( domain != pdev->domain ) + { + if ( pdev->domain->is_dying ) + mode |= MAP_OWNER_DYING; + else if ( drhd && + !any_pdev_behind_iommu(pdev->domain, pdev, drhd->iommu) && + !pdev->phantom_stride ) + mode |= MAP_SINGLE_DEVICE; + } + switch ( pdev->type ) { + bool prev_present; + case DEV_TYPE_PCI_HOST_BRIDGE: if ( iommu_debug ) printk(VTDPREFIX "%pd:Hostbridge: skip %pp map\n", @@ -1523,7 +1654,9 @@ static int domain_context_mapping(struct printk(VTDPREFIX "%pd:PCIe: map %pp\n", domain, &PCI_SBDF3(seg, bus, devfn)); ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn, - pdev); + pdev, mode); + if ( ret > 0 ) + ret = 0; if ( !ret && devfn == pdev->devfn && ats_device(pdev, drhd) > 0 ) enable_ats_device(pdev, &drhd->iommu->ats_devices); @@ -1538,9 +1671,10 @@ static int domain_context_mapping(struct domain, &PCI_SBDF3(seg, bus, devfn)); ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn, - pdev); - if ( ret ) + pdev, mode); + if ( ret < 0 ) break; + prev_present = ret; if ( (ret = find_upstream_bridge(seg, &bus, &devfn, &secbus)) < 1 ) { @@ -1548,6 +1682,15 @@ static int domain_context_mapping(struct break; ret = -ENXIO; } + /* + * Strictly speaking if the device is the only one behind this bridge + * and the only one with this (secbus,0,0) tuple, it could be allowed + * to be re-assigned regardless of RMRR presence. But let's deal with + * that case only if it is actually found in the wild. + */ + else if ( prev_present && (mode & MAP_WITH_RMRR) && + domain != pdev->domain ) + ret = -EOPNOTSUPP; /* * Mapping a bridge should, if anything, pass the struct pci_dev of @@ -1556,7 +1699,7 @@ static int domain_context_mapping(struct */ if ( ret >= 0 ) ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn, - NULL); + NULL, mode); /* * Devices behind PCIe-to-PCI/PCIx bridge may generate different @@ -1571,10 +1714,15 @@ static int domain_context_mapping(struct if ( !ret && pdev_type(seg, bus, devfn) == DEV_TYPE_PCIe2PCI_BRIDGE && (secbus != pdev->bus || pdev->devfn != 0) ) ret = domain_context_mapping_one(domain, drhd->iommu, secbus, 0, - NULL); + NULL, mode); if ( ret ) - domain_context_unmap(domain, devfn, pdev); + { + if ( !prev_present ) + domain_context_unmap(domain, devfn, pdev); + else if ( pdev->domain != domain ) /* Avoid infinite recursion. */ + domain_context_mapping(pdev->domain, devfn, pdev); + } break; @@ -2363,17 +2511,46 @@ static int reassign_device_ownership( { int ret; - ret = domain_context_unmap(source, devfn, pdev); + if ( !QUARANTINE_SKIP(target) ) + { + if ( !has_arch_pdevs(target) ) + vmx_pi_hooks_assign(target); + + /* + * Devices assigned to untrusted domains (here assumed to be any domU) + * can attempt to send arbitrary LAPIC/MSI messages. We are unprotected + * by the root complex unless interrupt remapping is enabled. + */ + if ( (target != hardware_domain) && !iommu_intremap ) + untrusted_msi = true; + + ret = domain_context_mapping(target, devfn, pdev); + + if ( !ret && !QUARANTINE_SKIP(source) && pdev->devfn == devfn ) + { + const struct acpi_drhd_unit *drhd = acpi_find_matched_drhd_unit(pdev); + + if ( drhd ) + check_cleanup_domid_map(source, pdev, drhd->iommu); + } + } + else + ret = domain_context_unmap(source, devfn, pdev); if ( ret ) + { + if ( !has_arch_pdevs(target) ) + vmx_pi_hooks_deassign(target); return ret; + } - /* - * Devices assigned to untrusted domains (here assumed to be any domU) - * can attempt to send arbitrary LAPIC/MSI messages. We are unprotected - * by the root complex unless interrupt remapping is enabled. - */ - if ( (target != hardware_domain) && !iommu_intremap ) - untrusted_msi = true; + if ( devfn == pdev->devfn && pdev->domain != target ) + { + list_move(&pdev->domain_list, &target->pdev_list); + pdev->domain = target; + } + + if ( !has_arch_pdevs(source) ) + vmx_pi_hooks_deassign(source); /* * If the device belongs to the hardware domain, and it has RMRR, don't @@ -2403,34 +2580,7 @@ static int reassign_device_ownership( } } - if ( devfn == pdev->devfn && pdev->domain != dom_io ) - { - list_move(&pdev->domain_list, &dom_io->pdev_list); - pdev->domain = dom_io; - } - - if ( !has_arch_pdevs(source) ) - vmx_pi_hooks_deassign(source); - - if ( !has_arch_pdevs(target) ) - vmx_pi_hooks_assign(target); - - ret = domain_context_mapping(target, devfn, pdev); - if ( ret ) - { - if ( !has_arch_pdevs(target) ) - vmx_pi_hooks_deassign(target); - - return ret; - } - - if ( devfn == pdev->devfn && pdev->domain != target ) - { - list_move(&pdev->domain_list, &target->pdev_list); - pdev->domain = target; - } - - return ret; + return 0; } static int intel_iommu_assign_device( --- a/xen/drivers/passthrough/vtd/iommu.h +++ b/xen/drivers/passthrough/vtd/iommu.h @@ -202,8 +202,12 @@ struct root_entry { do {(root).val |= ((value) & PAGE_MASK_4K);} while(0) struct context_entry { - u64 lo; - u64 hi; + union { + struct { + uint64_t lo, hi; + }; + __uint128_t full; + }; }; #define ROOT_ENTRY_NR (PAGE_SIZE_4K/sizeof(struct root_entry)) #define context_present(c) ((c).lo & 1) --- a/xen/drivers/passthrough/vtd/quirks.c +++ b/xen/drivers/passthrough/vtd/quirks.c @@ -407,7 +407,8 @@ void __init platform_quirks_init(void) */ static int __must_check map_me_phantom_function(struct domain *domain, - u32 dev, int map) + unsigned int dev, + unsigned int mode) { struct acpi_drhd_unit *drhd; struct pci_dev *pdev; @@ -418,9 +419,9 @@ static int __must_check map_me_phantom_f drhd = acpi_find_matched_drhd_unit(pdev); /* map or unmap ME phantom function */ - if ( map ) + if ( !(mode & UNMAP_ME_PHANTOM_FUNC) ) rc = domain_context_mapping_one(domain, drhd->iommu, 0, - PCI_DEVFN(dev, 7), NULL); + PCI_DEVFN(dev, 7), NULL, mode); else rc = domain_context_unmap_one(domain, drhd->iommu, 0, PCI_DEVFN(dev, 7)); @@ -428,7 +429,8 @@ static int __must_check map_me_phantom_f return rc; } -int me_wifi_quirk(struct domain *domain, u8 bus, u8 devfn, int map) +int me_wifi_quirk(struct domain *domain, uint8_t bus, uint8_t devfn, + unsigned int mode) { u32 id; int rc = 0; @@ -452,7 +454,7 @@ int me_wifi_quirk(struct domain *domain, case 0x423b8086: case 0x423c8086: case 0x423d8086: - rc = map_me_phantom_function(domain, 3, map); + rc = map_me_phantom_function(domain, 3, mode); break; default: break; @@ -478,7 +480,7 @@ int me_wifi_quirk(struct domain *domain, case 0x42388086: /* Puma Peak */ case 0x422b8086: case 0x422c8086: - rc = map_me_phantom_function(domain, 22, map); + rc = map_me_phantom_function(domain, 22, mode); break; default: break; --- a/xen/drivers/passthrough/vtd/vtd.h +++ b/xen/drivers/passthrough/vtd/vtd.h @@ -22,8 +22,14 @@ #include <xen/iommu.h> -#define MAP_ME_PHANTOM_FUNC 1 -#define UNMAP_ME_PHANTOM_FUNC 0 +/* + * Values for domain_context_mapping_one()'s and me_wifi_quirk()'s "mode" + * parameters. + */ +#define MAP_WITH_RMRR (1u << 0) +#define MAP_OWNER_DYING (1u << 1) +#define MAP_SINGLE_DEVICE (1u << 2) +#define UNMAP_ME_PHANTOM_FUNC (1u << 3) /* Allow for both IOAPIC and IOSAPIC. */ #define IO_xAPIC_route_entry IO_APIC_route_entry ++++++ xsa400-06.patch ++++++ From: Jan Beulich <jbeul...@suse.com> Subject: AMD/IOMMU: re-assign devices directly Devices with unity map ranges, due to it being unspecified how/when these memory ranges may get accessed, may not be left disconnected from their unity mappings (as long as it's not certain that the device has been fully quiesced). Hence rather than tearing down the old root page table pointer and then establishing the new one, re-assignment needs to be done in a single step. This is CVE-2022-26360 / part of XSA-400. Similarly quarantining scratch-page mode relies on page tables to be continuously wired up. To avoid complicating things more than necessary, treat all devices mostly equally, i.e. regardless of their association with any unity map ranges. The main difference is when it comes to updating DTEs, which need to be atomic when there are unity mappings. Yet atomicity can only be achieved with CMPXCHG16B, availability of which we can't take for given. Signed-off-by: Jan Beulich <jbeul...@suse.com> Reviewed-by: Paul Durrant <p...@xen.org> Reviewed-by: Roger Pau Monn?? <roger....@citrix.com> --- a/xen/drivers/passthrough/amd/iommu.h +++ b/xen/drivers/passthrough/amd/iommu.h @@ -262,9 +262,13 @@ void amd_iommu_set_intremap_table(struct const void *ptr, const struct amd_iommu *iommu, bool valid); -void amd_iommu_set_root_page_table(struct amd_iommu_dte *dte, - uint64_t root_ptr, uint16_t domain_id, - uint8_t paging_mode, bool valid); +#define SET_ROOT_VALID (1u << 0) +#define SET_ROOT_WITH_UNITY_MAP (1u << 1) +int __must_check amd_iommu_set_root_page_table(struct amd_iommu_dte *dte, + uint64_t root_ptr, + uint16_t domain_id, + uint8_t paging_mode, + unsigned int flags); void iommu_dte_add_device_entry(struct amd_iommu_dte *dte, const struct ivrs_mappings *ivrs_dev); --- a/xen/drivers/passthrough/amd/iommu_map.c +++ b/xen/drivers/passthrough/amd/iommu_map.c @@ -114,10 +114,69 @@ static unsigned int set_iommu_ptes_prese return flush_flags; } -void amd_iommu_set_root_page_table(struct amd_iommu_dte *dte, - uint64_t root_ptr, uint16_t domain_id, - uint8_t paging_mode, bool valid) +/* + * This function returns + * - -errno for errors, + * - 0 for a successful update, atomic when necessary + * - 1 for a successful but non-atomic update, which may need to be warned + * about by the caller. + */ +int amd_iommu_set_root_page_table(struct amd_iommu_dte *dte, + uint64_t root_ptr, uint16_t domain_id, + uint8_t paging_mode, unsigned int flags) { + bool valid = flags & SET_ROOT_VALID; + + if ( dte->v && dte->tv && + (cpu_has_cx16 || (flags & SET_ROOT_WITH_UNITY_MAP)) ) + { + union { + struct amd_iommu_dte dte; + uint64_t raw64[4]; + __uint128_t raw128[2]; + } ldte = { .dte = *dte }; + __uint128_t old = ldte.raw128[0]; + int ret = 0; + + ldte.dte.domain_id = domain_id; + ldte.dte.pt_root = paddr_to_pfn(root_ptr); + ldte.dte.iw = true; + ldte.dte.ir = true; + ldte.dte.paging_mode = paging_mode; + ldte.dte.v = valid; + + if ( cpu_has_cx16 ) + { + __uint128_t res = cmpxchg16b(dte, &old, &ldte.raw128[0]); + + /* + * Hardware does not update the DTE behind our backs, so the + * return value should match "old". + */ + if ( res != old ) + { + printk(XENLOG_ERR + "Dom%d: unexpected DTE %016lx_%016lx (expected %016lx_%016lx)\n", + domain_id, + (uint64_t)(res >> 64), (uint64_t)res, + (uint64_t)(old >> 64), (uint64_t)old); + ret = -EILSEQ; + } + } + else /* Best effort, updating domain_id last. */ + { + uint64_t *ptr = (void *)dte; + + write_atomic(ptr + 0, ldte.raw64[0]); + /* No barrier should be needed between these two. */ + write_atomic(ptr + 1, ldte.raw64[1]); + + ret = 1; + } + + return ret; + } + if ( valid || dte->v ) { dte->tv = false; @@ -132,6 +191,8 @@ void amd_iommu_set_root_page_table(struc smp_wmb(); dte->tv = true; dte->v = valid; + + return 0; } void amd_iommu_set_intremap_table( --- a/xen/drivers/passthrough/amd/pci_amd_iommu.c +++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c @@ -96,13 +96,32 @@ static int __must_check allocate_domain_ return rc; } +static bool any_pdev_behind_iommu(const struct domain *d, + const struct pci_dev *exclude, + const struct amd_iommu *iommu) +{ + const struct pci_dev *pdev; + + for_each_pdev ( d, pdev ) + { + if ( pdev == exclude ) + continue; + + if ( find_iommu_for_device(pdev->seg, pdev->sbdf.bdf) == iommu ) + return true; + } + + return false; +} + static int __must_check amd_iommu_setup_domain_device( struct domain *domain, struct amd_iommu *iommu, uint8_t devfn, struct pci_dev *pdev) { struct amd_iommu_dte *table, *dte; unsigned long flags; - int req_id, valid = 1, rc; + unsigned int req_id, sr_flags; + int rc; u8 bus = pdev->bus; struct domain_iommu *hd = dom_iommu(domain); const struct ivrs_mappings *ivrs_dev; @@ -116,8 +135,11 @@ static int __must_check amd_iommu_setup_ if ( rc ) return rc; - if ( iommu_hwdom_passthrough && is_hardware_domain(domain) ) - valid = 0; + req_id = get_dma_requestor_id(iommu->seg, pdev->sbdf.bdf); + ivrs_dev = &get_ivrs_mappings(iommu->seg)[req_id]; + sr_flags = (iommu_hwdom_passthrough && is_hardware_domain(domain) + ? 0 : SET_ROOT_VALID) + | (ivrs_dev->unity_map ? SET_ROOT_WITH_UNITY_MAP : 0); /* get device-table entry */ req_id = get_dma_requestor_id(iommu->seg, PCI_BDF2(bus, devfn)); @@ -130,9 +152,15 @@ static int __must_check amd_iommu_setup_ if ( !dte->v || !dte->tv ) { /* bind DTE to domain page-tables */ - amd_iommu_set_root_page_table( - dte, page_to_maddr(hd->arch.amd.root_table), - domain->domain_id, hd->arch.amd.paging_mode, valid); + rc = amd_iommu_set_root_page_table( + dte, page_to_maddr(hd->arch.amd.root_table), + domain->domain_id, hd->arch.amd.paging_mode, sr_flags); + if ( rc ) + { + ASSERT(rc < 0); + spin_unlock_irqrestore(&iommu->lock, flags); + return rc; + } /* Undo what amd_iommu_disable_domain_device() may have done. */ if ( dte->it_root ) @@ -152,17 +180,76 @@ static int __must_check amd_iommu_setup_ spin_unlock_irqrestore(&iommu->lock, flags); amd_iommu_flush_device(iommu, req_id); + } + else if ( dte->pt_root != mfn_x(page_to_mfn(hd->arch.amd.root_table)) ) + { + /* + * Strictly speaking if the device is the only one with this requestor + * ID, it could be allowed to be re-assigned regardless of unity map + * presence. But let's deal with that case only if it is actually + * found in the wild. + */ + if ( req_id != PCI_BDF2(bus, devfn) && + (sr_flags & SET_ROOT_WITH_UNITY_MAP) ) + rc = -EOPNOTSUPP; + else + rc = amd_iommu_set_root_page_table( + dte, page_to_maddr(hd->arch.amd.root_table), + domain->domain_id, hd->arch.amd.paging_mode, sr_flags); + if ( rc < 0 ) + { + spin_unlock_irqrestore(&iommu->lock, flags); + return rc; + } + if ( rc && + domain != pdev->domain && + /* + * By non-atomically updating the DTE's domain ID field last, + * during a short window in time TLB entries with the old domain + * ID but the new page tables may have been inserted. This could + * affect I/O of other devices using this same (old) domain ID. + * Such updating therefore is not a problem if this was the only + * device associated with the old domain ID. Diverting I/O of any + * of a dying domain's devices to the quarantine page tables is + * intended anyway. + */ + !pdev->domain->is_dying && + (any_pdev_behind_iommu(pdev->domain, pdev, iommu) || + pdev->phantom_stride) ) + AMD_IOMMU_WARN(" %pp: reassignment may cause %pd data corruption\n", + &PCI_SBDF3(pdev->seg, bus, devfn), pdev->domain); + + /* + * Check remaining settings are still in place from an earlier call + * here. They're all independent of the domain, so should not have + * changed. + */ + if ( dte->it_root ) + ASSERT(dte->int_ctl == IOMMU_DEV_TABLE_INT_CONTROL_TRANSLATED); + ASSERT(dte->iv == iommu_intremap); + ASSERT(dte->ex == ivrs_dev->dte_allow_exclusion); + ASSERT(dte->sys_mgt == MASK_EXTR(ivrs_dev->device_flags, + ACPI_IVHD_SYSTEM_MGMT)); - AMD_IOMMU_DEBUG("Setup I/O page table: device id = %#x, type = %#x, " - "root table = %#"PRIx64", " - "domain = %d, paging mode = %d\n", - req_id, pdev->type, - page_to_maddr(hd->arch.amd.root_table), - domain->domain_id, hd->arch.amd.paging_mode); + if ( pci_ats_device(iommu->seg, bus, pdev->devfn) && + !ivrs_dev->block_ats && + iommu_has_cap(iommu, PCI_CAP_IOTLB_SHIFT) ) + ASSERT(dte->i == ats_enabled); + + spin_unlock_irqrestore(&iommu->lock, flags); + + amd_iommu_flush_device(iommu, req_id); } else spin_unlock_irqrestore(&iommu->lock, flags); + AMD_IOMMU_DEBUG("Setup I/O page table: device id = %#x, type = %#x, " + "root table = %#"PRIx64", " + "domain = %d, paging mode = %d\n", + req_id, pdev->type, + page_to_maddr(hd->arch.amd.root_table), + domain->domain_id, hd->arch.amd.paging_mode); + ASSERT(pcidevs_locked()); if ( pci_ats_device(iommu->seg, bus, pdev->devfn) && @@ -366,7 +453,20 @@ static int reassign_device(struct domain return -ENODEV; } - amd_iommu_disable_domain_device(source, iommu, devfn, pdev); + if ( !QUARANTINE_SKIP(target) ) + { + rc = amd_iommu_setup_domain_device(target, iommu, devfn, pdev); + if ( rc ) + return rc; + } + else + amd_iommu_disable_domain_device(source, iommu, devfn, pdev); + + if ( devfn == pdev->devfn && pdev->domain != target ) + { + list_move(&pdev->domain_list, &target->pdev_list); + pdev->domain = target; + } /* * If the device belongs to the hardware domain, and it has a unity mapping, @@ -382,25 +482,9 @@ static int reassign_device(struct domain return rc; } - if ( devfn == pdev->devfn && pdev->domain != dom_io ) - { - list_move(&pdev->domain_list, &dom_io->pdev_list); - pdev->domain = dom_io; - } - - rc = amd_iommu_setup_domain_device(target, iommu, devfn, pdev); - if ( rc ) - return rc; - AMD_IOMMU_DEBUG("Re-assign %pp from dom%d to dom%d\n", &pdev->sbdf, source->domain_id, target->domain_id); - if ( devfn == pdev->devfn && pdev->domain != target ) - { - list_move(&pdev->domain_list, &target->pdev_list); - pdev->domain = target; - } - return 0; } ++++++ xsa400-07.patch ++++++ From: Jan Beulich <jbeul...@suse.com> Subject: VT-d: prepare for per-device quarantine page tables (part I) Arrange for domain ID and page table root to be passed around, the latter in particular to domain_pgd_maddr() such that taking it from the per-domain fields can be overridden. No functional change intended. Signed-off-by: Jan Beulich <jbeul...@suse.com> Reviewed-by: Paul Durrant <p...@xen.org> Reviewed-by: Roger Pau Monn?? <roger....@citrix.com> Reviewed-by: Kevin Tian <kevin.t...@intel.com> --- a/xen/drivers/passthrough/vtd/extern.h +++ b/xen/drivers/passthrough/vtd/extern.h @@ -85,9 +85,10 @@ void *map_vtd_domain_page(u64 maddr); void unmap_vtd_domain_page(const void *va); int domain_context_mapping_one(struct domain *domain, struct vtd_iommu *iommu, uint8_t bus, uint8_t devfn, - const struct pci_dev *pdev, unsigned int mode); + const struct pci_dev *pdev, domid_t domid, + paddr_t pgd_maddr, unsigned int mode); int domain_context_unmap_one(struct domain *domain, struct vtd_iommu *iommu, - u8 bus, u8 devfn); + uint8_t bus, uint8_t devfn, domid_t domid); int intel_iommu_get_reserved_device_memory(iommu_grdm_t *func, void *ctxt); unsigned int io_apic_read_remap_rte(unsigned int apic, unsigned int reg); @@ -106,7 +107,8 @@ void platform_quirks_init(void); void vtd_ops_preamble_quirk(struct vtd_iommu *iommu); void vtd_ops_postamble_quirk(struct vtd_iommu *iommu); int __must_check me_wifi_quirk(struct domain *domain, uint8_t bus, - uint8_t devfn, unsigned int mode); + uint8_t devfn, domid_t domid, paddr_t pgd_maddr, + unsigned int mode); void pci_vtd_quirk(const struct pci_dev *); void quirk_iommu_caps(struct vtd_iommu *iommu); --- a/xen/drivers/passthrough/vtd/iommu.c +++ b/xen/drivers/passthrough/vtd/iommu.c @@ -43,7 +43,7 @@ #include "../ats.h" /* dom_io is used as a sentinel for quarantined devices */ -#define QUARANTINE_SKIP(d) ((d) == dom_io && !dom_iommu(d)->arch.vtd.pgd_maddr) +#define QUARANTINE_SKIP(d, pgd_maddr) ((d) == dom_io && !(pgd_maddr)) /* Possible unfiltered LAPIC/MSI messages from untrusted sources? */ bool __read_mostly untrusted_msi; @@ -358,15 +358,17 @@ static u64 addr_to_dma_page_maddr(struct return pte_maddr; } -static uint64_t domain_pgd_maddr(struct domain *d, unsigned int nr_pt_levels) +static paddr_t domain_pgd_maddr(struct domain *d, paddr_t pgd_maddr, + unsigned int nr_pt_levels) { struct domain_iommu *hd = dom_iommu(d); - uint64_t pgd_maddr; unsigned int agaw; ASSERT(spin_is_locked(&hd->arch.mapping_lock)); - if ( iommu_use_hap_pt(d) ) + if ( pgd_maddr ) + /* nothing */; + else if ( iommu_use_hap_pt(d) ) { pagetable_t pgt = p2m_get_pagetable(p2m_get_hostp2m(d)); @@ -1385,18 +1387,18 @@ int domain_context_mapping_one( struct domain *domain, struct vtd_iommu *iommu, uint8_t bus, uint8_t devfn, const struct pci_dev *pdev, - unsigned int mode) + domid_t domid, paddr_t pgd_maddr, unsigned int mode) { struct domain_iommu *hd = dom_iommu(domain); struct context_entry *context, *context_entries, lctxt; __uint128_t old; - u64 maddr, pgd_maddr; + uint64_t maddr; uint16_t seg = iommu->drhd->segment, prev_did = 0; struct domain *prev_dom = NULL; int rc, ret; bool_t flush_dev_iotlb; - if ( QUARANTINE_SKIP(domain) ) + if ( QUARANTINE_SKIP(domain, pgd_maddr) ) return 0; ASSERT(pcidevs_locked()); @@ -1433,10 +1435,12 @@ int domain_context_mapping_one( } else { + paddr_t root; + spin_lock(&hd->arch.mapping_lock); - pgd_maddr = domain_pgd_maddr(domain, iommu->nr_pt_levels); - if ( !pgd_maddr ) + root = domain_pgd_maddr(domain, pgd_maddr, iommu->nr_pt_levels); + if ( !root ) { spin_unlock(&hd->arch.mapping_lock); spin_unlock(&iommu->lock); @@ -1446,7 +1450,7 @@ int domain_context_mapping_one( return -ENOMEM; } - context_set_address_root(lctxt, pgd_maddr); + context_set_address_root(lctxt, root); if ( ats_enabled && ecap_dev_iotlb(iommu->ecap) ) context_set_translation_type(lctxt, CONTEXT_TT_DEV_IOTLB); else @@ -1562,15 +1566,21 @@ int domain_context_mapping_one( unmap_vtd_domain_page(context_entries); if ( !seg && !rc ) - rc = me_wifi_quirk(domain, bus, devfn, mode); + rc = me_wifi_quirk(domain, bus, devfn, domid, pgd_maddr, mode); if ( rc ) { if ( !prev_dom ) - ret = domain_context_unmap_one(domain, iommu, bus, devfn); + ret = domain_context_unmap_one(domain, iommu, bus, devfn, + domain->domain_id); else if ( prev_dom != domain ) /* Avoid infinite recursion. */ + { + hd = dom_iommu(prev_dom); ret = domain_context_mapping_one(prev_dom, iommu, bus, devfn, pdev, + domain->domain_id, + hd->arch.vtd.pgd_maddr, mode & MAP_WITH_RMRR) < 0; + } else ret = 1; @@ -1592,6 +1602,7 @@ static int domain_context_mapping(struct { const struct acpi_drhd_unit *drhd = acpi_find_matched_drhd_unit(pdev); const struct acpi_rmrr_unit *rmrr; + paddr_t pgd_maddr = dom_iommu(domain)->arch.vtd.pgd_maddr; int ret = 0; unsigned int i, mode = 0; uint16_t seg = pdev->seg, bdf; @@ -1654,7 +1665,8 @@ static int domain_context_mapping(struct printk(VTDPREFIX "%pd:PCIe: map %pp\n", domain, &PCI_SBDF3(seg, bus, devfn)); ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn, - pdev, mode); + pdev, domain->domain_id, pgd_maddr, + mode); if ( ret > 0 ) ret = 0; if ( !ret && devfn == pdev->devfn && ats_device(pdev, drhd) > 0 ) @@ -1671,7 +1683,8 @@ static int domain_context_mapping(struct domain, &PCI_SBDF3(seg, bus, devfn)); ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn, - pdev, mode); + pdev, domain->domain_id, pgd_maddr, + mode); if ( ret < 0 ) break; prev_present = ret; @@ -1699,7 +1712,8 @@ static int domain_context_mapping(struct */ if ( ret >= 0 ) ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn, - NULL, mode); + NULL, domain->domain_id, pgd_maddr, + mode); /* * Devices behind PCIe-to-PCI/PCIx bridge may generate different @@ -1714,7 +1728,8 @@ static int domain_context_mapping(struct if ( !ret && pdev_type(seg, bus, devfn) == DEV_TYPE_PCIe2PCI_BRIDGE && (secbus != pdev->bus || pdev->devfn != 0) ) ret = domain_context_mapping_one(domain, drhd->iommu, secbus, 0, - NULL, mode); + NULL, domain->domain_id, pgd_maddr, + mode); if ( ret ) { @@ -1742,14 +1757,14 @@ static int domain_context_mapping(struct int domain_context_unmap_one( struct domain *domain, struct vtd_iommu *iommu, - u8 bus, u8 devfn) + uint8_t bus, uint8_t devfn, domid_t domid) { struct context_entry *context, *context_entries; u64 maddr; int iommu_domid, rc, ret; bool_t flush_dev_iotlb; - if ( QUARANTINE_SKIP(domain) ) + if ( QUARANTINE_SKIP(domain, dom_iommu(domain)->arch.vtd.pgd_maddr) ) return 0; ASSERT(pcidevs_locked()); @@ -1803,7 +1818,7 @@ int domain_context_unmap_one( unmap_vtd_domain_page(context_entries); if ( !iommu->drhd->segment && !rc ) - rc = me_wifi_quirk(domain, bus, devfn, UNMAP_ME_PHANTOM_FUNC); + rc = me_wifi_quirk(domain, bus, devfn, domid, 0, UNMAP_ME_PHANTOM_FUNC); if ( rc && !is_hardware_domain(domain) && domain != dom_io ) { @@ -1850,7 +1865,8 @@ static int domain_context_unmap(struct d if ( iommu_debug ) printk(VTDPREFIX "%pd:PCIe: unmap %pp\n", domain, &PCI_SBDF3(seg, bus, devfn)); - ret = domain_context_unmap_one(domain, iommu, bus, devfn); + ret = domain_context_unmap_one(domain, iommu, bus, devfn, + domain->domain_id); if ( !ret && devfn == pdev->devfn && ats_device(pdev, drhd) > 0 ) disable_ats_device(pdev); @@ -1863,7 +1879,8 @@ static int domain_context_unmap(struct d if ( iommu_debug ) printk(VTDPREFIX "%pd:PCI: unmap %pp\n", domain, &PCI_SBDF3(seg, bus, devfn)); - ret = domain_context_unmap_one(domain, iommu, bus, devfn); + ret = domain_context_unmap_one(domain, iommu, bus, devfn, + domain->domain_id); if ( ret ) break; @@ -1889,12 +1906,15 @@ static int domain_context_unmap(struct d /* PCIe to PCI/PCIx bridge */ if ( pdev_type(seg, tmp_bus, tmp_devfn) == DEV_TYPE_PCIe2PCI_BRIDGE ) { - ret = domain_context_unmap_one(domain, iommu, tmp_bus, tmp_devfn); + ret = domain_context_unmap_one(domain, iommu, tmp_bus, tmp_devfn, + domain->domain_id); if ( !ret ) - ret = domain_context_unmap_one(domain, iommu, secbus, 0); + ret = domain_context_unmap_one(domain, iommu, secbus, 0, + domain->domain_id); } else /* Legacy PCI bridge */ - ret = domain_context_unmap_one(domain, iommu, tmp_bus, tmp_devfn); + ret = domain_context_unmap_one(domain, iommu, tmp_bus, tmp_devfn, + domain->domain_id); break; @@ -1904,7 +1924,8 @@ static int domain_context_unmap(struct d return -EINVAL; } - if ( !ret && !QUARANTINE_SKIP(domain) && pdev->devfn == devfn ) + if ( !ret && pdev->devfn == devfn && + !QUARANTINE_SKIP(domain, dom_iommu(domain)->arch.vtd.pgd_maddr) ) check_cleanup_domid_map(domain, pdev, iommu); return ret; @@ -2511,7 +2532,7 @@ static int reassign_device_ownership( { int ret; - if ( !QUARANTINE_SKIP(target) ) + if ( !QUARANTINE_SKIP(target, dom_iommu(target)->arch.vtd.pgd_maddr) ) { if ( !has_arch_pdevs(target) ) vmx_pi_hooks_assign(target); @@ -2526,7 +2547,8 @@ static int reassign_device_ownership( ret = domain_context_mapping(target, devfn, pdev); - if ( !ret && !QUARANTINE_SKIP(source) && pdev->devfn == devfn ) + if ( !ret && pdev->devfn == devfn && + !QUARANTINE_SKIP(source, dom_iommu(source)->arch.vtd.pgd_maddr) ) { const struct acpi_drhd_unit *drhd = acpi_find_matched_drhd_unit(pdev); --- a/xen/drivers/passthrough/vtd/quirks.c +++ b/xen/drivers/passthrough/vtd/quirks.c @@ -408,6 +408,8 @@ void __init platform_quirks_init(void) static int __must_check map_me_phantom_function(struct domain *domain, unsigned int dev, + domid_t domid, + paddr_t pgd_maddr, unsigned int mode) { struct acpi_drhd_unit *drhd; @@ -421,16 +423,17 @@ static int __must_check map_me_phantom_f /* map or unmap ME phantom function */ if ( !(mode & UNMAP_ME_PHANTOM_FUNC) ) rc = domain_context_mapping_one(domain, drhd->iommu, 0, - PCI_DEVFN(dev, 7), NULL, mode); + PCI_DEVFN(dev, 7), NULL, + domid, pgd_maddr, mode); else rc = domain_context_unmap_one(domain, drhd->iommu, 0, - PCI_DEVFN(dev, 7)); + PCI_DEVFN(dev, 7), domid); return rc; } int me_wifi_quirk(struct domain *domain, uint8_t bus, uint8_t devfn, - unsigned int mode) + domid_t domid, paddr_t pgd_maddr, unsigned int mode) { u32 id; int rc = 0; @@ -454,7 +457,7 @@ int me_wifi_quirk(struct domain *domain, case 0x423b8086: case 0x423c8086: case 0x423d8086: - rc = map_me_phantom_function(domain, 3, mode); + rc = map_me_phantom_function(domain, 3, domid, pgd_maddr, mode); break; default: break; @@ -480,7 +483,7 @@ int me_wifi_quirk(struct domain *domain, case 0x42388086: /* Puma Peak */ case 0x422b8086: case 0x422c8086: - rc = map_me_phantom_function(domain, 22, mode); + rc = map_me_phantom_function(domain, 22, domid, pgd_maddr, mode); break; default: break; ++++++ xsa400-08.patch ++++++ From: Jan Beulich <jbeul...@suse.com> Subject: VT-d: prepare for per-device quarantine page tables (part II) Replace the passing of struct domain * by domid_t in preparation of per-device quarantine page tables also requiring per-device pseudo domain IDs, which aren't going to be associated with any struct domain instances. No functional change intended (except for slightly adjusted log message text). Signed-off-by: Jan Beulich <jbeul...@suse.com> Reviewed-by: Paul Durrant <p...@xen.org> Reviewed-by: Kevin Tian <kevin.t...@intel.com> Reviewed-by: Roger Pau Monn?? <roger....@citrix.com> --- a/xen/drivers/passthrough/vtd/iommu.c +++ b/xen/drivers/passthrough/vtd/iommu.c @@ -62,8 +62,8 @@ static struct tasklet vtd_fault_tasklet; static int setup_hwdom_device(u8 devfn, struct pci_dev *); static void setup_hwdom_rmrr(struct domain *d); -static int domain_iommu_domid(struct domain *d, - struct vtd_iommu *iommu) +static int get_iommu_did(domid_t domid, const struct vtd_iommu *iommu, + bool warn) { unsigned long nr_dom, i; @@ -71,16 +71,16 @@ static int domain_iommu_domid(struct dom i = find_first_bit(iommu->domid_bitmap, nr_dom); while ( i < nr_dom ) { - if ( iommu->domid_map[i] == d->domain_id ) + if ( iommu->domid_map[i] == domid ) return i; i = find_next_bit(iommu->domid_bitmap, nr_dom, i+1); } - if ( !d->is_dying ) + if ( warn ) dprintk(XENLOG_ERR VTDPREFIX, - "Cannot get valid iommu %u domid: %pd\n", - iommu->index, d); + "No valid iommu %u domid for Dom%d\n", + iommu->index, domid); return -1; } @@ -88,8 +88,7 @@ static int domain_iommu_domid(struct dom #define DID_FIELD_WIDTH 16 #define DID_HIGH_OFFSET 8 static int context_set_domain_id(struct context_entry *context, - struct domain *d, - struct vtd_iommu *iommu) + domid_t domid, struct vtd_iommu *iommu) { unsigned long nr_dom, i; int found = 0; @@ -100,7 +99,7 @@ static int context_set_domain_id(struct i = find_first_bit(iommu->domid_bitmap, nr_dom); while ( i < nr_dom ) { - if ( iommu->domid_map[i] == d->domain_id ) + if ( iommu->domid_map[i] == domid ) { found = 1; break; @@ -116,7 +115,7 @@ static int context_set_domain_id(struct dprintk(XENLOG_ERR VTDPREFIX, "IOMMU: no free domain ids\n"); return -EFAULT; } - iommu->domid_map[i] = d->domain_id; + iommu->domid_map[i] = domid; } set_bit(i, iommu->domid_bitmap); @@ -125,9 +124,9 @@ static int context_set_domain_id(struct return 0; } -static void cleanup_domid_map(struct domain *domain, struct vtd_iommu *iommu) +static void cleanup_domid_map(domid_t domid, struct vtd_iommu *iommu) { - int iommu_domid = domain_iommu_domid(domain, iommu); + int iommu_domid = get_iommu_did(domid, iommu, false); if ( iommu_domid >= 0 ) { @@ -167,7 +166,7 @@ static bool any_pdev_behind_iommu(const * If no other devices under the same iommu owned by this domain, * clear iommu in iommu_bitmap and clear domain_id in domid_bitmap. */ -static void check_cleanup_domid_map(struct domain *d, +static void check_cleanup_domid_map(const struct domain *d, const struct pci_dev *exclude, struct vtd_iommu *iommu) { @@ -183,7 +182,7 @@ static void check_cleanup_domid_map(stru if ( !found ) { clear_bit(iommu->index, dom_iommu(d)->arch.vtd.iommu_bitmap); - cleanup_domid_map(d, iommu); + cleanup_domid_map(d->domain_id, iommu); } } @@ -683,7 +682,7 @@ static int __must_check iommu_flush_iotl continue; flush_dev_iotlb = !!find_ats_dev_drhd(iommu); - iommu_domid= domain_iommu_domid(d, iommu); + iommu_domid = get_iommu_did(d->domain_id, iommu, !d->is_dying); if ( iommu_domid == -1 ) continue; @@ -1459,7 +1458,7 @@ int domain_context_mapping_one( spin_unlock(&hd->arch.mapping_lock); } - if ( context_set_domain_id(&lctxt, domain, iommu) ) + if ( context_set_domain_id(&lctxt, domid, iommu) ) { unlock: spin_unlock(&iommu->lock); @@ -1785,7 +1784,7 @@ int domain_context_unmap_one( context_clear_entry(*context); iommu_sync_cache(context, sizeof(struct context_entry)); - iommu_domid= domain_iommu_domid(domain, iommu); + iommu_domid = get_iommu_did(domid, iommu, !domain->is_dying); if ( iommu_domid == -1 ) { spin_unlock(&iommu->lock); @@ -1953,7 +1952,7 @@ static void iommu_domain_teardown(struct ASSERT(!hd->arch.vtd.pgd_maddr); for_each_drhd_unit ( drhd ) - cleanup_domid_map(d, drhd->iommu); + cleanup_domid_map(d->domain_id, drhd->iommu); XFREE(hd->arch.vtd.iommu_bitmap); } ++++++ xsa400-09.patch ++++++ From: Jan Beulich <jbeul...@suse.com> Subject: IOMMU/x86: maintain a per-device pseudo domain ID In order to subsequently enable per-device quarantine page tables, we'll need domain-ID-like identifiers to be inserted in the respective device (AMD) or context (Intel) table entries alongside the per-device page table root addresses. Make use of "real" domain IDs occupying only half of the value range coverable by domid_t. Note that in VT-d's iommu_alloc() I didn't want to introduce new memory leaks in case of error, but existing ones don't get plugged - that'll be the subject of a later change. The VT-d changes are slightly asymmetric, but this way we can avoid assigning pseudo domain IDs to devices which would never be mapped while still avoiding to add a new parameter to domain_context_unmap(). Signed-off-by: Jan Beulich <jbeul...@suse.com> Reviewed-by: Paul Durrant <p...@xen.org> Reviewed-by: Kevin Tian <kevin.t...@intel.com> Reviewed-by: Roger Pau Monn?? <roger....@citrix.com> --- a/xen/include/asm-x86/iommu.h +++ b/xen/include/asm-x86/iommu.h @@ -141,6 +141,10 @@ int pi_update_irte(const struct pi_desc iommu_vcall(ops, sync_cache, addr, size); \ }) +unsigned long *iommu_init_domid(void); +domid_t iommu_alloc_domid(unsigned long *map); +void iommu_free_domid(domid_t domid, unsigned long *map); + int __must_check iommu_free_pgtables(struct domain *d); struct domain_iommu; struct page_info *__must_check iommu_alloc_pgtable(struct domain_iommu *hd); --- a/xen/include/asm-x86/pci.h +++ b/xen/include/asm-x86/pci.h @@ -13,6 +13,12 @@ struct arch_pci_dev { vmask_t used_vectors; + /* + * These fields are (de)initialized under pcidevs-lock. Other uses of + * them don't race (de)initialization and hence don't strictly need any + * locking. + */ + domid_t pseudo_domid; }; int pci_conf_write_intercept(unsigned int seg, unsigned int bdf, @@ -36,6 +42,6 @@ static always_inline bool is_pci_passthr return true; } -static inline void arch_pci_init_pdev(struct pci_dev *pdev) {} +void arch_pci_init_pdev(struct pci_dev *pdev); #endif /* __X86_PCI_H__ */ --- a/xen/drivers/passthrough/amd/iommu.h +++ b/xen/drivers/passthrough/amd/iommu.h @@ -96,6 +96,7 @@ struct amd_iommu { struct ring_buffer cmd_buffer; struct ring_buffer event_log; struct ring_buffer ppr_log; + unsigned long *domid_map; int exclusion_enable; int exclusion_allow_all; --- a/xen/drivers/passthrough/amd/iommu_detect.c +++ b/xen/drivers/passthrough/amd/iommu_detect.c @@ -223,6 +223,11 @@ int __init amd_iommu_detect_one_acpi( if ( rt ) goto out; + iommu->domid_map = iommu_init_domid(); + rt = -ENOMEM; + if ( !iommu->domid_map ) + goto out; + rt = pci_ro_device(iommu->seg, bus, PCI_DEVFN(dev, func)); if ( rt ) printk(XENLOG_ERR "Could not mark config space of %pp read-only (%d)\n", @@ -233,7 +238,10 @@ int __init amd_iommu_detect_one_acpi( out: if ( rt ) + { + xfree(iommu->domid_map); xfree(iommu); + } return rt; } --- a/xen/drivers/passthrough/amd/pci_amd_iommu.c +++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c @@ -539,6 +539,8 @@ static int amd_iommu_add_device(u8 devfn struct amd_iommu *iommu; u16 bdf; struct ivrs_mappings *ivrs_mappings; + bool fresh_domid = false; + int ret; if ( !pdev->domain ) return -EINVAL; @@ -606,7 +608,22 @@ static int amd_iommu_add_device(u8 devfn AMD_IOMMU_WARN("%pd: unity mapping failed for %pp\n", pdev->domain, &pdev->sbdf); - return amd_iommu_setup_domain_device(pdev->domain, iommu, devfn, pdev); + if ( iommu_quarantine && pdev->arch.pseudo_domid == DOMID_INVALID ) + { + pdev->arch.pseudo_domid = iommu_alloc_domid(iommu->domid_map); + if ( pdev->arch.pseudo_domid == DOMID_INVALID ) + return -ENOSPC; + fresh_domid = true; + } + + ret = amd_iommu_setup_domain_device(pdev->domain, iommu, devfn, pdev); + if ( ret && fresh_domid ) + { + iommu_free_domid(pdev->arch.pseudo_domid, iommu->domid_map); + pdev->arch.pseudo_domid = DOMID_INVALID; + } + + return ret; } static int amd_iommu_remove_device(u8 devfn, struct pci_dev *pdev) @@ -638,6 +655,9 @@ static int amd_iommu_remove_device(u8 de AMD_IOMMU_WARN("%pd: unity unmapping failed for %pp\n", pdev->domain, &pdev->sbdf); + iommu_free_domid(pdev->arch.pseudo_domid, iommu->domid_map); + pdev->arch.pseudo_domid = DOMID_INVALID; + if ( amd_iommu_perdev_intremap && ivrs_mappings[bdf].dte_requestor_id == bdf && ivrs_mappings[bdf].intremap_table ) --- a/xen/drivers/passthrough/pci.c +++ b/xen/drivers/passthrough/pci.c @@ -1271,9 +1271,14 @@ static int _dump_pci_devices(struct pci_ list_for_each_entry ( pdev, &pseg->alldevs_list, alldevs_list ) { - printk("%pp - %pd - node %-3d", - &pdev->sbdf, pdev->domain, - (pdev->node != NUMA_NO_NODE) ? pdev->node : -1); + printk("%pp - ", &pdev->sbdf); +#ifdef CONFIG_X86 + if ( pdev->domain == dom_io ) + printk("DomIO:%x", pdev->arch.pseudo_domid); + else +#endif + printk("%pd", pdev->domain); + printk(" - node %-3d", (pdev->node != NUMA_NO_NODE) ? pdev->node : -1); pdev_dump_msi(pdev); printk("\n"); } --- a/xen/drivers/passthrough/vtd/iommu.c +++ b/xen/drivers/passthrough/vtd/iommu.c @@ -22,6 +22,7 @@ #include <xen/sched.h> #include <xen/xmalloc.h> #include <xen/domain_page.h> +#include <xen/err.h> #include <xen/iocap.h> #include <xen/iommu.h> #include <xen/numa.h> @@ -1215,7 +1216,7 @@ int __init iommu_alloc(struct acpi_drhd_ { struct vtd_iommu *iommu; unsigned long sagaw, nr_dom; - int agaw; + int agaw, rc; iommu = xzalloc(struct vtd_iommu); if ( iommu == NULL ) @@ -1301,7 +1302,16 @@ int __init iommu_alloc(struct acpi_drhd_ if ( !iommu->domid_map ) return -ENOMEM; + iommu->pseudo_domid_map = iommu_init_domid(); + rc = -ENOMEM; + if ( !iommu->pseudo_domid_map ) + goto free; + return 0; + + free: + iommu_free(drhd); + return rc; } void __init iommu_free(struct acpi_drhd_unit *drhd) @@ -1324,6 +1334,7 @@ void __init iommu_free(struct acpi_drhd_ xfree(iommu->domid_bitmap); xfree(iommu->domid_map); + xfree(iommu->pseudo_domid_map); if ( iommu->msi.irq >= 0 ) destroy_irq(iommu->msi.irq); @@ -1593,8 +1604,8 @@ int domain_context_mapping_one( return rc ?: pdev && prev_dom; } -static int domain_context_unmap(struct domain *d, uint8_t devfn, - struct pci_dev *pdev); +static const struct acpi_drhd_unit *domain_context_unmap( + struct domain *d, uint8_t devfn, struct pci_dev *pdev); static int domain_context_mapping(struct domain *domain, u8 devfn, struct pci_dev *pdev) @@ -1602,6 +1613,7 @@ static int domain_context_mapping(struct const struct acpi_drhd_unit *drhd = acpi_find_matched_drhd_unit(pdev); const struct acpi_rmrr_unit *rmrr; paddr_t pgd_maddr = dom_iommu(domain)->arch.vtd.pgd_maddr; + domid_t orig_domid = pdev->arch.pseudo_domid; int ret = 0; unsigned int i, mode = 0; uint16_t seg = pdev->seg, bdf; @@ -1660,6 +1672,14 @@ static int domain_context_mapping(struct if ( !drhd ) return -ENODEV; + if ( iommu_quarantine && orig_domid == DOMID_INVALID ) + { + pdev->arch.pseudo_domid = + iommu_alloc_domid(drhd->iommu->pseudo_domid_map); + if ( pdev->arch.pseudo_domid == DOMID_INVALID ) + return -ENOSPC; + } + if ( iommu_debug ) printk(VTDPREFIX "%pd:PCIe: map %pp\n", domain, &PCI_SBDF3(seg, bus, devfn)); @@ -1677,6 +1697,14 @@ static int domain_context_mapping(struct if ( !drhd ) return -ENODEV; + if ( iommu_quarantine && orig_domid == DOMID_INVALID ) + { + pdev->arch.pseudo_domid = + iommu_alloc_domid(drhd->iommu->pseudo_domid_map); + if ( pdev->arch.pseudo_domid == DOMID_INVALID ) + return -ENOSPC; + } + if ( iommu_debug ) printk(VTDPREFIX "%pd:PCI: map %pp\n", domain, &PCI_SBDF3(seg, bus, devfn)); @@ -1750,6 +1778,13 @@ static int domain_context_mapping(struct if ( !ret && devfn == pdev->devfn ) pci_vtd_quirk(pdev); + if ( ret && drhd && orig_domid == DOMID_INVALID ) + { + iommu_free_domid(pdev->arch.pseudo_domid, + drhd->iommu->pseudo_domid_map); + pdev->arch.pseudo_domid = DOMID_INVALID; + } + return ret; } @@ -1835,8 +1870,10 @@ int domain_context_unmap_one( return rc; } -static int domain_context_unmap(struct domain *domain, u8 devfn, - struct pci_dev *pdev) +static const struct acpi_drhd_unit *domain_context_unmap( + struct domain *domain, + uint8_t devfn, + struct pci_dev *pdev) { const struct acpi_drhd_unit *drhd = acpi_find_matched_drhd_unit(pdev); struct vtd_iommu *iommu = drhd ? drhd->iommu : NULL; @@ -1850,16 +1887,16 @@ static int domain_context_unmap(struct d if ( iommu_debug ) printk(VTDPREFIX "%pd:Hostbridge: skip %pp unmap\n", domain, &PCI_SBDF3(seg, bus, devfn)); - return is_hardware_domain(domain) ? 0 : -EPERM; + return ERR_PTR(is_hardware_domain(domain) ? 0 : -EPERM); case DEV_TYPE_PCIe_BRIDGE: case DEV_TYPE_PCIe2PCI_BRIDGE: case DEV_TYPE_LEGACY_PCI_BRIDGE: - return 0; + return ERR_PTR(0); case DEV_TYPE_PCIe_ENDPOINT: if ( !iommu ) - return -ENODEV; + return ERR_PTR(-ENODEV); if ( iommu_debug ) printk(VTDPREFIX "%pd:PCIe: unmap %pp\n", @@ -1873,7 +1910,7 @@ static int domain_context_unmap(struct d case DEV_TYPE_PCI: if ( !iommu ) - return -ENODEV; + return ERR_PTR(-ENODEV); if ( iommu_debug ) printk(VTDPREFIX "%pd:PCI: unmap %pp\n", @@ -1920,14 +1957,14 @@ static int domain_context_unmap(struct d default: dprintk(XENLOG_ERR VTDPREFIX, "%pd:unknown(%u): %pp\n", domain, pdev->type, &PCI_SBDF3(seg, bus, devfn)); - return -EINVAL; + return ERR_PTR(-EINVAL); } if ( !ret && pdev->devfn == devfn && !QUARANTINE_SKIP(domain, dom_iommu(domain)->arch.vtd.pgd_maddr) ) check_cleanup_domid_map(domain, pdev, iommu); - return ret; + return drhd; } static void iommu_clear_root_pgtable(struct domain *d) @@ -2154,16 +2191,17 @@ static int intel_iommu_enable_device(str static int intel_iommu_remove_device(u8 devfn, struct pci_dev *pdev) { + const struct acpi_drhd_unit *drhd; struct acpi_rmrr_unit *rmrr; u16 bdf; - int ret, i; + unsigned int i; if ( !pdev->domain ) return -EINVAL; - ret = domain_context_unmap(pdev->domain, devfn, pdev); - if ( ret ) - return ret; + drhd = domain_context_unmap(pdev->domain, devfn, pdev); + if ( IS_ERR(drhd) ) + return PTR_ERR(drhd); for_each_rmrr_device ( rmrr, bdf, i ) { @@ -2180,6 +2218,13 @@ static int intel_iommu_remove_device(u8 rmrr->end_address, 0); } + if ( drhd ) + { + iommu_free_domid(pdev->arch.pseudo_domid, + drhd->iommu->pseudo_domid_map); + pdev->arch.pseudo_domid = DOMID_INVALID; + } + return 0; } @@ -2556,7 +2601,12 @@ static int reassign_device_ownership( } } else - ret = domain_context_unmap(source, devfn, pdev); + { + const struct acpi_drhd_unit *drhd; + + drhd = domain_context_unmap(source, devfn, pdev); + ret = IS_ERR(drhd) ? PTR_ERR(drhd) : 0; + } if ( ret ) { if ( !has_arch_pdevs(target) ) --- a/xen/drivers/passthrough/vtd/iommu.h +++ b/xen/drivers/passthrough/vtd/iommu.h @@ -508,6 +508,7 @@ struct vtd_iommu { } flush; struct list_head ats_devices; + unsigned long *pseudo_domid_map; /* "pseudo" domain id bitmap */ unsigned long *domid_bitmap; /* domain id bitmap */ u16 *domid_map; /* domain id mapping array */ uint32_t version; --- a/xen/drivers/passthrough/x86/iommu.c +++ b/xen/drivers/passthrough/x86/iommu.c @@ -387,6 +387,58 @@ void __hwdom_init arch_iommu_hwdom_init( return; } +void arch_pci_init_pdev(struct pci_dev *pdev) +{ + pdev->arch.pseudo_domid = DOMID_INVALID; +} + +unsigned long *__init iommu_init_domid(void) +{ + if ( !iommu_quarantine ) + return ZERO_BLOCK_PTR; + + BUILD_BUG_ON(DOMID_MASK * 2U >= UINT16_MAX); + + return xzalloc_array(unsigned long, + BITS_TO_LONGS(UINT16_MAX - DOMID_MASK)); +} + +domid_t iommu_alloc_domid(unsigned long *map) +{ + /* + * This is used uniformly across all IOMMUs, such that on typical + * systems we wouldn't re-use the same ID very quickly (perhaps never). + */ + static unsigned int start; + unsigned int idx = find_next_zero_bit(map, UINT16_MAX - DOMID_MASK, start); + + ASSERT(pcidevs_locked()); + + if ( idx >= UINT16_MAX - DOMID_MASK ) + idx = find_first_zero_bit(map, UINT16_MAX - DOMID_MASK); + if ( idx >= UINT16_MAX - DOMID_MASK ) + return DOMID_INVALID; + + __set_bit(idx, map); + + start = idx + 1; + + return idx | (DOMID_MASK + 1); +} + +void iommu_free_domid(domid_t domid, unsigned long *map) +{ + ASSERT(pcidevs_locked()); + + if ( domid == DOMID_INVALID ) + return; + + ASSERT(domid > DOMID_MASK); + + if ( !__test_and_clear_bit(domid & DOMID_MASK, map) ) + BUG(); +} + int iommu_free_pgtables(struct domain *d) { struct domain_iommu *hd = dom_iommu(d); ++++++ xsa400-10.patch ++++++ From: Jan Beulich <jbeul...@suse.com> Subject: IOMMU/x86: drop TLB flushes from quarantine_init() hooks The page tables just created aren't hooked up yet anywhere, so there's nothing that could be present in any TLB, and hence nothing to flush. Dropping this flush is, at least on the VT-d side, a prereq to per- device domain ID use when quarantining devices, as dom_io isn't going to be assigned a DID anymore: The warning in get_iommu_did() would trigger. Signed-off-by: Jan Beulich <jbeul...@suse.com> Reviewed-by: Paul Durrant <p...@xen.org> Reviewed-by: Roger Pau Monn?? <roger....@citrix.com> Reviewed-by: Kevin Tian <kevin.t...@intel.com> --- a/xen/drivers/passthrough/amd/iommu_map.c +++ b/xen/drivers/passthrough/amd/iommu_map.c @@ -654,8 +654,6 @@ int __init amd_iommu_quarantine_init(str out: spin_unlock(&hd->arch.mapping_lock); - amd_iommu_flush_all_pages(d); - /* Pages leaked in failure case */ return level ? -ENOMEM : 0; } --- a/xen/drivers/passthrough/vtd/iommu.c +++ b/xen/drivers/passthrough/vtd/iommu.c @@ -2975,9 +2975,6 @@ static int __init intel_iommu_quarantine out: spin_unlock(&hd->arch.mapping_lock); - if ( !rc ) - rc = iommu_flush_iotlb_all(d); - /* Pages may be leaked in failure case */ return rc; } ++++++ xsa400-11.patch ++++++ From: Jan Beulich <jbeul...@suse.com> Subject: AMD/IOMMU: abstract maximum number of page table levels We will want to use the constant elsewhere. Signed-off-by: Jan Beulich <jbeul...@suse.com> Reviewed-by: Paul Durrant <p...@xen.org> --- a/xen/drivers/passthrough/amd/iommu-defs.h +++ b/xen/drivers/passthrough/amd/iommu-defs.h @@ -106,6 +106,7 @@ struct amd_iommu_dte { bool tv:1; unsigned int :5; unsigned int had:2; +#define IOMMU_MAX_PT_LEVELS 6 unsigned int paging_mode:3; uint64_t pt_root:40; bool ppr:1; --- a/xen/drivers/passthrough/amd/pci_amd_iommu.c +++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c @@ -337,7 +337,7 @@ int amd_iommu_alloc_root(struct domain * return 0; } -unsigned int __read_mostly amd_iommu_max_paging_mode = 6; +unsigned int __read_mostly amd_iommu_max_paging_mode = IOMMU_MAX_PT_LEVELS; int __read_mostly amd_iommu_min_paging_mode = 1; static int amd_iommu_domain_init(struct domain *d) ++++++ xsa400-12.patch ++++++ ++++ 946 lines (skipped)