On 3/12/26 14:20, CLEMENT MATHIEU--DRIF wrote:
Hi Yi, Zhenzhong,
More details here: https://www.kernel.org/doc/html/next/x86/shstk.html
During my debugging sessions, I got the following flow:
- Device sends an ATS request to the IOMMU without the no-write flag (so it
requests RW permissions)
- IOMMU walks the first stage page table, finds the last pte and notices that
the W permission is not set (I know this page not really RO, I prepared it to
make sure it is COW)
- Because of the condition I update in the patch, the IOMMU sets the dirty bit
despite the absence of the R/W bit
- The devices gets the translation completion and decides to issue a PRI request
- The kernel runs the handler, schedules the work, and ends up calling
iommu_sva_handle_mm (with flag FAULT_FLAG_WRITE)
- It then calls handle_mm_fault, __handle_mm_fault and handle_pte_fault
- In handle_pte_fault, it runs the following fragment:
```
if (vmf->flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) { /* This is true */
if (!pte_write(entry))
return do_wp_page(vmf);
else if (likely(vmf->flags & FAULT_FLAG_WRITE))
entry = pte_mkdirty(entry);
}
```
- pte_write is define as follows
```
int pte_write(pte_t pte)
{
/*
* Shadow stack pages are logically writable, but do not have
* _PAGE_RW. Check for them separately from _PAGE_RW itself.
*/
return (pte_flags(pte) & _PAGE_RW) || pte_shstk(pte);
}
bool pte_shstk(pte_t pte)
{
return cpu_feature_enabled(X86_FEATURE_SHSTK) &&
(pte_flags(pte) & (_PAGE_RW | _PAGE_DIRTY)) == _PAGE_DIRTY;
}
```
- The pte has D=1 W=0 so the first half of the pte_write condition is false
- As the CPU I used for testing has the shstk feature, pte_shstk is true
- Hence, pte_write is true, which means we will run `entry =
pte_mkdirty(entry);` in handle_pte_fault
- After the fault handler has completed, the write permission is still not
granted
- The PRI request returns a success completion code
- Device requests the translation again to the IOMMU, and goes back to the
first step, trapped forever.
Note that the second patch you reviewed was masking this behavior as it was
preventing the dirty bit from being added to the pte most of the time.
This is the reason why I fix them in this order o.O
Understood. You may want to refine the commit message slightly. In my
opinion, regardless of any bugs, it's essential to set the dirty bit
only when write permission is granted AND the device requests write access.
If you wish to highlight the specific issue: setting the dirty bit
without the write bit in vIOMMU creates a spoofed shadow stack PTE that
misleads the guest OS. This results in infinite PRI requests from the
device, as no actual write permission is ever granted - the guest OS
treats the PTE as a legitimate shadow stack entry and considers the
permissions already correct.
On Thu, 2026-03-12 at 11:30 +0800, Yi Liu wrote:
On 3/10/26 14:57, CLEMENT MATHIEU--DRIF wrote:
Avoid unintentionally flagging the entry as a shadow stack entry.
hmmm. not quite get shadow stack...
In the current implementation, the dirty bit is always set in the pte.
Hence, an ATS device requesting an RW translation for a copy-on-write
page is likely to trigger a PRI request for a region that has just been
marked as dirty by the IOMMU. However, CPUs that support shadow stacks
give special meaning to PTEs reporting W=0 and D=1. Setting these values
can cause PRI requests to complete successfully without granting the
expected write permission, which leads the device to enter an infinite
loop of ATS/PRI requests.
appreciate a clearer explanation here.
Fixes: 65c4f0999991 ("intel_iommu: Set accessed and dirty bits during stage-1
translation")
Signed-off-by: Clement Mathieu--Drif
<[[email protected]](mailto:[email protected])>
---
hw/i386/intel_iommu.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index d24ba989bf..56146aafc1 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -2066,7 +2066,7 @@ static int vtd_iova_to_fspte(IntelIOMMUState *s,
VTDContextEntry *ce,
return -VTD_FR_FS_PAGING_ENTRY_RSVD;
}
- if (vtd_is_last_pte(fspte, *fspte_level) && is_write) {
+ if (vtd_is_last_pte(fspte, *fspte_level) && *writes && is_write) {
flag_ad |= VTD_FS_D;
}
the change looks reasonable.
Reviewed-by: Yi Liu <[email protected]>