On Fri, 2026-03-13 at 16:24 +0800, Yi Liu wrote:
> On 3/12/26 14:20, CLEMENT MATHIEU--DRIF wrote:
>
> > Hi Yi, Zhenzhong,
> >
> > More details here:
> > [https://www.kernel.org/doc/html/next/x86/shstk.html](https://www.kernel.org/doc/html/next/x86/shstk.html)
> >
> > During my debugging sessions, I got the following flow:
> >
> > - Device sends an ATS request to the IOMMU without the no-write flag (so it
> > requests RW permissions)
> > - IOMMU walks the first stage page table, finds the last pte and notices
> > that the W permission is not set (I know this page not really RO, I
> > prepared it to make sure it is COW)
> > - Because of the condition I update in the patch, the IOMMU sets the dirty
> > bit despite the absence of the R/W bit
> > - The devices gets the translation completion and decides to issue a PRI
> > request
> > - The kernel runs the handler, schedules the work, and ends up calling
> > iommu_sva_handle_mm (with flag FAULT_FLAG_WRITE)
> > - It then calls handle_mm_fault, __handle_mm_fault and handle_pte_fault
> > - In handle_pte_fault, it runs the following fragment:
> >
> > ```
> > if (vmf->flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) { /* This is true */
> > if (!pte_write(entry))
> > return do_wp_page(vmf);
> > else if (likely(vmf->flags & FAULT_FLAG_WRITE))
> > entry = pte_mkdirty(entry);
> > }
> > ```
> >
> > - pte_write is define as follows
> >
> > ```
> > int pte_write(pte_t pte)
> > {
> > /*
> > * Shadow stack pages are logically writable, but do not have
> > * _PAGE_RW. Check for them separately from _PAGE_RW itself.
> > */
> > return (pte_flags(pte) & _PAGE_RW) || pte_shstk(pte);
> > }
> >
> > bool pte_shstk(pte_t pte)
> > {
> > return cpu_feature_enabled(X86_FEATURE_SHSTK) &&
> > (pte_flags(pte) & (_PAGE_RW | _PAGE_DIRTY)) == _PAGE_DIRTY;
> > }
> > ```
> >
> > - The pte has D=1 W=0 so the first half of the pte_write condition is false
> > - As the CPU I used for testing has the shstk feature, pte_shstk is true
> > - Hence, pte_write is true, which means we will run `entry =
> > pte_mkdirty(entry);` in handle_pte_fault
> > - After the fault handler has completed, the write permission is still not
> > granted
> > - The PRI request returns a success completion code
> > - Device requests the translation again to the IOMMU, and goes back to the
> > first step, trapped forever.
> >
> > Note that the second patch you reviewed was masking this behavior as it was
> > preventing the dirty bit from being added to the pte most of the time.
> > This is the reason why I fix them in this order o.O
>
>
> Understood. You may want to refine the commit message slightly. In my
> opinion, regardless of any bugs, it's essential to set the dirty bit
> only when write permission is granted AND the device requests write access.
Yes, that's my understanding as well
>
> If you wish to highlight the specific issue: setting the dirty bit
> without the write bit in vIOMMU creates a spoofed shadow stack PTE that
> misleads the guest OS. This results in infinite PRI requests from the
> device, as no actual write permission is ever granted - the guest OS
> treats the PTE as a legitimate shadow stack entry and considers the
> permissions already correct.
Crystal clear, will repost with the updated message
>
>
> > On Thu, 2026-03-12 at 11:30 +0800, Yi Liu wrote:
> >
> > > On 3/10/26 14:57, CLEMENT MATHIEU--DRIF wrote:
> > >
> > >
> > > > Avoid unintentionally flagging the entry as a shadow stack entry.
> > >
> > >
> > >
> > > hmmm. not quite get shadow stack...
> > >
> > >
> > >
> > > > In the current implementation, the dirty bit is always set in the pte.
> > > > Hence, an ATS device requesting an RW translation for a copy-on-write
> > > > page is likely to trigger a PRI request for a region that has just been
> > > > marked as dirty by the IOMMU. However, CPUs that support shadow stacks
> > > > give special meaning to PTEs reporting W=0 and D=1. Setting these values
> > > > can cause PRI requests to complete successfully without granting the
> > > > expected write permission, which leads the device to enter an infinite
> > > > loop of ATS/PRI requests.
> > >
> > >
> > >
> > > appreciate a clearer explanation here.
> > >
> > >
> > >
> > >
> > > > Fixes: 65c4f0999991 ("intel_iommu: Set accessed and dirty bits during
> > > > stage-1 translation")
> > > > Signed-off-by: Clement Mathieu--Drif
> > > > <[[[email protected]](mailto:[email protected])](mailto:[[email protected]](mailto:[email protected]))>
> > > > ---
> > > > hw/i386/intel_iommu.c | 2 +-
> > > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > > >
> > > > diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> > > > index d24ba989bf..56146aafc1 100644
> > > > --- a/hw/i386/intel_iommu.c
> > > > +++ b/hw/i386/intel_iommu.c
> > > > @@ -2066,7 +2066,7 @@ static int vtd_iova_to_fspte(IntelIOMMUState *s,
> > > > VTDContextEntry *ce,
> > > > return -VTD_FR_FS_PAGING_ENTRY_RSVD;
> > > > }
> > > >
> > > > - if (vtd_is_last_pte(fspte, *fspte_level) && is_write) {
> > > > + if (vtd_is_last_pte(fspte, *fspte_level) && *writes &&
> > > > is_write) {
> > > > flag_ad |= VTD_FS_D;
> > > > }
> > > >
> > >
> > >
> > >
> > > the change looks reasonable.
> >
>
>
> Reviewed-by: Yi Liu <[[email protected]](mailto:[email protected])>
>
>