Hi Yi, Zhenzhong,

More details here: https://www.kernel.org/doc/html/next/x86/shstk.html

During my debugging sessions, I got the following flow:

- Device sends an ATS request to the IOMMU without the no-write flag (so it 
requests RW permissions)
- IOMMU walks the first stage page table, finds the last pte and notices that 
the W permission is not set (I know this page not really RO, I prepared it to 
make sure it is COW)
- Because of the condition I update in the patch, the IOMMU sets the dirty bit 
despite the absence of the R/W bit
- The devices gets the translation completion and decides to issue a PRI request
- The kernel runs the handler, schedules the work, and ends up calling 
iommu_sva_handle_mm (with flag FAULT_FLAG_WRITE)
- It then calls handle_mm_fault, __handle_mm_fault and handle_pte_fault
- In handle_pte_fault, it runs the following fragment:

```
if (vmf->flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) { /* This is true */
        if (!pte_write(entry))
                return do_wp_page(vmf);
        else if (likely(vmf->flags & FAULT_FLAG_WRITE))
                entry = pte_mkdirty(entry);
}
```

- pte_write is define as follows

```
int pte_write(pte_t pte)
{
        /*
         * Shadow stack pages are logically writable, but do not have
         * _PAGE_RW.  Check for them separately from _PAGE_RW itself.
         */
        return (pte_flags(pte) & _PAGE_RW) || pte_shstk(pte);
}

bool pte_shstk(pte_t pte)
{
        return cpu_feature_enabled(X86_FEATURE_SHSTK) &&
               (pte_flags(pte) & (_PAGE_RW | _PAGE_DIRTY)) == _PAGE_DIRTY;
}
```

- The pte has D=1 W=0 so the first half of the pte_write condition is false
- As the CPU I used for testing has the shstk feature, pte_shstk is true
- Hence, pte_write is true, which means we will run `entry = 
pte_mkdirty(entry);` in handle_pte_fault
- After the fault handler has completed, the write permission is still not 
granted
- The PRI request returns a success completion code
- Device requests the translation again to the IOMMU, and goes back to the 
first step, trapped forever.

Note that the second patch you reviewed was masking this behavior as it was 
preventing the dirty bit from being added to the pte most of the time.  
This is the reason why I fix them in this order o.O

cmd


On Thu, 2026-03-12 at 11:30 +0800, Yi Liu wrote:
> On 3/10/26 14:57, CLEMENT MATHIEU--DRIF wrote:
> 
> > Avoid unintentionally flagging the entry as a shadow stack entry.
> 
> 
> hmmm. not quite get shadow stack...
> 
> 
> > In the current implementation, the dirty bit is always set in the pte.
> > Hence, an ATS device requesting an RW translation for a copy-on-write
> > page is likely to trigger a PRI request for a region that has just been
> > marked as dirty by the IOMMU. However, CPUs that support shadow stacks
> > give special meaning to PTEs reporting W=0 and D=1. Setting these values
> > can cause PRI requests to complete successfully without granting the
> > expected write permission, which leads the device to enter an infinite
> > loop of ATS/PRI requests.
> 
> 
> appreciate a clearer explanation here.
> 
> 
> 
> > Fixes: 65c4f0999991 ("intel_iommu: Set accessed and dirty bits during 
> > stage-1 translation")
> > Signed-off-by: Clement Mathieu--Drif 
> > <[[email protected]](mailto:[email protected])>
> > ---
> >   hw/i386/intel_iommu.c | 2 +-
> >   1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> > index d24ba989bf..56146aafc1 100644
> > --- a/hw/i386/intel_iommu.c
> > +++ b/hw/i386/intel_iommu.c
> > @@ -2066,7 +2066,7 @@ static int vtd_iova_to_fspte(IntelIOMMUState *s, 
> > VTDContextEntry *ce,
> >               return -VTD_FR_FS_PAGING_ENTRY_RSVD;
> >           }
> >   
> > -        if (vtd_is_last_pte(fspte, *fspte_level) && is_write) {
> > +        if (vtd_is_last_pte(fspte, *fspte_level) && *writes && is_write) {
> >               flag_ad |= VTD_FS_D;
> >           }
> >  
> 
> 
> the change looks reasonable.

Reply via email to