Re: [Intel-gfx] [cache coherency bug] i915 and PAT attributes

2023-01-01 Thread Demi Marie Obenour
On Sun, Jan 01, 2023 at 08:17:52PM -0500, Demi Marie Obenour wrote:
> On Mon, Jan 02, 2023 at 02:00:51AM +0100, Marek Marczykowski-Górecki wrote:
> > On Sun, Jan 01, 2023 at 07:03:18PM -0500, Demi Marie Obenour wrote:
> > > On Mon, Jan 02, 2023 at 12:24:54AM +0100, Marek Marczykowski-Górecki 
> > > wrote:
> > > > On Thu, Dec 22, 2022 at 10:29:57AM +0200, Ville Syrjälä wrote:
> > > > > On Fri, Dec 16, 2022 at 03:30:13PM +, Andrew Cooper wrote:
> > > > > > On 08/12/2022 1:55 pm, Marek Marczykowski-Górecki wrote:
> > > > > > > Hi,
> > > > > > >
> > > > > > > There is an issue with i915 on Xen PV (dom0). The end result is a 
> > > > > > > lot of
> > > > > > > glitches, like here: 
> > > > > > > https://openqa.qubes-os.org/tests/54748#step/startup/8
> > > > > > > (this one is on ADL, Linux 6.1-rc7 as a Xen PV dom0). It's using 
> > > > > > > Xorg
> > > > > > > with "modesetting" driver.
> > > > > > >
> > > > > > > After some iterations of debugging, we narrowed it down to i915 
> > > > > > > handling
> > > > > > > caching. The main difference is that PAT is setup differently on 
> > > > > > > Xen PV
> > > > > > > than on native Linux. Normally, Linux does have appropriate 
> > > > > > > abstraction
> > > > > > > for that, but apparently something related to i915 doesn't play 
> > > > > > > well
> > > > > > > with it. The specific difference is:
> > > > > > > native linux:
> > > > > > > x86/PAT: Configuration [0-7]: WB  WC  UC- UC  WB  WP  UC- WT
> > > > > > > xen pv:
> > > > > > > x86/PAT: Configuration [0-7]: WB  WT  UC- UC  WC  WP  UC  UC
> > > > > > >   ~~  ~~  ~~  ~~
> > > > > > >
> > > > > > > The specific impact depends on kernel version and the hardware. 
> > > > > > > The most
> > > > > > > severe issues I see on >=ADL, but some older hardware is affected 
> > > > > > > too -
> > > > > > > sometimes only if composition is disabled in the window manager.
> > > > > > > Some more information is collected at
> > > > > > > https://github.com/QubesOS/qubes-issues/issues/4782 (and few 
> > > > > > > linked
> > > > > > > duplicates...).
> > > > > > >
> > > > > > > Kind-of related commit is here:
> > > > > > > https://github.com/torvalds/linux/commit/bdd8b6c98239cad 
> > > > > > > ("drm/i915:
> > > > > > > replace X86_FEATURE_PAT with pat_enabled()") - it is the place 
> > > > > > > where
> > > > > > > i915 explicitly checks for PAT support, so I'm cc-ing people 
> > > > > > > mentioned
> > > > > > > there too.
> > > > > > >
> > > > > > > Any ideas?
> > > > > > >
> > > > > > > The issue can be easily reproduced without Xen too, by adjusting 
> > > > > > > PAT in
> > > > > > > Linux:
> > > > > > > -8<-
> > > > > > > diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
> > > > > > > index 66a209f7eb86..319ab60c8d8c 100644
> > > > > > > --- a/arch/x86/mm/pat/memtype.c
> > > > > > > +++ b/arch/x86/mm/pat/memtype.c
> > > > > > > @@ -400,8 +400,8 @@ void pat_init(void)
> > > > > > >* The reserved slots are unused, but mapped to their
> > > > > > >* corresponding types in the presence of PAT errata.
> > > > > > >*/
> > > > > > > - pat = PAT(0, WB) | PAT(1, WC) | PAT(2, UC_MINUS) | 
> > > > > > > PAT(3, UC) |
> > > > > > > -   PAT(4, WB) | PAT(5, WP) | PAT(6, UC_MINUS) | 
> > > > > > > PAT(7, WT);
> > > > > > > + pat = PAT(0, WB) | PAT(1, WT) | PAT(2, UC_MINUS) | 
> > > > > > > PAT(3, UC) |
> > > > > > > +   PAT(4, WC) | PAT(5, WP) | PAT(6, UC)   | 
> > > > > > > PAT(7, UC);
> > > > > > >   }
> > > > > > >  
> > > > > > >   if (!pat_bp_initialized) {
> > > > > > > -8<-
> > > > > > >
> > > > > > 
> > > > > > Hello, can anyone help please?
> > > > > > 
> > > > > > Intel's CI has taken this reproducer of the bug, and confirmed the
> > > > > > regression. 
> > > > > > https://lore.kernel.org/intel-gfx/Y5Hst0bCxQDTN7lK@mail-itl/T/#m4480c15a0d117dce6210562eb542875e757647fb
> > > > > > 
> > > > > > We're reasonably confident that it is an i915 bug (given the repro 
> > > > > > with
> > > > > > no Xen in the mix), but we're out of any further ideas.
> > > > > 
> > > > > I don't think we have any code that assumes anything about the PAT,
> > > > > apart from WC being available (which seems like it should still be
> > > > > the case with your modified PAT). I suppose you'll just have to 
> > > > > start digging from pgprot_writecombine()/noncached() and make sure
> > > > > everything ends up using the correct PAT entry.
> > > > 
> > > > I tried several approach to this, without success. Here is an update on
> > > > debugging (reported also on #intel-gfx live):
> > > > 
> > > > I did several tests with different PAT configuration (by modifying Xen
> > > > that sets the MSR). Full table is at 
> > > > https://pad.itl.space/sheet/#/2/sheet/view/HD1qT2Zf44Ha36TJ3wj2YL+PchsTidyNTFepW5++ZKM/
> > > > Some highlights:
> > > > - 1=WC, 4=WT - good
> > > > - 1=WT, 4=WC - bad
> > > > - 

Re: [Intel-gfx] [cache coherency bug] i915 and PAT attributes

2023-01-01 Thread Demi Marie Obenour
On Mon, Jan 02, 2023 at 02:00:51AM +0100, Marek Marczykowski-Górecki wrote:
> On Sun, Jan 01, 2023 at 07:03:18PM -0500, Demi Marie Obenour wrote:
> > On Mon, Jan 02, 2023 at 12:24:54AM +0100, Marek Marczykowski-Górecki wrote:
> > > On Thu, Dec 22, 2022 at 10:29:57AM +0200, Ville Syrjälä wrote:
> > > > On Fri, Dec 16, 2022 at 03:30:13PM +, Andrew Cooper wrote:
> > > > > On 08/12/2022 1:55 pm, Marek Marczykowski-Górecki wrote:
> > > > > > Hi,
> > > > > >
> > > > > > There is an issue with i915 on Xen PV (dom0). The end result is a 
> > > > > > lot of
> > > > > > glitches, like here: 
> > > > > > https://openqa.qubes-os.org/tests/54748#step/startup/8
> > > > > > (this one is on ADL, Linux 6.1-rc7 as a Xen PV dom0). It's using 
> > > > > > Xorg
> > > > > > with "modesetting" driver.
> > > > > >
> > > > > > After some iterations of debugging, we narrowed it down to i915 
> > > > > > handling
> > > > > > caching. The main difference is that PAT is setup differently on 
> > > > > > Xen PV
> > > > > > than on native Linux. Normally, Linux does have appropriate 
> > > > > > abstraction
> > > > > > for that, but apparently something related to i915 doesn't play well
> > > > > > with it. The specific difference is:
> > > > > > native linux:
> > > > > > x86/PAT: Configuration [0-7]: WB  WC  UC- UC  WB  WP  UC- WT
> > > > > > xen pv:
> > > > > > x86/PAT: Configuration [0-7]: WB  WT  UC- UC  WC  WP  UC  UC
> > > > > >   ~~  ~~  ~~  ~~
> > > > > >
> > > > > > The specific impact depends on kernel version and the hardware. The 
> > > > > > most
> > > > > > severe issues I see on >=ADL, but some older hardware is affected 
> > > > > > too -
> > > > > > sometimes only if composition is disabled in the window manager.
> > > > > > Some more information is collected at
> > > > > > https://github.com/QubesOS/qubes-issues/issues/4782 (and few linked
> > > > > > duplicates...).
> > > > > >
> > > > > > Kind-of related commit is here:
> > > > > > https://github.com/torvalds/linux/commit/bdd8b6c98239cad ("drm/i915:
> > > > > > replace X86_FEATURE_PAT with pat_enabled()") - it is the place where
> > > > > > i915 explicitly checks for PAT support, so I'm cc-ing people 
> > > > > > mentioned
> > > > > > there too.
> > > > > >
> > > > > > Any ideas?
> > > > > >
> > > > > > The issue can be easily reproduced without Xen too, by adjusting 
> > > > > > PAT in
> > > > > > Linux:
> > > > > > -8<-
> > > > > > diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
> > > > > > index 66a209f7eb86..319ab60c8d8c 100644
> > > > > > --- a/arch/x86/mm/pat/memtype.c
> > > > > > +++ b/arch/x86/mm/pat/memtype.c
> > > > > > @@ -400,8 +400,8 @@ void pat_init(void)
> > > > > >  * The reserved slots are unused, but mapped to their
> > > > > >  * corresponding types in the presence of PAT errata.
> > > > > >  */
> > > > > > -   pat = PAT(0, WB) | PAT(1, WC) | PAT(2, UC_MINUS) | 
> > > > > > PAT(3, UC) |
> > > > > > - PAT(4, WB) | PAT(5, WP) | PAT(6, UC_MINUS) | 
> > > > > > PAT(7, WT);
> > > > > > +   pat = PAT(0, WB) | PAT(1, WT) | PAT(2, UC_MINUS) | 
> > > > > > PAT(3, UC) |
> > > > > > + PAT(4, WC) | PAT(5, WP) | PAT(6, UC)   | 
> > > > > > PAT(7, UC);
> > > > > > }
> > > > > >  
> > > > > > if (!pat_bp_initialized) {
> > > > > > -8<-
> > > > > >
> > > > > 
> > > > > Hello, can anyone help please?
> > > > > 
> > > > > Intel's CI has taken this reproducer of the bug, and confirmed the
> > > > > regression. 
> > > > > https://lore.kernel.org/intel-gfx/Y5Hst0bCxQDTN7lK@mail-itl/T/#m4480c15a0d117dce6210562eb542875e757647fb
> > > > > 
> > > > > We're reasonably confident that it is an i915 bug (given the repro 
> > > > > with
> > > > > no Xen in the mix), but we're out of any further ideas.
> > > > 
> > > > I don't think we have any code that assumes anything about the PAT,
> > > > apart from WC being available (which seems like it should still be
> > > > the case with your modified PAT). I suppose you'll just have to 
> > > > start digging from pgprot_writecombine()/noncached() and make sure
> > > > everything ends up using the correct PAT entry.
> > > 
> > > I tried several approach to this, without success. Here is an update on
> > > debugging (reported also on #intel-gfx live):
> > > 
> > > I did several tests with different PAT configuration (by modifying Xen
> > > that sets the MSR). Full table is at 
> > > https://pad.itl.space/sheet/#/2/sheet/view/HD1qT2Zf44Ha36TJ3wj2YL+PchsTidyNTFepW5++ZKM/
> > > Some highlights:
> > > - 1=WC, 4=WT - good
> > > - 1=WT, 4=WC - bad
> > > - 1=WT, 3=WC (4=WC too) - good
> > > - 1=WT, 5=WC - good
> > > 
> > > So, for me it seems WC at index 4 is problematic for some reason.
> > > 
> > > Next, I tried to trap all the places in arch/x86/xen/mmu_pv.c that
> > > write PTEs and verify requested cache attributes. There, it seems all
> > > the requested WC are 

Re: [Intel-gfx] [cache coherency bug] i915 and PAT attributes

2023-01-01 Thread Marek Marczykowski-Górecki
On Sun, Jan 01, 2023 at 07:03:18PM -0500, Demi Marie Obenour wrote:
> On Mon, Jan 02, 2023 at 12:24:54AM +0100, Marek Marczykowski-Górecki wrote:
> > On Thu, Dec 22, 2022 at 10:29:57AM +0200, Ville Syrjälä wrote:
> > > On Fri, Dec 16, 2022 at 03:30:13PM +, Andrew Cooper wrote:
> > > > On 08/12/2022 1:55 pm, Marek Marczykowski-Górecki wrote:
> > > > > Hi,
> > > > >
> > > > > There is an issue with i915 on Xen PV (dom0). The end result is a lot 
> > > > > of
> > > > > glitches, like here: 
> > > > > https://openqa.qubes-os.org/tests/54748#step/startup/8
> > > > > (this one is on ADL, Linux 6.1-rc7 as a Xen PV dom0). It's using Xorg
> > > > > with "modesetting" driver.
> > > > >
> > > > > After some iterations of debugging, we narrowed it down to i915 
> > > > > handling
> > > > > caching. The main difference is that PAT is setup differently on Xen 
> > > > > PV
> > > > > than on native Linux. Normally, Linux does have appropriate 
> > > > > abstraction
> > > > > for that, but apparently something related to i915 doesn't play well
> > > > > with it. The specific difference is:
> > > > > native linux:
> > > > > x86/PAT: Configuration [0-7]: WB  WC  UC- UC  WB  WP  UC- WT
> > > > > xen pv:
> > > > > x86/PAT: Configuration [0-7]: WB  WT  UC- UC  WC  WP  UC  UC
> > > > >   ~~  ~~  ~~  ~~
> > > > >
> > > > > The specific impact depends on kernel version and the hardware. The 
> > > > > most
> > > > > severe issues I see on >=ADL, but some older hardware is affected too 
> > > > > -
> > > > > sometimes only if composition is disabled in the window manager.
> > > > > Some more information is collected at
> > > > > https://github.com/QubesOS/qubes-issues/issues/4782 (and few linked
> > > > > duplicates...).
> > > > >
> > > > > Kind-of related commit is here:
> > > > > https://github.com/torvalds/linux/commit/bdd8b6c98239cad ("drm/i915:
> > > > > replace X86_FEATURE_PAT with pat_enabled()") - it is the place where
> > > > > i915 explicitly checks for PAT support, so I'm cc-ing people mentioned
> > > > > there too.
> > > > >
> > > > > Any ideas?
> > > > >
> > > > > The issue can be easily reproduced without Xen too, by adjusting PAT 
> > > > > in
> > > > > Linux:
> > > > > -8<-
> > > > > diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
> > > > > index 66a209f7eb86..319ab60c8d8c 100644
> > > > > --- a/arch/x86/mm/pat/memtype.c
> > > > > +++ b/arch/x86/mm/pat/memtype.c
> > > > > @@ -400,8 +400,8 @@ void pat_init(void)
> > > > >* The reserved slots are unused, but mapped to their
> > > > >* corresponding types in the presence of PAT errata.
> > > > >*/
> > > > > - pat = PAT(0, WB) | PAT(1, WC) | PAT(2, UC_MINUS) | 
> > > > > PAT(3, UC) |
> > > > > -   PAT(4, WB) | PAT(5, WP) | PAT(6, UC_MINUS) | 
> > > > > PAT(7, WT);
> > > > > + pat = PAT(0, WB) | PAT(1, WT) | PAT(2, UC_MINUS) | 
> > > > > PAT(3, UC) |
> > > > > +   PAT(4, WC) | PAT(5, WP) | PAT(6, UC)   | 
> > > > > PAT(7, UC);
> > > > >   }
> > > > >  
> > > > >   if (!pat_bp_initialized) {
> > > > > -8<-
> > > > >
> > > > 
> > > > Hello, can anyone help please?
> > > > 
> > > > Intel's CI has taken this reproducer of the bug, and confirmed the
> > > > regression. 
> > > > https://lore.kernel.org/intel-gfx/Y5Hst0bCxQDTN7lK@mail-itl/T/#m4480c15a0d117dce6210562eb542875e757647fb
> > > > 
> > > > We're reasonably confident that it is an i915 bug (given the repro with
> > > > no Xen in the mix), but we're out of any further ideas.
> > > 
> > > I don't think we have any code that assumes anything about the PAT,
> > > apart from WC being available (which seems like it should still be
> > > the case with your modified PAT). I suppose you'll just have to 
> > > start digging from pgprot_writecombine()/noncached() and make sure
> > > everything ends up using the correct PAT entry.
> > 
> > I tried several approach to this, without success. Here is an update on
> > debugging (reported also on #intel-gfx live):
> > 
> > I did several tests with different PAT configuration (by modifying Xen
> > that sets the MSR). Full table is at 
> > https://pad.itl.space/sheet/#/2/sheet/view/HD1qT2Zf44Ha36TJ3wj2YL+PchsTidyNTFepW5++ZKM/
> > Some highlights:
> > - 1=WC, 4=WT - good
> > - 1=WT, 4=WC - bad
> > - 1=WT, 3=WC (4=WC too) - good
> > - 1=WT, 5=WC - good
> > 
> > So, for me it seems WC at index 4 is problematic for some reason.
> > 
> > Next, I tried to trap all the places in arch/x86/xen/mmu_pv.c that
> > write PTEs and verify requested cache attributes. There, it seems all
> > the requested WC are properly translated (using either index 1, 3, 4, or
> > 5 according to PAT settings). And then after reading PTE back, it indeed
> > seems to be correctly set. I didn't added reading back after
> > HYPERVISOR_update_va_mapping, but verified it isn't used for setting WC.
> > 
> > Using the same method

Re: [Intel-gfx] [cache coherency bug] i915 and PAT attributes

2023-01-01 Thread Demi Marie Obenour
On Mon, Jan 02, 2023 at 12:24:54AM +0100, Marek Marczykowski-Górecki wrote:
> On Thu, Dec 22, 2022 at 10:29:57AM +0200, Ville Syrjälä wrote:
> > On Fri, Dec 16, 2022 at 03:30:13PM +, Andrew Cooper wrote:
> > > On 08/12/2022 1:55 pm, Marek Marczykowski-Górecki wrote:
> > > > Hi,
> > > >
> > > > There is an issue with i915 on Xen PV (dom0). The end result is a lot of
> > > > glitches, like here: 
> > > > https://openqa.qubes-os.org/tests/54748#step/startup/8
> > > > (this one is on ADL, Linux 6.1-rc7 as a Xen PV dom0). It's using Xorg
> > > > with "modesetting" driver.
> > > >
> > > > After some iterations of debugging, we narrowed it down to i915 handling
> > > > caching. The main difference is that PAT is setup differently on Xen PV
> > > > than on native Linux. Normally, Linux does have appropriate abstraction
> > > > for that, but apparently something related to i915 doesn't play well
> > > > with it. The specific difference is:
> > > > native linux:
> > > > x86/PAT: Configuration [0-7]: WB  WC  UC- UC  WB  WP  UC- WT
> > > > xen pv:
> > > > x86/PAT: Configuration [0-7]: WB  WT  UC- UC  WC  WP  UC  UC
> > > >   ~~  ~~  ~~  ~~
> > > >
> > > > The specific impact depends on kernel version and the hardware. The most
> > > > severe issues I see on >=ADL, but some older hardware is affected too -
> > > > sometimes only if composition is disabled in the window manager.
> > > > Some more information is collected at
> > > > https://github.com/QubesOS/qubes-issues/issues/4782 (and few linked
> > > > duplicates...).
> > > >
> > > > Kind-of related commit is here:
> > > > https://github.com/torvalds/linux/commit/bdd8b6c98239cad ("drm/i915:
> > > > replace X86_FEATURE_PAT with pat_enabled()") - it is the place where
> > > > i915 explicitly checks for PAT support, so I'm cc-ing people mentioned
> > > > there too.
> > > >
> > > > Any ideas?
> > > >
> > > > The issue can be easily reproduced without Xen too, by adjusting PAT in
> > > > Linux:
> > > > -8<-
> > > > diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
> > > > index 66a209f7eb86..319ab60c8d8c 100644
> > > > --- a/arch/x86/mm/pat/memtype.c
> > > > +++ b/arch/x86/mm/pat/memtype.c
> > > > @@ -400,8 +400,8 @@ void pat_init(void)
> > > >  * The reserved slots are unused, but mapped to their
> > > >  * corresponding types in the presence of PAT errata.
> > > >  */
> > > > -   pat = PAT(0, WB) | PAT(1, WC) | PAT(2, UC_MINUS) | 
> > > > PAT(3, UC) |
> > > > - PAT(4, WB) | PAT(5, WP) | PAT(6, UC_MINUS) | 
> > > > PAT(7, WT);
> > > > +   pat = PAT(0, WB) | PAT(1, WT) | PAT(2, UC_MINUS) | 
> > > > PAT(3, UC) |
> > > > + PAT(4, WC) | PAT(5, WP) | PAT(6, UC)   | 
> > > > PAT(7, UC);
> > > > }
> > > >  
> > > > if (!pat_bp_initialized) {
> > > > -8<-
> > > >
> > > 
> > > Hello, can anyone help please?
> > > 
> > > Intel's CI has taken this reproducer of the bug, and confirmed the
> > > regression. 
> > > https://lore.kernel.org/intel-gfx/Y5Hst0bCxQDTN7lK@mail-itl/T/#m4480c15a0d117dce6210562eb542875e757647fb
> > > 
> > > We're reasonably confident that it is an i915 bug (given the repro with
> > > no Xen in the mix), but we're out of any further ideas.
> > 
> > I don't think we have any code that assumes anything about the PAT,
> > apart from WC being available (which seems like it should still be
> > the case with your modified PAT). I suppose you'll just have to 
> > start digging from pgprot_writecombine()/noncached() and make sure
> > everything ends up using the correct PAT entry.
> 
> I tried several approach to this, without success. Here is an update on
> debugging (reported also on #intel-gfx live):
> 
> I did several tests with different PAT configuration (by modifying Xen
> that sets the MSR). Full table is at 
> https://pad.itl.space/sheet/#/2/sheet/view/HD1qT2Zf44Ha36TJ3wj2YL+PchsTidyNTFepW5++ZKM/
> Some highlights:
> - 1=WC, 4=WT - good
> - 1=WT, 4=WC - bad
> - 1=WT, 3=WC (4=WC too) - good
> - 1=WT, 5=WC - good
> 
> So, for me it seems WC at index 4 is problematic for some reason.
> 
> Next, I tried to trap all the places in arch/x86/xen/mmu_pv.c that
> write PTEs and verify requested cache attributes. There, it seems all
> the requested WC are properly translated (using either index 1, 3, 4, or
> 5 according to PAT settings). And then after reading PTE back, it indeed
> seems to be correctly set. I didn't added reading back after
> HYPERVISOR_update_va_mapping, but verified it isn't used for setting WC.
> 
> Using the same method, I also checked that indexes that aren't supposed
> to be used (for example index 4 when both 3 and 4 are WC) indeed are not
> used. So, the hypothesis that specific indexes are hardcoded somewhere
> is unlikely.
> 
> This all looks very weird to me. Any ideas?

Old CPUs have had hardware errata that caused the top bit of th

Re: [Intel-gfx] [cache coherency bug] i915 and PAT attributes

2023-01-01 Thread Marek Marczykowski-Górecki
On Thu, Dec 22, 2022 at 10:29:57AM +0200, Ville Syrjälä wrote:
> On Fri, Dec 16, 2022 at 03:30:13PM +, Andrew Cooper wrote:
> > On 08/12/2022 1:55 pm, Marek Marczykowski-Górecki wrote:
> > > Hi,
> > >
> > > There is an issue with i915 on Xen PV (dom0). The end result is a lot of
> > > glitches, like here: 
> > > https://openqa.qubes-os.org/tests/54748#step/startup/8
> > > (this one is on ADL, Linux 6.1-rc7 as a Xen PV dom0). It's using Xorg
> > > with "modesetting" driver.
> > >
> > > After some iterations of debugging, we narrowed it down to i915 handling
> > > caching. The main difference is that PAT is setup differently on Xen PV
> > > than on native Linux. Normally, Linux does have appropriate abstraction
> > > for that, but apparently something related to i915 doesn't play well
> > > with it. The specific difference is:
> > > native linux:
> > > x86/PAT: Configuration [0-7]: WB  WC  UC- UC  WB  WP  UC- WT
> > > xen pv:
> > > x86/PAT: Configuration [0-7]: WB  WT  UC- UC  WC  WP  UC  UC
> > >   ~~  ~~  ~~  ~~
> > >
> > > The specific impact depends on kernel version and the hardware. The most
> > > severe issues I see on >=ADL, but some older hardware is affected too -
> > > sometimes only if composition is disabled in the window manager.
> > > Some more information is collected at
> > > https://github.com/QubesOS/qubes-issues/issues/4782 (and few linked
> > > duplicates...).
> > >
> > > Kind-of related commit is here:
> > > https://github.com/torvalds/linux/commit/bdd8b6c98239cad ("drm/i915:
> > > replace X86_FEATURE_PAT with pat_enabled()") - it is the place where
> > > i915 explicitly checks for PAT support, so I'm cc-ing people mentioned
> > > there too.
> > >
> > > Any ideas?
> > >
> > > The issue can be easily reproduced without Xen too, by adjusting PAT in
> > > Linux:
> > > -8<-
> > > diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
> > > index 66a209f7eb86..319ab60c8d8c 100644
> > > --- a/arch/x86/mm/pat/memtype.c
> > > +++ b/arch/x86/mm/pat/memtype.c
> > > @@ -400,8 +400,8 @@ void pat_init(void)
> > >* The reserved slots are unused, but mapped to their
> > >* corresponding types in the presence of PAT errata.
> > >*/
> > > - pat = PAT(0, WB) | PAT(1, WC) | PAT(2, UC_MINUS) | PAT(3, UC) |
> > > -   PAT(4, WB) | PAT(5, WP) | PAT(6, UC_MINUS) | PAT(7, WT);
> > > + pat = PAT(0, WB) | PAT(1, WT) | PAT(2, UC_MINUS) | PAT(3, UC) |
> > > +   PAT(4, WC) | PAT(5, WP) | PAT(6, UC)   | PAT(7, UC);
> > >   }
> > >  
> > >   if (!pat_bp_initialized) {
> > > -8<-
> > >
> > 
> > Hello, can anyone help please?
> > 
> > Intel's CI has taken this reproducer of the bug, and confirmed the
> > regression. 
> > https://lore.kernel.org/intel-gfx/Y5Hst0bCxQDTN7lK@mail-itl/T/#m4480c15a0d117dce6210562eb542875e757647fb
> > 
> > We're reasonably confident that it is an i915 bug (given the repro with
> > no Xen in the mix), but we're out of any further ideas.
> 
> I don't think we have any code that assumes anything about the PAT,
> apart from WC being available (which seems like it should still be
> the case with your modified PAT). I suppose you'll just have to 
> start digging from pgprot_writecombine()/noncached() and make sure
> everything ends up using the correct PAT entry.

I tried several approach to this, without success. Here is an update on
debugging (reported also on #intel-gfx live):

I did several tests with different PAT configuration (by modifying Xen
that sets the MSR). Full table is at 
https://pad.itl.space/sheet/#/2/sheet/view/HD1qT2Zf44Ha36TJ3wj2YL+PchsTidyNTFepW5++ZKM/
Some highlights:
- 1=WC, 4=WT - good
- 1=WT, 4=WC - bad
- 1=WT, 3=WC (4=WC too) - good
- 1=WT, 5=WC - good

So, for me it seems WC at index 4 is problematic for some reason.

Next, I tried to trap all the places in arch/x86/xen/mmu_pv.c that
write PTEs and verify requested cache attributes. There, it seems all
the requested WC are properly translated (using either index 1, 3, 4, or
5 according to PAT settings). And then after reading PTE back, it indeed
seems to be correctly set. I didn't added reading back after
HYPERVISOR_update_va_mapping, but verified it isn't used for setting WC.

Using the same method, I also checked that indexes that aren't supposed
to be used (for example index 4 when both 3 and 4 are WC) indeed are not
used. So, the hypothesis that specific indexes are hardcoded somewhere
is unlikely.

This all looks very weird to me. Any ideas?

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature


Re: [Intel-gfx] [cache coherency bug] i915 and PAT attributes

2022-12-22 Thread Ville Syrjälä
On Fri, Dec 16, 2022 at 03:30:13PM +, Andrew Cooper wrote:
> On 08/12/2022 1:55 pm, Marek Marczykowski-Górecki wrote:
> > Hi,
> >
> > There is an issue with i915 on Xen PV (dom0). The end result is a lot of
> > glitches, like here: https://openqa.qubes-os.org/tests/54748#step/startup/8
> > (this one is on ADL, Linux 6.1-rc7 as a Xen PV dom0). It's using Xorg
> > with "modesetting" driver.
> >
> > After some iterations of debugging, we narrowed it down to i915 handling
> > caching. The main difference is that PAT is setup differently on Xen PV
> > than on native Linux. Normally, Linux does have appropriate abstraction
> > for that, but apparently something related to i915 doesn't play well
> > with it. The specific difference is:
> > native linux:
> > x86/PAT: Configuration [0-7]: WB  WC  UC- UC  WB  WP  UC- WT
> > xen pv:
> > x86/PAT: Configuration [0-7]: WB  WT  UC- UC  WC  WP  UC  UC
> >   ~~  ~~  ~~  ~~
> >
> > The specific impact depends on kernel version and the hardware. The most
> > severe issues I see on >=ADL, but some older hardware is affected too -
> > sometimes only if composition is disabled in the window manager.
> > Some more information is collected at
> > https://github.com/QubesOS/qubes-issues/issues/4782 (and few linked
> > duplicates...).
> >
> > Kind-of related commit is here:
> > https://github.com/torvalds/linux/commit/bdd8b6c98239cad ("drm/i915:
> > replace X86_FEATURE_PAT with pat_enabled()") - it is the place where
> > i915 explicitly checks for PAT support, so I'm cc-ing people mentioned
> > there too.
> >
> > Any ideas?
> >
> > The issue can be easily reproduced without Xen too, by adjusting PAT in
> > Linux:
> > -8<-
> > diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
> > index 66a209f7eb86..319ab60c8d8c 100644
> > --- a/arch/x86/mm/pat/memtype.c
> > +++ b/arch/x86/mm/pat/memtype.c
> > @@ -400,8 +400,8 @@ void pat_init(void)
> >  * The reserved slots are unused, but mapped to their
> >  * corresponding types in the presence of PAT errata.
> >  */
> > -   pat = PAT(0, WB) | PAT(1, WC) | PAT(2, UC_MINUS) | PAT(3, UC) |
> > - PAT(4, WB) | PAT(5, WP) | PAT(6, UC_MINUS) | PAT(7, WT);
> > +   pat = PAT(0, WB) | PAT(1, WT) | PAT(2, UC_MINUS) | PAT(3, UC) |
> > + PAT(4, WC) | PAT(5, WP) | PAT(6, UC)   | PAT(7, UC);
> > }
> >  
> > if (!pat_bp_initialized) {
> > -8<-
> >
> 
> Hello, can anyone help please?
> 
> Intel's CI has taken this reproducer of the bug, and confirmed the
> regression. 
> https://lore.kernel.org/intel-gfx/Y5Hst0bCxQDTN7lK@mail-itl/T/#m4480c15a0d117dce6210562eb542875e757647fb
> 
> We're reasonably confident that it is an i915 bug (given the repro with
> no Xen in the mix), but we're out of any further ideas.

I don't think we have any code that assumes anything about the PAT,
apart from WC being available (which seems like it should still be
the case with your modified PAT). I suppose you'll just have to 
start digging from pgprot_writecombine()/noncached() and make sure
everything ends up using the correct PAT entry.

-- 
Ville Syrjälä
Intel