Re: [Intel-gfx] [cache coherency bug] i915 and PAT attributes
On Sun, Jan 01, 2023 at 08:17:52PM -0500, Demi Marie Obenour wrote: > On Mon, Jan 02, 2023 at 02:00:51AM +0100, Marek Marczykowski-Górecki wrote: > > On Sun, Jan 01, 2023 at 07:03:18PM -0500, Demi Marie Obenour wrote: > > > On Mon, Jan 02, 2023 at 12:24:54AM +0100, Marek Marczykowski-Górecki > > > wrote: > > > > On Thu, Dec 22, 2022 at 10:29:57AM +0200, Ville Syrjälä wrote: > > > > > On Fri, Dec 16, 2022 at 03:30:13PM +, Andrew Cooper wrote: > > > > > > On 08/12/2022 1:55 pm, Marek Marczykowski-Górecki wrote: > > > > > > > Hi, > > > > > > > > > > > > > > There is an issue with i915 on Xen PV (dom0). The end result is a > > > > > > > lot of > > > > > > > glitches, like here: > > > > > > > https://openqa.qubes-os.org/tests/54748#step/startup/8 > > > > > > > (this one is on ADL, Linux 6.1-rc7 as a Xen PV dom0). It's using > > > > > > > Xorg > > > > > > > with "modesetting" driver. > > > > > > > > > > > > > > After some iterations of debugging, we narrowed it down to i915 > > > > > > > handling > > > > > > > caching. The main difference is that PAT is setup differently on > > > > > > > Xen PV > > > > > > > than on native Linux. Normally, Linux does have appropriate > > > > > > > abstraction > > > > > > > for that, but apparently something related to i915 doesn't play > > > > > > > well > > > > > > > with it. The specific difference is: > > > > > > > native linux: > > > > > > > x86/PAT: Configuration [0-7]: WB WC UC- UC WB WP UC- WT > > > > > > > xen pv: > > > > > > > x86/PAT: Configuration [0-7]: WB WT UC- UC WC WP UC UC > > > > > > > ~~ ~~ ~~ ~~ > > > > > > > > > > > > > > The specific impact depends on kernel version and the hardware. > > > > > > > The most > > > > > > > severe issues I see on >=ADL, but some older hardware is affected > > > > > > > too - > > > > > > > sometimes only if composition is disabled in the window manager. > > > > > > > Some more information is collected at > > > > > > > https://github.com/QubesOS/qubes-issues/issues/4782 (and few > > > > > > > linked > > > > > > > duplicates...). > > > > > > > > > > > > > > Kind-of related commit is here: > > > > > > > https://github.com/torvalds/linux/commit/bdd8b6c98239cad > > > > > > > ("drm/i915: > > > > > > > replace X86_FEATURE_PAT with pat_enabled()") - it is the place > > > > > > > where > > > > > > > i915 explicitly checks for PAT support, so I'm cc-ing people > > > > > > > mentioned > > > > > > > there too. > > > > > > > > > > > > > > Any ideas? > > > > > > > > > > > > > > The issue can be easily reproduced without Xen too, by adjusting > > > > > > > PAT in > > > > > > > Linux: > > > > > > > -8<- > > > > > > > diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c > > > > > > > index 66a209f7eb86..319ab60c8d8c 100644 > > > > > > > --- a/arch/x86/mm/pat/memtype.c > > > > > > > +++ b/arch/x86/mm/pat/memtype.c > > > > > > > @@ -400,8 +400,8 @@ void pat_init(void) > > > > > > >* The reserved slots are unused, but mapped to their > > > > > > >* corresponding types in the presence of PAT errata. > > > > > > >*/ > > > > > > > - pat = PAT(0, WB) | PAT(1, WC) | PAT(2, UC_MINUS) | > > > > > > > PAT(3, UC) | > > > > > > > - PAT(4, WB) | PAT(5, WP) | PAT(6, UC_MINUS) | > > > > > > > PAT(7, WT); > > > > > > > + pat = PAT(0, WB) | PAT(1, WT) | PAT(2, UC_MINUS) | > > > > > > > PAT(3, UC) | > > > > > > > + PAT(4, WC) | PAT(5, WP) | PAT(6, UC) | > > > > > > > PAT(7, UC); > > > > > > > } > > > > > > > > > > > > > > if (!pat_bp_initialized) { > > > > > > > -8<- > > > > > > > > > > > > > > > > > > > Hello, can anyone help please? > > > > > > > > > > > > Intel's CI has taken this reproducer of the bug, and confirmed the > > > > > > regression. > > > > > > https://lore.kernel.org/intel-gfx/Y5Hst0bCxQDTN7lK@mail-itl/T/#m4480c15a0d117dce6210562eb542875e757647fb > > > > > > > > > > > > We're reasonably confident that it is an i915 bug (given the repro > > > > > > with > > > > > > no Xen in the mix), but we're out of any further ideas. > > > > > > > > > > I don't think we have any code that assumes anything about the PAT, > > > > > apart from WC being available (which seems like it should still be > > > > > the case with your modified PAT). I suppose you'll just have to > > > > > start digging from pgprot_writecombine()/noncached() and make sure > > > > > everything ends up using the correct PAT entry. > > > > > > > > I tried several approach to this, without success. Here is an update on > > > > debugging (reported also on #intel-gfx live): > > > > > > > > I did several tests with different PAT configuration (by modifying Xen > > > > that sets the MSR). Full table is at > > > > https://pad.itl.space/sheet/#/2/sheet/view/HD1qT2Zf44Ha36TJ3wj2YL+PchsTidyNTFepW5++ZKM/ > > > > Some highlights: > > > > - 1=WC, 4=WT - good > > > > - 1=WT, 4=WC - bad > > > > -
Re: [Intel-gfx] [cache coherency bug] i915 and PAT attributes
On Mon, Jan 02, 2023 at 02:00:51AM +0100, Marek Marczykowski-Górecki wrote: > On Sun, Jan 01, 2023 at 07:03:18PM -0500, Demi Marie Obenour wrote: > > On Mon, Jan 02, 2023 at 12:24:54AM +0100, Marek Marczykowski-Górecki wrote: > > > On Thu, Dec 22, 2022 at 10:29:57AM +0200, Ville Syrjälä wrote: > > > > On Fri, Dec 16, 2022 at 03:30:13PM +, Andrew Cooper wrote: > > > > > On 08/12/2022 1:55 pm, Marek Marczykowski-Górecki wrote: > > > > > > Hi, > > > > > > > > > > > > There is an issue with i915 on Xen PV (dom0). The end result is a > > > > > > lot of > > > > > > glitches, like here: > > > > > > https://openqa.qubes-os.org/tests/54748#step/startup/8 > > > > > > (this one is on ADL, Linux 6.1-rc7 as a Xen PV dom0). It's using > > > > > > Xorg > > > > > > with "modesetting" driver. > > > > > > > > > > > > After some iterations of debugging, we narrowed it down to i915 > > > > > > handling > > > > > > caching. The main difference is that PAT is setup differently on > > > > > > Xen PV > > > > > > than on native Linux. Normally, Linux does have appropriate > > > > > > abstraction > > > > > > for that, but apparently something related to i915 doesn't play well > > > > > > with it. The specific difference is: > > > > > > native linux: > > > > > > x86/PAT: Configuration [0-7]: WB WC UC- UC WB WP UC- WT > > > > > > xen pv: > > > > > > x86/PAT: Configuration [0-7]: WB WT UC- UC WC WP UC UC > > > > > > ~~ ~~ ~~ ~~ > > > > > > > > > > > > The specific impact depends on kernel version and the hardware. The > > > > > > most > > > > > > severe issues I see on >=ADL, but some older hardware is affected > > > > > > too - > > > > > > sometimes only if composition is disabled in the window manager. > > > > > > Some more information is collected at > > > > > > https://github.com/QubesOS/qubes-issues/issues/4782 (and few linked > > > > > > duplicates...). > > > > > > > > > > > > Kind-of related commit is here: > > > > > > https://github.com/torvalds/linux/commit/bdd8b6c98239cad ("drm/i915: > > > > > > replace X86_FEATURE_PAT with pat_enabled()") - it is the place where > > > > > > i915 explicitly checks for PAT support, so I'm cc-ing people > > > > > > mentioned > > > > > > there too. > > > > > > > > > > > > Any ideas? > > > > > > > > > > > > The issue can be easily reproduced without Xen too, by adjusting > > > > > > PAT in > > > > > > Linux: > > > > > > -8<- > > > > > > diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c > > > > > > index 66a209f7eb86..319ab60c8d8c 100644 > > > > > > --- a/arch/x86/mm/pat/memtype.c > > > > > > +++ b/arch/x86/mm/pat/memtype.c > > > > > > @@ -400,8 +400,8 @@ void pat_init(void) > > > > > > * The reserved slots are unused, but mapped to their > > > > > > * corresponding types in the presence of PAT errata. > > > > > > */ > > > > > > - pat = PAT(0, WB) | PAT(1, WC) | PAT(2, UC_MINUS) | > > > > > > PAT(3, UC) | > > > > > > - PAT(4, WB) | PAT(5, WP) | PAT(6, UC_MINUS) | > > > > > > PAT(7, WT); > > > > > > + pat = PAT(0, WB) | PAT(1, WT) | PAT(2, UC_MINUS) | > > > > > > PAT(3, UC) | > > > > > > + PAT(4, WC) | PAT(5, WP) | PAT(6, UC) | > > > > > > PAT(7, UC); > > > > > > } > > > > > > > > > > > > if (!pat_bp_initialized) { > > > > > > -8<- > > > > > > > > > > > > > > > > Hello, can anyone help please? > > > > > > > > > > Intel's CI has taken this reproducer of the bug, and confirmed the > > > > > regression. > > > > > https://lore.kernel.org/intel-gfx/Y5Hst0bCxQDTN7lK@mail-itl/T/#m4480c15a0d117dce6210562eb542875e757647fb > > > > > > > > > > We're reasonably confident that it is an i915 bug (given the repro > > > > > with > > > > > no Xen in the mix), but we're out of any further ideas. > > > > > > > > I don't think we have any code that assumes anything about the PAT, > > > > apart from WC being available (which seems like it should still be > > > > the case with your modified PAT). I suppose you'll just have to > > > > start digging from pgprot_writecombine()/noncached() and make sure > > > > everything ends up using the correct PAT entry. > > > > > > I tried several approach to this, without success. Here is an update on > > > debugging (reported also on #intel-gfx live): > > > > > > I did several tests with different PAT configuration (by modifying Xen > > > that sets the MSR). Full table is at > > > https://pad.itl.space/sheet/#/2/sheet/view/HD1qT2Zf44Ha36TJ3wj2YL+PchsTidyNTFepW5++ZKM/ > > > Some highlights: > > > - 1=WC, 4=WT - good > > > - 1=WT, 4=WC - bad > > > - 1=WT, 3=WC (4=WC too) - good > > > - 1=WT, 5=WC - good > > > > > > So, for me it seems WC at index 4 is problematic for some reason. > > > > > > Next, I tried to trap all the places in arch/x86/xen/mmu_pv.c that > > > write PTEs and verify requested cache attributes. There, it seems all > > > the requested WC are
Re: [Intel-gfx] [cache coherency bug] i915 and PAT attributes
On Sun, Jan 01, 2023 at 07:03:18PM -0500, Demi Marie Obenour wrote: > On Mon, Jan 02, 2023 at 12:24:54AM +0100, Marek Marczykowski-Górecki wrote: > > On Thu, Dec 22, 2022 at 10:29:57AM +0200, Ville Syrjälä wrote: > > > On Fri, Dec 16, 2022 at 03:30:13PM +, Andrew Cooper wrote: > > > > On 08/12/2022 1:55 pm, Marek Marczykowski-Górecki wrote: > > > > > Hi, > > > > > > > > > > There is an issue with i915 on Xen PV (dom0). The end result is a lot > > > > > of > > > > > glitches, like here: > > > > > https://openqa.qubes-os.org/tests/54748#step/startup/8 > > > > > (this one is on ADL, Linux 6.1-rc7 as a Xen PV dom0). It's using Xorg > > > > > with "modesetting" driver. > > > > > > > > > > After some iterations of debugging, we narrowed it down to i915 > > > > > handling > > > > > caching. The main difference is that PAT is setup differently on Xen > > > > > PV > > > > > than on native Linux. Normally, Linux does have appropriate > > > > > abstraction > > > > > for that, but apparently something related to i915 doesn't play well > > > > > with it. The specific difference is: > > > > > native linux: > > > > > x86/PAT: Configuration [0-7]: WB WC UC- UC WB WP UC- WT > > > > > xen pv: > > > > > x86/PAT: Configuration [0-7]: WB WT UC- UC WC WP UC UC > > > > > ~~ ~~ ~~ ~~ > > > > > > > > > > The specific impact depends on kernel version and the hardware. The > > > > > most > > > > > severe issues I see on >=ADL, but some older hardware is affected too > > > > > - > > > > > sometimes only if composition is disabled in the window manager. > > > > > Some more information is collected at > > > > > https://github.com/QubesOS/qubes-issues/issues/4782 (and few linked > > > > > duplicates...). > > > > > > > > > > Kind-of related commit is here: > > > > > https://github.com/torvalds/linux/commit/bdd8b6c98239cad ("drm/i915: > > > > > replace X86_FEATURE_PAT with pat_enabled()") - it is the place where > > > > > i915 explicitly checks for PAT support, so I'm cc-ing people mentioned > > > > > there too. > > > > > > > > > > Any ideas? > > > > > > > > > > The issue can be easily reproduced without Xen too, by adjusting PAT > > > > > in > > > > > Linux: > > > > > -8<- > > > > > diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c > > > > > index 66a209f7eb86..319ab60c8d8c 100644 > > > > > --- a/arch/x86/mm/pat/memtype.c > > > > > +++ b/arch/x86/mm/pat/memtype.c > > > > > @@ -400,8 +400,8 @@ void pat_init(void) > > > > >* The reserved slots are unused, but mapped to their > > > > >* corresponding types in the presence of PAT errata. > > > > >*/ > > > > > - pat = PAT(0, WB) | PAT(1, WC) | PAT(2, UC_MINUS) | > > > > > PAT(3, UC) | > > > > > - PAT(4, WB) | PAT(5, WP) | PAT(6, UC_MINUS) | > > > > > PAT(7, WT); > > > > > + pat = PAT(0, WB) | PAT(1, WT) | PAT(2, UC_MINUS) | > > > > > PAT(3, UC) | > > > > > + PAT(4, WC) | PAT(5, WP) | PAT(6, UC) | > > > > > PAT(7, UC); > > > > > } > > > > > > > > > > if (!pat_bp_initialized) { > > > > > -8<- > > > > > > > > > > > > > Hello, can anyone help please? > > > > > > > > Intel's CI has taken this reproducer of the bug, and confirmed the > > > > regression. > > > > https://lore.kernel.org/intel-gfx/Y5Hst0bCxQDTN7lK@mail-itl/T/#m4480c15a0d117dce6210562eb542875e757647fb > > > > > > > > We're reasonably confident that it is an i915 bug (given the repro with > > > > no Xen in the mix), but we're out of any further ideas. > > > > > > I don't think we have any code that assumes anything about the PAT, > > > apart from WC being available (which seems like it should still be > > > the case with your modified PAT). I suppose you'll just have to > > > start digging from pgprot_writecombine()/noncached() and make sure > > > everything ends up using the correct PAT entry. > > > > I tried several approach to this, without success. Here is an update on > > debugging (reported also on #intel-gfx live): > > > > I did several tests with different PAT configuration (by modifying Xen > > that sets the MSR). Full table is at > > https://pad.itl.space/sheet/#/2/sheet/view/HD1qT2Zf44Ha36TJ3wj2YL+PchsTidyNTFepW5++ZKM/ > > Some highlights: > > - 1=WC, 4=WT - good > > - 1=WT, 4=WC - bad > > - 1=WT, 3=WC (4=WC too) - good > > - 1=WT, 5=WC - good > > > > So, for me it seems WC at index 4 is problematic for some reason. > > > > Next, I tried to trap all the places in arch/x86/xen/mmu_pv.c that > > write PTEs and verify requested cache attributes. There, it seems all > > the requested WC are properly translated (using either index 1, 3, 4, or > > 5 according to PAT settings). And then after reading PTE back, it indeed > > seems to be correctly set. I didn't added reading back after > > HYPERVISOR_update_va_mapping, but verified it isn't used for setting WC. > > > > Using the same method
Re: [Intel-gfx] [cache coherency bug] i915 and PAT attributes
On Mon, Jan 02, 2023 at 12:24:54AM +0100, Marek Marczykowski-Górecki wrote: > On Thu, Dec 22, 2022 at 10:29:57AM +0200, Ville Syrjälä wrote: > > On Fri, Dec 16, 2022 at 03:30:13PM +, Andrew Cooper wrote: > > > On 08/12/2022 1:55 pm, Marek Marczykowski-Górecki wrote: > > > > Hi, > > > > > > > > There is an issue with i915 on Xen PV (dom0). The end result is a lot of > > > > glitches, like here: > > > > https://openqa.qubes-os.org/tests/54748#step/startup/8 > > > > (this one is on ADL, Linux 6.1-rc7 as a Xen PV dom0). It's using Xorg > > > > with "modesetting" driver. > > > > > > > > After some iterations of debugging, we narrowed it down to i915 handling > > > > caching. The main difference is that PAT is setup differently on Xen PV > > > > than on native Linux. Normally, Linux does have appropriate abstraction > > > > for that, but apparently something related to i915 doesn't play well > > > > with it. The specific difference is: > > > > native linux: > > > > x86/PAT: Configuration [0-7]: WB WC UC- UC WB WP UC- WT > > > > xen pv: > > > > x86/PAT: Configuration [0-7]: WB WT UC- UC WC WP UC UC > > > > ~~ ~~ ~~ ~~ > > > > > > > > The specific impact depends on kernel version and the hardware. The most > > > > severe issues I see on >=ADL, but some older hardware is affected too - > > > > sometimes only if composition is disabled in the window manager. > > > > Some more information is collected at > > > > https://github.com/QubesOS/qubes-issues/issues/4782 (and few linked > > > > duplicates...). > > > > > > > > Kind-of related commit is here: > > > > https://github.com/torvalds/linux/commit/bdd8b6c98239cad ("drm/i915: > > > > replace X86_FEATURE_PAT with pat_enabled()") - it is the place where > > > > i915 explicitly checks for PAT support, so I'm cc-ing people mentioned > > > > there too. > > > > > > > > Any ideas? > > > > > > > > The issue can be easily reproduced without Xen too, by adjusting PAT in > > > > Linux: > > > > -8<- > > > > diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c > > > > index 66a209f7eb86..319ab60c8d8c 100644 > > > > --- a/arch/x86/mm/pat/memtype.c > > > > +++ b/arch/x86/mm/pat/memtype.c > > > > @@ -400,8 +400,8 @@ void pat_init(void) > > > > * The reserved slots are unused, but mapped to their > > > > * corresponding types in the presence of PAT errata. > > > > */ > > > > - pat = PAT(0, WB) | PAT(1, WC) | PAT(2, UC_MINUS) | > > > > PAT(3, UC) | > > > > - PAT(4, WB) | PAT(5, WP) | PAT(6, UC_MINUS) | > > > > PAT(7, WT); > > > > + pat = PAT(0, WB) | PAT(1, WT) | PAT(2, UC_MINUS) | > > > > PAT(3, UC) | > > > > + PAT(4, WC) | PAT(5, WP) | PAT(6, UC) | > > > > PAT(7, UC); > > > > } > > > > > > > > if (!pat_bp_initialized) { > > > > -8<- > > > > > > > > > > Hello, can anyone help please? > > > > > > Intel's CI has taken this reproducer of the bug, and confirmed the > > > regression. > > > https://lore.kernel.org/intel-gfx/Y5Hst0bCxQDTN7lK@mail-itl/T/#m4480c15a0d117dce6210562eb542875e757647fb > > > > > > We're reasonably confident that it is an i915 bug (given the repro with > > > no Xen in the mix), but we're out of any further ideas. > > > > I don't think we have any code that assumes anything about the PAT, > > apart from WC being available (which seems like it should still be > > the case with your modified PAT). I suppose you'll just have to > > start digging from pgprot_writecombine()/noncached() and make sure > > everything ends up using the correct PAT entry. > > I tried several approach to this, without success. Here is an update on > debugging (reported also on #intel-gfx live): > > I did several tests with different PAT configuration (by modifying Xen > that sets the MSR). Full table is at > https://pad.itl.space/sheet/#/2/sheet/view/HD1qT2Zf44Ha36TJ3wj2YL+PchsTidyNTFepW5++ZKM/ > Some highlights: > - 1=WC, 4=WT - good > - 1=WT, 4=WC - bad > - 1=WT, 3=WC (4=WC too) - good > - 1=WT, 5=WC - good > > So, for me it seems WC at index 4 is problematic for some reason. > > Next, I tried to trap all the places in arch/x86/xen/mmu_pv.c that > write PTEs and verify requested cache attributes. There, it seems all > the requested WC are properly translated (using either index 1, 3, 4, or > 5 according to PAT settings). And then after reading PTE back, it indeed > seems to be correctly set. I didn't added reading back after > HYPERVISOR_update_va_mapping, but verified it isn't used for setting WC. > > Using the same method, I also checked that indexes that aren't supposed > to be used (for example index 4 when both 3 and 4 are WC) indeed are not > used. So, the hypothesis that specific indexes are hardcoded somewhere > is unlikely. > > This all looks very weird to me. Any ideas? Old CPUs have had hardware errata that caused the top bit of th
Re: [Intel-gfx] [cache coherency bug] i915 and PAT attributes
On Thu, Dec 22, 2022 at 10:29:57AM +0200, Ville Syrjälä wrote: > On Fri, Dec 16, 2022 at 03:30:13PM +, Andrew Cooper wrote: > > On 08/12/2022 1:55 pm, Marek Marczykowski-Górecki wrote: > > > Hi, > > > > > > There is an issue with i915 on Xen PV (dom0). The end result is a lot of > > > glitches, like here: > > > https://openqa.qubes-os.org/tests/54748#step/startup/8 > > > (this one is on ADL, Linux 6.1-rc7 as a Xen PV dom0). It's using Xorg > > > with "modesetting" driver. > > > > > > After some iterations of debugging, we narrowed it down to i915 handling > > > caching. The main difference is that PAT is setup differently on Xen PV > > > than on native Linux. Normally, Linux does have appropriate abstraction > > > for that, but apparently something related to i915 doesn't play well > > > with it. The specific difference is: > > > native linux: > > > x86/PAT: Configuration [0-7]: WB WC UC- UC WB WP UC- WT > > > xen pv: > > > x86/PAT: Configuration [0-7]: WB WT UC- UC WC WP UC UC > > > ~~ ~~ ~~ ~~ > > > > > > The specific impact depends on kernel version and the hardware. The most > > > severe issues I see on >=ADL, but some older hardware is affected too - > > > sometimes only if composition is disabled in the window manager. > > > Some more information is collected at > > > https://github.com/QubesOS/qubes-issues/issues/4782 (and few linked > > > duplicates...). > > > > > > Kind-of related commit is here: > > > https://github.com/torvalds/linux/commit/bdd8b6c98239cad ("drm/i915: > > > replace X86_FEATURE_PAT with pat_enabled()") - it is the place where > > > i915 explicitly checks for PAT support, so I'm cc-ing people mentioned > > > there too. > > > > > > Any ideas? > > > > > > The issue can be easily reproduced without Xen too, by adjusting PAT in > > > Linux: > > > -8<- > > > diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c > > > index 66a209f7eb86..319ab60c8d8c 100644 > > > --- a/arch/x86/mm/pat/memtype.c > > > +++ b/arch/x86/mm/pat/memtype.c > > > @@ -400,8 +400,8 @@ void pat_init(void) > > >* The reserved slots are unused, but mapped to their > > >* corresponding types in the presence of PAT errata. > > >*/ > > > - pat = PAT(0, WB) | PAT(1, WC) | PAT(2, UC_MINUS) | PAT(3, UC) | > > > - PAT(4, WB) | PAT(5, WP) | PAT(6, UC_MINUS) | PAT(7, WT); > > > + pat = PAT(0, WB) | PAT(1, WT) | PAT(2, UC_MINUS) | PAT(3, UC) | > > > + PAT(4, WC) | PAT(5, WP) | PAT(6, UC) | PAT(7, UC); > > > } > > > > > > if (!pat_bp_initialized) { > > > -8<- > > > > > > > Hello, can anyone help please? > > > > Intel's CI has taken this reproducer of the bug, and confirmed the > > regression. > > https://lore.kernel.org/intel-gfx/Y5Hst0bCxQDTN7lK@mail-itl/T/#m4480c15a0d117dce6210562eb542875e757647fb > > > > We're reasonably confident that it is an i915 bug (given the repro with > > no Xen in the mix), but we're out of any further ideas. > > I don't think we have any code that assumes anything about the PAT, > apart from WC being available (which seems like it should still be > the case with your modified PAT). I suppose you'll just have to > start digging from pgprot_writecombine()/noncached() and make sure > everything ends up using the correct PAT entry. I tried several approach to this, without success. Here is an update on debugging (reported also on #intel-gfx live): I did several tests with different PAT configuration (by modifying Xen that sets the MSR). Full table is at https://pad.itl.space/sheet/#/2/sheet/view/HD1qT2Zf44Ha36TJ3wj2YL+PchsTidyNTFepW5++ZKM/ Some highlights: - 1=WC, 4=WT - good - 1=WT, 4=WC - bad - 1=WT, 3=WC (4=WC too) - good - 1=WT, 5=WC - good So, for me it seems WC at index 4 is problematic for some reason. Next, I tried to trap all the places in arch/x86/xen/mmu_pv.c that write PTEs and verify requested cache attributes. There, it seems all the requested WC are properly translated (using either index 1, 3, 4, or 5 according to PAT settings). And then after reading PTE back, it indeed seems to be correctly set. I didn't added reading back after HYPERVISOR_update_va_mapping, but verified it isn't used for setting WC. Using the same method, I also checked that indexes that aren't supposed to be used (for example index 4 when both 3 and 4 are WC) indeed are not used. So, the hypothesis that specific indexes are hardcoded somewhere is unlikely. This all looks very weird to me. Any ideas? -- Best Regards, Marek Marczykowski-Górecki Invisible Things Lab signature.asc Description: PGP signature
Re: [Intel-gfx] [cache coherency bug] i915 and PAT attributes
On Fri, Dec 16, 2022 at 03:30:13PM +, Andrew Cooper wrote: > On 08/12/2022 1:55 pm, Marek Marczykowski-Górecki wrote: > > Hi, > > > > There is an issue with i915 on Xen PV (dom0). The end result is a lot of > > glitches, like here: https://openqa.qubes-os.org/tests/54748#step/startup/8 > > (this one is on ADL, Linux 6.1-rc7 as a Xen PV dom0). It's using Xorg > > with "modesetting" driver. > > > > After some iterations of debugging, we narrowed it down to i915 handling > > caching. The main difference is that PAT is setup differently on Xen PV > > than on native Linux. Normally, Linux does have appropriate abstraction > > for that, but apparently something related to i915 doesn't play well > > with it. The specific difference is: > > native linux: > > x86/PAT: Configuration [0-7]: WB WC UC- UC WB WP UC- WT > > xen pv: > > x86/PAT: Configuration [0-7]: WB WT UC- UC WC WP UC UC > > ~~ ~~ ~~ ~~ > > > > The specific impact depends on kernel version and the hardware. The most > > severe issues I see on >=ADL, but some older hardware is affected too - > > sometimes only if composition is disabled in the window manager. > > Some more information is collected at > > https://github.com/QubesOS/qubes-issues/issues/4782 (and few linked > > duplicates...). > > > > Kind-of related commit is here: > > https://github.com/torvalds/linux/commit/bdd8b6c98239cad ("drm/i915: > > replace X86_FEATURE_PAT with pat_enabled()") - it is the place where > > i915 explicitly checks for PAT support, so I'm cc-ing people mentioned > > there too. > > > > Any ideas? > > > > The issue can be easily reproduced without Xen too, by adjusting PAT in > > Linux: > > -8<- > > diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c > > index 66a209f7eb86..319ab60c8d8c 100644 > > --- a/arch/x86/mm/pat/memtype.c > > +++ b/arch/x86/mm/pat/memtype.c > > @@ -400,8 +400,8 @@ void pat_init(void) > > * The reserved slots are unused, but mapped to their > > * corresponding types in the presence of PAT errata. > > */ > > - pat = PAT(0, WB) | PAT(1, WC) | PAT(2, UC_MINUS) | PAT(3, UC) | > > - PAT(4, WB) | PAT(5, WP) | PAT(6, UC_MINUS) | PAT(7, WT); > > + pat = PAT(0, WB) | PAT(1, WT) | PAT(2, UC_MINUS) | PAT(3, UC) | > > + PAT(4, WC) | PAT(5, WP) | PAT(6, UC) | PAT(7, UC); > > } > > > > if (!pat_bp_initialized) { > > -8<- > > > > Hello, can anyone help please? > > Intel's CI has taken this reproducer of the bug, and confirmed the > regression. > https://lore.kernel.org/intel-gfx/Y5Hst0bCxQDTN7lK@mail-itl/T/#m4480c15a0d117dce6210562eb542875e757647fb > > We're reasonably confident that it is an i915 bug (given the repro with > no Xen in the mix), but we're out of any further ideas. I don't think we have any code that assumes anything about the PAT, apart from WC being available (which seems like it should still be the case with your modified PAT). I suppose you'll just have to start digging from pgprot_writecombine()/noncached() and make sure everything ends up using the correct PAT entry. -- Ville Syrjälä Intel