Re: [Intel-gfx] [RFC PATCH] x86/mm: Fix PAT bit missing from page protection modify mask

2023-04-24 Thread Marek Marczykowski-Górecki
On Mon, Apr 24, 2023 at 02:35:24PM +0200, Janusz Krzysztofik wrote:
> Visible glitches have been observed when running graphics applications on
> Linux under Xen hypervisor.  Those observations have been confirmed with
> failures from kms_pwrite_crc Intel GPU test that verifies data coherency
> of DRM frame buffer objects using hardware CRC checksums calculated by
> display controllers, exposed to userspace via debugfs.  Affected
> processing paths have then been identified with new test variants that
> mmap the objects using different methods and caching modes.
> 
> When running as a Xen PV guest, Linux uses Xen provided PAT configuration
> which is different from its native one.  In particular, Xen specific PTE
> encoding of write-combining caching, likely used by graphics applications,
> differs from the Linux default one found among statically defined minimal
> set of supported modes.  Since Xen defines PTE encoding of the WC mode as
> _PAGE_PAT, it no longer belongs to the minimal set, depends on correct
> handling of _PAGE_PAT bit, and can be mismatched with write-back caching.
> 
> When a user calls mmap() for a DRM buffer object, DRM device specific
> .mmap file operation, called from mmap_region(), takes care of setting PTE
> encoding bits in a vm_page_prot field of an associated virtual memory area
> structure.  Unfortunately, _PAGE_PAT bit is not preserved when the vma's
> .vm_flags are then applied to .vm_page_prot via vm_set_page_prot().  Bits
> to be preserved are determined with _PAGE_CHG_MASK symbol that doesn't
> cover _PAGE_PAT.  As a consequence, WB caching is requested instead of WC
> when running under Xen (also, WP is silently changed to WT, and UC
> downgraded to UC_MINUS).  When running on bare metal, WC is not affected,
> but WP and WT extra modes are unintentionally replaced with WC and UC,
> respectively.
> 
> WP and WT modes, encoded with _PAGE_PAT bit set, were introduced by commit
> 281d4078bec3 ("x86: Make page cache mode a real type").  Care was taken
> to extend _PAGE_CACHE_MASK symbol with that additional bit, but that
> symbol has never been used for identification of bits preserved when
> applying page protection flags.  Support for all cache modes under Xen,
> including the problematic WC mode, was then introduced by commit
> 47591df50512 ("xen: Support Xen pv-domains using PAT").
> 
> Extend bitmask used by pgprot_modify() for selecting bits to be preserved
> with _PAGE_PAT bit.  However, since that bit can be reused as _PAGE_PSE,
> and the _PAGE_CHG_MASK symbol, primarly used by pte_modify(), is likely
> intentionally defined with that bit not set, keep that symbol unchanged.
> 
> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/7648
> Fixes: 281d4078bec3 ("x86: Make page cache mode a real type")
> Signed-off-by: Janusz Krzysztofik 
> Cc: sta...@vger.kernel.org # v3.19+

I can confirm it fixes the issue, thanks!

Tested-by: Marek Marczykowski-Górecki 

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature


Re: [Intel-gfx] [cache coherency bug] [hw bug?] i915 and PAT attributes

2023-01-01 Thread Marek Marczykowski-Górecki
On Sun, Jan 01, 2023 at 08:48:13PM -0500, Demi Marie Obenour wrote:
> On Sun, Jan 01, 2023 at 08:17:52PM -0500, Demi Marie Obenour wrote:
> > On Mon, Jan 02, 2023 at 02:00:51AM +0100, Marek Marczykowski-Górecki wrote:
> > > On Sun, Jan 01, 2023 at 07:03:18PM -0500, Demi Marie Obenour wrote:
> > > > On Mon, Jan 02, 2023 at 12:24:54AM +0100, Marek Marczykowski-Górecki 
> > > > wrote:
> > > > > On Thu, Dec 22, 2022 at 10:29:57AM +0200, Ville Syrjälä wrote:
> > > > > > On Fri, Dec 16, 2022 at 03:30:13PM +, Andrew Cooper wrote:
> > > > > > > On 08/12/2022 1:55 pm, Marek Marczykowski-Górecki wrote:
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > There is an issue with i915 on Xen PV (dom0). The end result is 
> > > > > > > > a lot of
> > > > > > > > glitches, like here: 
> > > > > > > > https://openqa.qubes-os.org/tests/54748#step/startup/8
> > > > > > > > (this one is on ADL, Linux 6.1-rc7 as a Xen PV dom0). It's 
> > > > > > > > using Xorg
> > > > > > > > with "modesetting" driver.
> > > > > > > >
> > > > > > > > After some iterations of debugging, we narrowed it down to i915 
> > > > > > > > handling
> > > > > > > > caching. The main difference is that PAT is setup differently 
> > > > > > > > on Xen PV
> > > > > > > > than on native Linux. Normally, Linux does have appropriate 
> > > > > > > > abstraction
> > > > > > > > for that, but apparently something related to i915 doesn't play 
> > > > > > > > well
> > > > > > > > with it. The specific difference is:
> > > > > > > > native linux:
> > > > > > > > x86/PAT: Configuration [0-7]: WB  WC  UC- UC  WB  WP  UC- WT
> > > > > > > > xen pv:
> > > > > > > > x86/PAT: Configuration [0-7]: WB  WT  UC- UC  WC  WP  UC  UC
> > > > > > > >   ~~  ~~  ~~  ~~
> > > > > > > >
> > > > > > > > The specific impact depends on kernel version and the hardware. 
> > > > > > > > The most
> > > > > > > > severe issues I see on >=ADL, but some older hardware is 
> > > > > > > > affected too -
> > > > > > > > sometimes only if composition is disabled in the window manager.
> > > > > > > > Some more information is collected at
> > > > > > > > https://github.com/QubesOS/qubes-issues/issues/4782 (and few 
> > > > > > > > linked
> > > > > > > > duplicates...).
> > > > > > > >
> > > > > > > > Kind-of related commit is here:
> > > > > > > > https://github.com/torvalds/linux/commit/bdd8b6c98239cad 
> > > > > > > > ("drm/i915:
> > > > > > > > replace X86_FEATURE_PAT with pat_enabled()") - it is the place 
> > > > > > > > where
> > > > > > > > i915 explicitly checks for PAT support, so I'm cc-ing people 
> > > > > > > > mentioned
> > > > > > > > there too.
> > > > > > > >
> > > > > > > > Any ideas?
> > > > > > > >
> > > > > > > > The issue can be easily reproduced without Xen too, by 
> > > > > > > > adjusting PAT in
> > > > > > > > Linux:
> > > > > > > > -8<-
> > > > > > > > diff --git a/arch/x86/mm/pat/memtype.c 
> > > > > > > > b/arch/x86/mm/pat/memtype.c
> > > > > > > > index 66a209f7eb86..319ab60c8d8c 100644
> > > > > > > > --- a/arch/x86/mm/pat/memtype.c
> > > > > > > > +++ b/arch/x86/mm/pat/memtype.c
> > > > > > > > @@ -400,8 +400,8 @@ void pat_init(void)
> > > > > > > >  * The reserved slots are unused, but mapped to 
> > > > > > > > their
> > > > > > > >  * corresponding types in the presence of PAT 
> > > > > > > > errata.
> > >

Re: [Intel-gfx] [cache coherency bug] i915 and PAT attributes

2023-01-01 Thread Marek Marczykowski-Górecki
On Sun, Jan 01, 2023 at 07:03:18PM -0500, Demi Marie Obenour wrote:
> On Mon, Jan 02, 2023 at 12:24:54AM +0100, Marek Marczykowski-Górecki wrote:
> > On Thu, Dec 22, 2022 at 10:29:57AM +0200, Ville Syrjälä wrote:
> > > On Fri, Dec 16, 2022 at 03:30:13PM +, Andrew Cooper wrote:
> > > > On 08/12/2022 1:55 pm, Marek Marczykowski-Górecki wrote:
> > > > > Hi,
> > > > >
> > > > > There is an issue with i915 on Xen PV (dom0). The end result is a lot 
> > > > > of
> > > > > glitches, like here: 
> > > > > https://openqa.qubes-os.org/tests/54748#step/startup/8
> > > > > (this one is on ADL, Linux 6.1-rc7 as a Xen PV dom0). It's using Xorg
> > > > > with "modesetting" driver.
> > > > >
> > > > > After some iterations of debugging, we narrowed it down to i915 
> > > > > handling
> > > > > caching. The main difference is that PAT is setup differently on Xen 
> > > > > PV
> > > > > than on native Linux. Normally, Linux does have appropriate 
> > > > > abstraction
> > > > > for that, but apparently something related to i915 doesn't play well
> > > > > with it. The specific difference is:
> > > > > native linux:
> > > > > x86/PAT: Configuration [0-7]: WB  WC  UC- UC  WB  WP  UC- WT
> > > > > xen pv:
> > > > > x86/PAT: Configuration [0-7]: WB  WT  UC- UC  WC  WP  UC  UC
> > > > >   ~~  ~~  ~~  ~~
> > > > >
> > > > > The specific impact depends on kernel version and the hardware. The 
> > > > > most
> > > > > severe issues I see on >=ADL, but some older hardware is affected too 
> > > > > -
> > > > > sometimes only if composition is disabled in the window manager.
> > > > > Some more information is collected at
> > > > > https://github.com/QubesOS/qubes-issues/issues/4782 (and few linked
> > > > > duplicates...).
> > > > >
> > > > > Kind-of related commit is here:
> > > > > https://github.com/torvalds/linux/commit/bdd8b6c98239cad ("drm/i915:
> > > > > replace X86_FEATURE_PAT with pat_enabled()") - it is the place where
> > > > > i915 explicitly checks for PAT support, so I'm cc-ing people mentioned
> > > > > there too.
> > > > >
> > > > > Any ideas?
> > > > >
> > > > > The issue can be easily reproduced without Xen too, by adjusting PAT 
> > > > > in
> > > > > Linux:
> > > > > -8<-
> > > > > diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
> > > > > index 66a209f7eb86..319ab60c8d8c 100644
> > > > > --- a/arch/x86/mm/pat/memtype.c
> > > > > +++ b/arch/x86/mm/pat/memtype.c
> > > > > @@ -400,8 +400,8 @@ void pat_init(void)
> > > > >* The reserved slots are unused, but mapped to their
> > > > >* corresponding types in the presence of PAT errata.
> > > > >*/
> > > > > - pat = PAT(0, WB) | PAT(1, WC) | PAT(2, UC_MINUS) | 
> > > > > PAT(3, UC) |
> > > > > -   PAT(4, WB) | PAT(5, WP) | PAT(6, UC_MINUS) | 
> > > > > PAT(7, WT);
> > > > > + pat = PAT(0, WB) | PAT(1, WT) | PAT(2, UC_MINUS) | 
> > > > > PAT(3, UC) |
> > > > > +   PAT(4, WC) | PAT(5, WP) | PAT(6, UC)   | 
> > > > > PAT(7, UC);
> > > > >   }
> > > > >  
> > > > >   if (!pat_bp_initialized) {
> > > > > -8<-
> > > > >
> > > > 
> > > > Hello, can anyone help please?
> > > > 
> > > > Intel's CI has taken this reproducer of the bug, and confirmed the
> > > > regression. 
> > > > https://lore.kernel.org/intel-gfx/Y5Hst0bCxQDTN7lK@mail-itl/T/#m4480c15a0d117dce6210562eb542875e757647fb
> > > > 
> > > > We're reasonably confident that it is an i915 bug (given the repro with
> > > > no Xen in the mix), but we're out of any further ideas.
> > > 
> > > I don't think we have any code that assumes anything about the PAT,
> > > apart from WC being available (which seems like it should still be
> > 

Re: [Intel-gfx] [cache coherency bug] i915 and PAT attributes

2023-01-01 Thread Marek Marczykowski-Górecki
On Thu, Dec 22, 2022 at 10:29:57AM +0200, Ville Syrjälä wrote:
> On Fri, Dec 16, 2022 at 03:30:13PM +, Andrew Cooper wrote:
> > On 08/12/2022 1:55 pm, Marek Marczykowski-Górecki wrote:
> > > Hi,
> > >
> > > There is an issue with i915 on Xen PV (dom0). The end result is a lot of
> > > glitches, like here: 
> > > https://openqa.qubes-os.org/tests/54748#step/startup/8
> > > (this one is on ADL, Linux 6.1-rc7 as a Xen PV dom0). It's using Xorg
> > > with "modesetting" driver.
> > >
> > > After some iterations of debugging, we narrowed it down to i915 handling
> > > caching. The main difference is that PAT is setup differently on Xen PV
> > > than on native Linux. Normally, Linux does have appropriate abstraction
> > > for that, but apparently something related to i915 doesn't play well
> > > with it. The specific difference is:
> > > native linux:
> > > x86/PAT: Configuration [0-7]: WB  WC  UC- UC  WB  WP  UC- WT
> > > xen pv:
> > > x86/PAT: Configuration [0-7]: WB  WT  UC- UC  WC  WP  UC  UC
> > >   ~~  ~~  ~~  ~~
> > >
> > > The specific impact depends on kernel version and the hardware. The most
> > > severe issues I see on >=ADL, but some older hardware is affected too -
> > > sometimes only if composition is disabled in the window manager.
> > > Some more information is collected at
> > > https://github.com/QubesOS/qubes-issues/issues/4782 (and few linked
> > > duplicates...).
> > >
> > > Kind-of related commit is here:
> > > https://github.com/torvalds/linux/commit/bdd8b6c98239cad ("drm/i915:
> > > replace X86_FEATURE_PAT with pat_enabled()") - it is the place where
> > > i915 explicitly checks for PAT support, so I'm cc-ing people mentioned
> > > there too.
> > >
> > > Any ideas?
> > >
> > > The issue can be easily reproduced without Xen too, by adjusting PAT in
> > > Linux:
> > > -8<-
> > > diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
> > > index 66a209f7eb86..319ab60c8d8c 100644
> > > --- a/arch/x86/mm/pat/memtype.c
> > > +++ b/arch/x86/mm/pat/memtype.c
> > > @@ -400,8 +400,8 @@ void pat_init(void)
> > >* The reserved slots are unused, but mapped to their
> > >* corresponding types in the presence of PAT errata.
> > >*/
> > > - pat = PAT(0, WB) | PAT(1, WC) | PAT(2, UC_MINUS) | PAT(3, UC) |
> > > -   PAT(4, WB) | PAT(5, WP) | PAT(6, UC_MINUS) | PAT(7, WT);
> > > + pat = PAT(0, WB) | PAT(1, WT) | PAT(2, UC_MINUS) | PAT(3, UC) |
> > > +   PAT(4, WC) | PAT(5, WP) | PAT(6, UC)   | PAT(7, UC);
> > >   }
> > >  
> > >   if (!pat_bp_initialized) {
> > > -8<-
> > >
> > 
> > Hello, can anyone help please?
> > 
> > Intel's CI has taken this reproducer of the bug, and confirmed the
> > regression. 
> > https://lore.kernel.org/intel-gfx/Y5Hst0bCxQDTN7lK@mail-itl/T/#m4480c15a0d117dce6210562eb542875e757647fb
> > 
> > We're reasonably confident that it is an i915 bug (given the repro with
> > no Xen in the mix), but we're out of any further ideas.
> 
> I don't think we have any code that assumes anything about the PAT,
> apart from WC being available (which seems like it should still be
> the case with your modified PAT). I suppose you'll just have to 
> start digging from pgprot_writecombine()/noncached() and make sure
> everything ends up using the correct PAT entry.

I tried several approach to this, without success. Here is an update on
debugging (reported also on #intel-gfx live):

I did several tests with different PAT configuration (by modifying Xen
that sets the MSR). Full table is at 
https://pad.itl.space/sheet/#/2/sheet/view/HD1qT2Zf44Ha36TJ3wj2YL+PchsTidyNTFepW5++ZKM/
Some highlights:
- 1=WC, 4=WT - good
- 1=WT, 4=WC - bad
- 1=WT, 3=WC (4=WC too) - good
- 1=WT, 5=WC - good

So, for me it seems WC at index 4 is problematic for some reason.

Next, I tried to trap all the places in arch/x86/xen/mmu_pv.c that
write PTEs and verify requested cache attributes. There, it seems all
the requested WC are properly translated (using either index 1, 3, 4, or
5 according to PAT settings). And then after reading PTE back, it indeed
seems to be correctly set. I didn't added reading back after
HYPERVISOR_update_va_mapping, but verified it isn't used for setting WC.

Using the same method, I also checked that indexes that aren't supposed
to be used (for example index 4 when both 3 and 4 are WC) indeed are not
used. So, the hypothesis that specific indexes are hardcoded somewhere
is unlikely.

This all looks very weird to me. Any ideas?

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature


[Intel-gfx] i915 and PAT attributes on Xen PV

2022-12-08 Thread Marek Marczykowski-Górecki
Hi,

There is an issue with i915 on Xen PV (dom0). The end result is a lot of
glitches, like here: https://openqa.qubes-os.org/tests/54748#step/startup/8
(this one is on ADL, Linux 6.1-rc7 as a Xen PV dom0). It's using Xorg
with "modesetting" driver.

After some iterations of debugging, we narrowed it down to i915 handling
caching. The main difference is that PAT is setup differently on Xen PV
than on native Linux. Normally, Linux does have appropriate abstraction
for that, but apparently something related to i915 doesn't play well
with it. The specific difference is:
native linux:
x86/PAT: Configuration [0-7]: WB  WC  UC- UC  WB  WP  UC- WT
xen pv:
x86/PAT: Configuration [0-7]: WB  WT  UC- UC  WC  WP  UC  UC
  ~~  ~~  ~~  ~~

The specific impact depends on kernel version and the hardware. The most
severe issues I see on >=ADL, but some older hardware is affected too -
sometimes only if composition is disabled in the window manager.
Some more information is collected at
https://github.com/QubesOS/qubes-issues/issues/4782 (and few linked
duplicates...).

Kind-of related commit is here:
https://github.com/torvalds/linux/commit/bdd8b6c98239cad ("drm/i915:
replace X86_FEATURE_PAT with pat_enabled()") - it is the place where
i915 explicitly checks for PAT support, so I'm cc-ing people mentioned
there too.

Any ideas?

The issue can be easily reproduced without Xen too, by adjusting PAT in
Linux:
-8<-
diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
index 66a209f7eb86..319ab60c8d8c 100644
--- a/arch/x86/mm/pat/memtype.c
+++ b/arch/x86/mm/pat/memtype.c
@@ -400,8 +400,8 @@ void pat_init(void)
 * The reserved slots are unused, but mapped to their
 * corresponding types in the presence of PAT errata.
 */
-   pat = PAT(0, WB) | PAT(1, WC) | PAT(2, UC_MINUS) | PAT(3, UC) |
- PAT(4, WB) | PAT(5, WP) | PAT(6, UC_MINUS) | PAT(7, WT);
+   pat = PAT(0, WB) | PAT(1, WT) | PAT(2, UC_MINUS) | PAT(3, UC) |
+ PAT(4, WC) | PAT(5, WP) | PAT(6, UC)   | PAT(7, UC);
}
 
if (!pat_bp_initialized) {
-8<-

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature