Re: [Mesa-dev] [PATCH 0/7] i965 L3 caching and pull constant improvements.

2015-02-12 Thread Kenneth Graunke
On Thursday, February 12, 2015 04:13:06 PM Francisco Jerez wrote:
 Francisco Jerez curroje...@riseup.net writes:
  Kenneth Graunke kenn...@whitecape.org writes:
  On Sunday, January 18, 2015 01:04:02 AM Francisco Jerez wrote:
  This is the first part of a series meant to improve our usage of the L3 
  cache.
  Currently it's far from ideal since the following objects aren't taking 
  any
  advantage of it:
   - Pull constants (i.e. UBOs and demoted uniforms)
   - Buffer textures
   - Shader scratch space (i.e. register spills and fills)
   - Atomic counters
   - (Soon) Images
  
  This first series addresses the first two issues.  Fixing the last three 
  is
  going to be a bit more difficult because we need to modify the 
  partitioning of
  the L3 cache in order to increase the number of ways assigned to the DC, 
  which
  happens to be zero on boot until Gen8.  That's likely to require kernel
  changes because we don't have any extremely satisfactory API to change 
  that
  from userspace right now.
  
  The first patch in the series sets the MOCS L3 cacheability bit in the 
  surface
  state structure for buffers so the mentioned memory objects (except the 
  shader
  scratch space that gets its MOCS from elsewhere) have a chance of getting
  cached in L3.
  
  The fourth patch in the series switches to using the constant cache 
  (which,
  unlike the data cache that was used years ago before we started using the
  sampler, is cached on L3 with the default partitioning on all gens) for
  uniform pull constants loads.  The overall performance numbers I've 
  collected
  are included in the commit message of the same patch for future reference.
  Most of it points at the constant cache being faster than the sampler in a
  number of cases (assuming the L3 caching settings are correct), it's also
  likely to alleviate some cache thrashing caused by the competition with
  textures for the L1/L2 sampler caches, and it allows fetching up to eight
  consecutive owords (128B) with just one message.
  
  The sixth patch enables 4 oword loads because they're basically for free 
  and
  they avoid some of the shortcomings of the 1 and 2 oword messages (see the
  commit message for more details).  I'll have a look into enabling 8 oword
  loads but it's going to require an analysis pass to avoid wasting 
  bandwidth
  and increasing the register pressure unnecessarily when the shader doesn't
  actually need as many constants.
  
  We could do something similar for non-uniform offset pull constant loads 
  and
  for both kinds of pull constant loads on the vec4 back-end, but I don't 
  have
  enough performance data to support that yet.
 
  Hi Curro!
 
  Hi Ken,
 
  Technically, I believe we /are/ taking advantage of the L3 today - the 
  sampler
  should be part of the All Clients and Read Only Client Pool portions 
  of the
  L3.  I believe the data port's Constant Cache is part of the same L3 
  region.
  However, the sampler has an additional L1/L2 cache.
 
  If you're referring to pull constants, nope we aren't, because it's also
  necessary to have set the MOCS bits to cacheable in L3, and that wasn't
  the case for any of the memory objects I mentioned except shader scratch
  space (the latter goes through the data cache so it's still not cached
  until Gen8).
 
  When you say you don't have enough performance data to support doing 
  this in
  the vec4 backend, or for non-uniform offset pull loads, do you mean that 
  you
  tried it and it wasn't useful, or you just haven't tried it yet?
 
  I tried it on the VS and didn't see any significant change in the
  benchmarks I had at hand.  For non-uniform pull constant loads it's a
  bit trickier because performance may be dependent on how non-uniform
  the offsets are, I don't have any convincing benchmark data yet but I'll
  look into it.
 
  In my experience, the VS matters a *lot* - skinning shaders tend to use 
  large
  arrays of matrices, which get demoted to pull constants.  For example, I
  observed huge speedups in GLBenchmark 2.7 EgyptHD (commit 5d8e246ac86b4a94,
  in the VS backend), GLBenchmark 2.1 PRO (commit 259b65e2e79), and Trine
  (commit 04f5b2f4e454d6 - in the ARBvp backend) by moving from the data 
  cache
  to the sampler.
 
  I'd love to see data for applying your new approach in the VS backend.
 
  Sure, I'll try running those to see if it makes any difference.  If it
  does it can be fixed later on as a follow-up in any case.
 
  --Ken
 
 Ken, I don't see any reason to put this series on hold until the changes
 for the other cases are ready instead of going through it incrementally.
 The VS changes themselves are trivial and completely orthogonal to this
 series, but the amount of testing and benchmarking to be done to make
 sure that they don't incur a performance penalty on any of the other
 platforms is overwhelming, and the expected benefit (according to my
 previous observations) will be considerably lower than what we get from
 the 

Re: [Mesa-dev] [PATCH 0/7] i965 L3 caching and pull constant improvements.

2015-02-12 Thread Francisco Jerez
Francisco Jerez curroje...@riseup.net writes:

 Kenneth Graunke kenn...@whitecape.org writes:

 On Sunday, January 18, 2015 01:04:02 AM Francisco Jerez wrote:
 This is the first part of a series meant to improve our usage of the L3 
 cache.
 Currently it's far from ideal since the following objects aren't taking any
 advantage of it:
  - Pull constants (i.e. UBOs and demoted uniforms)
  - Buffer textures
  - Shader scratch space (i.e. register spills and fills)
  - Atomic counters
  - (Soon) Images
 
 This first series addresses the first two issues.  Fixing the last three is
 going to be a bit more difficult because we need to modify the partitioning 
 of
 the L3 cache in order to increase the number of ways assigned to the DC, 
 which
 happens to be zero on boot until Gen8.  That's likely to require kernel
 changes because we don't have any extremely satisfactory API to change that
 from userspace right now.
 
 The first patch in the series sets the MOCS L3 cacheability bit in the 
 surface
 state structure for buffers so the mentioned memory objects (except the 
 shader
 scratch space that gets its MOCS from elsewhere) have a chance of getting
 cached in L3.
 
 The fourth patch in the series switches to using the constant cache (which,
 unlike the data cache that was used years ago before we started using the
 sampler, is cached on L3 with the default partitioning on all gens) for
 uniform pull constants loads.  The overall performance numbers I've 
 collected
 are included in the commit message of the same patch for future reference.
 Most of it points at the constant cache being faster than the sampler in a
 number of cases (assuming the L3 caching settings are correct), it's also
 likely to alleviate some cache thrashing caused by the competition with
 textures for the L1/L2 sampler caches, and it allows fetching up to eight
 consecutive owords (128B) with just one message.
 
 The sixth patch enables 4 oword loads because they're basically for free and
 they avoid some of the shortcomings of the 1 and 2 oword messages (see the
 commit message for more details).  I'll have a look into enabling 8 oword
 loads but it's going to require an analysis pass to avoid wasting bandwidth
 and increasing the register pressure unnecessarily when the shader doesn't
 actually need as many constants.
 
 We could do something similar for non-uniform offset pull constant loads and
 for both kinds of pull constant loads on the vec4 back-end, but I don't have
 enough performance data to support that yet.

 Hi Curro!

 Hi Ken,

 Technically, I believe we /are/ taking advantage of the L3 today - the 
 sampler
 should be part of the All Clients and Read Only Client Pool portions of 
 the
 L3.  I believe the data port's Constant Cache is part of the same L3 
 region.
 However, the sampler has an additional L1/L2 cache.

 If you're referring to pull constants, nope we aren't, because it's also
 necessary to have set the MOCS bits to cacheable in L3, and that wasn't
 the case for any of the memory objects I mentioned except shader scratch
 space (the latter goes through the data cache so it's still not cached
 until Gen8).

 When you say you don't have enough performance data to support doing this 
 in
 the vec4 backend, or for non-uniform offset pull loads, do you mean that you
 tried it and it wasn't useful, or you just haven't tried it yet?

 I tried it on the VS and didn't see any significant change in the
 benchmarks I had at hand.  For non-uniform pull constant loads it's a
 bit trickier because performance may be dependent on how non-uniform
 the offsets are, I don't have any convincing benchmark data yet but I'll
 look into it.

 In my experience, the VS matters a *lot* - skinning shaders tend to use large
 arrays of matrices, which get demoted to pull constants.  For example, I
 observed huge speedups in GLBenchmark 2.7 EgyptHD (commit 5d8e246ac86b4a94,
 in the VS backend), GLBenchmark 2.1 PRO (commit 259b65e2e79), and Trine
 (commit 04f5b2f4e454d6 - in the ARBvp backend) by moving from the data cache
 to the sampler.

 I'd love to see data for applying your new approach in the VS backend.

 Sure, I'll try running those to see if it makes any difference.  If it
 does it can be fixed later on as a follow-up in any case.

 --Ken

Ken, I don't see any reason to put this series on hold until the changes
for the other cases are ready instead of going through it incrementally.
The VS changes themselves are trivial and completely orthogonal to this
series, but the amount of testing and benchmarking to be done to make
sure that they don't incur a performance penalty on any of the other
platforms is overwhelming, and the expected benefit (according to my
previous observations) will be considerably lower than what we get from
the FS changes, if any, so it's not a high priority for me at this
point.

I'll get to it, I promise ;), but can we land this before it starts
bit-rotting?


pgpEgtZY5b5UG.pgp
Description: PGP signature

Re: [Mesa-dev] [PATCH 0/7] i965 L3 caching and pull constant improvements.

2015-02-05 Thread Francisco Jerez
Francisco Jerez curroje...@riseup.net writes:

 This is the first part of a series meant to improve our usage of the L3 cache.
 Currently it's far from ideal since the following objects aren't taking any
 advantage of it:
  - Pull constants (i.e. UBOs and demoted uniforms)
  - Buffer textures
  - Shader scratch space (i.e. register spills and fills)
  - Atomic counters
  - (Soon) Images

 This first series addresses the first two issues.  Fixing the last three is
 going to be a bit more difficult because we need to modify the partitioning of
 the L3 cache in order to increase the number of ways assigned to the DC, which
 happens to be zero on boot until Gen8.  That's likely to require kernel
 changes because we don't have any extremely satisfactory API to change that
 from userspace right now.

 The first patch in the series sets the MOCS L3 cacheability bit in the surface
 state structure for buffers so the mentioned memory objects (except the shader
 scratch space that gets its MOCS from elsewhere) have a chance of getting
 cached in L3.

 The fourth patch in the series switches to using the constant cache (which,
 unlike the data cache that was used years ago before we started using the
 sampler, is cached on L3 with the default partitioning on all gens) for
 uniform pull constants loads.  The overall performance numbers I've collected
 are included in the commit message of the same patch for future reference.
 Most of it points at the constant cache being faster than the sampler in a
 number of cases (assuming the L3 caching settings are correct), it's also
 likely to alleviate some cache thrashing caused by the competition with
 textures for the L1/L2 sampler caches, and it allows fetching up to eight
 consecutive owords (128B) with just one message.

 The sixth patch enables 4 oword loads because they're basically for free and
 they avoid some of the shortcomings of the 1 and 2 oword messages (see the
 commit message for more details).  I'll have a look into enabling 8 oword
 loads but it's going to require an analysis pass to avoid wasting bandwidth
 and increasing the register pressure unnecessarily when the shader doesn't
 actually need as many constants.

 We could do something similar for non-uniform offset pull constant loads and
 for both kinds of pull constant loads on the vec4 back-end, but I don't have
 enough performance data to support that yet.

 [PATCH 1/7] i965: Enable L3 caching of buffer surfaces.
 [PATCH 2/7] i965: Remove the create_raw_surface vtbl hook.
 [PATCH 3/7] i965: Let the caller of brw_set_dp_write/read_message control the 
 target cache.
 [PATCH 4/7] i965/fs: Switch to the constant cache for uniform pull constants.
 [PATCH 5/7] i965/fs: Less broken handling of force_writemask_all in 
 lower_load_payload().
 [PATCH 6/7] i965/fs: Fetch one cacheline of pull constants at a time.
 [PATCH 7/7] i965/fs: Remove the FS_OPCODE_SET_SIMD4X2_OFFSET virtual opcode.
 ___
 mesa-dev mailing list
 mesa-dev@lists.freedesktop.org
 http://lists.freedesktop.org/mailman/listinfo/mesa-dev

Any volunteer to review the rest of this performance-improving series
before the merge window closes?


pgpJ4SJ4KaoAf.pgp
Description: PGP signature
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 0/7] i965 L3 caching and pull constant improvements.

2015-02-01 Thread Syrja, Harri
Ok I see. I am not sure if it was good choice from VPG not to support dedicated 
region for constant cache, but if that is what it is there is little we can do 
for it.

Thank,
Harri

-Original Message-
From: Kenneth Graunke [mailto:kenn...@whitecape.org] 
Sent: Wednesday, January 28, 2015 7:18 PM
To: Syrja, Harri
Cc: mesa-dev@lists.freedesktop.org; Francisco Jerez
Subject: Re: [Mesa-dev] [PATCH 0/7] i965 L3 caching and pull constant 
improvements.

On Wednesday, January 28, 2015 01:14:08 PM Syrja, Harri wrote:
 Hi Kenneth,
 
 Constant cache could and should allocate to separate region in $L3. The main 
 point of having separate constant region is to avoid texture data trashing 
 due to pulled constants load. In optimal solution constant region is 
 allocated only when shader uses pull constants, but that is not so easy as 
 the $L3 config reg is not part of per constant regs. 
 
 BR,
 Harri

I agree that constants and textures /could/ be allocated to a separate region 
of $L3, but I haven't found any evidence in the documentation to confirm that.

It looks like you can set it up that way on Haswell (and we don't), but on 
other chips, everything I've found suggests that read-only clients are lumped 
together in the same region...

--Ken
-
Intel Finland Oy
Registered Address: PL 281, 00181 Helsinki 
Business Identity Code: 0357606 - 4 
Domiciled in Helsinki 

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 0/7] i965 L3 caching and pull constant improvements.

2015-01-28 Thread Francisco Jerez
Kenneth Graunke kenn...@whitecape.org writes:

 On Sunday, January 18, 2015 01:04:02 AM Francisco Jerez wrote:
 This is the first part of a series meant to improve our usage of the L3 
 cache.
 Currently it's far from ideal since the following objects aren't taking any
 advantage of it:
  - Pull constants (i.e. UBOs and demoted uniforms)
  - Buffer textures
  - Shader scratch space (i.e. register spills and fills)
  - Atomic counters
  - (Soon) Images
 
 This first series addresses the first two issues.  Fixing the last three is
 going to be a bit more difficult because we need to modify the partitioning 
 of
 the L3 cache in order to increase the number of ways assigned to the DC, 
 which
 happens to be zero on boot until Gen8.  That's likely to require kernel
 changes because we don't have any extremely satisfactory API to change that
 from userspace right now.
 
 The first patch in the series sets the MOCS L3 cacheability bit in the 
 surface
 state structure for buffers so the mentioned memory objects (except the 
 shader
 scratch space that gets its MOCS from elsewhere) have a chance of getting
 cached in L3.
 
 The fourth patch in the series switches to using the constant cache (which,
 unlike the data cache that was used years ago before we started using the
 sampler, is cached on L3 with the default partitioning on all gens) for
 uniform pull constants loads.  The overall performance numbers I've collected
 are included in the commit message of the same patch for future reference.
 Most of it points at the constant cache being faster than the sampler in a
 number of cases (assuming the L3 caching settings are correct), it's also
 likely to alleviate some cache thrashing caused by the competition with
 textures for the L1/L2 sampler caches, and it allows fetching up to eight
 consecutive owords (128B) with just one message.
 
 The sixth patch enables 4 oword loads because they're basically for free and
 they avoid some of the shortcomings of the 1 and 2 oword messages (see the
 commit message for more details).  I'll have a look into enabling 8 oword
 loads but it's going to require an analysis pass to avoid wasting bandwidth
 and increasing the register pressure unnecessarily when the shader doesn't
 actually need as many constants.
 
 We could do something similar for non-uniform offset pull constant loads and
 for both kinds of pull constant loads on the vec4 back-end, but I don't have
 enough performance data to support that yet.

 Hi Curro!

Hi Ken,

 Technically, I believe we /are/ taking advantage of the L3 today - the sampler
 should be part of the All Clients and Read Only Client Pool portions of 
 the
 L3.  I believe the data port's Constant Cache is part of the same L3 region.
 However, the sampler has an additional L1/L2 cache.

If you're referring to pull constants, nope we aren't, because it's also
necessary to have set the MOCS bits to cacheable in L3, and that wasn't
the case for any of the memory objects I mentioned except shader scratch
space (the latter goes through the data cache so it's still not cached
until Gen8).

 When you say you don't have enough performance data to support doing this in
 the vec4 backend, or for non-uniform offset pull loads, do you mean that you
 tried it and it wasn't useful, or you just haven't tried it yet?

I tried it on the VS and didn't see any significant change in the
benchmarks I had at hand.  For non-uniform pull constant loads it's a
bit trickier because performance may be dependent on how non-uniform
the offsets are, I don't have any convincing benchmark data yet but I'll
look into it.

 In my experience, the VS matters a *lot* - skinning shaders tend to use large
 arrays of matrices, which get demoted to pull constants.  For example, I
 observed huge speedups in GLBenchmark 2.7 EgyptHD (commit 5d8e246ac86b4a94,
 in the VS backend), GLBenchmark 2.1 PRO (commit 259b65e2e79), and Trine
 (commit 04f5b2f4e454d6 - in the ARBvp backend) by moving from the data cache
 to the sampler.

 I'd love to see data for applying your new approach in the VS backend.

Sure, I'll try running those to see if it makes any difference.  If it
does it can be fixed later on as a follow-up in any case.

 --Ken


pgpRcWWAjZioK.pgp
Description: PGP signature
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 0/7] i965 L3 caching and pull constant improvements.

2015-01-28 Thread Syrja, Harri
Hi Kenneth,

Constant cache could and should allocate to separate region in $L3. The main 
point of having separate constant region is to avoid texture data trashing due 
to pulled constants load. In optimal solution constant region is allocated only 
when shader uses pull constants, but that is not so easy as the $L3 config reg 
is not part of per constant regs. 

BR,
Harri

-Original Message-
From: Kenneth Graunke [mailto:kenn...@whitecape.org] 
Sent: Wednesday, January 28, 2015 7:09 AM
To: mesa-dev@lists.freedesktop.org; Francisco Jerez
Cc: Syrja, Harri
Subject: Re: [Mesa-dev] [PATCH 0/7] i965 L3 caching and pull constant 
improvements.

On Sunday, January 18, 2015 01:04:02 AM Francisco Jerez wrote:
 This is the first part of a series meant to improve our usage of the L3 cache.
 Currently it's far from ideal since the following objects aren't 
 taking any advantage of it:
  - Pull constants (i.e. UBOs and demoted uniforms)
  - Buffer textures
  - Shader scratch space (i.e. register spills and fills)
  - Atomic counters
  - (Soon) Images
 
 This first series addresses the first two issues.  Fixing the last 
 three is going to be a bit more difficult because we need to modify 
 the partitioning of the L3 cache in order to increase the number of 
 ways assigned to the DC, which happens to be zero on boot until Gen8.  
 That's likely to require kernel changes because we don't have any 
 extremely satisfactory API to change that from userspace right now.
 
 The first patch in the series sets the MOCS L3 cacheability bit in the 
 surface state structure for buffers so the mentioned memory objects 
 (except the shader scratch space that gets its MOCS from elsewhere) 
 have a chance of getting cached in L3.
 
 The fourth patch in the series switches to using the constant cache 
 (which, unlike the data cache that was used years ago before we 
 started using the sampler, is cached on L3 with the default 
 partitioning on all gens) for uniform pull constants loads.  The 
 overall performance numbers I've collected are included in the commit message 
 of the same patch for future reference.
 Most of it points at the constant cache being faster than the sampler 
 in a number of cases (assuming the L3 caching settings are correct), 
 it's also likely to alleviate some cache thrashing caused by the 
 competition with textures for the L1/L2 sampler caches, and it allows 
 fetching up to eight consecutive owords (128B) with just one message.
 
 The sixth patch enables 4 oword loads because they're basically for 
 free and they avoid some of the shortcomings of the 1 and 2 oword 
 messages (see the commit message for more details).  I'll have a look 
 into enabling 8 oword loads but it's going to require an analysis pass 
 to avoid wasting bandwidth and increasing the register pressure 
 unnecessarily when the shader doesn't actually need as many constants.
 
 We could do something similar for non-uniform offset pull constant 
 loads and for both kinds of pull constant loads on the vec4 back-end, 
 but I don't have enough performance data to support that yet.

Hi Curro!

Technically, I believe we /are/ taking advantage of the L3 today - the sampler 
should be part of the All Clients and Read Only Client Pool portions of the 
L3.  I believe the data port's Constant Cache is part of the same L3 region.
However, the sampler has an additional L1/L2 cache.

When you say you don't have enough performance data to support doing this in 
the vec4 backend, or for non-uniform offset pull loads, do you mean that you 
tried it and it wasn't useful, or you just haven't tried it yet?

In my experience, the VS matters a *lot* - skinning shaders tend to use large 
arrays of matrices, which get demoted to pull constants.  For example, I 
observed huge speedups in GLBenchmark 2.7 EgyptHD (commit 5d8e246ac86b4a94, in 
the VS backend), GLBenchmark 2.1 PRO (commit 259b65e2e79), and Trine (commit 
04f5b2f4e454d6 - in the ARBvp backend) by moving from the data cache to the 
sampler.

I'd love to see data for applying your new approach in the VS backend.

--Ken
-
Intel Finland Oy
Registered Address: PL 281, 00181 Helsinki 
Business Identity Code: 0357606 - 4 
Domiciled in Helsinki 

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 0/7] i965 L3 caching and pull constant improvements.

2015-01-28 Thread Kenneth Graunke
On Wednesday, January 28, 2015 01:14:08 PM Syrja, Harri wrote:
 Hi Kenneth,
 
 Constant cache could and should allocate to separate region in $L3. The main 
 point of having separate constant region is to avoid texture data trashing 
 due to pulled constants load. In optimal solution constant region is 
 allocated only when shader uses pull constants, but that is not so easy as 
 the $L3 config reg is not part of per constant regs. 
 
 BR,
 Harri

I agree that constants and textures /could/ be allocated to a separate region
of $L3, but I haven't found any evidence in the documentation to confirm that.

It looks like you can set it up that way on Haswell (and we don't), but on
other chips, everything I've found suggests that read-only clients are lumped
together in the same region...

--Ken

signature.asc
Description: This is a digitally signed message part.
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 0/7] i965 L3 caching and pull constant improvements.

2015-01-27 Thread Kenneth Graunke
On Sunday, January 18, 2015 01:04:02 AM Francisco Jerez wrote:
 This is the first part of a series meant to improve our usage of the L3 cache.
 Currently it's far from ideal since the following objects aren't taking any
 advantage of it:
  - Pull constants (i.e. UBOs and demoted uniforms)
  - Buffer textures
  - Shader scratch space (i.e. register spills and fills)
  - Atomic counters
  - (Soon) Images
 
 This first series addresses the first two issues.  Fixing the last three is
 going to be a bit more difficult because we need to modify the partitioning of
 the L3 cache in order to increase the number of ways assigned to the DC, which
 happens to be zero on boot until Gen8.  That's likely to require kernel
 changes because we don't have any extremely satisfactory API to change that
 from userspace right now.
 
 The first patch in the series sets the MOCS L3 cacheability bit in the surface
 state structure for buffers so the mentioned memory objects (except the shader
 scratch space that gets its MOCS from elsewhere) have a chance of getting
 cached in L3.
 
 The fourth patch in the series switches to using the constant cache (which,
 unlike the data cache that was used years ago before we started using the
 sampler, is cached on L3 with the default partitioning on all gens) for
 uniform pull constants loads.  The overall performance numbers I've collected
 are included in the commit message of the same patch for future reference.
 Most of it points at the constant cache being faster than the sampler in a
 number of cases (assuming the L3 caching settings are correct), it's also
 likely to alleviate some cache thrashing caused by the competition with
 textures for the L1/L2 sampler caches, and it allows fetching up to eight
 consecutive owords (128B) with just one message.
 
 The sixth patch enables 4 oword loads because they're basically for free and
 they avoid some of the shortcomings of the 1 and 2 oword messages (see the
 commit message for more details).  I'll have a look into enabling 8 oword
 loads but it's going to require an analysis pass to avoid wasting bandwidth
 and increasing the register pressure unnecessarily when the shader doesn't
 actually need as many constants.
 
 We could do something similar for non-uniform offset pull constant loads and
 for both kinds of pull constant loads on the vec4 back-end, but I don't have
 enough performance data to support that yet.

Hi Curro!

Technically, I believe we /are/ taking advantage of the L3 today - the sampler
should be part of the All Clients and Read Only Client Pool portions of the
L3.  I believe the data port's Constant Cache is part of the same L3 region.
However, the sampler has an additional L1/L2 cache.

When you say you don't have enough performance data to support doing this in
the vec4 backend, or for non-uniform offset pull loads, do you mean that you
tried it and it wasn't useful, or you just haven't tried it yet?

In my experience, the VS matters a *lot* - skinning shaders tend to use large
arrays of matrices, which get demoted to pull constants.  For example, I
observed huge speedups in GLBenchmark 2.7 EgyptHD (commit 5d8e246ac86b4a94,
in the VS backend), GLBenchmark 2.1 PRO (commit 259b65e2e79), and Trine
(commit 04f5b2f4e454d6 - in the ARBvp backend) by moving from the data cache
to the sampler.

I'd love to see data for applying your new approach in the VS backend.

--Ken

signature.asc
Description: This is a digitally signed message part.
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


[Mesa-dev] [PATCH 0/7] i965 L3 caching and pull constant improvements.

2015-01-17 Thread Francisco Jerez
This is the first part of a series meant to improve our usage of the L3 cache.
Currently it's far from ideal since the following objects aren't taking any
advantage of it:
 - Pull constants (i.e. UBOs and demoted uniforms)
 - Buffer textures
 - Shader scratch space (i.e. register spills and fills)
 - Atomic counters
 - (Soon) Images

This first series addresses the first two issues.  Fixing the last three is
going to be a bit more difficult because we need to modify the partitioning of
the L3 cache in order to increase the number of ways assigned to the DC, which
happens to be zero on boot until Gen8.  That's likely to require kernel
changes because we don't have any extremely satisfactory API to change that
from userspace right now.

The first patch in the series sets the MOCS L3 cacheability bit in the surface
state structure for buffers so the mentioned memory objects (except the shader
scratch space that gets its MOCS from elsewhere) have a chance of getting
cached in L3.

The fourth patch in the series switches to using the constant cache (which,
unlike the data cache that was used years ago before we started using the
sampler, is cached on L3 with the default partitioning on all gens) for
uniform pull constants loads.  The overall performance numbers I've collected
are included in the commit message of the same patch for future reference.
Most of it points at the constant cache being faster than the sampler in a
number of cases (assuming the L3 caching settings are correct), it's also
likely to alleviate some cache thrashing caused by the competition with
textures for the L1/L2 sampler caches, and it allows fetching up to eight
consecutive owords (128B) with just one message.

The sixth patch enables 4 oword loads because they're basically for free and
they avoid some of the shortcomings of the 1 and 2 oword messages (see the
commit message for more details).  I'll have a look into enabling 8 oword
loads but it's going to require an analysis pass to avoid wasting bandwidth
and increasing the register pressure unnecessarily when the shader doesn't
actually need as many constants.

We could do something similar for non-uniform offset pull constant loads and
for both kinds of pull constant loads on the vec4 back-end, but I don't have
enough performance data to support that yet.

[PATCH 1/7] i965: Enable L3 caching of buffer surfaces.
[PATCH 2/7] i965: Remove the create_raw_surface vtbl hook.
[PATCH 3/7] i965: Let the caller of brw_set_dp_write/read_message control the 
target cache.
[PATCH 4/7] i965/fs: Switch to the constant cache for uniform pull constants.
[PATCH 5/7] i965/fs: Less broken handling of force_writemask_all in 
lower_load_payload().
[PATCH 6/7] i965/fs: Fetch one cacheline of pull constants at a time.
[PATCH 7/7] i965/fs: Remove the FS_OPCODE_SET_SIMD4X2_OFFSET virtual opcode.
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev