Re: [Mesa-dev] [Intel-gfx] [PATCH] drm/i915: Enable the HiZ RAW Stall Optimization on Gen8.

2015-01-12 Thread Ville Syrjälä
On Sun, Jan 11, 2015 at 07:14:57PM -0800, Ben Widawsky wrote:
 On Sun, Jan 11, 2015 at 07:05:21PM -0800, Kenneth Graunke wrote:
  On Sunday, January 11, 2015 05:46:09 PM Ben Widawsky wrote:
   On Sun, Jan 11, 2015 at 04:05:25PM -0800, Kenneth Graunke wrote:
On Sunday, January 11, 2015 01:49:41 PM Ben Widawsky wrote:
 On Sat, Jan 10, 2015 at 06:44:49PM -0800, Kenneth Graunke wrote:
  This is an important optimization for avoiding read-after-write 
  (RAW)
  stalls in the HiZ buffer.  Certain workloads would run very slowly 
  with
  HiZ enabled, but run much faster with the hiz=false driconf 
  option.
  With this patch, they run at full speed even with HiZ.
  
  Improves performance in OglVSInstancing by 3.2x on Broadwell GT3e
  (Iris Pro 6200).
  
  Thanks to Jesse Barnes for finding this missing bit!
  Thanks to Chris Wilson for helping me find where to set it.
  
  Signed-off-by: Kenneth Graunke kenn...@whitecape.org
  Cc: Jesse Barnes jbar...@virtuousgeek.org
  ---
   drivers/gpu/drm/i915/intel_ringbuffer.c | 15 +++
   1 file changed, 15 insertions(+)
  
  Here's an alternate patch which implements the workaround in the 
  kernel
  instead of Mesa.  It's probably better to do it there, since the 
  kernel
  does it on Haswell already.
  
  diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c 
  b/drivers/gpu/drm/i915/intel_ringbuffer.c
  index dabc1d8..23020d6 100644
  --- a/drivers/gpu/drm/i915/intel_ringbuffer.c
  +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
  @@ -796,6 +796,16 @@ static int bdw_init_workarounds(struct 
  intel_engine_cs *ring)
HDC_DONOT_FETCH_MEM_WHEN_MASKED |
(IS_BDW_GT3(dev) ? HDC_FENCE_DEST_SLM_DISABLE 
  : 0));
   
  +   /* From the Haswell PRM, Command Reference: Registers, 
  CACHE_MODE_0:
  +* The Hierarchical Z RAW Stall Optimization allows 
  non-overlapping
  +*  polygons in the same 8x4 pixel/sample area to be processed 
  without
  +*  stalling waiting for the earlier ones to write to 
  Hierarchical Z
  +*  buffer.
  +*
  +* This optimization is off by default for Broadwell; turn it 
  on.
  +*/
  +   WA_CLR_BIT_MASKED(CACHE_MODE_0_GEN7, HIZ_RAW_STALL_OPT_DISABLE);
  +
  /* Wa4x4STCOptimizationDisable:bdw */
  WA_SET_BIT_MASKED(CACHE_MODE_1,
GEN8_4x4_STC_OPTIMIZATION_DISABLE);
  @@ -836,6 +846,11 @@ static int chv_init_workarounds(struct 
  intel_engine_cs *ring)
HDC_FORCE_NON_COHERENT |
HDC_DONOT_FETCH_MEM_WHEN_MASKED);
   
  +   /* According to the CACHE_MODE_0 default value documentation, 
  some
  +* CHV platforms disable this optimization by default.  Turn it 
  on.
  +*/
  +   WA_CLR_BIT_MASKED(CACHE_MODE_0_GEN7, HIZ_RAW_STALL_OPT_DISABLE);
  +
  /* Improve HiZ throughput on CHV. */
  WA_SET_BIT_MASKED(HIZ_CHICKEN, CHV_HZ_8X8_MODE_IN_1X);
   
 
 I think you should do this as two separate patches, 1 per platform. 
 For the BSW
 patch (given that I had the same functionality in the kernel patch I 
 asked you
 to look at ;-) and FWIW, Jordan has numbers on BSW B-step with my 
 kernel patch
 which we can use for the commit):
 Signed-off-by: Ben Widawsky b...@bwidawsk.net

Huh, I don't recall seeing that kernel patch.  Sorry.  I guess I'll 
split it
and resubmit...

   
   It's not my call, it's just nice to have platform specific bisection. And 
   the
   patch wasn't on the list, it was the one I kept asking you to look at in 
   my
   branch :-)
   
 I haven't looked at Broadwell docs, so I'll let someone else take 
 care of that.
 
 I don't know if I agree with Chris that we should call these in the 
 workaround
 section, but whatever. init_clock_gating is equally sucky.

init_clock_gating doesn't work.  The register writes don't stick and 
they have
no effect at all.  Setting them here makes them actually take effect in 
the
context.

--Ken
   
   Separate thread now, but are you sure? We're setting at least two context
   specific registers in there today, among them: GEN7_FF_THREAD_MODE (which 
   is
   important to performance).
  
  It looks like we're setting:
  - [BDW] RC_SLEEP_PSMI_CONTROL - 0x2050
 
 dword offset 0x1c in the context image

power context, not logical context

  - [BDW, CHV] FF_THREAD_MODE - 0x20a0
 
 dword offset 0x2a in the context image

Also power context

  - [CHV] RC_SLEEP_PSMI_CONTROL - 0x12050
 
 Kinda surprised this one isn't there. I'm not sure how it can work correctly.

We're not frobbing with this anywhere but gen6_bsd_ring_write_tail(). In any
case it's 

Re: [Mesa-dev] [Intel-gfx] [PATCH] drm/i915: Enable the HiZ RAW Stall Optimization on Gen8.

2015-01-12 Thread Ville Syrjälä
On Sat, Jan 10, 2015 at 06:44:49PM -0800, Kenneth Graunke wrote:
 This is an important optimization for avoiding read-after-write (RAW)
 stalls in the HiZ buffer.  Certain workloads would run very slowly with
 HiZ enabled, but run much faster with the hiz=false driconf option.
 With this patch, they run at full speed even with HiZ.
 
 Improves performance in OglVSInstancing by 3.2x on Broadwell GT3e
 (Iris Pro 6200).
 
 Thanks to Jesse Barnes for finding this missing bit!
 Thanks to Chris Wilson for helping me find where to set it.
 
 Signed-off-by: Kenneth Graunke kenn...@whitecape.org
 Cc: Jesse Barnes jbar...@virtuousgeek.org
 ---
  drivers/gpu/drm/i915/intel_ringbuffer.c | 15 +++
  1 file changed, 15 insertions(+)
 
 Here's an alternate patch which implements the workaround in the kernel
 instead of Mesa.  It's probably better to do it there, since the kernel
 does it on Haswell already.
 
 diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c 
 b/drivers/gpu/drm/i915/intel_ringbuffer.c
 index dabc1d8..23020d6 100644
 --- a/drivers/gpu/drm/i915/intel_ringbuffer.c
 +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
 @@ -796,6 +796,16 @@ static int bdw_init_workarounds(struct intel_engine_cs 
 *ring)
 HDC_DONOT_FETCH_MEM_WHEN_MASKED |
 (IS_BDW_GT3(dev) ? HDC_FENCE_DEST_SLM_DISABLE : 0));
  
 + /* From the Haswell PRM, Command Reference: Registers, CACHE_MODE_0:
 +  * The Hierarchical Z RAW Stall Optimization allows non-overlapping
 +  *  polygons in the same 8x4 pixel/sample area to be processed without
 +  *  stalling waiting for the earlier ones to write to Hierarchical Z
 +  *  buffer.
 +  *
 +  * This optimization is off by default for Broadwell; turn it on.
 +  */
 + WA_CLR_BIT_MASKED(CACHE_MODE_0_GEN7, HIZ_RAW_STALL_OPT_DISABLE);
 +
   /* Wa4x4STCOptimizationDisable:bdw */
   WA_SET_BIT_MASKED(CACHE_MODE_1,
 GEN8_4x4_STC_OPTIMIZATION_DISABLE);
 @@ -836,6 +846,11 @@ static int chv_init_workarounds(struct intel_engine_cs 
 *ring)
 HDC_FORCE_NON_COHERENT |
 HDC_DONOT_FETCH_MEM_WHEN_MASKED);
  
 + /* According to the CACHE_MODE_0 default value documentation, some
 +  * CHV platforms disable this optimization by default.  Turn it on.
 +  */
 + WA_CLR_BIT_MASKED(CACHE_MODE_0_GEN7, HIZ_RAW_STALL_OPT_DISABLE);
 +

I remember looking at this when the HSW version was done, and at the
time the docs claimed that the default value on gen8 was already good.
Looks like the docs have been updated since then. Also my BSW agrees
that the disable bit is now set by default.

I suppose we could assume that since it works on HSW, it'll work on
gen8. However, I find it a bit suspicious that the later steppings seem
to have changed the default to disable the optimization. IVB suffered
from real problems with the optimization enabled and hence the IVB patch
was reverted. IIRC Chia-I wrote some kind of test to demonstrate the
problem on IVB, so maybe you want to run it and make sure it still
works correctly on gen8. Here's the test:
http://permalink.gmane.org/gmane.comp.freedesktop.xorg.drivers.intel/35399

   /* Improve HiZ throughput on CHV. */
   WA_SET_BIT_MASKED(HIZ_CHICKEN, CHV_HZ_8X8_MODE_IN_1X);
  
 -- 
 2.2.1
 
 ___
 Intel-gfx mailing list
 intel-...@lists.freedesktop.org
 http://lists.freedesktop.org/mailman/listinfo/intel-gfx

-- 
Ville Syrjälä
Intel OTC
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [Intel-gfx] [PATCH] drm/i915: Enable the HiZ RAW Stall Optimization on Gen8.

2015-01-12 Thread Kenneth Graunke
On Monday, January 12, 2015 02:32:20 PM Ville Syrjälä wrote:
 On Sat, Jan 10, 2015 at 06:44:49PM -0800, Kenneth Graunke wrote:
  This is an important optimization for avoiding read-after-write (RAW)
  stalls in the HiZ buffer.  Certain workloads would run very slowly with
  HiZ enabled, but run much faster with the hiz=false driconf option.
  With this patch, they run at full speed even with HiZ.
  
  Improves performance in OglVSInstancing by 3.2x on Broadwell GT3e
  (Iris Pro 6200).
  
  Thanks to Jesse Barnes for finding this missing bit!
  Thanks to Chris Wilson for helping me find where to set it.
  
  Signed-off-by: Kenneth Graunke kenn...@whitecape.org
  Cc: Jesse Barnes jbar...@virtuousgeek.org
  ---
   drivers/gpu/drm/i915/intel_ringbuffer.c | 15 +++
   1 file changed, 15 insertions(+)
  
  Here's an alternate patch which implements the workaround in the kernel
  instead of Mesa.  It's probably better to do it there, since the kernel
  does it on Haswell already.
  
  diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c 
  b/drivers/gpu/drm/i915/intel_ringbuffer.c
  index dabc1d8..23020d6 100644
  --- a/drivers/gpu/drm/i915/intel_ringbuffer.c
  +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
  @@ -796,6 +796,16 @@ static int bdw_init_workarounds(struct intel_engine_cs 
  *ring)
HDC_DONOT_FETCH_MEM_WHEN_MASKED |
(IS_BDW_GT3(dev) ? HDC_FENCE_DEST_SLM_DISABLE : 0));
   
  +   /* From the Haswell PRM, Command Reference: Registers, CACHE_MODE_0:
  +* The Hierarchical Z RAW Stall Optimization allows non-overlapping
  +*  polygons in the same 8x4 pixel/sample area to be processed without
  +*  stalling waiting for the earlier ones to write to Hierarchical Z
  +*  buffer.
  +*
  +* This optimization is off by default for Broadwell; turn it on.
  +*/
  +   WA_CLR_BIT_MASKED(CACHE_MODE_0_GEN7, HIZ_RAW_STALL_OPT_DISABLE);
  +
  /* Wa4x4STCOptimizationDisable:bdw */
  WA_SET_BIT_MASKED(CACHE_MODE_1,
GEN8_4x4_STC_OPTIMIZATION_DISABLE);
  @@ -836,6 +846,11 @@ static int chv_init_workarounds(struct intel_engine_cs 
  *ring)
HDC_FORCE_NON_COHERENT |
HDC_DONOT_FETCH_MEM_WHEN_MASKED);
   
  +   /* According to the CACHE_MODE_0 default value documentation, some
  +* CHV platforms disable this optimization by default.  Turn it on.
  +*/
  +   WA_CLR_BIT_MASKED(CACHE_MODE_0_GEN7, HIZ_RAW_STALL_OPT_DISABLE);
  +
 
 I remember looking at this when the HSW version was done, and at the
 time the docs claimed that the default value on gen8 was already good.
 Looks like the docs have been updated since then. Also my BSW agrees
 that the disable bit is now set by default.
 
 I suppose we could assume that since it works on HSW, it'll work on
 gen8. However, I find it a bit suspicious that the later steppings seem
 to have changed the default to disable the optimization. IVB suffered
 from real problems with the optimization enabled and hence the IVB patch
 was reverted. IIRC Chia-I wrote some kind of test to demonstrate the
 problem on IVB, so maybe you want to run it and make sure it still
 works correctly on gen8. Here's the test:
 http://permalink.gmane.org/gmane.comp.freedesktop.xorg.drivers.intel/35399

I just ran Chia-I's program on BSW, and it appears to be working fine.
Classic swrast and i965/hardware both produced an identical image.

--Ken

signature.asc
Description: This is a digitally signed message part.
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [Intel-gfx] [PATCH] drm/i915: Enable the HiZ RAW Stall Optimization on Gen8.

2015-01-12 Thread Ben Widawsky
On Mon, Jan 12, 2015 at 02:02:34PM +0200, Ville Syrjälä wrote:
 On Sun, Jan 11, 2015 at 07:14:57PM -0800, Ben Widawsky wrote:
  On Sun, Jan 11, 2015 at 07:05:21PM -0800, Kenneth Graunke wrote:
   On Sunday, January 11, 2015 05:46:09 PM Ben Widawsky wrote:
On Sun, Jan 11, 2015 at 04:05:25PM -0800, Kenneth Graunke wrote:
 On Sunday, January 11, 2015 01:49:41 PM Ben Widawsky wrote:
  On Sat, Jan 10, 2015 at 06:44:49PM -0800, Kenneth Graunke wrote:
   This is an important optimization for avoiding read-after-write 
   (RAW)
   stalls in the HiZ buffer.  Certain workloads would run very 
   slowly with
   HiZ enabled, but run much faster with the hiz=false driconf 
   option.
   With this patch, they run at full speed even with HiZ.
   
   Improves performance in OglVSInstancing by 3.2x on Broadwell GT3e
   (Iris Pro 6200).
   
   Thanks to Jesse Barnes for finding this missing bit!
   Thanks to Chris Wilson for helping me find where to set it.
   
   Signed-off-by: Kenneth Graunke kenn...@whitecape.org
   Cc: Jesse Barnes jbar...@virtuousgeek.org
   ---
drivers/gpu/drm/i915/intel_ringbuffer.c | 15 +++
1 file changed, 15 insertions(+)
   
   Here's an alternate patch which implements the workaround in the 
   kernel
   instead of Mesa.  It's probably better to do it there, since the 
   kernel
   does it on Haswell already.
   
   diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c 
   b/drivers/gpu/drm/i915/intel_ringbuffer.c
   index dabc1d8..23020d6 100644
   --- a/drivers/gpu/drm/i915/intel_ringbuffer.c
   +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
   @@ -796,6 +796,16 @@ static int bdw_init_workarounds(struct 
   intel_engine_cs *ring)
   HDC_DONOT_FETCH_MEM_WHEN_MASKED |
   (IS_BDW_GT3(dev) ? HDC_FENCE_DEST_SLM_DISABLE 
   : 0));

   + /* From the Haswell PRM, Command Reference: Registers, 
   CACHE_MODE_0:
   +  * The Hierarchical Z RAW Stall Optimization allows 
   non-overlapping
   +  *  polygons in the same 8x4 pixel/sample area to be processed 
   without
   +  *  stalling waiting for the earlier ones to write to 
   Hierarchical Z
   +  *  buffer.
   +  *
   +  * This optimization is off by default for Broadwell; turn it 
   on.
   +  */
   + WA_CLR_BIT_MASKED(CACHE_MODE_0_GEN7, HIZ_RAW_STALL_OPT_DISABLE);
   +
 /* Wa4x4STCOptimizationDisable:bdw */
 WA_SET_BIT_MASKED(CACHE_MODE_1,
   GEN8_4x4_STC_OPTIMIZATION_DISABLE);
   @@ -836,6 +846,11 @@ static int chv_init_workarounds(struct 
   intel_engine_cs *ring)
   HDC_FORCE_NON_COHERENT |
   HDC_DONOT_FETCH_MEM_WHEN_MASKED);

   + /* According to the CACHE_MODE_0 default value documentation, 
   some
   +  * CHV platforms disable this optimization by default.  Turn it 
   on.
   +  */
   + WA_CLR_BIT_MASKED(CACHE_MODE_0_GEN7, HIZ_RAW_STALL_OPT_DISABLE);
   +
 /* Improve HiZ throughput on CHV. */
 WA_SET_BIT_MASKED(HIZ_CHICKEN, CHV_HZ_8X8_MODE_IN_1X);

  
  I think you should do this as two separate patches, 1 per platform. 
  For the BSW
  patch (given that I had the same functionality in the kernel patch 
  I asked you
  to look at ;-) and FWIW, Jordan has numbers on BSW B-step with my 
  kernel patch
  which we can use for the commit):
  Signed-off-by: Ben Widawsky b...@bwidawsk.net
 
 Huh, I don't recall seeing that kernel patch.  Sorry.  I guess I'll 
 split it
 and resubmit...
 

It's not my call, it's just nice to have platform specific bisection. 
And the
patch wasn't on the list, it was the one I kept asking you to look at 
in my
branch :-)

  I haven't looked at Broadwell docs, so I'll let someone else take 
  care of that.
  
  I don't know if I agree with Chris that we should call these in the 
  workaround
  section, but whatever. init_clock_gating is equally sucky.
 
 init_clock_gating doesn't work.  The register writes don't stick and 
 they have
 no effect at all.  Setting them here makes them actually take effect 
 in the
 context.
 
 --Ken

Separate thread now, but are you sure? We're setting at least two 
context
specific registers in there today, among them: GEN7_FF_THREAD_MODE 
(which is
important to performance).
   
   It looks like we're setting:
   - [BDW] RC_SLEEP_PSMI_CONTROL - 0x2050
  
  dword offset 0x1c in the context image
 
 power context, not logical context
 
   - [BDW, CHV] FF_THREAD_MODE - 0x20a0
  
  dword offset 0x2a in the context image
 
 Also power context
 
   - [CHV] RC_SLEEP_PSMI_CONTROL - 0x12050
  
  Kinda