Re: [Intel-gfx] [Mesa-dev] [PATCH] drm/i915: Enable the HiZ RAW Stall Optimization on Gen8.

2015-01-13 Thread Ville Syrjälä
On Mon, Jan 12, 2015 at 06:07:26PM -0800, Ben Widawsky wrote:
 On Mon, Jan 12, 2015 at 06:09:12PM +, Dave Gordon wrote:
  On 12/01/15 18:02, Ben Widawsky wrote:
   On Mon, Jan 12, 2015 at 02:02:34PM +0200, Ville Syrjälä wrote:
   On Sun, Jan 11, 2015 at 07:14:57PM -0800, Ben Widawsky wrote:
   On Sun, Jan 11, 2015 at 07:05:21PM -0800, Kenneth Graunke wrote:
   On Sunday, January 11, 2015 05:46:09 PM Ben Widawsky wrote:
   On Sun, Jan 11, 2015 at 04:05:25PM -0800, Kenneth Graunke wrote:
   On Sunday, January 11, 2015 01:49:41 PM Ben Widawsky wrote:
   On Sat, Jan 10, 2015 at 06:44:49PM -0800, Kenneth Graunke wrote:
   This is an important optimization for avoiding read-after-write 
   (RAW)
   stalls in the HiZ buffer.  Certain workloads would run very slowly 
   with
   HiZ enabled, but run much faster with the hiz=false driconf 
   option.
   With this patch, they run at full speed even with HiZ.
  
   Improves performance in OglVSInstancing by 3.2x on Broadwell GT3e
   (Iris Pro 6200).
  
   Thanks to Jesse Barnes for finding this missing bit!
   Thanks to Chris Wilson for helping me find where to set it.
  
   Signed-off-by: Kenneth Graunke kenn...@whitecape.org
   Cc: Jesse Barnes jbar...@virtuousgeek.org
   ---
drivers/gpu/drm/i915/intel_ringbuffer.c | 15 +++
1 file changed, 15 insertions(+)
  
   Here's an alternate patch which implements the workaround in the 
   kernel
   instead of Mesa.  It's probably better to do it there, since the 
   kernel
   does it on Haswell already.
  
   diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c 
   b/drivers/gpu/drm/i915/intel_ringbuffer.c
   index dabc1d8..23020d6 100644
   --- a/drivers/gpu/drm/i915/intel_ringbuffer.c
   +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
   @@ -796,6 +796,16 @@ static int bdw_init_workarounds(struct 
   intel_engine_cs *ring)
HDC_DONOT_FETCH_MEM_WHEN_MASKED |
(IS_BDW_GT3(dev) ? HDC_FENCE_DEST_SLM_DISABLE 
   : 0));

   +  /* From the Haswell PRM, Command Reference: Registers, 
   CACHE_MODE_0:
   +   * The Hierarchical Z RAW Stall Optimization allows 
   non-overlapping
   +   *  polygons in the same 8x4 pixel/sample area to be processed 
   without
   +   *  stalling waiting for the earlier ones to write to 
   Hierarchical Z
   +   *  buffer.
   +   *
   +   * This optimization is off by default for Broadwell; turn it 
   on.
   +   */
   +  WA_CLR_BIT_MASKED(CACHE_MODE_0_GEN7, HIZ_RAW_STALL_OPT_DISABLE);
   +
  /* Wa4x4STCOptimizationDisable:bdw */
  WA_SET_BIT_MASKED(CACHE_MODE_1,
GEN8_4x4_STC_OPTIMIZATION_DISABLE);
   @@ -836,6 +846,11 @@ static int chv_init_workarounds(struct 
   intel_engine_cs *ring)
HDC_FORCE_NON_COHERENT |
HDC_DONOT_FETCH_MEM_WHEN_MASKED);

   +  /* According to the CACHE_MODE_0 default value documentation, 
   some
   +   * CHV platforms disable this optimization by default.  Turn it 
   on.
   +   */
   +  WA_CLR_BIT_MASKED(CACHE_MODE_0_GEN7, HIZ_RAW_STALL_OPT_DISABLE);
   +
  /* Improve HiZ throughput on CHV. */
  WA_SET_BIT_MASKED(HIZ_CHICKEN, CHV_HZ_8X8_MODE_IN_1X);

  
   I think you should do this as two separate patches, 1 per platform. 
   For the BSW
   patch (given that I had the same functionality in the kernel patch 
   I asked you
   to look at ;-) and FWIW, Jordan has numbers on BSW B-step with my 
   kernel patch
   which we can use for the commit):
   Signed-off-by: Ben Widawsky b...@bwidawsk.net
  
   Huh, I don't recall seeing that kernel patch.  Sorry.  I guess I'll 
   split it
   and resubmit...
  
  
   It's not my call, it's just nice to have platform specific bisection. 
   And the
   patch wasn't on the list, it was the one I kept asking you to look at 
   in my
   branch :-)
  
   I haven't looked at Broadwell docs, so I'll let someone else take 
   care of that.
  
   I don't know if I agree with Chris that we should call these in the 
   workaround
   section, but whatever. init_clock_gating is equally sucky.
  
   init_clock_gating doesn't work.  The register writes don't stick and 
   they have
   no effect at all.  Setting them here makes them actually take effect 
   in the
   context.
  
   --Ken
  
   Separate thread now, but are you sure? We're setting at least two 
   context
   specific registers in there today, among them: GEN7_FF_THREAD_MODE 
   (which is
   important to performance).
  
   It looks like we're setting:
   - [BDW] RC_SLEEP_PSMI_CONTROL - 0x2050
  
   dword offset 0x1c in the context image
  
   power context, not logical context
  
   - [BDW, CHV] FF_THREAD_MODE - 0x20a0
  
   dword offset 0x2a in the context image
  
   Also power context
  
   - [CHV] RC_SLEEP_PSMI_CONTROL - 0x12050
  
   Kinda surprised this one isn't there. I'm not sure how it can work 
   correctly.
  
   We're not frobbing with this anywhere but gen6_bsd_ring_write_tail(). In 
   any
   case it's a VCS 

Re: [Intel-gfx] [Mesa-dev] [PATCH] drm/i915: Enable the HiZ RAW Stall Optimization on Gen8.

2015-01-12 Thread Ville Syrjälä
On Sun, Jan 11, 2015 at 07:14:57PM -0800, Ben Widawsky wrote:
 On Sun, Jan 11, 2015 at 07:05:21PM -0800, Kenneth Graunke wrote:
  On Sunday, January 11, 2015 05:46:09 PM Ben Widawsky wrote:
   On Sun, Jan 11, 2015 at 04:05:25PM -0800, Kenneth Graunke wrote:
On Sunday, January 11, 2015 01:49:41 PM Ben Widawsky wrote:
 On Sat, Jan 10, 2015 at 06:44:49PM -0800, Kenneth Graunke wrote:
  This is an important optimization for avoiding read-after-write 
  (RAW)
  stalls in the HiZ buffer.  Certain workloads would run very slowly 
  with
  HiZ enabled, but run much faster with the hiz=false driconf 
  option.
  With this patch, they run at full speed even with HiZ.
  
  Improves performance in OglVSInstancing by 3.2x on Broadwell GT3e
  (Iris Pro 6200).
  
  Thanks to Jesse Barnes for finding this missing bit!
  Thanks to Chris Wilson for helping me find where to set it.
  
  Signed-off-by: Kenneth Graunke kenn...@whitecape.org
  Cc: Jesse Barnes jbar...@virtuousgeek.org
  ---
   drivers/gpu/drm/i915/intel_ringbuffer.c | 15 +++
   1 file changed, 15 insertions(+)
  
  Here's an alternate patch which implements the workaround in the 
  kernel
  instead of Mesa.  It's probably better to do it there, since the 
  kernel
  does it on Haswell already.
  
  diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c 
  b/drivers/gpu/drm/i915/intel_ringbuffer.c
  index dabc1d8..23020d6 100644
  --- a/drivers/gpu/drm/i915/intel_ringbuffer.c
  +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
  @@ -796,6 +796,16 @@ static int bdw_init_workarounds(struct 
  intel_engine_cs *ring)
HDC_DONOT_FETCH_MEM_WHEN_MASKED |
(IS_BDW_GT3(dev) ? HDC_FENCE_DEST_SLM_DISABLE 
  : 0));
   
  +   /* From the Haswell PRM, Command Reference: Registers, 
  CACHE_MODE_0:
  +* The Hierarchical Z RAW Stall Optimization allows 
  non-overlapping
  +*  polygons in the same 8x4 pixel/sample area to be processed 
  without
  +*  stalling waiting for the earlier ones to write to 
  Hierarchical Z
  +*  buffer.
  +*
  +* This optimization is off by default for Broadwell; turn it 
  on.
  +*/
  +   WA_CLR_BIT_MASKED(CACHE_MODE_0_GEN7, HIZ_RAW_STALL_OPT_DISABLE);
  +
  /* Wa4x4STCOptimizationDisable:bdw */
  WA_SET_BIT_MASKED(CACHE_MODE_1,
GEN8_4x4_STC_OPTIMIZATION_DISABLE);
  @@ -836,6 +846,11 @@ static int chv_init_workarounds(struct 
  intel_engine_cs *ring)
HDC_FORCE_NON_COHERENT |
HDC_DONOT_FETCH_MEM_WHEN_MASKED);
   
  +   /* According to the CACHE_MODE_0 default value documentation, 
  some
  +* CHV platforms disable this optimization by default.  Turn it 
  on.
  +*/
  +   WA_CLR_BIT_MASKED(CACHE_MODE_0_GEN7, HIZ_RAW_STALL_OPT_DISABLE);
  +
  /* Improve HiZ throughput on CHV. */
  WA_SET_BIT_MASKED(HIZ_CHICKEN, CHV_HZ_8X8_MODE_IN_1X);
   
 
 I think you should do this as two separate patches, 1 per platform. 
 For the BSW
 patch (given that I had the same functionality in the kernel patch I 
 asked you
 to look at ;-) and FWIW, Jordan has numbers on BSW B-step with my 
 kernel patch
 which we can use for the commit):
 Signed-off-by: Ben Widawsky b...@bwidawsk.net

Huh, I don't recall seeing that kernel patch.  Sorry.  I guess I'll 
split it
and resubmit...

   
   It's not my call, it's just nice to have platform specific bisection. And 
   the
   patch wasn't on the list, it was the one I kept asking you to look at in 
   my
   branch :-)
   
 I haven't looked at Broadwell docs, so I'll let someone else take 
 care of that.
 
 I don't know if I agree with Chris that we should call these in the 
 workaround
 section, but whatever. init_clock_gating is equally sucky.

init_clock_gating doesn't work.  The register writes don't stick and 
they have
no effect at all.  Setting them here makes them actually take effect in 
the
context.

--Ken
   
   Separate thread now, but are you sure? We're setting at least two context
   specific registers in there today, among them: GEN7_FF_THREAD_MODE (which 
   is
   important to performance).
  
  It looks like we're setting:
  - [BDW] RC_SLEEP_PSMI_CONTROL - 0x2050
 
 dword offset 0x1c in the context image

power context, not logical context

  - [BDW, CHV] FF_THREAD_MODE - 0x20a0
 
 dword offset 0x2a in the context image

Also power context

  - [CHV] RC_SLEEP_PSMI_CONTROL - 0x12050
 
 Kinda surprised this one isn't there. I'm not sure how it can work correctly.

We're not frobbing with this anywhere but gen6_bsd_ring_write_tail(). In any
case it's 

Re: [Intel-gfx] [Mesa-dev] [PATCH] drm/i915: Enable the HiZ RAW Stall Optimization on Gen8.

2015-01-12 Thread Ben Widawsky
On Mon, Jan 12, 2015 at 06:09:12PM +, Dave Gordon wrote:
 On 12/01/15 18:02, Ben Widawsky wrote:
  On Mon, Jan 12, 2015 at 02:02:34PM +0200, Ville Syrjälä wrote:
  On Sun, Jan 11, 2015 at 07:14:57PM -0800, Ben Widawsky wrote:
  On Sun, Jan 11, 2015 at 07:05:21PM -0800, Kenneth Graunke wrote:
  On Sunday, January 11, 2015 05:46:09 PM Ben Widawsky wrote:
  On Sun, Jan 11, 2015 at 04:05:25PM -0800, Kenneth Graunke wrote:
  On Sunday, January 11, 2015 01:49:41 PM Ben Widawsky wrote:
  On Sat, Jan 10, 2015 at 06:44:49PM -0800, Kenneth Graunke wrote:
  This is an important optimization for avoiding read-after-write (RAW)
  stalls in the HiZ buffer.  Certain workloads would run very slowly 
  with
  HiZ enabled, but run much faster with the hiz=false driconf option.
  With this patch, they run at full speed even with HiZ.
 
  Improves performance in OglVSInstancing by 3.2x on Broadwell GT3e
  (Iris Pro 6200).
 
  Thanks to Jesse Barnes for finding this missing bit!
  Thanks to Chris Wilson for helping me find where to set it.
 
  Signed-off-by: Kenneth Graunke kenn...@whitecape.org
  Cc: Jesse Barnes jbar...@virtuousgeek.org
  ---
   drivers/gpu/drm/i915/intel_ringbuffer.c | 15 +++
   1 file changed, 15 insertions(+)
 
  Here's an alternate patch which implements the workaround in the 
  kernel
  instead of Mesa.  It's probably better to do it there, since the 
  kernel
  does it on Haswell already.
 
  diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c 
  b/drivers/gpu/drm/i915/intel_ringbuffer.c
  index dabc1d8..23020d6 100644
  --- a/drivers/gpu/drm/i915/intel_ringbuffer.c
  +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
  @@ -796,6 +796,16 @@ static int bdw_init_workarounds(struct 
  intel_engine_cs *ring)
 HDC_DONOT_FETCH_MEM_WHEN_MASKED |
 (IS_BDW_GT3(dev) ? HDC_FENCE_DEST_SLM_DISABLE 
  : 0));
   
  +/* From the Haswell PRM, Command Reference: Registers, 
  CACHE_MODE_0:
  + * The Hierarchical Z RAW Stall Optimization allows 
  non-overlapping
  + *  polygons in the same 8x4 pixel/sample area to be processed 
  without
  + *  stalling waiting for the earlier ones to write to 
  Hierarchical Z
  + *  buffer.
  + *
  + * This optimization is off by default for Broadwell; turn it 
  on.
  + */
  +WA_CLR_BIT_MASKED(CACHE_MODE_0_GEN7, HIZ_RAW_STALL_OPT_DISABLE);
  +
   /* Wa4x4STCOptimizationDisable:bdw */
   WA_SET_BIT_MASKED(CACHE_MODE_1,
 GEN8_4x4_STC_OPTIMIZATION_DISABLE);
  @@ -836,6 +846,11 @@ static int chv_init_workarounds(struct 
  intel_engine_cs *ring)
 HDC_FORCE_NON_COHERENT |
 HDC_DONOT_FETCH_MEM_WHEN_MASKED);
   
  +/* According to the CACHE_MODE_0 default value documentation, 
  some
  + * CHV platforms disable this optimization by default.  Turn it 
  on.
  + */
  +WA_CLR_BIT_MASKED(CACHE_MODE_0_GEN7, HIZ_RAW_STALL_OPT_DISABLE);
  +
   /* Improve HiZ throughput on CHV. */
   WA_SET_BIT_MASKED(HIZ_CHICKEN, CHV_HZ_8X8_MODE_IN_1X);
   
 
  I think you should do this as two separate patches, 1 per platform. 
  For the BSW
  patch (given that I had the same functionality in the kernel patch I 
  asked you
  to look at ;-) and FWIW, Jordan has numbers on BSW B-step with my 
  kernel patch
  which we can use for the commit):
  Signed-off-by: Ben Widawsky b...@bwidawsk.net
 
  Huh, I don't recall seeing that kernel patch.  Sorry.  I guess I'll 
  split it
  and resubmit...
 
 
  It's not my call, it's just nice to have platform specific bisection. 
  And the
  patch wasn't on the list, it was the one I kept asking you to look at 
  in my
  branch :-)
 
  I haven't looked at Broadwell docs, so I'll let someone else take 
  care of that.
 
  I don't know if I agree with Chris that we should call these in the 
  workaround
  section, but whatever. init_clock_gating is equally sucky.
 
  init_clock_gating doesn't work.  The register writes don't stick and 
  they have
  no effect at all.  Setting them here makes them actually take effect 
  in the
  context.
 
  --Ken
 
  Separate thread now, but are you sure? We're setting at least two 
  context
  specific registers in there today, among them: GEN7_FF_THREAD_MODE 
  (which is
  important to performance).
 
  It looks like we're setting:
  - [BDW] RC_SLEEP_PSMI_CONTROL - 0x2050
 
  dword offset 0x1c in the context image
 
  power context, not logical context
 
  - [BDW, CHV] FF_THREAD_MODE - 0x20a0
 
  dword offset 0x2a in the context image
 
  Also power context
 
  - [CHV] RC_SLEEP_PSMI_CONTROL - 0x12050
 
  Kinda surprised this one isn't there. I'm not sure how it can work 
  correctly.
 
  We're not frobbing with this anywhere but gen6_bsd_ring_write_tail(). In 
  any
  case it's a VCS register. Sadly I've not found any documentation for !RCS
  power context, but I'm assuming every engine has a power context of some 
  sort.
 
  
  Yeah, Ken and I 

Re: [Intel-gfx] [Mesa-dev] [PATCH] drm/i915: Enable the HiZ RAW Stall Optimization on Gen8.

2015-01-12 Thread Dave Gordon
On 12/01/15 18:02, Ben Widawsky wrote:
 On Mon, Jan 12, 2015 at 02:02:34PM +0200, Ville Syrjälä wrote:
 On Sun, Jan 11, 2015 at 07:14:57PM -0800, Ben Widawsky wrote:
 On Sun, Jan 11, 2015 at 07:05:21PM -0800, Kenneth Graunke wrote:
 On Sunday, January 11, 2015 05:46:09 PM Ben Widawsky wrote:
 On Sun, Jan 11, 2015 at 04:05:25PM -0800, Kenneth Graunke wrote:
 On Sunday, January 11, 2015 01:49:41 PM Ben Widawsky wrote:
 On Sat, Jan 10, 2015 at 06:44:49PM -0800, Kenneth Graunke wrote:
 This is an important optimization for avoiding read-after-write (RAW)
 stalls in the HiZ buffer.  Certain workloads would run very slowly with
 HiZ enabled, but run much faster with the hiz=false driconf option.
 With this patch, they run at full speed even with HiZ.

 Improves performance in OglVSInstancing by 3.2x on Broadwell GT3e
 (Iris Pro 6200).

 Thanks to Jesse Barnes for finding this missing bit!
 Thanks to Chris Wilson for helping me find where to set it.

 Signed-off-by: Kenneth Graunke kenn...@whitecape.org
 Cc: Jesse Barnes jbar...@virtuousgeek.org
 ---
  drivers/gpu/drm/i915/intel_ringbuffer.c | 15 +++
  1 file changed, 15 insertions(+)

 Here's an alternate patch which implements the workaround in the kernel
 instead of Mesa.  It's probably better to do it there, since the kernel
 does it on Haswell already.

 diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c 
 b/drivers/gpu/drm/i915/intel_ringbuffer.c
 index dabc1d8..23020d6 100644
 --- a/drivers/gpu/drm/i915/intel_ringbuffer.c
 +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
 @@ -796,6 +796,16 @@ static int bdw_init_workarounds(struct 
 intel_engine_cs *ring)
  HDC_DONOT_FETCH_MEM_WHEN_MASKED |
  (IS_BDW_GT3(dev) ? HDC_FENCE_DEST_SLM_DISABLE 
 : 0));
  
 +  /* From the Haswell PRM, Command Reference: Registers, 
 CACHE_MODE_0:
 +   * The Hierarchical Z RAW Stall Optimization allows 
 non-overlapping
 +   *  polygons in the same 8x4 pixel/sample area to be processed 
 without
 +   *  stalling waiting for the earlier ones to write to 
 Hierarchical Z
 +   *  buffer.
 +   *
 +   * This optimization is off by default for Broadwell; turn it 
 on.
 +   */
 +  WA_CLR_BIT_MASKED(CACHE_MODE_0_GEN7, HIZ_RAW_STALL_OPT_DISABLE);
 +
/* Wa4x4STCOptimizationDisable:bdw */
WA_SET_BIT_MASKED(CACHE_MODE_1,
  GEN8_4x4_STC_OPTIMIZATION_DISABLE);
 @@ -836,6 +846,11 @@ static int chv_init_workarounds(struct 
 intel_engine_cs *ring)
  HDC_FORCE_NON_COHERENT |
  HDC_DONOT_FETCH_MEM_WHEN_MASKED);
  
 +  /* According to the CACHE_MODE_0 default value documentation, 
 some
 +   * CHV platforms disable this optimization by default.  Turn it 
 on.
 +   */
 +  WA_CLR_BIT_MASKED(CACHE_MODE_0_GEN7, HIZ_RAW_STALL_OPT_DISABLE);
 +
/* Improve HiZ throughput on CHV. */
WA_SET_BIT_MASKED(HIZ_CHICKEN, CHV_HZ_8X8_MODE_IN_1X);
  

 I think you should do this as two separate patches, 1 per platform. For 
 the BSW
 patch (given that I had the same functionality in the kernel patch I 
 asked you
 to look at ;-) and FWIW, Jordan has numbers on BSW B-step with my 
 kernel patch
 which we can use for the commit):
 Signed-off-by: Ben Widawsky b...@bwidawsk.net

 Huh, I don't recall seeing that kernel patch.  Sorry.  I guess I'll 
 split it
 and resubmit...


 It's not my call, it's just nice to have platform specific bisection. And 
 the
 patch wasn't on the list, it was the one I kept asking you to look at in 
 my
 branch :-)

 I haven't looked at Broadwell docs, so I'll let someone else take care 
 of that.

 I don't know if I agree with Chris that we should call these in the 
 workaround
 section, but whatever. init_clock_gating is equally sucky.

 init_clock_gating doesn't work.  The register writes don't stick and 
 they have
 no effect at all.  Setting them here makes them actually take effect in 
 the
 context.

 --Ken

 Separate thread now, but are you sure? We're setting at least two context
 specific registers in there today, among them: GEN7_FF_THREAD_MODE (which 
 is
 important to performance).

 It looks like we're setting:
 - [BDW] RC_SLEEP_PSMI_CONTROL - 0x2050

 dword offset 0x1c in the context image

 power context, not logical context

 - [BDW, CHV] FF_THREAD_MODE - 0x20a0

 dword offset 0x2a in the context image

 Also power context

 - [CHV] RC_SLEEP_PSMI_CONTROL - 0x12050

 Kinda surprised this one isn't there. I'm not sure how it can work 
 correctly.

 We're not frobbing with this anywhere but gen6_bsd_ring_write_tail(). In any
 case it's a VCS register. Sadly I've not found any documentation for !RCS
 power context, but I'm assuming every engine has a power context of some 
 sort.

 
 Yeah, Ken and I resolved this offline. Any idea why the bits don't stick when
 written via MMIO?
 
 -- 
 Ville Syrjälä
 Intel OTC

Doesn't BSpec say writing via MMIO is unreliable if the 

Re: [Intel-gfx] [Mesa-dev] [PATCH] drm/i915: Enable the HiZ RAW Stall Optimization on Gen8.

2015-01-12 Thread Ben Widawsky
On Mon, Jan 12, 2015 at 02:02:34PM +0200, Ville Syrjälä wrote:
 On Sun, Jan 11, 2015 at 07:14:57PM -0800, Ben Widawsky wrote:
  On Sun, Jan 11, 2015 at 07:05:21PM -0800, Kenneth Graunke wrote:
   On Sunday, January 11, 2015 05:46:09 PM Ben Widawsky wrote:
On Sun, Jan 11, 2015 at 04:05:25PM -0800, Kenneth Graunke wrote:
 On Sunday, January 11, 2015 01:49:41 PM Ben Widawsky wrote:
  On Sat, Jan 10, 2015 at 06:44:49PM -0800, Kenneth Graunke wrote:
   This is an important optimization for avoiding read-after-write 
   (RAW)
   stalls in the HiZ buffer.  Certain workloads would run very 
   slowly with
   HiZ enabled, but run much faster with the hiz=false driconf 
   option.
   With this patch, they run at full speed even with HiZ.
   
   Improves performance in OglVSInstancing by 3.2x on Broadwell GT3e
   (Iris Pro 6200).
   
   Thanks to Jesse Barnes for finding this missing bit!
   Thanks to Chris Wilson for helping me find where to set it.
   
   Signed-off-by: Kenneth Graunke kenn...@whitecape.org
   Cc: Jesse Barnes jbar...@virtuousgeek.org
   ---
drivers/gpu/drm/i915/intel_ringbuffer.c | 15 +++
1 file changed, 15 insertions(+)
   
   Here's an alternate patch which implements the workaround in the 
   kernel
   instead of Mesa.  It's probably better to do it there, since the 
   kernel
   does it on Haswell already.
   
   diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c 
   b/drivers/gpu/drm/i915/intel_ringbuffer.c
   index dabc1d8..23020d6 100644
   --- a/drivers/gpu/drm/i915/intel_ringbuffer.c
   +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
   @@ -796,6 +796,16 @@ static int bdw_init_workarounds(struct 
   intel_engine_cs *ring)
   HDC_DONOT_FETCH_MEM_WHEN_MASKED |
   (IS_BDW_GT3(dev) ? HDC_FENCE_DEST_SLM_DISABLE 
   : 0));

   + /* From the Haswell PRM, Command Reference: Registers, 
   CACHE_MODE_0:
   +  * The Hierarchical Z RAW Stall Optimization allows 
   non-overlapping
   +  *  polygons in the same 8x4 pixel/sample area to be processed 
   without
   +  *  stalling waiting for the earlier ones to write to 
   Hierarchical Z
   +  *  buffer.
   +  *
   +  * This optimization is off by default for Broadwell; turn it 
   on.
   +  */
   + WA_CLR_BIT_MASKED(CACHE_MODE_0_GEN7, HIZ_RAW_STALL_OPT_DISABLE);
   +
 /* Wa4x4STCOptimizationDisable:bdw */
 WA_SET_BIT_MASKED(CACHE_MODE_1,
   GEN8_4x4_STC_OPTIMIZATION_DISABLE);
   @@ -836,6 +846,11 @@ static int chv_init_workarounds(struct 
   intel_engine_cs *ring)
   HDC_FORCE_NON_COHERENT |
   HDC_DONOT_FETCH_MEM_WHEN_MASKED);

   + /* According to the CACHE_MODE_0 default value documentation, 
   some
   +  * CHV platforms disable this optimization by default.  Turn it 
   on.
   +  */
   + WA_CLR_BIT_MASKED(CACHE_MODE_0_GEN7, HIZ_RAW_STALL_OPT_DISABLE);
   +
 /* Improve HiZ throughput on CHV. */
 WA_SET_BIT_MASKED(HIZ_CHICKEN, CHV_HZ_8X8_MODE_IN_1X);

  
  I think you should do this as two separate patches, 1 per platform. 
  For the BSW
  patch (given that I had the same functionality in the kernel patch 
  I asked you
  to look at ;-) and FWIW, Jordan has numbers on BSW B-step with my 
  kernel patch
  which we can use for the commit):
  Signed-off-by: Ben Widawsky b...@bwidawsk.net
 
 Huh, I don't recall seeing that kernel patch.  Sorry.  I guess I'll 
 split it
 and resubmit...
 

It's not my call, it's just nice to have platform specific bisection. 
And the
patch wasn't on the list, it was the one I kept asking you to look at 
in my
branch :-)

  I haven't looked at Broadwell docs, so I'll let someone else take 
  care of that.
  
  I don't know if I agree with Chris that we should call these in the 
  workaround
  section, but whatever. init_clock_gating is equally sucky.
 
 init_clock_gating doesn't work.  The register writes don't stick and 
 they have
 no effect at all.  Setting them here makes them actually take effect 
 in the
 context.
 
 --Ken

Separate thread now, but are you sure? We're setting at least two 
context
specific registers in there today, among them: GEN7_FF_THREAD_MODE 
(which is
important to performance).
   
   It looks like we're setting:
   - [BDW] RC_SLEEP_PSMI_CONTROL - 0x2050
  
  dword offset 0x1c in the context image
 
 power context, not logical context
 
   - [BDW, CHV] FF_THREAD_MODE - 0x20a0
  
  dword offset 0x2a in the context image
 
 Also power context
 
   - [CHV] RC_SLEEP_PSMI_CONTROL - 0x12050
  
  Kinda 

Re: [Intel-gfx] [Mesa-dev] [PATCH] drm/i915: Enable the HiZ RAW Stall Optimization on Gen8.

2015-01-11 Thread Kenneth Graunke
On Sunday, January 11, 2015 05:46:09 PM Ben Widawsky wrote:
 On Sun, Jan 11, 2015 at 04:05:25PM -0800, Kenneth Graunke wrote:
  On Sunday, January 11, 2015 01:49:41 PM Ben Widawsky wrote:
   On Sat, Jan 10, 2015 at 06:44:49PM -0800, Kenneth Graunke wrote:
This is an important optimization for avoiding read-after-write (RAW)
stalls in the HiZ buffer.  Certain workloads would run very slowly with
HiZ enabled, but run much faster with the hiz=false driconf option.
With this patch, they run at full speed even with HiZ.

Improves performance in OglVSInstancing by 3.2x on Broadwell GT3e
(Iris Pro 6200).

Thanks to Jesse Barnes for finding this missing bit!
Thanks to Chris Wilson for helping me find where to set it.

Signed-off-by: Kenneth Graunke kenn...@whitecape.org
Cc: Jesse Barnes jbar...@virtuousgeek.org
---
 drivers/gpu/drm/i915/intel_ringbuffer.c | 15 +++
 1 file changed, 15 insertions(+)

Here's an alternate patch which implements the workaround in the kernel
instead of Mesa.  It's probably better to do it there, since the kernel
does it on Haswell already.

diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c 
b/drivers/gpu/drm/i915/intel_ringbuffer.c
index dabc1d8..23020d6 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.c
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
@@ -796,6 +796,16 @@ static int bdw_init_workarounds(struct 
intel_engine_cs *ring)
  HDC_DONOT_FETCH_MEM_WHEN_MASKED |
  (IS_BDW_GT3(dev) ? HDC_FENCE_DEST_SLM_DISABLE 
: 0));
 
+   /* From the Haswell PRM, Command Reference: Registers, 
CACHE_MODE_0:
+* The Hierarchical Z RAW Stall Optimization allows 
non-overlapping
+*  polygons in the same 8x4 pixel/sample area to be processed 
without
+*  stalling waiting for the earlier ones to write to 
Hierarchical Z
+*  buffer.
+*
+* This optimization is off by default for Broadwell; turn it 
on.
+*/
+   WA_CLR_BIT_MASKED(CACHE_MODE_0_GEN7, HIZ_RAW_STALL_OPT_DISABLE);
+
/* Wa4x4STCOptimizationDisable:bdw */
WA_SET_BIT_MASKED(CACHE_MODE_1,
  GEN8_4x4_STC_OPTIMIZATION_DISABLE);
@@ -836,6 +846,11 @@ static int chv_init_workarounds(struct 
intel_engine_cs *ring)
  HDC_FORCE_NON_COHERENT |
  HDC_DONOT_FETCH_MEM_WHEN_MASKED);
 
+   /* According to the CACHE_MODE_0 default value documentation, 
some
+* CHV platforms disable this optimization by default.  Turn it 
on.
+*/
+   WA_CLR_BIT_MASKED(CACHE_MODE_0_GEN7, HIZ_RAW_STALL_OPT_DISABLE);
+
/* Improve HiZ throughput on CHV. */
WA_SET_BIT_MASKED(HIZ_CHICKEN, CHV_HZ_8X8_MODE_IN_1X);
 
   
   I think you should do this as two separate patches, 1 per platform. For 
   the BSW
   patch (given that I had the same functionality in the kernel patch I 
   asked you
   to look at ;-) and FWIW, Jordan has numbers on BSW B-step with my kernel 
   patch
   which we can use for the commit):
   Signed-off-by: Ben Widawsky b...@bwidawsk.net
  
  Huh, I don't recall seeing that kernel patch.  Sorry.  I guess I'll split it
  and resubmit...
  
 
 It's not my call, it's just nice to have platform specific bisection. And the
 patch wasn't on the list, it was the one I kept asking you to look at in my
 branch :-)
 
   I haven't looked at Broadwell docs, so I'll let someone else take care of 
   that.
   
   I don't know if I agree with Chris that we should call these in the 
   workaround
   section, but whatever. init_clock_gating is equally sucky.
  
  init_clock_gating doesn't work.  The register writes don't stick and they 
  have
  no effect at all.  Setting them here makes them actually take effect in the
  context.
  
  --Ken
 
 Separate thread now, but are you sure? We're setting at least two context
 specific registers in there today, among them: GEN7_FF_THREAD_MODE (which is
 important to performance).

It looks like we're setting:

- [BDW] GAM_ECOCHK - 0x4090
- [BDW] CHICKEN_PAR1_1 - 0x42080
- [BDW] RC_SLEEP_PSMI_CONTROL - 0x2050
- [BDW, CHV] UCGCTL6 - 0x9430
- [BDW, CHV] FF_THREAD_MODE - 0x20a0
- [CHV] DSPCLK_GATE_D - display
- [CHV] MI_ARB_VLV - display
- [CHV] RC_SLEEP_PSMI_CONTROL - 0x12050
- [CHV] UCGCTL1 - 0x9400

I searched for all of these in the Register State Context tables for BDW
and CHV, and I didn't see any of them listed (including FF_THREAD_MODE or
0x20a0).  So I'm pretty sure these are not part of the context, and so they
should work.

--Ken

signature.asc
Description: This is a digitally signed message part.
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org

Re: [Intel-gfx] [Mesa-dev] [PATCH] drm/i915: Enable the HiZ RAW Stall Optimization on Gen8.

2015-01-11 Thread Kenneth Graunke
On Sunday, January 11, 2015 05:46:09 PM Ben Widawsky wrote:
 On Sun, Jan 11, 2015 at 04:05:25PM -0800, Kenneth Graunke wrote:
  On Sunday, January 11, 2015 01:49:41 PM Ben Widawsky wrote:
   On Sat, Jan 10, 2015 at 06:44:49PM -0800, Kenneth Graunke wrote:
This is an important optimization for avoiding read-after-write (RAW)
stalls in the HiZ buffer.  Certain workloads would run very slowly with
HiZ enabled, but run much faster with the hiz=false driconf option.
With this patch, they run at full speed even with HiZ.

Improves performance in OglVSInstancing by 3.2x on Broadwell GT3e
(Iris Pro 6200).

Thanks to Jesse Barnes for finding this missing bit!
Thanks to Chris Wilson for helping me find where to set it.

Signed-off-by: Kenneth Graunke kenn...@whitecape.org
Cc: Jesse Barnes jbar...@virtuousgeek.org
---
 drivers/gpu/drm/i915/intel_ringbuffer.c | 15 +++
 1 file changed, 15 insertions(+)

Here's an alternate patch which implements the workaround in the kernel
instead of Mesa.  It's probably better to do it there, since the kernel
does it on Haswell already.

diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c 
b/drivers/gpu/drm/i915/intel_ringbuffer.c
index dabc1d8..23020d6 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.c
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
@@ -796,6 +796,16 @@ static int bdw_init_workarounds(struct 
intel_engine_cs *ring)
  HDC_DONOT_FETCH_MEM_WHEN_MASKED |
  (IS_BDW_GT3(dev) ? HDC_FENCE_DEST_SLM_DISABLE 
: 0));
 
+   /* From the Haswell PRM, Command Reference: Registers, 
CACHE_MODE_0:
+* The Hierarchical Z RAW Stall Optimization allows 
non-overlapping
+*  polygons in the same 8x4 pixel/sample area to be processed 
without
+*  stalling waiting for the earlier ones to write to 
Hierarchical Z
+*  buffer.
+*
+* This optimization is off by default for Broadwell; turn it 
on.
+*/
+   WA_CLR_BIT_MASKED(CACHE_MODE_0_GEN7, HIZ_RAW_STALL_OPT_DISABLE);
+
/* Wa4x4STCOptimizationDisable:bdw */
WA_SET_BIT_MASKED(CACHE_MODE_1,
  GEN8_4x4_STC_OPTIMIZATION_DISABLE);
@@ -836,6 +846,11 @@ static int chv_init_workarounds(struct 
intel_engine_cs *ring)
  HDC_FORCE_NON_COHERENT |
  HDC_DONOT_FETCH_MEM_WHEN_MASKED);
 
+   /* According to the CACHE_MODE_0 default value documentation, 
some
+* CHV platforms disable this optimization by default.  Turn it 
on.
+*/
+   WA_CLR_BIT_MASKED(CACHE_MODE_0_GEN7, HIZ_RAW_STALL_OPT_DISABLE);
+
/* Improve HiZ throughput on CHV. */
WA_SET_BIT_MASKED(HIZ_CHICKEN, CHV_HZ_8X8_MODE_IN_1X);
 
   
   I think you should do this as two separate patches, 1 per platform. For 
   the BSW
   patch (given that I had the same functionality in the kernel patch I 
   asked you
   to look at ;-) and FWIW, Jordan has numbers on BSW B-step with my kernel 
   patch
   which we can use for the commit):
   Signed-off-by: Ben Widawsky b...@bwidawsk.net
  
  Huh, I don't recall seeing that kernel patch.  Sorry.  I guess I'll split it
  and resubmit...
  
 
 It's not my call, it's just nice to have platform specific bisection. And the
 patch wasn't on the list, it was the one I kept asking you to look at in my
 branch :-)

   I haven't looked at Broadwell docs, so I'll let someone else take care of 
   that.
   
   I don't know if I agree with Chris that we should call these in the 
   workaround
   section, but whatever. init_clock_gating is equally sucky.
  
  init_clock_gating doesn't work.  The register writes don't stick and they 
  have
  no effect at all.  Setting them here makes them actually take effect in the
  context.
  
  --Ken
 
 Separate thread now, but are you sure? We're setting at least two context
 specific registers in there today, among them: GEN7_FF_THREAD_MODE (which is
 important to performance).
 
 AFAIK it should stick, and if it doesn't it's not expected behavior. Unless 
 you
 know something I do not?

Jesse had suggested setting it in broadwell_init_clock_gating on January 5th,
and Valtteri tried it on January 7th.  He found no noticeable difference.
I tried it again, and confirmed his result: there was zero performance impact.

Setting it via an LRI in Mesa did have a performance impact.  I reverted my
Mesa patch, and tried setting it here, and it had the same performance impact.
I rebooted between kernels several times to confirm.  It works here, but it
doesn't there.

I'm pretty sure I confirmed the same result with this bit.  Feel free to try.

Perhaps we should move the rest of the per-context bits here instead of
*_init_clock_gating.  We 

Re: [Intel-gfx] [Mesa-dev] [PATCH] drm/i915: Enable the HiZ RAW Stall Optimization on Gen8.

2015-01-11 Thread Ben Widawsky
On Sun, Jan 11, 2015 at 07:05:21PM -0800, Kenneth Graunke wrote:
 On Sunday, January 11, 2015 05:46:09 PM Ben Widawsky wrote:
  On Sun, Jan 11, 2015 at 04:05:25PM -0800, Kenneth Graunke wrote:
   On Sunday, January 11, 2015 01:49:41 PM Ben Widawsky wrote:
On Sat, Jan 10, 2015 at 06:44:49PM -0800, Kenneth Graunke wrote:
 This is an important optimization for avoiding read-after-write (RAW)
 stalls in the HiZ buffer.  Certain workloads would run very slowly 
 with
 HiZ enabled, but run much faster with the hiz=false driconf option.
 With this patch, they run at full speed even with HiZ.
 
 Improves performance in OglVSInstancing by 3.2x on Broadwell GT3e
 (Iris Pro 6200).
 
 Thanks to Jesse Barnes for finding this missing bit!
 Thanks to Chris Wilson for helping me find where to set it.
 
 Signed-off-by: Kenneth Graunke kenn...@whitecape.org
 Cc: Jesse Barnes jbar...@virtuousgeek.org
 ---
  drivers/gpu/drm/i915/intel_ringbuffer.c | 15 +++
  1 file changed, 15 insertions(+)
 
 Here's an alternate patch which implements the workaround in the 
 kernel
 instead of Mesa.  It's probably better to do it there, since the 
 kernel
 does it on Haswell already.
 
 diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c 
 b/drivers/gpu/drm/i915/intel_ringbuffer.c
 index dabc1d8..23020d6 100644
 --- a/drivers/gpu/drm/i915/intel_ringbuffer.c
 +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
 @@ -796,6 +796,16 @@ static int bdw_init_workarounds(struct 
 intel_engine_cs *ring)
 HDC_DONOT_FETCH_MEM_WHEN_MASKED |
 (IS_BDW_GT3(dev) ? HDC_FENCE_DEST_SLM_DISABLE 
 : 0));
  
 + /* From the Haswell PRM, Command Reference: Registers, 
 CACHE_MODE_0:
 +  * The Hierarchical Z RAW Stall Optimization allows 
 non-overlapping
 +  *  polygons in the same 8x4 pixel/sample area to be processed 
 without
 +  *  stalling waiting for the earlier ones to write to 
 Hierarchical Z
 +  *  buffer.
 +  *
 +  * This optimization is off by default for Broadwell; turn it 
 on.
 +  */
 + WA_CLR_BIT_MASKED(CACHE_MODE_0_GEN7, HIZ_RAW_STALL_OPT_DISABLE);
 +
   /* Wa4x4STCOptimizationDisable:bdw */
   WA_SET_BIT_MASKED(CACHE_MODE_1,
 GEN8_4x4_STC_OPTIMIZATION_DISABLE);
 @@ -836,6 +846,11 @@ static int chv_init_workarounds(struct 
 intel_engine_cs *ring)
 HDC_FORCE_NON_COHERENT |
 HDC_DONOT_FETCH_MEM_WHEN_MASKED);
  
 + /* According to the CACHE_MODE_0 default value documentation, 
 some
 +  * CHV platforms disable this optimization by default.  Turn it 
 on.
 +  */
 + WA_CLR_BIT_MASKED(CACHE_MODE_0_GEN7, HIZ_RAW_STALL_OPT_DISABLE);
 +
   /* Improve HiZ throughput on CHV. */
   WA_SET_BIT_MASKED(HIZ_CHICKEN, CHV_HZ_8X8_MODE_IN_1X);
  

I think you should do this as two separate patches, 1 per platform. For 
the BSW
patch (given that I had the same functionality in the kernel patch I 
asked you
to look at ;-) and FWIW, Jordan has numbers on BSW B-step with my 
kernel patch
which we can use for the commit):
Signed-off-by: Ben Widawsky b...@bwidawsk.net
   
   Huh, I don't recall seeing that kernel patch.  Sorry.  I guess I'll split 
   it
   and resubmit...
   
  
  It's not my call, it's just nice to have platform specific bisection. And 
  the
  patch wasn't on the list, it was the one I kept asking you to look at in my
  branch :-)
  
I haven't looked at Broadwell docs, so I'll let someone else take care 
of that.

I don't know if I agree with Chris that we should call these in the 
workaround
section, but whatever. init_clock_gating is equally sucky.
   
   init_clock_gating doesn't work.  The register writes don't stick and they 
   have
   no effect at all.  Setting them here makes them actually take effect in 
   the
   context.
   
   --Ken
  
  Separate thread now, but are you sure? We're setting at least two context
  specific registers in there today, among them: GEN7_FF_THREAD_MODE (which is
  important to performance).
 
 It looks like we're setting:
 
 - [BDW] GAM_ECOCHK - 0x4090

ECO registers are never ctx, I think

 - [BDW] CHICKEN_PAR1_1 - 0x42080

Diplay registers are never

 - [BDW] RC_SLEEP_PSMI_CONTROL - 0x2050

dword offset 0x1c in the context image

 - [BDW, CHV] UCGCTL6 - 0x9430
 - [BDW, CHV] FF_THREAD_MODE - 0x20a0

dword offset 0x2a in the context image

 - [CHV] DSPCLK_GATE_D - display
 - [CHV] MI_ARB_VLV - display

More display...

 - [CHV] RC_SLEEP_PSMI_CONTROL - 0x12050

Kinda surprised this one isn't there. I'm not sure how it can work correctly.

 - [CHV] UCGCTL1 - 0x9400
 
 I searched for all of these in 

Re: [Intel-gfx] [Mesa-dev] [PATCH] drm/i915: Enable the HiZ RAW Stall Optimization on Gen8.

2015-01-11 Thread Ben Widawsky
On Sat, Jan 10, 2015 at 06:44:49PM -0800, Kenneth Graunke wrote:
 This is an important optimization for avoiding read-after-write (RAW)
 stalls in the HiZ buffer.  Certain workloads would run very slowly with
 HiZ enabled, but run much faster with the hiz=false driconf option.
 With this patch, they run at full speed even with HiZ.
 
 Improves performance in OglVSInstancing by 3.2x on Broadwell GT3e
 (Iris Pro 6200).
 
 Thanks to Jesse Barnes for finding this missing bit!
 Thanks to Chris Wilson for helping me find where to set it.
 
 Signed-off-by: Kenneth Graunke kenn...@whitecape.org
 Cc: Jesse Barnes jbar...@virtuousgeek.org
 ---
  drivers/gpu/drm/i915/intel_ringbuffer.c | 15 +++
  1 file changed, 15 insertions(+)
 
 Here's an alternate patch which implements the workaround in the kernel
 instead of Mesa.  It's probably better to do it there, since the kernel
 does it on Haswell already.
 
 diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c 
 b/drivers/gpu/drm/i915/intel_ringbuffer.c
 index dabc1d8..23020d6 100644
 --- a/drivers/gpu/drm/i915/intel_ringbuffer.c
 +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
 @@ -796,6 +796,16 @@ static int bdw_init_workarounds(struct intel_engine_cs 
 *ring)
 HDC_DONOT_FETCH_MEM_WHEN_MASKED |
 (IS_BDW_GT3(dev) ? HDC_FENCE_DEST_SLM_DISABLE : 0));
  
 + /* From the Haswell PRM, Command Reference: Registers, CACHE_MODE_0:
 +  * The Hierarchical Z RAW Stall Optimization allows non-overlapping
 +  *  polygons in the same 8x4 pixel/sample area to be processed without
 +  *  stalling waiting for the earlier ones to write to Hierarchical Z
 +  *  buffer.
 +  *
 +  * This optimization is off by default for Broadwell; turn it on.
 +  */
 + WA_CLR_BIT_MASKED(CACHE_MODE_0_GEN7, HIZ_RAW_STALL_OPT_DISABLE);
 +
   /* Wa4x4STCOptimizationDisable:bdw */
   WA_SET_BIT_MASKED(CACHE_MODE_1,
 GEN8_4x4_STC_OPTIMIZATION_DISABLE);
 @@ -836,6 +846,11 @@ static int chv_init_workarounds(struct intel_engine_cs 
 *ring)
 HDC_FORCE_NON_COHERENT |
 HDC_DONOT_FETCH_MEM_WHEN_MASKED);
  
 + /* According to the CACHE_MODE_0 default value documentation, some
 +  * CHV platforms disable this optimization by default.  Turn it on.
 +  */
 + WA_CLR_BIT_MASKED(CACHE_MODE_0_GEN7, HIZ_RAW_STALL_OPT_DISABLE);
 +
   /* Improve HiZ throughput on CHV. */
   WA_SET_BIT_MASKED(HIZ_CHICKEN, CHV_HZ_8X8_MODE_IN_1X);
  

I think you should do this as two separate patches, 1 per platform. For the BSW
patch (given that I had the same functionality in the kernel patch I asked you
to look at ;-) and FWIW, Jordan has numbers on BSW B-step with my kernel patch
which we can use for the commit):
Signed-off-by: Ben Widawsky b...@bwidawsk.net

I haven't looked at Broadwell docs, so I'll let someone else take care of that.

I don't know if I agree with Chris that we should call these in the workaround
section, but whatever. init_clock_gating is equally sucky.
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [Mesa-dev] [PATCH] drm/i915: Enable the HiZ RAW Stall Optimization on Gen8.

2015-01-11 Thread Ben Widawsky
On Sun, Jan 11, 2015 at 06:53:32PM -0800, Kenneth Graunke wrote:

[snip]

 
 Jesse had suggested setting it in broadwell_init_clock_gating on January 5th,
 and Valtteri tried it on January 7th.  He found no noticeable difference.
 I tried it again, and confirmed his result: there was zero performance impact.
 
 Setting it via an LRI in Mesa did have a performance impact.  I reverted my
 Mesa patch, and tried setting it here, and it had the same performance impact.
 I rebooted between kernels several times to confirm.  It works here, but it
 doesn't there.
 
 I'm pretty sure I confirmed the same result with this bit.  Feel free to try.
 

That's okay. I believe you, I just thought you may have known something I 
didn't.

 Perhaps we should move the rest of the per-context bits here instead of
 *_init_clock_gating.  We should also confirm that the other bits are actually
 having an effect.

If this is the behavior we're getting, we should absolutely do this.

 
 I don't know why it works on Haswell, but it does there - the HiZ RAW stall
 bit is set via haswell_init_clock_gating, and it's clearly having an impact.
 Maybe it has something to do with the golden context, which is new on BDW.
 But I'm probably wrong about that.  Setting it when a context is active does
 seem more reliable...

The golden context stuff has been backported to all gens supported hardware
contexts and/or execlists, so I don't think it's that.

The big difference between the workarounds and init clock gating is the former
is done via LRI, and the latter MMIO. (also, the latter is run again on resume,
and I don't think the former is).

Thinking outloud - what's the default setting for execlists on BDW now?  For
execlists my plan (when it was my plan to have) had always been to manually set
the register in the context image before loading it. We don't do that with the
existing code, we use the old ringbuffer style of, hope it preserves the
contents. I wonder if that's the distinction between HSW.

 
 --Ken


___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [Mesa-dev] [PATCH] drm/i915: Enable the HiZ RAW Stall Optimization on Gen8.

2015-01-11 Thread Kenneth Graunke
On Sunday, January 11, 2015 01:49:41 PM Ben Widawsky wrote:
 On Sat, Jan 10, 2015 at 06:44:49PM -0800, Kenneth Graunke wrote:
  This is an important optimization for avoiding read-after-write (RAW)
  stalls in the HiZ buffer.  Certain workloads would run very slowly with
  HiZ enabled, but run much faster with the hiz=false driconf option.
  With this patch, they run at full speed even with HiZ.
  
  Improves performance in OglVSInstancing by 3.2x on Broadwell GT3e
  (Iris Pro 6200).
  
  Thanks to Jesse Barnes for finding this missing bit!
  Thanks to Chris Wilson for helping me find where to set it.
  
  Signed-off-by: Kenneth Graunke kenn...@whitecape.org
  Cc: Jesse Barnes jbar...@virtuousgeek.org
  ---
   drivers/gpu/drm/i915/intel_ringbuffer.c | 15 +++
   1 file changed, 15 insertions(+)
  
  Here's an alternate patch which implements the workaround in the kernel
  instead of Mesa.  It's probably better to do it there, since the kernel
  does it on Haswell already.
  
  diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c 
  b/drivers/gpu/drm/i915/intel_ringbuffer.c
  index dabc1d8..23020d6 100644
  --- a/drivers/gpu/drm/i915/intel_ringbuffer.c
  +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
  @@ -796,6 +796,16 @@ static int bdw_init_workarounds(struct intel_engine_cs 
  *ring)
HDC_DONOT_FETCH_MEM_WHEN_MASKED |
(IS_BDW_GT3(dev) ? HDC_FENCE_DEST_SLM_DISABLE : 0));
   
  +   /* From the Haswell PRM, Command Reference: Registers, CACHE_MODE_0:
  +* The Hierarchical Z RAW Stall Optimization allows non-overlapping
  +*  polygons in the same 8x4 pixel/sample area to be processed without
  +*  stalling waiting for the earlier ones to write to Hierarchical Z
  +*  buffer.
  +*
  +* This optimization is off by default for Broadwell; turn it on.
  +*/
  +   WA_CLR_BIT_MASKED(CACHE_MODE_0_GEN7, HIZ_RAW_STALL_OPT_DISABLE);
  +
  /* Wa4x4STCOptimizationDisable:bdw */
  WA_SET_BIT_MASKED(CACHE_MODE_1,
GEN8_4x4_STC_OPTIMIZATION_DISABLE);
  @@ -836,6 +846,11 @@ static int chv_init_workarounds(struct intel_engine_cs 
  *ring)
HDC_FORCE_NON_COHERENT |
HDC_DONOT_FETCH_MEM_WHEN_MASKED);
   
  +   /* According to the CACHE_MODE_0 default value documentation, some
  +* CHV platforms disable this optimization by default.  Turn it on.
  +*/
  +   WA_CLR_BIT_MASKED(CACHE_MODE_0_GEN7, HIZ_RAW_STALL_OPT_DISABLE);
  +
  /* Improve HiZ throughput on CHV. */
  WA_SET_BIT_MASKED(HIZ_CHICKEN, CHV_HZ_8X8_MODE_IN_1X);
   
 
 I think you should do this as two separate patches, 1 per platform. For the 
 BSW
 patch (given that I had the same functionality in the kernel patch I asked you
 to look at ;-) and FWIW, Jordan has numbers on BSW B-step with my kernel patch
 which we can use for the commit):
 Signed-off-by: Ben Widawsky b...@bwidawsk.net

Huh, I don't recall seeing that kernel patch.  Sorry.  I guess I'll split it
and resubmit...

 I haven't looked at Broadwell docs, so I'll let someone else take care of 
 that.
 
 I don't know if I agree with Chris that we should call these in the workaround
 section, but whatever. init_clock_gating is equally sucky.

init_clock_gating doesn't work.  The register writes don't stick and they have
no effect at all.  Setting them here makes them actually take effect in the
context.

--Ken

signature.asc
Description: This is a digitally signed message part.
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx