Re: [Mesa-dev] [RFC PATCH] nir: Transform 4*x into x 2 during late optimizations.

2015-05-18 Thread Matt Turner
On Fri, May 8, 2015 at 3:36 AM, Kenneth Graunke kenn...@whitecape.org wrote:
 According to Glenn, shifts on R600 have 5x the throughput as multiplies.

 Intel GPUs have strange integer multiplication restrictions - on most
 hardware, MUL actually only does a 32-bit x 16-bit multiply.  This
 means the arguments aren't commutative, which can limit our constant
 propagation options.  SHL has no such restrictions.

 Shifting is probably reasonable on most people's hardware, so let's just
 do that.

 i965 shader-db results (using NIR for VS):
 total instructions in shared programs: 7432587 - 7388982 (-0.59%)
 instructions in affected programs: 1360411 - 1316806 (-3.21%)
 helped:5772
 HURT:  0

Just to close the loop, I ran shader-db with this patch on top of my
integer multiplication series, and it doesn't change any instruction
counts on i965. (I also tried with all other power-of-two
multiplications for shift values  31.)

We may want to do it for other reasons though.
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [RFC PATCH] nir: Transform 4*x into x 2 during late optimizations.

2015-05-18 Thread Jason Ekstrand
On Mon, May 18, 2015 at 3:28 PM, Kenneth Graunke kenn...@whitecape.org wrote:
 On Monday, May 18, 2015 11:26:05 AM Matt Turner wrote:
 On Fri, May 8, 2015 at 3:36 AM, Kenneth Graunke kenn...@whitecape.org 
 wrote:
  According to Glenn, shifts on R600 have 5x the throughput as multiplies.
 
  Intel GPUs have strange integer multiplication restrictions - on most
  hardware, MUL actually only does a 32-bit x 16-bit multiply.  This
  means the arguments aren't commutative, which can limit our constant
  propagation options.  SHL has no such restrictions.
 
  Shifting is probably reasonable on most people's hardware, so let's just
  do that.
 
  i965 shader-db results (using NIR for VS):
  total instructions in shared programs: 7432587 - 7388982 (-0.59%)
  instructions in affected programs: 1360411 - 1316806 (-3.21%)
  helped:5772
  HURT:  0

 Just to close the loop, I ran shader-db with this patch on top of my
 integer multiplication series, and it doesn't change any instruction
 counts on i965. (I also tried with all other power-of-two
 multiplications for shift values  31.)

 We may want to do it for other reasons though.

 If we're going to do it because shifts are faster/nicer than multiplies,
 then we should probably just do it for powers-of-two in general.
 Unfortunately, opt_algebraic doesn't really lend itself to that without
 adding some sort of power of two infrastructure.

 I guess we could optimize things like:
 a * 2^n  =  a  n
 a % 2^n  =  a  (n-1)
 a / 2^n  =  a  n (possibly only for unsigned? (*))
 ...others?

 The first is a clear win on r600, and the latter are clear wins on i965,
 though they may be rather rare...

 We could add a custom NIR pass.  Or, we could just or just have backends
 check for an immediate second operand and do this sort of stuff.  Or
 optimize it themselves.  *shrug*

Or we could just do

for i in range(32):
optimizations.append(('imul', a, (1  i)), ('ishl', a, i)))

--Jason
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [RFC PATCH] nir: Transform 4*x into x 2 during late optimizations.

2015-05-18 Thread Kenneth Graunke
On Monday, May 18, 2015 11:26:05 AM Matt Turner wrote:
 On Fri, May 8, 2015 at 3:36 AM, Kenneth Graunke kenn...@whitecape.org wrote:
  According to Glenn, shifts on R600 have 5x the throughput as multiplies.
 
  Intel GPUs have strange integer multiplication restrictions - on most
  hardware, MUL actually only does a 32-bit x 16-bit multiply.  This
  means the arguments aren't commutative, which can limit our constant
  propagation options.  SHL has no such restrictions.
 
  Shifting is probably reasonable on most people's hardware, so let's just
  do that.
 
  i965 shader-db results (using NIR for VS):
  total instructions in shared programs: 7432587 - 7388982 (-0.59%)
  instructions in affected programs: 1360411 - 1316806 (-3.21%)
  helped:5772
  HURT:  0
 
 Just to close the loop, I ran shader-db with this patch on top of my
 integer multiplication series, and it doesn't change any instruction
 counts on i965. (I also tried with all other power-of-two
 multiplications for shift values  31.)
 
 We may want to do it for other reasons though.

If we're going to do it because shifts are faster/nicer than multiplies,
then we should probably just do it for powers-of-two in general.
Unfortunately, opt_algebraic doesn't really lend itself to that without
adding some sort of power of two infrastructure.

I guess we could optimize things like:
a * 2^n  =  a  n
a % 2^n  =  a  (n-1)
a / 2^n  =  a  n (possibly only for unsigned? (*))
...others?

The first is a clear win on r600, and the latter are clear wins on i965,
though they may be rather rare...

We could add a custom NIR pass.  Or, we could just or just have backends
check for an immediate second operand and do this sort of stuff.  Or
optimize it themselves.  *shrug*

(*) http://lists.freedesktop.org/archives/mesa-dev/2014-April/057364.html


signature.asc
Description: This is a digitally signed message part.
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


[Mesa-dev] [RFC PATCH] nir: Transform 4*x into x 2 during late optimizations.

2015-05-08 Thread Kenneth Graunke
According to Glenn, shifts on R600 have 5x the throughput as multiplies.

Intel GPUs have strange integer multiplication restrictions - on most
hardware, MUL actually only does a 32-bit x 16-bit multiply.  This
means the arguments aren't commutative, which can limit our constant
propagation options.  SHL has no such restrictions.

Shifting is probably reasonable on most people's hardware, so let's just
do that.

i965 shader-db results (using NIR for VS):
total instructions in shared programs: 7432587 - 7388982 (-0.59%)
instructions in affected programs: 1360411 - 1316806 (-3.21%)
helped:5772
HURT:  0

Signed-off-by: Kenneth Graunke kenn...@whitecape.org
Cc: matts...@gmail.com
Cc: ja...@jlekstrand.net
---
 src/glsl/nir/nir_opt_algebraic.py | 5 +
 1 file changed, 5 insertions(+)

So...I found a bizarre issue with this patch.

   (('imul', 4, a), ('ishl', a, 2)),

totally optimizes things.  However...

   (('imul', a, 4), ('ishl', a, 2)),

doesn't seem to do anything, even though imul is commutative, and nir_search
should totally handle that...

 ▄▄  ▄▄▄▄    ▄   ▄▄
 ██  ██   ▀▀▀██▀▀▀  ███  ██
 ▀█▄ ██ ▄█▀      ██ ▄█▀  ██
  ██ ██ ██   ██  ██  ██   ▄██▀   ██
  ███▀▀███   ██  ██   ██ ▀▀
  ███  ███  ▄██  ██▄ ██   ▄▄ ▄▄
  ▀▀▀  ▀▀▀  ▀▀▀▀ ▀▀   ▀▀ ▀▀

If you know why, let me know, otherwise I may have to look into it when more
awake.

diff --git a/src/glsl/nir/nir_opt_algebraic.py 
b/src/glsl/nir/nir_opt_algebraic.py
index 400d60e..350471f 100644
--- a/src/glsl/nir/nir_opt_algebraic.py
+++ b/src/glsl/nir/nir_opt_algebraic.py
@@ -247,6 +247,11 @@ late_optimizations = [
(('fge', ('fadd', a, b), 0.0), ('fge', a, ('fneg', b))),
(('feq', ('fadd', a, b), 0.0), ('feq', a, ('fneg', b))),
(('fne', ('fadd', a, b), 0.0), ('fne', a, ('fneg', b))),
+
+   # Multiplication by 4 comes up fairly often in indirect offset calculations.
+   # Some GPUs have weird integer multiplication limitations, but shifts 
should work
+   # equally well everywhere.
+   (('imul', 4, a), ('ishl', a, 2)),
 ]
 
 print nir_algebraic.AlgebraicPass(nir_opt_algebraic, optimizations).render()
-- 
2.4.0

___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [RFC PATCH] nir: Transform 4*x into x 2 during late optimizations.

2015-05-08 Thread Jason Ekstrand
On Fri, May 8, 2015 at 3:36 AM, Kenneth Graunke kenn...@whitecape.org wrote:
 According to Glenn, shifts on R600 have 5x the throughput as multiplies.

 Intel GPUs have strange integer multiplication restrictions - on most
 hardware, MUL actually only does a 32-bit x 16-bit multiply.  This
 means the arguments aren't commutative, which can limit our constant
 propagation options.  SHL has no such restrictions.

 Shifting is probably reasonable on most people's hardware, so let's just
 do that.

 i965 shader-db results (using NIR for VS):
 total instructions in shared programs: 7432587 - 7388982 (-0.59%)
 instructions in affected programs: 1360411 - 1316806 (-3.21%)
 helped:5772
 HURT:  0

 Signed-off-by: Kenneth Graunke kenn...@whitecape.org
 Cc: matts...@gmail.com
 Cc: ja...@jlekstrand.net
 ---
  src/glsl/nir/nir_opt_algebraic.py | 5 +
  1 file changed, 5 insertions(+)

 So...I found a bizarre issue with this patch.

(('imul', 4, a), ('ishl', a, 2)),

 totally optimizes things.  However...

(('imul', a, 4), ('ishl', a, 2)),

 doesn't seem to do anything, even though imul is commutative, and nir_search
 should totally handle that...

  ▄▄  ▄▄▄▄    ▄   ▄▄
  ██  ██   ▀▀▀██▀▀▀  ███  ██
  ▀█▄ ██ ▄█▀      ██ ▄█▀  ██
   ██ ██ ██   ██  ██  ██   ▄██▀   ██
   ███▀▀███   ██  ██   ██ ▀▀
   ███  ███  ▄██  ██▄ ██   ▄▄ ▄▄
   ▀▀▀  ▀▀▀  ▀▀▀▀ ▀▀   ▀▀ ▀▀

 If you know why, let me know, otherwise I may have to look into it when more
 awake.

I figured it out and I have a patch.  Unfortunately, it regresses a
few programs and looses 8 SIMD8 programs so I'm doing some more
investigation.  I'll send it out soon.

 diff --git a/src/glsl/nir/nir_opt_algebraic.py 
 b/src/glsl/nir/nir_opt_algebraic.py
 index 400d60e..350471f 100644
 --- a/src/glsl/nir/nir_opt_algebraic.py
 +++ b/src/glsl/nir/nir_opt_algebraic.py
 @@ -247,6 +247,11 @@ late_optimizations = [
 (('fge', ('fadd', a, b), 0.0), ('fge', a, ('fneg', b))),
 (('feq', ('fadd', a, b), 0.0), ('feq', a, ('fneg', b))),
 (('fne', ('fadd', a, b), 0.0), ('fne', a, ('fneg', b))),
 +
 +   # Multiplication by 4 comes up fairly often in indirect offset 
 calculations.
 +   # Some GPUs have weird integer multiplication limitations, but shifts 
 should work
 +   # equally well everywhere.
 +   (('imul', 4, a), ('ishl', a, 2)),
  ]

  print nir_algebraic.AlgebraicPass(nir_opt_algebraic, 
 optimizations).render()
 --
 2.4.0

___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [RFC PATCH] nir: Transform 4*x into x 2 during late optimizations.

2015-05-08 Thread Jason Ekstrand
On Fri, May 8, 2015 at 11:11 AM, Ian Romanick i...@freedesktop.org wrote:
 On 05/08/2015 03:36 AM, Kenneth Graunke wrote:
 According to Glenn, shifts on R600 have 5x the throughput as multiplies.

 Intel GPUs have strange integer multiplication restrictions - on most
 hardware, MUL actually only does a 32-bit x 16-bit multiply.  This
 means the arguments aren't commutative, which can limit our constant
 propagation options.  SHL has no such restrictions.

 Shifting is probably reasonable on most people's hardware, so let's just
 do that.

 i965 shader-db results (using NIR for VS):
 total instructions in shared programs: 7432587 - 7388982 (-0.59%)
 instructions in affected programs: 1360411 - 1316806 (-3.21%)
 helped:5772
 HURT:  0

 Signed-off-by: Kenneth Graunke kenn...@whitecape.org
 Cc: matts...@gmail.com
 Cc: ja...@jlekstrand.net
 ---
  src/glsl/nir/nir_opt_algebraic.py | 5 +
  1 file changed, 5 insertions(+)

 So...I found a bizarre issue with this patch.

(('imul', 4, a), ('ishl', a, 2)),

 totally optimizes things.  However...

(('imul', a, 4), ('ishl', a, 2)),

 doesn't seem to do anything, even though imul is commutative, and nir_search
 should totally handle that...

  ▄▄  ▄▄▄▄    ▄   ▄▄
  ██  ██   ▀▀▀██▀▀▀  ███  ██
  ▀█▄ ██ ▄█▀      ██ ▄█▀  ██
   ██ ██ ██   ██  ██  ██   ▄██▀   ██
   ███▀▀███   ██  ██   ██ ▀▀
   ███  ███  ▄██  ██▄ ██   ▄▄ ▄▄
   ▀▀▀  ▀▀▀  ▀▀▀▀ ▀▀   ▀▀ ▀▀

 If you know why, let me know, otherwise I may have to look into it when more
 awake.

 I've noticed a couple other weird things that I have been unable to
 understand.  Shaders like the one below end with fmul/ffma instaed of
 flrp, for example.  I understand why that happens from GLSL IR
 opt_algebraic, but it seems like nir_opt_algebraic should handle it.

Just a guess, but it's quite possibly due to the commutative
operations bug I just sent a patch for.
--Jason

 [require]
 GLSL = 1.30

 [vertex shader]
 in vec4 v;
 in vec2 tc_in;

 out vec2 tc;

 void main() {
 gl_Position = v;
 tc = tc_in;
 }

 [fragment shader]
 in vec2 tc;

 out vec4 color;

 uniform sampler2D s;
 uniform float a;
 uniform vec3 base_color;

 void main() {
 vec3 tex_color = texture(s, tc).xyz;

 color.xyz = (base_color * a) + (tex_color * (1.0 - a));
 color.a = 1.0;
 }



 diff --git a/src/glsl/nir/nir_opt_algebraic.py 
 b/src/glsl/nir/nir_opt_algebraic.py
 index 400d60e..350471f 100644
 --- a/src/glsl/nir/nir_opt_algebraic.py
 +++ b/src/glsl/nir/nir_opt_algebraic.py
 @@ -247,6 +247,11 @@ late_optimizations = [
 (('fge', ('fadd', a, b), 0.0), ('fge', a, ('fneg', b))),
 (('feq', ('fadd', a, b), 0.0), ('feq', a, ('fneg', b))),
 (('fne', ('fadd', a, b), 0.0), ('fne', a, ('fneg', b))),
 +
 +   # Multiplication by 4 comes up fairly often in indirect offset 
 calculations.
 +   # Some GPUs have weird integer multiplication limitations, but shifts 
 should work
 +   # equally well everywhere.
 +   (('imul', 4, a), ('ishl', a, 2)),

 This should be conditionalized on whether the platform has native integers.

  ]

  print nir_algebraic.AlgebraicPass(nir_opt_algebraic, 
 optimizations).render()


 ___
 mesa-dev mailing list
 mesa-dev@lists.freedesktop.org
 http://lists.freedesktop.org/mailman/listinfo/mesa-dev
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [RFC PATCH] nir: Transform 4*x into x 2 during late optimizations.

2015-05-08 Thread Eric Anholt
Ilia Mirkin imir...@alum.mit.edu writes:

 On Fri, May 8, 2015 at 6:36 AM, Kenneth Graunke kenn...@whitecape.org wrote:
 +   # Multiplication by 4 comes up fairly often in indirect offset 
 calculations.
 +   # Some GPUs have weird integer multiplication limitations, but shifts 
 should work
 +   # equally well everywhere.
 +   (('imul', 4, a), ('ishl', a, 2)),

 Not sure what the cost of doing it this way, but really you want all
 powers of 2... and also udiv - shr. Since this is python, should be
 easy enough to append onto that list. AFAIK all GPU's prefer a shift
 over a mul. Adreno doen't have 32-bit imul in the first place (and no
 idiv either).

I can confirm that I'd love shifts instead of imul/udiv on vc4.


signature.asc
Description: PGP signature
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [RFC PATCH] nir: Transform 4*x into x 2 during late optimizations.

2015-05-08 Thread Ilia Mirkin
On Fri, May 8, 2015 at 6:36 AM, Kenneth Graunke kenn...@whitecape.org wrote:
 +   # Multiplication by 4 comes up fairly often in indirect offset 
 calculations.
 +   # Some GPUs have weird integer multiplication limitations, but shifts 
 should work
 +   # equally well everywhere.
 +   (('imul', 4, a), ('ishl', a, 2)),

Not sure what the cost of doing it this way, but really you want all
powers of 2... and also udiv - shr. Since this is python, should be
easy enough to append onto that list. AFAIK all GPU's prefer a shift
over a mul. Adreno doen't have 32-bit imul in the first place (and no
idiv either).

In nouveau/codegen we just have a single check for whether the
immediate is a power of 2, perhaps that can be encoded here in some
way.

Cheers,

  -ilia
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [RFC PATCH] nir: Transform 4*x into x 2 during late optimizations.

2015-05-08 Thread Ian Romanick
On 05/08/2015 03:36 AM, Kenneth Graunke wrote:
 According to Glenn, shifts on R600 have 5x the throughput as multiplies.
 
 Intel GPUs have strange integer multiplication restrictions - on most
 hardware, MUL actually only does a 32-bit x 16-bit multiply.  This
 means the arguments aren't commutative, which can limit our constant
 propagation options.  SHL has no such restrictions.
 
 Shifting is probably reasonable on most people's hardware, so let's just
 do that.
 
 i965 shader-db results (using NIR for VS):
 total instructions in shared programs: 7432587 - 7388982 (-0.59%)
 instructions in affected programs: 1360411 - 1316806 (-3.21%)
 helped:5772
 HURT:  0
 
 Signed-off-by: Kenneth Graunke kenn...@whitecape.org
 Cc: matts...@gmail.com
 Cc: ja...@jlekstrand.net
 ---
  src/glsl/nir/nir_opt_algebraic.py | 5 +
  1 file changed, 5 insertions(+)
 
 So...I found a bizarre issue with this patch.
 
(('imul', 4, a), ('ishl', a, 2)),
 
 totally optimizes things.  However...
 
(('imul', a, 4), ('ishl', a, 2)),
 
 doesn't seem to do anything, even though imul is commutative, and nir_search
 should totally handle that...
 
  ▄▄  ▄▄▄▄    ▄   ▄▄
  ██  ██   ▀▀▀██▀▀▀  ███  ██
  ▀█▄ ██ ▄█▀      ██ ▄█▀  ██
   ██ ██ ██   ██  ██  ██   ▄██▀   ██
   ███▀▀███   ██  ██   ██ ▀▀
   ███  ███  ▄██  ██▄ ██   ▄▄ ▄▄
   ▀▀▀  ▀▀▀  ▀▀▀▀ ▀▀   ▀▀ ▀▀
 
 If you know why, let me know, otherwise I may have to look into it when more
 awake.

I've noticed a couple other weird things that I have been unable to
understand.  Shaders like the one below end with fmul/ffma instaed of
flrp, for example.  I understand why that happens from GLSL IR
opt_algebraic, but it seems like nir_opt_algebraic should handle it.

[require]
GLSL = 1.30

[vertex shader]
in vec4 v;
in vec2 tc_in;

out vec2 tc;

void main() {
gl_Position = v;
tc = tc_in;
}

[fragment shader]
in vec2 tc;

out vec4 color;

uniform sampler2D s;
uniform float a;
uniform vec3 base_color;

void main() {
vec3 tex_color = texture(s, tc).xyz;

color.xyz = (base_color * a) + (tex_color * (1.0 - a));
color.a = 1.0;
}



 diff --git a/src/glsl/nir/nir_opt_algebraic.py 
 b/src/glsl/nir/nir_opt_algebraic.py
 index 400d60e..350471f 100644
 --- a/src/glsl/nir/nir_opt_algebraic.py
 +++ b/src/glsl/nir/nir_opt_algebraic.py
 @@ -247,6 +247,11 @@ late_optimizations = [
 (('fge', ('fadd', a, b), 0.0), ('fge', a, ('fneg', b))),
 (('feq', ('fadd', a, b), 0.0), ('feq', a, ('fneg', b))),
 (('fne', ('fadd', a, b), 0.0), ('fne', a, ('fneg', b))),
 +
 +   # Multiplication by 4 comes up fairly often in indirect offset 
 calculations.
 +   # Some GPUs have weird integer multiplication limitations, but shifts 
 should work
 +   # equally well everywhere.
 +   (('imul', 4, a), ('ishl', a, 2)),

This should be conditionalized on whether the platform has native integers.

  ]
  
  print nir_algebraic.AlgebraicPass(nir_opt_algebraic, 
 optimizations).render()
 

___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev