Re: [Mesa-dev] [RFC PATCH] nir: Transform 4*x into x << 2 during late optimizations.

2015-05-18 Thread Jason Ekstrand
On Mon, May 18, 2015 at 3:28 PM, Kenneth Graunke  wrote:
> On Monday, May 18, 2015 11:26:05 AM Matt Turner wrote:
>> On Fri, May 8, 2015 at 3:36 AM, Kenneth Graunke  
>> wrote:
>> > According to Glenn, shifts on R600 have 5x the throughput as multiplies.
>> >
>> > Intel GPUs have strange integer multiplication restrictions - on most
>> > hardware, MUL actually only does a 32-bit x 16-bit multiply.  This
>> > means the arguments aren't commutative, which can limit our constant
>> > propagation options.  SHL has no such restrictions.
>> >
>> > Shifting is probably reasonable on most people's hardware, so let's just
>> > do that.
>> >
>> > i965 shader-db results (using NIR for VS):
>> > total instructions in shared programs: 7432587 -> 7388982 (-0.59%)
>> > instructions in affected programs: 1360411 -> 1316806 (-3.21%)
>> > helped:5772
>> > HURT:  0
>>
>> Just to close the loop, I ran shader-db with this patch on top of my
>> integer multiplication series, and it doesn't change any instruction
>> counts on i965. (I also tried with all other power-of-two
>> multiplications for shift values < 31.)
>>
>> We may want to do it for other reasons though.
>
> If we're going to do it because shifts are faster/nicer than multiplies,
> then we should probably just do it for powers-of-two in general.
> Unfortunately, opt_algebraic doesn't really lend itself to that without
> adding some sort of "power of two" infrastructure.
>
> I guess we could optimize things like:
> a * 2^n  =>  a << n
> a % 2^n  =>  a & (n-1)
> a / 2^n  =>  a >> n (possibly only for unsigned? (*))
> ...others?
>
> The first is a clear win on r600, and the latter are clear wins on i965,
> though they may be rather rare...
>
> We could add a custom NIR pass.  Or, we could just or just have backends
> check for an immediate second operand and do this sort of stuff.  Or
> optimize it themselves.  *shrug*

Or we could just do

for i in range(32):
optimizations.append(('imul', a, (1 << i)), ('ishl', a, i)))

--Jason
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [RFC PATCH] nir: Transform 4*x into x << 2 during late optimizations.

2015-05-18 Thread Kenneth Graunke
On Monday, May 18, 2015 11:26:05 AM Matt Turner wrote:
> On Fri, May 8, 2015 at 3:36 AM, Kenneth Graunke  wrote:
> > According to Glenn, shifts on R600 have 5x the throughput as multiplies.
> >
> > Intel GPUs have strange integer multiplication restrictions - on most
> > hardware, MUL actually only does a 32-bit x 16-bit multiply.  This
> > means the arguments aren't commutative, which can limit our constant
> > propagation options.  SHL has no such restrictions.
> >
> > Shifting is probably reasonable on most people's hardware, so let's just
> > do that.
> >
> > i965 shader-db results (using NIR for VS):
> > total instructions in shared programs: 7432587 -> 7388982 (-0.59%)
> > instructions in affected programs: 1360411 -> 1316806 (-3.21%)
> > helped:5772
> > HURT:  0
> 
> Just to close the loop, I ran shader-db with this patch on top of my
> integer multiplication series, and it doesn't change any instruction
> counts on i965. (I also tried with all other power-of-two
> multiplications for shift values < 31.)
> 
> We may want to do it for other reasons though.

If we're going to do it because shifts are faster/nicer than multiplies,
then we should probably just do it for powers-of-two in general.
Unfortunately, opt_algebraic doesn't really lend itself to that without
adding some sort of "power of two" infrastructure.

I guess we could optimize things like:
a * 2^n  =>  a << n
a % 2^n  =>  a & (n-1)
a / 2^n  =>  a >> n (possibly only for unsigned? (*))
...others?

The first is a clear win on r600, and the latter are clear wins on i965,
though they may be rather rare...

We could add a custom NIR pass.  Or, we could just or just have backends
check for an immediate second operand and do this sort of stuff.  Or
optimize it themselves.  *shrug*

(*) http://lists.freedesktop.org/archives/mesa-dev/2014-April/057364.html


signature.asc
Description: This is a digitally signed message part.
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [RFC PATCH] nir: Transform 4*x into x << 2 during late optimizations.

2015-05-18 Thread Matt Turner
On Fri, May 8, 2015 at 3:36 AM, Kenneth Graunke  wrote:
> According to Glenn, shifts on R600 have 5x the throughput as multiplies.
>
> Intel GPUs have strange integer multiplication restrictions - on most
> hardware, MUL actually only does a 32-bit x 16-bit multiply.  This
> means the arguments aren't commutative, which can limit our constant
> propagation options.  SHL has no such restrictions.
>
> Shifting is probably reasonable on most people's hardware, so let's just
> do that.
>
> i965 shader-db results (using NIR for VS):
> total instructions in shared programs: 7432587 -> 7388982 (-0.59%)
> instructions in affected programs: 1360411 -> 1316806 (-3.21%)
> helped:5772
> HURT:  0

Just to close the loop, I ran shader-db with this patch on top of my
integer multiplication series, and it doesn't change any instruction
counts on i965. (I also tried with all other power-of-two
multiplications for shift values < 31.)

We may want to do it for other reasons though.
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [RFC PATCH] nir: Transform 4*x into x << 2 during late optimizations.

2015-05-08 Thread Eric Anholt
Ilia Mirkin  writes:

> On Fri, May 8, 2015 at 6:36 AM, Kenneth Graunke  wrote:
>> +   # Multiplication by 4 comes up fairly often in indirect offset 
>> calculations.
>> +   # Some GPUs have weird integer multiplication limitations, but shifts 
>> should work
>> +   # equally well everywhere.
>> +   (('imul', 4, a), ('ishl', a, 2)),
>
> Not sure what the cost of doing it this way, but really you want all
> powers of 2... and also udiv -> shr. Since this is python, should be
> easy enough to append onto that list. AFAIK all GPU's prefer a shift
> over a mul. Adreno doen't have 32-bit imul in the first place (and no
> idiv either).

I can confirm that I'd love shifts instead of imul/udiv on vc4.


signature.asc
Description: PGP signature
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [RFC PATCH] nir: Transform 4*x into x << 2 during late optimizations.

2015-05-08 Thread Jason Ekstrand
On Fri, May 8, 2015 at 11:11 AM, Ian Romanick  wrote:
> On 05/08/2015 03:36 AM, Kenneth Graunke wrote:
>> According to Glenn, shifts on R600 have 5x the throughput as multiplies.
>>
>> Intel GPUs have strange integer multiplication restrictions - on most
>> hardware, MUL actually only does a 32-bit x 16-bit multiply.  This
>> means the arguments aren't commutative, which can limit our constant
>> propagation options.  SHL has no such restrictions.
>>
>> Shifting is probably reasonable on most people's hardware, so let's just
>> do that.
>>
>> i965 shader-db results (using NIR for VS):
>> total instructions in shared programs: 7432587 -> 7388982 (-0.59%)
>> instructions in affected programs: 1360411 -> 1316806 (-3.21%)
>> helped:5772
>> HURT:  0
>>
>> Signed-off-by: Kenneth Graunke 
>> Cc: matts...@gmail.com
>> Cc: ja...@jlekstrand.net
>> ---
>>  src/glsl/nir/nir_opt_algebraic.py | 5 +
>>  1 file changed, 5 insertions(+)
>>
>> So...I found a bizarre issue with this patch.
>>
>>(('imul', 4, a), ('ishl', a, 2)),
>>
>> totally optimizes things.  However...
>>
>>(('imul', a, 4), ('ishl', a, 2)),
>>
>> doesn't seem to do anything, even though imul is commutative, and nir_search
>> should totally handle that...
>>
>>  ▄▄  ▄▄▄▄    ▄   ▄▄
>>  ██  ██   ▀▀▀██▀▀▀  ███  ██
>>  ▀█▄ ██ ▄█▀      ██ ▄█▀  ██
>>   ██ ██ ██   ██  ██  ██   ▄██▀   ██
>>   ███▀▀███   ██  ██   ██ ▀▀
>>   ███  ███  ▄██  ██▄ ██   ▄▄ ▄▄
>>   ▀▀▀  ▀▀▀  ▀▀▀▀ ▀▀   ▀▀ ▀▀
>>
>> If you know why, let me know, otherwise I may have to look into it when more
>> awake.
>
> I've noticed a couple other weird things that I have been unable to
> understand.  Shaders like the one below end with fmul/ffma instaed of
> flrp, for example.  I understand why that happens from GLSL IR
> opt_algebraic, but it seems like nir_opt_algebraic should handle it.

Just a guess, but it's quite possibly due to the commutative
operations bug I just sent a patch for.
--Jason

> [require]
> GLSL >= 1.30
>
> [vertex shader]
> in vec4 v;
> in vec2 tc_in;
>
> out vec2 tc;
>
> void main() {
> gl_Position = v;
> tc = tc_in;
> }
>
> [fragment shader]
> in vec2 tc;
>
> out vec4 color;
>
> uniform sampler2D s;
> uniform float a;
> uniform vec3 base_color;
>
> void main() {
> vec3 tex_color = texture(s, tc).xyz;
>
> color.xyz = (base_color * a) + (tex_color * (1.0 - a));
> color.a = 1.0;
> }
>
>
>
>> diff --git a/src/glsl/nir/nir_opt_algebraic.py 
>> b/src/glsl/nir/nir_opt_algebraic.py
>> index 400d60e..350471f 100644
>> --- a/src/glsl/nir/nir_opt_algebraic.py
>> +++ b/src/glsl/nir/nir_opt_algebraic.py
>> @@ -247,6 +247,11 @@ late_optimizations = [
>> (('fge', ('fadd', a, b), 0.0), ('fge', a, ('fneg', b))),
>> (('feq', ('fadd', a, b), 0.0), ('feq', a, ('fneg', b))),
>> (('fne', ('fadd', a, b), 0.0), ('fne', a, ('fneg', b))),
>> +
>> +   # Multiplication by 4 comes up fairly often in indirect offset 
>> calculations.
>> +   # Some GPUs have weird integer multiplication limitations, but shifts 
>> should work
>> +   # equally well everywhere.
>> +   (('imul', 4, a), ('ishl', a, 2)),
>
> This should be conditionalized on whether the platform has native integers.
>
>>  ]
>>
>>  print nir_algebraic.AlgebraicPass("nir_opt_algebraic", 
>> optimizations).render()
>>
>
> ___
> mesa-dev mailing list
> mesa-dev@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/mesa-dev
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [RFC PATCH] nir: Transform 4*x into x << 2 during late optimizations.

2015-05-08 Thread Ilia Mirkin
On Fri, May 8, 2015 at 6:36 AM, Kenneth Graunke  wrote:
> +   # Multiplication by 4 comes up fairly often in indirect offset 
> calculations.
> +   # Some GPUs have weird integer multiplication limitations, but shifts 
> should work
> +   # equally well everywhere.
> +   (('imul', 4, a), ('ishl', a, 2)),

Not sure what the cost of doing it this way, but really you want all
powers of 2... and also udiv -> shr. Since this is python, should be
easy enough to append onto that list. AFAIK all GPU's prefer a shift
over a mul. Adreno doen't have 32-bit imul in the first place (and no
idiv either).

In nouveau/codegen we just have a single check for whether the
immediate is a power of 2, perhaps that can be encoded here in some
way.

Cheers,

  -ilia
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [RFC PATCH] nir: Transform 4*x into x << 2 during late optimizations.

2015-05-08 Thread Ian Romanick
On 05/08/2015 03:36 AM, Kenneth Graunke wrote:
> According to Glenn, shifts on R600 have 5x the throughput as multiplies.
> 
> Intel GPUs have strange integer multiplication restrictions - on most
> hardware, MUL actually only does a 32-bit x 16-bit multiply.  This
> means the arguments aren't commutative, which can limit our constant
> propagation options.  SHL has no such restrictions.
> 
> Shifting is probably reasonable on most people's hardware, so let's just
> do that.
> 
> i965 shader-db results (using NIR for VS):
> total instructions in shared programs: 7432587 -> 7388982 (-0.59%)
> instructions in affected programs: 1360411 -> 1316806 (-3.21%)
> helped:5772
> HURT:  0
> 
> Signed-off-by: Kenneth Graunke 
> Cc: matts...@gmail.com
> Cc: ja...@jlekstrand.net
> ---
>  src/glsl/nir/nir_opt_algebraic.py | 5 +
>  1 file changed, 5 insertions(+)
> 
> So...I found a bizarre issue with this patch.
> 
>(('imul', 4, a), ('ishl', a, 2)),
> 
> totally optimizes things.  However...
> 
>(('imul', a, 4), ('ishl', a, 2)),
> 
> doesn't seem to do anything, even though imul is commutative, and nir_search
> should totally handle that...
> 
>  ▄▄  ▄▄▄▄    ▄   ▄▄
>  ██  ██   ▀▀▀██▀▀▀  ███  ██
>  ▀█▄ ██ ▄█▀      ██ ▄█▀  ██
>   ██ ██ ██   ██  ██  ██   ▄██▀   ██
>   ███▀▀███   ██  ██   ██ ▀▀
>   ███  ███  ▄██  ██▄ ██   ▄▄ ▄▄
>   ▀▀▀  ▀▀▀  ▀▀▀▀ ▀▀   ▀▀ ▀▀
> 
> If you know why, let me know, otherwise I may have to look into it when more
> awake.

I've noticed a couple other weird things that I have been unable to
understand.  Shaders like the one below end with fmul/ffma instaed of
flrp, for example.  I understand why that happens from GLSL IR
opt_algebraic, but it seems like nir_opt_algebraic should handle it.

[require]
GLSL >= 1.30

[vertex shader]
in vec4 v;
in vec2 tc_in;

out vec2 tc;

void main() {
gl_Position = v;
tc = tc_in;
}

[fragment shader]
in vec2 tc;

out vec4 color;

uniform sampler2D s;
uniform float a;
uniform vec3 base_color;

void main() {
vec3 tex_color = texture(s, tc).xyz;

color.xyz = (base_color * a) + (tex_color * (1.0 - a));
color.a = 1.0;
}



> diff --git a/src/glsl/nir/nir_opt_algebraic.py 
> b/src/glsl/nir/nir_opt_algebraic.py
> index 400d60e..350471f 100644
> --- a/src/glsl/nir/nir_opt_algebraic.py
> +++ b/src/glsl/nir/nir_opt_algebraic.py
> @@ -247,6 +247,11 @@ late_optimizations = [
> (('fge', ('fadd', a, b), 0.0), ('fge', a, ('fneg', b))),
> (('feq', ('fadd', a, b), 0.0), ('feq', a, ('fneg', b))),
> (('fne', ('fadd', a, b), 0.0), ('fne', a, ('fneg', b))),
> +
> +   # Multiplication by 4 comes up fairly often in indirect offset 
> calculations.
> +   # Some GPUs have weird integer multiplication limitations, but shifts 
> should work
> +   # equally well everywhere.
> +   (('imul', 4, a), ('ishl', a, 2)),

This should be conditionalized on whether the platform has native integers.

>  ]
>  
>  print nir_algebraic.AlgebraicPass("nir_opt_algebraic", 
> optimizations).render()
> 

___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [RFC PATCH] nir: Transform 4*x into x << 2 during late optimizations.

2015-05-08 Thread Jason Ekstrand
On Fri, May 8, 2015 at 3:36 AM, Kenneth Graunke  wrote:
> According to Glenn, shifts on R600 have 5x the throughput as multiplies.
>
> Intel GPUs have strange integer multiplication restrictions - on most
> hardware, MUL actually only does a 32-bit x 16-bit multiply.  This
> means the arguments aren't commutative, which can limit our constant
> propagation options.  SHL has no such restrictions.
>
> Shifting is probably reasonable on most people's hardware, so let's just
> do that.
>
> i965 shader-db results (using NIR for VS):
> total instructions in shared programs: 7432587 -> 7388982 (-0.59%)
> instructions in affected programs: 1360411 -> 1316806 (-3.21%)
> helped:5772
> HURT:  0
>
> Signed-off-by: Kenneth Graunke 
> Cc: matts...@gmail.com
> Cc: ja...@jlekstrand.net
> ---
>  src/glsl/nir/nir_opt_algebraic.py | 5 +
>  1 file changed, 5 insertions(+)
>
> So...I found a bizarre issue with this patch.
>
>(('imul', 4, a), ('ishl', a, 2)),
>
> totally optimizes things.  However...
>
>(('imul', a, 4), ('ishl', a, 2)),
>
> doesn't seem to do anything, even though imul is commutative, and nir_search
> should totally handle that...
>
>  ▄▄  ▄▄▄▄    ▄   ▄▄
>  ██  ██   ▀▀▀██▀▀▀  ███  ██
>  ▀█▄ ██ ▄█▀      ██ ▄█▀  ██
>   ██ ██ ██   ██  ██  ██   ▄██▀   ██
>   ███▀▀███   ██  ██   ██ ▀▀
>   ███  ███  ▄██  ██▄ ██   ▄▄ ▄▄
>   ▀▀▀  ▀▀▀  ▀▀▀▀ ▀▀   ▀▀ ▀▀
>
> If you know why, let me know, otherwise I may have to look into it when more
> awake.

I figured it out and I have a patch.  Unfortunately, it regresses a
few programs and looses 8 SIMD8 programs so I'm doing some more
investigation.  I'll send it out soon.

> diff --git a/src/glsl/nir/nir_opt_algebraic.py 
> b/src/glsl/nir/nir_opt_algebraic.py
> index 400d60e..350471f 100644
> --- a/src/glsl/nir/nir_opt_algebraic.py
> +++ b/src/glsl/nir/nir_opt_algebraic.py
> @@ -247,6 +247,11 @@ late_optimizations = [
> (('fge', ('fadd', a, b), 0.0), ('fge', a, ('fneg', b))),
> (('feq', ('fadd', a, b), 0.0), ('feq', a, ('fneg', b))),
> (('fne', ('fadd', a, b), 0.0), ('fne', a, ('fneg', b))),
> +
> +   # Multiplication by 4 comes up fairly often in indirect offset 
> calculations.
> +   # Some GPUs have weird integer multiplication limitations, but shifts 
> should work
> +   # equally well everywhere.
> +   (('imul', 4, a), ('ishl', a, 2)),
>  ]
>
>  print nir_algebraic.AlgebraicPass("nir_opt_algebraic", 
> optimizations).render()
> --
> 2.4.0
>
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


[Mesa-dev] [RFC PATCH] nir: Transform 4*x into x << 2 during late optimizations.

2015-05-08 Thread Kenneth Graunke
According to Glenn, shifts on R600 have 5x the throughput as multiplies.

Intel GPUs have strange integer multiplication restrictions - on most
hardware, MUL actually only does a 32-bit x 16-bit multiply.  This
means the arguments aren't commutative, which can limit our constant
propagation options.  SHL has no such restrictions.

Shifting is probably reasonable on most people's hardware, so let's just
do that.

i965 shader-db results (using NIR for VS):
total instructions in shared programs: 7432587 -> 7388982 (-0.59%)
instructions in affected programs: 1360411 -> 1316806 (-3.21%)
helped:5772
HURT:  0

Signed-off-by: Kenneth Graunke 
Cc: matts...@gmail.com
Cc: ja...@jlekstrand.net
---
 src/glsl/nir/nir_opt_algebraic.py | 5 +
 1 file changed, 5 insertions(+)

So...I found a bizarre issue with this patch.

   (('imul', 4, a), ('ishl', a, 2)),

totally optimizes things.  However...

   (('imul', a, 4), ('ishl', a, 2)),

doesn't seem to do anything, even though imul is commutative, and nir_search
should totally handle that...

 ▄▄  ▄▄▄▄    ▄   ▄▄
 ██  ██   ▀▀▀██▀▀▀  ███  ██
 ▀█▄ ██ ▄█▀      ██ ▄█▀  ██
  ██ ██ ██   ██  ██  ██   ▄██▀   ██
  ███▀▀███   ██  ██   ██ ▀▀
  ███  ███  ▄██  ██▄ ██   ▄▄ ▄▄
  ▀▀▀  ▀▀▀  ▀▀▀▀ ▀▀   ▀▀ ▀▀

If you know why, let me know, otherwise I may have to look into it when more
awake.

diff --git a/src/glsl/nir/nir_opt_algebraic.py 
b/src/glsl/nir/nir_opt_algebraic.py
index 400d60e..350471f 100644
--- a/src/glsl/nir/nir_opt_algebraic.py
+++ b/src/glsl/nir/nir_opt_algebraic.py
@@ -247,6 +247,11 @@ late_optimizations = [
(('fge', ('fadd', a, b), 0.0), ('fge', a, ('fneg', b))),
(('feq', ('fadd', a, b), 0.0), ('feq', a, ('fneg', b))),
(('fne', ('fadd', a, b), 0.0), ('fne', a, ('fneg', b))),
+
+   # Multiplication by 4 comes up fairly often in indirect offset calculations.
+   # Some GPUs have weird integer multiplication limitations, but shifts 
should work
+   # equally well everywhere.
+   (('imul', 4, a), ('ishl', a, 2)),
 ]
 
 print nir_algebraic.AlgebraicPass("nir_opt_algebraic", optimizations).render()
-- 
2.4.0

___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev