Re: [Mesa-dev] V2 radeonsi use STD430 packing of UBOs by default

2017-09-14 Thread Nicolai Hähnle

On 14.09.2017 15:14, Marek Olšák wrote:

On Thu, Sep 14, 2017 at 12:31 PM, Timothy Arceri  wrote:



On 31/08/17 01:55, Marek Olšák wrote:


On Wed, Aug 30, 2017 at 2:22 PM, Timothy Arceri 
wrote:


On 30/08/17 20:07, Marek Olšák wrote:



If LLVM was fixed to do the correct thing, we could enable CONSTBUF
LOAD for LLVM 6.0 and later.




You seem to think that the compiler *should* be placing them near where
they
are used? What part of LLVM were you expecting to do this? I'm happy to
do
some digging around but don't know where I should start looking.



I think the LLVM machine instruction scheduler should do that. The
starting point would be to add "-print-after-all" to llc or LLVM
arguments in Mesa to have visibility into what LLVM is doing. From
that point it's just about learning to understand that. By default,
LLVM assumes that most or all loads may be affected by any store. LLVM
might also think that the instruction order is OK and doesn't need
changes. I don't know what the exact issue is.

If Natural Selection 2 is the only game showing small changes in
shader-db stats and there are no differences in *real performance* of
NS2 and other apps, I'd say let's merge this.



Retesting with master and more recent LLVM I'm getting:

MaxWaves -1.68% (previously was -2.94%) with -1.60% for NS2.

My care factor for NS2 has officially dropped to 0. I got a copy of it for
testing but I noticed:

  1. OpenGL support is still marked as beta
  2. It crashes when I try to load the tutorial, I assume its related to
 this bug [1].

Since this is the case I'd rather not hold up this work based on the results
of a buggy game. Marek is patch 4 ok with you? Everything else has you r-b
(once I split patch 7).


Can you remind what the name of patch 4 was?


"radeonsi: make use of LOAD for UBOs"

The advantages of using a real e-mail client ;-)

Cheers,
Nicolai



Thanks,
Marek




--
Lerne, wie die Welt wirklich ist,
Aber vergiss niemals, wie sie sein sollte.
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] V2 radeonsi use STD430 packing of UBOs by default

2017-09-14 Thread Marek Olšák
On Thu, Sep 14, 2017 at 12:31 PM, Timothy Arceri  wrote:
>
>
> On 31/08/17 01:55, Marek Olšák wrote:
>>
>> On Wed, Aug 30, 2017 at 2:22 PM, Timothy Arceri 
>> wrote:
>>>
>>> On 30/08/17 20:07, Marek Olšák wrote:


 If LLVM was fixed to do the correct thing, we could enable CONSTBUF
 LOAD for LLVM 6.0 and later.
>>>
>>>
>>>
>>> You seem to think that the compiler *should* be placing them near where
>>> they
>>> are used? What part of LLVM were you expecting to do this? I'm happy to
>>> do
>>> some digging around but don't know where I should start looking.
>>
>>
>> I think the LLVM machine instruction scheduler should do that. The
>> starting point would be to add "-print-after-all" to llc or LLVM
>> arguments in Mesa to have visibility into what LLVM is doing. From
>> that point it's just about learning to understand that. By default,
>> LLVM assumes that most or all loads may be affected by any store. LLVM
>> might also think that the instruction order is OK and doesn't need
>> changes. I don't know what the exact issue is.
>>
>> If Natural Selection 2 is the only game showing small changes in
>> shader-db stats and there are no differences in *real performance* of
>> NS2 and other apps, I'd say let's merge this.
>
>
> Retesting with master and more recent LLVM I'm getting:
>
> MaxWaves -1.68% (previously was -2.94%) with -1.60% for NS2.
>
> My care factor for NS2 has officially dropped to 0. I got a copy of it for
> testing but I noticed:
>
>  1. OpenGL support is still marked as beta
>  2. It crashes when I try to load the tutorial, I assume its related to
> this bug [1].
>
> Since this is the case I'd rather not hold up this work based on the results
> of a buggy game. Marek is patch 4 ok with you? Everything else has you r-b
> (once I split patch 7).

Can you remind what the name of patch 4 was?

Thanks,
Marek
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] V2 radeonsi use STD430 packing of UBOs by default

2017-09-14 Thread Timothy Arceri



On 31/08/17 01:55, Marek Olšák wrote:

On Wed, Aug 30, 2017 at 2:22 PM, Timothy Arceri  wrote:

On 30/08/17 20:07, Marek Olšák wrote:


If LLVM was fixed to do the correct thing, we could enable CONSTBUF
LOAD for LLVM 6.0 and later.



You seem to think that the compiler *should* be placing them near where they
are used? What part of LLVM were you expecting to do this? I'm happy to do
some digging around but don't know where I should start looking.


I think the LLVM machine instruction scheduler should do that. The
starting point would be to add "-print-after-all" to llc or LLVM
arguments in Mesa to have visibility into what LLVM is doing. From
that point it's just about learning to understand that. By default,
LLVM assumes that most or all loads may be affected by any store. LLVM
might also think that the instruction order is OK and doesn't need
changes. I don't know what the exact issue is.

If Natural Selection 2 is the only game showing small changes in
shader-db stats and there are no differences in *real performance* of
NS2 and other apps, I'd say let's merge this.


Retesting with master and more recent LLVM I'm getting:

MaxWaves -1.68% (previously was -2.94%) with -1.60% for NS2.

My care factor for NS2 has officially dropped to 0. I got a copy of it 
for testing but I noticed:


 1. OpenGL support is still marked as beta
 2. It crashes when I try to load the tutorial, I assume its related to
this bug [1].

Since this is the case I'd rather not hold up this work based on the 
results of a buggy game. Marek is patch 4 ok with you? Everything else 
has you r-b (once I split patch 7).


Thanks,
Tim

[1] https://bugs.freedesktop.org/show_bug.cgi?id=93301



Marek


___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] V2 radeonsi use STD430 packing of UBOs by default

2017-08-30 Thread Marek Olšák
On Wed, Aug 30, 2017 at 2:22 PM, Timothy Arceri  wrote:
> On 30/08/17 20:07, Marek Olšák wrote:
>>
>> If LLVM was fixed to do the correct thing, we could enable CONSTBUF
>> LOAD for LLVM 6.0 and later.
>
>
> You seem to think that the compiler *should* be placing them near where they
> are used? What part of LLVM were you expecting to do this? I'm happy to do
> some digging around but don't know where I should start looking.

I think the LLVM machine instruction scheduler should do that. The
starting point would be to add "-print-after-all" to llc or LLVM
arguments in Mesa to have visibility into what LLVM is doing. From
that point it's just about learning to understand that. By default,
LLVM assumes that most or all loads may be affected by any store. LLVM
might also think that the instruction order is OK and doesn't need
changes. I don't know what the exact issue is.

If Natural Selection 2 is the only game showing small changes in
shader-db stats and there are no differences in *real performance* of
NS2 and other apps, I'd say let's merge this.

Marek
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] V2 radeonsi use STD430 packing of UBOs by default

2017-08-30 Thread Timothy Arceri

On 30/08/17 20:07, Marek Olšák wrote:

If LLVM was fixed to do the correct thing, we could enable CONSTBUF
LOAD for LLVM 6.0 and later.


You seem to think that the compiler *should* be placing them near where 
they are used? What part of LLVM were you expecting to do this? I'm 
happy to do some digging around but don't know where I should start looking.




Marek

On Wed, Aug 30, 2017 at 9:18 AM, Timothy Arceri  wrote:

On 30/08/17 10:25, Marek Olšák wrote:


I have to conclude that I don't see a way to use LOAD with CONSTBUF
and keep the same performance as before. It looks like there are some
deficiencies in our compiler stack that are unfixable in Mesa alone.



Well that's frustrating :( Pretty much makes finishing off uniform packing
[1] pointless. Besides an issue with matrices and some tidy ups it was
mostly done.

[1] https://github.com/tarceri/Mesa/compare/uniform_packing5




Marek

On Wed, Aug 30, 2017 at 2:11 AM, Marek Olšák  wrote:


Related IRC discussion:

00:01 < mareko> arsenm: what are the chances I can convince you to
allow me to set mayLoad = 0 on s_buffer_load_dword? :) the instruction
always reads from read-only memory with Mesa
00:02 < mareko> apparently, readnone doesn't get through
00:02 < arsenm> mareko: you should get the same effect by having
invariant on the MMO
00:03 < mareko> arsenm: and how would I set invariant on SI.load.const?
00:04 < arsenm> mareko: we create MMOs for a few other intrinsics
already, it should be the same
00:05 < mareko> if only I had time to play with LLVM
00:05 < arsenm> mareko: it looks like that is already done so it might
be a more specific problem
00:05 < arsenm> that rematerializable scalar loads patch is probably
OK now though
00:07 < arsenm> https://reviews.llvm.org/D11621

Marek


On Wed, Aug 30, 2017 at 1:58 AM, Marek Olšák  wrote:


Interesting. It may be that glsl_to_tgsi uses copy propagation to fold
those CONST loads into operands, which puts them next to their uses in
LLVM.

I guess LLVM doesn't understand that s_buffer_load_dword loads from
immutable dereferenceable memory. It would benefit from mayLoad = 0 in
this case I think.

Marek

On Thu, Aug 24, 2017 at 11:48 AM, Timothy Arceri 
wrote:




On 24/08/17 18:12, Nicolai Hähnle wrote:



On 24.08.2017 09:45, Timothy Arceri wrote:





On 22/08/17 22:14, Timothy Arceri wrote:



I'm a little unsure what to do with this now. Below is my shader-db
results, the majority of negative changes are from Natural Selection
2.

I looked at some dumps of the worst Natural Selection 2 shaders and
it seems to just be scheduling differences causing the regressions.

I tested with sisched but that just made things even worse.

Obviously we should be aiming to improve the schedulare, but since
this regresses things and I have no evidence of it helping anything
it makes the case for adding it pretty weak.

Thoughts??

PERCENTAGE DELTASShaders SGPRs VGPRs SpillSGPR  MaxWaves

All affected57972.92 3.05 %5.04 %   -2.94

---
Total  722870.28 %0.34 %0.33 %  -0.21
%

___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev




As far as I can tell this is because after this chnage we end up with
large sections of consecutive loads. Any thoughts on avoid this?




Odd. Do you see the same change in TGSI?

This is one of those things that ideally LLVM would be smart about,
but
unfortunately it isn't really.




Yeah I assume it's very doable since SSA makes this stuff reasonably
easy to
deal with. However I'm not really sure where to begin, or how welcome a
pass
to do this sorting would be. We have a similar pass in nir for moving
comparisons to where they are first used.

The TGSI is introduces an extra temp to store the value of the LOAD,
this is
probably what triggers the difference in LLVM.

eg.

   LOAD TEMP[61], UBO[2], IMM[2].
   LOAD TEMP[62], UBO[2], IMM[1].
   LOAD TEMP[63], UBO[2], IMM[1].
   LOAD TEMP[64], UBO[2], IMM[2].
   DP4 TEMP[65].x, TEMP[60], TEMP[61]
   DP4 TEMP[66].x, TEMP[60], TEMP[62]
   MOV TEMP[65].y, TEMP[66].
   DP4 TEMP[67].x, TEMP[60], TEMP[63]
   MOV TEMP[65].z, TEMP[67].
   DP4 TEMP[68].x, TEMP[60], TEMP[64]
   MOV TEMP[69].w, TEMP[68].
   MOV TEMP[69].xyz, TEMP[65].xyzx
   LOAD TEMP[70], UBO[1], IMM[6].
   LOAD TEMP[71], UBO[1], IMM[6].
   DP4 TEMP[72].x, TEMP[69], TEMP[70]
   DP4 TEMP[73].x, TEMP[69], TEMP[71]
   LOAD TEMP[74], UBO[1], IMM[6].
   LOAD TEMP[75], UBO[1], IMM[7].
   LOAD TEMP[76], UBO[1], IMM[7].
   LOAD TEMP[77], UBO[1], IMM[7].
   DP4 TEMP[78].x, TEMP[69], TEMP[74]
   DP4 TEMP[79].x, TEMP[69], TEMP[75]
   MOV TEMP[78].y, 

Re: [Mesa-dev] V2 radeonsi use STD430 packing of UBOs by default

2017-08-30 Thread Marek Olšák
If LLVM was fixed to do the correct thing, we could enable CONSTBUF
LOAD for LLVM 6.0 and later.

Marek

On Wed, Aug 30, 2017 at 9:18 AM, Timothy Arceri  wrote:
> On 30/08/17 10:25, Marek Olšák wrote:
>>
>> I have to conclude that I don't see a way to use LOAD with CONSTBUF
>> and keep the same performance as before. It looks like there are some
>> deficiencies in our compiler stack that are unfixable in Mesa alone.
>
>
> Well that's frustrating :( Pretty much makes finishing off uniform packing
> [1] pointless. Besides an issue with matrices and some tidy ups it was
> mostly done.
>
> [1] https://github.com/tarceri/Mesa/compare/uniform_packing5
>
>
>>
>> Marek
>>
>> On Wed, Aug 30, 2017 at 2:11 AM, Marek Olšák  wrote:
>>>
>>> Related IRC discussion:
>>>
>>> 00:01 < mareko> arsenm: what are the chances I can convince you to
>>> allow me to set mayLoad = 0 on s_buffer_load_dword? :) the instruction
>>> always reads from read-only memory with Mesa
>>> 00:02 < mareko> apparently, readnone doesn't get through
>>> 00:02 < arsenm> mareko: you should get the same effect by having
>>> invariant on the MMO
>>> 00:03 < mareko> arsenm: and how would I set invariant on SI.load.const?
>>> 00:04 < arsenm> mareko: we create MMOs for a few other intrinsics
>>> already, it should be the same
>>> 00:05 < mareko> if only I had time to play with LLVM
>>> 00:05 < arsenm> mareko: it looks like that is already done so it might
>>> be a more specific problem
>>> 00:05 < arsenm> that rematerializable scalar loads patch is probably
>>> OK now though
>>> 00:07 < arsenm> https://reviews.llvm.org/D11621
>>>
>>> Marek
>>>
>>>
>>> On Wed, Aug 30, 2017 at 1:58 AM, Marek Olšák  wrote:

 Interesting. It may be that glsl_to_tgsi uses copy propagation to fold
 those CONST loads into operands, which puts them next to their uses in
 LLVM.

 I guess LLVM doesn't understand that s_buffer_load_dword loads from
 immutable dereferenceable memory. It would benefit from mayLoad = 0 in
 this case I think.

 Marek

 On Thu, Aug 24, 2017 at 11:48 AM, Timothy Arceri 
 wrote:
>
>
>
> On 24/08/17 18:12, Nicolai Hähnle wrote:
>>
>>
>> On 24.08.2017 09:45, Timothy Arceri wrote:
>>>
>>>
>>>
>>>
>>> On 22/08/17 22:14, Timothy Arceri wrote:


 I'm a little unsure what to do with this now. Below is my shader-db
 results, the majority of negative changes are from Natural Selection
 2.

 I looked at some dumps of the worst Natural Selection 2 shaders and
 it seems to just be scheduling differences causing the regressions.

 I tested with sisched but that just made things even worse.

 Obviously we should be aiming to improve the schedulare, but since
 this regresses things and I have no evidence of it helping anything
 it makes the case for adding it pretty weak.

 Thoughts??

 PERCENTAGE DELTASShaders SGPRs VGPRs SpillSGPR  MaxWaves
 
All affected57972.92 3.05 %5.04 %   -2.94

 ---
Total  722870.28 %0.34 %0.33 %  -0.21
 %

 ___
 mesa-dev mailing list
 mesa-dev@lists.freedesktop.org
 https://lists.freedesktop.org/mailman/listinfo/mesa-dev

>>>
>>>
>>> As far as I can tell this is because after this chnage we end up with
>>> large sections of consecutive loads. Any thoughts on avoid this?
>>
>>
>>
>> Odd. Do you see the same change in TGSI?
>>
>> This is one of those things that ideally LLVM would be smart about,
>> but
>> unfortunately it isn't really.
>
>
>
> Yeah I assume it's very doable since SSA makes this stuff reasonably
> easy to
> deal with. However I'm not really sure where to begin, or how welcome a
> pass
> to do this sorting would be. We have a similar pass in nir for moving
> comparisons to where they are first used.
>
> The TGSI is introduces an extra temp to store the value of the LOAD,
> this is
> probably what triggers the difference in LLVM.
>
> eg.
>
>   LOAD TEMP[61], UBO[2], IMM[2].
>   LOAD TEMP[62], UBO[2], IMM[1].
>   LOAD TEMP[63], UBO[2], IMM[1].
>   LOAD TEMP[64], UBO[2], IMM[2].
>   DP4 TEMP[65].x, TEMP[60], TEMP[61]
>   DP4 TEMP[66].x, TEMP[60], TEMP[62]
>   MOV TEMP[65].y, TEMP[66].
>   DP4 TEMP[67].x, TEMP[60], TEMP[63]
>   MOV TEMP[65].z, TEMP[67].
>   DP4 TEMP[68].x, TEMP[60], TEMP[64]
>   MOV 

Re: [Mesa-dev] V2 radeonsi use STD430 packing of UBOs by default

2017-08-30 Thread Timothy Arceri

On 30/08/17 10:25, Marek Olšák wrote:

I have to conclude that I don't see a way to use LOAD with CONSTBUF
and keep the same performance as before. It looks like there are some
deficiencies in our compiler stack that are unfixable in Mesa alone.


Well that's frustrating :( Pretty much makes finishing off uniform 
packing [1] pointless. Besides an issue with matrices and some tidy ups 
it was mostly done.


[1] https://github.com/tarceri/Mesa/compare/uniform_packing5



Marek

On Wed, Aug 30, 2017 at 2:11 AM, Marek Olšák  wrote:

Related IRC discussion:

00:01 < mareko> arsenm: what are the chances I can convince you to
allow me to set mayLoad = 0 on s_buffer_load_dword? :) the instruction
always reads from read-only memory with Mesa
00:02 < mareko> apparently, readnone doesn't get through
00:02 < arsenm> mareko: you should get the same effect by having
invariant on the MMO
00:03 < mareko> arsenm: and how would I set invariant on SI.load.const?
00:04 < arsenm> mareko: we create MMOs for a few other intrinsics
already, it should be the same
00:05 < mareko> if only I had time to play with LLVM
00:05 < arsenm> mareko: it looks like that is already done so it might
be a more specific problem
00:05 < arsenm> that rematerializable scalar loads patch is probably
OK now though
00:07 < arsenm> https://reviews.llvm.org/D11621

Marek


On Wed, Aug 30, 2017 at 1:58 AM, Marek Olšák  wrote:

Interesting. It may be that glsl_to_tgsi uses copy propagation to fold
those CONST loads into operands, which puts them next to their uses in LLVM.

I guess LLVM doesn't understand that s_buffer_load_dword loads from
immutable dereferenceable memory. It would benefit from mayLoad = 0 in
this case I think.

Marek

On Thu, Aug 24, 2017 at 11:48 AM, Timothy Arceri  wrote:



On 24/08/17 18:12, Nicolai Hähnle wrote:


On 24.08.2017 09:45, Timothy Arceri wrote:




On 22/08/17 22:14, Timothy Arceri wrote:


I'm a little unsure what to do with this now. Below is my shader-db
results, the majority of negative changes are from Natural Selection
2.

I looked at some dumps of the worst Natural Selection 2 shaders and
it seems to just be scheduling differences causing the regressions.

I tested with sisched but that just made things even worse.

Obviously we should be aiming to improve the schedulare, but since
this regresses things and I have no evidence of it helping anything
it makes the case for adding it pretty weak.

Thoughts??

PERCENTAGE DELTASShaders SGPRs VGPRs SpillSGPR  MaxWaves

   All affected57972.92 3.05 %5.04 %   -2.94
   ---
   Total  722870.28 %0.34 %0.33 %  -0.21 %

___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev




As far as I can tell this is because after this chnage we end up with
large sections of consecutive loads. Any thoughts on avoid this?



Odd. Do you see the same change in TGSI?

This is one of those things that ideally LLVM would be smart about, but
unfortunately it isn't really.



Yeah I assume it's very doable since SSA makes this stuff reasonably easy to
deal with. However I'm not really sure where to begin, or how welcome a pass
to do this sorting would be. We have a similar pass in nir for moving
comparisons to where they are first used.

The TGSI is introduces an extra temp to store the value of the LOAD, this is
probably what triggers the difference in LLVM.

eg.

  LOAD TEMP[61], UBO[2], IMM[2].
  LOAD TEMP[62], UBO[2], IMM[1].
  LOAD TEMP[63], UBO[2], IMM[1].
  LOAD TEMP[64], UBO[2], IMM[2].
  DP4 TEMP[65].x, TEMP[60], TEMP[61]
  DP4 TEMP[66].x, TEMP[60], TEMP[62]
  MOV TEMP[65].y, TEMP[66].
  DP4 TEMP[67].x, TEMP[60], TEMP[63]
  MOV TEMP[65].z, TEMP[67].
  DP4 TEMP[68].x, TEMP[60], TEMP[64]
  MOV TEMP[69].w, TEMP[68].
  MOV TEMP[69].xyz, TEMP[65].xyzx
  LOAD TEMP[70], UBO[1], IMM[6].
  LOAD TEMP[71], UBO[1], IMM[6].
  DP4 TEMP[72].x, TEMP[69], TEMP[70]
  DP4 TEMP[73].x, TEMP[69], TEMP[71]
  LOAD TEMP[74], UBO[1], IMM[6].
  LOAD TEMP[75], UBO[1], IMM[7].
  LOAD TEMP[76], UBO[1], IMM[7].
  LOAD TEMP[77], UBO[1], IMM[7].
  DP4 TEMP[78].x, TEMP[69], TEMP[74]
  DP4 TEMP[79].x, TEMP[69], TEMP[75]
  MOV TEMP[78].y, TEMP[79].
  DP4 TEMP[80].x, TEMP[69], TEMP[76]
  MOV TEMP[78].z, TEMP[80].
  DP4 TEMP[81].x, TEMP[69], TEMP[77]
  MOV TEMP[78].w, TEMP[81].

vs

  DP4 TEMP[63].x, TEMP[62], CONST[2][0]
  DP4 TEMP[64].x, TEMP[62], CONST[2][1]
  MOV TEMP[63].y, TEMP[64].
  DP4 TEMP[65].x, TEMP[62], CONST[2][2]
  MOV TEMP[63].z, TEMP[65].
  DP4 TEMP[66].x, TEMP[62], CONST[2][3]
  MOV TEMP[67].w, TEMP[66].
  MOV TEMP[67].xyz, TEMP[63].xyzx
  DP4 TEMP[68].x, TEMP[67], CONST[1][14]
  

Re: [Mesa-dev] V2 radeonsi use STD430 packing of UBOs by default

2017-08-29 Thread Marek Olšák
I have to conclude that I don't see a way to use LOAD with CONSTBUF
and keep the same performance as before. It looks like there are some
deficiencies in our compiler stack that are unfixable in Mesa alone.

Marek

On Wed, Aug 30, 2017 at 2:11 AM, Marek Olšák  wrote:
> Related IRC discussion:
>
> 00:01 < mareko> arsenm: what are the chances I can convince you to
> allow me to set mayLoad = 0 on s_buffer_load_dword? :) the instruction
> always reads from read-only memory with Mesa
> 00:02 < mareko> apparently, readnone doesn't get through
> 00:02 < arsenm> mareko: you should get the same effect by having
> invariant on the MMO
> 00:03 < mareko> arsenm: and how would I set invariant on SI.load.const?
> 00:04 < arsenm> mareko: we create MMOs for a few other intrinsics
> already, it should be the same
> 00:05 < mareko> if only I had time to play with LLVM
> 00:05 < arsenm> mareko: it looks like that is already done so it might
> be a more specific problem
> 00:05 < arsenm> that rematerializable scalar loads patch is probably
> OK now though
> 00:07 < arsenm> https://reviews.llvm.org/D11621
>
> Marek
>
>
> On Wed, Aug 30, 2017 at 1:58 AM, Marek Olšák  wrote:
>> Interesting. It may be that glsl_to_tgsi uses copy propagation to fold
>> those CONST loads into operands, which puts them next to their uses in LLVM.
>>
>> I guess LLVM doesn't understand that s_buffer_load_dword loads from
>> immutable dereferenceable memory. It would benefit from mayLoad = 0 in
>> this case I think.
>>
>> Marek
>>
>> On Thu, Aug 24, 2017 at 11:48 AM, Timothy Arceri  
>> wrote:
>>>
>>>
>>> On 24/08/17 18:12, Nicolai Hähnle wrote:

 On 24.08.2017 09:45, Timothy Arceri wrote:
>
>
>
> On 22/08/17 22:14, Timothy Arceri wrote:
>>
>> I'm a little unsure what to do with this now. Below is my shader-db
>> results, the majority of negative changes are from Natural Selection
>> 2.
>>
>> I looked at some dumps of the worst Natural Selection 2 shaders and
>> it seems to just be scheduling differences causing the regressions.
>>
>> I tested with sisched but that just made things even worse.
>>
>> Obviously we should be aiming to improve the schedulare, but since
>> this regresses things and I have no evidence of it helping anything
>> it makes the case for adding it pretty weak.
>>
>> Thoughts??
>>
>> PERCENTAGE DELTASShaders SGPRs VGPRs SpillSGPR  MaxWaves
>> 
>>   All affected57972.92 3.05 %5.04 %   -2.94
>>   ---
>>   Total  722870.28 %0.34 %0.33 %  -0.21 %
>>
>> ___
>> mesa-dev mailing list
>> mesa-dev@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
>>
>
>
> As far as I can tell this is because after this chnage we end up with
> large sections of consecutive loads. Any thoughts on avoid this?


 Odd. Do you see the same change in TGSI?

 This is one of those things that ideally LLVM would be smart about, but
 unfortunately it isn't really.
>>>
>>>
>>> Yeah I assume it's very doable since SSA makes this stuff reasonably easy to
>>> deal with. However I'm not really sure where to begin, or how welcome a pass
>>> to do this sorting would be. We have a similar pass in nir for moving
>>> comparisons to where they are first used.
>>>
>>> The TGSI is introduces an extra temp to store the value of the LOAD, this is
>>> probably what triggers the difference in LLVM.
>>>
>>> eg.
>>>
>>>  LOAD TEMP[61], UBO[2], IMM[2].
>>>  LOAD TEMP[62], UBO[2], IMM[1].
>>>  LOAD TEMP[63], UBO[2], IMM[1].
>>>  LOAD TEMP[64], UBO[2], IMM[2].
>>>  DP4 TEMP[65].x, TEMP[60], TEMP[61]
>>>  DP4 TEMP[66].x, TEMP[60], TEMP[62]
>>>  MOV TEMP[65].y, TEMP[66].
>>>  DP4 TEMP[67].x, TEMP[60], TEMP[63]
>>>  MOV TEMP[65].z, TEMP[67].
>>>  DP4 TEMP[68].x, TEMP[60], TEMP[64]
>>>  MOV TEMP[69].w, TEMP[68].
>>>  MOV TEMP[69].xyz, TEMP[65].xyzx
>>>  LOAD TEMP[70], UBO[1], IMM[6].
>>>  LOAD TEMP[71], UBO[1], IMM[6].
>>>  DP4 TEMP[72].x, TEMP[69], TEMP[70]
>>>  DP4 TEMP[73].x, TEMP[69], TEMP[71]
>>>  LOAD TEMP[74], UBO[1], IMM[6].
>>>  LOAD TEMP[75], UBO[1], IMM[7].
>>>  LOAD TEMP[76], UBO[1], IMM[7].
>>>  LOAD TEMP[77], UBO[1], IMM[7].
>>>  DP4 TEMP[78].x, TEMP[69], TEMP[74]
>>>  DP4 TEMP[79].x, TEMP[69], TEMP[75]
>>>  MOV TEMP[78].y, TEMP[79].
>>>  DP4 TEMP[80].x, TEMP[69], TEMP[76]
>>>  MOV TEMP[78].z, TEMP[80].
>>>  DP4 TEMP[81].x, TEMP[69], TEMP[77]
>>>  MOV TEMP[78].w, TEMP[81].
>>>
>>> vs
>>>
>>>  DP4 TEMP[63].x, TEMP[62], CONST[2][0]
>>>  DP4 TEMP[64].x, TEMP[62], CONST[2][1]
>>>  MOV TEMP[63].y, TEMP[64].
>>>  DP4 

Re: [Mesa-dev] V2 radeonsi use STD430 packing of UBOs by default

2017-08-29 Thread Marek Olšák
Related IRC discussion:

00:01 < mareko> arsenm: what are the chances I can convince you to
allow me to set mayLoad = 0 on s_buffer_load_dword? :) the instruction
always reads from read-only memory with Mesa
00:02 < mareko> apparently, readnone doesn't get through
00:02 < arsenm> mareko: you should get the same effect by having
invariant on the MMO
00:03 < mareko> arsenm: and how would I set invariant on SI.load.const?
00:04 < arsenm> mareko: we create MMOs for a few other intrinsics
already, it should be the same
00:05 < mareko> if only I had time to play with LLVM
00:05 < arsenm> mareko: it looks like that is already done so it might
be a more specific problem
00:05 < arsenm> that rematerializable scalar loads patch is probably
OK now though
00:07 < arsenm> https://reviews.llvm.org/D11621

Marek


On Wed, Aug 30, 2017 at 1:58 AM, Marek Olšák  wrote:
> Interesting. It may be that glsl_to_tgsi uses copy propagation to fold
> those CONST loads into operands, which puts them next to their uses in LLVM.
>
> I guess LLVM doesn't understand that s_buffer_load_dword loads from
> immutable dereferenceable memory. It would benefit from mayLoad = 0 in
> this case I think.
>
> Marek
>
> On Thu, Aug 24, 2017 at 11:48 AM, Timothy Arceri  
> wrote:
>>
>>
>> On 24/08/17 18:12, Nicolai Hähnle wrote:
>>>
>>> On 24.08.2017 09:45, Timothy Arceri wrote:



 On 22/08/17 22:14, Timothy Arceri wrote:
>
> I'm a little unsure what to do with this now. Below is my shader-db
> results, the majority of negative changes are from Natural Selection
> 2.
>
> I looked at some dumps of the worst Natural Selection 2 shaders and
> it seems to just be scheduling differences causing the regressions.
>
> I tested with sisched but that just made things even worse.
>
> Obviously we should be aiming to improve the schedulare, but since
> this regresses things and I have no evidence of it helping anything
> it makes the case for adding it pretty weak.
>
> Thoughts??
>
> PERCENTAGE DELTASShaders SGPRs VGPRs SpillSGPR  MaxWaves
> 
>   All affected57972.92 3.05 %5.04 %   -2.94
>   ---
>   Total  722870.28 %0.34 %0.33 %  -0.21 %
>
> ___
> mesa-dev mailing list
> mesa-dev@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
>


 As far as I can tell this is because after this chnage we end up with
 large sections of consecutive loads. Any thoughts on avoid this?
>>>
>>>
>>> Odd. Do you see the same change in TGSI?
>>>
>>> This is one of those things that ideally LLVM would be smart about, but
>>> unfortunately it isn't really.
>>
>>
>> Yeah I assume it's very doable since SSA makes this stuff reasonably easy to
>> deal with. However I'm not really sure where to begin, or how welcome a pass
>> to do this sorting would be. We have a similar pass in nir for moving
>> comparisons to where they are first used.
>>
>> The TGSI is introduces an extra temp to store the value of the LOAD, this is
>> probably what triggers the difference in LLVM.
>>
>> eg.
>>
>>  LOAD TEMP[61], UBO[2], IMM[2].
>>  LOAD TEMP[62], UBO[2], IMM[1].
>>  LOAD TEMP[63], UBO[2], IMM[1].
>>  LOAD TEMP[64], UBO[2], IMM[2].
>>  DP4 TEMP[65].x, TEMP[60], TEMP[61]
>>  DP4 TEMP[66].x, TEMP[60], TEMP[62]
>>  MOV TEMP[65].y, TEMP[66].
>>  DP4 TEMP[67].x, TEMP[60], TEMP[63]
>>  MOV TEMP[65].z, TEMP[67].
>>  DP4 TEMP[68].x, TEMP[60], TEMP[64]
>>  MOV TEMP[69].w, TEMP[68].
>>  MOV TEMP[69].xyz, TEMP[65].xyzx
>>  LOAD TEMP[70], UBO[1], IMM[6].
>>  LOAD TEMP[71], UBO[1], IMM[6].
>>  DP4 TEMP[72].x, TEMP[69], TEMP[70]
>>  DP4 TEMP[73].x, TEMP[69], TEMP[71]
>>  LOAD TEMP[74], UBO[1], IMM[6].
>>  LOAD TEMP[75], UBO[1], IMM[7].
>>  LOAD TEMP[76], UBO[1], IMM[7].
>>  LOAD TEMP[77], UBO[1], IMM[7].
>>  DP4 TEMP[78].x, TEMP[69], TEMP[74]
>>  DP4 TEMP[79].x, TEMP[69], TEMP[75]
>>  MOV TEMP[78].y, TEMP[79].
>>  DP4 TEMP[80].x, TEMP[69], TEMP[76]
>>  MOV TEMP[78].z, TEMP[80].
>>  DP4 TEMP[81].x, TEMP[69], TEMP[77]
>>  MOV TEMP[78].w, TEMP[81].
>>
>> vs
>>
>>  DP4 TEMP[63].x, TEMP[62], CONST[2][0]
>>  DP4 TEMP[64].x, TEMP[62], CONST[2][1]
>>  MOV TEMP[63].y, TEMP[64].
>>  DP4 TEMP[65].x, TEMP[62], CONST[2][2]
>>  MOV TEMP[63].z, TEMP[65].
>>  DP4 TEMP[66].x, TEMP[62], CONST[2][3]
>>  MOV TEMP[67].w, TEMP[66].
>>  MOV TEMP[67].xyz, TEMP[63].xyzx
>>  DP4 TEMP[68].x, TEMP[67], CONST[1][14]
>>  DP4 TEMP[69].x, TEMP[67], CONST[1][15]
>>  DP4 TEMP[70].x, TEMP[67], CONST[1][8]
>>  DP4 TEMP[71].x, TEMP[67], CONST[1][9]
>>  MOV TEMP[70].y, TEMP[71].
>>  DP4 TEMP[72].x, TEMP[67], CONST[1][10]
>>  MOV 

Re: [Mesa-dev] V2 radeonsi use STD430 packing of UBOs by default

2017-08-29 Thread Marek Olšák
Interesting. It may be that glsl_to_tgsi uses copy propagation to fold
those CONST loads into operands, which puts them next to their uses in LLVM.

I guess LLVM doesn't understand that s_buffer_load_dword loads from
immutable dereferenceable memory. It would benefit from mayLoad = 0 in
this case I think.

Marek

On Thu, Aug 24, 2017 at 11:48 AM, Timothy Arceri  wrote:
>
>
> On 24/08/17 18:12, Nicolai Hähnle wrote:
>>
>> On 24.08.2017 09:45, Timothy Arceri wrote:
>>>
>>>
>>>
>>> On 22/08/17 22:14, Timothy Arceri wrote:

 I'm a little unsure what to do with this now. Below is my shader-db
 results, the majority of negative changes are from Natural Selection
 2.

 I looked at some dumps of the worst Natural Selection 2 shaders and
 it seems to just be scheduling differences causing the regressions.

 I tested with sisched but that just made things even worse.

 Obviously we should be aiming to improve the schedulare, but since
 this regresses things and I have no evidence of it helping anything
 it makes the case for adding it pretty weak.

 Thoughts??

 PERCENTAGE DELTASShaders SGPRs VGPRs SpillSGPR  MaxWaves
 
   All affected57972.92 3.05 %5.04 %   -2.94
   ---
   Total  722870.28 %0.34 %0.33 %  -0.21 %

 ___
 mesa-dev mailing list
 mesa-dev@lists.freedesktop.org
 https://lists.freedesktop.org/mailman/listinfo/mesa-dev

>>>
>>>
>>> As far as I can tell this is because after this chnage we end up with
>>> large sections of consecutive loads. Any thoughts on avoid this?
>>
>>
>> Odd. Do you see the same change in TGSI?
>>
>> This is one of those things that ideally LLVM would be smart about, but
>> unfortunately it isn't really.
>
>
> Yeah I assume it's very doable since SSA makes this stuff reasonably easy to
> deal with. However I'm not really sure where to begin, or how welcome a pass
> to do this sorting would be. We have a similar pass in nir for moving
> comparisons to where they are first used.
>
> The TGSI is introduces an extra temp to store the value of the LOAD, this is
> probably what triggers the difference in LLVM.
>
> eg.
>
>  LOAD TEMP[61], UBO[2], IMM[2].
>  LOAD TEMP[62], UBO[2], IMM[1].
>  LOAD TEMP[63], UBO[2], IMM[1].
>  LOAD TEMP[64], UBO[2], IMM[2].
>  DP4 TEMP[65].x, TEMP[60], TEMP[61]
>  DP4 TEMP[66].x, TEMP[60], TEMP[62]
>  MOV TEMP[65].y, TEMP[66].
>  DP4 TEMP[67].x, TEMP[60], TEMP[63]
>  MOV TEMP[65].z, TEMP[67].
>  DP4 TEMP[68].x, TEMP[60], TEMP[64]
>  MOV TEMP[69].w, TEMP[68].
>  MOV TEMP[69].xyz, TEMP[65].xyzx
>  LOAD TEMP[70], UBO[1], IMM[6].
>  LOAD TEMP[71], UBO[1], IMM[6].
>  DP4 TEMP[72].x, TEMP[69], TEMP[70]
>  DP4 TEMP[73].x, TEMP[69], TEMP[71]
>  LOAD TEMP[74], UBO[1], IMM[6].
>  LOAD TEMP[75], UBO[1], IMM[7].
>  LOAD TEMP[76], UBO[1], IMM[7].
>  LOAD TEMP[77], UBO[1], IMM[7].
>  DP4 TEMP[78].x, TEMP[69], TEMP[74]
>  DP4 TEMP[79].x, TEMP[69], TEMP[75]
>  MOV TEMP[78].y, TEMP[79].
>  DP4 TEMP[80].x, TEMP[69], TEMP[76]
>  MOV TEMP[78].z, TEMP[80].
>  DP4 TEMP[81].x, TEMP[69], TEMP[77]
>  MOV TEMP[78].w, TEMP[81].
>
> vs
>
>  DP4 TEMP[63].x, TEMP[62], CONST[2][0]
>  DP4 TEMP[64].x, TEMP[62], CONST[2][1]
>  MOV TEMP[63].y, TEMP[64].
>  DP4 TEMP[65].x, TEMP[62], CONST[2][2]
>  MOV TEMP[63].z, TEMP[65].
>  DP4 TEMP[66].x, TEMP[62], CONST[2][3]
>  MOV TEMP[67].w, TEMP[66].
>  MOV TEMP[67].xyz, TEMP[63].xyzx
>  DP4 TEMP[68].x, TEMP[67], CONST[1][14]
>  DP4 TEMP[69].x, TEMP[67], CONST[1][15]
>  DP4 TEMP[70].x, TEMP[67], CONST[1][8]
>  DP4 TEMP[71].x, TEMP[67], CONST[1][9]
>  MOV TEMP[70].y, TEMP[71].
>  DP4 TEMP[72].x, TEMP[67], CONST[1][10]
>  MOV TEMP[70].z, TEMP[72].
>  DP4 TEMP[73].x, TEMP[67], CONST[1][11]
>  MOV TEMP[70].w, TEMP[73].
>  MOV TEMP[74].xyw, TEMP[70].xyxw
>
>>
>> Cheers,
>> Nicolai
>>
> ___
> mesa-dev mailing list
> mesa-dev@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] V2 radeonsi use STD430 packing of UBOs by default

2017-08-24 Thread Timothy Arceri



On 24/08/17 18:12, Nicolai Hähnle wrote:

On 24.08.2017 09:45, Timothy Arceri wrote:



On 22/08/17 22:14, Timothy Arceri wrote:

I'm a little unsure what to do with this now. Below is my shader-db
results, the majority of negative changes are from Natural Selection
2.

I looked at some dumps of the worst Natural Selection 2 shaders and
it seems to just be scheduling differences causing the regressions.

I tested with sisched but that just made things even worse.

Obviously we should be aiming to improve the schedulare, but since
this regresses things and I have no evidence of it helping anything
it makes the case for adding it pretty weak.

Thoughts??

PERCENTAGE DELTASShaders SGPRs VGPRs SpillSGPR  MaxWaves

  All affected57972.92 3.05 %5.04 %   -2.94
  ---
  Total  722870.28 %0.34 %0.33 %  -0.21 %

___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev




As far as I can tell this is because after this chnage we end up with 
large sections of consecutive loads. Any thoughts on avoid this?


Odd. Do you see the same change in TGSI?

This is one of those things that ideally LLVM would be smart about, but 
unfortunately it isn't really.


Yeah I assume it's very doable since SSA makes this stuff reasonably 
easy to deal with. However I'm not really sure where to begin, or how 
welcome a pass to do this sorting would be. We have a similar pass in 
nir for moving comparisons to where they are first used.


The TGSI is introduces an extra temp to store the value of the LOAD, 
this is probably what triggers the difference in LLVM.


eg.

 LOAD TEMP[61], UBO[2], IMM[2].
 LOAD TEMP[62], UBO[2], IMM[1].
 LOAD TEMP[63], UBO[2], IMM[1].
 LOAD TEMP[64], UBO[2], IMM[2].
 DP4 TEMP[65].x, TEMP[60], TEMP[61]
 DP4 TEMP[66].x, TEMP[60], TEMP[62]
 MOV TEMP[65].y, TEMP[66].
 DP4 TEMP[67].x, TEMP[60], TEMP[63]
 MOV TEMP[65].z, TEMP[67].
 DP4 TEMP[68].x, TEMP[60], TEMP[64]
 MOV TEMP[69].w, TEMP[68].
 MOV TEMP[69].xyz, TEMP[65].xyzx
 LOAD TEMP[70], UBO[1], IMM[6].
 LOAD TEMP[71], UBO[1], IMM[6].
 DP4 TEMP[72].x, TEMP[69], TEMP[70]
 DP4 TEMP[73].x, TEMP[69], TEMP[71]
 LOAD TEMP[74], UBO[1], IMM[6].
 LOAD TEMP[75], UBO[1], IMM[7].
 LOAD TEMP[76], UBO[1], IMM[7].
 LOAD TEMP[77], UBO[1], IMM[7].
 DP4 TEMP[78].x, TEMP[69], TEMP[74]
 DP4 TEMP[79].x, TEMP[69], TEMP[75]
 MOV TEMP[78].y, TEMP[79].
 DP4 TEMP[80].x, TEMP[69], TEMP[76]
 MOV TEMP[78].z, TEMP[80].
 DP4 TEMP[81].x, TEMP[69], TEMP[77]
 MOV TEMP[78].w, TEMP[81].

vs

 DP4 TEMP[63].x, TEMP[62], CONST[2][0]
 DP4 TEMP[64].x, TEMP[62], CONST[2][1]
 MOV TEMP[63].y, TEMP[64].
 DP4 TEMP[65].x, TEMP[62], CONST[2][2]
 MOV TEMP[63].z, TEMP[65].
 DP4 TEMP[66].x, TEMP[62], CONST[2][3]
 MOV TEMP[67].w, TEMP[66].
 MOV TEMP[67].xyz, TEMP[63].xyzx
 DP4 TEMP[68].x, TEMP[67], CONST[1][14]
 DP4 TEMP[69].x, TEMP[67], CONST[1][15]
 DP4 TEMP[70].x, TEMP[67], CONST[1][8]
 DP4 TEMP[71].x, TEMP[67], CONST[1][9]
 MOV TEMP[70].y, TEMP[71].
 DP4 TEMP[72].x, TEMP[67], CONST[1][10]
 MOV TEMP[70].z, TEMP[72].
 DP4 TEMP[73].x, TEMP[67], CONST[1][11]
 MOV TEMP[70].w, TEMP[73].
 MOV TEMP[74].xyw, TEMP[70].xyxw



Cheers,
Nicolai


___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] V2 radeonsi use STD430 packing of UBOs by default

2017-08-24 Thread Nicolai Hähnle

On 24.08.2017 09:45, Timothy Arceri wrote:



On 22/08/17 22:14, Timothy Arceri wrote:

I'm a little unsure what to do with this now. Below is my shader-db
results, the majority of negative changes are from Natural Selection
2.

I looked at some dumps of the worst Natural Selection 2 shaders and
it seems to just be scheduling differences causing the regressions.

I tested with sisched but that just made things even worse.

Obviously we should be aiming to improve the schedulare, but since
this regresses things and I have no evidence of it helping anything
it makes the case for adding it pretty weak.

Thoughts??

PERCENTAGE DELTASShaders SGPRs VGPRs SpillSGPR  MaxWaves

  All affected57972.92 3.05 %5.04 %   -2.94
  ---
  Total  722870.28 %0.34 %0.33 %  -0.21 %

___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev




As far as I can tell this is because after this chnage we end up with 
large sections of consecutive loads. Any thoughts on avoid this?


Odd. Do you see the same change in TGSI?

This is one of those things that ideally LLVM would be smart about, but 
unfortunately it isn't really.


Cheers,
Nicolai



  e.g

   %234 = call nsz float @llvm.SI.load.const.v4i32(<4 x i32> %233, i32 0)
   %235 = call nsz float @llvm.SI.load.const.v4i32(<4 x i32> %233, i32 4)
   %236 = call nsz float @llvm.SI.load.const.v4i32(<4 x i32> %233, i32 8)
   %237 = call nsz float @llvm.SI.load.const.v4i32(<4 x i32> %233, i32 12)
   %238 = call nsz float @llvm.SI.load.const.v4i32(<4 x i32> %233, i32 16)
   %239 = call nsz float @llvm.SI.load.const.v4i32(<4 x i32> %233, i32 20)
   %240 = call nsz float @llvm.SI.load.const.v4i32(<4 x i32> %233, i32 24)
   %241 = call nsz float @llvm.SI.load.const.v4i32(<4 x i32> %233, i32 28)
   %242 = call nsz float @llvm.SI.load.const.v4i32(<4 x i32> %233, i32 32)
   %243 = call nsz float @llvm.SI.load.const.v4i32(<4 x i32> %233, i32 36)
   %244 = call nsz float @llvm.SI.load.const.v4i32(<4 x i32> %233, i32 40)
   %245 = call nsz float @llvm.SI.load.const.v4i32(<4 x i32> %233, i32 44)
   %246 = call nsz float @llvm.SI.load.const.v4i32(<4 x i32> %233, i32 48)
   %247 = call nsz float @llvm.SI.load.const.v4i32(<4 x i32> %233, i32 52)
   %248 = call nsz float @llvm.SI.load.const.v4i32(<4 x i32> %233, i32 56)
   %249 = call nsz float @llvm.SI.load.const.v4i32(<4 x i32> %233, i32 60)
   %250 = fmul nsz float %227, %234
   %251 = fmul nsz float %229, %235
   %252 = fadd nsz float %250, %251
   %253 = fmul nsz float %231, %236
   %254 = fadd nsz float %252, %253
   %255 = fadd nsz float %254, %237
   %256 = fmul nsz float %227, %238
   %257 = fmul nsz float %229, %239
   %258 = fadd nsz float %256, %257
   %259 = fmul nsz float %231, %240
   %260 = fadd nsz float %258, %259
   %261 = fadd nsz float %260, %241
   %262 = fmul nsz float %227, %242
   %263 = fmul nsz float %229, %243
   %264 = fadd nsz float %262, %263
   %265 = fmul nsz float %231, %244
   %266 = fadd nsz float %264, %265
   %267 = fadd nsz float %266, %245
   %268 = fmul nsz float %227, %246
   %269 = fmul nsz float %229, %247
   %270 = fadd nsz float %268, %269
   %271 = fmul nsz float %231, %248
   %272 = fadd nsz float %270, %271
   %273 = fadd nsz float %272, %249


vs


%234 = call nsz float @llvm.SI.load.const.v4i32(<4 x i32> %233, i32 0)
   %235 = call nsz float @llvm.SI.load.const.v4i32(<4 x i32> %233, i32 4)
   %236 = call nsz float @llvm.SI.load.const.v4i32(<4 x i32> %233, i32 8)
   %237 = call nsz float @llvm.SI.load.const.v4i32(<4 x i32> %233, i32 12)
   %238 = fmul nsz float %227, %234
   %239 = fmul nsz float %229, %235
   %240 = fadd nsz float %238, %239
   %241 = fmul nsz float %231, %236
   %242 = fadd nsz float %240, %241
   %243 = fadd nsz float %242, %237
   %244 = call nsz float @llvm.SI.load.const.v4i32(<4 x i32> %233, i32 16)
   %245 = call nsz float @llvm.SI.load.const.v4i32(<4 x i32> %233, i32 20)
   %246 = call nsz float @llvm.SI.load.const.v4i32(<4 x i32> %233, i32 24)
   %247 = call nsz float @llvm.SI.load.const.v4i32(<4 x i32> %233, i32 28)
   %248 = fmul nsz float %227, %244
   %249 = fmul nsz float %229, %245
   %250 = fadd nsz float %248, %249
   %251 = fmul nsz float %231, %246
   %252 = fadd nsz float %250, %251
   %253 = fadd nsz float %252, %247
   %254 = call nsz float @llvm.SI.load.const.v4i32(<4 x i32> %233, i32 32)
   %255 = call nsz float @llvm.SI.load.const.v4i32(<4 x i32> %233, i32 36)
   %256 = call nsz float @llvm.SI.load.const.v4i32(<4 x i32> %233, i32 40)
   %257 = call nsz float @llvm.SI.load.const.v4i32(<4 x i32> %233, i32 44)
   %258 = fmul nsz float %227, %254
   %259 = fmul nsz float %229, %255
   %260 = fadd nsz float %258, %259
   %261 = fmul nsz float %231, %256
   %262 = fadd nsz 

Re: [Mesa-dev] V2 radeonsi use STD430 packing of UBOs by default

2017-08-24 Thread Timothy Arceri



On 22/08/17 22:14, Timothy Arceri wrote:

I'm a little unsure what to do with this now. Below is my shader-db
results, the majority of negative changes are from Natural Selection
2.

I looked at some dumps of the worst Natural Selection 2 shaders and
it seems to just be scheduling differences causing the regressions.

I tested with sisched but that just made things even worse.

Obviously we should be aiming to improve the schedulare, but since
this regresses things and I have no evidence of it helping anything
it makes the case for adding it pretty weak.

Thoughts??

PERCENTAGE DELTASShaders SGPRs VGPRs SpillSGPR  MaxWaves

  All affected57972.92 3.05 %5.04 %   -2.94
  ---
  Total  722870.28 %0.34 %0.33 %  -0.21 %

___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev




As far as I can tell this is because after this chnage we end up with 
large sections of consecutive loads. Any thoughts on avoid this?


 e.g

  %234 = call nsz float @llvm.SI.load.const.v4i32(<4 x i32> %233, i32 0)
  %235 = call nsz float @llvm.SI.load.const.v4i32(<4 x i32> %233, i32 4)
  %236 = call nsz float @llvm.SI.load.const.v4i32(<4 x i32> %233, i32 8)
  %237 = call nsz float @llvm.SI.load.const.v4i32(<4 x i32> %233, i32 12)
  %238 = call nsz float @llvm.SI.load.const.v4i32(<4 x i32> %233, i32 16)
  %239 = call nsz float @llvm.SI.load.const.v4i32(<4 x i32> %233, i32 20)
  %240 = call nsz float @llvm.SI.load.const.v4i32(<4 x i32> %233, i32 24)
  %241 = call nsz float @llvm.SI.load.const.v4i32(<4 x i32> %233, i32 28)
  %242 = call nsz float @llvm.SI.load.const.v4i32(<4 x i32> %233, i32 32)
  %243 = call nsz float @llvm.SI.load.const.v4i32(<4 x i32> %233, i32 36)
  %244 = call nsz float @llvm.SI.load.const.v4i32(<4 x i32> %233, i32 40)
  %245 = call nsz float @llvm.SI.load.const.v4i32(<4 x i32> %233, i32 44)
  %246 = call nsz float @llvm.SI.load.const.v4i32(<4 x i32> %233, i32 48)
  %247 = call nsz float @llvm.SI.load.const.v4i32(<4 x i32> %233, i32 52)
  %248 = call nsz float @llvm.SI.load.const.v4i32(<4 x i32> %233, i32 56)
  %249 = call nsz float @llvm.SI.load.const.v4i32(<4 x i32> %233, i32 60)
  %250 = fmul nsz float %227, %234
  %251 = fmul nsz float %229, %235
  %252 = fadd nsz float %250, %251
  %253 = fmul nsz float %231, %236
  %254 = fadd nsz float %252, %253
  %255 = fadd nsz float %254, %237
  %256 = fmul nsz float %227, %238
  %257 = fmul nsz float %229, %239
  %258 = fadd nsz float %256, %257
  %259 = fmul nsz float %231, %240
  %260 = fadd nsz float %258, %259
  %261 = fadd nsz float %260, %241
  %262 = fmul nsz float %227, %242
  %263 = fmul nsz float %229, %243
  %264 = fadd nsz float %262, %263
  %265 = fmul nsz float %231, %244
  %266 = fadd nsz float %264, %265
  %267 = fadd nsz float %266, %245
  %268 = fmul nsz float %227, %246
  %269 = fmul nsz float %229, %247
  %270 = fadd nsz float %268, %269
  %271 = fmul nsz float %231, %248
  %272 = fadd nsz float %270, %271
  %273 = fadd nsz float %272, %249


vs


%234 = call nsz float @llvm.SI.load.const.v4i32(<4 x i32> %233, i32 0)
  %235 = call nsz float @llvm.SI.load.const.v4i32(<4 x i32> %233, i32 4)
  %236 = call nsz float @llvm.SI.load.const.v4i32(<4 x i32> %233, i32 8)
  %237 = call nsz float @llvm.SI.load.const.v4i32(<4 x i32> %233, i32 12)
  %238 = fmul nsz float %227, %234
  %239 = fmul nsz float %229, %235
  %240 = fadd nsz float %238, %239
  %241 = fmul nsz float %231, %236
  %242 = fadd nsz float %240, %241
  %243 = fadd nsz float %242, %237
  %244 = call nsz float @llvm.SI.load.const.v4i32(<4 x i32> %233, i32 16)
  %245 = call nsz float @llvm.SI.load.const.v4i32(<4 x i32> %233, i32 20)
  %246 = call nsz float @llvm.SI.load.const.v4i32(<4 x i32> %233, i32 24)
  %247 = call nsz float @llvm.SI.load.const.v4i32(<4 x i32> %233, i32 28)
  %248 = fmul nsz float %227, %244
  %249 = fmul nsz float %229, %245
  %250 = fadd nsz float %248, %249
  %251 = fmul nsz float %231, %246
  %252 = fadd nsz float %250, %251
  %253 = fadd nsz float %252, %247
  %254 = call nsz float @llvm.SI.load.const.v4i32(<4 x i32> %233, i32 32)
  %255 = call nsz float @llvm.SI.load.const.v4i32(<4 x i32> %233, i32 36)
  %256 = call nsz float @llvm.SI.load.const.v4i32(<4 x i32> %233, i32 40)
  %257 = call nsz float @llvm.SI.load.const.v4i32(<4 x i32> %233, i32 44)
  %258 = fmul nsz float %227, %254
  %259 = fmul nsz float %229, %255
  %260 = fadd nsz float %258, %259
  %261 = fmul nsz float %231, %256
  %262 = fadd nsz float %260, %261
  %263 = fadd nsz float %262, %257
  %264 = call nsz float @llvm.SI.load.const.v4i32(<4 x i32> %233, i32 48)
  %265 = call nsz float @llvm.SI.load.const.v4i32(<4 x i32> %233, i32 52)
  %266 = call nsz float @llvm.SI.load.const.v4i32(<4 x i32> %233, i32 56)
  %267