Re: [Mesa-dev] [PATCH 41/47] i965/fs: Add reuse_16bit_conversions_register optimization

2017-09-07 Thread Francisco Jerez
Jason Ekstrand  writes:

> On Wed, Sep 6, 2017 at 3:58 PM, Chema Casanova 
> wrote:
>
>> Hi Connor and Curro,
>>
>> On 28/08/17 12:24, Alejandro Piñeiro wrote:
>> > On 27/08/17 20:24, Connor Abbott wrote:
>> >> Hi,
>> >>
>> >> On Aug 25, 2017 9:28 AM, "Alejandro Piñeiro" > >> > wrote:
>> >>
>> >> On 24/08/17 21:07, Connor Abbott wrote:
>> >> >
>> >> > Hi Alejandro,
>> >>
>> >> Hi Connor,
>> >>
>> >> >
>> >> > This seems really suspicious. If the live ranges are really
>> >> > independent, then the register allocator should be able to
>> >> assign the
>> >> > two virtual registers to the same physical register if it needs
>> to.
>> >>
>> >> Yes, it is true, the register allocator should be able to assign two
>> >> virtual registers to the same physical register. But that is done
>> >> at the
>> >> end (or really near the end), so late for the problem this
>> >> optimization
>> >> is trying to fix.
>> >>
>> >>
>> >> Well, my understanding is that the problem is long compilation times
>> >> due to spilling and our not-so-great implementation of it. So no,
>> >> register allocation is not late for the problem. As both Curro and I
>> >> explained, the change by itself can only pessimise register
>> >> allocation, so if it helps then it must be due to a bug in the
>> >> register allocator or a problem in a subsequent pass that's getting
>> >> hidden by this one.
>> >
>> > Ok.
>> >
>> >>
>> >> We are also reducing the amount of instructions used.
>> >>
>> >>
>> >> The comments in the source code say otherwise. Any instructions
>> >> eliminated were from spilling, which this pass only accidentally
>> reduces.
>> >
>> > Yes, sorry, I explained myself poorly. The optimization itself doesn't
>> > remove any instructions. But using it reduces the final number of
>> > instructions, although as you say, they are likely due reducing the
>> > spilling.
>> >
>> >>
>> >>
>> >>
>> >> Probably not really clear on the commit message. When I say
>> >> "reduce the
>> >> pressure of the register allocator" I mean having a code that the
>> >> register allocator would be able to handle without using too much
>> >> time.
>> >> The problem this optimization tries to solve is that for some 16
>> >> bit CTS
>> >> tests (some with matrices and geometry shaders), the amount of
>> virtual
>> >> registers used and instructions was really big. For the record,
>> >> initially, some tests needed 24 min just to compile. Right now,
>> thanks
>> >> to other optimizations, the slower test without this optimization
>> >> needs
>> >> 1min 30 seconds. Adding some hacky timestamps, the time used  at
>> >> fs_visitor::allocate_registers (brw_fs.cpp:6096) is:
>> >>
>> >> * While trying to schedule using the three available pre mode
>> >> heuristics: 7 seconds
>> >> * Allocation with spilling: 63 seconds
>> >> * Final schedule using SCHEDULE_POST: 19 seconds
>> >>
>> >> With this optimization, the total time goes down to 14 seconds (10
>> >> + 0 +
>> >> 3 on the previous bullet point list).
>> >>
>> >> One could argue that 1min 30 seconds is okish. But taking into
>> account
>> >> that it goes down to 14 seconds, even with some caveats (see
>> below), I
>> >> still think that it is worth to use the optimization.
>> >>
>> >> And a final comment. For that same test, this is the final stats
>> >> (using
>> >> INTEL_DEBUG):
>> >>
>> >>  * With the optimization: SIMD8 shader: 4610 instructions. 0 loops.
>> >> 130320 cycles. 15:9 spills:fills.
>> >>  * Without the optimization: SIMD8 shader: 12312 instructions. 0
>> >> loops.
>> >> 174816 cycles. 751:1851 spills:fills.
>> >>
>> >>
>> >> So, the fact that it helps at all with SIMD8 shows that my theory is
>> >> wrong, but since your pass reduces spilling, it clearly must be
>> >> avoiding a bug somewhere else. You need to compare the IR for a shader
>> >> with the problem with and without this pass right before register
>> >> allocation. Maybe the sources and destinations of the conversion
>> >> instructions interfere without the change due to some other pass
>> >> that's increasing register pressure, in which case that's the problem,
>> >> but I doubt it.
>> >
>> > Ok, thanks for the hints.
>>
>> After some research we found that we need to adapt the live_variables
>> algorithm to support 32 to 16-bit conversions. Because of the HW
>> alignment restrictions these conversions need that the result register
>> uses stride=2, so it is not continuous (stride!=1) so by definition
>> is_partial_write returns true. Any of the next last 3 conditions could
>> be true when we use 16-bit types.
>>
>> bool
>> fs_inst::is_partial_write() const
>> {
>>return ((this->predicate && this->opcode != BRW_OPCODE_SEL) ||
>>

Re: [Mesa-dev] [PATCH 41/47] i965/fs: Add reuse_16bit_conversions_register optimization

2017-09-07 Thread Jason Ekstrand
On Wed, Sep 6, 2017 at 3:58 PM, Chema Casanova 
wrote:

> Hi Connor and Curro,
>
> On 28/08/17 12:24, Alejandro Piñeiro wrote:
> > On 27/08/17 20:24, Connor Abbott wrote:
> >> Hi,
> >>
> >> On Aug 25, 2017 9:28 AM, "Alejandro Piñeiro"  >> > wrote:
> >>
> >> On 24/08/17 21:07, Connor Abbott wrote:
> >> >
> >> > Hi Alejandro,
> >>
> >> Hi Connor,
> >>
> >> >
> >> > This seems really suspicious. If the live ranges are really
> >> > independent, then the register allocator should be able to
> >> assign the
> >> > two virtual registers to the same physical register if it needs
> to.
> >>
> >> Yes, it is true, the register allocator should be able to assign two
> >> virtual registers to the same physical register. But that is done
> >> at the
> >> end (or really near the end), so late for the problem this
> >> optimization
> >> is trying to fix.
> >>
> >>
> >> Well, my understanding is that the problem is long compilation times
> >> due to spilling and our not-so-great implementation of it. So no,
> >> register allocation is not late for the problem. As both Curro and I
> >> explained, the change by itself can only pessimise register
> >> allocation, so if it helps then it must be due to a bug in the
> >> register allocator or a problem in a subsequent pass that's getting
> >> hidden by this one.
> >
> > Ok.
> >
> >>
> >> We are also reducing the amount of instructions used.
> >>
> >>
> >> The comments in the source code say otherwise. Any instructions
> >> eliminated were from spilling, which this pass only accidentally
> reduces.
> >
> > Yes, sorry, I explained myself poorly. The optimization itself doesn't
> > remove any instructions. But using it reduces the final number of
> > instructions, although as you say, they are likely due reducing the
> > spilling.
> >
> >>
> >>
> >>
> >> Probably not really clear on the commit message. When I say
> >> "reduce the
> >> pressure of the register allocator" I mean having a code that the
> >> register allocator would be able to handle without using too much
> >> time.
> >> The problem this optimization tries to solve is that for some 16
> >> bit CTS
> >> tests (some with matrices and geometry shaders), the amount of
> virtual
> >> registers used and instructions was really big. For the record,
> >> initially, some tests needed 24 min just to compile. Right now,
> thanks
> >> to other optimizations, the slower test without this optimization
> >> needs
> >> 1min 30 seconds. Adding some hacky timestamps, the time used  at
> >> fs_visitor::allocate_registers (brw_fs.cpp:6096) is:
> >>
> >> * While trying to schedule using the three available pre mode
> >> heuristics: 7 seconds
> >> * Allocation with spilling: 63 seconds
> >> * Final schedule using SCHEDULE_POST: 19 seconds
> >>
> >> With this optimization, the total time goes down to 14 seconds (10
> >> + 0 +
> >> 3 on the previous bullet point list).
> >>
> >> One could argue that 1min 30 seconds is okish. But taking into
> account
> >> that it goes down to 14 seconds, even with some caveats (see
> below), I
> >> still think that it is worth to use the optimization.
> >>
> >> And a final comment. For that same test, this is the final stats
> >> (using
> >> INTEL_DEBUG):
> >>
> >>  * With the optimization: SIMD8 shader: 4610 instructions. 0 loops.
> >> 130320 cycles. 15:9 spills:fills.
> >>  * Without the optimization: SIMD8 shader: 12312 instructions. 0
> >> loops.
> >> 174816 cycles. 751:1851 spills:fills.
> >>
> >>
> >> So, the fact that it helps at all with SIMD8 shows that my theory is
> >> wrong, but since your pass reduces spilling, it clearly must be
> >> avoiding a bug somewhere else. You need to compare the IR for a shader
> >> with the problem with and without this pass right before register
> >> allocation. Maybe the sources and destinations of the conversion
> >> instructions interfere without the change due to some other pass
> >> that's increasing register pressure, in which case that's the problem,
> >> but I doubt it.
> >
> > Ok, thanks for the hints.
>
> After some research we found that we need to adapt the live_variables
> algorithm to support 32 to 16-bit conversions. Because of the HW
> alignment restrictions these conversions need that the result register
> uses stride=2, so it is not continuous (stride!=1) so by definition
> is_partial_write returns true. Any of the next last 3 conditions could
> be true when we use 16-bit types.
>
> bool
> fs_inst::is_partial_write() const
> {
>return ((this->predicate && this->opcode != BRW_OPCODE_SEL) ||
>(this->exec_size * type_sz(this->dst.type)) < 32 ||
>!this->dst.is_contiguous() ||
>this->dst.offset % REG_SIZE != 0);
> }
>
> So at the check on the 

Re: [Mesa-dev] [PATCH 41/47] i965/fs: Add reuse_16bit_conversions_register optimization

2017-09-06 Thread Chema Casanova
Hi Connor and Curro,

On 28/08/17 12:24, Alejandro Piñeiro wrote:
> On 27/08/17 20:24, Connor Abbott wrote:
>> Hi,
>>
>> On Aug 25, 2017 9:28 AM, "Alejandro Piñeiro" > > wrote:
>>
>> On 24/08/17 21:07, Connor Abbott wrote:
>> >
>> > Hi Alejandro,
>>
>> Hi Connor,
>>
>> >
>> > This seems really suspicious. If the live ranges are really
>> > independent, then the register allocator should be able to
>> assign the
>> > two virtual registers to the same physical register if it needs to.
>>
>> Yes, it is true, the register allocator should be able to assign two
>> virtual registers to the same physical register. But that is done
>> at the
>> end (or really near the end), so late for the problem this
>> optimization
>> is trying to fix.
>>
>>
>> Well, my understanding is that the problem is long compilation times
>> due to spilling and our not-so-great implementation of it. So no,
>> register allocation is not late for the problem. As both Curro and I
>> explained, the change by itself can only pessimise register
>> allocation, so if it helps then it must be due to a bug in the
>> register allocator or a problem in a subsequent pass that's getting
>> hidden by this one.
> 
> Ok.
> 
>>
>> We are also reducing the amount of instructions used.
>>
>>
>> The comments in the source code say otherwise. Any instructions
>> eliminated were from spilling, which this pass only accidentally reduces.
> 
> Yes, sorry, I explained myself poorly. The optimization itself doesn't
> remove any instructions. But using it reduces the final number of
> instructions, although as you say, they are likely due reducing the
> spilling.
> 
>>
>>
>>
>> Probably not really clear on the commit message. When I say
>> "reduce the
>> pressure of the register allocator" I mean having a code that the
>> register allocator would be able to handle without using too much
>> time.
>> The problem this optimization tries to solve is that for some 16
>> bit CTS
>> tests (some with matrices and geometry shaders), the amount of virtual
>> registers used and instructions was really big. For the record,
>> initially, some tests needed 24 min just to compile. Right now, thanks
>> to other optimizations, the slower test without this optimization
>> needs
>> 1min 30 seconds. Adding some hacky timestamps, the time used  at
>> fs_visitor::allocate_registers (brw_fs.cpp:6096) is:
>>
>> * While trying to schedule using the three available pre mode
>> heuristics: 7 seconds
>> * Allocation with spilling: 63 seconds
>> * Final schedule using SCHEDULE_POST: 19 seconds
>>
>> With this optimization, the total time goes down to 14 seconds (10
>> + 0 +
>> 3 on the previous bullet point list).
>>
>> One could argue that 1min 30 seconds is okish. But taking into account
>> that it goes down to 14 seconds, even with some caveats (see below), I
>> still think that it is worth to use the optimization.
>>
>> And a final comment. For that same test, this is the final stats
>> (using
>> INTEL_DEBUG):
>>
>>  * With the optimization: SIMD8 shader: 4610 instructions. 0 loops.
>> 130320 cycles. 15:9 spills:fills.
>>  * Without the optimization: SIMD8 shader: 12312 instructions. 0
>> loops.
>> 174816 cycles. 751:1851 spills:fills.
>>
>>
>> So, the fact that it helps at all with SIMD8 shows that my theory is
>> wrong, but since your pass reduces spilling, it clearly must be
>> avoiding a bug somewhere else. You need to compare the IR for a shader
>> with the problem with and without this pass right before register
>> allocation. Maybe the sources and destinations of the conversion
>> instructions interfere without the change due to some other pass
>> that's increasing register pressure, in which case that's the problem,
>> but I doubt it.
> 
> Ok, thanks for the hints.

After some research we found that we need to adapt the live_variables
algorithm to support 32 to 16-bit conversions. Because of the HW
alignment restrictions these conversions need that the result register
uses stride=2, so it is not continuous (stride!=1) so by definition
is_partial_write returns true. Any of the next last 3 conditions could
be true when we use 16-bit types.

bool
fs_inst::is_partial_write() const
{
   return ((this->predicate && this->opcode != BRW_OPCODE_SEL) ||
   (this->exec_size * type_sz(this->dst.type)) < 32 ||
   !this->dst.is_contiguous() ||
   this->dst.offset % REG_SIZE != 0);
}

So at the check on the setup_one_write function at
brw_fs_live_variables.cpp the variable isn't marked as defined
completely in the block.

   if (inst->dst.file == VGRF && !inst->is_partial_write()) {
  if (!BITSET_TEST(bd->use, var))
 BITSET_SET(bd->def, var);
   }

That makes that the live start of the variable is expected to defined

Re: [Mesa-dev] [PATCH 41/47] i965/fs: Add reuse_16bit_conversions_register optimization

2017-08-28 Thread Alejandro Piñeiro
On 27/08/17 20:24, Connor Abbott wrote:
> Hi,
>
> On Aug 25, 2017 9:28 AM, "Alejandro Piñeiro"  > wrote:
>
> On 24/08/17 21:07, Connor Abbott wrote:
> >
> > Hi Alejandro,
>
> Hi Connor,
>
> >
> > This seems really suspicious. If the live ranges are really
> > independent, then the register allocator should be able to
> assign the
> > two virtual registers to the same physical register if it needs to.
>
> Yes, it is true, the register allocator should be able to assign two
> virtual registers to the same physical register. But that is done
> at the
> end (or really near the end), so late for the problem this
> optimization
> is trying to fix.
>
>
> Well, my understanding is that the problem is long compilation times
> due to spilling and our not-so-great implementation of it. So no,
> register allocation is not late for the problem. As both Curro and I
> explained, the change by itself can only pessimise register
> allocation, so if it helps then it must be due to a bug in the
> register allocator or a problem in a subsequent pass that's getting
> hidden by this one.

Ok.

>
> We are also reducing the amount of instructions used.
>
>
> The comments in the source code say otherwise. Any instructions
> eliminated were from spilling, which this pass only accidentally reduces.

Yes, sorry, I explained myself poorly. The optimization itself doesn't
remove any instructions. But using it reduces the final number of
instructions, although as you say, they are likely due reducing the
spilling.

>
>
>
> Probably not really clear on the commit message. When I say
> "reduce the
> pressure of the register allocator" I mean having a code that the
> register allocator would be able to handle without using too much
> time.
> The problem this optimization tries to solve is that for some 16
> bit CTS
> tests (some with matrices and geometry shaders), the amount of virtual
> registers used and instructions was really big. For the record,
> initially, some tests needed 24 min just to compile. Right now, thanks
> to other optimizations, the slower test without this optimization
> needs
> 1min 30 seconds. Adding some hacky timestamps, the time used  at
> fs_visitor::allocate_registers (brw_fs.cpp:6096) is:
>
> * While trying to schedule using the three available pre mode
> heuristics: 7 seconds
> * Allocation with spilling: 63 seconds
> * Final schedule using SCHEDULE_POST: 19 seconds
>
> With this optimization, the total time goes down to 14 seconds (10
> + 0 +
> 3 on the previous bullet point list).
>
> One could argue that 1min 30 seconds is okish. But taking into account
> that it goes down to 14 seconds, even with some caveats (see below), I
> still think that it is worth to use the optimization.
>
> And a final comment. For that same test, this is the final stats
> (using
> INTEL_DEBUG):
>
>  * With the optimization: SIMD8 shader: 4610 instructions. 0 loops.
> 130320 cycles. 15:9 spills:fills.
>  * Without the optimization: SIMD8 shader: 12312 instructions. 0
> loops.
> 174816 cycles. 751:1851 spills:fills.
>
>
> So, the fact that it helps at all with SIMD8 shows that my theory is
> wrong, but since your pass reduces spilling, it clearly must be
> avoiding a bug somewhere else. You need to compare the IR for a shader
> with the problem with and without this pass right before register
> allocation. Maybe the sources and destinations of the conversion
> instructions interfere without the change due to some other pass
> that's increasing register pressure, in which case that's the problem,
> but I doubt it.

Ok, thanks for the hints.

> (IIRC there's a debug option to print out the register pressure for
> each instruction, which will be helpful here).

I tried to use that option back when I was working on this patch,
without too much luck. Will try again.

> If that's not the case, you should check to see if the register
> allocator thinks the sources and destinations of the conversion
> instructions interfere, and if so figure out why - that will likely be
> the real bug.

Ok.

Thanks Connor and Curro for the comments. I will work on a different
solution.

>
>
> > This change forces the two to be the same, which constrains the
> > register allocator unecessarily and should make it worse, so I'm
> > confused as to why this would help at all.
>
> I didn't check that issue specifically, but I recently found that this
> optimization affects copy propagation/dead code eliminate. So
> there are
> still some room for improvement. But in any case, take into
> account that
> this custom optimization is only used if there is a 32 to 16 bit
> conversion, so only affects shaders with this specific feature.
>
> >
> > IIRC there were some issues where 

Re: [Mesa-dev] [PATCH 41/47] i965/fs: Add reuse_16bit_conversions_register optimization

2017-08-27 Thread Connor Abbott
Hi,

On Aug 25, 2017 9:28 AM, "Alejandro Piñeiro"  wrote:

On 24/08/17 21:07, Connor Abbott wrote:
>
> Hi Alejandro,

Hi Connor,

>
> This seems really suspicious. If the live ranges are really
> independent, then the register allocator should be able to assign the
> two virtual registers to the same physical register if it needs to.

Yes, it is true, the register allocator should be able to assign two
virtual registers to the same physical register. But that is done at the
end (or really near the end), so late for the problem this optimization
is trying to fix.


Well, my understanding is that the problem is long compilation times due to
spilling and our not-so-great implementation of it. So no, register
allocation is not late for the problem. As both Curro and I explained, the
change by itself can only pessimise register allocation, so if it helps
then it must be due to a bug in the register allocator or a problem in a
subsequent pass that's getting hidden by this one.

We are also reducing the amount of instructions used.


The comments in the source code say otherwise. Any instructions eliminated
were from spilling, which this pass only accidentally reduces.



Probably not really clear on the commit message. When I say "reduce the
pressure of the register allocator" I mean having a code that the
register allocator would be able to handle without using too much time.
The problem this optimization tries to solve is that for some 16 bit CTS
tests (some with matrices and geometry shaders), the amount of virtual
registers used and instructions was really big. For the record,
initially, some tests needed 24 min just to compile. Right now, thanks
to other optimizations, the slower test without this optimization needs
1min 30 seconds. Adding some hacky timestamps, the time used  at
fs_visitor::allocate_registers (brw_fs.cpp:6096) is:

* While trying to schedule using the three available pre mode
heuristics: 7 seconds
* Allocation with spilling: 63 seconds
* Final schedule using SCHEDULE_POST: 19 seconds

With this optimization, the total time goes down to 14 seconds (10 + 0 +
3 on the previous bullet point list).

One could argue that 1min 30 seconds is okish. But taking into account
that it goes down to 14 seconds, even with some caveats (see below), I
still think that it is worth to use the optimization.

And a final comment. For that same test, this is the final stats (using
INTEL_DEBUG):

 * With the optimization: SIMD8 shader: 4610 instructions. 0 loops.
130320 cycles. 15:9 spills:fills.
 * Without the optimization: SIMD8 shader: 12312 instructions. 0 loops.
174816 cycles. 751:1851 spills:fills.


So, the fact that it helps at all with SIMD8 shows that my theory is wrong,
but since your pass reduces spilling, it clearly must be avoiding a bug
somewhere else. You need to compare the IR for a shader with the problem
with and without this pass right before register allocation. Maybe the
sources and destinations of the conversion instructions interfere without
the change due to some other pass that's increasing register pressure, in
which case that's the problem, but I doubt it. (IIRC there's a debug option
to print out the register pressure for each instruction, which will be
helpful here). If that's not the case, you should check to see if the
register allocator thinks the sources and destinations of the conversion
instructions interfere, and if so figure out why - that will likely be the
real bug.


> This change forces the two to be the same, which constrains the
> register allocator unecessarily and should make it worse, so I'm
> confused as to why this would help at all.

I didn't check that issue specifically, but I recently found that this
optimization affects copy propagation/dead code eliminate. So there are
still some room for improvement. But in any case, take into account that
this custom optimization is only used if there is a 32 to 16 bit
conversion, so only affects shaders with this specific feature.

>
> IIRC there were some issues where we unnecessarily made the sources
> and destination of an instruction interefere with each other, but if
> that's what's causing this, then we should fix that underlying issue.
>
> (From what I remember, a lot of SIMD16 were expanded to SIMD8 in the
> generator, in which case the second half of the source is read after
> the first half of the destination is written, and we falsely thought
> that the HW did that too, so we had some code to add a fake
> interference between them, but a while ago Curro moved the expansion
> to happen before register allocation. I don't have the code in front
> of me, but I think we still have this useless code lying around, and I
> would guess this is the source of the problem.)

Taking into account what I explained before, I don't think that the
problem is the interference or this code you mention (although perhaps
Im wrong).

> 
> Connor
>
> On Aug 24, 2017 2:59 PM, "Alejandro Piñeiro" 

Re: [Mesa-dev] [PATCH 41/47] i965/fs: Add reuse_16bit_conversions_register optimization

2017-08-25 Thread Francisco Jerez
Alejandro Piñeiro  writes:

> On 24/08/17 21:07, Connor Abbott wrote:
>>
>> Hi Alejandro,
>
> Hi Connor,
>
>>
>> This seems really suspicious. If the live ranges are really
>> independent, then the register allocator should be able to assign the
>> two virtual registers to the same physical register if it needs to.
>
> Yes, it is true, the register allocator should be able to assign two
> virtual registers to the same physical register. But that is done at the
> end (or really near the end), so late for the problem this optimization
> is trying to fix. We are also reducing the amount of instructions used.
>
> Probably not really clear on the commit message. When I say "reduce the
> pressure of the register allocator" I mean having a code that the
> register allocator would be able to handle without using too much time.
> The problem this optimization tries to solve is that for some 16 bit CTS
> tests (some with matrices and geometry shaders), the amount of virtual
> registers used and instructions was really big. For the record,
> initially, some tests needed 24 min just to compile. Right now, thanks
> to other optimizations, the slower test without this optimization needs
> 1min 30 seconds. Adding some hacky timestamps, the time used  at
> fs_visitor::allocate_registers (brw_fs.cpp:6096) is:
>
> * While trying to schedule using the three available pre mode
> heuristics: 7 seconds
> * Allocation with spilling: 63 seconds
> * Final schedule using SCHEDULE_POST: 19 seconds
>
> With this optimization, the total time goes down to 14 seconds (10 + 0 +
> 3 on the previous bullet point list).
>
> One could argue that 1min 30 seconds is okish. But taking into account
> that it goes down to 14 seconds, even with some caveats (see below), I
> still think that it is worth to use the optimization.
>
> And a final comment. For that same test, this is the final stats (using
> INTEL_DEBUG):
>
>  * With the optimization: SIMD8 shader: 4610 instructions. 0 loops.
> 130320 cycles. 15:9 spills:fills.
>  * Without the optimization: SIMD8 shader: 12312 instructions. 0 loops.
> 174816 cycles. 751:1851 spills:fills.
>
>> This change forces the two to be the same, which constrains the
>> register allocator unecessarily and should make it worse, so I'm
>> confused as to why this would help at all.
>
> I didn't check that issue specifically, but I recently found that this
> optimization affects copy propagation/dead code eliminate. So there are
> still some room for improvement. But in any case, take into account that
> this custom optimization is only used if there is a 32 to 16 bit
> conversion, so only affects shaders with this specific feature.
>
>>
>> IIRC there were some issues where we unnecessarily made the sources
>> and destination of an instruction interefere with each other, but if
>> that's what's causing this, then we should fix that underlying issue.
>>
>> (From what I remember, a lot of SIMD16 were expanded to SIMD8 in the
>> generator, in which case the second half of the source is read after
>> the first half of the destination is written, and we falsely thought
>> that the HW did that too, so we had some code to add a fake
>> interference between them, but a while ago Curro moved the expansion
>> to happen before register allocation. I don't have the code in front
>> of me, but I think we still have this useless code lying around, and I
>> would guess this is the source of the problem.)
>
> Taking into account what I explained before, I don't think that the
> problem is the interference or this code you mention (although perhaps
> Im wrong).
>

I agree with Connor's feed-back on this change, this really smells like
a hack working around register allocator brokenness.  If the register
allocator is failing to assign two variables with disjoint live ranges
to the same register it has a bug.  If you forcefully merge the live
ranges of source and destination it might turn out that that wasn't the
optimal decision to take after all register pressure-wise (because it's
frequently harder to find room in the GRF for a variable with 2x the
live range than for two independent variables), so you will be
pessimizing register usage in some cases -- The register allocator
should know better than you.

>> 
>> Connor
>>
>> On Aug 24, 2017 2:59 PM, "Alejandro Piñeiro" > > wrote:
>>
>> When dealing with HF/U/UW, it is usual having a register with a
>> F/D/UD, and then convert it to HF/U/UW, and not use again the F/D/UD
>> value. In those cases it would be possible to reuse the register where
>> the F value is initially stored instead of having two. Take also into
>> account that when operating with HF/U/UW, you would need to use the
>> full register (so stride 2). Packs/unpacks would be only useful when
>> loading/storing several HF/W/UW.
>>
>> Note that no instruction is removed. The main benefict is reducing the
>> 

Re: [Mesa-dev] [PATCH 41/47] i965/fs: Add reuse_16bit_conversions_register optimization

2017-08-25 Thread Alejandro Piñeiro
On 24/08/17 21:07, Connor Abbott wrote:
>
> Hi Alejandro,

Hi Connor,

>
> This seems really suspicious. If the live ranges are really
> independent, then the register allocator should be able to assign the
> two virtual registers to the same physical register if it needs to.

Yes, it is true, the register allocator should be able to assign two
virtual registers to the same physical register. But that is done at the
end (or really near the end), so late for the problem this optimization
is trying to fix. We are also reducing the amount of instructions used.

Probably not really clear on the commit message. When I say "reduce the
pressure of the register allocator" I mean having a code that the
register allocator would be able to handle without using too much time.
The problem this optimization tries to solve is that for some 16 bit CTS
tests (some with matrices and geometry shaders), the amount of virtual
registers used and instructions was really big. For the record,
initially, some tests needed 24 min just to compile. Right now, thanks
to other optimizations, the slower test without this optimization needs
1min 30 seconds. Adding some hacky timestamps, the time used  at
fs_visitor::allocate_registers (brw_fs.cpp:6096) is:

* While trying to schedule using the three available pre mode
heuristics: 7 seconds
* Allocation with spilling: 63 seconds
* Final schedule using SCHEDULE_POST: 19 seconds

With this optimization, the total time goes down to 14 seconds (10 + 0 +
3 on the previous bullet point list).

One could argue that 1min 30 seconds is okish. But taking into account
that it goes down to 14 seconds, even with some caveats (see below), I
still think that it is worth to use the optimization.

And a final comment. For that same test, this is the final stats (using
INTEL_DEBUG):

 * With the optimization: SIMD8 shader: 4610 instructions. 0 loops.
130320 cycles. 15:9 spills:fills.
 * Without the optimization: SIMD8 shader: 12312 instructions. 0 loops.
174816 cycles. 751:1851 spills:fills.

> This change forces the two to be the same, which constrains the
> register allocator unecessarily and should make it worse, so I'm
> confused as to why this would help at all.

I didn't check that issue specifically, but I recently found that this
optimization affects copy propagation/dead code eliminate. So there are
still some room for improvement. But in any case, take into account that
this custom optimization is only used if there is a 32 to 16 bit
conversion, so only affects shaders with this specific feature.

>
> IIRC there were some issues where we unnecessarily made the sources
> and destination of an instruction interefere with each other, but if
> that's what's causing this, then we should fix that underlying issue.
>
> (From what I remember, a lot of SIMD16 were expanded to SIMD8 in the
> generator, in which case the second half of the source is read after
> the first half of the destination is written, and we falsely thought
> that the HW did that too, so we had some code to add a fake
> interference between them, but a while ago Curro moved the expansion
> to happen before register allocation. I don't have the code in front
> of me, but I think we still have this useless code lying around, and I
> would guess this is the source of the problem.)

Taking into account what I explained before, I don't think that the
problem is the interference or this code you mention (although perhaps
Im wrong).

> 
> Connor
>
> On Aug 24, 2017 2:59 PM, "Alejandro Piñeiro"  > wrote:
>
> When dealing with HF/U/UW, it is usual having a register with a
> F/D/UD, and then convert it to HF/U/UW, and not use again the F/D/UD
> value. In those cases it would be possible to reuse the register where
> the F value is initially stored instead of having two. Take also into
> account that when operating with HF/U/UW, you would need to use the
> full register (so stride 2). Packs/unpacks would be only useful when
> loading/storing several HF/W/UW.
>
> Note that no instruction is removed. The main benefict is reducing the
> amoung of registers used, so the pressure on the register allocator is
> decreased with big shaders.
>
> Possibly this could be integrated into an existing optimization, at it
> is even already done by the register allocator, but was far easier to
> write and cleaner to read as a separate optimization.
>
> We found this issue when dealing with some Vulkan CTS tests that
> needed several minutes to compile. Most of the time was spent on the
> register allocator.
>
> Right now the optimization only handles 32 to 16 bit conversion. It
> could be possible to do the equivalent for 16 to 32 bit too, but in
> practice, we didn't need it.
> ---
>  src/intel/compiler/brw_fs.cpp | 77
> +++
>  src/intel/compiler/brw_fs.h   |  1 +
>  2 

Re: [Mesa-dev] [PATCH 41/47] i965/fs: Add reuse_16bit_conversions_register optimization

2017-08-24 Thread Connor Abbott
Hi Alejandro,

This seems really suspicious. If the live ranges are really independent,
then the register allocator should be able to assign the two virtual
registers to the same physical register if it needs to. This change forces
the two to be the same, which constrains the register allocator
unecessarily and should make it worse, so I'm confused as to why this would
help at all.

IIRC there were some issues where we unnecessarily made the sources and
destination of an instruction interefere with each other, but if that's
what's causing this, then we should fix that underlying issue.

(From what I remember, a lot of SIMD16 were expanded to SIMD8 in the
generator, in which case the second half of the source is read after the
first half of the destination is written, and we falsely thought that the
HW did that too, so we had some code to add a fake interference between
them, but a while ago Curro moved the expansion to happen before register
allocation. I don't have the code in front of me, but I think we still have
this useless code lying around, and I would guess this is the source of the
problem.)

Connor

On Aug 24, 2017 2:59 PM, "Alejandro Piñeiro"  wrote:

When dealing with HF/U/UW, it is usual having a register with a
F/D/UD, and then convert it to HF/U/UW, and not use again the F/D/UD
value. In those cases it would be possible to reuse the register where
the F value is initially stored instead of having two. Take also into
account that when operating with HF/U/UW, you would need to use the
full register (so stride 2). Packs/unpacks would be only useful when
loading/storing several HF/W/UW.

Note that no instruction is removed. The main benefict is reducing the
amoung of registers used, so the pressure on the register allocator is
decreased with big shaders.

Possibly this could be integrated into an existing optimization, at it
is even already done by the register allocator, but was far easier to
write and cleaner to read as a separate optimization.

We found this issue when dealing with some Vulkan CTS tests that
needed several minutes to compile. Most of the time was spent on the
register allocator.

Right now the optimization only handles 32 to 16 bit conversion. It
could be possible to do the equivalent for 16 to 32 bit too, but in
practice, we didn't need it.
---
 src/intel/compiler/brw_fs.cpp | 77 ++
+
 src/intel/compiler/brw_fs.h   |  1 +
 2 files changed, 78 insertions(+)

diff --git a/src/intel/compiler/brw_fs.cpp b/src/intel/compiler/brw_fs.cpp
index b6013a5ce85..1342150b44e 100644
--- a/src/intel/compiler/brw_fs.cpp
+++ b/src/intel/compiler/brw_fs.cpp
@@ -39,6 +39,7 @@
 #include "compiler/glsl_types.h"
 #include "compiler/nir/nir_builder.h"
 #include "program/prog_parameter.h"
+#include "brw_fs_live_variables.h"

 using namespace brw;

@@ -3133,6 +3134,81 @@ fs_visitor::remove_extra_rounding_modes()
return progress;
 }

+/**
+ * When dealing with HF/W/UW, it is usual having a register with a F/D/UD,
and
+ * then convert it to HF/W/UW, and not use again the F/D/UD value. In those
+ * cases it would be possible to reuse the register where the F value is
+ * initially stored instead of having two. Take also into account that when
+ * operating with HF/W/UW, you would need to use the full register (so
stride
+ * 2). Packs/unpacks would be only useful when loading/storing several
+ * HF/W/UWs.
+ *
+ * So something like this:
+ *  mov(8) vgrf14<2>:HF, vgrf39:F
+ *
+ * Became:
+ *  mov(8) vgrf39<2>:HF, vgrf39:F
+ *
+ * Note that no instruction is removed. The main benefict is reducing the
+ * amoung of registers used, so the pressure on the register allocator is
+ * decreased with big shaders.
+ */
+bool
+fs_visitor::reuse_16bit_conversions_vgrf()
+{
+   bool progress = false;
+   int ip = -1;
+
+   calculate_live_intervals();
+
+   foreach_block_and_inst_safe (block, fs_inst, inst, cfg) {
+  ip++;
+
+  if (inst->dst.file != VGRF || inst->src[0].file != VGRF)
+ continue;
+
+  if (inst->opcode != BRW_OPCODE_MOV)
+ continue;
+
+  if (type_sz(inst->dst.type) != 2 || inst->dst.stride != 2 ||
+  type_sz(inst->src[0].type) != 4 || inst->src[0].stride != 1) {
+ continue;
+  }
+
+  int src_reg = inst->src[0].nr;
+  int src_offset = inst->src[0].offset;
+  unsigned src_var = live_intervals->var_from_vgrf[src_reg];
+  int src_end = live_intervals->end[src_var];
+  int dst_reg = inst->dst.nr;
+
+  if (src_end > ip)
+ continue;
+
+  foreach_block_and_inst(block, fs_inst, scan_inst, cfg) {
+ if (scan_inst->dst.file == VGRF &&
+ scan_inst->dst.nr == dst_reg) {
+scan_inst->dst.nr = src_reg;
+scan_inst->dst.offset = src_offset;
+progress = true;
+ }
+
+ for (int i = 0; i < scan_inst->sources; i++) {
+if (scan_inst->src[i].file == VGRF &&
+scan_inst->src[i].nr 

[Mesa-dev] [PATCH 41/47] i965/fs: Add reuse_16bit_conversions_register optimization

2017-08-24 Thread Alejandro Piñeiro
When dealing with HF/U/UW, it is usual having a register with a
F/D/UD, and then convert it to HF/U/UW, and not use again the F/D/UD
value. In those cases it would be possible to reuse the register where
the F value is initially stored instead of having two. Take also into
account that when operating with HF/U/UW, you would need to use the
full register (so stride 2). Packs/unpacks would be only useful when
loading/storing several HF/W/UW.

Note that no instruction is removed. The main benefict is reducing the
amoung of registers used, so the pressure on the register allocator is
decreased with big shaders.

Possibly this could be integrated into an existing optimization, at it
is even already done by the register allocator, but was far easier to
write and cleaner to read as a separate optimization.

We found this issue when dealing with some Vulkan CTS tests that
needed several minutes to compile. Most of the time was spent on the
register allocator.

Right now the optimization only handles 32 to 16 bit conversion. It
could be possible to do the equivalent for 16 to 32 bit too, but in
practice, we didn't need it.
---
 src/intel/compiler/brw_fs.cpp | 77 +++
 src/intel/compiler/brw_fs.h   |  1 +
 2 files changed, 78 insertions(+)

diff --git a/src/intel/compiler/brw_fs.cpp b/src/intel/compiler/brw_fs.cpp
index b6013a5ce85..1342150b44e 100644
--- a/src/intel/compiler/brw_fs.cpp
+++ b/src/intel/compiler/brw_fs.cpp
@@ -39,6 +39,7 @@
 #include "compiler/glsl_types.h"
 #include "compiler/nir/nir_builder.h"
 #include "program/prog_parameter.h"
+#include "brw_fs_live_variables.h"
 
 using namespace brw;
 
@@ -3133,6 +3134,81 @@ fs_visitor::remove_extra_rounding_modes()
return progress;
 }
 
+/**
+ * When dealing with HF/W/UW, it is usual having a register with a F/D/UD, and
+ * then convert it to HF/W/UW, and not use again the F/D/UD value. In those
+ * cases it would be possible to reuse the register where the F value is
+ * initially stored instead of having two. Take also into account that when
+ * operating with HF/W/UW, you would need to use the full register (so stride
+ * 2). Packs/unpacks would be only useful when loading/storing several
+ * HF/W/UWs.
+ *
+ * So something like this:
+ *  mov(8) vgrf14<2>:HF, vgrf39:F
+ *
+ * Became:
+ *  mov(8) vgrf39<2>:HF, vgrf39:F
+ *
+ * Note that no instruction is removed. The main benefict is reducing the
+ * amoung of registers used, so the pressure on the register allocator is
+ * decreased with big shaders.
+ */
+bool
+fs_visitor::reuse_16bit_conversions_vgrf()
+{
+   bool progress = false;
+   int ip = -1;
+
+   calculate_live_intervals();
+
+   foreach_block_and_inst_safe (block, fs_inst, inst, cfg) {
+  ip++;
+
+  if (inst->dst.file != VGRF || inst->src[0].file != VGRF)
+ continue;
+
+  if (inst->opcode != BRW_OPCODE_MOV)
+ continue;
+
+  if (type_sz(inst->dst.type) != 2 || inst->dst.stride != 2 ||
+  type_sz(inst->src[0].type) != 4 || inst->src[0].stride != 1) {
+ continue;
+  }
+
+  int src_reg = inst->src[0].nr;
+  int src_offset = inst->src[0].offset;
+  unsigned src_var = live_intervals->var_from_vgrf[src_reg];
+  int src_end = live_intervals->end[src_var];
+  int dst_reg = inst->dst.nr;
+
+  if (src_end > ip)
+ continue;
+
+  foreach_block_and_inst(block, fs_inst, scan_inst, cfg) {
+ if (scan_inst->dst.file == VGRF &&
+ scan_inst->dst.nr == dst_reg) {
+scan_inst->dst.nr = src_reg;
+scan_inst->dst.offset = src_offset;
+progress = true;
+ }
+
+ for (int i = 0; i < scan_inst->sources; i++) {
+if (scan_inst->src[i].file == VGRF &&
+scan_inst->src[i].nr == dst_reg) {
+   scan_inst->src[i].nr = src_reg;
+   scan_inst->src[i].offset = src_offset;
+   progress = true;
+}
+ }
+  }
+   }
+
+   if (progress)
+  invalidate_live_intervals();
+
+   return progress;
+}
+
 
 static void
 clear_deps_for_inst_src(fs_inst *inst, bool *deps, int first_grf, int grf_len)
@@ -5829,6 +5905,7 @@ fs_visitor::optimize()
 
OPT(opt_drop_redundant_mov_to_flags);
OPT(remove_extra_rounding_modes);
+   OPT(reuse_16bit_conversions_vgrf);
 
do {
   progress = false;
diff --git a/src/intel/compiler/brw_fs.h b/src/intel/compiler/brw_fs.h
index b9476e69edb..2685f748b87 100644
--- a/src/intel/compiler/brw_fs.h
+++ b/src/intel/compiler/brw_fs.h
@@ -151,6 +151,7 @@ public:
bool dead_code_eliminate();
bool remove_duplicate_mrf_writes();
bool remove_extra_rounding_modes();
+   bool reuse_16bit_conversions_vgrf();
 
bool opt_sampler_eot();
bool virtual_grf_interferes(int a, int b);
-- 
2.11.0

___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev