Re: [Mesa-dev] [PATCH 10/95] i965/vec4: handle 32 and 64 bit channels in liveness analysis

2016-08-08 Thread Francisco Jerez
"Juan A. Suarez Romero"  writes:

> On Mon, 2016-08-08 at 16:12 +0200, Juan A. Suarez Romero wrote:
>> Hmm... what about the case of exec_size == 4 and writing just a
>> float? 
>> 
>> I understand in this case we only should mark one word, so the loop
>> should  not be 2*inst->regs_written.
>> 
>> Note that in the original patch that was the case for using offs ==
>> 0.
>> So I guess we should compute if we should use 1 or 2 times
>> regs_written.
>
>
> Thinking more carefully, I think it is fine using 2*regs_written/read
> in this case.
>

The fundamental problem is that regs_written is expressed in units of
whole GRFs, but SIMD4 single-precision instructions typically write less
than one GRF, which will cause you to overestimate the number of dwords
written or read by the instruction (that's exactly what I was referring
to earlier with losing accuracy with sub-GRF writes).

I wouldn't worry too much about this right now, because it will only
cause you to calculate slightly pessimistic live intervals for the
variables used in such instructions, and shouldn't ever lead to program
miscompilation.  Overriding the dwords-per-vec4-component factor
depending on the SIMD width of the instruction seems like a hack that
could potentially hurt more than it would help (What if a
single-precision SIMD4 instruction actually reads or writes the whole
GRF register because it has multiple output phases?  How do you know?).

The right solution would be to represent regs_written in units that give
you sufficient accuracy, but I think there is enough going on in this
series that I wouldn't bother to fix that in addition.  We hadn't been
using SIMD4 instructions in the vec4 back-end until now, which means
that this inaccuracy shouldn't affect existing code or lead to
performance regressions, and the FS back-end has exactly the same
problem so you could potentially fix both at the same time. ;)

> So forget about my comment!
>
>
>   J.A.


signature.asc
Description: PGP signature
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 10/95] i965/vec4: handle 32 and 64 bit channels in liveness analysis

2016-08-08 Thread Juan A. Suarez Romero
On Mon, 2016-08-08 at 16:12 +0200, Juan A. Suarez Romero wrote:
> Hmm... what about the case of exec_size == 4 and writing just a
> float? 
> 
> I understand in this case we only should mark one word, so the loop
> should  not be 2*inst->regs_written.
> 
> Note that in the original patch that was the case for using offs ==
> 0.
> So I guess we should compute if we should use 1 or 2 times
> regs_written.


Thinking more carefully, I think it is fine using 2*regs_written/read
in this case.

So forget about my comment!


J.A.


signature.asc
Description: This is a digitally signed message part
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 10/95] i965/vec4: handle 32 and 64 bit channels in liveness analysis

2016-08-08 Thread Juan A. Suarez Romero
On Fri, 2016-07-29 at 12:59 -0700, Francisco Jerez wrote:
> |   for (unsigned i = 0; i < 2 * inst->regs_written; i++)
> {
> |  for (int c = 0; c < 4; c++)
> | result_live[c] |= BITSET_TEST(
> |    live, var_from_reg(alloc, inst->dst, c, i));
> |   }
> 
> Note that the offset call goes away.  The factor of two makes sense
> because you need to check 4 * 2 * regs_written bits in total, since
> you're keeping track of 8 bits per GRF.  It would likely make sense
> to

Hmm... what about the case of exec_size == 4 and writing just a float? 

I understand in this case we only should mark one word, so the loop
should  not be 2*inst->regs_written.

Note that in the original patch that was the case for using offs == 0.
So I guess we should compute if we should use 1 or 2 times
regs_written.


J.A.


signature.asc
Description: This is a digitally signed message part
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 10/95] i965/vec4: handle 32 and 64 bit channels in liveness analysis

2016-08-01 Thread Juan A. Suarez Romero
On Fri, 2016-07-29 at 12:59 -0700, Francisco Jerez wrote:
> Iago Toral Quiroga  writes:
> 
> > 
> > From: "Juan A. Suarez Romero" 
> > 
> > Our current data flow analysis does not take into account that
> > channels
> > on 64-bit operands are 64-bit. This is a problem when the same
> > register
> > is accessed using both 64-bit and 32-bit channels. This is very
> > common
> > in operations where we need to access 64-bit data in 32-bit chunks,
> > such as the double packing and packing operations.
> > 
> > This patch changes the analysis by checking the bits that each
> > source
> > or destination datatype needs. Actually, rather than bits, we use
> > blocks of 32bits, which is the minimum channel size.
> > 
> > Because a vgrf can contain a dvec4 (256 bits), we reserve 8
> > 32-bit blocks to map the channels.
> > ---
> >  .../dri/i965/brw_vec4_dead_code_eliminate.cpp  | 30
> > +-
> >  .../drivers/dri/i965/brw_vec4_live_variables.cpp   | 36
> > +-
> >  .../drivers/dri/i965/brw_vec4_live_variables.h |  7 +++--
> >  3 files changed, 55 insertions(+), 18 deletions(-)
> > 
> > diff --git
> > a/src/mesa/drivers/dri/i965/brw_vec4_dead_code_eliminate.cpp
> > b/src/mesa/drivers/dri/i965/brw_vec4_dead_code_eliminate.cpp
> > index c643212..1f7cd49 100644
> > --- a/src/mesa/drivers/dri/i965/brw_vec4_dead_code_eliminate.cpp
> > +++ b/src/mesa/drivers/dri/i965/brw_vec4_dead_code_eliminate.cpp
> > @@ -59,10 +59,16 @@ vec4_visitor::dead_code_eliminate()
> >  bool result_live[4] = { false };
> >  
> >  if (inst->dst.file == VGRF) {
> > +   const unsigned offs = type_sz(inst->dst.type) == 8
> > ?
> > +  1 : (inst->exec_size == 4 ? 0 : 4);
> 
> I suggest that instead of duplicating this calculation in every
> single
> user of the live sets you handle this in the var_from_reg helpers by
> adding a new argument to specify which dword of the register
> component
> you want to take, like:
> 
> > 
> > inline unsigned
> > var_from_reg(const simple_allocator &alloc, const src_reg ®,
> >  unsigned c = 0, unsigned k = 0);
> > 
> > inline unsigned
> > var_from_reg(const simple_allocator &alloc, const dst_reg ®,
> >  unsigned c = 0, unsigned k = 0);
> 
> This would give you the variable index for the k-th dword of the c-th
> component of register reg (feel free to write it in a comment for the
> sake of documentation).  Doing it in a single place would be
> advantageous because we are likely to have to change the calculation
> in
> the future (e.g. to account for strides), and going through every
> user
> will be a pain.  You could implement it as follows:
> 
> > 
> > inline unsigned
> > var_from_reg(const simple_allocator &alloc, const src_reg ®,
> >  unsigned c = 0, unsigned i = 0)
> > {
> >    /* ... */
> >    const unsigned csize = DIV_ROUND_UP(type_sz(reg.type), 4);
> >    return 8 * (alloc.offsets[reg.nr] + reg.reg_offset) +
> >   (BRW_GET_SWZ(reg.swizzle, c) + i / csize * 4) * csize +
> >   i % csize;
> > }
> 
> > 
> > for (unsigned i = 0; i < inst->regs_written; i++) {
> > -  for (int c = 0; c < 4; c++)
> > - result_live[c] |= BITSET_TEST(
> > -live, var_from_reg(alloc, offset(inst-
> > >dst, i), c));
> > +  for (int c = 0; c < 4; c++) {
> > + const unsigned v =
> > +var_from_reg(alloc, offset(inst->dst, i),
> > c);
> > + result_live[c] |= BITSET_TEST(live, v);
> > + if (offs)
> > +result_live[c] |= BITSET_TEST(live, v +
> > offs);
> > +  }
> 
> Then you could simplify this and most of the following hunks
> massively,
> like:
> 
> > 
> >   for (unsigned i = 0; i < 2 * inst->regs_written; i++)
> > {
> >  for (int c = 0; c < 4; c++)
> > result_live[c] |= BITSET_TEST(
> >    live, var_from_reg(alloc, inst->dst, c, i));
> >   }
> 
> Note that the offset call goes away.  The factor of two makes sense
> because you need to check 4 * 2 * regs_written bits in total, since
> you're keeping track of 8 bits per GRF.  It would likely make sense
> to
> represent vec4_instruction::regs_written and ::regs_read with smaller
> granularity (e.g. in bytes) to avoid losing accuracy with sub-GRF
> writes
> and reads, but that can probably be done in a separate change.

Thanks for the suggestion!! I'll proceed as you said.


> 
> > 
> > }
> >  } else {
> > for (unsigned c = 0; c < 4; c++)
> > @@ -110,11 +116,16 @@ vec4_visitor::dead_code_eliminate()
> >   }
> >  
> >   if (inst->dst.file == VGRF && !inst->predicate) {
> > +const unsigned offs = type_sz(inst->dst.type) == 8 ?
> > +   1 : (inst->exec_size == 4 ? 0 : 4);
> >   

Re: [Mesa-dev] [PATCH 10/95] i965/vec4: handle 32 and 64 bit channels in liveness analysis

2016-07-29 Thread Francisco Jerez
Iago Toral Quiroga  writes:

> From: "Juan A. Suarez Romero" 
>
> Our current data flow analysis does not take into account that channels
> on 64-bit operands are 64-bit. This is a problem when the same register
> is accessed using both 64-bit and 32-bit channels. This is very common
> in operations where we need to access 64-bit data in 32-bit chunks,
> such as the double packing and packing operations.
>
> This patch changes the analysis by checking the bits that each source
> or destination datatype needs. Actually, rather than bits, we use
> blocks of 32bits, which is the minimum channel size.
>
> Because a vgrf can contain a dvec4 (256 bits), we reserve 8
> 32-bit blocks to map the channels.
> ---
>  .../dri/i965/brw_vec4_dead_code_eliminate.cpp  | 30 +-
>  .../drivers/dri/i965/brw_vec4_live_variables.cpp   | 36 
> +-
>  .../drivers/dri/i965/brw_vec4_live_variables.h |  7 +++--
>  3 files changed, 55 insertions(+), 18 deletions(-)
>
> diff --git a/src/mesa/drivers/dri/i965/brw_vec4_dead_code_eliminate.cpp 
> b/src/mesa/drivers/dri/i965/brw_vec4_dead_code_eliminate.cpp
> index c643212..1f7cd49 100644
> --- a/src/mesa/drivers/dri/i965/brw_vec4_dead_code_eliminate.cpp
> +++ b/src/mesa/drivers/dri/i965/brw_vec4_dead_code_eliminate.cpp
> @@ -59,10 +59,16 @@ vec4_visitor::dead_code_eliminate()
>  bool result_live[4] = { false };
>  
>  if (inst->dst.file == VGRF) {
> +   const unsigned offs = type_sz(inst->dst.type) == 8 ?
> +  1 : (inst->exec_size == 4 ? 0 : 4);

I suggest that instead of duplicating this calculation in every single
user of the live sets you handle this in the var_from_reg helpers by
adding a new argument to specify which dword of the register component
you want to take, like:

| inline unsigned
| var_from_reg(const simple_allocator &alloc, const src_reg ®,
|  unsigned c = 0, unsigned k = 0);
|
| inline unsigned
| var_from_reg(const simple_allocator &alloc, const dst_reg ®,
|  unsigned c = 0, unsigned k = 0);

This would give you the variable index for the k-th dword of the c-th
component of register reg (feel free to write it in a comment for the
sake of documentation).  Doing it in a single place would be
advantageous because we are likely to have to change the calculation in
the future (e.g. to account for strides), and going through every user
will be a pain.  You could implement it as follows:

| inline unsigned
| var_from_reg(const simple_allocator &alloc, const src_reg ®,
|  unsigned c = 0, unsigned i = 0)
| {
|/* ... */
|const unsigned csize = DIV_ROUND_UP(type_sz(reg.type), 4);
|return 8 * (alloc.offsets[reg.nr] + reg.reg_offset) +
|   (BRW_GET_SWZ(reg.swizzle, c) + i / csize * 4) * csize +
|   i % csize;
| }

> for (unsigned i = 0; i < inst->regs_written; i++) {
> -  for (int c = 0; c < 4; c++)
> - result_live[c] |= BITSET_TEST(
> -live, var_from_reg(alloc, offset(inst->dst, i), c));
> +  for (int c = 0; c < 4; c++) {
> + const unsigned v =
> +var_from_reg(alloc, offset(inst->dst, i), c);
> + result_live[c] |= BITSET_TEST(live, v);
> + if (offs)
> +result_live[c] |= BITSET_TEST(live, v + offs);
> +  }

Then you could simplify this and most of the following hunks massively,
like:

|   for (unsigned i = 0; i < 2 * inst->regs_written; i++) {
|  for (int c = 0; c < 4; c++)
| result_live[c] |= BITSET_TEST(
|live, var_from_reg(alloc, inst->dst, c, i));
|   }

Note that the offset call goes away.  The factor of two makes sense
because you need to check 4 * 2 * regs_written bits in total, since
you're keeping track of 8 bits per GRF.  It would likely make sense to
represent vec4_instruction::regs_written and ::regs_read with smaller
granularity (e.g. in bytes) to avoid losing accuracy with sub-GRF writes
and reads, but that can probably be done in a separate change.

> }
>  } else {
> for (unsigned c = 0; c < 4; c++)
> @@ -110,11 +116,16 @@ vec4_visitor::dead_code_eliminate()
>   }
>  
>   if (inst->dst.file == VGRF && !inst->predicate) {
> +const unsigned offs = type_sz(inst->dst.type) == 8 ?
> +   1 : (inst->exec_size == 4 ? 0 : 4);
>  for (unsigned i = 0; i < inst->regs_written; i++) {
> for (int c = 0; c < 4; c++) {
>if (inst->dst.writemask & (1 << c)) {
> - BITSET_CLEAR(live, var_from_reg(alloc,
> - offset(inst->dst, i), 
> c));
> + const unsigned v =
> +var_from_reg(alloc, offset(inst->d

Re: [Mesa-dev] [PATCH 10/95] i965/vec4: handle 32 and 64 bit channels in liveness analysis

2016-07-26 Thread Iago Toral
This patch should not be so early in the series. It uses the execsize
information in the IR which is not available until patch 38, so it
should really go after that.

Iago

On Tue, 2016-07-19 at 12:40 +0200, Iago Toral Quiroga wrote:
> From: "Juan A. Suarez Romero" 
> 
> Our current data flow analysis does not take into account that
> channels
> on 64-bit operands are 64-bit. This is a problem when the same
> register
> is accessed using both 64-bit and 32-bit channels. This is very
> common
> in operations where we need to access 64-bit data in 32-bit chunks,
> such as the double packing and packing operations.
> 
> This patch changes the analysis by checking the bits that each source
> or destination datatype needs. Actually, rather than bits, we use
> blocks of 32bits, which is the minimum channel size.
> 
> Because a vgrf can contain a dvec4 (256 bits), we reserve 8
> 32-bit blocks to map the channels.
> ---
>  .../dri/i965/brw_vec4_dead_code_eliminate.cpp  | 30
> +-
>  .../drivers/dri/i965/brw_vec4_live_variables.cpp   | 36
> +-
>  .../drivers/dri/i965/brw_vec4_live_variables.h |  7 +++--
>  3 files changed, 55 insertions(+), 18 deletions(-)
> 
> diff --git
> a/src/mesa/drivers/dri/i965/brw_vec4_dead_code_eliminate.cpp
> b/src/mesa/drivers/dri/i965/brw_vec4_dead_code_eliminate.cpp
> index c643212..1f7cd49 100644
> --- a/src/mesa/drivers/dri/i965/brw_vec4_dead_code_eliminate.cpp
> +++ b/src/mesa/drivers/dri/i965/brw_vec4_dead_code_eliminate.cpp
> @@ -59,10 +59,16 @@ vec4_visitor::dead_code_eliminate()
>  bool result_live[4] = { false };
>  
>  if (inst->dst.file == VGRF) {
> +   const unsigned offs = type_sz(inst->dst.type) == 8 ?
> +  1 : (inst->exec_size == 4 ? 0 : 4);
> for (unsigned i = 0; i < inst->regs_written; i++) {
> -  for (int c = 0; c < 4; c++)
> - result_live[c] |= BITSET_TEST(
> -live, var_from_reg(alloc, offset(inst->dst,
> i), c));
> +  for (int c = 0; c < 4; c++) {
> + const unsigned v =
> +var_from_reg(alloc, offset(inst->dst, i),
> c);
> + result_live[c] |= BITSET_TEST(live, v);
> + if (offs)
> +result_live[c] |= BITSET_TEST(live, v +
> offs);
> +  }
> }
>  } else {
> for (unsigned c = 0; c < 4; c++)
> @@ -110,11 +116,16 @@ vec4_visitor::dead_code_eliminate()
>   }
>  
>   if (inst->dst.file == VGRF && !inst->predicate) {
> +const unsigned offs = type_sz(inst->dst.type) == 8 ?
> +   1 : (inst->exec_size == 4 ? 0 : 4);
>  for (unsigned i = 0; i < inst->regs_written; i++) {
> for (int c = 0; c < 4; c++) {
>    if (inst->dst.writemask & (1 << c)) {
> - BITSET_CLEAR(live, var_from_reg(alloc,
> - offset(inst-
> >dst, i), c));
> + const unsigned v =
> +var_from_reg(alloc, offset(inst->dst, i),
> c);
> + BITSET_CLEAR(live, v);
> + if (offs)
> +BITSET_CLEAR(live, v + offs);
>    }
> }
>  }
> @@ -132,10 +143,15 @@ vec4_visitor::dead_code_eliminate()
>  
>   for (int i = 0; i < 3; i++) {
>  if (inst->src[i].file == VGRF) {
> +   const unsigned offs = type_sz(inst->src[i].type) == 8
> ?
> +  1 : (inst->exec_size == 4 ? 0 : 4);
> for (unsigned j = 0; j < inst->regs_read(i); j++) {
>    for (int c = 0; c < 4; c++) {
> - BITSET_SET(live, var_from_reg(alloc,
> -   offset(inst-
> >src[i], j), c));
> + const unsigned v =
> +var_from_reg(alloc, offset(inst->src[i], j),
> c);
> + BITSET_SET(live, v);
> + if (offs)
> +BITSET_SET(live, v + offs);
>    }
> }
>  }
> diff --git a/src/mesa/drivers/dri/i965/brw_vec4_live_variables.cpp
> b/src/mesa/drivers/dri/i965/brw_vec4_live_variables.cpp
> index 57d5fbb..bfc5e44 100644
> --- a/src/mesa/drivers/dri/i965/brw_vec4_live_variables.cpp
> +++ b/src/mesa/drivers/dri/i965/brw_vec4_live_variables.cpp
> @@ -76,12 +76,16 @@ vec4_live_variables::setup_def_use()
>    /* Set use[] for this instruction */
>    for (unsigned int i = 0; i < 3; i++) {
>   if (inst->src[i].file == VGRF) {
> +   const unsigned offs = type_sz(inst->dst.type) == 8 ?
> +  1 : (inst->exec_size == 4 ? 0 : 4);
> for (unsigned j = 0; j < inst->regs_read(i); j++) {
>    for (int c

[Mesa-dev] [PATCH 10/95] i965/vec4: handle 32 and 64 bit channels in liveness analysis

2016-07-19 Thread Iago Toral Quiroga
From: "Juan A. Suarez Romero" 

Our current data flow analysis does not take into account that channels
on 64-bit operands are 64-bit. This is a problem when the same register
is accessed using both 64-bit and 32-bit channels. This is very common
in operations where we need to access 64-bit data in 32-bit chunks,
such as the double packing and packing operations.

This patch changes the analysis by checking the bits that each source
or destination datatype needs. Actually, rather than bits, we use
blocks of 32bits, which is the minimum channel size.

Because a vgrf can contain a dvec4 (256 bits), we reserve 8
32-bit blocks to map the channels.
---
 .../dri/i965/brw_vec4_dead_code_eliminate.cpp  | 30 +-
 .../drivers/dri/i965/brw_vec4_live_variables.cpp   | 36 +-
 .../drivers/dri/i965/brw_vec4_live_variables.h |  7 +++--
 3 files changed, 55 insertions(+), 18 deletions(-)

diff --git a/src/mesa/drivers/dri/i965/brw_vec4_dead_code_eliminate.cpp 
b/src/mesa/drivers/dri/i965/brw_vec4_dead_code_eliminate.cpp
index c643212..1f7cd49 100644
--- a/src/mesa/drivers/dri/i965/brw_vec4_dead_code_eliminate.cpp
+++ b/src/mesa/drivers/dri/i965/brw_vec4_dead_code_eliminate.cpp
@@ -59,10 +59,16 @@ vec4_visitor::dead_code_eliminate()
 bool result_live[4] = { false };
 
 if (inst->dst.file == VGRF) {
+   const unsigned offs = type_sz(inst->dst.type) == 8 ?
+  1 : (inst->exec_size == 4 ? 0 : 4);
for (unsigned i = 0; i < inst->regs_written; i++) {
-  for (int c = 0; c < 4; c++)
- result_live[c] |= BITSET_TEST(
-live, var_from_reg(alloc, offset(inst->dst, i), c));
+  for (int c = 0; c < 4; c++) {
+ const unsigned v =
+var_from_reg(alloc, offset(inst->dst, i), c);
+ result_live[c] |= BITSET_TEST(live, v);
+ if (offs)
+result_live[c] |= BITSET_TEST(live, v + offs);
+  }
}
 } else {
for (unsigned c = 0; c < 4; c++)
@@ -110,11 +116,16 @@ vec4_visitor::dead_code_eliminate()
  }
 
  if (inst->dst.file == VGRF && !inst->predicate) {
+const unsigned offs = type_sz(inst->dst.type) == 8 ?
+   1 : (inst->exec_size == 4 ? 0 : 4);
 for (unsigned i = 0; i < inst->regs_written; i++) {
for (int c = 0; c < 4; c++) {
   if (inst->dst.writemask & (1 << c)) {
- BITSET_CLEAR(live, var_from_reg(alloc,
- offset(inst->dst, i), c));
+ const unsigned v =
+var_from_reg(alloc, offset(inst->dst, i), c);
+ BITSET_CLEAR(live, v);
+ if (offs)
+BITSET_CLEAR(live, v + offs);
   }
}
 }
@@ -132,10 +143,15 @@ vec4_visitor::dead_code_eliminate()
 
  for (int i = 0; i < 3; i++) {
 if (inst->src[i].file == VGRF) {
+   const unsigned offs = type_sz(inst->src[i].type) == 8 ?
+  1 : (inst->exec_size == 4 ? 0 : 4);
for (unsigned j = 0; j < inst->regs_read(i); j++) {
   for (int c = 0; c < 4; c++) {
- BITSET_SET(live, var_from_reg(alloc,
-   offset(inst->src[i], j), 
c));
+ const unsigned v =
+var_from_reg(alloc, offset(inst->src[i], j), c);
+ BITSET_SET(live, v);
+ if (offs)
+BITSET_SET(live, v + offs);
   }
}
 }
diff --git a/src/mesa/drivers/dri/i965/brw_vec4_live_variables.cpp 
b/src/mesa/drivers/dri/i965/brw_vec4_live_variables.cpp
index 57d5fbb..bfc5e44 100644
--- a/src/mesa/drivers/dri/i965/brw_vec4_live_variables.cpp
+++ b/src/mesa/drivers/dri/i965/brw_vec4_live_variables.cpp
@@ -76,12 +76,16 @@ vec4_live_variables::setup_def_use()
 /* Set use[] for this instruction */
 for (unsigned int i = 0; i < 3; i++) {
if (inst->src[i].file == VGRF) {
+   const unsigned offs = type_sz(inst->dst.type) == 8 ?
+  1 : (inst->exec_size == 4 ? 0 : 4);
for (unsigned j = 0; j < inst->regs_read(i); j++) {
   for (int c = 0; c < 4; c++) {
  const unsigned v =
 var_from_reg(alloc, offset(inst->src[i], j), c);
  if (!BITSET_TEST(bd->def, v))
 BITSET_SET(bd->use, v);
+ if (offs && !BITSET_TEST(bd->def, v + offs))
+BITSET_SET(bd->use, v + offs);
   }
}
}
@@ -99,6 +103,8 @@ vec4_live_vari