Re: [Mesa-dev] [PATCH 10/95] i965/vec4: handle 32 and 64 bit channels in liveness analysis
"Juan A. Suarez Romero" writes: > On Mon, 2016-08-08 at 16:12 +0200, Juan A. Suarez Romero wrote: >> Hmm... what about the case of exec_size == 4 and writing just a >> float? >> >> I understand in this case we only should mark one word, so the loop >> should not be 2*inst->regs_written. >> >> Note that in the original patch that was the case for using offs == >> 0. >> So I guess we should compute if we should use 1 or 2 times >> regs_written. > > > Thinking more carefully, I think it is fine using 2*regs_written/read > in this case. > The fundamental problem is that regs_written is expressed in units of whole GRFs, but SIMD4 single-precision instructions typically write less than one GRF, which will cause you to overestimate the number of dwords written or read by the instruction (that's exactly what I was referring to earlier with losing accuracy with sub-GRF writes). I wouldn't worry too much about this right now, because it will only cause you to calculate slightly pessimistic live intervals for the variables used in such instructions, and shouldn't ever lead to program miscompilation. Overriding the dwords-per-vec4-component factor depending on the SIMD width of the instruction seems like a hack that could potentially hurt more than it would help (What if a single-precision SIMD4 instruction actually reads or writes the whole GRF register because it has multiple output phases? How do you know?). The right solution would be to represent regs_written in units that give you sufficient accuracy, but I think there is enough going on in this series that I wouldn't bother to fix that in addition. We hadn't been using SIMD4 instructions in the vec4 back-end until now, which means that this inaccuracy shouldn't affect existing code or lead to performance regressions, and the FS back-end has exactly the same problem so you could potentially fix both at the same time. ;) > So forget about my comment! > > > J.A. signature.asc Description: PGP signature ___ mesa-dev mailing list mesa-dev@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/mesa-dev
Re: [Mesa-dev] [PATCH 10/95] i965/vec4: handle 32 and 64 bit channels in liveness analysis
On Mon, 2016-08-08 at 16:12 +0200, Juan A. Suarez Romero wrote: > Hmm... what about the case of exec_size == 4 and writing just a > float? > > I understand in this case we only should mark one word, so the loop > should not be 2*inst->regs_written. > > Note that in the original patch that was the case for using offs == > 0. > So I guess we should compute if we should use 1 or 2 times > regs_written. Thinking more carefully, I think it is fine using 2*regs_written/read in this case. So forget about my comment! J.A. signature.asc Description: This is a digitally signed message part ___ mesa-dev mailing list mesa-dev@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/mesa-dev
Re: [Mesa-dev] [PATCH 10/95] i965/vec4: handle 32 and 64 bit channels in liveness analysis
On Fri, 2016-07-29 at 12:59 -0700, Francisco Jerez wrote: > | for (unsigned i = 0; i < 2 * inst->regs_written; i++) > { > | for (int c = 0; c < 4; c++) > | result_live[c] |= BITSET_TEST( > | live, var_from_reg(alloc, inst->dst, c, i)); > | } > > Note that the offset call goes away. The factor of two makes sense > because you need to check 4 * 2 * regs_written bits in total, since > you're keeping track of 8 bits per GRF. It would likely make sense > to Hmm... what about the case of exec_size == 4 and writing just a float? I understand in this case we only should mark one word, so the loop should not be 2*inst->regs_written. Note that in the original patch that was the case for using offs == 0. So I guess we should compute if we should use 1 or 2 times regs_written. J.A. signature.asc Description: This is a digitally signed message part ___ mesa-dev mailing list mesa-dev@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/mesa-dev
Re: [Mesa-dev] [PATCH 10/95] i965/vec4: handle 32 and 64 bit channels in liveness analysis
On Fri, 2016-07-29 at 12:59 -0700, Francisco Jerez wrote: > Iago Toral Quiroga writes: > > > > > From: "Juan A. Suarez Romero" > > > > Our current data flow analysis does not take into account that > > channels > > on 64-bit operands are 64-bit. This is a problem when the same > > register > > is accessed using both 64-bit and 32-bit channels. This is very > > common > > in operations where we need to access 64-bit data in 32-bit chunks, > > such as the double packing and packing operations. > > > > This patch changes the analysis by checking the bits that each > > source > > or destination datatype needs. Actually, rather than bits, we use > > blocks of 32bits, which is the minimum channel size. > > > > Because a vgrf can contain a dvec4 (256 bits), we reserve 8 > > 32-bit blocks to map the channels. > > --- > > .../dri/i965/brw_vec4_dead_code_eliminate.cpp | 30 > > +- > > .../drivers/dri/i965/brw_vec4_live_variables.cpp | 36 > > +- > > .../drivers/dri/i965/brw_vec4_live_variables.h | 7 +++-- > > 3 files changed, 55 insertions(+), 18 deletions(-) > > > > diff --git > > a/src/mesa/drivers/dri/i965/brw_vec4_dead_code_eliminate.cpp > > b/src/mesa/drivers/dri/i965/brw_vec4_dead_code_eliminate.cpp > > index c643212..1f7cd49 100644 > > --- a/src/mesa/drivers/dri/i965/brw_vec4_dead_code_eliminate.cpp > > +++ b/src/mesa/drivers/dri/i965/brw_vec4_dead_code_eliminate.cpp > > @@ -59,10 +59,16 @@ vec4_visitor::dead_code_eliminate() > > bool result_live[4] = { false }; > > > > if (inst->dst.file == VGRF) { > > + const unsigned offs = type_sz(inst->dst.type) == 8 > > ? > > + 1 : (inst->exec_size == 4 ? 0 : 4); > > I suggest that instead of duplicating this calculation in every > single > user of the live sets you handle this in the var_from_reg helpers by > adding a new argument to specify which dword of the register > component > you want to take, like: > > > > > inline unsigned > > var_from_reg(const simple_allocator &alloc, const src_reg ®, > > unsigned c = 0, unsigned k = 0); > > > > inline unsigned > > var_from_reg(const simple_allocator &alloc, const dst_reg ®, > > unsigned c = 0, unsigned k = 0); > > This would give you the variable index for the k-th dword of the c-th > component of register reg (feel free to write it in a comment for the > sake of documentation). Doing it in a single place would be > advantageous because we are likely to have to change the calculation > in > the future (e.g. to account for strides), and going through every > user > will be a pain. You could implement it as follows: > > > > > inline unsigned > > var_from_reg(const simple_allocator &alloc, const src_reg ®, > > unsigned c = 0, unsigned i = 0) > > { > > /* ... */ > > const unsigned csize = DIV_ROUND_UP(type_sz(reg.type), 4); > > return 8 * (alloc.offsets[reg.nr] + reg.reg_offset) + > > (BRW_GET_SWZ(reg.swizzle, c) + i / csize * 4) * csize + > > i % csize; > > } > > > > > for (unsigned i = 0; i < inst->regs_written; i++) { > > - for (int c = 0; c < 4; c++) > > - result_live[c] |= BITSET_TEST( > > -live, var_from_reg(alloc, offset(inst- > > >dst, i), c)); > > + for (int c = 0; c < 4; c++) { > > + const unsigned v = > > +var_from_reg(alloc, offset(inst->dst, i), > > c); > > + result_live[c] |= BITSET_TEST(live, v); > > + if (offs) > > +result_live[c] |= BITSET_TEST(live, v + > > offs); > > + } > > Then you could simplify this and most of the following hunks > massively, > like: > > > > > for (unsigned i = 0; i < 2 * inst->regs_written; i++) > > { > > for (int c = 0; c < 4; c++) > > result_live[c] |= BITSET_TEST( > > live, var_from_reg(alloc, inst->dst, c, i)); > > } > > Note that the offset call goes away. The factor of two makes sense > because you need to check 4 * 2 * regs_written bits in total, since > you're keeping track of 8 bits per GRF. It would likely make sense > to > represent vec4_instruction::regs_written and ::regs_read with smaller > granularity (e.g. in bytes) to avoid losing accuracy with sub-GRF > writes > and reads, but that can probably be done in a separate change. Thanks for the suggestion!! I'll proceed as you said. > > > > > } > > } else { > > for (unsigned c = 0; c < 4; c++) > > @@ -110,11 +116,16 @@ vec4_visitor::dead_code_eliminate() > > } > > > > if (inst->dst.file == VGRF && !inst->predicate) { > > +const unsigned offs = type_sz(inst->dst.type) == 8 ? > > + 1 : (inst->exec_size == 4 ? 0 : 4); > >
Re: [Mesa-dev] [PATCH 10/95] i965/vec4: handle 32 and 64 bit channels in liveness analysis
Iago Toral Quiroga writes: > From: "Juan A. Suarez Romero" > > Our current data flow analysis does not take into account that channels > on 64-bit operands are 64-bit. This is a problem when the same register > is accessed using both 64-bit and 32-bit channels. This is very common > in operations where we need to access 64-bit data in 32-bit chunks, > such as the double packing and packing operations. > > This patch changes the analysis by checking the bits that each source > or destination datatype needs. Actually, rather than bits, we use > blocks of 32bits, which is the minimum channel size. > > Because a vgrf can contain a dvec4 (256 bits), we reserve 8 > 32-bit blocks to map the channels. > --- > .../dri/i965/brw_vec4_dead_code_eliminate.cpp | 30 +- > .../drivers/dri/i965/brw_vec4_live_variables.cpp | 36 > +- > .../drivers/dri/i965/brw_vec4_live_variables.h | 7 +++-- > 3 files changed, 55 insertions(+), 18 deletions(-) > > diff --git a/src/mesa/drivers/dri/i965/brw_vec4_dead_code_eliminate.cpp > b/src/mesa/drivers/dri/i965/brw_vec4_dead_code_eliminate.cpp > index c643212..1f7cd49 100644 > --- a/src/mesa/drivers/dri/i965/brw_vec4_dead_code_eliminate.cpp > +++ b/src/mesa/drivers/dri/i965/brw_vec4_dead_code_eliminate.cpp > @@ -59,10 +59,16 @@ vec4_visitor::dead_code_eliminate() > bool result_live[4] = { false }; > > if (inst->dst.file == VGRF) { > + const unsigned offs = type_sz(inst->dst.type) == 8 ? > + 1 : (inst->exec_size == 4 ? 0 : 4); I suggest that instead of duplicating this calculation in every single user of the live sets you handle this in the var_from_reg helpers by adding a new argument to specify which dword of the register component you want to take, like: | inline unsigned | var_from_reg(const simple_allocator &alloc, const src_reg ®, | unsigned c = 0, unsigned k = 0); | | inline unsigned | var_from_reg(const simple_allocator &alloc, const dst_reg ®, | unsigned c = 0, unsigned k = 0); This would give you the variable index for the k-th dword of the c-th component of register reg (feel free to write it in a comment for the sake of documentation). Doing it in a single place would be advantageous because we are likely to have to change the calculation in the future (e.g. to account for strides), and going through every user will be a pain. You could implement it as follows: | inline unsigned | var_from_reg(const simple_allocator &alloc, const src_reg ®, | unsigned c = 0, unsigned i = 0) | { |/* ... */ |const unsigned csize = DIV_ROUND_UP(type_sz(reg.type), 4); |return 8 * (alloc.offsets[reg.nr] + reg.reg_offset) + | (BRW_GET_SWZ(reg.swizzle, c) + i / csize * 4) * csize + | i % csize; | } > for (unsigned i = 0; i < inst->regs_written; i++) { > - for (int c = 0; c < 4; c++) > - result_live[c] |= BITSET_TEST( > -live, var_from_reg(alloc, offset(inst->dst, i), c)); > + for (int c = 0; c < 4; c++) { > + const unsigned v = > +var_from_reg(alloc, offset(inst->dst, i), c); > + result_live[c] |= BITSET_TEST(live, v); > + if (offs) > +result_live[c] |= BITSET_TEST(live, v + offs); > + } Then you could simplify this and most of the following hunks massively, like: | for (unsigned i = 0; i < 2 * inst->regs_written; i++) { | for (int c = 0; c < 4; c++) | result_live[c] |= BITSET_TEST( |live, var_from_reg(alloc, inst->dst, c, i)); | } Note that the offset call goes away. The factor of two makes sense because you need to check 4 * 2 * regs_written bits in total, since you're keeping track of 8 bits per GRF. It would likely make sense to represent vec4_instruction::regs_written and ::regs_read with smaller granularity (e.g. in bytes) to avoid losing accuracy with sub-GRF writes and reads, but that can probably be done in a separate change. > } > } else { > for (unsigned c = 0; c < 4; c++) > @@ -110,11 +116,16 @@ vec4_visitor::dead_code_eliminate() > } > > if (inst->dst.file == VGRF && !inst->predicate) { > +const unsigned offs = type_sz(inst->dst.type) == 8 ? > + 1 : (inst->exec_size == 4 ? 0 : 4); > for (unsigned i = 0; i < inst->regs_written; i++) { > for (int c = 0; c < 4; c++) { >if (inst->dst.writemask & (1 << c)) { > - BITSET_CLEAR(live, var_from_reg(alloc, > - offset(inst->dst, i), > c)); > + const unsigned v = > +var_from_reg(alloc, offset(inst->d
Re: [Mesa-dev] [PATCH 10/95] i965/vec4: handle 32 and 64 bit channels in liveness analysis
This patch should not be so early in the series. It uses the execsize information in the IR which is not available until patch 38, so it should really go after that. Iago On Tue, 2016-07-19 at 12:40 +0200, Iago Toral Quiroga wrote: > From: "Juan A. Suarez Romero" > > Our current data flow analysis does not take into account that > channels > on 64-bit operands are 64-bit. This is a problem when the same > register > is accessed using both 64-bit and 32-bit channels. This is very > common > in operations where we need to access 64-bit data in 32-bit chunks, > such as the double packing and packing operations. > > This patch changes the analysis by checking the bits that each source > or destination datatype needs. Actually, rather than bits, we use > blocks of 32bits, which is the minimum channel size. > > Because a vgrf can contain a dvec4 (256 bits), we reserve 8 > 32-bit blocks to map the channels. > --- > .../dri/i965/brw_vec4_dead_code_eliminate.cpp | 30 > +- > .../drivers/dri/i965/brw_vec4_live_variables.cpp | 36 > +- > .../drivers/dri/i965/brw_vec4_live_variables.h | 7 +++-- > 3 files changed, 55 insertions(+), 18 deletions(-) > > diff --git > a/src/mesa/drivers/dri/i965/brw_vec4_dead_code_eliminate.cpp > b/src/mesa/drivers/dri/i965/brw_vec4_dead_code_eliminate.cpp > index c643212..1f7cd49 100644 > --- a/src/mesa/drivers/dri/i965/brw_vec4_dead_code_eliminate.cpp > +++ b/src/mesa/drivers/dri/i965/brw_vec4_dead_code_eliminate.cpp > @@ -59,10 +59,16 @@ vec4_visitor::dead_code_eliminate() > bool result_live[4] = { false }; > > if (inst->dst.file == VGRF) { > + const unsigned offs = type_sz(inst->dst.type) == 8 ? > + 1 : (inst->exec_size == 4 ? 0 : 4); > for (unsigned i = 0; i < inst->regs_written; i++) { > - for (int c = 0; c < 4; c++) > - result_live[c] |= BITSET_TEST( > -live, var_from_reg(alloc, offset(inst->dst, > i), c)); > + for (int c = 0; c < 4; c++) { > + const unsigned v = > +var_from_reg(alloc, offset(inst->dst, i), > c); > + result_live[c] |= BITSET_TEST(live, v); > + if (offs) > +result_live[c] |= BITSET_TEST(live, v + > offs); > + } > } > } else { > for (unsigned c = 0; c < 4; c++) > @@ -110,11 +116,16 @@ vec4_visitor::dead_code_eliminate() > } > > if (inst->dst.file == VGRF && !inst->predicate) { > +const unsigned offs = type_sz(inst->dst.type) == 8 ? > + 1 : (inst->exec_size == 4 ? 0 : 4); > for (unsigned i = 0; i < inst->regs_written; i++) { > for (int c = 0; c < 4; c++) { > if (inst->dst.writemask & (1 << c)) { > - BITSET_CLEAR(live, var_from_reg(alloc, > - offset(inst- > >dst, i), c)); > + const unsigned v = > +var_from_reg(alloc, offset(inst->dst, i), > c); > + BITSET_CLEAR(live, v); > + if (offs) > +BITSET_CLEAR(live, v + offs); > } > } > } > @@ -132,10 +143,15 @@ vec4_visitor::dead_code_eliminate() > > for (int i = 0; i < 3; i++) { > if (inst->src[i].file == VGRF) { > + const unsigned offs = type_sz(inst->src[i].type) == 8 > ? > + 1 : (inst->exec_size == 4 ? 0 : 4); > for (unsigned j = 0; j < inst->regs_read(i); j++) { > for (int c = 0; c < 4; c++) { > - BITSET_SET(live, var_from_reg(alloc, > - offset(inst- > >src[i], j), c)); > + const unsigned v = > +var_from_reg(alloc, offset(inst->src[i], j), > c); > + BITSET_SET(live, v); > + if (offs) > +BITSET_SET(live, v + offs); > } > } > } > diff --git a/src/mesa/drivers/dri/i965/brw_vec4_live_variables.cpp > b/src/mesa/drivers/dri/i965/brw_vec4_live_variables.cpp > index 57d5fbb..bfc5e44 100644 > --- a/src/mesa/drivers/dri/i965/brw_vec4_live_variables.cpp > +++ b/src/mesa/drivers/dri/i965/brw_vec4_live_variables.cpp > @@ -76,12 +76,16 @@ vec4_live_variables::setup_def_use() > /* Set use[] for this instruction */ > for (unsigned int i = 0; i < 3; i++) { > if (inst->src[i].file == VGRF) { > + const unsigned offs = type_sz(inst->dst.type) == 8 ? > + 1 : (inst->exec_size == 4 ? 0 : 4); > for (unsigned j = 0; j < inst->regs_read(i); j++) { > for (int c
[Mesa-dev] [PATCH 10/95] i965/vec4: handle 32 and 64 bit channels in liveness analysis
From: "Juan A. Suarez Romero" Our current data flow analysis does not take into account that channels on 64-bit operands are 64-bit. This is a problem when the same register is accessed using both 64-bit and 32-bit channels. This is very common in operations where we need to access 64-bit data in 32-bit chunks, such as the double packing and packing operations. This patch changes the analysis by checking the bits that each source or destination datatype needs. Actually, rather than bits, we use blocks of 32bits, which is the minimum channel size. Because a vgrf can contain a dvec4 (256 bits), we reserve 8 32-bit blocks to map the channels. --- .../dri/i965/brw_vec4_dead_code_eliminate.cpp | 30 +- .../drivers/dri/i965/brw_vec4_live_variables.cpp | 36 +- .../drivers/dri/i965/brw_vec4_live_variables.h | 7 +++-- 3 files changed, 55 insertions(+), 18 deletions(-) diff --git a/src/mesa/drivers/dri/i965/brw_vec4_dead_code_eliminate.cpp b/src/mesa/drivers/dri/i965/brw_vec4_dead_code_eliminate.cpp index c643212..1f7cd49 100644 --- a/src/mesa/drivers/dri/i965/brw_vec4_dead_code_eliminate.cpp +++ b/src/mesa/drivers/dri/i965/brw_vec4_dead_code_eliminate.cpp @@ -59,10 +59,16 @@ vec4_visitor::dead_code_eliminate() bool result_live[4] = { false }; if (inst->dst.file == VGRF) { + const unsigned offs = type_sz(inst->dst.type) == 8 ? + 1 : (inst->exec_size == 4 ? 0 : 4); for (unsigned i = 0; i < inst->regs_written; i++) { - for (int c = 0; c < 4; c++) - result_live[c] |= BITSET_TEST( -live, var_from_reg(alloc, offset(inst->dst, i), c)); + for (int c = 0; c < 4; c++) { + const unsigned v = +var_from_reg(alloc, offset(inst->dst, i), c); + result_live[c] |= BITSET_TEST(live, v); + if (offs) +result_live[c] |= BITSET_TEST(live, v + offs); + } } } else { for (unsigned c = 0; c < 4; c++) @@ -110,11 +116,16 @@ vec4_visitor::dead_code_eliminate() } if (inst->dst.file == VGRF && !inst->predicate) { +const unsigned offs = type_sz(inst->dst.type) == 8 ? + 1 : (inst->exec_size == 4 ? 0 : 4); for (unsigned i = 0; i < inst->regs_written; i++) { for (int c = 0; c < 4; c++) { if (inst->dst.writemask & (1 << c)) { - BITSET_CLEAR(live, var_from_reg(alloc, - offset(inst->dst, i), c)); + const unsigned v = +var_from_reg(alloc, offset(inst->dst, i), c); + BITSET_CLEAR(live, v); + if (offs) +BITSET_CLEAR(live, v + offs); } } } @@ -132,10 +143,15 @@ vec4_visitor::dead_code_eliminate() for (int i = 0; i < 3; i++) { if (inst->src[i].file == VGRF) { + const unsigned offs = type_sz(inst->src[i].type) == 8 ? + 1 : (inst->exec_size == 4 ? 0 : 4); for (unsigned j = 0; j < inst->regs_read(i); j++) { for (int c = 0; c < 4; c++) { - BITSET_SET(live, var_from_reg(alloc, - offset(inst->src[i], j), c)); + const unsigned v = +var_from_reg(alloc, offset(inst->src[i], j), c); + BITSET_SET(live, v); + if (offs) +BITSET_SET(live, v + offs); } } } diff --git a/src/mesa/drivers/dri/i965/brw_vec4_live_variables.cpp b/src/mesa/drivers/dri/i965/brw_vec4_live_variables.cpp index 57d5fbb..bfc5e44 100644 --- a/src/mesa/drivers/dri/i965/brw_vec4_live_variables.cpp +++ b/src/mesa/drivers/dri/i965/brw_vec4_live_variables.cpp @@ -76,12 +76,16 @@ vec4_live_variables::setup_def_use() /* Set use[] for this instruction */ for (unsigned int i = 0; i < 3; i++) { if (inst->src[i].file == VGRF) { + const unsigned offs = type_sz(inst->dst.type) == 8 ? + 1 : (inst->exec_size == 4 ? 0 : 4); for (unsigned j = 0; j < inst->regs_read(i); j++) { for (int c = 0; c < 4; c++) { const unsigned v = var_from_reg(alloc, offset(inst->src[i], j), c); if (!BITSET_TEST(bd->def, v)) BITSET_SET(bd->use, v); + if (offs && !BITSET_TEST(bd->def, v + offs)) +BITSET_SET(bd->use, v + offs); } } } @@ -99,6 +103,8 @@ vec4_live_vari