Re: [Mesa-dev] [PATCH 09/13] SQUASH: i965/fs: Rework fs_visitor::lower_load_payload

2015-05-05 Thread Matt Turner
On Tue, May 5, 2015 at 4:13 PM, Jason Ekstrand ja...@jlekstrand.net wrote:
 On Wed, Apr 15, 2015 at 6:44 PM, Matt Turner matts...@gmail.com wrote:
 On Wed, Apr 15, 2015 at 5:13 PM, Jason Ekstrand ja...@jlekstrand.net wrote:
 On Wed, Apr 15, 2015 at 1:31 PM, Matt Turner matts...@gmail.com wrote:
 On Wed, Apr 1, 2015 at 6:19 PM, Jason Ekstrand ja...@jlekstrand.net 
 wrote:
 Instead of the complicated and broken-by-design pile of heuristics we had
 before, we now have a straightforward lowering:

  1) All header sources are copied directly using force_writemask_all and,
 since they are guaranteed to be a single register, there are no
 force_sechalf issues.

  2) All non-header sources are copied using the exact same force_sechalf
 and saturate modifiers as the LOAD_PAYLOAD operation itself.

 Let's not do this. Nothing puts a saturate modifier on LOAD_PAYLOAD
 today, and it is kind of confusing about what it means. Can't we have
 fbwrites that write depth as well. I wouldn't think we wanted to
 saturate that.

 Sure.  I can drop saturate and just assert that it's not set.  We do
 want to keep force_sechalf and force_writemask_all though.

 I didn't think about those before, but I don't know how a load_payload
 could have force_writemask_all set. Have I missed something?

 No, no one (to my knowlege) sets force_writemask_all on it but I see
 no reason why it shouldn't be respected.  As for saturate, we do for
 fb_writes when key-clamp_fragment_color is set.
 --Jason

In your last email in this thread, I thought we'd agreed not to do
anything (i.e., not allow) saturate on load_payload. Handling it seems
confusing and moreover, unnecessary.
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 09/13] SQUASH: i965/fs: Rework fs_visitor::lower_load_payload

2015-05-05 Thread Jason Ekstrand
On Tue, May 5, 2015 at 4:20 PM, Matt Turner matts...@gmail.com wrote:
 On Tue, May 5, 2015 at 4:13 PM, Jason Ekstrand ja...@jlekstrand.net wrote:
 On Wed, Apr 15, 2015 at 6:44 PM, Matt Turner matts...@gmail.com wrote:
 On Wed, Apr 15, 2015 at 5:13 PM, Jason Ekstrand ja...@jlekstrand.net 
 wrote:
 On Wed, Apr 15, 2015 at 1:31 PM, Matt Turner matts...@gmail.com wrote:
 On Wed, Apr 1, 2015 at 6:19 PM, Jason Ekstrand ja...@jlekstrand.net 
 wrote:
 Instead of the complicated and broken-by-design pile of heuristics we had
 before, we now have a straightforward lowering:

  1) All header sources are copied directly using force_writemask_all and,
 since they are guaranteed to be a single register, there are no
 force_sechalf issues.

  2) All non-header sources are copied using the exact same force_sechalf
 and saturate modifiers as the LOAD_PAYLOAD operation itself.

 Let's not do this. Nothing puts a saturate modifier on LOAD_PAYLOAD
 today, and it is kind of confusing about what it means. Can't we have
 fbwrites that write depth as well. I wouldn't think we wanted to
 saturate that.

 Sure.  I can drop saturate and just assert that it's not set.  We do
 want to keep force_sechalf and force_writemask_all though.

 I didn't think about those before, but I don't know how a load_payload
 could have force_writemask_all set. Have I missed something?

 No, no one (to my knowlege) sets force_writemask_all on it but I see
 no reason why it shouldn't be respected.  As for saturate, we do for
 fb_writes when key-clamp_fragment_color is set.
 --Jason

 In your last email in this thread, I thought we'd agreed not to do
 anything (i.e., not allow) saturate on load_payload. Handling it seems
 confusing and moreover, unnecessary.

Right.  It'll be a bit of a pain, but I can get that fixed...
--Jason
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 09/13] SQUASH: i965/fs: Rework fs_visitor::lower_load_payload

2015-05-05 Thread Jason Ekstrand
On Wed, Apr 15, 2015 at 6:44 PM, Matt Turner matts...@gmail.com wrote:
 On Wed, Apr 15, 2015 at 5:13 PM, Jason Ekstrand ja...@jlekstrand.net wrote:
 On Wed, Apr 15, 2015 at 1:31 PM, Matt Turner matts...@gmail.com wrote:
 On Wed, Apr 1, 2015 at 6:19 PM, Jason Ekstrand ja...@jlekstrand.net wrote:
 Instead of the complicated and broken-by-design pile of heuristics we had
 before, we now have a straightforward lowering:

  1) All header sources are copied directly using force_writemask_all and,
 since they are guaranteed to be a single register, there are no
 force_sechalf issues.

  2) All non-header sources are copied using the exact same force_sechalf
 and saturate modifiers as the LOAD_PAYLOAD operation itself.

 Let's not do this. Nothing puts a saturate modifier on LOAD_PAYLOAD
 today, and it is kind of confusing about what it means. Can't we have
 fbwrites that write depth as well. I wouldn't think we wanted to
 saturate that.

 Sure.  I can drop saturate and just assert that it's not set.  We do
 want to keep force_sechalf and force_writemask_all though.

 I didn't think about those before, but I don't know how a load_payload
 could have force_writemask_all set. Have I missed something?

No, no one (to my knowlege) sets force_writemask_all on it but I see
no reason why it shouldn't be respected.  As for saturate, we do for
fb_writes when key-clamp_fragment_color is set.
--Jason

 I see that setup_color_payload sets force_sechalf for dual-source
 fbwrites -- that's the only case we're going to have force_sechalf
 set, right?

 That is, the Gen  6 case is going to be handled by passing 16-channel
 sources to load_payload and letting it do compr4?

 I don't think it buys us anything. If we just run copy propagation
 after lower_load_payload() we'll get the code we want.
 [snip]
 +/* The COMPR4 code took care of the first 4 sources.  We'll 
 let
 + * the regular path handle any remaining sources.  Yes, we are
 + * modifying the instruction but we're about to delete it so
 + * this really doesn't hurt anything.
 + */
 +inst-header_size += 4;

 I mean, while the comment is a true statement, why is doing this any
 better than just...

 + }
 +
 + for (uint8_t i = inst-header_size; i  inst-sources; i++) {

 ... changing this to inst-header_size + 4?

 Because the inst-header_size += 4 is predicated on it being a COMPR4
 destination while the code below handles both the remaining sources
 (in the COMPR4 case) and the regular non-COMPR4 case.

 Ahh, right. It'd be fewer lines (no commenting necessary) to just have
 a 'start' variable that you set to header_size at the top and +=4
 here.
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 09/13] SQUASH: i965/fs: Rework fs_visitor::lower_load_payload

2015-04-15 Thread Matt Turner
On Wed, Apr 1, 2015 at 6:19 PM, Jason Ekstrand ja...@jlekstrand.net wrote:
 Instead of the complicated and broken-by-design pile of heuristics we had
 before, we now have a straightforward lowering:

  1) All header sources are copied directly using force_writemask_all and,
 since they are guaranteed to be a single register, there are no
 force_sechalf issues.

  2) All non-header sources are copied using the exact same force_sechalf
 and saturate modifiers as the LOAD_PAYLOAD operation itself.

Let's not do this. Nothing puts a saturate modifier on LOAD_PAYLOAD
today, and it is kind of confusing about what it means. Can't we have
fbwrites that write depth as well. I wouldn't think we wanted to
saturate that.

I don't think it buys us anything. If we just run copy propagation
after lower_load_payload() we'll get the code we want.

  3) In order to accommodate older gens that need interleaved colors,
 lower_load_payload detects when the destination is a COMPR4 register
 and automatically interleaves the non-header sources.  The
 lower_load_payload pass does the right thing here regardless of whether
 or not the hardware actually supports COMPR4.
 ---
  src/mesa/drivers/dri/i965/brw_fs.cpp | 154 
 ++-
  src/mesa/drivers/dri/i965/brw_fs_visitor.cpp |   3 -
  2 files changed, 80 insertions(+), 77 deletions(-)

 diff --git a/src/mesa/drivers/dri/i965/brw_fs.cpp 
 b/src/mesa/drivers/dri/i965/brw_fs.cpp
 index f8e26c0..bc45a38 100644
 --- a/src/mesa/drivers/dri/i965/brw_fs.cpp
 +++ b/src/mesa/drivers/dri/i965/brw_fs.cpp
 @@ -3201,93 +3201,99 @@ fs_visitor::lower_load_payload()
  {
 bool progress = false;

 -   int vgrf_to_reg[alloc.count];
 -   int reg_count = 0;
 -   for (unsigned i = 0; i  alloc.count; ++i) {
 -  vgrf_to_reg[i] = reg_count;
 -  reg_count += alloc.sizes[i];
 -   }
 -
 -   struct {
 -  bool written:1; /* Whether this register has ever been written */
 -  bool force_writemask_all:1;
 -  bool force_sechalf:1;
 -   } metadata[reg_count];
 -   memset(metadata, 0, sizeof(metadata));
 -
 foreach_block_and_inst_safe (block, fs_inst, inst, cfg) {
 -  if (inst-dst.file == GRF) {
 - const int dst_reg = vgrf_to_reg[inst-dst.reg] + 
 inst-dst.reg_offset;
 - bool force_sechalf = inst-force_sechalf 
 -  !inst-force_writemask_all;
 - bool toggle_sechalf = inst-dst.width == 16 
 -   type_sz(inst-dst.type) == 4 
 -   !inst-force_writemask_all;
 - for (int i = 0; i  inst-regs_written; ++i) {
 -metadata[dst_reg + i].written = true;
 -metadata[dst_reg + i].force_sechalf = force_sechalf;
 -metadata[dst_reg + i].force_writemask_all = 
 inst-force_writemask_all;
 -force_sechalf = (toggle_sechalf != force_sechalf);
 - }
 -  }
 -
if (inst-opcode == SHADER_OPCODE_LOAD_PAYLOAD) {

We should invert this condition and 'continue' so that we can avoid an
extra level of indentation which only makes the code harder to read.

   assert(inst-dst.file == MRF || inst-dst.file == GRF);
 +
   fs_reg dst = inst-dst;

 - for (int i = 0; i  inst-sources; i++) {
 -dst.width = inst-src[i].effective_width;
 -dst.type = inst-src[i].type;
 -
 -if (inst-src[i].file == BAD_FILE) {
 -   /* Do nothing but otherwise increment as normal */
 -} else if (dst.file == MRF 
 -   dst.width == 8 
 -   brw-has_compr4 
 -   i + 4  inst-sources 
 -   inst-src[i + 4].equals(horiz_offset(inst-src[i], 
 8))) {
 -   fs_reg compr4_dst = dst;
 -   compr4_dst.reg += BRW_MRF_COMPR4;
 -   compr4_dst.width = 16;
 -   fs_reg compr4_src = inst-src[i];
 -   compr4_src.width = 16;
 -   fs_inst *mov = MOV(compr4_dst, compr4_src);
 + /* Get rid of COMPR4.  We'll add it back in if we need it */
 + if (dst.file == MRF  dst.reg  BRW_MRF_COMPR4)
 +dst.reg = dst.reg  ~BRW_MRF_COMPR4;

No point in checking whether the BRW_MRF_COMPR4 bit is set before
clearing it. Just clear it.

 +
 + dst.width = 8;
 + for (uint8_t i = 0; i  inst-header_size; i++) {
 +if (inst-src[i].file != BAD_FILE) {
 +   fs_reg mov_dst = retype(dst, BRW_REGISTER_TYPE_UD);
 +   fs_reg mov_src = retype(inst-src[i], BRW_REGISTER_TYPE_UD);
 +   mov_src.width = 8;
 +   fs_inst *mov = MOV(mov_dst, mov_src);
 mov-force_writemask_all = true;
 inst-insert_before(block, mov);
 -   /* Mark i+4 as BAD_FILE so we don't emit a MOV for it */
 -   inst-src[i + 4].file = BAD_FILE;
 -} else {
 -   fs_inst *mov = MOV(dst, inst-src[i]);
 -   

Re: [Mesa-dev] [PATCH 09/13] SQUASH: i965/fs: Rework fs_visitor::lower_load_payload

2015-04-15 Thread Matt Turner
On Fri, Apr 3, 2015 at 7:28 AM, Francisco Jerez curroje...@riseup.net wrote:
 Jason Ekstrand ja...@jlekstrand.net writes:

 On Thu, Apr 2, 2015 at 3:01 AM, Francisco Jerez curroje...@riseup.net 
 wrote:
 Jason Ekstrand ja...@jlekstrand.net writes:

 Instead of the complicated and broken-by-design pile of heuristics we had
 before, we now have a straightforward lowering:

  1) All header sources are copied directly using force_writemask_all and,
 since they are guaranteed to be a single register, there are no
 force_sechalf issues.

  2) All non-header sources are copied using the exact same force_sechalf
 and saturate modifiers as the LOAD_PAYLOAD operation itself.

  3) In order to accommodate older gens that need interleaved colors,
 lower_load_payload detects when the destination is a COMPR4 register
 and automatically interleaves the non-header sources.  The
 lower_load_payload pass does the right thing here regardless of whether
 or not the hardware actually supports COMPR4.

 I had a quick glance at the series and it seems to be going in the right
 direction.

 One thing I honestly don't like is the ad-hoc and IMHO premature
 treatment of payload headers, it still feels like the LOAD_PAYLOAD
 instruction has more complex semantics than necessary and the benefit is
 unclear to me.

 I suppose that your motivation was to avoid setting force_writemask_all
 in LOAD_PAYLOAD instructions with header.  The optimizer should be able
 to cope with those flags and get rid of them from the resulting moves
 where they are redundant, and if it's not able to it looks like
 something that should be fixed anyway.  The explicit handling of headers
 is responsible for much of the churn in this series and is likely to
 complicate users of LOAD_PAYLOAD and optimization passes that have to
 manipulate them.

 Avoiding force_writemask_all is only half of the motivation and the
 small half at that.  A header source, more properly defined, is a
 single physical register that, conceptually, applies to all channels.
 Effectively, a header source (I should have stated this clearly) has
 two properties:

 1) It has force_writemask_all set
 2) It is exactly one physical hardware register.

 This second property is the more important of the two.  Most of the
 disaster of the previous LOAD_PAYLOAD implementation was that we did a
 pile of guesswork and had a ill-conceved effective width think in
 order to figure out how big the register actually was.  Making the
 user specify which sources are header sources eliminates that
 guesswork.  It also has the nice side-effect that we can do the right
 force_writemask_all and we can properly handle COMPR4 for the the
 user.

 Yeah, true, but this seems like the least orthogonal and most annoying
 to use solution for this problem, it forces the caller to provide
 redundant information, it takes into account the saturate flag on some
 arguments and not others,

There's no way to get a saturated load_payload, as far as I can tell.

Let's just avoid the problem you mention all together and continue not
doing it. I agree that this patch adding support for that is not a
good idea.

 it shuffles sources with respect to the
 specified order when COMPR4 is set, but only for the first four
 non-header sources.

To confirm I understand -- the existing lower_load_payload takes 2x
8-channel color sources and recognizes that src[i] and src[i+4] can be
emitted as a compr4 MOV. The proposed patch changes that to take 1x
16-channel color source and just emits a compr4 MOV.

I don't really find that change particularly contentious. In fact, I
kind of wonder what case we were trying to handle by allowing separate
sources for the first 8-channels and the second 8-channels. Is there
some case where these are going to be produced by different
instructions that can't do compr4? Texturing or math instructions come
to mind, but they can't have MRF destinations anyway.


In another email, you mention that compr4 change in this patch means
that optimization passes need to know some extra information.

I'm not able to think of any places where we'd need that information
in optimization passes. The only passes that really operate on
load_payload are CSE, copy propagate, and register coalescing; and I
don't think any of these will touch things writing to MRFs. Am I
forgetting something?

I think I understand the concern in general -- the sources are copied
into different destination registers dependent on (dst.reg 
BRW_MRF_COMPR4) == 0. That's a little strange, but that strangeness is
in the hardware. I mean, if it's strange for load_payload to do this,
isn't it strange for other hardware instructions to do it? Or is it
just that it's a virtual opcode doing a hardware-specific thing?

I don't know. It seems like it's just hiding a little bit of
complexity (give me 16-channels and I'll emit compr4 MOVs to put them
in the right hardware-specific places).

 I think any of the following 

Re: [Mesa-dev] [PATCH 09/13] SQUASH: i965/fs: Rework fs_visitor::lower_load_payload

2015-04-15 Thread Jason Ekstrand
On Wed, Apr 15, 2015 at 1:31 PM, Matt Turner matts...@gmail.com wrote:
 On Wed, Apr 1, 2015 at 6:19 PM, Jason Ekstrand ja...@jlekstrand.net wrote:
 Instead of the complicated and broken-by-design pile of heuristics we had
 before, we now have a straightforward lowering:

  1) All header sources are copied directly using force_writemask_all and,
 since they are guaranteed to be a single register, there are no
 force_sechalf issues.

  2) All non-header sources are copied using the exact same force_sechalf
 and saturate modifiers as the LOAD_PAYLOAD operation itself.

 Let's not do this. Nothing puts a saturate modifier on LOAD_PAYLOAD
 today, and it is kind of confusing about what it means. Can't we have
 fbwrites that write depth as well. I wouldn't think we wanted to
 saturate that.

Sure.  I can drop saturate and just assert that it's not set.  We do
want to keep force_sechalf and force_writemask_all though.

 I don't think it buys us anything. If we just run copy propagation
 after lower_load_payload() we'll get the code we want.

  3) In order to accommodate older gens that need interleaved colors,
 lower_load_payload detects when the destination is a COMPR4 register
 and automatically interleaves the non-header sources.  The
 lower_load_payload pass does the right thing here regardless of whether
 or not the hardware actually supports COMPR4.
 ---
  src/mesa/drivers/dri/i965/brw_fs.cpp | 154 
 ++-
  src/mesa/drivers/dri/i965/brw_fs_visitor.cpp |   3 -
  2 files changed, 80 insertions(+), 77 deletions(-)

 diff --git a/src/mesa/drivers/dri/i965/brw_fs.cpp 
 b/src/mesa/drivers/dri/i965/brw_fs.cpp
 index f8e26c0..bc45a38 100644
 --- a/src/mesa/drivers/dri/i965/brw_fs.cpp
 +++ b/src/mesa/drivers/dri/i965/brw_fs.cpp
 @@ -3201,93 +3201,99 @@ fs_visitor::lower_load_payload()
  {
 bool progress = false;

 -   int vgrf_to_reg[alloc.count];
 -   int reg_count = 0;
 -   for (unsigned i = 0; i  alloc.count; ++i) {
 -  vgrf_to_reg[i] = reg_count;
 -  reg_count += alloc.sizes[i];
 -   }
 -
 -   struct {
 -  bool written:1; /* Whether this register has ever been written */
 -  bool force_writemask_all:1;
 -  bool force_sechalf:1;
 -   } metadata[reg_count];
 -   memset(metadata, 0, sizeof(metadata));
 -
 foreach_block_and_inst_safe (block, fs_inst, inst, cfg) {
 -  if (inst-dst.file == GRF) {
 - const int dst_reg = vgrf_to_reg[inst-dst.reg] + 
 inst-dst.reg_offset;
 - bool force_sechalf = inst-force_sechalf 
 -  !inst-force_writemask_all;
 - bool toggle_sechalf = inst-dst.width == 16 
 -   type_sz(inst-dst.type) == 4 
 -   !inst-force_writemask_all;
 - for (int i = 0; i  inst-regs_written; ++i) {
 -metadata[dst_reg + i].written = true;
 -metadata[dst_reg + i].force_sechalf = force_sechalf;
 -metadata[dst_reg + i].force_writemask_all = 
 inst-force_writemask_all;
 -force_sechalf = (toggle_sechalf != force_sechalf);
 - }
 -  }
 -
if (inst-opcode == SHADER_OPCODE_LOAD_PAYLOAD) {

 We should invert this condition and 'continue' so that we can avoid an
 extra level of indentation which only makes the code harder to read.

Sure.  I can do that.

   assert(inst-dst.file == MRF || inst-dst.file == GRF);
 +
   fs_reg dst = inst-dst;

 - for (int i = 0; i  inst-sources; i++) {
 -dst.width = inst-src[i].effective_width;
 -dst.type = inst-src[i].type;
 -
 -if (inst-src[i].file == BAD_FILE) {
 -   /* Do nothing but otherwise increment as normal */
 -} else if (dst.file == MRF 
 -   dst.width == 8 
 -   brw-has_compr4 
 -   i + 4  inst-sources 
 -   inst-src[i + 4].equals(horiz_offset(inst-src[i], 
 8))) {
 -   fs_reg compr4_dst = dst;
 -   compr4_dst.reg += BRW_MRF_COMPR4;
 -   compr4_dst.width = 16;
 -   fs_reg compr4_src = inst-src[i];
 -   compr4_src.width = 16;
 -   fs_inst *mov = MOV(compr4_dst, compr4_src);
 + /* Get rid of COMPR4.  We'll add it back in if we need it */
 + if (dst.file == MRF  dst.reg  BRW_MRF_COMPR4)
 +dst.reg = dst.reg  ~BRW_MRF_COMPR4;

 No point in checking whether the BRW_MRF_COMPR4 bit is set before
 clearing it. Just clear it.

Sure.

 +
 + dst.width = 8;
 + for (uint8_t i = 0; i  inst-header_size; i++) {
 +if (inst-src[i].file != BAD_FILE) {
 +   fs_reg mov_dst = retype(dst, BRW_REGISTER_TYPE_UD);
 +   fs_reg mov_src = retype(inst-src[i], BRW_REGISTER_TYPE_UD);
 +   mov_src.width = 8;
 +   fs_inst *mov = MOV(mov_dst, mov_src);
 mov-force_writemask_all = true;
 

Re: [Mesa-dev] [PATCH 09/13] SQUASH: i965/fs: Rework fs_visitor::lower_load_payload

2015-04-15 Thread Matt Turner
On Wed, Apr 15, 2015 at 5:13 PM, Jason Ekstrand ja...@jlekstrand.net wrote:
 On Wed, Apr 15, 2015 at 1:31 PM, Matt Turner matts...@gmail.com wrote:
 On Wed, Apr 1, 2015 at 6:19 PM, Jason Ekstrand ja...@jlekstrand.net wrote:
 Instead of the complicated and broken-by-design pile of heuristics we had
 before, we now have a straightforward lowering:

  1) All header sources are copied directly using force_writemask_all and,
 since they are guaranteed to be a single register, there are no
 force_sechalf issues.

  2) All non-header sources are copied using the exact same force_sechalf
 and saturate modifiers as the LOAD_PAYLOAD operation itself.

 Let's not do this. Nothing puts a saturate modifier on LOAD_PAYLOAD
 today, and it is kind of confusing about what it means. Can't we have
 fbwrites that write depth as well. I wouldn't think we wanted to
 saturate that.

 Sure.  I can drop saturate and just assert that it's not set.  We do
 want to keep force_sechalf and force_writemask_all though.

I didn't think about those before, but I don't know how a load_payload
could have force_writemask_all set. Have I missed something?

I see that setup_color_payload sets force_sechalf for dual-source
fbwrites -- that's the only case we're going to have force_sechalf
set, right?

That is, the Gen  6 case is going to be handled by passing 16-channel
sources to load_payload and letting it do compr4?

 I don't think it buys us anything. If we just run copy propagation
 after lower_load_payload() we'll get the code we want.
[snip]
 +/* The COMPR4 code took care of the first 4 sources.  We'll let
 + * the regular path handle any remaining sources.  Yes, we are
 + * modifying the instruction but we're about to delete it so
 + * this really doesn't hurt anything.
 + */
 +inst-header_size += 4;

 I mean, while the comment is a true statement, why is doing this any
 better than just...

 + }
 +
 + for (uint8_t i = inst-header_size; i  inst-sources; i++) {

 ... changing this to inst-header_size + 4?

 Because the inst-header_size += 4 is predicated on it being a COMPR4
 destination while the code below handles both the remaining sources
 (in the COMPR4 case) and the regular non-COMPR4 case.

Ahh, right. It'd be fewer lines (no commenting necessary) to just have
a 'start' variable that you set to header_size at the top and +=4
here.
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 09/13] SQUASH: i965/fs: Rework fs_visitor::lower_load_payload

2015-04-03 Thread Jason Ekstrand
On Fri, Apr 3, 2015 at 7:28 AM, Francisco Jerez curroje...@riseup.net wrote:
 Jason Ekstrand ja...@jlekstrand.net writes:

 On Thu, Apr 2, 2015 at 3:01 AM, Francisco Jerez curroje...@riseup.net 
 wrote:
 Jason Ekstrand ja...@jlekstrand.net writes:

 Instead of the complicated and broken-by-design pile of heuristics we had
 before, we now have a straightforward lowering:

  1) All header sources are copied directly using force_writemask_all and,
 since they are guaranteed to be a single register, there are no
 force_sechalf issues.

  2) All non-header sources are copied using the exact same force_sechalf
 and saturate modifiers as the LOAD_PAYLOAD operation itself.

  3) In order to accommodate older gens that need interleaved colors,
 lower_load_payload detects when the destination is a COMPR4 register
 and automatically interleaves the non-header sources.  The
 lower_load_payload pass does the right thing here regardless of whether
 or not the hardware actually supports COMPR4.

 I had a quick glance at the series and it seems to be going in the right
 direction.

 One thing I honestly don't like is the ad-hoc and IMHO premature
 treatment of payload headers, it still feels like the LOAD_PAYLOAD
 instruction has more complex semantics than necessary and the benefit is
 unclear to me.

 I suppose that your motivation was to avoid setting force_writemask_all
 in LOAD_PAYLOAD instructions with header.  The optimizer should be able
 to cope with those flags and get rid of them from the resulting moves
 where they are redundant, and if it's not able to it looks like
 something that should be fixed anyway.  The explicit handling of headers
 is responsible for much of the churn in this series and is likely to
 complicate users of LOAD_PAYLOAD and optimization passes that have to
 manipulate them.

 Avoiding force_writemask_all is only half of the motivation and the
 small half at that.  A header source, more properly defined, is a
 single physical register that, conceptually, applies to all channels.
 Effectively, a header source (I should have stated this clearly) has
 two properties:

 1) It has force_writemask_all set
 2) It is exactly one physical hardware register.

 This second property is the more important of the two.  Most of the
 disaster of the previous LOAD_PAYLOAD implementation was that we did a
 pile of guesswork and had a ill-conceved effective width think in
 order to figure out how big the register actually was.  Making the
 user specify which sources are header sources eliminates that
 guesswork.  It also has the nice side-effect that we can do the right
 force_writemask_all and we can properly handle COMPR4 for the the
 user.

 Yeah, true, but this seems like the least orthogonal and most annoying
 to use solution for this problem, it forces the caller to provide
 redundant information, it takes into account the saturate flag on some
 arguments and not others, it shuffles sources with respect to the
 specified order when COMPR4 is set, but only for the first four
 non-header sources.  I think any of the following solutions would be
 better-behaved than the current approach:

I don't know that saying which sources are headers is really
redundant.  It's explicit which is what we want.  Yes, the COMPR4
thing is a bit magical but we have to do COMPR4 in lower_load_payload
so we have to have some way of doing it and this method puts the
interleving code in one place instead of two.

 1/ Use the source width to determine the size of each copy.  This would
imply that the source width carries semantic information and hence
would have to be left alone by e.g. copy propagation.

That's what do now and it's terrible.  The effective width field was
basically a width that gets kept.

 2/ Use the instruction exec size and flags to determine the properties
of *all* copies.  This means that if a header is present the exec
size would necessarily have to be 8 and the halves of a 16-wide
register would have to be specified separately, which sounds annoying
at first but in practice wouldn't necessarily be because it could be
handled by the LOAD_PAYLOAD() helper based on the argument widths
without running into problems with optimization passes changing the
meaning of the instruction.  The semantics of the instruction itself
would be as stupid as possible, but the implementation could still
trivially recognise 16-wide and COMPR4 copies using the exact same
mechanism you are using now.

Yes, that might work.  I'll try and take a swing at it today.  It will
*hopefully* have less code churn than the solution in this series
because the magic will still happen, just in a different place.  Of
course, this solution also requires that we lower everything with
force_writemask_all and *hopefully* we can get rid of those and set
force_sechalf appropriately in optimization passes.

 3/ Split LOAD_PAYLOAD into two separate instructions, each of them 

Re: [Mesa-dev] [PATCH 09/13] SQUASH: i965/fs: Rework fs_visitor::lower_load_payload

2015-04-03 Thread Jason Ekstrand
On Fri, Apr 3, 2015 at 8:37 AM, Francisco Jerez curroje...@riseup.net wrote:
 Jason Ekstrand ja...@jlekstrand.net writes:

 On Fri, Apr 3, 2015 at 7:28 AM, Francisco Jerez curroje...@riseup.net 
 wrote:
 Jason Ekstrand ja...@jlekstrand.net writes:

 On Thu, Apr 2, 2015 at 3:01 AM, Francisco Jerez curroje...@riseup.net 
 wrote:
 Jason Ekstrand ja...@jlekstrand.net writes:

 Instead of the complicated and broken-by-design pile of heuristics we had
 before, we now have a straightforward lowering:

  1) All header sources are copied directly using force_writemask_all and,
 since they are guaranteed to be a single register, there are no
 force_sechalf issues.

  2) All non-header sources are copied using the exact same force_sechalf
 and saturate modifiers as the LOAD_PAYLOAD operation itself.

  3) In order to accommodate older gens that need interleaved colors,
 lower_load_payload detects when the destination is a COMPR4 register
 and automatically interleaves the non-header sources.  The
 lower_load_payload pass does the right thing here regardless of 
 whether
 or not the hardware actually supports COMPR4.

 I had a quick glance at the series and it seems to be going in the right
 direction.

 One thing I honestly don't like is the ad-hoc and IMHO premature
 treatment of payload headers, it still feels like the LOAD_PAYLOAD
 instruction has more complex semantics than necessary and the benefit is
 unclear to me.

 I suppose that your motivation was to avoid setting force_writemask_all
 in LOAD_PAYLOAD instructions with header.  The optimizer should be able
 to cope with those flags and get rid of them from the resulting moves
 where they are redundant, and if it's not able to it looks like
 something that should be fixed anyway.  The explicit handling of headers
 is responsible for much of the churn in this series and is likely to
 complicate users of LOAD_PAYLOAD and optimization passes that have to
 manipulate them.

 Avoiding force_writemask_all is only half of the motivation and the
 small half at that.  A header source, more properly defined, is a
 single physical register that, conceptually, applies to all channels.
 Effectively, a header source (I should have stated this clearly) has
 two properties:

 1) It has force_writemask_all set
 2) It is exactly one physical hardware register.

 This second property is the more important of the two.  Most of the
 disaster of the previous LOAD_PAYLOAD implementation was that we did a
 pile of guesswork and had a ill-conceved effective width think in
 order to figure out how big the register actually was.  Making the
 user specify which sources are header sources eliminates that
 guesswork.  It also has the nice side-effect that we can do the right
 force_writemask_all and we can properly handle COMPR4 for the the
 user.

Ok, Allow me to be a bit more explicit as to what all we need to keep track of:

 1) How big is the source register for real.  Even immediates can end
up being two registers in the copy.
 2) Do we want force_writemask_all?
 3) If not, do we want force_sechalf?
 4) On g45 and gen5, we want to use COMPR4 for interlaced movs
 5) When lowering, we want to use 16-wide moves when possible in SIMD16

With the patch series I sent, all of this is explicit except for
COMPR4 which is, admittedly, kind of magic.  Which of these sources
are headers? is a reasonable question for the caller to answer.  It
knows explicitly and it would take the LOAD_PAYLOAD helper some work
to guess it correctly.  Another option would be to guess that based on
exec sizes but then the caller has to know not to pass in the wrong
register type or the guess will be wrong.  I like explicit.

 Yeah, true, but this seems like the least orthogonal and most annoying
 to use solution for this problem, it forces the caller to provide
 redundant information, it takes into account the saturate flag on some
 arguments and not others, it shuffles sources with respect to the
 specified order when COMPR4 is set, but only for the first four
 non-header sources.  I think any of the following solutions would be
 better-behaved than the current approach:

 I don't know that saying which sources are headers is really
 redundant.  It's explicit which is what we want.  Yes, the COMPR4
 thing is a bit magical but we have to do COMPR4 in lower_load_payload
 so we have to have some way of doing it and this method puts the
 interleving code in one place instead of two.

 Well, at least with the previous approach LOAD_PAYLOAD had consistent
 (if broken) semantics across its arguments, and regardless of COMPR4
 being used or not, which IMHO is preferable to the modest code saving.

To be clear, I don't really like the way I did COMPR4 either.  I just
couldn't come up with anything better.

 1/ Use the source width to determine the size of each copy.  This would
imply that the source width carries semantic information and hence
would have to be left alone by e.g. copy 

Re: [Mesa-dev] [PATCH 09/13] SQUASH: i965/fs: Rework fs_visitor::lower_load_payload

2015-04-03 Thread Francisco Jerez
Jason Ekstrand ja...@jlekstrand.net writes:

 On Fri, Apr 3, 2015 at 8:37 AM, Francisco Jerez curroje...@riseup.net wrote:
 Jason Ekstrand ja...@jlekstrand.net writes:

 On Fri, Apr 3, 2015 at 7:28 AM, Francisco Jerez curroje...@riseup.net 
 wrote:
 Jason Ekstrand ja...@jlekstrand.net writes:

 On Thu, Apr 2, 2015 at 3:01 AM, Francisco Jerez curroje...@riseup.net 
 wrote:
 Jason Ekstrand ja...@jlekstrand.net writes:

 Instead of the complicated and broken-by-design pile of heuristics we 
 had
 before, we now have a straightforward lowering:

  1) All header sources are copied directly using force_writemask_all 
 and,
 since they are guaranteed to be a single register, there are no
 force_sechalf issues.

  2) All non-header sources are copied using the exact same force_sechalf
 and saturate modifiers as the LOAD_PAYLOAD operation itself.

  3) In order to accommodate older gens that need interleaved colors,
 lower_load_payload detects when the destination is a COMPR4 register
 and automatically interleaves the non-header sources.  The
 lower_load_payload pass does the right thing here regardless of 
 whether
 or not the hardware actually supports COMPR4.

 I had a quick glance at the series and it seems to be going in the right
 direction.

 One thing I honestly don't like is the ad-hoc and IMHO premature
 treatment of payload headers, it still feels like the LOAD_PAYLOAD
 instruction has more complex semantics than necessary and the benefit is
 unclear to me.

 I suppose that your motivation was to avoid setting force_writemask_all
 in LOAD_PAYLOAD instructions with header.  The optimizer should be able
 to cope with those flags and get rid of them from the resulting moves
 where they are redundant, and if it's not able to it looks like
 something that should be fixed anyway.  The explicit handling of headers
 is responsible for much of the churn in this series and is likely to
 complicate users of LOAD_PAYLOAD and optimization passes that have to
 manipulate them.

 Avoiding force_writemask_all is only half of the motivation and the
 small half at that.  A header source, more properly defined, is a
 single physical register that, conceptually, applies to all channels.
 Effectively, a header source (I should have stated this clearly) has
 two properties:

 1) It has force_writemask_all set
 2) It is exactly one physical hardware register.

 This second property is the more important of the two.  Most of the
 disaster of the previous LOAD_PAYLOAD implementation was that we did a
 pile of guesswork and had a ill-conceved effective width think in
 order to figure out how big the register actually was.  Making the
 user specify which sources are header sources eliminates that
 guesswork.  It also has the nice side-effect that we can do the right
 force_writemask_all and we can properly handle COMPR4 for the the
 user.

 Ok, Allow me to be a bit more explicit as to what all we need to keep track 
 of:

  1) How big is the source register for real.  Even immediates can end
 up being two registers in the copy.
  2) Do we want force_writemask_all?
  3) If not, do we want force_sechalf?
  4) On g45 and gen5, we want to use COMPR4 for interlaced movs
  5) When lowering, we want to use 16-wide moves when possible in SIMD16

 With the patch series I sent, all of this is explicit except for
 COMPR4 which is, admittedly, kind of magic.  Which of these sources
 are headers? is a reasonable question for the caller to answer.  It
 knows explicitly and it would take the LOAD_PAYLOAD helper some work
 to guess it correctly.  Another option would be to guess that based on
 exec sizes but then the caller has to know not to pass in the wrong
 register type or the guess will be wrong.  I like explicit.

I'm not advocating LOAD_PAYLOAD to perform any guesswork, I'm advocating
it to be more stupid -- Let it do as much as it can sensibly do with the
information it already has, and no more.

 Yeah, true, but this seems like the least orthogonal and most annoying
 to use solution for this problem, it forces the caller to provide
 redundant information, it takes into account the saturate flag on some
 arguments and not others, it shuffles sources with respect to the
 specified order when COMPR4 is set, but only for the first four
 non-header sources.  I think any of the following solutions would be
 better-behaved than the current approach:

 I don't know that saying which sources are headers is really
 redundant.  It's explicit which is what we want.  Yes, the COMPR4
 thing is a bit magical but we have to do COMPR4 in lower_load_payload
 so we have to have some way of doing it and this method puts the
 interleving code in one place instead of two.

 Well, at least with the previous approach LOAD_PAYLOAD had consistent
 (if broken) semantics across its arguments, and regardless of COMPR4
 being used or not, which IMHO is preferable to the modest code saving.

 To be clear, I don't really like the way I 

Re: [Mesa-dev] [PATCH 09/13] SQUASH: i965/fs: Rework fs_visitor::lower_load_payload

2015-04-03 Thread Francisco Jerez
Jason Ekstrand ja...@jlekstrand.net writes:

 On Thu, Apr 2, 2015 at 3:01 AM, Francisco Jerez curroje...@riseup.net wrote:
 Jason Ekstrand ja...@jlekstrand.net writes:

 Instead of the complicated and broken-by-design pile of heuristics we had
 before, we now have a straightforward lowering:

  1) All header sources are copied directly using force_writemask_all and,
 since they are guaranteed to be a single register, there are no
 force_sechalf issues.

  2) All non-header sources are copied using the exact same force_sechalf
 and saturate modifiers as the LOAD_PAYLOAD operation itself.

  3) In order to accommodate older gens that need interleaved colors,
 lower_load_payload detects when the destination is a COMPR4 register
 and automatically interleaves the non-header sources.  The
 lower_load_payload pass does the right thing here regardless of whether
 or not the hardware actually supports COMPR4.

 I had a quick glance at the series and it seems to be going in the right
 direction.

 One thing I honestly don't like is the ad-hoc and IMHO premature
 treatment of payload headers, it still feels like the LOAD_PAYLOAD
 instruction has more complex semantics than necessary and the benefit is
 unclear to me.

 I suppose that your motivation was to avoid setting force_writemask_all
 in LOAD_PAYLOAD instructions with header.  The optimizer should be able
 to cope with those flags and get rid of them from the resulting moves
 where they are redundant, and if it's not able to it looks like
 something that should be fixed anyway.  The explicit handling of headers
 is responsible for much of the churn in this series and is likely to
 complicate users of LOAD_PAYLOAD and optimization passes that have to
 manipulate them.

 Avoiding force_writemask_all is only half of the motivation and the
 small half at that.  A header source, more properly defined, is a
 single physical register that, conceptually, applies to all channels.
 Effectively, a header source (I should have stated this clearly) has
 two properties:

 1) It has force_writemask_all set
 2) It is exactly one physical hardware register.

 This second property is the more important of the two.  Most of the
 disaster of the previous LOAD_PAYLOAD implementation was that we did a
 pile of guesswork and had a ill-conceved effective width think in
 order to figure out how big the register actually was.  Making the
 user specify which sources are header sources eliminates that
 guesswork.  It also has the nice side-effect that we can do the right
 force_writemask_all and we can properly handle COMPR4 for the the
 user.

Yeah, true, but this seems like the least orthogonal and most annoying
to use solution for this problem, it forces the caller to provide
redundant information, it takes into account the saturate flag on some
arguments and not others, it shuffles sources with respect to the
specified order when COMPR4 is set, but only for the first four
non-header sources.  I think any of the following solutions would be
better-behaved than the current approach:

1/ Use the source width to determine the size of each copy.  This would
   imply that the source width carries semantic information and hence
   would have to be left alone by e.g. copy propagation.

2/ Use the instruction exec size and flags to determine the properties
   of *all* copies.  This means that if a header is present the exec
   size would necessarily have to be 8 and the halves of a 16-wide
   register would have to be specified separately, which sounds annoying
   at first but in practice wouldn't necessarily be because it could be
   handled by the LOAD_PAYLOAD() helper based on the argument widths
   without running into problems with optimization passes changing the
   meaning of the instruction.  The semantics of the instruction itself
   would be as stupid as possible, but the implementation could still
   trivially recognise 16-wide and COMPR4 copies using the exact same
   mechanism you are using now.

3/ Split LOAD_PAYLOAD into two separate instructions, each of them dead
   simple, say COLLECT and LOAD_HEADER.  COLLECT dst, src0, ..., srcn
   would be equivalent to the LOAD_PAYLOAD instruction described in 2,
   but it would only be used to load full-width non-header sources of
   the payload, so you would avoid the need to split 16-wide registers
   in half.  LOAD_HEADER dst, header, payload would handle the
   asymmetric requirements of prepending a header, like using 8-wide
   instead of exec_size-wide copies and setting force_writemask_all.
   You could use mlen to specify the size of payload as is usual for
   instructions taking a payload source.

 --Jason

 Thanks for looking into this BTW.

 ---
  src/mesa/drivers/dri/i965/brw_fs.cpp | 154 
 ++-
  src/mesa/drivers/dri/i965/brw_fs_visitor.cpp |   3 -
  2 files changed, 80 insertions(+), 77 deletions(-)

 diff --git a/src/mesa/drivers/dri/i965/brw_fs.cpp 
 

Re: [Mesa-dev] [PATCH 09/13] SQUASH: i965/fs: Rework fs_visitor::lower_load_payload

2015-04-03 Thread Francisco Jerez
Jason Ekstrand ja...@jlekstrand.net writes:

 On Fri, Apr 3, 2015 at 7:28 AM, Francisco Jerez curroje...@riseup.net wrote:
 Jason Ekstrand ja...@jlekstrand.net writes:

 On Thu, Apr 2, 2015 at 3:01 AM, Francisco Jerez curroje...@riseup.net 
 wrote:
 Jason Ekstrand ja...@jlekstrand.net writes:

 Instead of the complicated and broken-by-design pile of heuristics we had
 before, we now have a straightforward lowering:

  1) All header sources are copied directly using force_writemask_all and,
 since they are guaranteed to be a single register, there are no
 force_sechalf issues.

  2) All non-header sources are copied using the exact same force_sechalf
 and saturate modifiers as the LOAD_PAYLOAD operation itself.

  3) In order to accommodate older gens that need interleaved colors,
 lower_load_payload detects when the destination is a COMPR4 register
 and automatically interleaves the non-header sources.  The
 lower_load_payload pass does the right thing here regardless of 
 whether
 or not the hardware actually supports COMPR4.

 I had a quick glance at the series and it seems to be going in the right
 direction.

 One thing I honestly don't like is the ad-hoc and IMHO premature
 treatment of payload headers, it still feels like the LOAD_PAYLOAD
 instruction has more complex semantics than necessary and the benefit is
 unclear to me.

 I suppose that your motivation was to avoid setting force_writemask_all
 in LOAD_PAYLOAD instructions with header.  The optimizer should be able
 to cope with those flags and get rid of them from the resulting moves
 where they are redundant, and if it's not able to it looks like
 something that should be fixed anyway.  The explicit handling of headers
 is responsible for much of the churn in this series and is likely to
 complicate users of LOAD_PAYLOAD and optimization passes that have to
 manipulate them.

 Avoiding force_writemask_all is only half of the motivation and the
 small half at that.  A header source, more properly defined, is a
 single physical register that, conceptually, applies to all channels.
 Effectively, a header source (I should have stated this clearly) has
 two properties:

 1) It has force_writemask_all set
 2) It is exactly one physical hardware register.

 This second property is the more important of the two.  Most of the
 disaster of the previous LOAD_PAYLOAD implementation was that we did a
 pile of guesswork and had a ill-conceved effective width think in
 order to figure out how big the register actually was.  Making the
 user specify which sources are header sources eliminates that
 guesswork.  It also has the nice side-effect that we can do the right
 force_writemask_all and we can properly handle COMPR4 for the the
 user.

 Yeah, true, but this seems like the least orthogonal and most annoying
 to use solution for this problem, it forces the caller to provide
 redundant information, it takes into account the saturate flag on some
 arguments and not others, it shuffles sources with respect to the
 specified order when COMPR4 is set, but only for the first four
 non-header sources.  I think any of the following solutions would be
 better-behaved than the current approach:

 I don't know that saying which sources are headers is really
 redundant.  It's explicit which is what we want.  Yes, the COMPR4
 thing is a bit magical but we have to do COMPR4 in lower_load_payload
 so we have to have some way of doing it and this method puts the
 interleving code in one place instead of two.

Well, at least with the previous approach LOAD_PAYLOAD had consistent
(if broken) semantics across its arguments, and regardless of COMPR4
being used or not, which IMHO is preferable to the modest code saving.

 1/ Use the source width to determine the size of each copy.  This would
imply that the source width carries semantic information and hence
would have to be left alone by e.g. copy propagation.

 That's what do now and it's terrible.  The effective width field was
 basically a width that gets kept.

Couldn't you just prevent copy propagation from propagating a source of
different width into a LOAD_PAYLOAD?  You would still be able to get rid
of the metadata guess and effective_width, but you may have to re-run
copy propagate after lower_load_payload() for the case you missed any
optimization oportunities.

 2/ Use the instruction exec size and flags to determine the properties
of *all* copies.  This means that if a header is present the exec
size would necessarily have to be 8 and the halves of a 16-wide
register would have to be specified separately, which sounds annoying
at first but in practice wouldn't necessarily be because it could be
handled by the LOAD_PAYLOAD() helper based on the argument widths
without running into problems with optimization passes changing the
meaning of the instruction.  The semantics of the instruction itself
would be as stupid as possible, but the 

Re: [Mesa-dev] [PATCH 09/13] SQUASH: i965/fs: Rework fs_visitor::lower_load_payload

2015-04-02 Thread Francisco Jerez
Jason Ekstrand ja...@jlekstrand.net writes:

 Instead of the complicated and broken-by-design pile of heuristics we had
 before, we now have a straightforward lowering:

  1) All header sources are copied directly using force_writemask_all and,
 since they are guaranteed to be a single register, there are no
 force_sechalf issues.

  2) All non-header sources are copied using the exact same force_sechalf
 and saturate modifiers as the LOAD_PAYLOAD operation itself.

  3) In order to accommodate older gens that need interleaved colors,
 lower_load_payload detects when the destination is a COMPR4 register
 and automatically interleaves the non-header sources.  The
 lower_load_payload pass does the right thing here regardless of whether
 or not the hardware actually supports COMPR4.

I had a quick glance at the series and it seems to be going in the right
direction.

One thing I honestly don't like is the ad-hoc and IMHO premature
treatment of payload headers, it still feels like the LOAD_PAYLOAD
instruction has more complex semantics than necessary and the benefit is
unclear to me.

I suppose that your motivation was to avoid setting force_writemask_all
in LOAD_PAYLOAD instructions with header.  The optimizer should be able
to cope with those flags and get rid of them from the resulting moves
where they are redundant, and if it's not able to it looks like
something that should be fixed anyway.  The explicit handling of headers
is responsible for much of the churn in this series and is likely to
complicate users of LOAD_PAYLOAD and optimization passes that have to
manipulate them.

Thanks for looking into this BTW.

 ---
  src/mesa/drivers/dri/i965/brw_fs.cpp | 154 
 ++-
  src/mesa/drivers/dri/i965/brw_fs_visitor.cpp |   3 -
  2 files changed, 80 insertions(+), 77 deletions(-)

 diff --git a/src/mesa/drivers/dri/i965/brw_fs.cpp 
 b/src/mesa/drivers/dri/i965/brw_fs.cpp
 index f8e26c0..bc45a38 100644
 --- a/src/mesa/drivers/dri/i965/brw_fs.cpp
 +++ b/src/mesa/drivers/dri/i965/brw_fs.cpp
 @@ -3201,93 +3201,99 @@ fs_visitor::lower_load_payload()
  {
 bool progress = false;
  
 -   int vgrf_to_reg[alloc.count];
 -   int reg_count = 0;
 -   for (unsigned i = 0; i  alloc.count; ++i) {
 -  vgrf_to_reg[i] = reg_count;
 -  reg_count += alloc.sizes[i];
 -   }
 -
 -   struct {
 -  bool written:1; /* Whether this register has ever been written */
 -  bool force_writemask_all:1;
 -  bool force_sechalf:1;
 -   } metadata[reg_count];
 -   memset(metadata, 0, sizeof(metadata));
 -
 foreach_block_and_inst_safe (block, fs_inst, inst, cfg) {
 -  if (inst-dst.file == GRF) {
 - const int dst_reg = vgrf_to_reg[inst-dst.reg] + 
 inst-dst.reg_offset;
 - bool force_sechalf = inst-force_sechalf 
 -  !inst-force_writemask_all;
 - bool toggle_sechalf = inst-dst.width == 16 
 -   type_sz(inst-dst.type) == 4 
 -   !inst-force_writemask_all;
 - for (int i = 0; i  inst-regs_written; ++i) {
 -metadata[dst_reg + i].written = true;
 -metadata[dst_reg + i].force_sechalf = force_sechalf;
 -metadata[dst_reg + i].force_writemask_all = 
 inst-force_writemask_all;
 -force_sechalf = (toggle_sechalf != force_sechalf);
 - }
 -  }
 -
if (inst-opcode == SHADER_OPCODE_LOAD_PAYLOAD) {
   assert(inst-dst.file == MRF || inst-dst.file == GRF);
 +
   fs_reg dst = inst-dst;
  
 - for (int i = 0; i  inst-sources; i++) {
 -dst.width = inst-src[i].effective_width;
 -dst.type = inst-src[i].type;
 -
 -if (inst-src[i].file == BAD_FILE) {
 -   /* Do nothing but otherwise increment as normal */
 -} else if (dst.file == MRF 
 -   dst.width == 8 
 -   brw-has_compr4 
 -   i + 4  inst-sources 
 -   inst-src[i + 4].equals(horiz_offset(inst-src[i], 
 8))) {
 -   fs_reg compr4_dst = dst;
 -   compr4_dst.reg += BRW_MRF_COMPR4;
 -   compr4_dst.width = 16;
 -   fs_reg compr4_src = inst-src[i];
 -   compr4_src.width = 16;
 -   fs_inst *mov = MOV(compr4_dst, compr4_src);
 + /* Get rid of COMPR4.  We'll add it back in if we need it */
 + if (dst.file == MRF  dst.reg  BRW_MRF_COMPR4)
 +dst.reg = dst.reg  ~BRW_MRF_COMPR4;
 +
 + dst.width = 8;
 + for (uint8_t i = 0; i  inst-header_size; i++) {
 +if (inst-src[i].file != BAD_FILE) {
 +   fs_reg mov_dst = retype(dst, BRW_REGISTER_TYPE_UD);
 +   fs_reg mov_src = retype(inst-src[i], BRW_REGISTER_TYPE_UD);
 +   mov_src.width = 8;
 +   fs_inst *mov = MOV(mov_dst, mov_src);
 mov-force_writemask_all = true;

Re: [Mesa-dev] [PATCH 09/13] SQUASH: i965/fs: Rework fs_visitor::lower_load_payload

2015-04-02 Thread Jason Ekstrand
On Thu, Apr 2, 2015 at 3:01 AM, Francisco Jerez curroje...@riseup.net wrote:
 Jason Ekstrand ja...@jlekstrand.net writes:

 Instead of the complicated and broken-by-design pile of heuristics we had
 before, we now have a straightforward lowering:

  1) All header sources are copied directly using force_writemask_all and,
 since they are guaranteed to be a single register, there are no
 force_sechalf issues.

  2) All non-header sources are copied using the exact same force_sechalf
 and saturate modifiers as the LOAD_PAYLOAD operation itself.

  3) In order to accommodate older gens that need interleaved colors,
 lower_load_payload detects when the destination is a COMPR4 register
 and automatically interleaves the non-header sources.  The
 lower_load_payload pass does the right thing here regardless of whether
 or not the hardware actually supports COMPR4.

 I had a quick glance at the series and it seems to be going in the right
 direction.

 One thing I honestly don't like is the ad-hoc and IMHO premature
 treatment of payload headers, it still feels like the LOAD_PAYLOAD
 instruction has more complex semantics than necessary and the benefit is
 unclear to me.

 I suppose that your motivation was to avoid setting force_writemask_all
 in LOAD_PAYLOAD instructions with header.  The optimizer should be able
 to cope with those flags and get rid of them from the resulting moves
 where they are redundant, and if it's not able to it looks like
 something that should be fixed anyway.  The explicit handling of headers
 is responsible for much of the churn in this series and is likely to
 complicate users of LOAD_PAYLOAD and optimization passes that have to
 manipulate them.

Avoiding force_writemask_all is only half of the motivation and the
small half at that.  A header source, more properly defined, is a
single physical register that, conceptually, applies to all channels.
Effectively, a header source (I should have stated this clearly) has
two properties:

1) It has force_writemask_all set
2) It is exactly one physical hardware register.

This second property is the more important of the two.  Most of the
disaster of the previous LOAD_PAYLOAD implementation was that we did a
pile of guesswork and had a ill-conceved effective width think in
order to figure out how big the register actually was.  Making the
user specify which sources are header sources eliminates that
guesswork.  It also has the nice side-effect that we can do the right
force_writemask_all and we can properly handle COMPR4 for the the
user.
--Jason

 Thanks for looking into this BTW.

 ---
  src/mesa/drivers/dri/i965/brw_fs.cpp | 154 
 ++-
  src/mesa/drivers/dri/i965/brw_fs_visitor.cpp |   3 -
  2 files changed, 80 insertions(+), 77 deletions(-)

 diff --git a/src/mesa/drivers/dri/i965/brw_fs.cpp 
 b/src/mesa/drivers/dri/i965/brw_fs.cpp
 index f8e26c0..bc45a38 100644
 --- a/src/mesa/drivers/dri/i965/brw_fs.cpp
 +++ b/src/mesa/drivers/dri/i965/brw_fs.cpp
 @@ -3201,93 +3201,99 @@ fs_visitor::lower_load_payload()
  {
 bool progress = false;

 -   int vgrf_to_reg[alloc.count];
 -   int reg_count = 0;
 -   for (unsigned i = 0; i  alloc.count; ++i) {
 -  vgrf_to_reg[i] = reg_count;
 -  reg_count += alloc.sizes[i];
 -   }
 -
 -   struct {
 -  bool written:1; /* Whether this register has ever been written */
 -  bool force_writemask_all:1;
 -  bool force_sechalf:1;
 -   } metadata[reg_count];
 -   memset(metadata, 0, sizeof(metadata));
 -
 foreach_block_and_inst_safe (block, fs_inst, inst, cfg) {
 -  if (inst-dst.file == GRF) {
 - const int dst_reg = vgrf_to_reg[inst-dst.reg] + 
 inst-dst.reg_offset;
 - bool force_sechalf = inst-force_sechalf 
 -  !inst-force_writemask_all;
 - bool toggle_sechalf = inst-dst.width == 16 
 -   type_sz(inst-dst.type) == 4 
 -   !inst-force_writemask_all;
 - for (int i = 0; i  inst-regs_written; ++i) {
 -metadata[dst_reg + i].written = true;
 -metadata[dst_reg + i].force_sechalf = force_sechalf;
 -metadata[dst_reg + i].force_writemask_all = 
 inst-force_writemask_all;
 -force_sechalf = (toggle_sechalf != force_sechalf);
 - }
 -  }
 -
if (inst-opcode == SHADER_OPCODE_LOAD_PAYLOAD) {
   assert(inst-dst.file == MRF || inst-dst.file == GRF);
 +
   fs_reg dst = inst-dst;

 - for (int i = 0; i  inst-sources; i++) {
 -dst.width = inst-src[i].effective_width;
 -dst.type = inst-src[i].type;
 -
 -if (inst-src[i].file == BAD_FILE) {
 -   /* Do nothing but otherwise increment as normal */
 -} else if (dst.file == MRF 
 -   dst.width == 8 
 -   brw-has_compr4 
 -   i + 4  inst-sources 
 -   inst-src[i + 

[Mesa-dev] [PATCH 09/13] SQUASH: i965/fs: Rework fs_visitor::lower_load_payload

2015-04-01 Thread Jason Ekstrand
Instead of the complicated and broken-by-design pile of heuristics we had
before, we now have a straightforward lowering:

 1) All header sources are copied directly using force_writemask_all and,
since they are guaranteed to be a single register, there are no
force_sechalf issues.

 2) All non-header sources are copied using the exact same force_sechalf
and saturate modifiers as the LOAD_PAYLOAD operation itself.

 3) In order to accommodate older gens that need interleaved colors,
lower_load_payload detects when the destination is a COMPR4 register
and automatically interleaves the non-header sources.  The
lower_load_payload pass does the right thing here regardless of whether
or not the hardware actually supports COMPR4.
---
 src/mesa/drivers/dri/i965/brw_fs.cpp | 154 ++-
 src/mesa/drivers/dri/i965/brw_fs_visitor.cpp |   3 -
 2 files changed, 80 insertions(+), 77 deletions(-)

diff --git a/src/mesa/drivers/dri/i965/brw_fs.cpp 
b/src/mesa/drivers/dri/i965/brw_fs.cpp
index f8e26c0..bc45a38 100644
--- a/src/mesa/drivers/dri/i965/brw_fs.cpp
+++ b/src/mesa/drivers/dri/i965/brw_fs.cpp
@@ -3201,93 +3201,99 @@ fs_visitor::lower_load_payload()
 {
bool progress = false;
 
-   int vgrf_to_reg[alloc.count];
-   int reg_count = 0;
-   for (unsigned i = 0; i  alloc.count; ++i) {
-  vgrf_to_reg[i] = reg_count;
-  reg_count += alloc.sizes[i];
-   }
-
-   struct {
-  bool written:1; /* Whether this register has ever been written */
-  bool force_writemask_all:1;
-  bool force_sechalf:1;
-   } metadata[reg_count];
-   memset(metadata, 0, sizeof(metadata));
-
foreach_block_and_inst_safe (block, fs_inst, inst, cfg) {
-  if (inst-dst.file == GRF) {
- const int dst_reg = vgrf_to_reg[inst-dst.reg] + inst-dst.reg_offset;
- bool force_sechalf = inst-force_sechalf 
-  !inst-force_writemask_all;
- bool toggle_sechalf = inst-dst.width == 16 
-   type_sz(inst-dst.type) == 4 
-   !inst-force_writemask_all;
- for (int i = 0; i  inst-regs_written; ++i) {
-metadata[dst_reg + i].written = true;
-metadata[dst_reg + i].force_sechalf = force_sechalf;
-metadata[dst_reg + i].force_writemask_all = 
inst-force_writemask_all;
-force_sechalf = (toggle_sechalf != force_sechalf);
- }
-  }
-
   if (inst-opcode == SHADER_OPCODE_LOAD_PAYLOAD) {
  assert(inst-dst.file == MRF || inst-dst.file == GRF);
+
  fs_reg dst = inst-dst;
 
- for (int i = 0; i  inst-sources; i++) {
-dst.width = inst-src[i].effective_width;
-dst.type = inst-src[i].type;
-
-if (inst-src[i].file == BAD_FILE) {
-   /* Do nothing but otherwise increment as normal */
-} else if (dst.file == MRF 
-   dst.width == 8 
-   brw-has_compr4 
-   i + 4  inst-sources 
-   inst-src[i + 4].equals(horiz_offset(inst-src[i], 8))) 
{
-   fs_reg compr4_dst = dst;
-   compr4_dst.reg += BRW_MRF_COMPR4;
-   compr4_dst.width = 16;
-   fs_reg compr4_src = inst-src[i];
-   compr4_src.width = 16;
-   fs_inst *mov = MOV(compr4_dst, compr4_src);
+ /* Get rid of COMPR4.  We'll add it back in if we need it */
+ if (dst.file == MRF  dst.reg  BRW_MRF_COMPR4)
+dst.reg = dst.reg  ~BRW_MRF_COMPR4;
+
+ dst.width = 8;
+ for (uint8_t i = 0; i  inst-header_size; i++) {
+if (inst-src[i].file != BAD_FILE) {
+   fs_reg mov_dst = retype(dst, BRW_REGISTER_TYPE_UD);
+   fs_reg mov_src = retype(inst-src[i], BRW_REGISTER_TYPE_UD);
+   mov_src.width = 8;
+   fs_inst *mov = MOV(mov_dst, mov_src);
mov-force_writemask_all = true;
inst-insert_before(block, mov);
-   /* Mark i+4 as BAD_FILE so we don't emit a MOV for it */
-   inst-src[i + 4].file = BAD_FILE;
-} else {
-   fs_inst *mov = MOV(dst, inst-src[i]);
-   if (inst-src[i].file == GRF) {
-  int src_reg = vgrf_to_reg[inst-src[i].reg] +
-inst-src[i].reg_offset;
-  mov-force_sechalf = metadata[src_reg].force_sechalf;
-  mov-force_writemask_all = 
metadata[src_reg].force_writemask_all;
-   } else {
-  /* We don't have any useful metadata for immediates or
-   * uniforms.  Assume that any of the channels of the
-   * destination may be used.
-   */
-  assert(inst-src[i].file == IMM ||
- inst-src[i].file == UNIFORM);
-  mov-force_writemask_all = true;
-   }
+}