compiler: split is_partial_write() into two variants

Chema Casanova Thu, 13 Dec 2018 03:48:26 -0800


El 13/12/18 a las 11:49, Pohjolainen, Topi escribió:
> On Thu, Dec 13, 2018 at 09:10:24AM +0100, Iago Toral wrote:
>> On Wed, 2018-12-12 at 14:15 +0200, Pohjolainen, Topi wrote:
>>> On Wed, Dec 12, 2018 at 09:48:20AM +0100, Iago Toral wrote:
>>>> On Tue, 2018-12-11 at 18:59 +0200, Pohjolainen, Topi wrote:
>>>>> On Fri, Dec 07, 2018 at 03:30:11PM +0200, Pohjolainen, Topi
>>>>> wrote:
>>>>>> On Tue, Dec 04, 2018 at 08:17:05AM +0100, Iago Toral Quiroga
>>>>>> wrote:
>>>>>>> This function is used in two different scenarios that for 32-
>>>>>>> bit
>>>>>>> instructions are the same, but for 16-bit instructions are
>>>>>>> not.
>>>>>>>
>>>>>>> One scenario is that in which we are working at a SIMD8
>>>>>>> register
>>>>>>> level and we need to know if a register is fully defined or
>>>>>>> written.
>>>>>>> This is useful, for example, in the context of liveness
>>>>>>> analysis
>>>>>>> or
>>>>>>> register allocation, where we work with units of registers.
>>>>>>>
>>>>>>> The other scenario is that in which we want to know if an
>>>>>>> instruction
>>>>>>> is writing a full scalar component or just some subset of it.
>>>>>>> This is
>>>>>>> useful, for example, in the context of some optimization
>>>>>>> passes
>>>>>>> like copy propagation.
>>>>>>>
>>>>>>> For 32-bit instructions (or larger), a SIMD8 dispatch will
>>>>>>> always
>>>>>>> write
>>>>>>> at least a full SIMD8 register (32B) if the write is not
>>>>>>> partial.
>>>>>>> The
>>>>>>> function is_partial_write() checks this to determine if we
>>>>>>> have a
>>>>>>> partial
>>>>>>> write. However, when we deal with 16-bit instructions, that
>>>>>>> logic
>>>>>>> disables
>>>>>>> some optimizations that should be safe. For example, a SIMD8
>>>>>>> 16-
>>>>>>> bit MOV will
>>>>>>> only update half of a SIMD register, but it is still a
>>>>>>> complete
>>>>>>> write of the
>>>>>>> variable for a SIMD8 dispatch, so we should not prevent copy
>>>>>>> propagation in
>>>>>>> this scenario because we don't write all 32 bytes in the SIMD
>>>>>>> register
>>>>>>> or because the write starts at offset 16B (wehere we pack
>>>>>>> components Y or
>>>>>>> W of 16-bit vectors).
>>>>>>>
>>>>>>> This is a problem for SIMD8 executions (VS, TCS, TES, GS) of
>>>>>>> 16-
>>>>>>> bit
>>>>>>> instructions, which lose a number of optimizations because of
>>>>>>> this, most
>>>>>>> important of which is copy-propagation.
>>>>>>>
>>>>>>> This patch splits is_partial_write() into
>>>>>>> is_partial_reg_write(),
>>>>>>> which
>>>>>>> represents the current is_partial_write(), useful for things
>>>>>>> like
>>>>>>> liveness analysis, and is_partial_var_write(), which
>>>>>>> considers
>>>>>>> the dispatch size to check if we are writing a full variable
>>>>>>> (rather
>>>>>>> than a full register) to decide if the write is partial or
>>>>>>> not,
>>>>>>> which
>>>>>>> is what we really want in many optimization passes.
>>>>>
>>>>> I actually started wondering why would liveness analysis and
>>>>> register
>>>>> coalescing need to treat the 16-bit SIMD8 case differently than
>>>>> optimizations.
>>>>> In virtual register space nothing would read or write the unused
>>>>> second half
>>>>> of the register in case of 16-bit type and SIMD8.
>>>>
>>>> True, we might be able to use the "variable" version in more cases.
>>>> I
>>>> was trying to be conservative when I implemented this because I
>>>> don't
>>>> think the half-float CTS tests provides a good testing ground for
>>>> all
>>>> aspects of the compiler. I can try that and see if it breaks
>>>> anything
>>>> though.
>>>>  
>>>>> Real register allocation in turn should be orthogonal to how
>>>>> things
>>>>> are
>>>>> allocated in virtual space. And I guess even there we wouldn't be
>>>>> interested
>>>>> of packing two 16-bit typed SIMD8 variables into one and same
>>>>> hardware
>>>>> register. It is SIMD16 where we get more pressure into register
>>>>> space
>>>>> anyway
>>>>> and there the 16-bit typed variables occupy full registers. In
>>>>> other
>>>>> words,
>>>>> if things fit in SIMD16, would we bother packing things more
>>>>> tightly
>>>>> in
>>>>> SIMD8? Or even if SIMD8 was the only option, would we be
>>>>> interested
>>>>> packing
>>>>> channels for two variables in one hw reg even then?
>>>>>
>>>>> Jason, we discussed this a little in the spring time.
>>>>>
>>>>> As a recap my approach shortly. Instead of ignoring the second
>>>>> half
>>>>> of
>>>>> registers case by case I addressed it more generally:
>>>>>
>>>>> - changed all the open coded checks to use helpers,
>>>>> - added a padding bit into fs_reg telling about the unused space,
>>>>> - change nir -> fs step to set that bit for 16-bit typed regs
>>>>> - and finally changed the helpers to consider the padding bit.
>>>>
>>>> So if I understand how this works, you mostly make the vgrf
>>>> infrastructure think that half-float registers actually use twice
>>>> the
>>>> space they require by including the padding into the
>>>> component_size()
>>>> helper, right? (We also need to cover 8-bit by the way, but I
>>>> suppose
>>>> it just means increasing the padding we use for 8-bit variables).
>>>> And
>>>> then you modify is_partial_write() to use that helper so it
>>>> accounts
>>>> for that padding and just check if the (padded) size is less than a
>>>> full SIMD register. Basically, we are making all smaller types look
>>>> like they use 32-bit per channel in the vgrf space.
>>>
>>> Actually not 32-bit per channel but that there is unused data towards
>>> the
>>> end of the register, cut-paste from the diff:
>>>
>>> diff --git a/src/intel/compiler/brw_ir_fs.h
>>> b/src/intel/compiler/brw_ir_fs.h
>>> index 02c91b246a3..a1add69a319 100644
>>> --- a/src/intel/compiler/brw_ir_fs.h
>>> +++ b/src/intel/compiler/brw_ir_fs.h
>>> @@ -52,6 +52,9 @@ public:
>>>  
>>>     /** Register region horizontal stride */
>>>     uint8_t stride;
>>> +
>>> +   /* Needed, for example, for SIMD8 half float payloads. */
>>> +   uint8_t pad_per_component;
>>>  };
>>>  
>>>  static inline fs_reg
>>>
>>>
>>> I should start by the history. Payloads for sampler messages, sampler
>>> return
>>> data and data port writes operate with full registers even in case of
>>> SIMD8.
>>> For example, in case of data port write, each component has all the
>>> channels
>>> packed in the beginning of the reg. The second half is simply ignored
>>> by the
>>> hardware. The hardware just expects each component to be aligned on
>>> reg
>>> boundary. The same applies for 16-bit sampler message coordinates,
>>> each
>>> component takes full reg and only the first half contains meaningful
>>> data.
>>> And when the hardware returns 16-bit typed data from the sampler in
>>> SIMD8 the
>>> data is once again padded. 
>>> This is where the notion of padding originally comes from.
>>>
>>> But then I started thinking why stop at payloads only - we could
>>> extend the
>>> same for all 16-bit typed variables in SIMD8. I thought that if the
>>> hardware
>>> wants to use padding for things such as sampler data coming in and
>>> pixel data
>>> going out then why should we use tighter packing for anything else -
>>> at least
>>> not in the virtual register space.
>>>
>>> And as answer to below, the padding is only used in SIMD8 mode,
>>> SIMD16 doesn't
>>> need any of that, for 16-bit types, I mean.
>>
>> Ok, so you just add the padding to the total size and only for SIMD8
>> dispatches of 16-bit.
>>
>> I guess we also need to account for horizontal strides right? For
>> example, a SIMD8 half-float register with a stride of 2 uses a full
>> register and should not have any padding in that case. We will have
>> this for half-float when there are alignment restrictions (for
>> conversions, etc). We also have this for pretty much all 8-bit
>> instructions, since there is a 16-bit alignment requierement for pretty
>> much everything in that case.
>>
>> Also, what would happen if someone  changes the stride of the register
>> (or retypes it), would the size calculations adapt the padding
>> correspondingly and take it into account automatically?
> 
> Well, this is just my way of seeing things. I thought that marking padding
> explicitly in the fs_reg makes all sorts of asserts and debugging clearer
> as the padding tells explicitly that at least the allocator of the reg meant
> that only the first half is ever written (and hence only the first half should
> be read). Now, the logic that adds workarounds for special cases, such as the
> stride you mentioned, needs to modify the register (at least the stride) for
> producer and every consumer. Now that logic would additionally need to reset
> the padding as it doesn't apply anymore. Adding an assert allowing only
> padding or stride != 1 but not both sounds like a thing to add on top.
> 
>>
>> I guess that in general I have two types of concerns; one is whether
>> this is robust enough: on the one hand people can still try to open
>> code size calculations in future work and they are very likely to not
>> going get it right for SIMD8 with non-32 bit types if they do this. I
>> guess this can be fixed with very careful reviews about these things in
>> the future but I think this is going to be easy to miss. The other one
>> is that as I mention above, we would be returning fake (padded) sizes
>> and then using these sizes for things where we are expecting the real
>> size, such as when checking for hardware restrictions, etc I am less
>> worried about this issue though, since as long as we keep this for
>> SIMD8 only I don't think that using padded sizes instead of real sizes
>> would usually lead to different behaviors, as we usually only run into
>> limitations when things exceed the size of a SIMD register, but I am
>> not sure if this is always true.
>>
>>> Now, I haven't though about 8-bit types. At least there aren't
>>> hardware
>>> supported payloads for those, right?
>>
>> We need to do UBO load and SSBO load/store with 8-bit types. Chema
>> implemented that so maybe he can share more details about what is
>> needed here and if there are any particular restrictions or not.
> 
> I had a quick look and initially it seems to me that those payloads are quite
> different from sampler/fb write. They are rather fixed to 32-bit slots and
> instead of any padding one interleaves things. I'll try to study this a little
> more...


In the case of 8-bit types, we use the byte_scattered read/write
messages that use a full register for the return/payload registers. But
as you comment for every value read/written this message we use the fist
8/16 bits of each Double Word of the register. So we have undefined
values interleaved that we shouldn't care about. So we need to
read/write with stride=4 for 8-bit and stride=2 for 16-bit types. That
is different for Samplers and Render Target Write for 16-bits that have
an specific flag for 16-bits.

BTW, In order to improve the register allocation I have a series related
to "intel/fs: Liveness range improvements with partial writes" that I
have some pending feedback to address at
https://patchwork.freedesktop.org/series/46476/

Those patches solve for in the same block operations at the register
allocator how to deal with partial read/write. They care for cases of
padding and interleaved values. That are problematic because of the
shuffle/unshuffle needed for 8/16/64 load/store operations.

Chema

>>
>>>>
>>>> One thing that worries me a bit is that we are faking the real size
>>>> of
>>>> things and that can lead to parts of the compiler that rely on
>>>> these
>>>> helpers to provide the real size to do things we don't want. For
>>>> example: are you limiting this to SIMD8 only? I ask because if we
>>>> do
>>>> this for SIMD16 16-bit dispatches then we'd computing a size of two
>>>> registers for things that only really use one, and I don't think we
>>>> want that since that would have actual implications like SIMD-width
>>>> lowering kicking in for things it should not need to. There are a
>>>> lot
>>>> of uses of the component_size helper in other parts of the
>>>> compiler,
>>>> including the generator, where we want to know the real size of
>>>> things
>>>> because they use these sizes to check hardware restrictions so we
>>>> want
>>>> to know the real size without padding...
>>>>
>>>> Also, I presume that all this is orthogonal to actual register
>>>> allocation and that you are still storing the Y and W components of
>>>> SIMD8 16-bit vectors in the second halves of registers, right?
>>>>
>>>>> Now, I'm fine doing this case by case the way it is done here.
>>>>> I'm
>>>>> just
>>>>> wondering if the split is needed, i.e., considering in some cases
>>>>> 16-
>>>>> bit SIMD8
>>>>> virtual registers as half written and in some cases just ignoring
>>>>> the
>>>>> fact.
>>>>>>> Then the patch goes on and rewrites all uses of
>>>>>>> is_partial_write() to use
>>>>>>> one or the other version. Specifically, we use
>>>>>>> is_partial_var_write()
>>>>>>> in the following places: copy propagation, cmod propagation,
>>>>>>> common
>>>>>>> subexpression elimination, saturate propagation and sel
>>>>>>> peephole.
>>>>>>>
>>>>>>> Notice that the semantics of is_partial_var_write() exactly
>>>>>>> match
>>>>>>> the
>>>>>>> current implementation of is_partial_write() for anything
>>>>>>> that is
>>>>>>> 32-bit or larger, so no changes are expected for 32-bit
>>>>>>> instructions.
>>>>>>>
>>>>>>> Tested against ~5000 tests involving 16-bit instructions in
>>>>>>> CTS
>>>>>>> produced
>>>>>>> the following changes in instruction counts:
>>>>>>>
>>>>>>>             Patched  |     Master    |    %    |
>>>>>>> ================================================
>>>>>>> SIMD8  |    621,900  |    706,721    | -12.00% |
>>>>>>> ================================================
>>>>>>> SIMD16 |     93,252  |     93,252    |   0.00% |
>>>>>>> ================================================
>>>>>>>
>>>>>>> As expected, the change only affects SIMD8 dispatches.
>>>>>>
>>>>>> I like this. But I think I want to try and rebase my fp16 work
>>>>>> on
>>>>>> top to see
>>>>>> if there are any differences in the final assembly between this
>>>>>> and
>>>>>> my
>>>>>> "register padding" scheme.
>>>>>>
>>>>>>> ---
>>>>>>>  src/intel/compiler/brw_fs.cpp                 | 31
>>>>>>> +++++++++++++++----
>>>>>>>  .../compiler/brw_fs_cmod_propagation.cpp      | 20 ++++++---
>>>>>>> ---
>>>>>>>  .../compiler/brw_fs_copy_propagation.cpp      |  8 ++---
>>>>>>>  src/intel/compiler/brw_fs_cse.cpp             |  3 +-
>>>>>>>  .../compiler/brw_fs_dead_code_eliminate.cpp   |  2 +-
>>>>>>>  src/intel/compiler/brw_fs_live_variables.cpp  |  2 +-
>>>>>>>  src/intel/compiler/brw_fs_reg_allocate.cpp    |  2 +-
>>>>>>>  .../compiler/brw_fs_register_coalesce.cpp     |  2 +-
>>>>>>>  .../compiler/brw_fs_saturate_propagation.cpp  |  7 +++--
>>>>>>>  src/intel/compiler/brw_fs_sel_peephole.cpp    |  4 +--
>>>>>>>  src/intel/compiler/brw_ir_fs.h                |  3 +-
>>>>>>>  11 files changed, 54 insertions(+), 30 deletions(-)
>>>>>>>
>>>>>>> diff --git a/src/intel/compiler/brw_fs.cpp
>>>>>>> b/src/intel/compiler/brw_fs.cpp
>>>>>>> index 1d5d1dd0d22..9ea67975e1e 100644
>>>>>>> --- a/src/intel/compiler/brw_fs.cpp
>>>>>>> +++ b/src/intel/compiler/brw_fs.cpp
>>>>>>> @@ -698,14 +698,33 @@
>>>>>>> fs_visitor::limit_dispatch_width(unsigned
>>>>>>> n, const char *msg)
>>>>>>>   * it.
>>>>>>>   */
>>>>>>>  bool
>>>>>>> -fs_inst::is_partial_write() const
>>>>>>> +fs_inst::is_partial_reg_write() const
>>>>>>>  {
>>>>>>>     return ((this->predicate && this->opcode !=
>>>>>>> BRW_OPCODE_SEL)
>>>>>>>>>
>>>>>>>
>>>>>>> -           (this->exec_size * type_sz(this->dst.type)) < 32
>>>>>>> ||
>>>>>>>             !this->dst.is_contiguous() ||
>>>>>>> +           (this->exec_size * type_sz(this->dst.type)) <
>>>>>>> REG_SIZE ||
>>>>>>>             this->dst.offset % REG_SIZE != 0);
>>>>>>>  }
>>>>>>>  
>>>>>>> +/**
>>>>>>> + * Returns true if the instruction has a flag that means it
>>>>>>> won't
>>>>>>> + * update an entire variable for the given dispatch width.
>>>>>>> + *
>>>>>>> + * This is only different from is_partial_reg_write() for
>>>>>>> SIMD8
>>>>>>> + * dispatches of 16-bit (or smaller) instructions.
>>>>>>> + */
>>>>>>> +bool
>>>>>>> +fs_inst::is_partial_var_write(uint32_t dispatch_width) const
>>>>>>> +{
>>>>>>> +   const uint32_t type_size = type_sz(this->dst.type);
>>>>>>> +   uint32_t var_size = MIN2(REG_SIZE, dispatch_width *
>>>>>>> type_size);
>>>>>>> +
>>>>>>> +   return ((this->predicate && this->opcode !=
>>>>>>> BRW_OPCODE_SEL)
>>>>>>>>>
>>>>>>>
>>>>>>> +           !this->dst.is_contiguous() ||
>>>>>>> +           (this->exec_size * type_sz(this->dst.type)) <
>>>>>>> var_size ||
>>>>>>> +           this->dst.offset % var_size != 0);
>>>>>>> +}
>>>>>>> +
>>>>>>>  unsigned
>>>>>>>  fs_inst::components_read(unsigned i) const
>>>>>>>  {
>>>>>>> @@ -2847,7 +2866,7 @@ fs_visitor::opt_register_renaming()
>>>>>>>        if (depth == 0 &&
>>>>>>>            inst->dst.file == VGRF &&
>>>>>>>            alloc.sizes[inst->dst.nr] * REG_SIZE == inst-
>>>>>>>> size_written &&
>>>>>>>
>>>>>>> -          !inst->is_partial_write()) {
>>>>>>> +          !inst->is_partial_reg_write()) {
>>>>>>>           if (remap[dst] == -1) {
>>>>>>>              remap[dst] = dst;
>>>>>>>           } else {
>>>>>>> @@ -3050,7 +3069,7 @@ fs_visitor::compute_to_mrf()
>>>>>>>        next_ip++;
>>>>>>>  
>>>>>>>        if (inst->opcode != BRW_OPCODE_MOV ||
>>>>>>> -         inst->is_partial_write() ||
>>>>>>> +         inst->is_partial_reg_write() ||
>>>>>>>           inst->dst.file != MRF || inst->src[0].file != VGRF ||
>>>>>>>           inst->dst.type != inst->src[0].type ||
>>>>>>>           inst->src[0].abs || inst->src[0].negate ||
>>>>>>> @@ -3083,7 +3102,7 @@ fs_visitor::compute_to_mrf()
>>>>>>>              * that writes that reg, but it would require
>>>>>>> smarter
>>>>>>>              * tracking.
>>>>>>>              */
>>>>>>> -           if (scan_inst->is_partial_write())
>>>>>>> +           if (scan_inst->is_partial_reg_write())
>>>>>>>                break;
>>>>>>>  
>>>>>>>              /* Handling things not fully contained in the
>>>>>>> source
>>>>>>> of the copy
>>>>>>> @@ -3395,7 +3414,7 @@
>>>>>>> fs_visitor::remove_duplicate_mrf_writes()
>>>>>>>        if (inst->opcode == BRW_OPCODE_MOV &&
>>>>>>>           inst->dst.file == MRF &&
>>>>>>>           inst->src[0].file != ARF &&
>>>>>>> -         !inst->is_partial_write()) {
>>>>>>> +         !inst->is_partial_reg_write()) {
>>>>>>>           last_mrf_move[inst->dst.nr] = inst;
>>>>>>>        }
>>>>>>>     }
>>>>>>> diff --git a/src/intel/compiler/brw_fs_cmod_propagation.cpp
>>>>>>> b/src/intel/compiler/brw_fs_cmod_propagation.cpp
>>>>>>> index 5fb522f810f..7bb5c9afbc9 100644
>>>>>>> --- a/src/intel/compiler/brw_fs_cmod_propagation.cpp
>>>>>>> +++ b/src/intel/compiler/brw_fs_cmod_propagation.cpp
>>>>>>> @@ -50,13 +50,13 @@
>>>>>>>  
>>>>>>>  static bool
>>>>>>>  cmod_propagate_cmp_to_add(const gen_device_info *devinfo,
>>>>>>> bblock_t *block,
>>>>>>> -                          fs_inst *inst)
>>>>>>> +                          fs_inst *inst, unsigned
>>>>>>> dispatch_width)
>>>>>>>  {
>>>>>>>     bool read_flag = false;
>>>>>>>  
>>>>>>>     foreach_inst_in_block_reverse_starting_from(fs_inst,
>>>>>>> scan_inst, inst) {
>>>>>>>        if (scan_inst->opcode == BRW_OPCODE_ADD &&
>>>>>>> -          !scan_inst->is_partial_write() &&
>>>>>>> +          !scan_inst->is_partial_var_write(dispatch_width)
>>>>>>> &&
>>>>>>>            scan_inst->exec_size == inst->exec_size) {
>>>>>>>           bool negate;
>>>>>>>  
>>>>>>> @@ -126,7 +126,7 @@ cmod_propagate_cmp_to_add(const
>>>>>>> gen_device_info *devinfo, bblock_t *block,
>>>>>>>   */
>>>>>>>  static bool
>>>>>>>  cmod_propagate_not(const gen_device_info *devinfo, bblock_t
>>>>>>> *block,
>>>>>>> -                   fs_inst *inst)
>>>>>>> +                   fs_inst *inst, unsigned dispatch_width)
>>>>>>>  {
>>>>>>>     const enum brw_conditional_mod cond =
>>>>>>> brw_negate_cmod(inst-
>>>>>>>> conditional_mod);
>>>>>>>
>>>>>>>     bool read_flag = false;
>>>>>>> @@ -141,7 +141,7 @@ cmod_propagate_not(const gen_device_info
>>>>>>> *devinfo, bblock_t *block,
>>>>>>>               scan_inst->opcode != BRW_OPCODE_AND)
>>>>>>>              break;
>>>>>>>  
>>>>>>> -         if (scan_inst->is_partial_write() ||
>>>>>>> +         if (scan_inst->is_partial_var_write(dispatch_width) 
>>>>>>> ||
>>>>>>>               scan_inst->dst.offset != inst->src[0].offset ||
>>>>>>>               scan_inst->exec_size != inst->exec_size)
>>>>>>>              break;
>>>>>>> @@ -166,7 +166,9 @@ cmod_propagate_not(const gen_device_info
>>>>>>> *devinfo, bblock_t *block,
>>>>>>>  }
>>>>>>>  
>>>>>>>  static bool
>>>>>>> -opt_cmod_propagation_local(const gen_device_info *devinfo,
>>>>>>> bblock_t *block)
>>>>>>> +opt_cmod_propagation_local(const gen_device_info *devinfo,
>>>>>>> +                           bblock_t *block,
>>>>>>> +                           unsigned dispatch_width)
>>>>>>>  {
>>>>>>>     bool progress = false;
>>>>>>>     int ip = block->end_ip + 1;
>>>>>>> @@ -219,14 +221,14 @@ opt_cmod_propagation_local(const
>>>>>>> gen_device_info *devinfo, bblock_t *block)
>>>>>>>         */
>>>>>>>        if (inst->opcode == BRW_OPCODE_CMP && !inst-
>>>>>>>> src[1].is_zero()) {
>>>>>>>
>>>>>>>           if (brw_reg_type_is_floating_point(inst-
>>>>>>>> src[0].type)
>>>>>>> &&
>>>>>>> -             cmod_propagate_cmp_to_add(devinfo, block,
>>>>>>> inst))
>>>>>>> +             cmod_propagate_cmp_to_add(devinfo, block, inst,
>>>>>>> dispatch_width))
>>>>>>>              progress = true;
>>>>>>>  
>>>>>>>           continue;
>>>>>>>        }
>>>>>>>  
>>>>>>>        if (inst->opcode == BRW_OPCODE_NOT) {
>>>>>>> -         progress = cmod_propagate_not(devinfo, block, inst)
>>>>>>> ||
>>>>>>> progress;
>>>>>>> +         progress = cmod_propagate_not(devinfo, block, inst,
>>>>>>> dispatch_width) || progress;
>>>>>>>           continue;
>>>>>>>        }
>>>>>>>  
>>>>>>> @@ -234,7 +236,7 @@ opt_cmod_propagation_local(const
>>>>>>> gen_device_info *devinfo, bblock_t *block)
>>>>>>>        foreach_inst_in_block_reverse_starting_from(fs_inst,
>>>>>>> scan_inst, inst) {
>>>>>>>           if (regions_overlap(scan_inst->dst, scan_inst-
>>>>>>>> size_written,
>>>>>>>
>>>>>>>                               inst->src[0], inst-
>>>>>>>> size_read(0)))
>>>>>>> {
>>>>>>> -            if (scan_inst->is_partial_write() ||
>>>>>>> +            if (scan_inst-
>>>>>>>> is_partial_var_write(dispatch_width)
>>>>>>>>>
>>>>>>>
>>>>>>>                  scan_inst->dst.offset != inst->src[0].offset 
>>>>>>> ||
>>>>>>>                  scan_inst->exec_size != inst->exec_size)
>>>>>>>                 break;
>>>>>>> @@ -342,7 +344,7 @@ fs_visitor::opt_cmod_propagation()
>>>>>>>     bool progress = false;
>>>>>>>  
>>>>>>>     foreach_block_reverse(block, cfg) {
>>>>>>> -      progress = opt_cmod_propagation_local(devinfo, block)
>>>>>>> ||
>>>>>>> progress;
>>>>>>> +      progress = opt_cmod_propagation_local(devinfo, block,
>>>>>>> dispatch_width) || progress;
>>>>>>>     }
>>>>>>>  
>>>>>>>     if (progress)
>>>>>>> diff --git a/src/intel/compiler/brw_fs_copy_propagation.cpp
>>>>>>> b/src/intel/compiler/brw_fs_copy_propagation.cpp
>>>>>>> index c01d4ec4a4f..0f0284115fb 100644
>>>>>>> --- a/src/intel/compiler/brw_fs_copy_propagation.cpp
>>>>>>> +++ b/src/intel/compiler/brw_fs_copy_propagation.cpp
>>>>>>> @@ -505,7 +505,7 @@ fs_visitor::try_copy_propagate(fs_inst
>>>>>>> *inst,
>>>>>>> int arg, acp_entry *entry)
>>>>>>>     /* Compute the first component of the copy that the
>>>>>>> instruction is
>>>>>>>      * reading, and the base byte offset within that
>>>>>>> component.
>>>>>>>      */
>>>>>>> -   assert(entry->dst.offset % REG_SIZE == 0 && entry-
>>>>>>>> dst.stride 
>>>>>>> == 1);
>>>>>>> +   assert(entry->dst.stride == 1);
>>>>>>>     const unsigned component = rel_offset / type_sz(entry-
>>>>>>>> dst.type);
>>>>>>>
>>>>>>>     const unsigned suboffset = rel_offset % type_sz(entry-
>>>>>>>> dst.type);
>>>>>>>
>>>>>>>  
>>>>>>> @@ -783,7 +783,7 @@
>>>>>>> fs_visitor::try_constant_propagate(fs_inst
>>>>>>> *inst, acp_entry *entry)
>>>>>>>  }
>>>>>>>  
>>>>>>>  static bool
>>>>>>> -can_propagate_from(fs_inst *inst)
>>>>>>> +can_propagate_from(fs_inst *inst, unsigned dispatch_width)
>>>>>>>  {
>>>>>>>     return (inst->opcode == BRW_OPCODE_MOV &&
>>>>>>>             inst->dst.file == VGRF &&
>>>>>>> @@ -794,7 +794,7 @@ can_propagate_from(fs_inst *inst)
>>>>>>>              inst->src[0].file == UNIFORM ||
>>>>>>>              inst->src[0].file == IMM) &&
>>>>>>>             inst->src[0].type == inst->dst.type &&
>>>>>>> -           !inst->is_partial_write());
>>>>>>> +           !inst->is_partial_var_write(dispatch_width));
>>>>>>>  }
>>>>>>>  
>>>>>>>  /* Walks a basic block and does copy propagation on it using
>>>>>>> the
>>>>>>> acp
>>>>>>> @@ -846,7 +846,7 @@
>>>>>>> fs_visitor::opt_copy_propagation_local(void
>>>>>>> *copy_prop_ctx, bblock_t *block,
>>>>>>>        /* If this instruction's source could potentially be
>>>>>>> folded into the
>>>>>>>         * operand of another instruction, add it to the ACP.
>>>>>>>         */
>>>>>>> -      if (can_propagate_from(inst)) {
>>>>>>> +      if (can_propagate_from(inst, dispatch_width)) {
>>>>>>>           acp_entry *entry = ralloc(copy_prop_ctx,
>>>>>>> acp_entry);
>>>>>>>           entry->dst = inst->dst;
>>>>>>>           entry->src = inst->src[0];
>>>>>>> diff --git a/src/intel/compiler/brw_fs_cse.cpp
>>>>>>> b/src/intel/compiler/brw_fs_cse.cpp
>>>>>>> index 6859733d58c..56813df2d2a 100644
>>>>>>> --- a/src/intel/compiler/brw_fs_cse.cpp
>>>>>>> +++ b/src/intel/compiler/brw_fs_cse.cpp
>>>>>>> @@ -247,7 +247,8 @@ fs_visitor::opt_cse_local(bblock_t
>>>>>>> *block)
>>>>>>>     int ip = block->start_ip;
>>>>>>>     foreach_inst_in_block(fs_inst, inst, block) {
>>>>>>>        /* Skip some cases. */
>>>>>>> -      if (is_expression(this, inst) && !inst-
>>>>>>>> is_partial_write() 
>>>>>>> &&
>>>>>>> +      if (is_expression(this, inst) &&
>>>>>>> +          !inst->is_partial_var_write(dispatch_width) &&
>>>>>>>            ((inst->dst.file != ARF && inst->dst.file !=
>>>>>>> FIXED_GRF) ||
>>>>>>>             inst->dst.is_null()))
>>>>>>>        {
>>>>>>> diff --git
>>>>>>> a/src/intel/compiler/brw_fs_dead_code_eliminate.cpp
>>>>>>> b/src/intel/compiler/brw_fs_dead_code_eliminate.cpp
>>>>>>> index eeb71dd2b92..f24fd0643b8 100644
>>>>>>> --- a/src/intel/compiler/brw_fs_dead_code_eliminate.cpp
>>>>>>> +++ b/src/intel/compiler/brw_fs_dead_code_eliminate.cpp
>>>>>>> @@ -110,7 +110,7 @@ fs_visitor::dead_code_eliminate()
>>>>>>>           }
>>>>>>>  
>>>>>>>           if (inst->dst.file == VGRF) {
>>>>>>> -            if (!inst->is_partial_write()) {
>>>>>>> +            if (!inst->is_partial_reg_write()) {
>>>>>>>                 int var = live_intervals->var_from_reg(inst-
>>>>>>>> dst);
>>>>>>>
>>>>>>>                 for (unsigned i = 0; i < regs_written(inst);
>>>>>>> i++)
>>>>>>> {
>>>>>>>                    BITSET_CLEAR(live, var + i);
>>>>>>> diff --git a/src/intel/compiler/brw_fs_live_variables.cpp
>>>>>>> b/src/intel/compiler/brw_fs_live_variables.cpp
>>>>>>> index 059f076fa51..30625aa586a 100644
>>>>>>> --- a/src/intel/compiler/brw_fs_live_variables.cpp
>>>>>>> +++ b/src/intel/compiler/brw_fs_live_variables.cpp
>>>>>>> @@ -84,7 +84,7 @@ fs_live_variables::setup_one_write(struct
>>>>>>> block_data *bd, fs_inst *inst,
>>>>>>>      * screens off previous updates of that variable (VGRF
>>>>>>> channel).
>>>>>>>      */
>>>>>>>     if (inst->dst.file == VGRF) {
>>>>>>> -      if (!inst->is_partial_write() && !BITSET_TEST(bd->use,
>>>>>>> var))
>>>>>>> +      if (!inst->is_partial_reg_write() && !BITSET_TEST(bd-
>>>>>>>> use, 
>>>>>>> var))
>>>>>>>           BITSET_SET(bd->def, var);
>>>>>>>  
>>>>>>>        BITSET_SET(bd->defout, var);
>>>>>>> diff --git a/src/intel/compiler/brw_fs_reg_allocate.cpp
>>>>>>> b/src/intel/compiler/brw_fs_reg_allocate.cpp
>>>>>>> index 73b8b7841f5..5147e8b4426 100644
>>>>>>> --- a/src/intel/compiler/brw_fs_reg_allocate.cpp
>>>>>>> +++ b/src/intel/compiler/brw_fs_reg_allocate.cpp
>>>>>>> @@ -1026,7 +1026,7 @@ fs_visitor::spill_reg(int spill_reg)
>>>>>>>            * write, there should be no need for the unspill
>>>>>>> since
>>>>>>> the
>>>>>>>            * instruction will be overwriting the whole
>>>>>>> destination in any case.
>>>>>>>           */
>>>>>>> -         if (inst->is_partial_write() ||
>>>>>>> +         if (inst->is_partial_reg_write() ||
>>>>>>>               (!inst->force_writemask_all && !per_channel))
>>>>>>>              emit_unspill(ubld, spill_src,
>>>>>>> subset_spill_offset,
>>>>>>>                           regs_written(inst));
>>>>>>> diff --git a/src/intel/compiler/brw_fs_register_coalesce.cpp
>>>>>>> b/src/intel/compiler/brw_fs_register_coalesce.cpp
>>>>>>> index 952276faed8..b27956023c6 100644
>>>>>>> --- a/src/intel/compiler/brw_fs_register_coalesce.cpp
>>>>>>> +++ b/src/intel/compiler/brw_fs_register_coalesce.cpp
>>>>>>> @@ -70,7 +70,7 @@ is_coalesce_candidate(const fs_visitor *v,
>>>>>>> const fs_inst *inst)
>>>>>>>  {
>>>>>>>     if ((inst->opcode != BRW_OPCODE_MOV &&
>>>>>>>          inst->opcode != SHADER_OPCODE_LOAD_PAYLOAD) ||
>>>>>>> -       inst->is_partial_write() ||
>>>>>>> +       inst->is_partial_reg_write() ||
>>>>>>>         inst->saturate ||
>>>>>>>         inst->src[0].file != VGRF ||
>>>>>>>         inst->src[0].negate ||
>>>>>>> diff --git
>>>>>>> a/src/intel/compiler/brw_fs_saturate_propagation.cpp
>>>>>>> b/src/intel/compiler/brw_fs_saturate_propagation.cpp
>>>>>>> index d6cfa79a618..1e1461063ae 100644
>>>>>>> --- a/src/intel/compiler/brw_fs_saturate_propagation.cpp
>>>>>>> +++ b/src/intel/compiler/brw_fs_saturate_propagation.cpp
>>>>>>> @@ -43,7 +43,8 @@
>>>>>>>   */
>>>>>>>  
>>>>>>>  static bool
>>>>>>> -opt_saturate_propagation_local(fs_visitor *v, bblock_t
>>>>>>> *block)
>>>>>>> +opt_saturate_propagation_local(fs_visitor *v, bblock_t
>>>>>>> *block,
>>>>>>> +                               unsigned dispatch_width)
>>>>>>>  {
>>>>>>>     bool progress = false;
>>>>>>>     int ip = block->end_ip + 1;
>>>>>>> @@ -66,7 +67,7 @@ opt_saturate_propagation_local(fs_visitor
>>>>>>> *v,
>>>>>>> bblock_t *block)
>>>>>>>        foreach_inst_in_block_reverse_starting_from(fs_inst,
>>>>>>> scan_inst, inst) {
>>>>>>>           if (regions_overlap(scan_inst->dst, scan_inst-
>>>>>>>> size_written,
>>>>>>>
>>>>>>>                               inst->src[0], inst-
>>>>>>>> size_read(0)))
>>>>>>> {
>>>>>>> -            if (scan_inst->is_partial_write() ||
>>>>>>> +            if (scan_inst-
>>>>>>>> is_partial_var_write(dispatch_width)
>>>>>>>>>
>>>>>>>
>>>>>>>                  (scan_inst->dst.type != inst->dst.type &&
>>>>>>>                   !scan_inst->can_change_types()))
>>>>>>>                 break;
>>>>>>> @@ -153,7 +154,7 @@ fs_visitor::opt_saturate_propagation()
>>>>>>>     calculate_live_intervals();
>>>>>>>  
>>>>>>>     foreach_block (block, cfg) {
>>>>>>> -      progress = opt_saturate_propagation_local(this, block)
>>>>>>> ||
>>>>>>> progress;
>>>>>>> +      progress = opt_saturate_propagation_local(this, block,
>>>>>>> dispatch_width) || progress;
>>>>>>>     }
>>>>>>>  
>>>>>>>     /* Live intervals are still valid. */
>>>>>>> diff --git a/src/intel/compiler/brw_fs_sel_peephole.cpp
>>>>>>> b/src/intel/compiler/brw_fs_sel_peephole.cpp
>>>>>>> index 6395b409b7c..98d640a3bfe 100644
>>>>>>> --- a/src/intel/compiler/brw_fs_sel_peephole.cpp
>>>>>>> +++ b/src/intel/compiler/brw_fs_sel_peephole.cpp
>>>>>>> @@ -167,8 +167,8 @@ fs_visitor::opt_peephole_sel()
>>>>>>>               then_mov[i]->exec_size != else_mov[i]-
>>>>>>>> exec_size ||
>>>>>>>               then_mov[i]->group != else_mov[i]->group ||
>>>>>>>               then_mov[i]->force_writemask_all !=
>>>>>>> else_mov[i]-
>>>>>>>> force_writemask_all ||
>>>>>>>
>>>>>>> -             then_mov[i]->is_partial_write() ||
>>>>>>> -             else_mov[i]->is_partial_write() ||
>>>>>>> +             then_mov[i]-
>>>>>>>> is_partial_var_write(dispatch_width)
>>>>>>>>>
>>>>>>>
>>>>>>> +             else_mov[i]-
>>>>>>>> is_partial_var_write(dispatch_width)
>>>>>>>>>
>>>>>>>
>>>>>>>               then_mov[i]->conditional_mod !=
>>>>>>> BRW_CONDITIONAL_NONE ||
>>>>>>>               else_mov[i]->conditional_mod !=
>>>>>>> BRW_CONDITIONAL_NONE) {
>>>>>>>              movs = i;
>>>>>>> diff --git a/src/intel/compiler/brw_ir_fs.h
>>>>>>> b/src/intel/compiler/brw_ir_fs.h
>>>>>>> index 07e7224e0f8..02c91b246a3 100644
>>>>>>> --- a/src/intel/compiler/brw_ir_fs.h
>>>>>>> +++ b/src/intel/compiler/brw_ir_fs.h
>>>>>>> @@ -349,7 +349,8 @@ public:
>>>>>>>  
>>>>>>>     bool equals(fs_inst *inst) const;
>>>>>>>     bool is_send_from_grf() const;
>>>>>>> -   bool is_partial_write() const;
>>>>>>> +   bool is_partial_reg_write() const;
>>>>>>> +   bool is_partial_var_write(unsigned dispatch_width) const;
>>>>>>>     bool is_copy_payload(const brw::simple_allocator
>>>>>>> &grf_alloc)
>>>>>>> const;
>>>>>>>     unsigned components_read(unsigned i) const;
>>>>>>>     unsigned size_read(int arg) const;
>>>>>>> -- 
>>>>>>> 2.17.1
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> mesa-dev mailing list
>>>>>>> mesa-dev@lists.freedesktop.org
>>>>>>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
>>>>>
>>>>>
>>>
>>>
>>
> 
_______________________________________________
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev

Re: [Mesa-dev] [PATCH 41/59] intel/compiler: split is_partial_write() into two variants

Reply via email to