fs: New method for register_byte_use_pattern for fs_inst

Chema Casanova Sun, 29 Jul 2018 10:48:43 -0700

El 28/07/18 a las 01:45, Francisco Jerez escribió:
> Chema Casanova <jmcasan...@igalia.com> writes:
> 
>> El 27/07/18 a las 02:44, Francisco Jerez escribió:
>>> Chema Casanova <jmcasan...@igalia.com> writes:
>>>
>>>> El 26/07/18 a las 20:02, Francisco Jerez escribió:
>>>>> Chema Casanova <jmcasan...@igalia.com> writes:
>>>>>
>>>>>> El 20/07/18 a las 22:10, Francisco Jerez escribió:
>>>>>>> Chema Casanova <jmcasan...@igalia.com> writes:
>>>>>>>
>>>>>>>> El 20/07/18 a las 00:34, Francisco Jerez escribió:
>>>>>>>>> Chema Casanova <jmcasan...@igalia.com> writes:
>>>>>>>>>
>>>>>>>>>> El 14/07/18 a las 00:14, Francisco Jerez escribió:
>>>>>>>>>>> Jose Maria Casanova Crespo <jmcasan...@igalia.com> writes:
>>>>>>>>>>>
>>>>>>>>>>>> For a register source/destination of an instruction the function 
>>>>>>>>>>>> returns
>>>>>>>>>>>> the read/write byte pattern of a 32-byte registers as a unsigned 
>>>>>>>>>>>> int.
>>>>>>>>>>>>
>>>>>>>>>>>> The returned pattern takes into account the exec_size of the 
>>>>>>>>>>>> instruction,
>>>>>>>>>>>> the type bitsize, the stride and if the register is source or 
>>>>>>>>>>>> destination.
>>>>>>>>>>>>
>>>>>>>>>>>> The objective of the functions if to help to know the read/written 
>>>>>>>>>>>> bytes
>>>>>>>>>>>> of the instructions to improve the liveness analysis for partial 
>>>>>>>>>>>> read/writes.
>>>>>>>>>>>>
>>>>>>>>>>>> We manage special cases for 
>>>>>>>>>>>> SHADER_OPCODE_BYTE_SCATTERED_WRITE_LOGICAL
>>>>>>>>>>>> and SHADER_OPCODE_BYTE_SCATTERED_WRITE because depending of the 
>>>>>>>>>>>> bitsize
>>>>>>>>>>>> parameter they have a different read pattern.
>>>>>>>>>>>> ---
>>>>>>>>>>>>  src/intel/compiler/brw_fs.cpp  | 183 
>>>>>>>>>>>> +++++++++++++++++++++++++++++++++
>>>>>>>>>>>>  src/intel/compiler/brw_ir_fs.h |   1 +
>>>>>>>>>>>>  2 files changed, 184 insertions(+)
>>>>>>>>>>>>
>>>>>>>>>>>> diff --git a/src/intel/compiler/brw_fs.cpp 
>>>>>>>>>>>> b/src/intel/compiler/brw_fs.cpp
>>>>>>>>>>>> index 2b8363ca362..f3045c4ff6c 100644
>>>>>>>>>>>> --- a/src/intel/compiler/brw_fs.cpp
>>>>>>>>>>>> +++ b/src/intel/compiler/brw_fs.cpp
>>>>>>>>>>>> @@ -687,6 +687,189 @@ fs_inst::is_partial_write() const
>>>>>>>>>>>>             this->dst.offset % REG_SIZE != 0);
>>>>>>>>>>>>  }
>>>>>>>>>>>>  
>>>>>>>>>>>> +/**
>>>>>>>>>>>> + * Returns a 32-bit uint whose bits represent if the associated 
>>>>>>>>>>>> register byte
>>>>>>>>>>>> + * has been read/written by the instruction. The returned pattern 
>>>>>>>>>>>> takes into
>>>>>>>>>>>> + * account the exec_size of the instruction, the type bitsize and 
>>>>>>>>>>>> the register
>>>>>>>>>>>> + * stride and the register is source or destination for the 
>>>>>>>>>>>> instruction.
>>>>>>>>>>>> + *
>>>>>>>>>>>> + * The objective of this function is to identify which parts of 
>>>>>>>>>>>> the register
>>>>>>>>>>>> + * are read or written for operations that don't read/write a 
>>>>>>>>>>>> full register.
>>>>>>>>>>>> + * So we can identify in live range variable analysis if a 
>>>>>>>>>>>> partial write has
>>>>>>>>>>>> + * completelly defined the part of the register used by a partial 
>>>>>>>>>>>> read. So we
>>>>>>>>>>>> + * avoid extending the liveness range because all data read was 
>>>>>>>>>>>> already
>>>>>>>>>>>> + * defined although the wasn't completely written.
>>>>>>>>>>>> + */
>>>>>>>>>>>> +unsigned
>>>>>>>>>>>> +fs_inst::register_byte_use_pattern(const fs_reg &r, boolean 
>>>>>>>>>>>> is_dst) const
>>>>>>>>>>>> +{
>>>>>>>>>>>> +   if (is_dst) {
>>>>>>>>>>
>>>>>>>>>>> Please split into two functions (like fs_inst::src_read and
>>>>>>>>>>> ::src_written) since that would make the call-sites of this method 
>>>>>>>>>>> more
>>>>>>>>>>> self-documenting than a boolean parameter.  You should be able to 
>>>>>>>>>>> share
>>>>>>>>>>> code by refactoring the common logic into a separate function (see 
>>>>>>>>>>> below
>>>>>>>>>>> for some suggestions on how that could be achieved).
>>>>>>>>>>
>>>>>>>>>> Sure, it would improve readability and simplifies the logic, I've 
>>>>>>>>>> chosen
>>>>>>>>>> dst_write_pattern and src_read_pattern.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> +      /* We don't know what is written so we return the worts 
>>>>>>>>>>>> case */
>>>>>>>>>>>
>>>>>>>>>>> "worst"
>>>>>>>>>>
>>>>>>>>>> Fixed.
>>>>>>>>>>
>>>>>>>>>>>> +      if (this->predicate && this->opcode != BRW_OPCODE_SEL)
>>>>>>>>>>>> +         return 0;
>>>>>>>>>>>> +      /* We assume that send destinations are completely written 
>>>>>>>>>>>> */
>>>>>>>>>>>> +      if (this->is_send_from_grf())
>>>>>>>>>>>> +         return ~0u;
>>>>>>>>>>>
>>>>>>>>>>> Some send-like instructions won't be caught by this condition, you
>>>>>>>>>>> should check for this->mlen != 0 in addition.
>>>>>>>>>>
>>>>>>>>>> Would it be enough to check for (this->mlen > 0) and forget about
>>>>>>>>>> is_send_from_grf? I am using this approach in v2 I am sending.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I don't think the mlen > 0 condition would catch all cases either...
>>>>>>>>> E.g. FS_OPCODE_UNIFORM_PULL_CONSTANT_LOAD IIRC.  You probably need 
>>>>>>>>> both
>>>>>>>>> conditions.  Sucks...
>>>>>>>>
>>>>>>>> That is true, so now we have the:
>>>>>>>>  (this->is_send_from_grf() || this->mlen != 0)
>>>>>>>>
>>>>>>>>>>>> +   } else {
>>>>>>>>>>>> +      /* byte_scattered_write_logical pattern of src[1] is 32-bit 
>>>>>>>>>>>> aligned
>>>>>>>>>>>> +       * so the read pattern depends on the bitsize stored at 
>>>>>>>>>>>> src[4]
>>>>>>>>>>>> +       */
>>>>>>>>>>>> +      if (this->opcode == 
>>>>>>>>>>>> SHADER_OPCODE_BYTE_SCATTERED_WRITE_LOGICAL &&
>>>>>>>>>>>> +          this->src[1].nr == r.nr) {
>>>>>>>>>>
>>>>>>>>>>> I feel uncomfortable about attempting to guess the source the 
>>>>>>>>>>> caller is
>>>>>>>>>>> referring to by comparing the registers for equality.  E.g.  you 
>>>>>>>>>>> could
>>>>>>>>>>> potentially end up with two sources that compare equal but have
>>>>>>>>>>> different semantics (e.g. as a result of CSE) which might cause it 
>>>>>>>>>>> to
>>>>>>>>>>> get the wrong answer.  It would probably be better to pass a source
>>>>>>>>>>> index and a byte offset as argument instead of an fs_reg.
>>>>>>>>>>
>>>>>>>>>> I've didn't thought about CSE, I'm now receiving the number of source
>>>>>>>>>> and the reg_offset. I'm using reg_offset instead of byte offsets as 
>>>>>>>>>> it
>>>>>>>>>> simplifies the logic. Now we are using always the base src register 
>>>>>>>>>> to
>>>>>>>>>> do all the calculation
>>>>>>>>>>>> +         switch (this->src[4].ud) {
>>>>>>>>>>>> +         case 32:
>>>>>>>>>>>> +            return ~0u;
>>>>>>>>>>>> +         case 16:
>>>>>>>>>>>> +            return 0x33333333;
>>>>>>>>>>>> +         case 8:
>>>>>>>>>>>> +            return 0x11111111;
>>>>>>>>>>>> +         default:
>>>>>>>>>>>> +            unreachable("Unsupported bitsize at 
>>>>>>>>>>>> byte_scattered_write_logical");
>>>>>>>>>>>> +         }
>>>>>>>>>>>
>>>>>>>>>>> Replace the above switch statement with a call to "periodic_mask(8, 
>>>>>>>>>>> 4,
>>>>>>>>>>> this->src[4].ud / 8)" (see below for the definition).
>>>>>>>>>>
>>>>>>>>>> Ok.
>>>>>>>>>>
>>>>>>>>>>>> +      }
>>>>>>>>>>>> +      /* As for byte_scattered_write_logical but we need to take 
>>>>>>>>>>>> into account
>>>>>>>>>>>> +       * that data written are in the payload offset 32 with 
>>>>>>>>>>>> SIMD8 and offset
>>>>>>>>>>>> +       * 64 with SIMD16.
>>>>>>>>>>>> +       */
>>>>>>>>>>>> +      if (this->opcode == SHADER_OPCODE_BYTE_SCATTERED_WRITE &&
>>>>>>>>>>>> +          this->src[0].nr == r.nr) {
>>>>>>>>>>>> +         fs_reg payload = this->src[0];
>>>>>>>>>>>> +         payload.offset = REG_SIZE * this->exec_size / 8;
>>>>>>>>>>>
>>>>>>>>>>> byte_offset() is your friend.
>>>>>>>>>>
>>>>>>>>>> I've removed the overlap logic, and I'm just checking if we are in 
>>>>>>>>>> the
>>>>>>>>>> reg_offset 1 on SIMD8 or reg_offset 2-3 in SIMD16.
>>>>>>>>>>
>>>>>>>>>>>> +         if (regions_overlap(r, REG_SIZE,
>>>>>>>>>>>> +                             payload, REG_SIZE * this->exec_size 
>>>>>>>>>>>> / 8)) {
>>>>>>>>>>>> +            switch (this->src[2].ud) {
>>>>>>>>>>>> +            case 32:
>>>>>>>>>>>> +               return ~0u;
>>>>>>>>>>>> +            case 16:
>>>>>>>>>>>> +               return 0x33333333;
>>>>>>>>>>>> +            case 8:
>>>>>>>>>>>> +               return 0x11111111;
>>>>>>>>>>>> +            default:
>>>>>>>>>>>> +               unreachable("Unsupported bitsize at 
>>>>>>>>>>>> byte_scattered_write");
>>>>>>>>>>>> +            }
>>>>>>>>>>>
>>>>>>>>>>> Replace the above switch statement with a call to "periodic_mask(8, 
>>>>>>>>>>> 4,
>>>>>>>>>>> this->src[2].ud / 8)".
>>>>>>>>>>
>>>>>>>>>> Ok.
>>>>>>>>>>
>>>>>>>>>>>> +         } else {
>>>>>>>>>>>> +            return ~0u;
>>>>>>>>>>>> +         }
>>>>>>>>>>>> +      }
>>>>>>>>>>>> +   }
>>>>>>>>>>>> +
>>>>>>>>>>>> +   /* We define the most conservative value in order to calculate 
>>>>>>>>>>>> liveness
>>>>>>>>>>>> +    * range. If it is a destination nothing is defined and if is 
>>>>>>>>>>>> a source
>>>>>>>>>>>> +    * all the bytes of the register could be read. So for release 
>>>>>>>>>>>> builds
>>>>>>>>>>>> +    * the unreachables would have always safe return value. */
>>>>>>>>>>>> +   unsigned pattern =  is_dst ? 0 : ~0u;
>>>>>>>>>>>> +
>>>>>>>>>>>> +   /* In the general case we calculate the pattern for a specific 
>>>>>>>>>>>> register
>>>>>>>>>>>> +    * on base of the type_size and stride. We calculate the SIMD8 
>>>>>>>>>>>> pattern
>>>>>>>>>>>> +    * and then we adjust the patter if needed for different 
>>>>>>>>>>>> exec_sizes
>>>>>>>>>>>> +    * and offset
>>>>>>>>>>>> +    */
>>>>>>>>>>>> +   switch (type_sz(r.type)){
>>>>>>>>>>>> +   case 1:
>>>>>>>>>>>> +      switch (r.stride) {
>>>>>>>>>>>> +      case 0:
>>>>>>>>>>>> +         pattern = 0X1;
>>>>>>>>>>>> +         break;
>>>>>>>>>>>> +      case 1:
>>>>>>>>>>>> +         pattern = 0xff;
>>>>>>>>>>>> +         break;
>>>>>>>>>>>> +      case 2:
>>>>>>>>>>>> +         pattern = 0x5555;
>>>>>>>>>>>> +         break;
>>>>>>>>>>>> +      case 4:
>>>>>>>>>>>> +         pattern = 0x11111111;
>>>>>>>>>>>> +         break;
>>>>>>>>>>>> +      case 8:
>>>>>>>>>>>> +         pattern = 0x01010101;
>>>>>>>>>>>> +         break;
>>>>>>>>>>>> +      default:
>>>>>>>>>>>> +         unreachable("Unknown pattern unsupported 8-bit stride");
>>>>>>>>>>>> +      }
>>>>>>>>>>>> +      break;
>>>>>>>>>>>> +   case 2:
>>>>>>>>>>>> +      switch (r.stride) {
>>>>>>>>>>>> +      case 0:
>>>>>>>>>>>> +         pattern = 0X3;
>>>>>>>>>>>> +         break;
>>>>>>>>>>>> +      case 1:
>>>>>>>>>>>> +         pattern = 0xffff;
>>>>>>>>>>>> +         break;
>>>>>>>>>>>> +      case 2:
>>>>>>>>>>>> +         pattern = 0x33333333;
>>>>>>>>>>>> +         break;
>>>>>>>>>>>> +      case 4:
>>>>>>>>>>>> +         pattern = 0x03030303;
>>>>>>>>>>>> +         break;
>>>>>>>>>>>> +      case 8:
>>>>>>>>>>>> +         pattern = 0x00030003;
>>>>>>>>>>>> +         break;
>>>>>>>>>>>> +      default:
>>>>>>>>>>>> +         unreachable("Unknown pattern unsupported 16-bit stride");
>>>>>>>>>>>> +      }
>>>>>>>>>>>> +      break;
>>>>>>>>>>>> +   case 4:
>>>>>>>>>>>> +      switch (r.stride) {
>>>>>>>>>>>> +      case 0:
>>>>>>>>>>>> +         pattern = 0Xf;
>>>>>>>>>>>> +         break;
>>>>>>>>>>>> +      case 1:
>>>>>>>>>>>> +         pattern = ~0u;
>>>>>>>>>>>> +         break;
>>>>>>>>>>>> +      case 2:
>>>>>>>>>>>> +         pattern = 0x0f0f0f0f;
>>>>>>>>>>>> +         break;
>>>>>>>>>>>> +      case 4:
>>>>>>>>>>>> +         pattern = 0x000f000f;
>>>>>>>>>>>> +         break;
>>>>>>>>>>>> +      default:
>>>>>>>>>>>> +         unreachable("Unknown pattern unsupported 32-bit stride");
>>>>>>>>>>>> +      }
>>>>>>>>>>>> +      break;
>>>>>>>>>>>> +   case 8:
>>>>>>>>>>>> +      switch (r.stride) {
>>>>>>>>>>>> +      case 0:
>>>>>>>>>>>> +         pattern = 0Xff;
>>>>>>>>>>>> +         break;
>>>>>>>>>>>> +      case 1:
>>>>>>>>>>>> +         pattern = ~0u;
>>>>>>>>>>>> +         break;
>>>>>>>>>>>> +      case 2:
>>>>>>>>>>>> +         pattern = 0x00ff00ff;
>>>>>>>>>>>> +         break;
>>>>>>>>>>>> +      case 4:
>>>>>>>>>>>> +         pattern = 0xff;
>>>>>>>>>>>> +         break;
>>>>>>>>>>>> +      default:
>>>>>>>>>>>> +         unreachable("Unknown pattern unsupported 64-bit stride");
>>>>>>>>>>>> +      }
>>>>>>>>>>>> +      break;
>>>>>>>>>>>> +   default:
>>>>>>>>>>>> +      unreachable("Unknown pattern for unsupported bitsize ");
>>>>>>>>>>>> +   }
>>>>>>>>>>>> +
>>>>>>>>>>>> +   if (this->exec_size > 8  && r.stride * type_sz(r.type) * 8 < 
>>>>>>>>>>>> REG_SIZE) {
>>>>>>>>>>>> +      /* For exec_size greater than SIMD8 we repeat the pattern 
>>>>>>>>>>>> until it
>>>>>>>>>>>> +       * represents a full register already represent a full 
>>>>>>>>>>>> register */
>>>>>>>>>>>> +      pattern = pattern | (pattern << (8 * r.stride * 
>>>>>>>>>>>> type_sz(r.type)));
>>>>>>>>>>>> +      if (this->exec_size > 16 && r.stride * type_sz(r.type) * 16 
>>>>>>>>>>>> < REG_SIZE)
>>>>>>>>>>>> +         pattern = pattern | (pattern << (16 * r.stride * 
>>>>>>>>>>>> type_sz(r.type)));
>>>>>>>>>>>> +   } else if (this->exec_size < 8 &&
>>>>>>>>>>>> +              r.stride * type_sz(r.type) * this->exec_size < 
>>>>>>>>>>>> REG_SIZE) {
>>>>>>>>>>>> +      /* For exec_size smaller than SIMD8 we reduce the pattern 
>>>>>>>>>>>> if its size
>>>>>>>>>>>> +       * is smaller than a full register. */
>>>>>>>>>>>> +      pattern = pattern >> (MIN2(REG_SIZE, 8 * type_sz(r.type) * 
>>>>>>>>>>>> r.stride) -
>>>>>>>>>>>> +                            this->exec_size * type_sz(r.type) * 
>>>>>>>>>>>> r.stride);
>>>>>>>>>>>> +   }
>>>>>>>>>>>> +
>>>>>>>>>>
>>>>>>>>>>> This seems really mad, no clue whether it's correct.   Why not 
>>>>>>>>>>> replace
>>>>>>>>>>> the above ~110 lines with a call to the following (fully
>>>>>>>>>>> untested) 5-LOC function:
>>>>>>>>>>
>>>>>>>>>> Your suggestion seems to work perfectly fine, my original approach 
>>>>>>>>>> was
>>>>>>>>>> trying to avoid the loop of creating the read/write pattern but after
>>>>>>>>>> testing my v2 I wasn't able to notice any performance difference 
>>>>>>>>>> running
>>>>>>>>>> shader-db and having the same results. I was originally trying that
>>>>>>>>>> SIMD8 patterns were already constants. Sorry for the added 
>>>>>>>>>> complexity.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> The loop runs for a logarithmic number of iterations though, so it has
>>>>>>>>> the exact same run-time complexity as your original patch, roughly the
>>>>>>>>> same amount of branches at run-time (but possibly less indirect
>>>>>>>>> branches!), and it should be compiled into a substantially lower 
>>>>>>>>> number
>>>>>>>>> of instructions, which may actually cause it to perform better due to
>>>>>>>>> more favourable caching.  It's hard to tell though which one will
>>>>>>>>> perform better in practice without benchmarking them, and as you
>>>>>>>>> probably realized this is so far from being a bottleneck that whatever
>>>>>>>>> the difference was is likely lost in the noise.  So it really doesn't
>>>>>>>>> matter which one performs better...
>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> | uint32_t
>>>>>>>>>>> | periodic_mask(unsigned count, unsigned step, unsigned bits)
>>>>>>>>>>> | {
>>>>>>>>>>> |    uint32_t m = (count ? (1 << bits) - 1 : 0);
>>>>>>>>>>> |    const unsigined max = MIN2(count * step, sizeof(m) * 
>>>>>>>>>>> CHAR_BITS);
>>>>>>>>>>> | 
>>>>>>>>>>> |    for (unsigned shift = step; shift < max; shift *= 2)
>>>>>>>>>>> |       m |= m << shift;
>>>>>>>>>>> | 
>>>>>>>>>>> |    return m;
>>>>>>>>>>> | }
>>>>>>>>>>
>>>>>>>>>> I've used your function just changing the sizeof(m) * CHAR_BITS for 
>>>>>>>>>> the
>>>>>>>>>> REG_SIZE, to not include the limits.h.
>>>>>>>>>
>>>>>>>>> My intention was to make the function as agnostic to IR details as
>>>>>>>>> possible: the only reason there is a limit of 32 bits is because 
>>>>>>>>> that's
>>>>>>>>> the size of the type used to hold the return value.  Using sizeof 
>>>>>>>>> makes
>>>>>>>>> sure that e.g. extending the code to 64 bits is as simple as changing
>>>>>>>>> the datatype to uint64_t.
>>>>>>>>>
>>>>>>>>>> And I've also included an offset parameter that allows us to shift 
>>>>>>>>>> the
>>>>>>>>>> bits of the pattern when we have an offset inside the register.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> That sounds fine.
>>>>>>>>>
>>>>>>>>>>>> +   /* We adjust the pattern to the byte_offset of the register */
>>>>>>>>>>>> +   pattern = pattern << (r.offset % REG_SIZE);
>>>>>>>>>>>> +
>>>>>>>>>>>
>>>>>>>>>>> This doesn't really work except for r equal to the first GRF read 
>>>>>>>>>>> by the
>>>>>>>>>>> source.  Regions with non-zero sub-GRF offset that straddle multiple
>>>>>>>>>>> GRFs are not really very common at the IR level, but they're 
>>>>>>>>>>> allowed by
>>>>>>>>>>> the hardware with some restrictions, so it would make sense to at 
>>>>>>>>>>> least
>>>>>>>>>>> handle them conservatively in order to keep things from breaking 
>>>>>>>>>>> silently.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I included an assert in the periodic_mask function to detect if the
>>>>>>>>>> combination of offset and mask goes over the REG_SIZE, so we would
>>>>>>>>>> detect an straddle at this level of the IR.
>>>>>>>>>>
>>>>>>>>>> assert(offset + max - (step - bits) <= REG_SIZE);
>>>>>>>>
>>>>>>>>> Uhm...  What's the definition of "offset" here?  You seem to be 
>>>>>>>>> passing
>>>>>>>>> the offset relative to the start of the VGRF modulo REG_SIZE but that
>>>>>>>>> doesn't really make sense to me whenever "reg_offset" is non-zero.
>>>>>>>>
>>>>>>>> Yes I agree on removing this assert in the periodic_mask as it a 
>>>>>>>> generic
>>>>>>>> function. I was to focus on seen the integer as the register :)
>>>>>>>>
>>>>>>>> The motivation was because of the following cases that happen with
>>>>>>>> 8/16-bit more usually:
>>>>>>>>
>>>>>>>> (1) mov(16)    g12<2>HF    g1<16,8,2>HF                  { align1 1H };
>>>>>>>> (2) mov(16)    g12.1<2>HF  g2<16,8,2>HF                  { align1 1H };
>>>>>>>>
>>>>>>>> In the dst_write_pattern, at point of calling the general periodic_mask
>>>>>>>> case we know that instruction is not expected to be a send message
>>>>>>>> destination. So the pattern is the same for reg_offset = 0 and
>>>>>>>> reg_offset = 1.
>>>>>>>>
>>>>>>>
>>>>>>> In this specific example, yes, but in general the pattern will not be
>>>>>>> the same for multiple registers for arbitrary fs_reg::offset values.
>>>>>>
>>>>>> I've been running some tests. I've found that the three opcodes reach
>>>>>> this situation in shader-db (MOV, INDIRECT_MOV and OR).
>>>>>>
>>>>>> If we just manage the MOV opcode that we agree that the same pattern is
>>>>>> repeated for the same source/dst used registers have the same spilling
>>>>>> reduction that with my original patch.
>>>>>>
>>>>>>>> 0x3333333 for case (1) and 0xccccccc for case (2). But without 
>>>>>>>> repeating
>>>>>>>> the pattern offset we would get 0xcccccccc for reg_offset=0 but
>>>>>>>> 0x333333333 for reg_offset=1 witch would be incorrect.
>>>>>>>>
>>>>>>>>> I think you want to pass "src[i].offset % REG_SIZE - reg_offset *
>>>>>>>>> REG_SIZE" as a signed integer in order to get the offset of the first
>>>>>>>>> byte actually written by the instruction relative to the first byte of
>>>>>>>>> the GRF window of the pattern.  You don't really need to assert-fail
>>>>>>>>> when the offset is greater or equal to 32 (which shouldn't actually
>>>>>>>>> happen in practice), "return 0" gives you the correct behavior.  For
>>>>>>>>> negative offsets (which means the pattern starts after the first byte
>>>>>>>>> written by the instruction) you can just return ~0u conservatively
>>>>>>>>> whenever the current logic wouldn't work [assuming you don't feel like
>>>>>>>>> implementing the code to handle that case accurately ;)].
>>>>>>>>
>>>>>>>> Maybe you feel more comfortable with the following approach for
>>>>>>>> src_read_pattern.:
>>>>>>>>
>>>>>>>> For read sources we also assume that SENDs sources are completely
>>>>>>>> read so return ~0u except the byte_scattered_write source exceptions.
>>>>>>>>
>>>>>>>> if (this->is_send_from_grf() || this->mlen != 0)
>>>>>>>>       return ~0u;
>>>>>>>>
>>>>>>>> So after this I think that we can assume that the following condition
>>>>>>>> should be always correct where source operand must reside within two
>>>>>>>> adjacent 256-bit registers so the pattern would be periodic for
>>>>>>>> all reg_offsets and we can use the (this->src[i].offset % REG_SIZE).
>>>>>>>>
>>>>>>>
>>>>>>> That's not a particularly futureproof assumption.  I'm okay with
>>>>>>> handling non-reg_offset-periodic cases inaccurately for the moment for
>>>>>>> the sake of simplicity, but there is no reason to have the code blow up
>>>>>>> in such cases if you could just return ~0u.
>>>>>>
>>>>>> I've added some modifications in the v3, we return ~0u for reads and 0u
>>>>>> for writes by default, but i added the case of !this->is_partial_write()
>>>>>> that will return ~0u for writes, that won't we reached with current use,
>>>>>> but to be coherent with any future uses of this function.
>>>>>>
>>>>>>>> assert(reg_offset < DIV_ROUND_UP(this->src[i].stride ?
>>>>>>>>     this->src[i].stride * type_sz(this->src[i].type) * this->exec_size 
>>>>>>>> :
>>>>>>>>     type_sz(this->src[i].type), REG_SIZE));
>>>>>>>>
>>>>>>
>>>>>>> I don't think it's useful to assert-fail on this condition, since it
>>>>>>> doesn't handle all cases where the code below will be broken for a
>>>>>>> certain valid IR, and the condition it does protect against can be
>>>>>>> handled trivially in periodic_mask(), by returning zero when the 32-byte
>>>>>>> pattern window falls outside the region read or written by the
>>>>>>> instruction -- But currently you can't even know whether that's the
>>>>>>> case, because periodic_mask() is unaware where the 32-byte window starts
>>>>>>> relative to the source region, which is *all* it cares about, instead
>>>>>>> you're passing an offset to it that is equal to that in some special
>>>>>>> cases -- In all other cases the result of the function will be bogus.
>>>>>>
>>>>>> I understand your concerns, so in my V3, I am using only the
>>>>>> periodic_mask function when reg_offset is 0, and for the MOV special
>>>>>> case that I understand we agree that the internal offset for
>>>>>> reg_offset=1 is generally reg.offset % REG_SIZE. With that two cases
>>>>>> covered we solve the register_pressure on current supported 64/16/8 bit
>>>>>> cases.
>>>>>>
>>>>
>>>>> I don't understand.  The MOV opcode is not substantially different from
>>>>> any other ALU opcode with regards to regioning.  The assumptions this is
>>>>> based on are invalid for general Align1 regions, whether the opcode is a
>>>>> MOV or not.
>>>>
>>>> Let's try step by step to find my misunderstanding :-)
>>>>
>>>> My original understanding was that for a general ALU operations to
>>>> define witch parts of the VGRF are written/read we need to use 4
>>>> different variables in the scalar backend for src/dst registers.
>>>>
>>>> - The type size.
>>>> - The horizontal stride.
>>>> - The execution size.
>>>> - The initial byte offset of the first register of src[i]/dst
>>>>
>>>> If stride==1 && type_size * execution_size>=REG_SIZE && byte_offset== 0
>>>> we have a full/write of one or two registers. In that case the solution
>>>> is easy everything is written ~0u.
>>>>
>>>
>>> Not necessarily, if "reg_offset * REG_SIZE >= byte_offset + type_size *
>>> MAX2(1, stride * exec_size)" the correct answer is 0.
>>
>> Ok, I got what you mean. I was incorrectly assuming as precondition that
>> reg_offset was always in range because that is what happens on
>> setup_def_used at brw_fs_live_variables.cpp.
>>
>> I think we can use directly regs_written and regs_read that do the same
>> calculation.
>>
>> if (reg_offset >= regs_written(this))
>>      return 0;
>> --------------------------------------
>> if (reg_offset >= regs_read(this, i))
>>      return 0;
>>
> 
> Yes, that should work.
> 
>>
>>>> If we have a partial write/read:
>>>>
>>>> I understood that you my initial patter proposal would only be ok for
>>>> the first GRF of src[i]/dst (reg_offset == 0)
>>>>
>>>> periodic_mask(this->exec_size,                           /* count */
>>>>        this->src[i].stride * type_sz(this->src[i].type), /*step */
>>>>        type_sz(this->src[i].type),                       /* bits */
>>>>        this->src[i].offset % REG_SIZE);                  /* offset */
>>>>
>>>> In the case we manage only reg_offset == 0 we get a huge improvement
>>>> reducing all problems many of the register_pressure we have now on all
>>>> SIMD8 shaders with 8/16bits test cases.
>>>>
>>>> I understood that you didn't agree that for cases where src/destination
>>>> use more than 1 GRF (reg_offset == 1) we can not guarantee that we can
>>>> apply the same internal offset (this->src[i].offset % REG_SIZE) as the
>>>> base register to calculate a patter. So It would be better to return ~0u
>>>> on reads or 0u in writes.
>>>>
>>
>>> Yes, but you could easily determine whether the mask is going to be
>>> invariant with respect to reg_offset (where reg_offset is within bounds)
>>> and in that case return the periodic_mask() expression above, otherwise
>>> return 0/~0u depending on whether reg_offset is within bounds.
>>
>> Ok, so we are within bounds, we don't have a predicated write, we are
>> not a send message. Then we have an ALU opcode and we return the
>> periodic_mask.
>>
> 
> Those are all necessary but not sufficient conditions for the
> periodic_mask() expression above to give you the correct answer for any
> in-bounds reg_offset > 0, you should check that byte_offset < type_size
> * stride in addition.


That's true. Fixed in v5.

If we don't satisfy the condition then we return 0 on writes and ~0u on
reads.

Chema

> 
>>>> That is the point were I couldn't found a counter-case with the search I
>>>> did to identify the different register region that appear in shaderdb
>>>> generated MOV instructions and 16/8 bit CTS tests, that used two
>>>> registers with a region pattern that implied a different byte_pattern,
>>>> for the first and the second register. But it should exist :)
>>>
>>> I don't see how special-casing MOV could help you in any way except by
>>> luck.  If your assumptions were correct you would be able to apply them
>>> to all ALU instructions with Align1 regions.
>>
>> It worked because all the current uses are in range, but future uses of
>> this function would have return incorrect values as you pointed.
>>
>> I've already send v4. Let's see if we are converging :)
>>
>> Chema
>>
>>>> If my bad understanding is before this point, I should back to re-review
>>>> the Regioning PRM section. Or maybe I messing the IR level and the
>>>> emission level that is another option with different semantics :-?
>>>>
>>>> Thank you for your time.
>>>>
>>>> Chema
>>>>
>>>>>>>> So we could call in this scenario.
>>>>>>>>
>>>>>>>> return periodic_mask(this->exec_size,
>>>>>>>>                      this->src[i].stride * type_sz(this->src[i].type),
>>>>>>>>                      type_sz(this->src[i].type),
>>>>>>>>                      this->src[i].offset % REG_SIZE);
>>>>>>>>
>>>>>>>> The only issue I could think about that could generate issues would a
>>>>>>>> case that I didn't found in our current code where a source is defined
>>>>>>>> like r5.0<1;8;2>:w explained at PRM KBL vol07 Page 790 "A 16-element
>>>>>>>> register region with interleaved rows (r5.0<1;8,2>:w)".
>>>>>>>>
>>>>>>>
>>>>>>> You couldn't really handle that even if you wanted to, because there is
>>>>>>> no way to represent such a thing at the IR level, since the horizontal
>>>>>>> and vertical strides cannot be controlled independently, only the stride
>>>>>>> member is meaningful for VGRFs.
>>>>>>
>>>>>> Good, that simplifies things.
>>>>>>
>>>>>> Thanks again.
>>>>>>
>>>>>> Chema
>>>>>>
>>>>>>>> What do you think?
>>>>>>>>
>>>>>>>> Chema
>>>>>>>>
>>>>>>>>
>>>>>>>>>> Thanks for the review, I think that v2 has better shape.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> No problem.
>>>>>>>>>
>>>>>>>>>> Chema
>>>>>>>>>>
>>>>>>>>>>>> +   assert(pattern);
>>>>>>>>>>>> +
>>>>>>>>>>>> +   return pattern;
>>>>>>>>>>>> +}
>>>>>>>>>>>> +
>>>>>>>>>>>> +
>>>>>>>>>>>>  unsigned
>>>>>>>>>>>>  fs_inst::components_read(unsigned i) const
>>>>>>>>>>>>  {
>>>>>>>>>>>> diff --git a/src/intel/compiler/brw_ir_fs.h 
>>>>>>>>>>>> b/src/intel/compiler/brw_ir_fs.h
>>>>>>>>>>>> index 92dad269a34..5ea6294b8ad 100644
>>>>>>>>>>>> --- a/src/intel/compiler/brw_ir_fs.h
>>>>>>>>>>>> +++ b/src/intel/compiler/brw_ir_fs.h
>>>>>>>>>>>> @@ -350,6 +350,7 @@ public:
>>>>>>>>>>>>     bool equals(fs_inst *inst) const;
>>>>>>>>>>>>     bool is_send_from_grf() const;
>>>>>>>>>>>>     bool is_partial_write() const;
>>>>>>>>>>>> +   unsigned register_byte_use_pattern(const fs_reg &r, boolean 
>>>>>>>>>>>> is_dst) const;
>>>>>>>>>>>>     bool is_copy_payload(const brw::simple_allocator &grf_alloc) 
>>>>>>>>>>>> const;
>>>>>>>>>>>>     unsigned components_read(unsigned i) const;
>>>>>>>>>>>>     unsigned size_read(int arg) const;
>>>>>>>>>>>> -- 
>>>>>>>>>>>> 2.17.1
>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> mesa-dev mailing list
>>>>>>>>>>>> mesa-dev@lists.freedesktop.org
>>>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> mesa-dev mailing list
>>>>>>>>>>>> mesa-dev@lists.freedesktop.org
>>>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> mesa-dev mailing list
>>>>>>>>>>>> mesa-dev@lists.freedesktop.org
>>>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> mesa-dev mailing list
>>>>>>>>>>>> mesa-dev@lists.freedesktop.org
>>>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> mesa-dev mailing list
>>>>>>>>>>>> mesa-dev@lists.freedesktop.org
>>>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
_______________________________________________
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev

Re: [Mesa-dev] [PATCH 1/2] intel/fs: New method for register_byte_use_pattern for fs_inst

Reply via email to