El 28/07/18 a las 01:45, Francisco Jerez escribió: > Chema Casanova <jmcasan...@igalia.com> writes: > >> El 27/07/18 a las 02:44, Francisco Jerez escribió: >>> Chema Casanova <jmcasan...@igalia.com> writes: >>> >>>> El 26/07/18 a las 20:02, Francisco Jerez escribió: >>>>> Chema Casanova <jmcasan...@igalia.com> writes: >>>>> >>>>>> El 20/07/18 a las 22:10, Francisco Jerez escribió: >>>>>>> Chema Casanova <jmcasan...@igalia.com> writes: >>>>>>> >>>>>>>> El 20/07/18 a las 00:34, Francisco Jerez escribió: >>>>>>>>> Chema Casanova <jmcasan...@igalia.com> writes: >>>>>>>>> >>>>>>>>>> El 14/07/18 a las 00:14, Francisco Jerez escribió: >>>>>>>>>>> Jose Maria Casanova Crespo <jmcasan...@igalia.com> writes: >>>>>>>>>>> >>>>>>>>>>>> For a register source/destination of an instruction the function >>>>>>>>>>>> returns >>>>>>>>>>>> the read/write byte pattern of a 32-byte registers as a unsigned >>>>>>>>>>>> int. >>>>>>>>>>>> >>>>>>>>>>>> The returned pattern takes into account the exec_size of the >>>>>>>>>>>> instruction, >>>>>>>>>>>> the type bitsize, the stride and if the register is source or >>>>>>>>>>>> destination. >>>>>>>>>>>> >>>>>>>>>>>> The objective of the functions if to help to know the read/written >>>>>>>>>>>> bytes >>>>>>>>>>>> of the instructions to improve the liveness analysis for partial >>>>>>>>>>>> read/writes. >>>>>>>>>>>> >>>>>>>>>>>> We manage special cases for >>>>>>>>>>>> SHADER_OPCODE_BYTE_SCATTERED_WRITE_LOGICAL >>>>>>>>>>>> and SHADER_OPCODE_BYTE_SCATTERED_WRITE because depending of the >>>>>>>>>>>> bitsize >>>>>>>>>>>> parameter they have a different read pattern. >>>>>>>>>>>> --- >>>>>>>>>>>> src/intel/compiler/brw_fs.cpp | 183 >>>>>>>>>>>> +++++++++++++++++++++++++++++++++ >>>>>>>>>>>> src/intel/compiler/brw_ir_fs.h | 1 + >>>>>>>>>>>> 2 files changed, 184 insertions(+) >>>>>>>>>>>> >>>>>>>>>>>> diff --git a/src/intel/compiler/brw_fs.cpp >>>>>>>>>>>> b/src/intel/compiler/brw_fs.cpp >>>>>>>>>>>> index 2b8363ca362..f3045c4ff6c 100644 >>>>>>>>>>>> --- a/src/intel/compiler/brw_fs.cpp >>>>>>>>>>>> +++ b/src/intel/compiler/brw_fs.cpp >>>>>>>>>>>> @@ -687,6 +687,189 @@ fs_inst::is_partial_write() const >>>>>>>>>>>> this->dst.offset % REG_SIZE != 0); >>>>>>>>>>>> } >>>>>>>>>>>> >>>>>>>>>>>> +/** >>>>>>>>>>>> + * Returns a 32-bit uint whose bits represent if the associated >>>>>>>>>>>> register byte >>>>>>>>>>>> + * has been read/written by the instruction. The returned pattern >>>>>>>>>>>> takes into >>>>>>>>>>>> + * account the exec_size of the instruction, the type bitsize and >>>>>>>>>>>> the register >>>>>>>>>>>> + * stride and the register is source or destination for the >>>>>>>>>>>> instruction. >>>>>>>>>>>> + * >>>>>>>>>>>> + * The objective of this function is to identify which parts of >>>>>>>>>>>> the register >>>>>>>>>>>> + * are read or written for operations that don't read/write a >>>>>>>>>>>> full register. >>>>>>>>>>>> + * So we can identify in live range variable analysis if a >>>>>>>>>>>> partial write has >>>>>>>>>>>> + * completelly defined the part of the register used by a partial >>>>>>>>>>>> read. So we >>>>>>>>>>>> + * avoid extending the liveness range because all data read was >>>>>>>>>>>> already >>>>>>>>>>>> + * defined although the wasn't completely written. >>>>>>>>>>>> + */ >>>>>>>>>>>> +unsigned >>>>>>>>>>>> +fs_inst::register_byte_use_pattern(const fs_reg &r, boolean >>>>>>>>>>>> is_dst) const >>>>>>>>>>>> +{ >>>>>>>>>>>> + if (is_dst) { >>>>>>>>>> >>>>>>>>>>> Please split into two functions (like fs_inst::src_read and >>>>>>>>>>> ::src_written) since that would make the call-sites of this method >>>>>>>>>>> more >>>>>>>>>>> self-documenting than a boolean parameter. You should be able to >>>>>>>>>>> share >>>>>>>>>>> code by refactoring the common logic into a separate function (see >>>>>>>>>>> below >>>>>>>>>>> for some suggestions on how that could be achieved). >>>>>>>>>> >>>>>>>>>> Sure, it would improve readability and simplifies the logic, I've >>>>>>>>>> chosen >>>>>>>>>> dst_write_pattern and src_read_pattern. >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> + /* We don't know what is written so we return the worts >>>>>>>>>>>> case */ >>>>>>>>>>> >>>>>>>>>>> "worst" >>>>>>>>>> >>>>>>>>>> Fixed. >>>>>>>>>> >>>>>>>>>>>> + if (this->predicate && this->opcode != BRW_OPCODE_SEL) >>>>>>>>>>>> + return 0; >>>>>>>>>>>> + /* We assume that send destinations are completely written >>>>>>>>>>>> */ >>>>>>>>>>>> + if (this->is_send_from_grf()) >>>>>>>>>>>> + return ~0u; >>>>>>>>>>> >>>>>>>>>>> Some send-like instructions won't be caught by this condition, you >>>>>>>>>>> should check for this->mlen != 0 in addition. >>>>>>>>>> >>>>>>>>>> Would it be enough to check for (this->mlen > 0) and forget about >>>>>>>>>> is_send_from_grf? I am using this approach in v2 I am sending. >>>>>>>>>> >>>>>>>>> >>>>>>>>> I don't think the mlen > 0 condition would catch all cases either... >>>>>>>>> E.g. FS_OPCODE_UNIFORM_PULL_CONSTANT_LOAD IIRC. You probably need >>>>>>>>> both >>>>>>>>> conditions. Sucks... >>>>>>>> >>>>>>>> That is true, so now we have the: >>>>>>>> (this->is_send_from_grf() || this->mlen != 0) >>>>>>>> >>>>>>>>>>>> + } else { >>>>>>>>>>>> + /* byte_scattered_write_logical pattern of src[1] is 32-bit >>>>>>>>>>>> aligned >>>>>>>>>>>> + * so the read pattern depends on the bitsize stored at >>>>>>>>>>>> src[4] >>>>>>>>>>>> + */ >>>>>>>>>>>> + if (this->opcode == >>>>>>>>>>>> SHADER_OPCODE_BYTE_SCATTERED_WRITE_LOGICAL && >>>>>>>>>>>> + this->src[1].nr == r.nr) { >>>>>>>>>> >>>>>>>>>>> I feel uncomfortable about attempting to guess the source the >>>>>>>>>>> caller is >>>>>>>>>>> referring to by comparing the registers for equality. E.g. you >>>>>>>>>>> could >>>>>>>>>>> potentially end up with two sources that compare equal but have >>>>>>>>>>> different semantics (e.g. as a result of CSE) which might cause it >>>>>>>>>>> to >>>>>>>>>>> get the wrong answer. It would probably be better to pass a source >>>>>>>>>>> index and a byte offset as argument instead of an fs_reg. >>>>>>>>>> >>>>>>>>>> I've didn't thought about CSE, I'm now receiving the number of source >>>>>>>>>> and the reg_offset. I'm using reg_offset instead of byte offsets as >>>>>>>>>> it >>>>>>>>>> simplifies the logic. Now we are using always the base src register >>>>>>>>>> to >>>>>>>>>> do all the calculation >>>>>>>>>>>> + switch (this->src[4].ud) { >>>>>>>>>>>> + case 32: >>>>>>>>>>>> + return ~0u; >>>>>>>>>>>> + case 16: >>>>>>>>>>>> + return 0x33333333; >>>>>>>>>>>> + case 8: >>>>>>>>>>>> + return 0x11111111; >>>>>>>>>>>> + default: >>>>>>>>>>>> + unreachable("Unsupported bitsize at >>>>>>>>>>>> byte_scattered_write_logical"); >>>>>>>>>>>> + } >>>>>>>>>>> >>>>>>>>>>> Replace the above switch statement with a call to "periodic_mask(8, >>>>>>>>>>> 4, >>>>>>>>>>> this->src[4].ud / 8)" (see below for the definition). >>>>>>>>>> >>>>>>>>>> Ok. >>>>>>>>>> >>>>>>>>>>>> + } >>>>>>>>>>>> + /* As for byte_scattered_write_logical but we need to take >>>>>>>>>>>> into account >>>>>>>>>>>> + * that data written are in the payload offset 32 with >>>>>>>>>>>> SIMD8 and offset >>>>>>>>>>>> + * 64 with SIMD16. >>>>>>>>>>>> + */ >>>>>>>>>>>> + if (this->opcode == SHADER_OPCODE_BYTE_SCATTERED_WRITE && >>>>>>>>>>>> + this->src[0].nr == r.nr) { >>>>>>>>>>>> + fs_reg payload = this->src[0]; >>>>>>>>>>>> + payload.offset = REG_SIZE * this->exec_size / 8; >>>>>>>>>>> >>>>>>>>>>> byte_offset() is your friend. >>>>>>>>>> >>>>>>>>>> I've removed the overlap logic, and I'm just checking if we are in >>>>>>>>>> the >>>>>>>>>> reg_offset 1 on SIMD8 or reg_offset 2-3 in SIMD16. >>>>>>>>>> >>>>>>>>>>>> + if (regions_overlap(r, REG_SIZE, >>>>>>>>>>>> + payload, REG_SIZE * this->exec_size >>>>>>>>>>>> / 8)) { >>>>>>>>>>>> + switch (this->src[2].ud) { >>>>>>>>>>>> + case 32: >>>>>>>>>>>> + return ~0u; >>>>>>>>>>>> + case 16: >>>>>>>>>>>> + return 0x33333333; >>>>>>>>>>>> + case 8: >>>>>>>>>>>> + return 0x11111111; >>>>>>>>>>>> + default: >>>>>>>>>>>> + unreachable("Unsupported bitsize at >>>>>>>>>>>> byte_scattered_write"); >>>>>>>>>>>> + } >>>>>>>>>>> >>>>>>>>>>> Replace the above switch statement with a call to "periodic_mask(8, >>>>>>>>>>> 4, >>>>>>>>>>> this->src[2].ud / 8)". >>>>>>>>>> >>>>>>>>>> Ok. >>>>>>>>>> >>>>>>>>>>>> + } else { >>>>>>>>>>>> + return ~0u; >>>>>>>>>>>> + } >>>>>>>>>>>> + } >>>>>>>>>>>> + } >>>>>>>>>>>> + >>>>>>>>>>>> + /* We define the most conservative value in order to calculate >>>>>>>>>>>> liveness >>>>>>>>>>>> + * range. If it is a destination nothing is defined and if is >>>>>>>>>>>> a source >>>>>>>>>>>> + * all the bytes of the register could be read. So for release >>>>>>>>>>>> builds >>>>>>>>>>>> + * the unreachables would have always safe return value. */ >>>>>>>>>>>> + unsigned pattern = is_dst ? 0 : ~0u; >>>>>>>>>>>> + >>>>>>>>>>>> + /* In the general case we calculate the pattern for a specific >>>>>>>>>>>> register >>>>>>>>>>>> + * on base of the type_size and stride. We calculate the SIMD8 >>>>>>>>>>>> pattern >>>>>>>>>>>> + * and then we adjust the patter if needed for different >>>>>>>>>>>> exec_sizes >>>>>>>>>>>> + * and offset >>>>>>>>>>>> + */ >>>>>>>>>>>> + switch (type_sz(r.type)){ >>>>>>>>>>>> + case 1: >>>>>>>>>>>> + switch (r.stride) { >>>>>>>>>>>> + case 0: >>>>>>>>>>>> + pattern = 0X1; >>>>>>>>>>>> + break; >>>>>>>>>>>> + case 1: >>>>>>>>>>>> + pattern = 0xff; >>>>>>>>>>>> + break; >>>>>>>>>>>> + case 2: >>>>>>>>>>>> + pattern = 0x5555; >>>>>>>>>>>> + break; >>>>>>>>>>>> + case 4: >>>>>>>>>>>> + pattern = 0x11111111; >>>>>>>>>>>> + break; >>>>>>>>>>>> + case 8: >>>>>>>>>>>> + pattern = 0x01010101; >>>>>>>>>>>> + break; >>>>>>>>>>>> + default: >>>>>>>>>>>> + unreachable("Unknown pattern unsupported 8-bit stride"); >>>>>>>>>>>> + } >>>>>>>>>>>> + break; >>>>>>>>>>>> + case 2: >>>>>>>>>>>> + switch (r.stride) { >>>>>>>>>>>> + case 0: >>>>>>>>>>>> + pattern = 0X3; >>>>>>>>>>>> + break; >>>>>>>>>>>> + case 1: >>>>>>>>>>>> + pattern = 0xffff; >>>>>>>>>>>> + break; >>>>>>>>>>>> + case 2: >>>>>>>>>>>> + pattern = 0x33333333; >>>>>>>>>>>> + break; >>>>>>>>>>>> + case 4: >>>>>>>>>>>> + pattern = 0x03030303; >>>>>>>>>>>> + break; >>>>>>>>>>>> + case 8: >>>>>>>>>>>> + pattern = 0x00030003; >>>>>>>>>>>> + break; >>>>>>>>>>>> + default: >>>>>>>>>>>> + unreachable("Unknown pattern unsupported 16-bit stride"); >>>>>>>>>>>> + } >>>>>>>>>>>> + break; >>>>>>>>>>>> + case 4: >>>>>>>>>>>> + switch (r.stride) { >>>>>>>>>>>> + case 0: >>>>>>>>>>>> + pattern = 0Xf; >>>>>>>>>>>> + break; >>>>>>>>>>>> + case 1: >>>>>>>>>>>> + pattern = ~0u; >>>>>>>>>>>> + break; >>>>>>>>>>>> + case 2: >>>>>>>>>>>> + pattern = 0x0f0f0f0f; >>>>>>>>>>>> + break; >>>>>>>>>>>> + case 4: >>>>>>>>>>>> + pattern = 0x000f000f; >>>>>>>>>>>> + break; >>>>>>>>>>>> + default: >>>>>>>>>>>> + unreachable("Unknown pattern unsupported 32-bit stride"); >>>>>>>>>>>> + } >>>>>>>>>>>> + break; >>>>>>>>>>>> + case 8: >>>>>>>>>>>> + switch (r.stride) { >>>>>>>>>>>> + case 0: >>>>>>>>>>>> + pattern = 0Xff; >>>>>>>>>>>> + break; >>>>>>>>>>>> + case 1: >>>>>>>>>>>> + pattern = ~0u; >>>>>>>>>>>> + break; >>>>>>>>>>>> + case 2: >>>>>>>>>>>> + pattern = 0x00ff00ff; >>>>>>>>>>>> + break; >>>>>>>>>>>> + case 4: >>>>>>>>>>>> + pattern = 0xff; >>>>>>>>>>>> + break; >>>>>>>>>>>> + default: >>>>>>>>>>>> + unreachable("Unknown pattern unsupported 64-bit stride"); >>>>>>>>>>>> + } >>>>>>>>>>>> + break; >>>>>>>>>>>> + default: >>>>>>>>>>>> + unreachable("Unknown pattern for unsupported bitsize "); >>>>>>>>>>>> + } >>>>>>>>>>>> + >>>>>>>>>>>> + if (this->exec_size > 8 && r.stride * type_sz(r.type) * 8 < >>>>>>>>>>>> REG_SIZE) { >>>>>>>>>>>> + /* For exec_size greater than SIMD8 we repeat the pattern >>>>>>>>>>>> until it >>>>>>>>>>>> + * represents a full register already represent a full >>>>>>>>>>>> register */ >>>>>>>>>>>> + pattern = pattern | (pattern << (8 * r.stride * >>>>>>>>>>>> type_sz(r.type))); >>>>>>>>>>>> + if (this->exec_size > 16 && r.stride * type_sz(r.type) * 16 >>>>>>>>>>>> < REG_SIZE) >>>>>>>>>>>> + pattern = pattern | (pattern << (16 * r.stride * >>>>>>>>>>>> type_sz(r.type))); >>>>>>>>>>>> + } else if (this->exec_size < 8 && >>>>>>>>>>>> + r.stride * type_sz(r.type) * this->exec_size < >>>>>>>>>>>> REG_SIZE) { >>>>>>>>>>>> + /* For exec_size smaller than SIMD8 we reduce the pattern >>>>>>>>>>>> if its size >>>>>>>>>>>> + * is smaller than a full register. */ >>>>>>>>>>>> + pattern = pattern >> (MIN2(REG_SIZE, 8 * type_sz(r.type) * >>>>>>>>>>>> r.stride) - >>>>>>>>>>>> + this->exec_size * type_sz(r.type) * >>>>>>>>>>>> r.stride); >>>>>>>>>>>> + } >>>>>>>>>>>> + >>>>>>>>>> >>>>>>>>>>> This seems really mad, no clue whether it's correct. Why not >>>>>>>>>>> replace >>>>>>>>>>> the above ~110 lines with a call to the following (fully >>>>>>>>>>> untested) 5-LOC function: >>>>>>>>>> >>>>>>>>>> Your suggestion seems to work perfectly fine, my original approach >>>>>>>>>> was >>>>>>>>>> trying to avoid the loop of creating the read/write pattern but after >>>>>>>>>> testing my v2 I wasn't able to notice any performance difference >>>>>>>>>> running >>>>>>>>>> shader-db and having the same results. I was originally trying that >>>>>>>>>> SIMD8 patterns were already constants. Sorry for the added >>>>>>>>>> complexity. >>>>>>>>>> >>>>>>>>> >>>>>>>>> The loop runs for a logarithmic number of iterations though, so it has >>>>>>>>> the exact same run-time complexity as your original patch, roughly the >>>>>>>>> same amount of branches at run-time (but possibly less indirect >>>>>>>>> branches!), and it should be compiled into a substantially lower >>>>>>>>> number >>>>>>>>> of instructions, which may actually cause it to perform better due to >>>>>>>>> more favourable caching. It's hard to tell though which one will >>>>>>>>> perform better in practice without benchmarking them, and as you >>>>>>>>> probably realized this is so far from being a bottleneck that whatever >>>>>>>>> the difference was is likely lost in the noise. So it really doesn't >>>>>>>>> matter which one performs better... >>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> | uint32_t >>>>>>>>>>> | periodic_mask(unsigned count, unsigned step, unsigned bits) >>>>>>>>>>> | { >>>>>>>>>>> | uint32_t m = (count ? (1 << bits) - 1 : 0); >>>>>>>>>>> | const unsigined max = MIN2(count * step, sizeof(m) * >>>>>>>>>>> CHAR_BITS); >>>>>>>>>>> | >>>>>>>>>>> | for (unsigned shift = step; shift < max; shift *= 2) >>>>>>>>>>> | m |= m << shift; >>>>>>>>>>> | >>>>>>>>>>> | return m; >>>>>>>>>>> | } >>>>>>>>>> >>>>>>>>>> I've used your function just changing the sizeof(m) * CHAR_BITS for >>>>>>>>>> the >>>>>>>>>> REG_SIZE, to not include the limits.h. >>>>>>>>> >>>>>>>>> My intention was to make the function as agnostic to IR details as >>>>>>>>> possible: the only reason there is a limit of 32 bits is because >>>>>>>>> that's >>>>>>>>> the size of the type used to hold the return value. Using sizeof >>>>>>>>> makes >>>>>>>>> sure that e.g. extending the code to 64 bits is as simple as changing >>>>>>>>> the datatype to uint64_t. >>>>>>>>> >>>>>>>>>> And I've also included an offset parameter that allows us to shift >>>>>>>>>> the >>>>>>>>>> bits of the pattern when we have an offset inside the register. >>>>>>>>>> >>>>>>>>> >>>>>>>>> That sounds fine. >>>>>>>>> >>>>>>>>>>>> + /* We adjust the pattern to the byte_offset of the register */ >>>>>>>>>>>> + pattern = pattern << (r.offset % REG_SIZE); >>>>>>>>>>>> + >>>>>>>>>>> >>>>>>>>>>> This doesn't really work except for r equal to the first GRF read >>>>>>>>>>> by the >>>>>>>>>>> source. Regions with non-zero sub-GRF offset that straddle multiple >>>>>>>>>>> GRFs are not really very common at the IR level, but they're >>>>>>>>>>> allowed by >>>>>>>>>>> the hardware with some restrictions, so it would make sense to at >>>>>>>>>>> least >>>>>>>>>>> handle them conservatively in order to keep things from breaking >>>>>>>>>>> silently. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> I included an assert in the periodic_mask function to detect if the >>>>>>>>>> combination of offset and mask goes over the REG_SIZE, so we would >>>>>>>>>> detect an straddle at this level of the IR. >>>>>>>>>> >>>>>>>>>> assert(offset + max - (step - bits) <= REG_SIZE); >>>>>>>> >>>>>>>>> Uhm... What's the definition of "offset" here? You seem to be >>>>>>>>> passing >>>>>>>>> the offset relative to the start of the VGRF modulo REG_SIZE but that >>>>>>>>> doesn't really make sense to me whenever "reg_offset" is non-zero. >>>>>>>> >>>>>>>> Yes I agree on removing this assert in the periodic_mask as it a >>>>>>>> generic >>>>>>>> function. I was to focus on seen the integer as the register :) >>>>>>>> >>>>>>>> The motivation was because of the following cases that happen with >>>>>>>> 8/16-bit more usually: >>>>>>>> >>>>>>>> (1) mov(16) g12<2>HF g1<16,8,2>HF { align1 1H }; >>>>>>>> (2) mov(16) g12.1<2>HF g2<16,8,2>HF { align1 1H }; >>>>>>>> >>>>>>>> In the dst_write_pattern, at point of calling the general periodic_mask >>>>>>>> case we know that instruction is not expected to be a send message >>>>>>>> destination. So the pattern is the same for reg_offset = 0 and >>>>>>>> reg_offset = 1. >>>>>>>> >>>>>>> >>>>>>> In this specific example, yes, but in general the pattern will not be >>>>>>> the same for multiple registers for arbitrary fs_reg::offset values. >>>>>> >>>>>> I've been running some tests. I've found that the three opcodes reach >>>>>> this situation in shader-db (MOV, INDIRECT_MOV and OR). >>>>>> >>>>>> If we just manage the MOV opcode that we agree that the same pattern is >>>>>> repeated for the same source/dst used registers have the same spilling >>>>>> reduction that with my original patch. >>>>>> >>>>>>>> 0x3333333 for case (1) and 0xccccccc for case (2). But without >>>>>>>> repeating >>>>>>>> the pattern offset we would get 0xcccccccc for reg_offset=0 but >>>>>>>> 0x333333333 for reg_offset=1 witch would be incorrect. >>>>>>>> >>>>>>>>> I think you want to pass "src[i].offset % REG_SIZE - reg_offset * >>>>>>>>> REG_SIZE" as a signed integer in order to get the offset of the first >>>>>>>>> byte actually written by the instruction relative to the first byte of >>>>>>>>> the GRF window of the pattern. You don't really need to assert-fail >>>>>>>>> when the offset is greater or equal to 32 (which shouldn't actually >>>>>>>>> happen in practice), "return 0" gives you the correct behavior. For >>>>>>>>> negative offsets (which means the pattern starts after the first byte >>>>>>>>> written by the instruction) you can just return ~0u conservatively >>>>>>>>> whenever the current logic wouldn't work [assuming you don't feel like >>>>>>>>> implementing the code to handle that case accurately ;)]. >>>>>>>> >>>>>>>> Maybe you feel more comfortable with the following approach for >>>>>>>> src_read_pattern.: >>>>>>>> >>>>>>>> For read sources we also assume that SENDs sources are completely >>>>>>>> read so return ~0u except the byte_scattered_write source exceptions. >>>>>>>> >>>>>>>> if (this->is_send_from_grf() || this->mlen != 0) >>>>>>>> return ~0u; >>>>>>>> >>>>>>>> So after this I think that we can assume that the following condition >>>>>>>> should be always correct where source operand must reside within two >>>>>>>> adjacent 256-bit registers so the pattern would be periodic for >>>>>>>> all reg_offsets and we can use the (this->src[i].offset % REG_SIZE). >>>>>>>> >>>>>>> >>>>>>> That's not a particularly futureproof assumption. I'm okay with >>>>>>> handling non-reg_offset-periodic cases inaccurately for the moment for >>>>>>> the sake of simplicity, but there is no reason to have the code blow up >>>>>>> in such cases if you could just return ~0u. >>>>>> >>>>>> I've added some modifications in the v3, we return ~0u for reads and 0u >>>>>> for writes by default, but i added the case of !this->is_partial_write() >>>>>> that will return ~0u for writes, that won't we reached with current use, >>>>>> but to be coherent with any future uses of this function. >>>>>> >>>>>>>> assert(reg_offset < DIV_ROUND_UP(this->src[i].stride ? >>>>>>>> this->src[i].stride * type_sz(this->src[i].type) * this->exec_size >>>>>>>> : >>>>>>>> type_sz(this->src[i].type), REG_SIZE)); >>>>>>>> >>>>>> >>>>>>> I don't think it's useful to assert-fail on this condition, since it >>>>>>> doesn't handle all cases where the code below will be broken for a >>>>>>> certain valid IR, and the condition it does protect against can be >>>>>>> handled trivially in periodic_mask(), by returning zero when the 32-byte >>>>>>> pattern window falls outside the region read or written by the >>>>>>> instruction -- But currently you can't even know whether that's the >>>>>>> case, because periodic_mask() is unaware where the 32-byte window starts >>>>>>> relative to the source region, which is *all* it cares about, instead >>>>>>> you're passing an offset to it that is equal to that in some special >>>>>>> cases -- In all other cases the result of the function will be bogus. >>>>>> >>>>>> I understand your concerns, so in my V3, I am using only the >>>>>> periodic_mask function when reg_offset is 0, and for the MOV special >>>>>> case that I understand we agree that the internal offset for >>>>>> reg_offset=1 is generally reg.offset % REG_SIZE. With that two cases >>>>>> covered we solve the register_pressure on current supported 64/16/8 bit >>>>>> cases. >>>>>> >>>> >>>>> I don't understand. The MOV opcode is not substantially different from >>>>> any other ALU opcode with regards to regioning. The assumptions this is >>>>> based on are invalid for general Align1 regions, whether the opcode is a >>>>> MOV or not. >>>> >>>> Let's try step by step to find my misunderstanding :-) >>>> >>>> My original understanding was that for a general ALU operations to >>>> define witch parts of the VGRF are written/read we need to use 4 >>>> different variables in the scalar backend for src/dst registers. >>>> >>>> - The type size. >>>> - The horizontal stride. >>>> - The execution size. >>>> - The initial byte offset of the first register of src[i]/dst >>>> >>>> If stride==1 && type_size * execution_size>=REG_SIZE && byte_offset== 0 >>>> we have a full/write of one or two registers. In that case the solution >>>> is easy everything is written ~0u. >>>> >>> >>> Not necessarily, if "reg_offset * REG_SIZE >= byte_offset + type_size * >>> MAX2(1, stride * exec_size)" the correct answer is 0. >> >> Ok, I got what you mean. I was incorrectly assuming as precondition that >> reg_offset was always in range because that is what happens on >> setup_def_used at brw_fs_live_variables.cpp. >> >> I think we can use directly regs_written and regs_read that do the same >> calculation. >> >> if (reg_offset >= regs_written(this)) >> return 0; >> -------------------------------------- >> if (reg_offset >= regs_read(this, i)) >> return 0; >> > > Yes, that should work. > >> >>>> If we have a partial write/read: >>>> >>>> I understood that you my initial patter proposal would only be ok for >>>> the first GRF of src[i]/dst (reg_offset == 0) >>>> >>>> periodic_mask(this->exec_size, /* count */ >>>> this->src[i].stride * type_sz(this->src[i].type), /*step */ >>>> type_sz(this->src[i].type), /* bits */ >>>> this->src[i].offset % REG_SIZE); /* offset */ >>>> >>>> In the case we manage only reg_offset == 0 we get a huge improvement >>>> reducing all problems many of the register_pressure we have now on all >>>> SIMD8 shaders with 8/16bits test cases. >>>> >>>> I understood that you didn't agree that for cases where src/destination >>>> use more than 1 GRF (reg_offset == 1) we can not guarantee that we can >>>> apply the same internal offset (this->src[i].offset % REG_SIZE) as the >>>> base register to calculate a patter. So It would be better to return ~0u >>>> on reads or 0u in writes. >>>> >> >>> Yes, but you could easily determine whether the mask is going to be >>> invariant with respect to reg_offset (where reg_offset is within bounds) >>> and in that case return the periodic_mask() expression above, otherwise >>> return 0/~0u depending on whether reg_offset is within bounds. >> >> Ok, so we are within bounds, we don't have a predicated write, we are >> not a send message. Then we have an ALU opcode and we return the >> periodic_mask. >> > > Those are all necessary but not sufficient conditions for the > periodic_mask() expression above to give you the correct answer for any > in-bounds reg_offset > 0, you should check that byte_offset < type_size > * stride in addition.
That's true. Fixed in v5. If we don't satisfy the condition then we return 0 on writes and ~0u on reads. Chema > >>>> That is the point were I couldn't found a counter-case with the search I >>>> did to identify the different register region that appear in shaderdb >>>> generated MOV instructions and 16/8 bit CTS tests, that used two >>>> registers with a region pattern that implied a different byte_pattern, >>>> for the first and the second register. But it should exist :) >>> >>> I don't see how special-casing MOV could help you in any way except by >>> luck. If your assumptions were correct you would be able to apply them >>> to all ALU instructions with Align1 regions. >> >> It worked because all the current uses are in range, but future uses of >> this function would have return incorrect values as you pointed. >> >> I've already send v4. Let's see if we are converging :) >> >> Chema >> >>>> If my bad understanding is before this point, I should back to re-review >>>> the Regioning PRM section. Or maybe I messing the IR level and the >>>> emission level that is another option with different semantics :-? >>>> >>>> Thank you for your time. >>>> >>>> Chema >>>> >>>>>>>> So we could call in this scenario. >>>>>>>> >>>>>>>> return periodic_mask(this->exec_size, >>>>>>>> this->src[i].stride * type_sz(this->src[i].type), >>>>>>>> type_sz(this->src[i].type), >>>>>>>> this->src[i].offset % REG_SIZE); >>>>>>>> >>>>>>>> The only issue I could think about that could generate issues would a >>>>>>>> case that I didn't found in our current code where a source is defined >>>>>>>> like r5.0<1;8;2>:w explained at PRM KBL vol07 Page 790 "A 16-element >>>>>>>> register region with interleaved rows (r5.0<1;8,2>:w)". >>>>>>>> >>>>>>> >>>>>>> You couldn't really handle that even if you wanted to, because there is >>>>>>> no way to represent such a thing at the IR level, since the horizontal >>>>>>> and vertical strides cannot be controlled independently, only the stride >>>>>>> member is meaningful for VGRFs. >>>>>> >>>>>> Good, that simplifies things. >>>>>> >>>>>> Thanks again. >>>>>> >>>>>> Chema >>>>>> >>>>>>>> What do you think? >>>>>>>> >>>>>>>> Chema >>>>>>>> >>>>>>>> >>>>>>>>>> Thanks for the review, I think that v2 has better shape. >>>>>>>>>> >>>>>>>>> >>>>>>>>> No problem. >>>>>>>>> >>>>>>>>>> Chema >>>>>>>>>> >>>>>>>>>>>> + assert(pattern); >>>>>>>>>>>> + >>>>>>>>>>>> + return pattern; >>>>>>>>>>>> +} >>>>>>>>>>>> + >>>>>>>>>>>> + >>>>>>>>>>>> unsigned >>>>>>>>>>>> fs_inst::components_read(unsigned i) const >>>>>>>>>>>> { >>>>>>>>>>>> diff --git a/src/intel/compiler/brw_ir_fs.h >>>>>>>>>>>> b/src/intel/compiler/brw_ir_fs.h >>>>>>>>>>>> index 92dad269a34..5ea6294b8ad 100644 >>>>>>>>>>>> --- a/src/intel/compiler/brw_ir_fs.h >>>>>>>>>>>> +++ b/src/intel/compiler/brw_ir_fs.h >>>>>>>>>>>> @@ -350,6 +350,7 @@ public: >>>>>>>>>>>> bool equals(fs_inst *inst) const; >>>>>>>>>>>> bool is_send_from_grf() const; >>>>>>>>>>>> bool is_partial_write() const; >>>>>>>>>>>> + unsigned register_byte_use_pattern(const fs_reg &r, boolean >>>>>>>>>>>> is_dst) const; >>>>>>>>>>>> bool is_copy_payload(const brw::simple_allocator &grf_alloc) >>>>>>>>>>>> const; >>>>>>>>>>>> unsigned components_read(unsigned i) const; >>>>>>>>>>>> unsigned size_read(int arg) const; >>>>>>>>>>>> -- >>>>>>>>>>>> 2.17.1 >>>>>>>>>>>> >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> mesa-dev mailing list >>>>>>>>>>>> mesa-dev@lists.freedesktop.org >>>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> mesa-dev mailing list >>>>>>>>>>>> mesa-dev@lists.freedesktop.org >>>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> mesa-dev mailing list >>>>>>>>>>>> mesa-dev@lists.freedesktop.org >>>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> mesa-dev mailing list >>>>>>>>>>>> mesa-dev@lists.freedesktop.org >>>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> mesa-dev mailing list >>>>>>>>>>>> mesa-dev@lists.freedesktop.org >>>>>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev _______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/mesa-dev