El 20/07/18 a las 22:10, Francisco Jerez escribió: > Chema Casanova <jmcasan...@igalia.com> writes: > >> El 20/07/18 a las 00:34, Francisco Jerez escribió: >>> Chema Casanova <jmcasan...@igalia.com> writes: >>> >>>> El 14/07/18 a las 00:14, Francisco Jerez escribió: >>>>> Jose Maria Casanova Crespo <jmcasan...@igalia.com> writes: >>>>> >>>>>> For a register source/destination of an instruction the function returns >>>>>> the read/write byte pattern of a 32-byte registers as a unsigned int. >>>>>> >>>>>> The returned pattern takes into account the exec_size of the instruction, >>>>>> the type bitsize, the stride and if the register is source or >>>>>> destination. >>>>>> >>>>>> The objective of the functions if to help to know the read/written bytes >>>>>> of the instructions to improve the liveness analysis for partial >>>>>> read/writes. >>>>>> >>>>>> We manage special cases for SHADER_OPCODE_BYTE_SCATTERED_WRITE_LOGICAL >>>>>> and SHADER_OPCODE_BYTE_SCATTERED_WRITE because depending of the bitsize >>>>>> parameter they have a different read pattern. >>>>>> --- >>>>>> src/intel/compiler/brw_fs.cpp | 183 +++++++++++++++++++++++++++++++++ >>>>>> src/intel/compiler/brw_ir_fs.h | 1 + >>>>>> 2 files changed, 184 insertions(+) >>>>>> >>>>>> diff --git a/src/intel/compiler/brw_fs.cpp >>>>>> b/src/intel/compiler/brw_fs.cpp >>>>>> index 2b8363ca362..f3045c4ff6c 100644 >>>>>> --- a/src/intel/compiler/brw_fs.cpp >>>>>> +++ b/src/intel/compiler/brw_fs.cpp >>>>>> @@ -687,6 +687,189 @@ fs_inst::is_partial_write() const >>>>>> this->dst.offset % REG_SIZE != 0); >>>>>> } >>>>>> >>>>>> +/** >>>>>> + * Returns a 32-bit uint whose bits represent if the associated >>>>>> register byte >>>>>> + * has been read/written by the instruction. The returned pattern takes >>>>>> into >>>>>> + * account the exec_size of the instruction, the type bitsize and the >>>>>> register >>>>>> + * stride and the register is source or destination for the instruction. >>>>>> + * >>>>>> + * The objective of this function is to identify which parts of the >>>>>> register >>>>>> + * are read or written for operations that don't read/write a full >>>>>> register. >>>>>> + * So we can identify in live range variable analysis if a partial >>>>>> write has >>>>>> + * completelly defined the part of the register used by a partial read. >>>>>> So we >>>>>> + * avoid extending the liveness range because all data read was already >>>>>> + * defined although the wasn't completely written. >>>>>> + */ >>>>>> +unsigned >>>>>> +fs_inst::register_byte_use_pattern(const fs_reg &r, boolean is_dst) >>>>>> const >>>>>> +{ >>>>>> + if (is_dst) { >>>> >>>>> Please split into two functions (like fs_inst::src_read and >>>>> ::src_written) since that would make the call-sites of this method more >>>>> self-documenting than a boolean parameter. You should be able to share >>>>> code by refactoring the common logic into a separate function (see below >>>>> for some suggestions on how that could be achieved). >>>> >>>> Sure, it would improve readability and simplifies the logic, I've chosen >>>> dst_write_pattern and src_read_pattern. >>>> >>>>> >>>>>> + /* We don't know what is written so we return the worts case */ >>>>> >>>>> "worst" >>>> >>>> Fixed. >>>> >>>>>> + if (this->predicate && this->opcode != BRW_OPCODE_SEL) >>>>>> + return 0; >>>>>> + /* We assume that send destinations are completely written */ >>>>>> + if (this->is_send_from_grf()) >>>>>> + return ~0u; >>>>> >>>>> Some send-like instructions won't be caught by this condition, you >>>>> should check for this->mlen != 0 in addition. >>>> >>>> Would it be enough to check for (this->mlen > 0) and forget about >>>> is_send_from_grf? I am using this approach in v2 I am sending. >>>> >>> >>> I don't think the mlen > 0 condition would catch all cases either... >>> E.g. FS_OPCODE_UNIFORM_PULL_CONSTANT_LOAD IIRC. You probably need both >>> conditions. Sucks... >> >> That is true, so now we have the: >> (this->is_send_from_grf() || this->mlen != 0) >> >>>>>> + } else { >>>>>> + /* byte_scattered_write_logical pattern of src[1] is 32-bit >>>>>> aligned >>>>>> + * so the read pattern depends on the bitsize stored at src[4] >>>>>> + */ >>>>>> + if (this->opcode == SHADER_OPCODE_BYTE_SCATTERED_WRITE_LOGICAL && >>>>>> + this->src[1].nr == r.nr) { >>>> >>>>> I feel uncomfortable about attempting to guess the source the caller is >>>>> referring to by comparing the registers for equality. E.g. you could >>>>> potentially end up with two sources that compare equal but have >>>>> different semantics (e.g. as a result of CSE) which might cause it to >>>>> get the wrong answer. It would probably be better to pass a source >>>>> index and a byte offset as argument instead of an fs_reg. >>>> >>>> I've didn't thought about CSE, I'm now receiving the number of source >>>> and the reg_offset. I'm using reg_offset instead of byte offsets as it >>>> simplifies the logic. Now we are using always the base src register to >>>> do all the calculation >>>>>> + switch (this->src[4].ud) { >>>>>> + case 32: >>>>>> + return ~0u; >>>>>> + case 16: >>>>>> + return 0x33333333; >>>>>> + case 8: >>>>>> + return 0x11111111; >>>>>> + default: >>>>>> + unreachable("Unsupported bitsize at >>>>>> byte_scattered_write_logical"); >>>>>> + } >>>>> >>>>> Replace the above switch statement with a call to "periodic_mask(8, 4, >>>>> this->src[4].ud / 8)" (see below for the definition). >>>> >>>> Ok. >>>> >>>>>> + } >>>>>> + /* As for byte_scattered_write_logical but we need to take into >>>>>> account >>>>>> + * that data written are in the payload offset 32 with SIMD8 and >>>>>> offset >>>>>> + * 64 with SIMD16. >>>>>> + */ >>>>>> + if (this->opcode == SHADER_OPCODE_BYTE_SCATTERED_WRITE && >>>>>> + this->src[0].nr == r.nr) { >>>>>> + fs_reg payload = this->src[0]; >>>>>> + payload.offset = REG_SIZE * this->exec_size / 8; >>>>> >>>>> byte_offset() is your friend. >>>> >>>> I've removed the overlap logic, and I'm just checking if we are in the >>>> reg_offset 1 on SIMD8 or reg_offset 2-3 in SIMD16. >>>> >>>>>> + if (regions_overlap(r, REG_SIZE, >>>>>> + payload, REG_SIZE * this->exec_size / 8)) { >>>>>> + switch (this->src[2].ud) { >>>>>> + case 32: >>>>>> + return ~0u; >>>>>> + case 16: >>>>>> + return 0x33333333; >>>>>> + case 8: >>>>>> + return 0x11111111; >>>>>> + default: >>>>>> + unreachable("Unsupported bitsize at >>>>>> byte_scattered_write"); >>>>>> + } >>>>> >>>>> Replace the above switch statement with a call to "periodic_mask(8, 4, >>>>> this->src[2].ud / 8)". >>>> >>>> Ok. >>>> >>>>>> + } else { >>>>>> + return ~0u; >>>>>> + } >>>>>> + } >>>>>> + } >>>>>> + >>>>>> + /* We define the most conservative value in order to calculate >>>>>> liveness >>>>>> + * range. If it is a destination nothing is defined and if is a >>>>>> source >>>>>> + * all the bytes of the register could be read. So for release builds >>>>>> + * the unreachables would have always safe return value. */ >>>>>> + unsigned pattern = is_dst ? 0 : ~0u; >>>>>> + >>>>>> + /* In the general case we calculate the pattern for a specific >>>>>> register >>>>>> + * on base of the type_size and stride. We calculate the SIMD8 >>>>>> pattern >>>>>> + * and then we adjust the patter if needed for different exec_sizes >>>>>> + * and offset >>>>>> + */ >>>>>> + switch (type_sz(r.type)){ >>>>>> + case 1: >>>>>> + switch (r.stride) { >>>>>> + case 0: >>>>>> + pattern = 0X1; >>>>>> + break; >>>>>> + case 1: >>>>>> + pattern = 0xff; >>>>>> + break; >>>>>> + case 2: >>>>>> + pattern = 0x5555; >>>>>> + break; >>>>>> + case 4: >>>>>> + pattern = 0x11111111; >>>>>> + break; >>>>>> + case 8: >>>>>> + pattern = 0x01010101; >>>>>> + break; >>>>>> + default: >>>>>> + unreachable("Unknown pattern unsupported 8-bit stride"); >>>>>> + } >>>>>> + break; >>>>>> + case 2: >>>>>> + switch (r.stride) { >>>>>> + case 0: >>>>>> + pattern = 0X3; >>>>>> + break; >>>>>> + case 1: >>>>>> + pattern = 0xffff; >>>>>> + break; >>>>>> + case 2: >>>>>> + pattern = 0x33333333; >>>>>> + break; >>>>>> + case 4: >>>>>> + pattern = 0x03030303; >>>>>> + break; >>>>>> + case 8: >>>>>> + pattern = 0x00030003; >>>>>> + break; >>>>>> + default: >>>>>> + unreachable("Unknown pattern unsupported 16-bit stride"); >>>>>> + } >>>>>> + break; >>>>>> + case 4: >>>>>> + switch (r.stride) { >>>>>> + case 0: >>>>>> + pattern = 0Xf; >>>>>> + break; >>>>>> + case 1: >>>>>> + pattern = ~0u; >>>>>> + break; >>>>>> + case 2: >>>>>> + pattern = 0x0f0f0f0f; >>>>>> + break; >>>>>> + case 4: >>>>>> + pattern = 0x000f000f; >>>>>> + break; >>>>>> + default: >>>>>> + unreachable("Unknown pattern unsupported 32-bit stride"); >>>>>> + } >>>>>> + break; >>>>>> + case 8: >>>>>> + switch (r.stride) { >>>>>> + case 0: >>>>>> + pattern = 0Xff; >>>>>> + break; >>>>>> + case 1: >>>>>> + pattern = ~0u; >>>>>> + break; >>>>>> + case 2: >>>>>> + pattern = 0x00ff00ff; >>>>>> + break; >>>>>> + case 4: >>>>>> + pattern = 0xff; >>>>>> + break; >>>>>> + default: >>>>>> + unreachable("Unknown pattern unsupported 64-bit stride"); >>>>>> + } >>>>>> + break; >>>>>> + default: >>>>>> + unreachable("Unknown pattern for unsupported bitsize "); >>>>>> + } >>>>>> + >>>>>> + if (this->exec_size > 8 && r.stride * type_sz(r.type) * 8 < >>>>>> REG_SIZE) { >>>>>> + /* For exec_size greater than SIMD8 we repeat the pattern until it >>>>>> + * represents a full register already represent a full register */ >>>>>> + pattern = pattern | (pattern << (8 * r.stride * type_sz(r.type))); >>>>>> + if (this->exec_size > 16 && r.stride * type_sz(r.type) * 16 < >>>>>> REG_SIZE) >>>>>> + pattern = pattern | (pattern << (16 * r.stride * >>>>>> type_sz(r.type))); >>>>>> + } else if (this->exec_size < 8 && >>>>>> + r.stride * type_sz(r.type) * this->exec_size < REG_SIZE) { >>>>>> + /* For exec_size smaller than SIMD8 we reduce the pattern if its >>>>>> size >>>>>> + * is smaller than a full register. */ >>>>>> + pattern = pattern >> (MIN2(REG_SIZE, 8 * type_sz(r.type) * >>>>>> r.stride) - >>>>>> + this->exec_size * type_sz(r.type) * >>>>>> r.stride); >>>>>> + } >>>>>> + >>>> >>>>> This seems really mad, no clue whether it's correct. Why not replace >>>>> the above ~110 lines with a call to the following (fully >>>>> untested) 5-LOC function: >>>> >>>> Your suggestion seems to work perfectly fine, my original approach was >>>> trying to avoid the loop of creating the read/write pattern but after >>>> testing my v2 I wasn't able to notice any performance difference running >>>> shader-db and having the same results. I was originally trying that >>>> SIMD8 patterns were already constants. Sorry for the added complexity. >>>> >>> >>> The loop runs for a logarithmic number of iterations though, so it has >>> the exact same run-time complexity as your original patch, roughly the >>> same amount of branches at run-time (but possibly less indirect >>> branches!), and it should be compiled into a substantially lower number >>> of instructions, which may actually cause it to perform better due to >>> more favourable caching. It's hard to tell though which one will >>> perform better in practice without benchmarking them, and as you >>> probably realized this is so far from being a bottleneck that whatever >>> the difference was is likely lost in the noise. So it really doesn't >>> matter which one performs better... >>> >>>>> >>>>> | uint32_t >>>>> | periodic_mask(unsigned count, unsigned step, unsigned bits) >>>>> | { >>>>> | uint32_t m = (count ? (1 << bits) - 1 : 0); >>>>> | const unsigined max = MIN2(count * step, sizeof(m) * CHAR_BITS); >>>>> | >>>>> | for (unsigned shift = step; shift < max; shift *= 2) >>>>> | m |= m << shift; >>>>> | >>>>> | return m; >>>>> | } >>>> >>>> I've used your function just changing the sizeof(m) * CHAR_BITS for the >>>> REG_SIZE, to not include the limits.h. >>> >>> My intention was to make the function as agnostic to IR details as >>> possible: the only reason there is a limit of 32 bits is because that's >>> the size of the type used to hold the return value. Using sizeof makes >>> sure that e.g. extending the code to 64 bits is as simple as changing >>> the datatype to uint64_t. >>> >>>> And I've also included an offset parameter that allows us to shift the >>>> bits of the pattern when we have an offset inside the register. >>>> >>> >>> That sounds fine. >>> >>>>>> + /* We adjust the pattern to the byte_offset of the register */ >>>>>> + pattern = pattern << (r.offset % REG_SIZE); >>>>>> + >>>>> >>>>> This doesn't really work except for r equal to the first GRF read by the >>>>> source. Regions with non-zero sub-GRF offset that straddle multiple >>>>> GRFs are not really very common at the IR level, but they're allowed by >>>>> the hardware with some restrictions, so it would make sense to at least >>>>> handle them conservatively in order to keep things from breaking silently. >>>> >>>> >>>> I included an assert in the periodic_mask function to detect if the >>>> combination of offset and mask goes over the REG_SIZE, so we would >>>> detect an straddle at this level of the IR. >>>> >>>> assert(offset + max - (step - bits) <= REG_SIZE); >> >>> Uhm... What's the definition of "offset" here? You seem to be passing >>> the offset relative to the start of the VGRF modulo REG_SIZE but that >>> doesn't really make sense to me whenever "reg_offset" is non-zero. >> >> Yes I agree on removing this assert in the periodic_mask as it a generic >> function. I was to focus on seen the integer as the register :) >> >> The motivation was because of the following cases that happen with >> 8/16-bit more usually: >> >> (1) mov(16) g12<2>HF g1<16,8,2>HF { align1 1H }; >> (2) mov(16) g12.1<2>HF g2<16,8,2>HF { align1 1H }; >> >> In the dst_write_pattern, at point of calling the general periodic_mask >> case we know that instruction is not expected to be a send message >> destination. So the pattern is the same for reg_offset = 0 and >> reg_offset = 1. >> > > In this specific example, yes, but in general the pattern will not be > the same for multiple registers for arbitrary fs_reg::offset values.
I've been running some tests. I've found that the three opcodes reach this situation in shader-db (MOV, INDIRECT_MOV and OR). If we just manage the MOV opcode that we agree that the same pattern is repeated for the same source/dst used registers have the same spilling reduction that with my original patch. >> 0x3333333 for case (1) and 0xccccccc for case (2). But without repeating >> the pattern offset we would get 0xcccccccc for reg_offset=0 but >> 0x333333333 for reg_offset=1 witch would be incorrect. >> >>> I think you want to pass "src[i].offset % REG_SIZE - reg_offset * >>> REG_SIZE" as a signed integer in order to get the offset of the first >>> byte actually written by the instruction relative to the first byte of >>> the GRF window of the pattern. You don't really need to assert-fail >>> when the offset is greater or equal to 32 (which shouldn't actually >>> happen in practice), "return 0" gives you the correct behavior. For >>> negative offsets (which means the pattern starts after the first byte >>> written by the instruction) you can just return ~0u conservatively >>> whenever the current logic wouldn't work [assuming you don't feel like >>> implementing the code to handle that case accurately ;)]. >> >> Maybe you feel more comfortable with the following approach for >> src_read_pattern.: >> >> For read sources we also assume that SENDs sources are completely >> read so return ~0u except the byte_scattered_write source exceptions. >> >> if (this->is_send_from_grf() || this->mlen != 0) >> return ~0u; >> >> So after this I think that we can assume that the following condition >> should be always correct where source operand must reside within two >> adjacent 256-bit registers so the pattern would be periodic for >> all reg_offsets and we can use the (this->src[i].offset % REG_SIZE). >> > > That's not a particularly futureproof assumption. I'm okay with > handling non-reg_offset-periodic cases inaccurately for the moment for > the sake of simplicity, but there is no reason to have the code blow up > in such cases if you could just return ~0u. I've added some modifications in the v3, we return ~0u for reads and 0u for writes by default, but i added the case of !this->is_partial_write() that will return ~0u for writes, that won't we reached with current use, but to be coherent with any future uses of this function. >> assert(reg_offset < DIV_ROUND_UP(this->src[i].stride ? >> this->src[i].stride * type_sz(this->src[i].type) * this->exec_size : >> type_sz(this->src[i].type), REG_SIZE)); >> > I don't think it's useful to assert-fail on this condition, since it > doesn't handle all cases where the code below will be broken for a > certain valid IR, and the condition it does protect against can be > handled trivially in periodic_mask(), by returning zero when the 32-byte > pattern window falls outside the region read or written by the > instruction -- But currently you can't even know whether that's the > case, because periodic_mask() is unaware where the 32-byte window starts > relative to the source region, which is *all* it cares about, instead > you're passing an offset to it that is equal to that in some special > cases -- In all other cases the result of the function will be bogus. I understand your concerns, so in my V3, I am using only the periodic_mask function when reg_offset is 0, and for the MOV special case that I understand we agree that the internal offset for reg_offset=1 is generally reg.offset % REG_SIZE. With that two cases covered we solve the register_pressure on current supported 64/16/8 bit cases. >> So we could call in this scenario. >> >> return periodic_mask(this->exec_size, >> this->src[i].stride * type_sz(this->src[i].type), >> type_sz(this->src[i].type), >> this->src[i].offset % REG_SIZE); >> >> The only issue I could think about that could generate issues would a >> case that I didn't found in our current code where a source is defined >> like r5.0<1;8;2>:w explained at PRM KBL vol07 Page 790 "A 16-element >> register region with interleaved rows (r5.0<1;8,2>:w)". >> > > You couldn't really handle that even if you wanted to, because there is > no way to represent such a thing at the IR level, since the horizontal > and vertical strides cannot be controlled independently, only the stride > member is meaningful for VGRFs. Good, that simplifies things. Thanks again. Chema >> What do you think? >> >> Chema >> >> >>>> Thanks for the review, I think that v2 has better shape. >>>> >>> >>> No problem. >>> >>>> Chema >>>> >>>>>> + assert(pattern); >>>>>> + >>>>>> + return pattern; >>>>>> +} >>>>>> + >>>>>> + >>>>>> unsigned >>>>>> fs_inst::components_read(unsigned i) const >>>>>> { >>>>>> diff --git a/src/intel/compiler/brw_ir_fs.h >>>>>> b/src/intel/compiler/brw_ir_fs.h >>>>>> index 92dad269a34..5ea6294b8ad 100644 >>>>>> --- a/src/intel/compiler/brw_ir_fs.h >>>>>> +++ b/src/intel/compiler/brw_ir_fs.h >>>>>> @@ -350,6 +350,7 @@ public: >>>>>> bool equals(fs_inst *inst) const; >>>>>> bool is_send_from_grf() const; >>>>>> bool is_partial_write() const; >>>>>> + unsigned register_byte_use_pattern(const fs_reg &r, boolean is_dst) >>>>>> const; >>>>>> bool is_copy_payload(const brw::simple_allocator &grf_alloc) const; >>>>>> unsigned components_read(unsigned i) const; >>>>>> unsigned size_read(int arg) const; >>>>>> -- >>>>>> 2.17.1 >>>>>> >>>>>> _______________________________________________ >>>>>> mesa-dev mailing list >>>>>> mesa-dev@lists.freedesktop.org >>>>>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> mesa-dev mailing list >>>>>> mesa-dev@lists.freedesktop.org >>>>>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> mesa-dev mailing list >>>>>> mesa-dev@lists.freedesktop.org >>>>>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev _______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/mesa-dev