El 20/07/18 a las 00:34, Francisco Jerez escribió: > Chema Casanova <jmcasan...@igalia.com> writes: > >> El 14/07/18 a las 00:14, Francisco Jerez escribió: >>> Jose Maria Casanova Crespo <jmcasan...@igalia.com> writes: >>> >>>> For a register source/destination of an instruction the function returns >>>> the read/write byte pattern of a 32-byte registers as a unsigned int. >>>> >>>> The returned pattern takes into account the exec_size of the instruction, >>>> the type bitsize, the stride and if the register is source or destination. >>>> >>>> The objective of the functions if to help to know the read/written bytes >>>> of the instructions to improve the liveness analysis for partial >>>> read/writes. >>>> >>>> We manage special cases for SHADER_OPCODE_BYTE_SCATTERED_WRITE_LOGICAL >>>> and SHADER_OPCODE_BYTE_SCATTERED_WRITE because depending of the bitsize >>>> parameter they have a different read pattern. >>>> --- >>>> src/intel/compiler/brw_fs.cpp | 183 +++++++++++++++++++++++++++++++++ >>>> src/intel/compiler/brw_ir_fs.h | 1 + >>>> 2 files changed, 184 insertions(+) >>>> >>>> diff --git a/src/intel/compiler/brw_fs.cpp b/src/intel/compiler/brw_fs.cpp >>>> index 2b8363ca362..f3045c4ff6c 100644 >>>> --- a/src/intel/compiler/brw_fs.cpp >>>> +++ b/src/intel/compiler/brw_fs.cpp >>>> @@ -687,6 +687,189 @@ fs_inst::is_partial_write() const >>>> this->dst.offset % REG_SIZE != 0); >>>> } >>>> >>>> +/** >>>> + * Returns a 32-bit uint whose bits represent if the associated register >>>> byte >>>> + * has been read/written by the instruction. The returned pattern takes >>>> into >>>> + * account the exec_size of the instruction, the type bitsize and the >>>> register >>>> + * stride and the register is source or destination for the instruction. >>>> + * >>>> + * The objective of this function is to identify which parts of the >>>> register >>>> + * are read or written for operations that don't read/write a full >>>> register. >>>> + * So we can identify in live range variable analysis if a partial write >>>> has >>>> + * completelly defined the part of the register used by a partial read. >>>> So we >>>> + * avoid extending the liveness range because all data read was already >>>> + * defined although the wasn't completely written. >>>> + */ >>>> +unsigned >>>> +fs_inst::register_byte_use_pattern(const fs_reg &r, boolean is_dst) const >>>> +{ >>>> + if (is_dst) { >> >>> Please split into two functions (like fs_inst::src_read and >>> ::src_written) since that would make the call-sites of this method more >>> self-documenting than a boolean parameter. You should be able to share >>> code by refactoring the common logic into a separate function (see below >>> for some suggestions on how that could be achieved). >> >> Sure, it would improve readability and simplifies the logic, I've chosen >> dst_write_pattern and src_read_pattern. >> >>> >>>> + /* We don't know what is written so we return the worts case */ >>> >>> "worst" >> >> Fixed. >> >>>> + if (this->predicate && this->opcode != BRW_OPCODE_SEL) >>>> + return 0; >>>> + /* We assume that send destinations are completely written */ >>>> + if (this->is_send_from_grf()) >>>> + return ~0u; >>> >>> Some send-like instructions won't be caught by this condition, you >>> should check for this->mlen != 0 in addition. >> >> Would it be enough to check for (this->mlen > 0) and forget about >> is_send_from_grf? I am using this approach in v2 I am sending. >> > > I don't think the mlen > 0 condition would catch all cases either... > E.g. FS_OPCODE_UNIFORM_PULL_CONSTANT_LOAD IIRC. You probably need both > conditions. Sucks...
That is true, so now we have the: (this->is_send_from_grf() || this->mlen != 0) >>>> + } else { >>>> + /* byte_scattered_write_logical pattern of src[1] is 32-bit aligned >>>> + * so the read pattern depends on the bitsize stored at src[4] >>>> + */ >>>> + if (this->opcode == SHADER_OPCODE_BYTE_SCATTERED_WRITE_LOGICAL && >>>> + this->src[1].nr == r.nr) { >> >>> I feel uncomfortable about attempting to guess the source the caller is >>> referring to by comparing the registers for equality. E.g. you could >>> potentially end up with two sources that compare equal but have >>> different semantics (e.g. as a result of CSE) which might cause it to >>> get the wrong answer. It would probably be better to pass a source >>> index and a byte offset as argument instead of an fs_reg. >> >> I've didn't thought about CSE, I'm now receiving the number of source >> and the reg_offset. I'm using reg_offset instead of byte offsets as it >> simplifies the logic. Now we are using always the base src register to >> do all the calculation >>>> + switch (this->src[4].ud) { >>>> + case 32: >>>> + return ~0u; >>>> + case 16: >>>> + return 0x33333333; >>>> + case 8: >>>> + return 0x11111111; >>>> + default: >>>> + unreachable("Unsupported bitsize at >>>> byte_scattered_write_logical"); >>>> + } >>> >>> Replace the above switch statement with a call to "periodic_mask(8, 4, >>> this->src[4].ud / 8)" (see below for the definition). >> >> Ok. >> >>>> + } >>>> + /* As for byte_scattered_write_logical but we need to take into >>>> account >>>> + * that data written are in the payload offset 32 with SIMD8 and >>>> offset >>>> + * 64 with SIMD16. >>>> + */ >>>> + if (this->opcode == SHADER_OPCODE_BYTE_SCATTERED_WRITE && >>>> + this->src[0].nr == r.nr) { >>>> + fs_reg payload = this->src[0]; >>>> + payload.offset = REG_SIZE * this->exec_size / 8; >>> >>> byte_offset() is your friend. >> >> I've removed the overlap logic, and I'm just checking if we are in the >> reg_offset 1 on SIMD8 or reg_offset 2-3 in SIMD16. >> >>>> + if (regions_overlap(r, REG_SIZE, >>>> + payload, REG_SIZE * this->exec_size / 8)) { >>>> + switch (this->src[2].ud) { >>>> + case 32: >>>> + return ~0u; >>>> + case 16: >>>> + return 0x33333333; >>>> + case 8: >>>> + return 0x11111111; >>>> + default: >>>> + unreachable("Unsupported bitsize at byte_scattered_write"); >>>> + } >>> >>> Replace the above switch statement with a call to "periodic_mask(8, 4, >>> this->src[2].ud / 8)". >> >> Ok. >> >>>> + } else { >>>> + return ~0u; >>>> + } >>>> + } >>>> + } >>>> + >>>> + /* We define the most conservative value in order to calculate liveness >>>> + * range. If it is a destination nothing is defined and if is a source >>>> + * all the bytes of the register could be read. So for release builds >>>> + * the unreachables would have always safe return value. */ >>>> + unsigned pattern = is_dst ? 0 : ~0u; >>>> + >>>> + /* In the general case we calculate the pattern for a specific register >>>> + * on base of the type_size and stride. We calculate the SIMD8 pattern >>>> + * and then we adjust the patter if needed for different exec_sizes >>>> + * and offset >>>> + */ >>>> + switch (type_sz(r.type)){ >>>> + case 1: >>>> + switch (r.stride) { >>>> + case 0: >>>> + pattern = 0X1; >>>> + break; >>>> + case 1: >>>> + pattern = 0xff; >>>> + break; >>>> + case 2: >>>> + pattern = 0x5555; >>>> + break; >>>> + case 4: >>>> + pattern = 0x11111111; >>>> + break; >>>> + case 8: >>>> + pattern = 0x01010101; >>>> + break; >>>> + default: >>>> + unreachable("Unknown pattern unsupported 8-bit stride"); >>>> + } >>>> + break; >>>> + case 2: >>>> + switch (r.stride) { >>>> + case 0: >>>> + pattern = 0X3; >>>> + break; >>>> + case 1: >>>> + pattern = 0xffff; >>>> + break; >>>> + case 2: >>>> + pattern = 0x33333333; >>>> + break; >>>> + case 4: >>>> + pattern = 0x03030303; >>>> + break; >>>> + case 8: >>>> + pattern = 0x00030003; >>>> + break; >>>> + default: >>>> + unreachable("Unknown pattern unsupported 16-bit stride"); >>>> + } >>>> + break; >>>> + case 4: >>>> + switch (r.stride) { >>>> + case 0: >>>> + pattern = 0Xf; >>>> + break; >>>> + case 1: >>>> + pattern = ~0u; >>>> + break; >>>> + case 2: >>>> + pattern = 0x0f0f0f0f; >>>> + break; >>>> + case 4: >>>> + pattern = 0x000f000f; >>>> + break; >>>> + default: >>>> + unreachable("Unknown pattern unsupported 32-bit stride"); >>>> + } >>>> + break; >>>> + case 8: >>>> + switch (r.stride) { >>>> + case 0: >>>> + pattern = 0Xff; >>>> + break; >>>> + case 1: >>>> + pattern = ~0u; >>>> + break; >>>> + case 2: >>>> + pattern = 0x00ff00ff; >>>> + break; >>>> + case 4: >>>> + pattern = 0xff; >>>> + break; >>>> + default: >>>> + unreachable("Unknown pattern unsupported 64-bit stride"); >>>> + } >>>> + break; >>>> + default: >>>> + unreachable("Unknown pattern for unsupported bitsize "); >>>> + } >>>> + >>>> + if (this->exec_size > 8 && r.stride * type_sz(r.type) * 8 < REG_SIZE) >>>> { >>>> + /* For exec_size greater than SIMD8 we repeat the pattern until it >>>> + * represents a full register already represent a full register */ >>>> + pattern = pattern | (pattern << (8 * r.stride * type_sz(r.type))); >>>> + if (this->exec_size > 16 && r.stride * type_sz(r.type) * 16 < >>>> REG_SIZE) >>>> + pattern = pattern | (pattern << (16 * r.stride * >>>> type_sz(r.type))); >>>> + } else if (this->exec_size < 8 && >>>> + r.stride * type_sz(r.type) * this->exec_size < REG_SIZE) { >>>> + /* For exec_size smaller than SIMD8 we reduce the pattern if its >>>> size >>>> + * is smaller than a full register. */ >>>> + pattern = pattern >> (MIN2(REG_SIZE, 8 * type_sz(r.type) * >>>> r.stride) - >>>> + this->exec_size * type_sz(r.type) * r.stride); >>>> + } >>>> + >> >>> This seems really mad, no clue whether it's correct. Why not replace >>> the above ~110 lines with a call to the following (fully >>> untested) 5-LOC function: >> >> Your suggestion seems to work perfectly fine, my original approach was >> trying to avoid the loop of creating the read/write pattern but after >> testing my v2 I wasn't able to notice any performance difference running >> shader-db and having the same results. I was originally trying that >> SIMD8 patterns were already constants. Sorry for the added complexity. >> > > The loop runs for a logarithmic number of iterations though, so it has > the exact same run-time complexity as your original patch, roughly the > same amount of branches at run-time (but possibly less indirect > branches!), and it should be compiled into a substantially lower number > of instructions, which may actually cause it to perform better due to > more favourable caching. It's hard to tell though which one will > perform better in practice without benchmarking them, and as you > probably realized this is so far from being a bottleneck that whatever > the difference was is likely lost in the noise. So it really doesn't > matter which one performs better... > >>> >>> | uint32_t >>> | periodic_mask(unsigned count, unsigned step, unsigned bits) >>> | { >>> | uint32_t m = (count ? (1 << bits) - 1 : 0); >>> | const unsigined max = MIN2(count * step, sizeof(m) * CHAR_BITS); >>> | >>> | for (unsigned shift = step; shift < max; shift *= 2) >>> | m |= m << shift; >>> | >>> | return m; >>> | } >> >> I've used your function just changing the sizeof(m) * CHAR_BITS for the >> REG_SIZE, to not include the limits.h. > > My intention was to make the function as agnostic to IR details as > possible: the only reason there is a limit of 32 bits is because that's > the size of the type used to hold the return value. Using sizeof makes > sure that e.g. extending the code to 64 bits is as simple as changing > the datatype to uint64_t. > >> And I've also included an offset parameter that allows us to shift the >> bits of the pattern when we have an offset inside the register. >> > > That sounds fine. > >>>> + /* We adjust the pattern to the byte_offset of the register */ >>>> + pattern = pattern << (r.offset % REG_SIZE); >>>> + >>> >>> This doesn't really work except for r equal to the first GRF read by the >>> source. Regions with non-zero sub-GRF offset that straddle multiple >>> GRFs are not really very common at the IR level, but they're allowed by >>> the hardware with some restrictions, so it would make sense to at least >>> handle them conservatively in order to keep things from breaking silently. >> >> >> I included an assert in the periodic_mask function to detect if the >> combination of offset and mask goes over the REG_SIZE, so we would >> detect an straddle at this level of the IR. >> >> assert(offset + max - (step - bits) <= REG_SIZE); > Uhm... What's the definition of "offset" here? You seem to be passing > the offset relative to the start of the VGRF modulo REG_SIZE but that > doesn't really make sense to me whenever "reg_offset" is non-zero. Yes I agree on removing this assert in the periodic_mask as it a generic function. I was to focus on seen the integer as the register :) The motivation was because of the following cases that happen with 8/16-bit more usually: (1) mov(16) g12<2>HF g1<16,8,2>HF { align1 1H }; (2) mov(16) g12.1<2>HF g2<16,8,2>HF { align1 1H }; In the dst_write_pattern, at point of calling the general periodic_mask case we know that instruction is not expected to be a send message destination. So the pattern is the same for reg_offset = 0 and reg_offset = 1. 0x3333333 for case (1) and 0xccccccc for case (2). But without repeating the pattern offset we would get 0xcccccccc for reg_offset=0 but 0x333333333 for reg_offset=1 witch would be incorrect. > I think you want to pass "src[i].offset % REG_SIZE - reg_offset * > REG_SIZE" as a signed integer in order to get the offset of the first > byte actually written by the instruction relative to the first byte of > the GRF window of the pattern. You don't really need to assert-fail > when the offset is greater or equal to 32 (which shouldn't actually > happen in practice), "return 0" gives you the correct behavior. For > negative offsets (which means the pattern starts after the first byte > written by the instruction) you can just return ~0u conservatively > whenever the current logic wouldn't work [assuming you don't feel like > implementing the code to handle that case accurately ;)]. Maybe you feel more comfortable with the following approach for src_read_pattern.: For read sources we also assume that SENDs sources are completely read so return ~0u except the byte_scattered_write source exceptions. if (this->is_send_from_grf() || this->mlen != 0) return ~0u; So after this I think that we can assume that the following condition should be always correct where source operand must reside within two adjacent 256-bit registers so the pattern would be periodic for all reg_offsets and we can use the (this->src[i].offset % REG_SIZE). assert(reg_offset < DIV_ROUND_UP(this->src[i].stride ? this->src[i].stride * type_sz(this->src[i].type) * this->exec_size : type_sz(this->src[i].type), REG_SIZE)); So we could call in this scenario. return periodic_mask(this->exec_size, this->src[i].stride * type_sz(this->src[i].type), type_sz(this->src[i].type), this->src[i].offset % REG_SIZE); The only issue I could think about that could generate issues would a case that I didn't found in our current code where a source is defined like r5.0<1;8;2>:w explained at PRM KBL vol07 Page 790 "A 16-element register region with interleaved rows (r5.0<1;8,2>:w)". What do you think? Chema >> Thanks for the review, I think that v2 has better shape. >> > > No problem. > >> Chema >> >>>> + assert(pattern); >>>> + >>>> + return pattern; >>>> +} >>>> + >>>> + >>>> unsigned >>>> fs_inst::components_read(unsigned i) const >>>> { >>>> diff --git a/src/intel/compiler/brw_ir_fs.h >>>> b/src/intel/compiler/brw_ir_fs.h >>>> index 92dad269a34..5ea6294b8ad 100644 >>>> --- a/src/intel/compiler/brw_ir_fs.h >>>> +++ b/src/intel/compiler/brw_ir_fs.h >>>> @@ -350,6 +350,7 @@ public: >>>> bool equals(fs_inst *inst) const; >>>> bool is_send_from_grf() const; >>>> bool is_partial_write() const; >>>> + unsigned register_byte_use_pattern(const fs_reg &r, boolean is_dst) >>>> const; >>>> bool is_copy_payload(const brw::simple_allocator &grf_alloc) const; >>>> unsigned components_read(unsigned i) const; >>>> unsigned size_read(int arg) const; >>>> -- >>>> 2.17.1 >>>> >>>> _______________________________________________ >>>> mesa-dev mailing list >>>> mesa-dev@lists.freedesktop.org >>>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev >>>> >>>> >>>> _______________________________________________ >>>> mesa-dev mailing list >>>> mesa-dev@lists.freedesktop.org >>>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev _______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/mesa-dev