On 21/02/17 21:07, Francisco Jerez wrote: > Samuel Iglesias Gonsálvez <sigles...@igalia.com> writes: > >> On 20/02/17 21:31, Francisco Jerez wrote: >>> Samuel Iglesias Gonsálvez <sigles...@igalia.com> writes: >>> >>>> On Mon, 2017-02-20 at 08:58 +0100, Samuel Iglesias Gonsálvez wrote: >>>>> On Sat, 2017-02-18 at 18:58 -0800, Francisco Jerez wrote: >>>>>> Samuel Iglesias Gonsálvez <sigles...@igalia.com> writes: >>>>>> >>>>>>> The lowered BSW/BXT indirect move instructions had incorrect >>>>>>> source types, which luckily wasn't causing incorrect assembly to >>>>>>> be >>>>>>> generated due to the bug fixed in the next patch, but would have >>>>>>> confused the remaining back-end IR infrastructure due to the >>>>>>> mismatch >>>>>>> between the IR source types and the emitted machine code. >>>>>>> >>>>>>> v2: >>>>>>> - Improve commit log (Curro) >>>>>>> - Fix read_size (Curro) >>>>>>> - Fix DF uniform array detection in assign_constant_locations() >>>>>>> when >>>>>>> it is acceded with 32-bit MOV_INDIRECTs in BSW/BXT. >>>>>>> >>>>>>> Signed-off-by: Samuel Iglesias Gonsálvez <sigles...@igalia.com> >>>>>>> Cc: "17.0" <mesa-sta...@lists.freedesktop.org> >>>>>>> --- >>>>>>> src/mesa/drivers/dri/i965/brw_fs.cpp | 11 ++++++++- >>>>>>> src/mesa/drivers/dri/i965/brw_fs_nir.cpp | 41 ++++++++++++++++-- >>>>>>> -------------- >>>>>>> 2 files changed, 30 insertions(+), 22 deletions(-) >>>>>>> >>>>>>> diff --git a/src/mesa/drivers/dri/i965/brw_fs.cpp >>>>>>> b/src/mesa/drivers/dri/i965/brw_fs.cpp >>>>>>> index c348bc7138d..93ab84b5845 100644 >>>>>>> --- a/src/mesa/drivers/dri/i965/brw_fs.cpp >>>>>>> +++ b/src/mesa/drivers/dri/i965/brw_fs.cpp >>>>>>> @@ -1944,10 +1944,19 @@ fs_visitor::assign_constant_locations() >>>>>>> unsigned last = constant_nr + (inst->src[2].ud / 4) >>>>>>> - >>>>>>> 1; >>>>>>> assert(last < uniforms); >>>>>>> >>>>>>> + bool supports_64bit_indirects = >>>>>>> + !devinfo->is_cherryview && !devinfo->is_broxton; >>>>>>> + /* Detect if this is as a result 64-bit MOV >>>>>>> INDIRECT. >>>>>>> In case >>>>>>> + * of BSW/BXT, we substitute each DF MOV INDIRECT by >>>>>>> two 32-bit MOV >>>>>>> + * INDIRECT. >>>>>>> + */ >>>>>>> + bool mov_indirect_64bit = (type_sz(inst- >>>>>>>> src[i].type) >>>>>>> == 8) || >>>>>>> + (!supports_64bit_indirects && inst->dst.type == >>>>>>> BRW_REGISTER_TYPE_UD && >>>>>>> + inst->src[0].type == BRW_REGISTER_TYPE_UD && >>>>>>> inst- >>>>>>>> dst.stride == 2); >>>>>> >>>>>> This seems kind of fragile, I don't think the optimizer gives you >>>>>> any >>>>>> guarantees that the stride of a lowered 64-bit indirect move will >>>>>> remain >>>>>> equal to two, or that the destination stride of an actual 32-bit >>>>>> indirect uniform load will never end up being two as well. That >>>>>> said, >>>>>> because you access these with 32-bit indirect moves, I don't see >>>>>> why >>>>>> you'd need to treat them as 64-bit uniforms here, the usual >>>>>> alignment >>>>>> requirements for 64-bit uniforms no longer apply, so you can treat >>>>>> them >>>>>> as regular 32-bit uniforms AFAICT. Why did you add this hunk? >>>>>> >>>>> >>>>> I added it because of this case: if we access to one DF uniform array >>>>> element with a normal MOV and the rest with MOV INDIRECT, we will >>>>> mark >>>>> the former as a live 64bit variable. Then we have the array scattered >>>>> as part of it is uploaded as a 64-bits uniform and the other as 32- >>>>> bits. Even if we avoid this by uploading everything together as 32- >>>>> bits, then the access to that DF could not be aligned to 64-bits. >>>>> >>>>> So my idea was to find a way to identify somehow those MOV INDIRECT >>>>> in >>>>> BSW to mark all the array as a 64-bit one. >>>>> >>>> >>>> Mmm, maybe I can fix this case without the hack I did. I can add the >>>> following code after marking all live variables accessed by the >>>> instructions. >>>> >>>> It is very similar to the one to detect live variables but it is fixing >>>> the case where any MOV INDIRECT in BSW is accessing to an uniform array >>>> of DF elements where one of these elements is directly accessed by >>>> another instruction. >>>> >>>> What do you think? >>>> >>> >>> Looks somewhat better, but I don't think this is correct if you have >>> multiple overlapping indirect loads of the same uniform array and only >>> one of them overlaps with a direct 64-bit load. >> >> In that case, I think this is correct. The 2 32-bit MOV INDIRECTs where >> emitted as a "lowering" of an unsupported 64-bit MOV INDIRECT. They both >> keep the 'read_size' of the original one, so they both overlap to any >> other direct 64-bit load to that array like with the original >> instruction. If none of them overlap to the direct 64-bit access, then I >> think they can be handled as non-contiguous to the latter without any issue. >> > > What if you have two lowered indirect loads with overlapping but not > identical range, and only the second one contains elements accessed with > regular 64-bit moves? Wouldn't your change mark part of the array > 64-bit aligned and the rest 32-bit aligned? >
Right. What we can do is similar to the loop of optimizations: put this code in a loop and if one of the iterations marked some area as 64-bit one, we repeat the process until no more progress is done. Then we are sure we mark all overlapping areas with the same block size. >>> Apparently >>> assign_constant_locations() already has code to attempt to push >>> indirectly accessed uniform sections contiguously, but the logic seems >>> broken in multiple ways, even on platforms other than CHV/BXT... The >>> first is_live_64bit location assignment loop doesn't consider non-64bit >>> uniforms even if they're marked as contiguous with a 64-bit uniform, >>> because the contiguous block tracking is done from within >>> set_push_pull_constant_loc() which may not be called at all if the >>> continue condition evaluates to true at the top of the loop. >> >> The idea for that block was to push all 64-bit uniforms first and >> together so they are aligned to 64-bit, there are no gaps between them >> and assuming, of course, they are not contiguous to non-64-bit data. >> >> They cannot be contiguous to non-64-bit data due to a declaration in a >> GLSL shader because it is not allowed to have two uniforms with the same >> location, > > They definitely can right now due to mixed 32-bit and 64-bit access to > uniform arrays on CHV/BXT and due to the last element of a 64-bit array > being marked incorrectly as 32-bit on AFAICT any platform. Also I don't > see anything that would prevent the same from happening due to the > compiler propagating uniform loads into 32-bit bitwise operations. > What I was talking about is that this is not due to GLSL shader declarations itself, but because of the way we are doing things inside the driver due to HW limitations. >> i.e. we cannot have two "views" of the same uniform variable >> (for view, I mean read the same uniform as 32-bits type in one, and as >> 64-bits type in other). See "4.4.3 Uniform Variable Layout Qualifiers" >> section in GLSL 4.50 spec. And if we have two consecutive arrays, we can >> safely upload them in different locations of the push constant buffer. >> >> So, this can only happen with instructions accessing the same area, with >> a different bit-size and generated by the driver itself. I am going to >> assume here than we are talking about a uniform variable defined with a >> type of X-bit size and we access to it as Y-bit size, be X > Y. If the >> vice-versa is possible (I am not aware of any case though), then the >> solution would be more complex as we need to take into account alignments. >> >> Knowing that, we can do one of these solutions: >> >> 1) Fix case by case. This is what I have proposed in BSW. >> The idea is to see if any overlapped area is accessed with >> different bit-sizes and mark the whole area with the bigger bit-size >> and upload it in the proper place in the push constant buffer. >> With BSW, we know this is happening due to a lowered DF-MOV INDIRECT >> so we can safely mark the whole area as DF in case of mixed accesses. > > This sounds like a partial and rather fragile solution, the compiler > gives you no guarantees that the indirect or direct move instruction > loading from the uniform file is going to look the same as when you > inserted it into the probgram by the end of the optimization loop. > >> 2) Generalize 1) to all bit-sizes and generations. >> > I don't think we need to generalize things to other bit sizes for the > moment, but none of this problem is generation-dependent, so introducing > generation-dependent hacks only complicates the uniform assignment code > if you don't consider it crazy enough already... Really, the uniform > allocation code should only guarantee two things which are fully > generation-independent in principle and we currently fail at: > > - Assign uniforms accessed contiguously to consecutive locations in the > uniform buffer (whether it's a pull or push constant buffer). > > - Align the starting address of any contiguously accessed block (which > may be the union of the ranges of multiple instructions) to the bit > size of the largest access in the block. > Agreed. >>> Also >>> AFAICT this could potentially lead to a very large amount of data being >>> considered as contiguous if you have a contiguous 64-bit block, then >>> random unrelated 32-bit data, and then another contiguous 64-bit block, >>> because the very last element of a MOV_INDIRECT block doesn't seem to be >>> marked is_live_64bit, which is also the one that would terminate the >>> 64-bit contiguous block and actually cause it to be assigned a location. >> >> Right. Marking the last element as is_live_64_bit is a trivial fix, I >> will do it too. >> > > Thank you. > >>> I don't think the current code would even guarantee proper alignment for >>> subsequent 64-blocks in that case because it may be pushing unrelated >>> 32-bit data in between. >>> Worse, if the UNIFORM space ends with a 64-bit >>> contiguous uniform block it may be considered contiguous with whatever >>> unrelated data is handled first by the next 32-bit uniform assignment >>> pass. >>> >> >> I don't see how this can happen. 'last' is marking the end of the area >> and its position is not marked as contiguous except with overlapped >> accesses, where we know are done in purpose by the driver (e.g BSW's MOV >> INDIRECT). The whole contiguous area is marked with the same bit-size, >> after the aforementioned solutions. >> > It isn't necessarily the same bit-size right now (because of CHV/BXT, > because of the is_live_64bit[last] bug and potentially other things), > which can cause the location assignment loop to skip over some elements > which may include the terminator element of a contiguous block, so we > may continue growing the block including random non-64-bit-aligned data. > >> What do you think about looping on contiguous[], look for contiguous >> uniforms and assert if they all have the same bit-size, just before >> setting pull/push constant locations? Then we can easily know if we are >> wrong before uploading anything. >> > > I guess an assertion won't fix the problem when it actually happens. ;) > >>> Sigh... Seems like this was only scratching the surface... >>> >> >> :-( >> >> As I might be missing something, I am open to suggestions/opinions... >> > > I think the contiguous block tracking logic needs to be reworked and > pulled outside of set_push_pull_constant_loc() so the location > assignment loop can figure out what the correct alignment is for the > whole block. > Right. What do you think about a solution like: fix the is_live_64_bit bug with last element; take out the contiguous block tracking logic from set_push_pull_constant_loc(); have a loop that looks for contiguous areas, mark them with the largest block size accessed in each area, repeat the process until no more progress can be done. I am going to develop this solution and see if I am missing something else. Sam >> Sam >> >>>> Sam >>>> >>>> + /* Detect if we accessed to an array of DF uniforms in both direct >>>> and >>>> + * indirect ways on BSW/BXT. If this is the case, then we need >>>> + * to take into account in order to push all the elements together. >>>> + */ >>>> + if (devinfo->is_cherryview || devinfo->is_broxton) { >>>> + foreach_block_and_inst_safe(block, fs_inst, inst, cfg) { >>>> + for (int i = 0 ; i < inst->sources; i++) { >>>> + if (inst->src[i].file != UNIFORM) >>>> + continue; >>>> + >>>> + int constant_nr = inst->src[i].nr + inst->src[i].offset / >>>> 4; >>>> + >>>> + if (inst->opcode == SHADER_OPCODE_MOV_INDIRECT && i == 0) >>>> { >>>> + assert(inst->src[2].ud % 4 == 0); >>>> + unsigned last = constant_nr + (inst->src[2].ud / 4) - >>>> 1; >>>> + assert(last < uniforms); >>>> + >>>> + bool is_df_array = false; >>>> + for (unsigned j = constant_nr; j < last; j++) { >>>> + if (is_live_64bit[j]) { >>>> + is_df_array = true; >>>> + break; >>>> + } >>>> + } >>>> + >>>> + if (is_df_array) { >>>> + /* If part of it is accessed with direct addressing >>>> as a DF >>>> + * then we mark the whole array as DF. >>>> + */ >>>> + for (unsigned j = constant_nr; j < last; j++) >>>> + is_live_64bit[j] = true; >>>> + } >>>> + } >>>> + } >>>> + } >>>> + } >>>> + >>>> >>>>>> Other than that patch looks good to me. >>>>>> >>>>> >>>>> OK, thanks. >>>>> >>>>> Sam >>>>> >>>>>>> for (unsigned j = constant_nr; j < last; j++) { >>>>>>> is_live[j] = true; >>>>>>> contiguous[j] = true; >>>>>>> - if (type_sz(inst->src[i].type) == 8) { >>>>>>> + if (mov_indirect_64bit) { >>>>>>> is_live_64bit[j] = true; >>>>>>> } >>>>>>> } >>>>>>> diff --git a/src/mesa/drivers/dri/i965/brw_fs_nir.cpp >>>>>>> b/src/mesa/drivers/dri/i965/brw_fs_nir.cpp >>>>>>> index 991c20fad62..4aaf84964e9 100644 >>>>>>> --- a/src/mesa/drivers/dri/i965/brw_fs_nir.cpp >>>>>>> +++ b/src/mesa/drivers/dri/i965/brw_fs_nir.cpp >>>>>>> @@ -3878,31 +3878,30 @@ fs_visitor::nir_emit_intrinsic(const >>>>>>> fs_builder &bld, nir_intrinsic_instr *instr >>>>>>> unsigned read_size = instr->const_index[1] - >>>>>>> (instr->num_components - 1) * type_sz(dest.type); >>>>>>> >>>>>>> - fs_reg indirect_chv_high_32bit; >>>>>>> - bool is_chv_bxt_64bit = >>>>>>> - (devinfo->is_cherryview || devinfo->is_broxton) && >>>>>>> - type_sz(dest.type) == 8; >>>>>>> - if (is_chv_bxt_64bit) { >>>>>>> - indirect_chv_high_32bit = >>>>>>> vgrf(glsl_type::uint_type); >>>>>>> - /* Calculate indirect address to read high 32 bits >>>>>>> */ >>>>>>> - bld.ADD(indirect_chv_high_32bit, indirect, >>>>>>> brw_imm_ud(4)); >>>>>>> - } >>>>>>> + bool supports_64bit_indirects = >>>>>>> + !devinfo->is_cherryview && !devinfo->is_broxton; >>>>>>> >>>>>>> - for (unsigned j = 0; j < instr->num_components; j++) { >>>>>>> - if (!is_chv_bxt_64bit) { >>>>>>> + if (type_sz(dest.type) != 8 || >>>>>>> supports_64bit_indirects) >>>>>>> { >>>>>>> + for (unsigned j = 0; j < instr->num_components; j++) >>>>>>> { >>>>>>> bld.emit(SHADER_OPCODE_MOV_INDIRECT, >>>>>>> offset(dest, bld, j), offset(src, bld, >>>>>>> j), >>>>>>> indirect, brw_imm_ud(read_size)); >>>>>>> - } else { >>>>>>> - bld.emit(SHADER_OPCODE_MOV_INDIRECT, >>>>>>> - subscript(offset(dest, bld, j), >>>>>>> BRW_REGISTER_TYPE_UD, 0), >>>>>>> - offset(src, bld, j), >>>>>>> - indirect, brw_imm_ud(read_size)); >>>>>>> - >>>>>>> - bld.emit(SHADER_OPCODE_MOV_INDIRECT, >>>>>>> - subscript(offset(dest, bld, j), >>>>>>> BRW_REGISTER_TYPE_UD, 1), >>>>>>> - offset(src, bld, j), >>>>>>> - indirect_chv_high_32bit, >>>>>>> brw_imm_ud(read_size)); >>>>>>> + } >>>>>>> + } else { >>>>>>> + const unsigned num_mov_indirects = >>>>>>> + type_sz(dest.type) / >>>>>>> type_sz(BRW_REGISTER_TYPE_UD); >>>>>>> + /* We read a little bit less per MOV INDIRECT, as >>>>>>> they >>>>>>> are now >>>>>>> + * 32-bits ones instead of 64-bit. Fix read_size >>>>>>> then. >>>>>>> + */ >>>>>>> + const unsigned read_size_32bit = read_size - >>>>>>> + (num_mov_indirects - 1) * >>>>>>> type_sz(BRW_REGISTER_TYPE_UD); >>>>>>> + for (unsigned j = 0; j < instr->num_components; j++) >>>>>>> { >>>>>>> + for (unsigned i = 0; i < num_mov_indirects; i++) >>>>>>> { >>>>>>> + bld.emit(SHADER_OPCODE_MOV_INDIRECT, >>>>>>> + subscript(offset(dest, bld, j), >>>>>>> BRW_REGISTER_TYPE_UD, i), >>>>>>> + subscript(offset(src, bld, j), >>>>>>> BRW_REGISTER_TYPE_UD, i), >>>>>>> + indirect, >>>>>>> brw_imm_ud(read_size_32bit)); >>>>>>> + } >>>>>>> } >>>>>>> } >>>>>>> } >>>>>>> -- >>>>>>> 2.11.0 >>>>> >>>>> _______________________________________________ >>>>> mesa-dev mailing list >>>>> mesa-dev@lists.freedesktop.org >>>>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev >> >> _______________________________________________ >> mesa-stable mailing list >> mesa-sta...@lists.freedesktop.org >> https://lists.freedesktop.org/mailman/listinfo/mesa-stable
signature.asc
Description: OpenPGP digital signature
_______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/mesa-dev