On Wed, Jul 22, 2015 at 10:05 AM, Francisco Jerez <curroje...@riseup.net> wrote: > Jason Ekstrand <ja...@jlekstrand.net> writes: > >> On Wed, Jul 22, 2015 at 12:31 AM, Francisco Jerez <curroje...@riseup.net> >> wrote: >>> Jason Ekstrand <ja...@jlekstrand.net> writes: >>> >>>> A few comments below. Mostly just asking for explanation. >>>> >>>> 1-3 are >>>> >>>> Reviewed-by: Jason Ekstrand <jason.ekstr...@intel.com> >>>> >>>> Obviously, don't merge 4/4 until it actually has users. >>>> --Jason >>> >>> Thanks. >>> >>>> >>>> On Thu, Jul 16, 2015 at 8:35 AM, Francisco Jerez <curroje...@riseup.net> >>>> wrote: >>>>> This lowering pass implements an algorithm to expand SIMDN >>>>> instructions into a sequence of SIMDM instructions in cases where the >>>>> hardware doesn't support the original execution size natively for some >>>>> particular instruction. The most important use-cases are: >>>>> >>>>> - Lowering send message instructions that don't support SIMD16 >>>>> natively into SIMD8 (several texturing, framebuffer write and typed >>>>> surface operations). >>>>> >>>>> - Lowering messages that don't support SIMD8 natively into SIMD16 >>>>> (*cough*gen4*cough*). >>>>> >>>>> - 64-bit precision operations (e.g. FP64 and 64-bit integer >>>>> multiplication). >>>>> >>>>> - SIMD32. >>>>> >>>>> The algorithm works by splitting the sources of the original >>>>> instruction into chunks of width appropriate for the lowered >>>>> instructions, and then interleaving the results component-wise into >>>>> the destination of the original instruction. The pass is controlled >>>>> by the get_lowered_simd_width() function that currently just returns >>>>> the original execution size making the whole pass a no-op for the >>>>> moment until some user is introduced. >>>>> --- >>>>> src/mesa/drivers/dri/i965/brw_fs.cpp | 142 >>>>> +++++++++++++++++++++++++++++++++++ >>>>> src/mesa/drivers/dri/i965/brw_fs.h | 1 + >>>>> 2 files changed, 143 insertions(+) >>>>> >>>>> diff --git a/src/mesa/drivers/dri/i965/brw_fs.cpp >>>>> b/src/mesa/drivers/dri/i965/brw_fs.cpp >>>>> index d031352..eeb6938 100644 >>>>> --- a/src/mesa/drivers/dri/i965/brw_fs.cpp >>>>> +++ b/src/mesa/drivers/dri/i965/brw_fs.cpp >>>>> @@ -3204,6 +3204,147 @@ fs_visitor::lower_logical_sends() >>>>> return progress; >>>>> } >>>>> >>>>> +/** >>>>> + * Get the closest native SIMD width supported by the hardware for >>>>> instruction >>>>> + * \p inst. The instruction will be left untouched by >>>>> + * fs_visitor::lower_simd_width() if the returned value is equal to the >>>>> + * original execution size. >>>>> + */ >>>>> +static unsigned >>>>> +get_lowered_simd_width(const struct brw_device_info *devinfo, >>>>> + const fs_inst *inst) >>>>> +{ >>>>> + switch (inst->opcode) { >>>>> + default: >>>>> + return inst->exec_size; >>>>> + } >>>>> +} >>>>> + >>>>> +/** >>>>> + * The \p rows array of registers represents a \p num_rows by \p >>>>> num_columns >>>>> + * matrix in row-major order, write it in column-major order into the >>>>> register >>>>> + * passed as destination. \p stride gives the separation between matrix >>>>> + * elements in the input in fs_builder::dispatch_width() units. >>>>> + */ >>>>> +static void >>>>> +emit_transpose(const fs_builder &bld, >>>>> + const fs_reg &dst, const fs_reg *rows, >>>>> + unsigned num_rows, unsigned num_columns, unsigned stride) >>>> >>>> I'm not sure what I think about calling this "emit_transpose". I >>>> guess it is kind of a transpose, but maybe it's more of a "gather". >>>> I'm not going to quibble about it though. >>> >>> *Shrug*, it doesn't only gather the vectors of the rows array (that >>> would have been emit_collect :P), it copies them out in vertical, just >>> like a matrix transpose -- assuming you're not horrified by the idea of >>> considering the argument a matrix. >> >> That's fine. Feel free to ignore that suggestion. >> >>>> >>>>> +{ >>>>> + fs_reg *const components = new fs_reg[num_rows * num_columns]; >>>>> + >>>>> + for (unsigned i = 0; i < num_columns; ++i) { >>>>> + for (unsigned j = 0; j < num_rows; ++j) >>>>> + components[num_rows * i + j] = offset(rows[j], bld, stride * i); >>>>> + } >>>>> + >>>>> + bld.LOAD_PAYLOAD(dst, components, num_rows * num_columns, 0); >>>>> + >>>>> + delete[] components; >>>>> +} >>>>> + >>>>> +bool >>>>> +fs_visitor::lower_simd_width() >>>>> +{ >>>>> + bool progress = false; >>>>> + >>>>> + foreach_block_and_inst_safe(block, fs_inst, inst, cfg) { >>>>> + const unsigned lower_width = get_lowered_simd_width(devinfo, inst); >>>>> + >>>>> + if (lower_width != inst->exec_size) { >>>>> + /* Builder matching the original instruction. */ >>>>> + const fs_builder ibld = bld.at(block, inst) >>>>> + .exec_all(inst->force_writemask_all) >>>>> + .group(inst->exec_size, >>>>> inst->force_sechalf); >>>>> + >>>>> + /* Split the copies in chunks of the execution width of either >>>>> the >>>>> + * original or the lowered instruction, whichever is lower. >>>>> + */ >>>>> + const unsigned copy_width = MIN2(lower_width, inst->exec_size); >>>>> + const unsigned n = inst->exec_size / copy_width; >>>>> + const unsigned dst_size = inst->regs_written * REG_SIZE / >>>>> + inst->dst.component_size(inst->exec_size); >>>>> + fs_reg dsts[4]; >>>>> + >>>>> + assert(n > 0 && n <= ARRAY_SIZE(dsts) && >>>>> + !inst->writes_accumulator && !inst->mlen); >>>>> + >>>>> + for (unsigned i = 0; i < n; i++) { >>>>> + /* Emit a copy of the original instruction with the lowered >>>>> width. >>>>> + * If the EOT flag was set throw it away except for the last >>>>> + * instruction to avoid killing the thread prematurely. >>>>> + */ >>>>> + fs_inst tmp_inst = *inst; >>>>> + tmp_inst.exec_size = lower_width; >>>>> + tmp_inst.eot = inst->eot && i == n - 1; >>>>> + >>>>> + /* Set exec_all if the lowered width is higher than the >>>>> original >>>>> + * to avoid breaking the compiler invariant that no control >>>>> + * flow-masked instruction is wider than the shader's >>>>> + * dispatch_width. Then emit the lowered instruction. >> >> I don't think this is a standard invariant. The standard invariant is >> "set exec_all if you're doing something that doesn't match the natural >> exec mask". In fact, in the gen4 texturing code (which is what this >> is primarily for), we leave the writemask alone. > > The gen4 texturing code would have likely caused problems with > optimization passes that use the builder to emit new instructions based > on the channel enables of already existing instructions, precisely > because it used to break the invariant you mention (do something with an > exec mask higher than the natural without force_writemask_all).
What do you mean by "higher than the natural"? What is an optimization pass doing with exec_size that would get messed up? The optimization passes and most of the compiler beyond the initial NIR -> FS pass shouldn't care about dispatch_width. It should only ever care about the exec_sizes of particular instructions. Hey, look, a SIMD16 instruction! No big deal. I really don't know what these mythical optimization problems would be. >> If you do a SIMD16 instruction in SIMD8 mode without exec_all, you >> get garbage in the other 8 channels but it's otherwise fine. For >> instructions we need to "expand", this should be fine since we're >> throwing away the extra channels. >> >> On the other hand, we don't know what the other half of the writemask >> will be (experimentation indicates that it's *not* a duplicate of the >> first half) so not setting exec_all doesn't really gain us anything. >> Unless, of course, we could maybe save some memory bandwidth in >> texturing instructions if some of the channels are disabled. I doubt >> that matters much. >> > Yeah, in any case it can only have an effect on Gen4, and it will change > From writing garbage nondeterministically to some channels of the second > half to writing garbage deterministically. > >> My inclination would be to leave the exec_all alone since it's not >> needed, but I guess I'm not going to fight it too hard. >> > I wouldn't feel comfortable with doing that, it's surely going lead to > assertion failures in the future which will only be reproducible on Gen4 > and will therefore not be obvious for the casual contributor unless it's > being tested on that specifically. I guess the alternative would be to > relax the invariant, but it's proven to catch legitimate mistakes and > what the Gen4 code was doing had rather dubious semantics anyway, so I > doubt it's justified to change it... > >>>>> + */ >>>>> + const fs_builder lbld = ibld.exec_all(lower_width > >>>>> inst->exec_size) >>>>> + .group(lower_width, i); >>>>> + fs_inst *split_inst = lbld.emit(tmp_inst); >>>> >>>> Why are you emitting the split instruction before the source >>>> transposes? Don't they need to go first? Maybe I'm missing >>>> something. >>>> >>> I guess it doesn't really matter in which order they are inserted as >>> long as they end up in the correct order inside the program (and as you >>> can see below there is an at() call making sure that the builder used to >>> emit the transposes is pointing before the split instruction). >>> >>> No important reason really, other than to be able to assign the >>> temporaries allocated below to the split instruction sources directly. >> >> That's fine. It would be nice if you left a comment to that effect. >> Otherwise, it looks like you're emitting the send and then the moves. >> It's really easy when reading the code to miss the at() call below. >> > Sure. It would also be easy to swap the order of the transposes and the > split_inst emission if you find it easier to understand that way. I'll > just go do that instead of the comment if you have no objection. > >>>>> + >>>>> + for (unsigned j = 0; j < inst->sources; j++) { >>>>> + if (inst->src[j].file != BAD_FILE && >>>>> + !is_uniform(inst->src[j])) { >>>>> + /* Get the i-th copy_width-wide chunk of the source. */ >>>>> + const fs_reg src = horiz_offset(inst->src[j], >>>>> copy_width * i); >>>>> + const unsigned src_size = inst->regs_read(j) * >>>>> REG_SIZE / >>>>> + inst->src[j].component_size(inst->exec_size); >>>>> + >>>>> + /* Use a trivial transposition to copy one every n >>>>> + * copy_width-wide components of the register into a >>>>> + * temporary passed as source to the lowered >>>>> instruction. >>>>> + */ >>>>> + split_inst->src[j] = lbld.vgrf(inst->src[j].type, >>>>> src_size); >>>>> + emit_transpose(lbld.group(copy_width, 0).at(block, >>>>> split_inst), >>>>> + split_inst->src[j], &src, 1, src_size, >>>>> n); >>>> >>>> Shouldn't this be group(copy_width, i)? >>>> >>> >>> I'm already choosing the i-th channel group in the definition of lbld >>> while emitting the split instruction, this group() call is just to make >>> sure that the transposes don't copy data into the garbage channels of >>> the temporary in cases where copy_width is less than lower_width. >> >> Right. That makes sense. >> >>>>> + } >>>>> + } >>>>> + >>>>> + if (inst->regs_written) { >>>>> + /* Allocate enough space to hold the result of the lowered >>>>> + * instruction and fix up the number of registers written. >>>>> + */ >>>>> + split_inst->dst = dsts[i] = >>>>> + lbld.vgrf(inst->dst.type, dst_size); >>>>> + split_inst->regs_written = >>>>> + DIV_ROUND_UP(inst->regs_written * lower_width, >>>>> + inst->exec_size); >>>>> + } >>>>> + } >>>>> + >>>>> + if (inst->regs_written) { >>>>> + /* Distance between useful channels in the temporaries, >>>>> skipping >>>>> + * garbage if the lowered instruction is wider than the >>>>> original. >>>>> + */ >>>>> + const unsigned m = lower_width / copy_width; >>>>> + >>>>> + /* Interleave the components of the result from the lowered >>>>> + * instructions. We need to set exec_all() when copying >>>>> more than >>>>> + * one half per component, because LOAD_PAYLOAD (in terms of >>>>> which >>>>> + * emit_transpose is implemented) can only use the same >>>>> channel >>>>> + * enable signals for all of its non-header sources. >>>>> + */ >>>>> + emit_transpose(ibld.exec_all(inst->exec_size > copy_width) >>>>> + .group(copy_width, 0), >>>>> + inst->dst, dsts, n, dst_size, m); >> >> While we're on the exec_all topic, I don't think it's needed here >> either. In the case where exec_size > copy_width, we're throwing away >> the second half. Therefore, the channels with the wrong exec mask >> will either not get emitted at all or dead-code will delete them (I'm >> not sure which without thinking about it harder). In either case, the >> MOV's that do matter will have the right exec mask and exec_all >> doesn't gain us anything. > > Nope, the "exec_size > copy_width" case is the usual (non-gen4) case. > Nothing can be thrown away in that case because both halves have to be > copied interleaved into the destination register, so whatever execution > mask LOAD_PAYLOAD uses it has to work for both halves, which is only > possible if force_writemask_all is set. Right... Not a huge fan of exec_all'd LOAD_PAYLOAD, but we don't have a ZIP instruction so I won't complain too much. >>>>> + } >>>>> + >>>>> + inst->remove(block); >>>>> + progress = true; >>>>> + } >>>>> + } >>>>> + >>>>> + if (progress) >>>>> + invalidate_live_intervals(); >>>>> + >>>>> + return progress; >>>>> +} >>>>> + >>>>> void >>>>> fs_visitor::dump_instructions() >>>>> { >>>>> @@ -3655,6 +3796,7 @@ fs_visitor::optimize() >>>>> int iteration = 0; >>>>> int pass_num = 0; >>>>> >>>>> + OPT(lower_simd_width); >>>>> OPT(lower_logical_sends); >>>>> >>>>> do { >>>>> diff --git a/src/mesa/drivers/dri/i965/brw_fs.h >>>>> b/src/mesa/drivers/dri/i965/brw_fs.h >>>>> index f3850d1..9582648 100644 >>>>> --- a/src/mesa/drivers/dri/i965/brw_fs.h >>>>> +++ b/src/mesa/drivers/dri/i965/brw_fs.h >>>>> @@ -184,6 +184,7 @@ public: >>>>> bool lower_load_payload(); >>>>> bool lower_logical_sends(); >>>>> bool lower_integer_multiplication(); >>>>> + bool lower_simd_width(); >>>>> bool opt_combine_constants(); >>>>> >>>>> void emit_dummy_fs(); >>>>> -- >>>>> 2.4.3 >>>>> _______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev