Samuel Iglesias Gonsálvez <sigles...@igalia.com> writes: > On 07/03/17 22:24, Francisco Jerez wrote: >> Samuel Iglesias Gonsálvez <sigles...@igalia.com> writes: >> >>> On 04/03/17 01:04, Francisco Jerez wrote: >>>> Samuel Iglesias Gonsálvez <sigles...@igalia.com> writes: >>>> >>>>> From: "Juan A. Suarez Romero" <jasua...@igalia.com> >>>>> >>>>> When splitting VEC4_OPCODE_FROM_DOUBLE in Ivybridge/Baytrail, the second >>>>> part should use a temporal register, and then move the values to the >>>>> second half of the original destination, so we get all the results in the >>>>> same register. >>>>> >>>>> v2: >>>>> - Fix typos (Matt). >>>>> --- >>>>> src/mesa/drivers/dri/i965/brw_vec4.cpp | 17 +++++++++++++---- >>>>> src/mesa/drivers/dri/i965/brw_vec4_generator.cpp | 1 + >>>>> 2 files changed, 14 insertions(+), 4 deletions(-) >>>>> >>>>> diff --git a/src/mesa/drivers/dri/i965/brw_vec4.cpp >>>>> b/src/mesa/drivers/dri/i965/brw_vec4.cpp >>>>> index 64b435f3ec4..adcde085305 100644 >>>>> --- a/src/mesa/drivers/dri/i965/brw_vec4.cpp >>>>> +++ b/src/mesa/drivers/dri/i965/brw_vec4.cpp >>>>> @@ -2191,9 +2191,15 @@ vec4_visitor::lower_simd_width() >>>>> linst->group = channel_offset; >>>>> linst->size_written = size_written; >>>>> >>>>> + /* When splitting VEC4_OPCODE_FROM_DOUBLE on Ivybridge, the >>>>> second part >>>>> + * should use in a temporal register. Later we will move the >>>>> values >>>>> + * to the second half of the original destination, so we get >>>>> all the >>>>> + * results in the same register. We use d2f_pass to detect this >>>>> case. >>>>> + */ >>>>> + bool d2f_pass = (inst->opcode == VEC4_OPCODE_FROM_DOUBLE && n > >>>>> 0); >>>>> /* Compute split dst region */ >>>>> dst_reg dst; >>>>> - if (needs_temp) { >>>>> + if (needs_temp || d2f_pass) { >>>>> unsigned num_regs = DIV_ROUND_UP(size_written, REG_SIZE); >>>>> dst = retype(dst_reg(VGRF, alloc.allocate(num_regs)), >>>>> inst->dst.type); >>>>> @@ -2226,9 +2232,12 @@ vec4_visitor::lower_simd_width() >>>>> /* If we used a temporary to store the result of the split >>>>> * instruction, copy the result to the original destination >>>>> */ >>>>> - if (needs_temp) { >>>>> - vec4_instruction *mov = >>>>> - MOV(offset(inst->dst, lowered_width, n), src_reg(dst)); >>>>> + if (needs_temp || d2f_pass) { >>>>> + vec4_instruction *mov; >>>>> + if (d2f_pass) >>>>> + mov = MOV(horiz_offset(inst->dst, n * >>>>> type_sz(inst->dst.type)), src_reg(dst)); >>>> >>>> I have no idea how this could possibly work... horiz_offset() expects a >>>> number of scalar components, not bytes. Anyway I have a hunch this is >>>> trying to workaround the bug I pointed out in PATCH 15... >>>> >>> >>> This worked by luck. We want an horizontal offset of 'lowered_width' >>> components per split. As inst->dst.type is F, type_sz(inst->dst.type) = >>> 4 and we set the maximum 'lowered_width' value to 4, it matches in the >>> cases we tested... however, this is a bug. >>> >>> Actually it works too if we use offset() instead, so this hunk is not >>> needed. >>> >> >> This patch as a whole makes no sense to me... It's working around the >> dubious behavior of VEC4_OPCODE_FROM_DOUBLE which doesn't behave like >> any other arithmetic instruction of the back-end because it corrupts >> data beyond its destination region, which means any reasoning done about >> it within the optimizer is suspect of leading to data corruption in the >> same way that you got the SIMD lowering pass to inadvertently cause >> corruption. The problem IMO is that you're pretending that something >> that doesn't behave like a back-end instruction is an instruction. I >> suggest you split the datatype conversion from the gathering/scattering >> of 32-bit fields as separate instructions -- Turns out that you already >> have the latter implemented as VEC4_OPCODE_PICK/SET_LOW_32BIT. >> >> Incidentally doing things that way would allow better performance >> because the two operations would become visible to the scheduler and an >> EU pipeline stall could potentially be avoided (though performance is by >> no means my primary motivation for NAKing this patch, but rather >> preserving our sanity in the long run...). >> > > I am wondering if we can just fix the data corruption produced by > VEC4_OPCODE_FROM_DOUBLE in the generator, instead of splitting the > instruction. This has the advantage that we have the result of the > opcode already shrank instead of needing VEC4_OPCODE_PICK_LOW_32BIT, so > if any lowering/optimization emits a new VEC4_OPCODE_FROM_DOUBLE, we > don't need to remember to emit VEC4_OPCODE_PICK_LOW_32BIT together with it. >
Lowering/optimization passes aren't going to emit a new VEC4_OPCODE_FROM_DOUBLE opcode unless one was already there, in which case there will certainly be an accompanying VEC4_OPCODE_PICK_LOW_32BIT as long as the NIR translation pass is emitting valid code. You shouldn't need to introduce any additional complexity into the optimizer in order to fix this. > This would be the diff (against v3, not v4): > > diff --git a/src/mesa/drivers/dri/i965/brw_vec4_generator.cpp > b/src/mesa/drivers/dri/i965/brw_vec4_generator.cpp > index f6034bc..dc5f437 100644 > --- a/src/mesa/drivers/dri/i965/brw_vec4_generator.cpp > +++ b/src/mesa/drivers/dri/i965/brw_vec4_generator.cpp > @@ -1973,7 +1973,9 @@ generate_code(struct brw_codegen *p, > > dst.hstride = BRW_HORIZONTAL_STRIDE_1; > dst.width = BRW_WIDTH_8; > + brw_set_default_exec_size(p, cvt(inst->exec_size) - 1); > brw_MOV(p, dst, dst_as_src); > + brw_set_default_exec_size(p, cvt(doubled_exec_size) - 1); > > brw_set_default_access_mode(p, BRW_ALIGN_16); > break; > This is fixing *one* problem but the MOV emitted immediately before that hunk will still corrupt data beyond the 32B bounds of the align16 float destination region. > However, the EU pipeline stall issue you mention would be still there. > > Nevertheless, I am OK with your suggestion, no strong opinion against > it. So, while waiting for your answer, I am going to implement your > suggestion and add it to v4 unless you prefer this one. > > Sam > >>> Sam >>> >>>>> + else >>>>> + mov = MOV(offset(inst->dst, lowered_width, n), >>>>> src_reg(dst)); >>>>> mov->exec_size = lowered_width; >>>>> mov->group = channel_offset; >>>>> mov->size_written = size_written; >>>>> diff --git a/src/mesa/drivers/dri/i965/brw_vec4_generator.cpp >>>>> b/src/mesa/drivers/dri/i965/brw_vec4_generator.cpp >>>>> index 7fa1afc9073..b570792badd 100644 >>>>> --- a/src/mesa/drivers/dri/i965/brw_vec4_generator.cpp >>>>> +++ b/src/mesa/drivers/dri/i965/brw_vec4_generator.cpp >>>>> @@ -1532,6 +1532,7 @@ generate_code(struct brw_codegen *p, >>>>> is_ivb_df); >>>>> >>>>> assert(inst->group % 8 == 0 || >>>>> + (inst->exec_size == 4 && inst->group % 4 == 0) || >>>>> inst->dst.type == BRW_REGISTER_TYPE_DF || >>>>> inst->src[0].type == BRW_REGISTER_TYPE_DF || >>>>> inst->src[1].type == BRW_REGISTER_TYPE_DF || >>>>> -- >>>>> 2.11.0 >>>>> >>>>> _______________________________________________ >>>>> mesa-dev mailing list >>>>> mesa-dev@lists.freedesktop.org >>>>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
signature.asc
Description: PGP signature
_______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/mesa-dev