Jason Ekstrand <ja...@jlekstrand.net> writes: > On Fri, Apr 3, 2015 at 7:28 AM, Francisco Jerez <curroje...@riseup.net> wrote: >> Jason Ekstrand <ja...@jlekstrand.net> writes: >> >>> On Thu, Apr 2, 2015 at 3:01 AM, Francisco Jerez <curroje...@riseup.net> >>> wrote: >>>> Jason Ekstrand <ja...@jlekstrand.net> writes: >>>> >>>>> Instead of the complicated and broken-by-design pile of heuristics we had >>>>> before, we now have a straightforward lowering: >>>>> >>>>> 1) All header sources are copied directly using force_writemask_all and, >>>>> since they are guaranteed to be a single register, there are no >>>>> force_sechalf issues. >>>>> >>>>> 2) All non-header sources are copied using the exact same force_sechalf >>>>> and saturate modifiers as the LOAD_PAYLOAD operation itself. >>>>> >>>>> 3) In order to accommodate older gens that need interleaved colors, >>>>> lower_load_payload detects when the destination is a COMPR4 register >>>>> and automatically interleaves the non-header sources. The >>>>> lower_load_payload pass does the right thing here regardless of >>>>> whether >>>>> or not the hardware actually supports COMPR4. >>>> >>>> I had a quick glance at the series and it seems to be going in the right >>>> direction. >>>> >>>> One thing I honestly don't like is the ad-hoc and IMHO premature >>>> treatment of payload headers, it still feels like the LOAD_PAYLOAD >>>> instruction has more complex semantics than necessary and the benefit is >>>> unclear to me. >>>> >>>> I suppose that your motivation was to avoid setting force_writemask_all >>>> in LOAD_PAYLOAD instructions with header. The optimizer should be able >>>> to cope with those flags and get rid of them from the resulting moves >>>> where they are redundant, and if it's not able to it looks like >>>> something that should be fixed anyway. The explicit handling of headers >>>> is responsible for much of the churn in this series and is likely to >>>> complicate users of LOAD_PAYLOAD and optimization passes that have to >>>> manipulate them. >>> >>> Avoiding force_writemask_all is only half of the motivation and the >>> small half at that. A header source, more properly defined, is a >>> single physical register that, conceptually, applies to all channels. >>> Effectively, a header source (I should have stated this clearly) has >>> two properties: >>> >>> 1) It has force_writemask_all set >>> 2) It is exactly one physical hardware register. >>> >>> This second property is the more important of the two. Most of the >>> disaster of the previous LOAD_PAYLOAD implementation was that we did a >>> pile of guesswork and had a ill-conceved "effective width" think in >>> order to figure out how big the register actually was. Making the >>> user specify which sources are header sources eliminates that >>> guesswork. It also has the nice side-effect that we can do the right >>> force_writemask_all and we can properly handle COMPR4 for the the >>> user. >> >> Yeah, true, but this seems like the least orthogonal and most annoying >> to use solution for this problem, it forces the caller to provide >> redundant information, it takes into account the saturate flag on some >> arguments and not others, it shuffles sources with respect to the >> specified order when COMPR4 is set, but only for the first four >> non-header sources. I think any of the following solutions would be >> better-behaved than the current approach: > > I don't know that saying which sources are headers is really > redundant. It's explicit which is what we want. Yes, the COMPR4 > thing is a bit magical but we have to do COMPR4 in lower_load_payload > so we have to have some way of doing it and this method puts the > interleving code in one place instead of two. > Well, at least with the previous approach LOAD_PAYLOAD had consistent (if broken) semantics across its arguments, and regardless of COMPR4 being used or not, which IMHO is preferable to the modest code saving.
>> 1/ Use the source width to determine the size of each copy. This would >> imply that the source width carries semantic information and hence >> would have to be left alone by e.g. copy propagation. > > That's what do now and it's terrible. The "effective width" field was > basically a width that gets kept. > Couldn't you just prevent copy propagation from propagating a source of different width into a LOAD_PAYLOAD? You would still be able to get rid of the metadata guess and effective_width, but you may have to re-run copy propagate after lower_load_payload() for the case you missed any optimization oportunities. >> 2/ Use the instruction exec size and flags to determine the properties >> of *all* copies. This means that if a header is present the exec >> size would necessarily have to be 8 and the halves of a 16-wide >> register would have to be specified separately, which sounds annoying >> at first but in practice wouldn't necessarily be because it could be >> handled by the LOAD_PAYLOAD() helper based on the argument widths >> without running into problems with optimization passes changing the >> meaning of the instruction. The semantics of the instruction itself >> would be as stupid as possible, but the implementation could still >> trivially recognise 16-wide and COMPR4 copies using the exact same >> mechanism you are using now. > > Yes, that might work. I'll try and take a swing at it today. It will > *hopefully* have less code churn than the solution in this series > because the magic will still happen, just in a different place. Of > course, this solution also requires that we lower everything with > force_writemask_all and *hopefully* we can get rid of those and set > force_sechalf appropriately in optimization passes. > >> 3/ Split LOAD_PAYLOAD into two separate instructions, each of them dead >> simple, say COLLECT and LOAD_HEADER. "COLLECT dst, src0, ..., srcn" >> would be equivalent to the LOAD_PAYLOAD instruction described in 2, >> but it would only be used to load full-width non-header sources of >> the payload, so you would avoid the need to split 16-wide registers >> in half. "LOAD_HEADER dst, header, payload" would handle the >> asymmetric requirements of prepending a header, like using 8-wide >> instead of exec_size-wide copies and setting force_writemask_all. >> You could use mlen to specify the size of payload as is usual for >> instructions taking a payload source. > > I thought about doing something like that but it seems more convoluted > than would be at all worth it. > It's not so obvious to me that 3 would necessarily be more difficult than getting 2 right, but yeah it would definitely be more effort than 1. >>> --Jason >>> >>>> Thanks for looking into this BTW. >>>> >>>>> --- >>>>> src/mesa/drivers/dri/i965/brw_fs.cpp | 154 >>>>> ++++++++++++++------------- >>>>> src/mesa/drivers/dri/i965/brw_fs_visitor.cpp | 3 - >>>>> 2 files changed, 80 insertions(+), 77 deletions(-) >>>>> >>>>> diff --git a/src/mesa/drivers/dri/i965/brw_fs.cpp >>>>> b/src/mesa/drivers/dri/i965/brw_fs.cpp >>>>> index f8e26c0..bc45a38 100644 >>>>> --- a/src/mesa/drivers/dri/i965/brw_fs.cpp >>>>> +++ b/src/mesa/drivers/dri/i965/brw_fs.cpp >>>>> @@ -3201,93 +3201,99 @@ fs_visitor::lower_load_payload() >>>>> { >>>>> bool progress = false; >>>>> >>>>> - int vgrf_to_reg[alloc.count]; >>>>> - int reg_count = 0; >>>>> - for (unsigned i = 0; i < alloc.count; ++i) { >>>>> - vgrf_to_reg[i] = reg_count; >>>>> - reg_count += alloc.sizes[i]; >>>>> - } >>>>> - >>>>> - struct { >>>>> - bool written:1; /* Whether this register has ever been written */ >>>>> - bool force_writemask_all:1; >>>>> - bool force_sechalf:1; >>>>> - } metadata[reg_count]; >>>>> - memset(metadata, 0, sizeof(metadata)); >>>>> - >>>>> foreach_block_and_inst_safe (block, fs_inst, inst, cfg) { >>>>> - if (inst->dst.file == GRF) { >>>>> - const int dst_reg = vgrf_to_reg[inst->dst.reg] + >>>>> inst->dst.reg_offset; >>>>> - bool force_sechalf = inst->force_sechalf && >>>>> - !inst->force_writemask_all; >>>>> - bool toggle_sechalf = inst->dst.width == 16 && >>>>> - type_sz(inst->dst.type) == 4 && >>>>> - !inst->force_writemask_all; >>>>> - for (int i = 0; i < inst->regs_written; ++i) { >>>>> - metadata[dst_reg + i].written = true; >>>>> - metadata[dst_reg + i].force_sechalf = force_sechalf; >>>>> - metadata[dst_reg + i].force_writemask_all = >>>>> inst->force_writemask_all; >>>>> - force_sechalf = (toggle_sechalf != force_sechalf); >>>>> - } >>>>> - } >>>>> - >>>>> if (inst->opcode == SHADER_OPCODE_LOAD_PAYLOAD) { >>>>> assert(inst->dst.file == MRF || inst->dst.file == GRF); >>>>> + >>>>> fs_reg dst = inst->dst; >>>>> >>>>> - for (int i = 0; i < inst->sources; i++) { >>>>> - dst.width = inst->src[i].effective_width; >>>>> - dst.type = inst->src[i].type; >>>>> - >>>>> - if (inst->src[i].file == BAD_FILE) { >>>>> - /* Do nothing but otherwise increment as normal */ >>>>> - } else if (dst.file == MRF && >>>>> - dst.width == 8 && >>>>> - brw->has_compr4 && >>>>> - i + 4 < inst->sources && >>>>> - inst->src[i + >>>>> 4].equals(horiz_offset(inst->src[i], 8))) { >>>>> - fs_reg compr4_dst = dst; >>>>> - compr4_dst.reg += BRW_MRF_COMPR4; >>>>> - compr4_dst.width = 16; >>>>> - fs_reg compr4_src = inst->src[i]; >>>>> - compr4_src.width = 16; >>>>> - fs_inst *mov = MOV(compr4_dst, compr4_src); >>>>> + /* Get rid of COMPR4. We'll add it back in if we need it */ >>>>> + if (dst.file == MRF && dst.reg & BRW_MRF_COMPR4) >>>>> + dst.reg = dst.reg & ~BRW_MRF_COMPR4; >>>>> + >>>>> + dst.width = 8; >>>>> + for (uint8_t i = 0; i < inst->header_size; i++) { >>>>> + if (inst->src[i].file != BAD_FILE) { >>>>> + fs_reg mov_dst = retype(dst, BRW_REGISTER_TYPE_UD); >>>>> + fs_reg mov_src = retype(inst->src[i], >>>>> BRW_REGISTER_TYPE_UD); >>>>> + mov_src.width = 8; >>>>> + fs_inst *mov = MOV(mov_dst, mov_src); >>>>> mov->force_writemask_all = true; >>>>> inst->insert_before(block, mov); >>>>> - /* Mark i+4 as BAD_FILE so we don't emit a MOV for it */ >>>>> - inst->src[i + 4].file = BAD_FILE; >>>>> - } else { >>>>> - fs_inst *mov = MOV(dst, inst->src[i]); >>>>> - if (inst->src[i].file == GRF) { >>>>> - int src_reg = vgrf_to_reg[inst->src[i].reg] + >>>>> - inst->src[i].reg_offset; >>>>> - mov->force_sechalf = metadata[src_reg].force_sechalf; >>>>> - mov->force_writemask_all = >>>>> metadata[src_reg].force_writemask_all; >>>>> - } else { >>>>> - /* We don't have any useful metadata for immediates or >>>>> - * uniforms. Assume that any of the channels of the >>>>> - * destination may be used. >>>>> - */ >>>>> - assert(inst->src[i].file == IMM || >>>>> - inst->src[i].file == UNIFORM); >>>>> - mov->force_writemask_all = true; >>>>> - } >>>>> + } >>>>> + dst = offset(dst, 1); >>>>> + } >>>>> >>>>> - if (dst.file == GRF) { >>>>> - const int dst_reg = vgrf_to_reg[dst.reg] + >>>>> dst.reg_offset; >>>>> - const bool force_writemask = mov->force_writemask_all; >>>>> - metadata[dst_reg].force_writemask_all = >>>>> force_writemask; >>>>> - metadata[dst_reg].force_sechalf = mov->force_sechalf; >>>>> - if (dst.width * type_sz(dst.type) > 32) { >>>>> - assert(!mov->force_sechalf); >>>>> - metadata[dst_reg + 1].force_writemask_all = >>>>> force_writemask; >>>>> - metadata[dst_reg + 1].force_sechalf = >>>>> !force_writemask; >>>>> + dst.width = inst->exec_size; >>>>> + if (inst->dst.file == MRF && (inst->dst.reg & BRW_MRF_COMPR4) && >>>>> + inst->exec_size > 8) { >>>>> + /* In this case, the payload portion of the LOAD_PAYLOAD >>>>> isn't >>>>> + * a straightforward copy. Instead, the result of the >>>>> + * LOAD_PAYLOAD is treated as interlaced and the first four >>>>> + * non-header sources are unpacked as: >>>>> + * >>>>> + * m + 0: r0 >>>>> + * m + 1: g0 >>>>> + * m + 2: b0 >>>>> + * m + 3: a0 >>>>> + * m + 4: r1 >>>>> + * m + 5: g1 >>>>> + * m + 6: b1 >>>>> + * m + 7: a1 >>>>> + * >>>>> + * This is used for gen <= 5 fb writes. >>>>> + */ >>>>> + assert(inst->exec_size == 16); >>>>> + assert(inst->header_size + 4 <= inst->sources); >>>>> + for (uint8_t i = inst->header_size; i < inst->header_size + >>>>> 4; i++) { >>>>> + if (inst->src[i].file != BAD_FILE) { >>>>> + if (brw->has_compr4) { >>>>> + fs_reg compr4_dst = retype(dst, inst->src[i].type); >>>>> + compr4_dst.reg |= BRW_MRF_COMPR4; >>>>> + >>>>> + fs_inst *mov = MOV(compr4_dst, inst->src[i]); >>>>> + mov->force_writemask_all = >>>>> inst->force_writemask_all; >>>>> + mov->saturate = inst->saturate; >>>>> + inst->insert_before(block, mov); >>>>> + } else { >>>>> + /* Platform doesn't have COMPR4. We have to fake >>>>> it */ >>>>> + fs_reg mov_dst = retype(dst, inst->src[i].type); >>>>> + mov_dst.width = 8; >>>>> + >>>>> + fs_inst *mov = MOV(mov_dst, half(inst->src[i], 0)); >>>>> + mov->force_writemask_all = >>>>> inst->force_writemask_all; >>>>> + mov->saturate = inst->saturate; >>>>> + inst->insert_before(block, mov); >>>>> + >>>>> + mov = MOV(offset(mov_dst, 4), half(inst->src[i], >>>>> 1)); >>>>> + mov->force_writemask_all = >>>>> inst->force_writemask_all; >>>>> + mov->saturate = inst->saturate; >>>>> + mov->force_sechalf = true; >>>>> + inst->insert_before(block, mov); >>>>> } >>>>> } >>>>> >>>>> - inst->insert_before(block, mov); >>>>> + dst.reg++; >>>>> } >>>>> >>>>> + dst.reg += 4; >>>>> + >>>>> + /* The COMPR4 code took care of the first 4 sources. We'll >>>>> let >>>>> + * the regular path handle any remaining sources. Yes, we >>>>> are >>>>> + * modifying the instruction but we're about to delete it so >>>>> + * this really doesn't hurt anything. >>>>> + */ >>>>> + inst->header_size += 4; >>>>> + } >>>>> + >>>>> + for (uint8_t i = inst->header_size; i < inst->sources; i++) { >>>>> + if (inst->src[i].file != BAD_FILE) { >>>>> + fs_inst *mov = MOV(retype(dst, inst->src[i].type), >>>>> + inst->src[i]); >>>>> + mov->force_writemask_all = inst->force_writemask_all; >>>>> + mov->saturate = inst->saturate; >>>>> + inst->insert_before(block, mov); >>>>> + } >>>>> dst = offset(dst, 1); >>>>> } >>>>> >>>>> diff --git a/src/mesa/drivers/dri/i965/brw_fs_visitor.cpp >>>>> b/src/mesa/drivers/dri/i965/brw_fs_visitor.cpp >>>>> index b7eeb47..a0d8b798 100644 >>>>> --- a/src/mesa/drivers/dri/i965/brw_fs_visitor.cpp >>>>> +++ b/src/mesa/drivers/dri/i965/brw_fs_visitor.cpp >>>>> @@ -3471,9 +3471,6 @@ fs_visitor::setup_color_payload(fs_reg *dst, fs_reg >>>>> color, unsigned components, >>>>> color.type, color.width); >>>>> inst = emit(MOV(dst[len], offset(color, i))); >>>>> inst->saturate = key->clamp_fragment_color; >>>>> - } else if (color.width == 16) { >>>>> - /* We need two BAD_FILE slots for a 16-wide color */ >>>>> - len++; >>>>> } >>>>> len++; >>>>> } >>>>> -- >>>>> 2.3.4 >>>>> >>>>> _______________________________________________ >>>>> mesa-dev mailing list >>>>> mesa-dev@lists.freedesktop.org >>>>> http://lists.freedesktop.org/mailman/listinfo/mesa-dev
signature.asc
Description: PGP signature
_______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev