On Tuesday, February 17, 2015 04:59:37 PM Matt Turner wrote: > On Tue, Feb 17, 2015 at 4:44 PM, Ben Widawsky > <benjamin.widaw...@intel.com> wrote: > > With scalar VS, it so happens that many vertex shaders will line up in a > > such a > > way that two SIMD8 instructions can be collapsed into 1 SIMD16 instruction. > > For > > example > > > > The following two MOVs > > mov(8) g124<1>F g6<8,8,1>F { align1 1Q > > compacted }; > > mov(8) g125<1>F g7<8,8,1>F { align1 1Q > > compacted }; > > > > Could be represented as a single MOV > > mov(16) g124<1>F g6<8,8,1>F { align1 1H > > compacted }; > > > > The basic algorithm is very simple. For two consecutive instructions, check > > if > > all source, and dst registers are adjacent. If so, reuse the first > > instruction > > by adjusting the compression bits and then killing the second instruction. > > The > > caveat is (shown above) is 1Q->1H is insufficient. As mentioned in the > > comments, > > the second quarter of the DMask is invalid for us, so we actually must > > generate > > the follow if possible: > > mov(16) g124<1>F g6<8,8,1>F { align1 > > WE_all 1H compacted }; > > > > The next step would be to try informing the instruction scheduler and > > register > > allocator to make this happen more often. Anecdotally the most often > > occurance > > is for the blit shader generated by meta, and it always leaves things in > > good > > order for us. > > > > The scalar VS is only available on later platforms. This same thing could be > > applied to the FS, but there we hope to be using SIMD16 already for most > > instructions. It shouldn't hurt to throw this same optimization at the FS > > for > > cases where we have to fall back though. > > > > Cc: Kenneth Graunke <kenn...@whitecape.org> > > Cc: Kristian Høgsberg <k...@bitplanet.net> > > Signed-off-by: Ben Widawsky <b...@bwidawsk.net> > > --- > > > > I have no had time to benchmark this very much, nor run piglit on it. I am > > just > > sending it out before it bitrots too much further. > > I would be surprised if it had a measurable effect. Compressed > instructions (i.e., SIMD16) are apparently just split into a pair of > SIMD8 instructions by the instruction decoder. So, this should > basically just be reducing code size, like instruction compaction. > > > > > --- > > > > src/mesa/drivers/dri/i965/brw_fs.cpp | 74 > > ++++++++++++++++++++++++++++++++++++ > > src/mesa/drivers/dri/i965/brw_fs.h | 1 + > > 2 files changed, 75 insertions(+) > > > > diff --git a/src/mesa/drivers/dri/i965/brw_fs.cpp > > b/src/mesa/drivers/dri/i965/brw_fs.cpp > > index 200a494..cc21cdf 100644 > > --- a/src/mesa/drivers/dri/i965/brw_fs.cpp > > +++ b/src/mesa/drivers/dri/i965/brw_fs.cpp > > @@ -3716,6 +3716,78 @@ fs_visitor::allocate_registers() > > prog_data->total_scratch = brw_get_scratch_size(last_scratch); > > } > > > > +static bool > > +is_ops_adjacent(fs_inst *a, fs_inst *b) > > +{ > > + if (a->opcode != b->opcode) > > + return false; > > + > > + if (a->dst.reg != b->dst.reg - 1) > > + return false; > > + > > + assert(a->sources == b->sources); > > + > > + for (int i = 0; i < a->sources; i++) { > > + if (a->src[i].file != b->src[i].file) > > + return false; > > + > > + if (a->src[i].file == HW_REG && > > + (a->src[i].fixed_hw_reg.nr == b->src[i].fixed_hw_reg.nr - 1)) > > + continue; > > + else if (a->src[i].file == GRF && > > + (a->src[i].reg == b->src[i].reg - 1)) > > + continue; > > + else if (a->src[i].file == IMM && > > + a->src[i].fixed_hw_reg.dw1.ud == > > b->src[i].fixed_hw_reg.dw1.ud) > > + continue; > > + > > + return false; > > + } > > + > > + return true; > > +} > > + > > +/* Try to upconvert a SIMD8 instruction into a fake SIMD16 instruction. > > + * > > + * If we have two operations in sequence, and they are using sequentially > > + * contiguous operands, the two SIMD8 instructions may be combined into 1 > > SIMD16 > > + * instruction. For example: > > + * mov(8) g124<1>F g6<8,8,1>F > > + * mov(8) g125<1>F g7<8,8,1>F > > + * > > + * Is the same as: > > + * mov(16) g124<1>F g6<8,8,1>F > > + * > > + * This is trickier than it initially sounds. On the surface it sounds > > like a > > + * good idea to simply combine the instructions as shown above, and convert > > + * 1Q->1H. The main problem is that we're executing the shader with SIMD8 > > mode. > > + * This means that 1/4 of the DMask is useful, and the rest is junk. All > > we can > > + * do therefore is use WE_all if possible. > > Oh, wow. > > Presumably you tried without setting WE_all and it failed piglit? > > I've never figured out what the high bits of the execution mask > contains in a SIMD8 shader. Something I read made me think the low 8 > bits simply repeated (seems like a useful behavior), and other text > makes me think they're undefined. From the IVB PRM: > > Note: When branching instructions are predicated, branching is > evaluated on all channels enabled at dispatch. This means, the > appropriate number of flag register bits must be initialized or used > in predication depending on the execution mask (EMask). Uninitalized > flags may result in undesired branching. For example, if using DMask > as EMask and if all 32 channels of DMask are enabled, a SIMD8 kernel > must initialize unused flag bits so that predication on branching is > evaluated correctly.
Another thing of note: on Gen8+, we use VMask as EMask. Previously, we used DMask. I believe I saw failures in glsl-fs-derivs without the VECTOR_MASK_ENABLE flag in gen8_ps_state.c. Just in case it's related.
signature.asc
Description: This is a digitally signed message part.
_______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev