reads

Francisco Jerez Mon, 26 Jun 2017 10:49:05 -0700

Samuel Iglesias Gonsálvez <sigles...@igalia.com> writes:

> On Fri, 2017-06-23 at 11:06 -0700, Francisco Jerez wrote:
>> Samuel Iglesias Gonsálvez <sigles...@igalia.com> writes:
>> 
>> > On Thu, 2017-06-22 at 16:25 -0700, Francisco Jerez wrote:
>> > > Samuel Iglesias Gonsálvez <sigles...@igalia.com> writes:
>> > > 
>> > > > Signed-off-by: Samuel Iglesias Gonsálvez <sigles...@igalia.com>
>> > > > ---
>> > > >  src/intel/compiler/brw_eu_defines.h          |   2 +
>> > > >  src/intel/compiler/brw_shader.cpp            |   5 +
>> > > >  src/intel/compiler/brw_vec4.cpp              |   7 ++
>> > > >  src/intel/compiler/brw_vec4.h                |   8 ++
>> > > >  src/intel/compiler/brw_vec4_generator.cpp    | 136
>> > > > +++++++++++++++++++++++++++
>> > > >  src/intel/compiler/brw_vec4_reg_allocate.cpp |   6 +-
>> > > >  src/intel/compiler/brw_vec4_visitor.cpp      |  49 ++++++++++
>> > > >  7 files changed, 212 insertions(+), 1 deletion(-)
>> > > > 
>> > > > diff --git a/src/intel/compiler/brw_eu_defines.h
>> > > > b/src/intel/compiler/brw_eu_defines.h
>> > > > index 1af835d47e..3c148de0fa 100644
>> > > > --- a/src/intel/compiler/brw_eu_defines.h
>> > > > +++ b/src/intel/compiler/brw_eu_defines.h
>> > > > @@ -436,6 +436,8 @@ enum opcode {
>> > > >     VEC4_OPCODE_PICK_HIGH_32BIT,
>> > > >     VEC4_OPCODE_SET_LOW_32BIT,
>> > > >     VEC4_OPCODE_SET_HIGH_32BIT,
>> > > > +   VEC4_OPCODE_GEN4_SCRATCH_READ_1OWORD_LOW,
>> > > > +   VEC4_OPCODE_GEN4_SCRATCH_READ_1OWORD_HIGH,
>> > > >  
>> > > 
>> > > What's the point of introducing two different opcodes with
>> > > essentially
>> > > the same semantics (read 32B worth of data) as the current
>> > > SHADER_OPCODE_GEN4_SCRATCH_READ?
>> > 
>> > Originally I had only SHADER_OPCODE_GEN4_SCRATCH_READ but I changed
>> > it
>> > to don't allocate more registers than needed when doing scratch
>> > write
>> > of a partial DF write. Let me explain it:
>> > 
>> > When doing spilling, as DF instructions are both split and
>> > scalarized,
>> > we read the existing contents in scratch memory, overwrite them
>> > with
>> > the destination of the instruction, then emit scratch write.
>> > Together
>> > with the fact that I am not shuffling DF data, we only need to
>> > allocate
>> > 1 GRF to do so, instead of 2 (if I had emitted
>> > SHADER_OPCODE_GEN4_SCRATCH_READ), when doing spilling on partial DF
>> > writes.
>> > 
>> 
>> Why would you need to allocate more GRFs for
>> SHADER_OPCODE_GEN4_SCRATCH_READ?  It also only reads one register,
>> which
>> should be sufficient for a single scalarized instruction as long as
>> you
>> don't shuffle data around -- Have a look at how the FS back-end
>> addresses this problem.
>> 
>
> OK
>
>> > >   Is there any downside from using the
>> > > current opcode with force_writemask_all?  If anything it would
>> > > give
>> > > you
>> > > better performance because you'd only have to set up one header
>> > > (which
>> > > stalls the EU pipeline twice), send down one message to the
>> > > dataport,
>> > > and avoid stalling to shuffle the data around in the return
>> > > payload
>> > > (which prevents your two 1OWORD messages from being pipelined at
>> > > all).
>> > > 
>> > 
>> > Sorry, I am confused here. Do you mean using
>> > SHADER_OPCODE_GEN4_SCRATCH_READ as-is, which emits a "OWord Dual
>> > Block
>> > Read" message (so only one message)?
>> > 
>> > If that's the case, then I should shuffle the destination data of
>> > the
>> > partial DF write, change the 1-Oword block write offsets and so
>> > on...
>> 
>> Why would you need to shuffle any spilled data?  I don't think
>> there's
>> much of a benefit from shuffling since scratch overwrites need read
>> the
>> original data for the most part anyway because of writemasking.  In
>> fact
>> shuffling DF data is probably the reason things blow up right now
>> whenever you have mixed DF and single-precision reads or writes to
>> the
>> same spilled variable, which I guess is the reason you need to look
>> for
>> those cases and mark them as no_spill...
>> 
>
> Right, I don't need to shuffle data for the scratch write.
>
>> > in order to save it inside scratch memory in the proper place to
>> > make
>> > OWord Dual Block Read work. That would require to some extra
>> > instructions, but I don't know if this would give better
>> > performance
>> > against current implementation or not.
>> > 
>> 
>> I expect the most serious performance issue with the approach of this
>> patch will be the sequence of non-pipelined single-oword reads, which
>> means you get to pay for the EU-dataport roundtrip latency twice
>> instead
>> of once.
>> 
>> > Then, why do I need force_writemask=true when emitting
>> > SHADER_OPCODE_GEN4_SCRATCH_READ?
>> > 
>> 
>> Because you probably don't want to shuffle data in your scratch
>> buffer,
>> and you don't want the dataport to apply bogus 16B channel enables to
>> your reads and writes.
>> 
>
> If we save the dvec4 data of a vertex altogether in consecutive 32
> bytes in scratch memory (i.e. no need of shuffling and we use
> force_writemask_all as you said), then we need to create a special case
> for IVB and partial DFs reads on HSW+ when unspilling the data.
>
> What I am thinking now is if the scratch write is done wisely, we can
> write the data in the proper places for the two
> SHADER_OPCODE_GEN4_SCRATCH_READ we use for unspill DF data: write each
> XY components with the respective 1-OWord scratch write message and ZW
> components with other 1-OWord scratch write messages with an offset of
> 32 bytes. Thanks to this, we don't need to touch the current code for
> unspilling (which does data shuffling) and it allows us to do
> unspilling on IVB and partial DF reads on HSW+ without any special
> case.
>
> If we choose the no-shuffling-at-all solution, this is an improvement
> to what I have sent in this v1, but I am leaning toward the solution in
> last paragraph because it re-uses existing code and simplifies the
> changes, although we have some data shuffling overhead.
>
> What do you think?
>


Cannot we just drop the shuffling on HSW+ too?  AFAIA it has the same
drawbacks on HSW+ as it has on IVB, so I don't see any reason for
supporting both codepaths.

> Sam
>
>
>> > I can try this alternative solution if this is what you meant. It
>> > has
>> > the advantage of simplifying the changes a lot, which is always
>> > great.
>> > 
>> > Sam
>> > 
>> > > >     FS_OPCODE_DDX_COARSE,
>> > > >     FS_OPCODE_DDX_FINE,
>> > > > diff --git a/src/intel/compiler/brw_shader.cpp
>> > > > b/src/intel/compiler/brw_shader.cpp
>> > > > index 53d0742d2e..248feacbd2 100644
>> > > > --- a/src/intel/compiler/brw_shader.cpp
>> > > > +++ b/src/intel/compiler/brw_shader.cpp
>> > > > @@ -296,6 +296,11 @@ brw_instruction_name(const struct
>> > > > gen_device_info *devinfo, enum opcode op)
>> > > >     case FS_OPCODE_PACK:
>> > > >        return "pack";
>> > > >  
>> > > > +
>> > > > +   case VEC4_OPCODE_GEN4_SCRATCH_READ_1OWORD_LOW:
>> > > > +      return "gen4_scratch_read_1word_low";
>> > > > +   case VEC4_OPCODE_GEN4_SCRATCH_READ_1OWORD_HIGH:
>> > > > +      return "gen4_scratch_read_1word_high";
>> > > >     case SHADER_OPCODE_GEN4_SCRATCH_READ:
>> > > >        return "gen4_scratch_read";
>> > > >     case SHADER_OPCODE_GEN4_SCRATCH_WRITE:
>> > > > diff --git a/src/intel/compiler/brw_vec4.cpp
>> > > > b/src/intel/compiler/brw_vec4.cpp
>> > > > index b443effca9..b6d409eea2 100644
>> > > > --- a/src/intel/compiler/brw_vec4.cpp
>> > > > +++ b/src/intel/compiler/brw_vec4.cpp
>> > > > @@ -259,6 +259,8 @@ bool
>> > > >  vec4_instruction::can_do_writemask(const struct
>> > > > gen_device_info
>> > > > *devinfo)
>> > > >  {
>> > > >     switch (opcode) {
>> > > > +   case VEC4_OPCODE_GEN4_SCRATCH_READ_1OWORD_LOW:
>> > > > +   case VEC4_OPCODE_GEN4_SCRATCH_READ_1OWORD_HIGH:
>> > > >     case SHADER_OPCODE_GEN4_SCRATCH_READ:
>> > > >     case VEC4_OPCODE_DOUBLE_TO_F32:
>> > > >     case VEC4_OPCODE_DOUBLE_TO_D32:
>> > > > @@ -335,6 +337,9 @@
>> > > > vec4_visitor::implied_mrf_writes(vec4_instruction *inst)
>> > > >        return 1;
>> > > >     case VS_OPCODE_PULL_CONSTANT_LOAD:
>> > > >        return 2;
>> > > > +   case VEC4_OPCODE_GEN4_SCRATCH_READ_1OWORD_LOW:
>> > > > +   case VEC4_OPCODE_GEN4_SCRATCH_READ_1OWORD_HIGH:
>> > > > +      return 1;
>> > > >     case SHADER_OPCODE_GEN4_SCRATCH_READ:
>> > > >        return 2;
>> > > >     case SHADER_OPCODE_GEN4_SCRATCH_WRITE:
>> > > > @@ -2091,6 +2096,8 @@ get_lowered_simd_width(const struct
>> > > > gen_device_info *devinfo,
>> > > >  {
>> > > >     /* Do not split some instructions that require special
>> > > > handling
>> > > > */
>> > > >     switch (inst->opcode) {
>> > > > +   case VEC4_OPCODE_GEN4_SCRATCH_READ_1OWORD_LOW:
>> > > > +   case VEC4_OPCODE_GEN4_SCRATCH_READ_1OWORD_HIGH:
>> > > >     case SHADER_OPCODE_GEN4_SCRATCH_READ:
>> > > >     case SHADER_OPCODE_GEN4_SCRATCH_WRITE:
>> > > >        return inst->exec_size;
>> > > > diff --git a/src/intel/compiler/brw_vec4.h
>> > > > b/src/intel/compiler/brw_vec4.h
>> > > > index d828da02ea..a5b45aca21 100644
>> > > > --- a/src/intel/compiler/brw_vec4.h
>> > > > +++ b/src/intel/compiler/brw_vec4.h
>> > > > @@ -214,6 +214,9 @@ public:
>> > > >                          enum brw_conditional_mod condition);
>> > > >     vec4_instruction *IF(enum brw_predicate predicate);
>> > > >     EMIT1(SCRATCH_READ)
>> > > > +   vec4_instruction *DF_IVB_SCRATCH_READ(const dst_reg &dst,
>> > > > const
>> > > > src_reg &src0,
>> > > > +                                         bool low);
>> > > > +
>> > > >     EMIT2(SCRATCH_WRITE)
>> > > >     EMIT3(LRP)
>> > > >     EMIT1(BFREV)
>> > > > @@ -294,6 +297,11 @@ public:
>> > > >                          dst_reg dst,
>> > > >                          src_reg orig_src,
>> > > >                          int base_offset);
>> > > > +   void emit_1grf_df_ivb_scratch_read(bblock_t *block,
>> > > > +                                      vec4_instruction *inst,
>> > > > +                                      dst_reg temp, src_reg
>> > > > orig_src,
>> > > > +                                      int base_offset, bool
>> > > > first_grf);
>> > > > +
>> > > >     void emit_scratch_write(bblock_t *block, vec4_instruction
>> > > > *inst,
>> > > >                           int base_offset);
>> > > >     void emit_pull_constant_load(bblock_t *block,
>> > > > vec4_instruction
>> > > > *inst,
>> > > > diff --git a/src/intel/compiler/brw_vec4_generator.cpp
>> > > > b/src/intel/compiler/brw_vec4_generator.cpp
>> > > > index 334933d15a..3bb931385a 100644
>> > > > --- a/src/intel/compiler/brw_vec4_generator.cpp
>> > > > +++ b/src/intel/compiler/brw_vec4_generator.cpp
>> > > > @@ -1133,6 +1133,73 @@ generate_unpack_flags(struct brw_codegen
>> > > > *p,
>> > > >  }
>> > > >  
>> > > >  static void
>> > > > +generate_scratch_read_1oword(struct brw_codegen *p,
>> > > > +                             vec4_instruction *inst,
>> > > > +                             struct brw_reg dst,
>> > > > +                             struct brw_reg index,
>> > > > +                             bool low)
>> > > > +{
>> > > > +   const struct gen_device_info *devinfo = p->devinfo;
>> > > > +
>> > > > +   assert(devinfo->gen >= 7 && inst->exec_size == 4 &&
>> > > > +          type_sz(dst.type) == 8);
>> > > > +   brw_set_default_access_mode(p, BRW_ALIGN_1);
>> > > > +   brw_set_default_exec_size(p, BRW_EXECUTE_8);
>> > > > +
>> > > > +   if (!low) {
>> > > > +      /* Read second GRF (offset in OWORDs) */
>> > > > +      for (int i = 0; i < 2; i++) {
>> > > > +         brw_oword_block_read_scratch(p,
>> > > > +                                      dst,
>> > > > +                                      brw_message_reg(inst-
>> > > > > base_mrf),
>> > > > 
>> > > > +                                      1, 32*inst->offset +
>> > > > 16*i +
>> > > > 32, false, true);
>> > > > +         if (i == 0) {
>> > > > +            /* The scratch read message writes the 128 MSB
>> > > > (OWORD1
>> > > > HIGH) of
>> > > > +             * the destination. We need to move them to dst.0
>> > > > so
>> > > > we can
>> > > > +             * read the pending 128 bits without using a
>> > > > temporary
>> > > > register.
>> > > > +             */
>> > > > +            brw_set_default_exec_size(p, BRW_EXECUTE_4);
>> > > > +            struct brw_reg tmp =
>> > > > +               stride(suboffset(dst, 16 / type_sz(dst.type)),
>> > > > +                      4, 4, 1);
>> > > > +
>> > > > +            brw_set_default_mask_control(p, true);
>> > > > +            brw_MOV(p, dst, tmp);
>> > > > +            brw_set_default_mask_control(p, inst-
>> > > > > force_writemask_all);
>> > > > 
>> > > > +            brw_set_default_exec_size(p, BRW_EXECUTE_8);
>> > > > +         }
>> > > > +      }
>> > > > +   } else {
>> > > > +      /* Read first GRF (offset in OWORDs) */
>> > > > +      for (int i = 1; i >= 0; i--) {
>> > > > +         brw_oword_block_read_scratch(p,
>> > > > +                                      dst,
>> > > > +                                      brw_message_reg(inst-
>> > > > > base_mrf),
>> > > > 
>> > > > +                                      1, 32*inst->offset +
>> > > > 16*i,
>> > > > true, false);
>> > > > +
>> > > > +         if (i == 1) {
>> > > > +            /* The scratch read message writes the 128 LSB
>> > > > (OWORD1
>> > > > LOW) of
>> > > > +             * the destination. We need to move them to dst.4
>> > > > so
>> > > > we can
>> > > > +             * read the pending 128 bits without using a
>> > > > temporary
>> > > > register.
>> > > > +             */
>> > > > +            struct brw_reg tmp = stride(dst, 4, 4, 1);
>> > > > +            brw_set_default_exec_size(p, BRW_EXECUTE_4);
>> > > > +            brw_set_default_mask_control(p, true);
>> > > > +            brw_MOV(p,
>> > > > +                    suboffset(dst, 16 / type_sz(dst.type)),
>> > > > +                    tmp);
>> > > > +            brw_set_default_mask_control(p, inst-
>> > > > > force_writemask_all);
>> > > > 
>> > > > +            brw_set_default_exec_size(p, BRW_EXECUTE_8);
>> > > > +         }
>> > > > +      }
>> > > > +   }
>> > > > +
>> > > > +   brw_set_default_exec_size(p, cvt(inst->exec_size) - 1);
>> > > > +   brw_set_default_access_mode(p, BRW_ALIGN_16);
>> > > > +   return;
>> > > > +}
>> > > > +
>> > > > +static void
>> > > >  generate_scratch_read(struct brw_codegen *p,
>> > > >                        vec4_instruction *inst,
>> > > >                        struct brw_reg dst,
>> > > > @@ -1143,6 +1210,16 @@ generate_scratch_read(struct brw_codegen
>> > > > *p,
>> > > >  
>> > > >     gen6_resolve_implied_move(p, &header, inst->base_mrf);
>> > > >  
>> > > > +   if (devinfo->gen >= 7 && inst->exec_size == 4 &&
>> > > > +       type_sz(dst.type) == 8) {
>> > > > +      /* First read second GRF (offset in OWORDs) */
>> > > > +      struct brw_reg dst_high = suboffset(dst, 32 /
>> > > > type_sz(dst.type));
>> > > > +      generate_scratch_read_1oword(p, inst, dst_high, index,
>> > > > false);
>> > > > +      /* Now read first GRF (data from first vertex) */
>> > > > +      generate_scratch_read_1oword(p, inst, dst, index, true);
>> > > > +      return;
>> > > > +   }
>> > > > +
>> > > >     generate_oword_dual_block_offsets(p, brw_message_reg(inst-
>> > > > > base_mrf + 1),
>> > > > 
>> > > >                                     index);
>> > > >  
>> > > > @@ -1192,6 +1269,57 @@ generate_scratch_write(struct
>> > > > brw_codegen
>> > > > *p,
>> > > >     struct brw_reg header = brw_vec8_grf(0, 0);
>> > > >     bool write_commit;
>> > > >  
>> > > > +   if (devinfo->gen >= 7 && inst->exec_size == 4 &&
>> > > > +       type_sz(src.type) == 8) {
>> > > > +      brw_set_default_access_mode(p, BRW_ALIGN_1);
>> > > > +
>> > > > +      /* The messages only works with group == 0, we use the
>> > > > group
>> > > > to know which
>> > > > +       * message emit (1-OWORD LOW or 1-OWORD HIGH).
>> > > > +       */
>> > > > +      brw_set_default_group(p, 0);
>> > > > +
>> > > > +      if (inst->group == 0) {
>> > > > +         for (int i = 0; i < 2; i++) {
>> > > > +            brw_set_default_exec_size(p, BRW_EXECUTE_4);
>> > > > +            brw_set_default_mask_control(p, true);
>> > > > +            struct brw_reg temp =
>> > > > +               retype(suboffset(src, i * 16 /
>> > > > type_sz(src.type)),
>> > > > BRW_REGISTER_TYPE_UD);
>> > > > +            temp = stride(temp, 4, 4, 1);
>> > > > +
>> > > > +            brw_MOV(p, brw_uvec_mrf(4, inst->base_mrf + 1, 0),
>> > > > +                    temp);
>> > > > +            brw_set_default_mask_control(p, inst-
>> > > > > force_writemask_all);
>> > > > 
>> > > > +            brw_set_default_exec_size(p, BRW_EXECUTE_8);
>> > > > +
>> > > > +            /* Offset in OWORDs */
>> > > > +            brw_oword_block_write_scratch(p,
>> > > > brw_message_reg(inst-
>> > > > > base_mrf),
>> > > > 
>> > > > +                                          1, 32*inst->offset +
>> > > > 16*i, true, false);
>> > > > +         }
>> > > > +      } else {
>> > > > +         for (int i = 0; i < 2; i++) {
>> > > > +            brw_set_default_exec_size(p, BRW_EXECUTE_4);
>> > > > +
>> > > > +            brw_set_default_mask_control(p, true);
>> > > > +            struct brw_reg temp =
>> > > > +               retype(suboffset(src, i * 16 /
>> > > > type_sz(src.type)),
>> > > > BRW_REGISTER_TYPE_UD);
>> > > > +            temp = stride(temp, 4, 4, 1);
>> > > > +
>> > > > +            brw_MOV(p, brw_uvec_mrf(4, inst->base_mrf + 1, 4),
>> > > > +                    temp);
>> > > > +
>> > > > +            brw_set_default_mask_control(p, inst-
>> > > > > force_writemask_all);
>> > > > 
>> > > > +            brw_set_default_exec_size(p, BRW_EXECUTE_8);
>> > > > +
>> > > > +            /* Offset in OWORDs */
>> > > > +            brw_oword_block_write_scratch(p,
>> > > > brw_message_reg(inst-
>> > > > > base_mrf),
>> > > > 
>> > > > +                                          1, 32*inst->offset +
>> > > > 16*i + 32, false, true);
>> > > > +         }
>> > > > +      }
>> > > > +      brw_set_default_exec_size(p, cvt(inst->exec_size) - 1);
>> > > > +      brw_set_default_access_mode(p, BRW_ALIGN_16);
>> > > > +      return;
>> > > > +   }
>> > > > +
>> > > >     /* If the instruction is predicated, we'll predicate the
>> > > > send,
>> > > > not
>> > > >      * the header setup.
>> > > >      */
>> > > > @@ -1780,6 +1908,14 @@ generate_code(struct brw_codegen *p,
>> > > >           generate_vs_urb_write(p, inst);
>> > > >           break;
>> > > >  
>> > > > +      case VEC4_OPCODE_GEN4_SCRATCH_READ_1OWORD_LOW:
>> > > > +         generate_scratch_read_1oword(p, inst, dst, src[0],
>> > > > true);
>> > > > +         fill_count++;
>> > > > +         break;
>> > > > +      case VEC4_OPCODE_GEN4_SCRATCH_READ_1OWORD_HIGH:
>> > > > +         generate_scratch_read_1oword(p, inst, dst, src[0],
>> > > > false);
>> > > > +         fill_count++;
>> > > > +         break;
>> > > >        case SHADER_OPCODE_GEN4_SCRATCH_READ:
>> > > >           generate_scratch_read(p, inst, dst, src[0]);
>> > > >           fill_count++;
>> > > > diff --git a/src/intel/compiler/brw_vec4_reg_allocate.cpp
>> > > > b/src/intel/compiler/brw_vec4_reg_allocate.cpp
>> > > > index a0ba77b867..ec5ba10e86 100644
>> > > > --- a/src/intel/compiler/brw_vec4_reg_allocate.cpp
>> > > > +++ b/src/intel/compiler/brw_vec4_reg_allocate.cpp
>> > > > @@ -332,7 +332,9 @@ can_use_scratch_for_source(const
>> > > > vec4_instruction *inst, unsigned i,
>> > > >         * reusing scratch_reg for this instruction.
>> > > >         */
>> > > >        if (prev_inst->opcode ==
>> > > > SHADER_OPCODE_GEN4_SCRATCH_WRITE ||
>> > > > -          prev_inst->opcode ==
>> > > > SHADER_OPCODE_GEN4_SCRATCH_READ)
>> > > > +          prev_inst->opcode == SHADER_OPCODE_GEN4_SCRATCH_READ
>> > > > ||
>> > > > +          prev_inst->opcode ==
>> > > > VEC4_OPCODE_GEN4_SCRATCH_READ_1OWORD_LOW ||
>> > > > +          prev_inst->opcode ==
>> > > > VEC4_OPCODE_GEN4_SCRATCH_READ_1OWORD_HIGH)
>> > > >           continue;
>> > > >  
>> > > >        /* If the previous instruction does not write to
>> > > > scratch_reg, then check
>> > > > @@ -467,6 +469,8 @@ vec4_visitor::evaluate_spill_costs(float
>> > > > *spill_costs, bool *no_spill)
>> > > >           loop_scale /= 10;
>> > > >           break;
>> > > >  
>> > > > +      case VEC4_OPCODE_GEN4_SCRATCH_READ_1OWORD_LOW:
>> > > > +      case VEC4_OPCODE_GEN4_SCRATCH_READ_1OWORD_HIGH:
>> > > >        case SHADER_OPCODE_GEN4_SCRATCH_READ:
>> > > >        case SHADER_OPCODE_GEN4_SCRATCH_WRITE:
>> > > >           for (int i = 0; i < 3; i++) {
>> > > > diff --git a/src/intel/compiler/brw_vec4_visitor.cpp
>> > > > b/src/intel/compiler/brw_vec4_visitor.cpp
>> > > > index 22ee4dd1c4..37ae31c0d5 100644
>> > > > --- a/src/intel/compiler/brw_vec4_visitor.cpp
>> > > > +++ b/src/intel/compiler/brw_vec4_visitor.cpp
>> > > > @@ -264,6 +264,24 @@ vec4_visitor::SCRATCH_READ(const dst_reg
>> > > > &dst,
>> > > > const src_reg &index)
>> > > >  }
>> > > >  
>> > > >  vec4_instruction *
>> > > > +vec4_visitor::DF_IVB_SCRATCH_READ(const dst_reg &dst,
>> > > > +                                  const src_reg &index,
>> > > > +                                  bool first_grf)
>> > > > +{
>> > > > +   vec4_instruction *inst;
>> > > > +   enum opcode op = first_grf ?
>> > > > +      VEC4_OPCODE_GEN4_SCRATCH_READ_1OWORD_LOW :
>> > > > +      VEC4_OPCODE_GEN4_SCRATCH_READ_1OWORD_HIGH;
>> > > > +
>> > > > +   inst = new(mem_ctx) vec4_instruction(op,
>> > > > +                                        dst, index);
>> > > > +   inst->base_mrf = FIRST_SPILL_MRF(devinfo->gen) + 1;
>> > > > +   inst->mlen = 1;
>> > > > +
>> > > > +   return inst;
>> > > > +}
>> > > > +
>> > > > +vec4_instruction *
>> > > >  vec4_visitor::SCRATCH_WRITE(const dst_reg &dst, const src_reg
>> > > > &src,
>> > > >                              const src_reg &index)
>> > > >  {
>> > > > @@ -1472,6 +1490,37 @@
>> > > > vec4_visitor::get_scratch_offset(bblock_t
>> > > > *block, vec4_instruction *inst,
>> > > >  
>> > > >  /**
>> > > >   * Emits an instruction before @inst to load the value named
>> > > > by
>> > > > @orig_src
>> > > > + * from scratch space at @base_offset to @temp. This
>> > > > instruction
>> > > > only reads
>> > > > + * DF value on IVB, one GRF each time.
>> > > > + *
>> > > > + * @base_offset is measured in 32-byte units (the size of a
>> > > > register).
>> > > > + * @first_grf indicates if we want to read first vertex data
>> > > > (true) or
>> > > > + * the second (false).
>> > > > + */
>> > > > +void
>> > > > +vec4_visitor::emit_1grf_df_ivb_scratch_read(bblock_t *block,
>> > > > +                                            vec4_instruction
>> > > > *inst,
>> > > > +                                            dst_reg temp,
>> > > > src_reg
>> > > > orig_src,
>> > > > +                                            int base_offset,
>> > > > bool
>> > > > first_grf)
>> > > > +{
>> > > > +   assert(orig_src.offset % REG_SIZE == 0);
>> > > > +   src_reg index = get_scratch_offset(block, inst, 0,
>> > > > base_offset);
>> > > > +
>> > > > +   assert(devinfo->gen == 7 && !devinfo->is_haswell &&
>> > > > type_sz(temp.type) == 8);
>> > > > +   temp.offset = 0;
>> > > > +   vec4_instruction *read = DF_IVB_SCRATCH_READ(temp, index,
>> > > > first_grf);
>> > > > +   read->exec_size = 4;
>> > > > +   /* The instruction will use group 0 but a different message
>> > > > depending of the
>> > > > +    * vertex data to load.
>> > > > +    */
>> > > > +   read->group = 0;
>> > > > +   read->offset = base_offset;
>> > > > +   read->size_written = 1;
>> > > > +   emit_before(block, inst, read);
>> > > > +}
>> > > > +
>> > > > +/**
>> > > > + * Emits an instruction before @inst to load the value named
>> > > > by
>> > > > @orig_src
>> > > >   * from scratch space at @base_offset to @temp.
>> > > >   *
>> > > >   * @base_offset is measured in 32-byte units (the size of a
>> > > > register).
>> > > > -- 
>> > > > 2.11.0

signature.asc
Description: PGP signature

_______________________________________________
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev

Re: [Mesa-dev] [PATCH 2/6] i965/vec4/generator: use 1-Oword Block Read/Write messages for DF scratch writes/reads

Reply via email to