Samuel Iglesias Gonsálvez <sigles...@igalia.com> writes:

> On Thu, 2017-06-22 at 16:25 -0700, Francisco Jerez wrote:
>> Samuel Iglesias Gonsálvez <sigles...@igalia.com> writes:
>> 
>> > Signed-off-by: Samuel Iglesias Gonsálvez <sigles...@igalia.com>
>> > ---
>> >  src/intel/compiler/brw_eu_defines.h          |   2 +
>> >  src/intel/compiler/brw_shader.cpp            |   5 +
>> >  src/intel/compiler/brw_vec4.cpp              |   7 ++
>> >  src/intel/compiler/brw_vec4.h                |   8 ++
>> >  src/intel/compiler/brw_vec4_generator.cpp    | 136
>> > +++++++++++++++++++++++++++
>> >  src/intel/compiler/brw_vec4_reg_allocate.cpp |   6 +-
>> >  src/intel/compiler/brw_vec4_visitor.cpp      |  49 ++++++++++
>> >  7 files changed, 212 insertions(+), 1 deletion(-)
>> > 
>> > diff --git a/src/intel/compiler/brw_eu_defines.h
>> > b/src/intel/compiler/brw_eu_defines.h
>> > index 1af835d47e..3c148de0fa 100644
>> > --- a/src/intel/compiler/brw_eu_defines.h
>> > +++ b/src/intel/compiler/brw_eu_defines.h
>> > @@ -436,6 +436,8 @@ enum opcode {
>> >     VEC4_OPCODE_PICK_HIGH_32BIT,
>> >     VEC4_OPCODE_SET_LOW_32BIT,
>> >     VEC4_OPCODE_SET_HIGH_32BIT,
>> > +   VEC4_OPCODE_GEN4_SCRATCH_READ_1OWORD_LOW,
>> > +   VEC4_OPCODE_GEN4_SCRATCH_READ_1OWORD_HIGH,
>> >  
>> 
>> What's the point of introducing two different opcodes with
>> essentially
>> the same semantics (read 32B worth of data) as the current
>> SHADER_OPCODE_GEN4_SCRATCH_READ?
>
> Originally I had only SHADER_OPCODE_GEN4_SCRATCH_READ but I changed it
> to don't allocate more registers than needed when doing scratch write
> of a partial DF write. Let me explain it:
>
> When doing spilling, as DF instructions are both split and scalarized,
> we read the existing contents in scratch memory, overwrite them with
> the destination of the instruction, then emit scratch write. Together
> with the fact that I am not shuffling DF data, we only need to allocate
> 1 GRF to do so, instead of 2 (if I had emitted
> SHADER_OPCODE_GEN4_SCRATCH_READ), when doing spilling on partial DF
> writes.
>

Why would you need to allocate more GRFs for
SHADER_OPCODE_GEN4_SCRATCH_READ?  It also only reads one register, which
should be sufficient for a single scalarized instruction as long as you
don't shuffle data around -- Have a look at how the FS back-end
addresses this problem.

>>  Is there any downside from using the
>> current opcode with force_writemask_all?  If anything it would give
>> you
>> better performance because you'd only have to set up one header
>> (which
>> stalls the EU pipeline twice), send down one message to the dataport,
>> and avoid stalling to shuffle the data around in the return payload
>> (which prevents your two 1OWORD messages from being pipelined at
>> all).
>> 
>
> Sorry, I am confused here. Do you mean using
> SHADER_OPCODE_GEN4_SCRATCH_READ as-is, which emits a "OWord Dual Block
> Read" message (so only one message)?
>
> If that's the case, then I should shuffle the destination data of the
> partial DF write, change the 1-Oword block write offsets and so on...

Why would you need to shuffle any spilled data?  I don't think there's
much of a benefit from shuffling since scratch overwrites need read the
original data for the most part anyway because of writemasking.  In fact
shuffling DF data is probably the reason things blow up right now
whenever you have mixed DF and single-precision reads or writes to the
same spilled variable, which I guess is the reason you need to look for
those cases and mark them as no_spill...

> in order to save it inside scratch memory in the proper place to make
> OWord Dual Block Read work. That would require to some extra
> instructions, but I don't know if this would give better performance
> against current implementation or not.
>

I expect the most serious performance issue with the approach of this
patch will be the sequence of non-pipelined single-oword reads, which
means you get to pay for the EU-dataport roundtrip latency twice instead
of once.

> Then, why do I need force_writemask=true when emitting
> SHADER_OPCODE_GEN4_SCRATCH_READ?
>
Because you probably don't want to shuffle data in your scratch buffer,
and you don't want the dataport to apply bogus 16B channel enables to
your reads and writes.

> I can try this alternative solution if this is what you meant. It has
> the advantage of simplifying the changes a lot, which is always great.
>
> Sam
>
>> >     FS_OPCODE_DDX_COARSE,
>> >     FS_OPCODE_DDX_FINE,
>> > diff --git a/src/intel/compiler/brw_shader.cpp
>> > b/src/intel/compiler/brw_shader.cpp
>> > index 53d0742d2e..248feacbd2 100644
>> > --- a/src/intel/compiler/brw_shader.cpp
>> > +++ b/src/intel/compiler/brw_shader.cpp
>> > @@ -296,6 +296,11 @@ brw_instruction_name(const struct
>> > gen_device_info *devinfo, enum opcode op)
>> >     case FS_OPCODE_PACK:
>> >        return "pack";
>> >  
>> > +
>> > +   case VEC4_OPCODE_GEN4_SCRATCH_READ_1OWORD_LOW:
>> > +      return "gen4_scratch_read_1word_low";
>> > +   case VEC4_OPCODE_GEN4_SCRATCH_READ_1OWORD_HIGH:
>> > +      return "gen4_scratch_read_1word_high";
>> >     case SHADER_OPCODE_GEN4_SCRATCH_READ:
>> >        return "gen4_scratch_read";
>> >     case SHADER_OPCODE_GEN4_SCRATCH_WRITE:
>> > diff --git a/src/intel/compiler/brw_vec4.cpp
>> > b/src/intel/compiler/brw_vec4.cpp
>> > index b443effca9..b6d409eea2 100644
>> > --- a/src/intel/compiler/brw_vec4.cpp
>> > +++ b/src/intel/compiler/brw_vec4.cpp
>> > @@ -259,6 +259,8 @@ bool
>> >  vec4_instruction::can_do_writemask(const struct gen_device_info
>> > *devinfo)
>> >  {
>> >     switch (opcode) {
>> > +   case VEC4_OPCODE_GEN4_SCRATCH_READ_1OWORD_LOW:
>> > +   case VEC4_OPCODE_GEN4_SCRATCH_READ_1OWORD_HIGH:
>> >     case SHADER_OPCODE_GEN4_SCRATCH_READ:
>> >     case VEC4_OPCODE_DOUBLE_TO_F32:
>> >     case VEC4_OPCODE_DOUBLE_TO_D32:
>> > @@ -335,6 +337,9 @@
>> > vec4_visitor::implied_mrf_writes(vec4_instruction *inst)
>> >        return 1;
>> >     case VS_OPCODE_PULL_CONSTANT_LOAD:
>> >        return 2;
>> > +   case VEC4_OPCODE_GEN4_SCRATCH_READ_1OWORD_LOW:
>> > +   case VEC4_OPCODE_GEN4_SCRATCH_READ_1OWORD_HIGH:
>> > +      return 1;
>> >     case SHADER_OPCODE_GEN4_SCRATCH_READ:
>> >        return 2;
>> >     case SHADER_OPCODE_GEN4_SCRATCH_WRITE:
>> > @@ -2091,6 +2096,8 @@ get_lowered_simd_width(const struct
>> > gen_device_info *devinfo,
>> >  {
>> >     /* Do not split some instructions that require special handling
>> > */
>> >     switch (inst->opcode) {
>> > +   case VEC4_OPCODE_GEN4_SCRATCH_READ_1OWORD_LOW:
>> > +   case VEC4_OPCODE_GEN4_SCRATCH_READ_1OWORD_HIGH:
>> >     case SHADER_OPCODE_GEN4_SCRATCH_READ:
>> >     case SHADER_OPCODE_GEN4_SCRATCH_WRITE:
>> >        return inst->exec_size;
>> > diff --git a/src/intel/compiler/brw_vec4.h
>> > b/src/intel/compiler/brw_vec4.h
>> > index d828da02ea..a5b45aca21 100644
>> > --- a/src/intel/compiler/brw_vec4.h
>> > +++ b/src/intel/compiler/brw_vec4.h
>> > @@ -214,6 +214,9 @@ public:
>> >                          enum brw_conditional_mod condition);
>> >     vec4_instruction *IF(enum brw_predicate predicate);
>> >     EMIT1(SCRATCH_READ)
>> > +   vec4_instruction *DF_IVB_SCRATCH_READ(const dst_reg &dst, const
>> > src_reg &src0,
>> > +                                         bool low);
>> > +
>> >     EMIT2(SCRATCH_WRITE)
>> >     EMIT3(LRP)
>> >     EMIT1(BFREV)
>> > @@ -294,6 +297,11 @@ public:
>> >                      dst_reg dst,
>> >                      src_reg orig_src,
>> >                      int base_offset);
>> > +   void emit_1grf_df_ivb_scratch_read(bblock_t *block,
>> > +                                      vec4_instruction *inst,
>> > +                                      dst_reg temp, src_reg
>> > orig_src,
>> > +                                      int base_offset, bool
>> > first_grf);
>> > +
>> >     void emit_scratch_write(bblock_t *block, vec4_instruction
>> > *inst,
>> >                       int base_offset);
>> >     void emit_pull_constant_load(bblock_t *block, vec4_instruction
>> > *inst,
>> > diff --git a/src/intel/compiler/brw_vec4_generator.cpp
>> > b/src/intel/compiler/brw_vec4_generator.cpp
>> > index 334933d15a..3bb931385a 100644
>> > --- a/src/intel/compiler/brw_vec4_generator.cpp
>> > +++ b/src/intel/compiler/brw_vec4_generator.cpp
>> > @@ -1133,6 +1133,73 @@ generate_unpack_flags(struct brw_codegen *p,
>> >  }
>> >  
>> >  static void
>> > +generate_scratch_read_1oword(struct brw_codegen *p,
>> > +                             vec4_instruction *inst,
>> > +                             struct brw_reg dst,
>> > +                             struct brw_reg index,
>> > +                             bool low)
>> > +{
>> > +   const struct gen_device_info *devinfo = p->devinfo;
>> > +
>> > +   assert(devinfo->gen >= 7 && inst->exec_size == 4 &&
>> > +          type_sz(dst.type) == 8);
>> > +   brw_set_default_access_mode(p, BRW_ALIGN_1);
>> > +   brw_set_default_exec_size(p, BRW_EXECUTE_8);
>> > +
>> > +   if (!low) {
>> > +      /* Read second GRF (offset in OWORDs) */
>> > +      for (int i = 0; i < 2; i++) {
>> > +         brw_oword_block_read_scratch(p,
>> > +                                      dst,
>> > +                                      brw_message_reg(inst-
>> > >base_mrf),
>> > +                                      1, 32*inst->offset + 16*i +
>> > 32, false, true);
>> > +         if (i == 0) {
>> > +            /* The scratch read message writes the 128 MSB (OWORD1
>> > HIGH) of
>> > +             * the destination. We need to move them to dst.0 so
>> > we can
>> > +             * read the pending 128 bits without using a temporary
>> > register.
>> > +             */
>> > +            brw_set_default_exec_size(p, BRW_EXECUTE_4);
>> > +            struct brw_reg tmp =
>> > +               stride(suboffset(dst, 16 / type_sz(dst.type)),
>> > +                      4, 4, 1);
>> > +
>> > +            brw_set_default_mask_control(p, true);
>> > +            brw_MOV(p, dst, tmp);
>> > +            brw_set_default_mask_control(p, inst-
>> > >force_writemask_all);
>> > +            brw_set_default_exec_size(p, BRW_EXECUTE_8);
>> > +         }
>> > +      }
>> > +   } else {
>> > +      /* Read first GRF (offset in OWORDs) */
>> > +      for (int i = 1; i >= 0; i--) {
>> > +         brw_oword_block_read_scratch(p,
>> > +                                      dst,
>> > +                                      brw_message_reg(inst-
>> > >base_mrf),
>> > +                                      1, 32*inst->offset + 16*i,
>> > true, false);
>> > +
>> > +         if (i == 1) {
>> > +            /* The scratch read message writes the 128 LSB (OWORD1
>> > LOW) of
>> > +             * the destination. We need to move them to dst.4 so
>> > we can
>> > +             * read the pending 128 bits without using a temporary
>> > register.
>> > +             */
>> > +            struct brw_reg tmp = stride(dst, 4, 4, 1);
>> > +            brw_set_default_exec_size(p, BRW_EXECUTE_4);
>> > +            brw_set_default_mask_control(p, true);
>> > +            brw_MOV(p,
>> > +                    suboffset(dst, 16 / type_sz(dst.type)),
>> > +                    tmp);
>> > +            brw_set_default_mask_control(p, inst-
>> > >force_writemask_all);
>> > +            brw_set_default_exec_size(p, BRW_EXECUTE_8);
>> > +         }
>> > +      }
>> > +   }
>> > +
>> > +   brw_set_default_exec_size(p, cvt(inst->exec_size) - 1);
>> > +   brw_set_default_access_mode(p, BRW_ALIGN_16);
>> > +   return;
>> > +}
>> > +
>> > +static void
>> >  generate_scratch_read(struct brw_codegen *p,
>> >                        vec4_instruction *inst,
>> >                        struct brw_reg dst,
>> > @@ -1143,6 +1210,16 @@ generate_scratch_read(struct brw_codegen *p,
>> >  
>> >     gen6_resolve_implied_move(p, &header, inst->base_mrf);
>> >  
>> > +   if (devinfo->gen >= 7 && inst->exec_size == 4 &&
>> > +       type_sz(dst.type) == 8) {
>> > +      /* First read second GRF (offset in OWORDs) */
>> > +      struct brw_reg dst_high = suboffset(dst, 32 /
>> > type_sz(dst.type));
>> > +      generate_scratch_read_1oword(p, inst, dst_high, index,
>> > false);
>> > +      /* Now read first GRF (data from first vertex) */
>> > +      generate_scratch_read_1oword(p, inst, dst, index, true);
>> > +      return;
>> > +   }
>> > +
>> >     generate_oword_dual_block_offsets(p, brw_message_reg(inst-
>> > >base_mrf + 1),
>> >                                 index);
>> >  
>> > @@ -1192,6 +1269,57 @@ generate_scratch_write(struct brw_codegen
>> > *p,
>> >     struct brw_reg header = brw_vec8_grf(0, 0);
>> >     bool write_commit;
>> >  
>> > +   if (devinfo->gen >= 7 && inst->exec_size == 4 &&
>> > +       type_sz(src.type) == 8) {
>> > +      brw_set_default_access_mode(p, BRW_ALIGN_1);
>> > +
>> > +      /* The messages only works with group == 0, we use the group
>> > to know which
>> > +       * message emit (1-OWORD LOW or 1-OWORD HIGH).
>> > +       */
>> > +      brw_set_default_group(p, 0);
>> > +
>> > +      if (inst->group == 0) {
>> > +         for (int i = 0; i < 2; i++) {
>> > +            brw_set_default_exec_size(p, BRW_EXECUTE_4);
>> > +            brw_set_default_mask_control(p, true);
>> > +            struct brw_reg temp =
>> > +               retype(suboffset(src, i * 16 / type_sz(src.type)),
>> > BRW_REGISTER_TYPE_UD);
>> > +            temp = stride(temp, 4, 4, 1);
>> > +
>> > +            brw_MOV(p, brw_uvec_mrf(4, inst->base_mrf + 1, 0),
>> > +                    temp);
>> > +            brw_set_default_mask_control(p, inst-
>> > >force_writemask_all);
>> > +            brw_set_default_exec_size(p, BRW_EXECUTE_8);
>> > +
>> > +            /* Offset in OWORDs */
>> > +            brw_oword_block_write_scratch(p, brw_message_reg(inst-
>> > >base_mrf),
>> > +                                          1, 32*inst->offset +
>> > 16*i, true, false);
>> > +         }
>> > +      } else {
>> > +         for (int i = 0; i < 2; i++) {
>> > +            brw_set_default_exec_size(p, BRW_EXECUTE_4);
>> > +
>> > +            brw_set_default_mask_control(p, true);
>> > +            struct brw_reg temp =
>> > +               retype(suboffset(src, i * 16 / type_sz(src.type)),
>> > BRW_REGISTER_TYPE_UD);
>> > +            temp = stride(temp, 4, 4, 1);
>> > +
>> > +            brw_MOV(p, brw_uvec_mrf(4, inst->base_mrf + 1, 4),
>> > +                    temp);
>> > +
>> > +            brw_set_default_mask_control(p, inst-
>> > >force_writemask_all);
>> > +            brw_set_default_exec_size(p, BRW_EXECUTE_8);
>> > +
>> > +            /* Offset in OWORDs */
>> > +            brw_oword_block_write_scratch(p, brw_message_reg(inst-
>> > >base_mrf),
>> > +                                          1, 32*inst->offset +
>> > 16*i + 32, false, true);
>> > +         }
>> > +      }
>> > +      brw_set_default_exec_size(p, cvt(inst->exec_size) - 1);
>> > +      brw_set_default_access_mode(p, BRW_ALIGN_16);
>> > +      return;
>> > +   }
>> > +
>> >     /* If the instruction is predicated, we'll predicate the send,
>> > not
>> >      * the header setup.
>> >      */
>> > @@ -1780,6 +1908,14 @@ generate_code(struct brw_codegen *p,
>> >           generate_vs_urb_write(p, inst);
>> >           break;
>> >  
>> > +      case VEC4_OPCODE_GEN4_SCRATCH_READ_1OWORD_LOW:
>> > +         generate_scratch_read_1oword(p, inst, dst, src[0], true);
>> > +         fill_count++;
>> > +         break;
>> > +      case VEC4_OPCODE_GEN4_SCRATCH_READ_1OWORD_HIGH:
>> > +         generate_scratch_read_1oword(p, inst, dst, src[0],
>> > false);
>> > +         fill_count++;
>> > +         break;
>> >        case SHADER_OPCODE_GEN4_SCRATCH_READ:
>> >           generate_scratch_read(p, inst, dst, src[0]);
>> >           fill_count++;
>> > diff --git a/src/intel/compiler/brw_vec4_reg_allocate.cpp
>> > b/src/intel/compiler/brw_vec4_reg_allocate.cpp
>> > index a0ba77b867..ec5ba10e86 100644
>> > --- a/src/intel/compiler/brw_vec4_reg_allocate.cpp
>> > +++ b/src/intel/compiler/brw_vec4_reg_allocate.cpp
>> > @@ -332,7 +332,9 @@ can_use_scratch_for_source(const
>> > vec4_instruction *inst, unsigned i,
>> >         * reusing scratch_reg for this instruction.
>> >         */
>> >        if (prev_inst->opcode == SHADER_OPCODE_GEN4_SCRATCH_WRITE ||
>> > -          prev_inst->opcode == SHADER_OPCODE_GEN4_SCRATCH_READ)
>> > +          prev_inst->opcode == SHADER_OPCODE_GEN4_SCRATCH_READ ||
>> > +          prev_inst->opcode ==
>> > VEC4_OPCODE_GEN4_SCRATCH_READ_1OWORD_LOW ||
>> > +          prev_inst->opcode ==
>> > VEC4_OPCODE_GEN4_SCRATCH_READ_1OWORD_HIGH)
>> >           continue;
>> >  
>> >        /* If the previous instruction does not write to
>> > scratch_reg, then check
>> > @@ -467,6 +469,8 @@ vec4_visitor::evaluate_spill_costs(float
>> > *spill_costs, bool *no_spill)
>> >           loop_scale /= 10;
>> >           break;
>> >  
>> > +      case VEC4_OPCODE_GEN4_SCRATCH_READ_1OWORD_LOW:
>> > +      case VEC4_OPCODE_GEN4_SCRATCH_READ_1OWORD_HIGH:
>> >        case SHADER_OPCODE_GEN4_SCRATCH_READ:
>> >        case SHADER_OPCODE_GEN4_SCRATCH_WRITE:
>> >           for (int i = 0; i < 3; i++) {
>> > diff --git a/src/intel/compiler/brw_vec4_visitor.cpp
>> > b/src/intel/compiler/brw_vec4_visitor.cpp
>> > index 22ee4dd1c4..37ae31c0d5 100644
>> > --- a/src/intel/compiler/brw_vec4_visitor.cpp
>> > +++ b/src/intel/compiler/brw_vec4_visitor.cpp
>> > @@ -264,6 +264,24 @@ vec4_visitor::SCRATCH_READ(const dst_reg &dst,
>> > const src_reg &index)
>> >  }
>> >  
>> >  vec4_instruction *
>> > +vec4_visitor::DF_IVB_SCRATCH_READ(const dst_reg &dst,
>> > +                                  const src_reg &index,
>> > +                                  bool first_grf)
>> > +{
>> > +   vec4_instruction *inst;
>> > +   enum opcode op = first_grf ?
>> > +      VEC4_OPCODE_GEN4_SCRATCH_READ_1OWORD_LOW :
>> > +      VEC4_OPCODE_GEN4_SCRATCH_READ_1OWORD_HIGH;
>> > +
>> > +   inst = new(mem_ctx) vec4_instruction(op,
>> > +                                        dst, index);
>> > +   inst->base_mrf = FIRST_SPILL_MRF(devinfo->gen) + 1;
>> > +   inst->mlen = 1;
>> > +
>> > +   return inst;
>> > +}
>> > +
>> > +vec4_instruction *
>> >  vec4_visitor::SCRATCH_WRITE(const dst_reg &dst, const src_reg
>> > &src,
>> >                              const src_reg &index)
>> >  {
>> > @@ -1472,6 +1490,37 @@ vec4_visitor::get_scratch_offset(bblock_t
>> > *block, vec4_instruction *inst,
>> >  
>> >  /**
>> >   * Emits an instruction before @inst to load the value named by
>> > @orig_src
>> > + * from scratch space at @base_offset to @temp. This instruction
>> > only reads
>> > + * DF value on IVB, one GRF each time.
>> > + *
>> > + * @base_offset is measured in 32-byte units (the size of a
>> > register).
>> > + * @first_grf indicates if we want to read first vertex data
>> > (true) or
>> > + * the second (false).
>> > + */
>> > +void
>> > +vec4_visitor::emit_1grf_df_ivb_scratch_read(bblock_t *block,
>> > +                                            vec4_instruction
>> > *inst,
>> > +                                            dst_reg temp, src_reg
>> > orig_src,
>> > +                                            int base_offset, bool
>> > first_grf)
>> > +{
>> > +   assert(orig_src.offset % REG_SIZE == 0);
>> > +   src_reg index = get_scratch_offset(block, inst, 0,
>> > base_offset);
>> > +
>> > +   assert(devinfo->gen == 7 && !devinfo->is_haswell &&
>> > type_sz(temp.type) == 8);
>> > +   temp.offset = 0;
>> > +   vec4_instruction *read = DF_IVB_SCRATCH_READ(temp, index,
>> > first_grf);
>> > +   read->exec_size = 4;
>> > +   /* The instruction will use group 0 but a different message
>> > depending of the
>> > +    * vertex data to load.
>> > +    */
>> > +   read->group = 0;
>> > +   read->offset = base_offset;
>> > +   read->size_written = 1;
>> > +   emit_before(block, inst, read);
>> > +}
>> > +
>> > +/**
>> > + * Emits an instruction before @inst to load the value named by
>> > @orig_src
>> >   * from scratch space at @base_offset to @temp.
>> >   *
>> >   * @base_offset is measured in 32-byte units (the size of a
>> > register).
>> > -- 
>> > 2.11.0

Attachment: signature.asc
Description: PGP signature

_______________________________________________
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev

Reply via email to