vs: Add instruction scheduling.

Kenneth Graunke Mon, 29 Apr 2013 20:31:52 -0700

On 04/29/2013 06:20 PM, Eric Anholt wrote:

Kenneth Graunke <kenn...@whitecape.org> writes:

On 04/23/2013 04:56 PM, Eric Anholt wrote:

While this doesn't have the detail that the FS scheduler does, and is
ignorant of dependency control, it's still good for a 0.60% +/- 0.15%
performance improvement on GLBenchmark 2.7 (n=45/47, outliers removed)
---
   src/mesa/drivers/dri/i965/Makefile.sources         |    1 +
   src/mesa/drivers/dri/i965/brw_vec4.cpp             |    9 +
   src/mesa/drivers/dri/i965/brw_vec4.h               |    1 +
   .../drivers/dri/i965/brw_vec4_reg_allocate.cpp     |    4 +
   .../dri/i965/brw_vec4_schedule_instructions.cpp    |  484 
++++++++++++++++++++
   5 files changed, 499 insertions(+)
   create mode 100644 
src/mesa/drivers/dri/i965/brw_vec4_schedule_instructions.cpp

diff --git a/src/mesa/drivers/dri/i965/Makefile.sources 
b/src/mesa/drivers/dri/i965/Makefile.sources
index be8d630..a5231ee 100644
--- a/src/mesa/drivers/dri/i965/Makefile.sources
+++ b/src/mesa/drivers/dri/i965/Makefile.sources
@@ -86,6 +86,7 @@ i965_FILES = \
        brw_vec4.cpp \
        brw_vec4_copy_propagation.cpp \
        brw_vec4_emit.cpp \
+       brw_vec4_schedule_instructions.cpp \
        brw_vec4_live_variables.cpp \
        brw_vec4_reg_allocate.cpp \
        brw_vec4_visitor.cpp \
diff --git a/src/mesa/drivers/dri/i965/brw_vec4.cpp 
b/src/mesa/drivers/dri/i965/brw_vec4.cpp
index 0afff6f..120a4e2 100644
--- a/src/mesa/drivers/dri/i965/brw_vec4.cpp
+++ b/src/mesa/drivers/dri/i965/brw_vec4.cpp
@@ -285,6 +285,13 @@ vec4_visitor::implied_mrf_writes(vec4_instruction *inst)
         return 3;
      case SHADER_OPCODE_SHADER_TIME_ADD:
         return 0;
+   case SHADER_OPCODE_TEX:
+   case SHADER_OPCODE_TXL:
+   case SHADER_OPCODE_TXD:
+   case SHADER_OPCODE_TXF:
+   case SHADER_OPCODE_TXF_MS:
+   case SHADER_OPCODE_TXS:
+      return (inst->texture_offset || inst->header_present) ? 1 : 0;


inst->header_present is actually true if inst->texture_offset != 0,
though I admit that's only obvious from the vec4_visitor side (and not
obvious when reading vec4_generator).

Either way is fine.


I think it'll make more sense to me if I just write inst->header_present
then.  Sound good?


Yep, that sounds best.

      default:
         assert(!"not reached");
         return inst->mlen;
@@ -1499,6 +1506,8 @@ vec4_visitor::run()

      opt_set_dependency_control();

+   opt_schedule_instructions();
+


Setting the "don't check dependencies" flags and then reordering all the
instructions makes me nervous.  Maybe it's safe, but I'd rather not have
to try and justify it.  Can we switch the order of these passes?


It's safe at the moment, because they're all considered to be WAW deps
of each other.

I had been thinking that we wanted the scheduler to (eventually) know
about depctrl so that it could know that you can do the series of
depctrled dp4s right next to each other with no depdendencies.  But in
order to build the tracking for that, you'd be tracking the writes on a
channel basis plus maintaining the depctrl tracking, and if you do that
you could more easily just schedule them as if there's no dependency and
know that the depctrl pass immediately afterwards will set the flags as
needed.  So I think you're right about the desired ordering.

Yeah, that makes sense to me - we just want per-channel scheduling.opt_set_dependency_control() basically makes that scheduling work out.

diff --git a/src/mesa/drivers/dri/i965/brw_vec4_reg_allocate.cpp 
b/src/mesa/drivers/dri/i965/brw_vec4_reg_allocate.cpp
index ac3d401..7149d46 100644
--- a/src/mesa/drivers/dri/i965/brw_vec4_reg_allocate.cpp
+++ b/src/mesa/drivers/dri/i965/brw_vec4_reg_allocate.cpp
@@ -102,6 +102,8 @@ brw_alloc_reg_set_for_classes(struct brw_context *brw,
                              int class_count,
                              int base_reg_count)
   {
+   struct intel_context *intel = &brw->intel;
+
      /* Compute the total number of registers across all classes. */
      int ra_reg_count = 0;
      for (int i = 0; i < class_count; i++) {
@@ -112,6 +114,8 @@ brw_alloc_reg_set_for_classes(struct brw_context *brw,
      brw->vs.ra_reg_to_grf = ralloc_array(brw, uint8_t, ra_reg_count);
      ralloc_free(brw->vs.regs);
      brw->vs.regs = ra_alloc_reg_set(brw, ra_reg_count);
+   if (intel->gen >= 6)
+      ra_set_allocate_round_robin(brw->vs.regs);


Isn't this an unrelated change?  Both this and scheduling could impact
performance.


Yeah, I should separate them out.

diff --git a/src/mesa/drivers/dri/i965/brw_vec4_schedule_instructions.cpp 
b/src/mesa/drivers/dri/i965/brw_vec4_schedule_instructions.cpp
new file mode 100644
index 0000000..f1e997a
--- /dev/null
+++ b/src/mesa/drivers/dri/i965/brw_vec4_schedule_instructions.cpp
@@ -0,0 +1,484 @@


This new file has tabs.  Please replace them with spaces.


Thanks.

+    * dead code elimination anyway.
+    */
+   schedule_node *last = (schedule_node *)instructions.get_tail();
+   add_barrier_deps(last);
+
+   memset(last_grf_write, 0, sizeof(last_grf_write));
+   memset(last_mrf_write, 0, sizeof(last_mrf_write));
+
+   /* top-to-bottom dependencies: RAW and WAW. */
+   foreach_list(node, &instructions) {
+      schedule_node *n = (schedule_node *)node;
+      vec4_instruction *inst = n->inst;
+
+      /* read-after-write deps. */
+      for (int i = 0; i < 3; i++) {
+        if (inst->src[i].file == GRF) {
+           add_dep(last_grf_write[inst->src[i].reg], n);
+        } else if (inst->src[i].file == HW_REG &&
+                   (inst->src[i].fixed_hw_reg.file ==
+                    BRW_GENERAL_REGISTER_FILE)) {
+           add_dep(last_fixed_grf_write, n);
+        } else if (inst->src[i].file != BAD_FILE &&
+                   inst->src[i].file != IMM &&
+                   inst->src[i].file != UNIFORM) {
+           assert(inst->src[i].file != MRF);


I had to ask "What about ATTR?" here.  It turns out that
setup_attributes() executed earlier and lowered them all away, but it
might be nice to have an explicit assertion or comment so people don't
have to ask.


Asserted.

+      /* write-after-write deps. */
+      if (inst->dst.file == GRF) {
+        add_dep(last_grf_write[inst->dst.reg], n);
+        last_grf_write[inst->dst.reg] = n;
+      } else if (inst->dst.file == MRF) {
+        add_dep(last_mrf_write[inst->dst.reg], n);
+        last_mrf_write[inst->dst.reg] = n;
+     } else if (inst->dst.file == HW_REG &&
+                inst->dst.fixed_hw_reg.file == BRW_GENERAL_REGISTER_FILE) {
+        last_fixed_grf_write = n;
+      } else if (inst->dst.file != BAD_FILE) {
+        add_barrier_deps(n);
+      }
+
+      if (inst->mlen > 0) {
+        for (int i = 0; i < v->implied_mrf_writes(inst); i++) {


Eh?  Aren't you missing a line here?  Namely:

              add_dep(last_mrf_write[inst->base_mrf + i], n);

The FS has it, and I doubt it's superfluous.


Good catch.  Fixed.

+      /* Now that we've scheduled a new instruction, some of its
+       * children can be promoted to the list of instructions ready to
+       * be scheduled.  Update the children's unblocked time for this
+       * DAG edge as we do so.
+       */
+      for (int i = 0; i < chosen->child_count; i++) {
+        schedule_node *child = chosen->children[i];
+
+        child->unblocked_time = MAX2(child->unblocked_time,
+                                     time + chosen->child_latency[i]);
+
+        child->parent_count--;
+        if (child->parent_count == 0) {
+           instructions.push_tail(child);
+        }
+      }
+
+      /* Shared resource: the mathbox.  There's one per EU (on later
+       * generations, it's even more limited pre-gen6), so if we send


I actually don't think this is true, at least not on IVB...there's just
the ALU pipe and the EM pipe, and they're both available and can process
things...

But I suppose that's a separate change that we ought to do in the FS as
well (if it's true).


Cool, that's what I feel too, once we test it.


Excellent.

+      }
+      sched.calculate_deps();
+      sched.schedule_instructions(next_block_header);
+   }
+
+   this->live_intervals_valid = false;
+}


A final dumb observation: a lot of the scheduler boilerplate is
identical between the VS and FS.  This could probably be shared by:

1. Making schedule_node use backend_instruction pointers.
2. Creating an instruction_scheduler base class and two subclasses:

     class instruction_scheduler {
        virtual calculate_deps() = 0;
        ...everything else...
     }

     Since calculate_deps() actually looks at vec4/fs instructions, it
should probably be pure virtual and implemented by the subclasses.
Every other function is the same boilerplate and could be shared.


OK, it was a bunch of work, but there turned out to be more shared code
than I thought.  I'll send out a new series once I run some more tests.


Nice, thanks for doing that.  Sorry for the extra work.
_______________________________________________
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev

Re: [Mesa-dev] [PATCH 2/2] i965/vs: Add instruction scheduling.

Reply via email to