fs: Switch to the constant cache for uniform pull constants.

Francisco Jerez Sat, 17 Jan 2015 15:02:36 -0800

This reverts to using the oword block read messages for uniform pull constant
loads, as used to be the case until 4c1fdae0a01b3f92ec03b61aac1d3df5.  There
are two important differences though: Now the L3 cacheability bits are set up
correctly, and we target the constant cache instead of the data cache.  The
latter turns out to get no L3 way allocation on boot on most platforms, so
data cache messages are currently *not* cached on L3 regardless of the MOCS
bits, what probably explains the apparent slowness of oword fetches back then.


Constant cache loads seem to perform better than SIMD4x2 sampler loads
in a number of cases, they alleviate some of the cache thrashing
caused by the competition with textures for the L1/L2 sampler caches,
and they allow fetching up to 8 consecutive owords (128B) with just
one message.

FPS deltas relative to master are shown below for all generations since Gen6
and all oword block sizes from 1 to 8.

                        1oword  2oword  4oword  8oword

OglShMapPcf     SNB     3%      3%      5%      6%
                IVB     9%      11%     28%     41%
                BYT     5%      7%      25%     42%
                HSW     9%      9%      20%     30%
                BDW     3%      5%      19%     33%
                BSW     3%      5%      25%     44%

ubo-worst       SNB     2%      2%      2%      3%
                IVB     -85%    -71%    -71%    -71%
                BYT     14%     14%     14%     14%
                HSW     0%      -1%     -1%     -1%
                BDW     -40%    0%      0%      0%
                BSW     0%      0%      0%      0%

ubo-best        SNB     191%    190%    205%    205%
                IVB     152%    350%    474%    563%
                BYT     83%     208%    292%    353%
                HSW     292%    464%    615%    726%
                BDW     38%     267%    546%    581%
                BSW     124%    721%    1135%   580%

shader-db       HSW
  gained - lost         0       -3      -1      -1
  instruction delta in  3.44%   1.90%   -1.77%  -3.48%
  affected programs

OglShMapPcf is a PCF shadow mapping benchmark from SynMark that exercises pull
constants and texture sampling, other tests from the Finnish benchmarking
system show either a smaller improvement or no significant change.  ubo-worst
and ubo-best are the worst- and best-case scenarios of a simple microbenchmark
that reads n constants (with n between 1 and 128) from a UBO, accessing up to
2kB of memory per invocation with alignment taken into account.  Typically the
gap between master and my constant cache branch increases with the amount of
bandwidth used by the shader, with n=128 showing the greatest improvement.

IVB's apparent worst-case regression deserves an explanation.  After some
investigation it seems like it's caused by a hardware bug leading to
serialization of read requests to the L3 for the same cacheline as result of a
(on IVB buggy) mechanism of the L3 to preserve coherency.  As read requests
for matching cachelines from any L3 client are not pipelined throughput will
decrease in cases where there are no non-overlapping requests left in the
queue that can be processed in between.  I suspect that this situation is
relatively uncommon in real-world applications, as the regression disappears
completely from my microbenchmark as soon as each individual shader invocation
accesses more than two non-overlapping cachelines from L3.

To make this situation less likely we should make sure that we don't use the
1/2 oword messages at all if the shader intends to read from any other
location in the same cacheline at some other point.  This is generally a good
idea anyway on all generations because using the 1 and 2 oword messages is
expected to waste bandwidth since the minimum L3 request size for the DC is
exactly 4 owords (i.e. one cacheline.  This probably explains the negative
result in the first column for BDW).  A future commit will have this effect.
I haven't been able to find any real-world example where this would still
result in a regression, but if someone happens to find one it shouldn't be too
difficult to add an IVB-specific heuristic that falls back to using the
sampler for pull constant loads when a shader uses less than certain amount of
L3 bandwidth.
---
 src/mesa/drivers/dri/i965/brw_eu_emit.c        |  5 +-
 src/mesa/drivers/dri/i965/brw_fs.cpp           | 34 ++++-------
 src/mesa/drivers/dri/i965/brw_fs.h             |  2 +-
 src/mesa/drivers/dri/i965/brw_fs_generator.cpp | 84 ++++++++------------------
 4 files changed, 40 insertions(+), 85 deletions(-)

diff --git a/src/mesa/drivers/dri/i965/brw_eu_emit.c 
b/src/mesa/drivers/dri/i965/brw_eu_emit.c
index c2e490d..7829878 100644
--- a/src/mesa/drivers/dri/i965/brw_eu_emit.c
+++ b/src/mesa/drivers/dri/i965/brw_eu_emit.c
@@ -2194,7 +2194,7 @@ gen7_block_read_scratch(struct brw_compile *p,
 }
 
 /**
- * Read a float[4] vector from the data port Data Cache (const buffer).
+ * Read a float[4] vector from the data port constant cache.
  * Location (in buffer) should be a multiple of 16.
  * Used for fetching shader constants.
  */
@@ -2206,8 +2206,7 @@ void brw_oword_block_read(struct brw_compile *p,
 {
    struct brw_context *brw = p->brw;
    const unsigned target_cache =
-      (brw->gen >= 7 ? GEN7_SFID_DATAPORT_DATA_CACHE :
-       brw->gen >= 6 ? GEN6_SFID_DATAPORT_SAMPLER_CACHE :
+      (brw->gen >= 6 ? GEN6_SFID_DATAPORT_CONSTANT_CACHE :
        BRW_DATAPORT_READ_TARGET_DATA_CACHE);
 
    /* On newer hardware, offset is in units of owords. */
diff --git a/src/mesa/drivers/dri/i965/brw_fs.cpp 
b/src/mesa/drivers/dri/i965/brw_fs.cpp
index 35639de..4a61943 100644
--- a/src/mesa/drivers/dri/i965/brw_fs.cpp
+++ b/src/mesa/drivers/dri/i965/brw_fs.cpp
@@ -2990,32 +2990,22 @@ fs_visitor::lower_uniform_pull_constant_loads()
          continue;
 
       if (brw->gen >= 7) {
-         /* The offset arg before was a vec4-aligned byte offset.  We need to
-          * turn it into a dword offset.
-          */
-         fs_reg const_offset_reg = inst->src[1];
-         assert(const_offset_reg.file == IMM &&
-                const_offset_reg.type == BRW_REGISTER_TYPE_UD);
-         const_offset_reg.fixed_hw_reg.dw1.ud /= 4;
+         assert(inst->src[1].file == IMM &&
+                inst->src[1].type == BRW_REGISTER_TYPE_UD);
          fs_reg payload = fs_reg(this, glsl_type::uint_type);
+         fs_inst *setup = MOV(payload, retype(brw_vec8_grf(0, 0),
+                                              BRW_REGISTER_TYPE_UD));
+         setup->force_writemask_all = true;
+         setup->ir = inst->ir;
+         setup->annotation = inst->annotation;
+         inst->insert_before(block, setup);
 
-         /* We have to use a message header on Skylake to get SIMD4x2 mode.
-          * Reserve space for the register.
-          */
-         if (brw->gen >= 9) {
-            payload.reg_offset++;
-            virtual_grf_sizes[payload.reg] = 2;
-         }
-
-         /* This is actually going to be a MOV, but since only the first dword
-          * is accessed, we have a special opcode to do just that one.  Note
-          * that this needs to be an operation that will be considered a def
-          * by live variable analysis, or register allocation will explode.
+         /* The offset arg before was a vec4-aligned byte offset.  We need to
+          * turn it into an oword offset.
           */
-         fs_inst *setup = new(mem_ctx) fs_inst(FS_OPCODE_SET_SIMD4X2_OFFSET,
-                                               8, payload, const_offset_reg);
+         setup = MOV(component(payload, 2),
+                     fs_reg(inst->src[1].fixed_hw_reg.dw1.ud / 16));
          setup->force_writemask_all = true;
-
          setup->ir = inst->ir;
          setup->annotation = inst->annotation;
          inst->insert_before(block, setup);
diff --git a/src/mesa/drivers/dri/i965/brw_fs.h 
b/src/mesa/drivers/dri/i965/brw_fs.h
index 2aa58eb..8349ad2 100644
--- a/src/mesa/drivers/dri/i965/brw_fs.h
+++ b/src/mesa/drivers/dri/i965/brw_fs.h
@@ -805,7 +805,7 @@ private:
    void generate_uniform_pull_constant_load_gen7(fs_inst *inst,
                                                  struct brw_reg dst,
                                                  struct brw_reg surf_index,
-                                                 struct brw_reg offset);
+                                                 struct brw_reg payload);
    void generate_varying_pull_constant_load(fs_inst *inst, struct brw_reg dst,
                                             struct brw_reg index,
                                             struct brw_reg offset);
diff --git a/src/mesa/drivers/dri/i965/brw_fs_generator.cpp 
b/src/mesa/drivers/dri/i965/brw_fs_generator.cpp
index ab848f1..3916c1a 100644
--- a/src/mesa/drivers/dri/i965/brw_fs_generator.cpp
+++ b/src/mesa/drivers/dri/i965/brw_fs_generator.cpp
@@ -1003,44 +1003,14 @@ void
 fs_generator::generate_uniform_pull_constant_load_gen7(fs_inst *inst,
                                                        struct brw_reg dst,
                                                        struct brw_reg index,
-                                                       struct brw_reg offset)
+                                                       struct brw_reg payload)
 {
    assert(inst->mlen == 0);
    assert(index.type == BRW_REGISTER_TYPE_UD);
-
-   assert(offset.file == BRW_GENERAL_REGISTER_FILE);
-   /* Reference just the dword we need, to avoid angering validate_reg(). */
-   offset = brw_vec1_grf(offset.nr, 0);
-
-   /* We use the SIMD4x2 mode because we want to end up with 4 components in
-    * the destination loaded consecutively from the same offset (which appears
-    * in the first component, and the rest are ignored).
-    */
-   dst.width = BRW_WIDTH_4;
-
-   struct brw_reg src = offset;
-   bool header_present = false;
-   int mlen = 1;
-
-   if (brw->gen >= 9) {
-      /* Skylake requires a message header in order to use SIMD4x2 mode. */
-      src = retype(brw_vec8_grf(offset.nr - 1, 0), BRW_REGISTER_TYPE_UD);
-      mlen = 2;
-      header_present = true;
-
-      brw_push_insn_state(p);
-      brw_set_default_mask_control(p, BRW_MASK_DISABLE);
-      brw_MOV(p, src, retype(brw_vec8_grf(0, 0), BRW_REGISTER_TYPE_UD));
-      brw_set_default_access_mode(p, BRW_ALIGN_1);
-
-      brw_MOV(p, get_element_ud(src, 2),
-              brw_imm_ud(GEN9_SAMPLER_SIMD_MODE_EXTENSION_SIMD4X2));
-      brw_pop_insn_state(p);
-   }
+   assert(payload.file == BRW_GENERAL_REGISTER_FILE);
 
    if (index.file == BRW_IMMEDIATE_VALUE) {
-
-      uint32_t surf_index = index.dw1.ud;
+      const uint32_t surf_index = index.dw1.ud;
 
       brw_push_insn_state(p);
       brw_set_default_compression_control(p, BRW_COMPRESSION_NONE);
@@ -1048,22 +1018,20 @@ 
fs_generator::generate_uniform_pull_constant_load_gen7(fs_inst *inst,
       brw_inst *send = brw_next_insn(p, BRW_OPCODE_SEND);
       brw_pop_insn_state(p);
 
-      brw_set_dest(p, send, dst);
-      brw_set_src0(p, send, src);
-      brw_set_sampler_message(p, send,
+      brw_set_dest(p, send, vec8(dst));
+      brw_set_src0(p, send, payload);
+      brw_set_dp_read_message(p, send,
                               surf_index,
-                              0, /* LD message ignores sampler unit */
-                              GEN5_SAMPLER_MESSAGE_SAMPLE_LD,
-                              1, /* rlen */
-                              mlen,
-                              header_present,
-                              BRW_SAMPLER_SIMD_MODE_SIMD4X2,
-                              0);
+                              BRW_DATAPORT_OWORD_BLOCK_1_OWORDLOW,
+                              GEN7_DATAPORT_DC_OWORD_BLOCK_READ,
+                              GEN6_SFID_DATAPORT_CONSTANT_CACHE,
+                              1, /* mlen */
+                              true, /* header */
+                              1); /* rlen */
 
       brw_mark_surface_used(prog_data, surf_index);
 
    } else {
-
       struct brw_reg addr = vec1(retype(brw_address_reg(0), 
BRW_REGISTER_TYPE_UD));
 
       brw_push_insn_state(p);
@@ -1077,29 +1045,27 @@ 
fs_generator::generate_uniform_pull_constant_load_gen7(fs_inst *inst,
       brw_set_src0(p, insn_and, vec1(retype(index, BRW_REGISTER_TYPE_UD)));
       brw_set_src1(p, insn_and, brw_imm_ud(0x0ff));
 
-
       /* a0.0 |= <descriptor> */
       brw_inst *insn_or = brw_next_insn(p, BRW_OPCODE_OR);
-      brw_set_sampler_message(p, insn_or,
-                              0 /* surface */,
-                              0 /* sampler */,
-                              GEN5_SAMPLER_MESSAGE_SAMPLE_LD,
-                              1 /* rlen */,
-                              mlen,
-                              header_present,
-                              BRW_SAMPLER_SIMD_MODE_SIMD4X2,
-                              0);
+      brw_set_dp_read_message(p, insn_or,
+                              0, /* surface */
+                              BRW_DATAPORT_OWORD_BLOCK_1_OWORDLOW,
+                              GEN7_DATAPORT_DC_OWORD_BLOCK_READ,
+                              GEN6_SFID_DATAPORT_CONSTANT_CACHE,
+                              1, /* mlen */
+                              true, /* header */
+                              1); /* rlen */
       brw_inst_set_exec_size(p->brw, insn_or, BRW_EXECUTE_1);
       brw_inst_set_src1_reg_type(p->brw, insn_or, BRW_REGISTER_TYPE_UD);
       brw_set_src0(p, insn_or, addr);
       brw_set_dest(p, insn_or, addr);
 
-
-      /* dst = send(offset, a0.0) */
+      /* dst = send(payload, a0.0) */
       brw_inst *insn_send = brw_next_insn(p, BRW_OPCODE_SEND);
-      brw_set_dest(p, insn_send, dst);
-      brw_set_src0(p, insn_send, src);
-      brw_set_indirect_send_descriptor(p, insn_send, BRW_SFID_SAMPLER, addr);
+      brw_set_dest(p, insn_send, vec8(dst));
+      brw_set_src0(p, insn_send, payload);
+      brw_set_indirect_send_descriptor(p, insn_send,
+                                       GEN6_SFID_DATAPORT_CONSTANT_CACHE, 
addr);
 
       brw_pop_insn_state(p);
 
-- 
2.1.3

_______________________________________________
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev

[Mesa-dev] [PATCH 4/7] i965/fs: Switch to the constant cache for uniform pull constants.

Reply via email to