On Sat, Aug 10, 2019 at 2:22 PM Francisco Jerez <curroje...@riseup.net> wrote:
> Jason Ekstrand <ja...@jlekstrand.net> writes: > > > On Fri, Aug 9, 2019 at 7:22 PM Francisco Jerez <curroje...@riseup.net> > > wrote: > > > >> See "i965/gen9: Optimize slice and subslice load balancing behavior." > >> for the rationale. Marked optional because no performance evaluation > >> has been done on this commit, it is provided to match the hashing > >> settings of the Iris driver. Test reports welcome. > >> > > > > Improves Aztec Ruins by 2.7% (tested 5 times in each configuration). I > > also tested Batman: Arkham City but it only reports integer FPS numbers > and > > no change was noticable. > > > > > >> --- > >> src/intel/vulkan/anv_genX.h | 4 ++ > >> src/intel/vulkan/anv_private.h | 6 ++ > >> src/intel/vulkan/genX_blorp_exec.c | 6 ++ > >> src/intel/vulkan/genX_cmd_buffer.c | 96 ++++++++++++++++++++++++++++++ > >> 4 files changed, 112 insertions(+) > >> > >> diff --git a/src/intel/vulkan/anv_genX.h b/src/intel/vulkan/anv_genX.h > >> index a5435e566a3..06c6b467acf 100644 > >> --- a/src/intel/vulkan/anv_genX.h > >> +++ b/src/intel/vulkan/anv_genX.h > >> @@ -44,6 +44,10 @@ void genX(cmd_buffer_apply_pipe_flushes)(struct > >> anv_cmd_buffer *cmd_buffer); > >> > >> void genX(cmd_buffer_emit_gen7_depth_flush)(struct anv_cmd_buffer > >> *cmd_buffer); > >> > >> +void genX(cmd_buffer_emit_hashing_mode)(struct anv_cmd_buffer > *cmd_buffer, > >> + unsigned width, unsigned > height, > >> + unsigned scale); > >> + > >> void genX(flush_pipeline_select_3d)(struct anv_cmd_buffer *cmd_buffer); > >> void genX(flush_pipeline_select_gpgpu)(struct anv_cmd_buffer > *cmd_buffer); > >> > >> diff --git a/src/intel/vulkan/anv_private.h > >> b/src/intel/vulkan/anv_private.h > >> index 2465f264354..b381386a716 100644 > >> --- a/src/intel/vulkan/anv_private.h > >> +++ b/src/intel/vulkan/anv_private.h > >> @@ -2421,6 +2421,12 @@ struct anv_cmd_state { > >> > >> bool > >> conditional_render_enabled; > >> > >> + /** > >> + * Last rendering scale argument provided to > >> + * genX(cmd_buffer_emit_hashing_mode)(). > >> + */ > >> + unsigned current_hash_scale; > >> + > >> /** > >> * Array length is anv_cmd_state::pass::attachment_count. Array > >> content is > >> * valid only when recording a render pass instance. > >> diff --git a/src/intel/vulkan/genX_blorp_exec.c > >> b/src/intel/vulkan/genX_blorp_exec.c > >> index 1592e7f7e3d..e9eedc06696 100644 > >> --- a/src/intel/vulkan/genX_blorp_exec.c > >> +++ b/src/intel/vulkan/genX_blorp_exec.c > >> @@ -223,6 +223,12 @@ genX(blorp_exec)(struct blorp_batch *batch, > >> genX(cmd_buffer_config_l3)(cmd_buffer, cfg); > >> } > >> > >> + const unsigned scale = params->fast_clear_op ? UINT_MAX : 1; > >> + if (cmd_buffer->state.current_hash_scale != scale) { > >> + genX(cmd_buffer_emit_hashing_mode)(cmd_buffer, params->x1 - > >> params->x0, > >> + params->y1 - params->y0, > scale); > >> + } > >> + > >> #if GEN_GEN >= 11 > >> /* The PIPE_CONTROL command description says: > >> * > >> diff --git a/src/intel/vulkan/genX_cmd_buffer.c > >> b/src/intel/vulkan/genX_cmd_buffer.c > >> index 86ef1663ac4..e9e5570d49f 100644 > >> --- a/src/intel/vulkan/genX_cmd_buffer.c > >> +++ b/src/intel/vulkan/genX_cmd_buffer.c > >> @@ -1595,6 +1595,7 @@ genX(CmdExecuteCommands)( > >> */ > >> primary->state.current_pipeline = UINT32_MAX; > >> primary->state.current_l3_config = NULL; > >> + primary->state.current_hash_scale = 0; > >> > >> /* Each of the secondary command buffers will use its own state base > >> * address. We need to re-emit state base address for the primary > >> after > >> @@ -2663,6 +2664,9 @@ genX(cmd_buffer_flush_state)(struct anv_cmd_buffer > >> *cmd_buffer) > >> > >> genX(cmd_buffer_config_l3)(cmd_buffer, pipeline->urb.l3_config); > >> > >> + if (cmd_buffer->state.current_hash_scale != 1) > >> + genX(cmd_buffer_emit_hashing_mode)(cmd_buffer, UINT_MAX, > UINT_MAX, > >> 1); > >> > > > > Can we do the check against current_hash_scale as an early return inside > > the emit function since that's already where the "current" value is > > updated? Maybe call it cmd_buffer_config_hashing_mode if "emit" no > longer > > seems right with that change? I don't really care about the rename. > > > > That's what my first revision did. I moved the current_hash_scale check > one level up because the function call overhead was visible in > brw_upload_pipeline_state(), which is quite a hot path in i965. That's a bit surprising but believable if the two are in different files and there's a real function call going on there. > That > said I don't have any data justifying the change in the Vulkan driver -- > Because caller and callee are defined in the same compilation unit it > *might* be that the function is inlined into > genX(cmd_buffer_flush_state), though it's large enough that it may not > I highly doubt it would be a problem. It's a lot of lines of code but once you optimize things a bit, it's actually quite a small function. > be, either way it seemed like a good idea to do things consistently > across drivers. > The functions right next to it for l3$ configuration and PIPELINE_SELECT are both called unconditionally. IMO, consistency within the Vulkan driver is more important than consistency between iris and ANV given that they have completely different state tracking models. --Jason > > > >> + > >> genX(flush_pipeline_select_3d)(cmd_buffer); > >> > >> if (vb_emit) { > >> @@ -3925,6 +3929,98 @@ genX(cmd_buffer_emit_gen7_depth_flush)(struct > >> anv_cmd_buffer *cmd_buffer) > >> } > >> } > >> > >> +/** > >> + * Update the pixel hashing modes that determine the balancing of PS > >> threads > >> + * across subslices and slices. > >> + * > >> + * \param width Width bound of the rendering area (already scaled down > if > >> \p > >> + * scale is greater than 1). > >> + * \param height Height bound of the rendering area (already scaled > down > >> if \p > >> + * scale is greater than 1). > >> + * \param scale The number of framebuffer samples that could > potentially > >> be > >> + * affected by an individual channel of the PS thread. > This > >> is > >> + * typically one for single-sampled rendering, but for > >> operations > >> + * like CCS resolves and fast clears a single PS > invocation > >> may > >> + * update a huge number of pixels, in which case a finer > >> + * balancing is desirable in order to maximally utilize > the > >> + * bandwidth available. UINT_MAX can be used as shorthand > >> for > >> + * "finest hashing mode available". > >> + */ > >> +void > >> +genX(cmd_buffer_emit_hashing_mode)(struct anv_cmd_buffer *cmd_buffer, > >> + unsigned width, unsigned height, > >> + unsigned scale) > >> +{ > >> +#if GEN_GEN == 9 > >> + const struct gen_device_info *devinfo = &cmd_buffer->device->info; > >> + const unsigned slice_hashing[] = { > >> + /* Because all Gen9 platforms with more than one slice require > >> + * three-way subslice hashing, a single "normal" 16x16 slice > hashing > >> + * block is guaranteed to suffer from substantial imbalance, with > >> one > >> + * subslice receiving twice as much work as the other two in the > >> + * slice. > >> + * > >> + * The performance impact of that would be particularly severe > when > >> + * three-way hashing is also in use for slice balancing (which is > >> the > >> + * case for all Gen9 GT4 platforms), because one of the slices > >> + * receives one every three 16x16 blocks in either direction, > which > >> + * is roughly the periodicity of the underlying subslice > imbalance > >> + * pattern ("roughly" because in reality the hardware's > >> + * implementation of three-way hashing doesn't do exact modulo 3 > >> + * arithmetic, which somewhat decreases the magnitude of this > effect > >> + * in practice). This leads to a systematic subslice imbalance > >> + * within that slice regardless of the size of the primitive. > The > >> + * 32x32 hashing mode guarantees that the subslice imbalance > within > >> a > >> + * single slice hashing block is minimal, largely eliminating > this > >> + * effect. > >> + */ > >> + _32x32, > >> + /* Finest slice hashing mode available. */ > >> + NORMAL > >> + }; > >> + const unsigned subslice_hashing[] = { > >> + /* 16x16 would provide a slight cache locality benefit especially > >> + * visible in the sampler L1 cache efficiency of low-bandwidth > >> + * non-LLC platforms, but it comes at the cost of greater > subslice > >> + * imbalance for primitives of dimensions approximately > intermediate > >> + * between 16x4 and 16x16. > >> + */ > >> + _16x4, > >> + /* Finest subslice hashing mode available. */ > >> + _8x4 > >> + }; > >> + /* Dimensions of the smallest hashing block of a given hashing mode. > >> If > >> + * the rendering area is smaller than this there can't possibly be > any > >> + * benefit from switching to this mode, so we optimize out the > >> + * transition. > >> + */ > >> + const unsigned min_size[][2] = { > >> + { 16, 4 }, > >> + { 8, 4 } > >> + }; > >> + const unsigned idx = scale > 1; > >> + > >> + if (width > min_size[idx][0] || height > min_size[idx][1]) { > >> + uint32_t gt_mode; > >> + > >> + anv_pack_struct(>_mode, GENX(GT_MODE), > >> + .SliceHashing = (devinfo->num_slices > 1 ? > >> slice_hashing[idx] : 0), > >> + .SliceHashingMask = (devinfo->num_slices > 1 ? > -1 : > >> 0), > >> + .SubsliceHashing = subslice_hashing[idx], > >> + .SubsliceHashingMask = -1); > >> + > >> + anv_batch_emit(&cmd_buffer->batch, GENX(PIPE_CONTROL), pc) { > >> + pc.StallAtPixelScoreboard = true; > >> + pc.CommandStreamerStallEnable = true; > >> + } > >> > > > > This would be better as > > > > cmd_buffer->state.pending_pipe_bits |= ANV_PIPE_CS_STALL_BIT; > > genX(cmd_buffer_apply_pipe_flushes)(cmd_buffer); > > > > That way workarounds get applied more properly and we can avoid redundant > > PIPE_CONTROLs. apply_pipe_flushes will insert the StallAtPixelScoreboard > > for you if needed. That said, maybe it's best to also OR in > > ANV_PIPE_STALL_AT_SCOREBOARD_BIT just to be sure. > > > > OK, changed locally. > > > Other than that, the patch looks good to me. Happy to see the perf > bump. :) > > > > Thanks for testing it out! > > > I did consider maybe trying to move things around so that the change is > > done more as a subpass thing but there are some subtle interactions to > > think through there and I'm convinced what you have here is correct so > > let's go with it for now. > > > > --Jason > > > > > >> + > >> + emit_lri(&cmd_buffer->batch, GENX(GT_MODE_num), gt_mode); > >> + > >> + cmd_buffer->state.current_hash_scale = scale; > >> + } > >> +#endif > >> +} > >> + > >> static void > >> cmd_buffer_emit_depth_stencil(struct anv_cmd_buffer *cmd_buffer) > >> { > >> -- > >> 2.22.0 > >> > >> _______________________________________________ > >> mesa-dev mailing list > >> mesa-dev@lists.freedesktop.org > >> https://lists.freedesktop.org/mailman/listinfo/mesa-dev >
_______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/mesa-dev