Re: [Mesa-dev] [PATCH 13/14] radeonsi: Process multiple patches per threadgroup.

Nicolai Hähnle Tue, 10 May 2016 09:56:46 -0700

On 10.05.2016 05:53, Bas Nieuwenhuizen wrote:

Using more than 1 wave per threadgroup does increase performance
generally.  Not using too many patches per threadgroup also
increases performance. Both catalyst and amdgpu-pro seem to
use 40 patches as their maximum, but I haven't really seen
any performance increase from limiting the number of patches
to 40 instead of 64.


Note that the trick where we overlap the input and output LDS
does not work anymore as the insertion of the tess factors
changes the patch stride.

So this still doesn't take into account potential LDS use by the shadersthemselves, e.g. because of alloca lowering. Perhaps this should benoted in a comment somewhere?


Signed-off-by: Bas Nieuwenhuizen <b...@basnieuwenhuizen.nl>
---
  src/gallium/drivers/radeonsi/si_state_draw.c | 42 ++++++++++++++++++----------
  1 file changed, 27 insertions(+), 15 deletions(-)

diff --git a/src/gallium/drivers/radeonsi/si_state_draw.c 
b/src/gallium/drivers/radeonsi/si_state_draw.c
index 7aa886a..3150489 100644
--- a/src/gallium/drivers/radeonsi/si_state_draw.c
+++ b/src/gallium/drivers/radeonsi/si_state_draw.c
@@ -108,20 +108,7 @@ static void si_emit_derived_tess_state(struct si_context 
*sctx,
        unsigned input_patch_size, output_patch_size, output_patch0_offset;
        unsigned perpatch_output_offset, lds_size, ls_rsrc2;
        unsigned tcs_in_layout, tcs_out_layout, tcs_out_offsets;
-       unsigned offchip_layout;
-
-       *num_patches = 1; /* TODO: calculate this */
-
-       if (sctx->last_ls == ls->current &&
-           sctx->last_tcs == tcs &&
-           sctx->last_tes_sh_base == tes_sh_base &&
-           sctx->last_num_tcs_input_cp == num_tcs_input_cp)
-               return;
-
-       sctx->last_ls = ls->current;
-       sctx->last_tcs = tcs;
-       sctx->last_tes_sh_base = tes_sh_base;
-       sctx->last_num_tcs_input_cp = num_tcs_input_cp;
+       unsigned offchip_layout, hardware_lds_size;

        /* This calculates how shader inputs and outputs among VS, TCS, and TES
         * are laid out in LDS. */
@@ -146,9 +133,23 @@ static void si_emit_derived_tess_state(struct si_context 
*sctx,
        pervertex_output_patch_size = num_tcs_output_cp * output_vertex_size;
        output_patch_size = pervertex_output_patch_size + num_tcs_patch_outputs 
* 16;

-       output_patch0_offset = sctx->tcs_shader.cso ? input_patch_size * 
*num_patches : 0;
+       *num_patches = MIN2(256 / num_tcs_input_cp, 256 / num_tcs_output_cp);
+
+       /* Make sure that the data fits in LDS. */
+       hardware_lds_size = sctx->b.chip_class >= CIK ? 65536 : 32768;
+       *num_patches = MIN2(*num_patches, hardware_lds_size / (input_patch_size 
+
+                                                              
output_patch_size));
+
+       /* Make sure the output data fits in the offchip buffer */
+       *num_patches = MIN2(*num_patches, 32768 / output_patch_size);

That's a bit too much magic constants for my taste. Can you define anamed constant somewhere that both this calculation and the offchipbuffer allocation are based on?

+       /* Not necessary for correctness, but improves performance */
+       *num_patches = MIN2(*num_patches, 64);
+
+       output_patch0_offset = input_patch_size * *num_patches;
        perpatch_output_offset = output_patch0_offset + 
pervertex_output_patch_size;

+


Extraneous blank line.

Cheers,
Nicolai

        lds_size = output_patch0_offset + output_patch_size * *num_patches;
        ls_rsrc2 = ls->current->config.rsrc2;

@@ -160,6 +161,17 @@ static void si_emit_derived_tess_state(struct si_context 
*sctx,
                ls_rsrc2 |= S_00B52C_LDS_SIZE(align(lds_size, 256) / 256);
        }

+       if (sctx->last_ls == ls->current &&
+           sctx->last_tcs == tcs &&
+           sctx->last_tes_sh_base == tes_sh_base &&
+           sctx->last_num_tcs_input_cp == num_tcs_input_cp)
+               return;
+
+       sctx->last_ls = ls->current;
+       sctx->last_tcs = tcs;
+       sctx->last_tes_sh_base = tes_sh_base;
+       sctx->last_num_tcs_input_cp = num_tcs_input_cp;
+
        /* Due to a hw bug, RSRC2_LS must be written twice with another
         * LS register written in between. */
        if (sctx->b.chip_class == CIK && sctx->b.family != CHIP_HAWAII)

_______________________________________________
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev

Re: [Mesa-dev] [PATCH 13/14] radeonsi: Process multiple patches per threadgroup.

Reply via email to