Re: [Mesa-dev] [PATCH] st/glsl_to_tgsi: fix the DCE pass in presence of loops
On 04/05/17 22:28, Samuel Pitoiset wrote: On 05/04/2017 02:13 PM, Samuel Pitoiset wrote: On 05/04/2017 12:26 PM, Timothy Arceri wrote: On 04/05/17 18:22, Samuel Pitoiset wrot On 05/04/2017 02:36 AM, Timothy Arceri wrote: This and the tgsi copy_propagation pass are very slow, I'd really like it if we both didn't require the pass for things to work and also didn't make it any slower. perf is showing these 2 passes at ~11% during Deus Ex start-up (cold cache). This fix shouldn't make the DCE pass really slower, shouldn't hurt. Well, those passes are useful because the visitor is lazy and it produces bunch of useless code which is going to be removed later. Sure I just mean it would be nice not to add extra reliance on this pass, in case we are able to make the visitor smarter, and not generate so much junk to begin with. I have a new approach which is much better (suggested by Nicolai). Can we not just add a DCE call to the glsl linker/compiler calls e.g. if (ctx->Const.GLSLOptimizeConservatively) { /* Run it just once. */ do_common_optimization(ir, true, false, >Const.ShaderCompilerOptions[stage], ctx->Const.NativeIntegers); ---> call DCE here to clean up since we don't call the opt passes again <--- This might help a bit but it won't remove useless TGSI code generated by the visitor. Sure but is the visitor generating code that is causing the issue you are seeing or is it the glsl IR code? The visitor. Seems you have an alternative solution which looks good, but maybe we should run the glsl dead code pass once at the end anyway? This doesn't help for this particular situation, but I guess this might help other shaders. Calling the DCE passes doesn't help. diff --git i/src/mesa/state_tracker/st_glsl_to_tgsi.cpp w/src/mesa/state_tracker/st_glsl_to_tgsi.cpp index 249584d75e..fafbfaea19 100644 --- i/src/mesa/state_tracker/st_glsl_to_tgsi.cpp +++ w/src/mesa/state_tracker/st_glsl_to_tgsi.cpp @@ -7008,6 +7008,10 @@ st_link_shader(struct gl_context *ctx, struct gl_shader_program *prog) lower_if_to_cond_assign((gl_shader_stage)i, ir, options->MaxIfDepth, if_threshold); } while (has_unsupported_control_flow(ir, options)); + + do_dead_code(ir, true); + do_dead_code_local(ir); + } else { /* Repeat it until it stops making changes. */ bool progress; shader-db results on RadeonSI: 47109 shaders in 29632 tests Totals: SGPRS: 1923308 -> 1923308 (0.00 %) VGPRS: 1133843 -> 1133839 (-0.00 %) Spilled SGPRs: 2516 -> 2516 (0.00 %) Spilled VGPRs: 65 -> 65 (0.00 %) Private memory VGPRs: 1184 -> 1184 (0.00 %) Scratch size: 1308 -> 1308 (0.00 %) dwords per thread Code Size: 60095968 -> 60095840 (-0.00 %) bytes LDS: 1077 -> 1077 (0.00 %) blocks Max Waves: 431889 -> 431889 (0.00 %) Wait states: 0 -> 0 (0.00 %) Totals from affected shaders: SGPRS: 4064 -> 4064 (0.00 %) VGPRS: 2196 -> 2192 (-0.18 %) Spilled SGPRs: 0 -> 0 (0.00 %) Spilled VGPRs: 0 -> 0 (0.00 %) Private memory VGPRs: 0 -> 0 (0.00 %) Scratch size: 0 -> 0 (0.00 %) dwords per thread Code Size: 230836 -> 230708 (-0.06 %) bytes LDS: 0 -> 0 (0.00 %) blocks Max Waves: 609 -> 609 (0.00 %) Wait states: 0 -> 0 (0.00 %) It only helps one or two Metro 2033 shaders. My thinking was it might be helpful for compilation speed. Anyway its not a bit deal. If we do manage to disable the tgsi DCE pass at some point it might be more important. For now it's probably fine as is. } else { /* Repeat it until it stops making changes. */ while (do_common_optimization(ir, true, false, >Const.ShaderCompilerOptions[stage], ctx->Const.NativeIntegers)) ; } On 03/05/17 00:43, Samuel Pitoiset wrote: The TGSI DCE pass doesn't eliminate dead assignments like MOV TEMP[0], TEMP[1] in presence of loops because it assumes that the visitor doesn't emit dead code. This assumption is actually wrong and this situation happens. However, it appears that the merge_registers() pass accidentally takes care of this for some weird reasons. But since this pass has been disabled for RadeonSI and Nouveau, the renumber_registers() pass which is called *after*, can't do its job correctly. This is because it assumes that no dead code is present. But if there is still a dead assignment, it might re-use the TEMP register id incorrectly and emits wrong code. This patches eliminates all dead assignments by tracking all temporary register reads like what the renumber_registers() actually does. The best solution would be to rewrite that DCE pass entirely but it's more work. This should fix Unigine Heaven on RadeonSI and Nouveau. shader-db results with RadeonSI: 47109 shaders in 29632 tests Totals: SGPRS: 1919548 -> 1919572 (0.00 %) VGPRS: 1139205 -> 1139209 (0.00 %) Spilled SGPRs: 1865 -> 1867 (0.11 %) Spilled VGPRs:
Re: [Mesa-dev] [PATCH] st/glsl_to_tgsi: fix the DCE pass in presence of loops
On 05/04/2017 02:13 PM, Samuel Pitoiset wrote: On 05/04/2017 12:26 PM, Timothy Arceri wrote: On 04/05/17 18:22, Samuel Pitoiset wrot On 05/04/2017 02:36 AM, Timothy Arceri wrote: This and the tgsi copy_propagation pass are very slow, I'd really like it if we both didn't require the pass for things to work and also didn't make it any slower. perf is showing these 2 passes at ~11% during Deus Ex start-up (cold cache). This fix shouldn't make the DCE pass really slower, shouldn't hurt. Well, those passes are useful because the visitor is lazy and it produces bunch of useless code which is going to be removed later. Sure I just mean it would be nice not to add extra reliance on this pass, in case we are able to make the visitor smarter, and not generate so much junk to begin with. I have a new approach which is much better (suggested by Nicolai). Can we not just add a DCE call to the glsl linker/compiler calls e.g. if (ctx->Const.GLSLOptimizeConservatively) { /* Run it just once. */ do_common_optimization(ir, true, false, >Const.ShaderCompilerOptions[stage], ctx->Const.NativeIntegers); ---> call DCE here to clean up since we don't call the opt passes again <--- This might help a bit but it won't remove useless TGSI code generated by the visitor. Sure but is the visitor generating code that is causing the issue you are seeing or is it the glsl IR code? The visitor. Seems you have an alternative solution which looks good, but maybe we should run the glsl dead code pass once at the end anyway? This doesn't help for this particular situation, but I guess this might help other shaders. Calling the DCE passes doesn't help. diff --git i/src/mesa/state_tracker/st_glsl_to_tgsi.cpp w/src/mesa/state_tracker/st_glsl_to_tgsi.cpp index 249584d75e..fafbfaea19 100644 --- i/src/mesa/state_tracker/st_glsl_to_tgsi.cpp +++ w/src/mesa/state_tracker/st_glsl_to_tgsi.cpp @@ -7008,6 +7008,10 @@ st_link_shader(struct gl_context *ctx, struct gl_shader_program *prog) lower_if_to_cond_assign((gl_shader_stage)i, ir, options->MaxIfDepth, if_threshold); } while (has_unsupported_control_flow(ir, options)); + + do_dead_code(ir, true); + do_dead_code_local(ir); + } else { /* Repeat it until it stops making changes. */ bool progress; shader-db results on RadeonSI: 47109 shaders in 29632 tests Totals: SGPRS: 1923308 -> 1923308 (0.00 %) VGPRS: 1133843 -> 1133839 (-0.00 %) Spilled SGPRs: 2516 -> 2516 (0.00 %) Spilled VGPRs: 65 -> 65 (0.00 %) Private memory VGPRs: 1184 -> 1184 (0.00 %) Scratch size: 1308 -> 1308 (0.00 %) dwords per thread Code Size: 60095968 -> 60095840 (-0.00 %) bytes LDS: 1077 -> 1077 (0.00 %) blocks Max Waves: 431889 -> 431889 (0.00 %) Wait states: 0 -> 0 (0.00 %) Totals from affected shaders: SGPRS: 4064 -> 4064 (0.00 %) VGPRS: 2196 -> 2192 (-0.18 %) Spilled SGPRs: 0 -> 0 (0.00 %) Spilled VGPRs: 0 -> 0 (0.00 %) Private memory VGPRs: 0 -> 0 (0.00 %) Scratch size: 0 -> 0 (0.00 %) dwords per thread Code Size: 230836 -> 230708 (-0.06 %) bytes LDS: 0 -> 0 (0.00 %) blocks Max Waves: 609 -> 609 (0.00 %) Wait states: 0 -> 0 (0.00 %) It only helps one or two Metro 2033 shaders. } else { /* Repeat it until it stops making changes. */ while (do_common_optimization(ir, true, false, >Const.ShaderCompilerOptions[stage], ctx->Const.NativeIntegers)) ; } On 03/05/17 00:43, Samuel Pitoiset wrote: The TGSI DCE pass doesn't eliminate dead assignments like MOV TEMP[0], TEMP[1] in presence of loops because it assumes that the visitor doesn't emit dead code. This assumption is actually wrong and this situation happens. However, it appears that the merge_registers() pass accidentally takes care of this for some weird reasons. But since this pass has been disabled for RadeonSI and Nouveau, the renumber_registers() pass which is called *after*, can't do its job correctly. This is because it assumes that no dead code is present. But if there is still a dead assignment, it might re-use the TEMP register id incorrectly and emits wrong code. This patches eliminates all dead assignments by tracking all temporary register reads like what the renumber_registers() actually does. The best solution would be to rewrite that DCE pass entirely but it's more work. This should fix Unigine Heaven on RadeonSI and Nouveau. shader-db results with RadeonSI: 47109 shaders in 29632 tests Totals: SGPRS: 1919548 -> 1919572 (0.00 %) VGPRS: 1139205 -> 1139209 (0.00 %) Spilled SGPRs: 1865 -> 1867 (0.11 %) Spilled VGPRs: 65 -> 65 (0.00 %) Private memory VGPRs: 1184 -> 1184 (0.00 %) Scratch size: 1308 -> 1308 (0.00 %) dwords per thread Code Size: 60083036 -> 60083244 (0.00 %) bytes LDS: 1077 -> 1077 (0.00 %) blocks Max Waves: 431200 -> 431200 (0.00 %) Wait states: 0 -> 0 (0.00 %)
Re: [Mesa-dev] [PATCH] st/glsl_to_tgsi: fix the DCE pass in presence of loops
On 05/04/2017 12:26 PM, Timothy Arceri wrote: On 04/05/17 18:22, Samuel Pitoiset wrot On 05/04/2017 02:36 AM, Timothy Arceri wrote: This and the tgsi copy_propagation pass are very slow, I'd really like it if we both didn't require the pass for things to work and also didn't make it any slower. perf is showing these 2 passes at ~11% during Deus Ex start-up (cold cache). This fix shouldn't make the DCE pass really slower, shouldn't hurt. Well, those passes are useful because the visitor is lazy and it produces bunch of useless code which is going to be removed later. Sure I just mean it would be nice not to add extra reliance on this pass, in case we are able to make the visitor smarter, and not generate so much junk to begin with. I have a new approach which is much better (suggested by Nicolai). Can we not just add a DCE call to the glsl linker/compiler calls e.g. if (ctx->Const.GLSLOptimizeConservatively) { /* Run it just once. */ do_common_optimization(ir, true, false, >Const.ShaderCompilerOptions[stage], ctx->Const.NativeIntegers); ---> call DCE here to clean up since we don't call the opt passes again <--- This might help a bit but it won't remove useless TGSI code generated by the visitor. Sure but is the visitor generating code that is causing the issue you are seeing or is it the glsl IR code? The visitor. Seems you have an alternative solution which looks good, but maybe we should run the glsl dead code pass once at the end anyway? This doesn't help for this particular situation, but I guess this might help other shaders. } else { /* Repeat it until it stops making changes. */ while (do_common_optimization(ir, true, false, >Const.ShaderCompilerOptions[stage], ctx->Const.NativeIntegers)) ; } On 03/05/17 00:43, Samuel Pitoiset wrote: The TGSI DCE pass doesn't eliminate dead assignments like MOV TEMP[0], TEMP[1] in presence of loops because it assumes that the visitor doesn't emit dead code. This assumption is actually wrong and this situation happens. However, it appears that the merge_registers() pass accidentally takes care of this for some weird reasons. But since this pass has been disabled for RadeonSI and Nouveau, the renumber_registers() pass which is called *after*, can't do its job correctly. This is because it assumes that no dead code is present. But if there is still a dead assignment, it might re-use the TEMP register id incorrectly and emits wrong code. This patches eliminates all dead assignments by tracking all temporary register reads like what the renumber_registers() actually does. The best solution would be to rewrite that DCE pass entirely but it's more work. This should fix Unigine Heaven on RadeonSI and Nouveau. shader-db results with RadeonSI: 47109 shaders in 29632 tests Totals: SGPRS: 1919548 -> 1919572 (0.00 %) VGPRS: 1139205 -> 1139209 (0.00 %) Spilled SGPRs: 1865 -> 1867 (0.11 %) Spilled VGPRs: 65 -> 65 (0.00 %) Private memory VGPRs: 1184 -> 1184 (0.00 %) Scratch size: 1308 -> 1308 (0.00 %) dwords per thread Code Size: 60083036 -> 60083244 (0.00 %) bytes LDS: 1077 -> 1077 (0.00 %) blocks Max Waves: 431200 -> 431200 (0.00 %) Wait states: 0 -> 0 (0.00 %) It's still interesting to disable the merge_registers() pass. Signed-off-by: Samuel Pitoiset--- src/mesa/state_tracker/st_glsl_to_tgsi.cpp | 36 +- 1 file changed, 31 insertions(+), 5 deletions(-) diff --git a/src/mesa/state_tracker/st_glsl_to_tgsi.cpp b/src/mesa/state_tracker/st_glsl_to_tgsi.cpp index 0f8688a41c..01b5a4dc98 100644 --- a/src/mesa/state_tracker/st_glsl_to_tgsi.cpp +++ b/src/mesa/state_tracker/st_glsl_to_tgsi.cpp @@ -5085,9 +5085,13 @@ glsl_to_tgsi_visitor::eliminate_dead_code(void) glsl_to_tgsi_instruction *, this->next_temp * 4); int *write_level = rzalloc_array(mem_ctx, int, this->next_temp * 4); + int *first_reads = rzalloc_array(mem_ctx, int, this->next_temp); int level = 0; int removed = 0; + for (int i = 0; i < this->next_temp; i++) + first_reads[i] = -1; + foreach_in_list(glsl_to_tgsi_instruction, inst, >instructions) { assert(inst->dst[0].file != PROGRAM_TEMPORARY || inst->dst[0].index < this->next_temp); @@ -5148,8 +5152,12 @@ glsl_to_tgsi_visitor::eliminate_dead_code(void) src_chans |= 1 << GET_SWZ(inst->src[i].swizzle, 3); for (int c = 0; c < 4; c++) { - if (src_chans & (1 << c)) + if (src_chans & (1 << c)) { writes[4 * inst->src[i].index + c] = NULL; + + if (first_reads[inst->src[i].index] == -1) +first_reads[inst->src[i].index] = level; + } } } } @@ -5167,8 +5175,12 @@
Re: [Mesa-dev] [PATCH] st/glsl_to_tgsi: fix the DCE pass in presence of loops
On 04/05/17 18:22, Samuel Pitoiset wrot On 05/04/2017 02:36 AM, Timothy Arceri wrote: This and the tgsi copy_propagation pass are very slow, I'd really like it if we both didn't require the pass for things to work and also didn't make it any slower. perf is showing these 2 passes at ~11% during Deus Ex start-up (cold cache). This fix shouldn't make the DCE pass really slower, shouldn't hurt. Well, those passes are useful because the visitor is lazy and it produces bunch of useless code which is going to be removed later. Sure I just mean it would be nice not to add extra reliance on this pass, in case we are able to make the visitor smarter, and not generate so much junk to begin with. Can we not just add a DCE call to the glsl linker/compiler calls e.g. if (ctx->Const.GLSLOptimizeConservatively) { /* Run it just once. */ do_common_optimization(ir, true, false, >Const.ShaderCompilerOptions[stage], ctx->Const.NativeIntegers); ---> call DCE here to clean up since we don't call the opt passes again <--- This might help a bit but it won't remove useless TGSI code generated by the visitor. Sure but is the visitor generating code that is causing the issue you are seeing or is it the glsl IR code? Seems you have an alternative solution which looks good, but maybe we should run the glsl dead code pass once at the end anyway? } else { /* Repeat it until it stops making changes. */ while (do_common_optimization(ir, true, false, >Const.ShaderCompilerOptions[stage], ctx->Const.NativeIntegers)) ; } On 03/05/17 00:43, Samuel Pitoiset wrote: The TGSI DCE pass doesn't eliminate dead assignments like MOV TEMP[0], TEMP[1] in presence of loops because it assumes that the visitor doesn't emit dead code. This assumption is actually wrong and this situation happens. However, it appears that the merge_registers() pass accidentally takes care of this for some weird reasons. But since this pass has been disabled for RadeonSI and Nouveau, the renumber_registers() pass which is called *after*, can't do its job correctly. This is because it assumes that no dead code is present. But if there is still a dead assignment, it might re-use the TEMP register id incorrectly and emits wrong code. This patches eliminates all dead assignments by tracking all temporary register reads like what the renumber_registers() actually does. The best solution would be to rewrite that DCE pass entirely but it's more work. This should fix Unigine Heaven on RadeonSI and Nouveau. shader-db results with RadeonSI: 47109 shaders in 29632 tests Totals: SGPRS: 1919548 -> 1919572 (0.00 %) VGPRS: 1139205 -> 1139209 (0.00 %) Spilled SGPRs: 1865 -> 1867 (0.11 %) Spilled VGPRs: 65 -> 65 (0.00 %) Private memory VGPRs: 1184 -> 1184 (0.00 %) Scratch size: 1308 -> 1308 (0.00 %) dwords per thread Code Size: 60083036 -> 60083244 (0.00 %) bytes LDS: 1077 -> 1077 (0.00 %) blocks Max Waves: 431200 -> 431200 (0.00 %) Wait states: 0 -> 0 (0.00 %) It's still interesting to disable the merge_registers() pass. Signed-off-by: Samuel Pitoiset--- src/mesa/state_tracker/st_glsl_to_tgsi.cpp | 36 +- 1 file changed, 31 insertions(+), 5 deletions(-) diff --git a/src/mesa/state_tracker/st_glsl_to_tgsi.cpp b/src/mesa/state_tracker/st_glsl_to_tgsi.cpp index 0f8688a41c..01b5a4dc98 100644 --- a/src/mesa/state_tracker/st_glsl_to_tgsi.cpp +++ b/src/mesa/state_tracker/st_glsl_to_tgsi.cpp @@ -5085,9 +5085,13 @@ glsl_to_tgsi_visitor::eliminate_dead_code(void) glsl_to_tgsi_instruction *, this->next_temp * 4); int *write_level = rzalloc_array(mem_ctx, int, this->next_temp * 4); + int *first_reads = rzalloc_array(mem_ctx, int, this->next_temp); int level = 0; int removed = 0; + for (int i = 0; i < this->next_temp; i++) + first_reads[i] = -1; + foreach_in_list(glsl_to_tgsi_instruction, inst, >instructions) { assert(inst->dst[0].file != PROGRAM_TEMPORARY || inst->dst[0].index < this->next_temp); @@ -5148,8 +5152,12 @@ glsl_to_tgsi_visitor::eliminate_dead_code(void) src_chans |= 1 << GET_SWZ(inst->src[i].swizzle, 3); for (int c = 0; c < 4; c++) { - if (src_chans & (1 << c)) + if (src_chans & (1 << c)) { writes[4 * inst->src[i].index + c] = NULL; + + if (first_reads[inst->src[i].index] == -1) +first_reads[inst->src[i].index] = level; + } } } } @@ -5167,8 +5175,12 @@ glsl_to_tgsi_visitor::eliminate_dead_code(void) src_chans |= 1 << GET_SWZ(inst->tex_offsets[i].swizzle, 3); for (int c = 0; c < 4; c++)
Re: [Mesa-dev] [PATCH] st/glsl_to_tgsi: fix the DCE pass in presence of loops
On 05/04/2017 02:36 AM, Timothy Arceri wrote: This and the tgsi copy_propagation pass are very slow, I'd really like it if we both didn't require the pass for things to work and also didn't make it any slower. perf is showing these 2 passes at ~11% during Deus Ex start-up (cold cache). This fix shouldn't make the DCE pass really slower, shouldn't hurt. Well, those passes are useful because the visitor is lazy and it produces bunch of useless code which is going to be removed later. Can we not just add a DCE call to the glsl linker/compiler calls e.g. if (ctx->Const.GLSLOptimizeConservatively) { /* Run it just once. */ do_common_optimization(ir, true, false, >Const.ShaderCompilerOptions[stage], ctx->Const.NativeIntegers); ---> call DCE here to clean up since we don't call the opt passes again <--- This might help a bit but it won't remove useless TGSI code generated by the visitor. } else { /* Repeat it until it stops making changes. */ while (do_common_optimization(ir, true, false, >Const.ShaderCompilerOptions[stage], ctx->Const.NativeIntegers)) ; } On 03/05/17 00:43, Samuel Pitoiset wrote: The TGSI DCE pass doesn't eliminate dead assignments like MOV TEMP[0], TEMP[1] in presence of loops because it assumes that the visitor doesn't emit dead code. This assumption is actually wrong and this situation happens. However, it appears that the merge_registers() pass accidentally takes care of this for some weird reasons. But since this pass has been disabled for RadeonSI and Nouveau, the renumber_registers() pass which is called *after*, can't do its job correctly. This is because it assumes that no dead code is present. But if there is still a dead assignment, it might re-use the TEMP register id incorrectly and emits wrong code. This patches eliminates all dead assignments by tracking all temporary register reads like what the renumber_registers() actually does. The best solution would be to rewrite that DCE pass entirely but it's more work. This should fix Unigine Heaven on RadeonSI and Nouveau. shader-db results with RadeonSI: 47109 shaders in 29632 tests Totals: SGPRS: 1919548 -> 1919572 (0.00 %) VGPRS: 1139205 -> 1139209 (0.00 %) Spilled SGPRs: 1865 -> 1867 (0.11 %) Spilled VGPRs: 65 -> 65 (0.00 %) Private memory VGPRs: 1184 -> 1184 (0.00 %) Scratch size: 1308 -> 1308 (0.00 %) dwords per thread Code Size: 60083036 -> 60083244 (0.00 %) bytes LDS: 1077 -> 1077 (0.00 %) blocks Max Waves: 431200 -> 431200 (0.00 %) Wait states: 0 -> 0 (0.00 %) It's still interesting to disable the merge_registers() pass. Signed-off-by: Samuel Pitoiset--- src/mesa/state_tracker/st_glsl_to_tgsi.cpp | 36 +- 1 file changed, 31 insertions(+), 5 deletions(-) diff --git a/src/mesa/state_tracker/st_glsl_to_tgsi.cpp b/src/mesa/state_tracker/st_glsl_to_tgsi.cpp index 0f8688a41c..01b5a4dc98 100644 --- a/src/mesa/state_tracker/st_glsl_to_tgsi.cpp +++ b/src/mesa/state_tracker/st_glsl_to_tgsi.cpp @@ -5085,9 +5085,13 @@ glsl_to_tgsi_visitor::eliminate_dead_code(void) glsl_to_tgsi_instruction *, this->next_temp * 4); int *write_level = rzalloc_array(mem_ctx, int, this->next_temp * 4); + int *first_reads = rzalloc_array(mem_ctx, int, this->next_temp); int level = 0; int removed = 0; + for (int i = 0; i < this->next_temp; i++) + first_reads[i] = -1; + foreach_in_list(glsl_to_tgsi_instruction, inst, >instructions) { assert(inst->dst[0].file != PROGRAM_TEMPORARY || inst->dst[0].index < this->next_temp); @@ -5148,8 +5152,12 @@ glsl_to_tgsi_visitor::eliminate_dead_code(void) src_chans |= 1 << GET_SWZ(inst->src[i].swizzle, 3); for (int c = 0; c < 4; c++) { - if (src_chans & (1 << c)) + if (src_chans & (1 << c)) { writes[4 * inst->src[i].index + c] = NULL; + + if (first_reads[inst->src[i].index] == -1) +first_reads[inst->src[i].index] = level; + } } } } @@ -5167,8 +5175,12 @@ glsl_to_tgsi_visitor::eliminate_dead_code(void) src_chans |= 1 << GET_SWZ(inst->tex_offsets[i].swizzle, 3); for (int c = 0; c < 4; c++) { - if (src_chans & (1 << c)) + if (src_chans & (1 << c)) { writes[4 * inst->tex_offsets[i].index + c] = NULL; + + if (first_reads[inst->tex_offsets[i].index] == -1) +first_reads[inst->tex_offsets[i].index] = level; + } } } } @@ -5211,17
Re: [Mesa-dev] [PATCH] st/glsl_to_tgsi: fix the DCE pass in presence of loops
On 04/05/17 10:43, Ilia Mirkin wrote: I looked into removing these passes a while back. A little detail here is that DCE has to get run at least a little bit for correctness reasons -- some (lazy) glsl -> tgsi Is there anyway to improve this? I was actually looking at this a little yesterday but to be honest I was getting a little lost in the code. It wasn't entirely clear to me how this conversion pass works and if we can stop doing this in the first place. glsl_to_tgsi_visitor::visit_expression(ir_expression* ir, st_src_reg *op) and glsl_to_tgsi_visitor::visit(ir_texture *ir) Have a comment: /* Storage for our result. Ideally for an assignment we'd be using * the actual storage for the result here, instead. */ The code all seems to be pretty much a copy of ir_to_mesa which has the same limitations. code uses the dead_mask with the assumption that DCE is run later. Another observation was that while, indeed, those passes take a while, they also remove a bunch of code. Not running those passes makes them disappear from the profile, but then the logic down the line has more work to do. It's not as obvious of a win as one might think. Yes I noticed this also :) Additionally, at least for nouveau, this logic does better than nouveau's codegen in some cases. (Largely due to the fact that system values get copy-propagated here but they don't by codegen, generally, which can lead to large register live intervals. At least that's what I recall from my analysis a while ago.) This was my hack-branch: https://github.com/imirkin/mesa/commits/st-pass-cleanup On Wed, May 3, 2017 at 8:36 PM, Timothy Arceriwrote: This and the tgsi copy_propagation pass are very slow, I'd really like it if we both didn't require the pass for things to work and also didn't make it any slower. perf is showing these 2 passes at ~11% during Deus Ex start-up (cold cache). Can we not just add a DCE call to the glsl linker/compiler calls e.g. if (ctx->Const.GLSLOptimizeConservatively) { /* Run it just once. */ do_common_optimization(ir, true, false, >Const.ShaderCompilerOptions[stage], ctx->Const.NativeIntegers); ---> call DCE here to clean up since we don't call the opt passes again <--- } else { /* Repeat it until it stops making changes. */ while (do_common_optimization(ir, true, false, >Const.ShaderCompilerOptions[stage], ctx->Const.NativeIntegers)) ; } On 03/05/17 00:43, Samuel Pitoiset wrote: The TGSI DCE pass doesn't eliminate dead assignments like MOV TEMP[0], TEMP[1] in presence of loops because it assumes that the visitor doesn't emit dead code. This assumption is actually wrong and this situation happens. However, it appears that the merge_registers() pass accidentally takes care of this for some weird reasons. But since this pass has been disabled for RadeonSI and Nouveau, the renumber_registers() pass which is called *after*, can't do its job correctly. This is because it assumes that no dead code is present. But if there is still a dead assignment, it might re-use the TEMP register id incorrectly and emits wrong code. This patches eliminates all dead assignments by tracking all temporary register reads like what the renumber_registers() actually does. The best solution would be to rewrite that DCE pass entirely but it's more work. This should fix Unigine Heaven on RadeonSI and Nouveau. shader-db results with RadeonSI: 47109 shaders in 29632 tests Totals: SGPRS: 1919548 -> 1919572 (0.00 %) VGPRS: 1139205 -> 1139209 (0.00 %) Spilled SGPRs: 1865 -> 1867 (0.11 %) Spilled VGPRs: 65 -> 65 (0.00 %) Private memory VGPRs: 1184 -> 1184 (0.00 %) Scratch size: 1308 -> 1308 (0.00 %) dwords per thread Code Size: 60083036 -> 60083244 (0.00 %) bytes LDS: 1077 -> 1077 (0.00 %) blocks Max Waves: 431200 -> 431200 (0.00 %) Wait states: 0 -> 0 (0.00 %) It's still interesting to disable the merge_registers() pass. Signed-off-by: Samuel Pitoiset --- src/mesa/state_tracker/st_glsl_to_tgsi.cpp | 36 +- 1 file changed, 31 insertions(+), 5 deletions(-) diff --git a/src/mesa/state_tracker/st_glsl_to_tgsi.cpp b/src/mesa/state_tracker/st_glsl_to_tgsi.cpp index 0f8688a41c..01b5a4dc98 100644 --- a/src/mesa/state_tracker/st_glsl_to_tgsi.cpp +++ b/src/mesa/state_tracker/st_glsl_to_tgsi.cpp @@ -5085,9 +5085,13 @@ glsl_to_tgsi_visitor::eliminate_dead_code(void) glsl_to_tgsi_instruction *, this->next_temp * 4); int *write_level = rzalloc_array(mem_ctx, int, this->next_temp * 4); + int *first_reads = rzalloc_array(mem_ctx, int, this->next_temp); int level = 0; int removed = 0; + for (int i = 0; i < this->next_temp; i++) + first_reads[i] = -1; +
Re: [Mesa-dev] [PATCH] st/glsl_to_tgsi: fix the DCE pass in presence of loops
I looked into removing these passes a while back. A little detail here is that DCE has to get run at least a little bit for correctness reasons -- some (lazy) glsl -> tgsi code uses the dead_mask with the assumption that DCE is run later. Another observation was that while, indeed, those passes take a while, they also remove a bunch of code. Not running those passes makes them disappear from the profile, but then the logic down the line has more work to do. It's not as obvious of a win as one might think. Additionally, at least for nouveau, this logic does better than nouveau's codegen in some cases. (Largely due to the fact that system values get copy-propagated here but they don't by codegen, generally, which can lead to large register live intervals. At least that's what I recall from my analysis a while ago.) This was my hack-branch: https://github.com/imirkin/mesa/commits/st-pass-cleanup On Wed, May 3, 2017 at 8:36 PM, Timothy Arceriwrote: > This and the tgsi copy_propagation pass are very slow, I'd really like it if > we both didn't require the pass for things to work and also didn't make it > any slower. perf is showing these 2 passes at ~11% during Deus Ex start-up > (cold cache). > > Can we not just add a DCE call to the glsl linker/compiler calls e.g. > >if (ctx->Const.GLSLOptimizeConservatively) { > /* Run it just once. */ > do_common_optimization(ir, true, false, > >Const.ShaderCompilerOptions[stage], > ctx->Const.NativeIntegers); > > ---> call DCE here to clean up since we don't call the opt passes again > <--- > >} else { > /* Repeat it until it stops making changes. */ > while (do_common_optimization(ir, true, false, > > >Const.ShaderCompilerOptions[stage], > ctx->Const.NativeIntegers)) > ; > >} > > On 03/05/17 00:43, Samuel Pitoiset wrote: >> >> The TGSI DCE pass doesn't eliminate dead assignments like >> MOV TEMP[0], TEMP[1] in presence of loops because it assumes >> that the visitor doesn't emit dead code. This assumption is >> actually wrong and this situation happens. >> >> However, it appears that the merge_registers() pass accidentally >> takes care of this for some weird reasons. But since this pass has >> been disabled for RadeonSI and Nouveau, the renumber_registers() >> pass which is called *after*, can't do its job correctly. >> >> This is because it assumes that no dead code is present. But if >> there is still a dead assignment, it might re-use the TEMP >> register id incorrectly and emits wrong code. >> >> This patches eliminates all dead assignments by tracking >> all temporary register reads like what the renumber_registers() >> actually does. The best solution would be to rewrite that DCE >> pass entirely but it's more work. >> >> This should fix Unigine Heaven on RadeonSI and Nouveau. >> >> shader-db results with RadeonSI: >> >> 47109 shaders in 29632 tests >> Totals: >> SGPRS: 1919548 -> 1919572 (0.00 %) >> VGPRS: 1139205 -> 1139209 (0.00 %) >> Spilled SGPRs: 1865 -> 1867 (0.11 %) >> Spilled VGPRs: 65 -> 65 (0.00 %) >> Private memory VGPRs: 1184 -> 1184 (0.00 %) >> Scratch size: 1308 -> 1308 (0.00 %) dwords per thread >> Code Size: 60083036 -> 60083244 (0.00 %) bytes >> LDS: 1077 -> 1077 (0.00 %) blocks >> Max Waves: 431200 -> 431200 (0.00 %) >> Wait states: 0 -> 0 (0.00 %) >> >> It's still interesting to disable the merge_registers() pass. >> >> Signed-off-by: Samuel Pitoiset >> --- >> src/mesa/state_tracker/st_glsl_to_tgsi.cpp | 36 >> +- >> 1 file changed, 31 insertions(+), 5 deletions(-) >> >> diff --git a/src/mesa/state_tracker/st_glsl_to_tgsi.cpp >> b/src/mesa/state_tracker/st_glsl_to_tgsi.cpp >> index 0f8688a41c..01b5a4dc98 100644 >> --- a/src/mesa/state_tracker/st_glsl_to_tgsi.cpp >> +++ b/src/mesa/state_tracker/st_glsl_to_tgsi.cpp >> @@ -5085,9 +5085,13 @@ glsl_to_tgsi_visitor::eliminate_dead_code(void) >> >> glsl_to_tgsi_instruction *, >>this->next_temp * >> 4); >> int *write_level = rzalloc_array(mem_ctx, int, this->next_temp * 4); >> + int *first_reads = rzalloc_array(mem_ctx, int, this->next_temp); >> int level = 0; >> int removed = 0; >> + for (int i = 0; i < this->next_temp; i++) >> + first_reads[i] = -1; >> + >> foreach_in_list(glsl_to_tgsi_instruction, inst, >instructions) >> { >> assert(inst->dst[0].file != PROGRAM_TEMPORARY >>|| inst->dst[0].index < this->next_temp); >> @@ -5148,8 +5152,12 @@ glsl_to_tgsi_visitor::eliminate_dead_code(void) >> src_chans |= 1 << GET_SWZ(inst->src[i].swizzle, 3); >>for (int c = 0; c < 4; c++) { >> - if (src_chans & (1 << c)) >> + if (src_chans & (1 << c)) { >>writes[4 *
Re: [Mesa-dev] [PATCH] st/glsl_to_tgsi: fix the DCE pass in presence of loops
This and the tgsi copy_propagation pass are very slow, I'd really like it if we both didn't require the pass for things to work and also didn't make it any slower. perf is showing these 2 passes at ~11% during Deus Ex start-up (cold cache). Can we not just add a DCE call to the glsl linker/compiler calls e.g. if (ctx->Const.GLSLOptimizeConservatively) { /* Run it just once. */ do_common_optimization(ir, true, false, >Const.ShaderCompilerOptions[stage], ctx->Const.NativeIntegers); ---> call DCE here to clean up since we don't call the opt passes again <--- } else { /* Repeat it until it stops making changes. */ while (do_common_optimization(ir, true, false, >Const.ShaderCompilerOptions[stage], ctx->Const.NativeIntegers)) ; } On 03/05/17 00:43, Samuel Pitoiset wrote: The TGSI DCE pass doesn't eliminate dead assignments like MOV TEMP[0], TEMP[1] in presence of loops because it assumes that the visitor doesn't emit dead code. This assumption is actually wrong and this situation happens. However, it appears that the merge_registers() pass accidentally takes care of this for some weird reasons. But since this pass has been disabled for RadeonSI and Nouveau, the renumber_registers() pass which is called *after*, can't do its job correctly. This is because it assumes that no dead code is present. But if there is still a dead assignment, it might re-use the TEMP register id incorrectly and emits wrong code. This patches eliminates all dead assignments by tracking all temporary register reads like what the renumber_registers() actually does. The best solution would be to rewrite that DCE pass entirely but it's more work. This should fix Unigine Heaven on RadeonSI and Nouveau. shader-db results with RadeonSI: 47109 shaders in 29632 tests Totals: SGPRS: 1919548 -> 1919572 (0.00 %) VGPRS: 1139205 -> 1139209 (0.00 %) Spilled SGPRs: 1865 -> 1867 (0.11 %) Spilled VGPRs: 65 -> 65 (0.00 %) Private memory VGPRs: 1184 -> 1184 (0.00 %) Scratch size: 1308 -> 1308 (0.00 %) dwords per thread Code Size: 60083036 -> 60083244 (0.00 %) bytes LDS: 1077 -> 1077 (0.00 %) blocks Max Waves: 431200 -> 431200 (0.00 %) Wait states: 0 -> 0 (0.00 %) It's still interesting to disable the merge_registers() pass. Signed-off-by: Samuel Pitoiset--- src/mesa/state_tracker/st_glsl_to_tgsi.cpp | 36 +- 1 file changed, 31 insertions(+), 5 deletions(-) diff --git a/src/mesa/state_tracker/st_glsl_to_tgsi.cpp b/src/mesa/state_tracker/st_glsl_to_tgsi.cpp index 0f8688a41c..01b5a4dc98 100644 --- a/src/mesa/state_tracker/st_glsl_to_tgsi.cpp +++ b/src/mesa/state_tracker/st_glsl_to_tgsi.cpp @@ -5085,9 +5085,13 @@ glsl_to_tgsi_visitor::eliminate_dead_code(void) glsl_to_tgsi_instruction *, this->next_temp * 4); int *write_level = rzalloc_array(mem_ctx, int, this->next_temp * 4); + int *first_reads = rzalloc_array(mem_ctx, int, this->next_temp); int level = 0; int removed = 0; + for (int i = 0; i < this->next_temp; i++) + first_reads[i] = -1; + foreach_in_list(glsl_to_tgsi_instruction, inst, >instructions) { assert(inst->dst[0].file != PROGRAM_TEMPORARY || inst->dst[0].index < this->next_temp); @@ -5148,8 +5152,12 @@ glsl_to_tgsi_visitor::eliminate_dead_code(void) src_chans |= 1 << GET_SWZ(inst->src[i].swizzle, 3); for (int c = 0; c < 4; c++) { - if (src_chans & (1 << c)) + if (src_chans & (1 << c)) { writes[4 * inst->src[i].index + c] = NULL; + + if (first_reads[inst->src[i].index] == -1) +first_reads[inst->src[i].index] = level; + } } } } @@ -5167,8 +5175,12 @@ glsl_to_tgsi_visitor::eliminate_dead_code(void) src_chans |= 1 << GET_SWZ(inst->tex_offsets[i].swizzle, 3); for (int c = 0; c < 4; c++) { - if (src_chans & (1 << c)) + if (src_chans & (1 << c)) { writes[4 * inst->tex_offsets[i].index + c] = NULL; + + if (first_reads[inst->tex_offsets[i].index] == -1) +first_reads[inst->tex_offsets[i].index] = level; + } } } } @@ -5211,17 +5223,30 @@ glsl_to_tgsi_visitor::eliminate_dead_code(void) * the writemask of other instructions with dead channels. */ foreach_in_list_safe(glsl_to_tgsi_instruction, inst, >instructions) { - if (!inst->dead_mask || !inst->dst[0].writemask) + bool dead_inst = false; + + if (!inst->dst[0].writemask) continue; +
Re: [Mesa-dev] [PATCH] st/glsl_to_tgsi: fix the DCE pass in presence of loops
On 05/03/2017 06:21 PM, Nicolai Hähnle wrote: On 02.05.2017 16:43, Samuel Pitoiset wrote: The TGSI DCE pass doesn't eliminate dead assignments like MOV TEMP[0], TEMP[1] in presence of loops because it assumes that the visitor doesn't emit dead code. This assumption is actually wrong and this situation happens. However, it appears that the merge_registers() pass accidentally takes care of this for some weird reasons. But since this pass has been disabled for RadeonSI and Nouveau, the renumber_registers() pass which is called *after*, can't do its job correctly. This is because it assumes that no dead code is present. But if there is still a dead assignment, it might re-use the TEMP register id incorrectly and emits wrong code. This patches eliminates all dead assignments by tracking all temporary register reads like what the renumber_registers() actually does. The best solution would be to rewrite that DCE pass entirely but it's more work. This should fix Unigine Heaven on RadeonSI and Nouveau. shader-db results with RadeonSI: 47109 shaders in 29632 tests Totals: SGPRS: 1919548 -> 1919572 (0.00 %) VGPRS: 1139205 -> 1139209 (0.00 %) Spilled SGPRs: 1865 -> 1867 (0.11 %) Spilled VGPRs: 65 -> 65 (0.00 %) Private memory VGPRs: 1184 -> 1184 (0.00 %) Scratch size: 1308 -> 1308 (0.00 %) dwords per thread Code Size: 60083036 -> 60083244 (0.00 %) bytes LDS: 1077 -> 1077 (0.00 %) blocks Max Waves: 431200 -> 431200 (0.00 %) Wait states: 0 -> 0 (0.00 %) It's still interesting to disable the merge_registers() pass. Signed-off-by: Samuel Pitoiset--- src/mesa/state_tracker/st_glsl_to_tgsi.cpp | 36 +- 1 file changed, 31 insertions(+), 5 deletions(-) diff --git a/src/mesa/state_tracker/st_glsl_to_tgsi.cpp b/src/mesa/state_tracker/st_glsl_to_tgsi.cpp index 0f8688a41c..01b5a4dc98 100644 --- a/src/mesa/state_tracker/st_glsl_to_tgsi.cpp +++ b/src/mesa/state_tracker/st_glsl_to_tgsi.cpp @@ -5085,9 +5085,13 @@ glsl_to_tgsi_visitor::eliminate_dead_code(void) glsl_to_tgsi_instruction *, this->next_temp * 4); int *write_level = rzalloc_array(mem_ctx, int, this->next_temp * 4); + int *first_reads = rzalloc_array(mem_ctx, int, this->next_temp); int level = 0; int removed = 0; + for (int i = 0; i < this->next_temp; i++) + first_reads[i] = -1; As discussed offline, it would be good to fix renumber_registers to work without this assumption. I will have a look at this approach. If you want to go ahead with this patch anyway / on top of this: the value of first_reads doesn't seem to matter, so why not make it an array of bool instead? An array of bool is better. Cheers, Nicolai + foreach_in_list(glsl_to_tgsi_instruction, inst, >instructions) { assert(inst->dst[0].file != PROGRAM_TEMPORARY || inst->dst[0].index < this->next_temp); @@ -5148,8 +5152,12 @@ glsl_to_tgsi_visitor::eliminate_dead_code(void) src_chans |= 1 << GET_SWZ(inst->src[i].swizzle, 3); for (int c = 0; c < 4; c++) { - if (src_chans & (1 << c)) + if (src_chans & (1 << c)) { writes[4 * inst->src[i].index + c] = NULL; + + if (first_reads[inst->src[i].index] == -1) +first_reads[inst->src[i].index] = level; + } } } } @@ -5167,8 +5175,12 @@ glsl_to_tgsi_visitor::eliminate_dead_code(void) src_chans |= 1 << GET_SWZ(inst->tex_offsets[i].swizzle, 3); for (int c = 0; c < 4; c++) { - if (src_chans & (1 << c)) + if (src_chans & (1 << c)) { writes[4 * inst->tex_offsets[i].index + c] = NULL; + + if (first_reads[inst->tex_offsets[i].index] == -1) +first_reads[inst->tex_offsets[i].index] = level; + } } } } @@ -5211,17 +5223,30 @@ glsl_to_tgsi_visitor::eliminate_dead_code(void) * the writemask of other instructions with dead channels. */ foreach_in_list_safe(glsl_to_tgsi_instruction, inst, >instructions) { - if (!inst->dead_mask || !inst->dst[0].writemask) + bool dead_inst = false; + + if (!inst->dst[0].writemask) continue; + /* No amount of dead masks should remove memory stores */ if (inst->info->is_store) continue; - if ((inst->dst[0].writemask & ~inst->dead_mask) == 0) { + if (!inst->dead_mask) { + if (inst->dst[0].file == PROGRAM_TEMPORARY && + first_reads[inst->dst[0].index] == -1) { +dead_inst = true; + } + } else { + if ((inst->dst[0].writemask & ~inst->dead_mask) == 0) +dead_inst = true; + } + + if (dead_inst) {
Re: [Mesa-dev] [PATCH] st/glsl_to_tgsi: fix the DCE pass in presence of loops
On 02.05.2017 16:43, Samuel Pitoiset wrote: The TGSI DCE pass doesn't eliminate dead assignments like MOV TEMP[0], TEMP[1] in presence of loops because it assumes that the visitor doesn't emit dead code. This assumption is actually wrong and this situation happens. However, it appears that the merge_registers() pass accidentally takes care of this for some weird reasons. But since this pass has been disabled for RadeonSI and Nouveau, the renumber_registers() pass which is called *after*, can't do its job correctly. This is because it assumes that no dead code is present. But if there is still a dead assignment, it might re-use the TEMP register id incorrectly and emits wrong code. This patches eliminates all dead assignments by tracking all temporary register reads like what the renumber_registers() actually does. The best solution would be to rewrite that DCE pass entirely but it's more work. This should fix Unigine Heaven on RadeonSI and Nouveau. shader-db results with RadeonSI: 47109 shaders in 29632 tests Totals: SGPRS: 1919548 -> 1919572 (0.00 %) VGPRS: 1139205 -> 1139209 (0.00 %) Spilled SGPRs: 1865 -> 1867 (0.11 %) Spilled VGPRs: 65 -> 65 (0.00 %) Private memory VGPRs: 1184 -> 1184 (0.00 %) Scratch size: 1308 -> 1308 (0.00 %) dwords per thread Code Size: 60083036 -> 60083244 (0.00 %) bytes LDS: 1077 -> 1077 (0.00 %) blocks Max Waves: 431200 -> 431200 (0.00 %) Wait states: 0 -> 0 (0.00 %) It's still interesting to disable the merge_registers() pass. Signed-off-by: Samuel Pitoiset--- src/mesa/state_tracker/st_glsl_to_tgsi.cpp | 36 +- 1 file changed, 31 insertions(+), 5 deletions(-) diff --git a/src/mesa/state_tracker/st_glsl_to_tgsi.cpp b/src/mesa/state_tracker/st_glsl_to_tgsi.cpp index 0f8688a41c..01b5a4dc98 100644 --- a/src/mesa/state_tracker/st_glsl_to_tgsi.cpp +++ b/src/mesa/state_tracker/st_glsl_to_tgsi.cpp @@ -5085,9 +5085,13 @@ glsl_to_tgsi_visitor::eliminate_dead_code(void) glsl_to_tgsi_instruction *, this->next_temp * 4); int *write_level = rzalloc_array(mem_ctx, int, this->next_temp * 4); + int *first_reads = rzalloc_array(mem_ctx, int, this->next_temp); int level = 0; int removed = 0; + for (int i = 0; i < this->next_temp; i++) + first_reads[i] = -1; As discussed offline, it would be good to fix renumber_registers to work without this assumption. If you want to go ahead with this patch anyway / on top of this: the value of first_reads doesn't seem to matter, so why not make it an array of bool instead? Cheers, Nicolai + foreach_in_list(glsl_to_tgsi_instruction, inst, >instructions) { assert(inst->dst[0].file != PROGRAM_TEMPORARY || inst->dst[0].index < this->next_temp); @@ -5148,8 +5152,12 @@ glsl_to_tgsi_visitor::eliminate_dead_code(void) src_chans |= 1 << GET_SWZ(inst->src[i].swizzle, 3); for (int c = 0; c < 4; c++) { - if (src_chans & (1 << c)) + if (src_chans & (1 << c)) { writes[4 * inst->src[i].index + c] = NULL; + + if (first_reads[inst->src[i].index] == -1) +first_reads[inst->src[i].index] = level; + } } } } @@ -5167,8 +5175,12 @@ glsl_to_tgsi_visitor::eliminate_dead_code(void) src_chans |= 1 << GET_SWZ(inst->tex_offsets[i].swizzle, 3); for (int c = 0; c < 4; c++) { - if (src_chans & (1 << c)) + if (src_chans & (1 << c)) { writes[4 * inst->tex_offsets[i].index + c] = NULL; + + if (first_reads[inst->tex_offsets[i].index] == -1) +first_reads[inst->tex_offsets[i].index] = level; + } } } } @@ -5211,17 +5223,30 @@ glsl_to_tgsi_visitor::eliminate_dead_code(void) * the writemask of other instructions with dead channels. */ foreach_in_list_safe(glsl_to_tgsi_instruction, inst, >instructions) { - if (!inst->dead_mask || !inst->dst[0].writemask) + bool dead_inst = false; + + if (!inst->dst[0].writemask) continue; + /* No amount of dead masks should remove memory stores */ if (inst->info->is_store) continue; - if ((inst->dst[0].writemask & ~inst->dead_mask) == 0) { + if (!inst->dead_mask) { + if (inst->dst[0].file == PROGRAM_TEMPORARY && + first_reads[inst->dst[0].index] == -1) { +dead_inst = true; + } + } else { + if ((inst->dst[0].writemask & ~inst->dead_mask) == 0) +dead_inst = true; + } + + if (dead_inst) { inst->remove(); delete inst; removed++; - } else { + } else if (inst->dead_mask) {
Re: [Mesa-dev] [PATCH] st/glsl_to_tgsi: fix the DCE pass in presence of loops
The issue in Uningine Heaven is fixed. Tested-by: Edmondo TommasinaThanks edmondo On Tue, May 2, 2017 at 4:43 PM, Samuel Pitoiset wrote: > The TGSI DCE pass doesn't eliminate dead assignments like > MOV TEMP[0], TEMP[1] in presence of loops because it assumes > that the visitor doesn't emit dead code. This assumption is > actually wrong and this situation happens. > > However, it appears that the merge_registers() pass accidentally > takes care of this for some weird reasons. But since this pass has > been disabled for RadeonSI and Nouveau, the renumber_registers() > pass which is called *after*, can't do its job correctly. > > This is because it assumes that no dead code is present. But if > there is still a dead assignment, it might re-use the TEMP > register id incorrectly and emits wrong code. > > This patches eliminates all dead assignments by tracking > all temporary register reads like what the renumber_registers() > actually does. The best solution would be to rewrite that DCE > pass entirely but it's more work. > > This should fix Unigine Heaven on RadeonSI and Nouveau. > > shader-db results with RadeonSI: > > 47109 shaders in 29632 tests > Totals: > SGPRS: 1919548 -> 1919572 (0.00 %) > VGPRS: 1139205 -> 1139209 (0.00 %) > Spilled SGPRs: 1865 -> 1867 (0.11 %) > Spilled VGPRs: 65 -> 65 (0.00 %) > Private memory VGPRs: 1184 -> 1184 (0.00 %) > Scratch size: 1308 -> 1308 (0.00 %) dwords per thread > Code Size: 60083036 -> 60083244 (0.00 %) bytes > LDS: 1077 -> 1077 (0.00 %) blocks > Max Waves: 431200 -> 431200 (0.00 %) > Wait states: 0 -> 0 (0.00 %) > > It's still interesting to disable the merge_registers() pass. > > Signed-off-by: Samuel Pitoiset > --- > src/mesa/state_tracker/st_glsl_to_tgsi.cpp | 36 > +- > 1 file changed, 31 insertions(+), 5 deletions(-) > > diff --git a/src/mesa/state_tracker/st_glsl_to_tgsi.cpp > b/src/mesa/state_tracker/st_glsl_to_tgsi.cpp > index 0f8688a41c..01b5a4dc98 100644 > --- a/src/mesa/state_tracker/st_glsl_to_tgsi.cpp > +++ b/src/mesa/state_tracker/st_glsl_to_tgsi.cpp > @@ -5085,9 +5085,13 @@ glsl_to_tgsi_visitor::eliminate_dead_code(void) > > glsl_to_tgsi_instruction *, > this->next_temp * 4); > int *write_level = rzalloc_array(mem_ctx, int, this->next_temp * 4); > + int *first_reads = rzalloc_array(mem_ctx, int, this->next_temp); > int level = 0; > int removed = 0; > > + for (int i = 0; i < this->next_temp; i++) > + first_reads[i] = -1; > + > foreach_in_list(glsl_to_tgsi_instruction, inst, >instructions) { >assert(inst->dst[0].file != PROGRAM_TEMPORARY > || inst->dst[0].index < this->next_temp); > @@ -5148,8 +5152,12 @@ glsl_to_tgsi_visitor::eliminate_dead_code(void) > src_chans |= 1 << GET_SWZ(inst->src[i].swizzle, 3); > > for (int c = 0; c < 4; c++) { > - if (src_chans & (1 << c)) > + if (src_chans & (1 << c)) { > writes[4 * inst->src[i].index + c] = NULL; > + > + if (first_reads[inst->src[i].index] == -1) > +first_reads[inst->src[i].index] = level; > + } > } > } > } > @@ -5167,8 +5175,12 @@ glsl_to_tgsi_visitor::eliminate_dead_code(void) > src_chans |= 1 << GET_SWZ(inst->tex_offsets[i].swizzle, 3); > > for (int c = 0; c < 4; c++) { > - if (src_chans & (1 << c)) > + if (src_chans & (1 << c)) { > writes[4 * inst->tex_offsets[i].index + c] = NULL; > + > + if (first_reads[inst->tex_offsets[i].index] == -1) > +first_reads[inst->tex_offsets[i].index] = level; > + } > } > } > } > @@ -5211,17 +5223,30 @@ glsl_to_tgsi_visitor::eliminate_dead_code(void) > * the writemask of other instructions with dead channels. > */ > foreach_in_list_safe(glsl_to_tgsi_instruction, inst, >instructions) > { > - if (!inst->dead_mask || !inst->dst[0].writemask) > + bool dead_inst = false; > + > + if (!inst->dst[0].writemask) > continue; > + >/* No amount of dead masks should remove memory stores */ >if (inst->info->is_store) > continue; > > - if ((inst->dst[0].writemask & ~inst->dead_mask) == 0) { > + if (!inst->dead_mask) { > + if (inst->dst[0].file == PROGRAM_TEMPORARY && > + first_reads[inst->dst[0].index] == -1) { > +dead_inst = true; > + } > + } else { > + if ((inst->dst[0].writemask & ~inst->dead_mask) == 0) > +dead_inst = true; > + } > + > + if (dead_inst) { > inst->remove(); >
[Mesa-dev] [PATCH] st/glsl_to_tgsi: fix the DCE pass in presence of loops
The TGSI DCE pass doesn't eliminate dead assignments like MOV TEMP[0], TEMP[1] in presence of loops because it assumes that the visitor doesn't emit dead code. This assumption is actually wrong and this situation happens. However, it appears that the merge_registers() pass accidentally takes care of this for some weird reasons. But since this pass has been disabled for RadeonSI and Nouveau, the renumber_registers() pass which is called *after*, can't do its job correctly. This is because it assumes that no dead code is present. But if there is still a dead assignment, it might re-use the TEMP register id incorrectly and emits wrong code. This patches eliminates all dead assignments by tracking all temporary register reads like what the renumber_registers() actually does. The best solution would be to rewrite that DCE pass entirely but it's more work. This should fix Unigine Heaven on RadeonSI and Nouveau. shader-db results with RadeonSI: 47109 shaders in 29632 tests Totals: SGPRS: 1919548 -> 1919572 (0.00 %) VGPRS: 1139205 -> 1139209 (0.00 %) Spilled SGPRs: 1865 -> 1867 (0.11 %) Spilled VGPRs: 65 -> 65 (0.00 %) Private memory VGPRs: 1184 -> 1184 (0.00 %) Scratch size: 1308 -> 1308 (0.00 %) dwords per thread Code Size: 60083036 -> 60083244 (0.00 %) bytes LDS: 1077 -> 1077 (0.00 %) blocks Max Waves: 431200 -> 431200 (0.00 %) Wait states: 0 -> 0 (0.00 %) It's still interesting to disable the merge_registers() pass. Signed-off-by: Samuel Pitoiset--- src/mesa/state_tracker/st_glsl_to_tgsi.cpp | 36 +- 1 file changed, 31 insertions(+), 5 deletions(-) diff --git a/src/mesa/state_tracker/st_glsl_to_tgsi.cpp b/src/mesa/state_tracker/st_glsl_to_tgsi.cpp index 0f8688a41c..01b5a4dc98 100644 --- a/src/mesa/state_tracker/st_glsl_to_tgsi.cpp +++ b/src/mesa/state_tracker/st_glsl_to_tgsi.cpp @@ -5085,9 +5085,13 @@ glsl_to_tgsi_visitor::eliminate_dead_code(void) glsl_to_tgsi_instruction *, this->next_temp * 4); int *write_level = rzalloc_array(mem_ctx, int, this->next_temp * 4); + int *first_reads = rzalloc_array(mem_ctx, int, this->next_temp); int level = 0; int removed = 0; + for (int i = 0; i < this->next_temp; i++) + first_reads[i] = -1; + foreach_in_list(glsl_to_tgsi_instruction, inst, >instructions) { assert(inst->dst[0].file != PROGRAM_TEMPORARY || inst->dst[0].index < this->next_temp); @@ -5148,8 +5152,12 @@ glsl_to_tgsi_visitor::eliminate_dead_code(void) src_chans |= 1 << GET_SWZ(inst->src[i].swizzle, 3); for (int c = 0; c < 4; c++) { - if (src_chans & (1 << c)) + if (src_chans & (1 << c)) { writes[4 * inst->src[i].index + c] = NULL; + + if (first_reads[inst->src[i].index] == -1) +first_reads[inst->src[i].index] = level; + } } } } @@ -5167,8 +5175,12 @@ glsl_to_tgsi_visitor::eliminate_dead_code(void) src_chans |= 1 << GET_SWZ(inst->tex_offsets[i].swizzle, 3); for (int c = 0; c < 4; c++) { - if (src_chans & (1 << c)) + if (src_chans & (1 << c)) { writes[4 * inst->tex_offsets[i].index + c] = NULL; + + if (first_reads[inst->tex_offsets[i].index] == -1) +first_reads[inst->tex_offsets[i].index] = level; + } } } } @@ -5211,17 +5223,30 @@ glsl_to_tgsi_visitor::eliminate_dead_code(void) * the writemask of other instructions with dead channels. */ foreach_in_list_safe(glsl_to_tgsi_instruction, inst, >instructions) { - if (!inst->dead_mask || !inst->dst[0].writemask) + bool dead_inst = false; + + if (!inst->dst[0].writemask) continue; + /* No amount of dead masks should remove memory stores */ if (inst->info->is_store) continue; - if ((inst->dst[0].writemask & ~inst->dead_mask) == 0) { + if (!inst->dead_mask) { + if (inst->dst[0].file == PROGRAM_TEMPORARY && + first_reads[inst->dst[0].index] == -1) { +dead_inst = true; + } + } else { + if ((inst->dst[0].writemask & ~inst->dead_mask) == 0) +dead_inst = true; + } + + if (dead_inst) { inst->remove(); delete inst; removed++; - } else { + } else if (inst->dead_mask) { if (glsl_base_type_is_64bit(inst->dst[0].type)) { if (inst->dead_mask == WRITEMASK_XY || inst->dead_mask == WRITEMASK_ZW) @@ -5232,6 +5257,7 @@ glsl_to_tgsi_visitor::eliminate_dead_code(void) } ralloc_free(write_level); + ralloc_free(first_reads); ralloc_free(writes); return