Hi! On 2024-02-15T08:49:17+0100, Richard Biener <rguent...@suse.de> wrote: > On Wed, 14 Feb 2024, Andrew Stubbs wrote: >> On 14/02/2024 13:43, Richard Biener wrote: >> > On Wed, 14 Feb 2024, Andrew Stubbs wrote: >> >> On 14/02/2024 13:27, Richard Biener wrote: >> >>> On Wed, 14 Feb 2024, Andrew Stubbs wrote: >> >>>> On 13/02/2024 08:26, Richard Biener wrote: >> >>>>> On Mon, 12 Feb 2024, Thomas Schwinge wrote: >> >>>>>> On 2023-10-20T12:51:03+0100, Andrew Stubbs <a...@codesourcery.com> >> >>>>>> wrote: >> >>>>>>> I've committed this patch >> >>>>>> >> >>>>>> ... as commit c7ec7bd1c6590cf4eed267feab490288e0b8d691 >> >>>>>> "amdgcn: add -march=gfx1030 EXPERIMENTAL". >> >>>>>> >> >>>>>> The RDNA2 ISA variant doesn't support certain instructions previous >> >>>>>> implemented in GCC/GCN, so a number of patterns etc. had to be >> >>>>>> disabled: >> >>>>>> >> >>>>>>> [...] Vector >> >>>>>>> reductions will need to be reworked for RDNA2. [...] >> >>>>>> >> >>>>>>> * config/gcn/gcn-valu.md (@dpp_move<mode>): Disable for RDNA2. >> >>>>>>> (addc<mode>3<exec_vcc>): Add RDNA2 syntax variant. >> >>>>>>> (subc<mode>3<exec_vcc>): Likewise. >> >>>>>>> (<convop><mode><vndi>2_exec): Add RDNA2 alternatives. >> >>>>>>> (vec_cmp<mode>di): Likewise. >> >>>>>>> (vec_cmp<u><mode>di): Likewise. >> >>>>>>> (vec_cmp<mode>di_exec): Likewise. >> >>>>>>> (vec_cmp<u><mode>di_exec): Likewise. >> >>>>>>> (vec_cmp<mode>di_dup): Likewise. >> >>>>>>> (vec_cmp<mode>di_dup_exec): Likewise. >> >>>>>>> (reduc_<reduc_op>_scal_<mode>): Disable for RDNA2. >> >>>>>>> (*<reduc_op>_dpp_shr_<mode>): Likewise. >> >>>>>>> (*plus_carry_dpp_shr_<mode>): Likewise. >> >>>>>>> (*plus_carry_in_dpp_shr_<mode>): Likewise. >> >>>>>> >> >>>>>> Etc. The expectation being that GCC middle end copes with this, and >> >>>>>> synthesizes some less ideal yet still functional vector code, I >> >>>>>> presume. >> >>>>>> >> >>>>>> The later RDNA3/gfx1100 support builds on top of this, and that's what >> >>>>>> I'm currently working on getting proper GCC/GCN target (not >> >>>>>> offloading) >> >>>>>> results for. >> >>>>>> >> >>>>>> I'm seeing a good number of execution test FAILs (regressions >> >>>>>> compared to >> >>>>>> my earlier non-gfx1100 testing), and I've now tracked down where one >> >>>>>> large class of those comes into existance -- [...]
>> >>>>>> With the following hack applied to 'gcc/tree-vect-loop.cc': >> >>>>>> >> >>>>>> @@ -6687,8 +6687,9 @@ vect_create_epilog_for_reduction >> >>>>>> (loop_vec_info >> >>>>>> loop_vinfo, >> >>>>>> reduce_with_shift = have_whole_vector_shift (mode1); >> >>>>>> if (!VECTOR_MODE_P (mode1) >> >>>>>> || !directly_supported_p (code, vectype1)) >> >>>>>> reduce_with_shift = false; >> >>>>>> + reduce_with_shift = false; >> >>>>>> >> >>>>>> ..., I'm able to work around those regressions: by means of forcing >> >>>>>> "Reduce using scalar code" instead of "Reduce using vector shifts". >> The attached not-well-tested patch should allow only valid permutations. >> Hopefully we go back to working code, but there'll be things that won't >> vectorize. That said, the new "dump" output code has fewer and probably >> cheaper instructions, so hmmm. > > This fixes the reduced builtin-bitops-1.c on RDNA2. I confirm that "amdgcn: Disallow unsupported permute on RDNA devices" also obsoletes my 'reduce_with_shift = false;' hack -- and also cures a good number of additional FAILs (regressions), where presumably we permute via different code paths. Thanks! There also are a few regressions, but only minor: PASS: gcc.dg/vect/no-vfa-vect-depend-3.c (test for excess errors) PASS: gcc.dg/vect/no-vfa-vect-depend-3.c execution test PASS: gcc.dg/vect/no-vfa-vect-depend-3.c scan-tree-dump-times vect "vectorized 1 loops" 4 [-PASS:-]{+FAIL:+} gcc.dg/vect/no-vfa-vect-depend-3.c scan-tree-dump-times vect "dependence distance negative" 4 ..., because: gcc.dg/vect/no-vfa-vect-depend-3.c: pattern found 6 times FAIL: gcc.dg/vect/no-vfa-vect-depend-3.c scan-tree-dump-times vect "dependence distance negative" 4 PASS: gcc.dg/vect/vect-119.c (test for excess errors) [-PASS:-]{+FAIL:+} gcc.dg/vect/vect-119.c scan-tree-dump-times vect "Detected interleaving load of size 2" 1 PASS: gcc.dg/vect/vect-119.c scan-tree-dump-not optimized "Invalid sum" ..., because: gcc.dg/vect/vect-119.c: pattern found 3 times FAIL: gcc.dg/vect/vect-119.c scan-tree-dump-times vect "Detected interleaving load of size 2" 1 PASS: gcc.dg/vect/vect-reduc-mul_1.c (test for excess errors) PASS: gcc.dg/vect/vect-reduc-mul_1.c execution test [-PASS:-]{+FAIL:+} gcc.dg/vect/vect-reduc-mul_1.c scan-tree-dump vect "Reduce using vector shifts" PASS: gcc.dg/vect/vect-reduc-mul_2.c (test for excess errors) PASS: gcc.dg/vect/vect-reduc-mul_2.c execution test [-PASS:-]{+FAIL:+} gcc.dg/vect/vect-reduc-mul_2.c scan-tree-dump vect "Reduce using vector shifts" ..., plus the following, in combination with the earlier changes disabling patterns: PASS: gcc.dg/vect/vect-reduc-or_1.c (test for excess errors) PASS: gcc.dg/vect/vect-reduc-or_1.c execution test [-PASS:-]{+FAIL:+} gcc.dg/vect/vect-reduc-or_1.c scan-tree-dump vect "Reduce using direct vector reduction" PASS: gcc.dg/vect/vect-reduc-or_2.c (test for excess errors) PASS: gcc.dg/vect/vect-reduc-or_2.c execution test [-PASS:-]{+FAIL:+} gcc.dg/vect/vect-reduc-or_2.c scan-tree-dump vect "Reduce using direct vector reduction" Such test cases will need conditionalization on specific configurations. I'm fine if we just let those FAIL (for RDNA2+) for the time being; there are a good number of similar scanning FAILs pre-existing also for non-gfx1100. Grüße Thomas