Re: GCN RDNA2+ vs. GCC vectorizer "Reduce using vector shifts"

Andrew Stubbs Thu, 15 Feb 2024 05:02:44 -0800

On 15/02/2024 10:23, Thomas Schwinge wrote:

Hi!


On 2024-02-15T08:49:17+0100, Richard Biener <[email protected]> wrote:

On Wed, 14 Feb 2024, Andrew Stubbs wrote:

On 14/02/2024 13:43, Richard Biener wrote:

On Wed, 14 Feb 2024, Andrew Stubbs wrote:

On 14/02/2024 13:27, Richard Biener wrote:

On Wed, 14 Feb 2024, Andrew Stubbs wrote:

On 13/02/2024 08:26, Richard Biener wrote:

On Mon, 12 Feb 2024, Thomas Schwinge wrote:

On 2023-10-20T12:51:03+0100, Andrew Stubbs <[email protected]>
wrote:

I've committed this patch


... as commit c7ec7bd1c6590cf4eed267feab490288e0b8d691
"amdgcn: add -march=gfx1030 EXPERIMENTAL".

The RDNA2 ISA variant doesn't support certain instructions previous
implemented in GCC/GCN, so a number of patterns etc. had to be
disabled:

[...] Vector
reductions will need to be reworked for RDNA2.  [...]

    * config/gcn/gcn-valu.md (@dpp_move<mode>): Disable for RDNA2.
    (addc<mode>3<exec_vcc>): Add RDNA2 syntax variant.
    (subc<mode>3<exec_vcc>): Likewise.
    (<convop><mode><vndi>2_exec): Add RDNA2 alternatives.
    (vec_cmp<mode>di): Likewise.
    (vec_cmp<u><mode>di): Likewise.
    (vec_cmp<mode>di_exec): Likewise.
    (vec_cmp<u><mode>di_exec): Likewise.
    (vec_cmp<mode>di_dup): Likewise.
    (vec_cmp<mode>di_dup_exec): Likewise.
    (reduc_<reduc_op>_scal_<mode>): Disable for RDNA2.
    (*<reduc_op>_dpp_shr_<mode>): Likewise.
    (*plus_carry_dpp_shr_<mode>): Likewise.
    (*plus_carry_in_dpp_shr_<mode>): Likewise.


Etc.  The expectation being that GCC middle end copes with this, and
synthesizes some less ideal yet still functional vector code, I presume.

The later RDNA3/gfx1100 support builds on top of this, and that's what
I'm currently working on getting proper GCC/GCN target (not offloading)
results for.

I'm seeing a good number of execution test FAILs (regressions compared to
my earlier non-gfx1100 testing), and I've now tracked down where one
large class of those comes into existance -- [...]

With the following hack applied to 'gcc/tree-vect-loop.cc':

        @@ -6687,8 +6687,9 @@ vect_create_epilog_for_reduction
        (loop_vec_info
        loop_vinfo,
               reduce_with_shift = have_whole_vector_shift (mode1);
               if (!VECTOR_MODE_P (mode1)
                  || !directly_supported_p (code, vectype1))
                reduce_with_shift = false;
        +      reduce_with_shift = false;

..., I'm able to work around those regressions: by means of forcing
"Reduce using scalar code" instead of "Reduce using vector shifts".

The attached not-well-tested patch should allow only valid permutations.
Hopefully we go back to working code, but there'll be things that won't
vectorize. That said, the new "dump" output code has fewer and probably
cheaper instructions, so hmmm.


This fixes the reduced builtin-bitops-1.c on RDNA2.


I confirm that "amdgcn: Disallow unsupported permute on RDNA devices"
also obsoletes my 'reduce_with_shift = false;' hack -- and also cures a
good number of additional FAILs (regressions), where presumably we
permute via different code paths.  Thanks!

There also are a few regressions, but only minor:

     PASS: gcc.dg/vect/no-vfa-vect-depend-3.c (test for excess errors)
     PASS: gcc.dg/vect/no-vfa-vect-depend-3.c execution test
     PASS: gcc.dg/vect/no-vfa-vect-depend-3.c scan-tree-dump-times vect "vectorized 
1 loops" 4
     [-PASS:-]{+FAIL:+} gcc.dg/vect/no-vfa-vect-depend-3.c scan-tree-dump-times vect 
"dependence distance negative" 4

..., because:

     gcc.dg/vect/no-vfa-vect-depend-3.c: pattern found 6 times
     FAIL: gcc.dg/vect/no-vfa-vect-depend-3.c scan-tree-dump-times vect "dependence 
distance negative" 4

     PASS: gcc.dg/vect/vect-119.c (test for excess errors)
     [-PASS:-]{+FAIL:+} gcc.dg/vect/vect-119.c scan-tree-dump-times vect "Detected 
interleaving load of size 2" 1
     PASS: gcc.dg/vect/vect-119.c scan-tree-dump-not optimized "Invalid sum"

..., because:

     gcc.dg/vect/vect-119.c: pattern found 3 times
     FAIL: gcc.dg/vect/vect-119.c scan-tree-dump-times vect "Detected interleaving 
load of size 2" 1

     PASS: gcc.dg/vect/vect-reduc-mul_1.c (test for excess errors)
     PASS: gcc.dg/vect/vect-reduc-mul_1.c execution test
     [-PASS:-]{+FAIL:+} gcc.dg/vect/vect-reduc-mul_1.c scan-tree-dump vect "Reduce 
using vector shifts"

     PASS: gcc.dg/vect/vect-reduc-mul_2.c (test for excess errors)
     PASS: gcc.dg/vect/vect-reduc-mul_2.c execution test
     [-PASS:-]{+FAIL:+} gcc.dg/vect/vect-reduc-mul_2.c scan-tree-dump vect "Reduce 
using vector shifts"

..., plus the following, in combination with the earlier changes
disabling patterns:

     PASS: gcc.dg/vect/vect-reduc-or_1.c (test for excess errors)
     PASS: gcc.dg/vect/vect-reduc-or_1.c execution test
     [-PASS:-]{+FAIL:+} gcc.dg/vect/vect-reduc-or_1.c scan-tree-dump vect "Reduce 
using direct vector reduction"

     PASS: gcc.dg/vect/vect-reduc-or_2.c (test for excess errors)
     PASS: gcc.dg/vect/vect-reduc-or_2.c execution test
     [-PASS:-]{+FAIL:+} gcc.dg/vect/vect-reduc-or_2.c scan-tree-dump vect "Reduce 
using direct vector reduction"

Such test cases will need conditionalization on specific configurations.
I'm fine if we just let those FAIL (for RDNA2+) for the time being; there
are a good number of similar scanning FAILs pre-existing also for
non-gfx1100.


Thanks, Thomas.

The patch is now committed.

Andrew

Re: GCN RDNA2+ vs. GCC vectorizer "Reduce using vector shifts"

Reply via email to