Re: GCN RDNA2+ vs. GCC vectorizer "Reduce using vector shifts"

Thomas Schwinge Thu, 15 Feb 2024 02:23:39 -0800

Hi!

On 2024-02-15T08:49:17+0100, Richard Biener <rguent...@suse.de> wrote:
> On Wed, 14 Feb 2024, Andrew Stubbs wrote:
>> On 14/02/2024 13:43, Richard Biener wrote:
>> > On Wed, 14 Feb 2024, Andrew Stubbs wrote:
>> >> On 14/02/2024 13:27, Richard Biener wrote:
>> >>> On Wed, 14 Feb 2024, Andrew Stubbs wrote:
>> >>>> On 13/02/2024 08:26, Richard Biener wrote:
>> >>>>> On Mon, 12 Feb 2024, Thomas Schwinge wrote:
>> >>>>>> On 2023-10-20T12:51:03+0100, Andrew Stubbs <a...@codesourcery.com>
>> >>>>>> wrote:
>> >>>>>>> I've committed this patch
>> >>>>>>
>> >>>>>> ... as commit c7ec7bd1c6590cf4eed267feab490288e0b8d691
>> >>>>>> "amdgcn: add -march=gfx1030 EXPERIMENTAL".
>> >>>>>>
>> >>>>>> The RDNA2 ISA variant doesn't support certain instructions previous
>> >>>>>> implemented in GCC/GCN, so a number of patterns etc. had to be
>> >>>>>> disabled:
>> >>>>>>
>> >>>>>>> [...] Vector
>> >>>>>>> reductions will need to be reworked for RDNA2.  [...]
>> >>>>>>
>> >>>>>>>    * config/gcn/gcn-valu.md (@dpp_move<mode>): Disable for RDNA2.
>> >>>>>>>    (addc<mode>3<exec_vcc>): Add RDNA2 syntax variant.
>> >>>>>>>    (subc<mode>3<exec_vcc>): Likewise.
>> >>>>>>>    (<convop><mode><vndi>2_exec): Add RDNA2 alternatives.
>> >>>>>>>    (vec_cmp<mode>di): Likewise.
>> >>>>>>>    (vec_cmp<u><mode>di): Likewise.
>> >>>>>>>    (vec_cmp<mode>di_exec): Likewise.
>> >>>>>>>    (vec_cmp<u><mode>di_exec): Likewise.
>> >>>>>>>    (vec_cmp<mode>di_dup): Likewise.
>> >>>>>>>    (vec_cmp<mode>di_dup_exec): Likewise.
>> >>>>>>>    (reduc_<reduc_op>_scal_<mode>): Disable for RDNA2.
>> >>>>>>>    (*<reduc_op>_dpp_shr_<mode>): Likewise.
>> >>>>>>>    (*plus_carry_dpp_shr_<mode>): Likewise.
>> >>>>>>>    (*plus_carry_in_dpp_shr_<mode>): Likewise.
>> >>>>>>
>> >>>>>> Etc.  The expectation being that GCC middle end copes with this, and
>> >>>>>> synthesizes some less ideal yet still functional vector code, I 
>> >>>>>> presume.
>> >>>>>>
>> >>>>>> The later RDNA3/gfx1100 support builds on top of this, and that's what
>> >>>>>> I'm currently working on getting proper GCC/GCN target (not 
>> >>>>>> offloading)
>> >>>>>> results for.
>> >>>>>>
>> >>>>>> I'm seeing a good number of execution test FAILs (regressions 
>> >>>>>> compared to
>> >>>>>> my earlier non-gfx1100 testing), and I've now tracked down where one
>> >>>>>> large class of those comes into existance -- [...]


>> >>>>>> With the following hack applied to 'gcc/tree-vect-loop.cc':
>> >>>>>>
>> >>>>>>        @@ -6687,8 +6687,9 @@ vect_create_epilog_for_reduction
>> >>>>>>        (loop_vec_info
>> >>>>>>        loop_vinfo,
>> >>>>>>               reduce_with_shift = have_whole_vector_shift (mode1);
>> >>>>>>               if (!VECTOR_MODE_P (mode1)
>> >>>>>>                  || !directly_supported_p (code, vectype1))
>> >>>>>>                reduce_with_shift = false;
>> >>>>>>        +      reduce_with_shift = false;
>> >>>>>>
>> >>>>>> ..., I'm able to work around those regressions: by means of forcing
>> >>>>>> "Reduce using scalar code" instead of "Reduce using vector shifts".

>> The attached not-well-tested patch should allow only valid permutations.
>> Hopefully we go back to working code, but there'll be things that won't
>> vectorize. That said, the new "dump" output code has fewer and probably
>> cheaper instructions, so hmmm.
>
> This fixes the reduced builtin-bitops-1.c on RDNA2.

I confirm that "amdgcn: Disallow unsupported permute on RDNA devices"
also obsoletes my 'reduce_with_shift = false;' hack -- and also cures a
good number of additional FAILs (regressions), where presumably we
permute via different code paths.  Thanks!

There also are a few regressions, but only minor:

    PASS: gcc.dg/vect/no-vfa-vect-depend-3.c (test for excess errors)
    PASS: gcc.dg/vect/no-vfa-vect-depend-3.c execution test
    PASS: gcc.dg/vect/no-vfa-vect-depend-3.c scan-tree-dump-times vect 
"vectorized 1 loops" 4
    [-PASS:-]{+FAIL:+} gcc.dg/vect/no-vfa-vect-depend-3.c scan-tree-dump-times 
vect "dependence distance negative" 4

..., because:

    gcc.dg/vect/no-vfa-vect-depend-3.c: pattern found 6 times
    FAIL: gcc.dg/vect/no-vfa-vect-depend-3.c scan-tree-dump-times vect 
"dependence distance negative" 4

    PASS: gcc.dg/vect/vect-119.c (test for excess errors)
    [-PASS:-]{+FAIL:+} gcc.dg/vect/vect-119.c scan-tree-dump-times vect 
"Detected interleaving load of size 2" 1
    PASS: gcc.dg/vect/vect-119.c scan-tree-dump-not optimized "Invalid sum"

..., because:

    gcc.dg/vect/vect-119.c: pattern found 3 times
    FAIL: gcc.dg/vect/vect-119.c scan-tree-dump-times vect "Detected 
interleaving load of size 2" 1

    PASS: gcc.dg/vect/vect-reduc-mul_1.c (test for excess errors)
    PASS: gcc.dg/vect/vect-reduc-mul_1.c execution test
    [-PASS:-]{+FAIL:+} gcc.dg/vect/vect-reduc-mul_1.c scan-tree-dump vect 
"Reduce using vector shifts"

    PASS: gcc.dg/vect/vect-reduc-mul_2.c (test for excess errors)
    PASS: gcc.dg/vect/vect-reduc-mul_2.c execution test
    [-PASS:-]{+FAIL:+} gcc.dg/vect/vect-reduc-mul_2.c scan-tree-dump vect 
"Reduce using vector shifts"

..., plus the following, in combination with the earlier changes
disabling patterns:

    PASS: gcc.dg/vect/vect-reduc-or_1.c (test for excess errors)
    PASS: gcc.dg/vect/vect-reduc-or_1.c execution test
    [-PASS:-]{+FAIL:+} gcc.dg/vect/vect-reduc-or_1.c scan-tree-dump vect 
"Reduce using direct vector reduction"

    PASS: gcc.dg/vect/vect-reduc-or_2.c (test for excess errors)
    PASS: gcc.dg/vect/vect-reduc-or_2.c execution test
    [-PASS:-]{+FAIL:+} gcc.dg/vect/vect-reduc-or_2.c scan-tree-dump vect 
"Reduce using direct vector reduction"

Such test cases will need conditionalization on specific configurations.
I'm fine if we just let those FAIL (for RDNA2+) for the time being; there
are a good number of similar scanning FAILs pre-existing also for
non-gfx1100.


Grüße
 Thomas

Re: GCN RDNA2+ vs. GCC vectorizer "Reduce using vector shifts"

Reply via email to