Re: GCN RDNA2+ vs. GCC vectorizer "Reduce using vector shifts"

Richard Biener Wed, 14 Feb 2024 05:43:15 -0800

On Wed, 14 Feb 2024, Andrew Stubbs wrote:

> On 14/02/2024 13:27, Richard Biener wrote:
> > On Wed, 14 Feb 2024, Andrew Stubbs wrote:
> > 
> >> On 13/02/2024 08:26, Richard Biener wrote:
> >>> On Mon, 12 Feb 2024, Thomas Schwinge wrote:
> >>>
> >>>> Hi!
> >>>>
> >>>> On 2023-10-20T12:51:03+0100, Andrew Stubbs <[email protected]> wrote:
> >>>>> I've committed this patch
> >>>>
> >>>> ... as commit c7ec7bd1c6590cf4eed267feab490288e0b8d691
> >>>> "amdgcn: add -march=gfx1030 EXPERIMENTAL".
> >>>>
> >>>> The RDNA2 ISA variant doesn't support certain instructions previous
> >>>> implemented in GCC/GCN, so a number of patterns etc. had to be disabled:
> >>>>
> >>>>> [...] Vector
> >>>>> reductions will need to be reworked for RDNA2.  [...]
> >>>>
> >>>>>   * config/gcn/gcn-valu.md (@dpp_move<mode>): Disable for RDNA2.
> >>>>>   (addc<mode>3<exec_vcc>): Add RDNA2 syntax variant.
> >>>>>   (subc<mode>3<exec_vcc>): Likewise.
> >>>>>   (<convop><mode><vndi>2_exec): Add RDNA2 alternatives.
> >>>>>   (vec_cmp<mode>di): Likewise.
> >>>>>   (vec_cmp<u><mode>di): Likewise.
> >>>>>   (vec_cmp<mode>di_exec): Likewise.
> >>>>>   (vec_cmp<u><mode>di_exec): Likewise.
> >>>>>   (vec_cmp<mode>di_dup): Likewise.
> >>>>>   (vec_cmp<mode>di_dup_exec): Likewise.
> >>>>>   (reduc_<reduc_op>_scal_<mode>): Disable for RDNA2.
> >>>>>   (*<reduc_op>_dpp_shr_<mode>): Likewise.
> >>>>>   (*plus_carry_dpp_shr_<mode>): Likewise.
> >>>>>   (*plus_carry_in_dpp_shr_<mode>): Likewise.
> >>>>
> >>>> Etc.  The expectation being that GCC middle end copes with this, and
> >>>> synthesizes some less ideal yet still functional vector code, I presume.
> >>>>
> >>>> The later RDNA3/gfx1100 support builds on top of this, and that's what
> >>>> I'm currently working on getting proper GCC/GCN target (not offloading)
> >>>> results for.
> >>>>
> >>>> I'm seeing a good number of execution test FAILs (regressions compared to
> >>>> my earlier non-gfx1100 testing), and I've now tracked down where one
> >>>> large class of those comes into existance -- not yet how to resolve,
> >>>> unfortunately.  But maybe, with you guys' combined vectorizer and back
> >>>> end experience, the latter will be done quickly?
> >>>>
> >>>> Richard, I don't know if you've ever run actual GCC/GCN target (not
> >>>> offloading) testing; let me know if you have any questions about that.
> >>>
> >>> I've only done offload testing - in the x86_64 build tree run
> >>> check-target-libgomp.  If you can tell me how to do GCN target testing
> >>> (maybe document it on the wiki even!) I can try do that as well.
> >>>
> >>>> Given that (at least largely?) the same patterns etc. are disabled as in
> >>>> my gfx1100 configuration, I suppose your gfx1030 one would exhibit the
> >>>> same issues.  You can build GCC/GCN target like you build the offloading
> >>>> one, just remove '--enable-as-accelerator-for=[...]'.  Likely, you can
> >>>> even use a offloading GCC/GCN build to reproduce the issue below.
> >>>>
> >>>> One example is the attached 'builtin-bitops-1.c', reduced from
> >>>> 'gcc.c-torture/execute/builtin-bitops-1.c', where 'my_popcount' is
> >>>> miscompiled as soon as '-ftree-vectorize' is effective:
> >>>>
> >>>>       $ build-gcc/gcc/xgcc -Bbuild-gcc/gcc/ builtin-bitops-1.c
> >>>>       -Bbuild-gcc/amdgcn-amdhsa/gfx1100/newlib/
> >>>>       -Lbuild-gcc/amdgcn-amdhsa/gfx1100/newlib -fdump-tree-all-all
> >>>>       -fdump-ipa-all-all -fdump-rtl-all-all -save-temps -march=gfx1100
> >>>>       -O1
> >>>>       -ftree-vectorize
> >>>>
> >>>> In the 'diff' of 'a-builtin-bitops-1.c.179t.vect', for example, for
> >>>> '-march=gfx90a' vs. '-march=gfx1100', we see:
> >>>>
> >>>>       +builtin-bitops-1.c:7:17: missed:   reduc op not supported by
> >>>>       target.
> >>>>
> >>>> ..., and therefore:
> >>>>
> >>>>       -builtin-bitops-1.c:7:17: note:  Reduce using direct vector
> >>>>       reduction.
> >>>>       +builtin-bitops-1.c:7:17: note:  Reduce using vector shifts
> >>>>       +builtin-bitops-1.c:7:17: note:  extract scalar result
> >>>>
> >>>> That is, instead of one '.REDUC_PLUS' for gfx90a, for gfx1100 we build a
> >>>> chain of summation of 'VEC_PERM_EXPR's.  However, there's wrong code
> >>>> generated:
> >>>>
> >>>>       $ flock /tmp/gcn.lock build-gcc/gcc/gcn-run a.out
> >>>>       i=1, ints[i]=0x1 a=1, b=2
> >>>>       i=2, ints[i]=0x80000000 a=1, b=2
> >>>>       i=3, ints[i]=0x2 a=1, b=2
> >>>>       i=4, ints[i]=0x40000000 a=1, b=2
> >>>>       i=5, ints[i]=0x10000 a=1, b=2
> >>>>       i=6, ints[i]=0x8000 a=1, b=2
> >>>>       i=7, ints[i]=0xa5a5a5a5 a=16, b=32
> >>>>       i=8, ints[i]=0x5a5a5a5a a=16, b=32
> >>>>       i=9, ints[i]=0xcafe0000 a=11, b=22
> >>>>       i=10, ints[i]=0xcafe00 a=11, b=22
> >>>>       i=11, ints[i]=0xcafe a=11, b=22
> >>>>       i=12, ints[i]=0xffffffff a=32, b=64
> >>>>
> >>>> (I can't tell if the 'b = 2 * a' pattern is purely coincidental?)
> >>>>
> >>>> I don't speak enough "vectorization" to fully understand the generic
> >>>> vectorized algorithm and its implementation.  It appears that the
> >>>> "Reduce using vector shifts" code has been around for a very long time,
> >>>> but also has gone through a number of changes.  I can't tell which GCC
> >>>> targets/configurations it's actually used for (in the same way as for
> >>>> GCN gfx1100), and thus whether there's an issue in that vectorizer code,
> >>>> or rather in the GCN back end, or GCN back end parameterizing the generic
> >>>> code?
> >>>
> >>> The "shift" reduction is basically doing reduction by repeatedly
> >>> adding the upper to the lower half of the vector (each time halving
> >>> the vector size).
> >>>
> >>>> Manually working through the 'a-builtin-bitops-1.c.265t.optimized' code:
> >>>>
> >>>>       int my_popcount (unsigned int x)
> >>>>       {
> >>>>         int stmp__12.12;
> >>>>         vector(64) int vect__12.11;
> >>>>         vector(64) unsigned int vect__1.8;
> >>>>         vector(64) unsigned int _13;
> >>>>         vector(64) unsigned int vect_cst__18;
> >>>>         vector(64) int [all others];
> >>>>       
> >>>>         <bb 2> [local count: 32534376]:
> >>>>         vect_cst__18 = { [all 'x_8(D)'] };
> >>>>         vect__1.8_19 = vect_cst__18 >> { 0, 1, 2, [...], 61, 62, 63 };
> >>>>         _13 = .COND_AND ({ [32 x '-1'], [32 x '0'] }, vect__1.8_19, {
> >>>>         [all
> >>>>         '1'] }, { [all '0'] });
> >>>>         vect__12.11_24 = VIEW_CONVERT_EXPR<vector(64) int>(_13);
> >>>>         _26 = VEC_PERM_EXPR <vect__12.11_24, { [all '0'] }, { 32, 33, 34,
> >>>>         [...], 93, 94, 95 }>;
> >>>>         _27 = vect__12.11_24 + _26;
> >>>>         _28 = VEC_PERM_EXPR <_27, { [all '0'] }, { 16, 17, 18, [...], 77,
> >>>>         78, 79 }>;
> >>>>         _29 = _27 + _28;
> >>>>         _30 = VEC_PERM_EXPR <_29, { [all '0'] }, { 8, 9, 10, [...], 69,
> >>>>         70,
> >>>>         71 }>;
> >>>>         _31 = _29 + _30;
> >>>>         _32 = VEC_PERM_EXPR <_31, { [all '0'] }, { 4, 5, 6, [...], 65,
> >>>>         66,
> >>>>         67 }>;
> >>>>         _33 = _31 + _32;
> >>>>         _34 = VEC_PERM_EXPR <_33, { [all '0'] }, { 2, 3, 4, [...], 63,
> >>>>         64,
> >>>>         65 }>;
> >>>>         _35 = _33 + _34;
> >>>>         _36 = VEC_PERM_EXPR <_35, { [all '0'] }, { 1, 2, 3, [...], 62,
> >>>>         63,
> >>>>         64 }>;
> >>>>         _37 = _35 + _36;
> >>>>         stmp__12.12_38 = BIT_FIELD_REF <_37, 32, 0>;
> >>>>         return stmp__12.12_38;
> >>>>
> >>>> ..., for example, for 'x = 7', we get:
> >>>>
> >>>>         vect_cst__18 = { [all '7'] };
> >>>>         vect__1.8_19 = { 7, 3, 1, 0, 0, 0, [...] };
> >>>>         _13 = { 1, 1, 1, 0, 0, 0, [...] };
> >>>>         vect__12.11_24 = { 1, 1, 1, 0, 0, 0, [...] };
> >>>>         _26 = { [all '0'] };
> >>>>         _27 = { 1, 1, 1, 0, 0, 0, [...] };
> >>>>         _28 = { [all '0'] };
> >>>>         _29 = { 1, 1, 1, 0, 0, 0, [...] };
> >>>>         _30 = { [all '0'] };
> >>>>         _31 = { 1, 1, 1, 0, 0, 0, [...] };
> >>>>         _32 = { [all '0'] };
> >>>>         _33 = { 1, 1, 1, 0, 0, 0, [...] };
> >>>>         _34 = { 1, 0, 0, 0, [...] };
> >>>>         _35 = { 2, 1, 1, 0, 0, 0, [...] };
> >>>>         _36 = { 1, 1, 0, 0, 0, [...] };
> >>>>         _37 = { 3, 2, 1, 0, 0, 0, [...] };
> >>>>         stmp__12.12_38 = 3;
> >>>>         return 3;
> >>>>
> >>>> ..., so the algorithm would appear to synthesize correct code for that
> >>>> case.  Adding '7' to 'builtin-bitops-1.c', we however again get:
> >>>>
> >>>>       i=13, ints[i]=0x7 a=3, b=6
> >>>>
> >>>>
> >>>> With the following hack applied to 'gcc/tree-vect-loop.cc':
> >>>>
> >>>>       @@ -6687,8 +6687,9 @@ vect_create_epilog_for_reduction
> >>>>       (loop_vec_info
> >>>>       loop_vinfo,
> >>>>              reduce_with_shift = have_whole_vector_shift (mode1);
> >>>>              if (!VECTOR_MODE_P (mode1)
> >>>>                 || !directly_supported_p (code, vectype1))
> >>>>               reduce_with_shift = false;
> >>>>       +      reduce_with_shift = false;
> >>>>
> >>>> ..., I'm able to work around those regressions: by means of forcing
> >>>> "Reduce using scalar code" instead of "Reduce using vector shifts".
> >>>
> >>> I would say it somewhere gets broken between the vectorizer and the GPU
> >>> which means likely in the target?  Can you point out an issue in the
> >>> actual generated GCN code?
> >>>
> >>> Iff this kind of reduction is the issue you'd see quite a lot of
> >>> vectorzer execute FAILs.  I'm seeing a .COND_AND above - could it
> >>> be that the "mask" is still set wrong when doing the reduction
> >>> steps?
> >>
> >> It looks like the ds_bpermute_b32 instruction works differently on RDNA3
> >> (vs.
> >> GCN/CDNA and even RDNA2).
> >>
> >>  From the pseudocode in the documentation:
> >>
> >>    for i in 0 : WAVE64 ? 63 : 31 do
> >>      // ADDR needs to be divided by 4.
> >>      // High-order bits are ignored.
> >>      // NOTE: destination lane is MOD 32 regardless of wave size.
> >>      src_lane = 32'I(VGPR[i][ADDR] + OFFSET.b) / 4 % 32;
> >>      // EXEC is applied to the source VGPR reads.
> >>      if EXEC[src_lane].u1 then
> >>        tmp[i] = VGPR[src_lane][DATA0]
> >>      endif
> >>    endfor;
> >>
> >> The key detail is the "mod 32"; the other architectures have "mod 64"
> >> there.
> >>
> >> So, the last 32 lanes are discarded, and the first 32 lanes are duplicated
> >> into the last, and this explains why my_popcount returns double the
> >> expected
> >> value for smaller inputs.
> >>
> >> Richi, can you confirm that this testcase works properly on your card,
> >> please?
> >>
> >> To test, assuming you only have the offload toolchain built, compile using
> >> x86_64-none-linux-gnu-accel-amdgcn-amdhsa-gcc, which should produce a raw
> >> AMD
> >> ELF file. Then you run it using "gcn-run a.out" (you can find gcn-run under
> >> libexec).
> > 
> > I'm getting
> > 
> > i=1, ints[i]=0x1 a=1, b=2
> > i=2, ints[i]=0x80000000 a=1, b=2
> > i=3, ints[i]=0x2 a=1, b=2
> > i=4, ints[i]=0x40000000 a=1, b=2
> > i=5, ints[i]=0x10000 a=1, b=2
> > i=6, ints[i]=0x8000 a=1, b=2
> > i=7, ints[i]=0xa5a5a5a5 a=16, b=32
> > i=8, ints[i]=0x5a5a5a5a a=16, b=32
> > i=9, ints[i]=0xcafe0000 a=11, b=22
> > i=10, ints[i]=0xcafe00 a=11, b=22
> > i=11, ints[i]=0xcafe a=11, b=22
> > i=12, ints[i]=0xffffffff a=32, b=64
> > 
> > which I think is the same as Thomas output and thus wrong?
> > 
> > When building with -O0 I get no output.
> > 
> > I'm of course building with -march=gfx1030
> 
> OK, please try this example, just to check my expectation that your permute
> works:
> 
> typedef int v64si __attribute__ ((vector_size (256)));
> 
> int main()
> {
>   v64si permute = {
>     40, 40, 40, 40, 40, 40, 40, 40,
>     40, 40, 40, 40, 40, 40, 40, 40,
>     40, 40, 40, 40, 40, 40, 40, 40,
>     40, 40, 40, 40, 40, 40, 40, 40,
>     40, 40, 40, 40, 40, 40, 40, 40,
>     40, 40, 40, 40, 40, 40, 40, 40,
>     40, 40, 40, 40, 40, 40, 40, 40,
>     40, 40, 40, 40, 40, 40, 40, 40
>   };
>   v64si result;
> 
>   asm ("ds_bpermute_b32 %0, %1, v1" : "=v"(result) : "v"(permute), "e"(-1L));
> 
>   for (int i=0; i<63; i++)
>     __builtin_printf ("%d ", result[i]);
>   __builtin_printf ("\n");
> 
>   return 0;
> }
> 
> On GCN/CDNA devices I expect this to print "10" 64 times. On RDNA3 it prints
> "10" 32 times, and "42" 32 times (which doesn't quite match what I'd expect
> from the pseudocode, but does match the written description). Which do you
> get?


10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 
10 10 10 10 10 10 10 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 42 
42 42 42 42 42 42 42 42 42 42 42 42 42 

so RDNA2 matches RDNA3 here.

Richard.

Re: GCN RDNA2+ vs. GCC vectorizer "Reduce using vector shifts"

Reply via email to