On Fri, 16 Feb 2024, Thomas Schwinge wrote:
> Hi!
>
> On 2023-10-20T12:51:03+0100, Andrew Stubbs <[email protected]> wrote:
> > I've committed this patch
>
> ... as commit c7ec7bd1c6590cf4eed267feab490288e0b8d691
> "amdgcn: add -march=gfx1030 EXPERIMENTAL", which the later RDNA3/gfx1100
> support builds on top of, and that's what I'm currently working on
> getting proper GCC/GCN target (not offloading) results for.
>
> Now looking at 'gcc.dg/vect/bb-slp-cond-1.c', which is reasonably simple,
> and hopefully representative for other SLP execution test FAILs
> (regressions compared to my earlier non-gfx1100 testing).
>
> $ build-gcc/gcc/xgcc -Bbuild-gcc/gcc/
> source-gcc/gcc/testsuite/gcc.dg/vect/bb-slp-cond-1.c
> --sysroot=install/amdgcn-amdhsa -ftree-vectorize
> -fno-tree-loop-distribute-patterns -fno-vect-cost-model -fno-common -O2
> -fdump-tree-slp-details -fdump-tree-vect-details -isystem
> build-gcc/amdgcn-amdhsa/gfx1100/newlib/targ-include -isystem
> source-gcc/newlib/libc/include -Bbuild-gcc/amdgcn-amdhsa/gfx1100/newlib/
> -Lbuild-gcc/amdgcn-amdhsa/gfx1100/newlib -wrapper setarch,--addr-no-randomize
> -fdump-tree-all-all -fdump-ipa-all-all -fdump-rtl-all-all -save-temps
> -march=gfx1100
>
> The '-march=gfx1030' 'a-bb-slp-cond-1.s' is identical (apart from
> 'TARGET_PACKED_WORK_ITEMS' in 'gcn_target_asm_function_prologue'), so I
> suppose will also exhibit the same failure mode, once again?
>
> Compared to '-march=gfx90a', the differences begin in
> 'a-bb-slp-cond-1.c.266r.expand' (only!), down to 'a-bb-slp-cond-1.s'.
>
> Changed like:
>
> @@ -38,10 +38,10 @@ int main ()
> #pragma GCC novector
> for (i = 1; i < N; i++)
> if (a[i] != i%4 + 1)
> - abort ();
> + __builtin_printf("%d %d != %d\n", i, a[i], i%4 + 1);
>
> if (a[0] != 5)
> - abort ();
> + __builtin_printf("%d %d != %d\n", 0, a[0], 5);
>
> ..., we see:
>
> $ flock /tmp/gcn.lock build-gcc/gcc/gcn-run a.out
> 40 5 != 1
> 41 6 != 2
> 42 7 != 3
> 43 8 != 4
> 44 5 != 1
> 45 6 != 2
> 46 7 != 3
> 47 8 != 4
>
> '40..47' are the 'i = 10..11' in 'foo', and the expectation is
> 'a[i * stride + 0..3] != 0'. So, either some earlier iteration has
> scribbled zero values over these (vector lane masking issue, perhaps?),
> or some other code generation issue?
So we're indeed BB vectorizing this to
_54 = MEM <vector(4) int> [(int *)_14];
vect_iftmp.12_56 = .VCOND (_54, { 0, 0, 0, 0 }, { 1, 2, 3, 4 }, { 5, 6,
7, 8 }, 115);
MEM <vector(4) int> [(int *)_14] = vect_iftmp.12_56;
I don't understand the assembly very well but it might be that
the mask computation for the .VCOND scribbles the mask used
to constrain operation to 4 lanes?
.L3:
s_mov_b64 exec, 15
v_add_co_u32 v4, s[22:23], s32, v3
v_mov_b32 v5, s33
v_add_co_ci_u32 v5, s[22:23], 0, v5, s[22:23]
flat_load_dword v7, v[4:5] offset:0
s_waitcnt 0
flat_load_dword v0, v[10:11] offset:0
s_waitcnt 0
flat_load_dword v6, v[8:9] offset:0
s_waitcnt 0
v_cmp_ne_u32 s[18:19], v7, 0
v_cndmask_b32 v0, v6, v0, s[18:19]
flat_store_dword v[4:5], v0 offset:0
s_add_i32 s12, s12, 1
s_add_u32 s32, s32, s28
s_addc_u32 s33, s33, s29
s_cmp_lg_u32 s12, s13
s_cbranch_scc1 .L3
Richard.