Re: [PATCH] RISC-V: tree-optimization/116274 - overzealous SLP vectorization

Richard Sandiford Thu, 08 Aug 2024 06:11:38 -0700

Richard Biener <[email protected]> writes:
> The following tries to address that the vectorizer fails to have
> precise knowledge of argument and return calling conventions and
> views some accesses as loads and stores that are not.
> This is mainly important when doing basic-block vectorization as
> otherwise loop indexing would force such arguments to memory.
>
> On x86 the reduction in the number of apparent loads and stores
> often dominates cost analysis so the following tries to mitigate
> this aggressively by adjusting only the scalar load and store
> cost, reducing them to the cost of a simple scalar statement,
> but not touching the vector access cost which would be much
> harder to estimate.  Thereby we error on the side of not performing
> basic-block vectorization.
>
> Bootstrapped and tested on x86_64-unknown-linux-gnu.
>
> Richard - we can of course do this adjustment in the backend as well
> but it might be worthwhile in generic code.  Do you see similar
> issues on arm?


Yeah, a pathological case is:

struct a { float f[4]; };
struct a test(struct a a) {
  a.f[0] += 1;
  a.f[1] += 2;
  a.f[2] += 3;
  a.f[3] += 4;
  return a;
}

which with -O2 generates:

test:
.LFB0:
        .cfi_startproc
        fmov    w1, s2
        fmov    w4, s0
        mov     x0, 0
        fmov    w3, s1
        sub     sp, sp, #16
        .cfi_def_cfa_offset 16
        mov     x2, 0
        bfi     x0, x1, 0, 32
        fmov    w1, s3
        bfi     x2, x4, 0, 32
        bfi     x2, x3, 32, 32
        bfi     x0, x1, 32, 32
        adrp    x1, .LC0
        stp     x2, x0, [sp]
        ldr     q30, [sp]
        ldr     q31, [x1, #:lo12:.LC0]
        add     sp, sp, 16
        .cfi_def_cfa_offset 0
        fadd    v31.4s, v30.4s, v31.4s
        umov    x0, v31.d[0]
        umov    x1, v31.d[1]
        mov     x3, x0
        lsr     x4, x0, 32
        lsr     x0, x1, 32
        fmov    s1, w4
        fmov    s3, w0
        fmov    s2, w1
        lsr     w0, w3, 0
        fmov    s0, w0
        ret
        .cfi_endproc

Admittedly most of the badness there would probably be fixed by
parameter and return fsra (Jiufu Guo's patch), but it still doesn't
make much sense to marshall 4 separate floats into one vector for
a single addition, only to tear it apart into 4 separate floats
afterwards.  We should just do four scalar additions instead.

(The patch doesn't fix this case, although it does trigger.)

>       PR tree-optimization/116274
>       * tree-vect-slp.cc (vect_bb_slp_scalar_cost): Cost scalar loads
>       and stores as simple scalar stmts when they access a non-global,
>       not address-taken variable that doesn't have BLKmode assigned.
>
>       * gcc.target/i386/pr116274.c: New testcase.
> ---
>  gcc/testsuite/gcc.target/i386/pr116274.c |  9 +++++++++
>  gcc/tree-vect-slp.cc                     | 12 +++++++++++-
>  2 files changed, 20 insertions(+), 1 deletion(-)
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr116274.c
>
> diff --git a/gcc/testsuite/gcc.target/i386/pr116274.c 
> b/gcc/testsuite/gcc.target/i386/pr116274.c
> new file mode 100644
> index 00000000000..d5811344b93
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr116274.c
> @@ -0,0 +1,9 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -fdump-tree-slp2-optimized" } */
> +
> +struct a { long x,y; };
> +long test(struct a a) { return a.x+a.y; }
> +
> +/* { dg-final { scan-tree-dump-not "basic block part vectorized" "slp2" } } 
> */
> +/* { dg-final { scan-assembler-times "addl|leaq" 1 } } */
> +/* { dg-final { scan-assembler-not "padd" } } */
> diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
> index 3464d0c0e23..e43ff721100 100644
> --- a/gcc/tree-vect-slp.cc
> +++ b/gcc/tree-vect-slp.cc
> @@ -7807,7 +7807,17 @@ next_lane:
>        vect_cost_for_stmt kind;
>        if (STMT_VINFO_DATA_REF (orig_stmt_info))
>       {
> -       if (DR_IS_READ (STMT_VINFO_DATA_REF (orig_stmt_info)))
> +       data_reference_p dr = STMT_VINFO_DATA_REF (orig_stmt_info);
> +       tree base = get_base_address (DR_REF (dr));
> +       /* When the scalar access is to a non-global not address-taken
> +          decl that is not BLKmode assume we can access it with a single
> +          non-load/store instruction.  */
> +       if (DECL_P (base)
> +           && !is_global_var (base)
> +           && !TREE_ADDRESSABLE (base)
> +           && DECL_MODE (base) != BLKmode)
> +         kind = scalar_stmt;
> +       else if (DR_IS_READ (STMT_VINFO_DATA_REF (orig_stmt_info)))
>           kind = scalar_load;
>         else
>           kind = scalar_store;

LGTM FWIW, but did you consider skipping the cost altogether?
I'm not sure what the scalar_stmt would correspond to in practice,
if we assume that the ABI (for parameters/returns) or RA (for locals)
puts the data in a sensible register class for the datatype.

Thanks,
Richard

Re: [PATCH] RISC-V: tree-optimization/116274 - overzealous SLP vectorization

Reply via email to