Richard Biener <[email protected]> writes:
> The following tries to address that the vectorizer fails to have
> precise knowledge of argument and return calling conventions and
> views some accesses as loads and stores that are not.
> This is mainly important when doing basic-block vectorization as
> otherwise loop indexing would force such arguments to memory.
>
> On x86 the reduction in the number of apparent loads and stores
> often dominates cost analysis so the following tries to mitigate
> this aggressively by adjusting only the scalar load and store
> cost, reducing them to the cost of a simple scalar statement,
> but not touching the vector access cost which would be much
> harder to estimate. Thereby we error on the side of not performing
> basic-block vectorization.
>
> Bootstrapped and tested on x86_64-unknown-linux-gnu.
>
> Richard - we can of course do this adjustment in the backend as well
> but it might be worthwhile in generic code. Do you see similar
> issues on arm?
Yeah, a pathological case is:
struct a { float f[4]; };
struct a test(struct a a) {
a.f[0] += 1;
a.f[1] += 2;
a.f[2] += 3;
a.f[3] += 4;
return a;
}
which with -O2 generates:
test:
.LFB0:
.cfi_startproc
fmov w1, s2
fmov w4, s0
mov x0, 0
fmov w3, s1
sub sp, sp, #16
.cfi_def_cfa_offset 16
mov x2, 0
bfi x0, x1, 0, 32
fmov w1, s3
bfi x2, x4, 0, 32
bfi x2, x3, 32, 32
bfi x0, x1, 32, 32
adrp x1, .LC0
stp x2, x0, [sp]
ldr q30, [sp]
ldr q31, [x1, #:lo12:.LC0]
add sp, sp, 16
.cfi_def_cfa_offset 0
fadd v31.4s, v30.4s, v31.4s
umov x0, v31.d[0]
umov x1, v31.d[1]
mov x3, x0
lsr x4, x0, 32
lsr x0, x1, 32
fmov s1, w4
fmov s3, w0
fmov s2, w1
lsr w0, w3, 0
fmov s0, w0
ret
.cfi_endproc
Admittedly most of the badness there would probably be fixed by
parameter and return fsra (Jiufu Guo's patch), but it still doesn't
make much sense to marshall 4 separate floats into one vector for
a single addition, only to tear it apart into 4 separate floats
afterwards. We should just do four scalar additions instead.
(The patch doesn't fix this case, although it does trigger.)
> PR tree-optimization/116274
> * tree-vect-slp.cc (vect_bb_slp_scalar_cost): Cost scalar loads
> and stores as simple scalar stmts when they access a non-global,
> not address-taken variable that doesn't have BLKmode assigned.
>
> * gcc.target/i386/pr116274.c: New testcase.
> ---
> gcc/testsuite/gcc.target/i386/pr116274.c | 9 +++++++++
> gcc/tree-vect-slp.cc | 12 +++++++++++-
> 2 files changed, 20 insertions(+), 1 deletion(-)
> create mode 100644 gcc/testsuite/gcc.target/i386/pr116274.c
>
> diff --git a/gcc/testsuite/gcc.target/i386/pr116274.c
> b/gcc/testsuite/gcc.target/i386/pr116274.c
> new file mode 100644
> index 00000000000..d5811344b93
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr116274.c
> @@ -0,0 +1,9 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -fdump-tree-slp2-optimized" } */
> +
> +struct a { long x,y; };
> +long test(struct a a) { return a.x+a.y; }
> +
> +/* { dg-final { scan-tree-dump-not "basic block part vectorized" "slp2" } }
> */
> +/* { dg-final { scan-assembler-times "addl|leaq" 1 } } */
> +/* { dg-final { scan-assembler-not "padd" } } */
> diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
> index 3464d0c0e23..e43ff721100 100644
> --- a/gcc/tree-vect-slp.cc
> +++ b/gcc/tree-vect-slp.cc
> @@ -7807,7 +7807,17 @@ next_lane:
> vect_cost_for_stmt kind;
> if (STMT_VINFO_DATA_REF (orig_stmt_info))
> {
> - if (DR_IS_READ (STMT_VINFO_DATA_REF (orig_stmt_info)))
> + data_reference_p dr = STMT_VINFO_DATA_REF (orig_stmt_info);
> + tree base = get_base_address (DR_REF (dr));
> + /* When the scalar access is to a non-global not address-taken
> + decl that is not BLKmode assume we can access it with a single
> + non-load/store instruction. */
> + if (DECL_P (base)
> + && !is_global_var (base)
> + && !TREE_ADDRESSABLE (base)
> + && DECL_MODE (base) != BLKmode)
> + kind = scalar_stmt;
> + else if (DR_IS_READ (STMT_VINFO_DATA_REF (orig_stmt_info)))
> kind = scalar_load;
> else
> kind = scalar_store;
LGTM FWIW, but did you consider skipping the cost altogether?
I'm not sure what the scalar_stmt would correspond to in practice,
if we assume that the ABI (for parameters/returns) or RA (for locals)
puts the data in a sensible register class for the datatype.
Thanks,
Richard