https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113552

            Bug ID: 113552
           Summary: [11/12/13/14 Regression] vectorizer generates calls to
                    vector math routines with 1 simd lane.
           Product: gcc
           Version: 14.0
            Status: UNCONFIRMED
          Keywords: link-failure
          Severity: normal
          Priority: P3
         Component: middle-end
          Assignee: unassigned at gcc dot gnu.org
          Reporter: tnfchris at gcc dot gnu.org
  Target Milestone: ---
            Target: aarch64-*

In GCC 7 the Arm vector PCS was implemented to support libmvec but the libmvec
component never made it into glibc until now.

GLIBC 2.39 which will be paired with GCC 14 now implements the vector math
routines.

However consider this function:

> cat cosmo.fppized3.f
      SUBROUTINE a(b)
      DIMENSION b(3,0)
      COMMON c
      DO 4 m=1,c
         DO 4 d=1,3
             b(d,m)=b(d,m)+COS(5.0D00*m)
   4  CONTINUE
      END
      DIMENSION e(53)
      DIMENSION f(6,91),g(6,91),h(6,91),
     *          i(6,91),j(6,91),k(6,86)
      DIMENSION l(107)
      END

and compiled with headers from a glibc 2.39:

> aarch64-unknown-linux-gnu-gfortran -S -o - -Ofast 
> -L/data/repro/glibc/usr/lib64 -I/data/repro/glibc/include 
> --sysroot=/data/repro/glibc -w cosmo.fppized3.f

produces:

        fmul    v13.2d, v13.2d, v19.2d
        fmov    d0, d13
        bl      _ZGVnN1v_cos
        fmov    d12, d0
        dup     d0, v13.d[1]
        bl      _ZGVnN1v_cos
        fmov    d31, d0
        stp     d12, d31, [sp, 96]

which has deconstructed the vector to scalar and performs a vector call with 1
element.
This is not just inefficient but _ZGVnN1v_cos does not exist in glibc as such
code is produced that we cannot link.

It looks like the vectorizer starts with 4 floats and widens to 2x 2 double. 
But then during vectorizable simd this is again split into multiple vectors,
even though the operation already fits in a vector:

cosmo.fppized3.f:4:13: note:   ------>vectorizing SLP node starting from: _49 =
__builtin_cos (_48);
cosmo.fppized3.f:4:13: note:   vect_is_simple_use: operand _47 * 5.0e+0, type
of def: internal
cosmo.fppized3.f:4:13: note:   transform call.
cosmo.fppized3.f:4:13: note:   add new stmt: _132 = BIT_FIELD_REF
<vect__48.26_126, 64, 0>;
cosmo.fppized3.f:4:13: note:   add new stmt: _133 = cos.simdclone.0 (_132);
cosmo.fppized3.f:4:13: note:   add new stmt: _134 = BIT_FIELD_REF
<vect__48.26_126, 64, 64>;
cosmo.fppized3.f:4:13: note:   add new stmt: _135 = cos.simdclone.0 (_134);
cosmo.fppized3.f:4:13: note:   add new stmt: vect__49.27_136 = {_133, _135};
cosmo.fppized3.f:4:13: note:   add new stmt: _137 = BIT_FIELD_REF
<vect__48.26_127, 64, 0>;
cosmo.fppized3.f:4:13: note:   add new stmt: _138 = cos.simdclone.0 (_137);
cosmo.fppized3.f:4:13: note:   add new stmt: _139 = BIT_FIELD_REF
<vect__48.26_127, 64, 64>;
cosmo.fppized3.f:4:13: note:   add new stmt: _140 = cos.simdclone.0 (_139);
...

Because we happen to have a V1DF mode that is meant to only be used by some
intrinsics the operation succeeds.

So several issues here:

1. We should remove the new libmvec headers from glibc from applying to GCC
10,9,8,7 since we can't fix those anymore.  So we need a GCC version check on
them, however glibc is now frozen for release.
2. The vectorizer should not decompose a simd call if the input and result
don't require it.
3. We shouldn't generate a call with simdlen 1.  That said in theory this could
still be beneficial because it would allow the rest of the code to vectorize
and the vector pcs is cheaper to call.

Reply via email to