[Bug target/124162] [16 Regression] wrong first iteration mask in SVE loop with partial vectors

cvs-commit at gcc dot gnu.org via Gcc-bugs Thu, 26 Feb 2026 23:42:08 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=124162


--- Comment #4 from GCC Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Tamar Christina <[email protected]>:

https://gcc.gnu.org/g:cd13223e7760d55601f440f4605db33bafd1a6d2

commit r16-7753-gcd13223e7760d55601f440f4605db33bafd1a6d2
Author: Tamar Christina <[email protected]>
Date:   Fri Feb 27 07:41:20 2026 +0000

    AArch64: Don't enable ptest elimination for partial vectors [PR124162]

    The following loop

    char b = 41;
    int main() {
      signed char a[31];
    #pragma GCC novector
      for (int c = 0; c < 31; ++c)
        a[c] = c * c + c % 5;
      {
        signed char *d = a;
    #pragma GCC novector
        for (int c = 0; c < 31; ++c, b += -16)
          d[c] += b;
      }
      for (int c = 0; c < 31; ++c) {
        signed char e = c * c + c % 5 + 41 + c * -16;
        if (a[c] != e)
          __builtin_abort();
      }
    }

    compiled with -O2 -ftree-vectorize -msve-vector-bits=256
-march=armv8.2-a+sve

    generates

            ptrue   p6.b, vl32
            add     x2, x2, :lo12:.LC0
            add     w5, w5, 16
            ld1rw   z25.s, p6/z, [x2]
            strb    w5, [x6, #:lo12:.LANCHOR0]
            mov     w0, 0
            mov     p7.b, p6.b
            mov     w2, 31
            index   z30.s, #0, #1
            mov     z26.s, #5
            mov     z27.b, #41
    .L6:
            mov     z29.d, z30.d
            movprfx z28, z30
            add     z28.b, z28.b, #240
            mad     z29.b, p6/m, z28.b, z27.b
            mov     w3, w0
            movprfx z31, z30
            smulh   z31.s, p6/m, z31.s, z25.s
            add     w0, w0, 8
            asr     z31.s, z31.s, #1
            msb     z31.s, p6/m, z26.s, z30.s
            add     z31.b, z31.b, z29.b
            ld1b    z29.s, p7/z, [x1]
            cmpne   p7.b, p7/z, z31.b, z29.b
            b.any   .L15
            add     x1, x1, 8
            add     z30.s, z30.s, #8
            whilelo p7.s, w0, w2
            b.any   .L6

    Which uses a predicate for the first iteration where all bits are 1. i.e.
all
    lanes active. This causes the result of the cmpne to set the wrong CC
flags.
    The second iteration uses

            whilelo p7.s, w0, w2

    which gives the correct mask layout going forward.

    This is due to the CSE'ing code that tries to share predicates as much as
    possible.  In aarch64_expand_mov_immediate we do during predicate
generation

              /* Only the low bit of each .H, .S and .D element is defined,
                 so we can set the upper bits to whatever we like.  If the
                 predicate is all-true in MODE, prefer to set all the undefined
                 bits as well, so that we can share a single .B predicate for
                 all modes.  */
              if (imm == CONSTM1_RTX (mode))
                imm = CONSTM1_RTX (VNx16BImode);

    which essentially maps all predicates to .b unless the predicate is created
    outside the immediate expansion code.

    It creates the sparse predicate for data lane VNx4QI from a VNx16QI and
then
    has a "conversion" operation. The conversion operation results in a simple
copy:

            mov     p7.b, p6.b

    because in the data model for partial vectors the upper lanes are *don't
care*.
    So computations using this vector are fine.  However for comparisons, or
any
    operations setting flags the predicate value does matter otherwise we get
the
    wrong flags as the above.

    Additionally we don't have a way to distinguish based on the predicate
alone
    whether the operation is partial or not. i.e. we have no "partial
predicate"
    modes.

    two ways to solve this:

    1. restore the ptest for partial vector compares.  This would slow down the
loop
       though and introduce a second ptrue .s, VL8 predicate.

    2. disable the sharing of partial vector predicates.  This allows us to
remove
       the ptest.  Since the ptest would introduce a second predicate here
anyway
       I'm leaning towards disabling sharing between partial and full
predicates.

    For the patch I ended up going with 1.  The reason is that this is that
    unsharing the predicate does end up being pessimistic loops that only
operate on
    full vectors only (which are the majority of the cases).

    I also don't fully understand all the places we depend on this sharing (and
    about 3600 ACLE tests fail assembler scans).  I suspect one way to possibly
    deal with this is to perform the sharing on GIMPLE using the new isel hook
and
    make RTL constant expansion not manually force it.  Since in gimple it's
easy to
    follow compares from a back-edge to figure out if it's safe to share the
    predicate.

    I also tried changing it so that during cond_vec_cbranch_any we perform an
AND
    with the proper partial predicate.  But unfortunately folding doesn't
realize

    that the and on the latch edge is useless (e.g.  whilelo p7.s, w0, w2 makes
it
    a no-op) and that the AND should be moved outside the loop.  This is
something
    that again isel should be able to do.

    I also tried changing the

            mov     p7.b, p6.b

    into an AND, while this worked, folding didn't quite get that the and can
be
    eliminated.  And this also pessimists actual register copies.

    So for now I just undo ptest elimination for partial vectors for GCC 16 and
will
    revisit it for GCC 17 when we extend ptest elimination.

    gcc/ChangeLog:

            PR target/124162
            * config/aarch64/aarch64-sve.md (cond_vec_cbranch_any,
            cond_vec_cbranch_all): Drop partial vectors support.

    gcc/testsuite/ChangeLog:

            PR target/124162
            * gcc.target/aarch64/sve/vect-early-break-cbranch_16.c: New test.

[Bug target/124162] [16 Regression] wrong first iteration mask in SVE loop with partial vectors

Reply via email to