https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116667

            Bug ID: 116667
           Summary: missing superfluous zero-extends of SVE values
           Product: gcc
           Version: 15.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: tnfchris at gcc dot gnu.org
  Target Milestone: ---
            Target: aarch64*

We've recently started vectorizing functions such as:

void
decode (unsigned char * restrict h, unsigned char * restrict p4,
        unsigned char * restrict p6, int f, int b, char * restrict e,
        char * restrict a, char * restrict i)
{
    int j = b % 8;
    for (int k = 0; k < 2; ++k)
        {
            p4[k] = i[a[k]] | e[k] << j;
            h[k] = p6[k] = a[k];
        }
}

due to the vectorizer now correctly eliding one of the loads making it
profitable. Using -O3 -march=armv9-a now vectorizes and generates:

decode:
        ptrue   p7.s, vl2
        ptrue   p6.b, all
        ld1b    z31.s, p7/z, [x6]
        ld1b    z28.s, p7/z, [x5]
        and     w4, w4, 7
        movprfx z0, z31
        uxtb    z0.s, p6/m, z31.s
        mov     z30.s, w4
        ld1b    z29.s, p7/z, [x7, z0.s, uxtw]
        lslr    z30.s, p6/m, z30.s, z28.s
        orr     z30.d, z30.d, z29.d
        st1b    z30.s, p7, [x1]
        st1b    z31.s, p7, [x2]
        st1b    z31.s, p7, [x0]
        ret

where as we used to generate:

decode:
        ptrue   p7.s, vl2
        and     w4, w4, 7
        ld1b    z0.s, p7/z, [x6]
        ld1b    z28.s, p7/z, [x5]
        ld1b    z29.s, p7/z, [x7, z0.s, uxtw]
        ld1b    z31.s, p7/z, [x6]
        mov     z30.s, w4
        ptrue   p6.b, all
        lslr    z30.s, p6/m, z30.s, z28.s
        orr     z30.d, z30.d, z29.d
        st1b    z30.s, p7, [x1]
        st1b    z31.s, p7, [x2]
        st1b    z31.s, p7, [x0]
        ret

This is great, however we're let down by RTL opt.

There's a couple of weird things here,
Cleaning up the sequence a bit the problematic parts are:

        ptrue   p7.s, vl2
        ptrue   p6.b, all
        ld1b    z31.s, p7/z, [x6]
        movprfx z0, z31
        uxtb    z0.s, p6/m, z31.s
        ld1b    z29.s, p7/z, [x7, z0.s, uxtw]

It zero extends the same vaue in z31 three times.  In the old code we actually
loaded the same value twice, both zero extended and not zero extended.

The RTL for the z31 + extend is

(insn 15 13 16 2 (set (reg:VNx4QI 110 [ vect__3.6 ])
        (unspec:VNx4QI [
                (subreg:VNx4BI (reg:VNx16BI 120) 0)
                (mem:VNx4QI (reg/v/f:DI 117 [ a ]) [0  S[4, 4] A8])
            ] UNSPEC_LD1_SVE)) "/app/example.c":9:24 5683
{maskloadvnx4qivnx4bi}
     (expr_list:REG_DEAD (reg/v/f:DI 117 [ a ])
        (expr_list:REG_EQUAL (unspec:VNx4QI [
                    (const_vector:VNx4BI [
                            (const_int 1 [0x1]) repeated x2
                            repeat [
                                (const_int 0 [0])
                                (const_int 0 [0])
                            ]
                        ])
                    (mem:VNx4QI (reg/v/f:DI 117 [ a ]) [0  S[4, 4] A8])
                ] UNSPEC_LD1_SVE)
            (nil))))
(insn 16 15 17 2 (set (reg:VNx16BI 122)
        (const_vector:VNx16BI repeat [
                (const_int 1 [0x1])
            ])) 5658 {*aarch64_sve_movvnx16bi}
     (nil))
(insn 17 16 20 2 (set (reg:VNx4SI 121 [ vect_patt_59.7_52 ])
        (unspec:VNx4SI [
                (subreg:VNx4BI (reg:VNx16BI 122) 0)
                (zero_extend:VNx4SI (reg:VNx4QI 110 [ vect__3.6 ]))
            ] UNSPEC_PRED_X)) 6943 {*zero_extendvnx4qivnx4si2}
     (expr_list:REG_EQUAL (zero_extend:VNx4SI (reg:VNx4QI 110 [ vect__3.6 ]))
        (nil)))

But combine refuses the merge of the zero extend into the load,

deferring rescan insn with uid = 15.
allowing combination of insns 15 and 17
original costs 4 + 4 = 8
replacement costs 4 + 4 = 8
i2 didn't change, not doing this

and instead copies it into the gather load, but leaving the insn 17 alone
presumably because of the predicate.  So it looks like a bug in our backend
costing.  The widening load is definitely cheaper than load + extend.

However I'm not sure as the line "i2 didn't change, not doing this" seems to
indicate that it wasn't rejected because of cost?

In the codegen there's a peculiarity in that while the two loads

        ld1b    z31.s, p7/z, [x6]
        ld1b    z28.s, p7/z, [x5]

are both widening loads, but they aren't modelled the same:

        ld1b    z31.s, p7/z, [x6]       // 15 [c=4 l=4]  maskloadvnx4qivnx4bi
        ld1b    z28.s, p7/z, [x5]       // 50 [c=4 l=4] 
aarch64_load_zero_extendvnx4sivnx4qi

This is because the RTL pattern seems to want to keep the same number of
elements as the input vector size. So it ends up with a gather and I think is
relying on combine changing one form into the other to remove unneeded extends.

Reply via email to