https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99195

            Bug ID: 99195
           Summary: Optimise away vec_concat of 64-bit AdvancedSIMD
                    operations with zeroes in aarch64
           Product: gcc
           Version: unknown
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: ktkachov at gcc dot gnu.org
  Target Milestone: ---
            Target: aarch64

Motivating testcases:
#include <arm_neon.h>

#define ONE(OT,IT,OP,S)                         \
OT                                              \
foo_##OP##_##S (IT a, IT b)                     \
{                                               \
  IT zeros = vcreate_##S (0);                   \
  return vcombine_##S (v##OP##_##S (a, b), zeros);      \
}


#define FUNC(T,IS,OS,OP,S) ONE (T##x##OS##_t, T##x##IS##_t, OP, S)

#define OPTWO(T,IS,OS,S,OP1,OP2)        \
FUNC (T, IS, OS, OP1, S)                \
FUNC (T, IS, OS, OP2, S)

#define OPTHREE(T, IS, OS, S, OP1, OP2, OP3)    \
FUNC (T, IS, OS, OP1, S)        \
OPTWO (T, IS, OS, S, OP2, OP3)

#define OPFOUR(T,IS,OS,S,OP1,OP2,OP3,OP4)       \
FUNC (T, IS, OS, OP1, S)                \
OPTHREE (T, IS, OS, S, OP2, OP3, OP4)

#define OPFIVE(T,IS,OS,S,OP1,OP2,OP3,OP4, OP5)  \
FUNC (T, IS, OS, OP1, S)                \
OPFOUR (T, IS, OS, S, OP2, OP3, OP4, OP5)

#define OPSIX(T,IS,OS,S,OP1,OP2,OP3,OP4,OP5,OP6)        \
FUNC (T, IS, OS, OP1, S)                \
OPFIVE (T, IS, OS, S, OP2, OP3, OP4, OP5, OP6)

OPSIX (int8, 8, 16, s8, add, sub, mul, and, orr, eor)
OPSIX (int16, 4, 8, s16, add, sub, mul, and, orr, eor)
OPSIX (int32, 2, 4, s32, add, sub, mul, and, orr, eor)
OPFIVE (int64, 1, 2, s64, add, sub, and, orr, eor)

OPSIX (uint8, 8, 16, u8, add, sub, mul, and, orr, eor)
OPSIX (uint16, 4, 8, u16, add, sub, mul, and, orr, eor)
OPSIX (uint32, 2, 4, u32, add, sub, mul, and, orr, eor)
OPFIVE (uint64, 1, 2, u64, add, sub, and, orr, eor)

for example generates:
foo_add_s8:
        add     v0.8b, v0.8b, v1.8b
        mov     v0.8b, v0.8b
        ret

The 64-bit V8QI ADD instruction implicitly zeroes out the top bits of the
128-bit destination so the vec_concat with zeroes can be represented easily.
However we don't have such pattern for all the AdvancedSIMd operations that we
support. Indeed, it would bloat the MD files quite a bit. Can we come up with a
define_subst scheme to auto-generate the patterns to match things like:
(set (reg:V16QI 93 [ <retval> ])
    (vec_concat:V16QI (plus:V8QI (reg:V8QI 98)
            (reg:V8QI 99))
        (const_vector:V8QI [
                (const_int 0 [0]) repeated x8
            ])))
?
Then we should be able to just generate:
foo_add_s8:
        add     v0.8b, v0.8b, v1.8b
        ret
etc.
The testcase above shows the problem for some of the simple binary ops, but
there are many more instructions that can benefit from this.

Reply via email to