https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117048
Bug ID: 117048
Summary: Failure to combine into XAR instruction
Product: gcc
Version: 15.0
Status: UNCONFIRMED
Keywords: missed-optimization
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: ktkachov at gcc dot gnu.org
Target Milestone: ---
Target: aarch64
A testcase derived from a hashing algorithm:
#include <stdint.h>
#include <string.h>
#include <arm_neon.h>
static inline uint64x2_t
rotr64_vec(uint64x2_t x, const int b)
{
int64x2_t neg_b = vdupq_n_s64(-b);
int64x2_t left_shift = vsubq_s64(vdupq_n_s64(64), vdupq_n_s64(b));
uint64x2_t right_shifted = vshlq_u64(x, neg_b);
uint64x2_t left_shifted = vshlq_u64(x, left_shift);
return vorrq_u64(right_shifted, left_shifted);
}
void G(
int64_t* v,
int64x2_t& m1_01,
int64x2_t& m1_23,
int64x2_t& m2_01,
int64x2_t& m2_23
) {
int64x2_t vd01 = {v[12],v[13]};
vd01 = veorq_s64(vd01, m1_01);
vd01 = vreinterpretq_s64_u64(rotr64_vec( vreinterpretq_u64_s64 (vd01),
32));
v[12] = vgetq_lane_s64(vd01, 0);
}
When compiling with, say -march=armv9-a+sha3 should generate the XAR
instruction like LLVM does:
G(long*, __Int64x2_t&, __Int64x2_t&, __Int64x2_t&, __Int64x2_t&):
ldr q0, [x0, #96]
ldr q1, [x1]
xar v0.2d, v0.2d, v1.2d, #32
str d0, [x0, #96]
ret
But GCC generates the less efficient:
G(long*, __Int64x2_t&, __Int64x2_t&, __Int64x2_t&, __Int64x2_t&):
ldr q30, [x1]
ldr q0, [x0, 96]
eor v30.16b, v0.16b, v30.16b
ushr v31.2d, v30.2d, 32
shl v30.2d, v30.2d, 32
orr v30.16b, v31.16b, v30.16b
str d30, [x0, 96]
ret
We do have an RTL pattern for XAR expressed as a rotate of a XOR. I see combine
trying and failing to match:
(set (reg:V2DI 119 [ _14 ])
(ior:V2DI (ashift:V2DI (xor:V2DI (reg:V2DI 114 [ vect__1.12_16 ])
(reg:V2DI 116 [ *m1_01_8(D) ]))
(const_vector:V2DI [
(const_int 32 [0x20]) repeated x2
]))
(lshiftrt:V2DI (xor:V2DI (reg:V2DI 114 [ vect__1.12_16 ])
(reg:V2DI 116 [ *m1_01_8(D) ]))
(const_vector:V2DI [
(const_int 32 [0x20]) repeated x2
]))))
Should this have been simplified to a rotate or do we need more backend
patterns to match it?