https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117850
Bug ID: 117850
Summary: GCC emits DUP, UMULL instead of UMULL2
Product: gcc
Version: 15.0
Status: UNCONFIRMED
Keywords: missed-optimization
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: tnfchris at gcc dot gnu.org
CC: rguenth at gcc dot gnu.org
Target Milestone: ---
Target: aarch64*
The following example:
#include <arm_neon.h>
uint16x8_t foo(const uint8x16_t s) {
const uint8x16_t f0 = vdupq_n_u8(4);
return vmull_u8(vget_high_u8(s), vget_high_u8(f0));
}
compiled with -O3 generates:
foo(__Uint8x16_t):
movi v31.8b, 0x4
dup d0, v0.d[1]
umull v0.8h, v0.8b, v31.8b
ret
instead of
foo(__Uint8x16_t):
movi v1.16b, #4
umull2 v0.8h, v0.16b, v1.16b
ret
I think we can fix this an other cases by lowering them in GIMPLE.
concretely the above could be lowered to VEC_WIDEN_MUL and based on the
BIT_FIELD_REFs generated by the vget_high's folded into the proper _lo or _hi
variant.
To do this though we might need to expose valueize to the API so we can look at
the operands rather than having to chase up the SSA_NAME_DEF_STMT.
Are you ok with this Richi?