[Bug target/89101] [Aarch64] vfmaq_laneq_f32 generates unnecessary dup instrcutions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89101 Wilco changed: What|Removed |Added CC||wilco at gcc dot gnu.org --- Comment #1 from Wilco --- (In reply to Gael Guennebaud from comment #0) > vfmaq_laneq_f32 is currently implemented as: > > __extension__ static __inline float32x4_t __attribute__ ((__always_inline__)) > vfmaq_laneq_f32 (float32x4_t __a, float32x4_t __b, >float32x4_t __c, const int __lane) > { > return __builtin_aarch64_fmav4sf (__b, > __aarch64_vdupq_laneq_f32 (__c, __lane), > __a); > } > > thus leading to unoptimized code as: > > ldr q1, [x2, 16] > dup v28.4s, v1.s[0] > dup v27.4s, v1.s[1] > dup v26.4s, v1.s[2] > dup v1.4s, v1.s[3] > fmlav22.4s, v25.4s, v28.4s > fmlav3.4s, v25.4s, v27.4s > fmlav6.4s, v25.4s, v26.4s > fmlav17.4s, v25.4s, v1.4s > > instead of: > > ldr q1, [x2, 16] > fmlav22.4s, v25.4s, v1.s[0] > fmlav3.4s, v25.4s, v1.s[1] > fmlav6.4s, v25.4s, v1.s[2] > fmlav17.4s, v25.4s, v1.s[3] > > I guess several other *lane* intrinsics exhibit the same shortcoming. Which compiler version did you use? I tried this on GCC6, 7, 8, and 9 with -O2: #include "arm_neon.h" float32x4_t f(float32x4_t a, float32x4_t b, float32x4_t c) { a = vfmaq_laneq_f32 (a, b, c, 0); a = vfmaq_laneq_f32 (a, b, c, 1); return a; } fmlav0.4s, v1.4s, v2.4s[0] fmlav0.4s, v1.4s, v2.4s[1] ret In all cases the optimizer is able to merge the dups as expected. If it still fails for you, could you provide a compilable example like above that shows the issue? > For the record, I managed to partly workaround this issue by writing my own > version as: > > if(LaneID==0) asm("fmla %0.4s, %1.4s, %2.s[0]\n" : "+w" (c) : "w" > (a), "w" (b) : ); > else if(LaneID==1) asm("fmla %0.4s, %1.4s, %2.s[1]\n" : "+w" (c) : "w" > (a), "w" (b) : ); > else if(LaneID==2) asm("fmla %0.4s, %1.4s, %2.s[2]\n" : "+w" (c) : "w" > (a), "w" (b) : ); > else if(LaneID==3) asm("fmla %0.4s, %1.4s, %2.s[3]\n" : "+w" (c) : "w" > (a), "w" (b) : ); > > but that's of course not ideal. This change yields a 32% speed up in Eigen's > matrix product: http://eigen.tuxfamily.org/bz/show_bug.cgi?id=1633 I'd strongly advise against using inline assembler since most people make mistakes writing it, and GCC won't be able to optimize code using inline assembler.
[Bug target/89101] [Aarch64] vfmaq_laneq_f32 generates unnecessary dup instrcutions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89101 Richard Biener changed: What|Removed |Added Status|UNCONFIRMED |WAITING Last reconfirmed||2019-01-29 Ever confirmed|0 |1
[Bug target/89101] [Aarch64] vfmaq_laneq_f32 generates unnecessary dup instrcutions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89101 --- Comment #2 from Gael Guennebaud --- Indeed, it fails to remove the dup only if the coefficient is used multiple times as in the following reduced exemple: (https://godbolt.org/z/hmSaE0) #include void foo(const float* a, const float * b, float * c, int n) { float32x4_t c0, c1, c2, c3; c0 = vld1q_f32(c+0*4); c1 = vld1q_f32(c+1*4); for(int k=0; k
[Bug target/89101] [Aarch64] vfmaq_laneq_f32 generates unnecessary dup instrcutions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89101 Wilco changed: What|Removed |Added Status|WAITING |NEW Known to work||9.0 Version|unknown |8.2.0 Target Milestone|--- |9.0 Known to fail||8.2.0 --- Comment #3 from Wilco --- (In reply to Gael Guennebaud from comment #2) > Indeed, it fails to remove the dup only if the coefficient is used multiple > times as in the following reduced exemple: (https://godbolt.org/z/hmSaE0) > > > #include > > void foo(const float* a, const float * b, float * c, int n) { > float32x4_t c0, c1, c2, c3; > c0 = vld1q_f32(c+0*4); > c1 = vld1q_f32(c+1*4); > for(int k=0; k { > float32x4_t a0 = vld1q_f32(a+0*4+k*4); > float32x4_t b0 = vld1q_f32(b+k*4); > c0 = vfmaq_laneq_f32(c0, a0, b0, 0); > c1 = vfmaq_laneq_f32(c1, a0, b0, 0); > } > vst1q_f32(c+0*4, c0); > vst1q_f32(c+1*4, c1); > } > > > I tested with gcc 7 and 8. Confirmed for GCC8, fixed on trunk. I tried the above example with up to 4 uses and it always generates the expected code on trunk. So this is fixed for GCC9, however it seems unlikely the fix (multi-use support in Combine) could be backported.
[Bug target/89101] [Aarch64] vfmaq_laneq_f32 generates unnecessary dup instrcutions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89101 --- Comment #4 from Gael Guennebaud --- Good to know this is fixed in trunk! Thank you, and sorry for the false alarm then.
[Bug target/89101] [Aarch64] vfmaq_laneq_f32 generates unnecessary dup instrcutions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89101 Richard Earnshaw changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #5 from Richard Earnshaw --- Fixed on trunk (aka gcc-9). Not a regression, so no backport.