[Bug target/93005] Redundant NEON loads/stores from stack are not eliminated
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93005 --- Comment #7 from Joel Holdsworth --- > Did you test it with big-endian? Good question. It seems to do the right thing in both cases: https://godbolt.org/z/7rDzAm
[Bug target/93005] Redundant NEON loads/stores from stack are not eliminated
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93005 --- Comment #5 from Joel Holdsworth --- I found that if I make modified versions of the intrinsics in arm_neon.h that are designed more along the lines of the x86_64 SSE intrinsics defined with a simple pointer dereference, then gcc does the right thing [1]. #include __extension__ extern __inline void __attribute__ ((__always_inline__, __gnu_inline__, __artificial__)) vst1q_s32_fixed (int32_t * __a, int32x4_t __b) { *(int32x4_t*)__a = __b; } __extension__ extern __inline int32x4_t __attribute__ ((__always_inline__, __gnu_inline__, __artificial__)) vld1q_s32_fixed (const int32_t * __a) { return *(const int32x4_t*)__a; } int32x4_t foo(int32x4_t a) { int32_t temp[4]; vst1q_s32_fixed(temp, a); return vld1q_s32_fixed(temp); } ...compiles to: foo(long __vector(4)): bx lr Is there any reason not to simply redefine vst1q_s32, vld1q_s32 and friends to stop using builtins? [1]https://godbolt.org/#g:!((g:!((g:!((h:codeEditor,i:(fontScale:14,j:2,lang:c%2B%2B,selection:(endColumn:2,endLineNumber:22,positionColumn:1,positionLineNumber:1,selectionStartColumn:2,selectionStartLineNumber:22,startColumn:1,startLineNumber:1),source:'%23include+%3Carm_neon.h%3E%0A%0A__extension__+extern+__inline+void%0A__attribute__++((__always_inline__,+__gnu_inline__,+__artificial__))%0Avst1q_s32_fixed+(int32_t+*+__a,+int32x4_t+__b)%0A%7B%0A++*(int32x4_t*)__a+%3D+__b%3B%0A%7D%0A%0A__extension__+extern+__inline+int32x4_t%0A__attribute__++((__always_inline__,+__gnu_inline__,+__artificial__))%0Avld1q_s32_fixed+(const+int32_t+*+__a)%0A%7B%0A++return+*(const+int32x4_t*)__a%3B%0A%7D%0A%0Aint32x4_t+foo(int32x4_t+a)%0A%7B%0Aint32_t+temp%5B4%5D%3B%0Avst1q_s32_fixed(temp,+a)%3B%0Areturn+vld1q_s32_fixed(temp)%3B%0A%7D'),l:'5',n:'0',o:'C%2B%2B+source+%232',t:'0')),header:(),k:49.54010711093072,l:'4',m:50,n:'0',o:'',s:0,t:'0'),(g:!((h:compiler,i:(compiler:arm831,filters:(b:'0',binary:'1',commentOnly:'0',demangle:'0',directives:'0',execute:'1',intel:'0',libraryCode:'1',trim:'1'),fontScale:14,j:2,lang:c%2B%2B,libs:!(),options:'-O2+-march%3Darmv7-a+-mtune%3Dcortex-a8+-mfpu%3Dneon+-mfloat-abi%3Dhard',selection:(endColumn:1,endLineNumber:1,positionColumn:1,positionLineNumber:1,selectionStartColumn:1,selectionStartLineNumber:1,startColumn:1,startLineNumber:1),source:2),l:'5',n:'0',o:'ARM+gcc+8.3.1+(none)+(Editor+%232,+Compiler+%232)+C%2B%2B',t:'0')),header:(),l:'4',m:50,n:'0',o:'',s:0,t:'0')),k:100,l:'3',n:'0',o:'',t:'0')),version:4
[Bug target/93005] Redundant NEON loads/stores from stack are not eliminated
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93005 --- Comment #4 from Joel Holdsworth --- Results for clang and MSVC are similar: clang trunk: foo(__simd128_int32_t): push{r11, lr} mov r11, sp sub sp, sp, #24 bfc sp, #0, #4 mov r0, sp vst1.32 {d0, d1}, [r0] vld1.64 {d0, d1}, [r0:128] mov sp, r11 pop {r11, pc} ...but even though these other compilers don't do any better on ARM, I still think my original point stands.
[Bug target/93005] Redundant NEON loads/stores from stack are not eliminated
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93005 --- Comment #3 from Joel Holdsworth --- Interesting. Comparing the implementation of _mm_store_si128 to vst1q_s32: emminitrin.h extern __inline void __attribute__((__gnu_inline__, __always_inline__, __artificial__)) _mm_store_si128 (__m128i *__P, __m128i __B) { *__P = __B; } arm_neon.h __extension__ extern __inline void __attribute__ ((__always_inline__, __gnu_inline__, __artificial__)) vst1q_s32 (int32_t * __a, int32x4_t __b) { __builtin_neon_vst1v4si ((__builtin_neon_si *) __a, __b); } So why is one implemented with a built-in, and the other with a pointer dereference? Is there a way of making the optimizer see through __builtin_neon_vst1v4si with GIMPLE? Where would the code be implemented? Where is it implemented for other architectures?
[Bug target/93005] Redundant NEON loads/stores from stack are not eliminated
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93005 --- Comment #2 from Joel Holdsworth --- Are you saying that if the GIMPLE were defined for the intrinsics, then the optimizer would eliminate them automatically? Or is there more to it?
[Bug c++/93005] New: Redundant NEON loads/stores from stack are not eliminated
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93005 Bug ID: 93005 Summary: Redundant NEON loads/stores from stack are not eliminated Product: gcc Version: 8.3.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: joel at airwebreathe dot org.uk Target Milestone: --- On x86_64 SSE, gcc is able to eliminated redundant load/store operations to the stack, but on ARM, gcc seems unable to do the same optimization with NEON vector registers. This x86_64 code is optimized as expected: #include __m128i foo(__m128i a) { int32_t temp[4]; _mm_store_si128(reinterpret_cast<__m128i*>(temp), a); return _mm_load_si128(reinterpret_cast<__m128i*>(temp)); } ...when compiled with -O2: foo(long long __vector(2)): ret However, when compiling analogous code for ARM NEON: #include int32x4_t foo(int32x4_t a) { int32_t temp[4]; vst1q_s32(temp, a); return vld1q_s32(temp); } ...when compiled with -O2 -march=armv7-a -mtune=cortex-a8 -mfpu=neon -mfloat-abi=hard : foo(__simd128_int32_t): sub sp, sp, #16 vst1.32 {d0-d1}, [sp:64] vld1.32 {d0-d1}, [sp:64] add sp, sp, #16 bx lr The load/store to the stack are redundant and should be eliminated, because temp should have been promoted to NEON registers. (see the attached godbolt link [1] to compare) This issue was discovered while trying to use gcc with the Eigen library on ARM NEON. Eigen does intermediate processing using compiler intrinsics, but intermediate values must be written back to POD arrays on the stack. In a complex algorithm this results in the machine code being peppered with redundant stores and loads. [1] https://godbolt.org/#z:OYLghAFBqd5TKALEBjA9gEwKYFFMCWALugE4A0BIEAZugHZEDKqAhgDbYgCMALOQCse5dq3qhUAUgBMAIRmzyAZ2ydURAg2rZ6mAMLp2AVwC29ENPI7MAGQL1sAOVMAjbKRAAOcgAd0S4k16A2MzC19/QIY7B2cTNw9vFTUNBiYiVlIiENNzS2TsdSD0zKIYp1d3L2UMrJyw/Nqy%2Bwr4qs8ASmV0I1JULgByGQBme1RjHABqSWG9AA9PADZ7IlJ7ADokGdxJAAYAQT39gH1jk25pTwJJunQIU/PL69YOo8kAdnkDyZ/JleHpMciJMiNgTD5JABWWS8KEAERmX32v0mZxMxyUJFI2AxBAungg2JW7h82KIxzYmJmege%2BIIACpthBQeCupMXoijiiyb16KiTOj2OhWJhcfjCdhiaRSdhyZSiNTaU9GcNcMywT4Oq9hkiPnCBl12CABpCBuRzANdmb0Ma9AoFJMlD0%2BthptJhtwzURjVateQANYgYbvdaeXi8ACc3Aji1jnl20cWImNvDNJhAkfWEd2kd40hjkPe7wjvHe5p91uNZqUIF25G9loN5DgsBQGHBBE4FCoEHbPk7VWAnmG5BondBpBrEBcFfILnsmQAnsbPeR2yYdEQAPL0djLgarnAmMTATizwjYooAN2wNcbVjmhSMoJXZuJRvv7AILlIS4MOFfes1nTA8mzoRgWA4Lg%2BEEYRRHENB7TkERvxrWB6FYDcQElYAdHIG8PFWIx6H9P10B8VJ6DvABaLdpGrVRCko7RdHqcxuCsXRyjiBJhD8AJKLYvjIko7jKg8DiCiKNImiEyTGOk%2BgSiyMS2gkmpSjkjSVJaHiqm4LonV6foeENY1TXLe8bQGBZFmoxZeEmYBUFQSZPHWYZJggfBiDIN0PXISYDA7Lt/IMoKkNkL0Kz9JBsBFKoIDMgZU3IdNIV2dZdgjDLC12YY82kXZIWkfgLStchrOrWt6xirpAxK9ZpEWZqIw9SFFneAFPAjZMBmGM1ysrAZosbLoWwQeAIDbLA8EILFKGocDmDYM8YKESx4IkSLlAU5iIGsITeusVTeKTfiomCQxchAXqLtE3TxJAJMpMo5TsmusJete4omlOqoXtkz7zG%2Bv7HrU57DOdEyhndMYJldalMnRBwGE2bY3gOf5pDmXggRudA7mx3H8Y5A4PiRFFsfxlkIWhWFIQRHUuV%2BK9MW4ABHDEAXVcFAo5Znvl%2BHlSD5K92EwTnuekXnNU5cn3n1ZKLKGyrbUix1ocR916NqsaujihKPCSgMMwjdZIV4BNdnebhuGy4Ybc2lM0xADLLIqqrlBqhtfWbRBprQdAQvcRbe2D/suxAZHhw4sd2AnKcZ3vecMNIfdV3XTcdz3c8wRPM97wvJiCBvO8KuwR9UGfQZV3fWcvx/P9ZsA1YCBAz0umWyC1v4DaRBPRC5AUFCXDQk3yMomi6Mmajjz6LZhjhZGr3eajWFnkwiGI7AZjhDAskr9fPE3mgfCMPfUb5OeaCFVgiHX%2Bc96QTJMAYlIghY/RgZ4Y6uPBs6EQBJBDkndESQR/oeBentX6mkf7Rl2h/GSpRIHPW0h9UI7FQYoIAQDKGxkuDSGVoNWc1l9gACUACyTkXJuQ8usbgXl6AMGwB0LyPksT%2BUsEFCOA5SBcLYXaYechRq%2BgNvFHAxtkqpXSnWVWXsax1l9gaeqGYQxdS6rsRYYYASdTtn1AaHthqiJUX1XW8iqx6zEfhdwAQtC8CAA%3D%3D%3D