[Bug target/93005] Redundant NEON loads/stores from stack are not eliminated

2020-01-06 Thread joel at airwebreathe dot org.uk
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93005

--- Comment #7 from Joel Holdsworth  ---
> Did you test it with big-endian?

Good question. It seems to do the right thing in both cases:
https://godbolt.org/z/7rDzAm

[Bug target/93005] Redundant NEON loads/stores from stack are not eliminated

2020-01-06 Thread joel at airwebreathe dot org.uk
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93005

--- Comment #5 from Joel Holdsworth  ---
I found that if I make modified versions of the intrinsics in arm_neon.h that
are designed more along the lines of the x86_64 SSE intrinsics defined with a
simple pointer dereference, then gcc does the right thing [1].


#include 

__extension__ extern __inline void
__attribute__  ((__always_inline__, __gnu_inline__, __artificial__))
vst1q_s32_fixed (int32_t * __a, int32x4_t __b)
{
  *(int32x4_t*)__a = __b;
}

__extension__ extern __inline int32x4_t
__attribute__  ((__always_inline__, __gnu_inline__, __artificial__))
vld1q_s32_fixed (const int32_t * __a)
{
  return *(const int32x4_t*)__a;
}

int32x4_t foo(int32x4_t a)
{
int32_t temp[4];
vst1q_s32_fixed(temp, a);
return vld1q_s32_fixed(temp);
}



...compiles to:

foo(long __vector(4)):
bx  lr


Is there any reason not to simply redefine vst1q_s32, vld1q_s32 and friends to
stop using builtins?


[1]https://godbolt.org/#g:!((g:!((g:!((h:codeEditor,i:(fontScale:14,j:2,lang:c%2B%2B,selection:(endColumn:2,endLineNumber:22,positionColumn:1,positionLineNumber:1,selectionStartColumn:2,selectionStartLineNumber:22,startColumn:1,startLineNumber:1),source:'%23include+%3Carm_neon.h%3E%0A%0A__extension__+extern+__inline+void%0A__attribute__++((__always_inline__,+__gnu_inline__,+__artificial__))%0Avst1q_s32_fixed+(int32_t+*+__a,+int32x4_t+__b)%0A%7B%0A++*(int32x4_t*)__a+%3D+__b%3B%0A%7D%0A%0A__extension__+extern+__inline+int32x4_t%0A__attribute__++((__always_inline__,+__gnu_inline__,+__artificial__))%0Avld1q_s32_fixed+(const+int32_t+*+__a)%0A%7B%0A++return+*(const+int32x4_t*)__a%3B%0A%7D%0A%0Aint32x4_t+foo(int32x4_t+a)%0A%7B%0Aint32_t+temp%5B4%5D%3B%0Avst1q_s32_fixed(temp,+a)%3B%0Areturn+vld1q_s32_fixed(temp)%3B%0A%7D'),l:'5',n:'0',o:'C%2B%2B+source+%232',t:'0')),header:(),k:49.54010711093072,l:'4',m:50,n:'0',o:'',s:0,t:'0'),(g:!((h:compiler,i:(compiler:arm831,filters:(b:'0',binary:'1',commentOnly:'0',demangle:'0',directives:'0',execute:'1',intel:'0',libraryCode:'1',trim:'1'),fontScale:14,j:2,lang:c%2B%2B,libs:!(),options:'-O2+-march%3Darmv7-a+-mtune%3Dcortex-a8+-mfpu%3Dneon+-mfloat-abi%3Dhard',selection:(endColumn:1,endLineNumber:1,positionColumn:1,positionLineNumber:1,selectionStartColumn:1,selectionStartLineNumber:1,startColumn:1,startLineNumber:1),source:2),l:'5',n:'0',o:'ARM+gcc+8.3.1+(none)+(Editor+%232,+Compiler+%232)+C%2B%2B',t:'0')),header:(),l:'4',m:50,n:'0',o:'',s:0,t:'0')),k:100,l:'3',n:'0',o:'',t:'0')),version:4

[Bug target/93005] Redundant NEON loads/stores from stack are not eliminated

2020-01-06 Thread joel at airwebreathe dot org.uk
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93005

--- Comment #4 from Joel Holdsworth  ---
Results for clang and MSVC are similar:

clang trunk:

foo(__simd128_int32_t):
push{r11, lr}
mov r11, sp
sub sp, sp, #24
bfc sp, #0, #4
mov r0, sp
vst1.32 {d0, d1}, [r0]
vld1.64 {d0, d1}, [r0:128]
mov sp, r11
pop {r11, pc}


...but even though these other compilers don't do any better on ARM, I still
think my original point stands.

[Bug target/93005] Redundant NEON loads/stores from stack are not eliminated

2020-01-03 Thread joel at airwebreathe dot org.uk
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93005

--- Comment #3 from Joel Holdsworth  ---
Interesting. Comparing the implementation of _mm_store_si128 to vst1q_s32:

emminitrin.h

extern __inline void __attribute__((__gnu_inline__, __always_inline__,
__artificial__))
_mm_store_si128 (__m128i *__P, __m128i __B)
{
  *__P = __B;
}


arm_neon.h

__extension__ extern __inline void
__attribute__  ((__always_inline__, __gnu_inline__, __artificial__))
vst1q_s32 (int32_t * __a, int32x4_t __b)
{
  __builtin_neon_vst1v4si ((__builtin_neon_si *) __a, __b);
}


So why is one implemented with a built-in, and the other with a pointer
dereference?

Is there a way of making the optimizer see through __builtin_neon_vst1v4si with
GIMPLE? Where would the code be implemented? Where is it implemented for other
architectures?

[Bug target/93005] Redundant NEON loads/stores from stack are not eliminated

2020-01-02 Thread joel at airwebreathe dot org.uk
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93005

--- Comment #2 from Joel Holdsworth  ---
Are you saying that if the GIMPLE were defined for the intrinsics, then the
optimizer would eliminate them automatically? Or is there more to it?

[Bug c++/93005] New: Redundant NEON loads/stores from stack are not eliminated

2019-12-19 Thread joel at airwebreathe dot org.uk
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93005

Bug ID: 93005
   Summary: Redundant NEON loads/stores from stack are not
eliminated
   Product: gcc
   Version: 8.3.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: joel at airwebreathe dot org.uk
  Target Milestone: ---

On x86_64 SSE, gcc is able to eliminated redundant load/store operations to the
stack, but on ARM, gcc seems unable to do the same optimization with NEON
vector registers.

This x86_64 code is optimized as expected:


#include 

__m128i foo(__m128i a)
{
int32_t temp[4];
_mm_store_si128(reinterpret_cast<__m128i*>(temp), a);
return _mm_load_si128(reinterpret_cast<__m128i*>(temp));
}


...when compiled with -O2:


foo(long long __vector(2)):
ret


However, when compiling analogous code for ARM NEON:


#include 

int32x4_t foo(int32x4_t a)
{
int32_t temp[4];
vst1q_s32(temp, a);
return vld1q_s32(temp);
}


...when compiled with -O2 -march=armv7-a -mtune=cortex-a8 -mfpu=neon
-mfloat-abi=hard :


foo(__simd128_int32_t):
sub sp, sp, #16
vst1.32 {d0-d1}, [sp:64]
vld1.32 {d0-d1}, [sp:64]
add sp, sp, #16
bx  lr


The load/store to the stack are redundant and should be eliminated, because
temp should have been promoted to NEON registers.

(see the attached godbolt link [1] to compare)

This issue was discovered while trying to use gcc with the Eigen library on ARM
NEON. Eigen does intermediate processing using compiler intrinsics, but
intermediate values must be written back to POD arrays on the stack. In a
complex algorithm this results in the machine code being peppered with
redundant stores and loads.


[1]
https://godbolt.org/#z:OYLghAFBqd5TKALEBjA9gEwKYFFMCWALugE4A0BIEAZugHZEDKqAhgDbYgCMALOQCse5dq3qhUAUgBMAIRmzyAZ2ydURAg2rZ6mAMLp2AVwC29ENPI7MAGQL1sAOVMAjbKRAAOcgAd0S4k16A2MzC19/QIY7B2cTNw9vFTUNBiYiVlIiENNzS2TsdSD0zKIYp1d3L2UMrJyw/Nqy%2Bwr4qs8ASmV0I1JULgByGQBme1RjHABqSWG9AA9PADZ7IlJ7ADokGdxJAAYAQT39gH1jk25pTwJJunQIU/PL69YOo8kAdnkDyZ/JleHpMciJMiNgTD5JABWWS8KEAERmX32v0mZxMxyUJFI2AxBAungg2JW7h82KIxzYmJmege%2BIIACpthBQeCupMXoijiiyb16KiTOj2OhWJhcfjCdhiaRSdhyZSiNTaU9GcNcMywT4Oq9hkiPnCBl12CABpCBuRzANdmb0Ma9AoFJMlD0%2BthptJhtwzURjVateQANYgYbvdaeXi8ACc3Aji1jnl20cWImNvDNJhAkfWEd2kd40hjkPe7wjvHe5p91uNZqUIF25G9loN5DgsBQGHBBE4FCoEHbPk7VWAnmG5BondBpBrEBcFfILnsmQAnsbPeR2yYdEQAPL0djLgarnAmMTATizwjYooAN2wNcbVjmhSMoJXZuJRvv7AILlIS4MOFfes1nTA8mzoRgWA4Lg%2BEEYRRHENB7TkERvxrWB6FYDcQElYAdHIG8PFWIx6H9P10B8VJ6DvABaLdpGrVRCko7RdHqcxuCsXRyjiBJhD8AJKLYvjIko7jKg8DiCiKNImiEyTGOk%2BgSiyMS2gkmpSjkjSVJaHiqm4LonV6foeENY1TXLe8bQGBZFmoxZeEmYBUFQSZPHWYZJggfBiDIN0PXISYDA7Lt/IMoKkNkL0Kz9JBsBFKoIDMgZU3IdNIV2dZdgjDLC12YY82kXZIWkfgLStchrOrWt6xirpAxK9ZpEWZqIw9SFFneAFPAjZMBmGM1ysrAZosbLoWwQeAIDbLA8EILFKGocDmDYM8YKESx4IkSLlAU5iIGsITeusVTeKTfiomCQxchAXqLtE3TxJAJMpMo5TsmusJete4omlOqoXtkz7zG%2Bv7HrU57DOdEyhndMYJldalMnRBwGE2bY3gOf5pDmXggRudA7mx3H8Y5A4PiRFFsfxlkIWhWFIQRHUuV%2BK9MW4ABHDEAXVcFAo5Znvl%2BHlSD5K92EwTnuekXnNU5cn3n1ZKLKGyrbUix1ocR916NqsaujihKPCSgMMwjdZIV4BNdnebhuGy4Ybc2lM0xADLLIqqrlBqhtfWbRBprQdAQvcRbe2D/suxAZHhw4sd2AnKcZ3vecMNIfdV3XTcdz3c8wRPM97wvJiCBvO8KuwR9UGfQZV3fWcvx/P9ZsA1YCBAz0umWyC1v4DaRBPRC5AUFCXDQk3yMomi6Mmajjz6LZhjhZGr3eajWFnkwiGI7AZjhDAskr9fPE3mgfCMPfUb5OeaCFVgiHX%2Bc96QTJMAYlIghY/RgZ4Y6uPBs6EQBJBDkndESQR/oeBentX6mkf7Rl2h/GSpRIHPW0h9UI7FQYoIAQDKGxkuDSGVoNWc1l9gACUACyTkXJuQ8usbgXl6AMGwB0LyPksT%2BUsEFCOA5SBcLYXaYechRq%2BgNvFHAxtkqpXSnWVWXsax1l9gaeqGYQxdS6rsRYYYASdTtn1AaHthqiJUX1XW8iqx6zEfhdwAQtC8CAA%3D%3D%3D