[Bug target/47754] [missed optimization] AVX allows unaligned memory operands but GCC uses unaligned load and register operand
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=47754 Uroš Bizjak changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #12 from Uroš Bizjak --- (In reply to Allan Jensen from comment #11) > I think this one could probably be closed though. Fixed.
[Bug target/47754] [missed optimization] AVX allows unaligned memory operands but GCC uses unaligned load and register operand
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=47754 --- Comment #11 from Allan Jensen --- The think the issue I noted is completely separate from this one, so I opened https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78762 to deal with it. I think this one could probably be closed though.
[Bug target/47754] [missed optimization] AVX allows unaligned memory operands but GCC uses unaligned load and register operand
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=47754 --- Comment #10 from Allan Jensen --- No I mean it triggers when you compile with -mavx2, it is solved with -march=haswell. It appears the issue is the tune flag X86_TUNE_AVX256_UNALIGNED_LOAD_OPTIMAL is set for all processors that support avx2, but if you use generic+avx2, it still pessimistically optimizes for pre-avx2 processors setting MASK_AVX256_SPLIT_UNALIGNED_LOAD. Though since there are two controlling flags and the second X86_TUNE_AVX256_UNALIGNED_STORE_OPTIMAL is still set for some avx2 processors (btver and znver) besides generic, it is harder to argue what generic+avx2 should do there.
[Bug target/47754] [missed optimization] AVX allows unaligned memory operands but GCC uses unaligned load and register operand
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=47754 --- Comment #9 from Marc Glisse --- (In reply to Allan Jensen from comment #7) > This is significantly worse with integer operands. > > _mm256_storeu_si256((__m256i *)&data[3], > _mm256_add_epi32(_mm256_loadu_si256((const __m256i *)&data[0]), > _mm256_loadu_si256((const __m256i *)&data[1])) > ); Please don't post isolated lines of code, always complete examples ready to be copy-pasted and compiled. The declaration of data is relevant to the generated code. > compiles to: > > vmovdqu 0x20(%rax),%xmm0 > vinserti128 $0x1,0x30(%rax),%ymm0,%ymm0 > vmovdqu (%rax),%xmm1 > vinserti128 $0x1,0x10(%rax),%ymm1,%ymm1 > vpaddd %ymm1,%ymm0,%ymm0 > vmovups %xmm0,0x60(%rax) > vextracti128 $0x1,%ymm0,0x70(%rax) With trunk and -march=skylake (or haswell), I can get vmovdqu data(%rip), %ymm0 vpaddd data+32(%rip), %ymm0, %ymm0 vmovdqu %ymm0, data+96(%rip) so this looks fixed?
[Bug target/47754] [missed optimization] AVX allows unaligned memory operands but GCC uses unaligned load and register operand
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=47754 --- Comment #8 from Allan Jensen --- Note this happens with -mavx2, but not with -march=haswell. It appears the tuning is a bit too pessimistic when avx2 is enabled on generic x64.
[Bug target/47754] [missed optimization] AVX allows unaligned memory operands but GCC uses unaligned load and register operand
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=47754 Allan Jensen changed: What|Removed |Added CC||linux at carewolf dot com --- Comment #7 from Allan Jensen --- This is significantly worse with integer operands. _mm256_storeu_si256((__m256i *)&data[3], _mm256_add_epi32(_mm256_loadu_si256((const __m256i *)&data[0]), _mm256_loadu_si256((const __m256i *)&data[1])) ); compiles to: vmovdqu 0x20(%rax),%xmm0 vinserti128 $0x1,0x30(%rax),%ymm0,%ymm0 vmovdqu (%rax),%xmm1 vinserti128 $0x1,0x10(%rax),%ymm1,%ymm1 vpaddd %ymm1,%ymm0,%ymm0 vmovups %xmm0,0x60(%rax) vextracti128 $0x1,%ymm0,0x70(%rax)
[Bug target/47754] [missed optimization] AVX allows unaligned memory operands but GCC uses unaligned load and register operand
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47754 --- Comment #6 from Jakub Jelinek --- Author: jakub Date: Wed Oct 30 17:59:44 2013 New Revision: 204219 URL: http://gcc.gnu.org/viewcvs?rev=204219&root=gcc&view=rev Log: PR target/47754 * config/i386/i386.c (ix86_avx256_split_vector_move_misalign): If op1 is misaligned_operand, just use *mov_internal insn rather than UNSPEC_LOADU load. (ix86_expand_vector_move_misalign): Likewise (for TARGET_AVX only). Avoid gen_lowpart on op0 if it isn't MEM. * gcc.target/i386/avx256-unaligned-load-1.c: Adjust scan-assembler and scan-assembler-not regexps. * gcc.target/i386/avx256-unaligned-load-2.c: Likewise. * gcc.target/i386/avx256-unaligned-load-3.c: Likewise. * gcc.target/i386/avx256-unaligned-load-4.c: Likewise. * gcc.target/i386/l_fma_float_1.c: Use pattern for scan-assembler-times instead of just one insn name. * gcc.target/i386/l_fma_float_2.c: Likewise. * gcc.target/i386/l_fma_float_3.c: Likewise. * gcc.target/i386/l_fma_float_4.c: Likewise. * gcc.target/i386/l_fma_float_5.c: Likewise. * gcc.target/i386/l_fma_float_6.c: Likewise. * gcc.target/i386/l_fma_double_1.c: Likewise. * gcc.target/i386/l_fma_double_2.c: Likewise. * gcc.target/i386/l_fma_double_3.c: Likewise. * gcc.target/i386/l_fma_double_4.c: Likewise. * gcc.target/i386/l_fma_double_5.c: Likewise. * gcc.target/i386/l_fma_double_6.c: Likewise. Modified: trunk/gcc/config/i386/i386.c trunk/gcc/testsuite/gcc.target/i386/avx256-unaligned-load-1.c trunk/gcc/testsuite/gcc.target/i386/avx256-unaligned-load-2.c trunk/gcc/testsuite/gcc.target/i386/avx256-unaligned-load-3.c trunk/gcc/testsuite/gcc.target/i386/avx256-unaligned-load-4.c trunk/gcc/testsuite/gcc.target/i386/l_fma_double_1.c trunk/gcc/testsuite/gcc.target/i386/l_fma_double_2.c trunk/gcc/testsuite/gcc.target/i386/l_fma_double_3.c trunk/gcc/testsuite/gcc.target/i386/l_fma_double_4.c trunk/gcc/testsuite/gcc.target/i386/l_fma_double_5.c trunk/gcc/testsuite/gcc.target/i386/l_fma_double_6.c trunk/gcc/testsuite/gcc.target/i386/l_fma_float_1.c trunk/gcc/testsuite/gcc.target/i386/l_fma_float_2.c trunk/gcc/testsuite/gcc.target/i386/l_fma_float_3.c trunk/gcc/testsuite/gcc.target/i386/l_fma_float_4.c trunk/gcc/testsuite/gcc.target/i386/l_fma_float_5.c trunk/gcc/testsuite/gcc.target/i386/l_fma_float_6.c
[Bug target/47754] [missed optimization] AVX allows unaligned memory operands but GCC uses unaligned load and register operand
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47754 xiaoyuanbo changed: What|Removed |Added CC||xiaoyuanbo at yeah dot net --- Comment #5 from xiaoyuanbo 2012-02-22 13:04:03 UTC --- so you are boss
[Bug target/47754] [missed optimization] AVX allows unaligned memory operands but GCC uses unaligned load and register operand
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47754 Richard Guenther changed: What|Removed |Added CC||rth at gcc dot gnu.org --- Comment #4 from Richard Guenther 2011-02-16 10:49:30 UTC --- Note that GCC doesn't use unaligned memory operands because it doesn't have the knowledge implemented that this is ok for AVX, it simply treats the AVX case the same as the SSE case where the memory operands are required to be aligned. That said, unaligned SSE and AVX moves are implemented using UNSPECs, so they will be never combined with other instructions. I don't know if there is a way to still distinguish unaligned and aligned loads/stores and let them appear as regular RTL moves at the same time. Richard, is that even possible?
[Bug target/47754] [missed optimization] AVX allows unaligned memory operands but GCC uses unaligned load and register operand
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47754 --- Comment #3 from Matthias Kretz 2011-02-15 16:40:38 UTC --- ICC??? Whatever, I stopped to trust that compiler long ago: : vmovups 0x2039b8(%rip),%xmm0 vmovups 0x2039b4(%rip),%xmm1 vinsertf128 $0x1,0x2039b6(%rip),%ymm0,%ymm2 vinsertf128 $0x1,0x2039b0(%rip),%ymm1,%ymm3 vaddps %ymm3,%ymm2,%ymm4 vmovups %ymm4,0x20399c(%rip) vzeroupper retq : vmovups 0x203978(%rip),%ymm0 vaddps 0x203974(%rip),%ymm0,%ymm1 vmovups %ymm1,0x203974(%rip) vzeroupper retq Nice optimization of unaligned loads there... not. ??? Just a small side-note for your enjoyment: I wrote a C++ abstraction for SSE; and with GCC this gives an almost four-fold speedup for Mandelbrot. ICC on the other hand compiles such awful code that - even with SSE use - it rather creates a four-fold slowdown compared to the non-SSE code. GCC really is a nice compiler! Keep on rocking!
[Bug target/47754] [missed optimization] AVX allows unaligned memory operands but GCC uses unaligned load and register operand
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47754 --- Comment #2 from Matthias Kretz 2011-02-15 16:31:39 UTC --- True, the Optimization Reference Manual and AVX Docs are not very specific about the performance impact of this. But as far as I understood the docs it will internally not be slower than an unaligned load + op, but also not faster. Except, of course, if it's related to memory fetch latency. So it's just about having more registers available - again AFAIU. If you want I can try the same testcase on ICC...
[Bug target/47754] [missed optimization] AVX allows unaligned memory operands but GCC uses unaligned load and register operand
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47754 Richard Guenther changed: What|Removed |Added Keywords||missed-optimization Target||x86_64-*-* Status|UNCONFIRMED |NEW Last reconfirmed||2011.02.15 16:21:49 Ever Confirmed|0 |1 --- Comment #1 from Richard Guenther 2011-02-15 16:21:49 UTC --- Confirmed. Not sure if it really would not be slower for a non-load/store instruction to need assist for unaligned loads/stores.