[Bug rtl-optimization/43147] SSE shuffle merge
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43147 Andrew Pinski changed: What|Removed |Added Status|NEW |RESOLVED Target Milestone|--- |13.0 Resolution|--- |FIXED --- Comment #21 from Andrew Pinski --- Constant folding part was fixed in GCC 12 but combining shuffles was fixed in GCC 13. That is for: ``` __m128 m; int main() { m = _mm_shuffle_ps(m, m, 0xC9); // Those two shuffles together swap pairs m = _mm_shuffle_ps(m, m, 0x2D); // And could be optimized to 0x4E printv(m); return 0; } ``` GCC 13+ Produces: ``` movaps m(%rip), %xmm0 shufps $78, %xmm0, %xmm0 movaps %xmm0, m(%rip) call_Z6printvDv4_f ``` instead of what was there in GCC 12: ``` movaps m(%rip), %xmm0 shufps $201, %xmm0, %xmm0 shufps $45, %xmm0, %xmm0 movaps %xmm0, m(%rip) ``` So closing as fixed in GCC 13.
[Bug rtl-optimization/43147] SSE shuffle merge
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43147 --- Comment #20 from Hongtao.liu --- Fixed in GCC12, now gcc generate optimal codes. main: .LFB532: .cfi_startproc subq$8, %rsp .cfi_def_cfa_offset 16 movaps .LC0(%rip), %xmm0 callprintv xorl%eax, %eax addq$8, %rsp .cfi_def_cfa_offset 8 ret .cfi_endproc .LFE532: .size main, .-main .section.rodata.cst16,"aM",@progbits,16 .align 16 .LC0: .long 1073741824 .long 1065353216 .long 1082130432 .long 1077936128 .ident "GCC: (GNU) 12.0.0 20210825 (experimental)" .section.note.GNU-stack,"",@progbits
[Bug rtl-optimization/43147] SSE shuffle merge
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43147 --- Comment #19 from CVS Commits --- The master branch has been updated by hongtao Liu : https://gcc.gnu.org/g:0fa4787bf34b173ce6f198e99b6f6dd8a3f98014 commit r12-3177-g0fa4787bf34b173ce6f198e99b6f6dd8a3f98014 Author: liuhongt Date: Fri Dec 11 19:02:43 2020 +0800 Fold more shuffle builtins to VEC_PERM_EXPR. A follow-up to https://gcc.gnu.org/pipermail/gcc-patches/2019-May/521983.html gcc/ PR target/98167 PR target/43147 * config/i386/i386.c (ix86_gimple_fold_builtin): Fold IX86_BUILTIN_SHUFPD512, IX86_BUILTIN_SHUFPS512, IX86_BUILTIN_SHUFPD256ï¼ IX86_BUILTIN_SHUFPSï¼ IX86_BUILTIN_SHUFPS256. (ix86_masked_all_ones): New function. gcc/testsuite/ * gcc.target/i386/avx512f-vshufpd-1.c: Adjust testcase. * gcc.target/i386/avx512f-vshufps-1.c: Adjust testcase. * gcc.target/i386/pr43147.c: New test.
[Bug rtl-optimization/43147] SSE shuffle merge
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43147 --- Comment #18 from Hongtao.liu --- (In reply to Marc Glisse from comment #17) > (In reply to Hongtao.liu from comment #15) > > The issue can also be solved by folding __builtin_ia32_shufps to gimple > > VEC_PERM_EXPR, > > Didn't you post a patch to do that last year? What happened to it? I almost forgot it, let me retest my patch, it's in https://gcc.gnu.org/pipermail/gcc-patches/2020-December/562029.html
[Bug rtl-optimization/43147] SSE shuffle merge
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43147 --- Comment #17 from Marc Glisse --- (In reply to Hongtao.liu from comment #15) > The issue can also be solved by folding __builtin_ia32_shufps to gimple > VEC_PERM_EXPR, Didn't you post a patch to do that last year? What happened to it?
[Bug rtl-optimization/43147] SSE shuffle merge
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43147 --- Comment #16 from Andrew Pinski --- (In reply to Hongtao.liu from comment #15) > > I think pass_combine should be extended to force illegitimate constant > > to constant pool and recog load insn again, It looks like a general > > optimization that better not do it in the backend. > > The issue can also be solved by folding __builtin_ia32_shufps to gimple > VEC_PERM_EXPR, .i.e the below testcase doesn't have the problem > > typedef int v4si __attribute__((vector_size (16))); > > v4si > foo () > { > v4si a = __extension__ (v4si) {4, 3, 2, 1}; > v4si b = __extension__ (v4si) {5, 6, 7, 8}; > v4si c = __builtin_shufflevector (a, b, 1, 4, 2, 7); > v4si d = __builtin_shuffle (c, __extension__ (v4si) { 3, 2, 0, 1 }); > return d; > } But that is because we constant fold on the gimple level for PERMs. combining VEC_PERM_EXPR on the gimple is PR 54346; note I found this while looking at other issues too :).
[Bug rtl-optimization/43147] SSE shuffle merge
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43147 --- Comment #15 from Hongtao.liu --- > I think pass_combine should be extended to force illegitimate constant > to constant pool and recog load insn again, It looks like a general > optimization that better not do it in the backend. The issue can also be solved by folding __builtin_ia32_shufps to gimple VEC_PERM_EXPR, .i.e the below testcase doesn't have the problem typedef int v4si __attribute__((vector_size (16))); v4si foo () { v4si a = __extension__ (v4si) {4, 3, 2, 1}; v4si b = __extension__ (v4si) {5, 6, 7, 8}; v4si c = __builtin_shufflevector (a, b, 1, 4, 2, 7); v4si d = __builtin_shuffle (c, __extension__ (v4si) { 3, 2, 0, 1 }); return d; } foo(): movdqa .LC0(%rip), %xmm0 ret .LC0: .long 8 .long 2 .long 3 .long 5
[Bug rtl-optimization/43147] SSE shuffle merge
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43147 H.J. Lu changed: What|Removed |Added Assignee|hjl.tools at gmail dot com |unassigned at gcc dot gnu.org Component|target |rtl-optimization Version|4.4.1 |12.0 --- Comment #14 from H.J. Lu --- From https://gcc.gnu.org/pipermail/gcc-patches/2021-August/577991.html Trying 5 -> 7: 5: r85:V4SF=[`*.LC0'] REG_EQUAL const_vector 7: r84:V4SF=vec_select(vec_concat(r85:V4SF,r85:V4SF),parallel) REG_DEAD r85:V4SF REG_EQUAL const_vector Failed to match this instruction: (set (reg:V4SF 84) (const_vector:V4SF [ (const_double:SF 3.0e+0 [0x0.cp+2]) (const_double:SF 2.0e+0 [0x0.8p+2]) (const_double:SF 4.0e+0 [0x0.8p+3]) (const_double:SF 1.0e+0 [0x0.8p+1]) ])) (insn 5 2 7 2 (set (reg:V4SF 85) (mem/u/c:V4SF (symbol_ref/u:DI ("*.LC0") [flags 0x2]) [0 S16 A128])) "/export/users/liuhongt/install/git_trunk_master_native/lib/gcc/x86_64-pc-linux-gnu/12.0.0/include/xmmintrin.h":746:19 1600 {movv4sf_internal} (expr_list:REG_EQUAL (const_vector:V4SF [ (const_double:SF 4.0e+0 [0x0.8p+3]) (const_double:SF 3.0e+0 [0x0.cp+2]) (const_double:SF 2.0e+0 [0x0.8p+2]) (const_double:SF 1.0e+0 [0x0.8p+1]) ]) (nil))) (insn 7 5 11 2 (set (reg:V4SF 84) (vec_select:V4SF (vec_concat:V8SF (reg:V4SF 85) (reg:V4SF 85)) (parallel [ (const_int 1 [0x1]) (const_int 2 [0x2]) (const_int 4 [0x4]) (const_int 7 [0x7]) ]))) "/export/users/liuhongt/install/git_trunk_master_native/lib/gcc/x86_64-pc-linux-gnu/12.0.0/include/xmmintrin.h":746:19 3015 {sse_shufps_v4sf} (expr_list:REG_DEAD (reg:V4SF 85) (expr_list:REG_EQUAL (const_vector:V4SF [ (const_double:SF 3.0e+0 [0x0.cp+2]) (const_double:SF 2.0e+0 [0x0.8p+2]) (const_double:SF 4.0e+0 [0x0.8p+3]) (const_double:SF 1.0e+0 [0x0.8p+1]) ]) (nil I think pass_combine should be extended to force illegitimate constant to constant pool and recog load insn again, It looks like a general optimization that better not do it in the backend.
[Bug rtl-optimization/43147] SSE shuffle merge
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43147 --- Comment #10 from Marc Glisse --- Author: glisse Date: Mon May 20 14:53:29 2019 New Revision: 271422 URL: https://gcc.gnu.org/viewcvs?rev=271422&root=gcc&view=rev Log: [i386] Fold __builtin_ia32_shufpd to VEC_PERM_EXPR 2019-05-20 Marc Glisse PR rtl-optimization/43147 * config/i386/i386.c (ix86_gimple_fold_builtin): Handle IX86_BUILTIN_SHUFPD. Modified: trunk/gcc/ChangeLog trunk/gcc/config/i386/i386.c
[Bug rtl-optimization/43147] SSE shuffle merge
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43147 Allan Jensen changed: What|Removed |Added CC||linux at carewolf dot com --- Comment #9 from Allan Jensen --- (In reply to Marc Glisse from comment #6) > Created attachment 45303 [details] > example patch (untested) > > Making the meaning of shuffles visible in GIMPLE could help a bit (although > it wouldn't solve the problem completely because IIRC we don't dare combine > shuffles, since it is hard to find an optimal expansion for a shuffle and we > might pessimize some cases). With some other cases there are checks to see if a combined new tree can be generated as a single instruction and only combined in that case. And as soon as the compiler have SSSE3 available, we can shuffle anything as single instruction, so combining them is always safe and fast.
[Bug rtl-optimization/43147] SSE shuffle merge
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43147 --- Comment #8 from Marc Glisse --- Created attachment 45306 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45306&action=edit ix86_gimple_fold_builtin patch Like this then? I realized (because of the testsuite) that we do not currently validate that the 3rd argument is a 2-bit immediate (the error message on non-constants is misleading), and Intel's documentation indeed seems to indicate that any integer is valid, we only look at the 2 lowest bits. If possible, I think it would be nice to reduce the number of builtins, that's why I started with a patch to emmintrin.h (I now know that it should use (mask&2)?3:2 for the second number). For validation, something like: if(!__builtin_constant_p(__mask)) complain(); with extern void complain(void) __attribute__((error("argument must be a constant"))); may be good enough. But we still need to handle -O0... Yeah, maybe ix86_gimple_fold_builtin is easier :-(
[Bug rtl-optimization/43147] SSE shuffle merge
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43147 Jakub Jelinek changed: What|Removed |Added CC||jakub at gcc dot gnu.org --- Comment #7 from Jakub Jelinek --- (In reply to Marc Glisse from comment #6) > Created attachment 45303 [details] > example patch (untested) > > Making the meaning of shuffles visible in GIMPLE could help a bit (although > it wouldn't solve the problem completely because IIRC we don't dare combine > shuffles, since it is hard to find an optimal expansion for a shuffle and we > might pessimize some cases). > This patch is one simple way to make _mm_shuffle_pd less opaque. > _mm_shuffle_ps would be a bit longer but still manageable. It has the > drawback that it does not diagnose when the mask is not a constant, or not > between 0 and 3, and I am not sure how to do that from the C code. An > alternative would be to keep the current builtin but turn it into a > vec_perm_expr in ix86_gimple_fold_builtin, which could include diagnostics. I think doing it in ix86_gimple_fold_builtin is better.
[Bug rtl-optimization/43147] SSE shuffle merge
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43147 --- Comment #6 from Marc Glisse --- Created attachment 45303 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45303&action=edit example patch (untested) Making the meaning of shuffles visible in GIMPLE could help a bit (although it wouldn't solve the problem completely because IIRC we don't dare combine shuffles, since it is hard to find an optimal expansion for a shuffle and we might pessimize some cases). This patch is one simple way to make _mm_shuffle_pd less opaque. _mm_shuffle_ps would be a bit longer but still manageable. It has the drawback that it does not diagnose when the mask is not a constant, or not between 0 and 3, and I am not sure how to do that from the C code. An alternative would be to keep the current builtin but turn it into a vec_perm_expr in ix86_gimple_fold_builtin, which could include diagnostics.
[Bug rtl-optimization/43147] SSE shuffle merge
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43147 Marc Glisse changed: What|Removed |Added CC||glisse at gcc dot gnu.org --- Comment #5 from Marc Glisse 2012-05-07 14:52:46 UTC --- Actually, why isn't constant propagation happening for the example in comment #1? simplify-rtx.c contains code to that effect, it might just need a little tweaking... (yes, I know the original report is probably interested in non-constant operands)
[Bug rtl-optimization/43147] SSE shuffle merge
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43147 Marc Glisse changed: What|Removed |Added CC||marc.glisse at normalesup ||dot org --- Comment #4 from Marc Glisse 2011-10-23 08:40:04 UTC --- Apart from combining 2 shuffles, I would expect the set and the shuffle to be combined in Comment 1. I was going to report the following, but it already appears in this bug: __m128d f(double d){ __m128d x=_mm_setr_pd(-d,d); return _mm_shuffle_pd(x,x,1); } movsd.LC0(%rip), %xmm1 xorpd%xmm0, %xmm1 movapd%xmm1, %xmm2 unpcklpd%xmm0, %xmm2 movapd%xmm2, %xmm0 shufpd$1, %xmm2, %xmm0 some extra moves, as usual, and a shuffle that could be combined with the unpack. (obviously we don't write such code, it is only after inlining that it looks that way)
[Bug rtl-optimization/43147] SSE shuffle merge
--- Comment #3 from pinskia at gcc dot gnu dot org 2010-02-23 01:42 --- Confirmed. -- pinskia at gcc dot gnu dot org changed: What|Removed |Added Status|UNCONFIRMED |NEW Ever Confirmed|0 |1 Last reconfirmed|-00-00 00:00:00 |2010-02-23 01:42:16 date|| http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43147
[Bug rtl-optimization/43147] SSE shuffle merge
--- Comment #2 from pinskia at gcc dot gnu dot org 2010-02-23 01:42 --- I think that is because nothing simplifies: (vec_select:V4SF (vec_concat:V8SF (vec_select:V4SF (vec_concat:V8SF (reg:V4SF 62) (reg:V4SF 62)) (parallel [ (const_int 1 [0x1]) (const_int 2 [0x2]) (const_int 4 [0x4]) (const_int 7 [0x7]) ])) (vec_select:V4SF (vec_concat:V8SF (reg:V4SF 62) (reg:V4SF 62)) (parallel [ (const_int 1 [0x1]) (const_int 2 [0x2]) (const_int 4 [0x4]) (const_int 7 [0x7]) ]))) (parallel [ (const_int 1 [0x1]) (const_int 3 [0x3]) (const_int 6 [0x6]) (const_int 4 [0x4]) ])) -- pinskia at gcc dot gnu dot org changed: What|Removed |Added Keywords||missed-optimization http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43147
[Bug rtl-optimization/43147] SSE shuffle merge
--- Comment #1 from liranuna at gmail dot com 2010-02-23 01:37 --- It appears I am missing a line in the code I posted: #include extern void printv(__m128 m); int main() { __m128 m = _mm_set_ps(1.0f, 2.0f, 3.0f, 4.0f); m = _mm_shuffle_ps(m, m, 0xC9); // Those two shuffles together swap pairs m = _mm_shuffle_ps(m, m, 0x2D); // And could be optimized to 0x4E printv(m); return 0; } -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43147