[Bug rtl-optimization/43147] SSE shuffle merge

2023-08-21 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43147

Andrew Pinski  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
   Target Milestone|--- |13.0
 Resolution|--- |FIXED

--- Comment #21 from Andrew Pinski  ---
Constant folding part was fixed in GCC 12 but combining shuffles was fixed in
GCC 13.

That is for:
```
__m128 m;
int main()
{
m = _mm_shuffle_ps(m, m, 0xC9); // Those two shuffles together swap
pairs
m = _mm_shuffle_ps(m, m, 0x2D); // And could be optimized to 0x4E
printv(m);

return 0;
}
```
GCC 13+ Produces:
```
movaps  m(%rip), %xmm0
shufps  $78, %xmm0, %xmm0
movaps  %xmm0, m(%rip)
call_Z6printvDv4_f
```
instead of what was there in GCC 12:
```
movaps  m(%rip), %xmm0
shufps  $201, %xmm0, %xmm0
shufps  $45, %xmm0, %xmm0
movaps  %xmm0, m(%rip)
```

So closing as fixed in GCC 13.

[Bug rtl-optimization/43147] SSE shuffle merge

2021-08-26 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43147

--- Comment #20 from Hongtao.liu  ---
Fixed in GCC12, now  gcc generate optimal codes.

main:
.LFB532:
.cfi_startproc
subq$8, %rsp
.cfi_def_cfa_offset 16
movaps  .LC0(%rip), %xmm0
callprintv
xorl%eax, %eax
addq$8, %rsp
.cfi_def_cfa_offset 8
ret
.cfi_endproc
.LFE532:
.size   main, .-main
.section.rodata.cst16,"aM",@progbits,16
.align 16
.LC0:
.long   1073741824
.long   1065353216
.long   1082130432
.long   1077936128
.ident  "GCC: (GNU) 12.0.0 20210825 (experimental)"
.section.note.GNU-stack,"",@progbits

[Bug rtl-optimization/43147] SSE shuffle merge

2021-08-26 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43147

--- Comment #19 from CVS Commits  ---
The master branch has been updated by hongtao Liu :

https://gcc.gnu.org/g:0fa4787bf34b173ce6f198e99b6f6dd8a3f98014

commit r12-3177-g0fa4787bf34b173ce6f198e99b6f6dd8a3f98014
Author: liuhongt 
Date:   Fri Dec 11 19:02:43 2020 +0800

Fold more shuffle builtins to VEC_PERM_EXPR.

A follow-up to
https://gcc.gnu.org/pipermail/gcc-patches/2019-May/521983.html

gcc/
PR target/98167
PR target/43147
* config/i386/i386.c (ix86_gimple_fold_builtin): Fold
IX86_BUILTIN_SHUFPD512, IX86_BUILTIN_SHUFPS512,
IX86_BUILTIN_SHUFPD256ï¼ IX86_BUILTIN_SHUFPSï¼
IX86_BUILTIN_SHUFPS256.
(ix86_masked_all_ones): New function.

gcc/testsuite/
* gcc.target/i386/avx512f-vshufpd-1.c: Adjust testcase.
* gcc.target/i386/avx512f-vshufps-1.c: Adjust testcase.
* gcc.target/i386/pr43147.c: New test.

[Bug rtl-optimization/43147] SSE shuffle merge

2021-08-25 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43147

--- Comment #18 from Hongtao.liu  ---
(In reply to Marc Glisse from comment #17)
> (In reply to Hongtao.liu from comment #15)
> > The issue can also be solved by folding __builtin_ia32_shufps to gimple
> > VEC_PERM_EXPR,
> 
> Didn't you post a patch to do that last year? What happened to it?

I almost forgot it, let me retest my patch, it's in
https://gcc.gnu.org/pipermail/gcc-patches/2020-December/562029.html

[Bug rtl-optimization/43147] SSE shuffle merge

2021-08-25 Thread glisse at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43147

--- Comment #17 from Marc Glisse  ---
(In reply to Hongtao.liu from comment #15)
> The issue can also be solved by folding __builtin_ia32_shufps to gimple
> VEC_PERM_EXPR,

Didn't you post a patch to do that last year? What happened to it?

[Bug rtl-optimization/43147] SSE shuffle merge

2021-08-25 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43147

--- Comment #16 from Andrew Pinski  ---
(In reply to Hongtao.liu from comment #15)
> > I think pass_combine should be extended to force illegitimate constant
> > to constant pool and recog load insn again, It looks like a general
> > optimization that better not do it in the backend.
> 
> The issue can also be solved by folding __builtin_ia32_shufps to gimple
> VEC_PERM_EXPR, .i.e the below testcase doesn't have the problem
> 
> typedef int v4si __attribute__((vector_size (16)));
> 
> v4si
> foo ()
> {
> v4si a = __extension__ (v4si) {4, 3, 2, 1};
> v4si b = __extension__ (v4si) {5, 6, 7, 8};
> v4si c = __builtin_shufflevector (a, b, 1, 4, 2, 7);
> v4si d = __builtin_shuffle (c, __extension__ (v4si) { 3, 2, 0, 1 });
> return d;
> }

But that is because we constant fold on the gimple level for PERMs.
combining VEC_PERM_EXPR on the gimple is PR 54346; note I found this while
looking at other issues too :).

[Bug rtl-optimization/43147] SSE shuffle merge

2021-08-25 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43147

--- Comment #15 from Hongtao.liu  ---

> I think pass_combine should be extended to force illegitimate constant
> to constant pool and recog load insn again, It looks like a general
> optimization that better not do it in the backend.

The issue can also be solved by folding __builtin_ia32_shufps to gimple
VEC_PERM_EXPR, .i.e the below testcase doesn't have the problem

typedef int v4si __attribute__((vector_size (16)));

v4si
foo ()
{
v4si a = __extension__ (v4si) {4, 3, 2, 1};
v4si b = __extension__ (v4si) {5, 6, 7, 8};
v4si c = __builtin_shufflevector (a, b, 1, 4, 2, 7);
v4si d = __builtin_shuffle (c, __extension__ (v4si) { 3, 2, 0, 1 });
return d;
}

foo():
  movdqa .LC0(%rip), %xmm0
  ret
.LC0:
  .long 8
  .long 2
  .long 3
  .long 5

[Bug rtl-optimization/43147] SSE shuffle merge

2021-08-23 Thread hjl.tools at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43147

H.J. Lu  changed:

   What|Removed |Added

   Assignee|hjl.tools at gmail dot com |unassigned at gcc dot 
gnu.org
  Component|target  |rtl-optimization
Version|4.4.1   |12.0

--- Comment #14 from H.J. Lu  ---
From

https://gcc.gnu.org/pipermail/gcc-patches/2021-August/577991.html

Trying 5 -> 7:
5: r85:V4SF=[`*.LC0']
  REG_EQUAL const_vector
7: r84:V4SF=vec_select(vec_concat(r85:V4SF,r85:V4SF),parallel)
  REG_DEAD r85:V4SF
  REG_EQUAL const_vector
Failed to match this instruction:
(set (reg:V4SF 84)
(const_vector:V4SF [
(const_double:SF 3.0e+0 [0x0.cp+2])
(const_double:SF 2.0e+0 [0x0.8p+2])
(const_double:SF 4.0e+0 [0x0.8p+3])
(const_double:SF 1.0e+0 [0x0.8p+1])
]))

(insn 5 2 7 2 (set (reg:V4SF 85)
(mem/u/c:V4SF (symbol_ref/u:DI ("*.LC0") [flags 0x2]) [0  S16
A128]))
"/export/users/liuhongt/install/git_trunk_master_native/lib/gcc/x86_64-pc-linux-gnu/12.0.0/include/xmmintrin.h":746:19
1600 {movv4sf_internal}
 (expr_list:REG_EQUAL (const_vector:V4SF [
(const_double:SF 4.0e+0 [0x0.8p+3])
(const_double:SF 3.0e+0 [0x0.cp+2])
(const_double:SF 2.0e+0 [0x0.8p+2])
(const_double:SF 1.0e+0 [0x0.8p+1])
])
(nil)))
(insn 7 5 11 2 (set (reg:V4SF 84)
(vec_select:V4SF (vec_concat:V8SF (reg:V4SF 85)
(reg:V4SF 85))
(parallel [
(const_int 1 [0x1])
(const_int 2 [0x2])
(const_int 4 [0x4])
(const_int 7 [0x7])
])))
"/export/users/liuhongt/install/git_trunk_master_native/lib/gcc/x86_64-pc-linux-gnu/12.0.0/include/xmmintrin.h":746:19
3015 {sse_shufps_v4sf}
 (expr_list:REG_DEAD (reg:V4SF 85)
(expr_list:REG_EQUAL (const_vector:V4SF [
(const_double:SF 3.0e+0 [0x0.cp+2])
(const_double:SF 2.0e+0 [0x0.8p+2])
(const_double:SF 4.0e+0 [0x0.8p+3])
(const_double:SF 1.0e+0 [0x0.8p+1])
])
(nil

I think pass_combine should be extended to force illegitimate constant
to constant pool and recog load insn again, It looks like a general
optimization that better not do it in the backend.

[Bug rtl-optimization/43147] SSE shuffle merge

2019-05-20 Thread glisse at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43147

--- Comment #10 from Marc Glisse  ---
Author: glisse
Date: Mon May 20 14:53:29 2019
New Revision: 271422

URL: https://gcc.gnu.org/viewcvs?rev=271422&root=gcc&view=rev
Log:
[i386] Fold __builtin_ia32_shufpd to VEC_PERM_EXPR

2019-05-20  Marc Glisse  

PR rtl-optimization/43147
* config/i386/i386.c (ix86_gimple_fold_builtin): Handle
IX86_BUILTIN_SHUFPD.

Modified:
trunk/gcc/ChangeLog
trunk/gcc/config/i386/i386.c

[Bug rtl-optimization/43147] SSE shuffle merge

2019-05-19 Thread linux at carewolf dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43147

Allan Jensen  changed:

   What|Removed |Added

 CC||linux at carewolf dot com

--- Comment #9 from Allan Jensen  ---
(In reply to Marc Glisse from comment #6)
> Created attachment 45303 [details]
> example patch (untested)
> 
> Making the meaning of shuffles visible in GIMPLE could help a bit (although
> it wouldn't solve the problem completely because IIRC we don't dare combine
> shuffles, since it is hard to find an optimal expansion for a shuffle and we
> might pessimize some cases).

With some other cases there are checks to see if a combined new tree can be
generated as a single instruction and only combined in that case. And as soon
as the compiler have SSSE3 available, we can shuffle anything as single
instruction, so combining them is always safe and fast.

[Bug rtl-optimization/43147] SSE shuffle merge

2018-12-29 Thread glisse at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43147

--- Comment #8 from Marc Glisse  ---
Created attachment 45306
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45306&action=edit
ix86_gimple_fold_builtin patch

Like this then?

I realized (because of the testsuite) that we do not currently validate that
the 3rd argument is a 2-bit immediate (the error message on non-constants is
misleading), and Intel's documentation indeed seems to indicate that any
integer is valid, we only look at the 2 lowest bits.

If possible, I think it would be nice to reduce the number of builtins, that's
why I started with a patch to emmintrin.h (I now know that it should use
(mask&2)?3:2 for the second number). For validation, something like:

  if(!__builtin_constant_p(__mask)) complain();

with

extern void complain(void) __attribute__((error("argument must be a
constant")));

may be good enough. But we still need to handle -O0... Yeah, maybe
ix86_gimple_fold_builtin is easier :-(

[Bug rtl-optimization/43147] SSE shuffle merge

2018-12-28 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43147

Jakub Jelinek  changed:

   What|Removed |Added

 CC||jakub at gcc dot gnu.org

--- Comment #7 from Jakub Jelinek  ---
(In reply to Marc Glisse from comment #6)
> Created attachment 45303 [details]
> example patch (untested)
> 
> Making the meaning of shuffles visible in GIMPLE could help a bit (although
> it wouldn't solve the problem completely because IIRC we don't dare combine
> shuffles, since it is hard to find an optimal expansion for a shuffle and we
> might pessimize some cases).
> This patch is one simple way to make _mm_shuffle_pd less opaque.
> _mm_shuffle_ps would be a bit longer but still manageable. It has the
> drawback that it does not diagnose when the mask is not a constant, or not
> between 0 and 3, and I am not sure how to do that from the C code. An
> alternative would be to keep the current builtin but turn it into a
> vec_perm_expr in ix86_gimple_fold_builtin, which could include diagnostics.

I think doing it in ix86_gimple_fold_builtin is better.

[Bug rtl-optimization/43147] SSE shuffle merge

2018-12-28 Thread glisse at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43147

--- Comment #6 from Marc Glisse  ---
Created attachment 45303
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45303&action=edit
example patch (untested)

Making the meaning of shuffles visible in GIMPLE could help a bit (although it
wouldn't solve the problem completely because IIRC we don't dare combine
shuffles, since it is hard to find an optimal expansion for a shuffle and we
might pessimize some cases).
This patch is one simple way to make _mm_shuffle_pd less opaque. _mm_shuffle_ps
would be a bit longer but still manageable. It has the drawback that it does
not diagnose when the mask is not a constant, or not between 0 and 3, and I am
not sure how to do that from the C code. An alternative would be to keep the
current builtin but turn it into a vec_perm_expr in ix86_gimple_fold_builtin,
which could include diagnostics.

[Bug rtl-optimization/43147] SSE shuffle merge

2012-05-07 Thread glisse at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43147

Marc Glisse  changed:

   What|Removed |Added

 CC||glisse at gcc dot gnu.org

--- Comment #5 from Marc Glisse  2012-05-07 14:52:46 
UTC ---
Actually, why isn't constant propagation happening for the example in comment
#1? simplify-rtx.c contains code to that effect, it might just need a little
tweaking...

(yes, I know the original report is probably interested in non-constant
operands)


[Bug rtl-optimization/43147] SSE shuffle merge

2011-10-23 Thread marc.glisse at normalesup dot org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43147

Marc Glisse  changed:

   What|Removed |Added

 CC||marc.glisse at normalesup
   ||dot org

--- Comment #4 from Marc Glisse  2011-10-23 
08:40:04 UTC ---
Apart from combining 2 shuffles, I would expect the set and the shuffle to be
combined in Comment 1. I was going to report the following, but it already
appears in this bug:
__m128d f(double d){
__m128d x=_mm_setr_pd(-d,d);
return _mm_shuffle_pd(x,x,1);
}

movsd.LC0(%rip), %xmm1
xorpd%xmm0, %xmm1
movapd%xmm1, %xmm2
unpcklpd%xmm0, %xmm2
movapd%xmm2, %xmm0
shufpd$1, %xmm2, %xmm0

some extra moves, as usual, and a shuffle that could be combined with the
unpack.
(obviously we don't write such code, it is only after inlining that it looks
that way)


[Bug rtl-optimization/43147] SSE shuffle merge

2010-02-22 Thread pinskia at gcc dot gnu dot org


--- Comment #3 from pinskia at gcc dot gnu dot org  2010-02-23 01:42 ---
Confirmed.


-- 

pinskia at gcc dot gnu dot org changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
 Ever Confirmed|0   |1
   Last reconfirmed|-00-00 00:00:00 |2010-02-23 01:42:16
   date||


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43147



[Bug rtl-optimization/43147] SSE shuffle merge

2010-02-22 Thread pinskia at gcc dot gnu dot org


--- Comment #2 from pinskia at gcc dot gnu dot org  2010-02-23 01:42 ---
I think that is because nothing simplifies:
(vec_select:V4SF (vec_concat:V8SF (vec_select:V4SF (vec_concat:V8SF
(reg:V4SF 62)
(reg:V4SF 62))
(parallel [
(const_int 1 [0x1])
(const_int 2 [0x2])
(const_int 4 [0x4])
(const_int 7 [0x7])
]))
(vec_select:V4SF (vec_concat:V8SF (reg:V4SF 62)
(reg:V4SF 62))
(parallel [
(const_int 1 [0x1])
(const_int 2 [0x2])
(const_int 4 [0x4])
(const_int 7 [0x7])
])))
(parallel [
(const_int 1 [0x1])
(const_int 3 [0x3])
(const_int 6 [0x6])
(const_int 4 [0x4])
]))


-- 

pinskia at gcc dot gnu dot org changed:

   What|Removed |Added

   Keywords||missed-optimization


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43147



[Bug rtl-optimization/43147] SSE shuffle merge

2010-02-22 Thread liranuna at gmail dot com


--- Comment #1 from liranuna at gmail dot com  2010-02-23 01:37 ---
It appears I am missing a line in the code I posted:

#include 

extern void printv(__m128 m);

int main()
{
__m128 m = _mm_set_ps(1.0f, 2.0f, 3.0f, 4.0f);
m = _mm_shuffle_ps(m, m, 0xC9); // Those two shuffles together swap
pairs
m = _mm_shuffle_ps(m, m, 0x2D); // And could be optimized to 0x4E
printv(m);

return 0;
}


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43147