[Bug target/85048] [missed optimization] vector conversions

2018-03-23 Thread kretz at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85048

--- Comment #1 from Matthias Kretz  ---
Godbolt link:
https://godbolt.org/#z:OYLghAFBqd5QCxAYwPYBMCmBRdBLAF1QCcAaPECAKxAEZSAbAQwDtRkBSAJgCFufSAZ1QBXYskwgA5NwDMeFsgYisAag6yAwskEF8LAhuwcADAEFTZgpgC2AB2bX1WpU0GDVAFVKqFBVQByPn6qAMp4AF6YzgAigaoAVKqCkZioAGYQngCURpYiKWyqAGrqAKx8FcAsIiAgAG6YyETEAPopURDhUbllMRx9sV4afOaW1vaO0RrazO6qAGLEqDY%2BrvOeqEYlM5s%2BXNvI9QRcEMUzSyv72wAe2eoA7KNmqq%2BqxJgEYiw7WnuqB1kxiemwgNwGPBMAxi2R8oPBFVo0NyD36smeHFR41sDiYThm6w8l1WqkJXi2QN%2Bmn%2BABZDscaWcLssSXTKXdHs83u9Pt8qbS8iDUGCIVC%2BrDySLEci4cKETwDuLZVKeLJkZi0RiseYJrj8S45kSWWtDeTtuc/qgfAAOekEa1MrTEm23e6YrlvD5fYg/C3Uq2qW1A92Wbnc%2BGimWS%2BVIpXRiGKmHK%2BVqpWhsOvCMVOlxrM8MpRvMANkLcohD3V5m5GpGlhrOpxU2cszcRqupNNm3NuwDtBLlKOBD7js0ztUfddnPTry9fL9/wnwae07DebFSfj0tzZYqiYledTSZX4Z3PBzG7zBe3KpL1/lFbTVYzm54trvEIAnKWVbR1/vT7QsZHk%2BGZ5rQe7JhCtCHv%2BP7nrBMZXjCK41uidbalYjZ4tMBqtosxodnhXaUvOAayIC2Cksc5EjmO5GTiGIE8t6vo9j49FLh6oGnn%2BkFbhep4QS%2BMGkMebx5vBfH5t%2B8q3gJKoPsBLzPnmb7yfKX7voivEvoBMpiZmAFCWBIm6ZJulIbCBm6XJCFQYpdmImpjk8LQmlKc%2BL5cDpeZcEBLlcMZgkidZvnmb5llSVwtlRQ5olMSeKpcM5UXuS5sg%2Baesj%2BShWJoWMGG6k2BKmmOZLEZRpE%2BEWbKUYONW0QRNUMcuTGzj6/IBs1nGhaeqiZSq44yRCALDRUqghQl4l9eFfWRS%2BqgxQtcW9YNKULWlUm/mNrn%2BfFyncT%2BQU/qZYGzT%2BkWrTGS1gXFunrWBm1XQmA3yn5O2BR9k0HaugnnW982%2BTdgkrVNhlJQ9gmbcJr0Qtl%2Blg8Jx0pqdWX/XDl2IwewMqrId0HpDuNPVjp40rD2Z7S%2BNLIxCNLfZ5Eno9mgOkzj8o0qDP2JezhPs9Dl7k/me3PRUZQ06LqMqmUTP5pjXPTVLbMQmU%2BOnmUvPK8T8vg7JgtFpTxbizwRb028qFav0YxjFwshYOkCjRK0EAsEwNiYD46QEUQPgdJgbplJolhMCIRCqC7bsjp77a%2By1PDMXyg5yHIMdaFHJJEEYYK5OijyW2YAD0%2Be%2BAYmDAMQTAMKoAC02x%2BKX5cMCMqgEMQIiKHieCoCwlhO/Udg2Kg9QAI7oKoPgiH4NWtAQ49%2BOR08%2BACuTmL3/eDyPY%2BqBPBhTzPW9z1wC%2BvDSy9mKvA/D6Ps87zSR/bwQ8974Gp/n%2BvV%2BvJPt97/fj%2BL1wL8QD7hfDei9P53wPkfVQJ8e6ALXpfTexcCC72vg/Q%2BT9rQAKAW/VoKDkEfwML/V4/8YFYMvjg/eN9IEQKftAlesDgHoHIffPBiDCHPxIXAke5DEEsOoX/TBnDGGgMoU/Phx8BEMO4WA0RBC0GLwwXWOhpCh4AHdN7MK/igvskDiFKM4Wo3BmiKFDiLJA2hZ96HrwMcYvB99tHoIkVYhB0itGmKfroixyjrE8KMXYtxi9zGv2Ht4lxxj7HyMccEphoTEHhKIZE1R0SRHCJMWYhJKiklIKMbE/xrwFF6OARklJvCDBxKXhwwpUjkn4NSTQ9JVSslUNKbk9hVtPGcIAEbqJiffa0OiEldMMZA3paSKnr0GTY7JIyHFjOHhMnxwy/B9PcQM5x1TjHLICas4pUylmQPye04BHTMksMQZs%2BJsyh7HKGU/M5oyCnjJOdku5MyHlzIaacvZKzLnXJqZ8gw5yoEDI%2Bc8r5ETFGHMHugaxP85FhJaYEyx9RoXdNkeA5p%2BzIkopQYQvxR8%2BxYpCWi7%2BfgymItIdimpuLSUtIOUEylrC4V4r3gS2Z0LMlsJpfcyFyKinGM5Ri15PL2U4rhTk/FRZCUNIFbUrZbK%2BWMqabKvJUqUkyriaytp9KJmwsWQC7l2rUWoL1faTFbKdViI2RKrF8zLXTLlW89AtriWL3tSq81azjVPzdZq4VvzdW3LBeI81HKxUvPBY6/1lrw3jklSGtVYag1AvjVSxN%2BqhXaulWm01LK41aqRSoi1grXVJrpQWotyqrW5siYWo1ZS3XkRrc6ytbrWk8trSk%2BtQbfVBI7TUrt6b2IeN7VG4tbxS1NsyWUmNPby1TpaTGxtszC0NOnRO5dvzxWBsHbGydnaF1BqXW0wuiD64V2rrXEuZcK5NyiMsVQmAbjWBYCkLusyIg3A6TCpNXKM1Io/V%2Bo1gLf0%2BFnaQgD36d1bqHZEiDmTznMojTyuDKCEMgd3e%2Bz9fK3WIYmh4zDHT35VoTQa/9n6iOtstWW8D5GEGUZdRht5H6VHv1w4qupmGWN1paVR2DNwuOuKVYQsDcCANCNeDhy15LRPke4ZJhjbaglibk0GgNoG83If4%2BJ3Dang1Ma0/O9FXqkNKYM4JklDHfUEaHnRn91Th1kY6TZktUHQnSYvgB5zEm7ONL3tRmTKivNsdCQ5mjgXuNKrwe5wezGgvoYWX5vj6A4sKZC0llLxm/lGOi/UD9yWjUyrwf5jzn6h4qZ3Roo%2BoWZNOfKzmm5Pgcuebq2h6pxWYv8bK%2BZhr5T9OBcMxZ3zjW%2BP9e65Mo%2B7Xcs3GS6GozeDqslZm6Kub2X0uzcG0V0%2BlgT112vZXGulJdsNybikaoD6n2YBfZ3bubzBBYc3j5spk27uAeCK5sdImL4vesY93JR6eXffIap%2BLz2sNA4q%2Bhz7g9Advfq/Co%2B/3ZkvaIz5thOXkeeuAwp0HhHMdCbhVD%2Bod2BPQdTaR0hxP37xd40jrTD2x3sfU5E5HDTWuZeTbd2TKS2dsJx%2BJmNunCeU/B8q3THOAdmdJ2L0HLGRdxMFxp2nTm8cyKG71iXyvueRdW0rrzMaEs2mZ51%2Bnla0u0/C7DspbmjcW9JwbxTSK7v5dh4VoxC3ofTa85a63tPneM6yxNo3tWtcbbd0HrrNS2dRfD6zlbgfzcR/Q5Vve7uiedbl7k5Pw2E8Z7j4l33ieGNZ/V0Ep3hfMvF/R57kX1K2tbfMDtq9DcL2UnQKIDpDBMAkMHMlrgdgUcMbbyIDv7sS%2BAJ70PPvA/MtD5H9npRE%2Bp%2BesIbPzvhvu/HBUJP/vy296r9H6FwcW%2Bp%2B758Pv%2Bfnij%2B953/yuF5%2BHf1EHEPbf0%2B8H38P8cZ/S%2BdlH3v%2BSp/L%2Bnqb%2B7ea%2BD%2BR%2BX%2BN%2Bxe7%2BmC4BL%2BPWf%2BMBm%2BEB6A8BIBo%2BBy22RcR256B2lE6QDAqAeIG%2Beg2%2BHgJGe8%2BBhBHGC%2Bxw1%2BZBZOFBBBeIJm4%2BtBpBy%2BcKlBzBjGl%2Bm%2BdBp%2BqgXB1BvBXw/Bt%2BR8QhLBj%2BfBpBAhkhPBvcABfe9B9u8h4uihn%2B7BP%2BjBVBUhShdgKhoS8hYGsByhPWah/%2BSBsh42Oh3BYBVhZhNhHsTB1aEKJ6QhzYuBqg%2B%2BMCVEBABA/ejh9%2BahH%2B/hgRBhZ%2BaBzhuhemIhAR6AQRURghLhLBfhARggJ%2ByRMR9%2BY%2B0hYRGRN%2B8hORlh%2BRmRRRSRbamBWReILelE2BjcdCaRBhXAyW0R3BUmiBpRrR1Re8NONBXRQ8bRvRlmGmTsTRGRz%2BQxsOLCJR6RXAkxPR0x2S1Gg4cxCx8hoSVmjRqxzRW%2BUx4hwhGhpRexixBxehxwcxJx8hCuAC4xXAIg6xKRThsRRxlxjxMRxe9hxx7x3BxeWxBcRc%2B%2BtRp6e2vhOxCR3R9%2BYioR8RLRgx3hSRYisxgRkJiJDGKxFxgRCxUJ9mnRsJ2JaJauyJCRBJw%2BoBoSGB/RsJJx9%2BumMJgRNJSRumxJ9xqJZJo%2BumGJYRCRDx8JtJZuVJDJpJc%2BzxLJvJkR7JPWlJAJYcqApICArAwA0QCgzcAAnnYF3jnCwKgHYDAtqXYIvPrjOqMc7DqS5nDsRgoaaQaeOlBoug5vqeaYCg2g6WabaRaTGjVAAo6d5lBm6l6XqW6XbiDt6W6TpiGYGTacGR9iaT6eGTGaGVGdTmOv9k7HGUnuhqmdaeQTmS8dmWcQWeofmf7iWWWumUXn0RYj6SWWIrOuWRXrWbGUGTEjEq6TacXp8W2dod2XmfWbYjEoitWS2esmWWGaEp8aOVGQiZKW8DkV2bORUdOSKUWdWUuaAa8DkZOYaacW8BYYmduWoa8GoVudyIeacQSlILCAwNIGUFIKQCwNICYHeagNIIHLwPwMkKIOINMDbLQHeQQI%2BZebCAANYgA0hFgAB0H4sgZQ0FJg1oDwDwNUZQJsjA0gNId5NgdAJgJg95gFpAL5Ugd5ggIAuFAFUgT5sIcAsASAaA9geAneZAFAEAdFdgDFmAxAIAwAggLsBhCAqAM89sDA1gxAJFEAHS%2BFHSCgTAxAqp0gf5pAdFbsBgAA8iwAwHJRRXeVgDY

[Bug target/85048] [missed optimization] vector conversions

2018-03-23 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85048

Richard Biener  changed:

   What|Removed |Added

   Keywords||missed-optimization
 Target||x86_64-*-*, i?86-*-*
 Status|UNCONFIRMED |NEW
   Last reconfirmed||2018-03-23
 Ever confirmed|0   |1
   Severity|normal  |enhancement

--- Comment #2 from Richard Biener  ---
If there's a good specification of __builtin_convertvector it certainly makes
sense to support that in a compatible way for the generic vector extension.
It would need to be handled by tree-vect-generic.c lowering it to
VEC_PACK_* / VEC_UNPACK_* / VIEW_CONVERT (for noop) sequences.

I suppose this bug is about unoptimal code being generated currently.

If so please open an enhacement request for __builtin_convertvector.

[Bug target/85048] [missed optimization] vector conversions

2018-03-23 Thread kretz at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85048

--- Comment #3 from Matthias Kretz  ---
Just opened PR85052 for tracking __builtin_convertvector support.

[Bug target/85048] [missed optimization] vector conversions

2018-03-30 Thread glisse at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85048

--- Comment #4 from Marc Glisse  ---
See PR77399.

[Bug target/85048] [missed optimization] vector conversions

2019-01-05 Thread husseydevin at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85048

Devin Hussey  changed:

   What|Removed |Added

 CC||husseydevin at gmail dot com

--- Comment #5 from Devin Hussey  ---
ARM/AArch64 NEON use these:

FromTo   Intrinsic  ARMv7-a  AArch64
intXxY_t -> int2XxY_tvmovl_sX   vmovl.sX sshll #0?
uintXxY_t.   -> uint2XxY_t   vmovl_uX   vmovl.uX ushll #0?
[u]int2XxY_t -> [u]intXxY_t  vmovn_[us]Xvmovn.iX xtn
floatXxY_t   -> intXxY_t vcvt[q]_sX_fX  vcvt.sX.fX   fcvtzs
floatXxY_t   -> uintXxY_tvcvt[q]_uX_fX  vcvt.uX.fX   fcvtzu
intXxY_t -> floatXxY_t   vcvt[q]_fX_sX  vcvt.fX.sX   scvtf
uintXxY_t-> floatXxY_t   vcvt[q]_fX_uX  vcvt.fX.uX   ucvtf
float32x2_t  -> float64x2_t  vcvt_f32_f64   2x vcvt.f64.f32  fcvtl
float64x2_t  -> float32x2_t  vcvt_f64_f32   2x vcvt.f32.f64  fcvtn

Clang optimizes vmovl to vshll by zero for some reason. 

float32x2_t <-> float64x2_t requires 2 VFP instructions on ARMv7-a.

[Bug target/85048] [missed optimization] vector conversions

2023-03-21 Thread mkretz at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85048

--- Comment #6 from Matthias Kretz (Vir)  ---
Most of the conversions are optimized perfectly now. Only the following
conversions are still missing for AVX-512:
https://godbolt.org/z/9afWbYod6

#include 

template 
using V [[gnu::vector_size(Size)]] = T;

template  V cvt4(V x) {
return V{To(x[0]), To(x[1]), To(x[2]), To(x[3])};
}
template  V cvt8(V x) {
return V{
To(x[0]), To(x[1]), To(x[2]), To(x[3]),
To(x[4]), To(x[5]), To(x[6]), To(x[7])
};
}
template  V cvt16(V x) {
return V{
To(x[0]), To(x[1]), To(x[2]), To(x[3]),
To(x[4]), To(x[5]), To(x[6]), To(x[7]),
To(x[8]), To(x[9]), To(x[10]), To(x[11]),
To(x[12]), To(x[13]), To(x[14]), To(x[15])
};
}

#define _(name, from, to, size) \
auto name(V x) { return cvt##size(x); }
// integral -> double
_(vcvtudq2pd, uint32_t, double, 4)
_(vcvtudq2pd, uint32_t, double, 8)

// integral -> float
_(vcvtqq2ps ,  int64_t, float, 16)
_(vcvtuqq2ps, uint64_t, float, 16)

// float -> integral
_(vcvttps2qq, float, int64_t, 16)

_( cvttps2udq, float, uint32_t,  4)
_(vcvttps2udq, float, uint32_t,  8)
_(vcvttps2uqq, float, uint64_t, 16)

// double -> integral
_(vcvttpd2udq, double, uint32_t, 4)

[Bug target/85048] [missed optimization] vector conversions

2023-03-21 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85048

Hongtao.liu  changed:

   What|Removed |Added

 CC||crazylht at gmail dot com

--- Comment #7 from Hongtao.liu  ---
Yes, Looks like the pattern name is misdefined.
it shoud be fixuns_trunc, but we have ufix_trunc.

[Bug target/85048] [missed optimization] vector conversions

2023-03-21 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85048

--- Comment #8 from Hongtao.liu  ---
(In reply to Hongtao.liu from comment #7)
> Yes, Looks like the pattern name is misdefined.
> it shoud be fixuns_trunc, but we have ufix_trunc.

No, we have the right name but generate extra instructions for uns.

 8012(define_expand "fixuns_trunc2"
 8013  [(match_operand: 0 "register_operand")
 8014   (match_operand:VF1 1 "register_operand")]
 8015  "TARGET_SSE2"
 8016{
 8017  if (mode == V16SFmode)
 8018emit_insn (gen_ufix_truncv16sfv16si2 (operands[0],
 8019  operands[1]));
 8020  else
 8021{
 8022  rtx tmp[3];
 8023  tmp[0] = ix86_expand_adjust_ufix_to_sfix_si (operands[1], &tmp[2]);
 8024  tmp[1] = gen_reg_rtx (mode);
 8025  emit_insn (gen_fix_trunc2 (tmp[1],
tmp[0]));
 8026  emit_insn (gen_xor3 (operands[0], tmp[1],
tmp[2]));
 8027}
 8028  DONE;

[Bug target/85048] [missed optimization] vector conversions

2023-03-21 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85048

--- Comment #9 from Hongtao.liu  ---
With the patch, we can generate optimized code expect for those 16 {u,}qq
cases, since the ABI doesn't support 1024-bit vector.

1 file changed, 16 insertions(+), 2 deletions(-)
gcc/config/i386/sse.md | 18 --

modified   gcc/config/i386/sse.md
@@ -8014,8 +8014,9 @@ (define_expand "fixuns_trunc2"
(match_operand:VF1 1 "register_operand")]
   "TARGET_SSE2"
 {
-  if (mode == V16SFmode)
-emit_insn (gen_ufix_truncv16sfv16si2 (operands[0],
+  /* AVX512 support vcvttps2udq for all 128/256/512-bit vectors.  */
+  if (mode == V16SFmode || TARGET_AVX512VL)
+emit_insn (gen_ufix_trunc2 (operands[0],
  operands[1]));
   else
 {
@@ -8413,6 +8414,12 @@ (define_insn "*floatv2div2sf2_mask_1"
(set_attr "prefix" "evex")
(set_attr "mode" "V4SF")])

+(define_expand "floatuns2"
+  [(set (match_operand:VF2_512_256VL 0 "register_operand")
+   (unsigned_float:VF2_512_256VL
+ (match_operand: 1 "nonimmediate_operand")))]
+   "TARGET_AVX512F")
+
 (define_insn "ufloat2"
   [(set (match_operand:VF2_512_256VL 0 "register_operand" "=v")
(unsigned_float:VF2_512_256VL
@@ -8694,6 +8701,13 @@ (define_insn "fix_truncv4dfv4si2"
(set_attr "prefix" "maybe_evex")
(set_attr "mode" "OI")])

+
+/* The standard pattern name is fixuns_truncmn2.  */
+(define_expand "fixuns_truncv4dfv4si2"
+  [(set (match_operand:V4SI 0 "register_operand")
+   (unsigned_fix:V4SI (match_operand:V4DF 1 "nonimmediate_operand")))]
+  "TARGET_AVX512VL && TARGET_AVX512F")
+
 (define_insn "ufix_truncv4dfv4si2"
   [(set (match_operand:V4SI 0 "register_operand" "=v")
(unsigned_fix:V4SI (match_operand:V4DF 1 "nonimmediate_operand"
"vm")))]

[Bug target/85048] [missed optimization] vector conversions

2023-03-30 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85048

--- Comment #10 from CVS Commits  ---
The master branch has been updated by hongtao Liu :

https://gcc.gnu.org/g:fe42e7fe119159f7443dbe68189e52891dc0148e

commit r13-6951-gfe42e7fe119159f7443dbe68189e52891dc0148e
Author: liuhongt 
Date:   Thu Mar 30 15:43:25 2023 +0800

Rename ufix_trunc/ufloat* patterns to fixuns_trunc/floatuns* to align with
standard pattern name.

There's some typo for the standard pattern name for unsigned_{float,fix},
it should be floatunsmn2/fixuns_truncmn2, not ufloatmn2/ufix_truncmn2
in current trunk, the patch fix the typo, also change all though
ufix_trunc/ufloat patterns.

Also vcvttps2udq is available under AVX512VL, so it can be generated
directly instead of being emulated via vcvttps2dq.

gcc/ChangeLog:

PR target/85048
* config/i386/i386-builtin.def (BDESC): Adjust icode name from
ufloat/ufix to floatuns/fixuns.
* config/i386/i386-expand.cc
(ix86_expand_vector_convert_uns_vsivsf): Adjust comments.
* config/i386/sse.md
(ufloat2):
Renamed to ..
   
(floatuns2):..
this.
   
(_ufix_notrunc):
Renamed to ..
   
(_fixuns_notrunc):
.. this.
(fix_truncv16sfv16si2):
Renamed to ..
   
(fix_truncv16sfv16si2):.. this.
(ufloat2): Renamed to ..
(floatuns2): .. this.
(ufloatv2siv2df2): Renamed to ..
(floatunsv2siv2df2): .. this.
(ufix_notrunc2):
Renamed to ..
(fixuns_notrunc2):
.. this.
(ufix_notruncv2dfv2si2): Renamed to ..
(fixuns_notruncv2dfv2si2):.. this.
(ufix_notruncv2dfv2si2_mask): Renamed to ..
(fixuns_notruncv2dfv2si2_mask): .. this.
(*ufix_notruncv2dfv2si2_mask_1): Renamed to ..
(*fixuns_notruncv2dfv2si2_mask_1): .. this.
(ufix_truncv2dfv2si2): Renamed to ..
(*fixuns_truncv2dfv2si2): .. this.
(ufix_truncv2dfv2si2_mask): Renamed to ..
(fixuns_truncv2dfv2si2_mask): .. this.
(*ufix_truncv2dfv2si2_mask_1): Renamed to ..
(*fixuns_truncv2dfv2si2_mask_1): .. this.
(ufix_truncv4dfv4si2): Renamed to ..
(fixuns_truncv4dfv4si2): .. this.
(ufix_notrunc2):
Renamed to ..
(fixuns_notrunc2):
.. this.
(ufix_trunc2): Renamed to ..
(fixuns_trunc2):
.. this.

gcc/testsuite/ChangeLog:

* g++.target/i386/pr85048.C: New test.

[Bug target/85048] [missed optimization] vector conversions

2023-03-30 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85048

--- Comment #11 from Hongtao.liu  ---
Fixed in GCC13.

[Bug target/85048] [missed optimization] vector conversions

2023-03-31 Thread ubizjak at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85048

--- Comment #12 from Uroš Bizjak  ---
(In reply to Hongtao.liu from comment #9)
> With the patch, we can generate optimized code expect for those 16 {u,}qq
> cases, since the ABI doesn't support 1024-bit vector.

Can't these be vectorized using partial vectors? GCC generates:

_Z9vcvtqq2psDv16_l:
vmovq   56(%rsp), %xmm0
vmovq   40(%rsp), %xmm1
vmovq   88(%rsp), %xmm2
vmovq   120(%rsp), %xmm3
vpinsrq $1, 64(%rsp), %xmm0, %xmm0
vpinsrq $1, 48(%rsp), %xmm1, %xmm1
vpinsrq $1, 96(%rsp), %xmm2, %xmm2
vpinsrq $1, 128(%rsp), %xmm3, %xmm3
vinserti128 $0x1, %xmm0, %ymm1, %ymm1
vcvtqq2psy  8(%rsp), %xmm0
vcvtqq2psy  %ymm1, %xmm1
vinsertf128 $0x1, %xmm1, %ymm0, %ymm0
vmovq   72(%rsp), %xmm1
vpinsrq $1, 80(%rsp), %xmm1, %xmm1
vinserti128 $0x1, %xmm2, %ymm1, %ymm1
vmovq   104(%rsp), %xmm2
vcvtqq2psy  %ymm1, %xmm1
vpinsrq $1, 112(%rsp), %xmm2, %xmm2
vinserti128 $0x1, %xmm3, %ymm2, %ymm2
vcvtqq2psy  %ymm2, %xmm2
vinsertf128 $0x1, %xmm2, %ymm1, %ymm1
vinsertf64x4$0x1, %ymm1, %zmm0, %zmm0

where clang manages to vectorize the function to:

  vcvtqq2ps 16(%rbp), %ymm0
  vcvtqq2ps 80(%rbp), %ymm1
  vinsertf64x4 $1, %ymm1, %zmm0, %zmm0

[Bug target/85048] [missed optimization] vector conversions

2024-04-19 Thread mkretz at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85048

--- Comment #13 from Matthias Kretz (Vir)  ---
Should I open a new PR for the remaining ((u)int64, 16) <-> (float, 16)
conversions?

https://godbolt.org/z/x3xPMYKj3

Note that __builtin_convertvector produces the code we want.

template 
using V [[gnu::vector_size (Size)]] = T;

template 
V
cvt16 (V x)
{
#if BUILTIN
  return __builtin_convertvector (x, V);
#else
  return V{ To (x[0]),  To (x[1]),  To (x[2]),  To (x[3]),
To (x[4]),  To (x[5]),  To (x[6]),  To (x[7]),
To (x[8]),  To (x[9]),  To (x[10]), To (x[11]),
To (x[12]), To (x[13]), To (x[14]), To (x[15]) };
#endif
}

#define _(name, from, to, size)   \
  auto name (V x) { return cvt##size (x); }
// integral -> float
_ (vcvtqq2ps, int64_t, float, 16)
_ (vcvtuqq2ps, uint64_t, float, 16)

// float -> integral
_ (vcvttps2qq, float, int64_t, 16)
_ (vcvttps2uqq, float, uint64_t, 16)

[Bug target/85048] [missed optimization] vector conversions

2024-04-21 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85048

Hongtao Liu  changed:

   What|Removed |Added

 CC||liuhongt at gcc dot gnu.org

--- Comment #14 from Hongtao Liu  ---
(In reply to Matthias Kretz (Vir) from comment #13)
> Should I open a new PR for the remaining ((u)int64, 16) <-> (float, 16)
> conversions?
> 
> https://godbolt.org/z/x3xPMYKj3
> 
> Note that __builtin_convertvector produces the code we want.
> 

With -mprefer-vector-width=512, GCC generate produces the same code.
Default tuning for -march=skylake-avx512 is -mprefer-vector-width=256.

[Bug target/85048] [missed optimization] vector conversions

2024-04-22 Thread mkretz at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85048

--- Comment #15 from Matthias Kretz (Vir)  ---
So it seems that if at least one of the vector builtins involved in the
expression is 512 bits GCC needs to locally increase prefer-vector-width to
512? Or, more generally:

prefer-vector-width = max(prefer-vector-width, 8 * sizeof(operands)..., 8 *
sizeof(return-value))

The reason to default to 256 bits is to avoid zmm register usage altogether
(clock-down). But if the surrounding code already uses zmm registers that
motivation is moot.

Also, I think this shouldn't be considered auto-vectorization but rather
pattern recognition (recognizing a __builtin_convertvector).

[Bug target/85048] [missed optimization] vector conversions

2024-04-22 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85048

--- Comment #16 from Hongtao Liu  ---
(In reply to Matthias Kretz (Vir) from comment #15)
> So it seems that if at least one of the vector builtins involved in the
> expression is 512 bits GCC needs to locally increase prefer-vector-width to
> 512? Or, more generally:
> 
> prefer-vector-width = max(prefer-vector-width, 8 * sizeof(operands)..., 8 *
> sizeof(return-value))
> 
> The reason to default to 256 bits is to avoid zmm register usage altogether
> (clock-down). But if the surrounding code already uses zmm registers that
> motivation is moot.
> 
> Also, I think this shouldn't be considered auto-vectorization but rather
> pattern recognition (recognizing a __builtin_convertvector).

The related question is "should GCC set prefer-vector-width=512" when 512-bit
intrinsics is used. There may be a situation where users don't want compiler to
generate zmm except for those 512-bit intrinsics in their program, i.e the hot
loop is written with 512-bit intrinsics for performance purpose, but for other
places, better no zmm usage.