[Bug target/80846] auto-vectorized AVX2 horizontal sum should narrow to 128b right away, to be more efficient for Ryzen and Intel

2018-01-16 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80846 --- Comment #28 from Peter Cordes --- (In reply to Richard Biener from comment #27) > Note that this is deliberately left as-is because the target advertises > (cheap) support for horizontal reduction. The vectorizer simply generates > a single

[Bug target/80846] auto-vectorized AVX2 horizontal sum should narrow to 128b right away, to be more efficient for Ryzen and Intel

2018-01-15 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80846 Richard Biener changed: What|Removed |Added Status|REOPENED|RESOLVED Resolution|---

[Bug target/80846] auto-vectorized AVX2 horizontal sum should narrow to 128b right away, to be more efficient for Ryzen and Intel

2018-01-15 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80846 --- Comment #26 from Richard Biener --- (In reply to Peter Cordes from comment #25) > We're getting a spill/reload inside the loop with AVX512: > > .L2: > vmovdqa64 (%esp), %zmm3 > vpaddd (%eax), %zmm3, %zmm2 > addl

[Bug target/80846] auto-vectorized AVX2 horizontal sum should narrow to 128b right away, to be more efficient for Ryzen and Intel

2018-01-14 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80846 --- Comment #25 from Peter Cordes --- We're getting a spill/reload inside the loop with AVX512: .L2: vmovdqa64 (%esp), %zmm3 vpaddd (%eax), %zmm3, %zmm2 addl$64, %eax vmovdqa64 %zmm2, (%esp)

[Bug target/80846] auto-vectorized AVX2 horizontal sum should narrow to 128b right away, to be more efficient for Ryzen and Intel

2018-01-14 Thread ro at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80846 Rainer Orth changed: What|Removed |Added CC||ro at gcc dot gnu.org --- Comment #23

[Bug target/80846] auto-vectorized AVX2 horizontal sum should narrow to 128b right away, to be more efficient for Ryzen and Intel

2018-01-14 Thread ro at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80846 --- Comment #24 from Rainer Orth --- The new gcc.target/i386/pr80846-1.c testcase FAILs on Solaris/x86 (32 and 64-bit): +FAIL: gcc.target/i386/pr80846-1.c scan-assembler-times vextracti 2 (found 1 times) Assembler output attached. Rainer

[Bug target/80846] auto-vectorized AVX2 horizontal sum should narrow to 128b right away, to be more efficient for Ryzen and Intel

2018-01-14 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80846 --- Comment #22 from Peter Cordes --- Forgot the Godbolt link with updated cmdline options: https://godbolt.org/g/FCZAEj.

[Bug target/80846] auto-vectorized AVX2 horizontal sum should narrow to 128b right away, to be more efficient for Ryzen and Intel

2018-01-14 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80846 Peter Cordes changed: What|Removed |Added Status|RESOLVED|REOPENED Resolution|FIXED

[Bug target/80846] auto-vectorized AVX2 horizontal sum should narrow to 128b right away, to be more efficient for Ryzen and Intel

2018-01-12 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80846 Richard Biener changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|---

[Bug target/80846] auto-vectorized AVX2 horizontal sum should narrow to 128b right away, to be more efficient for Ryzen and Intel

2018-01-12 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80846 --- Comment #19 from Richard Biener --- Author: rguenth Date: Fri Jan 12 11:43:13 2018 New Revision: 256576 URL: https://gcc.gnu.org/viewcvs?rev=256576=gcc=rev Log: 2018-01-12 Richard Biener PR

[Bug target/80846] auto-vectorized AVX2 horizontal sum should narrow to 128b right away, to be more efficient for Ryzen and Intel

2017-09-13 Thread aldyh at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80846 --- Comment #18 from Aldy Hernandez --- Author: aldyh Date: Wed Sep 13 16:15:07 2017 New Revision: 252229 URL: https://gcc.gnu.org/viewcvs?rev=252229=gcc=rev Log: PR target/80846 * config/rs6000/vsx.md (vextract_fp_from_shorth,

[Bug target/80846] auto-vectorized AVX2 horizontal sum should narrow to 128b right away, to be more efficient for Ryzen and Intel

2017-09-13 Thread aldyh at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80846 --- Comment #17 from Aldy Hernandez --- Author: aldyh Date: Wed Sep 13 16:10:45 2017 New Revision: 252207 URL: https://gcc.gnu.org/viewcvs?rev=252207=gcc=rev Log: PR target/80846 * optabs.def (vec_extract_optab, vec_init_optab):

[Bug target/80846] auto-vectorized AVX2 horizontal sum should narrow to 128b right away, to be more efficient for Ryzen and Intel

2017-09-07 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80846 --- Comment #16 from Jakub Jelinek --- (In reply to rguent...@suse.de from comment #15) > Yeah, I have a patch that does this. The question is how to query the target > if the vector sizes share the same register set. Like we wouldn't want to go

[Bug target/80846] auto-vectorized AVX2 horizontal sum should narrow to 128b right away, to be more efficient for Ryzen and Intel

2017-09-07 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80846 --- Comment #15 from rguenther at suse dot de --- On September 7, 2017 1:53:47 PM GMT+02:00, "jakub at gcc dot gnu.org" wrote: >https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80846 > >--- Comment #14 from Jakub Jelinek

[Bug target/80846] auto-vectorized AVX2 horizontal sum should narrow to 128b right away, to be more efficient for Ryzen and Intel

2017-09-07 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80846 --- Comment #14 from Jakub Jelinek --- (In reply to Richard Biener from comment #11) > that's not using the unpacking strategy (sum adjacent elements) but still the > vector shift approach (add upper/lower halves). That's sth that can be >

[Bug target/80846] auto-vectorized AVX2 horizontal sum should narrow to 128b right away, to be more efficient for Ryzen and Intel

2017-08-01 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80846 --- Comment #13 from Jakub Jelinek --- Author: jakub Date: Tue Aug 1 16:12:31 2017 New Revision: 250784 URL: https://gcc.gnu.org/viewcvs?rev=250784=gcc=rev Log: PR target/80846 * config/rs6000/vsx.md (vextract_fp_from_shorth,

[Bug target/80846] auto-vectorized AVX2 horizontal sum should narrow to 128b right away, to be more efficient for Ryzen and Intel

2017-08-01 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80846 --- Comment #12 from Jakub Jelinek --- Author: jakub Date: Tue Aug 1 08:26:14 2017 New Revision: 250759 URL: https://gcc.gnu.org/viewcvs?rev=250759=gcc=rev Log: PR target/80846 * optabs.def (vec_extract_optab, vec_init_optab):

[Bug target/80846] auto-vectorized AVX2 horizontal sum should narrow to 128b right away, to be more efficient for Ryzen and Intel

2017-07-26 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80846 --- Comment #11 from Richard Biener --- So after Jakubs update the vectorizer patch yields sumint: .LFB0: .cfi_startproc vpxor %xmm0, %xmm0, %xmm0 leaq4096(%rdi), %rax .p2align 4,,10 .p2align 3 .L2:

[Bug target/80846] auto-vectorized AVX2 horizontal sum should narrow to 128b right away, to be more efficient for Ryzen and Intel

2017-07-20 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80846 --- Comment #10 from Jakub Jelinek --- Author: jakub Date: Thu Jul 20 16:36:18 2017 New Revision: 250397 URL: https://gcc.gnu.org/viewcvs?rev=250397=gcc=rev Log: PR target/80846 * config/i386/i386.c

[Bug target/80846] auto-vectorized AVX2 horizontal sum should narrow to 128b right away, to be more efficient for Ryzen and Intel

2017-07-19 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80846 Jakub Jelinek changed: What|Removed |Added CC||jakub at gcc dot gnu.org --- Comment #9

[Bug target/80846] auto-vectorized AVX2 horizontal sum should narrow to 128b right away, to be more efficient for Ryzen and Intel

2017-05-26 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80846 --- Comment #8 from Richard Biener --- Created attachment 41422 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41422=edit adjusted tree-vect-loop.c hunk

[Bug target/80846] auto-vectorized AVX2 horizontal sum should narrow to 128b right away, to be more efficient for Ryzen and Intel

2017-05-26 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80846 --- Comment #7 from Richard Biener --- Note that similar to the vec_init optab not allowing constructing larger vectors from smaller ones vec_extract doesn't allow extracting smaller vectors from larger ones. So I might be forced to go V8SI ->

[Bug target/80846] auto-vectorized AVX2 horizontal sum should narrow to 128b right away, to be more efficient for Ryzen and Intel

2017-05-26 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80846 --- Comment #6 from Richard Biener --- Similar with AVX512F I get .L2: vmovdqa64 -112(%rbp), %zmm3 addq$64, %rdi vpaddd -64(%rdi), %zmm3, %zmm2 cmpq%rdi, %rax vmovdqa64 %zmm2,

[Bug target/80846] auto-vectorized AVX2 horizontal sum should narrow to 128b right away, to be more efficient for Ryzen and Intel

2017-05-26 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80846 --- Comment #5 from Richard Biener --- Created attachment 41421 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41421=edit WIP patch

[Bug target/80846] auto-vectorized AVX2 horizontal sum should narrow to 128b right away, to be more efficient for Ryzen and Intel

2017-05-26 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80846 --- Comment #4 from Richard Biener --- (define_expand "3" [(set (match_operand:VI_AVX2 0 "register_operand") (plusminus:VI_AVX2 (match_operand:VI_AVX2 1 "vector_operand") (match_operand:VI_AVX2 2

[Bug target/80846] auto-vectorized AVX2 horizontal sum should narrow to 128b right away, to be more efficient for Ryzen and Intel

2017-05-26 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80846 Richard Biener changed: What|Removed |Added CC||vmakarov at gcc dot gnu.org ---

[Bug target/80846] auto-vectorized AVX2 horizontal sum should narrow to 128b right away, to be more efficient for Ryzen and Intel

2017-05-24 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80846 --- Comment #2 from Peter Cordes --- (In reply to Richard Biener from comment #1) > That is, it was supposed to end up using pslldq I think you mean PSRLDQ. Byte zero is the right-most when drawn in a way that makes bit/byte shift directions

[Bug target/80846] auto-vectorized AVX2 horizontal sum should narrow to 128b right away, to be more efficient for Ryzen and Intel

2017-05-24 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80846 Richard Biener changed: What|Removed |Added Status|UNCONFIRMED |ASSIGNED Last reconfirmed|