Re: issue with behavior change of gcc -r between gcc-8 and gcc-9
On Wednesday, 1 April 2020 19:48:11 CEST Olivier Hainque wrote: > > -r 's business was to arrange for the linker not to > complain because the closure is incomplete, leaving us > with complete control of the closure. > > It doesn't seem to me there was a really strong motivation > to suddenly have -r influence the closure the way it now does. > > Would it be possible to revert to the previous behavior > and document it ? > > Or maybe allow it to be controllable by the target ports ? > > Or provide something to bring back the flexibility we had > if we really believe the default should change ? (I'm not > convinced) -r is used for relinking. The idea behind the change was to make it directly suitable for that. It takes object files and relinks them into a new object file. It gives the caller complete control. It sounds like you are missing some way to add startfiles? A reverse of -nostartfiles? But hopefully you can just use the linker directly? Unless you have LTO enabled object files you dont need the compiler to link. `Allan
Re: [PATCH] Come up with -flto=auto option.
On Mittwoch, 24. Juli 2019 08:45:21 CEST Martin Liška wrote: > On 7/24/19 12:11 AM, Allan Sandfeld Jensen wrote: > > On Dienstag, 23. Juli 2019 10:30:07 CEST Martin Liška wrote: > >> Hi. > >> > >> As we as openSUSE started using -flto, I see it very handy to have > >> an option value that will automatically detect number of cores > >> that can be used for parallel LTRANS phase. > >> > >> Thoughts? > > > > That's really nice. > > > > How much extra work would it be to make it support a posix make jobserver? > > We do support it via -flto=jobserver: > Good to know :) > > Problem is that nowadays you how much more common make systems like ninja, > meson and others that probably do not support that. > There are patches to enable it in ninja, and I know some Linux distros apply the patches by default. Though that is more listening, so it probably requires launching ninja using make, if you want to be able to pass it own to gcc. 'Allan
Re: [PATCH] Come up with -flto=auto option.
On Dienstag, 23. Juli 2019 10:30:07 CEST Martin Liška wrote: > Hi. > > As we as openSUSE started using -flto, I see it very handy to have > an option value that will automatically detect number of cores > that can be used for parallel LTRANS phase. > > Thoughts? > That's really nice. How much extra work would it be to make it support a posix make jobserver? As far as I understand, you would need to guess a partition size first (as your patch here does), but then only start each job when given a token from the jobserver FD. With that the integration to existing build infrastructure would be optimal. Cheers 'Allan
[Patch] Make x86-64 a generic architecture as documented
Hello Changing -march=x86-64 behaviour to match documentation and common usage. Question: Is this also good for -m32 -march=x86-64 or will generic then mean something else? Thank Allan diff --git a/gcc/ChangeLog b/gcc/ChangeLog index 364466b6b6f..90f9cbc3c35 100644 --- a/gcc/ChangeLog +++ b/gcc/ChangeLog @@ -1,3 +1,8 @@ +2019-07-13 Allan Sandfeld Jensen + + * gcc/common/config/i386/i386-common.c (processor_alias_table): Change + x86-64 architecture to use generic tuning and scheduling instead of K8. + 2019-07-11 Jakub Jelinek PR target/91124 diff --git a/gcc/common/config/i386/i386-common.c b/gcc/common/config/i386/ i386-common.c index a394f874fe4..19ace226190 100644 --- a/gcc/common/config/i386/i386-common.c +++ b/gcc/common/config/i386/i386-common.c @@ -1673,7 +1673,7 @@ const pta processor_alias_table[] = PTA_MMX | PTA_3DNOW | PTA_3DNOW_A | PTA_SSE | PTA_FXSR}, {"athlon-mp", PROCESSOR_ATHLON, CPU_ATHLON, PTA_MMX | PTA_3DNOW | PTA_3DNOW_A | PTA_SSE | PTA_FXSR}, - {"x86-64", PROCESSOR_K8, CPU_K8, + {"x86-64", PROCESSOR_GENERIC, CPU_GENERIC, PTA_64BIT | PTA_MMX | PTA_SSE | PTA_SSE2 | PTA_NO_SAHF | PTA_FXSR}, {"eden-x2", PROCESSOR_K8, CPU_K8, PTA_64BIT | PTA_MMX | PTA_SSE | PTA_SSE2 | PTA_SSE3 | PTA_FXSR},
Re: [Patch][GCC] Document and fix -r (partial linking)
On Montag, 27. August 2018 15:37:15 CEST Joseph Myers wrote: > On Sun, 26 Aug 2018, Allan Sandfeld Jensen wrote: > > Patch updated. I specifically edited a number of the existing tests that > > used both -r and -nostdlib and removed -nostdlib so the patch is > > exercised by existing tests. The patch bootstrapped, I didn't notice any > > relevant failures when running the test suite (though I could have missed > > something, I am never comfortable reading that output). > > Note that Iain's comments also included that the patch is incomplete > because of more specs in gcc.c (VTABLE_VERIFICATION_SPEC, > SANITIZER_EARLY_SPEC, SANITIZER_SPEC) that needs corresponding updates to > handle -r like -nostdlib. Updated (but tests not rerun)>From 1d164bced7979c94767c260174e3c486d4fc8c5d Mon Sep 17 00:00:00 2001 From: Allan Sandfeld Jensen Date: Sat, 1 Sep 2018 12:59:14 +0200 Subject: [PATCH] Fix and document -r option The option has existed and been working for years, make sure it implies the right extra options, and list it in the documentation. 2018-09-01 Allan Sandfeld Jensen gcc/doc/ * invoke.texi: Document -r. gcc/ * gcc.c (LINK_COMMAND_SPEC): Handle -r like -nostdlib. (VTABLE_VERIFICATION_SPEC): Ditto (SANITIZER_EARLY_SPEC): Ditto (SANITIZER_SPEC): Ditto * config/darwin.h (LINK_COMMAND_SPEC): Ditto * cp/g++spec.c (lang_specific_driver): Ditto * fortran/gfortranspec.c (lang_specific_driver): Ditto * go/gospec.c (lang_specific_driver): Ditto gcc/testsuite/ * g++.dg/ipa/pr64059.C: Removed now redundant -nostdlib. * g++.dg/lto/20081109-1_0.C: Ditto * g++.dg/lto/20090302_0.C: Ditto * g++.dg/lto/pr45621_0.C: Ditto * g++.dg/lto/pr60567_0.C: Ditto * g++.dg/lto/pr62026.C: Ditto * gcc.dg/lto/pr45736_0.c: Ditto * gcc.dg/lto/pr52634_0.c: Ditto * gfortran.dg/lto/20091016-1_0.f90: Ditto * gfortran.dg/lto/pr79108_0.f90: Ditto --- gcc/config/darwin.h| 8 gcc/cp/g++spec.c | 1 + gcc/doc/invoke.texi| 7 ++- gcc/fortran/gfortranspec.c | 1 + gcc/gcc.c | 18 +- gcc/go/gospec.c| 1 + gcc/testsuite/g++.dg/ipa/pr64059.C | 2 +- gcc/testsuite/g++.dg/lto/20081109-1_0.C| 2 +- gcc/testsuite/g++.dg/lto/20090302_0.C | 2 +- gcc/testsuite/g++.dg/lto/pr45621_0.C | 2 +- gcc/testsuite/g++.dg/lto/pr60567_0.C | 2 +- gcc/testsuite/g++.dg/lto/pr62026.C | 2 +- gcc/testsuite/gcc.dg/lto/pr45736_0.c | 2 +- gcc/testsuite/gcc.dg/lto/pr52634_0.c | 2 +- gcc/testsuite/gfortran.dg/lto/20091016-1_0.f90 | 2 +- gcc/testsuite/gfortran.dg/lto/pr79108_0.f90| 2 +- 16 files changed, 32 insertions(+), 24 deletions(-) diff --git a/gcc/config/darwin.h b/gcc/config/darwin.h index cd6d6521658..87f610259c0 100644 --- a/gcc/config/darwin.h +++ b/gcc/config/darwin.h @@ -180,20 +180,20 @@ extern GTY(()) int darwin_ms_struct; "%X %{s} %{t} %{Z} %{u*} \ %{e*} %{r} \ %{o*}%{!o:-o a.out} \ -%{!nostdlib:%{!nostartfiles:%S}} \ +%{!nostdlib:%{!r:%{!nostartfiles:%S}}} \ %{L*} %(link_libgcc) %o %{fprofile-arcs|fprofile-generate*|coverage:-lgcov} \ %{fopenacc|fopenmp|%:gt(%{ftree-parallelize-loops=*:%*} 1): \ %{static|static-libgcc|static-libstdc++|static-libgfortran: libgomp.a%s; : -lgomp } } \ %{fgnu-tm: \ %{static|static-libgcc|static-libstdc++|static-libgfortran: libitm.a%s; : -litm } } \ -%{!nostdlib:%{!nodefaultlibs:\ +%{!nostdlib:%{!r:%{!nodefaultlibs:\ %{%:sanitize(address): -lasan } \ %{%:sanitize(undefined): -lubsan } \ %(link_ssp) \ " DARWIN_EXPORT_DYNAMIC " %
Re: [Patch][GCC] Document and fix -r (partial linking)
On Montag, 27. August 2018 15:37:15 CEST Joseph Myers wrote: > On Sun, 26 Aug 2018, Allan Sandfeld Jensen wrote: > > Patch updated. I specifically edited a number of the existing tests that > > used both -r and -nostdlib and removed -nostdlib so the patch is > > exercised by existing tests. The patch bootstrapped, I didn't notice any > > relevant failures when running the test suite (though I could have missed > > something, I am never comfortable reading that output). > > Note that Iain's comments also included that the patch is incomplete > because of more specs in gcc.c (VTABLE_VERIFICATION_SPEC, > SANITIZER_EARLY_SPEC, SANITIZER_SPEC) that needs corresponding updates to > handle -r like -nostdlib. Okay, I can add that, or whoever commits the patch can add that. We can also improve the feature if we discover more places that needs updating. Do you want me to post an version updated with with these two places? 'Allan
Re: [Patch][GCC] Document and fix -r (partial linking)
On Donnerstag, 23. August 2018 23:24:02 CEST Joseph Myers wrote: > On Thu, 23 Aug 2018, Iain Sandoe wrote: > > Joseph: As a side-comment, is there a reason that we don’t exclude > > gomp/itm/fortran/gcov from the link for -nostdlib / -nodefaultlib? > > > > If we are relying on the lib self-specs for this, then we’re not > > succeeding since the one we build at the moment don’t include those > > clauses. > > Well, fortran/gfortranspec.c for example has > > case OPT_nostdlib: > case OPT_nodefaultlibs: > case OPT_c: > case OPT_S: > case OPT_fsyntax_only: > case OPT_E: > /* These options disable linking entirely or linking of the > standard libraries. */ > library = 0; > break; > > and only uses libgfortran.spec if (library). So it's certainly meant to > avoid linking with libgfortran or its dependencies if -nostdlib. Patch updated. I specifically edited a number of the existing tests that used both -r and -nostdlib and removed -nostdlib so the patch is exercised by existing tests. The patch bootstrapped, I didn't notice any relevant failures when running the test suite (though I could have missed something, I am never comfortable reading that output). 'Allan >From 07ed41a9afd107c5d45feb1ead7a74ca735a1bb2 Mon Sep 17 00:00:00 2001 From: Allan Sandfeld Jensen Date: Sun, 26 Aug 2018 20:02:54 +0200 Subject: [PATCH] Fix and document -r option The option has existed and been working for years, make sure it implies the right extra options, and list it in the documentation. 2018-08-26 Allan Sandfeld Jensen gcc/doc/ * invoke.texi: Document -r. gcc/ * gcc.c (LINK_COMMAND_SPEC): Handle -r like -nostdlib. * config/darwin.h (LINK_COMMAND_SPEC): Handle -r like -nostdlib. * cp/g++spec.c (lang_specific_driver): Handle -r like -nostdlib. * fortran/gfortranspec.c (lang_specific_driver): Handle -r like -nostdlib. * go/gospec.c (lang_specific_driver): Handle -r like -nostdlib. gcc/testsuite/ * g++.dg/ipa/pr64059.C: Removed now redundant -nostdlib. * g++.dg/lto/20081109-1_0.C: Removed now redundant -nostdlib. * g++.dg/lto/20090302_0.C: Removed now redundant -nostdlib. * g++.dg/lto/pr45621_0.C: Removed now redundant -nostdlib. * g++.dg/lto/pr60567_0.C: Removed now redundant -nostdlib. * g++.dg/lto/pr62026.C: Removed now redundant -nostdlib. * gcc.dg/lto/pr45736_0.c: Removed now redundant -nostdlib. * gcc.dg/lto/pr52634_0.c: Removed now redundant -nostdlib. * gfortran.dg/lto/20091016-1_0.f90: Removed now redundant -nostdlib. * gfortran.dg/lto/pr79108_0.f90: Removed now redundant -nostdlib. --- gcc/config/darwin.h| 8 gcc/cp/g++spec.c | 1 + gcc/doc/invoke.texi| 7 ++- gcc/fortran/gfortranspec.c | 1 + gcc/gcc.c | 6 +++--- gcc/go/gospec.c| 1 + gcc/testsuite/g++.dg/ipa/pr64059.C | 2 +- gcc/testsuite/g++.dg/lto/20081109-1_0.C| 2 +- gcc/testsuite/g++.dg/lto/20090302_0.C | 2 +- gcc/testsuite/g++.dg/lto/pr45621_0.C | 2 +- gcc/testsuite/g++.dg/lto/pr60567_0.C | 2 +- gcc/testsuite/g++.dg/lto/pr62026.C | 2 +- gcc/testsuite/gcc.dg/lto/pr45736_0.c | 2 +- gcc/testsuite/gcc.dg/lto/pr52634_0.c | 2 +- gcc/testsuite/gfortran.dg/lto/20091016-1_0.f90 | 2 +- gcc/testsuite/gfortran.dg/lto/pr79108_0.f90| 2 +- 16 files changed, 26 insertions(+), 18 deletions(-) diff --git a/gcc/config/darwin.h b/gcc/config/darwin.h index cd6d6521658..87f610259c0 100644 --- a/gcc/config/darwin.h +++ b/gcc/config/darwin.h @@ -180,20 +180,20 @@ extern GTY(()) int darwin_ms_struct; "%X %{s} %{t} %{Z} %{u*} \ %{e*} %{r} \ %{o*}%{!o:-o a.out} \ -%{!nostdlib:%{!nostartfiles:%S}} \ +%{!nostdlib:%{!r:%{!nostartfiles:%S}}} \ %{L*} %(link_libgcc) %o %{fprofile-arcs|fprofile-generate*|coverage:-lgcov} \ %{fopenacc|fopenmp|%:gt(%{ftree-parallelize-loops=*:%*} 1): \ %{static|static-libgcc|static-libstdc++|static-libgfortran: libgomp.a%s; : -lgomp } } \ %{fgnu-tm: \ %{static|static-libgcc|static-libstdc++|static-libgfortran: libitm.a%s; : -litm } } \ -%{!nostdlib:%{!nodefaultlibs:\ +%{!nostdlib:%{!r:%{!nodefaultlibs:\ %{%:sanitize(address): -lasan } \ %{%:sanitize(undefined): -lubsan } \ %(link_ssp) \ " DARWIN_EXPORT_DYNAMIC " %
Re: [Patch][GCC] Document and fix -r (partial linking)
On Dienstag, 21. August 2018 00:38:58 CEST Joseph Myers wrote: > On Fri, 3 Aug 2018, Allan Sandfeld Jensen wrote: > > > I think you're changing the wrong place for this. If you want -r to be > > > usable with GCC without using -nostdlib (which is an interesting > > > question), you actually need to change LINK_COMMAND_SPEC (also sometimes > > > overridden for targets) to handle -r more like -nostdlib -nostartfiles. > > > > Okay, so like this? > > Could you confirm if this has passed a bootstrap and testsuite run, with > no testsuite regressions compared to GCC without the patch applied? I > think it looks mostly OK (modulo ChangeLog rewrites and a missing second > space after '.' in the manual change) but I'd like to make sure it's > passed the usual testing before preparing it for commit. I didn't think of running the tests since it only affects command line options, but it did bootstrap, and behave as expected for my specific usecase. I will update and run the tests when I have time.
Re: [PATCH][x86] Match movss and movsd "blend" instructions
On Samstag, 11. August 2018 11:18:39 CEST Jakub Jelinek wrote: > On Sat, Aug 11, 2018 at 10:59:26AM +0200, Allan Sandfeld Jensen wrote: > > +/* A subroutine of ix86_expand_vec_perm_builtin_1. Try to implement D > > + using movss or movsd. */ > > +static bool > > +expand_vec_perm_movs (struct expand_vec_perm_d *d) > > +{ > > + machine_mode vmode = d->vmode; > > + unsigned i, nelt = d->nelt; > > + rtx x; > > + > > + if (d->one_operand_p) > > +return false; > > + > > + if (TARGET_SSE2 && (vmode == V2DFmode || vmode == V4SFmode)) > > +; > > + else > > +return false; > > + > > + /* Only the first element is changed. */ > > Two spaces after . > > > + if (d->perm[0] != nelt && d->perm[0] != 0) > > +return false; > > + for (i = 1; i < nelt; ++i) { > > +{ > > + if (d->perm[i] != i + nelt - d->perm[0]) > > +return false; > > +} > > + } > > Extraneous {}s (both pairs, the outer ones even badly indented). > > Otherwise LGTM. > Updated: Note as an infrequent contributor don't have commit access, so I need someone reviewing to also commit. 'Allan >From e33241e5ddc7fa57c4ba7893669af7f7e636125e Mon Sep 17 00:00:00 2001 From: Allan Sandfeld Jensen Date: Sat, 11 Aug 2018 11:52:21 +0200 Subject: [PATCH] Match movss and movsd "blend" instructions Adds the ability to match movss and movsd as blend patterns, implemented in a new method to be able to match these before shuffles, while keeping other blends after. 2018-08-11 Allan Sandfeld Jensen gcc/config/i386 * i386.cc (expand_vec_perm_movs): New method matching movs patterns. * i386.cc (expand_vec_perm_1): Try the new method. gcc/testsuite * gcc.target/i386/sse2-movs.c: New test. --- gcc/config/i386/emmintrin.h | 2 +- gcc/config/i386/i386.c | 41 + gcc/config/i386/xmmintrin.h | 5 - 3 files changed, 46 insertions(+), 2 deletions(-) diff --git a/gcc/config/i386/emmintrin.h b/gcc/config/i386/emmintrin.h index b940a39d27b..6501638f619 100644 --- a/gcc/config/i386/emmintrin.h +++ b/gcc/config/i386/emmintrin.h @@ -113,7 +113,7 @@ _mm_setzero_pd (void) extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__)) _mm_move_sd (__m128d __A, __m128d __B) { - return (__m128d) __builtin_ia32_movsd ((__v2df)__A, (__v2df)__B); + return __extension__ (__m128d) __builtin_shuffle((__v2df)__A, (__v2df)__B, (__v2di){2, 1}); } /* Load two DPFP values from P. The address must be 16-byte aligned. */ diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index 7554fd1f659..15a3caa94c3 100644 --- a/gcc/config/i386/i386.c +++ b/gcc/config/i386/i386.c @@ -46145,6 +46145,43 @@ expand_vselect_vconcat (rtx target, rtx op0, rtx op1, return ok; } +/* A subroutine of ix86_expand_vec_perm_builtin_1. Try to implement D + using movss or movsd. */ +static bool +expand_vec_perm_movs (struct expand_vec_perm_d *d) +{ + machine_mode vmode = d->vmode; + unsigned i, nelt = d->nelt; + rtx x; + + if (d->one_operand_p) +return false; + + if (TARGET_SSE2 && (vmode == V2DFmode || vmode == V4SFmode)) +; + else +return false; + + /* Only the first element is changed. */ + if (d->perm[0] != nelt && d->perm[0] != 0) +return false; + for (i = 1; i < nelt; ++i) +if (d->perm[i] != i + nelt - d->perm[0]) + return false; + + if (d->testing_p) +return true; + + if (d->perm[0] == nelt) +x = gen_rtx_VEC_MERGE (vmode, d->op1, d->op0, GEN_INT (1)); + else +x = gen_rtx_VEC_MERGE (vmode, d->op0, d->op1, GEN_INT (1)); + + emit_insn (gen_rtx_SET (d->target, x)); + + return true; +} + /* A subroutine of ix86_expand_vec_perm_builtin_1. Try to implement D in terms of blendp[sd] / pblendw / pblendvb / vpblendd. */ @@ -46887,6 +46924,10 @@ expand_vec_perm_1 (struct expand_vec_perm_d *d) } } + /* Try movss/movsd instructions. */ + if (expand_vec_perm_movs (d)) +return true; + /* Finally, try the fully general two operand permute. */ if (expand_vselect_vconcat (d->target, d->op0, d->op1, d->perm, nelt, d->testing_p)) diff --git a/gcc/config/i386/xmmintrin.h b/gcc/config/i386/xmmintrin.h index f64f3f74a0b..f770570295c 100644 --- a/gcc/config/i386/xmmintrin.h +++ b/gcc/config/i386/xmmintrin.h @@ -1011,7 +1011,10 @@ _mm_storer_ps (float *__P, __m128 __A) extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__)) _mm_move_ss (__m128 __A, __m128 __B) { - return (__m128) __builtin_ia32_movss ((__v4sf)__A, (__v4sf)__B); + return (__m128) __builtin_shuffle ((__v4sf)__A, (__v4sf)__B, + __extension__ + (__attribute__((__vector_size__ (16))) int) + {4,1,2,3}); } /* Extracts one of the four words of A. The selector N must be immediate. */ -- 2.17.1
Re: [PATCH][x86] Match movss and movsd "blend" instructions
Updated: Match movss and movsd "blend" instructions Adds the ability to match movss and movsd as blend patterns, implemented in a new method to be able to match these before shuffles, while keeping other blends after. 2018-08-11 Allan Sandfeld Jensen gcc/config/i386 * i386.cc (expand_vec_perm_movs): New method matching movs patterns. * i386.cc (expand_vec_perm_1): Try the new method. gcc/testsuite * gcc.target/i386/sse2-movs.c: New test. diff --git a/gcc/config/i386/emmintrin.h b/gcc/config/i386/emmintrin.h index b940a39d27b..6501638f619 100644 --- a/gcc/config/i386/emmintrin.h +++ b/gcc/config/i386/emmintrin.h @@ -113,7 +113,7 @@ _mm_setzero_pd (void) extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__)) _mm_move_sd (__m128d __A, __m128d __B) { - return (__m128d) __builtin_ia32_movsd ((__v2df)__A, (__v2df)__B); + return __extension__ (__m128d) __builtin_shuffle((__v2df)__A, (__v2df)__B, (__v2di){2, 1}); } /* Load two DPFP values from P. The address must be 16-byte aligned. */ diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index 7554fd1f659..485850096e9 100644 --- a/gcc/config/i386/i386.c +++ b/gcc/config/i386/i386.c @@ -46145,6 +46145,46 @@ expand_vselect_vconcat (rtx target, rtx op0, rtx op1, return ok; } +/* A subroutine of ix86_expand_vec_perm_builtin_1. Try to implement D + using movss or movsd. */ +static bool +expand_vec_perm_movs (struct expand_vec_perm_d *d) +{ + machine_mode vmode = d->vmode; + unsigned i, nelt = d->nelt; + rtx x; + + if (d->one_operand_p) +return false; + + if (TARGET_SSE2 && (vmode == V2DFmode || vmode == V4SFmode)) +; + else +return false; + + /* Only the first element is changed. */ + if (d->perm[0] != nelt && d->perm[0] != 0) +return false; + for (i = 1; i < nelt; ++i) { +{ + if (d->perm[i] != i + nelt - d->perm[0]) +return false; +} + } + + if (d->testing_p) +return true; + + if (d->perm[0] == nelt) +x = gen_rtx_VEC_MERGE (vmode, d->op1, d->op0, GEN_INT (1)); + else +x = gen_rtx_VEC_MERGE (vmode, d->op0, d->op1, GEN_INT (1)); + + emit_insn (gen_rtx_SET (d->target, x)); + + return true; +} + /* A subroutine of ix86_expand_vec_perm_builtin_1. Try to implement D in terms of blendp[sd] / pblendw / pblendvb / vpblendd. */ @@ -46887,6 +46927,10 @@ expand_vec_perm_1 (struct expand_vec_perm_d *d) } } + /* Try movss/movsd instructions. */ + if (expand_vec_perm_movs (d)) +return true; + /* Finally, try the fully general two operand permute. */ if (expand_vselect_vconcat (d->target, d->op0, d->op1, d->perm, nelt, d->testing_p)) diff --git a/gcc/config/i386/xmmintrin.h b/gcc/config/i386/xmmintrin.h index f64f3f74a0b..f770570295c 100644 --- a/gcc/config/i386/xmmintrin.h +++ b/gcc/config/i386/xmmintrin.h @@ -1011,7 +1011,10 @@ _mm_storer_ps (float *__P, __m128 __A) extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__)) _mm_move_ss (__m128 __A, __m128 __B) { - return (__m128) __builtin_ia32_movss ((__v4sf)__A, (__v4sf)__B); + return (__m128) __builtin_shuffle ((__v4sf)__A, (__v4sf)__B, + __extension__ + (__attribute__((__vector_size__ (16))) int) + {4,1,2,3}); } /* Extracts one of the four words of A. The selector N must be immediate. */
Re: [Patch][GCC] Document and fix -r (partial linking)
On Freitag, 3. August 2018 13:56:12 CEST Allan Sandfeld Jensen wrote: > On Mittwoch, 1. August 2018 18:32:30 CEST Joseph Myers wrote: > > On Wed, 1 Aug 2018, Allan Sandfeld Jensen wrote: > > > gcc/ > > > > > > * gcc.c: Correct default specs for -r > > > > I don't follow why your changes (which would need describing for each > > individual spec changed) are corrections. > > > > > /* config.h can define LIB_SPEC to override the default libraries. */ > > > #ifndef LIB_SPEC > > > > > > -#define LIB_SPEC "%{!shared:%{g*:-lg} > > > %{!p:%{!pg:-lc}}%{p:-lc_p}%{pg:-lc_p}}" +#define LIB_SPEC > > > "%{!shared|!r:%{g*:-lg} %{!p:%{!pg:-lc}}%{p:-lc_p}%{pg:-lc_p}}"> > > > > > > #endif > > > > '!' binds more closely than '|' in specs. That is, !shared|!r means the > > following specs are used unless both -shared and -r are specified, which > > seems nonsensical to me. I'd expect something more like "shared|r:;" to > > expand to nothing if either -shared or -r is passed and to what follows if > > neither is passed. > > > > And that ignores that this LIB_SPEC value in gcc.c is largely irrelevant, > > as it's generally overridden by targets - and normally for targets using > > ELF shared libraries, for example, -lc *does* have to be used when linking > > with -shared. > > > > I think you're changing the wrong place for this. If you want -r to be > > usable with GCC without using -nostdlib (which is an interesting > > question), you actually need to change LINK_COMMAND_SPEC (also sometimes > > overridden for targets) to handle -r more like -nostdlib -nostartfiles. > > Okay, so like this? Any further comments or corrections to updated patch in the parent message? 'Allan
Re: [Patch][GCC] Document and fix -r (partial linking)
On Mittwoch, 1. August 2018 18:32:30 CEST Joseph Myers wrote: > On Wed, 1 Aug 2018, Allan Sandfeld Jensen wrote: > > gcc/ > > > > * gcc.c: Correct default specs for -r > > I don't follow why your changes (which would need describing for each > individual spec changed) are corrections. > > > /* config.h can define LIB_SPEC to override the default libraries. */ > > #ifndef LIB_SPEC > > > > -#define LIB_SPEC "%{!shared:%{g*:-lg} > > %{!p:%{!pg:-lc}}%{p:-lc_p}%{pg:-lc_p}}" +#define LIB_SPEC > > "%{!shared|!r:%{g*:-lg} %{!p:%{!pg:-lc}}%{p:-lc_p}%{pg:-lc_p}}"> > > #endif > > '!' binds more closely than '|' in specs. That is, !shared|!r means the > following specs are used unless both -shared and -r are specified, which > seems nonsensical to me. I'd expect something more like "shared|r:;" to > expand to nothing if either -shared or -r is passed and to what follows if > neither is passed. > > And that ignores that this LIB_SPEC value in gcc.c is largely irrelevant, > as it's generally overridden by targets - and normally for targets using > ELF shared libraries, for example, -lc *does* have to be used when linking > with -shared. > > I think you're changing the wrong place for this. If you want -r to be > usable with GCC without using -nostdlib (which is an interesting > question), you actually need to change LINK_COMMAND_SPEC (also sometimes > overridden for targets) to handle -r more like -nostdlib -nostartfiles. > Okay, so like this? >From 9de68f2ef0b77a0c0bcf0d83232e7fc34b006406 Mon Sep 17 00:00:00 2001 From: Allan Sandfeld Jensen Date: Wed, 1 Aug 2018 18:07:05 +0200 Subject: [PATCH] Fix and document -r option The option has existed and been working for years, make sure it implies the right extra options, and list it in the documentation. 2018-07-29 Allan Sandfeld Jensen gcc/doc * invoke.texi: Document -r gcc/ * gcc.c (LINK_COMMAND_SPEC): Handle -r like -nostdlib * config/darwin.h (LINK_COMMAND_SPEC): Handle -r like -nostdlib --- gcc/config/darwin.h | 8 gcc/doc/invoke.texi | 7 ++- gcc/gcc.c | 6 +++--- 3 files changed, 13 insertions(+), 8 deletions(-) diff --git a/gcc/config/darwin.h b/gcc/config/darwin.h index 980ad9b4057..6ad657a4958 100644 --- a/gcc/config/darwin.h +++ b/gcc/config/darwin.h @@ -178,20 +178,20 @@ extern GTY(()) int darwin_ms_struct; "%X %{s} %{t} %{Z} %{u*} \ %{e*} %{r} \ %{o*}%{!o:-o a.out} \ -%{!nostdlib:%{!nostartfiles:%S}} \ +%{!nostdlib:%{!r:%{!nostartfiles:%S}}} \ %{L*} %(link_libgcc) %o %{fprofile-arcs|fprofile-generate*|coverage:-lgcov} \ %{fopenacc|fopenmp|%:gt(%{ftree-parallelize-loops=*:%*} 1): \ %{static|static-libgcc|static-libstdc++|static-libgfortran: libgomp.a%s; : -lgomp } } \ %{fgnu-tm: \ %{static|static-libgcc|static-libstdc++|static-libgfortran: libitm.a%s; : -litm } } \ -%{!nostdlib:%{!nodefaultlibs:\ +%{!nostdlib:%{!r:%{!nodefaultlibs:\ %{%:sanitize(address): -lasan } \ %{%:sanitize(undefined): -lubsan } \ %(link_ssp) \ " DARWIN_EXPORT_DYNAMIC " %
Re: [PATCH][x86] Match movss and movsd "blend" instructions
On Donnerstag, 2. August 2018 23:15:28 CEST Marc Glisse wrote: > On Thu, 2 Aug 2018, Allan Sandfeld Jensen wrote: > > I forgot. One of the things that makes using __builtin_shuffle ugly is > > that > > __v4si as the suffle argument needs to be in _mm_move_ss, is declared > > in emmintrin.h, but _mm_move_ss is in xmmintrin.h. > > __v4si is some internal detail, I don't see much issue with moving it to > xmmintrin.h if you want to use it there. > > > In general the gcc __builtin_shuffle syntax with the argument being a > > vector is kind of ackward. At least for the declaring intrinsics, the > > clang still where the permutator is extra argument is easier to deal > > with: > > __builtin_shuffle(a, b, (__v4si){4, 0, 1, 2}) > > vs > > __builtin_shuffle(a, b, 4, 0, 1, 2) > > __builtin_shufflevector IIRC > > >> The question is what users expect and get when they use -O0 with > >> intrinsics?> > > Here is the version with __builtin_shuffle. It might be more expectable > > -O0, but it is also uglier. > > I am not convinced -O0 is very important. > Me neither, and in any case I would argue the logic that recognizes the vector constructions patterns are not optimizations but instruction matching. > If you start extending your approach to _mm_add_sd and others, while one > instruction is easy enough to recognize, if we put several in a row, they > will be partially simplified and may become harder to recognize. > { x*(y+v[0]-z), v[1] } requires that you notice that the upper part of > this vector is v[1], i.e. the upper part of a vector whose lower part > appears somewhere in the arbitrarily complex expression for the lower > part of the result. And you then have to propagate the fact that you are > doing vector operations all the way back to v[0]. > > I don't have a strong opinion on what the best approach is. Yes, I am not sure all of those could be done exhaustively with the existing logic, and it might also be of dubious value as in almost all cases the ps instructions have the same latency and bandwidth as the ss instructions, so developers should probably use _ps versions as they are scheduled better by the compiler (or at least better by gcc). It was just an idea, and I haven't tried it at this point. 'Allan
Re: [PATCH][x86] Match movss and movsd "blend" instructions
On Donnerstag, 2. August 2018 23:46:37 CEST Jakub Jelinek wrote: > On Thu, Aug 02, 2018 at 10:50:58PM +0200, Allan Sandfeld Jensen wrote: > > Here is the version with __builtin_shuffle. It might be more expectable > > -O0, but it is also uglier. > > I don't find anything ugly on it, except the formatting glitches (missing > space before (, overlong line, and useless __extension__. > Improving code generated for __builtin_shuffle is desirable too. > __extension__ is needed when using the the {...} initialization otherwise - std=C89 will produce warnings about standards. The line is a bit long, but I thought it looked better like this rather than adding any emergency line breaks. Is there a hard limit? > > --- a/gcc/config/i386/xmmintrin.h > > +++ b/gcc/config/i386/xmmintrin.h > > @@ -1011,7 +1011,8 @@ _mm_storer_ps (float *__P, __m128 __A) > > > > extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, > > __artificial__)) _mm_move_ss (__m128 __A, __m128 __B) > > { > > > > - return (__m128) __builtin_ia32_movss ((__v4sf)__A, (__v4sf)__B); > > + return __extension__ (__m128) __builtin_shuffle((__v4sf)__A, > > (__v4sf)__B, + > > (__attribute__((__vector_size__ (16))) int){4, 1, 2, 3}); > And obviously use __v4si here instead of __attribute__((__vector_size__ > (16))) int. > __v4si is declared in emmintrin.h, so I couldn't use it here unless I moved the definition. I tried changing as little as possible to not trigger bike shedding. 'Allan
Re: [PATCH][x86] Match movss and movsd "blend" instructions
On Donnerstag, 2. August 2018 11:18:41 CEST Richard Biener wrote: > On Thu, Aug 2, 2018 at 11:12 AM Allan Sandfeld Jensen > > wrote: > > On Mittwoch, 1. August 2018 18:51:41 CEST Marc Glisse wrote: > > > On Wed, 1 Aug 2018, Allan Sandfeld Jensen wrote: > > > > extern __inline __m128d __attribute__((__gnu_inline__, > > > > __always_inline__, > > > > > > > > __artificial__)) > > > > > > > > _mm_move_sd (__m128d __A, __m128d __B) > > > > { > > > > > > > > - return (__m128d) __builtin_ia32_movsd ((__v2df)__A, (__v2df)__B); > > > > + return __extension__ (__m128d)(__v2df){__B[0],__A[1]}; > > > > > > > > } > > > > > > If the goal is to have it represented as a VEC_PERM_EXPR internally, I > > > wonder if we should be explicit and use __builtin_shuffle instead of > > > relying on some forwprop pass to transform it. Maybe not, just asking. > > > And > > > the answer need not even be the same for _mm_move_sd and _mm_move_ss. > > > > I wrote it this way because this pattern could later also be used for the > > other _ss intrinsics, such as _mm_add_ss, where a _builtin_shuffle could > > not. To match the other intrinsics the logic that tries to match vector > > construction just needs to be extended to try merge patterns even if one > > of the subexpressions is not simple. > > The question is what users expect and get when they use -O0 with intrinsics? > > Richard. > Here is the version with __builtin_shuffle. It might be more expectable -O0, but it is also uglier. diff --git a/gcc/config/i386/emmintrin.h b/gcc/config/i386/emmintrin.h index b940a39d27b..6501638f619 100644 --- a/gcc/config/i386/emmintrin.h +++ b/gcc/config/i386/emmintrin.h @@ -113,7 +113,7 @@ _mm_setzero_pd (void) extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__)) _mm_move_sd (__m128d __A, __m128d __B) { - return (__m128d) __builtin_ia32_movsd ((__v2df)__A, (__v2df)__B); + return __extension__ (__m128d) __builtin_shuffle((__v2df)__A, (__v2df)__B, (__v2di){2, 1}); } /* Load two DPFP values from P. The address must be 16-byte aligned. */ diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index ee409cfe7e4..2337ef5ea08 100644 --- a/gcc/config/i386/i386.c +++ b/gcc/config/i386/i386.c @@ -46143,6 +46143,46 @@ expand_vselect_vconcat (rtx target, rtx op0, rtx op1, return ok; } +/* A subroutine of ix86_expand_vec_perm_builtin_1. Try to implement D + using movss or movsd. */ +static bool +expand_vec_perm_movs (struct expand_vec_perm_d *d) +{ + machine_mode vmode = d->vmode; + unsigned i, nelt = d->nelt; + rtx x; + + if (d->one_operand_p) +return false; + + if (TARGET_SSE2 && (vmode == V2DFmode || vmode == V4SFmode)) +; + else +return false; + + /* Only the first element is changed. */ + if (d->perm[0] != nelt && d->perm[0] != 0) +return false; + for (i = 1; i < nelt; ++i) { +{ + if (d->perm[i] != i + nelt - d->perm[0]) +return false; +} + } + + if (d->testing_p) +return true; + + if (d->perm[0] == nelt) +x = gen_rtx_VEC_MERGE (vmode, d->op1, d->op0, GEN_INT (1)); + else +x = gen_rtx_VEC_MERGE (vmode, d->op0, d->op1, GEN_INT (1)); + + emit_insn (gen_rtx_SET (d->target, x)); + + return true; +} + /* A subroutine of ix86_expand_vec_perm_builtin_1. Try to implement D in terms of blendp[sd] / pblendw / pblendvb / vpblendd. */ @@ -46885,6 +46925,10 @@ expand_vec_perm_1 (struct expand_vec_perm_d *d) } } + /* Try movss/movsd instructions. */ + if (expand_vec_perm_movs (d)) +return true; + /* Finally, try the fully general two operand permute. */ if (expand_vselect_vconcat (d->target, d->op0, d->op1, d->perm, nelt, d->testing_p)) diff --git a/gcc/config/i386/xmmintrin.h b/gcc/config/i386/xmmintrin.h index f64f3f74a0b..45b99ff87d5 100644 --- a/gcc/config/i386/xmmintrin.h +++ b/gcc/config/i386/xmmintrin.h @@ -1011,7 +1011,8 @@ _mm_storer_ps (float *__P, __m128 __A) extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__)) _mm_move_ss (__m128 __A, __m128 __B) { - return (__m128) __builtin_ia32_movss ((__v4sf)__A, (__v4sf)__B); + return __extension__ (__m128) __builtin_shuffle((__v4sf)__A, (__v4sf)__B, + (__attribute__((__vector_size__ (16))) int){4, 1, 2, 3}); } /* Extracts one of the four words of A. The selector N must be immediate. */
Re: [PATCH][x86] Match movss and movsd "blend" instructions
On Mittwoch, 1. August 2018 18:51:41 CEST Marc Glisse wrote: > On Wed, 1 Aug 2018, Allan Sandfeld Jensen wrote: > > extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, > > > > __artificial__)) > > > > _mm_move_sd (__m128d __A, __m128d __B) > > { > > > > - return (__m128d) __builtin_ia32_movsd ((__v2df)__A, (__v2df)__B); > > + return __extension__ (__m128d)(__v2df){__B[0],__A[1]}; > > > > } > > If the goal is to have it represented as a VEC_PERM_EXPR internally, I > wonder if we should be explicit and use __builtin_shuffle instead of > relying on some forwprop pass to transform it. Maybe not, just asking. And > the answer need not even be the same for _mm_move_sd and _mm_move_ss. I forgot. One of the things that makes using __builtin_shuffle ugly is that __v4si as the suffle argument needs to be in _mm_move_ss, is declared in emmintrin.h, but _mm_move_ss is in xmmintrin.h. In general the gcc __builtin_shuffle syntax with the argument being a vector is kind of ackward. At least for the declaring intrinsics, the clang still where the permutator is extra argument is easier to deal with: __builtin_shuffle(a, b, (__v4si){4, 0, 1, 2}) vs __builtin_shuffle(a, b, 4, 0, 1, 2)
Re: [PATCH][x86] Match movss and movsd "blend" instructions
On Mittwoch, 1. August 2018 18:51:41 CEST Marc Glisse wrote: > On Wed, 1 Aug 2018, Allan Sandfeld Jensen wrote: > > extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, > > > > __artificial__)) > > > > _mm_move_sd (__m128d __A, __m128d __B) > > { > > > > - return (__m128d) __builtin_ia32_movsd ((__v2df)__A, (__v2df)__B); > > + return __extension__ (__m128d)(__v2df){__B[0],__A[1]}; > > > > } > > If the goal is to have it represented as a VEC_PERM_EXPR internally, I > wonder if we should be explicit and use __builtin_shuffle instead of > relying on some forwprop pass to transform it. Maybe not, just asking. And > the answer need not even be the same for _mm_move_sd and _mm_move_ss. I wrote it this way because this pattern could later also be used for the other _ss intrinsics, such as _mm_add_ss, where a _builtin_shuffle could not. To match the other intrinsics the logic that tries to match vector construction just needs to be extended to try merge patterns even if one of the subexpressions is not simple. 'Allan
Re: [Patch][GCC] Document and fix -r (partial linking)
On Mittwoch, 1. August 2018 18:32:30 CEST Joseph Myers wrote: > On Wed, 1 Aug 2018, Allan Sandfeld Jensen wrote: > > gcc/ > > > > * gcc.c: Correct default specs for -r > > I don't follow why your changes (which would need describing for each > individual spec changed) are corrections. > > > /* config.h can define LIB_SPEC to override the default libraries. */ > > #ifndef LIB_SPEC > > > > -#define LIB_SPEC "%{!shared:%{g*:-lg} > > %{!p:%{!pg:-lc}}%{p:-lc_p}%{pg:-lc_p}}" +#define LIB_SPEC > > "%{!shared|!r:%{g*:-lg} %{!p:%{!pg:-lc}}%{p:-lc_p}%{pg:-lc_p}}"> > > #endif > > '!' binds more closely than '|' in specs. That is, !shared|!r means the > following specs are used unless both -shared and -r are specified, which > seems nonsensical to me. I'd expect something more like "shared|r:;" to > expand to nothing if either -shared or -r is passed and to what follows if > neither is passed. > > And that ignores that this LIB_SPEC value in gcc.c is largely irrelevant, > as it's generally overridden by targets - and normally for targets using > ELF shared libraries, for example, -lc *does* have to be used when linking > with -shared. > > I think you're changing the wrong place for this. If you want -r to be > usable with GCC without using -nostdlib (which is an interesting > question), you actually need to change LINK_COMMAND_SPEC (also sometimes > overridden for targets) to handle -r more like -nostdlib -nostartfiles. > Ok, thanks for the information, I will investigate that. > > -#define LINK_PIE_SPEC "%{static|shared|r:;" PIE_SPEC ":" LD_PIE_SPEC "} " > > +#define LINK_PIE_SPEC "%{static|shared|r|ar:;" PIE_SPEC ":" LD_PIE_SPEC > > "} " > What's this "-ar" option you're handling here? Dead code from a previous more ambitious version of the patch. I will remove. `Allan
[Patch][GCC] Document and fix -r (partial linking)
The option has existed and been working for years, make sure it implies the right extra options, and list it in the documentation. 2018-08-01 Allan Sandfeld Jensen gcc/doc * invoke.texi: Document -r gcc/ * gcc.c: Correct default specs for -r --- gcc/doc/invoke.texi | 7 ++- gcc/gcc.c | 6 +++--- 2 files changed, 9 insertions(+), 4 deletions(-)>From 638966e6c7e072ca46c6af0664fbd57bedbfff80 Mon Sep 17 00:00:00 2001 From: Allan Sandfeld Jensen Date: Wed, 1 Aug 2018 18:07:05 +0200 Subject: [PATCH] Fix and document -r option The option has existed and been working for years, make sure it implies the right extra options, and list it in the documentation. 2018-07-29 Allan Sandfeld Jensen gcc/doc * invoke.texi: Document -r gcc/ * gcc.c: Correct default specs for -r --- gcc/doc/invoke.texi | 7 ++- gcc/gcc.c | 6 +++--- 2 files changed, 9 insertions(+), 4 deletions(-) diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi index 6047d82065a..7da30bd9d99 100644 --- a/gcc/doc/invoke.texi +++ b/gcc/doc/invoke.texi @@ -518,7 +518,7 @@ Objective-C and Objective-C++ Dialects}. @xref{Link Options,,Options for Linking}. @gccoptlist{@var{object-file-name} -fuse-ld=@var{linker} -l@var{library} @gol -nostartfiles -nodefaultlibs -nolibc -nostdlib @gol --pie -pthread -rdynamic @gol +-pie -pthread -r -rdynamic @gol -s -static -static-pie -static-libgcc -static-libstdc++ @gol -static-libasan -static-libtsan -static-liblsan -static-libubsan @gol -shared -shared-libgcc -symbolic @gol @@ -12444,6 +12444,11 @@ x86 Cygwin and MinGW targets. On some targets this option also sets flags for the preprocessor, so it should be used consistently for both compilation and linking. +@item -r +@opindex r +Produce a relocatable object as output. This is also known as partial +linking. + @item -rdynamic @opindex rdynamic Pass the flag @option{-export-dynamic} to the ELF linker, on targets diff --git a/gcc/gcc.c b/gcc/gcc.c index 780d4859ef3..858a5600c14 100644 --- a/gcc/gcc.c +++ b/gcc/gcc.c @@ -675,7 +675,7 @@ proper position among the other output files. */ /* config.h can define LIB_SPEC to override the default libraries. */ #ifndef LIB_SPEC -#define LIB_SPEC "%{!shared:%{g*:-lg} %{!p:%{!pg:-lc}}%{p:-lc_p}%{pg:-lc_p}}" +#define LIB_SPEC "%{!shared|!r:%{g*:-lg} %{!p:%{!pg:-lc}}%{p:-lc_p}%{pg:-lc_p}}" #endif /* When using -fsplit-stack we need to wrap pthread_create, in order @@ -797,7 +797,7 @@ proper position among the other output files. */ /* config.h can define STARTFILE_SPEC to override the default crt0 files. */ #ifndef STARTFILE_SPEC #define STARTFILE_SPEC \ - "%{!shared:%{pg:gcrt0%O%s}%{!pg:%{p:mcrt0%O%s}%{!p:crt0%O%s}}}" + "%{!shared|!r:%{pg:gcrt0%O%s}%{!pg:%{p:mcrt0%O%s}%{!p:crt0%O%s}}}" #endif /* config.h can define ENDFILE_SPEC to override the default crtn files. */ @@ -936,7 +936,7 @@ proper position among the other output files. */ #else #define LD_PIE_SPEC "" #endif -#define LINK_PIE_SPEC "%{static|shared|r:;" PIE_SPEC ":" LD_PIE_SPEC "} " +#define LINK_PIE_SPEC "%{static|shared|r|ar:;" PIE_SPEC ":" LD_PIE_SPEC "} " #endif #ifndef LINK_BUILDID_SPEC -- 2.17.0
[PATCH][x86] Match movss and movsd "blend" instructions
Adds the ability to match movss and movsd as blend patterns, implemented in a new method to be able to match these before shuffles, while keeping other blends after. 2018-07-29 Allan Sandfeld Jensen gcc/config/i386 * i386.cc (expand_vec_perm_movs): New method matching movs patterns. * i386.cc (expand_vec_perm_1): Try the new method. gcc/testsuite * gcc.target/i386/sse2-movs.c: New test. --- gcc/config/i386/emmintrin.h | 2 +- gcc/config/i386/i386.c| 44 +++ gcc/config/i386/xmmintrin.h | 2 +- gcc/testsuite/gcc.target/i386/sse2-movs.c | 21 +++ 4 files changed, 67 insertions(+), 2 deletions(-) create mode 100644 gcc/testsuite/gcc.target/i386/sse2-movs.c >From e96b3aa9017ad0d19238c923146196405cc4e5af Mon Sep 17 00:00:00 2001 From: Allan Sandfeld Jensen Date: Wed, 9 May 2018 12:35:14 +0200 Subject: [PATCH] Match movss and movsd blends Adds the ability to match movss and movsd as blend patterns, implemented in a new method to be able to match these before shuffles, while keeping other blends after. 2018-07-29 Allan Sandfeld Jensen gcc/config/i386 * i386.cc (expand_vec_perm_movs): New method matching movs patterns. * i386.cc (expand_vec_perm_1): Try the new method. gcc/testsuite * gcc.target/i386/sse2-movs.c: New test. --- gcc/config/i386/emmintrin.h | 2 +- gcc/config/i386/i386.c| 44 +++ gcc/config/i386/xmmintrin.h | 2 +- gcc/testsuite/gcc.target/i386/sse2-movs.c | 21 +++ 4 files changed, 67 insertions(+), 2 deletions(-) create mode 100644 gcc/testsuite/gcc.target/i386/sse2-movs.c diff --git a/gcc/config/i386/emmintrin.h b/gcc/config/i386/emmintrin.h index b940a39d27b..1efd943bac4 100644 --- a/gcc/config/i386/emmintrin.h +++ b/gcc/config/i386/emmintrin.h @@ -113,7 +113,7 @@ _mm_setzero_pd (void) extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__)) _mm_move_sd (__m128d __A, __m128d __B) { - return (__m128d) __builtin_ia32_movsd ((__v2df)__A, (__v2df)__B); + return __extension__ (__m128d)(__v2df){__B[0],__A[1]}; } /* Load two DPFP values from P. The address must be 16-byte aligned. */ diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index ee409cfe7e4..2337ef5ea08 100644 --- a/gcc/config/i386/i386.c +++ b/gcc/config/i386/i386.c @@ -46143,6 +46143,46 @@ expand_vselect_vconcat (rtx target, rtx op0, rtx op1, return ok; } +/* A subroutine of ix86_expand_vec_perm_builtin_1. Try to implement D + using movss or movsd. */ +static bool +expand_vec_perm_movs (struct expand_vec_perm_d *d) +{ + machine_mode vmode = d->vmode; + unsigned i, nelt = d->nelt; + rtx x; + + if (d->one_operand_p) +return false; + + if (TARGET_SSE2 && (vmode == V2DFmode || vmode == V4SFmode)) +; + else +return false; + + /* Only the first element is changed. */ + if (d->perm[0] != nelt && d->perm[0] != 0) +return false; + for (i = 1; i < nelt; ++i) { +{ + if (d->perm[i] != i + nelt - d->perm[0]) +return false; +} + } + + if (d->testing_p) +return true; + + if (d->perm[0] == nelt) +x = gen_rtx_VEC_MERGE (vmode, d->op1, d->op0, GEN_INT (1)); + else +x = gen_rtx_VEC_MERGE (vmode, d->op0, d->op1, GEN_INT (1)); + + emit_insn (gen_rtx_SET (d->target, x)); + + return true; +} + /* A subroutine of ix86_expand_vec_perm_builtin_1. Try to implement D in terms of blendp[sd] / pblendw / pblendvb / vpblendd. */ @@ -46885,6 +46925,10 @@ expand_vec_perm_1 (struct expand_vec_perm_d *d) } } + /* Try movss/movsd instructions. */ + if (expand_vec_perm_movs (d)) +return true; + /* Finally, try the fully general two operand permute. */ if (expand_vselect_vconcat (d->target, d->op0, d->op1, d->perm, nelt, d->testing_p)) diff --git a/gcc/config/i386/xmmintrin.h b/gcc/config/i386/xmmintrin.h index f64f3f74a0b..699f681e054 100644 --- a/gcc/config/i386/xmmintrin.h +++ b/gcc/config/i386/xmmintrin.h @@ -1011,7 +1011,7 @@ _mm_storer_ps (float *__P, __m128 __A) extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__)) _mm_move_ss (__m128 __A, __m128 __B) { - return (__m128) __builtin_ia32_movss ((__v4sf)__A, (__v4sf)__B); + return __extension__ (__m128)(__v4sf){__B[0],__A[1],__A[2],__A[3]}; } /* Extracts one of the four words of A. The selector N must be immediate. */ diff --git a/gcc/testsuite/gcc.target/i386/sse2-movs.c b/gcc/testsuite/gcc.target/i386/sse2-movs.c new file mode 100644 index 000..79f486cfa82 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/sse2-movs.c @@ -0,0 +1,21 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -msse2" } */ +/* { dg-require-effective-target sse2 } */ +/* { dg-final { scan-assembler "movss" } } */ +/* { dg-final {
Re: [Patch] Request for comments: short int builtins
On Sonntag, 20. Mai 2018 15:07:59 CEST Richard Biener wrote: > On May 20, 2018 11:01:54 AM GMT+02:00, Allan Sandfeld Jensen <li...@carewolf.com> wrote: > >A little over a year back we had a regression in a point release of gcc > > > >because the builtin __builtin_clzs got removed from i386, in part > >because it > >is was wrongly named for a target specific builtin, but we were using > >it in Qt > >since it existed in multiple compilers. I got the patch removing it > >partially > >reverted and the problem solved, but in the meantime I had worked on a > >patch > >to make it a generic builtin instead. I have rebased it and added it > >below, > >should I clean it up futher, finish the other builtins add tests and > >propose > >it, or is this not something we want? > > Can't users simply do clz((unsigned short) s) - 16? GCC should be able to > handle this for instruction selection With The addition of some folding > patterns using the corresponding internal function. > Of course, but we already have the builtin for i386, and a version of the builtin for all integer types except short for all platforms. Note the patch also generally adds short versions for all the general integer builtins, not just clzs and they are not all that trivial to synthesize (without knowing the trick, which gcc does). 'Allan
[Patch] Request for comments: short int builtins
A little over a year back we had a regression in a point release of gcc because the builtin __builtin_clzs got removed from i386, in part because it is was wrongly named for a target specific builtin, but we were using it in Qt since it existed in multiple compilers. I got the patch removing it partially reverted and the problem solved, but in the meantime I had worked on a patch to make it a generic builtin instead. I have rebased it and added it below, should I clean it up futher, finish the other builtins add tests and propose it, or is this not something we want? diff --git a/gcc/builtin-types.def b/gcc/builtin-types.def index 5365befd351..74a84e653e4 100644 --- a/gcc/builtin-types.def +++ b/gcc/builtin-types.def @@ -232,6 +232,8 @@ DEF_FUNCTION_TYPE_1 (BT_FN_INT_LONG, BT_INT, BT_LONG) DEF_FUNCTION_TYPE_1 (BT_FN_INT_ULONG, BT_INT, BT_ULONG) DEF_FUNCTION_TYPE_1 (BT_FN_INT_LONGLONG, BT_INT, BT_LONGLONG) DEF_FUNCTION_TYPE_1 (BT_FN_INT_ULONGLONG, BT_INT, BT_ULONGLONG) +DEF_FUNCTION_TYPE_1 (BT_FN_INT_INT16, BT_INT, BT_INT16) +DEF_FUNCTION_TYPE_1 (BT_FN_INT_UINT16, BT_INT, BT_UINT16) DEF_FUNCTION_TYPE_1 (BT_FN_INT_INTMAX, BT_INT, BT_INTMAX) DEF_FUNCTION_TYPE_1 (BT_FN_INT_UINTMAX, BT_INT, BT_UINTMAX) DEF_FUNCTION_TYPE_1 (BT_FN_INT_PTR, BT_INT, BT_PTR) diff --git a/gcc/builtins.c b/gcc/builtins.c index 9a2bf8c7d38..a8ad00e8016 100644 --- a/gcc/builtins.c +++ b/gcc/builtins.c @@ -10689,14 +10689,17 @@ is_inexpensive_builtin (tree decl) case BUILT_IN_CLZIMAX: case BUILT_IN_CLZL: case BUILT_IN_CLZLL: + case BUILT_IN_CLZS: case BUILT_IN_CTZ: case BUILT_IN_CTZIMAX: case BUILT_IN_CTZL: case BUILT_IN_CTZLL: + case BUILT_IN_CTZS: case BUILT_IN_FFS: case BUILT_IN_FFSIMAX: case BUILT_IN_FFSL: case BUILT_IN_FFSLL: + case BUILT_IN_FFSS: case BUILT_IN_IMAXABS: case BUILT_IN_FINITE: case BUILT_IN_FINITEF: @@ -10734,10 +10737,12 @@ is_inexpensive_builtin (tree decl) case BUILT_IN_POPCOUNTL: case BUILT_IN_POPCOUNTLL: case BUILT_IN_POPCOUNTIMAX: + case BUILT_IN_POPCOUNTS: case BUILT_IN_POPCOUNT: case BUILT_IN_PARITYL: case BUILT_IN_PARITYLL: case BUILT_IN_PARITYIMAX: + case BUILT_IN_PARITYS: case BUILT_IN_PARITY: case BUILT_IN_LABS: case BUILT_IN_LLABS: diff --git a/gcc/builtins.def b/gcc/builtins.def index 449d08d682f..618ee798767 100644 --- a/gcc/builtins.def +++ b/gcc/builtins.def @@ -848,15 +848,18 @@ DEF_GCC_BUILTIN(BUILT_IN_CLZ, "clz", BT_FN_INT_UINT, ATTR_CONST_NOTHROW_ DEF_GCC_BUILTIN(BUILT_IN_CLZIMAX, "clzimax", BT_FN_INT_UINTMAX, ATTR_CONST_NOTHROW_LEAF_LIST) DEF_GCC_BUILTIN(BUILT_IN_CLZL, "clzl", BT_FN_INT_ULONG, ATTR_CONST_NOTHROW_LEAF_LIST) DEF_GCC_BUILTIN(BUILT_IN_CLZLL, "clzll", BT_FN_INT_ULONGLONG, ATTR_CONST_NOTHROW_LEAF_LIST) +DEF_GCC_BUILTIN(BUILT_IN_CLZS, "clzs", BT_FN_INT_UINT16, ATTR_CONST_NOTHROW_LEAF_LIST) DEF_GCC_BUILTIN(BUILT_IN_CONSTANT_P, "constant_p", BT_FN_INT_VAR, ATTR_CONST_NOTHROW_LEAF_LIST) DEF_GCC_BUILTIN(BUILT_IN_CTZ, "ctz", BT_FN_INT_UINT, ATTR_CONST_NOTHROW_LEAF_LIST) DEF_GCC_BUILTIN(BUILT_IN_CTZIMAX, "ctzimax", BT_FN_INT_UINTMAX, ATTR_CONST_NOTHROW_LEAF_LIST) DEF_GCC_BUILTIN(BUILT_IN_CTZL, "ctzl", BT_FN_INT_ULONG, ATTR_CONST_NOTHROW_LEAF_LIST) DEF_GCC_BUILTIN(BUILT_IN_CTZLL, "ctzll", BT_FN_INT_ULONGLONG, ATTR_CONST_NOTHROW_LEAF_LIST) +DEF_GCC_BUILTIN(BUILT_IN_CTZS, "ctzs", BT_FN_INT_UINT16, ATTR_CONST_NOTHROW_LEAF_LIST) DEF_GCC_BUILTIN(BUILT_IN_CLRSB, "clrsb", BT_FN_INT_INT, ATTR_CONST_NOTHROW_LEAF_LIST) DEF_GCC_BUILTIN(BUILT_IN_CLRSBIMAX, "clrsbimax", BT_FN_INT_INTMAX, ATTR_CONST_NOTHROW_LEAF_LIST) DEF_GCC_BUILTIN(BUILT_IN_CLRSBL, "clrsbl", BT_FN_INT_LONG, ATTR_CONST_NOTHROW_LEAF_LIST) DEF_GCC_BUILTIN(BUILT_IN_CLRSBLL, "clrsbll", BT_FN_INT_LONGLONG, ATTR_CONST_NOTHROW_LEAF_LIST) +DEF_GCC_BUILTIN(BUILT_IN_CLRSBS, "clrsbs", BT_FN_INT_INT16, ATTR_CONST_NOTHROW_LEAF_LIST) DEF_EXT_LIB_BUILTIN(BUILT_IN_DCGETTEXT, "dcgettext", BT_FN_STRING_CONST_STRING_CONST_STRING_INT, ATTR_FORMAT_ARG_2) DEF_EXT_LIB_BUILTIN(BUILT_IN_DGETTEXT, "dgettext", BT_FN_STRING_CONST_STRING_CONST_STRING, ATTR_FORMAT_ARG_2) DEF_GCC_BUILTIN(BUILT_IN_DWARF_CFA, "dwarf_cfa", BT_FN_PTR, ATTR_NULL) @@ -878,6 +881,7 @@ DEF_EXT_LIB_BUILTIN(BUILT_IN_FFS, "ffs", BT_FN_INT_INT, ATTR_CONST_NOTHROW_L DEF_EXT_LIB_BUILTIN(BUILT_IN_FFSIMAX, "ffsimax", BT_FN_INT_INTMAX, ATTR_CONST_NOTHROW_LEAF_LIST) DEF_EXT_LIB_BUILTIN(BUILT_IN_FFSL, "ffsl", BT_FN_INT_LONG, ATTR_CONST_NOTHROW_LEAF_LIST) DEF_EXT_LIB_BUILTIN(BUILT_IN_FFSLL, "ffsll", BT_FN_INT_LONGLONG, ATTR_CONST_NOTHROW_LEAF_LIST) +DEF_EXT_LIB_BUILTIN(BUILT_IN_FFSS, "ffss", BT_FN_INT_INT16, ATTR_CONST_NOTHROW_LEAF_LIST) DEF_EXT_LIB_BUILTIN(BUILT_IN_FORK, "fork", BT_FN_PID,
Re: [PATCH] Use two source permute for vector initialization (PR 85692, take 2)
On Donnerstag, 10. Mai 2018 09:57:22 CEST Jakub Jelinek wrote: > On Wed, May 09, 2018 at 04:53:19PM +0200, Allan Sandfeld Jensen wrote: > > > > @@ -2022,8 +2022,9 @@ simplify_vector_constructor > > > > (gimple_stmt_iterator > > > > *gsi)> > > > > > > > >elem_type = TREE_TYPE (type); > > > >elem_size = TREE_INT_CST_LOW (TYPE_SIZE (elem_type)); > > > > > > > > - vec_perm_builder sel (nelts, nelts, 1); > > > > - orig = NULL; > > > > + vec_perm_builder sel (nelts, 2, nelts); > > > > > > Why this change? I admit the vec_parm_builder arguments are confusing, > > > but > > > I think the second times third is the number of how many indices are > > > being > > > pushed into the vector, so I think (nelts, nelts, 1) is right. > > > > I had the impression it was what was selected from. In any case, I changed > > it because without I get crash when vec_perm_indices is created later > > with a possible nparms of 2. > > The documentation is apparently in vector-builder.h: >This class is a wrapper around auto_vec for building vectors of T. >It aims to encode each vector as npatterns interleaved patterns, >where each pattern represents a sequence: > > { BASE0, BASE1, BASE1 + STEP, BASE1 + STEP*2, BASE1 + STEP*3, ... } > >The first three elements in each pattern provide enough information >to derive the other elements. If all patterns have a STEP of zero, >we only need to encode the first two elements in each pattern. >If BASE1 is also equal to BASE0 for all patterns, we only need to >encode the first element in each pattern. The number of encoded >elements per pattern is given by nelts_per_pattern. > >The class can be used in two ways: > >1. It can be used to build a full image of the vector, which is then > canonicalized by finalize (). In this case npatterns is initially > the number of elements in the vector and nelts_per_pattern is > initially 1. > >2. It can be used to build a vector that already has a known encoding. > This is preferred since it is more efficient and copes with > variable-length vectors. finalize () then canonicalizes the encoding > to a simpler form if possible. > > As the vector is constant width and we are building the full image of the > vector, the right arguments are (nelts, nelts, 1) as per 1. above, and the > finalization can perhaps change it to something more compact. > > > > (and sorry for missing your patch first, the PR wasn't ASSIGNED and > > > there > > > was no link to gcc-patches for it). > > > > It is okay. You are welcome to take it over. I am not a regular gcc > > contributor and thus not well-versed in the details, only the basic logic > > of how things work. > > Ok, here is my version of the patch. Bootstrapped/regtested on x86_64-linux > and i686-linux, ok for trunk? > Looks good to me if that counts for anything. 'Allan
Re: [Patch] Use two source permute for vector initialization (PR 85692)
On Mittwoch, 9. Mai 2018 11:08:02 CEST Jakub Jelinek wrote: > On Tue, May 08, 2018 at 01:25:35PM +0200, Allan Sandfeld Jensen wrote: > > 2018-05-08 Allan Sandfeld Jensen <allan.jen...@qt.io> > > 2 spaces between date and name and two spaces between name and email > address. > > > gcc/ > > > > PR tree-optimization/85692 > > * tree-ssa-forwprop.c (simplify_vector_constructor): Try two > > source permute as well. > > > > gcc/testsuite > > > > * gcc.target/i386/pr85692.c: Test two simply constructions are > > detected as permute instructions. > > Just > * gcc.target/i386/pr85692.c: New test. > > > diff --git a/gcc/testsuite/gcc.target/i386/pr85692.c > > b/gcc/testsuite/gcc.target/i386/pr85692.c new file mode 100644 > > index 000..322c1050161 > > --- /dev/null > > +++ b/gcc/testsuite/gcc.target/i386/pr85692.c > > @@ -0,0 +1,18 @@ > > +/* { dg-do run } */ > > +/* { dg-options "-O2 -msse4.1" } */ > > +/* { dg-final { scan-assembler "unpcklps" } } */ > > +/* { dg-final { scan-assembler "blendps" } } */ > > +/* { dg-final { scan-assembler-not "shufps" } } */ > > +/* { dg-final { scan-assembler-not "unpckhps" } } */ > > + > > +typedef float v4sf __attribute__ ((vector_size (16))); > > + > > +v4sf unpcklps(v4sf a, v4sf b) > > +{ > > +return v4sf{a[0],b[0],a[1],b[1]}; > > Though, not really sure if this has been tested at all. > The above is valid only in C++ (and only C++11 and above), while the > test is compiled as C and thus has to fail. > Yes, I thought it had been tested, but it wasn't. It also needs to change the first line to be a compile and not run test. > > @@ -2022,8 +2022,9 @@ simplify_vector_constructor (gimple_stmt_iterator > > *gsi)> > >elem_type = TREE_TYPE (type); > >elem_size = TREE_INT_CST_LOW (TYPE_SIZE (elem_type)); > > > > - vec_perm_builder sel (nelts, nelts, 1); > > - orig = NULL; > > + vec_perm_builder sel (nelts, 2, nelts); > > Why this change? I admit the vec_parm_builder arguments are confusing, but > I think the second times third is the number of how many indices are being > pushed into the vector, so I think (nelts, nelts, 1) is right. > I had the impression it was what was selected from. In any case, I changed it because without I get crash when vec_perm_indices is created later with a possible nparms of 2. > > @@ -2063,10 +2064,26 @@ simplify_vector_constructor (gimple_stmt_iterator > > *gsi)> > > return false; > > > >op1 = gimple_assign_rhs1 (def_stmt); > >ref = TREE_OPERAND (op1, 0); > > > > - if (orig) > > + if (orig1) > > > > { > > > > - if (ref != orig) > > - return false; > > + if (ref == orig1 || orig2) > > + { > > + if (ref != orig1 && ref != orig2) > > + return false; > > + } > > + else > > + { > > + if (TREE_CODE (ref) != SSA_NAME) > > + return false; > > + if (! VECTOR_TYPE_P (TREE_TYPE (ref)) > > + || ! useless_type_conversion_p (TREE_TYPE (op1), > > + TREE_TYPE (TREE_TYPE (ref > > + return false; > > + if (TREE_TYPE (orig1) != TREE_TYPE (ref)) > > + return false; > > I think even different type is acceptable here, as long as its conversion to > orig1's type is useless. > > Furthermore, I think the way you wrote the patch with 2 variables rather > than an array of 2 elements means too much duplication, this else block > is a duplication of the else block below. See the patch I've added to the > PR It seemed to me it was clearer like this, but I can see your point. > (and sorry for missing your patch first, the PR wasn't ASSIGNED and there > was no link to gcc-patches for it). > It is okay. You are welcome to take it over. I am not a regular gcc contributor and thus not well-versed in the details, only the basic logic of how things work. 'Allan
Re: [Patch] Use two source permute for vector initialization (PR 85692)
On Dienstag, 8. Mai 2018 12:42:33 CEST Richard Biener wrote: > On Tue, May 8, 2018 at 12:37 PM, Allan Sandfeld Jensen > > <li...@carewolf.com> wrote: > > I have tried to fix PR85692 that I opened. > > Please add a testcase as well. It also helps if you shortly tell what > the patch does > in your mail. > Okay. I have updated the patch with a test-case based on my motivating examples. The patch just extends patching a vector construction to not just a single source permute instruction, but also a two source permute instruction. commit 15c0f6a933d60b085416a59221851b604b955958 Author: Allan Sandfeld Jensen <allan.jen...@qt.io> Date: Tue May 8 13:16:18 2018 +0200 Try two source permute for vector construction simplify_vector_constructor() was detecting when vector construction could be implemented as a single source permute, but was not detecting when it could be implemented as a double source permute. This patch adds the second case. 2018-05-08 Allan Sandfeld Jensen <allan.jen...@qt.io> gcc/ PR tree-optimization/85692 * tree-ssa-forwprop.c (simplify_vector_constructor): Try two source permute as well. gcc/testsuite * gcc.target/i386/pr85692.c: Test two simply constructions are detected as permute instructions. diff --git a/gcc/testsuite/gcc.target/i386/pr85692.c b/gcc/testsuite/gcc.target/i386/pr85692.c new file mode 100644 index 000..322c1050161 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr85692.c @@ -0,0 +1,18 @@ +/* { dg-do run } */ +/* { dg-options "-O2 -msse4.1" } */ +/* { dg-final { scan-assembler "unpcklps" } } */ +/* { dg-final { scan-assembler "blendps" } } */ +/* { dg-final { scan-assembler-not "shufps" } } */ +/* { dg-final { scan-assembler-not "unpckhps" } } */ + +typedef float v4sf __attribute__ ((vector_size (16))); + +v4sf unpcklps(v4sf a, v4sf b) +{ +return v4sf{a[0],b[0],a[1],b[1]}; +} + +v4sf blendps(v4sf a, v4sf b) +{ +return v4sf{a[0],b[1],a[2],b[3]}; +} diff --git a/gcc/tree-ssa-forwprop.c b/gcc/tree-ssa-forwprop.c index 58ec6b47a5b..fbee8064160 100644 --- a/gcc/tree-ssa-forwprop.c +++ b/gcc/tree-ssa-forwprop.c @@ -2004,7 +2004,7 @@ simplify_vector_constructor (gimple_stmt_iterator *gsi) { gimple *stmt = gsi_stmt (*gsi); gimple *def_stmt; - tree op, op2, orig, type, elem_type; + tree op, op2, orig1, orig2, type, elem_type; unsigned elem_size, i; unsigned HOST_WIDE_INT nelts; enum tree_code code, conv_code; @@ -2022,8 +2022,9 @@ simplify_vector_constructor (gimple_stmt_iterator *gsi) elem_type = TREE_TYPE (type); elem_size = TREE_INT_CST_LOW (TYPE_SIZE (elem_type)); - vec_perm_builder sel (nelts, nelts, 1); - orig = NULL; + vec_perm_builder sel (nelts, 2, nelts); + orig1 = NULL; + orig2 = NULL; conv_code = ERROR_MARK; maybe_ident = true; FOR_EACH_VEC_SAFE_ELT (CONSTRUCTOR_ELTS (op), i, elt) @@ -2063,10 +2064,26 @@ simplify_vector_constructor (gimple_stmt_iterator *gsi) return false; op1 = gimple_assign_rhs1 (def_stmt); ref = TREE_OPERAND (op1, 0); - if (orig) + if (orig1) { - if (ref != orig) - return false; + if (ref == orig1 || orig2) + { + if (ref != orig1 && ref != orig2) + return false; + } + else + { + if (TREE_CODE (ref) != SSA_NAME) + return false; + if (! VECTOR_TYPE_P (TREE_TYPE (ref)) + || ! useless_type_conversion_p (TREE_TYPE (op1), + TREE_TYPE (TREE_TYPE (ref + return false; + if (TREE_TYPE (orig1) != TREE_TYPE (ref)) + return false; + orig2 = ref; + maybe_ident = false; + } } else { @@ -2076,12 +2093,14 @@ simplify_vector_constructor (gimple_stmt_iterator *gsi) || ! useless_type_conversion_p (TREE_TYPE (op1), TREE_TYPE (TREE_TYPE (ref return false; - orig = ref; + orig1 = ref; } unsigned int elt; if (maybe_ne (bit_field_size (op1), elem_size) || !constant_multiple_p (bit_field_offset (op1), elem_size, )) return false; + if (orig2 && ref == orig2) + elt += nelts; if (elt != i) maybe_ident = false; sel.quick_push (elt); @@ -2089,14 +2108,17 @@ simplify_vector_constructor (gimple_stmt_iterator *gsi) if (i < nelts) return false; - if (! VECTOR_TYPE_P (TREE_TYPE (orig)) + if (! VECTOR_TYPE_P (TREE_TYPE (orig1)) || maybe_ne (TYPE_VECTOR_SUBPARTS (type), - TYPE_VECTOR_SUBPARTS (TREE_TYPE (orig + TYPE_VECTOR_SUBPARTS (TREE_TYPE (orig1 return false; + if (!orig2) +orig2 = orig1; + tree tem; if (conv_code != ERROR_MARK - && (! supportable_convert_operation (conv_code, type, TREE_TYPE (orig), + && (! supportable_convert_operation (conv_code, type, TREE_TYPE (orig1), , _code) || conv_code
[Patch] Use two source permute for vector initialization (PR 85692)
I have tried to fix PR85692 that I opened. 2018-05-08 Allan Sandfeld JensePR tree-optimization/85692 * tree-ssa-forwprop.c (simplify_vector_constructor): Detect two source permute operations as well. diff --git a/gcc/tree-ssa-forwprop.c b/gcc/tree-ssa-forwprop.c index 58ec6b47a5b..fbee8064160 100644 --- a/gcc/tree-ssa-forwprop.c +++ b/gcc/tree-ssa-forwprop.c @@ -2004,7 +2004,7 @@ simplify_vector_constructor (gimple_stmt_iterator *gsi) { gimple *stmt = gsi_stmt (*gsi); gimple *def_stmt; - tree op, op2, orig, type, elem_type; + tree op, op2, orig1, orig2, type, elem_type; unsigned elem_size, i; unsigned HOST_WIDE_INT nelts; enum tree_code code, conv_code; @@ -2022,8 +2022,9 @@ simplify_vector_constructor (gimple_stmt_iterator *gsi) elem_type = TREE_TYPE (type); elem_size = TREE_INT_CST_LOW (TYPE_SIZE (elem_type)); - vec_perm_builder sel (nelts, nelts, 1); - orig = NULL; + vec_perm_builder sel (nelts, 2, nelts); + orig1 = NULL; + orig2 = NULL; conv_code = ERROR_MARK; maybe_ident = true; FOR_EACH_VEC_SAFE_ELT (CONSTRUCTOR_ELTS (op), i, elt) @@ -2063,10 +2064,26 @@ simplify_vector_constructor (gimple_stmt_iterator *gsi) return false; op1 = gimple_assign_rhs1 (def_stmt); ref = TREE_OPERAND (op1, 0); - if (orig) + if (orig1) { - if (ref != orig) - return false; + if (ref == orig1 || orig2) + { + if (ref != orig1 && ref != orig2) + return false; + } + else + { + if (TREE_CODE (ref) != SSA_NAME) + return false; + if (! VECTOR_TYPE_P (TREE_TYPE (ref)) + || ! useless_type_conversion_p (TREE_TYPE (op1), + TREE_TYPE (TREE_TYPE (ref + return false; + if (TREE_TYPE (orig1) != TREE_TYPE (ref)) + return false; + orig2 = ref; + maybe_ident = false; + } } else { @@ -2076,12 +2093,14 @@ simplify_vector_constructor (gimple_stmt_iterator *gsi) || ! useless_type_conversion_p (TREE_TYPE (op1), TREE_TYPE (TREE_TYPE (ref return false; - orig = ref; + orig1 = ref; } unsigned int elt; if (maybe_ne (bit_field_size (op1), elem_size) || !constant_multiple_p (bit_field_offset (op1), elem_size, )) return false; + if (orig2 && ref == orig2) + elt += nelts; if (elt != i) maybe_ident = false; sel.quick_push (elt); @@ -2089,14 +2108,17 @@ simplify_vector_constructor (gimple_stmt_iterator *gsi) if (i < nelts) return false; - if (! VECTOR_TYPE_P (TREE_TYPE (orig)) + if (! VECTOR_TYPE_P (TREE_TYPE (orig1)) || maybe_ne (TYPE_VECTOR_SUBPARTS (type), - TYPE_VECTOR_SUBPARTS (TREE_TYPE (orig + TYPE_VECTOR_SUBPARTS (TREE_TYPE (orig1 return false; + if (!orig2) +orig2 = orig1; + tree tem; if (conv_code != ERROR_MARK - && (! supportable_convert_operation (conv_code, type, TREE_TYPE (orig), + && (! supportable_convert_operation (conv_code, type, TREE_TYPE (orig1), , _code) || conv_code == CALL_EXPR)) return false; @@ -2104,16 +2126,16 @@ simplify_vector_constructor (gimple_stmt_iterator *gsi) if (maybe_ident) { if (conv_code == ERROR_MARK) - gimple_assign_set_rhs_from_tree (gsi, orig); + gimple_assign_set_rhs_from_tree (gsi, orig1); else - gimple_assign_set_rhs_with_ops (gsi, conv_code, orig, + gimple_assign_set_rhs_with_ops (gsi, conv_code, orig1, NULL_TREE, NULL_TREE); } else { tree mask_type; - vec_perm_indices indices (sel, 1, nelts); + vec_perm_indices indices (sel, 2, nelts); if (!can_vec_perm_const_p (TYPE_MODE (type), indices)) return false; mask_type @@ -2125,15 +2147,14 @@ simplify_vector_constructor (gimple_stmt_iterator *gsi) return false; op2 = vec_perm_indices_to_tree (mask_type, indices); if (conv_code == ERROR_MARK) - gimple_assign_set_rhs_with_ops (gsi, VEC_PERM_EXPR, orig, orig, op2); + gimple_assign_set_rhs_with_ops (gsi, VEC_PERM_EXPR, orig1, orig2, op2); else { gimple *perm - = gimple_build_assign (make_ssa_name (TREE_TYPE (orig)), - VEC_PERM_EXPR, orig, orig, op2); - orig = gimple_assign_lhs (perm); + = gimple_build_assign (make_ssa_name (TREE_TYPE (orig1)), + VEC_PERM_EXPR, orig1, orig2, op2); gsi_insert_before (gsi, perm, GSI_SAME_STMT); - gimple_assign_set_rhs_with_ops (gsi, conv_code, orig, + gimple_assign_set_rhs_with_ops (gsi, conv_code, gimple_assign_lhs (perm), NULL_TREE, NULL_TREE); } }
Re: [PATCH] [x86] Avoid builtins for SSE/AVX2 immidiate logical shifts
On Tuesday 02 May 2017, Jakub Jelinek wrote: > On Mon, Apr 24, 2017 at 03:15:11PM +0200, Allan Sandfeld Jensen wrote: > > Okay, I have tried that, and I also made it more obvious how the > > intrinsics can become non-immediate shift. > > > > > > diff --git a/gcc/ChangeLog b/gcc/ChangeLog > > index b58f5050db0..b9406550fc5 100644 > > --- a/gcc/ChangeLog > > +++ b/gcc/ChangeLog > > @@ -1,3 +1,10 @@ > > +2017-04-22 Allan Sandfeld Jensen <sandf...@kde.org> > > + > > + * config/i386/emmintrin.h (_mm_slli_*, _mm_srli_*): > > + Use vector intrinstics instead of builtins. > > + * config/i386/avx2intrin.h (_mm256_slli_*, _mm256_srli_*): > > + Use vector intrinstics instead of builtins. > > + > > > > 2017-04-21 Uros Bizjak <ubiz...@gmail.com> > > > > * config/i386/i386.md (*extzvqi_mem_rex64): Move above *extzv. > > > > diff --git a/gcc/config/i386/avx2intrin.h b/gcc/config/i386/avx2intrin.h > > index 82f170a3d61..64ba52b244e 100644 > > --- a/gcc/config/i386/avx2intrin.h > > +++ b/gcc/config/i386/avx2intrin.h > > @@ -665,13 +665,6 @@ _mm256_slli_si256 (__m256i __A, const int __N) > > > > extern __inline __m256i > > __attribute__ ((__gnu_inline__, __always_inline__, __artificial__)) > > > > -_mm256_slli_epi16 (__m256i __A, int __B) > > -{ > > - return (__m256i)__builtin_ia32_psllwi256 ((__v16hi)__A, __B); > > -} > > - > > -extern __inline __m256i > > -__attribute__ ((__gnu_inline__, __always_inline__, __artificial__)) > > > > _mm256_sll_epi16 (__m256i __A, __m128i __B) > > { > > > >return (__m256i)__builtin_ia32_psllw256((__v16hi)__A, (__v8hi)__B); > > > > @@ -679,9 +672,11 @@ _mm256_sll_epi16 (__m256i __A, __m128i __B) > > > > extern __inline __m256i > > __attribute__ ((__gnu_inline__, __always_inline__, __artificial__)) > > > > -_mm256_slli_epi32 (__m256i __A, int __B) > > +_mm256_slli_epi16 (__m256i __A, int __B) > > > > { > > > > - return (__m256i)__builtin_ia32_pslldi256 ((__v8si)__A, __B); > > + if (__builtin_constant_p(__B)) > > +return ((unsigned int)__B < 16) ? (__m256i)((__v16hi)__A << __B) : > > _mm256_setzero_si256(); + return _mm256_sll_epi16(__A, > > _mm_cvtsi32_si128(__B)); > > > > } > > The formatting is wrong, missing spaces before function names and opening > (, too long lines. Also, you've removed some builtin uses like > __builtin_ia32_psllwi256 above, but haven't removed those builtins from the > compiler (unlike the intrinsics, the builtins are not supported and can be > removed). But I guess the primary question is on Uros, do we > want to handle this in the *intrin.h headers and thus increase the size > of those (and their parsing time etc.), or do we want to handle this > in the target folders (tree as well as gimple one), where we'd convert > e.g. __builtin_ia32_psllwi256 to the shift if the shift count is constant. > Ok. I will await what you decide. Btw. I thought of an alternative idea: Make a new set of built-ins, called for instance __builtin_lshift and __builtin_rshift, that translates simply to GIMPLE shifts, just like cpp_shifts currently does, the only difference being the new shifts (unlike C/C++ shifts) are defined for all shift sizes and on negative values. With this also variable shift intrinsics can be written without builtins. Though to do this would making a whole set of them for all integer types, it would need to be implemented in the c-parser like __buitin_shuffle, and not with the other generic builtins. Would that make sense? Best regards `Allan
Re: [PATCH] [x86] Avoid builtins for SSE/AVX2 immidiate logical shifts
On Monday 24 April 2017, Jakub Jelinek wrote: > On Mon, Apr 24, 2017 at 09:33:09AM +0200, Allan Sandfeld Jensen wrote: > > --- a/gcc/config/i386/avx2intrin.h > > +++ b/gcc/config/i386/avx2intrin.h > > @@ -667,7 +667,7 @@ extern __inline __m256i > > > > __attribute__ ((__gnu_inline__, __always_inline__, __artificial__)) > > _mm256_slli_epi16 (__m256i __A, int __B) > > { > > > > - return (__m256i)__builtin_ia32_psllwi256 ((__v16hi)__A, __B); > > + return ((__B & 0xff) < 16) ? (__m256i)((__v16hi)__A << (__B & 0xff)) : > > _mm256_setzero_si256(); > > > > } > > What is the advantage of doing that when you replace one operation with > several (&, <, ?:, <<)? > I'd say instead we should fold the builtins if in the gimple fold target > hook we see the shift count constant and can decide based on that. > Or we could use __builtin_constant_p (__B) to decide whether to use > the generic vector shifts or builtin, but that means larger IL. > Okay, I have tried that, and I also made it more obvious how the intrinsics can become non-immediate shift. diff --git a/gcc/ChangeLog b/gcc/ChangeLog index b58f5050db0..b9406550fc5 100644 --- a/gcc/ChangeLog +++ b/gcc/ChangeLog @@ -1,3 +1,10 @@ +2017-04-22 Allan Sandfeld Jensen <sandf...@kde.org> + + * config/i386/emmintrin.h (_mm_slli_*, _mm_srli_*): + Use vector intrinstics instead of builtins. + * config/i386/avx2intrin.h (_mm256_slli_*, _mm256_srli_*): + Use vector intrinstics instead of builtins. + 2017-04-21 Uros Bizjak <ubiz...@gmail.com> * config/i386/i386.md (*extzvqi_mem_rex64): Move above *extzv. diff --git a/gcc/config/i386/avx2intrin.h b/gcc/config/i386/avx2intrin.h index 82f170a3d61..64ba52b244e 100644 --- a/gcc/config/i386/avx2intrin.h +++ b/gcc/config/i386/avx2intrin.h @@ -665,13 +665,6 @@ _mm256_slli_si256 (__m256i __A, const int __N) extern __inline __m256i __attribute__ ((__gnu_inline__, __always_inline__, __artificial__)) -_mm256_slli_epi16 (__m256i __A, int __B) -{ - return (__m256i)__builtin_ia32_psllwi256 ((__v16hi)__A, __B); -} - -extern __inline __m256i -__attribute__ ((__gnu_inline__, __always_inline__, __artificial__)) _mm256_sll_epi16 (__m256i __A, __m128i __B) { return (__m256i)__builtin_ia32_psllw256((__v16hi)__A, (__v8hi)__B); @@ -679,9 +672,11 @@ _mm256_sll_epi16 (__m256i __A, __m128i __B) extern __inline __m256i __attribute__ ((__gnu_inline__, __always_inline__, __artificial__)) -_mm256_slli_epi32 (__m256i __A, int __B) +_mm256_slli_epi16 (__m256i __A, int __B) { - return (__m256i)__builtin_ia32_pslldi256 ((__v8si)__A, __B); + if (__builtin_constant_p(__B)) +return ((unsigned int)__B < 16) ? (__m256i)((__v16hi)__A << __B) : _mm256_setzero_si256(); + return _mm256_sll_epi16(__A, _mm_cvtsi32_si128(__B)); } extern __inline __m256i @@ -693,9 +688,11 @@ _mm256_sll_epi32 (__m256i __A, __m128i __B) extern __inline __m256i __attribute__ ((__gnu_inline__, __always_inline__, __artificial__)) -_mm256_slli_epi64 (__m256i __A, int __B) +_mm256_slli_epi32 (__m256i __A, int __B) { - return (__m256i)__builtin_ia32_psllqi256 ((__v4di)__A, __B); + if (__builtin_constant_p(__B)) +return ((unsigned int)__B < 32) ? (__m256i)((__v8si)__A << __B) : _mm256_setzero_si256(); + return _mm256_sll_epi32(__A, _mm_cvtsi32_si128(__B)); } extern __inline __m256i @@ -707,6 +704,15 @@ _mm256_sll_epi64 (__m256i __A, __m128i __B) extern __inline __m256i __attribute__ ((__gnu_inline__, __always_inline__, __artificial__)) +_mm256_slli_epi64 (__m256i __A, int __B) +{ + if (__builtin_constant_p(__B)) +return ((unsigned int)__B < 64) ? (__m256i)((__v4di)__A << __B) : _mm256_setzero_si256(); + return _mm256_sll_epi64(__A, _mm_cvtsi32_si128(__B)); +} + +extern __inline __m256i +__attribute__ ((__gnu_inline__, __always_inline__, __artificial__)) _mm256_srai_epi16 (__m256i __A, int __B) { return (__m256i)__builtin_ia32_psrawi256 ((__v16hi)__A, __B); @@ -756,13 +762,6 @@ _mm256_srli_si256 (__m256i __A, const int __N) extern __inline __m256i __attribute__ ((__gnu_inline__, __always_inline__, __artificial__)) -_mm256_srli_epi16 (__m256i __A, int __B) -{ - return (__m256i)__builtin_ia32_psrlwi256 ((__v16hi)__A, __B); -} - -extern __inline __m256i -__attribute__ ((__gnu_inline__, __always_inline__, __artificial__)) _mm256_srl_epi16 (__m256i __A, __m128i __B) { return (__m256i)__builtin_ia32_psrlw256((__v16hi)__A, (__v8hi)__B); @@ -770,9 +769,11 @@ _mm256_srl_epi16 (__m256i __A, __m128i __B) extern __inline __m256i __attribute__ ((__gnu_inline__, __always_inline__, __artificial__)) -_mm256_srli_epi32 (__m256i __A, int __B) +_mm256_srli_epi16 (__m256i __A, int __B) { - return (__m256i)__builtin_ia32_psrldi256 ((__v8si)__A, __B); + if (__builtin_constant_p(__B)) +return ((unsigned int)__B < 16) ? (__m256i)((__v16hu)__A >>
Re: [PATCH] [x86] Avoid builtins for SSE/AVX2 immidiate logical shifts
On Monday 24 April 2017, Jakub Jelinek wrote: > On Mon, Apr 24, 2017 at 11:01:29AM +0200, Allan Sandfeld Jensen wrote: > > On Monday 24 April 2017, Jakub Jelinek wrote: > > > On Mon, Apr 24, 2017 at 10:34:58AM +0200, Allan Sandfeld Jensen wrote: > > > > That is a different instruction. That is the vpsllw not vpsllwi > > > > > > > > The intrinsics I changed is the immediate version, I didn't change > > > > the non- immediate version. It is probably a bug if you can give > > > > non-immediate values to the immediate only intrinsic. At least both > > > > versions handles it, if in different ways, but is is illegal > > > > arguments. > > > > > > The documentation is unclear on that and I've only recently fixed up > > > some cases where these intrinsics weren't able to handle non-constant > > > arguments in some cases, while both ICC and clang coped with that > > > fine. > > > So it is clearly allowed and handled by all the compilers and needs to > > > be supported, people use that in real-world code. > > > > Undoubtedly it happens. I just make a mistake myself that created that > > case. But it is rather unfortunate, and means we make wrong code > > currently for corner case values. > > The intrinsic documentation is poor, usually you have a good documentation > on what the instructions do, and then you just have to guess what the > intrinsics do. You can of course ask Intel for clarification. > > If you try: > #include > > __m128i > foo (__m128i a, int b) > { > return _mm_slli_epi16 (a, b); > } > and call it with 257 from somewhere else, you can see that all the > compilers will give you zero vector. And similarly if you use 257 > literally instead of b. So what the intrinsic (unlike the instruction) > actually does is that it compares all bits of the imm8 argument (supposedly > using unsigned comparison) and if it is bigger than 15 (or 7 or 31 or 63 > depending on the bitsize of element) it yields 0 vector. > Good point. I was using intel's documentation at https://software.intel.com/sites/landingpage/IntrinsicsGuide/, but if all compilers including us does something else, practicality wins. It did make me curious and test out what _mm_slli_epi16(v, -250); compiles to. For some reason that becomes an undefined shift using the non-immediate sll in gcc, but returns the zero-vector in clang. With my patch it was a 6 bit shift, but that is apparently not de-facto standard. `Allan
Re: [PATCH] [x86] Avoid builtins for SSE/AVX2 immidiate logical shifts
On Monday 24 April 2017, Jakub Jelinek wrote: > On Mon, Apr 24, 2017 at 10:34:58AM +0200, Allan Sandfeld Jensen wrote: > > That is a different instruction. That is the vpsllw not vpsllwi > > > > The intrinsics I changed is the immediate version, I didn't change the > > non- immediate version. It is probably a bug if you can give > > non-immediate values to the immediate only intrinsic. At least both > > versions handles it, if in different ways, but is is illegal arguments. > > The documentation is unclear on that and I've only recently fixed up some > cases where these intrinsics weren't able to handle non-constant arguments > in some cases, while both ICC and clang coped with that fine. > So it is clearly allowed and handled by all the compilers and needs to be > supported, people use that in real-world code. > Undoubtedly it happens. I just make a mistake myself that created that case. But it is rather unfortunate, and means we make wrong code currently for corner case values. Note the difference in definition between the two intrinsics: _mm_ssl_epi16: FOR j := 0 to 7 i := j*16 IF count[63:0] > 15 dst[i+15:i] := 0 ELSE dst[i+15:i] := ZeroExtend(a[i+15:i] << count[63:0]) FI ENDFOR _mm_ssli_epi16: FOR j := 0 to 7 i := j*16 IF imm8[7:0] > 15 dst[i+15:i] := 0 ELSE dst[i+15:i] := ZeroExtend(a[i+15:i] << imm8[7:0]) FI ENDFOR For a value such as 257, the immediate version does a 1 bit shift, while the non-immediate returns a zero vector. A simple function using the immediate intrinsic has to have an if-statement, if transformed to using the non- immediate instruction. `Allan
Re: [PATCH] [x86] Avoid builtins for SSE/AVX2 immidiate logical shifts
On Monday 24 April 2017, Allan Sandfeld Jensen wrote: > On Monday 24 April 2017, Jakub Jelinek wrote: > > On Mon, Apr 24, 2017 at 10:02:40AM +0200, Allan Sandfeld Jensen wrote: > > > > That said, both the options I've mentioned above provide the same > > > > advantages and don't have the disadvantages of pessimizing normal > > > > code. > > > > > > What pessimizing? This produce the same or better code for all legal > > > arguments. The only difference besides better generated code is that it > > > allows > > > > No. Have you really tried that? > > > > > the intrinsics to be used incorrectly with non-literal arguments > > > because we lack the C-extension for constexp to prevent that. > > > > Consider e.g. -O2 -mavx2 -mtune=intel: > > #include > > > > __m256i > > foo (__m256i x, int s) > > { > > > > return (__m256i)__builtin_ia32_psllwi256 ((__v16hi)x, s); > > > > } > > > > __m256i > > bar (__m256i x, int s) > > { > > > > return ((s & 0xff) < 16) ? (__m256i)((__v16hi)x << (s & 0xff)) : > > _mm256_setzero_si256 (); } > > > > The first one generates > > > > movl%edi, %edi > > vmovq %rdi, %xmm1 > > vpsllw %xmm1, %ymm0, %ymm0 > > ret > > > > (because that is actually what the instruction does), the second one > > That is a different instruction. That is the vpsllw not vpsllwi > > The intrinsics I changed is the immediate version, I didn't change the non- > immediate version. It is probably a bug if you can give non-immediate > values to the immediate only intrinsic. At least both versions handles it, > if in different ways, but is is illegal arguments. > Though I now that I think about it, this means my change of to the existing sse-psslw-1.c test and friends is wrong, because it uses variable input. `Allan
Re: [PATCH] [x86] Avoid builtins for SSE/AVX2 immidiate logical shifts
On Monday 24 April 2017, Jakub Jelinek wrote: > On Mon, Apr 24, 2017 at 10:02:40AM +0200, Allan Sandfeld Jensen wrote: > > > That said, both the options I've mentioned above provide the same > > > advantages and don't have the disadvantages of pessimizing normal code. > > > > What pessimizing? This produce the same or better code for all legal > > arguments. The only difference besides better generated code is that it > > allows > > No. Have you really tried that? > > > the intrinsics to be used incorrectly with non-literal arguments because > > we lack the C-extension for constexp to prevent that. > > Consider e.g. -O2 -mavx2 -mtune=intel: > #include > > __m256i > foo (__m256i x, int s) > { > return (__m256i)__builtin_ia32_psllwi256 ((__v16hi)x, s); > } > > __m256i > bar (__m256i x, int s) > { > return ((s & 0xff) < 16) ? (__m256i)((__v16hi)x << (s & 0xff)) : > _mm256_setzero_si256 (); } > > The first one generates > movl%edi, %edi > vmovq %rdi, %xmm1 > vpsllw %xmm1, %ymm0, %ymm0 > ret > (because that is actually what the instruction does), the second one That is a different instruction. That is the vpsllw not vpsllwi The intrinsics I changed is the immediate version, I didn't change the non- immediate version. It is probably a bug if you can give non-immediate values to the immediate only intrinsic. At least both versions handles it, if in different ways, but is is illegal arguments. `Allan
Re: [PATCH] [x86] Avoid builtins for SSE/AVX2 immidiate logical shifts
On Monday 24 April 2017, Jakub Jelinek wrote: > On Mon, Apr 24, 2017 at 09:51:29AM +0200, Allan Sandfeld Jensen wrote: > > On Monday 24 April 2017, Jakub Jelinek wrote: > > > On Mon, Apr 24, 2017 at 09:33:09AM +0200, Allan Sandfeld Jensen wrote: > > > > --- a/gcc/config/i386/avx2intrin.h > > > > +++ b/gcc/config/i386/avx2intrin.h > > > > @@ -667,7 +667,7 @@ extern __inline __m256i > > > > > > > > __attribute__ ((__gnu_inline__, __always_inline__, __artificial__)) > > > > _mm256_slli_epi16 (__m256i __A, int __B) > > > > { > > > > > > > > - return (__m256i)__builtin_ia32_psllwi256 ((__v16hi)__A, __B); > > > > + return ((__B & 0xff) < 16) ? (__m256i)((__v16hi)__A << (__B & > > > > 0xff)) : _mm256_setzero_si256(); > > > > > > > > } > > > > > > What is the advantage of doing that when you replace one operation with > > > several (&, <, ?:, <<)? > > > I'd say instead we should fold the builtins if in the gimple fold > > > target hook we see the shift count constant and can decide based on > > > that. Or we could use __builtin_constant_p (__B) to decide whether to > > > use the generic vector shifts or builtin, but that means larger IL. > > > > The advantage is that in this builtin, the __B is always a literal (or > > constexpr), so the if statement is resolved at compile time. > > Do we really want to support all the thousands _mm* intrinsics in constexpr > contexts? People can just use generic vectors instead. > I would love to support it, but first we need a C extension attribute matching constexpr, and I consider it a separate issue. > That said, both the options I've mentioned above provide the same > advantages and don't have the disadvantages of pessimizing normal code. > What pessimizing? This produce the same or better code for all legal arguments. The only difference besides better generated code is that it allows the intrinsics to be used incorrectly with non-literal arguments because we lack the C-extension for constexp to prevent that. `Allan
Re: [PATCH] [x86] Avoid builtins for SSE/AVX2 immidiate logical shifts
On Monday 24 April 2017, Jakub Jelinek wrote: > On Mon, Apr 24, 2017 at 09:33:09AM +0200, Allan Sandfeld Jensen wrote: > > --- a/gcc/config/i386/avx2intrin.h > > +++ b/gcc/config/i386/avx2intrin.h > > @@ -667,7 +667,7 @@ extern __inline __m256i > > > > __attribute__ ((__gnu_inline__, __always_inline__, __artificial__)) > > _mm256_slli_epi16 (__m256i __A, int __B) > > { > > > > - return (__m256i)__builtin_ia32_psllwi256 ((__v16hi)__A, __B); > > + return ((__B & 0xff) < 16) ? (__m256i)((__v16hi)__A << (__B & 0xff)) : > > _mm256_setzero_si256(); > > > > } > > What is the advantage of doing that when you replace one operation with > several (&, <, ?:, <<)? > I'd say instead we should fold the builtins if in the gimple fold target > hook we see the shift count constant and can decide based on that. > Or we could use __builtin_constant_p (__B) to decide whether to use > the generic vector shifts or builtin, but that means larger IL. The advantage is that in this builtin, the __B is always a literal (or constexpr), so the if statement is resolved at compile time. `Allan
Re: [PATCH] [x86] Avoid builtins for SSE/AVX2 immidiate logical shifts
On Saturday 22 April 2017, Allan Sandfeld Jensen wrote: > Replaces definitions of immediate logical shift intrinsics with GCC > extension syntax. Tests are added to ensure the intrinsics still produce > the right instructions and that a few basic optimizations now work. > > Compared to the earlier version of the patch, all potentially undefined > shifts are now avoided, which also means no variable shifts or arithmetic > right shifts. Fixed 2 errors in the tests. diff --git a/gcc/ChangeLog b/gcc/ChangeLog index b58f5050db0..b9406550fc5 100644 --- a/gcc/ChangeLog +++ b/gcc/ChangeLog @@ -1,3 +1,10 @@ +2017-04-22 Allan Sandfeld Jensen <sandf...@kde.org> + + * config/i386/emmintrin.h (_mm_slli_*, _mm_srli_*): + Use vector intrinstics instead of builtins. + * config/i386/avx2intrin.h (_mm256_slli_*, _mm256_srli_*): + Use vector intrinstics instead of builtins. + 2017-04-21 Uros Bizjak <ubiz...@gmail.com> * config/i386/i386.md (*extzvqi_mem_rex64): Move above *extzv. diff --git a/gcc/config/i386/avx2intrin.h b/gcc/config/i386/avx2intrin.h index 82f170a3d61..acb49734131 100644 --- a/gcc/config/i386/avx2intrin.h +++ b/gcc/config/i386/avx2intrin.h @@ -667,7 +667,7 @@ extern __inline __m256i __attribute__ ((__gnu_inline__, __always_inline__, __artificial__)) _mm256_slli_epi16 (__m256i __A, int __B) { - return (__m256i)__builtin_ia32_psllwi256 ((__v16hi)__A, __B); + return ((__B & 0xff) < 16) ? (__m256i)((__v16hi)__A << (__B & 0xff)) : _mm256_setzero_si256(); } extern __inline __m256i @@ -681,7 +681,7 @@ extern __inline __m256i __attribute__ ((__gnu_inline__, __always_inline__, __artificial__)) _mm256_slli_epi32 (__m256i __A, int __B) { - return (__m256i)__builtin_ia32_pslldi256 ((__v8si)__A, __B); + return ((__B & 0xff) < 32) ? (__m256i)((__v8si)__A << (__B & 0xff)) : _mm256_setzero_si256(); } extern __inline __m256i @@ -695,7 +695,7 @@ extern __inline __m256i __attribute__ ((__gnu_inline__, __always_inline__, __artificial__)) _mm256_slli_epi64 (__m256i __A, int __B) { - return (__m256i)__builtin_ia32_psllqi256 ((__v4di)__A, __B); + return ((__B & 0xff) < 64) ? (__m256i)((__v4di)__A << (__B & 0xff)) : _mm256_setzero_si256(); } extern __inline __m256i @@ -758,7 +758,7 @@ extern __inline __m256i __attribute__ ((__gnu_inline__, __always_inline__, __artificial__)) _mm256_srli_epi16 (__m256i __A, int __B) { - return (__m256i)__builtin_ia32_psrlwi256 ((__v16hi)__A, __B); + return ((__B & 0xff) < 16) ? (__m256i) ((__v16hu)__A >> (__B & 0xff)) : _mm256_setzero_si256(); } extern __inline __m256i @@ -772,7 +772,7 @@ extern __inline __m256i __attribute__ ((__gnu_inline__, __always_inline__, __artificial__)) _mm256_srli_epi32 (__m256i __A, int __B) { - return (__m256i)__builtin_ia32_psrldi256 ((__v8si)__A, __B); + return ((__B & 0xff) < 32) ? (__m256i) ((__v8su)__A >> (__B & 0xff)) : _mm256_setzero_si256(); } extern __inline __m256i @@ -786,7 +786,7 @@ extern __inline __m256i __attribute__ ((__gnu_inline__, __always_inline__, __artificial__)) _mm256_srli_epi64 (__m256i __A, int __B) { - return (__m256i)__builtin_ia32_psrlqi256 ((__v4di)__A, __B); + return ((__B & 0xff) < 64) ? (__m256i) ((__v4du)__A >> (__B & 0xff)) : _mm256_setzero_si256(); } extern __inline __m256i diff --git a/gcc/config/i386/emmintrin.h b/gcc/config/i386/emmintrin.h index 828f4a07a9b..5c048d9fd0d 100644 --- a/gcc/config/i386/emmintrin.h +++ b/gcc/config/i386/emmintrin.h @@ -1140,19 +1140,19 @@ _mm_mul_epu32 (__m128i __A, __m128i __B) extern __inline __m128i __attribute__((__gnu_inline__, __always_inline__, __artificial__)) _mm_slli_epi16 (__m128i __A, int __B) { - return (__m128i)__builtin_ia32_psllwi128 ((__v8hi)__A, __B); + return ((__B & 0xff) < 16) ? (__m128i)((__v8hi)__A << (__B & 0xff)) : _mm_setzero_si128(); } extern __inline __m128i __attribute__((__gnu_inline__, __always_inline__, __artificial__)) _mm_slli_epi32 (__m128i __A, int __B) { - return (__m128i)__builtin_ia32_pslldi128 ((__v4si)__A, __B); + return ((__B & 0xff) < 32) ? (__m128i)((__v4si)__A << (__B & 0xff)) : _mm_setzero_si128(); } extern __inline __m128i __attribute__((__gnu_inline__, __always_inline__, __artificial__)) _mm_slli_epi64 (__m128i __A, int __B) { - return (__m128i)__builtin_ia32_psllqi128 ((__v2di)__A, __B); + return ((__B & 0xff) < 64) ? (__m128i)((__v2di)__A << (__B & 0xff)) : _mm_setzero_si128(); } extern __inline __m128i __attribute__((__gnu_inline__, __always_inline__, __artificial__)) @@ -1205,19 +1205,19 @@ _mm_slli_si128 (__m128i __A, const int __N) extern __inline __m128i __attribute__((__gnu_inline__, __always_inline__, __artificial__)) _mm_srli_epi16 (__m128i __A, int __B) { - return (__m128i)__builtin_ia32_psrlwi128 ((__v8hi)__A, __B); + return ((__B & 0xff) < 16) ? (__
Re: [Patch][i386] PR 70118: Fix ubsan warning on SSE2 loadl_epi64 and storel_epi64
On Tuesday 06 December 2016, Uros Bizjak wrote: > Hello! > > > 2016-11-30 Allan Sandfeld Jensen <allan.jen...@qt.io> > > > >PR target/70118 > >* gcc/config/i386/mmintrin.h (__m64_u): New type > >* gcc/config/i386/emmintrin.h (_mm_loadl_epi64, _mm_storel_epi64): > >Make the allowed unaligned memory access explicit. > > OK. > Thanks Note I don't have write access. Best regards `Allan
Re: [Patch][i386] PR 70118: Fix ubsan warning on SSE2 loadl_epi64 and storel_epi64
Trying again, this time with changelog. gcc/ 2016-11-30 Allan Sandfeld Jensen <allan.jen...@qt.io> PR target/70118 * gcc/config/i386/mmintrin.h (__m64_u): New type * gcc/config/i386/emmintrin.h (_mm_loadl_epi64, _mm_storel_epi64): Make the allowed unaligned memory access explicit. Index: gcc/config/i386/emmintrin.h === --- gcc/config/i386/emmintrin.h (revision 242892) +++ gcc/config/i386/emmintrin.h (working copy) @@ -703,9 +703,9 @@ } extern __inline __m128i __attribute__((__gnu_inline__, __always_inline__, __artificial__)) -_mm_loadl_epi64 (__m128i const *__P) +_mm_loadl_epi64 (__m128i_u const *__P) { - return _mm_set_epi64 ((__m64)0LL, *(__m64 *)__P); + return _mm_set_epi64 ((__m64)0LL, *(__m64_u *)__P); } extern __inline void __attribute__((__gnu_inline__, __always_inline__, __artificial__)) @@ -721,9 +721,9 @@ } extern __inline void __attribute__((__gnu_inline__, __always_inline__, __artificial__)) -_mm_storel_epi64 (__m128i *__P, __m128i __B) +_mm_storel_epi64 (__m128i_u *__P, __m128i __B) { - *(long long *)__P = ((__v2di)__B)[0]; + *(__m64_u *)__P = (__m64) ((__v2di)__B)[0]; } extern __inline __m64 __attribute__((__gnu_inline__, __always_inline__, __artificial__)) Index: gcc/config/i386/mmintrin.h === --- gcc/config/i386/mmintrin.h (revision 242892) +++ gcc/config/i386/mmintrin.h (working copy) @@ -37,6 +37,9 @@ vector types, and their scalar components. */ typedef int __m64 __attribute__ ((__vector_size__ (8), __may_alias__)); +/* Unaligned version of the same type */ +typedef int __m64_u __attribute__ ((__vector_size__ (8), __may_alias__, __aligned__ (1))); + /* Internal data types for implementing the intrinsics. */ typedef int __v2si __attribute__ ((__vector_size__ (8))); typedef short __v4hi __attribute__ ((__vector_size__ (8)));
Re: [Patch][i386] PR 70118: Fix ubsan warning on SSE2 loadl_epi64 and storel_epi64
On Sunday 27 November 2016, Marc Glisse wrote: > On Sat, 26 Nov 2016, Allan Sandfeld Jensen wrote: > > Use the recently introduced unaligned variant of __m128i and add a > > similar __m64 and use those to make it clear these two intrinsics > > require neither 128- bit nor 64-bit alignment. > > Thanks for doing this. You'll want Uros or Kirill to review your patch. > There are probably several more places that could do with an unaligned > fix, but we don't have to find them all at once. > First I found it strange to use __m64, but then it actually seems like a > good call to use a type that is not just aligned(1) but also may_alias. > > + *(__m64_u *)__P = __m64(((__v2di)__B)[0]); > > gcc complains about this syntax for me, it wants parentheses around > __m64... Did it pass the testsuite for you? Fixed, it now matches the move just below. Index: gcc/config/i386/emmintrin.h === --- gcc/config/i386/emmintrin.h (revision 242892) +++ gcc/config/i386/emmintrin.h (working copy) @@ -703,9 +703,9 @@ } extern __inline __m128i __attribute__((__gnu_inline__, __always_inline__, __artificial__)) -_mm_loadl_epi64 (__m128i const *__P) +_mm_loadl_epi64 (__m128i_u const *__P) { - return _mm_set_epi64 ((__m64)0LL, *(__m64 *)__P); + return _mm_set_epi64 ((__m64)0LL, *(__m64_u *)__P); } extern __inline void __attribute__((__gnu_inline__, __always_inline__, __artificial__)) @@ -721,9 +721,9 @@ } extern __inline void __attribute__((__gnu_inline__, __always_inline__, __artificial__)) -_mm_storel_epi64 (__m128i *__P, __m128i __B) +_mm_storel_epi64 (__m128i_u *__P, __m128i __B) { - *(long long *)__P = ((__v2di)__B)[0]; + *(__m64_u *)__P = (__m64) ((__v2di)__B)[0]; } extern __inline __m64 __attribute__((__gnu_inline__, __always_inline__, __artificial__)) Index: gcc/config/i386/mmintrin.h === --- gcc/config/i386/mmintrin.h (revision 242892) +++ gcc/config/i386/mmintrin.h (working copy) @@ -37,6 +37,9 @@ vector types, and their scalar components. */ typedef int __m64 __attribute__ ((__vector_size__ (8), __may_alias__)); +/* Unaligned version of the same type */ +typedef int __m64_u __attribute__ ((__vector_size__ (8), __may_alias__, __aligned__ (1))); + /* Internal data types for implementing the intrinsics. */ typedef int __v2si __attribute__ ((__vector_size__ (8))); typedef short __v4hi __attribute__ ((__vector_size__ (8)));
[Patch][i386] PR 70118: Fix ubsan warning on SSE2 loadl_epi64 and storel_epi64
Use the recently introduced unaligned variant of __m128i and add a similar __m64 and use those to make it clear these two intrinsics require neither 128- bit nor 64-bit alignment. `Allan Index: gcc/config/i386/emmintrin.h === --- gcc/config/i386/emmintrin.h (revision 242753) +++ gcc/config/i386/emmintrin.h (working copy) @@ -703,9 +703,9 @@ } extern __inline __m128i __attribute__((__gnu_inline__, __always_inline__, __artificial__)) -_mm_loadl_epi64 (__m128i const *__P) +_mm_loadl_epi64 (__m128i_u const *__P) { - return _mm_set_epi64 ((__m64)0LL, *(__m64 *)__P); + return _mm_set_epi64 ((__m64)0LL, *(__m64_u *)__P); } extern __inline void __attribute__((__gnu_inline__, __always_inline__, __artificial__)) @@ -721,9 +721,9 @@ } extern __inline void __attribute__((__gnu_inline__, __always_inline__, __artificial__)) -_mm_storel_epi64 (__m128i *__P, __m128i __B) +_mm_storel_epi64 (__m128i_u *__P, __m128i __B) { - *(long long *)__P = ((__v2di)__B)[0]; + *(__m64_u *)__P = __m64(((__v2di)__B)[0]); } extern __inline __m64 __attribute__((__gnu_inline__, __always_inline__, __artificial__)) Index: gcc/config/i386/mmintrin.h === --- gcc/config/i386/mmintrin.h (revision 242753) +++ gcc/config/i386/mmintrin.h (working copy) @@ -37,6 +37,9 @@ vector types, and their scalar components. */ typedef int __m64 __attribute__ ((__vector_size__ (8), __may_alias__)); +/* Unaligned version of the same type */ +typedef int __m64_u __attribute__ ((__vector_size__ (8), __may_alias__, __aligned__ (1))); + /* Internal data types for implementing the intrinsics. */ typedef int __v2si __attribute__ ((__vector_size__ (8))); typedef short __v4hi __attribute__ ((__vector_size__ (8)));
Re: [PATCH] libiberty: Fix -Wimplicit-fallthrough warnings.
On Wednesday 02 November 2016, Mark Wielaard wrote: > -case 11: c+=((hashval_t)k[10]<<24); > -case 10: c+=((hashval_t)k[9]<<16); > -case 9 : c+=((hashval_t)k[8]<<8); > +case 11: c+=((hashval_t)k[10]<<24); /* fall through */ > +case 10: c+=((hashval_t)k[9]<<16); /* fall through */ > +case 9 : c+=((hashval_t)k[8]<<8);/* fall through */ >/* the first byte of c is reserved for the length */ This really highlights another exception -Wimplicit-fallthough should tolerate at least on level 1. Single line of code. case X: my_statement(); case y: my_statement(); and case X: my_statement(); case y: my_statement(); In both cases, the lack of break is obvious at a glance and thus a comment highlighting it has never been needed and shouldn't be enforced. Well, at least in my opinion, and it would take away all the rest of false positives I run into. Best regards `Allan
Re: [PATCH] Introduce -Wimplicit-fallthrough={0,1,2,3,4,5}
On Tuesday 11 October 2016, Jakub Jelinek wrote: > Hi! > > The following patch introduces difference warning levels for > -Wimplicit-fallthrough warning, so projects can choose if they want to > honor only attributes (-Wimplicit-fallthrough=5), or what kind of comments. > =4 is very picky and accepts only very small amount of comments, =3 is what > we had before this patch, =2 looks case insensitively for falls?[ > \t-]*thr(u|ough) anywhere in the comment, =1 accepts any comment, =0 is > the same as -Wno-implicit-fallthrough - disables the warning. I would suggest also looking for comments with "no break" in them, as that is another common way to annotate the intentional lack of a 'break'. If you want another example besides the linux kernel, I unified all our fall through comments in qtbase in August: https://codereview.qt-project.org/#/c/163595/ Though Qt is far from -Wimpliciit-fallthough clean after that, another colleague is working on that since we have traditionally aimed for -Wall - Wextra -Werror, though it will seriously wreck hawock with readability in several places with unrolled loops and switches on integers. Best regards `Allan
Re: [Patch, i386] Support BMI and BMI2 targets in multiversioning
On Monday 26 January 2015, H.J. Lu wrote: On Mon, Jan 26, 2015 at 11:08 AM, Allan Sandfeld Jensen carew...@gmail.com wrote: On Monday 26 January 2015, H.J. Lu wrote: On Mon, Jan 26, 2015 at 10:53 AM, Allan Sandfeld Jensen carew...@gmail.com wrote: Committed with a bunch of fixes (e.g. missing fold_builtin_cpu part in gcc/config/i386/i386.c, and mv17.C test didn't compile at all due to missing parenthesis). ... and now with committed ChangeLog and patch. gcc/ChangeLog: * config/i386/i386.c (get_builtin_code_for_version): Add support for BMI and BMI2 multiversion functions. (fold_builtin_cpu): Add F_BMI and F_BMI2. libgcc/ChangeLog: * config/i386/cpuinfo.c (enum processor_features): Add FEATURE_BMI and FEATURE_BMI2. (get_available_features): Detect FEATURE_BMI and FEATURE_BMI2. testsuite/ChangeLog: * gcc.target/i386/funcspec-5.c: Test new multiversion targets. * g++.dg/ext/mv17.C: Test BMI/BMI2 multiversion dispatcher. diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index 9ec40cb..441911d 100644 --- a/gcc/config/i386/i386.c +++ b/gcc/config/i386/i386.c @@ -34289,15 +34289,18 @@ get_builtin_code_for_version (tree decl, tree *predica te_list) P_PROC_SSE4_A, P_SSE4_1, P_SSE4_2, -P_PROC_SSE4_2, P_POPCNT, +P_PROC_SSE4_2, P_AVX, P_PROC_AVX, +P_BMI, +P_PROC_BMI, P_FMA4, P_XOP, P_PROC_XOP, P_FMA, P_PROC_FMA, +P_BMI2, P_AVX2, P_PROC_AVX2, P_AVX512F, This changed the priority of P_POPCNT and caused FAIL: g++.dg/ext/mv1.C -std=gnu++98 execution test FAIL: g++.dg/ext/mv1.C -std=gnu++11 execution test FAIL: g++.dg/ext/mv1.C -std=gnu++14 execution test on Nehalem and Westmere machines: mv1.exe: /export/gnu/import/git/sources/gcc/gcc/testsuite/g++.dg/ext/mv1.C: 51: int main(): Assertion `val == 5' failed. since val is 6 now. Right. I am not sure why popcnt was prioritized below arch=corei7. The logic is supposed to be that any target that includes an extension is prioritized I don't understand your question. popcnt feature is separate from -march. Its priority has nothing to do with -march=corei7. arch=corei7 implies popcnt. See PTA_NEHALEM in i386.c. The test would probably work with -march=core2. AFAIK The logic of the priorities in multiversioning is that architecture specific functions are chosen over feature specific, unless the feature is one that isn't required by the architecture. On SSE4.2 machines, we should choose int __attribute__ ((target(arch=corei7))) foo (); over int __attribute__ ((target(popcnt))) foo (); But we shouldn't choose int __attribute__ ((target(arch=corei7))) foo (); over int __attribute__ ((target(arch=corei7,popcnt))) foo (); I guess since they represent the exact same effective ISA, they would have equal priority, so that it would likely chose whatever comes last. I have no strong opinion on this. But this is a user visible compiler behavior change. We should issue a warning/note here. Yes, or revert that part of the patch. It should have been a separate patch anyway. `Allan
Re: [Patch, i386] Support BMI and BMI2 targets in multiversioning
On Monday 26 January 2015, you wrote: On Mon, Jan 26, 2015 at 10:38 AM, Allan Sandfeld Jensen al...@carewolf.com wrote: On Monday 26 January 2015, H.J. Lu wrote: On Sun, Jan 25, 2015 at 10:37 AM, Uros Bizjak ubiz...@gmail.com wrote: On Sun, Jan 25, 2015 at 7:23 PM, Uros Bizjak ubiz...@gmail.com wrote: On Sat, Jan 24, 2015 at 11:49 AM, Allan Sandfeld Jensen al...@carewolf.com wrote: On Saturday 24 January 2015, Uros Bizjak wrote: On Mon, Jan 12, 2015 at 6:02 PM, Uros Bizjak ubiz...@gmail.com wrote: Hello! On Wed, Dec 31, 2014 at 01:28:47PM +0100, Allan Sandfeld Jensen wrote: I recently wanted to use multiversioning for BMI2 specific extensions PDEP/PEXT, and noticed it wasn't there. So I wrote this patch to add it, and also added AES, F16C and BMI1 for completeness. AES nor F16C doesn't make any sense IMHO for multiversioning, you need special intrinsics for that anyway and when you use them, the function will fail to compile without those features. Multiversioning only makes sense for ISA features the compiler uses for normal C/C++ code without any intrinsics. Patch reduced to just adding BMI and BMI2 multiversioning: +2014-12-29 Allan Sandfeld Jensen sandf...@kde.org + + * config/i386/i386.c (get_builtin_code_for_version): Add + support for BMI and BMI2 multiversion functions. +2014-12-29 Allan Sandfeld Jensen sandf...@kde.org + + * gcc.target/i386/funcspec-5.c: Test new multiversion targets. + * g++.dg/ext/mv17.C: Test BMI/BMI2 multiversion dispatcher. +2014-12-29 Allan Sandfeld Jensen sandf...@kde.org + + * config/i386/cpuinfo.c (enum processor_features): Add FEATURE_BMI and + FEATURE_BMI2. + (get_available_features): Detect FEATURE_BMI and FEATURE_BMI2. OK for mainline Allan, did you commit the patch to mainline? I don't see it in SVN logs. (If you don't have SVN commit access, please mention it in the patch submission, so someone will commit the patch for you). Sorry. I don't have SVN commit access. Committed with a bunch of fixes (e.g. missing fold_builtin_cpu part in gcc/config/i386/i386.c, and mv17.C test didn't compile at all due to missing parenthesis). ... and now with committed ChangeLog and patch. gcc/ChangeLog: * config/i386/i386.c (get_builtin_code_for_version): Add support for BMI and BMI2 multiversion functions. (fold_builtin_cpu): Add F_BMI and F_BMI2. libgcc/ChangeLog: * config/i386/cpuinfo.c (enum processor_features): Add FEATURE_BMI and FEATURE_BMI2. (get_available_features): Detect FEATURE_BMI and FEATURE_BMI2. testsuite/ChangeLog: * gcc.target/i386/funcspec-5.c: Test new multiversion targets. * g++.dg/ext/mv17.C: Test BMI/BMI2 multiversion dispatcher. diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index 9ec40cb..441911d 100644 --- a/gcc/config/i386/i386.c +++ b/gcc/config/i386/i386.c @@ -34289,15 +34289,18 @@ get_builtin_code_for_version (tree decl, tree *predica te_list) P_PROC_SSE4_A, P_SSE4_1, P_SSE4_2, -P_PROC_SSE4_2, P_POPCNT, +P_PROC_SSE4_2, P_AVX, P_PROC_AVX, +P_BMI, +P_PROC_BMI, P_FMA4, P_XOP, P_PROC_XOP, P_FMA, P_PROC_FMA, +P_BMI2, P_AVX2, P_PROC_AVX2, P_AVX512F, This changed the priority of P_POPCNT and caused FAIL: g++.dg/ext/mv1.C -std=gnu++98 execution test FAIL: g++.dg/ext/mv1.C -std=gnu++11 execution test FAIL: g++.dg/ext/mv1.C -std=gnu++14 execution test on Nehalem and Westmere machines: mv1.exe: /export/gnu/import/git/sources/gcc/gcc/testsuite/g++.dg/ext/mv1.C:51: int main(): Assertion `val == 5' failed. since val is 6 now. Right. I am not sure why popcnt was prioritized below arch=corei7. The logic is supposed to be that any target that includes an extension is prioritized I don't understand your question. popcnt feature is separate from -march. Its priority has nothing to do with -march=corei7. arch=corei7 implies popcnt. See PTA_NEHALEM in i386.c. The test would probably work with -march=core2. AFAIK The logic of the priorities in multiversioning is that architecture specific functions are chosen over feature specific, unless the feature is one that isn't required by the architecture. `Allan
Re: [Patch, i386] Support BMI and BMI2 targets in multiversioning
On Monday 26 January 2015, H.J. Lu wrote: On Mon, Jan 26, 2015 at 10:53 AM, Allan Sandfeld Jensen carew...@gmail.com wrote: Committed with a bunch of fixes (e.g. missing fold_builtin_cpu part in gcc/config/i386/i386.c, and mv17.C test didn't compile at all due to missing parenthesis). ... and now with committed ChangeLog and patch. gcc/ChangeLog: * config/i386/i386.c (get_builtin_code_for_version): Add support for BMI and BMI2 multiversion functions. (fold_builtin_cpu): Add F_BMI and F_BMI2. libgcc/ChangeLog: * config/i386/cpuinfo.c (enum processor_features): Add FEATURE_BMI and FEATURE_BMI2. (get_available_features): Detect FEATURE_BMI and FEATURE_BMI2. testsuite/ChangeLog: * gcc.target/i386/funcspec-5.c: Test new multiversion targets. * g++.dg/ext/mv17.C: Test BMI/BMI2 multiversion dispatcher. diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index 9ec40cb..441911d 100644 --- a/gcc/config/i386/i386.c +++ b/gcc/config/i386/i386.c @@ -34289,15 +34289,18 @@ get_builtin_code_for_version (tree decl, tree *predica te_list) P_PROC_SSE4_A, P_SSE4_1, P_SSE4_2, -P_PROC_SSE4_2, P_POPCNT, +P_PROC_SSE4_2, P_AVX, P_PROC_AVX, +P_BMI, +P_PROC_BMI, P_FMA4, P_XOP, P_PROC_XOP, P_FMA, P_PROC_FMA, +P_BMI2, P_AVX2, P_PROC_AVX2, P_AVX512F, This changed the priority of P_POPCNT and caused FAIL: g++.dg/ext/mv1.C -std=gnu++98 execution test FAIL: g++.dg/ext/mv1.C -std=gnu++11 execution test FAIL: g++.dg/ext/mv1.C -std=gnu++14 execution test on Nehalem and Westmere machines: mv1.exe: /export/gnu/import/git/sources/gcc/gcc/testsuite/g++.dg/ext/mv1.C:51: int main(): Assertion `val == 5' failed. since val is 6 now. Right. I am not sure why popcnt was prioritized below arch=corei7. The logic is supposed to be that any target that includes an extension is prioritized I don't understand your question. popcnt feature is separate from -march. Its priority has nothing to do with -march=corei7. arch=corei7 implies popcnt. See PTA_NEHALEM in i386.c. The test would probably work with -march=core2. AFAIK The logic of the priorities in multiversioning is that architecture specific functions are chosen over feature specific, unless the feature is one that isn't required by the architecture. On SSE4.2 machines, we should choose int __attribute__ ((target(arch=corei7))) foo (); over int __attribute__ ((target(popcnt))) foo (); But we shouldn't choose int __attribute__ ((target(arch=corei7))) foo (); over int __attribute__ ((target(arch=corei7,popcnt))) foo (); I guess since they represent the exact same effective ISA, they would have equal priority, so that it would likely chose whatever comes last. `Allan
Re: [Patch, i386] Support BMI and BMI2 targets in multiversioning
On Monday 26 January 2015, H.J. Lu wrote: On Sun, Jan 25, 2015 at 10:37 AM, Uros Bizjak ubiz...@gmail.com wrote: On Sun, Jan 25, 2015 at 7:23 PM, Uros Bizjak ubiz...@gmail.com wrote: On Sat, Jan 24, 2015 at 11:49 AM, Allan Sandfeld Jensen al...@carewolf.com wrote: On Saturday 24 January 2015, Uros Bizjak wrote: On Mon, Jan 12, 2015 at 6:02 PM, Uros Bizjak ubiz...@gmail.com wrote: Hello! On Wed, Dec 31, 2014 at 01:28:47PM +0100, Allan Sandfeld Jensen wrote: I recently wanted to use multiversioning for BMI2 specific extensions PDEP/PEXT, and noticed it wasn't there. So I wrote this patch to add it, and also added AES, F16C and BMI1 for completeness. AES nor F16C doesn't make any sense IMHO for multiversioning, you need special intrinsics for that anyway and when you use them, the function will fail to compile without those features. Multiversioning only makes sense for ISA features the compiler uses for normal C/C++ code without any intrinsics. Patch reduced to just adding BMI and BMI2 multiversioning: +2014-12-29 Allan Sandfeld Jensen sandf...@kde.org + + * config/i386/i386.c (get_builtin_code_for_version): Add + support for BMI and BMI2 multiversion functions. +2014-12-29 Allan Sandfeld Jensen sandf...@kde.org + + * gcc.target/i386/funcspec-5.c: Test new multiversion targets. + * g++.dg/ext/mv17.C: Test BMI/BMI2 multiversion dispatcher. +2014-12-29 Allan Sandfeld Jensen sandf...@kde.org + + * config/i386/cpuinfo.c (enum processor_features): Add FEATURE_BMI and + FEATURE_BMI2. + (get_available_features): Detect FEATURE_BMI and FEATURE_BMI2. OK for mainline Allan, did you commit the patch to mainline? I don't see it in SVN logs. (If you don't have SVN commit access, please mention it in the patch submission, so someone will commit the patch for you). Sorry. I don't have SVN commit access. Committed with a bunch of fixes (e.g. missing fold_builtin_cpu part in gcc/config/i386/i386.c, and mv17.C test didn't compile at all due to missing parenthesis). ... and now with committed ChangeLog and patch. gcc/ChangeLog: * config/i386/i386.c (get_builtin_code_for_version): Add support for BMI and BMI2 multiversion functions. (fold_builtin_cpu): Add F_BMI and F_BMI2. libgcc/ChangeLog: * config/i386/cpuinfo.c (enum processor_features): Add FEATURE_BMI and FEATURE_BMI2. (get_available_features): Detect FEATURE_BMI and FEATURE_BMI2. testsuite/ChangeLog: * gcc.target/i386/funcspec-5.c: Test new multiversion targets. * g++.dg/ext/mv17.C: Test BMI/BMI2 multiversion dispatcher. diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index 9ec40cb..441911d 100644 --- a/gcc/config/i386/i386.c +++ b/gcc/config/i386/i386.c @@ -34289,15 +34289,18 @@ get_builtin_code_for_version (tree decl, tree *predica te_list) P_PROC_SSE4_A, P_SSE4_1, P_SSE4_2, -P_PROC_SSE4_2, P_POPCNT, +P_PROC_SSE4_2, P_AVX, P_PROC_AVX, +P_BMI, +P_PROC_BMI, P_FMA4, P_XOP, P_PROC_XOP, P_FMA, P_PROC_FMA, +P_BMI2, P_AVX2, P_PROC_AVX2, P_AVX512F, This changed the priority of P_POPCNT and caused FAIL: g++.dg/ext/mv1.C -std=gnu++98 execution test FAIL: g++.dg/ext/mv1.C -std=gnu++11 execution test FAIL: g++.dg/ext/mv1.C -std=gnu++14 execution test on Nehalem and Westmere machines: mv1.exe: /export/gnu/import/git/sources/gcc/gcc/testsuite/g++.dg/ext/mv1.C:51: int main(): Assertion `val == 5' failed. since val is 6 now. Right. I am not sure why popcnt was prioritized below arch=corei7. The logic is supposed to be that any target that includes an extension is prioritized above that extension. I included the change in the order in i386.c, but not the updated test. It is probably better to not change the order in i386.c, or at least change that in a separate commit. `Allan
Re: [Patch, i386] Support BMI and BMI2 targets in multiversioning
On Saturday 24 January 2015, Uros Bizjak wrote: On Mon, Jan 12, 2015 at 6:02 PM, Uros Bizjak ubiz...@gmail.com wrote: Hello! On Wed, Dec 31, 2014 at 01:28:47PM +0100, Allan Sandfeld Jensen wrote: I recently wanted to use multiversioning for BMI2 specific extensions PDEP/PEXT, and noticed it wasn't there. So I wrote this patch to add it, and also added AES, F16C and BMI1 for completeness. AES nor F16C doesn't make any sense IMHO for multiversioning, you need special intrinsics for that anyway and when you use them, the function will fail to compile without those features. Multiversioning only makes sense for ISA features the compiler uses for normal C/C++ code without any intrinsics. Patch reduced to just adding BMI and BMI2 multiversioning: +2014-12-29 Allan Sandfeld Jensen sandf...@kde.org + + * config/i386/i386.c (get_builtin_code_for_version): Add + support for BMI and BMI2 multiversion functions. +2014-12-29 Allan Sandfeld Jensen sandf...@kde.org + + * gcc.target/i386/funcspec-5.c: Test new multiversion targets. + * g++.dg/ext/mv17.C: Test BMI/BMI2 multiversion dispatcher. +2014-12-29 Allan Sandfeld Jensen sandf...@kde.org + + * config/i386/cpuinfo.c (enum processor_features): Add FEATURE_BMI and + FEATURE_BMI2. + (get_available_features): Detect FEATURE_BMI and FEATURE_BMI2. OK for mainline Allan, did you commit the patch to mainline? I don't see it in SVN logs. (If you don't have SVN commit access, please mention it in the patch submission, so someone will commit the patch for you). Sorry. I don't have SVN commit access. `Allan
Re: [Patch, i386] Support BMI and BMI2 targets in multiversioning
On Wednesday 31 December 2014, Jakub Jelinek wrote: On Wed, Dec 31, 2014 at 01:28:47PM +0100, Allan Sandfeld Jensen wrote: I recently wanted to use multiversioning for BMI2 specific extensions PDEP/PEXT, and noticed it wasn't there. So I wrote this patch to add it, and also added AES, F16C and BMI1 for completeness. AES nor F16C doesn't make any sense IMHO for multiversioning, you need special intrinsics for that anyway and when you use them, the function will fail to compile without those features. Multiversioning only makes sense for ISA features the compiler uses for normal C/C++ code without any intrinsics. Patch reduced to just adding BMI and BMI2 multiversioning: Best regards `Allan commit aa5b98407bcc3ae2f1fef70455b89447112c3e2c Author: Allan Sandfeld Jensen allan.jen...@digia.com Date: Fri Dec 26 21:14:01 2014 +0100 BMI and BMI2 multiversion support diff --git a/gcc/ChangeLog b/gcc/ChangeLog index ff8a5e6..6afd58a 100644 --- a/gcc/ChangeLog +++ b/gcc/ChangeLog @@ -1,3 +1,8 @@ +2014-12-29 Allan Sandfeld Jensen sandf...@kde.org + + * config/i386/i386.c (get_builtin_code_for_version): Add + support for BMI and BMI2 multiversion functions. + 2014-12-27 H.J. Lu hongjiu...@intel.com PR target/64409 diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index d693fdb..9ae9bca 100644 --- a/gcc/config/i386/i386.c +++ b/gcc/config/i386/i386.c @@ -34261,15 +34261,18 @@ get_builtin_code_for_version (tree decl, tree *predicate_list) P_PROC_SSE4_A, P_SSE4_1, P_SSE4_2, -P_PROC_SSE4_2, P_POPCNT, +P_PROC_SSE4_2, P_AVX, P_PROC_AVX, +P_BMI, +P_PROC_BMI, P_FMA4, P_XOP, P_PROC_XOP, P_FMA, P_PROC_FMA, +P_BMI2, P_AVX2, P_PROC_AVX2, P_AVX512F, @@ -34295,12 +34298,14 @@ get_builtin_code_for_version (tree decl, tree *predicate_list) {sse4a, P_SSE4_A}, {ssse3, P_SSSE3}, {sse4.1, P_SSE4_1}, - {sse4.2, P_SSE4_2}, {popcnt, P_POPCNT}, + {sse4.2, P_SSE4_2}, {avx, P_AVX}, + {bmi, P_BMI}, {fma4, P_FMA4}, {xop, P_XOP}, {fma, P_FMA}, + {bmi2, P_BMI2}, {avx2, P_AVX2}, {avx512f, P_AVX512F} }; @@ -34395,7 +34400,7 @@ get_builtin_code_for_version (tree decl, tree *predicate_list) break; case PROCESSOR_BTVER2: arg_str = btver2; - priority = P_PROC_AVX; + priority = P_PROC_BMI; break; case PROCESSOR_BDVER1: arg_str = bdver1; diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog index ef6ddcc..5b11622 100644 --- a/gcc/testsuite/ChangeLog +++ b/gcc/testsuite/ChangeLog @@ -1,3 +1,8 @@ +2014-12-29 Allan Sandfeld Jensen sandf...@kde.org + + * gcc.target/i386/funcspec-5.c: Test new multiversion targets. + * g++.dg/ext/mv17.C: Test BMI/BMI2 multiversion dispatcher. + 2014-12-28 H.J. Lu hongjiu...@intel.com * gcc.target/i386/pr57003.c: Skip on x32. diff --git a/gcc/testsuite/g++.dg/ext/mv17.C b/gcc/testsuite/g++.dg/ext/mv17.C new file mode 100644 index 000..311f217 --- /dev/null +++ b/gcc/testsuite/g++.dg/ext/mv17.C @@ -0,0 +1,91 @@ +/* Test case to check if Multiversioning works for BMI and BMI2. */ +/* { dg-do run { target i?86-*-* x86_64-*-* } } */ +/* { dg-options -O2 } */ + +#include assert.h + +// Check BMI feature selection works +int foo () __attribute__((target(default))); +int foo () __attribute__((target(bmi))); +int foo () __attribute__((target(bmi2))); + +// Check specialized versions for archs with BMI is chosen over generic BMI versions. +int bar () __attribute__((target(default))); +int bar () __attribute__((target(bmi))); +int bar () __attribute__((target(bmi2))); +int bar () __attribute__((target(arch=btver2))); +int bar () __attribute__((target(arch=haswell))); + +int main () +{ + int val = foo (); + + if (__builtin_cpu_supports (bmi2)) +assert (val == 2); + else if (__builtin_cpu_supports (bmi)) +assert (val == 1); + else +assert (val == 0); + + val = bar (); + + if (__builtin_cpu_is (btver2) +assert (val == 5); + else if (__builtin_cpu_is (haswell)) +assert (val == 6); + else if (__builtin_cpu_supports (bmi2)) +assert (val == 2); + else if (__builtin_cpu_supports (bmi)) +assert (val == 1); + else +assert (val == 0); + + return 0; +} + +int __attribute__ ((target(default))) +foo () +{ + return 0; +} + +int __attribute__ ((target(bmi))) +foo () +{ + return 1; +} +int __attribute__ ((target(bmi2))) +foo () +{ + return 2; +} + +int __attribute__ ((target(default))) +bar () +{ + return 0; +} + +int __attribute__ ((target(bmi))) +bar () +{ + return 1; +} +int __attribute__ ((target(bmi2))) +bar () +{ + return 2; +} + +int __attribute__ ((target(arch=btver2))) +bar () +{ + return 5; +} + +int __attribute__ ((target(arch=haswell))) +bar () +{ + return 6; +} + diff --git a/gcc/testsuite/gcc.target/i386/funcspec-5.c b/gcc/testsuite/gcc.target/i386/funcspec-5.c index 269e610..2e316a7
[Patch, i386] Support AES, F16C, BMI and BMI2 targets in multiversioning
I recently wanted to use multiversioning for BMI2 specific extensions PDEP/PEXT, and noticed it wasn't there. So I wrote this patch to add it, and also added AES, F16C and BMI1 for completeness. Happy new year `Allan commit 062c09d45d22302ffbd4f86d88e16a1a0d49cd80 Author: Allan Sandfeld Jensen allan.jen...@digia.com Date: Fri Dec 26 21:14:01 2014 +0100 AES, F16C BMI and BMI2 multiversion support diff --git a/gcc/ChangeLog b/gcc/ChangeLog index ff8a5e6..83f16a5 100644 --- a/gcc/ChangeLog +++ b/gcc/ChangeLog @@ -1,3 +1,8 @@ +2014-12-29 Allan Sandfeld Jensen sandf...@kde.org + + * config/i386/i386.c (get_builtin_code_for_version): Add + support for AES, BMI, BMI2 and F16C multiversion functions. + 2014-12-27 H.J. Lu hongjiu...@intel.com PR target/64409 diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index d693fdb..a1b74dc 100644 --- a/gcc/config/i386/i386.c +++ b/gcc/config/i386/i386.c @@ -34261,15 +34261,22 @@ get_builtin_code_for_version (tree decl, tree *predicate_list) P_PROC_SSE4_A, P_SSE4_1, P_SSE4_2, -P_PROC_SSE4_2, P_POPCNT, +P_PROC_SSE4_2, +P_AES, +P_PROC_AES, P_AVX, P_PROC_AVX, +P_F16C, +P_PROC_F16C, +P_BMI, +P_PROC_BMI, P_FMA4, P_XOP, P_PROC_XOP, P_FMA, P_PROC_FMA, +P_BMI2, P_AVX2, P_PROC_AVX2, P_AVX512F, @@ -34295,12 +34302,16 @@ get_builtin_code_for_version (tree decl, tree *predicate_list) {sse4a, P_SSE4_A}, {ssse3, P_SSSE3}, {sse4.1, P_SSE4_1}, - {sse4.2, P_SSE4_2}, {popcnt, P_POPCNT}, + {sse4.2, P_SSE4_2}, + {aes, P_AES}, {avx, P_AVX}, + {f16c, P_F16C}, + {bmi, P_BMI}, {fma4, P_FMA4}, {xop, P_XOP}, {fma, P_FMA}, + {bmi2, P_BMI2}, {avx2, P_AVX2}, {avx512f, P_AVX512F} }; @@ -34350,21 +34361,25 @@ get_builtin_code_for_version (tree decl, tree *predicate_list) priority = P_PROC_SSSE3; break; case PROCESSOR_NEHALEM: - if (new_target-x_ix86_isa_flags OPTION_MASK_ISA_AES) + if (new_target-x_ix86_isa_flags OPTION_MASK_ISA_AES) { arg_str = westmere; - else +priority = P_PROC_AES; + } else { /* We translate arch=corei7 and arch=nehalem to corei7 so that it will be mapped to M_INTEL_COREI7 as cpu type to cover all M_INTEL_COREI7_XXXs. */ arg_str = corei7; - priority = P_PROC_SSE4_2; +priority = P_PROC_SSE4_2; + } break; case PROCESSOR_SANDYBRIDGE: - if (new_target-x_ix86_isa_flags OPTION_MASK_ISA_F16C) + if (new_target-x_ix86_isa_flags OPTION_MASK_ISA_F16C) { arg_str = ivybridge; - else + priority = P_PROC_F16C; + } else { arg_str = sandybridge; - priority = P_PROC_AVX; + priority = P_PROC_AVX; + } break; case PROCESSOR_HASWELL: if (new_target-x_ix86_isa_flags OPTION_MASK_ISA_ADX) @@ -34395,7 +34410,7 @@ get_builtin_code_for_version (tree decl, tree *predicate_list) break; case PROCESSOR_BTVER2: arg_str = btver2; - priority = P_PROC_AVX; + priority = P_PROC_BMI; break; case PROCESSOR_BDVER1: arg_str = bdver1; diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog index ef6ddcc..5b11622 100644 --- a/gcc/testsuite/ChangeLog +++ b/gcc/testsuite/ChangeLog @@ -1,3 +1,8 @@ +2014-12-29 Allan Sandfeld Jensen sandf...@kde.org + + * gcc.target/i386/funcspec-5.c: Test new multiversion targets. + * g++.dg/ext/mv17.C: Test BMI/BMI2 multiversion dispatcher. + 2014-12-28 H.J. Lu hongjiu...@intel.com * gcc.target/i386/pr57003.c: Skip on x32. diff --git a/gcc/testsuite/g++.dg/ext/mv17.C b/gcc/testsuite/g++.dg/ext/mv17.C new file mode 100644 index 000..311f217 --- /dev/null +++ b/gcc/testsuite/g++.dg/ext/mv17.C @@ -0,0 +1,91 @@ +/* Test case to check if Multiversioning works for BMI and BMI2. */ +/* { dg-do run { target i?86-*-* x86_64-*-* } } */ +/* { dg-options -O2 } */ + +#include assert.h + +// Check BMI feature selection works +int foo () __attribute__((target(default))); +int foo () __attribute__((target(bmi))); +int foo () __attribute__((target(bmi2))); + +// Check specialized versions for archs with BMI is chosen over generic BMI versions. +int bar () __attribute__((target(default))); +int bar () __attribute__((target(bmi))); +int bar () __attribute__((target(bmi2))); +int bar () __attribute__((target(arch=btver2))); +int bar () __attribute__((target(arch=haswell))); + +int main () +{ + int val = foo (); + + if (__builtin_cpu_supports (bmi2)) +assert (val == 2); + else
Re: [Patch, i386] Support AES, F16C, BMI and BMI2 targets in multiversioning
On Wednesday 31 December 2014, Jakub Jelinek wrote: On Wed, Dec 31, 2014 at 01:28:47PM +0100, Allan Sandfeld Jensen wrote: I recently wanted to use multiversioning for BMI2 specific extensions PDEP/PEXT, and noticed it wasn't there. So I wrote this patch to add it, and also added AES, F16C and BMI1 for completeness. AES nor F16C doesn't make any sense IMHO for multiversioning, you need special intrinsics for that anyway and when you use them, the function will fail to compile without those features. Multiversioning only makes sense for ISA features the compiler uses for normal C/C++ code without any intrinsics. I disagree. You just include the intrinsics and use them. Inside a function with the right target rule they will work even if they don't outside. This works even without multiversioning, e.g. just specific functions with specific targets. `Allan
Re: [Patch, i386] Separate Intel processor with expanded ISA
Updated patch with test. diff --git a/gcc/ChangeLog b/gcc/ChangeLog index ccbea0f..e80c30b 100644 --- a/gcc/ChangeLog +++ b/gcc/ChangeLog @@ -1,3 +1,8 @@ +2013-12-29 Allan Sandfeld Jensen sandf...@kde.org + * config/i386/i386.c (get_builtin_code_for_version): Separate + Westmere from Nehalem, Ivy Bridge from Sandy Bridge and + Broadwell from Haswell. + 2014-01-25 Walter Lee w...@tilera.com * config/tilegx/sync.md (atomic_fetch_sub): Fix negation and diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index cf79486..50f3624 100644 --- a/gcc/config/i386/i386.c +++ b/gcc/config/i386/i386.c @@ -31298,18 +31298,27 @@ get_builtin_code_for_version (tree decl, tree *predicate_list) priority = P_PROC_SSSE3; break; case PROCESSOR_NEHALEM: - /* We translate arch=corei7 and arch=nehelam to - corei7 so that it will be mapped to M_INTEL_COREI7 - as cpu type to cover all M_INTEL_COREI7_XXXs. */ - arg_str = corei7; + if (new_target-x_ix86_isa_flags OPTION_MASK_ISA_AES) + arg_str = westmere; + else + /* We translate arch=corei7 and arch=nehelam to + corei7 so that it will be mapped to M_INTEL_COREI7 + as cpu type to cover all M_INTEL_COREI7_XXXs. */ + arg_str = corei7; priority = P_PROC_SSE4_2; break; case PROCESSOR_SANDYBRIDGE: - arg_str = sandybridge; + if (new_target-x_ix86_isa_flags OPTION_MASK_ISA_F16C) + arg_str = ivybridge; + else + arg_str = sandybridge; priority = P_PROC_AVX; break; case PROCESSOR_HASWELL: - arg_str = haswell; + if (new_target-x_ix86_isa_flags OPTION_MASK_ISA_ADX) + arg_str = broadwell; + else + arg_str = haswell; priority = P_PROC_AVX2; break; case PROCESSOR_BONNELL: diff --git a/gcc/testsuite/g++.dg/ext/mv16.C b/gcc/testsuite/g++.dg/ext/mv16.C new file mode 100644 index 000..4737c79 --- /dev/null +++ b/gcc/testsuite/g++.dg/ext/mv16.C @@ -0,0 +1,69 @@ +// Test that dispatching can choose the right multiversion +// for Intel CPUs with the same internal GCC processor id +// but slighly different sets of x86 extensions. + +// { dg-do run { target i?86-*-* x86_64-*-* } } +// { dg-require-ifunc } +// { dg-options -O2 } + +#include assert.h + +int __attribute__ ((target(default))) +foo () +{ +return 0; +} + +int __attribute__ ((target(arch=nehalem))) +foo () +{ +return 4; +} + +int __attribute__ ((target(arch=westmere))) +foo () +{ +return 5; +} + +int __attribute__ ((target(arch=sandybridge))) +foo () +{ +return 8; +} + +int __attribute__ ((target(arch=ivybridge))) +foo () +{ +return 9; +} + +int __attribute__ ((target(arch=haswell))) +foo () +{ +return 12; +} + +int main () +{ +int val = foo (); + +if (__builtin_cpu_is (nehalem)) + assert (val == 4); +else +if (__builtin_cpu_is (westmere)) + assert (val == 5); +else +if (__builtin_cpu_is (sandybridge)) + assert (val == 8); +else +if (__builtin_cpu_is (ivybridge)) + assert (val == 9); +else +if (__builtin_cpu_is (haswell)) + assert (val == 12); +else + assert (val == 0); + +return 0; +}
Re: [Patch, i386] Separate Intel processor with expanded ISA
No comments? On Sunday 29 December 2013, Allan Sandfeld Jensen wrote: The function dispatcher might currently choose functions declared with target(arch=ivybridge) on a Sandy Bridge CPU. This happens because the function is only detected as sandybridge when generated. The attached patch detects Westmere, Ivybridge and Broadwell processors based on new ISAs they support. Regards `Allan
[Patch, i386] Separate Intel processor with expanded ISA
The function dispatcher might currently choose functions declared with target(arch=ivybridge) on a Sandy Bridge CPU. This happens because the function is only detected as sandybridge when generated. The attached patch detects Westmere, Ivybridge and Broadwell processors based on new ISAs they support. Regards `Allan Index: gcc/ChangeLog === --- gcc/ChangeLog (revision 206233) +++ gcc/ChangeLog (working copy) @@ -1,3 +1,9 @@ +2013-12-29 Allan Sandfeld Jensen sandf...@kde.org + + * config/i386/i386.c (get_builtin_code_for_version): Separate + Westmere from Nehalem, Ivy Bridge from Sandy Bridge and + Broadwell from Haswell. + 2013-12-28 Eric Botcazou ebotca...@adacore.com * doc/invoke.texi (output file options): Document -fada-spec-parent. Index: gcc/config/i386/i386.c === --- gcc/config/i386/i386.c (revision 206233) +++ gcc/config/i386/i386.c (working copy) @@ -30030,18 +30005,27 @@ priority = P_PROC_SSSE3; break; case PROCESSOR_NEHALEM: - /* We translate arch=corei7 and arch=nehelam to - corei7 so that it will be mapped to M_INTEL_COREI7 - as cpu type to cover all M_INTEL_COREI7_XXXs. */ - arg_str = corei7; + if (new_target-x_ix86_isa_flags OPTION_MASK_ISA_AES) + arg_str = westmere; + else + /* We translate arch=corei7 and arch=nehelam to + corei7 so that it will be mapped to M_INTEL_COREI7 + as cpu type to cover all M_INTEL_COREI7_XXXs. */ + arg_str = corei7; priority = P_PROC_SSE4_2; break; case PROCESSOR_SANDYBRIDGE: - arg_str = sandybridge; + if (new_target-x_ix86_isa_flags OPTION_MASK_ISA_F16C) + arg_str = ivybridge; + else + arg_str = sandybridge; priority = P_PROC_AVX; break; case PROCESSOR_HASWELL: - arg_str = haswell; + if (new_target-x_ix86_isa_flags OPTION_MASK_ISA_ADX) + arg_str = broadwell; + else + arg_str = haswell; priority = P_PROC_AVX2; break; case PROCESSOR_BONNELL:
Re: [Patch, i386] PR 59422 - Support more targets for function multi versioning
On Thursday 26 December 2013, Gopalasubramanian, Ganesh wrote: Hi, (get_amd_cpu): Handle AMD_BOBCAT, AMD_JAGUAR, AMDFAM15H_BDVER2 and AMDFAM15H_BDVER3. As mentioned earlier, we would like to stick with BTVER1 and BTVER2 instead of using BOBCAT or JAGUAR. Attached patch does the changes. Sorry for missing your comment. Thanks for fixing it. Renaming the comments with the AMD family names might be overdoing it though. `Allan
Re: [Patch, i386] PR 59422 - Support more targets for function multi versioning
On Wednesday 25 December 2013, Uros Bizjak wrote: On Tue, Dec 24, 2013 at 4:17 PM, Allan Sandfeld Jensen carew...@gmail.com wrote: Will libgcc/config/i386/cpuinfo.c update be a separate patch? Should we use a single definition for both i386.c and libgcc? Currently they need to be in the same patch. But yes, moving the definition out to a common header would probably be a good idea to reduce potential mismatches in future. This could be a follow-up patch. How does the patch get commited after being accepted? It has been many years since I last contributed to gcc, and I can not remember the rest of the process, and doubt it is still the same. Please send me ChangeLog entries, and I will commit the patch for you. Patch is OK for mainline. Ah, sorry I moved the ChangeLog entries to ChangeLog.arch but it didn't get included by svn diff. I have attached it. Christmas greetings `Allan gcc/ 2013-12-23 Allan Sandfeld Jensen sandf...@kde.org PR gcc/59422 * config/i386/i386.c: Extend function multiversioning to better support recent Intel and AMD models. 2013-12-23 H.J. Lu hongjiu...@intel.com * config/i386/i386.c (get_builtin_code_for_version): Handle PROCESSOR_HASWELL and PROCESSOR_SILVERMONT. (processor_model): Add M_INTEL_COREI7_IVYBRIDGE and M_INTEL_COREI7_HASWELL. (arch_names_table): Add ivybridge, haswell, bonnell, silvermont. libgcc/ 2013-12-23 Allan Sandfeld Jensen sandf...@kde.org PR gcc/59422 * config/i386/cpuinfo.c: Detect sse4a, fma4, xop and fma ISAs and recent Intel and AMD models. 2013-12-23 H.J. Lu hongjiu...@intel.com * config/i386/cpuinfo.c (processor_subtypes): Add INTEL_COREI7_IVYBRIDGE and INTEL_COREI7_HASWELL. (get_intel_cpu): Check Ivy Bridge and Haswell processors.
Re: [Patch, i386] PR 59422 - Support more targets for function multi versioning
On Monday 23 December 2013, H.J. Lu wrote: If you use {corei7-avx, M_INTEL_COREI7_SANYBRIDGE}, {core-avx2, M_INTEL_COREI7_HASWELL}, will it cause any problems? When there are both Actually I seems I don't need these definitions any more after your clean-up of Intel architecture names. I have attached patch with them removed (and named haswell enums back to corei7_haswell). If both target(arch=corei7-avx) and target(arch=sandybridge) is present the dispatcher appears to choose sandybridge. If you want a warning for duplicates in this case, I suggest adding it in a later patch. `Allan Index: gcc/config/i386/i386.c === --- gcc/config/i386/i386.c (revision 206179) +++ gcc/config/i386/i386.c (working copy) @@ -29970,16 +29970,21 @@ P_SSE3, P_SSSE3, P_PROC_SSSE3, -P_SSE4_a, -P_PROC_SSE4_a, +P_SSE4_A, +P_PROC_SSE4_A, P_SSE4_1, P_SSE4_2, P_PROC_SSE4_2, P_POPCNT, P_AVX, +P_PROC_AVX, +P_FMA4, +P_XOP, +P_PROC_XOP, +P_FMA, +P_PROC_FMA, P_AVX2, -P_FMA, -P_PROC_FMA +P_PROC_AVX2 }; enum feature_priority priority = P_ZERO; @@ -29998,11 +30003,15 @@ {sse, P_SSE}, {sse2, P_SSE2}, {sse3, P_SSE3}, + {sse4a, P_SSE4_A}, {ssse3, P_SSSE3}, {sse4.1, P_SSE4_1}, {sse4.2, P_SSE4_2}, {popcnt, P_POPCNT}, {avx, P_AVX}, + {fma4, P_FMA4}, + {xop, P_XOP}, + {fma, P_FMA}, {avx2, P_AVX2} }; @@ -30054,26 +30063,50 @@ arg_str = nehalem; priority = P_PROC_SSE4_2; break; -case PROCESSOR_SANDYBRIDGE: - arg_str = sandybridge; - priority = P_PROC_SSE4_2; - break; + case PROCESSOR_SANDYBRIDGE: + arg_str = sandybridge; + priority = P_PROC_AVX; + break; + case PROCESSOR_HASWELL: + arg_str = haswell; + priority = P_PROC_AVX2; + break; case PROCESSOR_BONNELL: arg_str = bonnell; priority = P_PROC_SSSE3; break; + case PROCESSOR_SILVERMONT: + arg_str = silvermont; + priority = P_PROC_SSE4_2; + break; case PROCESSOR_AMDFAM10: arg_str = amdfam10h; - priority = P_PROC_SSE4_a; + priority = P_PROC_SSE4_A; break; + case PROCESSOR_BTVER1: + arg_str = bobcat; + priority = P_PROC_SSE4_A; + break; + case PROCESSOR_BTVER2: + arg_str = jaguar; + priority = P_PROC_AVX; + break; case PROCESSOR_BDVER1: arg_str = bdver1; - priority = P_PROC_FMA; + priority = P_PROC_XOP; break; case PROCESSOR_BDVER2: arg_str = bdver2; priority = P_PROC_FMA; break; + case PROCESSOR_BDVER3: + arg_str = bdver3; + priority = P_PROC_FMA; + break; + case PROCESSOR_BDVER4: + arg_str = bdver4; + priority = P_PROC_AVX2; + break; } } @@ -30938,6 +30971,10 @@ F_SSE4_2, F_AVX, F_AVX2, +F_SSE4_A, +F_FMA4, +F_XOP, +F_FMA, F_MAX }; @@ -30955,6 +30992,8 @@ M_AMDFAM10H, M_AMDFAM15H, M_INTEL_SILVERMONT, +M_AMD_BOBCAT, +M_AMD_JAGUAR, M_CPU_SUBTYPE_START, M_INTEL_COREI7_NEHALEM, M_INTEL_COREI7_WESTMERE, @@ -30965,7 +31004,9 @@ M_AMDFAM15H_BDVER1, M_AMDFAM15H_BDVER2, M_AMDFAM15H_BDVER3, -M_AMDFAM15H_BDVER4 +M_AMDFAM15H_BDVER4, +M_INTEL_COREI7_IVYBRIDGE, +M_INTEL_COREI7_HASWELL }; static struct _arch_names_table @@ -30984,15 +31025,21 @@ {nehalem, M_INTEL_COREI7_NEHALEM}, {westmere, M_INTEL_COREI7_WESTMERE}, {sandybridge, M_INTEL_COREI7_SANDYBRIDGE}, + {ivybridge, M_INTEL_COREI7_IVYBRIDGE}, + {haswell, M_INTEL_COREI7_HASWELL}, + {bonnell, M_INTEL_BONNELL}, + {silvermont, M_INTEL_SILVERMONT}, {amdfam10h, M_AMDFAM10H}, {barcelona, M_AMDFAM10H_BARCELONA}, {shanghai, M_AMDFAM10H_SHANGHAI}, {istanbul, M_AMDFAM10H_ISTANBUL}, + {bobcat, M_AMD_BOBCAT}, {amdfam15h, M_AMDFAM15H}, {bdver1, M_AMDFAM15H_BDVER1}, {bdver2, M_AMDFAM15H_BDVER2}, {bdver3, M_AMDFAM15H_BDVER3}, {bdver4, M_AMDFAM15H_BDVER4}, + {jaguar, M_AMD_JAGUAR}, }; static struct _isa_names_table @@ -31009,9 +31056,13 @@ {sse2, F_SSE2}, {sse3, F_SSE3}, {ssse3, F_SSSE3}, + {sse4a, F_SSE4_A}, {sse4.1, F_SSE4_1}, {sse4.2, F_SSE4_2}, {avx,F_AVX}, + {fma4, F_FMA4}, + {xop,F_XOP}, + {fma,F_FMA}, {avx2, F_AVX2} }; Index: gcc/testsuite/gcc.target/i386/funcspec-5.c === --- gcc/testsuite/gcc.target/i386/funcspec-5.c (revision 206179) +++ gcc/testsuite/gcc.target/i386/funcspec-5.c (working copy) @@ -17,7 +17,9 @@
Re: [Patch, i386] PR 59422 - Support more targets for function multi versioning
On Tuesday 24 December 2013, H.J. Lu wrote: Will libgcc/config/i386/cpuinfo.c update be a separate patch? Should we use a single definition for both i386.c and libgcc? Currently they need to be in the same patch. But yes, moving the definition out to a common header would probably be a good idea to reduce potential mismatches in future. How does the patch get commited after being accepted? It has been many years since I last contributed to gcc, and I can not remember the rest of the process, and doubt it is still the same. Regards `Allan
Re: [Patch, i386] PR 59422 - Support more targets for function multi versioning
On Monday 23 December 2013, H.J. Lu wrote: On Thu, Dec 19, 2013 at 11:20:39AM +0100, Allan Sandfeld Jensen wrote: On Thursday 19 December 2013, Gopalasubramanian, Ganesh wrote: Sorry, I must have been looking at an older version, but as I said I already did enable it in the latest patch. (see http://gcc.gnu.org/ml/gcc-patches/2013-12/msg01577.html ) Sorry for causing another revision but we would like to stick with btver1 and btver2 rather than BOBCAT or JAGUAR. Therefore the changes would be like I will need to make an updated patch to move the new ISAs to the end of the list anyway. I will send it in a few days to give AMD or Intel developers time to comment on the current version. I renamed Intel processor names. Please update your patch. Here is my patch to add more Intel processor support. You can add it to your patch. Updated patch attached. Rebased, fixed coding style, moved new ISA enums to the end and applied H.J.Lu's patch. `Allan Index: gcc/config/i386/i386.c === --- gcc/config/i386/i386.c (revision 206179) +++ gcc/config/i386/i386.c (working copy) @@ -29970,16 +29970,21 @@ P_SSE3, P_SSSE3, P_PROC_SSSE3, -P_SSE4_a, -P_PROC_SSE4_a, +P_SSE4_A, +P_PROC_SSE4_A, P_SSE4_1, P_SSE4_2, P_PROC_SSE4_2, P_POPCNT, P_AVX, +P_PROC_AVX, +P_FMA4, +P_XOP, +P_PROC_XOP, +P_FMA, +P_PROC_FMA, P_AVX2, -P_FMA, -P_PROC_FMA +P_PROC_AVX2 }; enum feature_priority priority = P_ZERO; @@ -29998,11 +30003,15 @@ {sse, P_SSE}, {sse2, P_SSE2}, {sse3, P_SSE3}, + {sse4a, P_SSE4_A}, {ssse3, P_SSSE3}, {sse4.1, P_SSE4_1}, {sse4.2, P_SSE4_2}, {popcnt, P_POPCNT}, {avx, P_AVX}, + {fma4, P_FMA4}, + {xop, P_XOP}, + {fma, P_FMA}, {avx2, P_AVX2} }; @@ -30054,26 +30063,50 @@ arg_str = nehalem; priority = P_PROC_SSE4_2; break; -case PROCESSOR_SANDYBRIDGE: - arg_str = sandybridge; - priority = P_PROC_SSE4_2; - break; + case PROCESSOR_SANDYBRIDGE: + arg_str = sandybridge; + priority = P_PROC_AVX; + break; + case PROCESSOR_HASWELL: + arg_str = haswell; + priority = P_PROC_SSE4_2; + break; case PROCESSOR_BONNELL: arg_str = bonnell; priority = P_PROC_SSSE3; break; + case PROCESSOR_SILVERMONT: + arg_str = silvermont; + priority = P_PROC_SSE4_2; + break; case PROCESSOR_AMDFAM10: arg_str = amdfam10h; - priority = P_PROC_SSE4_a; + priority = P_PROC_SSE4_A; break; + case PROCESSOR_BTVER1: + arg_str = bobcat; + priority = P_PROC_SSE4_A; + break; + case PROCESSOR_BTVER2: + arg_str = jaguar; + priority = P_PROC_AVX; + break; case PROCESSOR_BDVER1: arg_str = bdver1; - priority = P_PROC_FMA; + priority = P_PROC_XOP; break; case PROCESSOR_BDVER2: arg_str = bdver2; priority = P_PROC_FMA; break; + case PROCESSOR_BDVER3: + arg_str = bdver3; + priority = P_PROC_FMA; + break; + case PROCESSOR_BDVER4: + arg_str = bdver4; + priority = P_PROC_AVX2; + break; } } @@ -30938,6 +30971,10 @@ F_SSE4_2, F_AVX, F_AVX2, +F_SSE4_A, +F_FMA4, +F_XOP, +F_FMA, F_MAX }; @@ -30955,6 +30992,10 @@ M_AMDFAM10H, M_AMDFAM15H, M_INTEL_SILVERMONT, +M_INTEL_COREI7_AVX, +M_INTEL_CORE_AVX2, +M_AMD_BOBCAT, +M_AMD_JAGUAR, M_CPU_SUBTYPE_START, M_INTEL_COREI7_NEHALEM, M_INTEL_COREI7_WESTMERE, @@ -30965,7 +31006,9 @@ M_AMDFAM15H_BDVER1, M_AMDFAM15H_BDVER2, M_AMDFAM15H_BDVER3, -M_AMDFAM15H_BDVER4 +M_AMDFAM15H_BDVER4, +M_INTEL_COREI7_IVYBRIDGE, +M_INTEL_CORE_HASWELL }; static struct _arch_names_table @@ -30983,16 +31026,24 @@ {corei7, M_INTEL_COREI7}, {nehalem, M_INTEL_COREI7_NEHALEM}, {westmere, M_INTEL_COREI7_WESTMERE}, + {corei7-avx, M_INTEL_COREI7_AVX}, {sandybridge, M_INTEL_COREI7_SANDYBRIDGE}, + {ivybridge, M_INTEL_COREI7_IVYBRIDGE}, + {core-avx2, M_INTEL_CORE_AVX2}, + {haswell, M_INTEL_CORE_HASWELL}, + {bonnell, M_INTEL_BONNELL}, + {silvermont, M_INTEL_SILVERMONT}, {amdfam10h, M_AMDFAM10H}, {barcelona, M_AMDFAM10H_BARCELONA}, {shanghai, M_AMDFAM10H_SHANGHAI}, {istanbul, M_AMDFAM10H_ISTANBUL}, + {bobcat, M_AMD_BOBCAT}, {amdfam15h, M_AMDFAM15H}, {bdver1, M_AMDFAM15H_BDVER1}, {bdver2, M_AMDFAM15H_BDVER2}, {bdver3, M_AMDFAM15H_BDVER3}, {bdver4, M_AMDFAM15H_BDVER4}, + {jaguar, M_AMD_JAGUAR}, }; static struct _isa_names_table @@ -31009,9
Re: [Patch, i386] PR 59422 - Support more targets for function multi versioning
On Monday 23 December 2013, Allan Sandfeld Jensen wrote: On Monday 23 December 2013, H.J. Lu wrote: On Thu, Dec 19, 2013 at 11:20:39AM +0100, Allan Sandfeld Jensen wrote: On Thursday 19 December 2013, Gopalasubramanian, Ganesh wrote: Sorry, I must have been looking at an older version, but as I said I already did enable it in the latest patch. (see http://gcc.gnu.org/ml/gcc-patches/2013-12/msg01577.html ) Sorry for causing another revision but we would like to stick with btver1 and btver2 rather than BOBCAT or JAGUAR. Therefore the changes would be like I will need to make an updated patch to move the new ISAs to the end of the list anyway. I will send it in a few days to give AMD or Intel developers time to comment on the current version. I renamed Intel processor names. Please update your patch. Here is my patch to add more Intel processor support. You can add it to your patch. Updated patch attached. Rebased, fixed coding style, moved new ISA enums to the end and applied H.J.Lu's patch. Fixed merging mistake that left haswell with SSE4_2 priority. `Allan Index: gcc/config/i386/i386.c === --- gcc/config/i386/i386.c (revision 206179) +++ gcc/config/i386/i386.c (working copy) @@ -29970,16 +29970,21 @@ P_SSE3, P_SSSE3, P_PROC_SSSE3, -P_SSE4_a, -P_PROC_SSE4_a, +P_SSE4_A, +P_PROC_SSE4_A, P_SSE4_1, P_SSE4_2, P_PROC_SSE4_2, P_POPCNT, P_AVX, +P_PROC_AVX, +P_FMA4, +P_XOP, +P_PROC_XOP, +P_FMA, +P_PROC_FMA, P_AVX2, -P_FMA, -P_PROC_FMA +P_PROC_AVX2 }; enum feature_priority priority = P_ZERO; @@ -29998,11 +30003,15 @@ {sse, P_SSE}, {sse2, P_SSE2}, {sse3, P_SSE3}, + {sse4a, P_SSE4_A}, {ssse3, P_SSSE3}, {sse4.1, P_SSE4_1}, {sse4.2, P_SSE4_2}, {popcnt, P_POPCNT}, {avx, P_AVX}, + {fma4, P_FMA4}, + {xop, P_XOP}, + {fma, P_FMA}, {avx2, P_AVX2} }; @@ -30054,26 +30063,50 @@ arg_str = nehalem; priority = P_PROC_SSE4_2; break; -case PROCESSOR_SANDYBRIDGE: - arg_str = sandybridge; - priority = P_PROC_SSE4_2; - break; + case PROCESSOR_SANDYBRIDGE: + arg_str = sandybridge; + priority = P_PROC_AVX; + break; + case PROCESSOR_HASWELL: + arg_str = haswell; + priority = P_PROC_AVX2; + break; case PROCESSOR_BONNELL: arg_str = bonnell; priority = P_PROC_SSSE3; break; + case PROCESSOR_SILVERMONT: + arg_str = silvermont; + priority = P_PROC_SSE4_2; + break; case PROCESSOR_AMDFAM10: arg_str = amdfam10h; - priority = P_PROC_SSE4_a; + priority = P_PROC_SSE4_A; break; + case PROCESSOR_BTVER1: + arg_str = bobcat; + priority = P_PROC_SSE4_A; + break; + case PROCESSOR_BTVER2: + arg_str = jaguar; + priority = P_PROC_AVX; + break; case PROCESSOR_BDVER1: arg_str = bdver1; - priority = P_PROC_FMA; + priority = P_PROC_XOP; break; case PROCESSOR_BDVER2: arg_str = bdver2; priority = P_PROC_FMA; break; + case PROCESSOR_BDVER3: + arg_str = bdver3; + priority = P_PROC_FMA; + break; + case PROCESSOR_BDVER4: + arg_str = bdver4; + priority = P_PROC_AVX2; + break; } } @@ -30938,6 +30971,10 @@ F_SSE4_2, F_AVX, F_AVX2, +F_SSE4_A, +F_FMA4, +F_XOP, +F_FMA, F_MAX }; @@ -30955,6 +30992,10 @@ M_AMDFAM10H, M_AMDFAM15H, M_INTEL_SILVERMONT, +M_INTEL_COREI7_AVX, +M_INTEL_CORE_AVX2, +M_AMD_BOBCAT, +M_AMD_JAGUAR, M_CPU_SUBTYPE_START, M_INTEL_COREI7_NEHALEM, M_INTEL_COREI7_WESTMERE, @@ -30965,7 +31006,9 @@ M_AMDFAM15H_BDVER1, M_AMDFAM15H_BDVER2, M_AMDFAM15H_BDVER3, -M_AMDFAM15H_BDVER4 +M_AMDFAM15H_BDVER4, +M_INTEL_COREI7_IVYBRIDGE, +M_INTEL_CORE_HASWELL }; static struct _arch_names_table @@ -30983,16 +31026,24 @@ {corei7, M_INTEL_COREI7}, {nehalem, M_INTEL_COREI7_NEHALEM}, {westmere, M_INTEL_COREI7_WESTMERE}, + {corei7-avx, M_INTEL_COREI7_AVX}, {sandybridge, M_INTEL_COREI7_SANDYBRIDGE}, + {ivybridge, M_INTEL_COREI7_IVYBRIDGE}, + {core-avx2, M_INTEL_CORE_AVX2}, + {haswell, M_INTEL_CORE_HASWELL}, + {bonnell, M_INTEL_BONNELL}, + {silvermont, M_INTEL_SILVERMONT}, {amdfam10h, M_AMDFAM10H}, {barcelona, M_AMDFAM10H_BARCELONA}, {shanghai, M_AMDFAM10H_SHANGHAI}, {istanbul, M_AMDFAM10H_ISTANBUL}, + {bobcat, M_AMD_BOBCAT}, {amdfam15h, M_AMDFAM15H}, {bdver1, M_AMDFAM15H_BDVER1}, {bdver2, M_AMDFAM15H_BDVER2}, {bdver3
Re: [Patch, i386] PR 59422 - Support more targets for function multi versioning
On Monday 23 December 2013, H.J. Lu wrote: On Mon, Dec 23, 2013 at 8:57 AM, Allan Sandfeld Jensen carew...@gmail.com wrote: On Monday 23 December 2013, Allan Sandfeld Jensen wrote: On Monday 23 December 2013, H.J. Lu wrote: On Thu, Dec 19, 2013 at 11:20:39AM +0100, Allan Sandfeld Jensen wrote: On Thursday 19 December 2013, Gopalasubramanian, Ganesh wrote: Sorry, I must have been looking at an older version, but as I said I already did enable it in the latest patch. (see http://gcc.gnu.org/ml/gcc-patches/2013-12/msg01577.html ) Sorry for causing another revision but we would like to stick with btver1 and btver2 rather than BOBCAT or JAGUAR. Therefore the changes would be like I will need to make an updated patch to move the new ISAs to the end of the list anyway. I will send it in a few days to give AMD or Intel developers time to comment on the current version. I renamed Intel processor names. Please update your patch. Here is my patch to add more Intel processor support. You can add it to your patch. Updated patch attached. Rebased, fixed coding style, moved new ISA enums to the end and applied H.J.Lu's patch. Fixed merging mistake that left haswell with SSE4_2 priority. `Allan +M_INTEL_COREI7_AVX, +M_INTEL_CORE_AVX2, Do we need them? M_INTEL_COREI7_AVX is the same M_INTEL_COREI7_SANDYBRIDGE and M_INTEL_CORE_AVX2 is the same as M_INTEL_COREI7_HASWELL. M_INTEL_COREI7_AVX is the common model for both sandybridge and ivybridge. Matching PROCESSOR_SANDYBRIDGE, or march=corei7-avx. Similarly M_INTEL_CORE_AVX2 is the common model for haswell and broadwell, matching PROCESSOR_HASWELL or march=core-avx2. +M_INTEL_CORE_HASWELL Please change M_INTEL_CORE_HASWELL to M_INTEL_COREI7_HASWELL. I used the name core_haswell to make its prefix match that of its model core_avx2 (as opposed to corei7_avx for instance). + {corei7-avx, M_INTEL_COREI7_AVX}, + {core-avx2, M_INTEL_CORE_AVX2}, Why do we need them? Without the existence of these entries, __attribute__((target(corei7-avx))) or __attribute__((target(core-avx2)) failed to compile because of how parameters to attributes were verified. Regards `Allan
Re: [Patch, i386] PR 59422 - Support more targets for function multi versioning
On Thursday 19 December 2013, Gopalasubramanian, Ganesh wrote: Yes, I changed that in the last patch, though I consider it momentarily problematic because you do not yet enable AVX with march=btver2 (AVX versions would currently be better than btver2 versions for a btver2 arch), but expect march=btver2 will be fixed soon. The processor_alias_table entry for btver2 in i386.c enables AVX. snip {btver2, PROCESSOR_BTVER2, CPU_BTVER2, PTA_64BIT | PTA_MMX | PTA_SSE | PTA_SSE2 | PTA_SSE3 | PTA_SSSE3 | PTA_SSE4A |PTA_ABM | PTA_CX16 | PTA_SSE4_1 | PTA_SSE4_2 | PTA_AES | PTA_PCLMUL | PTA_AVX | PTA_BMI | PTA_F16C | PTA_MOVBE | PTA_PRFCHW | PTA_FXSR | PTA_XSAVE | PTA_XSAVEOPT}, /snip The assembly listing for a simple test (compiled with -march=btver2) also has -mavx enabled. So, can you please enable AVX for btver2? Sorry, I must have been looking at an older version, but as I said I already did enable it in the latest patch. (see http://gcc.gnu.org/ml/gcc-patches/2013-12/msg01577.html ) `Allan
Re: [Patch, i386] PR 59422 - Support more targets for function multi versioning
On Thursday 19 December 2013, Gopalasubramanian, Ganesh wrote: Sorry, I must have been looking at an older version, but as I said I already did enable it in the latest patch. (see http://gcc.gnu.org/ml/gcc-patches/2013-12/msg01577.html ) Sorry for causing another revision but we would like to stick with btver1 and btver2 rather than BOBCAT or JAGUAR. Therefore the changes would be like I will need to make an updated patch to move the new ISAs to the end of the list anyway. I will send it in a few days to give AMD or Intel developers time to comment on the current version. `Allan
Re: [Patch, i386] PR 59422 - Support more targets for function multi versioning
Update patch. Solved __attribute((target(arch=corei7-avx))) by defining proper architectures for the recent Intel families instead of renaming submodels. I am thinking the patch is starting to touch a bit many different details, perhaps it should be split up, or is it good as is? Regards `Allan Index: gcc/ChangeLog === --- gcc/ChangeLog (revision 206065) +++ gcc/ChangeLog (working copy) @@ -1,3 +1,9 @@ +2013-12-14 Allan Sandfeld Jensen sandf...@kde.org + +PR gcc/59422 +* config/i386/i386.c: Extend function multiversioning +to better support recent Intel and AMD models. + 2013-12-17 Jan Hubicka hubi...@ucw.cz * ipa-devirt.c (get_polymorphic_call_info): Fix offset calculatoin Index: gcc/config/i386/i386.c === --- gcc/config/i386/i386.c (revision 206065) +++ gcc/config/i386/i386.c (working copy) @@ -29965,9 +29965,14 @@ P_PROC_SSE4_2, P_POPCNT, P_AVX, +P_PROC_AVX, +P_FMA4, +P_XOP, +P_PROC_XOP, +P_FMA, +P_PROC_FMA, P_AVX2, -P_FMA, -P_PROC_FMA +P_PROC_AVX2 }; enum feature_priority priority = P_ZERO; @@ -29986,11 +29991,15 @@ {sse, P_SSE}, {sse2, P_SSE2}, {sse3, P_SSE3}, + {sse4a, P_SSE4_a}, {ssse3, P_SSSE3}, {sse4.1, P_SSE4_1}, {sse4.2, P_SSE4_2}, {popcnt, P_POPCNT}, {avx, P_AVX}, + {fma4, P_FMA4}, + {xop, P_XOP}, + {fma, P_FMA}, {avx2, P_AVX2} }; @@ -30044,25 +30053,49 @@ break; case PROCESSOR_COREI7_AVX: arg_str = corei7-avx; - priority = P_PROC_SSE4_2; + priority = P_PROC_AVX; break; +case PROCESSOR_HASWELL: + arg_str = core-avx2; + priority = P_PROC_AVX2; + break; case PROCESSOR_ATOM: arg_str = atom; priority = P_PROC_SSSE3; break; +case PROCESSOR_SLM: + arg_str = slm; + priority = P_PROC_SSE4_2; + break; case PROCESSOR_AMDFAM10: arg_str = amdfam10h; priority = P_PROC_SSE4_a; break; +case PROCESSOR_BTVER1: + arg_str = bobcat; + priority = P_PROC_SSE4_a; + break; +case PROCESSOR_BTVER2: + arg_str = jaguar; + priority = P_PROC_AVX; + break; case PROCESSOR_BDVER1: arg_str = bdver1; - priority = P_PROC_FMA; + priority = P_PROC_XOP; break; case PROCESSOR_BDVER2: arg_str = bdver2; priority = P_PROC_FMA; break; - } +case PROCESSOR_BDVER3: + arg_str = bdver3; + priority = P_PROC_FMA; + break; +case PROCESSOR_BDVER4: + arg_str = bdver4; + priority = P_PROC_AVX2; + break; +} } cl_target_option_restore (global_options, cur_target); @@ -30922,9 +30955,13 @@ F_SSE2, F_SSE3, F_SSSE3, +F_SSE4_a, F_SSE4_1, F_SSE4_2, F_AVX, +F_FMA4, +F_XOP, +F_FMA, F_AVX2, F_MAX }; @@ -30943,6 +30980,10 @@ M_AMDFAM10H, M_AMDFAM15H, M_INTEL_SLM, +M_INTEL_COREI7_AVX, +M_INTEL_CORE_AVX2, +M_AMD_BOBCAT, +M_AMD_JAGUAR, M_CPU_SUBTYPE_START, M_INTEL_COREI7_NEHALEM, M_INTEL_COREI7_WESTMERE, @@ -30953,7 +30994,9 @@ M_AMDFAM15H_BDVER1, M_AMDFAM15H_BDVER2, M_AMDFAM15H_BDVER3, -M_AMDFAM15H_BDVER4 +M_AMDFAM15H_BDVER4, +M_INTEL_COREI7_IVYBRIDGE, +M_INTEL_CORE_HASWELL }; static struct _arch_names_table @@ -30971,11 +31014,17 @@ {corei7, M_INTEL_COREI7}, {nehalem, M_INTEL_COREI7_NEHALEM}, {westmere, M_INTEL_COREI7_WESTMERE}, + {corei7-avx, M_INTEL_COREI7_AVX}, {sandybridge, M_INTEL_COREI7_SANDYBRIDGE}, + {ivybridge, M_INTEL_COREI7_IVYBRIDGE}, + {core-avx2, M_INTEL_CORE_AVX2}, + {haswell, M_INTEL_CORE_HASWELL}, {amdfam10h, M_AMDFAM10H}, {barcelona, M_AMDFAM10H_BARCELONA}, {shanghai, M_AMDFAM10H_SHANGHAI}, {istanbul, M_AMDFAM10H_ISTANBUL}, + {bobcat, M_AMD_BOBCAT}, + {jaguar, M_AMD_JAGUAR}, {amdfam15h, M_AMDFAM15H}, {bdver1, M_AMDFAM15H_BDVER1}, {bdver2, M_AMDFAM15H_BDVER2}, @@ -30997,9 +31046,13 @@ {sse2, F_SSE2}, {sse3, F_SSE3}, {ssse3, F_SSSE3}, + {sse4a, F_SSE4_a}, {sse4.1, F_SSE4_1}, {sse4.2, F_SSE4_2}, {avx,F_AVX}, + {fma4, F_FMA4}, + {xop,F_XOP}, + {fma,F_FMA}, {avx2, F_AVX2} }; Index: gcc/testsuite/gcc.target/i386/funcspec-5.c === --- gcc/testsuite/gcc.target/i386/funcspec-5.c
Re: [Patch, i386] PR 59422 - Support more targets for function multi versioning
On Wednesday 18 December 2013, Gopalasubramanian, Ganesh wrote: Ping! Gopalasubramanian, Ganesh ganesh.gopalasubraman...@amd.com wrote: Yes, I figured that was the original idea behind it, but the final family of the jaguar processors seems to have become 16h instead of 14h (bobcat) at some point. Yes. It is amdfam16h. I was supposed to pass on some comments on the patch. 1. Amdfam16h for Jaguar. 2. For Jaguar, the priority needs to be AVX (AVX got included into the Jaguar ISA). Yes, I changed that in the last patch, though I consider it momentarily problematic because you do not yet enable AVX with march=btver2 (AVX versions would currently be better than btver2 versions for a btver2 arch), but expect march=btver2 will be fixed soon. Regards 'Allan
Re: [Patch, i386] PR 59422 - Support more targets for function multi versioning
On Monday 16 December 2013, Uros Bizjak wrote: On Mon, Dec 16, 2013 at 10:34 AM, Uros Bizjak ubiz...@gmail.com wrote: On Sun, Dec 15, 2013 at 7:54 PM, Allan Sandfeld Jensen carew...@gmail.com wrote: Hi again On Wednesday 11 December 2013, Uros Bizjak wrote: Hello! PR gcc/59422 This patch extends the supported targets for function multi versiong to also include Haswell, Silvermont, and the most recent AMD models. It also prioritizes AVX2 versions over AMD specific pre-AVX2 versions. Please add a ChangeLog entry and attach the complete patch. Please also state how you tested the patch, as outlined in the instructions [1]. [1] http://gcc.gnu.org/contribute.html Updated patch for better CPU model detection and added ChangeLog. The patch has been tested with the attached test.cpp. Verified that it doesn't build before the patch, and that it builds after, and verified it selects correct versions at runtime based on either CPU model or supported ISA (tested on 3 machines: SandyBridge, IvyBridge and Phenom II). Btw, I couldn't find anything that corresponds to gcc's btver2 arch. Is that an old term for what has become the Jaguar architecture? Thanks for the patch! However, you should not change the existing order of enums in cpuinfo.c (enum processor_vendor, enum processor_types, enum processor_subtypes, enum processor_features), but new entries should be added at the end (before *_MAX entry, if exists) of the enum. The enums (enum processor_features and enum processor_model) in config/i386/i386.c should mirror these changes. Please see [1]. Probably, we should document this in the source... - {sandybridge, M_INTEL_COREI7_SANDYBRIDGE}, + {corei7-avx, M_INTEL_COREI7_SANDYBRIDGE}, Huh... Thanks for catching this. -march=sandybridge is not recognized... - {sandybridge, M_INTEL_COREI7_SANDYBRIDGE}, + {corei7-avx, M_INTEL_COREI7_SANDYBRIDGE}, + {core-avx-i, M_INTEL_COREI7_IVYBRIDGE}, + {core-avx2, M_INTEL_COREI7_HASWELL}, Ah, no. These names are not intended to be used in -march. We can follow the tradition and use sandybridge, ivybridge and haswell here. I had the problem that arch=corei7-avx was not recognized as a valid property argument until I made that change. I thought it was the intend to merge this list of models with the canonical names, but perhaps it is an error in the new parameter validation? Note that similarly arch=sandybridge is accepted as a valid property argument but then fails as an invalid argument for march. Regards `Allan
Re: [Patch, i386] PR 59422 - Support more targets for function multi versioning
On Tuesday 17 December 2013, Allan Sandfeld Jensen wrote: On Monday 16 December 2013, Uros Bizjak wrote: On Mon, Dec 16, 2013 at 10:34 AM, Uros Bizjak ubiz...@gmail.com wrote: On Sun, Dec 15, 2013 at 7:54 PM, Allan Sandfeld Jensen carew...@gmail.com wrote: Hi again On Wednesday 11 December 2013, Uros Bizjak wrote: Hello! PR gcc/59422 This patch extends the supported targets for function multi versiong to also include Haswell, Silvermont, and the most recent AMD models. It also prioritizes AVX2 versions over AMD specific pre-AVX2 versions. Please add a ChangeLog entry and attach the complete patch. Please also state how you tested the patch, as outlined in the instructions [1]. [1] http://gcc.gnu.org/contribute.html Updated patch for better CPU model detection and added ChangeLog. The patch has been tested with the attached test.cpp. Verified that it doesn't build before the patch, and that it builds after, and verified it selects correct versions at runtime based on either CPU model or supported ISA (tested on 3 machines: SandyBridge, IvyBridge and Phenom II). Btw, I couldn't find anything that corresponds to gcc's btver2 arch. Is that an old term for what has become the Jaguar architecture? Thanks for the patch! However, you should not change the existing order of enums in cpuinfo.c (enum processor_vendor, enum processor_types, enum processor_subtypes, enum processor_features), but new entries should be added at the end (before *_MAX entry, if exists) of the enum. The enums (enum processor_features and enum processor_model) in config/i386/i386.c should mirror these changes. Please see [1]. Probably, we should document this in the source... - {sandybridge, M_INTEL_COREI7_SANDYBRIDGE}, + {corei7-avx, M_INTEL_COREI7_SANDYBRIDGE}, Huh... Thanks for catching this. -march=sandybridge is not recognized... - {sandybridge, M_INTEL_COREI7_SANDYBRIDGE}, + {corei7-avx, M_INTEL_COREI7_SANDYBRIDGE}, + {core-avx-i, M_INTEL_COREI7_IVYBRIDGE}, + {core-avx2, M_INTEL_COREI7_HASWELL}, Ah, no. These names are not intended to be used in -march. We can follow the tradition and use sandybridge, ivybridge and haswell here. I had the problem that arch=corei7-avx was not recognized as a valid property argument until I made that change. I thought it was the intend to merge this list of models with the canonical names, but perhaps it is an error in the new parameter validation? Ah, sorry. I think I misremembered the problem. After reviewing the code again, I think the only problem is with target(arch=core-avx-i) because it is not in the list of architectures (because it is treated as the same architecture as corei7-avx presumably). I will revert the sandybridge name change in the next patch, and make the new names match. `Allan
Re: [Patch, i386] PR 59422 - Support more targets for function multi versioning
Patch updated to keep libgcc enums backwards compatible. `Allan Index: gcc/ChangeLog === --- gcc/ChangeLog (revision 205984) +++ gcc/ChangeLog (working copy) @@ -1,3 +1,9 @@ +2013-12-14 Allan Sandfeld Jensen sandf...@kde.org + +PR gcc/59422 +* config/i386/i386.c: Extend function multiversioning +to better support recent Intel and AMD models. + 2013-12-14 Marek Polacek pola...@redhat.com PR sanitizer/59503 Index: gcc/config/i386/i386.c === --- gcc/config/i386/i386.c (revision 205984) +++ gcc/config/i386/i386.c (working copy) @@ -29962,9 +29962,14 @@ P_PROC_SSE4_2, P_POPCNT, P_AVX, +P_PROC_AVX, +P_FMA4, +P_XOP, +P_PROC_XOP, +P_FMA, +P_PROC_FMA, P_AVX2, -P_FMA, -P_PROC_FMA +P_PROC_AVX2 }; enum feature_priority priority = P_ZERO; @@ -29983,11 +29988,15 @@ {sse, P_SSE}, {sse2, P_SSE2}, {sse3, P_SSE3}, + {sse4a, P_SSE4_a}, {ssse3, P_SSSE3}, {sse4.1, P_SSE4_1}, {sse4.2, P_SSE4_2}, {popcnt, P_POPCNT}, {avx, P_AVX}, + {fma4, P_FMA4}, + {xop, P_XOP}, + {fma, P_FMA}, {avx2, P_AVX2} }; @@ -30041,25 +30050,49 @@ break; case PROCESSOR_COREI7_AVX: arg_str = corei7-avx; - priority = P_PROC_SSE4_2; + priority = P_PROC_AVX; break; +case PROCESSOR_HASWELL: + arg_str = core-avx2; + priority = P_PROC_AVX2; + break; case PROCESSOR_ATOM: arg_str = atom; priority = P_PROC_SSSE3; break; +case PROCESSOR_SLM: + arg_str = slm; + priority = P_PROC_SSE4_2; + break; case PROCESSOR_AMDFAM10: arg_str = amdfam10h; priority = P_PROC_SSE4_a; break; +case PROCESSOR_BTVER1: + arg_str = btver1; + priority = P_PROC_SSE4_a; + break; +case PROCESSOR_BTVER2: + arg_str = btver2; + priority = P_PROC_SSE4_2; + break; case PROCESSOR_BDVER1: arg_str = bdver1; - priority = P_PROC_FMA; + priority = P_PROC_XOP; break; case PROCESSOR_BDVER2: arg_str = bdver2; priority = P_PROC_FMA; break; - } +case PROCESSOR_BDVER3: + arg_str = bdver3; + priority = P_PROC_FMA; + break; +case PROCESSOR_BDVER4: + arg_str = bdver4; + priority = P_PROC_AVX2; + break; +} } cl_target_option_restore (global_options, cur_target); @@ -30919,9 +30952,13 @@ F_SSE2, F_SSE3, F_SSSE3, +F_SSE4_a, F_SSE4_1, F_SSE4_2, F_AVX, +F_FMA4, +F_XOP, +F_FMA, F_AVX2, F_MAX }; @@ -30940,6 +30977,7 @@ M_AMDFAM10H, M_AMDFAM15H, M_INTEL_SLM, +M_AMDFAM14H, M_CPU_SUBTYPE_START, M_INTEL_COREI7_NEHALEM, M_INTEL_COREI7_WESTMERE, @@ -30950,7 +30988,10 @@ M_AMDFAM15H_BDVER1, M_AMDFAM15H_BDVER2, M_AMDFAM15H_BDVER3, -M_AMDFAM15H_BDVER4 +M_AMDFAM15H_BDVER4, +M_INTEL_COREI7_IVYBRIDGE, +M_INTEL_COREI7_HASWELL, +M_AMDFAM14H_BTVER1, }; static struct _arch_names_table @@ -30968,11 +31009,15 @@ {corei7, M_INTEL_COREI7}, {nehalem, M_INTEL_COREI7_NEHALEM}, {westmere, M_INTEL_COREI7_WESTMERE}, - {sandybridge, M_INTEL_COREI7_SANDYBRIDGE}, + {corei7-avx, M_INTEL_COREI7_SANDYBRIDGE}, + {core-avx-i, M_INTEL_COREI7_IVYBRIDGE}, + {core-avx2, M_INTEL_COREI7_HASWELL}, {amdfam10h, M_AMDFAM10H}, {barcelona, M_AMDFAM10H_BARCELONA}, {shanghai, M_AMDFAM10H_SHANGHAI}, {istanbul, M_AMDFAM10H_ISTANBUL}, + {amdfam14h, M_AMDFAM14H}, + {btver1, M_AMDFAM14H_BTVER1}, {amdfam15h, M_AMDFAM15H}, {bdver1, M_AMDFAM15H_BDVER1}, {bdver2, M_AMDFAM15H_BDVER2}, @@ -30994,9 +31039,13 @@ {sse2, F_SSE2}, {sse3, F_SSE3}, {ssse3, F_SSSE3}, + {sse4a, F_SSE4_a}, {sse4.1, F_SSE4_1}, {sse4.2, F_SSE4_2}, {avx,F_AVX}, + {fma4, F_FMA4}, + {xop,F_XOP}, + {fma,F_FMA}, {avx2, F_AVX2} }; Index: libgcc/ChangeLog === --- libgcc/ChangeLog (revision 205984) +++ libgcc/ChangeLog (working copy) @@ -1,3 +1,9 @@ +2013-12-14 Allan Sandfeld Jensen sandf...@kde.org + +PR gcc/59422 +* config/i386/cpuinfo.c: Detect sse4a, fma4, xop and fma +ISAs and recent Intel and AMD models. + 2013-12-12 Zhenqiang Chen zhenqiang.c...@arm.com * config.host (arm*-*-uclinux*): Move t-arm before t-bpabi. Index
Re: [Patch, i386] PR 59422 - Support more targets for function multi versioning
On Monday 16 December 2013, Gopalasubramanian, Ganesh wrote: Btw, I couldn't find anything that corresponds to gcc's btver2 arch. Is that an old term for what has become the Jaguar architecture? Yes, btver2 = jaguar. We have the name as per its family name (i.e, bobcat family) in GCC. Similarly we have the names bdver2 = piledriver, bdver3 = steamroller as per their family (bulldozer) name. Yes, I figured that was the original idea behind it, but the final family of the jaguar processors seems to have become 16h instead of 14h (bobcat) at some point. Regards `Allan
Re: [Patch, i386] PR 59422 - Support more targets for function multi versioning
Hi again On Wednesday 11 December 2013, Uros Bizjak wrote: Hello! PR gcc/59422 This patch extends the supported targets for function multi versiong to also include Haswell, Silvermont, and the most recent AMD models. It also prioritizes AVX2 versions over AMD specific pre-AVX2 versions. Please add a ChangeLog entry and attach the complete patch. Please also state how you tested the patch, as outlined in the instructions [1]. [1] http://gcc.gnu.org/contribute.html Updated patch for better CPU model detection and added ChangeLog. The patch has been tested with the attached test.cpp. Verified that it doesn't build before the patch, and that it builds after, and verified it selects correct versions at runtime based on either CPU model or supported ISA (tested on 3 machines: SandyBridge, IvyBridge and Phenom II). Btw, I couldn't find anything that corresponds to gcc's btver2 arch. Is that an old term for what has become the Jaguar architecture? `Allan Index: gcc/ChangeLog === --- gcc/ChangeLog (revision 205984) +++ gcc/ChangeLog (working copy) @@ -1,3 +1,9 @@ +2013-12-14 Allan Sandfeld Jensen sandf...@kde.org + +PR gcc/59422 +* config/i386/i386.c: Extend function multiversioning +to better support recent Intel and AMD models. + 2013-12-14 Marek Polacek pola...@redhat.com PR sanitizer/59503 Index: gcc/config/i386/i386.c === --- gcc/config/i386/i386.c (revision 205984) +++ gcc/config/i386/i386.c (working copy) @@ -29962,9 +29962,14 @@ P_PROC_SSE4_2, P_POPCNT, P_AVX, +P_PROC_AVX, +P_FMA4, +P_XOP, +P_PROC_XOP, +P_FMA, +P_PROC_FMA, P_AVX2, -P_FMA, -P_PROC_FMA +P_PROC_AVX2 }; enum feature_priority priority = P_ZERO; @@ -29983,11 +29988,15 @@ {sse, P_SSE}, {sse2, P_SSE2}, {sse3, P_SSE3}, + {sse4a, P_SSE4_a}, {ssse3, P_SSSE3}, {sse4.1, P_SSE4_1}, {sse4.2, P_SSE4_2}, {popcnt, P_POPCNT}, {avx, P_AVX}, + {fma4, P_FMA4}, + {xop, P_XOP}, + {fma, P_FMA}, {avx2, P_AVX2} }; @@ -30041,25 +30050,49 @@ break; case PROCESSOR_COREI7_AVX: arg_str = corei7-avx; - priority = P_PROC_SSE4_2; + priority = P_PROC_AVX; break; +case PROCESSOR_HASWELL: + arg_str = core-avx2; + priority = P_PROC_AVX2; + break; case PROCESSOR_ATOM: arg_str = atom; priority = P_PROC_SSSE3; break; +case PROCESSOR_SLM: + arg_str = slm; + priority = P_PROC_SSE4_2; + break; case PROCESSOR_AMDFAM10: arg_str = amdfam10h; priority = P_PROC_SSE4_a; break; +case PROCESSOR_BTVER1: + arg_str = btver1; + priority = P_PROC_SSE4_a; + break; +case PROCESSOR_BTVER2: + arg_str = btver2; + priority = P_PROC_SSE4_2; + break; case PROCESSOR_BDVER1: arg_str = bdver1; - priority = P_PROC_FMA; + priority = P_PROC_XOP; break; case PROCESSOR_BDVER2: arg_str = bdver2; priority = P_PROC_FMA; break; - } +case PROCESSOR_BDVER3: + arg_str = bdver3; + priority = P_PROC_FMA; + break; +case PROCESSOR_BDVER4: + arg_str = bdver4; + priority = P_PROC_AVX2; + break; +} } cl_target_option_restore (global_options, cur_target); @@ -30919,9 +30952,13 @@ F_SSE2, F_SSE3, F_SSSE3, +F_SSE4_a, F_SSE4_1, F_SSE4_2, F_AVX, +F_FMA4, +F_XOP, +F_FMA, F_AVX2, F_MAX }; @@ -30938,15 +30975,20 @@ M_INTEL_CORE2, M_INTEL_COREI7, M_AMDFAM10H, +M_AMDFAM14H, M_AMDFAM15H, M_INTEL_SLM, M_CPU_SUBTYPE_START, M_INTEL_COREI7_NEHALEM, M_INTEL_COREI7_WESTMERE, M_INTEL_COREI7_SANDYBRIDGE, +M_INTEL_COREI7_IVYBRIDGE, +M_INTEL_COREI7_HASWELL, M_AMDFAM10H_BARCELONA, M_AMDFAM10H_SHANGHAI, M_AMDFAM10H_ISTANBUL, +M_AMDFAM14H_BTVER1, +M_AMDFAM14H_BTVER2, M_AMDFAM15H_BDVER1, M_AMDFAM15H_BDVER2, M_AMDFAM15H_BDVER3, @@ -30968,11 +31010,16 @@ {corei7, M_INTEL_COREI7}, {nehalem, M_INTEL_COREI7_NEHALEM}, {westmere, M_INTEL_COREI7_WESTMERE}, - {sandybridge, M_INTEL_COREI7_SANDYBRIDGE}, + {corei7-avx, M_INTEL_COREI7_SANDYBRIDGE}, + {core-avx-i, M_INTEL_COREI7_IVYBRIDGE}, + {core-avx2, M_INTEL_COREI7_HASWELL}, {amdfam10h, M_AMDFAM10H}, {barcelona, M_AMDFAM10H_BARCELONA}, {shanghai, M_AMDFAM10H_SHANGHAI}, {istanbul, M_AMDFAM10H_ISTANBUL
[Patch, i386] PR 59422 - Support more targets for function multi versioning
PR gcc/59422 This patch extends the supported targets for function multi versiong to also include Haswell, Silvermont, and the most recent AMD models. It also prioritizes AVX2 versions over AMD specific pre-AVX2 versions.