RE: [PATCH, i386] Enable -mprefer-avx128 by default for Bulldozer
Hi, I re-attached the patch here. Can someone review it? We would like to commit to trunk as well as 4.6 branch. Thanks, Changpeng From: Fang, Changpeng Sent: Monday, June 27, 2011 5:42 PM To: Fang, Changpeng; Jan Hubicka Cc: Uros Bizjak; gcc-patches@gcc.gnu.org; rguent...@suse.de Subject: RE: [PATCH, i386] Enable -mprefer-avx128 by default for Bulldozer Is this patch OK to commit to trunk? Also I would like to backport this patch to gcc 4.6 branch. Do I have to send a separate request or use this one? Thanks, Changpeng From: Fang, Changpeng Sent: Friday, June 24, 2011 7:12 PM To: Jan Hubicka Cc: Uros Bizjak; gcc-patches@gcc.gnu.org; rguent...@suse.de Subject: RE: [PATCH, i386] Enable -mprefer-avx128 by default for Bulldozer Hi, I have no preference in tune feature coding. But I agree with you it's better to put similar things together. I modified the code following your suggestion. Is it OK to commit this modified patch? Thanks, Changpeng From: Jan Hubicka [hubi...@ucw.cz] Sent: Thursday, June 23, 2011 6:20 PM To: Fang, Changpeng Cc: Uros Bizjak; gcc-patches@gcc.gnu.org; hubi...@ucw.cz; rguent...@suse.de Subject: Re: [PATCH, i386] Enable -mprefer-avx128 by default for Bulldozer Hi, --- a/gcc/config/i386/i386.c +++ b/gcc/config/i386/i386.c @@ -2128,6 +2128,9 @@ static const unsigned int x86_avx256_split_unaligned_load static const unsigned int x86_avx256_split_unaligned_store = m_COREI7 | m_BDVER1 | m_GENERIC; +static const unsigned int x86_prefer_avx128 + = m_BDVER1; What is reason for stuff like this to not go into initial_ix86_tune_features? I sort of liked them better when they was individual flags, but having the target tunning flags spread across multiple places seems unnecesary. Honza From a325395439a314f87b3c79a5b9ce79a6a976a710 Mon Sep 17 00:00:00 2001 From: Changpeng Fang chfang@huainan.(none) Date: Wed, 22 Jun 2011 15:03:05 -0700 Subject: [PATCH] Auto-vectorizer generates 128-bit AVX insns by default for bdver1 * config/i386/i386.opt (mprefer-avx128): Redefine the flag as a Mask option. * config/i386/i386.h (ix86_tune_indices): Add X86_TUNE_AVX128_OPTIMAL entry. (TARGET_AVX128_OPTIMAL): New definition. * config/i386/i386.c (initial_ix86_tune_features): Initialize X86_TUNE_AVX128_OPTIMAL entry. (ix86_option_override_internal): Enable the generation of the 128-bit instructions when TARGET_AVX128_OPTIMAL is set. (ix86_preferred_simd_mode): Use TARGET_PREFER_AVX128. (ix86_autovectorize_vector_sizes): Use TARGET_PREFER_AVX128. --- gcc/config/i386/i386.c | 16 gcc/config/i386/i386.h |4 +++- gcc/config/i386/i386.opt |2 +- 3 files changed, 16 insertions(+), 6 deletions(-) diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index 014401b..b3434dd 100644 --- a/gcc/config/i386/i386.c +++ b/gcc/config/i386/i386.c @@ -2089,7 +2089,11 @@ static unsigned int initial_ix86_tune_features[X86_TUNE_LAST] = { /* X86_SOFTARE_PREFETCHING_BENEFICIAL: Enable software prefetching at -O3. For the moment, the prefetching seems badly tuned for Intel chips. */ - m_K6_GEODE | m_AMD_MULTIPLE + m_K6_GEODE | m_AMD_MULTIPLE, + + /* X86_TUNE_AVX128_OPTIMAL: Enable 128-bit AVX instruction generation for + the auto-vectorizer. */ + m_BDVER1 }; /* Feature tests against the various architecture variations. */ @@ -2623,6 +2627,7 @@ ix86_target_string (int isa, int flags, const char *arch, const char *tune, { -mvzeroupper, MASK_VZEROUPPER }, { -mavx256-split-unaligned-load, MASK_AVX256_SPLIT_UNALIGNED_LOAD}, { -mavx256-split-unaligned-store, MASK_AVX256_SPLIT_UNALIGNED_STORE}, +{ -mprefer-avx128, MASK_PREFER_AVX128}, }; const char *opts[ARRAY_SIZE (isa_opts) + ARRAY_SIZE (flag_opts) + 6][2]; @@ -3672,6 +3677,9 @@ ix86_option_override_internal (bool main_args_p) if ((x86_avx256_split_unaligned_store ix86_tune_mask) !(target_flags_explicit MASK_AVX256_SPLIT_UNALIGNED_STORE)) target_flags |= MASK_AVX256_SPLIT_UNALIGNED_STORE; + /* Enable 128-bit AVX instruction generation for the auto-vectorizer. */ + if (TARGET_AVX128_OPTIMAL !(target_flags_explicit MASK_PREFER_AVX128)) + target_flags |= MASK_PREFER_AVX128; } } else @@ -34614,7 +34622,7 @@ ix86_preferred_simd_mode (enum machine_mode mode) return V2DImode; case SFmode: - if (TARGET_AVX !flag_prefer_avx128) + if (TARGET_AVX !TARGET_PREFER_AVX128) return V8SFmode; else return V4SFmode; @@ -34622,7 +34630,7 @@ ix86_preferred_simd_mode (enum machine_mode mode) case DFmode: if (!TARGET_VECTORIZE_DOUBLE) return word_mode; - else if (TARGET_AVX !flag_prefer_avx128) + else if (TARGET_AVX !TARGET_PREFER_AVX128) return V4DFmode; else if (TARGET_SSE2) return V2DFmode; @@ -34639,7 +34647,7
Request to backport two -mvzeroupper related patches to 4.6 branch
Hi, Attached are two patches in gcc 4.7 trunk that we request to backport to 4.6 branch. There are all related to -mvzerupper 1) 0001-Save-the-initial-options-after-checking-vzeroupper.patch This patch fixes bug 47315, ICE: in extract_insn, at recog.c:2109 (unrecognizable insn) with -mvzeroupper and __attribute__((target(avx))) The patch was committed to trunk: 2011-05-23 H.J. Lu hongjiu...@intel.com The bug still exists in gcc 4.6.1. Backporting this patches would fix it. 2). 0001--config-i386-i386.c-ix86_reorg-Run-move_or_dele.patch This patch Run move_or_delete_vzeroupper first, and was committed to trunk: 2011-05-04 Uros Bizjak ubiz...@gmail.com Is It OK to commit to 4.6 branch? Thanks, Changpeng From 0b70e1e33afa25536305f4a228409cf9b4e0eaad Mon Sep 17 00:00:00 2001 From: hjl hjl@138bc75d-0d04-0410-961f-82ee72b054a4 Date: Mon, 23 May 2011 16:51:42 + Subject: [PATCH] Save the initial options after checking vzeroupper. gcc/ 2011-05-23 H.J. Lu hongjiu...@intel.com PR target/47315 * config/i386/i386.c (ix86_option_override_internal): Save the initial options after checking vzeroupper. gcc/testsuite/ 2011-05-23 H.J. Lu hongjiu...@intel.com PR target/47315 * gcc.target/i386/pr47315.c: New test. git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/trunk@174078 138bc75d-0d04-0410-961f-82ee72b054a4 --- gcc/ChangeLog |6 ++ gcc/config/i386/i386.c | 11 ++- gcc/testsuite/ChangeLog |5 + gcc/testsuite/gcc.target/i386/pr47315.c | 10 ++ 4 files changed, 27 insertions(+), 5 deletions(-) create mode 100644 gcc/testsuite/gcc.target/i386/pr47315.c diff --git a/gcc/ChangeLog b/gcc/ChangeLog index a3cb0f1..1d46b04 100644 --- a/gcc/ChangeLog +++ b/gcc/ChangeLog @@ -1,3 +1,9 @@ +2011-05-23 H.J. Lu hongjiu...@intel.com + + PR target/47315 + * config/i386/i386.c (ix86_option_override_internal): Save the + initial options after checking vzeroupper. + 2011-05-23 David Li davi...@google.com PR tree-optimization/48988 diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index 0709be8..854e376 100644 --- a/gcc/config/i386/i386.c +++ b/gcc/config/i386/i386.c @@ -4191,11 +4191,6 @@ ix86_option_override_internal (bool main_args_p) #endif } - /* Save the initial options in case the user does function specific options */ - if (main_args_p) -target_option_default_node = target_option_current_node - = build_target_option_node (); - if (TARGET_AVX) { /* When not optimize for size, enable vzeroupper optimization for @@ -4217,6 +4212,12 @@ ix86_option_override_internal (bool main_args_p) /* Disable vzeroupper pass if TARGET_AVX is disabled. */ target_flags = ~MASK_VZEROUPPER; } + + /* Save the initial options in case the user does function specific + options. */ + if (main_args_p) +target_option_default_node = target_option_current_node + = build_target_option_node (); } /* Return TRUE if VAL is passed in register with 256bit AVX modes. */ diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog index 72aae61..85137d0 100644 --- a/gcc/testsuite/ChangeLog +++ b/gcc/testsuite/ChangeLog @@ -1,3 +1,8 @@ +2011-05-23 H.J. Lu hongjiu...@intel.com + + PR target/47315 + * gcc.target/i386/pr47315.c: New test. + 2011-05-23 Jason Merrill ja...@redhat.com * g++.dg/cpp0x/lambda/lambda-eh2.C: New. diff --git a/gcc/testsuite/gcc.target/i386/pr47315.c b/gcc/testsuite/gcc.target/i386/pr47315.c new file mode 100644 index 000..871d3f1 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr47315.c @@ -0,0 +1,10 @@ +/* { dg-do compile } */ +/* { dg-options -O3 -mvzeroupper } */ + +__attribute__ ((__target__ (avx))) +float bar (float f) {} + +void foo (float f) +{ +bar (f); +} -- 1.6.0.2 From 343f07cbec2d66bebe71e4f48b0403f52ebfe8f9 Mon Sep 17 00:00:00 2001 From: uros uros@138bc75d-0d04-0410-961f-82ee72b054a4 Date: Wed, 4 May 2011 17:07:03 + Subject: [PATCH] * config/i386/i386.c (ix86_reorg): Run move_or_delete_vzeroupper first. git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/trunk@173383 138bc75d-0d04-0410-961f-82ee72b054a4 --- gcc/ChangeLog | 16 ++-- gcc/config/i386/i386.c |8 2 files changed, 14 insertions(+), 10 deletions(-) diff --git a/gcc/ChangeLog b/gcc/ChangeLog index 5412506..ca85616 100644 --- a/gcc/ChangeLog +++ b/gcc/ChangeLog @@ -1,3 +1,7 @@ +2011-05-04 Uros Bizjak ubiz...@gmail.com + + * config/i386/i386.c (ix86_reorg): Run move_or_delete_vzeroupper first. + 2011-05-04 Eric Botcazou ebotca...@adacore.com * stor-layout.c (variable_size): Do not issue errors. @@ -263,9 +267,9 @@ 2011-05-03 Stuart Henderson shend...@gcc.gnu.org -From Mike Frysinger: -* config/bfin/bfin.c (bfin_cpus[]): Add 0.4 for -bf542/bf544/bf547/bf548/bf549. + From Mike Frysinger: + * config/bfin/bfin.c (bfin_cpus[]): Add 0.4 for + bf542/bf544/bf547/bf548/bf549.
RE: Backport AVX256 load/store split patches to gcc 4.6 for performance boost on latest AMD/Intel hardware.
Hi, Attached are the patches we propose to backport to gcc 4.6 branch which are related to avx256 unaligned load/store splitting. As we mentioned before, The combined effect of these patches are positive on both AMD and Intel CPUs on cpu2006 and polyhedron 2005. 0001-Split-32-byte-AVX-unaligned-load-store.patch Initial patch that implements unaligned load/store splitting 0001-Don-t-assert-unaligned-256bit-load-store.patch Remove the assert. 0001-Fix-a-typo-in-mavx256-split-unaligned-store.patch Fix a typo. 0002-pr49089-enable-avx256-splitting-unaligned-load-store.patch Disable unaligned load splitting for bdver1. All these patches are in 4.7 trunk. Bootstrap and tests are on-going in gcc 4.6 branch. Is It OK to commit to 4.6 branch as long as the tests pass? Thanks, Changpeng From: Jagasia, Harsha Sent: Monday, June 20, 2011 12:03 PM To: 'H.J. Lu' Cc: 'gcc-patches@gcc.gnu.org'; 'hubi...@ucw.cz'; 'ubiz...@gmail.com'; 'hongjiu...@intel.com'; Fang, Changpeng Subject: RE: Backport AVX256 load/store split patches to gcc 4.6 for performance boost on latest AMD/Intel hardware. On Mon, Jun 20, 2011 at 9:58 AM, harsha.jaga...@amd.com wrote: Is it ok to backport patches, with Changelogs below, already in trunk to gcc 4.6? These patches are for AVX-256bit load store splitting. These patches make significant performance difference =3% to several CPU2006 and Polyhedron benchmarks on latest AMD and Intel hardware. If ok, I will post backported patches for commit approval. AMD plans to submit additional patches on AVX-256 load/store splitting to trunk. We will send additional backport requests for those later once they are accepted/comitted to trunk. Since we will make some changes on trunk, I would prefer to to do the backport after trunk change is finished. Ok, thanks. Adding Changpeng who is working on the trunk changes. Harsha From b8cb8d5224d650672add0fb6a74d759ef12e428f Mon Sep 17 00:00:00 2001 From: hjl hjl@138bc75d-0d04-0410-961f-82ee72b054a4 Date: Sun, 27 Mar 2011 18:56:00 + Subject: [PATCH] Split 32-byte AVX unaligned load/store. gcc/ 2011-03-27 H.J. Lu hongjiu...@intel.com * config/i386/i386.c (flag_opts): Add -mavx256-split-unaligned-load and -mavx256-split-unaligned-store. (ix86_option_override_internal): Split 32-byte AVX unaligned load/store by default. (ix86_avx256_split_vector_move_misalign): New. (ix86_expand_vector_move_misalign): Use it. * config/i386/i386.opt: Add -mavx256-split-unaligned-load and -mavx256-split-unaligned-store. * config/i386/sse.md (*avx_movmode_internal): Verify unaligned 256bit load/store. Generate unaligned store on misaligned memory operand. (*avx_movussemodesuffixavxmodesuffix): Verify unaligned 256bit load/store. (*avx_movdquavxmodesuffix): Likewise. * doc/invoke.texi: Document -mavx256-split-unaligned-load and -mavx256-split-unaligned-store. gcc/testsuite/ 2011-03-27 H.J. Lu hongjiu...@intel.com * gcc.target/i386/avx256-unaligned-load-1.c: New. * gcc.target/i386/avx256-unaligned-load-2.c: Likewise. * gcc.target/i386/avx256-unaligned-load-3.c: Likewise. * gcc.target/i386/avx256-unaligned-load-4.c: Likewise. * gcc.target/i386/avx256-unaligned-load-5.c: Likewise. * gcc.target/i386/avx256-unaligned-load-6.c: Likewise. * gcc.target/i386/avx256-unaligned-load-7.c: Likewise. * gcc.target/i386/avx256-unaligned-store-1.c: Likewise. * gcc.target/i386/avx256-unaligned-store-2.c: Likewise. * gcc.target/i386/avx256-unaligned-store-3.c: Likewise. * gcc.target/i386/avx256-unaligned-store-4.c: Likewise. * gcc.target/i386/avx256-unaligned-store-5.c: Likewise. * gcc.target/i386/avx256-unaligned-store-6.c: Likewise. * gcc.target/i386/avx256-unaligned-store-7.c: Likewise. git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/trunk@171578 138bc75d-0d04-0410-961f-82ee72b054a4 --- gcc/ChangeLog | 22 ++ gcc/config/i386/i386.c | 76 +-- gcc/config/i386/i386.opt |8 ++ gcc/config/i386/sse.md | 42 ++-- gcc/doc/invoke.texi|9 ++- gcc/testsuite/ChangeLog| 17 + .../gcc.target/i386/avx256-unaligned-load-1.c | 19 + .../gcc.target/i386/avx256-unaligned-load-2.c | 29 .../gcc.target/i386/avx256-unaligned-load-3.c | 19 + .../gcc.target/i386/avx256-unaligned-load-4.c | 19 + .../gcc.target/i386/avx256-unaligned-load-5.c | 43 +++ .../gcc.target/i386/avx256-unaligned-load-6.c | 42 +++ .../gcc.target/i386/avx256-unaligned-load-7.c | 60 +++ .../gcc.target/i386/avx256-unaligned-store-1.c | 22 ++ .../gcc.target/i386/avx256-unaligned-store-2.c | 29 .../gcc.target/i386/avx256-unaligned-store-3.c | 22 ++ .../gcc.target/i386
RE: [PATCH, i386] Enable -mprefer-avx128 by default for Bulldozer
Is this patch OK to commit to trunk? Also I would like to backport this patch to gcc 4.6 branch. Do I have to send a separate request or use this one? Thanks, Changpeng From: Fang, Changpeng Sent: Friday, June 24, 2011 7:12 PM To: Jan Hubicka Cc: Uros Bizjak; gcc-patches@gcc.gnu.org; rguent...@suse.de Subject: RE: [PATCH, i386] Enable -mprefer-avx128 by default for Bulldozer Hi, I have no preference in tune feature coding. But I agree with you it's better to put similar things together. I modified the code following your suggestion. Is it OK to commit this modified patch? Thanks, Changpeng From: Jan Hubicka [hubi...@ucw.cz] Sent: Thursday, June 23, 2011 6:20 PM To: Fang, Changpeng Cc: Uros Bizjak; gcc-patches@gcc.gnu.org; hubi...@ucw.cz; rguent...@suse.de Subject: Re: [PATCH, i386] Enable -mprefer-avx128 by default for Bulldozer Hi, --- a/gcc/config/i386/i386.c +++ b/gcc/config/i386/i386.c @@ -2128,6 +2128,9 @@ static const unsigned int x86_avx256_split_unaligned_load static const unsigned int x86_avx256_split_unaligned_store = m_COREI7 | m_BDVER1 | m_GENERIC; +static const unsigned int x86_prefer_avx128 + = m_BDVER1; What is reason for stuff like this to not go into initial_ix86_tune_features? I sort of liked them better when they was individual flags, but having the target tunning flags spread across multiple places seems unnecesary. Honza
RE: [PATCH, i386] Enable -mprefer-avx128 by default for Bulldozer
Hi, I have no preference in tune feature coding. But I agree with you it's better to put similar things together. I modified the code following your suggestion. Is it OK to commit this modified patch? Thanks, Changpeng From: Jan Hubicka [hubi...@ucw.cz] Sent: Thursday, June 23, 2011 6:20 PM To: Fang, Changpeng Cc: Uros Bizjak; gcc-patches@gcc.gnu.org; hubi...@ucw.cz; rguent...@suse.de Subject: Re: [PATCH, i386] Enable -mprefer-avx128 by default for Bulldozer Hi, --- a/gcc/config/i386/i386.c +++ b/gcc/config/i386/i386.c @@ -2128,6 +2128,9 @@ static const unsigned int x86_avx256_split_unaligned_load static const unsigned int x86_avx256_split_unaligned_store = m_COREI7 | m_BDVER1 | m_GENERIC; +static const unsigned int x86_prefer_avx128 + = m_BDVER1; What is reason for stuff like this to not go into initial_ix86_tune_features? I sort of liked them better when they was individual flags, but having the target tunning flags spread across multiple places seems unnecesary. Honza From a325395439a314f87b3c79a5b9ce79a6a976a710 Mon Sep 17 00:00:00 2001 From: Changpeng Fang chfang@huainan.(none) Date: Wed, 22 Jun 2011 15:03:05 -0700 Subject: [PATCH] Auto-vectorizer generates 128-bit AVX insns by default for bdver1 * config/i386/i386.opt (mprefer-avx128): Redefine the flag as a Mask option. * config/i386/i386.h (ix86_tune_indices): Add X86_TUNE_AVX128_OPTIMAL entry. (TARGET_AVX128_OPTIMAL): New definition. * config/i386/i386.c (initial_ix86_tune_features): Initialize X86_TUNE_AVX128_OPTIMAL entry. (ix86_option_override_internal): Enable the generation of the 128-bit instructions when TARGET_AVX128_OPTIMAL is set. (ix86_preferred_simd_mode): Use TARGET_PREFER_AVX128. (ix86_autovectorize_vector_sizes): Use TARGET_PREFER_AVX128. --- gcc/config/i386/i386.c | 16 gcc/config/i386/i386.h |4 +++- gcc/config/i386/i386.opt |2 +- 3 files changed, 16 insertions(+), 6 deletions(-) diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index 014401b..b3434dd 100644 --- a/gcc/config/i386/i386.c +++ b/gcc/config/i386/i386.c @@ -2089,7 +2089,11 @@ static unsigned int initial_ix86_tune_features[X86_TUNE_LAST] = { /* X86_SOFTARE_PREFETCHING_BENEFICIAL: Enable software prefetching at -O3. For the moment, the prefetching seems badly tuned for Intel chips. */ - m_K6_GEODE | m_AMD_MULTIPLE + m_K6_GEODE | m_AMD_MULTIPLE, + + /* X86_TUNE_AVX128_OPTIMAL: Enable 128-bit AVX instruction generation for + the auto-vectorizer. */ + m_BDVER1 }; /* Feature tests against the various architecture variations. */ @@ -2623,6 +2627,7 @@ ix86_target_string (int isa, int flags, const char *arch, const char *tune, { -mvzeroupper, MASK_VZEROUPPER }, { -mavx256-split-unaligned-load, MASK_AVX256_SPLIT_UNALIGNED_LOAD}, { -mavx256-split-unaligned-store, MASK_AVX256_SPLIT_UNALIGNED_STORE}, +{ -mprefer-avx128, MASK_PREFER_AVX128}, }; const char *opts[ARRAY_SIZE (isa_opts) + ARRAY_SIZE (flag_opts) + 6][2]; @@ -3672,6 +3677,9 @@ ix86_option_override_internal (bool main_args_p) if ((x86_avx256_split_unaligned_store ix86_tune_mask) !(target_flags_explicit MASK_AVX256_SPLIT_UNALIGNED_STORE)) target_flags |= MASK_AVX256_SPLIT_UNALIGNED_STORE; + /* Enable 128-bit AVX instruction generation for the auto-vectorizer. */ + if (TARGET_AVX128_OPTIMAL !(target_flags_explicit MASK_PREFER_AVX128)) + target_flags |= MASK_PREFER_AVX128; } } else @@ -34614,7 +34622,7 @@ ix86_preferred_simd_mode (enum machine_mode mode) return V2DImode; case SFmode: - if (TARGET_AVX !flag_prefer_avx128) + if (TARGET_AVX !TARGET_PREFER_AVX128) return V8SFmode; else return V4SFmode; @@ -34622,7 +34630,7 @@ ix86_preferred_simd_mode (enum machine_mode mode) case DFmode: if (!TARGET_VECTORIZE_DOUBLE) return word_mode; - else if (TARGET_AVX !flag_prefer_avx128) + else if (TARGET_AVX !TARGET_PREFER_AVX128) return V4DFmode; else if (TARGET_SSE2) return V2DFmode; @@ -34639,7 +34647,7 @@ ix86_preferred_simd_mode (enum machine_mode mode) static unsigned int ix86_autovectorize_vector_sizes (void) { - return (TARGET_AVX !flag_prefer_avx128) ? 32 | 16 : 0; + return (TARGET_AVX !TARGET_PREFER_AVX128) ? 32 | 16 : 0; } /* Initialize the GCC target structure. */ diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h index 8badcbb..d9317ed 100644 --- a/gcc/config/i386/i386.h +++ b/gcc/config/i386/i386.h @@ -312,6 +312,7 @@ enum ix86_tune_indices { X86_TUNE_OPT_AGU, X86_TUNE_VECTORIZE_DOUBLE, X86_TUNE_SOFTWARE_PREFETCHING_BENEFICIAL, + X86_TUNE_AVX128_OPTIMAL, X86_TUNE_LAST }; @@ -410,7 +411,8 @@ extern unsigned char ix86_tune_features[X86_TUNE_LAST]; ix86_tune_features[X86_TUNE_VECTORIZE_DOUBLE] #define TARGET_SOFTWARE_PREFETCHING_BENEFICIAL \ ix86_tune_features
[PATCH, i386] Enable -mprefer-avx128 by default for Bulldozer
Hi, This patch enables 128-bit avx instruction generation for the auto-vectorizer for AMD bulldozer machines. This enablement gives additional ~3% improvement on polyhedron 2005 and cpu2006 floating point programs. The patch passed bootstrapping on a x86_64-unknown-linux-gnu system with Bulldozer cores. Is it OK to commit to trunk and backport to 4.6 branch? Thanks, Changpeng From b5015593b0b30b14783866ac68c2c5f2e014d206 Mon Sep 17 00:00:00 2001 From: Changpeng Fang chfang@huainan.(none) Date: Wed, 22 Jun 2011 15:03:05 -0700 Subject: [PATCH] Auto-vectorizer generates 128-bit AVX insns by default for bdver1 * config/i386/i386.opt (mprefer-avx128): Redefine the flag as a Mask option. * config/i386/i386.c (x86_prefer_avx128): New tune option definition. (ix86_option_override_internal): Enable the generation of the 128-bit instructions when x86_prefer_avx128 is set. (ix86_preferred_simd_mode): Use TARGET_PREFER_AVX128. (ix86_autovectorize_vector_sizes): Use TARGET_PREFER_AVX128. --- gcc/config/i386/i386.c | 13 ++--- gcc/config/i386/i386.opt |2 +- 2 files changed, 11 insertions(+), 4 deletions(-) diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index 014401b..1f5113f 100644 --- a/gcc/config/i386/i386.c +++ b/gcc/config/i386/i386.c @@ -2128,6 +2128,9 @@ static const unsigned int x86_avx256_split_unaligned_load static const unsigned int x86_avx256_split_unaligned_store = m_COREI7 | m_BDVER1 | m_GENERIC; +static const unsigned int x86_prefer_avx128 + = m_BDVER1; + /* In case the average insn count for single function invocation is lower than this constant, emit fast (but longer) prologue and epilogue code. */ @@ -2623,6 +2626,7 @@ ix86_target_string (int isa, int flags, const char *arch, const char *tune, { -mvzeroupper, MASK_VZEROUPPER }, { -mavx256-split-unaligned-load, MASK_AVX256_SPLIT_UNALIGNED_LOAD}, { -mavx256-split-unaligned-store, MASK_AVX256_SPLIT_UNALIGNED_STORE}, +{ -mprefer-avx128, MASK_PREFER_AVX128}, }; const char *opts[ARRAY_SIZE (isa_opts) + ARRAY_SIZE (flag_opts) + 6][2]; @@ -3672,6 +3676,9 @@ ix86_option_override_internal (bool main_args_p) if ((x86_avx256_split_unaligned_store ix86_tune_mask) !(target_flags_explicit MASK_AVX256_SPLIT_UNALIGNED_STORE)) target_flags |= MASK_AVX256_SPLIT_UNALIGNED_STORE; + if ((x86_prefer_avx128 ix86_tune_mask) + !(target_flags_explicit MASK_PREFER_AVX128)) + target_flags |= MASK_PREFER_AVX128; } } else @@ -34614,7 +34621,7 @@ ix86_preferred_simd_mode (enum machine_mode mode) return V2DImode; case SFmode: - if (TARGET_AVX !flag_prefer_avx128) + if (TARGET_AVX !TARGET_PREFER_AVX128) return V8SFmode; else return V4SFmode; @@ -34622,7 +34629,7 @@ ix86_preferred_simd_mode (enum machine_mode mode) case DFmode: if (!TARGET_VECTORIZE_DOUBLE) return word_mode; - else if (TARGET_AVX !flag_prefer_avx128) + else if (TARGET_AVX !TARGET_PREFER_AVX128) return V4DFmode; else if (TARGET_SSE2) return V2DFmode; @@ -34639,7 +34646,7 @@ ix86_preferred_simd_mode (enum machine_mode mode) static unsigned int ix86_autovectorize_vector_sizes (void) { - return (TARGET_AVX !flag_prefer_avx128) ? 32 | 16 : 0; + return (TARGET_AVX !TARGET_PREFER_AVX128) ? 32 | 16 : 0; } /* Initialize the GCC target structure. */ diff --git a/gcc/config/i386/i386.opt b/gcc/config/i386/i386.opt index 21e0def..9886b7b 100644 --- a/gcc/config/i386/i386.opt +++ b/gcc/config/i386/i386.opt @@ -388,7 +388,7 @@ Do dispatch scheduling if processor is bdver1 and Haifa scheduling is selected. mprefer-avx128 -Target Report Var(flag_prefer_avx128) Init(0) +Target Report Mask(PREFER_AVX128) SAVE Use 128-bit AVX instructions instead of 256-bit AVX instructions in the auto-vectorizer. ;; ISA support -- 1.7.0.4
RE: [PATCH, PR 49089] Don't split AVX256 unaligned loads by default on bdver1 and generic
Hi, I modified the patch as H.J. suggested (patch attached). Is it OK to commit to trunk now? Thanks, Changpeng From: H.J. Lu [hjl.to...@gmail.com] Sent: Friday, June 17, 2011 5:44 PM To: Fang, Changpeng Cc: Richard Guenther; gcc-patches@gcc.gnu.org Subject: Re: [PATCH, PR 49089] Don't split AVX256 unaligned loads by default on bdver1 and generic On Fri, Jun 17, 2011 at 3:18 PM, Fang, Changpeng changpeng.f...@amd.com wrote: Hi, I added AVX256_SPLIT_UNALIGNED_STORE to ix86_tune_indices and put m_COREI7, m_BDVER1 and m_GENERIC as the targets that enable it. Is this OK? Can you do something similar to how MASK_ACCUMULATE_OUTGOING_ARGS is handled? Thanks. H.J. From 50310fc367348b406fc88d54c3ab54d1a304ad52 Mon Sep 17 00:00:00 2001 From: Changpeng Fang chfang@huainan.(none) Date: Mon, 13 Jun 2011 13:13:32 -0700 Subject: [PATCH 2/2] pr49089: enable avx256 splitting unaligned load/store only when beneficial * config/i386/i386.c (avx256_split_unaligned_load): New definition. (avx256_split_unaligned_store): New definition. (ix86_option_override_internal): Enable avx256 unaligned load(store) splitting only when avx256_split_unaligned_load(store) is set. --- gcc/config/i386/i386.c | 12 ++-- 1 files changed, 10 insertions(+), 2 deletions(-) diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index 7b266b9..3bc0b53 100644 --- a/gcc/config/i386/i386.c +++ b/gcc/config/i386/i386.c @@ -2121,6 +2121,12 @@ static const unsigned int x86_arch_always_fancy_math_387 = m_PENT | m_ATOM | m_PPRO | m_AMD_MULTIPLE | m_PENT4 | m_NOCONA | m_CORE2I7 | m_GENERIC; +static const unsigned int x86_avx256_split_unaligned_load + = m_COREI7 | m_GENERIC; + +static const unsigned int x86_avx256_split_unaligned_store + = m_COREI7 | m_BDVER1 | m_GENERIC; + /* In case the average insn count for single function invocation is lower than this constant, emit fast (but longer) prologue and epilogue code. */ @@ -4194,9 +4200,11 @@ ix86_option_override_internal (bool main_args_p) if (flag_expensive_optimizations !(target_flags_explicit MASK_VZEROUPPER)) target_flags |= MASK_VZEROUPPER; - if (!(target_flags_explicit MASK_AVX256_SPLIT_UNALIGNED_LOAD)) + if ((x86_avx256_split_unaligned_load ix86_tune_mask) + !(target_flags_explicit MASK_AVX256_SPLIT_UNALIGNED_LOAD)) target_flags |= MASK_AVX256_SPLIT_UNALIGNED_LOAD; - if (!(target_flags_explicit MASK_AVX256_SPLIT_UNALIGNED_STORE)) + if ((x86_avx256_split_unaligned_store ix86_tune_mask) + !(target_flags_explicit MASK_AVX256_SPLIT_UNALIGNED_STORE)) target_flags |= MASK_AVX256_SPLIT_UNALIGNED_STORE; } } -- 1.7.0.4
RE: Backport AVX256 load/store split patches to gcc 4.6 for performance boost on latest AMD/Intel hardware.
The patch that disables default setting of unaligned load splitting for bdver1 has been committed to trunk as revision 175230. Here is the patch: http://gcc.gnu.org/ml/gcc-patches/2011-06/msg01518.html. H. J., is there anything else that is pending to fix at this moment regarding avx256 load/store splitting? If no, can we backport the set of patches to 4.6 branch now? Thanks, Changpeng From: Jagasia, Harsha Sent: Monday, June 20, 2011 12:03 PM To: 'H.J. Lu' Cc: 'gcc-patches@gcc.gnu.org'; 'hubi...@ucw.cz'; 'ubiz...@gmail.com'; 'hongjiu...@intel.com'; Fang, Changpeng Subject: RE: Backport AVX256 load/store split patches to gcc 4.6 for performance boost on latest AMD/Intel hardware. On Mon, Jun 20, 2011 at 9:58 AM, harsha.jaga...@amd.com wrote: Is it ok to backport patches, with Changelogs below, already in trunk to gcc 4.6? These patches are for AVX-256bit load store splitting. These patches make significant performance difference =3% to several CPU2006 and Polyhedron benchmarks on latest AMD and Intel hardware. If ok, I will post backported patches for commit approval. AMD plans to submit additional patches on AVX-256 load/store splitting to trunk. We will send additional backport requests for those later once they are accepted/comitted to trunk. Since we will make some changes on trunk, I would prefer to to do the backport after trunk change is finished. Ok, thanks. Adding Changpeng who is working on the trunk changes. Harsha
RE: [PATCH, PR 49089] Don't split AVX256 unaligned loads by default on bdver1 and generic
Hi, I modify the patch to disable unaligned load splitting only for bdver1 at this moment. Unaligned load splitting degrades CFP2006 by 1.3% in geomean for both -mtune=bdver1 and -mtune=generic on Bulldozer. However, we agree with H.J's suggestion to determine the optimal optimization sets for modern cpus. Is is OK to commit the attached patch? Thanks, Changpeng So, is it OK to commit this patch to trunk, and H.J's original patch + this to 4.6 branch? I have no problems on -mtune=Bulldozer. But I object -mtune=generic change and did suggest a different approach for -mtune=generic. . From 913a31b425759ac3427a365646de866161a7908a Mon Sep 17 00:00:00 2001 From: Changpeng Fang chfang@huainan.(none) Date: Mon, 13 Jun 2011 13:13:32 -0700 Subject: [PATCH 2/2] pr49089: enable avx256 splitting unaligned load only when beneficial * config/i386/i386.h (ix86_tune_indices): Introduce X86_TUNE_AVX256_SPLIT_UNALIGNED_LOAD_OPTIMAL. (TARGET_AVX256_SPLIT_UNALIGNED_LOAD_OPTIMAL): New definition. * config/i386/i386.c (ix86_tune_features): Add entry for X86_TUNE_AVX256_SPLIT_UNALIGNED_LOAD_OPTIMAL. (ix86_option_override_internal): Enable avx256 unaligned load splitting only when TARGET_AVX256_SPLIT_UNALIGNED_LOAD_OPTIMAL is set. --- gcc/config/i386/i386.c | 10 -- gcc/config/i386/i386.h |3 +++ 2 files changed, 11 insertions(+), 2 deletions(-) diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index 7b266b9..82e6d3e 100644 --- a/gcc/config/i386/i386.c +++ b/gcc/config/i386/i386.c @@ -2088,7 +2088,12 @@ static unsigned int initial_ix86_tune_features[X86_TUNE_LAST] = { /* X86_SOFTARE_PREFETCHING_BENEFICIAL: Enable software prefetching at -O3. For the moment, the prefetching seems badly tuned for Intel chips. */ - m_K6_GEODE | m_AMD_MULTIPLE + m_K6_GEODE | m_AMD_MULTIPLE, + + /* X86_TUNE_AVX256_SPLIT_UNALIGNED_LOAD_OPTIMAL: Enable splitting 256-bit + unaligned load. It hurts the performance on Bulldozer. We need to + re-tune the generic options for current cpus! */ + m_COREI7 | m_GENERIC }; /* Feature tests against the various architecture variations. */ @@ -4194,7 +4199,8 @@ ix86_option_override_internal (bool main_args_p) if (flag_expensive_optimizations !(target_flags_explicit MASK_VZEROUPPER)) target_flags |= MASK_VZEROUPPER; - if (!(target_flags_explicit MASK_AVX256_SPLIT_UNALIGNED_LOAD)) + if (TARGET_AVX256_SPLIT_UNALIGNED_LOAD_OPTIMAL + !(target_flags_explicit MASK_AVX256_SPLIT_UNALIGNED_LOAD)) target_flags |= MASK_AVX256_SPLIT_UNALIGNED_LOAD; if (!(target_flags_explicit MASK_AVX256_SPLIT_UNALIGNED_STORE)) target_flags |= MASK_AVX256_SPLIT_UNALIGNED_STORE; diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h index 8badcbb..b2a1bc8 100644 --- a/gcc/config/i386/i386.h +++ b/gcc/config/i386/i386.h @@ -312,6 +312,7 @@ enum ix86_tune_indices { X86_TUNE_OPT_AGU, X86_TUNE_VECTORIZE_DOUBLE, X86_TUNE_SOFTWARE_PREFETCHING_BENEFICIAL, + X86_TUNE_AVX256_SPLIT_UNALIGNED_LOAD_OPTIMAL, X86_TUNE_LAST }; @@ -410,6 +411,8 @@ extern unsigned char ix86_tune_features[X86_TUNE_LAST]; ix86_tune_features[X86_TUNE_VECTORIZE_DOUBLE] #define TARGET_SOFTWARE_PREFETCHING_BENEFICIAL \ ix86_tune_features[X86_TUNE_SOFTWARE_PREFETCHING_BENEFICIAL] +#define TARGET_AVX256_SPLIT_UNALIGNED_LOAD_OPTIMAL \ + ix86_tune_features[X86_TUNE_AVX256_SPLIT_UNALIGNED_LOAD_OPTIMAL] /* Feature tests against the various architecture variations. */ enum ix86_arch_indices { -- 1.7.0.4
RE: [PATCH, PR 49089] Don't split AVX256 unaligned loads by default on bdver1 and generic
I have no problems on -mtune=Bulldozer. But I object -mtune=generic change and did suggest a different approach for -mtune=generic. Something must have been broken for the unaligned load splitting in generic mode. While we lose 1.3% on CFP2006 in geomean by splitting unaligned loads for -mtune=bdver1, splitting unaligned loads in generic mode is KILLING us: For 459.GemsFDTD (ref) on Bulldozer, -Ofast -mavx -mno-avx256-split-unaligned-load: 480s -Ofast -mavx :2527s So, splitting unaligned loads results in the program to run 5~6 times slower! For 434.zeusmp train run -Ofast -mavx -mno-avx256-split-unaligned-load: 32.5s -Ofast -mavx :106s Other tests are on-going! Changpeng.