RE: [PATCH, i386] Enable -mprefer-avx128 by default for Bulldozer

2011-06-28 Thread Fang, Changpeng
Hi, 

 I re-attached the patch here. Can someone review it?

We would like to commit to trunk as well as 4.6 branch.

Thanks,

Changpeng




From: Fang, Changpeng
Sent: Monday, June 27, 2011 5:42 PM
To: Fang, Changpeng; Jan Hubicka
Cc: Uros Bizjak; gcc-patches@gcc.gnu.org; rguent...@suse.de
Subject: RE: [PATCH, i386] Enable -mprefer-avx128 by default for Bulldozer

Is this patch OK to commit to trunk?

Also I would like to backport this patch to gcc 4.6 branch. Do I have to send a 
separate
request or use this one?

Thanks,

Changpeng





From: Fang, Changpeng
Sent: Friday, June 24, 2011 7:12 PM
To: Jan Hubicka
Cc: Uros Bizjak; gcc-patches@gcc.gnu.org; rguent...@suse.de
Subject: RE: [PATCH, i386] Enable -mprefer-avx128 by default for Bulldozer

Hi,

 I have no preference in tune feature coding. But I agree with you it's better 
to
put similar things together. I modified the code following your suggestion.

Is it OK to commit this modified patch?

Thanks,

Changpeng




From: Jan Hubicka [hubi...@ucw.cz]
Sent: Thursday, June 23, 2011 6:20 PM
To: Fang, Changpeng
Cc: Uros Bizjak; gcc-patches@gcc.gnu.org; hubi...@ucw.cz; rguent...@suse.de
Subject: Re: [PATCH, i386] Enable -mprefer-avx128 by default for Bulldozer

Hi,
 --- a/gcc/config/i386/i386.c
 +++ b/gcc/config/i386/i386.c
 @@ -2128,6 +2128,9 @@ static const unsigned int 
 x86_avx256_split_unaligned_load
  static const unsigned int x86_avx256_split_unaligned_store
= m_COREI7 | m_BDVER1 | m_GENERIC;

 +static const unsigned int x86_prefer_avx128
 +  = m_BDVER1;

What is reason for stuff like this to not go into initial_ix86_tune_features?
I sort of liked them better when they was individual flags, but having the 
target
tunning flags spread across multiple places seems unnecesary.

Honza

From a325395439a314f87b3c79a5b9ce79a6a976a710 Mon Sep 17 00:00:00 2001
From: Changpeng Fang chfang@huainan.(none)
Date: Wed, 22 Jun 2011 15:03:05 -0700
Subject: [PATCH] Auto-vectorizer generates 128-bit AVX insns by default for bdver1

	* config/i386/i386.opt (mprefer-avx128): Redefine the flag as a Mask option.

	* config/i386/i386.h (ix86_tune_indices): Add X86_TUNE_AVX128_OPTIMAL entry.
	(TARGET_AVX128_OPTIMAL): New definition.

	* config/i386/i386.c (initial_ix86_tune_features): Initialize
	X86_TUNE_AVX128_OPTIMAL entry.
	(ix86_option_override_internal): Enable the generation
	of the 128-bit instructions when TARGET_AVX128_OPTIMAL is set.
	(ix86_preferred_simd_mode): Use TARGET_PREFER_AVX128.
	(ix86_autovectorize_vector_sizes): Use TARGET_PREFER_AVX128.
---
 gcc/config/i386/i386.c   |   16 
 gcc/config/i386/i386.h   |4 +++-
 gcc/config/i386/i386.opt |2 +-
 3 files changed, 16 insertions(+), 6 deletions(-)

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 014401b..b3434dd 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -2089,7 +2089,11 @@ static unsigned int initial_ix86_tune_features[X86_TUNE_LAST] = {
   /* X86_SOFTARE_PREFETCHING_BENEFICIAL: Enable software prefetching
  at -O3.  For the moment, the prefetching seems badly tuned for Intel
  chips.  */
-  m_K6_GEODE | m_AMD_MULTIPLE
+  m_K6_GEODE | m_AMD_MULTIPLE,
+
+  /* X86_TUNE_AVX128_OPTIMAL: Enable 128-bit AVX instruction generation for
+ the auto-vectorizer.  */
+  m_BDVER1
 };
 
 /* Feature tests against the various architecture variations.  */
@@ -2623,6 +2627,7 @@ ix86_target_string (int isa, int flags, const char *arch, const char *tune,
 { -mvzeroupper,			MASK_VZEROUPPER },
 { -mavx256-split-unaligned-load,	MASK_AVX256_SPLIT_UNALIGNED_LOAD},
 { -mavx256-split-unaligned-store,	MASK_AVX256_SPLIT_UNALIGNED_STORE},
+{ -mprefer-avx128,		MASK_PREFER_AVX128},
   };
 
   const char *opts[ARRAY_SIZE (isa_opts) + ARRAY_SIZE (flag_opts) + 6][2];
@@ -3672,6 +3677,9 @@ ix86_option_override_internal (bool main_args_p)
 	  if ((x86_avx256_split_unaligned_store  ix86_tune_mask)
 	   !(target_flags_explicit  MASK_AVX256_SPLIT_UNALIGNED_STORE))
 	target_flags |= MASK_AVX256_SPLIT_UNALIGNED_STORE;
+	  /* Enable 128-bit AVX instruction generation for the auto-vectorizer.  */
+	  if (TARGET_AVX128_OPTIMAL  !(target_flags_explicit  MASK_PREFER_AVX128))
+	target_flags |= MASK_PREFER_AVX128;
 	}
 }
   else 
@@ -34614,7 +34622,7 @@ ix86_preferred_simd_mode (enum machine_mode mode)
   return V2DImode;
 
 case SFmode:
-  if (TARGET_AVX  !flag_prefer_avx128)
+  if (TARGET_AVX  !TARGET_PREFER_AVX128)
 	return V8SFmode;
   else
 	return V4SFmode;
@@ -34622,7 +34630,7 @@ ix86_preferred_simd_mode (enum machine_mode mode)
 case DFmode:
   if (!TARGET_VECTORIZE_DOUBLE)
 	return word_mode;
-  else if (TARGET_AVX  !flag_prefer_avx128)
+  else if (TARGET_AVX  !TARGET_PREFER_AVX128)
 	return V4DFmode;
   else if (TARGET_SSE2)
 	return V2DFmode;
@@ -34639,7 +34647,7

Request to backport two -mvzeroupper related patches to 4.6 branch

2011-06-28 Thread Fang, Changpeng
Hi, 

Attached are two patches in gcc 4.7 trunk that we request to backport to 4.6 
branch.
There are all related to -mvzerupper

1)
0001-Save-the-initial-options-after-checking-vzeroupper.patch
This patch fixes bug 47315, ICE: in extract_insn, at recog.c:2109 
(unrecognizable insn) with -mvzeroupper and __attribute__((target(avx)))

The patch was committed to trunk: 2011-05-23  H.J. Lu  hongjiu...@intel.com

The bug still exists in gcc 4.6.1. Backporting this patches would fix it.

2).
0001--config-i386-i386.c-ix86_reorg-Run-move_or_dele.patch
This patch Run move_or_delete_vzeroupper first, and was committed to trunk:
2011-05-04  Uros Bizjak  ubiz...@gmail.com


Is It OK to commit to 4.6 branch?

Thanks,

Changpeng From 0b70e1e33afa25536305f4a228409cf9b4e0eaad Mon Sep 17 00:00:00 2001
From: hjl hjl@138bc75d-0d04-0410-961f-82ee72b054a4
Date: Mon, 23 May 2011 16:51:42 +
Subject: [PATCH] Save the initial options after checking vzeroupper.

gcc/

2011-05-23  H.J. Lu  hongjiu...@intel.com

	PR target/47315
	* config/i386/i386.c (ix86_option_override_internal): Save the
	initial options after checking vzeroupper.

gcc/testsuite/

2011-05-23  H.J. Lu  hongjiu...@intel.com

	PR target/47315
	* gcc.target/i386/pr47315.c: New test.


git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/trunk@174078 138bc75d-0d04-0410-961f-82ee72b054a4
---
 gcc/ChangeLog   |6 ++
 gcc/config/i386/i386.c  |   11 ++-
 gcc/testsuite/ChangeLog |5 +
 gcc/testsuite/gcc.target/i386/pr47315.c |   10 ++
 4 files changed, 27 insertions(+), 5 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/pr47315.c

diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index a3cb0f1..1d46b04 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,3 +1,9 @@
+2011-05-23  H.J. Lu  hongjiu...@intel.com
+
+	PR target/47315
+	* config/i386/i386.c (ix86_option_override_internal): Save the
+	initial options after checking vzeroupper.
+
 2011-05-23  David Li  davi...@google.com
 
 	PR tree-optimization/48988
diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 0709be8..854e376 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -4191,11 +4191,6 @@ ix86_option_override_internal (bool main_args_p)
 #endif
}
 
-  /* Save the initial options in case the user does function specific options */
-  if (main_args_p)
-target_option_default_node = target_option_current_node
-  = build_target_option_node ();
-
   if (TARGET_AVX)
 {
   /* When not optimize for size, enable vzeroupper optimization for
@@ -4217,6 +4212,12 @@ ix86_option_override_internal (bool main_args_p)
   /* Disable vzeroupper pass if TARGET_AVX is disabled.  */
   target_flags = ~MASK_VZEROUPPER;
 }
+
+  /* Save the initial options in case the user does function specific
+ options.  */
+  if (main_args_p)
+target_option_default_node = target_option_current_node
+  = build_target_option_node ();
 }
 
 /* Return TRUE if VAL is passed in register with 256bit AVX modes.  */
diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog
index 72aae61..85137d0 100644
--- a/gcc/testsuite/ChangeLog
+++ b/gcc/testsuite/ChangeLog
@@ -1,3 +1,8 @@
+2011-05-23  H.J. Lu  hongjiu...@intel.com
+
+	PR target/47315
+	* gcc.target/i386/pr47315.c: New test.
+
 2011-05-23  Jason Merrill  ja...@redhat.com
 
 	* g++.dg/cpp0x/lambda/lambda-eh2.C: New.
diff --git a/gcc/testsuite/gcc.target/i386/pr47315.c b/gcc/testsuite/gcc.target/i386/pr47315.c
new file mode 100644
index 000..871d3f1
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr47315.c
@@ -0,0 +1,10 @@
+/* { dg-do compile } */
+/* { dg-options -O3 -mvzeroupper } */
+
+__attribute__ ((__target__ (avx)))
+float bar (float f) {}
+
+void foo (float f)
+{
+bar (f);
+}
-- 
1.6.0.2

From 343f07cbec2d66bebe71e4f48b0403f52ebfe8f9 Mon Sep 17 00:00:00 2001
From: uros uros@138bc75d-0d04-0410-961f-82ee72b054a4
Date: Wed, 4 May 2011 17:07:03 +
Subject: [PATCH] 	* config/i386/i386.c (ix86_reorg): Run move_or_delete_vzeroupper first.

git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/trunk@173383 138bc75d-0d04-0410-961f-82ee72b054a4
---
 gcc/ChangeLog  |   16 ++--
 gcc/config/i386/i386.c |8 
 2 files changed, 14 insertions(+), 10 deletions(-)

diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index 5412506..ca85616 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,3 +1,7 @@
+2011-05-04  Uros Bizjak  ubiz...@gmail.com
+
+	* config/i386/i386.c (ix86_reorg): Run move_or_delete_vzeroupper first.
+
 2011-05-04  Eric Botcazou  ebotca...@adacore.com
 
 	* stor-layout.c (variable_size): Do not issue errors.
@@ -263,9 +267,9 @@
 
 2011-05-03  Stuart Henderson  shend...@gcc.gnu.org
 
-From Mike Frysinger:
-* config/bfin/bfin.c (bfin_cpus[]): Add 0.4 for
-bf542/bf544/bf547/bf548/bf549.
+	From Mike Frysinger:
+	* config/bfin/bfin.c (bfin_cpus[]): Add 0.4 for
+	bf542/bf544/bf547/bf548/bf549.
 
 

RE: Backport AVX256 load/store split patches to gcc 4.6 for performance boost on latest AMD/Intel hardware.

2011-06-27 Thread Fang, Changpeng
Hi,

Attached are the patches we propose to backport to gcc 4.6 branch which are 
related to avx256 unaligned load/store splitting.
As we mentioned before,  The combined effect of these patches are positive on 
both AMD and Intel CPUs on cpu2006 and
polyhedron 2005.

0001-Split-32-byte-AVX-unaligned-load-store.patch
Initial patch that implements unaligned load/store splitting

0001-Don-t-assert-unaligned-256bit-load-store.patch
Remove the assert.

0001-Fix-a-typo-in-mavx256-split-unaligned-store.patch
Fix a typo.

0002-pr49089-enable-avx256-splitting-unaligned-load-store.patch
Disable unaligned load splitting for bdver1.

All these patches are in 4.7 trunk.

Bootstrap and tests are on-going in gcc 4.6 branch.

Is It OK to commit to 4.6 branch as long as the tests pass?

Thanks,

Changpeng 




From: Jagasia, Harsha
Sent: Monday, June 20, 2011 12:03 PM
To: 'H.J. Lu'
Cc: 'gcc-patches@gcc.gnu.org'; 'hubi...@ucw.cz'; 'ubiz...@gmail.com'; 
'hongjiu...@intel.com'; Fang, Changpeng
Subject: RE: Backport AVX256 load/store split patches to gcc 4.6 for 
performance boost on latest AMD/Intel hardware.

 On Mon, Jun 20, 2011 at 9:58 AM,  harsha.jaga...@amd.com wrote:
  Is it ok to backport patches, with Changelogs below, already in trunk
 to gcc
  4.6? These patches are for AVX-256bit load store splitting. These
 patches
  make significant performance difference =3% to several CPU2006 and
  Polyhedron benchmarks on latest AMD and Intel hardware. If ok, I will
 post
  backported patches for commit approval.
 
  AMD plans to submit additional patches on AVX-256 load/store
 splitting to
  trunk. We will send additional backport requests for those later once
 they
  are accepted/comitted to trunk.
 

 Since we will make some changes on trunk, I would prefer to to do
 the backport after trunk change is finished.

Ok, thanks. Adding Changpeng who is working on the trunk changes.

Harsha

From b8cb8d5224d650672add0fb6a74d759ef12e428f Mon Sep 17 00:00:00 2001
From: hjl hjl@138bc75d-0d04-0410-961f-82ee72b054a4
Date: Sun, 27 Mar 2011 18:56:00 +
Subject: [PATCH] Split 32-byte AVX unaligned load/store.

gcc/

2011-03-27  H.J. Lu  hongjiu...@intel.com

	* config/i386/i386.c (flag_opts): Add -mavx256-split-unaligned-load
	and -mavx256-split-unaligned-store.
	(ix86_option_override_internal): Split 32-byte AVX unaligned
	load/store by default.
	(ix86_avx256_split_vector_move_misalign): New.
	(ix86_expand_vector_move_misalign): Use it.

	* config/i386/i386.opt: Add -mavx256-split-unaligned-load and
	-mavx256-split-unaligned-store.

	* config/i386/sse.md (*avx_movmode_internal): Verify unaligned
	256bit load/store.  Generate unaligned store on misaligned memory
	operand.
	(*avx_movussemodesuffixavxmodesuffix): Verify unaligned
	256bit load/store.
	(*avx_movdquavxmodesuffix): Likewise.

	* doc/invoke.texi: Document -mavx256-split-unaligned-load and
	-mavx256-split-unaligned-store.

gcc/testsuite/

2011-03-27  H.J. Lu  hongjiu...@intel.com

	* gcc.target/i386/avx256-unaligned-load-1.c: New.
	* gcc.target/i386/avx256-unaligned-load-2.c: Likewise.
	* gcc.target/i386/avx256-unaligned-load-3.c: Likewise.
	* gcc.target/i386/avx256-unaligned-load-4.c: Likewise.
	* gcc.target/i386/avx256-unaligned-load-5.c: Likewise.
	* gcc.target/i386/avx256-unaligned-load-6.c: Likewise.
	* gcc.target/i386/avx256-unaligned-load-7.c: Likewise.
	* gcc.target/i386/avx256-unaligned-store-1.c: Likewise.
	* gcc.target/i386/avx256-unaligned-store-2.c: Likewise.
	* gcc.target/i386/avx256-unaligned-store-3.c: Likewise.
	* gcc.target/i386/avx256-unaligned-store-4.c: Likewise.
	* gcc.target/i386/avx256-unaligned-store-5.c: Likewise.
	* gcc.target/i386/avx256-unaligned-store-6.c: Likewise.
	* gcc.target/i386/avx256-unaligned-store-7.c: Likewise.


git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/trunk@171578 138bc75d-0d04-0410-961f-82ee72b054a4
---
 gcc/ChangeLog  |   22 ++
 gcc/config/i386/i386.c |   76 +--
 gcc/config/i386/i386.opt   |8 ++
 gcc/config/i386/sse.md |   42 ++--
 gcc/doc/invoke.texi|9 ++-
 gcc/testsuite/ChangeLog|   17 +
 .../gcc.target/i386/avx256-unaligned-load-1.c  |   19 +
 .../gcc.target/i386/avx256-unaligned-load-2.c  |   29 
 .../gcc.target/i386/avx256-unaligned-load-3.c  |   19 +
 .../gcc.target/i386/avx256-unaligned-load-4.c  |   19 +
 .../gcc.target/i386/avx256-unaligned-load-5.c  |   43 +++
 .../gcc.target/i386/avx256-unaligned-load-6.c  |   42 +++
 .../gcc.target/i386/avx256-unaligned-load-7.c  |   60 +++
 .../gcc.target/i386/avx256-unaligned-store-1.c |   22 ++
 .../gcc.target/i386/avx256-unaligned-store-2.c |   29 
 .../gcc.target/i386/avx256-unaligned-store-3.c |   22 ++
 .../gcc.target/i386

RE: [PATCH, i386] Enable -mprefer-avx128 by default for Bulldozer

2011-06-27 Thread Fang, Changpeng
Is this patch OK to commit to trunk?

Also I would like to backport this patch to gcc 4.6 branch. Do I have to send a 
separate 
request or use this one?

Thanks,

Changpeng





From: Fang, Changpeng
Sent: Friday, June 24, 2011 7:12 PM
To: Jan Hubicka
Cc: Uros Bizjak; gcc-patches@gcc.gnu.org; rguent...@suse.de
Subject: RE: [PATCH, i386] Enable -mprefer-avx128 by default for Bulldozer

Hi,

 I have no preference in tune feature coding. But I agree with you it's better 
to
put similar things together. I modified the code following your suggestion.

Is it OK to commit this modified patch?

Thanks,

Changpeng




From: Jan Hubicka [hubi...@ucw.cz]
Sent: Thursday, June 23, 2011 6:20 PM
To: Fang, Changpeng
Cc: Uros Bizjak; gcc-patches@gcc.gnu.org; hubi...@ucw.cz; rguent...@suse.de
Subject: Re: [PATCH, i386] Enable -mprefer-avx128 by default for Bulldozer

Hi,
 --- a/gcc/config/i386/i386.c
 +++ b/gcc/config/i386/i386.c
 @@ -2128,6 +2128,9 @@ static const unsigned int 
 x86_avx256_split_unaligned_load
  static const unsigned int x86_avx256_split_unaligned_store
= m_COREI7 | m_BDVER1 | m_GENERIC;

 +static const unsigned int x86_prefer_avx128
 +  = m_BDVER1;

What is reason for stuff like this to not go into initial_ix86_tune_features?
I sort of liked them better when they was individual flags, but having the 
target
tunning flags spread across multiple places seems unnecesary.

Honza




RE: [PATCH, i386] Enable -mprefer-avx128 by default for Bulldozer

2011-06-24 Thread Fang, Changpeng
Hi,

 I have no preference in tune feature coding. But I agree with you it's better 
to
put similar things together. I modified the code following your suggestion.

Is it OK to commit this modified patch?

Thanks,

Changpeng




From: Jan Hubicka [hubi...@ucw.cz]
Sent: Thursday, June 23, 2011 6:20 PM
To: Fang, Changpeng
Cc: Uros Bizjak; gcc-patches@gcc.gnu.org; hubi...@ucw.cz; rguent...@suse.de
Subject: Re: [PATCH, i386] Enable -mprefer-avx128 by default for Bulldozer

Hi,
 --- a/gcc/config/i386/i386.c
 +++ b/gcc/config/i386/i386.c
 @@ -2128,6 +2128,9 @@ static const unsigned int 
 x86_avx256_split_unaligned_load
  static const unsigned int x86_avx256_split_unaligned_store
= m_COREI7 | m_BDVER1 | m_GENERIC;

 +static const unsigned int x86_prefer_avx128
 +  = m_BDVER1;

What is reason for stuff like this to not go into initial_ix86_tune_features?
I sort of liked them better when they was individual flags, but having the 
target
tunning flags spread across multiple places seems unnecesary.

Honza

From a325395439a314f87b3c79a5b9ce79a6a976a710 Mon Sep 17 00:00:00 2001
From: Changpeng Fang chfang@huainan.(none)
Date: Wed, 22 Jun 2011 15:03:05 -0700
Subject: [PATCH] Auto-vectorizer generates 128-bit AVX insns by default for bdver1

	* config/i386/i386.opt (mprefer-avx128): Redefine the flag as a Mask option.

	* config/i386/i386.h (ix86_tune_indices): Add X86_TUNE_AVX128_OPTIMAL entry.
	(TARGET_AVX128_OPTIMAL): New definition.

	* config/i386/i386.c (initial_ix86_tune_features): Initialize
	X86_TUNE_AVX128_OPTIMAL entry.
	(ix86_option_override_internal): Enable the generation
	of the 128-bit instructions when TARGET_AVX128_OPTIMAL is set.
	(ix86_preferred_simd_mode): Use TARGET_PREFER_AVX128.
	(ix86_autovectorize_vector_sizes): Use TARGET_PREFER_AVX128.
---
 gcc/config/i386/i386.c   |   16 
 gcc/config/i386/i386.h   |4 +++-
 gcc/config/i386/i386.opt |2 +-
 3 files changed, 16 insertions(+), 6 deletions(-)

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 014401b..b3434dd 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -2089,7 +2089,11 @@ static unsigned int initial_ix86_tune_features[X86_TUNE_LAST] = {
   /* X86_SOFTARE_PREFETCHING_BENEFICIAL: Enable software prefetching
  at -O3.  For the moment, the prefetching seems badly tuned for Intel
  chips.  */
-  m_K6_GEODE | m_AMD_MULTIPLE
+  m_K6_GEODE | m_AMD_MULTIPLE,
+
+  /* X86_TUNE_AVX128_OPTIMAL: Enable 128-bit AVX instruction generation for
+ the auto-vectorizer.  */
+  m_BDVER1
 };
 
 /* Feature tests against the various architecture variations.  */
@@ -2623,6 +2627,7 @@ ix86_target_string (int isa, int flags, const char *arch, const char *tune,
 { -mvzeroupper,			MASK_VZEROUPPER },
 { -mavx256-split-unaligned-load,	MASK_AVX256_SPLIT_UNALIGNED_LOAD},
 { -mavx256-split-unaligned-store,	MASK_AVX256_SPLIT_UNALIGNED_STORE},
+{ -mprefer-avx128,		MASK_PREFER_AVX128},
   };
 
   const char *opts[ARRAY_SIZE (isa_opts) + ARRAY_SIZE (flag_opts) + 6][2];
@@ -3672,6 +3677,9 @@ ix86_option_override_internal (bool main_args_p)
 	  if ((x86_avx256_split_unaligned_store  ix86_tune_mask)
 	   !(target_flags_explicit  MASK_AVX256_SPLIT_UNALIGNED_STORE))
 	target_flags |= MASK_AVX256_SPLIT_UNALIGNED_STORE;
+	  /* Enable 128-bit AVX instruction generation for the auto-vectorizer.  */
+	  if (TARGET_AVX128_OPTIMAL  !(target_flags_explicit  MASK_PREFER_AVX128))
+	target_flags |= MASK_PREFER_AVX128;
 	}
 }
   else 
@@ -34614,7 +34622,7 @@ ix86_preferred_simd_mode (enum machine_mode mode)
   return V2DImode;
 
 case SFmode:
-  if (TARGET_AVX  !flag_prefer_avx128)
+  if (TARGET_AVX  !TARGET_PREFER_AVX128)
 	return V8SFmode;
   else
 	return V4SFmode;
@@ -34622,7 +34630,7 @@ ix86_preferred_simd_mode (enum machine_mode mode)
 case DFmode:
   if (!TARGET_VECTORIZE_DOUBLE)
 	return word_mode;
-  else if (TARGET_AVX  !flag_prefer_avx128)
+  else if (TARGET_AVX  !TARGET_PREFER_AVX128)
 	return V4DFmode;
   else if (TARGET_SSE2)
 	return V2DFmode;
@@ -34639,7 +34647,7 @@ ix86_preferred_simd_mode (enum machine_mode mode)
 static unsigned int
 ix86_autovectorize_vector_sizes (void)
 {
-  return (TARGET_AVX  !flag_prefer_avx128) ? 32 | 16 : 0;
+  return (TARGET_AVX  !TARGET_PREFER_AVX128) ? 32 | 16 : 0;
 }
 
 /* Initialize the GCC target structure.  */
diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
index 8badcbb..d9317ed 100644
--- a/gcc/config/i386/i386.h
+++ b/gcc/config/i386/i386.h
@@ -312,6 +312,7 @@ enum ix86_tune_indices {
   X86_TUNE_OPT_AGU,
   X86_TUNE_VECTORIZE_DOUBLE,
   X86_TUNE_SOFTWARE_PREFETCHING_BENEFICIAL,
+  X86_TUNE_AVX128_OPTIMAL,
 
   X86_TUNE_LAST
 };
@@ -410,7 +411,8 @@ extern unsigned char ix86_tune_features[X86_TUNE_LAST];
 	ix86_tune_features[X86_TUNE_VECTORIZE_DOUBLE]
 #define TARGET_SOFTWARE_PREFETCHING_BENEFICIAL \
 	ix86_tune_features

[PATCH, i386] Enable -mprefer-avx128 by default for Bulldozer

2011-06-23 Thread Fang, Changpeng
Hi,

This patch enables 128-bit avx instruction generation for the auto-vectorizer 
for AMD bulldozer 
machines. This enablement gives additional ~3% improvement on polyhedron 2005 
and cpu2006
floating point programs.

The patch passed bootstrapping on a x86_64-unknown-linux-gnu system with 
Bulldozer cores.

Is it OK to commit to trunk and backport to 4.6 branch?

Thanks,

Changpeng From b5015593b0b30b14783866ac68c2c5f2e014d206 Mon Sep 17 00:00:00 2001
From: Changpeng Fang chfang@huainan.(none)
Date: Wed, 22 Jun 2011 15:03:05 -0700
Subject: [PATCH] Auto-vectorizer generates 128-bit AVX insns by default for bdver1

	* config/i386/i386.opt (mprefer-avx128): Redefine the flag as a Mask option.

	* config/i386/i386.c (x86_prefer_avx128): New tune option definition.
	(ix86_option_override_internal): Enable the generation of the 128-bit
	instructions when x86_prefer_avx128 is set.
	(ix86_preferred_simd_mode): Use TARGET_PREFER_AVX128.
	(ix86_autovectorize_vector_sizes): Use TARGET_PREFER_AVX128.
---
 gcc/config/i386/i386.c   |   13 ++---
 gcc/config/i386/i386.opt |2 +-
 2 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 014401b..1f5113f 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -2128,6 +2128,9 @@ static const unsigned int x86_avx256_split_unaligned_load
 static const unsigned int x86_avx256_split_unaligned_store
   = m_COREI7 | m_BDVER1 | m_GENERIC;
 
+static const unsigned int x86_prefer_avx128
+  = m_BDVER1;
+
 /* In case the average insn count for single function invocation is
lower than this constant, emit fast (but longer) prologue and
epilogue code.  */
@@ -2623,6 +2626,7 @@ ix86_target_string (int isa, int flags, const char *arch, const char *tune,
 { -mvzeroupper,			MASK_VZEROUPPER },
 { -mavx256-split-unaligned-load,	MASK_AVX256_SPLIT_UNALIGNED_LOAD},
 { -mavx256-split-unaligned-store,	MASK_AVX256_SPLIT_UNALIGNED_STORE},
+{ -mprefer-avx128,		MASK_PREFER_AVX128},
   };
 
   const char *opts[ARRAY_SIZE (isa_opts) + ARRAY_SIZE (flag_opts) + 6][2];
@@ -3672,6 +3676,9 @@ ix86_option_override_internal (bool main_args_p)
 	  if ((x86_avx256_split_unaligned_store  ix86_tune_mask)
 	   !(target_flags_explicit  MASK_AVX256_SPLIT_UNALIGNED_STORE))
 	target_flags |= MASK_AVX256_SPLIT_UNALIGNED_STORE;
+	  if ((x86_prefer_avx128  ix86_tune_mask)
+	   !(target_flags_explicit  MASK_PREFER_AVX128))
+	target_flags |= MASK_PREFER_AVX128;
 	}
 }
   else 
@@ -34614,7 +34621,7 @@ ix86_preferred_simd_mode (enum machine_mode mode)
   return V2DImode;
 
 case SFmode:
-  if (TARGET_AVX  !flag_prefer_avx128)
+  if (TARGET_AVX  !TARGET_PREFER_AVX128)
 	return V8SFmode;
   else
 	return V4SFmode;
@@ -34622,7 +34629,7 @@ ix86_preferred_simd_mode (enum machine_mode mode)
 case DFmode:
   if (!TARGET_VECTORIZE_DOUBLE)
 	return word_mode;
-  else if (TARGET_AVX  !flag_prefer_avx128)
+  else if (TARGET_AVX  !TARGET_PREFER_AVX128)
 	return V4DFmode;
   else if (TARGET_SSE2)
 	return V2DFmode;
@@ -34639,7 +34646,7 @@ ix86_preferred_simd_mode (enum machine_mode mode)
 static unsigned int
 ix86_autovectorize_vector_sizes (void)
 {
-  return (TARGET_AVX  !flag_prefer_avx128) ? 32 | 16 : 0;
+  return (TARGET_AVX  !TARGET_PREFER_AVX128) ? 32 | 16 : 0;
 }
 
 /* Initialize the GCC target structure.  */
diff --git a/gcc/config/i386/i386.opt b/gcc/config/i386/i386.opt
index 21e0def..9886b7b 100644
--- a/gcc/config/i386/i386.opt
+++ b/gcc/config/i386/i386.opt
@@ -388,7 +388,7 @@ Do dispatch scheduling if processor is bdver1 and Haifa scheduling
 is selected.
 
 mprefer-avx128
-Target Report Var(flag_prefer_avx128) Init(0)
+Target Report Mask(PREFER_AVX128) SAVE
 Use 128-bit AVX instructions instead of 256-bit AVX instructions in the auto-vectorizer.
 
 ;; ISA support
-- 
1.7.0.4



RE: [PATCH, PR 49089] Don't split AVX256 unaligned loads by default on bdver1 and generic

2011-06-20 Thread Fang, Changpeng
Hi,

  I modified the patch as H.J. suggested (patch attached).

Is it OK to commit to trunk now?

Thanks,

Changpeng



From: H.J. Lu [hjl.to...@gmail.com]
Sent: Friday, June 17, 2011 5:44 PM
To: Fang, Changpeng
Cc: Richard Guenther; gcc-patches@gcc.gnu.org
Subject: Re: [PATCH, PR 49089] Don't split AVX256 unaligned loads by default on 
bdver1 and generic

On Fri, Jun 17, 2011 at 3:18 PM, Fang, Changpeng changpeng.f...@amd.com wrote:
 Hi,

  I added AVX256_SPLIT_UNALIGNED_STORE to ix86_tune_indices
 and put m_COREI7, m_BDVER1 and m_GENERIC as the targets that
 enable it.

 Is this OK?

Can you do something similar to how MASK_ACCUMULATE_OUTGOING_ARGS
is handled?

Thanks.

H.J.

From 50310fc367348b406fc88d54c3ab54d1a304ad52 Mon Sep 17 00:00:00 2001
From: Changpeng Fang chfang@huainan.(none)
Date: Mon, 13 Jun 2011 13:13:32 -0700
Subject: [PATCH 2/2] pr49089: enable avx256 splitting unaligned load/store only when beneficial

	* config/i386/i386.c (avx256_split_unaligned_load): New definition.
	  (avx256_split_unaligned_store): New definition.
	  (ix86_option_override_internal): Enable avx256 unaligned load(store)
	  splitting only when avx256_split_unaligned_load(store) is set.
---
 gcc/config/i386/i386.c |   12 ++--
 1 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 7b266b9..3bc0b53 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -2121,6 +2121,12 @@ static const unsigned int x86_arch_always_fancy_math_387
   = m_PENT | m_ATOM | m_PPRO | m_AMD_MULTIPLE | m_PENT4
 | m_NOCONA | m_CORE2I7 | m_GENERIC;
 
+static const unsigned int x86_avx256_split_unaligned_load
+  = m_COREI7 | m_GENERIC;
+
+static const unsigned int x86_avx256_split_unaligned_store
+  = m_COREI7 | m_BDVER1 | m_GENERIC;
+
 /* In case the average insn count for single function invocation is
lower than this constant, emit fast (but longer) prologue and
epilogue code.  */
@@ -4194,9 +4200,11 @@ ix86_option_override_internal (bool main_args_p)
 	  if (flag_expensive_optimizations
 	   !(target_flags_explicit  MASK_VZEROUPPER))
 	target_flags |= MASK_VZEROUPPER;
-	  if (!(target_flags_explicit  MASK_AVX256_SPLIT_UNALIGNED_LOAD))
+	  if ((x86_avx256_split_unaligned_load  ix86_tune_mask)
+	   !(target_flags_explicit  MASK_AVX256_SPLIT_UNALIGNED_LOAD))
 	target_flags |= MASK_AVX256_SPLIT_UNALIGNED_LOAD;
-	  if (!(target_flags_explicit  MASK_AVX256_SPLIT_UNALIGNED_STORE))
+	  if ((x86_avx256_split_unaligned_store  ix86_tune_mask)
+	   !(target_flags_explicit  MASK_AVX256_SPLIT_UNALIGNED_STORE))
 	target_flags |= MASK_AVX256_SPLIT_UNALIGNED_STORE;
 	}
 }
-- 
1.7.0.4



RE: Backport AVX256 load/store split patches to gcc 4.6 for performance boost on latest AMD/Intel hardware.

2011-06-20 Thread Fang, Changpeng
The patch that disables default setting of unaligned load splitting for bdver1 
has been committed
to trunk as revision 175230.

Here is the patch: http://gcc.gnu.org/ml/gcc-patches/2011-06/msg01518.html.

H. J., is there anything else that is pending to fix at this moment regarding 
avx256 load/store splitting?

If no, can we backport the set of patches to 4.6 branch now?

Thanks,

Changpeng 






From: Jagasia, Harsha
Sent: Monday, June 20, 2011 12:03 PM
To: 'H.J. Lu'
Cc: 'gcc-patches@gcc.gnu.org'; 'hubi...@ucw.cz'; 'ubiz...@gmail.com'; 
'hongjiu...@intel.com'; Fang, Changpeng
Subject: RE: Backport AVX256 load/store split patches to gcc 4.6 for 
performance boost on latest AMD/Intel hardware.

 On Mon, Jun 20, 2011 at 9:58 AM,  harsha.jaga...@amd.com wrote:
  Is it ok to backport patches, with Changelogs below, already in trunk
 to gcc
  4.6? These patches are for AVX-256bit load store splitting. These
 patches
  make significant performance difference =3% to several CPU2006 and
  Polyhedron benchmarks on latest AMD and Intel hardware. If ok, I will
 post
  backported patches for commit approval.
 
  AMD plans to submit additional patches on AVX-256 load/store
 splitting to
  trunk. We will send additional backport requests for those later once
 they
  are accepted/comitted to trunk.
 

 Since we will make some changes on trunk, I would prefer to to do
 the backport after trunk change is finished.

Ok, thanks. Adding Changpeng who is working on the trunk changes.

Harsha




RE: [PATCH, PR 49089] Don't split AVX256 unaligned loads by default on bdver1 and generic

2011-06-16 Thread Fang, Changpeng
Hi, 

 I modify the patch to disable unaligned load splitting only for bdver1 at this 
moment.
Unaligned load splitting degrades CFP2006 by 1.3% in geomean for both 
-mtune=bdver1 and
-mtune=generic on Bulldozer. However, we agree with H.J's suggestion to 
determine
the optimal optimization sets for modern cpus.

Is is OK to commit the attached patch?

Thanks,

Changpeng



 So, is it OK to commit this patch to trunk, and H.J's original patch + this 
 to 4.6 branch?



I have no problems on -mtune=Bulldozer.  But I object -mtune=generic
change and did suggest a different approach for -mtune=generic.


.

From 913a31b425759ac3427a365646de866161a7908a Mon Sep 17 00:00:00 2001
From: Changpeng Fang chfang@huainan.(none)
Date: Mon, 13 Jun 2011 13:13:32 -0700
Subject: [PATCH 2/2] pr49089: enable avx256 splitting unaligned load only when beneficial

	* config/i386/i386.h (ix86_tune_indices): Introduce
	  X86_TUNE_AVX256_SPLIT_UNALIGNED_LOAD_OPTIMAL.
	  (TARGET_AVX256_SPLIT_UNALIGNED_LOAD_OPTIMAL): New definition.

	* config/i386/i386.c (ix86_tune_features): Add entry for
	  X86_TUNE_AVX256_SPLIT_UNALIGNED_LOAD_OPTIMAL.
	  (ix86_option_override_internal): Enable avx256 unaligned load splitting
	  only when TARGET_AVX256_SPLIT_UNALIGNED_LOAD_OPTIMAL is set.
---
 gcc/config/i386/i386.c |   10 --
 gcc/config/i386/i386.h |3 +++
 2 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 7b266b9..82e6d3e 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -2088,7 +2088,12 @@ static unsigned int initial_ix86_tune_features[X86_TUNE_LAST] = {
   /* X86_SOFTARE_PREFETCHING_BENEFICIAL: Enable software prefetching
  at -O3.  For the moment, the prefetching seems badly tuned for Intel
  chips.  */
-  m_K6_GEODE | m_AMD_MULTIPLE
+  m_K6_GEODE | m_AMD_MULTIPLE,
+
+  /* X86_TUNE_AVX256_SPLIT_UNALIGNED_LOAD_OPTIMAL: Enable splitting 256-bit
+ unaligned load.  It hurts the performance on Bulldozer. We need to
+ re-tune the generic options for current cpus!  */
+  m_COREI7 | m_GENERIC
 };
 
 /* Feature tests against the various architecture variations.  */
@@ -4194,7 +4199,8 @@ ix86_option_override_internal (bool main_args_p)
 	  if (flag_expensive_optimizations
 	   !(target_flags_explicit  MASK_VZEROUPPER))
 	target_flags |= MASK_VZEROUPPER;
-	  if (!(target_flags_explicit  MASK_AVX256_SPLIT_UNALIGNED_LOAD))
+	  if (TARGET_AVX256_SPLIT_UNALIGNED_LOAD_OPTIMAL
+	   !(target_flags_explicit  MASK_AVX256_SPLIT_UNALIGNED_LOAD))
 	target_flags |= MASK_AVX256_SPLIT_UNALIGNED_LOAD;
 	  if (!(target_flags_explicit  MASK_AVX256_SPLIT_UNALIGNED_STORE))
 	target_flags |= MASK_AVX256_SPLIT_UNALIGNED_STORE;
diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
index 8badcbb..b2a1bc8 100644
--- a/gcc/config/i386/i386.h
+++ b/gcc/config/i386/i386.h
@@ -312,6 +312,7 @@ enum ix86_tune_indices {
   X86_TUNE_OPT_AGU,
   X86_TUNE_VECTORIZE_DOUBLE,
   X86_TUNE_SOFTWARE_PREFETCHING_BENEFICIAL,
+  X86_TUNE_AVX256_SPLIT_UNALIGNED_LOAD_OPTIMAL,
 
   X86_TUNE_LAST
 };
@@ -410,6 +411,8 @@ extern unsigned char ix86_tune_features[X86_TUNE_LAST];
 	ix86_tune_features[X86_TUNE_VECTORIZE_DOUBLE]
 #define TARGET_SOFTWARE_PREFETCHING_BENEFICIAL \
 	ix86_tune_features[X86_TUNE_SOFTWARE_PREFETCHING_BENEFICIAL]
+#define TARGET_AVX256_SPLIT_UNALIGNED_LOAD_OPTIMAL \
+	ix86_tune_features[X86_TUNE_AVX256_SPLIT_UNALIGNED_LOAD_OPTIMAL]
 
 /* Feature tests against the various architecture variations.  */
 enum ix86_arch_indices {
-- 
1.7.0.4



RE: [PATCH, PR 49089] Don't split AVX256 unaligned loads by default on bdver1 and generic

2011-06-15 Thread Fang, Changpeng
I have no problems on -mtune=Bulldozer.  But I object -mtune=generic
change and did suggest a different approach for -mtune=generic.

Something must have been broken for the unaligned load splitting in generic 
mode.

While we lose 1.3% on CFP2006 in geomean by splitting unaligned loads for 
-mtune=bdver1, splitting
unaligned loads in generic mode is KILLING us:

For 459.GemsFDTD (ref) on Bulldozer,
 -Ofast -mavx -mno-avx256-split-unaligned-load:   480s
-Ofast -mavx   :2527s

So, splitting unaligned loads results in the program to run 5~6 times slower!

For 434.zeusmp train run
 -Ofast -mavx -mno-avx256-split-unaligned-load:   32.5s
-Ofast -mavx   :106s

Other tests are on-going!


Changpeng.