Re: FW: [PATCH] DATA_ALIGNMENT vs. DATA_ABI_ALIGNMENT (PR target/56564)

2013-06-19 Thread Igor Zamyatin
The change also affects vectorizer in avx case which could be seen for
gcc.dg/tree-ssa/loop-19.c test.

After the change report says

loop-19_bad.c:16: note: === vect_analyze_data_refs_alignment ===
loop-19_bad.c:16: note: vect_compute_data_ref_alignment:
loop-19_bad.c:16: note: can't force alignment of ref: a[j_9]
loop-19_bad.c:16: note: vect_compute_data_ref_alignment:
loop-19_bad.c:16: note: can't force alignment of ref: c[j_9]

AFAICS first condition in ix86_data_alignment was true before the
change so 256 was a return value.

Do we need to tweak this test also?

Thanks,
Igor

 Hi!

 This PR is about DATA_ALIGNMENT macro increasing alignment of some decls for 
 optimization purposes beyond ABI mandated levels.  It is fine to emit the 
 vars aligned as much as we want for optimization purposes, but if we can't be 
 sure that references to that decl bind to the definition we increased the 
 alignment on (e.g. common variables, or -fpic code without hidden visibility, 
 weak vars etc.), we can't assume that alignment.
 As DECL_ALIGN is used for both the alignment emitted for the definitions and 
 alignment assumed on code referring to it, this patch increases DECL_ALIGN 
 only on decls where decl_binds_to_current_def_p is true, and otherwise the 
 optimization part on top of that emits only when aligning definition.
 On x86_64, DATA_ALIGNMENT macro was partly an optimization, partly ABI 
 mandated alignment increase, so I've introduced a new macro, 
 DATA_ABI_ALIGNMENT, which is the ABI mandated increase only (on x86-64 I 
 think the only one is that arrays with size 16 bytes or more (and VLAs, but 
 that is not handled by DATA*ALIGNMENT) are at least 16 byte aligned).

 Bootstrapped/regtested on x86_64-linux and i686-linux.  No idea about other 
 targets, I've kept them all using DATA_ALIGNMENT, which is considered 
 optimization increase only now, if there is some ABI mandated alignment 
 increase on other targets, that should be done in DATA_ABI_ALIGNMENT as well 
 as DATA_ALIGNMENT.  The patch causes some vectorization regressions (tweaked 
 in the testsuite), especially for common vars where we used to align say 
 common arrays to 256 bits rather than the ABI mandated 128 bits, or for -fpic 
 code, but I'm afraid we need to live with that, if you compile another file 
 with say icc or some other compiler which doesn't increase alignment beyond 
 ABI mandated level and that other file defines the var say as non-common, we 
 have wrong-code.

 2013-06-07  Jakub Jelinek  ja...@redhat.com

 PR target/56564
 * varasm.c (align_variable): Don't use DATA_ALIGNMENT or
 CONSTANT_ALIGNMENT if !decl_binds_to_current_def_p (decl).
 Use DATA_ABI_ALIGNMENT for that case instead if defined.
 (get_variable_align): New function.
 (get_variable_section, emit_bss, emit_common,
 assemble_variable_contents, place_block_symbol): Use
 get_variable_align instead of DECL_ALIGN.
 (assemble_noswitch_variable): Add align argument, use it
 instead of DECL_ALIGN.
 (assemble_variable): Adjust caller.  Use get_variable_align
 instead of DECL_ALIGN.
 * config/i386/i386.h (DATA_ALIGNMENT): Adjust x86_data_alignment
 caller.
 (DATA_ABI_ALIGNMENT): Define.
 * config/i386/i386-protos.h (x86_data_alignment): Adjust prototype.
 * config/i386/i386.c (x86_data_alignment): Add opt argument.  If
 opt is false, only return the psABI mandated alignment increase.
 * doc/tm.texi.in (DATA_ABI_ALIGNMENT): Document.
 * doc/tm.texi: Regenerated.

 * gcc.target/i386/pr56564-1.c: New test.
 * gcc.target/i386/pr56564-2.c: New test.
 * gcc.target/i386/pr56564-3.c: New test.
 * gcc.target/i386/pr56564-4.c: New test.
 * gcc.target/i386/avx256-unaligned-load-4.c: Add -fno-common.
 * gcc.target/i386/avx256-unaligned-store-1.c: Likewise.
 * gcc.target/i386/avx256-unaligned-store-3.c: Likewise.
 * gcc.target/i386/avx256-unaligned-store-4.c: Likewise.
 * gcc.target/i386/vect-sizes-1.c: Likewise.
 * gcc.target/i386/memcpy-1.c: Likewise.
 * gcc.dg/vect/costmodel/i386/costmodel-vect-31.c (tmp): Initialize.
 * gcc.dg/vect/costmodel/x86_64/costmodel-vect-31.c (tmp): Likewise.



Re: [PATCH] DATA_ALIGNMENT vs. DATA_ABI_ALIGNMENT (PR target/56564)

2013-06-19 Thread Igor Zamyatin
 Right, as you did for other cases. It works here as well.

Thanks,
Igor

On Wed, Jun 19, 2013 at 11:05 AM, Jakub Jelinek ja...@redhat.com wrote:
 On Wed, Jun 19, 2013 at 11:01:59AM +0400, Igor Zamyatin wrote:
 The change also affects vectorizer in avx case which could be seen for
 gcc.dg/tree-ssa/loop-19.c test.

 After the change report says

 loop-19_bad.c:16: note: === vect_analyze_data_refs_alignment ===
 loop-19_bad.c:16: note: vect_compute_data_ref_alignment:
 loop-19_bad.c:16: note: can't force alignment of ref: a[j_9]
 loop-19_bad.c:16: note: vect_compute_data_ref_alignment:
 loop-19_bad.c:16: note: can't force alignment of ref: c[j_9]

 AFAICS first condition in ix86_data_alignment was true before the
 change so 256 was a return value.

 Do we need to tweak this test also?

 I'd add -fno-common to the test.

 Jakub


Re: [Patch wwwdocs] gcc-4.9 changes: mention support of the Intel Silvermont microarchitecture

2013-06-18 Thread Igor Zamyatin
Ping?

On Mon, Jun 10, 2013 at 2:13 PM, Igor Zamyatin izamya...@gmail.com wrote:
 Hi!

 This patch mentions support of Silvermont architecture in the
 gcc-4.9/changes.html page.

 OK to install?

 Thanks,
 Igor

  Index: htdocs/gcc-4.9/changes.html
 ===
 RCS file: /cvs/gcc/wwwdocs/htdocs/gcc-4.9/changes.html,v
 retrieving revision 1.17
 diff -c -1 -r1.17 changes.html
 *** htdocs/gcc-4.9/changes.html 6 Jun 2013 11:15:24 -   1.17
 --- htdocs/gcc-4.9/changes.html 10 Jun 2013 10:11:24 -
 ***
 *** 177,178 
 --- 177,185 

 + h3IA-32/x86-64/h3
 +   ul
 + li GCC now supports new Intel microarchitecture named Silvermont
 +   through code-march=slm/code.
 + /li
 +   /ul
 +
   h3 id=rxRX/h3


[Patch wwwdocs] gcc-4.9 changes: mention support of the Intel Silvermont microarchitecture

2013-06-10 Thread Igor Zamyatin
Hi!

This patch mentions support of Silvermont architecture in the
gcc-4.9/changes.html page.

OK to install?

Thanks,
Igor

 Index: htdocs/gcc-4.9/changes.html
===
RCS file: /cvs/gcc/wwwdocs/htdocs/gcc-4.9/changes.html,v
retrieving revision 1.17
diff -c -1 -r1.17 changes.html
*** htdocs/gcc-4.9/changes.html 6 Jun 2013 11:15:24 -   1.17
--- htdocs/gcc-4.9/changes.html 10 Jun 2013 10:11:24 -
***
*** 177,178 
--- 177,185 

+ h3IA-32/x86-64/h3
+   ul
+ li GCC now supports new Intel microarchitecture named Silvermont
+   through code-march=slm/code.
+ /li
+   /ul
+
  h3 id=rxRX/h3


Document Intel Silvermont support in invoke.texi

2013-06-10 Thread Igor Zamyatin
Hi!

Following patch documents Intel Silvermont support.

OK to install?

Thanks,
Igor


Changelog:


2013-06-10  Igor Zamyatin  igor.zamya...@intel.com

* doc/invoke.texi: Document slm.




diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index b7b32f7..e4f1d45 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -13837,6 +13837,10 @@ set support.
 Intel Atom CPU with 64-bit extensions, MMX, SSE, SSE2, SSE3 and SSSE3
 instruction set support.

+@item slm
+Intel Silvermont CPU with 64-bit extensions, MMX, SSE, SSE2, SSE3 and SSSE3,
+SSE4.1, SSE4.2 instruction set support.
+
 @item k6
 AMD K6 CPU with MMX instruction set support.


Re: Document Intel Silvermont support in invoke.texi

2013-06-10 Thread Igor Zamyatin
Please see updated patch.

Is it ok to install?

Thanks,
Igor

Changelog:

2013-06-11  Igor Zamyatin  igor.zamya...@intel.com

* doc/invoke.texi (core-avx2): Document.
(slm): Likewise.
(atom): Updated with MOVBE.


diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index b7b32f7..dd82880 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -13833,10 +13833,19 @@ Intel Core CPU with 64-bit extensions, MMX,
SSE, SSE2, SSE3, SSSE3,
 SSE4.1, SSE4.2, AVX, AES, PCLMUL, FSGSBASE, RDRND and F16C instruction
 set support.

+@item core-avx2
+Intel Core CPU with 64-bit extensions, MOVBE, MMX, SSE, SSE2, SSE3, SSSE3,
+SSE4.1, SSE4.2, AVX, AVX2, AES, PCLMUL, FSGSBASE, RDRND, FMA, BMI, BMI2
+and F16C instruction set support.
+
 @item atom
-Intel Atom CPU with 64-bit extensions, MMX, SSE, SSE2, SSE3 and SSSE3
+Intel Atom CPU with 64-bit extensions, MOVBE, MMX, SSE, SSE2, SSE3 and SSSE3
 instruction set support.

+@item slm
+Intel Silvermont CPU with 64-bit extensions, MOVBE, MMX, SSE, SSE2,
SSE3, SSSE3,
+SSE4.1 and SSE4.2 instruction set support.
+
 @item k6
 AMD K6 CPU with MMX instruction set support.


On Mon, Jun 10, 2013 at 5:29 PM, Jakub Jelinek ja...@redhat.com wrote:
 On Mon, Jun 10, 2013 at 05:25:36PM +0400, Igor Zamyatin wrote:
 Following patch documents Intel Silvermont support.

 OK to install?

 2013-06-10  Igor Zamyatin  igor.zamya...@intel.com

 * doc/invoke.texi: Document slm.

 diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
 index b7b32f7..e4f1d45 100644
 --- a/gcc/doc/invoke.texi
 +++ b/gcc/doc/invoke.texi
 @@ -13837,6 +13837,10 @@ set support.
  Intel Atom CPU with 64-bit extensions, MMX, SSE, SSE2, SSE3 and SSSE3
  instruction set support.

 +@item slm
 +Intel Silvermont CPU with 64-bit extensions, MMX, SSE, SSE2, SSE3 and SSSE3,
 +SSE4.1, SSE4.2 instruction set support.

 Shouldn't it also list MOVBE (similarly for atom and core-avx2, btver2
 already lists it).

 core-avx2 isn't even documented at all in invoke.texi.

 Jakub


Re: [x86, PATCH 2/2] Enabling of the new Intel microarchitecture Silvermont

2013-05-31 Thread Igor Zamyatin
We do want to use the same register for float_extend.

On Thu, May 30, 2013 at 9:22 PM, Uros Bizjak ubiz...@gmail.com wrote:
 On Thu, May 30, 2013 at 4:25 PM, Yuri Rumyantsev ysrum...@gmail.com wrote:
 Hi All

 Second patch enables several Silvermont uarch features which improve
 performance of the new processor (based on experiments on real SLM
 hardware):
 1. If using a 2-source or 3-source LEA for non-destructive destination
 purposes, or due to wanting ability to use SCALE, the use of LEA is
 preferable.
 2. Transformation of FP conversion for memory operands into conversion
 from register.
 3. Couple of improvements  for post-reload scheduling:
 - increase latency of integer loads and load/store with exact dependence;
 - simple re-ordering of the top of ready list - if 2 instructions
 at the top of the list have the same priority we consider instruction
 which producer(s) were scheduled earlier as the best candidate.

 Bootstrapped and tested on x86_64-unknown-linux-gnu, ok for trunk?

 2013-05-30  Yuri Rumyantsev  yuri.s.rumyant...@intel.com
  Igor Zamyatin  igor.zamya...@intel.com

 Silvermont (SLM) architecture performance tuning.
 * config/i386/i386.h (enum ix86_tune_indices): Add
 X86_TUNE_SPLIT_MEM_OPND_FOR_FP_CONVERTS.
 (TARGET_SPLIT_MEM_OPND_FOR_FP_CONVERTS): New define.

 * config/i386/i386.c (initial_ix86_tune_features)
 X86_TUNE_SPLIT_MEM_OPND_FOR_FP_CONVERTS: Initialize.
 (ix86_lea_outperforms): Handle Silvermont tuning.
 (ix86_avoid_lea_for_add): Add new argument to ix86_lea_outperforms
 call.
 (ix86_use_lea_for_mov): Likewise.
 (ix86_avoid_lea_for_addr): Likewise.
 (ix86_lea_for_add_ok): Likewise.
 (exact_dependency_1): New function.
 (exact_store_load_dependency): Likewise.
 (ix86_adjust_cost): Handle Silvermont tuning.
 (do_reoder_for_imul): Likewise.
 (swap_top_of_ready_list): New function.
 (ix86_sched_reorder): Changed to handle Silvermont tuning.

 * config/i386/i386.md (peepholes that split memory operand in fp
 converts): New

 @@ -24625,9 +24730,9 @@ ix86_sched_reorder(FILE *dump, int
 sched_verbose, rtx *ready, int *pn_ready,

 -  con = DEP_CON (dep);
 -  if (!NONDEBUG_INSN_P (con))
 -continue;
 +  con = DEP_CON (dep);
 +  if (!NONDEBUG_INSN_P (con))
 +continue;

 There are some unnecessary whitespace changes (tabs-spaces) in a
 couple of places throughout the patch, such as in the above lines.

 +(define_peephole2
 +  [(set (match_operand:DF 0 register_operand)
 +(float_extend:DF
 +  (match_operand:SF 1 memory_operand)))]
 +  TARGET_SPLIT_MEM_OPND_FOR_FP_CONVERTS
 +optimize_insn_for_speed_p ()
 +SSE_REG_P (operands[0])
 +  [(set (match_dup 2) (match_dup 1))
 +   (set (match_dup 0) (float_extend:DF (match_dup 2)))]
 +{
 +  operands[2] = gen_rtx_REG (SFmode, REGNO (operands[0]));
 +})

 You should use

 (match_scratch:SF 2 x)

 at the top of the peephole2 pattern, and you will get a free scratch
 register (assuming that it is not necessary to use the same register
 for input and output operand of the float_extend insn).

 Otherwise, the patch looks OK to me.

 Uros.


Re: [x86, PATCH 2/2] Enabling of the new Intel microarchitecture Silvermont

2013-05-31 Thread Igor Zamyatin
Like this?

On Fri, May 31, 2013 at 3:45 PM, Uros Bizjak ubiz...@gmail.com wrote:
 On Fri, May 31, 2013 at 1:38 PM, Igor Zamyatin izamya...@gmail.com wrote:
 We do want to use the same register for float_extend.

 OK then. Please add a comment for this fact and also, please put
 single-line preparation statements inside double-quotes instead of
 curved braces.

 Uros.

 On Thu, May 30, 2013 at 9:22 PM, Uros Bizjak ubiz...@gmail.com wrote:
 On Thu, May 30, 2013 at 4:25 PM, Yuri Rumyantsev ysrum...@gmail.com wrote:
 Hi All

 Second patch enables several Silvermont uarch features which improve
 performance of the new processor (based on experiments on real SLM
 hardware):
 1. If using a 2-source or 3-source LEA for non-destructive destination
 purposes, or due to wanting ability to use SCALE, the use of LEA is
 preferable.
 2. Transformation of FP conversion for memory operands into conversion
 from register.
 3. Couple of improvements  for post-reload scheduling:
 - increase latency of integer loads and load/store with exact 
 dependence;
 - simple re-ordering of the top of ready list - if 2 instructions
 at the top of the list have the same priority we consider instruction
 which producer(s) were scheduled earlier as the best candidate.

 Bootstrapped and tested on x86_64-unknown-linux-gnu, ok for trunk?

 2013-05-30  Yuri Rumyantsev  yuri.s.rumyant...@intel.com
  Igor Zamyatin  igor.zamya...@intel.com

 Silvermont (SLM) architecture performance tuning.
 * config/i386/i386.h (enum ix86_tune_indices): Add
 X86_TUNE_SPLIT_MEM_OPND_FOR_FP_CONVERTS.
 (TARGET_SPLIT_MEM_OPND_FOR_FP_CONVERTS): New define.

 * config/i386/i386.c (initial_ix86_tune_features)
 X86_TUNE_SPLIT_MEM_OPND_FOR_FP_CONVERTS: Initialize.
 (ix86_lea_outperforms): Handle Silvermont tuning.
 (ix86_avoid_lea_for_add): Add new argument to ix86_lea_outperforms
 call.
 (ix86_use_lea_for_mov): Likewise.
 (ix86_avoid_lea_for_addr): Likewise.
 (ix86_lea_for_add_ok): Likewise.
 (exact_dependency_1): New function.
 (exact_store_load_dependency): Likewise.
 (ix86_adjust_cost): Handle Silvermont tuning.
 (do_reoder_for_imul): Likewise.
 (swap_top_of_ready_list): New function.
 (ix86_sched_reorder): Changed to handle Silvermont tuning.

 * config/i386/i386.md (peepholes that split memory operand in fp
 converts): New

 @@ -24625,9 +24730,9 @@ ix86_sched_reorder(FILE *dump, int
 sched_verbose, rtx *ready, int *pn_ready,

 -  con = DEP_CON (dep);
 -  if (!NONDEBUG_INSN_P (con))
 -continue;
 +  con = DEP_CON (dep);
 +  if (!NONDEBUG_INSN_P (con))
 +continue;

 There are some unnecessary whitespace changes (tabs-spaces) in a
 couple of places throughout the patch, such as in the above lines.

 +(define_peephole2
 +  [(set (match_operand:DF 0 register_operand)
 +(float_extend:DF
 +  (match_operand:SF 1 memory_operand)))]
 +  TARGET_SPLIT_MEM_OPND_FOR_FP_CONVERTS
 +optimize_insn_for_speed_p ()
 +SSE_REG_P (operands[0])
 +  [(set (match_dup 2) (match_dup 1))
 +   (set (match_dup 0) (float_extend:DF (match_dup 2)))]
 +{
 +  operands[2] = gen_rtx_REG (SFmode, REGNO (operands[0]));
 +})

 You should use

 (match_scratch:SF 2 x)

 at the top of the peephole2 pattern, and you will get a free scratch
 register (assuming that it is not necessary to use the same register
 for input and output operand of the float_extend insn).

 Otherwise, the patch looks OK to me.

 Uros.


gcc_slm_patch2_3_review.patch
Description: Binary data


Re: [patch] RFC: ix86 / x86_64 register pressure aware scheduling

2013-04-17 Thread Igor Zamyatin
These changes are what we used to try here at Intel after bunch of
changes which made pre-alloc scheduler more stable. We benchmarked
both register pressure algorithms and overall result was not that
promising.

We saw number of regressions e.g. for optset -mavx -O3 -funroll-loops
-ffast-math -march=corei7 (for spec2000 not only lucas but also applu
regressed). And overall gain is negative even for x86_64. For 32 bits
picture was worse if I remember correctly.

In common we have doubts that this feature is good for OOO machine

Thanks,
Igor

-Original Message-
From: gcc-patches-ow...@gcc.gnu.org
[mailto:gcc-patches-ow...@gcc.gnu.org] On Behalf Of Steven Bosscher
Sent: Monday, April 15, 2013 11:34 PM
To: GCC Patches
Cc: H.J. Lu; Uros Bizjak; Jan Hubicha
Subject: [patch] RFC: ix86 / x86_64 register pressure aware scheduling

Hello,

The attached patch enables register pressure aware scheduling for the
ix86 and x86_64 targets. It uses the optimistic algorithm to avoid
being overly conservative.

This is the same as what other CISCy targets, like s390, also do.

The motivation for this patch is the excessive spilling I've observed
in a few test cases with relatively large basic blocks, e.g.
encryption algorithms and codecs. The patch passes bootstrap+testing
on x86_64-unknown-linux-gnu and i686-unknown-linux-gnu, with a few new
failures due to PR56950.

Off-list, Uros, Honza and others have already looked at the patch and
benchmarked it. For x86_64 there is an overall improvement for SPEC2k
except that lucas regresses, but such a preliminary result is IMHO
very promising.

Comments/suggestions welcome :-)

Ciao!
Steven
* common/config/i386/i386-common.c (ix86_option_optimization_table):
Do not disable insns scheduling.  Enable register pressure aware
scheduling.
* config/i386/i386.c (ix86_option_override): Use the alternative,
optimistic scheduling-pressure algorithm by default.

Index: common/config/i386/i386-common.c
===
--- common/config/i386/i386-common.c(revision 197941)
+++ common/config/i386/i386-common.c(working copy)
@@ -707,9 +707,15 @@ static const struct default_options ix86
   {
 /* Enable redundant extension instructions removal at -O2 and higher.  */
 { OPT_LEVELS_2_PLUS, OPT_free, NULL, 1 },
-/* Turn off -fschedule-insns by default.  It tends to make the
-   problem with not enough registers even worse.  */
-{ OPT_LEVELS_ALL, OPT_fschedule_insns, NULL, 0 },
+/* Enable -fsched-pressure by default for all optimization levels.
+   Before SCHED_PRESSURE_MODEL register-pressure aware schedule was
+   available, -fschedule-insns was turned off completely by default for
+   this port, because scheduling before register allocation tends to
+   make the problem with not enough registers even worse.  However,
+   for very long basic blocks the scheduler can help bring register
+   pressure down significantly, and SCHED_PRESSURE_MODEL is still
+   conservative enough to avoid creating excessive register pressure.  */
+{ OPT_LEVELS_ALL, OPT_fsched_pressure, NULL, 1 },
 
 #ifdef SUBTARGET_OPTIMIZATION_OPTIONS
 SUBTARGET_OPTIMIZATION_OPTIONS,
Index: config/i386/i386.c
===
--- config/i386/i386.c  (revision 197941)
+++ config/i386/i386.c  (working copy)
@@ -3936,6 +3936,10 @@ ix86_option_override (void)
 
   ix86_option_override_internal (true);
 
+  /* Use the alternative scheduling-pressure algorithm by default.  */
+  maybe_set_param_value (PARAM_SCHED_PRESSURE_ALGORITHM, 2,
+global_options.x_param_values,
+global_options_set.x_param_values);
 
   /* This needs to be done at start up.  It's convenient to do it here.  */
   register_pass (insert_vzeroupper_info);


[PATCH, x86, AVX2] FP reassociation enabling for AVX2 targets

2013-02-18 Thread Igor Zamyatin
Hi!

This small change enables FP reassociation for AVX2 processors. This
gives ~+1.5% in performance geomean for spec2006FP tests.

Bootstrapped/regtested on x86_64-linux, ok for trunk?


2013-02-14  Igor Zamyatin  igor.zamya...@intel.com

* config/i386/i386.c (initial_ix86_tune_features): Turn on fp
reassociation for avx2 targets.



diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index caf4894..cb84866 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -2021,7 +2021,7 @@ static unsigned int
initial_ix86_tune_features[X86_TUNE_LAST] = {

   /* X86_TUNE_REASSOC_FP_TO_PARALLEL: Try to produce parallel computations
  during reassociation of fp computation.  */
-  m_ATOM,
+  m_ATOM | m_HASWELL,

   /* X86_TUNE_GENERAL_REGS_SSE_SPILL: Try to spill general regs to SSE
  regs instead of memory.  */


Re: [GCC 4.8 wwwdocs] PATCH: Mention several user-visible changes for x86

2013-02-17 Thread Igor Zamyatin
Gerald,

Thanks a lot for your remarks!

Below is updated patch which will be checked in.


Thanks,
Igor

On Mon, Feb 18, 2013 at 3:07 AM, Gerald Pfeifer ger...@pfeifer.com wrote:
 On Fri, 15 Feb 2013, Igor Zamyatin wrote:
 Is it ok for wwwdocs?

 Index: htdocs/gcc-4.8/changes.html
 ===
 + liSupport for the new Intel processor codename Broadwell with RDSEED,
 + ADCX, ADOX, PREFETCHW is available through code-madx/code,
 + code-mprfchw/code, code-mrdseed/code.

 Can you make this codeRDSEED/code, code... and so forth?
 (This is a bit borderline, in that one could also see this as
 more general references, but usually we mark those up.)

 And ...through the ... command-line options.?

 + li Support for Intel RTM and HLE intrinsics, built-in

 the ... intrinsics (and same below for instruction sets)

 + li x86 backend was improved to allow option
 code-fscedule-insns/code to work reliably.

 The x86 backend has been improved...

 + This option can be used to schedule instructions better and can
 lead to improved performace in certain cases.

 This line is quite long, can you break lines around 76 columns?

 And, let's be a bit more brave and omit either can or in certain
 cases.  Otherwise this may sounds too unlikely. :-)

 The patch is fine with changes along the lines described above.

 Thanks,
 Gerald

Index: htdocs/gcc-4.8/changes.html
  ===
  RCS file: /cvs/gcc/wwwdocs/htdocs/gcc-4.8/changes.html,v
  retrieving revision 1.95
  diff -c -r1.95 changes.html
  *** htdocs/gcc-4.8/changes.html 11 Feb 2013 15:12:58 -  1.95
  --- htdocs/gcc-4.8/changes.html 12 Feb 2013 15:10:41 -
  ***
  *** 460,465 
  --- 460,471 
wrong results.  You must build all
modules with code-mpreferred-stack-boundary=3/code, including any
libraries.  This includes the system libraries and startup modules./li
  + liSupport for the new Intel processor codename Broadwell with
  + codeRDSEED/code, codeADCX/code, codeADOX/code,
  + codePREFETCHW/code is available through code-madx/code,
  + code-mprfchw/code, code-mrdseed/code command-line options.
  + /li
  + li Support for the Intel RTM and HLE intrinsics, built-in
  functions and code generation is available via code-mrtm/code and
  code-mhle/code.
  + /li
  + li Support for the Intel FXSR, XSAVE and XSAVEOPT instruction
 sets. Intrinsics and built-in functions are available
  + via code-mfxsr/code, code-mxsave/code and
 code-mxsaveopt/code respectively.
  + /li
li New built-in functions to detect run-time CPU type and ISA:
ul
  liA built-in function code__builtin_cpu_is/code has
been added to
  ***
  *** 524,529 
  --- 530,538 
a href=http://gcc.gnu.org/wiki/FunctionMultiVersioning;wiki/a
  for more
information.
/li
  + li The x86 backend has been improved to allow option
 code-fscedule-insns/code to work reliably.
  + This option can be used to schedule instructions better and leads to
  + improved performace in certain cases.
  + /li
li Windows MinGW-w64 targets (code*-w64-mingw*/code)
  require at least r5437 from the Mingw-w64 trunk. /li
  /ul


Re: [GCC 4.8 wwwdocs] PATCH: Mention several user-visible changes for x86

2013-02-15 Thread Igor Zamyatin
Gerald,

Is it ok for wwwdocs?

Thanks,
Igor

On Wed, Feb 13, 2013 at 3:21 PM, Uros Bizjak ubiz...@gmail.com wrote:
 On Wed, Feb 13, 2013 at 12:17 PM, Igor Zamyatin izamya...@gmail.com wrote:

 Please also mention new -mfxsr, -mxsave and -mxsaveopt options.

   li New built-in functions to detect run-time CPU type and ISA:
   ul
 liA built-in function code__builtin_cpu_is/code has been 
 added to
 ***
 *** 524,529 
 --- 530,538 
   a href=http://gcc.gnu.org/wiki/FunctionMultiVersioning;wiki/a
 for more
   information.
   /li
 + li Problem with instability of pre-reload scheduler on x86
 targets was fixed. Now option -fschedule-insn
 + can be used loosely to reach better performance.

 used loosely in what sense?

 Thanks,
 Uros.

 Updated version. Does it look ok?

 Looks OK to me. I have CC'd Gerald for a grammar and style check.

 Thanks,
 Uros.



 Thanks,
 Igor

 Index: htdocs/gcc-4.8/changes.html
 ===
 RCS file: /cvs/gcc/wwwdocs/htdocs/gcc-4.8/changes.html,v
 retrieving revision 1.95
 diff -c -r1.95 changes.html
 *** htdocs/gcc-4.8/changes.html 11 Feb 2013 15:12:58 -  1.95
 --- htdocs/gcc-4.8/changes.html 12 Feb 2013 15:10:41 -
 ***
 *** 460,465 
 --- 460,471 
   wrong results.  You must build all
   modules with code-mpreferred-stack-boundary=3/code, including any
   libraries.  This includes the system libraries and startup modules./li
 + liSupport for the new Intel processor codename Broadwell with RDSEED,
 + ADCX, ADOX, PREFETCHW is available through code-madx/code,
 + code-mprfchw/code, code-mrdseed/code.
 + /li
 + li Support for Intel RTM and HLE intrinsics, built-in
 functions and code generation is available via code-mrtm/code and
 code-mhle/code.
 + /li
 + li Support for Intel FXSR, XSAVE and XSAVEOPT instruction
sets. Intrinsics and built-in functions are available
 + via code-mfxsr/code, code-mxsave/code and
code-mxsaveopt/code respectively.
 + /li
   li New built-in functions to detect run-time CPU type and ISA:
   ul
 liA built-in function code__builtin_cpu_is/code has been added to
 ***
 *** 524,529 
 --- 530,538 
   a href=http://gcc.gnu.org/wiki/FunctionMultiVersioning;wiki/a
 for more
   information.
   /li
 + li x86 backend was improved to allow option
code-fscedule-insns/code to work reliably.
 + This option can be used to schedule instructions better and can
lead to improved performace in certain cases.
 + /li
   li Windows MinGW-w64 targets (code*-w64-mingw*/code)
 require at least r5437 from the Mingw-w64 trunk. /li
 /ul


Re: [GCC 4.8 changes] PATCH: Mention several user-visible changes for x86

2013-02-13 Thread Igor Zamyatin

 Please also mention new -mfxsr, -mxsave and -mxsaveopt options.

   li New built-in functions to detect run-time CPU type and ISA:
   ul
 liA built-in function code__builtin_cpu_is/code has been added 
 to
 ***
 *** 524,529 
 --- 530,538 
   a href=http://gcc.gnu.org/wiki/FunctionMultiVersioning;wiki/a
 for more
   information.
   /li
 + li Problem with instability of pre-reload scheduler on x86
 targets was fixed. Now option -fschedule-insn
 + can be used loosely to reach better performance.

 used loosely in what sense?

 Thanks,
 Uros.

Updated version. Does it look ok?


Thanks,
Igor

Index: htdocs/gcc-4.8/changes.html
===
RCS file: /cvs/gcc/wwwdocs/htdocs/gcc-4.8/changes.html,v
retrieving revision 1.95
diff -c -r1.95 changes.html
*** htdocs/gcc-4.8/changes.html 11 Feb 2013 15:12:58 -  1.95
--- htdocs/gcc-4.8/changes.html 12 Feb 2013 15:10:41 -
***
*** 460,465 
--- 460,471 
  wrong results.  You must build all
  modules with code-mpreferred-stack-boundary=3/code, including any
  libraries.  This includes the system libraries and startup modules./li
+ liSupport for the new Intel processor codename Broadwell with RDSEED,
+ ADCX, ADOX, PREFETCHW is available through code-madx/code,
+ code-mprfchw/code, code-mrdseed/code.
+ /li
+ li Support for Intel RTM and HLE intrinsics, built-in
functions and code generation is available via code-mrtm/code and
code-mhle/code.
+ /li
+ li Support for Intel FXSR, XSAVE and XSAVEOPT instruction
sets. Intrinsics and built-in functions are available
+ via code-mfxsr/code, code-mxsave/code and
code-mxsaveopt/code respectively.
+ /li
  li New built-in functions to detect run-time CPU type and ISA:
  ul
liA built-in function code__builtin_cpu_is/code has been added to
***
*** 524,529 
--- 530,538 
  a href=http://gcc.gnu.org/wiki/FunctionMultiVersioning;wiki/a
for more
  information.
  /li
+ li x86 backend was improved to allow option
code-fscedule-insns/code to work reliably.
+ This option can be used to schedule instructions better and can
lead to improved performace in certain cases.
+ /li
  li Windows MinGW-w64 targets (code*-w64-mingw*/code)
require at least r5437 from the Mingw-w64 trunk. /li
/ul


[GCC 4.8 changes] PATCH: Mention several user-visible changes for x86

2013-02-12 Thread Igor Zamyatin
Hi,

This patch updates GCC 4.8 changes.html to mention Broadwell's
features, RTM and HLE support and fixed pre-reload scheduler.

OK to commit?


Thanks,
Igor

Index: htdocs/gcc-4.8/changes.html
===
RCS file: /cvs/gcc/wwwdocs/htdocs/gcc-4.8/changes.html,v
retrieving revision 1.95
diff -c -r1.95 changes.html
*** htdocs/gcc-4.8/changes.html 11 Feb 2013 15:12:58 -  1.95
--- htdocs/gcc-4.8/changes.html 12 Feb 2013 15:10:41 -
***
*** 460,465 
--- 460,471 
  wrong results.  You must build all
  modules with code-mpreferred-stack-boundary=3/code, including any
  libraries.  This includes the system libraries and startup modules./li
+ liSupport for the new Intel processor codename Broadwell with RDSEED,
+ ADCX, ADOX, PREFETCHW is available through code-madx/code,
+ code-mprfchw/code, code-mrdseed/code.
+ /li
+ li Support for Intel RTM and HLE intrinsics, built-in
functions and code generation is available via -mrtm and -mhle.
+ /li
  li New built-in functions to detect run-time CPU type and ISA:
  ul
liA built-in function code__builtin_cpu_is/code has been added to
***
*** 524,529 
--- 530,538 
  a href=http://gcc.gnu.org/wiki/FunctionMultiVersioning;wiki/a
for more
  information.
  /li
+ li Problem with instability of pre-reload scheduler on x86
targets was fixed. Now option -fschedule-insn
+ can be used loosely to reach better performance.
+ /li
  li Windows MinGW-w64 targets (code*-w64-mingw*/code)
require at least r5437 from the Mingw-w64 trunk. /li
/ul


Re: [PATCH] Fix instability of -fschedule-insn for x86

2012-10-01 Thread Igor Zamyatin
We also plan to test these changes along with LRA

On Sun, Sep 30, 2012 at 4:33 PM, Uros Bizjak ubiz...@gmail.com wrote:
 On Tue, Sep 18, 2012 at 1:31 PM, Uros Bizjak ubiz...@gmail.com wrote:

 This patch aims to fix all stability issues related to using the first
 scheduler in gcc
 for x86 target (there several reported issues related to this problem).

 Main idea of this activity is mostly to provide user a possibility to
 safely turn on first scheduler for his codes. In some cases this could
 positively affect performance, especially for in-order Atom.

 Below is short description of proposed changes.

 2012-09-18  Yuri Rumyantsev  ysrum...@gmail.com

 * config/i386/i386.c (ix86_dep_by_shift_count_body) : Add
 check on reload_completed since it can be invoked before
 register allocation phase in 1st scheduler.
 (ia32_multipass_dfa_lookahead) : Do not use dfa_lookahead for 1st
 Scheduler to save compile time.
 (ix86_sched_reorder) : Do not perform ready list reordering for 1st
 Scheduler to save compile time.
 (insn_is_function_arg) : New function. Returns true if lhs of insn 
 is
 HW function argument register.
 (add_parameter_dependencies) : New function. Add output dependencies
 for chain of function adjacent arguments if only there is a move to
 likely spilled HW registers. Return first argument if at least one
 dependence was added or NULL otherwise.
 (avoid_func_arg_motion) : New function. Add output or anti 
 dependency
 from insn to first_arg to restrict code motion.
 (add_dependee_for_func_arg) : New function. Avoid cross block 
 motion of
 function argument through adding dependency from the first non-jump
 insn in bb.
 (ix86_dependencies_evaluation_hook) : New function. Hook for 
 schedule1:
 avoid motion of function arguments passed in passed in likely 
 spilled
 HW registers.
 (ix86_adjust_priority) : New function. Hook for schedule1: set 
 priority
 of moves from likely spilled HW registers to maximum to schedule 
 them
 as soon as possible.
 (ix86_sched_init_global): Do not perform multipass scheduling for 
 1st
 Scheduler to save compile time.

 I would kindly ask scheduler expert to review the patch from the
 scheduler functionality POV.

 I have received opinion from Vladimir from off-line discussion, quoted below:

 --quote--
 I think, it is ok.

   Switching off first cycle multipass scheduling is ok.  It is mostly
 useful when the order of insns issued on the same cycle is important
 (mostly VLIW or quasy-VLIW processors).

   Other solutions are necessary to decrease spills and avoid reload
 crash (can not find a register in a class) when the 1st insn
 scheduling is on.  I don't think it fully avoids possibility of the
 reload crashes but it takes into account most of cases resulting in
 the crashes and makes the crash possibility really negligible.
 Register pressure sensitive insn scheduling decreased the possibility.
  This patch will make it negligible.  And LRA will solve all the rest
 cases of the crashes.

   I don't like a bit absence in freedom of moving argument insns with
 likely spilled hard-regs between each other as they are chained in the
 original order but it is debatable because it still decreases the
 possibility of spills.

   In overall, the patch is ok for me.
 --/quote--

 Based on this opinion, the patch is OK for mainline, if there are no
 objections from other x86 maintainers in the next couple of days
 (48h). However, please watch for possible fallout from the patch,
 compile-time ICEs and performance problems. x86 and scheduler didn't
 play well together in the past, but your patch and (in the near
 future) LRA seems to fix all these problems.

 Thanks,
 Uros.


[Patch, testsuite] Changes in tests for the case when long double size is equal double size

2012-09-24 Thread Igor Zamyatin
Hi All!

After recent commit
(http://gcc.gnu.org/ml/gcc-cvs/2012-08/msg00576.html, patch -
http://gcc.gnu.org/ml/gcc-patches/2012-08/msg01512.html), which sets
size of long double equal to double for Bionic, several tests required
some adjustment to pass on Android.

Tested both for Linux and Andorid. Ok to commit?


Thanks,
Igor

Changelog:

gcc/testsuite

2012-09-24  Igor Zamyatin  igor.zamya...@intel.com

Test changes for case when long double size is equal double size
* gcc.target/i386/20030217-1.c: Added check for
large_long_double effective target.
* gcc.target/i386/387-3.c: Likewise.
* gcc.target/i386/387-4.c: Likewise.
* gcc.target/i386/pr36578-1.c: Likewise.
* gcc.target/i386/excess-precision-1.c: Added new code for the case
when long double size is equal double size.
* gcc.target/i386/excess-precision-1.c: Likewise.
* gcc.target/i386/pr36578-2.c: Likewise.
* gcc.target/i386/20030217-2.c: New testcase.


long_double_for_android.patch
Description: Binary data


Re: [PATCH, x86-Atom] Enabling look-ahead scheduling feature for Atom processors

2012-08-31 Thread Igor Zamyatin
On Tue, Aug 28, 2012 at 1:34 PM, Uros Bizjak ubiz...@gmail.com wrote:
 On Tue, Aug 28, 2012 at 11:24 AM, Igor Zamyatin izamya...@gmail.com wrote:

 I'd like this patch (original version at the bottom) also to be
 backported into 4.7.

 Is it safe to backport?

 Looks safe to me.

 OK if there are no objections from RMs in the next 24h.

All testing passed so following will be checked in 4.7.

Index: gcc/config/i386/i386.c
===
--- gcc/config/i386/i386.c  (revision 190833)
+++ gcc/config/i386/i386.c  (working copy)
@@ -23842,6 +23842,7 @@
 case PROCESSOR_CORE2_64:
 case PROCESSOR_COREI7_32:
 case PROCESSOR_COREI7_64:
+case PROCESSOR_ATOM:
   /* Generally, we want haifa-sched:max_issue() to look ahead as far
 as many instructions can be executed on a cycle, i.e.,
 issue_rate.  I wonder why tuning for many CPUs does not do this.  */


Original Changelog:

2012-08-23  Yuri Rumyantsev  ysrum...@gmail.com

 * config/i386/i386.c (ia32_multipass_dfa_lookahead) : Add
 case for Atom processor.


Thanks,
Igor


 Thanks,
 Uros.


Re: [PATCH, x86-Atom] Enabling look-ahead scheduling feature for Atom processors

2012-08-28 Thread Igor Zamyatin
Hi!

I'd like this patch (original version at the bottom) also to be
backported into 4.7.

Is it safe to backport?

On Fri, Aug 24, 2012 at 11:35 AM, Uros Bizjak ubiz...@gmail.com wrote:
 On Fri, Aug 24, 2012 at 9:22 AM, Igor Zamyatin izamya...@gmail.com wrote:

 Following change enables look ahead feature in the code scheduler for
 Atom processors. This gives quite reasonable gain for some benchmarks
 for mobile market.

 Overall compile time increase for SPEC2000 is about 1%.

 Regtested for x86_64 and also bootstrapped with --with-arch=core2
 --with-cpu=atom

 2012-08-23  Yuri Rumyantsev  ysrum...@gmail.com

 * config/i386/i386.c (ia32_multipass_dfa_lookahead) : Add
 case for Atom processor.

 OK.

 Thanks,
 Uros.

Original patch:

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 976bbb4..331e29a 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -24103,6 +24103,7 @@ ia32_multipass_dfa_lookahead (void)
 case PROCESSOR_CORE2_64:
 case PROCESSOR_COREI7_32:
 case PROCESSOR_COREI7_64:
+case PROCESSOR_ATOM:
   /* Generally, we want haifa-sched:max_issue() to look ahead as far
 as many instructions can be executed on a cycle, i.e.,
 issue_rate.  I wonder why tuning for many CPUs does not do this.  */


[PATCH, x86-Atom] Enabling look-ahead scheduling feature for Atom processors

2012-08-24 Thread Igor Zamyatin
Hi!

Following change enables look ahead feature in the code scheduler for
Atom processors. This gives quite reasonable gain for some benchmarks
for mobile market.

Overall compile time increase for SPEC2000 is about 1%.

Regtested for x86_64 and also bootstrapped with --with-arch=core2
--with-cpu=atom
Ok for trunk?

Thanks,
Igor

Changelog:

2012-08-23  Yuri Rumyantsev  ysrum...@gmail.com

* config/i386/i386.c (ia32_multipass_dfa_lookahead) : Add
case for Atom processor.



diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 976bbb4..331e29a 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -24103,6 +24103,7 @@ ia32_multipass_dfa_lookahead (void)
 case PROCESSOR_CORE2_64:
 case PROCESSOR_COREI7_32:
 case PROCESSOR_COREI7_64:
+case PROCESSOR_ATOM:
   /* Generally, we want haifa-sched:max_issue() to look ahead as far
 as many instructions can be executed on a cycle, i.e.,
 issue_rate.  I wonder why tuning for many CPUs does not do this.  */


[PATCH][RFC] Fixing instability of -fschedule-insns for x86

2012-08-13 Thread Igor Zamyatin
Hi all!

Patch aims to fix instability introduced by first scheduler on x86. In
particular it targets following list:

[1] http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46843
[2] http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46829
[3] http://gcc.gnu.org/bugzilla/show_bug.cgi?id=36680
[4] http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42295

Main idea of this activity is mostly to provide user a possibility to
safely turn on first scheduler for his codes. In some cases this could
positively affect performance, especially for in-order Atom.

It would be great to hear some feedback from the community about the change.

Thanks in advance,
Igor


first_scheduler_fix.patch
Description: Binary data


Re: [PATCH, Android] Runtime stack protector enabling for Android target

2012-08-10 Thread Igor Zamyatin
Hi!

There were only trivial merge conflicts like


  [(match_operand 0 memory_operand )
   (match_operand 1 memory_operand )]
  
---
  [(match_operand 0 memory_operand)
   (match_operand 1 memory_operand)]
  !TARGET_HAS_BIONIC


Patch attached.

All necessary testing passed. Ok for 4.7?

Changelog:


2012-08-08 Pavel Chupin pavel.v.chu...@intel.com

Backport from mainline r189840 and r187586:

2012-07-25 Sergey Melnikov sergey.melni...@intel.com

* config/i386/i386.md (stack_protect_set): Disable the pattern
for Android since Android libc (bionic) does not provide random
value for stack protection guard at gs:0x14. Guard value
will be provided from external symbol (default implementation).
(stack_protect_set_mode): Likewise.
(stack_protect_test): Likewise.
(stack_protect_test_mode): Likewise.
* gcc/defaults.h: Define macro TARGET_HAS_BIONIC to 0 - target does
not have Bionic by default
* config/linux.h: Redefine macro TARGET_HAS_BIONIC to (OPTION_BIONIC)
Macro OPTION_BIONIC is defined in this file and provides Bionic
accessibility status

2012-05-16  Igor Zamyatin  igor.zamya...@intel.com

* configure.ac: Stack protector enabling for Android targets.
* configure: Regenerate.


On Wed, Aug 8, 2012 at 2:25 PM, Maxim Kuvyrkov ma...@codesourcery.com wrote:
 On 8/08/2012, at 9:46 PM, Uros Bizjak wrote:

 On Wed, Aug 8, 2012 at 9:54 AM, Igor Zamyatin izamya...@gmail.com wrote:

 I'd like to ask whether stack-protector changes for Android could go to 4.7?

 Pathes are:

 http://gcc.gnu.org/ml/gcc-patches/2012-05/msg01089.html
 http://gcc.gnu.org/ml/gcc-patches/2012-07/msg01157.html

 OK, as far as x86 is concerned.

 Strictly speaking, these fixes are not for a regression, so do not readily 
 qualify for a release branch.  However, since the patches are no-ops on 
 targets other than x86 and Uros OK'd them for x86 ...

 OK, provided that the patches in the above threads apply without conflicts.  
 If there are conflicts, please repost for review.

 Thanks,

 --
 Maxim Kuvyrkov
 CodeSourcery / Mentor Graphics



stack-protector-4_7.patch
Description: Binary data


Re: [PATCH, Android] Runtime stack protector enabling for Android target

2012-08-08 Thread Igor Zamyatin
Hi all!

I'd like to ask whether stack-protector changes for Android could go to 4.7?

Pathes are:

http://gcc.gnu.org/ml/gcc-patches/2012-05/msg01089.html
http://gcc.gnu.org/ml/gcc-patches/2012-07/msg01157.html

Thanks,
Igor

On Wed, Jul 25, 2012 at 2:08 AM, Maxim Kuvyrkov ma...@codesourcery.com wrote:
 On 24/07/2012, at 10:08 PM, Uros Bizjak wrote:

 On Mon, Jul 23, 2012 at 10:00 PM, Igor Zamyatin izamya...@gmail.com wrote:

 2012-07-23 Sergey Melnikov sergey.melni...@intel.com

* config/i386/i386.md (stack_protect_set): Disable the pattern
for Android since Android libc (bionic) does not provide random
value for stack protection guard at gs:0x14. Guard value
will be provided from external symbol (default implementation).
(stack_protect_set_mode): Likewise.
(stack_protect_test): Likewise.
(stack_protect_test_mode): Likewise.
* gcc/defaults.h: Define macro TARGET_HAS_BIONIC to 0 - target does
not have Bionic by default
* config/linux.h: Redefine macro TARGET_HAS_BIONIC to (OPTION_BIONIC)
Macro OPTION_BIONIC is defined in this file and provides Bionic
accessibility status

 Looks OK to me, patch needs approval from Maxim.

 OK.  Thanks for fixing this.

 --
 Maxim Kuvyrkov
 CodeSourcery / Mentor Graphics



[PATCH, Android] Runtime stack protector enabling for Android target

2012-07-23 Thread Igor Zamyatin
Hi!

Since this change for stack-protector enabling hurts mingw compiler
(see http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53980 and
http://gcc.gnu.org/ml/gcc-patches/2012-07/msg00638.html) we'd like to
make change more general.
Please see new patch in attachment.
Tested in android environment(x86_64-*-linux-android), also
bootstrapped and regtested on x86_64-unknown-linux-gnu and i686-linux.
Also mingw bootstrap was checked.
Ok for trunk?

Thanks,
Igor


Changelog:

2012-07-23 Sergey Melnikov sergey.melni...@intel.com

* config/i386/i386.md (stack_protect_set): Disable the pattern
for Android since Android libc (bionic) does not provide random
value for stack protection guard at gs:0x14. Guard value
will be provided from external symbol (default implementation).
(stack_protect_set_mode): Likewise.
(stack_protect_test): Likewise.
(stack_protect_test_mode): Likewise.
* gcc/defaults.h: Define macro TARGET_HAS_BIONIC to 0 - target does
not have Bionic by default
* config/linux.h: Redefine macro TARGET_HAS_BIONIC to (OPTION_BIONIC)
Macro OPTION_BIONIC is defined in this file and provides Bionic
accessibility status


patch-stack-protector-3_target_bionic.diff
Description: Binary data


Re: [Patch, Fortran] Add parsing support for assumed-rank array

2012-07-20 Thread Igor Zamyatin

 Tobias Burnus wrote:
 I will now regtest everything, read through the whole patch - your
 part and mine, update the ChangeLog and commit it tomorrow.

 I have now committed the attached version as Rev. 189700!

 Thanks agai for the review!

 Tobias


This seems to cause following fails at least on i686:

FAIL: gfortran.dg/assumed_rank_12.f90  -O0   scan-tree-dump original 
= f \\(\\);.*desc.0.dtype = 600;.*desc.0.data = .void .. D.*;.*sub
\\(desc.0\\);.*D.*= .integer.kind=4. .. desc.0.data;
FAIL: gfortran.dg/assumed_rank_12.f90  -O1   scan-tree-dump original 
= f \\(\\);.*desc.0.dtype = 600;.*desc.0.data = .void .. D.*;.*sub
\\(desc.0\\);.*D.*= .integer.kind=4. .. desc.0.data;
FAIL: gfortran.dg/assumed_rank_12.f90  -O2   scan-tree-dump original 
= f \\(\\);.*desc.0.dtype = 600;.*desc.0.data = .void .. D.*;.*sub
\\(desc.0\\);.*D.*= .integer.kind=4. .. desc.0.data;
FAIL: gfortran.dg/assumed_rank_12.f90  -O3 -fomit-frame-pointer
scan-tree-dump original  = f \\(\\);.*desc.0.dtype =
600;.*desc.0.data = .void .. D.*;.*sub \\(desc.0\\);.*D.*=
.integer.kind=4. .. desc.0.data;
FAIL: gfortran.dg/assumed_rank_12.f90  -O3 -fomit-frame-pointer
-funroll-all-loops -finline-functions   scan-tree-dump original  = f
\\(\\);.*desc.0.dtype = 600;.*desc.0.data = .void .. D.*;.*sub
\\(desc.0\\);.*D.*= .integer.kind=4. .. desc.0.data;
FAIL: gfortran.dg/assumed_rank_12.f90  -O3 -fomit-frame-pointer
-funroll-loops   scan-tree-dump original  = f \\(\\);.*desc.0.dtype =
600;.*desc.0.data = .void .. D.*;.*sub \\(desc.0\\);.*D.*=
.integer.kind=4. .. desc.0.data;
FAIL: gfortran.dg/assumed_rank_12.f90  -O3 -g   scan-tree-dump
original  = f \\(\\);.*desc.0.dtype = 600;.*desc.0.data = .void ..
D.*;.*sub \\(desc.0\\);.*D.*= .integer.kind=4. .. desc.0.data;
FAIL: gfortran.dg/assumed_rank_12.f90  -Os   scan-tree-dump original 
= f \\(\\);.*desc.0.dtype = 600;.*desc.0.data = .void .. D.*;.*sub
\\(desc.0\\);.*D.*= .integer.kind=4. .. desc.0.data;
FAIL: gfortran.dg/assumed_rank_6.f90  -O   (test for errors, line 19)
FAIL: gfortran.dg/assumed_rank_6.f90  -O   (test for errors, line 20)
FAIL: gfortran.dg/assumed_rank_6.f90  -O   (test for errors, line 21)
FAIL: gfortran.dg/assumed_rank_6.f90  -O   (test for errors, line 26)
FAIL: gfortran.dg/assumed_rank_6.f90  -O   (test for errors, line 33)
FAIL: gfortran.dg/assumed_rank_6.f90  -O   (test for errors, line 37)
FAIL: gfortran.dg/assumed_rank_6.f90  -O   (test for errors, line 9)
FAIL: gfortran.dg/assumed_rank_6.f90  -O  (internal compiler error)
FAIL: gfortran.dg/assumed_rank_6.f90  -O  (test for excess errors)
FAIL: gfortran.dg/lto/pr45586-2
f_lto_pr45586-2_0.o-f_lto_pr45586-2_0.o link, -O0 -flto
-fuse-linker-plugin -fno-fat-lto-objects  (internal compiler error)


Re: [Patch, Fortran] Add parsing support for assumed-rank array

2012-07-20 Thread Igor Zamyatin
On x86_64 the same happens. Also I modified list of failing tests -
now it is correct

On Fri, Jul 20, 2012 at 11:43 AM, Igor Zamyatin izamya...@gmail.com wrote:

 Tobias Burnus wrote:
 I will now regtest everything, read through the whole patch - your
 part and mine, update the ChangeLog and commit it tomorrow.

 I have now committed the attached version as Rev. 189700!

 Thanks agai for the review!

 Tobias


 This seems to cause following fails at least on i686:

 FAIL: gfortran.dg/assumed_rank_12.f90  -O0   scan-tree-dump original 
 = f \\(\\);.*desc.0.dtype = 600;.*desc.0.data = .void .. D.*;.*sub
 \\(desc.0\\);.*D.*= .integer.kind=4. .. desc.0.data;
 FAIL: gfortran.dg/assumed_rank_12.f90  -O1   scan-tree-dump original 
 = f \\(\\);.*desc.0.dtype = 600;.*desc.0.data = .void .. D.*;.*sub
 \\(desc.0\\);.*D.*= .integer.kind=4. .. desc.0.data;
 FAIL: gfortran.dg/assumed_rank_12.f90  -O2   scan-tree-dump original 
 = f \\(\\);.*desc.0.dtype = 600;.*desc.0.data = .void .. D.*;.*sub
 \\(desc.0\\);.*D.*= .integer.kind=4. .. desc.0.data;
 FAIL: gfortran.dg/assumed_rank_12.f90  -O3 -fomit-frame-pointer
 scan-tree-dump original  = f \\(\\);.*desc.0.dtype =
 600;.*desc.0.data = .void .. D.*;.*sub \\(desc.0\\);.*D.*=
 .integer.kind=4. .. desc.0.data;
 FAIL: gfortran.dg/assumed_rank_12.f90  -O3 -fomit-frame-pointer
 -funroll-all-loops -finline-functions   scan-tree-dump original  = f
 \\(\\);.*desc.0.dtype = 600;.*desc.0.data = .void .. D.*;.*sub
 \\(desc.0\\);.*D.*= .integer.kind=4. .. desc.0.data;
 FAIL: gfortran.dg/assumed_rank_12.f90  -O3 -fomit-frame-pointer
 -funroll-loops   scan-tree-dump original  = f \\(\\);.*desc.0.dtype =
 600;.*desc.0.data = .void .. D.*;.*sub \\(desc.0\\);.*D.*=
 .integer.kind=4. .. desc.0.data;
 FAIL: gfortran.dg/assumed_rank_12.f90  -O3 -g   scan-tree-dump
 original  = f \\(\\);.*desc.0.dtype = 600;.*desc.0.data = .void ..
 D.*;.*sub \\(desc.0\\);.*D.*= .integer.kind=4. .. desc.0.data;
 FAIL: gfortran.dg/assumed_rank_12.f90  -Os   scan-tree-dump original 
 = f \\(\\);.*desc.0.dtype = 600;.*desc.0.data = .void .. D.*;.*sub
 \\(desc.0\\);.*D.*= .integer.kind=4. .. desc.0.data;
 FAIL: gfortran.dg/assumed_rank_6.f90  -O   (test for errors, line 19)
 FAIL: gfortran.dg/assumed_rank_6.f90  -O   (test for errors, line 20)
 FAIL: gfortran.dg/assumed_rank_6.f90  -O   (test for errors, line 21)
 FAIL: gfortran.dg/assumed_rank_6.f90  -O   (test for errors, line 26)
 FAIL: gfortran.dg/assumed_rank_6.f90  -O   (test for errors, line 33)
 FAIL: gfortran.dg/assumed_rank_6.f90  -O   (test for errors, line 37)
 FAIL: gfortran.dg/assumed_rank_6.f90  -O   (test for errors, line 9)
 FAIL: gfortran.dg/assumed_rank_6.f90  -O  (internal compiler error)
 FAIL: gfortran.dg/assumed_rank_6.f90  -O  (test for excess errors)


[PATCH, Android] Runtime stack protector enabling for Android target

2012-07-06 Thread Igor Zamyatin
Hi!

For runtime stack protector enabling on x86 for Android we have to
disable current glibc-based implementation and turn on default one
since bionic doesn't use value from gs:0x14.

Tested in android environment(x86_64-*-linux-android), also
bootstrapped and regtested on x86_64-unknown-linux-gnu and i686-linux.
Ok for trunk?

Thanks,
Igor


ChangeLog:

2012-07-05 Sergey Melnikov sergey.melni...@intel.com

* config/i386/i386.md (stack_protect_set): Disable the pattern
for Android since Android libc (bionic) does not provide random
value for stack protection guard at gs:0x14. Guard value
will be provided from external symbol (default implementation).
(stack_protect_set_mode): Likewise.
(stack_protect_test): Likewise.
(stack_protect_test_mode): Likewise.

diff --git a/gcc-4.6/gcc/config/i386/i386.md b/gcc-4.6/gcc/config/i386/i386.md
index b1d7e5e..fe0009d 100644
--- a/gcc-4.6/gcc/config/i386/i386.md
+++ b/gcc-4.6/gcc/config/i386/i386.md
@@ -17955,7 +17955,7 @@
 (define_expand stack_protect_set
   [(match_operand 0 memory_operand )
(match_operand 1 memory_operand )]
-  
+  !ANDROID_DEFAULT
 {
   rtx (*insn)(rtx, rtx);

@@ -17979,7 +17979,7 @@
(unspec:P [(match_operand:P 1 memory_operand m)] UNSPEC_SP_SET))
(set (match_scratch:P 2 =r) (const_int 0))
(clobber (reg:CC FLAGS_REG))]
-  
+  !ANDROID_DEFAULT
   mov{imodesuffix}\t{%1, %2|%2, %1}\;mov{imodesuffix}\t{%2,
%0|%0, %2}\;xor{l}\t%k2, %k2
   [(set_attr type multi)])

@@ -17997,7 +17997,7 @@
   [(match_operand 0 memory_operand )
(match_operand 1 memory_operand )
(match_operand 2  )]
-  
+  !ANDROID_DEFAULT
 {
   rtx flags = gen_rtx_REG (CCZmode, FLAGS_REG);

@@ -18027,7 +18027,7 @@
 (match_operand:P 2 memory_operand m)]
UNSPEC_SP_TEST))
(clobber (match_scratch:P 3 =r))]
-  
+  !ANDROID_DEFAULT
   mov{imodesuffix}\t{%1, %3|%3, %1}\;xor{imodesuffix}\t{%2, %3|%3, %2}
   [(set_attr type multi)])


Re: [PATCH, Android] Runtime stack protector enabling for Android target

2012-07-06 Thread Igor Zamyatin
Right, flag_android looks better, thanks.
Probably it's even better to check also bionic (OPTION_BIONIC)

On Fri, Jul 6, 2012 at 12:13 PM, Andrew Pinski pins...@gmail.com wrote:
 On Fri, Jul 6, 2012 at 12:49 AM, Igor Zamyatin izamya...@gmail.com wrote:
 Hi!

 For runtime stack protector enabling on x86 for Android we have to
 disable current glibc-based implementation and turn on default one
 since bionic doesn't use value from gs:0x14.

 Tested in android environment(x86_64-*-linux-android), also
 bootstrapped and regtested on x86_64-unknown-linux-gnu and i686-linux.
 Ok for trunk?



 I think you want flag_android and not ANDROID_DEFAULT since you could
 use -mno-android .

 Thanks,
 Andrew




 Thanks,
 Igor


 ChangeLog:

 2012-07-05 Sergey Melnikov sergey.melni...@intel.com

 * config/i386/i386.md (stack_protect_set): Disable the pattern
 for Android since Android libc (bionic) does not provide random
 value for stack protection guard at gs:0x14. Guard value
 will be provided from external symbol (default implementation).
 (stack_protect_set_mode): Likewise.
 (stack_protect_test): Likewise.
 (stack_protect_test_mode): Likewise.

 diff --git a/gcc-4.6/gcc/config/i386/i386.md 
 b/gcc-4.6/gcc/config/i386/i386.md
 index b1d7e5e..fe0009d 100644
 --- a/gcc-4.6/gcc/config/i386/i386.md
 +++ b/gcc-4.6/gcc/config/i386/i386.md
 @@ -17955,7 +17955,7 @@
  (define_expand stack_protect_set
[(match_operand 0 memory_operand )
 (match_operand 1 memory_operand )]
 -  
 +  !ANDROID_DEFAULT
  {
rtx (*insn)(rtx, rtx);

 @@ -17979,7 +17979,7 @@
 (unspec:P [(match_operand:P 1 memory_operand m)] UNSPEC_SP_SET))
 (set (match_scratch:P 2 =r) (const_int 0))
 (clobber (reg:CC FLAGS_REG))]
 -  
 +  !ANDROID_DEFAULT
mov{imodesuffix}\t{%1, %2|%2, %1}\;mov{imodesuffix}\t{%2,
 %0|%0, %2}\;xor{l}\t%k2, %k2
[(set_attr type multi)])

 @@ -17997,7 +17997,7 @@
[(match_operand 0 memory_operand )
 (match_operand 1 memory_operand )
 (match_operand 2  )]
 -  
 +  !ANDROID_DEFAULT
  {
rtx flags = gen_rtx_REG (CCZmode, FLAGS_REG);

 @@ -18027,7 +18027,7 @@
  (match_operand:P 2 memory_operand m)]
 UNSPEC_SP_TEST))
 (clobber (match_scratch:P 3 =r))]
 -  
 +  !ANDROID_DEFAULT
mov{imodesuffix}\t{%1, %3|%3, %1}\;xor{imodesuffix}\t{%2, %3|%3, %2}
[(set_attr type multi)])


Re: [PATCH 1/3] Add rtx costs for sse integer ops

2012-06-27 Thread Igor Zamyatin
May I ask about the purpose of the following piece of change? Doesn't
it affect non-sse cases either?

@@ -32038,7 +32042,15 @@ ix86_rtx_costs (rtx x, int code, int
outer_code_i, int opno, int *total,
case ASHIFTRT:
case LSHIFTRT:
case ROTATERT:
-  if (!TARGET_64BIT  GET_MODE (XEXP (x, 0)) == DImode)
+  if (GET_MODE_CLASS (mode) == MODE_VECTOR_INT)
+   {
+ /* ??? Should be SSE vector operation cost.  */
+ /* At least for published AMD latencies, this really is the same
+as the latency for a simple fpu operation like fabs.  */
+ *total = cost-fabs;
+ return false;
+   }
+  if (GET_MODE_SIZE (mode)  UNITS_PER_WORD)
   {
 if (CONST_INT_P (XEXP (x, 1)))
   {

It also seems that we reversed the condition for the code that is now
under if (GET_MODE_SIZE (mode)  UNITS_PER_WORD). Why do we need this?


Thanks,
Igor

-Original Message-
From: gcc-patches-ow...@gcc.gnu.org
[mailto:gcc-patches-ow...@gcc.gnu.org] On Behalf Of Richard Henderson
Sent: Saturday, June 16, 2012 12:57 AM
To: gcc-patches@gcc.gnu.org
Cc: rguent...@suse.de; ubiz...@gmail.com; hjl.to...@gmail.com
Subject: [PATCH 1/3] Add rtx costs for sse integer ops

---
 gcc/config/i386/i386.c |   50 ++-
 1 files changed, 40 insertions(+), 10 deletions(-)

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index
e2f5740..578a756 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -31990,13 +31990,16 @@ ix86_rtx_costs (rtx x, int code, int
outer_code_i, int opno, int *total,
           break;
         case 0:
         case -1:
-           /* Start with (MEM (SYMBOL_REF)), since that's where
-              it'll probably end up.  Add a penalty for size.  */
-           *total = (COSTS_N_INSNS (1)
-                     + (flag_pic != 0  !TARGET_64BIT)
-                     + (mode == SFmode ? 0 : mode == DFmode ? 1 : 2));
           break;
         }
+      /* FALLTHRU */
+
+    case CONST_VECTOR:
+      /* Start with (MEM (SYMBOL_REF)), since that's where
+        it'll probably end up.  Add a penalty for size.  */
+      *total = (COSTS_N_INSNS (1)
+               + (flag_pic != 0  !TARGET_64BIT)
+               + (mode == SFmode ? 0 : mode == DFmode ? 1 : 2));
      return true;

    case ZERO_EXTEND:
@@ -32016,8 +32019,9 @@ ix86_rtx_costs (rtx x, int code, int
outer_code_i, int opno, int *total,
      return false;

    case ASHIFT:
-      if (CONST_INT_P (XEXP (x, 1))
-          (GET_MODE (XEXP (x, 0)) != DImode || TARGET_64BIT))
+      if (SCALAR_INT_MODE_P (mode)
+          GET_MODE_SIZE (mode)  UNITS_PER_WORD
+          CONST_INT_P (XEXP (x, 1)))
       {
         HOST_WIDE_INT value = INTVAL (XEXP (x, 1));
         if (value == 1)
@@ -32038,7 +32042,15 @@ ix86_rtx_costs (rtx x, int code, int
outer_code_i, int opno, int *total,
    case ASHIFTRT:
    case LSHIFTRT:
    case ROTATERT:
-      if (!TARGET_64BIT  GET_MODE (XEXP (x, 0)) == DImode)
+      if (GET_MODE_CLASS (mode) == MODE_VECTOR_INT)
+       {
+         /* ??? Should be SSE vector operation cost.  */
+         /* At least for published AMD latencies, this really is the same
+            as the latency for a simple fpu operation like fabs.  */
+         *total = cost-fabs;
+         return false;
+       }
+      if (GET_MODE_SIZE (mode)  UNITS_PER_WORD)
       {
         if (CONST_INT_P (XEXP (x, 1)))
           {
@@ -32107,6 +32119,16 @@ ix86_rtx_costs (rtx x, int code, int
outer_code_i, int opno, int *total,
         *total = cost-fmul;
         return false;
       }
+      else if (GET_MODE_CLASS (mode) == MODE_VECTOR_INT)
+       {
+         /* Without sse4.1, we don't have PMULLD; it's emulated with 7
+            insns, including two PMULUDQ.  */
+         if (mode == V4SImode  !(TARGET_SSE4_1 || TARGET_AVX))
+           *total = cost-fmul * 2 + cost-fabs * 5;
+         else
+           *total = cost-fmul;
+         return false;
+       }
      else
       {
         rtx op0 = XEXP (x, 0);
@@ -32171,7 +32193,7 @@ ix86_rtx_costs (rtx x, int code, int
outer_code_i, int opno, int *total,

    case PLUS:
      if (GET_MODE_CLASS (mode) == MODE_INT
-               GET_MODE_BITSIZE (mode) = GET_MODE_BITSIZE (Pmode))
+          GET_MODE_SIZE (mode) = UNITS_PER_WORD)
       {
         if (GET_CODE (XEXP (x, 0)) == PLUS
              GET_CODE (XEXP (XEXP (x, 0), 0)) == MULT @@ -32271,6
+32293,14 @@ ix86_rtx_costs (rtx x, int code, int outer_code_i, int
opno, int *total,
      /* FALLTHRU */

    case NOT:
+      if (GET_MODE_CLASS (mode) == MODE_VECTOR_INT)
+       {
+         /* ??? Should be SSE vector operation cost.  */
+         /* At least for published AMD latencies, this really is the same
+            as the latency for a simple fpu operation like fabs.  */
+         *total = cost-fabs;
+         return false;
+       }
      if (!TARGET_64BIT  mode == DImode)
       *total = cost-add * 2;
      else
@@ -32331,7 

Re: [PATCH 1/3] Add rtx costs for sse integer ops

2012-06-27 Thread Igor Zamyatin
On Wed, Jun 27, 2012 at 9:00 PM, Richard Henderson r...@redhat.com wrote:
 On 06/27/2012 09:07 AM, Igor Zamyatin wrote:
 May I ask about the purpose of the following piece of change? Doesn't
 it affect non-sse cases either?

 Err, no, it doesn't affect non-sse cases.  All MODE_VECTOR_INT
 cases will be implemented in the xmm registers (ignoring the
 deprecated and largely ignored mmx case).

Probably I misunderstand something... This condition - GET_MODE_SIZE
(mode)  UNITS_PER_WORD - is outside the check for MODE_VECTOR_INT.




 @@ -32038,7 +32042,15 @@ ix86_rtx_costs (rtx x, int code, int
 outer_code_i, int opno, int *total,
     case ASHIFTRT:
     case LSHIFTRT:
     case ROTATERT:
 -      if (!TARGET_64BIT  GET_MODE (XEXP (x, 0)) == DImode)
 +      if (GET_MODE_CLASS (mode) == MODE_VECTOR_INT)
 +       {
 +         /* ??? Should be SSE vector operation cost.  */
 +         /* At least for published AMD latencies, this really is the same
 +            as the latency for a simple fpu operation like fabs.  */
 +         *total = cost-fabs;
 +         return false;
 +       }
 +      if (GET_MODE_SIZE (mode)  UNITS_PER_WORD)
        {
          if (CONST_INT_P (XEXP (x, 1)))
            {

 It also seems that we reversed the condition for the code that is now
 under if (GET_MODE_SIZE (mode)  UNITS_PER_WORD). Why do we need this?

 I'm not sure what you're suggesting.  But we certainly don't use
 the xmm registers to implement DImode operations in 32-bit, so...


 r~


Re: [PATCH 1/3] Add rtx costs for sse integer ops

2012-06-27 Thread Igor Zamyatin
On Wed, Jun 27, 2012 at 9:14 PM, Richard Henderson r...@redhat.com wrote:
 On 06/27/2012 10:08 AM, Igor Zamyatin wrote:
 On Wed, Jun 27, 2012 at 9:00 PM, Richard Henderson r...@redhat.com wrote:
  On 06/27/2012 09:07 AM, Igor Zamyatin wrote:
  May I ask about the purpose of the following piece of change? Doesn't
  it affect non-sse cases either?
 
  Err, no, it doesn't affect non-sse cases.  All MODE_VECTOR_INT
  cases will be implemented in the xmm registers (ignoring the
  deprecated and largely ignored mmx case).
 Probably I misunderstand something... This condition - GET_MODE_SIZE
 (mode)  UNITS_PER_WORD - is outside the check for MODE_VECTOR_INT.

 Of course.

 We currently have

        if (vector mode)
        else if (mode = word size)
        else /* scalar mode  word size */

Sure, it's clear. So am I correct that in this case now for second
possibility we run the code which was executed for third possibility
before the patch (it was under TARGET_64BIT  GET_MODE (XEXP (x, 0))
== DImode)? If yes, why?


 We could no doubt legitimately rearrange this to

        if (mode  word size)
        else if (vector mode)
        else /* scalar mode  word size */

 but I don't see how that's any clearer.  We certainly
 can't eliminate any tests, since there are in fact
 three different possibilities.



 r~


Re: [PATCH] Atom: Scheduler improvements for better imul placement

2012-05-29 Thread Igor Zamyatin
Hi, Uros!

Sorry, I didn't realize that patch was missed. I attached new version.

Changelog:

2012-05-29  Yuri Rumyantsev  yuri.s.rumyant...@intel.com

   * config/i386/i386.c (x86_sched_reorder): New function.
   Added new function x86_sched_reorder.

As for multiply modes, currently we handled most frequent case for
Atom and in the future this could be enhanced.

Thanks,
Igor


 Hello!

 Ping?

 Please at least add and URL to the patch, it took me some time to found the 
 latest version [1], I'm not even sure if it is the latest version...

 I assume that you cleared all issues with middle-end and scheduler 
 maintainers, it is not clear from the message.

 +   (1) IMUL instrction is on the top of list;

 Typo above.

 +  static int issue_rate = -1;
 +  int n_ready = *pn_ready;
 +  rtx insn;
 +  rtx insn1;
 +  rtx insn2;

 Please put three definitions above on the same line.

 +  int i;
 +  sd_iterator_def sd_it;
 +  dep_t dep;
 +  int index = -1;
 +
 +  /* set up issue rate */
 +  if (issue_rate  0)
 +    issue_rate = ix86_issue_rate();

 Please set issue_rate unconditionally here.  Also, please follow the GNU 
 style of comments (Full sentence with two spaces after the dot) everywhere, 
 e.g:

 /* Set up issue rate.  */

 +  if (!(GET_CODE (SET_SRC (insn)) == MULT
 +       GET_MODE (SET_SRC (insn)) == SImode))
 +    return issue_rate;

 Is it correct that only SImode multiplies are checked against SImode 
 multiplies? Can't we use DImode or HImode multiply (or other long-latency 
 insns) to put them into the shadow of the first multiply insn?

 As proposed in [2], there are many other fine-tuning approaches proposed by 
 the scheduler maintainer. OTOH, even the big hammer
 approach in the proposed patch makes things better, so it is the step in the 
 right direction - and it is existing practice anyway.

 Under this rationale, I think that the patch should be committed to mainline. 
 But please also consider proposed fine-tunings to refine the scheduling 
 accuracy.

 So, OK for mainline, if there are no objections from other maintainers in 
 next two days.

 [1] http://gcc.gnu.org/ml/gcc-patches/2012-04/msg00964.html
 [2] http://gcc.gnu.org/ml/gcc-patches/2012-04/msg00806.html

 Thanks,
 Uros.


imul_reordering.patch
Description: Binary data


Re: [PATCH, tree-optimization] Fix for PR 52868

2012-05-28 Thread Igor Zamyatin
Ping?

On Sat, May 12, 2012 at 1:26 AM, Igor Zamyatin izamya...@gmail.com wrote:
 Ping?

 On Fri, Apr 27, 2012 at 4:42 PM, Igor Zamyatin izamya...@gmail.com wrote:
 On Wed, Apr 25, 2012 at 6:41 PM, Richard Guenther
 richard.guent...@gmail.com wrote:
 On Wed, Apr 25, 2012 at 4:32 PM, Igor Zamyatin izamya...@gmail.com wrote:
 On Wed, Apr 25, 2012 at 1:14 PM, Richard Guenther
 richard.guent...@gmail.com wrote:
 On Wed, Apr 25, 2012 at 10:56 AM, Igor Zamyatin izamya...@gmail.com 
 wrote:
 Hi all!

 I'd like to post for review the patch which makes some costs adjusting
 in get_computation_cost_at routine in ivopts part.
 As mentioned in the PR changes also fix the bwaves regression from PR 
 52272.
 Also changes introduce no degradations on spec2000/2006 and
 EEMBC1.1/2.0(this was measured on Atom) on x86


 Bootstrapped and regtested on x86. Ok to commit?

 I can't make sense of the patch and the comment does not help.

 +      diff_cost = cost.cost;
       cost.cost /= avg_loop_niter (data-current_loop);
 +      add_cost_val = add_cost (TYPE_MODE (ctype), data-speed);
 +      /* Do cost correction if address cost is small enough
 +         and difference cost is high enough.  */
 +      if (address_p  diff_cost  add_cost_val
 +           get_address_cost (symbol_present, var_present,
 +                               offset, ratio, cstepi,
 +                               TYPE_MODE (TREE_TYPE (utype)),
 +                               TYPE_ADDR_SPACE (TREE_TYPE (utype)),
 +                               speed, stmt_is_after_inc,
 +                               can_autoinc).cost = add_cost_val)
 +        cost.cost += add_cost_val;

 Please explain more thoroughly.  It also would seem to be better to add
 an extra case, as later code does

 For example for such code

   for (j=0; jM;j++) {
       for (i=0; iN; i++)
           sum += ptr-a[j][i] * ptr-c[k][i];
   }
  we currently have following gimple on x86 target (I provided a piece
 of all phase output):

           # ivtmp.13_30 = PHI ivtmp.13_31(3), ivtmp.13_33(7)
           D.1748_34 = (void *) ivtmp.13_30;
           D.1722_7 = MEM[base: D.1748_34, offset: 0B];
           D.1750_36 = ivtmp.27_28;
           D.1751_37 = D.1750_36 + ivtmp.13_30; -- we got
 non-invariant add which is not taken into account currently in cost
 model
           D.1752_38 = (void *) D.1751_37;
           D.1753_39 = (sizetype) k_8(D);
           D.1754_40 = D.1753_39 * 800;
           D.1723_9 = MEM[base: D.1752_38, index: D.1754_40, offset: 
 16000B];
           ...

  With proposed fix we produce:

           # ivtmp.14_30 = PHI ivtmp.14_31(3), 0(7)
           D.1749_34 = (struct S *) ivtmp.25_28;
           D.1722_7 = MEM[base: D.1749_34, index: ivtmp.14_30, offset: 0B];
           D.1750_35 = (sizetype) k_8(D);
           D.1751_36 = D.1750_35 * 800;
           D.1752_37 = ptr_6(D) + D.1751_36;
           D.1723_9 = MEM[base: D.1752_37, index: ivtmp.14_30, offset: 
 16000B];

 which is more effective on platforms where address cost is cheaper
 than cost of addition operation. That's basically what this adjustment
 is for.

 If we generally miss to account for the add then why is the adjustment
 conditional on diff_cost  add_cost and address_cost = add_cost?

 Is this a new heuristic or a fix for not accurately computing the cost for 
 the
 stmts we generate?

 I'd say this is closer to heuristic since diff_cost  add_cost is an
 attempt to catch the case with non-invariant add produced by pointer
 difference and address_cost =add_cost leaves the cases with cheap
 address operations


 Richard.

 So comment in the source code now looks as follows

 /* Do cost correction when address difference produces
   additional non-invariant add operation which is less
   profitable if address cost is cheaper than cost of add.  */


  /* Now the computation is in shape symbol + var1 + const + ratio * var2.
     (symbol/var1/const parts may be omitted).  If we are looking for an
     address, find the cost of addressing this.  */
  if (address_p)
    return add_costs (cost,
                      get_address_cost (symbol_present, var_present,
                                        offset, ratio, cstepi,
                                        TYPE_MODE (TREE_TYPE (utype)),
                                        TYPE_ADDR_SPACE (TREE_TYPE 
 (utype)),
                                        speed, stmt_is_after_inc,
                                        can_autoinc));

 thus refactoring the code a bit would make it possible to CSE the
 get_address_cost
 call and eventually make it clearer what the code does.

 'offset' could be changed beetween two calls of get_address_cost so
 such refactoring looks useless.

 New patch (only the comment was changed) attached. Changelog was
 changed as well.


 Richard.


 Changelog:

  2012-04-26  Yuri Rumyantsev  yuri.rumyant...@intel.com

         * tree-ssa-loop-ivopts.c (get_computation_cost_at): Adjust
        cost model when address difference

Re: [PATCH] Atom: Scheduler improvements for better imul placement

2012-05-28 Thread Igor Zamyatin
Ping?

On Sun, May 6, 2012 at 11:27 AM, Igor Zamyatin izamya...@gmail.com wrote:
 Ping. Could x86 maintainer(s) look at these changes?

 Thanks,
 Igor

 On Fri, Apr 20, 2012 at 4:04 PM, Igor Zamyatin izamya...@gmail.com wrote:
 On Tue, Apr 17, 2012 at 12:27 AM, Igor Zamyatin izamya...@gmail.com wrote:
 On Fri, Apr 13, 2012 at 4:20 PM, Andrey Belevantsev a...@ispras.ru wrote:
 On 13.04.2012 14:18, Igor Zamyatin wrote:

 On Thu, Apr 12, 2012 at 5:01 PM, Andrey Belevantseva...@ispras.ru
  wrote:

 On 12.04.2012 16:38, Richard Guenther wrote:


 On Thu, Apr 12, 2012 at 2:36 PM, Igor Zamyatinizamya...@gmail.com
  wrote:


 On Thu, Apr 12, 2012 at 4:24 PM, Richard Guenther
 richard.guent...@gmail.com    wrote:


 On Thu, Apr 12, 2012 at 2:00 PM, Alexander Monakovamona...@ispras.ru
  wrote:



 Can atom execute two IMUL in parallel?  Or what exactly is the
 pipeline
 behavior?



 As I understand from Intel's optimization reference manual, the
 behavior is as
 follows: if the instruction immediately following IMUL has shorter
 latency,
 execution is stalled for 4 cycles (which is IMUL's latency);
 otherwise,
 a
 4-or-more cycles latency instruction can be issued after IMUL without
 a
 stall.
 In other words, IMUL is pipelined with respect to other long-latency
 instructions, but not to short-latency instructions.



 It seems to be modeled in the pipeline description though:

 ;;; imul insn has 5 cycles latency
 (define_reservation atom-imul-32
                    atom-imul-1, atom-imul-2, atom-imul-3,
 atom-imul-4,
                     atom-port-0)

 ;;; imul instruction excludes other non-FP instructions.
 (exclusion_set atom-eu-0, atom-eu-1
               atom-imul-1, atom-imul-2, atom-imul-3, atom-imul-4)


 The main idea is quite simple:

 If we are going to schedule IMUL instruction (it is on the top of
 ready list) we try to find out producer of other (independent) IMUL
 instruction that is in ready list too. The goal is try to schedule
 such a producer to get another IMUL in ready list and get scheduling
 of 2 successive IMUL instructions.



 Why does that not happen without your patch?  Does it never happen
 without
 your patch or does it merely not happen for one EEMBC benchmark (can
 you provide a testcase?)?



 It does not happen because the scheduler by itself does not do such
 specific
 reordering.  That said, it is easy to imagine the cases where this patch
 will make things worse rather than better.

 Igor, why not try different subtler mechanisms like adjust_priority,
 which
 is get called when an insn is added to the ready list?  E.g. increase the
 producer's priority.

 The patch as is misses checks for NONDEBUG_INSN_P.  Also, why bail out
 when
 you have more than one imul in the ready list?  Don't you want to bump
 the
 priority of the other imul found?


 Could you provide some examples when this patch would harm the
 performance?


 I thought of the cases when the other ready insns can fill up the hole and
 that would be more beneficial because e.g. they would be on more critical
 paths than the producer of your second imul.  I don't know enough of Atom 
 to
 give an example -- maybe some long divisions?



 Sched_reorder was chosen since it is used in other ports and looks
 most suitable for such case, e.g. it provides access to the whole
 ready list.
 BTW, just increasing producer's priority seems to be more risky in
 performance sense - we can incorrectly start delaying some
 instructions.


 Yes, but exactly because of the above example you can start incorrectly
 delaying other insns, too, as you force the insn to be the first in the
 list.  While bumping priority still leaves the scheduler sorting heuristics
 in place and actually lowers that risk.


 Thought ready list doesn't contain DEBUG_INSN... Is it so? If it
 contains them - this could be added easily


 It does, but I'm not sure the sched_reorder hook gets them or they are
 immediately removed -- I saw similar checks in one of the targets' hooks.

 Done with DEBUG_INSN, also 1-imul limit was removed. Patch attached


 Anyways, my main thought was that it is better to test on more benchmarks 
 to
 alleviate the above concerns, so as long as the i386 maintainers are happy,
 I don't see major problems here.  A good idea could be to generalize the
 patch to handle other long latency insns as second consumers, not only
 imuls, if this is relevant for Atom.

 Yes, generalization of this approach is in plans. According to Atom
 Software optimization guide there are several headrooms left here.
 As for trying on more benchmarks - the particular case is indeed quite
 rare. I attached the example
 where patch helps to group imuls in pairs which is profitable for
 Atom. Such and similar codes are not very common.
 But hopefully this approach could help avoid this and other glassjaws.

 BTW, this patch also helps some EEMBC tests when funroll-loops specified.
 So, any feedback from i386 maintainers about this? :)

 Changelog slightly

Re: [PATCH, Android] Stack protector enabling for Android target

2012-05-11 Thread Igor Zamyatin
Hi!

Please look at the modified patch in the attachment. ChangeLog remains the same.

Tested in android environment(x86_64-*-linux-android), also
bootstrapped on x86_64-unknown-linux-gnu.
I also started regtesting on linux. Is it ok after successfull regtesting?

Thanks,
Igor

On Mon, May 7, 2012 at 10:08 PM, H.J. Lu hjl.to...@gmail.com wrote:
 On Mon, May 7, 2012 at 11:04 AM, Maxim Kuvyrkov ma...@codesourcery.com 
 wrote:
 On 6/05/2012, at 7:02 PM, Igor Zamyatin wrote:

 Hi!

 The patch enables stack protector for Android.
 Android targets don't contain necessary information in features.h so
 we explicitly enable stack protector for Android.

 Bootstrapped and regtested on x86_64. Ok to commit?

 Thanks,
 Igor

 2012-05-06  Igor Zamyatin  igor.zamya...@intel.com

        * configure.ac: Stack protector enabling for Android targets.
        * configure: Regenerate.


 diff --git a/gcc/configure.ac b/gcc/configure.ac
 index 86b4bea..c1012d6 100644
 --- a/gcc/configure.ac
 +++ b/gcc/configure.ac
 @@ -4545,6 +4545,8 @@ AC_CACHE_CHECK(__stack_chk_fail in target C library,
       gcc_cv_libc_provides_ssp,
       [gcc_cv_libc_provides_ssp=no
     case $target in
 +       *-android*)
 +         gcc_cv_libc_provides_ssp=yes;;
        *-*-linux* | *-*-kfreebsd*-gnu | *-*-knetbsd*-gnu)
       [# glibc 2.4 and later provides __stack_chk_fail and
       # either __stack_chk_guard, or TLS access to stack guard canary.


 What you really want is to enable stack protector for Bionic libc, which is 
 often synonymous with -android* target, but not always.  Let's enable ssp 
 based on whether __BIONIC__ is defined in the libc headers (i.e., add a grep 
 test for __BIONIC__ in sys/cdefs.h)

 Also please add a comment along the lines of all versions of Bionic support 
 stack protector.

 Which exact target did you test this on?  X86_64-*-* is a pretty broad 
 definition.


 We are working on x86_64-*-linux-android target,
 which uses x32.

 --
 H.J.


stack_protector.patch
Description: Binary data


Re: [PATCH, tree-optimization] Fix for PR 52868

2012-05-11 Thread Igor Zamyatin
Ping?

On Fri, Apr 27, 2012 at 4:42 PM, Igor Zamyatin izamya...@gmail.com wrote:
 On Wed, Apr 25, 2012 at 6:41 PM, Richard Guenther
 richard.guent...@gmail.com wrote:
 On Wed, Apr 25, 2012 at 4:32 PM, Igor Zamyatin izamya...@gmail.com wrote:
 On Wed, Apr 25, 2012 at 1:14 PM, Richard Guenther
 richard.guent...@gmail.com wrote:
 On Wed, Apr 25, 2012 at 10:56 AM, Igor Zamyatin izamya...@gmail.com 
 wrote:
 Hi all!

 I'd like to post for review the patch which makes some costs adjusting
 in get_computation_cost_at routine in ivopts part.
 As mentioned in the PR changes also fix the bwaves regression from PR 
 52272.
 Also changes introduce no degradations on spec2000/2006 and
 EEMBC1.1/2.0(this was measured on Atom) on x86


 Bootstrapped and regtested on x86. Ok to commit?

 I can't make sense of the patch and the comment does not help.

 +      diff_cost = cost.cost;
       cost.cost /= avg_loop_niter (data-current_loop);
 +      add_cost_val = add_cost (TYPE_MODE (ctype), data-speed);
 +      /* Do cost correction if address cost is small enough
 +         and difference cost is high enough.  */
 +      if (address_p  diff_cost  add_cost_val
 +           get_address_cost (symbol_present, var_present,
 +                               offset, ratio, cstepi,
 +                               TYPE_MODE (TREE_TYPE (utype)),
 +                               TYPE_ADDR_SPACE (TREE_TYPE (utype)),
 +                               speed, stmt_is_after_inc,
 +                               can_autoinc).cost = add_cost_val)
 +        cost.cost += add_cost_val;

 Please explain more thoroughly.  It also would seem to be better to add
 an extra case, as later code does

 For example for such code

   for (j=0; jM;j++) {
       for (i=0; iN; i++)
           sum += ptr-a[j][i] * ptr-c[k][i];
   }
  we currently have following gimple on x86 target (I provided a piece
 of all phase output):

           # ivtmp.13_30 = PHI ivtmp.13_31(3), ivtmp.13_33(7)
           D.1748_34 = (void *) ivtmp.13_30;
           D.1722_7 = MEM[base: D.1748_34, offset: 0B];
           D.1750_36 = ivtmp.27_28;
           D.1751_37 = D.1750_36 + ivtmp.13_30; -- we got
 non-invariant add which is not taken into account currently in cost
 model
           D.1752_38 = (void *) D.1751_37;
           D.1753_39 = (sizetype) k_8(D);
           D.1754_40 = D.1753_39 * 800;
           D.1723_9 = MEM[base: D.1752_38, index: D.1754_40, offset: 16000B];
           ...

  With proposed fix we produce:

           # ivtmp.14_30 = PHI ivtmp.14_31(3), 0(7)
           D.1749_34 = (struct S *) ivtmp.25_28;
           D.1722_7 = MEM[base: D.1749_34, index: ivtmp.14_30, offset: 0B];
           D.1750_35 = (sizetype) k_8(D);
           D.1751_36 = D.1750_35 * 800;
           D.1752_37 = ptr_6(D) + D.1751_36;
           D.1723_9 = MEM[base: D.1752_37, index: ivtmp.14_30, offset: 
 16000B];

 which is more effective on platforms where address cost is cheaper
 than cost of addition operation. That's basically what this adjustment
 is for.

 If we generally miss to account for the add then why is the adjustment
 conditional on diff_cost  add_cost and address_cost = add_cost?

 Is this a new heuristic or a fix for not accurately computing the cost for 
 the
 stmts we generate?

 I'd say this is closer to heuristic since diff_cost  add_cost is an
 attempt to catch the case with non-invariant add produced by pointer
 difference and address_cost =add_cost leaves the cases with cheap
 address operations


 Richard.

 So comment in the source code now looks as follows

 /* Do cost correction when address difference produces
   additional non-invariant add operation which is less
   profitable if address cost is cheaper than cost of add.  */


  /* Now the computation is in shape symbol + var1 + const + ratio * var2.
     (symbol/var1/const parts may be omitted).  If we are looking for an
     address, find the cost of addressing this.  */
  if (address_p)
    return add_costs (cost,
                      get_address_cost (symbol_present, var_present,
                                        offset, ratio, cstepi,
                                        TYPE_MODE (TREE_TYPE (utype)),
                                        TYPE_ADDR_SPACE (TREE_TYPE (utype)),
                                        speed, stmt_is_after_inc,
                                        can_autoinc));

 thus refactoring the code a bit would make it possible to CSE the
 get_address_cost
 call and eventually make it clearer what the code does.

 'offset' could be changed beetween two calls of get_address_cost so
 such refactoring looks useless.

 New patch (only the comment was changed) attached. Changelog was
 changed as well.


 Richard.


 Changelog:

  2012-04-26  Yuri Rumyantsev  yuri.rumyant...@intel.com

         * tree-ssa-loop-ivopts.c (get_computation_cost_at): Adjust
        cost model when address difference produces additional
        non-invariant add operation which is less profitable

[PATCH, Android] Stack protector enabling for Android target

2012-05-06 Thread Igor Zamyatin
Hi!

The patch enables stack protector for Android.
Android targets don't contain necessary information in features.h so
we explicitly enable stack protector for Android.

Bootstrapped and regtested on x86_64. Ok to commit?

Thanks,
Igor

2012-05-06  Igor Zamyatin  igor.zamya...@intel.com

* configure.ac: Stack protector enabling for Android targets.
* configure: Regenerate.


diff --git a/gcc/configure.ac b/gcc/configure.ac
index 86b4bea..c1012d6 100644
--- a/gcc/configure.ac
+++ b/gcc/configure.ac
@@ -4545,6 +4545,8 @@ AC_CACHE_CHECK(__stack_chk_fail in target C library,
   gcc_cv_libc_provides_ssp,
   [gcc_cv_libc_provides_ssp=no
 case $target in
+   *-android*)
+ gcc_cv_libc_provides_ssp=yes;;
*-*-linux* | *-*-kfreebsd*-gnu | *-*-knetbsd*-gnu)
   [# glibc 2.4 and later provides __stack_chk_fail and
   # either __stack_chk_guard, or TLS access to stack guard canary.


Re: [PATCH] Atom: Scheduler improvements for better imul placement

2012-05-06 Thread Igor Zamyatin
Ping. Could x86 maintainer(s) look at these changes?

Thanks,
Igor

On Fri, Apr 20, 2012 at 4:04 PM, Igor Zamyatin izamya...@gmail.com wrote:
 On Tue, Apr 17, 2012 at 12:27 AM, Igor Zamyatin izamya...@gmail.com wrote:
 On Fri, Apr 13, 2012 at 4:20 PM, Andrey Belevantsev a...@ispras.ru wrote:
 On 13.04.2012 14:18, Igor Zamyatin wrote:

 On Thu, Apr 12, 2012 at 5:01 PM, Andrey Belevantseva...@ispras.ru
  wrote:

 On 12.04.2012 16:38, Richard Guenther wrote:


 On Thu, Apr 12, 2012 at 2:36 PM, Igor Zamyatinizamya...@gmail.com
  wrote:


 On Thu, Apr 12, 2012 at 4:24 PM, Richard Guenther
 richard.guent...@gmail.com    wrote:


 On Thu, Apr 12, 2012 at 2:00 PM, Alexander Monakovamona...@ispras.ru
  wrote:



 Can atom execute two IMUL in parallel?  Or what exactly is the
 pipeline
 behavior?



 As I understand from Intel's optimization reference manual, the
 behavior is as
 follows: if the instruction immediately following IMUL has shorter
 latency,
 execution is stalled for 4 cycles (which is IMUL's latency);
 otherwise,
 a
 4-or-more cycles latency instruction can be issued after IMUL without
 a
 stall.
 In other words, IMUL is pipelined with respect to other long-latency
 instructions, but not to short-latency instructions.



 It seems to be modeled in the pipeline description though:

 ;;; imul insn has 5 cycles latency
 (define_reservation atom-imul-32
                    atom-imul-1, atom-imul-2, atom-imul-3,
 atom-imul-4,
                     atom-port-0)

 ;;; imul instruction excludes other non-FP instructions.
 (exclusion_set atom-eu-0, atom-eu-1
               atom-imul-1, atom-imul-2, atom-imul-3, atom-imul-4)


 The main idea is quite simple:

 If we are going to schedule IMUL instruction (it is on the top of
 ready list) we try to find out producer of other (independent) IMUL
 instruction that is in ready list too. The goal is try to schedule
 such a producer to get another IMUL in ready list and get scheduling
 of 2 successive IMUL instructions.



 Why does that not happen without your patch?  Does it never happen
 without
 your patch or does it merely not happen for one EEMBC benchmark (can
 you provide a testcase?)?



 It does not happen because the scheduler by itself does not do such
 specific
 reordering.  That said, it is easy to imagine the cases where this patch
 will make things worse rather than better.

 Igor, why not try different subtler mechanisms like adjust_priority,
 which
 is get called when an insn is added to the ready list?  E.g. increase the
 producer's priority.

 The patch as is misses checks for NONDEBUG_INSN_P.  Also, why bail out
 when
 you have more than one imul in the ready list?  Don't you want to bump
 the
 priority of the other imul found?


 Could you provide some examples when this patch would harm the
 performance?


 I thought of the cases when the other ready insns can fill up the hole and
 that would be more beneficial because e.g. they would be on more critical
 paths than the producer of your second imul.  I don't know enough of Atom to
 give an example -- maybe some long divisions?



 Sched_reorder was chosen since it is used in other ports and looks
 most suitable for such case, e.g. it provides access to the whole
 ready list.
 BTW, just increasing producer's priority seems to be more risky in
 performance sense - we can incorrectly start delaying some
 instructions.


 Yes, but exactly because of the above example you can start incorrectly
 delaying other insns, too, as you force the insn to be the first in the
 list.  While bumping priority still leaves the scheduler sorting heuristics
 in place and actually lowers that risk.


 Thought ready list doesn't contain DEBUG_INSN... Is it so? If it
 contains them - this could be added easily


 It does, but I'm not sure the sched_reorder hook gets them or they are
 immediately removed -- I saw similar checks in one of the targets' hooks.

 Done with DEBUG_INSN, also 1-imul limit was removed. Patch attached


 Anyways, my main thought was that it is better to test on more benchmarks to
 alleviate the above concerns, so as long as the i386 maintainers are happy,
 I don't see major problems here.  A good idea could be to generalize the
 patch to handle other long latency insns as second consumers, not only
 imuls, if this is relevant for Atom.

 Yes, generalization of this approach is in plans. According to Atom
 Software optimization guide there are several headrooms left here.
 As for trying on more benchmarks - the particular case is indeed quite
 rare. I attached the example
 where patch helps to group imuls in pairs which is profitable for
 Atom. Such and similar codes are not very common.
 But hopefully this approach could help avoid this and other glassjaws.

 BTW, this patch also helps some EEMBC tests when funroll-loops specified.
 So, any feedback from i386 maintainers about this? :)

 Changelog slightly changed


 2012-04-10  Yuri Rumyantsev  yuri.s.rumyant...@intel.com

        * config/i386

Re: [PATCH] Take branch misprediction effects into account when RTL loop unrolling (issue6099055)

2012-04-27 Thread Igor Zamyatin
Are you sure that tree-level unrollers are turned on at O2? My
impression was that they work only at O3 or with f[unroll,peel]-loops
flags.

On Tue, Apr 24, 2012 at 6:13 PM, Andi Kleen a...@firstfloor.org wrote:
 tejohn...@google.com (Teresa Johnson) writes:

 This patch adds heuristics to limit unrolling in loops with branches
 that may increase branch mispredictions. It affects loops that are
 not frequently iterated, and that are nested within a hot region of code 
 that already contains many branch instructions.

 Performance tested with both internal benchmarks and with SPEC
 2000/2006 on a variety of Intel systems (Core2, Corei7, SandyBridge) and a 
 couple of different AMD Opteron systems.
 This improves performance of an internal search indexing benchmark by
 close to 2% on all the tested Intel platforms.  It also consistently
 improves 445.gobmk (with FDO feedback where unrolling kicks in) by
 close to 1% on AMD Opteron. Other performance effects are neutral.

 Bootstrapped and tested on x86_64-unknown-linux-gnu.  Is this ok for trunk?

 One problem with any unrolling heuristics is currently that gcc has
 both the tree level and the rtl level unroller. The tree one is even
 on at -O3.  So if you tweak anything for one you have to affect both,
 otherwise the other may still do the wrong thing(tm).

Tree level unrollers (cunrolli and cunroll) do complete unroll. At O2,
both of them are turned on, but gcc does not allow any code growth --
which makes them pretty useless at O2 (very few loops qualify). The
default max complete peel iteration is also too low compared with both
icc and llvm.  This needs to be tuned.

David


 For some other tweaks I looked into a shared cost model some time ago.
 May be still needed.

 -Andi

 --
 a...@linux.intel.com -- Speaking for myself only


Re: [PATCH, tree-optimization] Fix for PR 52868

2012-04-27 Thread Igor Zamyatin
On Wed, Apr 25, 2012 at 6:41 PM, Richard Guenther
richard.guent...@gmail.com wrote:
 On Wed, Apr 25, 2012 at 4:32 PM, Igor Zamyatin izamya...@gmail.com wrote:
 On Wed, Apr 25, 2012 at 1:14 PM, Richard Guenther
 richard.guent...@gmail.com wrote:
 On Wed, Apr 25, 2012 at 10:56 AM, Igor Zamyatin izamya...@gmail.com wrote:
 Hi all!

 I'd like to post for review the patch which makes some costs adjusting
 in get_computation_cost_at routine in ivopts part.
 As mentioned in the PR changes also fix the bwaves regression from PR 
 52272.
 Also changes introduce no degradations on spec2000/2006 and
 EEMBC1.1/2.0(this was measured on Atom) on x86


 Bootstrapped and regtested on x86. Ok to commit?

 I can't make sense of the patch and the comment does not help.

 +      diff_cost = cost.cost;
       cost.cost /= avg_loop_niter (data-current_loop);
 +      add_cost_val = add_cost (TYPE_MODE (ctype), data-speed);
 +      /* Do cost correction if address cost is small enough
 +         and difference cost is high enough.  */
 +      if (address_p  diff_cost  add_cost_val
 +           get_address_cost (symbol_present, var_present,
 +                               offset, ratio, cstepi,
 +                               TYPE_MODE (TREE_TYPE (utype)),
 +                               TYPE_ADDR_SPACE (TREE_TYPE (utype)),
 +                               speed, stmt_is_after_inc,
 +                               can_autoinc).cost = add_cost_val)
 +        cost.cost += add_cost_val;

 Please explain more thoroughly.  It also would seem to be better to add
 an extra case, as later code does

 For example for such code

   for (j=0; jM;j++) {
       for (i=0; iN; i++)
           sum += ptr-a[j][i] * ptr-c[k][i];
   }
  we currently have following gimple on x86 target (I provided a piece
 of all phase output):

           # ivtmp.13_30 = PHI ivtmp.13_31(3), ivtmp.13_33(7)
           D.1748_34 = (void *) ivtmp.13_30;
           D.1722_7 = MEM[base: D.1748_34, offset: 0B];
           D.1750_36 = ivtmp.27_28;
           D.1751_37 = D.1750_36 + ivtmp.13_30; -- we got
 non-invariant add which is not taken into account currently in cost
 model
           D.1752_38 = (void *) D.1751_37;
           D.1753_39 = (sizetype) k_8(D);
           D.1754_40 = D.1753_39 * 800;
           D.1723_9 = MEM[base: D.1752_38, index: D.1754_40, offset: 16000B];
           ...

  With proposed fix we produce:

           # ivtmp.14_30 = PHI ivtmp.14_31(3), 0(7)
           D.1749_34 = (struct S *) ivtmp.25_28;
           D.1722_7 = MEM[base: D.1749_34, index: ivtmp.14_30, offset: 0B];
           D.1750_35 = (sizetype) k_8(D);
           D.1751_36 = D.1750_35 * 800;
           D.1752_37 = ptr_6(D) + D.1751_36;
           D.1723_9 = MEM[base: D.1752_37, index: ivtmp.14_30, offset: 
 16000B];

 which is more effective on platforms where address cost is cheaper
 than cost of addition operation. That's basically what this adjustment
 is for.

 If we generally miss to account for the add then why is the adjustment
 conditional on diff_cost  add_cost and address_cost = add_cost?

 Is this a new heuristic or a fix for not accurately computing the cost for the
 stmts we generate?

I'd say this is closer to heuristic since diff_cost  add_cost is an
attempt to catch the case with non-invariant add produced by pointer
difference and address_cost =add_cost leaves the cases with cheap
address operations


 Richard.

 So comment in the source code now looks as follows

 /* Do cost correction when address difference produces
   additional non-invariant add operation which is less
   profitable if address cost is cheaper than cost of add.  */


  /* Now the computation is in shape symbol + var1 + const + ratio * var2.
     (symbol/var1/const parts may be omitted).  If we are looking for an
     address, find the cost of addressing this.  */
  if (address_p)
    return add_costs (cost,
                      get_address_cost (symbol_present, var_present,
                                        offset, ratio, cstepi,
                                        TYPE_MODE (TREE_TYPE (utype)),
                                        TYPE_ADDR_SPACE (TREE_TYPE (utype)),
                                        speed, stmt_is_after_inc,
                                        can_autoinc));

 thus refactoring the code a bit would make it possible to CSE the
 get_address_cost
 call and eventually make it clearer what the code does.

 'offset' could be changed beetween two calls of get_address_cost so
 such refactoring looks useless.

 New patch (only the comment was changed) attached. Changelog was
 changed as well.


 Richard.


 Changelog:

  2012-04-26  Yuri Rumyantsev  yuri.rumyant...@intel.com

         * tree-ssa-loop-ivopts.c (get_computation_cost_at): Adjust
        cost model when address difference produces additional
        non-invariant add operation which is less profitable if
        address cost is cheaper than cost of add.

 Thanks,
 Igor


Re: FW: [v3] libstdc++/52689 testcase

2012-04-25 Thread Igor Zamyatin
This testcase is reported as failed on x86


 Noticed that this testcase wasn't put in as part of the patch. Fixed as 
 follows.

 tested x86/linux

 -benjamin



[PATCH, tree-optimization] Fix for PR 52868

2012-04-25 Thread Igor Zamyatin
Hi all!

I'd like to post for review the patch which makes some costs adjusting
in get_computation_cost_at routine in ivopts part.
As mentioned in the PR changes also fix the bwaves regression from PR 52272.
Also changes introduce no degradations on spec2000/2006 and
EEMBC1.1/2.0(this was measured on Atom) on x86


Bootstrapped and regtested on x86. Ok to commit?


Changelog:

2012-04-26  Yuri Rumyantsev  yuri.rumyant...@intel.com

* tree-ssa-loop-ivopts.c (get_computation_cost_at): Add cost
of extra addition if cost of address difference is high enough.

Thanks,
Igor


ivopts.patch
Description: Binary data


Re: [PATCH, tree-optimization] Fix for PR 52868

2012-04-25 Thread Igor Zamyatin
On Wed, Apr 25, 2012 at 1:14 PM, Richard Guenther
richard.guent...@gmail.com wrote:
 On Wed, Apr 25, 2012 at 10:56 AM, Igor Zamyatin izamya...@gmail.com wrote:
 Hi all!

 I'd like to post for review the patch which makes some costs adjusting
 in get_computation_cost_at routine in ivopts part.
 As mentioned in the PR changes also fix the bwaves regression from PR 52272.
 Also changes introduce no degradations on spec2000/2006 and
 EEMBC1.1/2.0(this was measured on Atom) on x86


 Bootstrapped and regtested on x86. Ok to commit?

 I can't make sense of the patch and the comment does not help.

 +      diff_cost = cost.cost;
       cost.cost /= avg_loop_niter (data-current_loop);
 +      add_cost_val = add_cost (TYPE_MODE (ctype), data-speed);
 +      /* Do cost correction if address cost is small enough
 +         and difference cost is high enough.  */
 +      if (address_p  diff_cost  add_cost_val
 +           get_address_cost (symbol_present, var_present,
 +                               offset, ratio, cstepi,
 +                               TYPE_MODE (TREE_TYPE (utype)),
 +                               TYPE_ADDR_SPACE (TREE_TYPE (utype)),
 +                               speed, stmt_is_after_inc,
 +                               can_autoinc).cost = add_cost_val)
 +        cost.cost += add_cost_val;

 Please explain more thoroughly.  It also would seem to be better to add
 an extra case, as later code does

For example for such code

   for (j=0; jM;j++) {
   for (i=0; iN; i++)
   sum += ptr-a[j][i] * ptr-c[k][i];
   }
 we currently have following gimple on x86 target (I provided a piece
of all phase output):

   # ivtmp.13_30 = PHI ivtmp.13_31(3), ivtmp.13_33(7)
   D.1748_34 = (void *) ivtmp.13_30;
   D.1722_7 = MEM[base: D.1748_34, offset: 0B];
   D.1750_36 = ivtmp.27_28;
   D.1751_37 = D.1750_36 + ivtmp.13_30; -- we got
non-invariant add which is not taken into account currently in cost
model
   D.1752_38 = (void *) D.1751_37;
   D.1753_39 = (sizetype) k_8(D);
   D.1754_40 = D.1753_39 * 800;
   D.1723_9 = MEM[base: D.1752_38, index: D.1754_40, offset: 16000B];
   ...

 With proposed fix we produce:

   # ivtmp.14_30 = PHI ivtmp.14_31(3), 0(7)
   D.1749_34 = (struct S *) ivtmp.25_28;
   D.1722_7 = MEM[base: D.1749_34, index: ivtmp.14_30, offset: 0B];
   D.1750_35 = (sizetype) k_8(D);
   D.1751_36 = D.1750_35 * 800;
   D.1752_37 = ptr_6(D) + D.1751_36;
   D.1723_9 = MEM[base: D.1752_37, index: ivtmp.14_30, offset: 16000B];

which is more effective on platforms where address cost is cheaper
than cost of addition operation. That's basically what this adjustment
is for.

So comment in the source code now looks as follows

/* Do cost correction when address difference produces
   additional non-invariant add operation which is less
   profitable if address cost is cheaper than cost of add.  */


  /* Now the computation is in shape symbol + var1 + const + ratio * var2.
     (symbol/var1/const parts may be omitted).  If we are looking for an
     address, find the cost of addressing this.  */
  if (address_p)
    return add_costs (cost,
                      get_address_cost (symbol_present, var_present,
                                        offset, ratio, cstepi,
                                        TYPE_MODE (TREE_TYPE (utype)),
                                        TYPE_ADDR_SPACE (TREE_TYPE (utype)),
                                        speed, stmt_is_after_inc,
                                        can_autoinc));

 thus refactoring the code a bit would make it possible to CSE the
 get_address_cost
 call and eventually make it clearer what the code does.

'offset' could be changed beetween two calls of get_address_cost so
such refactoring looks useless.

New patch (only the comment was changed) attached. Changelog was
changed as well.


 Richard.


Changelog:

 2012-04-26  Yuri Rumyantsev  yuri.rumyant...@intel.com

        * tree-ssa-loop-ivopts.c (get_computation_cost_at): Adjust
        cost model when address difference produces additional
non-invariant add operation which is less profitable if
address cost is cheaper than cost of add.

Thanks,
Igor


ivopts1.patch
Description: Binary data


Re: [PATCH] Atom: Scheduler improvements for better imul placement

2012-04-20 Thread Igor Zamyatin
On Tue, Apr 17, 2012 at 12:27 AM, Igor Zamyatin izamya...@gmail.com wrote:
 On Fri, Apr 13, 2012 at 4:20 PM, Andrey Belevantsev a...@ispras.ru wrote:
 On 13.04.2012 14:18, Igor Zamyatin wrote:

 On Thu, Apr 12, 2012 at 5:01 PM, Andrey Belevantseva...@ispras.ru
  wrote:

 On 12.04.2012 16:38, Richard Guenther wrote:


 On Thu, Apr 12, 2012 at 2:36 PM, Igor Zamyatinizamya...@gmail.com
  wrote:


 On Thu, Apr 12, 2012 at 4:24 PM, Richard Guenther
 richard.guent...@gmail.com    wrote:


 On Thu, Apr 12, 2012 at 2:00 PM, Alexander Monakovamona...@ispras.ru
  wrote:



 Can atom execute two IMUL in parallel?  Or what exactly is the
 pipeline
 behavior?



 As I understand from Intel's optimization reference manual, the
 behavior is as
 follows: if the instruction immediately following IMUL has shorter
 latency,
 execution is stalled for 4 cycles (which is IMUL's latency);
 otherwise,
 a
 4-or-more cycles latency instruction can be issued after IMUL without
 a
 stall.
 In other words, IMUL is pipelined with respect to other long-latency
 instructions, but not to short-latency instructions.



 It seems to be modeled in the pipeline description though:

 ;;; imul insn has 5 cycles latency
 (define_reservation atom-imul-32
                    atom-imul-1, atom-imul-2, atom-imul-3,
 atom-imul-4,
                     atom-port-0)

 ;;; imul instruction excludes other non-FP instructions.
 (exclusion_set atom-eu-0, atom-eu-1
               atom-imul-1, atom-imul-2, atom-imul-3, atom-imul-4)


 The main idea is quite simple:

 If we are going to schedule IMUL instruction (it is on the top of
 ready list) we try to find out producer of other (independent) IMUL
 instruction that is in ready list too. The goal is try to schedule
 such a producer to get another IMUL in ready list and get scheduling
 of 2 successive IMUL instructions.



 Why does that not happen without your patch?  Does it never happen
 without
 your patch or does it merely not happen for one EEMBC benchmark (can
 you provide a testcase?)?



 It does not happen because the scheduler by itself does not do such
 specific
 reordering.  That said, it is easy to imagine the cases where this patch
 will make things worse rather than better.

 Igor, why not try different subtler mechanisms like adjust_priority,
 which
 is get called when an insn is added to the ready list?  E.g. increase the
 producer's priority.

 The patch as is misses checks for NONDEBUG_INSN_P.  Also, why bail out
 when
 you have more than one imul in the ready list?  Don't you want to bump
 the
 priority of the other imul found?


 Could you provide some examples when this patch would harm the
 performance?


 I thought of the cases when the other ready insns can fill up the hole and
 that would be more beneficial because e.g. they would be on more critical
 paths than the producer of your second imul.  I don't know enough of Atom to
 give an example -- maybe some long divisions?



 Sched_reorder was chosen since it is used in other ports and looks
 most suitable for such case, e.g. it provides access to the whole
 ready list.
 BTW, just increasing producer's priority seems to be more risky in
 performance sense - we can incorrectly start delaying some
 instructions.


 Yes, but exactly because of the above example you can start incorrectly
 delaying other insns, too, as you force the insn to be the first in the
 list.  While bumping priority still leaves the scheduler sorting heuristics
 in place and actually lowers that risk.


 Thought ready list doesn't contain DEBUG_INSN... Is it so? If it
 contains them - this could be added easily


 It does, but I'm not sure the sched_reorder hook gets them or they are
 immediately removed -- I saw similar checks in one of the targets' hooks.

 Done with DEBUG_INSN, also 1-imul limit was removed. Patch attached


 Anyways, my main thought was that it is better to test on more benchmarks to
 alleviate the above concerns, so as long as the i386 maintainers are happy,
 I don't see major problems here.  A good idea could be to generalize the
 patch to handle other long latency insns as second consumers, not only
 imuls, if this is relevant for Atom.

 Yes, generalization of this approach is in plans. According to Atom
 Software optimization guide there are several headrooms left here.
 As for trying on more benchmarks - the particular case is indeed quite
 rare. I attached the example
 where patch helps to group imuls in pairs which is profitable for
 Atom. Such and similar codes are not very common.
 But hopefully this approach could help avoid this and other glassjaws.

BTW, this patch also helps some EEMBC tests when funroll-loops specified.
So, any feedback from i386 maintainers about this? :)

Changelog slightly changed


2012-04-10  Yuri Rumyantsev  yuri.s.rumyant...@intel.com

* config/i386/i386.c (x86_sched_reorder): New function.
Added new function x86_sched_reorder
to take advantage of Atom pipelened IMUL execution

Re: [PATCH] Atom: Enabling unroll at O2 optimization level

2012-04-17 Thread Igor Zamyatin
On Thu, Apr 12, 2012 at 3:16 PM, Richard Guenther
richard.guent...@gmail.com wrote:
 On Thu, Apr 12, 2012 at 1:05 PM, Igor Zamyatin izamya...@gmail.com wrote:
 On Wed, Apr 11, 2012 at 12:39 PM, Richard Guenther
 richard.guent...@gmail.com wrote:
 On Tue, Apr 10, 2012 at 8:43 PM, Igor Zamyatin izamya...@gmail.com wrote:
 Hi All!

 Here is a patch that enables unroll at O2 for Atom.

 This gives good performance boost on EEMBC 2.0 (~+8% in Geomean for 32
 bits) with quite moderate code size increase (~5% for EEMBC2.0, 32
 bits).

 5% is not moderate.  Your patch does enable unrolling at -O2 but not -O3,
 why? Why do you disable register renaming?  check_imull requires a function
 comment.

 Sure, enabling unroll for O3 could be the next step.
 We can't avoid code size increase with unroll - what number do you
 think will be appropriate?
 Register renaming was the reason of several degradations during tuning 
 process
 Comment for check_imull was added


 This completely looks like a hack for EEMBC2.0, so it's definitely not ok.

 Why? EEMBC was measured and result provided here just because this
 benchmark considers to be very relevant for Atom

 I'd say that SPEC INT (2000 / 2006) is more relevant for Atom (SPEC FP
 would be irrelevant OTOH).  Similar code size for, say, Mozilla Firefox
 or GCC itself would be important.

 -O2 is not supposed to give best benchmark results.

 O2 is wide-used so performance improvement could be important for users.

 But not at a 5% size cost.  Please also always check the compile-time effect
 which is important for -O2 as well.

What would be an acceptable number of size cost/compile-time increase
for O2 and O3 on EEMBC, SPEC INT 2000 and Mozilla?

Is it possible in common to put Atom-specific unroll heuristics under
some option which could be mentioned in GCC docs?


 Richard.


 Thanks,
 Richard.


 Tested for i386 and x86-64, ok for trunk?

 Updated patch attached


 Thanks,
 Igor

 ChangeLog:

 2012-04-10  Yakovlev Vladimir  vladimir.b.yakov...@intel.com

        * gcc/config/i386/i386.c (check_imul): New routine.
        (ix86_loop_unroll_adjust): New target hook.
        (ix86_option_override_internal): Enable unrolling on Atom at -O2.
        (TARGET_LOOP_UNROLL_ADJUST): New define.

Thanks,
Igor


Re: [PATCH] Atom: Scheduler improvements for better imul placement

2012-04-16 Thread Igor Zamyatin
On Fri, Apr 13, 2012 at 4:20 PM, Andrey Belevantsev a...@ispras.ru wrote:
 On 13.04.2012 14:18, Igor Zamyatin wrote:

 On Thu, Apr 12, 2012 at 5:01 PM, Andrey Belevantseva...@ispras.ru
  wrote:

 On 12.04.2012 16:38, Richard Guenther wrote:


 On Thu, Apr 12, 2012 at 2:36 PM, Igor Zamyatinizamya...@gmail.com
  wrote:


 On Thu, Apr 12, 2012 at 4:24 PM, Richard Guenther
 richard.guent...@gmail.com    wrote:


 On Thu, Apr 12, 2012 at 2:00 PM, Alexander Monakovamona...@ispras.ru
  wrote:



 Can atom execute two IMUL in parallel?  Or what exactly is the
 pipeline
 behavior?



 As I understand from Intel's optimization reference manual, the
 behavior is as
 follows: if the instruction immediately following IMUL has shorter
 latency,
 execution is stalled for 4 cycles (which is IMUL's latency);
 otherwise,
 a
 4-or-more cycles latency instruction can be issued after IMUL without
 a
 stall.
 In other words, IMUL is pipelined with respect to other long-latency
 instructions, but not to short-latency instructions.



 It seems to be modeled in the pipeline description though:

 ;;; imul insn has 5 cycles latency
 (define_reservation atom-imul-32
                    atom-imul-1, atom-imul-2, atom-imul-3,
 atom-imul-4,
                     atom-port-0)

 ;;; imul instruction excludes other non-FP instructions.
 (exclusion_set atom-eu-0, atom-eu-1
               atom-imul-1, atom-imul-2, atom-imul-3, atom-imul-4)


 The main idea is quite simple:

 If we are going to schedule IMUL instruction (it is on the top of
 ready list) we try to find out producer of other (independent) IMUL
 instruction that is in ready list too. The goal is try to schedule
 such a producer to get another IMUL in ready list and get scheduling
 of 2 successive IMUL instructions.



 Why does that not happen without your patch?  Does it never happen
 without
 your patch or does it merely not happen for one EEMBC benchmark (can
 you provide a testcase?)?



 It does not happen because the scheduler by itself does not do such
 specific
 reordering.  That said, it is easy to imagine the cases where this patch
 will make things worse rather than better.

 Igor, why not try different subtler mechanisms like adjust_priority,
 which
 is get called when an insn is added to the ready list?  E.g. increase the
 producer's priority.

 The patch as is misses checks for NONDEBUG_INSN_P.  Also, why bail out
 when
 you have more than one imul in the ready list?  Don't you want to bump
 the
 priority of the other imul found?


 Could you provide some examples when this patch would harm the
 performance?


 I thought of the cases when the other ready insns can fill up the hole and
 that would be more beneficial because e.g. they would be on more critical
 paths than the producer of your second imul.  I don't know enough of Atom to
 give an example -- maybe some long divisions?



 Sched_reorder was chosen since it is used in other ports and looks
 most suitable for such case, e.g. it provides access to the whole
 ready list.
 BTW, just increasing producer's priority seems to be more risky in
 performance sense - we can incorrectly start delaying some
 instructions.


 Yes, but exactly because of the above example you can start incorrectly
 delaying other insns, too, as you force the insn to be the first in the
 list.  While bumping priority still leaves the scheduler sorting heuristics
 in place and actually lowers that risk.


 Thought ready list doesn't contain DEBUG_INSN... Is it so? If it
 contains them - this could be added easily


 It does, but I'm not sure the sched_reorder hook gets them or they are
 immediately removed -- I saw similar checks in one of the targets' hooks.

Done with DEBUG_INSN, also 1-imul limit was removed. Patch attached


 Anyways, my main thought was that it is better to test on more benchmarks to
 alleviate the above concerns, so as long as the i386 maintainers are happy,
 I don't see major problems here.  A good idea could be to generalize the
 patch to handle other long latency insns as second consumers, not only
 imuls, if this is relevant for Atom.

Yes, generalization of this approach is in plans. According to Atom
Software optimization guide there are several headrooms left here.
As for trying on more benchmarks - the particular case is indeed quite
rare. I attached the example
where patch helps to group imuls in pairs which is profitable for
Atom. Such and similar codes are not very common.
But hopefully this approach could help avoid this and other glassjaws.


 Andrey



 The case with more than one imul in the ready list wasn't considered
 just because the goal was to catch the particular case when there is a
 risk to get the following picture: imul-producer-imul which is less
 effective than producer-imul-imul for Atom


 Andrey



 And MD allows us only prefer scheduling of successive IMUL
 instructions,
 i.e.
 If IMUL was just scheduled and ready list contains another IMUL
 instruction then it will be chosen

Re: [PATCH] Atom: Scheduler improvements for better imul placement

2012-04-13 Thread Igor Zamyatin
On Thu, Apr 12, 2012 at 5:01 PM, Andrey Belevantsev a...@ispras.ru wrote:
 On 12.04.2012 16:38, Richard Guenther wrote:

 On Thu, Apr 12, 2012 at 2:36 PM, Igor Zamyatinizamya...@gmail.com
  wrote:

 On Thu, Apr 12, 2012 at 4:24 PM, Richard Guenther
 richard.guent...@gmail.com  wrote:

 On Thu, Apr 12, 2012 at 2:00 PM, Alexander Monakovamona...@ispras.ru
  wrote:


 Can atom execute two IMUL in parallel?  Or what exactly is the
 pipeline
 behavior?


 As I understand from Intel's optimization reference manual, the
 behavior is as
 follows: if the instruction immediately following IMUL has shorter
 latency,
 execution is stalled for 4 cycles (which is IMUL's latency); otherwise,
 a
 4-or-more cycles latency instruction can be issued after IMUL without a
 stall.
 In other words, IMUL is pipelined with respect to other long-latency
 instructions, but not to short-latency instructions.


 It seems to be modeled in the pipeline description though:

 ;;; imul insn has 5 cycles latency
 (define_reservation atom-imul-32
                    atom-imul-1, atom-imul-2, atom-imul-3, atom-imul-4,
                     atom-port-0)

 ;;; imul instruction excludes other non-FP instructions.
 (exclusion_set atom-eu-0, atom-eu-1
               atom-imul-1, atom-imul-2, atom-imul-3, atom-imul-4)


 The main idea is quite simple:

 If we are going to schedule IMUL instruction (it is on the top of
 ready list) we try to find out producer of other (independent) IMUL
 instruction that is in ready list too. The goal is try to schedule
 such a producer to get another IMUL in ready list and get scheduling
 of 2 successive IMUL instructions.


 Why does that not happen without your patch?  Does it never happen without
 your patch or does it merely not happen for one EEMBC benchmark (can
 you provide a testcase?)?


 It does not happen because the scheduler by itself does not do such specific
 reordering.  That said, it is easy to imagine the cases where this patch
 will make things worse rather than better.

 Igor, why not try different subtler mechanisms like adjust_priority, which
 is get called when an insn is added to the ready list?  E.g. increase the
 producer's priority.

 The patch as is misses checks for NONDEBUG_INSN_P.  Also, why bail out when
 you have more than one imul in the ready list?  Don't you want to bump the
 priority of the other imul found?

Could you provide some examples when this patch would harm the performance?

Sched_reorder was chosen since it is used in other ports and looks
most suitable for such case, e.g. it provides access to the whole
ready list.
BTW, just increasing producer's priority seems to be more risky in
performance sense - we can incorrectly start delaying some
instructions.

Thought ready list doesn't contain DEBUG_INSN... Is it so? If it
contains them - this could be added easily

The case with more than one imul in the ready list wasn't considered
just because the goal was to catch the particular case when there is a
risk to get the following picture: imul-producer-imul which is less
effective than producer-imul-imul for Atom


 Andrey



 And MD allows us only prefer scheduling of successive IMUL instructions,
 i.e.
 If IMUL was just scheduled and ready list contains another IMUL
 instruction then it will be chosen as the best candidate for
 scheduling.


 at least from my very limited guessing of what the above does.  So, did
 you
 analyze why the scheduler does not properly schedule things for you?

 Richard.


  From reading the patch, I could not understand the link between
 pipeline
 behavior and what the patch appears to do.

 Hope that helps.

 Alexander



Thanks,
Igor


Re: [PATCH] Atom: Scheduler improvements for better imul placement

2012-04-13 Thread Igor Zamyatin
On Fri, Apr 13, 2012 at 2:18 PM, Igor Zamyatin izamya...@gmail.com wrote:
 On Thu, Apr 12, 2012 at 5:01 PM, Andrey Belevantsev a...@ispras.ru wrote:
 On 12.04.2012 16:38, Richard Guenther wrote:

 On Thu, Apr 12, 2012 at 2:36 PM, Igor Zamyatinizamya...@gmail.com
  wrote:

 On Thu, Apr 12, 2012 at 4:24 PM, Richard Guenther
 richard.guent...@gmail.com  wrote:

 On Thu, Apr 12, 2012 at 2:00 PM, Alexander Monakovamona...@ispras.ru
  wrote:


 Can atom execute two IMUL in parallel?  Or what exactly is the
 pipeline
 behavior?


 As I understand from Intel's optimization reference manual, the
 behavior is as
 follows: if the instruction immediately following IMUL has shorter
 latency,
 execution is stalled for 4 cycles (which is IMUL's latency); otherwise,
 a
 4-or-more cycles latency instruction can be issued after IMUL without a
 stall.
 In other words, IMUL is pipelined with respect to other long-latency
 instructions, but not to short-latency instructions.


 It seems to be modeled in the pipeline description though:

 ;;; imul insn has 5 cycles latency
 (define_reservation atom-imul-32
                    atom-imul-1, atom-imul-2, atom-imul-3, atom-imul-4,
                     atom-port-0)

 ;;; imul instruction excludes other non-FP instructions.
 (exclusion_set atom-eu-0, atom-eu-1
               atom-imul-1, atom-imul-2, atom-imul-3, atom-imul-4)


 The main idea is quite simple:

 If we are going to schedule IMUL instruction (it is on the top of
 ready list) we try to find out producer of other (independent) IMUL
 instruction that is in ready list too. The goal is try to schedule
 such a producer to get another IMUL in ready list and get scheduling
 of 2 successive IMUL instructions.


 Why does that not happen without your patch?  Does it never happen without
 your patch or does it merely not happen for one EEMBC benchmark (can
 you provide a testcase?)?


 It does not happen because the scheduler by itself does not do such specific
 reordering.  That said, it is easy to imagine the cases where this patch
 will make things worse rather than better.

 Igor, why not try different subtler mechanisms like adjust_priority, which
 is get called when an insn is added to the ready list?  E.g. increase the
 producer's priority.

 The patch as is misses checks for NONDEBUG_INSN_P.  Also, why bail out when
 you have more than one imul in the ready list?  Don't you want to bump the
 priority of the other imul found?

 Could you provide some examples when this patch would harm the performance?

 Sched_reorder was chosen since it is used in other ports and looks
 most suitable for such case, e.g. it provides access to the whole
 ready list.
 BTW, just increasing producer's priority seems to be more risky in
 performance sense - we can incorrectly start delaying some
 instructions.

 Thought ready list doesn't contain DEBUG_INSN... Is it so? If it
 contains them - this could be added easily

 The case with more than one imul in the ready list wasn't considered
 just because the goal was to catch the particular case when there is a
 risk to get the following picture: imul-producer-imul which is less
 effective than producer-imul-imul for Atom


Actually, this constraint of one imul could be omitted. I'm going to
try this change



 Andrey



 And MD allows us only prefer scheduling of successive IMUL instructions,
 i.e.
 If IMUL was just scheduled and ready list contains another IMUL
 instruction then it will be chosen as the best candidate for
 scheduling.


 at least from my very limited guessing of what the above does.  So, did
 you
 analyze why the scheduler does not properly schedule things for you?

 Richard.


  From reading the patch, I could not understand the link between
 pipeline
 behavior and what the patch appears to do.

 Hope that helps.

 Alexander



 Thanks,
 Igor


Re: [PATCH] Atom: Scheduler improvements for better imul placement

2012-04-12 Thread Igor Zamyatin
On Wed, Apr 11, 2012 at 5:38 PM, Andi Kleen a...@firstfloor.org wrote:
 Igor Zamyatin izamya...@gmail.com writes:

 Hi All!

 It is known that imul placement is rather critical for Atom processors
 and changes try to improve imul scheduling for Atom.

 This gives +5% performance on several tests from new OA 2.0 testsuite
 from EEMBC.

 Tested for i386 and x86-64, ok for trunk?

 Did you measure how much this slows down the compiler when compiling
 for Atom? The new pass looks rather slow.

No, since this pass is applied rarely. Conditions are mentioned in
the comments to ix86_sched_reorder


 -Andi

 --
 a...@linux.intel.com -- Speaking for myself only


Re: [PATCH] Atom: Scheduler improvements for better imul placement

2012-04-12 Thread Igor Zamyatin
On Wed, Apr 11, 2012 at 6:10 PM, Richard Guenther
richard.guent...@gmail.com wrote:
 On Wed, Apr 11, 2012 at 3:38 PM, Andi Kleen a...@firstfloor.org wrote:
 Igor Zamyatin izamya...@gmail.com writes:

 Hi All!

 It is known that imul placement is rather critical for Atom processors
 and changes try to improve imul scheduling for Atom.

 This gives +5% performance on several tests from new OA 2.0 testsuite
 from EEMBC.

 Tested for i386 and x86-64, ok for trunk?

 Did you measure how much this slows down the compiler when compiling
 for Atom? The new pass looks rather slow.

 Also please explain why adjusting the automaton for Atom is not a way to
 attack this issue.

If I understand the question correctly - it's a dynamic check and it's
not clear how to describe this adjusting statically in machine
description


 Richard.

 -Andi

 --
 a...@linux.intel.com -- Speaking for myself only


Re: [PATCH] Atom: Enabling unroll at O2 optimization level

2012-04-12 Thread Igor Zamyatin
On Wed, Apr 11, 2012 at 5:34 PM, Andi Kleen a...@firstfloor.org wrote:
 Richard Guenther richard.guent...@gmail.com writes:

 5% is not moderate.  Your patch does enable unrolling at -O2 but not -O3,
 why? Why do you disable register renaming?  check_imull requires a function
 comment.

 This completely looks like a hack for EEMBC2.0, so it's definitely not ok.

 -O2 is not supposed to give best benchmark results.

 Besides it is against the Intel Optimization Manual recommendation
 to prefer small code on Atom to avoid falling out of the predecode hints
 in the cache.

Yes, this is well-known concern for Atom. But in the same time unroll
could help a lot for inorder machines because it could provide more
opportunities to a compiler scheduler. And experiments showed that
unroll could really help.


 So would need much more benchmarking on macro workloads first at least.

Like what, for example? I believe in this case everything also
strongly depends on test usage model (e.g. it usually compiled with Os
not O2) and, let's say, internal test structure - whether there are
hot loops that suitable for unroll.


 -Andi

 --
 a...@linux.intel.com -- Speaking for myself only


Re: [PATCH] Atom: Enabling unroll at O2 optimization level

2012-04-12 Thread Igor Zamyatin
On Wed, Apr 11, 2012 at 12:39 PM, Richard Guenther
richard.guent...@gmail.com wrote:
 On Tue, Apr 10, 2012 at 8:43 PM, Igor Zamyatin izamya...@gmail.com wrote:
 Hi All!

 Here is a patch that enables unroll at O2 for Atom.

 This gives good performance boost on EEMBC 2.0 (~+8% in Geomean for 32
 bits) with quite moderate code size increase (~5% for EEMBC2.0, 32
 bits).

 5% is not moderate.  Your patch does enable unrolling at -O2 but not -O3,
 why? Why do you disable register renaming?  check_imull requires a function
 comment.

Sure, enabling unroll for O3 could be the next step.
We can't avoid code size increase with unroll - what number do you
think will be appropriate?
Register renaming was the reason of several degradations during tuning process
Comment for check_imull was added


 This completely looks like a hack for EEMBC2.0, so it's definitely not ok.

Why? EEMBC was measured and result provided here just because this
benchmark considers to be very relevant for Atom


 -O2 is not supposed to give best benchmark results.

O2 is wide-used so performance improvement could be important for users.


 Thanks,
 Richard.


 Tested for i386 and x86-64, ok for trunk?

Updated patch attached


 Thanks,
 Igor

 ChangeLog:

 2012-04-10  Yakovlev Vladimir  vladimir.b.yakov...@intel.com

        * gcc/config/i386/i386.c (check_imul): New routine.
        (ix86_loop_unroll_adjust): New target hook.
        (ix86_option_override_internal): Enable unrolling on Atom at -O2.
        (TARGET_LOOP_UNROLL_ADJUST): New define.


unroll1.patch
Description: Binary data


Re: [PATCH] Atom: Scheduler improvements for better imul placement

2012-04-12 Thread Igor Zamyatin
On Thu, Apr 12, 2012 at 4:24 PM, Richard Guenther
richard.guent...@gmail.com wrote:
 On Thu, Apr 12, 2012 at 2:00 PM, Alexander Monakov amona...@ispras.ru wrote:

 Can atom execute two IMUL in parallel?  Or what exactly is the pipeline
 behavior?

 As I understand from Intel's optimization reference manual, the behavior is 
 as
 follows: if the instruction immediately following IMUL has shorter latency,
 execution is stalled for 4 cycles (which is IMUL's latency); otherwise, a
 4-or-more cycles latency instruction can be issued after IMUL without a 
 stall.
 In other words, IMUL is pipelined with respect to other long-latency
 instructions, but not to short-latency instructions.

 It seems to be modeled in the pipeline description though:

 ;;; imul insn has 5 cycles latency
 (define_reservation atom-imul-32
                    atom-imul-1, atom-imul-2, atom-imul-3, atom-imul-4,
                     atom-port-0)

 ;;; imul instruction excludes other non-FP instructions.
 (exclusion_set atom-eu-0, atom-eu-1
               atom-imul-1, atom-imul-2, atom-imul-3, atom-imul-4)


The main idea is quite simple:

If we are going to schedule IMUL instruction (it is on the top of
ready list) we try to find out producer of other (independent) IMUL
instruction that is in ready list too. The goal is try to schedule
such a producer to get another IMUL in ready list and get scheduling
of 2 successive IMUL instructions.

And MD allows us only prefer scheduling of successive IMUL instructions, i.e.
If IMUL was just scheduled and ready list contains another IMUL
instruction then it will be chosen as the best candidate for
scheduling.


 at least from my very limited guessing of what the above does.  So, did you
 analyze why the scheduler does not properly schedule things for you?

 Richard.


 From reading the patch, I could not understand the link between pipeline
 behavior and what the patch appears to do.

 Hope that helps.

 Alexander


[PATCH] Atom: Scheduler improvements for better imul placement

2012-04-11 Thread Igor Zamyatin
Hi All!

It is known that imul placement is rather critical for Atom processors
and changes try to improve imul scheduling for Atom.

This gives +5% performance on several tests from new OA 2.0 testsuite
from EEMBC.

Tested for i386 and x86-64, ok for trunk?


ChangeLog:

2012-04-10  Yuri Rumyantsev  ysrum...@gmail.com

* config/i386/i386.c (x86_sched_reorder): New function.
Add new function x86_sched_reorder
to take advantage of Atom pipelened IMUL execution.


imul_sched.patch
Description: Binary data


[PATCH] Atom: Enabling unroll at O2 optimization level

2012-04-10 Thread Igor Zamyatin
Hi All!

Here is a patch that enables unroll at O2 for Atom.

This gives good performance boost on EEMBC 2.0 (~+8% in Geomean for 32
bits) with quite moderate code size increase (~5% for EEMBC2.0, 32
bits).

Tested for i386 and x86-64, ok for trunk?

Thanks,
Igor

ChangeLog:

2012-04-10  Yakovlev Vladimir  vladimir.b.yakov...@intel.com

       * gcc/config/i386/i386.c (check_imull): New routine.
       (ix86_loop_unroll_adjust): New target hook.
       (ix86_option_override_internal): Enable unrolling on Atom at -O2.
       (TARGET_LOOP_UNROLL_ADJUST): New define.


unroll.patch
Description: Binary data


Re: FW: patch to fix PR21617

2012-01-23 Thread Igor Zamyatin
Unfortunately patch doesn't help neither for separate EEMBC_2_0 tests
nor for the whole benchmark.

Do you want me to do some debugging here?

On Fri, Jan 20, 2012 at 10:13 PM, Vladimir Makarov vmaka...@redhat.com wrote:
 On 01/19/2012 03:52 PM, Vladimir Makarov wrote:

 On 01/18/2012 02:30 PM, Zamyatin, Igor wrote:

 Yes, we use Atom for EEMBC measurements.

 We'll be glad to help you with your findings.


 Thanks.

 Unfortunately I tried several alternative patches but I did not find a
 better solution (it is mostly code size degradation on CoreI7).  Now I am
 even thinking that the best action would have been ignoring the original PR.

 Could you try the attached patch.  It might work.

 I've tried several approach to prevent small hole creation in
 ira-color.c::assign_hard_reg but it does not work well.

 Thanks.




Re: [PATCH] PR testsuite/51097 fix: a lot of FAIL: gcc.dg/vect on i686 avx build 181167 to 181177

2011-12-29 Thread Igor Zamyatin
Because it includes AVX and AVX2 which deals with int and for AVX2
there are no problems with doubled diagnostics.

I understand that all this looks quite bulky but it's hard to create
something which looks better without loosing generality


On Thu, Dec 29, 2011 at 4:15 PM, Ira Rosen i...@il.ibm.com wrote:


 Igor Zamyatin izamya...@gmail.com wrote on 29/12/2011 02:04:45 PM:

 When compiler configured with, say corei7-avx, it outputs twice more
 diagnostics on integer tests since AVX deals mostly with floats. I.e.
 compiler tries to vectorize on AVX vector size, than fails and then
 vectorizes on smaller vector size. This double work leads to double
 diagnostic output.

 OK, so you why not use vect_sizes_32B_16B?

 Ira




Re: [PATCH] PR testsuite/51097 fix: a lot of FAIL: gcc.dg/vect on i686 avx build 181167 to 181177

2011-12-29 Thread Igor Zamyatin
Thanks, I'll look into your remarks

On Thu, Dec 29, 2011 at 5:33 PM, Ira Rosen i...@il.ibm.com wrote:


 Igor Zamyatin izamya...@gmail.com wrote on 29/12/2011 02:29:46 PM:


 Because it includes AVX and AVX2 which deals with int and for AVX2
 there are no problems with doubled diagnostics.

 And you can't just update vect_int because AVX does support it but with
 128-bit vectors, right?
 So, your vect_float_no_int looks incorrect as well.
 You need to describe the case when two vector sizes are analyzed, but the
 first one always fails. Maybe vect_sizes_32B_16B_noint? Probably ugly, but
 correct at least.

 I also suggest to simplify the checks and not to check the number of times
 a pattern was detected , like this:

 Index: vect-widen-mult-half.c
 ===
 --- vect-widen-mult-half.c      (revision 182703)
 +++ vect-widen-mult-half.c      (working copy)
 @@ -43,7 +43,7 @@ int main (void)
  }

  /* { dg-final { scan-tree-dump-times vectorized 1 loops 1
 vect { target vect_widen_mult_hi_to_si } } } */
 -/* { dg-final { scan-tree-dump-times vect_recog_widen_mult_pattern:
 detected 1 vect { target vect_widen_mult_hi_to_si_pattern } } } */
 +/* { dg-final { scan-tree-dump vect_recog_widen_mult_pattern: detected
 vect { target vect_widen_mult_hi_to_si_pattern } } } */
  /* { dg-final { scan-tree-dump-times pattern recognized 1
 vect { target vect_widen_mult_hi_to_si_pattern } } } */
  /* { dg-final { cleanup-tree-dump vect } } */


 And split the tests with several loops into several files, like, for
 example, vect-widen-mult-const-s16.c.

 The simplification is ok for slp-reduc-6.c, vect-119.c, vect-35-big-array.c
 as well.

 For the rest of the tests, I don't understand why there are alignment
 messages printed twice for two vector sizes. Why doesn't the vectorizer
 fail during vect_determine_vectorization_factor?

 Ira


 I understand that all this looks quite bulky but it's hard to create
 something which looks better without loosing generality


 On Thu, Dec 29, 2011 at 4:15 PM, Ira Rosen i...@il.ibm.com wrote:
 
 
  Igor Zamyatin izamya...@gmail.com wrote on 29/12/2011 02:04:45 PM:
 
  When compiler configured with, say corei7-avx, it outputs twice more
  diagnostics on integer tests since AVX deals mostly with floats. I.e.
  compiler tries to vectorize on AVX vector size, than fails and then
  vectorizes on smaller vector size. This double work leads to double
  diagnostic output.
 
  OK, so you why not use vect_sizes_32B_16B?
 
  Ira
 
 




Re: FW: patch to fix PR21617

2011-12-29 Thread Igor Zamyatin
Ilya is on vacation so I'll make the answer.

Overall score became worse on 0.3%.

 -Original Message-
 From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches-ow...@gcc.gnu.org] On 
 Behalf Of Vladimir Makarov
 Sent: Thursday, December 22, 2011 8:15 PM
 To: Ilya Enkovich
 Cc: gcc-patches
 Subject: Re: patch to fix PR21617

 On 12/22/2011 06:19 AM, Ilya Enkovich wrote:
 2011/12/13 Vladimir Makarovvmaka...@redhat.com:
 The following patch solves PR 21617 which is described on
 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21617.

 Just adding number of necessary hard registers solves the problem but
 creates 2% SPEC2000 perlbmk degradation on x86.  Fortunately, removing
 allocno class comparison removes the degradation.   The allocno class
 comparison was originally added for debugging purposes to put coloring of
 allocnos of the same class in one place.  It was ok when we had cover
 classes which did not intersect.  Now allocno classes intersect and this
 comparison should be not used (at least as highest priority heuristic).

 Beside solving the problem, the patch improves a bit SPEC2000 performance
 and code size on x86.

 The patch was successfully bootstrapped on x86/x86-64.  I expect it should
 be pretty safe for other targets because it changes heuristics only.

 Committed as rev. 182263.


 2011-12-12  Vladimir Makarovvmaka...@redhat.com

         PR rtl-optimization/21617
         * ira-color.c (bucket_allocno_compare_func): Don't compare
         allocno classes.  Compare number of hard registers needed.

 Hello, Vladimir,

 I think there may be a typo in this patch. I saw few degradations in
 EEMBC2.0 benchmarks caused by this patch. Following fix removes these
 degradations:

 -  if ((diff = (ira_reg_class_max_nregs[cl1][ALLOCNO_MODE (a1)]
 -              - ira_reg_class_max_nregs[cl2][ALLOCNO_MODE (a2)])) != 0)
 +  if ((diff = (ira_reg_class_max_nregs[cl2][ALLOCNO_MODE (a2)]
 +              - ira_reg_class_max_nregs[cl1][ALLOCNO_MODE (a1)])) != 0)

 I suspect typo because previous calculation of 'diff' value used f(a2)
 - f(a1) formula and now it is f(a1) - f(a2). Could you please look at
 it?

 I don't think it was a typo.  The patch puts multi-registers pseudos at
 the end of the coloring bucket, so they will push to the coloring stacke
 last and popped from the stack first and get better chance to be
 assigned to hard registers because smaller chance of small holes
 creation of free hard registers.

 Generally speaking, RA is all about heuristics and it is always possible
 to find a test when one heuristic will work better than another one and
 vise versa.  So for me, there is a little sense to work on such PRs of
 course unless it results in a discovery of better heuristics which
 improves overall score on a credible test set like SPECCPU, for example.

 I work on these PRs mostly to help next gcc release to come out.

 Returning to your complaint, could you give me info how the overall
 score of EEMBC changed after committing the patch, not just saying that
 there are few degradations.


Re: [PATCH] PR testsuite/51097 fix: a lot of FAIL: gcc.dg/vect on i686 avx build 181167 to 181177

2011-12-29 Thread Igor Zamyatin
When compiler configured with, say corei7-avx, it outputs twice more
diagnostics on integer tests since AVX deals mostly with floats. I.e.
compiler tries to vectorize on AVX vector size, than fails and then
vectorizes on smaller vector size. This double work leads to double
diagnostic output.

On Thu, Dec 29, 2011 at 12:07 PM, Ira Rosen i...@il.ibm.com wrote:


 gcc-patches-ow...@gcc.gnu.org wrote on 28/12/2011 11:05:19 PM:

 Hi,

 Hi Igor,


 Here is another patch about failures in gcc.dg/vect tests. These
 changes fix fails that could be seen on avx-built compilers. It also
 introduces no FAILs/XFAILs/XPASSes/ERRORs on regular i686, x86_64,
 avx2_32, avx2_64.
 Is it ok for the trunk?


 diff --git a/gcc/testsuite/gcc.dg/vect/vect-widen-mult-sum.c
 b/gcc/testsuite/gcc.dg/vect/vect-widen-mult-sum.c
 index 2898918..1d190fc 100644
 --- a/gcc/testsuite/gcc.dg/vect/vect-widen-mult-sum.c
 +++ b/gcc/testsuite/gcc.dg/vect/vect-widen-mult-sum.c
 @@ -43,5 +43,6 @@ int main (void)


  /* { dg-final { scan-tree-dump-times vectorized 1 loops 1
 vect { target vect_widen_mult_hi_to_si } } } */
 -/* { dg-final { scan-tree-dump-times vect_recog_widen_mult_pattern:
 detected 1 vect } } */
 +/* { dg-final { scan-tree-dump-times vect_recog_widen_mult_pattern:
 detected 1 vect { target {! vect_float_no_int } } } } */
 +/* { dg-final { scan-tree-dump-times vect_recog_widen_mult_pattern:
 detected 2 vect { target vect_float_no_int } } } */
  /* { dg-final { cleanup-tree-dump vect } } */

 Could you please explain what are you trying to do? How is
 +# Return 1 if the target supports hardware vectors of float and doesn't
 support
 +# vectors of int, 0 otherwise.
 related to the number of times that pattern is detected?

 Thanks,
 Ira



 Thanks,
 Igor

 2011-12-28  Igor Zamyatin  igor.zamya...@intel.com

        PR testsuite/51097
        * lib/target-supports.exp
 (check_effective_target_vect_float_no_int):
        New function.
        (check_avx2_available): Ditto.
        * gcc.dg/vect/no-scevccp-outer-7.c: Adjust dg-scans for AVX-built
        compiler.
        * gcc.dg/vect/no-scevccp-vect-iv-3.c: Likewise.
        * gcc.dg/vect/no-vfa-vect-depend-1.c: Likewise.
        * gcc.dg/vect/no-vfa-vect-dv-2.c: Likewise.
        * gcc.dg/vect/slp-perm-9.c: Likewise.
        * gcc.dg/vect/slp-reduc-6.c: Likewise.
        * gcc.dg/vect/slp-widen-mult-half.c: Likewise.
        * gcc.dg/vect/vect-109.c: Likewise.
        * gcc.dg/vect/vect-119.c: Likewise.
        * gcc.dg/vect/vect-35-big-array.c: Likewise.
        * gcc.dg/vect/vect-91.c: Likewise.
        * gcc.dg/vect/vect-multitypes-4.c: Likewise.
        * gcc.dg/vect/vect-multitypes-6.c: Likewise.
        * gcc.dg/vect/vect-outer-4c-big-array.c: Likewise.
        * gcc.dg/vect/vect-over-widen-1.c: Likewise.
        * gcc.dg/vect/vect-over-widen-4.c: Likewise.
        * gcc.dg/vect/vect-peel-1.c: Likewise.
        * gcc.dg/vect/vect-peel-3.c: Likewise.
        * gcc.dg/vect/vect-peel-4.c: Likewise.
        * gcc.dg/vect/vect-reduc-dot-s16a.c: Likewise.
        * gcc.dg/vect/vect-reduc-dot-s8a.c: Likewise.
        * gcc.dg/vect/vect-reduc-dot-u8a.c: Likewise.
        * gcc.dg/vect/vect-reduc-dot-u8b.c: Likewise.
        * gcc.dg/vect/vect-reduc-pattern-1a.c: Likewise.
        * gcc.dg/vect/vect-reduc-pattern-1b-big-array.c: Likewise.
        * gcc.dg/vect/vect-reduc-pattern-1c-big-array.c: Likewise.
        * gcc.dg/vect/vect-reduc-pattern-2a.c: Likewise.
        * gcc.dg/vect/vect-reduc-pattern-2b-big-array.c: Likewise.
        * gcc.dg/vect/vect-widen-mult-const-s16.c: Likewise.
        * gcc.dg/vect/vect-widen-mult-const-u16.c: Likewise.
        * gcc.dg/vect/vect-widen-mult-half-u8.c: Likewise.
        * gcc.dg/vect/vect-widen-mult-half.c: Likewise.
        * gcc.dg/vect/vect-widen-mult-sum.c: Likewise.
        * gcc.dg/vect/vect-widen-mult-u16.c: Likewise.
        * gcc.dg/vect/wrapv-vect-reduc-dot-s8b.c: Likewise.
 [attachment 51097.patch deleted by Ira Rosen/Haifa/IBM]



[PATCH] PR testsuite/51097 fix: a lot of FAIL: gcc.dg/vect on i686 avx build 181167 to 181177

2011-12-28 Thread Igor Zamyatin
Hi,

Here is another patch about failures in gcc.dg/vect tests. These
changes fix fails that could be seen on avx-built compilers. It also
introduces no FAILs/XFAILs/XPASSes/ERRORs on regular i686, x86_64,
avx2_32, avx2_64.
Is it ok for the trunk?

Thanks,
Igor

2011-12-28  Igor Zamyatin  igor.zamya...@intel.com

       PR testsuite/51097
       * lib/target-supports.exp (check_effective_target_vect_float_no_int):
       New function.
       (check_avx2_available): Ditto.
       * gcc.dg/vect/no-scevccp-outer-7.c: Adjust dg-scans for AVX-built
       compiler.
       * gcc.dg/vect/no-scevccp-vect-iv-3.c: Likewise.
       * gcc.dg/vect/no-vfa-vect-depend-1.c: Likewise.
       * gcc.dg/vect/no-vfa-vect-dv-2.c: Likewise.
       * gcc.dg/vect/slp-perm-9.c: Likewise.
       * gcc.dg/vect/slp-reduc-6.c: Likewise.
       * gcc.dg/vect/slp-widen-mult-half.c: Likewise.
       * gcc.dg/vect/vect-109.c: Likewise.
       * gcc.dg/vect/vect-119.c: Likewise.
       * gcc.dg/vect/vect-35-big-array.c: Likewise.
       * gcc.dg/vect/vect-91.c: Likewise.
       * gcc.dg/vect/vect-multitypes-4.c: Likewise.
       * gcc.dg/vect/vect-multitypes-6.c: Likewise.
       * gcc.dg/vect/vect-outer-4c-big-array.c: Likewise.
       * gcc.dg/vect/vect-over-widen-1.c: Likewise.
       * gcc.dg/vect/vect-over-widen-4.c: Likewise.
       * gcc.dg/vect/vect-peel-1.c: Likewise.
       * gcc.dg/vect/vect-peel-3.c: Likewise.
       * gcc.dg/vect/vect-peel-4.c: Likewise.
       * gcc.dg/vect/vect-reduc-dot-s16a.c: Likewise.
       * gcc.dg/vect/vect-reduc-dot-s8a.c: Likewise.
       * gcc.dg/vect/vect-reduc-dot-u8a.c: Likewise.
       * gcc.dg/vect/vect-reduc-dot-u8b.c: Likewise.
       * gcc.dg/vect/vect-reduc-pattern-1a.c: Likewise.
       * gcc.dg/vect/vect-reduc-pattern-1b-big-array.c: Likewise.
       * gcc.dg/vect/vect-reduc-pattern-1c-big-array.c: Likewise.
       * gcc.dg/vect/vect-reduc-pattern-2a.c: Likewise.
       * gcc.dg/vect/vect-reduc-pattern-2b-big-array.c: Likewise.
       * gcc.dg/vect/vect-widen-mult-const-s16.c: Likewise.
       * gcc.dg/vect/vect-widen-mult-const-u16.c: Likewise.
       * gcc.dg/vect/vect-widen-mult-half-u8.c: Likewise.
       * gcc.dg/vect/vect-widen-mult-half.c: Likewise.
       * gcc.dg/vect/vect-widen-mult-sum.c: Likewise.
       * gcc.dg/vect/vect-widen-mult-u16.c: Likewise.
       * gcc.dg/vect/wrapv-vect-reduc-dot-s8b.c: Likewise.


51097.patch
Description: Binary data