Re: [PATCH] DATA_ALIGNMENT vs. DATA_ABI_ALIGNMENT (PR target/56564)
Right, as you did for other cases. It works here as well. Thanks, Igor On Wed, Jun 19, 2013 at 11:05 AM, Jakub Jelinek wrote: > On Wed, Jun 19, 2013 at 11:01:59AM +0400, Igor Zamyatin wrote: >> The change also affects vectorizer in avx case which could be seen for >> gcc.dg/tree-ssa/loop-19.c test. >> >> After the change report says >> >> loop-19_bad.c:16: note: === vect_analyze_data_refs_alignment === >> loop-19_bad.c:16: note: vect_compute_data_ref_alignment: >> loop-19_bad.c:16: note: can't force alignment of ref: a[j_9] >> loop-19_bad.c:16: note: vect_compute_data_ref_alignment: >> loop-19_bad.c:16: note: can't force alignment of ref: c[j_9] >> >> AFAICS first condition in ix86_data_alignment was true before the >> change so 256 was a return value. >> >> Do we need to tweak this test also? > > I'd add -fno-common to the test. > > Jakub
Re: FW: [PATCH] DATA_ALIGNMENT vs. DATA_ABI_ALIGNMENT (PR target/56564)
The change also affects vectorizer in avx case which could be seen for gcc.dg/tree-ssa/loop-19.c test. After the change report says loop-19_bad.c:16: note: === vect_analyze_data_refs_alignment === loop-19_bad.c:16: note: vect_compute_data_ref_alignment: loop-19_bad.c:16: note: can't force alignment of ref: a[j_9] loop-19_bad.c:16: note: vect_compute_data_ref_alignment: loop-19_bad.c:16: note: can't force alignment of ref: c[j_9] AFAICS first condition in ix86_data_alignment was true before the change so 256 was a return value. Do we need to tweak this test also? Thanks, Igor > Hi! > > This PR is about DATA_ALIGNMENT macro increasing alignment of some decls for > optimization purposes beyond ABI mandated levels. It is fine to emit the > vars aligned as much as we want for optimization purposes, but if we can't be > sure that references to that decl bind to the definition we increased the > alignment on (e.g. common variables, or -fpic code without hidden visibility, > weak vars etc.), we can't assume that alignment. > As DECL_ALIGN is used for both the alignment emitted for the definitions and > alignment assumed on code referring to it, this patch increases DECL_ALIGN > only on decls where decl_binds_to_current_def_p is true, and otherwise the > optimization part on top of that emits only when aligning definition. > On x86_64, DATA_ALIGNMENT macro was partly an optimization, partly ABI > mandated alignment increase, so I've introduced a new macro, > DATA_ABI_ALIGNMENT, which is the ABI mandated increase only (on x86-64 I > think the only one is that arrays with size 16 bytes or more (and VLAs, but > that is not handled by DATA*ALIGNMENT) are at least 16 byte aligned). > > Bootstrapped/regtested on x86_64-linux and i686-linux. No idea about other > targets, I've kept them all using DATA_ALIGNMENT, which is considered > optimization increase only now, if there is some ABI mandated alignment > increase on other targets, that should be done in DATA_ABI_ALIGNMENT as well > as DATA_ALIGNMENT. The patch causes some vectorization regressions (tweaked > in the testsuite), especially for common vars where we used to align say > common arrays to 256 bits rather than the ABI mandated 128 bits, or for -fpic > code, but I'm afraid we need to live with that, if you compile another file > with say icc or some other compiler which doesn't increase alignment beyond > ABI mandated level and that other file defines the var say as non-common, we > have wrong-code. > > 2013-06-07 Jakub Jelinek > > PR target/56564 > * varasm.c (align_variable): Don't use DATA_ALIGNMENT or > CONSTANT_ALIGNMENT if !decl_binds_to_current_def_p (decl). > Use DATA_ABI_ALIGNMENT for that case instead if defined. > (get_variable_align): New function. > (get_variable_section, emit_bss, emit_common, > assemble_variable_contents, place_block_symbol): Use > get_variable_align instead of DECL_ALIGN. > (assemble_noswitch_variable): Add align argument, use it > instead of DECL_ALIGN. > (assemble_variable): Adjust caller. Use get_variable_align > instead of DECL_ALIGN. > * config/i386/i386.h (DATA_ALIGNMENT): Adjust x86_data_alignment > caller. > (DATA_ABI_ALIGNMENT): Define. > * config/i386/i386-protos.h (x86_data_alignment): Adjust prototype. > * config/i386/i386.c (x86_data_alignment): Add opt argument. If > opt is false, only return the psABI mandated alignment increase. > * doc/tm.texi.in (DATA_ABI_ALIGNMENT): Document. > * doc/tm.texi: Regenerated. > > * gcc.target/i386/pr56564-1.c: New test. > * gcc.target/i386/pr56564-2.c: New test. > * gcc.target/i386/pr56564-3.c: New test. > * gcc.target/i386/pr56564-4.c: New test. > * gcc.target/i386/avx256-unaligned-load-4.c: Add -fno-common. > * gcc.target/i386/avx256-unaligned-store-1.c: Likewise. > * gcc.target/i386/avx256-unaligned-store-3.c: Likewise. > * gcc.target/i386/avx256-unaligned-store-4.c: Likewise. > * gcc.target/i386/vect-sizes-1.c: Likewise. > * gcc.target/i386/memcpy-1.c: Likewise. > * gcc.dg/vect/costmodel/i386/costmodel-vect-31.c (tmp): Initialize. > * gcc.dg/vect/costmodel/x86_64/costmodel-vect-31.c (tmp): Likewise. >
Re: [Patch wwwdocs] gcc-4.9 changes: mention support of the Intel Silvermont microarchitecture
Ping? On Mon, Jun 10, 2013 at 2:13 PM, Igor Zamyatin wrote: > Hi! > > This patch mentions support of Silvermont architecture in the > gcc-4.9/changes.html page. > > OK to install? > > Thanks, > Igor > > Index: htdocs/gcc-4.9/changes.html > === > RCS file: /cvs/gcc/wwwdocs/htdocs/gcc-4.9/changes.html,v > retrieving revision 1.17 > diff -c -1 -r1.17 changes.html > *** htdocs/gcc-4.9/changes.html 6 Jun 2013 11:15:24 - 1.17 > --- htdocs/gcc-4.9/changes.html 10 Jun 2013 10:11:24 - > *** > *** 177,178 > --- 177,185 > > + IA-32/x86-64 > + > + GCC now supports new Intel microarchitecture named Silvermont > + through -march=slm. > + > + > + > RX
Re: Document Intel Silvermont support in invoke.texi
Please see updated patch. Is it ok to install? Thanks, Igor Changelog: 2013-06-11 Igor Zamyatin * doc/invoke.texi (core-avx2): Document. (slm): Likewise. (atom): Updated with MOVBE. diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi index b7b32f7..dd82880 100644 --- a/gcc/doc/invoke.texi +++ b/gcc/doc/invoke.texi @@ -13833,10 +13833,19 @@ Intel Core CPU with 64-bit extensions, MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, AVX, AES, PCLMUL, FSGSBASE, RDRND and F16C instruction set support. +@item core-avx2 +Intel Core CPU with 64-bit extensions, MOVBE, MMX, SSE, SSE2, SSE3, SSSE3, +SSE4.1, SSE4.2, AVX, AVX2, AES, PCLMUL, FSGSBASE, RDRND, FMA, BMI, BMI2 +and F16C instruction set support. + @item atom -Intel Atom CPU with 64-bit extensions, MMX, SSE, SSE2, SSE3 and SSSE3 +Intel Atom CPU with 64-bit extensions, MOVBE, MMX, SSE, SSE2, SSE3 and SSSE3 instruction set support. +@item slm +Intel Silvermont CPU with 64-bit extensions, MOVBE, MMX, SSE, SSE2, SSE3, SSSE3, +SSE4.1 and SSE4.2 instruction set support. + @item k6 AMD K6 CPU with MMX instruction set support. On Mon, Jun 10, 2013 at 5:29 PM, Jakub Jelinek wrote: > On Mon, Jun 10, 2013 at 05:25:36PM +0400, Igor Zamyatin wrote: >> Following patch documents Intel Silvermont support. >> >> OK to install? >> >> 2013-06-10 Igor Zamyatin >> >> * doc/invoke.texi: Document slm. >> >> diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi >> index b7b32f7..e4f1d45 100644 >> --- a/gcc/doc/invoke.texi >> +++ b/gcc/doc/invoke.texi >> @@ -13837,6 +13837,10 @@ set support. >> Intel Atom CPU with 64-bit extensions, MMX, SSE, SSE2, SSE3 and SSSE3 >> instruction set support. >> >> +@item slm >> +Intel Silvermont CPU with 64-bit extensions, MMX, SSE, SSE2, SSE3 and SSSE3, >> +SSE4.1, SSE4.2 instruction set support. > > Shouldn't it also list MOVBE (similarly for atom and core-avx2, btver2 > already lists it). > > core-avx2 isn't even documented at all in invoke.texi. > > Jakub
Document Intel Silvermont support in invoke.texi
Hi! Following patch documents Intel Silvermont support. OK to install? Thanks, Igor Changelog: 2013-06-10 Igor Zamyatin * doc/invoke.texi: Document slm. diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi index b7b32f7..e4f1d45 100644 --- a/gcc/doc/invoke.texi +++ b/gcc/doc/invoke.texi @@ -13837,6 +13837,10 @@ set support. Intel Atom CPU with 64-bit extensions, MMX, SSE, SSE2, SSE3 and SSSE3 instruction set support. +@item slm +Intel Silvermont CPU with 64-bit extensions, MMX, SSE, SSE2, SSE3 and SSSE3, +SSE4.1, SSE4.2 instruction set support. + @item k6 AMD K6 CPU with MMX instruction set support.
[Patch wwwdocs] gcc-4.9 changes: mention support of the Intel Silvermont microarchitecture
Hi! This patch mentions support of Silvermont architecture in the gcc-4.9/changes.html page. OK to install? Thanks, Igor Index: htdocs/gcc-4.9/changes.html === RCS file: /cvs/gcc/wwwdocs/htdocs/gcc-4.9/changes.html,v retrieving revision 1.17 diff -c -1 -r1.17 changes.html *** htdocs/gcc-4.9/changes.html 6 Jun 2013 11:15:24 - 1.17 --- htdocs/gcc-4.9/changes.html 10 Jun 2013 10:11:24 - *** *** 177,178 --- 177,185 + IA-32/x86-64 + + GCC now supports new Intel microarchitecture named Silvermont + through -march=slm. + + + RX
Re: [x86, PATCH 2/2] Enabling of the new Intel microarchitecture Silvermont
Like this? On Fri, May 31, 2013 at 3:45 PM, Uros Bizjak wrote: > On Fri, May 31, 2013 at 1:38 PM, Igor Zamyatin wrote: >> We do want to use the same register for float_extend. > > OK then. Please add a comment for this fact and also, please put > single-line preparation statements inside double-quotes instead of > curved braces. > > Uros. > >> On Thu, May 30, 2013 at 9:22 PM, Uros Bizjak wrote: >>> On Thu, May 30, 2013 at 4:25 PM, Yuri Rumyantsev wrote: >>>> Hi All >>>> >>>> Second patch enables several Silvermont uarch features which improve >>>> performance of the new processor (based on experiments on real SLM >>>> hardware): >>>> 1. If using a 2-source or 3-source LEA for non-destructive destination >>>> purposes, or due to wanting ability to use SCALE, the use of LEA is >>>> preferable. >>>> 2. Transformation of FP conversion for memory operands into conversion >>>> from register. >>>> 3. Couple of improvements for post-reload scheduling: >>>> - increase latency of integer loads and load/store with exact >>>> dependence; >>>> - simple re-ordering of the top of ready list - if 2 instructions >>>> at the top of the list have the same priority we consider instruction >>>> which producer(s) were scheduled earlier as the best candidate. >>>> >>>> Bootstrapped and tested on x86_64-unknown-linux-gnu, ok for trunk? >>>> >>>> 2013-05-30 Yuri Rumyantsev >>>> Igor Zamyatin >>>> >>>> Silvermont (SLM) architecture performance tuning. >>>> * config/i386/i386.h (enum ix86_tune_indices): Add >>>> X86_TUNE_SPLIT_MEM_OPND_FOR_FP_CONVERTS. >>>> (TARGET_SPLIT_MEM_OPND_FOR_FP_CONVERTS): New define. >>>> >>>> * config/i386/i386.c (initial_ix86_tune_features) >>>> : Initialize. >>>> (ix86_lea_outperforms): Handle Silvermont tuning. >>>> (ix86_avoid_lea_for_add): Add new argument to ix86_lea_outperforms >>>> call. >>>> (ix86_use_lea_for_mov): Likewise. >>>> (ix86_avoid_lea_for_addr): Likewise. >>>> (ix86_lea_for_add_ok): Likewise. >>>> (exact_dependency_1): New function. >>>> (exact_store_load_dependency): Likewise. >>>> (ix86_adjust_cost): Handle Silvermont tuning. >>>> (do_reoder_for_imul): Likewise. >>>> (swap_top_of_ready_list): New function. >>>> (ix86_sched_reorder): Changed to handle Silvermont tuning. >>>> >>>> * config/i386/i386.md (peepholes that split memory operand in fp >>>> converts): New >>> >>> @@ -24625,9 +24730,9 @@ ix86_sched_reorder(FILE *dump, int >>> sched_verbose, rtx *ready, int *pn_ready, >>> >>> - con = DEP_CON (dep); >>> - if (!NONDEBUG_INSN_P (con)) >>> -continue; >>> + con = DEP_CON (dep); >>> + if (!NONDEBUG_INSN_P (con)) >>> +continue; >>> >>> There are some unnecessary whitespace changes (tabs->spaces) in a >>> couple of places throughout the patch, such as in the above lines. >>> >>> +(define_peephole2 >>> + [(set (match_operand:DF 0 "register_operand") >>> +(float_extend:DF >>> + (match_operand:SF 1 "memory_operand")))] >>> + "TARGET_SPLIT_MEM_OPND_FOR_FP_CONVERTS >>> + && optimize_insn_for_speed_p () >>> + && SSE_REG_P (operands[0])" >>> + [(set (match_dup 2) (match_dup 1)) >>> + (set (match_dup 0) (float_extend:DF (match_dup 2)))] >>> +{ >>> + operands[2] = gen_rtx_REG (SFmode, REGNO (operands[0])); >>> +}) >>> >>> You should use >>> >>> (match_scratch:SF 2 "x") >>> >>> at the top of the peephole2 pattern, and you will get a free scratch >>> register (assuming that it is not necessary to use the same register >>> for input and output operand of the float_extend insn). >>> >>> Otherwise, the patch looks OK to me. >>> >>> Uros. gcc_slm_patch2_3_review.patch Description: Binary data
Re: [x86, PATCH 2/2] Enabling of the new Intel microarchitecture Silvermont
We do want to use the same register for float_extend. On Thu, May 30, 2013 at 9:22 PM, Uros Bizjak wrote: > On Thu, May 30, 2013 at 4:25 PM, Yuri Rumyantsev wrote: >> Hi All >> >> Second patch enables several Silvermont uarch features which improve >> performance of the new processor (based on experiments on real SLM >> hardware): >> 1. If using a 2-source or 3-source LEA for non-destructive destination >> purposes, or due to wanting ability to use SCALE, the use of LEA is >> preferable. >> 2. Transformation of FP conversion for memory operands into conversion >> from register. >> 3. Couple of improvements for post-reload scheduling: >> - increase latency of integer loads and load/store with exact dependence; >> - simple re-ordering of the top of ready list - if 2 instructions >> at the top of the list have the same priority we consider instruction >> which producer(s) were scheduled earlier as the best candidate. >> >> Bootstrapped and tested on x86_64-unknown-linux-gnu, ok for trunk? >> >> 2013-05-30 Yuri Rumyantsev >> Igor Zamyatin >> >> Silvermont (SLM) architecture performance tuning. >> * config/i386/i386.h (enum ix86_tune_indices): Add >> X86_TUNE_SPLIT_MEM_OPND_FOR_FP_CONVERTS. >> (TARGET_SPLIT_MEM_OPND_FOR_FP_CONVERTS): New define. >> >> * config/i386/i386.c (initial_ix86_tune_features) >> : Initialize. >> (ix86_lea_outperforms): Handle Silvermont tuning. >> (ix86_avoid_lea_for_add): Add new argument to ix86_lea_outperforms >> call. >> (ix86_use_lea_for_mov): Likewise. >> (ix86_avoid_lea_for_addr): Likewise. >> (ix86_lea_for_add_ok): Likewise. >> (exact_dependency_1): New function. >> (exact_store_load_dependency): Likewise. >> (ix86_adjust_cost): Handle Silvermont tuning. >> (do_reoder_for_imul): Likewise. >> (swap_top_of_ready_list): New function. >> (ix86_sched_reorder): Changed to handle Silvermont tuning. >> >> * config/i386/i386.md (peepholes that split memory operand in fp >> converts): New > > @@ -24625,9 +24730,9 @@ ix86_sched_reorder(FILE *dump, int > sched_verbose, rtx *ready, int *pn_ready, > > - con = DEP_CON (dep); > - if (!NONDEBUG_INSN_P (con)) > -continue; > + con = DEP_CON (dep); > + if (!NONDEBUG_INSN_P (con)) > +continue; > > There are some unnecessary whitespace changes (tabs->spaces) in a > couple of places throughout the patch, such as in the above lines. > > +(define_peephole2 > + [(set (match_operand:DF 0 "register_operand") > +(float_extend:DF > + (match_operand:SF 1 "memory_operand")))] > + "TARGET_SPLIT_MEM_OPND_FOR_FP_CONVERTS > + && optimize_insn_for_speed_p () > + && SSE_REG_P (operands[0])" > + [(set (match_dup 2) (match_dup 1)) > + (set (match_dup 0) (float_extend:DF (match_dup 2)))] > +{ > + operands[2] = gen_rtx_REG (SFmode, REGNO (operands[0])); > +}) > > You should use > > (match_scratch:SF 2 "x") > > at the top of the peephole2 pattern, and you will get a free scratch > register (assuming that it is not necessary to use the same register > for input and output operand of the float_extend insn). > > Otherwise, the patch looks OK to me. > > Uros.
Re: [patch] RFC: ix86 / x86_64 register pressure aware scheduling
These changes are what we used to try here at Intel after bunch of changes which made pre-alloc scheduler more stable. We benchmarked both register pressure algorithms and overall result was not that promising. We saw number of regressions e.g. for optset "-mavx -O3 -funroll-loops -ffast-math -march=corei7" (for spec2000 not only lucas but also applu regressed). And overall gain is negative even for x86_64. For 32 bits picture was worse if I remember correctly. In common we have doubts that this feature is good for OOO machine Thanks, Igor -Original Message- From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches-ow...@gcc.gnu.org] On Behalf Of Steven Bosscher Sent: Monday, April 15, 2013 11:34 PM To: GCC Patches Cc: H.J. Lu; Uros Bizjak; Jan Hubicha Subject: [patch] RFC: ix86 / x86_64 register pressure aware scheduling Hello, The attached patch enables register pressure aware scheduling for the ix86 and x86_64 targets. It uses the optimistic algorithm to avoid being overly conservative. This is the same as what other CISCy targets, like s390, also do. The motivation for this patch is the excessive spilling I've observed in a few test cases with relatively large basic blocks, e.g. encryption algorithms and codecs. The patch passes bootstrap+testing on x86_64-unknown-linux-gnu and i686-unknown-linux-gnu, with a few new failures due to PR56950. Off-list, Uros, Honza and others have already looked at the patch and benchmarked it. For x86_64 there is an overall improvement for SPEC2k except that lucas regresses, but such a preliminary result is IMHO very promising. Comments/suggestions welcome :-) Ciao! Steven * common/config/i386/i386-common.c (ix86_option_optimization_table): Do not disable insns scheduling. Enable register pressure aware scheduling. * config/i386/i386.c (ix86_option_override): Use the alternative, optimistic scheduling-pressure algorithm by default. Index: common/config/i386/i386-common.c === --- common/config/i386/i386-common.c(revision 197941) +++ common/config/i386/i386-common.c(working copy) @@ -707,9 +707,15 @@ static const struct default_options ix86 { /* Enable redundant extension instructions removal at -O2 and higher. */ { OPT_LEVELS_2_PLUS, OPT_free, NULL, 1 }, -/* Turn off -fschedule-insns by default. It tends to make the - problem with not enough registers even worse. */ -{ OPT_LEVELS_ALL, OPT_fschedule_insns, NULL, 0 }, +/* Enable -fsched-pressure by default for all optimization levels. + Before SCHED_PRESSURE_MODEL register-pressure aware schedule was + available, -fschedule-insns was turned off completely by default for + this port, because scheduling before register allocation tends to + make the problem with not enough registers even worse. However, + for very long basic blocks the scheduler can help bring register + pressure down significantly, and SCHED_PRESSURE_MODEL is still + conservative enough to avoid creating excessive register pressure. */ +{ OPT_LEVELS_ALL, OPT_fsched_pressure, NULL, 1 }, #ifdef SUBTARGET_OPTIMIZATION_OPTIONS SUBTARGET_OPTIMIZATION_OPTIONS, Index: config/i386/i386.c === --- config/i386/i386.c (revision 197941) +++ config/i386/i386.c (working copy) @@ -3936,6 +3936,10 @@ ix86_option_override (void) ix86_option_override_internal (true); + /* Use the alternative scheduling-pressure algorithm by default. */ + maybe_set_param_value (PARAM_SCHED_PRESSURE_ALGORITHM, 2, +global_options.x_param_values, +global_options_set.x_param_values); /* This needs to be done at start up. It's convenient to do it here. */ register_pass (&insert_vzeroupper_info);
[PATCH, x86, AVX2] FP reassociation enabling for AVX2 targets
Hi! This small change enables FP reassociation for AVX2 processors. This gives ~+1.5% in performance geomean for spec2006FP tests. Bootstrapped/regtested on x86_64-linux, ok for trunk? 2013-02-14 Igor Zamyatin * config/i386/i386.c (initial_ix86_tune_features): Turn on fp reassociation for avx2 targets. diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index caf4894..cb84866 100644 --- a/gcc/config/i386/i386.c +++ b/gcc/config/i386/i386.c @@ -2021,7 +2021,7 @@ static unsigned int initial_ix86_tune_features[X86_TUNE_LAST] = { /* X86_TUNE_REASSOC_FP_TO_PARALLEL: Try to produce parallel computations during reassociation of fp computation. */ - m_ATOM, + m_ATOM | m_HASWELL, /* X86_TUNE_GENERAL_REGS_SSE_SPILL: Try to spill general regs to SSE regs instead of memory. */
Re: [GCC 4.8 wwwdocs] PATCH: Mention several user-visible changes for x86
Gerald, Thanks a lot for your remarks! Below is updated patch which will be checked in. Thanks, Igor On Mon, Feb 18, 2013 at 3:07 AM, Gerald Pfeifer wrote: > On Fri, 15 Feb 2013, Igor Zamyatin wrote: >> Is it ok for wwwdocs? > > Index: htdocs/gcc-4.8/changes.html > === > + Support for the new Intel processor codename Broadwell with RDSEED, > + ADCX, ADOX, PREFETCHW is available through -madx, > + -mprfchw, -mrdseed. > > Can you make this RDSEED, ... and so forth? > (This is a bit borderline, in that one could also see this as > more general references, but usually we mark those up.) > > And "...through the ... command-line options."? > > + Support for Intel RTM and HLE intrinsics, built-in > > "the ... intrinsics" (and same below for "instruction sets") > > + x86 backend was improved to allow option > -fscedule-insns to work reliably. > > "The x86 backend has been improved..." > > + This option can be used to schedule instructions better and can > lead to improved performace in certain cases. > > This line is quite long, can you break lines around 76 columns? > > And, let's be a bit more brave and omit either "can" or "in certain > cases". Otherwise this may sounds too unlikely. :-) > > The patch is fine with changes along the lines described above. > > Thanks, > Gerald Index: htdocs/gcc-4.8/changes.html === RCS file: /cvs/gcc/wwwdocs/htdocs/gcc-4.8/changes.html,v retrieving revision 1.95 diff -c -r1.95 changes.html *** htdocs/gcc-4.8/changes.html 11 Feb 2013 15:12:58 - 1.95 --- htdocs/gcc-4.8/changes.html 12 Feb 2013 15:10:41 - *** *** 460,465 --- 460,471 wrong results. You must build all modules with -mpreferred-stack-boundary=3, including any libraries. This includes the system libraries and startup modules. + Support for the new Intel processor codename Broadwell with + RDSEED, ADCX, ADOX, + PREFETCHW is available through -madx, + -mprfchw, -mrdseed command-line options. + + Support for the Intel RTM and HLE intrinsics, built-in functions and code generation is available via -mrtm and -mhle. + + Support for the Intel FXSR, XSAVE and XSAVEOPT instruction sets. Intrinsics and built-in functions are available + via -mfxsr, -mxsave and -mxsaveopt respectively. + New built-in functions to detect run-time CPU type and ISA: A built-in function __builtin_cpu_is has been added to *** *** 524,529 --- 530,538 http://gcc.gnu.org/wiki/FunctionMultiVersioning";>wiki for more information. + The x86 backend has been improved to allow option -fscedule-insns to work reliably. + This option can be used to schedule instructions better and leads to + improved performace in certain cases. + Windows MinGW-w64 targets (*-w64-mingw*) require at least r5437 from the Mingw-w64 trunk.
Re: [GCC 4.8 wwwdocs] PATCH: Mention several user-visible changes for x86
Gerald, Is it ok for wwwdocs? Thanks, Igor On Wed, Feb 13, 2013 at 3:21 PM, Uros Bizjak wrote: > On Wed, Feb 13, 2013 at 12:17 PM, Igor Zamyatin wrote: >>> >>> Please also mention new -mfxsr, -mxsave and -mxsaveopt options. >>> >>>>New built-in functions to detect run-time CPU type and ISA: >>>> >>>> A built-in function __builtin_cpu_is has been >>>> added to >>>> *** >>>> *** 524,529 >>>> --- 530,538 >>>> http://gcc.gnu.org/wiki/FunctionMultiVersioning";>wiki >>>> for more >>>> information. >>>> >>>> + Problem with instability of pre-reload scheduler on x86 >>>> targets was fixed. Now option -fschedule-insn >>>> + can be used loosely to reach better performance. >>> >>> "used loosely" in what sense? >>> >>> Thanks, >>> Uros. >> >> Updated version. Does it look ok? > > Looks OK to me. I have CC'd Gerald for a grammar and style check. > > Thanks, > Uros. > >> >> >> Thanks, >> Igor >> Index: htdocs/gcc-4.8/changes.html === RCS file: /cvs/gcc/wwwdocs/htdocs/gcc-4.8/changes.html,v retrieving revision 1.95 diff -c -r1.95 changes.html *** htdocs/gcc-4.8/changes.html 11 Feb 2013 15:12:58 - 1.95 --- htdocs/gcc-4.8/changes.html 12 Feb 2013 15:10:41 - *** *** 460,465 --- 460,471 wrong results. You must build all modules with -mpreferred-stack-boundary=3, including any libraries. This includes the system libraries and startup modules. + Support for the new Intel processor codename Broadwell with RDSEED, + ADCX, ADOX, PREFETCHW is available through -madx, + -mprfchw, -mrdseed. + + Support for Intel RTM and HLE intrinsics, built-in functions and code generation is available via -mrtm and -mhle. + + Support for Intel FXSR, XSAVE and XSAVEOPT instruction sets. Intrinsics and built-in functions are available + via -mfxsr, -mxsave and -mxsaveopt respectively. + New built-in functions to detect run-time CPU type and ISA: A built-in function __builtin_cpu_is has been added to *** *** 524,529 --- 530,538 http://gcc.gnu.org/wiki/FunctionMultiVersioning";>wiki for more information. + x86 backend was improved to allow option -fscedule-insns to work reliably. + This option can be used to schedule instructions better and can lead to improved performace in certain cases. + Windows MinGW-w64 targets (*-w64-mingw*) require at least r5437 from the Mingw-w64 trunk.
Re: [GCC 4.8 changes] PATCH: Mention several user-visible changes for x86
> > Please also mention new -mfxsr, -mxsave and -mxsaveopt options. > >>New built-in functions to detect run-time CPU type and ISA: >> >> A built-in function __builtin_cpu_is has been added >> to >> *** >> *** 524,529 >> --- 530,538 >> http://gcc.gnu.org/wiki/FunctionMultiVersioning";>wiki >> for more >> information. >> >> + Problem with instability of pre-reload scheduler on x86 >> targets was fixed. Now option -fschedule-insn >> + can be used loosely to reach better performance. > > "used loosely" in what sense? > > Thanks, > Uros. Updated version. Does it look ok? Thanks, Igor Index: htdocs/gcc-4.8/changes.html === RCS file: /cvs/gcc/wwwdocs/htdocs/gcc-4.8/changes.html,v retrieving revision 1.95 diff -c -r1.95 changes.html *** htdocs/gcc-4.8/changes.html 11 Feb 2013 15:12:58 - 1.95 --- htdocs/gcc-4.8/changes.html 12 Feb 2013 15:10:41 - *** *** 460,465 --- 460,471 wrong results. You must build all modules with -mpreferred-stack-boundary=3, including any libraries. This includes the system libraries and startup modules. + Support for the new Intel processor codename Broadwell with RDSEED, + ADCX, ADOX, PREFETCHW is available through -madx, + -mprfchw, -mrdseed. + + Support for Intel RTM and HLE intrinsics, built-in functions and code generation is available via -mrtm and -mhle. + + Support for Intel FXSR, XSAVE and XSAVEOPT instruction sets. Intrinsics and built-in functions are available + via -mfxsr, -mxsave and -mxsaveopt respectively. + New built-in functions to detect run-time CPU type and ISA: A built-in function __builtin_cpu_is has been added to *** *** 524,529 --- 530,538 http://gcc.gnu.org/wiki/FunctionMultiVersioning";>wiki for more information. + x86 backend was improved to allow option -fscedule-insns to work reliably. + This option can be used to schedule instructions better and can lead to improved performace in certain cases. + Windows MinGW-w64 targets (*-w64-mingw*) require at least r5437 from the Mingw-w64 trunk.
[GCC 4.8 changes] PATCH: Mention several user-visible changes for x86
Hi, This patch updates GCC 4.8 changes.html to mention Broadwell's features, RTM and HLE support and fixed pre-reload scheduler. OK to commit? Thanks, Igor Index: htdocs/gcc-4.8/changes.html === RCS file: /cvs/gcc/wwwdocs/htdocs/gcc-4.8/changes.html,v retrieving revision 1.95 diff -c -r1.95 changes.html *** htdocs/gcc-4.8/changes.html 11 Feb 2013 15:12:58 - 1.95 --- htdocs/gcc-4.8/changes.html 12 Feb 2013 15:10:41 - *** *** 460,465 --- 460,471 wrong results. You must build all modules with -mpreferred-stack-boundary=3, including any libraries. This includes the system libraries and startup modules. + Support for the new Intel processor codename Broadwell with RDSEED, + ADCX, ADOX, PREFETCHW is available through -madx, + -mprfchw, -mrdseed. + + Support for Intel RTM and HLE intrinsics, built-in functions and code generation is available via -mrtm and -mhle. + New built-in functions to detect run-time CPU type and ISA: A built-in function __builtin_cpu_is has been added to *** *** 524,529 --- 530,538 http://gcc.gnu.org/wiki/FunctionMultiVersioning";>wiki for more information. + Problem with instability of pre-reload scheduler on x86 targets was fixed. Now option -fschedule-insn + can be used loosely to reach better performance. + Windows MinGW-w64 targets (*-w64-mingw*) require at least r5437 from the Mingw-w64 trunk.
Re: [PATCH] Fix instability of -fschedule-insn for x86
We also plan to test these changes along with LRA On Sun, Sep 30, 2012 at 4:33 PM, Uros Bizjak wrote: > On Tue, Sep 18, 2012 at 1:31 PM, Uros Bizjak wrote: > >>> This patch aims to fix all stability issues related to using the first >>> scheduler in gcc >>> for x86 target (there several reported issues related to this problem). >>> >>> Main idea of this activity is mostly to provide user a possibility to >>> safely turn on first scheduler for his codes. In some cases this could >>> positively affect performance, especially for in-order Atom. >>> >>> Below is short description of proposed changes. >> >>> 2012-09-18 Yuri Rumyantsev >>> >>> * config/i386/i386.c (ix86_dep_by_shift_count_body) : Add >>> check on reload_completed since it can be invoked before >>> register allocation phase in 1st scheduler. >>> (ia32_multipass_dfa_lookahead) : Do not use dfa_lookahead for 1st >>> Scheduler to save compile time. >>> (ix86_sched_reorder) : Do not perform ready list reordering for 1st >>> Scheduler to save compile time. >>> (insn_is_function_arg) : New function. Returns true if lhs of insn >>> is >>> HW function argument register. >>> (add_parameter_dependencies) : New function. Add output dependencies >>> for chain of function adjacent arguments if only there is a move to >>> likely spilled HW registers. Return first argument if at least one >>> dependence was added or NULL otherwise. >>> (avoid_func_arg_motion) : New function. Add output or anti >>> dependency >>> from insn to first_arg to restrict code motion. >>> (add_dependee_for_func_arg) : New function. Avoid cross block >>> motion of >>> function argument through adding dependency from the first non-jump >>> insn in bb. >>> (ix86_dependencies_evaluation_hook) : New function. Hook for >>> schedule1: >>> avoid motion of function arguments passed in passed in likely >>> spilled >>> HW registers. >>> (ix86_adjust_priority) : New function. Hook for schedule1: set >>> priority >>> of moves from likely spilled HW registers to maximum to schedule >>> them >>> as soon as possible. >>> (ix86_sched_init_global): Do not perform multipass scheduling for >>> 1st >>> Scheduler to save compile time. >> >> I would kindly ask scheduler expert to review the patch from the >> scheduler functionality POV. > > I have received opinion from Vladimir from off-line discussion, quoted below: > > --quote-- > I think, it is ok. > > Switching off first cycle multipass scheduling is ok. It is mostly > useful when the order of insns issued on the same cycle is important > (mostly VLIW or quasy-VLIW processors). > > Other solutions are necessary to decrease spills and avoid reload > crash (can not find a register in a class) when the 1st insn > scheduling is on. I don't think it fully avoids possibility of the > reload crashes but it takes into account most of cases resulting in > the crashes and makes the crash possibility really negligible. > Register pressure sensitive insn scheduling decreased the possibility. > This patch will make it negligible. And LRA will solve all the rest > cases of the crashes. > > I don't like a bit absence in freedom of moving argument insns with > likely spilled hard-regs between each other as they are chained in the > original order but it is debatable because it still decreases the > possibility of spills. > > In overall, the patch is ok for me. > --/quote-- > > Based on this opinion, the patch is OK for mainline, if there are no > objections from other x86 maintainers in the next couple of days > (48h). However, please watch for possible fallout from the patch, > compile-time ICEs and performance problems. x86 and scheduler didn't > play well together in the past, but your patch and (in the near > future) LRA seems to fix all these problems. > > Thanks, > Uros.
[Patch, testsuite] Changes in tests for the case when long double size is equal double size
Hi All! After recent commit (http://gcc.gnu.org/ml/gcc-cvs/2012-08/msg00576.html, patch - http://gcc.gnu.org/ml/gcc-patches/2012-08/msg01512.html), which sets size of long double equal to double for Bionic, several tests required some adjustment to pass on Android. Tested both for Linux and Andorid. Ok to commit? Thanks, Igor Changelog: gcc/testsuite 2012-09-24 Igor Zamyatin Test changes for case when long double size is equal double size * gcc.target/i386/20030217-1.c: Added check for large_long_double effective target. * gcc.target/i386/387-3.c: Likewise. * gcc.target/i386/387-4.c: Likewise. * gcc.target/i386/pr36578-1.c: Likewise. * gcc.target/i386/excess-precision-1.c: Added new code for the case when long double size is equal double size. * gcc.target/i386/excess-precision-1.c: Likewise. * gcc.target/i386/pr36578-2.c: Likewise. * gcc.target/i386/20030217-2.c: New testcase. long_double_for_android.patch Description: Binary data
Re: [PATCH, x86-Atom] Enabling look-ahead scheduling feature for Atom processors
On Tue, Aug 28, 2012 at 1:34 PM, Uros Bizjak wrote: > On Tue, Aug 28, 2012 at 11:24 AM, Igor Zamyatin wrote: > >> I'd like this patch (original version at the bottom) also to be >> backported into 4.7. >> >> Is it safe to backport? > > Looks safe to me. > > OK if there are no objections from RMs in the next 24h. All testing passed so following will be checked in 4.7. Index: gcc/config/i386/i386.c === --- gcc/config/i386/i386.c (revision 190833) +++ gcc/config/i386/i386.c (working copy) @@ -23842,6 +23842,7 @@ case PROCESSOR_CORE2_64: case PROCESSOR_COREI7_32: case PROCESSOR_COREI7_64: +case PROCESSOR_ATOM: /* Generally, we want haifa-sched:max_issue() to look ahead as far as many instructions can be executed on a cycle, i.e., issue_rate. I wonder why tuning for many CPUs does not do this. */ Original Changelog: 2012-08-23 Yuri Rumyantsev * config/i386/i386.c (ia32_multipass_dfa_lookahead) : Add case for Atom processor. Thanks, Igor > > Thanks, > Uros.
Re: [PATCH, x86-Atom] Enabling look-ahead scheduling feature for Atom processors
Hi! I'd like this patch (original version at the bottom) also to be backported into 4.7. Is it safe to backport? On Fri, Aug 24, 2012 at 11:35 AM, Uros Bizjak wrote: > On Fri, Aug 24, 2012 at 9:22 AM, Igor Zamyatin wrote: > >> Following change enables look ahead feature in the code scheduler for >> Atom processors. This gives quite reasonable gain for some benchmarks >> for mobile market. >> >> Overall compile time increase for SPEC2000 is about 1%. >> >> Regtested for x86_64 and also bootstrapped with "--with-arch=core2 >> --with-cpu=atom" >> >> 2012-08-23 Yuri Rumyantsev >> >> * config/i386/i386.c (ia32_multipass_dfa_lookahead) : Add >> case for Atom processor. > > OK. > > Thanks, > Uros. Original patch: diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index 976bbb4..331e29a 100644 --- a/gcc/config/i386/i386.c +++ b/gcc/config/i386/i386.c @@ -24103,6 +24103,7 @@ ia32_multipass_dfa_lookahead (void) case PROCESSOR_CORE2_64: case PROCESSOR_COREI7_32: case PROCESSOR_COREI7_64: +case PROCESSOR_ATOM: /* Generally, we want haifa-sched:max_issue() to look ahead as far as many instructions can be executed on a cycle, i.e., issue_rate. I wonder why tuning for many CPUs does not do this. */
[PATCH, x86-Atom] Enabling look-ahead scheduling feature for Atom processors
Hi! Following change enables look ahead feature in the code scheduler for Atom processors. This gives quite reasonable gain for some benchmarks for mobile market. Overall compile time increase for SPEC2000 is about 1%. Regtested for x86_64 and also bootstrapped with "--with-arch=core2 --with-cpu=atom" Ok for trunk? Thanks, Igor Changelog: 2012-08-23 Yuri Rumyantsev * config/i386/i386.c (ia32_multipass_dfa_lookahead) : Add case for Atom processor. diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index 976bbb4..331e29a 100644 --- a/gcc/config/i386/i386.c +++ b/gcc/config/i386/i386.c @@ -24103,6 +24103,7 @@ ia32_multipass_dfa_lookahead (void) case PROCESSOR_CORE2_64: case PROCESSOR_COREI7_32: case PROCESSOR_COREI7_64: +case PROCESSOR_ATOM: /* Generally, we want haifa-sched:max_issue() to look ahead as far as many instructions can be executed on a cycle, i.e., issue_rate. I wonder why tuning for many CPUs does not do this. */
[PATCH][RFC] Fixing instability of -fschedule-insns for x86
Hi all! Patch aims to fix instability introduced by first scheduler on x86. In particular it targets following list: [1] http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46843 [2] http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46829 [3] http://gcc.gnu.org/bugzilla/show_bug.cgi?id=36680 [4] http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42295 Main idea of this activity is mostly to provide user a possibility to safely turn on first scheduler for his codes. In some cases this could positively affect performance, especially for in-order Atom. It would be great to hear some feedback from the community about the change. Thanks in advance, Igor first_scheduler_fix.patch Description: Binary data
Re: [PATCH, Android] Runtime stack protector enabling for Android target
Hi! There were only trivial merge conflicts like <<< [(match_operand 0 "memory_operand" "") (match_operand 1 "memory_operand" "")] "" --- [(match_operand 0 "memory_operand") (match_operand 1 "memory_operand")] "!TARGET_HAS_BIONIC" >>> Patch attached. All necessary testing passed. Ok for 4.7? Changelog: 2012-08-08 Pavel Chupin Backport from mainline r189840 and r187586: 2012-07-25 Sergey Melnikov * config/i386/i386.md (stack_protect_set): Disable the pattern for Android since Android libc (bionic) does not provide random value for stack protection guard at gs:0x14. Guard value will be provided from external symbol (default implementation). (stack_protect_set_): Likewise. (stack_protect_test): Likewise. (stack_protect_test_): Likewise. * gcc/defaults.h: Define macro TARGET_HAS_BIONIC to 0 - target does not have Bionic by default * config/linux.h: Redefine macro TARGET_HAS_BIONIC to (OPTION_BIONIC) Macro OPTION_BIONIC is defined in this file and provides Bionic accessibility status 2012-05-16 Igor Zamyatin * configure.ac: Stack protector enabling for Android targets. * configure: Regenerate. On Wed, Aug 8, 2012 at 2:25 PM, Maxim Kuvyrkov wrote: > On 8/08/2012, at 9:46 PM, Uros Bizjak wrote: > >> On Wed, Aug 8, 2012 at 9:54 AM, Igor Zamyatin wrote: >> >>> I'd like to ask whether stack-protector changes for Android could go to 4.7? >>> >>> Pathes are: >>> >>> http://gcc.gnu.org/ml/gcc-patches/2012-05/msg01089.html >>> http://gcc.gnu.org/ml/gcc-patches/2012-07/msg01157.html >> >> OK, as far as x86 is concerned. > > Strictly speaking, these fixes are not for a regression, so do not readily > qualify for a release branch. However, since the patches are no-ops on > targets other than x86 and Uros OK'd them for x86 ... > > OK, provided that the patches in the above threads apply without conflicts. > If there are conflicts, please repost for review. > > Thanks, > > -- > Maxim Kuvyrkov > CodeSourcery / Mentor Graphics > stack-protector-4_7.patch Description: Binary data
Re: [PATCH, Android] Runtime stack protector enabling for Android target
Hi all! I'd like to ask whether stack-protector changes for Android could go to 4.7? Pathes are: http://gcc.gnu.org/ml/gcc-patches/2012-05/msg01089.html http://gcc.gnu.org/ml/gcc-patches/2012-07/msg01157.html Thanks, Igor On Wed, Jul 25, 2012 at 2:08 AM, Maxim Kuvyrkov wrote: > On 24/07/2012, at 10:08 PM, Uros Bizjak wrote: > >> On Mon, Jul 23, 2012 at 10:00 PM, Igor Zamyatin wrote: >> >>> 2012-07-23 Sergey Melnikov >>> >>>* config/i386/i386.md (stack_protect_set): Disable the pattern >>>for Android since Android libc (bionic) does not provide random >>>value for stack protection guard at gs:0x14. Guard value >>>will be provided from external symbol (default implementation). >>>(stack_protect_set_): Likewise. >>>(stack_protect_test): Likewise. >>>(stack_protect_test_): Likewise. >>>* gcc/defaults.h: Define macro TARGET_HAS_BIONIC to 0 - target does >>>not have Bionic by default >>>* config/linux.h: Redefine macro TARGET_HAS_BIONIC to (OPTION_BIONIC) >>>Macro OPTION_BIONIC is defined in this file and provides Bionic >>>accessibility status >> >> Looks OK to me, patch needs approval from Maxim. > > OK. Thanks for fixing this. > > -- > Maxim Kuvyrkov > CodeSourcery / Mentor Graphics >
[PATCH, Android] Runtime stack protector enabling for Android target
Hi! Since this change for stack-protector enabling hurts mingw compiler (see http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53980 and http://gcc.gnu.org/ml/gcc-patches/2012-07/msg00638.html) we'd like to make change more general. Please see new patch in attachment. Tested in android environment(x86_64-*-linux-android), also bootstrapped and regtested on x86_64-unknown-linux-gnu and i686-linux. Also mingw bootstrap was checked. Ok for trunk? Thanks, Igor Changelog: 2012-07-23 Sergey Melnikov * config/i386/i386.md (stack_protect_set): Disable the pattern for Android since Android libc (bionic) does not provide random value for stack protection guard at gs:0x14. Guard value will be provided from external symbol (default implementation). (stack_protect_set_): Likewise. (stack_protect_test): Likewise. (stack_protect_test_): Likewise. * gcc/defaults.h: Define macro TARGET_HAS_BIONIC to 0 - target does not have Bionic by default * config/linux.h: Redefine macro TARGET_HAS_BIONIC to (OPTION_BIONIC) Macro OPTION_BIONIC is defined in this file and provides Bionic accessibility status patch-stack-protector-3_target_bionic.diff Description: Binary data
Re: [Patch, Fortran] Add parsing support for assumed-rank array
On x86_64 the same happens. Also I modified list of failing tests - now it is correct On Fri, Jul 20, 2012 at 11:43 AM, Igor Zamyatin wrote: >> >> Tobias Burnus wrote: >>> I will now regtest everything, read through the whole patch - your >>> part and mine, update the ChangeLog and commit it tomorrow. >> >> I have now committed the attached version as Rev. 189700! >> >> Thanks agai for the review! >> >> Tobias >> > > This seems to cause following fails at least on i686: > > FAIL: gfortran.dg/assumed_rank_12.f90 -O0 scan-tree-dump original " > = f \\(\\);.*desc.0.dtype = 600;.*desc.0.data = .void .. D.*;.*sub > \\(&desc.0\\);.*D.*= .integer.kind=4. .. desc.0.data;" > FAIL: gfortran.dg/assumed_rank_12.f90 -O1 scan-tree-dump original " > = f \\(\\);.*desc.0.dtype = 600;.*desc.0.data = .void .. D.*;.*sub > \\(&desc.0\\);.*D.*= .integer.kind=4. .. desc.0.data;" > FAIL: gfortran.dg/assumed_rank_12.f90 -O2 scan-tree-dump original " > = f \\(\\);.*desc.0.dtype = 600;.*desc.0.data = .void .. D.*;.*sub > \\(&desc.0\\);.*D.*= .integer.kind=4. .. desc.0.data;" > FAIL: gfortran.dg/assumed_rank_12.f90 -O3 -fomit-frame-pointer > scan-tree-dump original " = f \\(\\);.*desc.0.dtype = > 600;.*desc.0.data = .void .. D.*;.*sub \\(&desc.0\\);.*D.*= > .integer.kind=4. .. desc.0.data;" > FAIL: gfortran.dg/assumed_rank_12.f90 -O3 -fomit-frame-pointer > -funroll-all-loops -finline-functions scan-tree-dump original " = f > \\(\\);.*desc.0.dtype = 600;.*desc.0.data = .void .. D.*;.*sub > \\(&desc.0\\);.*D.*= .integer.kind=4. .. desc.0.data;" > FAIL: gfortran.dg/assumed_rank_12.f90 -O3 -fomit-frame-pointer > -funroll-loops scan-tree-dump original " = f \\(\\);.*desc.0.dtype = > 600;.*desc.0.data = .void .. D.*;.*sub \\(&desc.0\\);.*D.*= > .integer.kind=4. .. desc.0.data;" > FAIL: gfortran.dg/assumed_rank_12.f90 -O3 -g scan-tree-dump > original " = f \\(\\);.*desc.0.dtype = 600;.*desc.0.data = .void .. > D.*;.*sub \\(&desc.0\\);.*D.*= .integer.kind=4. .. desc.0.data;" > FAIL: gfortran.dg/assumed_rank_12.f90 -Os scan-tree-dump original " > = f \\(\\);.*desc.0.dtype = 600;.*desc.0.data = .void .. D.*;.*sub > \\(&desc.0\\);.*D.*= .integer.kind=4. .. desc.0.data;" > FAIL: gfortran.dg/assumed_rank_6.f90 -O (test for errors, line 19) > FAIL: gfortran.dg/assumed_rank_6.f90 -O (test for errors, line 20) > FAIL: gfortran.dg/assumed_rank_6.f90 -O (test for errors, line 21) > FAIL: gfortran.dg/assumed_rank_6.f90 -O (test for errors, line 26) > FAIL: gfortran.dg/assumed_rank_6.f90 -O (test for errors, line 33) > FAIL: gfortran.dg/assumed_rank_6.f90 -O (test for errors, line 37) > FAIL: gfortran.dg/assumed_rank_6.f90 -O (test for errors, line 9) > FAIL: gfortran.dg/assumed_rank_6.f90 -O (internal compiler error) > FAIL: gfortran.dg/assumed_rank_6.f90 -O (test for excess errors)
Re: [Patch, Fortran] Add parsing support for assumed-rank array
> > Tobias Burnus wrote: >> I will now regtest everything, read through the whole patch - your >> part and mine, update the ChangeLog and commit it tomorrow. > > I have now committed the attached version as Rev. 189700! > > Thanks agai for the review! > > Tobias > This seems to cause following fails at least on i686: FAIL: gfortran.dg/assumed_rank_12.f90 -O0 scan-tree-dump original " = f \\(\\);.*desc.0.dtype = 600;.*desc.0.data = .void .. D.*;.*sub \\(&desc.0\\);.*D.*= .integer.kind=4. .. desc.0.data;" FAIL: gfortran.dg/assumed_rank_12.f90 -O1 scan-tree-dump original " = f \\(\\);.*desc.0.dtype = 600;.*desc.0.data = .void .. D.*;.*sub \\(&desc.0\\);.*D.*= .integer.kind=4. .. desc.0.data;" FAIL: gfortran.dg/assumed_rank_12.f90 -O2 scan-tree-dump original " = f \\(\\);.*desc.0.dtype = 600;.*desc.0.data = .void .. D.*;.*sub \\(&desc.0\\);.*D.*= .integer.kind=4. .. desc.0.data;" FAIL: gfortran.dg/assumed_rank_12.f90 -O3 -fomit-frame-pointer scan-tree-dump original " = f \\(\\);.*desc.0.dtype = 600;.*desc.0.data = .void .. D.*;.*sub \\(&desc.0\\);.*D.*= .integer.kind=4. .. desc.0.data;" FAIL: gfortran.dg/assumed_rank_12.f90 -O3 -fomit-frame-pointer -funroll-all-loops -finline-functions scan-tree-dump original " = f \\(\\);.*desc.0.dtype = 600;.*desc.0.data = .void .. D.*;.*sub \\(&desc.0\\);.*D.*= .integer.kind=4. .. desc.0.data;" FAIL: gfortran.dg/assumed_rank_12.f90 -O3 -fomit-frame-pointer -funroll-loops scan-tree-dump original " = f \\(\\);.*desc.0.dtype = 600;.*desc.0.data = .void .. D.*;.*sub \\(&desc.0\\);.*D.*= .integer.kind=4. .. desc.0.data;" FAIL: gfortran.dg/assumed_rank_12.f90 -O3 -g scan-tree-dump original " = f \\(\\);.*desc.0.dtype = 600;.*desc.0.data = .void .. D.*;.*sub \\(&desc.0\\);.*D.*= .integer.kind=4. .. desc.0.data;" FAIL: gfortran.dg/assumed_rank_12.f90 -Os scan-tree-dump original " = f \\(\\);.*desc.0.dtype = 600;.*desc.0.data = .void .. D.*;.*sub \\(&desc.0\\);.*D.*= .integer.kind=4. .. desc.0.data;" FAIL: gfortran.dg/assumed_rank_6.f90 -O (test for errors, line 19) FAIL: gfortran.dg/assumed_rank_6.f90 -O (test for errors, line 20) FAIL: gfortran.dg/assumed_rank_6.f90 -O (test for errors, line 21) FAIL: gfortran.dg/assumed_rank_6.f90 -O (test for errors, line 26) FAIL: gfortran.dg/assumed_rank_6.f90 -O (test for errors, line 33) FAIL: gfortran.dg/assumed_rank_6.f90 -O (test for errors, line 37) FAIL: gfortran.dg/assumed_rank_6.f90 -O (test for errors, line 9) FAIL: gfortran.dg/assumed_rank_6.f90 -O (internal compiler error) FAIL: gfortran.dg/assumed_rank_6.f90 -O (test for excess errors) FAIL: gfortran.dg/lto/pr45586-2 f_lto_pr45586-2_0.o-f_lto_pr45586-2_0.o link, -O0 -flto -fuse-linker-plugin -fno-fat-lto-objects (internal compiler error)
Re: [PATCH, Android] Runtime stack protector enabling for Android target
Let's try with this patch then. Is it ok for trunk after all appropriate testing is done? ChangeLog: 2012-07-09 Sergey Melnikov * config/i386/i386.md (stack_protect_set): Disable the pattern for Android since Android libc (bionic) does not provide random value for stack protection guard at gs:0x14. Guard value will be provided from external symbol (default implementation). (stack_protect_set_): Likewise. (stack_protect_test): Likewise. (stack_protect_test_): Likewise. diff --git a/gcc-4.6/gcc/config/i386/i386.md b/gcc-4.6/gcc/config/i386/i386.md index b1d7e5e..c4dd6e3 100644 --- a/gcc-4.6/gcc/config/i386/i386.md +++ b/gcc-4.6/gcc/config/i386/i386.md @@ -17955,7 +17955,7 @@ (define_expand "stack_protect_set" [(match_operand 0 "memory_operand" "") (match_operand 1 "memory_operand" "")] - "" + "!(flag_android && OPTION_BIONIC)" { rtx (*insn)(rtx, rtx); @@ -17979,7 +17979,7 @@ (unspec:P [(match_operand:P 1 "memory_operand" "m")] UNSPEC_SP_SET)) (set (match_scratch:P 2 "=&r") (const_int 0)) (clobber (reg:CC FLAGS_REG))] - "" + "!(flag_android && OPTION_BIONIC)" "mov{}\t{%1, %2|%2, %1}\;mov{}\t{%2, %0|%0, %2}\;xor{l}\t%k2, %k2" [(set_attr "type" "multi")]) @@ -17997,7 +17997,7 @@ [(match_operand 0 "memory_operand" "") (match_operand 1 "memory_operand" "") (match_operand 2 "" "")] - "" + "!(flag_android && OPTION_BIONIC)" { rtx flags = gen_rtx_REG (CCZmode, FLAGS_REG); @@ -18027,7 +18027,7 @@ (match_operand:P 2 "memory_operand" "m")] UNSPEC_SP_TEST)) (clobber (match_scratch:P 3 "=&r"))] - "" + "!(flag_android && OPTION_BIONIC)" "mov{}\t{%1, %3|%3, %1}\;xor{}\t{%2, %3|%3, %2}" [(set_attr "type" "multi")]) Thanks, Igor On Fri, Jul 6, 2012 at 4:54 PM, Igor Zamyatin wrote: > Right, flag_android looks better, thanks. > Probably it's even better to check also bionic (OPTION_BIONIC) > > On Fri, Jul 6, 2012 at 12:13 PM, Andrew Pinski wrote: >> On Fri, Jul 6, 2012 at 12:49 AM, Igor Zamyatin wrote: >>> Hi! >>> >>> For runtime stack protector enabling on x86 for Android we have to >>> disable current glibc-based implementation and turn on default one >>> since bionic doesn't use value from gs:0x14. >>> >>> Tested in android environment(x86_64-*-linux-android), also >>> bootstrapped and regtested on x86_64-unknown-linux-gnu and i686-linux. >>> Ok for trunk? >> >> >> >> I think you want flag_android and not ANDROID_DEFAULT since you could >> use -mno-android . >> >> Thanks, >> Andrew >> >> >> >>> >>> Thanks, >>> Igor >>> >>>
Re: [PATCH, Android] Runtime stack protector enabling for Android target
Right, flag_android looks better, thanks. Probably it's even better to check also bionic (OPTION_BIONIC) On Fri, Jul 6, 2012 at 12:13 PM, Andrew Pinski wrote: > On Fri, Jul 6, 2012 at 12:49 AM, Igor Zamyatin wrote: >> Hi! >> >> For runtime stack protector enabling on x86 for Android we have to >> disable current glibc-based implementation and turn on default one >> since bionic doesn't use value from gs:0x14. >> >> Tested in android environment(x86_64-*-linux-android), also >> bootstrapped and regtested on x86_64-unknown-linux-gnu and i686-linux. >> Ok for trunk? > > > > I think you want flag_android and not ANDROID_DEFAULT since you could > use -mno-android . > > Thanks, > Andrew > > > >> >> Thanks, >> Igor >> >> >> ChangeLog: >> >> 2012-07-05 Sergey Melnikov >> >> * config/i386/i386.md (stack_protect_set): Disable the pattern >> for Android since Android libc (bionic) does not provide random >> value for stack protection guard at gs:0x14. Guard value >> will be provided from external symbol (default implementation). >> (stack_protect_set_): Likewise. >> (stack_protect_test): Likewise. >> (stack_protect_test_): Likewise. >> >> diff --git a/gcc-4.6/gcc/config/i386/i386.md >> b/gcc-4.6/gcc/config/i386/i386.md >> index b1d7e5e..fe0009d 100644 >> --- a/gcc-4.6/gcc/config/i386/i386.md >> +++ b/gcc-4.6/gcc/config/i386/i386.md >> @@ -17955,7 +17955,7 @@ >> (define_expand "stack_protect_set" >>[(match_operand 0 "memory_operand" "") >> (match_operand 1 "memory_operand" "")] >> - "" >> + "!ANDROID_DEFAULT" >> { >>rtx (*insn)(rtx, rtx); >> >> @@ -17979,7 +17979,7 @@ >> (unspec:P [(match_operand:P 1 "memory_operand" "m")] UNSPEC_SP_SET)) >> (set (match_scratch:P 2 "=&r") (const_int 0)) >> (clobber (reg:CC FLAGS_REG))] >> - "" >> + "!ANDROID_DEFAULT" >>"mov{}\t{%1, %2|%2, %1}\;mov{}\t{%2, >> %0|%0, %2}\;xor{l}\t%k2, %k2" >>[(set_attr "type" "multi")]) >> >> @@ -17997,7 +17997,7 @@ >>[(match_operand 0 "memory_operand" "") >> (match_operand 1 "memory_operand" "") >> (match_operand 2 "" "")] >> - "" >> + "!ANDROID_DEFAULT" >> { >>rtx flags = gen_rtx_REG (CCZmode, FLAGS_REG); >> >> @@ -18027,7 +18027,7 @@ >> (match_operand:P 2 "memory_operand" "m")] >> UNSPEC_SP_TEST)) >> (clobber (match_scratch:P 3 "=&r"))] >> - "" >> + "!ANDROID_DEFAULT" >>"mov{}\t{%1, %3|%3, %1}\;xor{}\t{%2, %3|%3, %2}" >>[(set_attr "type" "multi")])
[PATCH, Android] Runtime stack protector enabling for Android target
Hi! For runtime stack protector enabling on x86 for Android we have to disable current glibc-based implementation and turn on default one since bionic doesn't use value from gs:0x14. Tested in android environment(x86_64-*-linux-android), also bootstrapped and regtested on x86_64-unknown-linux-gnu and i686-linux. Ok for trunk? Thanks, Igor ChangeLog: 2012-07-05 Sergey Melnikov * config/i386/i386.md (stack_protect_set): Disable the pattern for Android since Android libc (bionic) does not provide random value for stack protection guard at gs:0x14. Guard value will be provided from external symbol (default implementation). (stack_protect_set_): Likewise. (stack_protect_test): Likewise. (stack_protect_test_): Likewise. diff --git a/gcc-4.6/gcc/config/i386/i386.md b/gcc-4.6/gcc/config/i386/i386.md index b1d7e5e..fe0009d 100644 --- a/gcc-4.6/gcc/config/i386/i386.md +++ b/gcc-4.6/gcc/config/i386/i386.md @@ -17955,7 +17955,7 @@ (define_expand "stack_protect_set" [(match_operand 0 "memory_operand" "") (match_operand 1 "memory_operand" "")] - "" + "!ANDROID_DEFAULT" { rtx (*insn)(rtx, rtx); @@ -17979,7 +17979,7 @@ (unspec:P [(match_operand:P 1 "memory_operand" "m")] UNSPEC_SP_SET)) (set (match_scratch:P 2 "=&r") (const_int 0)) (clobber (reg:CC FLAGS_REG))] - "" + "!ANDROID_DEFAULT" "mov{}\t{%1, %2|%2, %1}\;mov{}\t{%2, %0|%0, %2}\;xor{l}\t%k2, %k2" [(set_attr "type" "multi")]) @@ -17997,7 +17997,7 @@ [(match_operand 0 "memory_operand" "") (match_operand 1 "memory_operand" "") (match_operand 2 "" "")] - "" + "!ANDROID_DEFAULT" { rtx flags = gen_rtx_REG (CCZmode, FLAGS_REG); @@ -18027,7 +18027,7 @@ (match_operand:P 2 "memory_operand" "m")] UNSPEC_SP_TEST)) (clobber (match_scratch:P 3 "=&r"))] - "" + "!ANDROID_DEFAULT" "mov{}\t{%1, %3|%3, %1}\;xor{}\t{%2, %3|%3, %2}" [(set_attr "type" "multi")])
Re: [PATCH 1/3] Add rtx costs for sse integer ops
On Wed, Jun 27, 2012 at 9:14 PM, Richard Henderson wrote: > On 06/27/2012 10:08 AM, Igor Zamyatin wrote: >> On Wed, Jun 27, 2012 at 9:00 PM, Richard Henderson wrote: >>> > On 06/27/2012 09:07 AM, Igor Zamyatin wrote: >>>> >> May I ask about the purpose of the following piece of change? Doesn't >>>> >> it affect non-sse cases either? >>> > >>> > Err, no, it doesn't affect non-sse cases. All MODE_VECTOR_INT >>> > cases will be implemented in the xmm registers (ignoring the >>> > deprecated and largely ignored mmx case). >> Probably I misunderstand something... This condition - GET_MODE_SIZE >> (mode) < UNITS_PER_WORD - is outside the check for MODE_VECTOR_INT. > > Of course. > > We currently have > > if (vector mode) > else if (mode <= word size) > else /* scalar mode > word size */ Sure, it's clear. So am I correct that in this case now for second possibility we run the code which was executed for third possibility before the patch (it was under TARGET_64BIT && GET_MODE (XEXP (x, 0)) == DImode)? If yes, why? > > We could no doubt legitimately rearrange this to > > if (mode < word size) > else if (vector mode) > else /* scalar mode > word size */ > > but I don't see how that's any clearer. We certainly > can't eliminate any tests, since there are in fact > three different possibilities. > > > > r~
Re: [PATCH 1/3] Add rtx costs for sse integer ops
On Wed, Jun 27, 2012 at 9:00 PM, Richard Henderson wrote: > On 06/27/2012 09:07 AM, Igor Zamyatin wrote: >> May I ask about the purpose of the following piece of change? Doesn't >> it affect non-sse cases either? > > Err, no, it doesn't affect non-sse cases. All MODE_VECTOR_INT > cases will be implemented in the xmm registers (ignoring the > deprecated and largely ignored mmx case). Probably I misunderstand something... This condition - GET_MODE_SIZE (mode) < UNITS_PER_WORD - is outside the check for MODE_VECTOR_INT. > > >> >> @@ -32038,7 +32042,15 @@ ix86_rtx_costs (rtx x, int code, int >> outer_code_i, int opno, int *total, >> case ASHIFTRT: >> case LSHIFTRT: >> case ROTATERT: >> - if (!TARGET_64BIT && GET_MODE (XEXP (x, 0)) == DImode) >> + if (GET_MODE_CLASS (mode) == MODE_VECTOR_INT) >> + { >> + /* ??? Should be SSE vector operation cost. */ >> + /* At least for published AMD latencies, this really is the same >> + as the latency for a simple fpu operation like fabs. */ >> + *total = cost->fabs; >> + return false; >> + } >> + if (GET_MODE_SIZE (mode) < UNITS_PER_WORD) >> { >> if (CONST_INT_P (XEXP (x, 1))) >> { >> >> It also seems that we reversed the condition for the code that is now >> under if (GET_MODE_SIZE (mode) < UNITS_PER_WORD). Why do we need this? > > I'm not sure what you're suggesting. But we certainly don't use > the xmm registers to implement DImode operations in 32-bit, so... > > > r~
Re: [PATCH 1/3] Add rtx costs for sse integer ops
May I ask about the purpose of the following piece of change? Doesn't it affect non-sse cases either? @@ -32038,7 +32042,15 @@ ix86_rtx_costs (rtx x, int code, int outer_code_i, int opno, int *total, case ASHIFTRT: case LSHIFTRT: case ROTATERT: - if (!TARGET_64BIT && GET_MODE (XEXP (x, 0)) == DImode) + if (GET_MODE_CLASS (mode) == MODE_VECTOR_INT) + { + /* ??? Should be SSE vector operation cost. */ + /* At least for published AMD latencies, this really is the same +as the latency for a simple fpu operation like fabs. */ + *total = cost->fabs; + return false; + } + if (GET_MODE_SIZE (mode) < UNITS_PER_WORD) { if (CONST_INT_P (XEXP (x, 1))) { It also seems that we reversed the condition for the code that is now under if (GET_MODE_SIZE (mode) < UNITS_PER_WORD). Why do we need this? Thanks, Igor -Original Message- From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches-ow...@gcc.gnu.org] On Behalf Of Richard Henderson Sent: Saturday, June 16, 2012 12:57 AM To: gcc-patches@gcc.gnu.org Cc: rguent...@suse.de; ubiz...@gmail.com; hjl.to...@gmail.com Subject: [PATCH 1/3] Add rtx costs for sse integer ops --- gcc/config/i386/i386.c | 50 ++- 1 files changed, 40 insertions(+), 10 deletions(-) diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index e2f5740..578a756 100644 --- a/gcc/config/i386/i386.c +++ b/gcc/config/i386/i386.c @@ -31990,13 +31990,16 @@ ix86_rtx_costs (rtx x, int code, int outer_code_i, int opno, int *total, break; case 0: case -1: - /* Start with (MEM (SYMBOL_REF)), since that's where - it'll probably end up. Add a penalty for size. */ - *total = (COSTS_N_INSNS (1) - + (flag_pic != 0 && !TARGET_64BIT) - + (mode == SFmode ? 0 : mode == DFmode ? 1 : 2)); break; } + /* FALLTHRU */ + + case CONST_VECTOR: + /* Start with (MEM (SYMBOL_REF)), since that's where + it'll probably end up. Add a penalty for size. */ + *total = (COSTS_N_INSNS (1) + + (flag_pic != 0 && !TARGET_64BIT) + + (mode == SFmode ? 0 : mode == DFmode ? 1 : 2)); return true; case ZERO_EXTEND: @@ -32016,8 +32019,9 @@ ix86_rtx_costs (rtx x, int code, int outer_code_i, int opno, int *total, return false; case ASHIFT: - if (CONST_INT_P (XEXP (x, 1)) - && (GET_MODE (XEXP (x, 0)) != DImode || TARGET_64BIT)) + if (SCALAR_INT_MODE_P (mode) + && GET_MODE_SIZE (mode) < UNITS_PER_WORD + && CONST_INT_P (XEXP (x, 1))) { HOST_WIDE_INT value = INTVAL (XEXP (x, 1)); if (value == 1) @@ -32038,7 +32042,15 @@ ix86_rtx_costs (rtx x, int code, int outer_code_i, int opno, int *total, case ASHIFTRT: case LSHIFTRT: case ROTATERT: - if (!TARGET_64BIT && GET_MODE (XEXP (x, 0)) == DImode) + if (GET_MODE_CLASS (mode) == MODE_VECTOR_INT) + { + /* ??? Should be SSE vector operation cost. */ + /* At least for published AMD latencies, this really is the same + as the latency for a simple fpu operation like fabs. */ + *total = cost->fabs; + return false; + } + if (GET_MODE_SIZE (mode) < UNITS_PER_WORD) { if (CONST_INT_P (XEXP (x, 1))) { @@ -32107,6 +32119,16 @@ ix86_rtx_costs (rtx x, int code, int outer_code_i, int opno, int *total, *total = cost->fmul; return false; } + else if (GET_MODE_CLASS (mode) == MODE_VECTOR_INT) + { + /* Without sse4.1, we don't have PMULLD; it's emulated with 7 + insns, including two PMULUDQ. */ + if (mode == V4SImode && !(TARGET_SSE4_1 || TARGET_AVX)) + *total = cost->fmul * 2 + cost->fabs * 5; + else + *total = cost->fmul; + return false; + } else { rtx op0 = XEXP (x, 0); @@ -32171,7 +32193,7 @@ ix86_rtx_costs (rtx x, int code, int outer_code_i, int opno, int *total, case PLUS: if (GET_MODE_CLASS (mode) == MODE_INT - && GET_MODE_BITSIZE (mode) <= GET_MODE_BITSIZE (Pmode)) + && GET_MODE_SIZE (mode) <= UNITS_PER_WORD) { if (GET_CODE (XEXP (x, 0)) == PLUS && GET_CODE (XEXP (XEXP (x, 0), 0)) == MULT @@ -32271,6 +32293,14 @@ ix86_rtx_costs (rtx x, int code, int outer_code_i, int opno, int *total, /* FALLTHRU */ case NOT: + if (GET_MODE_CLASS (mode) == MODE_VECTOR_INT) + { + /* ??? Should be SSE vector operation cost. */ + /* At least for published AMD latencies, this really is the same + as the latency for a simple fpu operation like fabs. */ + *total = cost->fabs; + return false; + } if (!TARGET_64BIT && mode == DImode) *total = cos
Re: [PATCH] Atom: Scheduler improvements for better imul placement
Hi, Uros! Sorry, I didn't realize that patch was missed. I attached new version. Changelog: 2012-05-29 Yuri Rumyantsev * config/i386/i386.c (x86_sched_reorder): New function. Added new function x86_sched_reorder. As for multiply modes, currently we handled most frequent case for Atom and in the future this could be enhanced. Thanks, Igor > > Hello! > >> Ping? > > Please at least add and URL to the patch, it took me some time to found the > latest version [1], I'm not even sure if it is the latest version... > > I assume that you cleared all issues with middle-end and scheduler > maintainers, it is not clear from the message. > > + (1) IMUL instrction is on the top of list; > > Typo above. > > + static int issue_rate = -1; > + int n_ready = *pn_ready; > + rtx insn; > + rtx insn1; > + rtx insn2; > > Please put three definitions above on the same line. > > + int i; > + sd_iterator_def sd_it; > + dep_t dep; > + int index = -1; > + > + /* set up issue rate */ > + if (issue_rate < 0) > + issue_rate = ix86_issue_rate(); > > Please set issue_rate unconditionally here. Also, please follow the GNU > style of comments (Full sentence with two spaces after the dot) everywhere, > e.g: > > /* Set up issue rate. */ > > + if (!(GET_CODE (SET_SRC (insn)) == MULT > + && GET_MODE (SET_SRC (insn)) == SImode)) > + return issue_rate; > > Is it correct that only SImode multiplies are checked against SImode > multiplies? Can't we use DImode or HImode multiply (or other long-latency > insns) to put them into the shadow of the first multiply insn? > > As proposed in [2], there are many other fine-tuning approaches proposed by > the scheduler maintainer. OTOH, even the "big hammer" > approach in the proposed patch makes things better, so it is the step in the > right direction - and it is existing practice anyway. > > Under this rationale, I think that the patch should be committed to mainline. > But please also consider proposed fine-tunings to refine the scheduling > accuracy. > > So, OK for mainline, if there are no objections from other maintainers in > next two days. > > [1] http://gcc.gnu.org/ml/gcc-patches/2012-04/msg00964.html > [2] http://gcc.gnu.org/ml/gcc-patches/2012-04/msg00806.html > > Thanks, > Uros. imul_reordering.patch Description: Binary data
Re: [PATCH] Atom: Scheduler improvements for better imul placement
Ping? On Sun, May 6, 2012 at 11:27 AM, Igor Zamyatin wrote: > Ping. Could x86 maintainer(s) look at these changes? > > Thanks, > Igor > > On Fri, Apr 20, 2012 at 4:04 PM, Igor Zamyatin wrote: >> On Tue, Apr 17, 2012 at 12:27 AM, Igor Zamyatin wrote: >>> On Fri, Apr 13, 2012 at 4:20 PM, Andrey Belevantsev wrote: >>>> On 13.04.2012 14:18, Igor Zamyatin wrote: >>>>> >>>>> On Thu, Apr 12, 2012 at 5:01 PM, Andrey Belevantsev >>>>> wrote: >>>>>> >>>>>> On 12.04.2012 16:38, Richard Guenther wrote: >>>>>>> >>>>>>> >>>>>>> On Thu, Apr 12, 2012 at 2:36 PM, Igor Zamyatin >>>>>>> wrote: >>>>>>>> >>>>>>>> >>>>>>>> On Thu, Apr 12, 2012 at 4:24 PM, Richard Guenther >>>>>>>> wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> On Thu, Apr 12, 2012 at 2:00 PM, Alexander Monakov >>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> Can atom execute two IMUL in parallel? Or what exactly is the >>>>>>>>>>> pipeline >>>>>>>>>>> behavior? >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> As I understand from Intel's optimization reference manual, the >>>>>>>>>> behavior is as >>>>>>>>>> follows: if the instruction immediately following IMUL has shorter >>>>>>>>>> latency, >>>>>>>>>> execution is stalled for 4 cycles (which is IMUL's latency); >>>>>>>>>> otherwise, >>>>>>>>>> a >>>>>>>>>> 4-or-more cycles latency instruction can be issued after IMUL without >>>>>>>>>> a >>>>>>>>>> stall. >>>>>>>>>> In other words, IMUL is pipelined with respect to other long-latency >>>>>>>>>> instructions, but not to short-latency instructions. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> It seems to be modeled in the pipeline description though: >>>>>>>>> >>>>>>>>> ;;; imul insn has 5 cycles latency >>>>>>>>> (define_reservation "atom-imul-32" >>>>>>>>> "atom-imul-1, atom-imul-2, atom-imul-3, >>>>>>>>> atom-imul-4, >>>>>>>>> atom-port-0") >>>>>>>>> >>>>>>>>> ;;; imul instruction excludes other non-FP instructions. >>>>>>>>> (exclusion_set "atom-eu-0, atom-eu-1" >>>>>>>>> "atom-imul-1, atom-imul-2, atom-imul-3, atom-imul-4") >>>>>>>>> >>>>>>>> >>>>>>>> The main idea is quite simple: >>>>>>>> >>>>>>>> If we are going to schedule IMUL instruction (it is on the top of >>>>>>>> ready list) we try to find out producer of other (independent) IMUL >>>>>>>> instruction that is in ready list too. The goal is try to schedule >>>>>>>> such a producer to get another IMUL in ready list and get scheduling >>>>>>>> of 2 successive IMUL instructions. >>>>>>> >>>>>>> >>>>>>> >>>>>>> Why does that not happen without your patch? Does it never happen >>>>>>> without >>>>>>> your patch or does it merely not happen for one EEMBC benchmark (can >>>>>>> you provide a testcase?)? >>>>>> >>>>>> >>>>>> >>>>>> It does not happen because the scheduler by itself does not do such >>>>>> specific >>>>>> reordering. That said, it is easy to imagine the cases where this patch >>>>>> will make things worse rather than better. >>>>>> >>>>>> Igor, why not tr
Re: [PATCH, tree-optimization] Fix for PR 52868
Ping? On Sat, May 12, 2012 at 1:26 AM, Igor Zamyatin wrote: > Ping? > > On Fri, Apr 27, 2012 at 4:42 PM, Igor Zamyatin wrote: >> On Wed, Apr 25, 2012 at 6:41 PM, Richard Guenther >> wrote: >>> On Wed, Apr 25, 2012 at 4:32 PM, Igor Zamyatin wrote: >>>> On Wed, Apr 25, 2012 at 1:14 PM, Richard Guenther >>>> wrote: >>>>> On Wed, Apr 25, 2012 at 10:56 AM, Igor Zamyatin >>>>> wrote: >>>>>> Hi all! >>>>>> >>>>>> I'd like to post for review the patch which makes some costs adjusting >>>>>> in get_computation_cost_at routine in ivopts part. >>>>>> As mentioned in the PR changes also fix the bwaves regression from PR >>>>>> 52272. >>>>>> Also changes introduce no degradations on spec2000/2006 and >>>>>> EEMBC1.1/2.0(this was measured on Atom) on x86 >>>>>> >>>>>> >>>>>> Bootstrapped and regtested on x86. Ok to commit? >>>>> >>>>> I can't make sense of the patch and the comment does not help. >>>>> >>>>> + diff_cost = cost.cost; >>>>> cost.cost /= avg_loop_niter (data->current_loop); >>>>> + add_cost_val = add_cost (TYPE_MODE (ctype), data->speed); >>>>> + /* Do cost correction if address cost is small enough >>>>> + and difference cost is high enough. */ >>>>> + if (address_p && diff_cost > add_cost_val >>>>> + && get_address_cost (symbol_present, var_present, >>>>> + offset, ratio, cstepi, >>>>> + TYPE_MODE (TREE_TYPE (utype)), >>>>> + TYPE_ADDR_SPACE (TREE_TYPE (utype)), >>>>> + speed, stmt_is_after_inc, >>>>> + can_autoinc).cost <= add_cost_val) >>>>> + cost.cost += add_cost_val; >>>>> >>>>> Please explain more thoroughly. It also would seem to be better to add >>>>> an extra case, as later code does >>>> >>>> For example for such code >>>> >>>> for (j=0; j>>> for (i=0; i>>> sum += ptr->a[j][i] * ptr->c[k][i]; >>>> } >>>> we currently have following gimple on x86 target (I provided a piece >>>> of all phase output): >>>> >>>> # ivtmp.13_30 = PHI >>>> D.1748_34 = (void *) ivtmp.13_30; >>>> D.1722_7 = MEM[base: D.1748_34, offset: 0B]; >>>> D.1750_36 = ivtmp.27_28; >>>> D.1751_37 = D.1750_36 + ivtmp.13_30; <-- we got >>>> non-invariant add which is not taken into account currently in cost >>>> model >>>> D.1752_38 = (void *) D.1751_37; >>>> D.1753_39 = (sizetype) k_8(D); >>>> D.1754_40 = D.1753_39 * 800; >>>> D.1723_9 = MEM[base: D.1752_38, index: D.1754_40, offset: >>>> 16000B]; >>>> ... >>>> >>>> With proposed fix we produce: >>>> >>>> # ivtmp.14_30 = PHI >>>> D.1749_34 = (struct S *) ivtmp.25_28; >>>> D.1722_7 = MEM[base: D.1749_34, index: ivtmp.14_30, offset: 0B]; >>>> D.1750_35 = (sizetype) k_8(D); >>>> D.1751_36 = D.1750_35 * 800; >>>> D.1752_37 = ptr_6(D) + D.1751_36; >>>> D.1723_9 = MEM[base: D.1752_37, index: ivtmp.14_30, offset: >>>> 16000B]; >>>> >>>> which is more effective on platforms where address cost is cheaper >>>> than cost of addition operation. That's basically what this adjustment >>>> is for. >>> >>> If we generally miss to account for the add then why is the adjustment >>> conditional on diff_cost > add_cost and address_cost <= add_cost? >>> >>> Is this a new heuristic or a fix for not accurately computing the cost for >>> the >>> stmts we generate? >> >> I'd say this is closer to heuristic since diff_cost > add_cost is an >> attempt to catch the case with non-invariant add produced by pointer >> difference and address_cost <=add_cost leaves the cases with cheap >> address operations >> >>> >>> Richard. >&
Re: [PATCH, tree-optimization] Fix for PR 52868
Ping? On Fri, Apr 27, 2012 at 4:42 PM, Igor Zamyatin wrote: > On Wed, Apr 25, 2012 at 6:41 PM, Richard Guenther > wrote: >> On Wed, Apr 25, 2012 at 4:32 PM, Igor Zamyatin wrote: >>> On Wed, Apr 25, 2012 at 1:14 PM, Richard Guenther >>> wrote: >>>> On Wed, Apr 25, 2012 at 10:56 AM, Igor Zamyatin >>>> wrote: >>>>> Hi all! >>>>> >>>>> I'd like to post for review the patch which makes some costs adjusting >>>>> in get_computation_cost_at routine in ivopts part. >>>>> As mentioned in the PR changes also fix the bwaves regression from PR >>>>> 52272. >>>>> Also changes introduce no degradations on spec2000/2006 and >>>>> EEMBC1.1/2.0(this was measured on Atom) on x86 >>>>> >>>>> >>>>> Bootstrapped and regtested on x86. Ok to commit? >>>> >>>> I can't make sense of the patch and the comment does not help. >>>> >>>> + diff_cost = cost.cost; >>>> cost.cost /= avg_loop_niter (data->current_loop); >>>> + add_cost_val = add_cost (TYPE_MODE (ctype), data->speed); >>>> + /* Do cost correction if address cost is small enough >>>> + and difference cost is high enough. */ >>>> + if (address_p && diff_cost > add_cost_val >>>> + && get_address_cost (symbol_present, var_present, >>>> + offset, ratio, cstepi, >>>> + TYPE_MODE (TREE_TYPE (utype)), >>>> + TYPE_ADDR_SPACE (TREE_TYPE (utype)), >>>> + speed, stmt_is_after_inc, >>>> + can_autoinc).cost <= add_cost_val) >>>> + cost.cost += add_cost_val; >>>> >>>> Please explain more thoroughly. It also would seem to be better to add >>>> an extra case, as later code does >>> >>> For example for such code >>> >>> for (j=0; j>> for (i=0; i>> sum += ptr->a[j][i] * ptr->c[k][i]; >>> } >>> we currently have following gimple on x86 target (I provided a piece >>> of all phase output): >>> >>> # ivtmp.13_30 = PHI >>> D.1748_34 = (void *) ivtmp.13_30; >>> D.1722_7 = MEM[base: D.1748_34, offset: 0B]; >>> D.1750_36 = ivtmp.27_28; >>> D.1751_37 = D.1750_36 + ivtmp.13_30; <-- we got >>> non-invariant add which is not taken into account currently in cost >>> model >>> D.1752_38 = (void *) D.1751_37; >>> D.1753_39 = (sizetype) k_8(D); >>> D.1754_40 = D.1753_39 * 800; >>> D.1723_9 = MEM[base: D.1752_38, index: D.1754_40, offset: 16000B]; >>> ... >>> >>> With proposed fix we produce: >>> >>> # ivtmp.14_30 = PHI >>> D.1749_34 = (struct S *) ivtmp.25_28; >>> D.1722_7 = MEM[base: D.1749_34, index: ivtmp.14_30, offset: 0B]; >>> D.1750_35 = (sizetype) k_8(D); >>> D.1751_36 = D.1750_35 * 800; >>> D.1752_37 = ptr_6(D) + D.1751_36; >>> D.1723_9 = MEM[base: D.1752_37, index: ivtmp.14_30, offset: >>> 16000B]; >>> >>> which is more effective on platforms where address cost is cheaper >>> than cost of addition operation. That's basically what this adjustment >>> is for. >> >> If we generally miss to account for the add then why is the adjustment >> conditional on diff_cost > add_cost and address_cost <= add_cost? >> >> Is this a new heuristic or a fix for not accurately computing the cost for >> the >> stmts we generate? > > I'd say this is closer to heuristic since diff_cost > add_cost is an > attempt to catch the case with non-invariant add produced by pointer > difference and address_cost <=add_cost leaves the cases with cheap > address operations > >> >> Richard. >> >>> So comment in the source code now looks as follows >>> >>> /* Do cost correction when address difference produces >>> additional non-invariant add operation which is less >>> profitable if address cost is cheaper than cost of add. */ >>> >>>> >>>> /* Now the computation is in shape symbol + var1 + const + ratio * var2. >>>> (symbol/var1/con
Re: [PATCH, Android] Stack protector enabling for Android target
Hi! Please look at the modified patch in the attachment. ChangeLog remains the same. Tested in android environment(x86_64-*-linux-android), also bootstrapped on x86_64-unknown-linux-gnu. I also started regtesting on linux. Is it ok after successfull regtesting? Thanks, Igor On Mon, May 7, 2012 at 10:08 PM, H.J. Lu wrote: > On Mon, May 7, 2012 at 11:04 AM, Maxim Kuvyrkov > wrote: >> On 6/05/2012, at 7:02 PM, Igor Zamyatin wrote: >> >>> Hi! >>> >>> The patch enables stack protector for Android. >>> Android targets don't contain necessary information in features.h so >>> we explicitly enable stack protector for Android. >>> >>> Bootstrapped and regtested on x86_64. Ok to commit? >>> >>> Thanks, >>> Igor >>> >>> 2012-05-06 Igor Zamyatin >>> >>> * configure.ac: Stack protector enabling for Android targets. >>> * configure: Regenerate. >>> >>> >>> diff --git a/gcc/configure.ac b/gcc/configure.ac >>> index 86b4bea..c1012d6 100644 >>> --- a/gcc/configure.ac >>> +++ b/gcc/configure.ac >>> @@ -4545,6 +4545,8 @@ AC_CACHE_CHECK(__stack_chk_fail in target C library, >>> gcc_cv_libc_provides_ssp, >>> [gcc_cv_libc_provides_ssp=no >>> case "$target" in >>> + *-android*) >>> + gcc_cv_libc_provides_ssp=yes;; >>> *-*-linux* | *-*-kfreebsd*-gnu | *-*-knetbsd*-gnu) >>> [# glibc 2.4 and later provides __stack_chk_fail and >>> # either __stack_chk_guard, or TLS access to stack guard canary. >> >> >> What you really want is to enable stack protector for Bionic libc, which is >> often synonymous with -android* target, but not always. Let's enable ssp >> based on whether __BIONIC__ is defined in the libc headers (i.e., add a grep >> test for __BIONIC__ in ) >> >> Also please add a comment along the lines of "all versions of Bionic support >> stack protector". >> >> Which exact target did you test this on? X86_64-*-* is a pretty broad >> definition. >> > > We are working on x86_64-*-linux-android target, > which uses x32. > > -- > H.J. stack_protector.patch Description: Binary data
Re: [PATCH] Atom: Scheduler improvements for better imul placement
Ping. Could x86 maintainer(s) look at these changes? Thanks, Igor On Fri, Apr 20, 2012 at 4:04 PM, Igor Zamyatin wrote: > On Tue, Apr 17, 2012 at 12:27 AM, Igor Zamyatin wrote: >> On Fri, Apr 13, 2012 at 4:20 PM, Andrey Belevantsev wrote: >>> On 13.04.2012 14:18, Igor Zamyatin wrote: >>>> >>>> On Thu, Apr 12, 2012 at 5:01 PM, Andrey Belevantsev >>>> wrote: >>>>> >>>>> On 12.04.2012 16:38, Richard Guenther wrote: >>>>>> >>>>>> >>>>>> On Thu, Apr 12, 2012 at 2:36 PM, Igor Zamyatin >>>>>> wrote: >>>>>>> >>>>>>> >>>>>>> On Thu, Apr 12, 2012 at 4:24 PM, Richard Guenther >>>>>>> wrote: >>>>>>>> >>>>>>>> >>>>>>>> On Thu, Apr 12, 2012 at 2:00 PM, Alexander Monakov >>>>>>>> wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>> Can atom execute two IMUL in parallel? Or what exactly is the >>>>>>>>>> pipeline >>>>>>>>>> behavior? >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> As I understand from Intel's optimization reference manual, the >>>>>>>>> behavior is as >>>>>>>>> follows: if the instruction immediately following IMUL has shorter >>>>>>>>> latency, >>>>>>>>> execution is stalled for 4 cycles (which is IMUL's latency); >>>>>>>>> otherwise, >>>>>>>>> a >>>>>>>>> 4-or-more cycles latency instruction can be issued after IMUL without >>>>>>>>> a >>>>>>>>> stall. >>>>>>>>> In other words, IMUL is pipelined with respect to other long-latency >>>>>>>>> instructions, but not to short-latency instructions. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> It seems to be modeled in the pipeline description though: >>>>>>>> >>>>>>>> ;;; imul insn has 5 cycles latency >>>>>>>> (define_reservation "atom-imul-32" >>>>>>>> "atom-imul-1, atom-imul-2, atom-imul-3, >>>>>>>> atom-imul-4, >>>>>>>> atom-port-0") >>>>>>>> >>>>>>>> ;;; imul instruction excludes other non-FP instructions. >>>>>>>> (exclusion_set "atom-eu-0, atom-eu-1" >>>>>>>> "atom-imul-1, atom-imul-2, atom-imul-3, atom-imul-4") >>>>>>>> >>>>>>> >>>>>>> The main idea is quite simple: >>>>>>> >>>>>>> If we are going to schedule IMUL instruction (it is on the top of >>>>>>> ready list) we try to find out producer of other (independent) IMUL >>>>>>> instruction that is in ready list too. The goal is try to schedule >>>>>>> such a producer to get another IMUL in ready list and get scheduling >>>>>>> of 2 successive IMUL instructions. >>>>>> >>>>>> >>>>>> >>>>>> Why does that not happen without your patch? Does it never happen >>>>>> without >>>>>> your patch or does it merely not happen for one EEMBC benchmark (can >>>>>> you provide a testcase?)? >>>>> >>>>> >>>>> >>>>> It does not happen because the scheduler by itself does not do such >>>>> specific >>>>> reordering. That said, it is easy to imagine the cases where this patch >>>>> will make things worse rather than better. >>>>> >>>>> Igor, why not try different subtler mechanisms like adjust_priority, >>>>> which >>>>> is get called when an insn is added to the ready list? E.g. increase the >>>>> producer's priority. >>>>> >>>>> The patch as is misses checks for NONDEBUG_INSN_P. Also, why bail out >>>>> when >>>>> you have more than one imul in t
[PATCH, Android] Stack protector enabling for Android target
Hi! The patch enables stack protector for Android. Android targets don't contain necessary information in features.h so we explicitly enable stack protector for Android. Bootstrapped and regtested on x86_64. Ok to commit? Thanks, Igor 2012-05-06 Igor Zamyatin * configure.ac: Stack protector enabling for Android targets. * configure: Regenerate. diff --git a/gcc/configure.ac b/gcc/configure.ac index 86b4bea..c1012d6 100644 --- a/gcc/configure.ac +++ b/gcc/configure.ac @@ -4545,6 +4545,8 @@ AC_CACHE_CHECK(__stack_chk_fail in target C library, gcc_cv_libc_provides_ssp, [gcc_cv_libc_provides_ssp=no case "$target" in + *-android*) + gcc_cv_libc_provides_ssp=yes;; *-*-linux* | *-*-kfreebsd*-gnu | *-*-knetbsd*-gnu) [# glibc 2.4 and later provides __stack_chk_fail and # either __stack_chk_guard, or TLS access to stack guard canary.
Re: [PATCH, tree-optimization] Fix for PR 52868
On Wed, Apr 25, 2012 at 6:41 PM, Richard Guenther wrote: > On Wed, Apr 25, 2012 at 4:32 PM, Igor Zamyatin wrote: >> On Wed, Apr 25, 2012 at 1:14 PM, Richard Guenther >> wrote: >>> On Wed, Apr 25, 2012 at 10:56 AM, Igor Zamyatin wrote: >>>> Hi all! >>>> >>>> I'd like to post for review the patch which makes some costs adjusting >>>> in get_computation_cost_at routine in ivopts part. >>>> As mentioned in the PR changes also fix the bwaves regression from PR >>>> 52272. >>>> Also changes introduce no degradations on spec2000/2006 and >>>> EEMBC1.1/2.0(this was measured on Atom) on x86 >>>> >>>> >>>> Bootstrapped and regtested on x86. Ok to commit? >>> >>> I can't make sense of the patch and the comment does not help. >>> >>> + diff_cost = cost.cost; >>> cost.cost /= avg_loop_niter (data->current_loop); >>> + add_cost_val = add_cost (TYPE_MODE (ctype), data->speed); >>> + /* Do cost correction if address cost is small enough >>> + and difference cost is high enough. */ >>> + if (address_p && diff_cost > add_cost_val >>> + && get_address_cost (symbol_present, var_present, >>> + offset, ratio, cstepi, >>> + TYPE_MODE (TREE_TYPE (utype)), >>> + TYPE_ADDR_SPACE (TREE_TYPE (utype)), >>> + speed, stmt_is_after_inc, >>> + can_autoinc).cost <= add_cost_val) >>> + cost.cost += add_cost_val; >>> >>> Please explain more thoroughly. It also would seem to be better to add >>> an extra case, as later code does >> >> For example for such code >> >> for (j=0; j> for (i=0; i> sum += ptr->a[j][i] * ptr->c[k][i]; >> } >> we currently have following gimple on x86 target (I provided a piece >> of all phase output): >> >> # ivtmp.13_30 = PHI >> D.1748_34 = (void *) ivtmp.13_30; >> D.1722_7 = MEM[base: D.1748_34, offset: 0B]; >> D.1750_36 = ivtmp.27_28; >> D.1751_37 = D.1750_36 + ivtmp.13_30; <-- we got >> non-invariant add which is not taken into account currently in cost >> model >> D.1752_38 = (void *) D.1751_37; >> D.1753_39 = (sizetype) k_8(D); >> D.1754_40 = D.1753_39 * 800; >> D.1723_9 = MEM[base: D.1752_38, index: D.1754_40, offset: 16000B]; >> ... >> >> With proposed fix we produce: >> >> # ivtmp.14_30 = PHI >> D.1749_34 = (struct S *) ivtmp.25_28; >> D.1722_7 = MEM[base: D.1749_34, index: ivtmp.14_30, offset: 0B]; >> D.1750_35 = (sizetype) k_8(D); >> D.1751_36 = D.1750_35 * 800; >> D.1752_37 = ptr_6(D) + D.1751_36; >> D.1723_9 = MEM[base: D.1752_37, index: ivtmp.14_30, offset: >> 16000B]; >> >> which is more effective on platforms where address cost is cheaper >> than cost of addition operation. That's basically what this adjustment >> is for. > > If we generally miss to account for the add then why is the adjustment > conditional on diff_cost > add_cost and address_cost <= add_cost? > > Is this a new heuristic or a fix for not accurately computing the cost for the > stmts we generate? I'd say this is closer to heuristic since diff_cost > add_cost is an attempt to catch the case with non-invariant add produced by pointer difference and address_cost <=add_cost leaves the cases with cheap address operations > > Richard. > >> So comment in the source code now looks as follows >> >> /* Do cost correction when address difference produces >> additional non-invariant add operation which is less >> profitable if address cost is cheaper than cost of add. */ >> >>> >>> /* Now the computation is in shape symbol + var1 + const + ratio * var2. >>> (symbol/var1/const parts may be omitted). If we are looking for an >>> address, find the cost of addressing this. */ >>> if (address_p) >>> return add_costs (cost, >>> get_address_cost (symbol_present, var_present, >>> offset, ratio, cstepi, >>> TYPE_MODE (TREE_TYPE (utype)), >>> TYPE_AD
Re: [PATCH] Take branch misprediction effects into account when RTL loop unrolling (issue6099055)
Are you sure that tree-level unrollers are turned on at O2? My impression was that they work only at O3 or with f[unroll,peel]-loops flags. On Tue, Apr 24, 2012 at 6:13 PM, Andi Kleen wrote: > tejohn...@google.com (Teresa Johnson) writes: > >> This patch adds heuristics to limit unrolling in loops with branches >> that may increase branch mispredictions. It affects loops that are >> not frequently iterated, and that are nested within a hot region of code >> that already contains many branch instructions. >> >> Performance tested with both internal benchmarks and with SPEC >> 2000/2006 on a variety of Intel systems (Core2, Corei7, SandyBridge) and a >> couple of different AMD Opteron systems. >> This improves performance of an internal search indexing benchmark by >> close to 2% on all the tested Intel platforms. It also consistently >> improves 445.gobmk (with FDO feedback where unrolling kicks in) by >> close to 1% on AMD Opteron. Other performance effects are neutral. >> >> Bootstrapped and tested on x86_64-unknown-linux-gnu. Is this ok for trunk? > > One problem with any unrolling heuristics is currently that gcc has > both the tree level and the rtl level unroller. The tree one is even > on at -O3. So if you tweak anything for one you have to affect both, > otherwise the other may still do the wrong thing(tm). Tree level unrollers (cunrolli and cunroll) do complete unroll. At O2, both of them are turned on, but gcc does not allow any code growth -- which makes them pretty useless at O2 (very few loops qualify). The default max complete peel iteration is also too low compared with both icc and llvm. This needs to be tuned. David > > For some other tweaks I looked into a shared cost model some time ago. > May be still needed. > > -Andi > > -- > a...@linux.intel.com -- Speaking for myself only
Re: [PATCH, tree-optimization] Fix for PR 52868
On Wed, Apr 25, 2012 at 1:14 PM, Richard Guenther wrote: > On Wed, Apr 25, 2012 at 10:56 AM, Igor Zamyatin wrote: >> Hi all! >> >> I'd like to post for review the patch which makes some costs adjusting >> in get_computation_cost_at routine in ivopts part. >> As mentioned in the PR changes also fix the bwaves regression from PR 52272. >> Also changes introduce no degradations on spec2000/2006 and >> EEMBC1.1/2.0(this was measured on Atom) on x86 >> >> >> Bootstrapped and regtested on x86. Ok to commit? > > I can't make sense of the patch and the comment does not help. > > + diff_cost = cost.cost; > cost.cost /= avg_loop_niter (data->current_loop); > + add_cost_val = add_cost (TYPE_MODE (ctype), data->speed); > + /* Do cost correction if address cost is small enough > + and difference cost is high enough. */ > + if (address_p && diff_cost > add_cost_val > + && get_address_cost (symbol_present, var_present, > + offset, ratio, cstepi, > + TYPE_MODE (TREE_TYPE (utype)), > + TYPE_ADDR_SPACE (TREE_TYPE (utype)), > + speed, stmt_is_after_inc, > + can_autoinc).cost <= add_cost_val) > + cost.cost += add_cost_val; > > Please explain more thoroughly. It also would seem to be better to add > an extra case, as later code does For example for such code for (j=0; ja[j][i] * ptr->c[k][i]; } we currently have following gimple on x86 target (I provided a piece of all phase output): # ivtmp.13_30 = PHI D.1748_34 = (void *) ivtmp.13_30; D.1722_7 = MEM[base: D.1748_34, offset: 0B]; D.1750_36 = ivtmp.27_28; D.1751_37 = D.1750_36 + ivtmp.13_30; <-- we got non-invariant add which is not taken into account currently in cost model D.1752_38 = (void *) D.1751_37; D.1753_39 = (sizetype) k_8(D); D.1754_40 = D.1753_39 * 800; D.1723_9 = MEM[base: D.1752_38, index: D.1754_40, offset: 16000B]; ... With proposed fix we produce: # ivtmp.14_30 = PHI D.1749_34 = (struct S *) ivtmp.25_28; D.1722_7 = MEM[base: D.1749_34, index: ivtmp.14_30, offset: 0B]; D.1750_35 = (sizetype) k_8(D); D.1751_36 = D.1750_35 * 800; D.1752_37 = ptr_6(D) + D.1751_36; D.1723_9 = MEM[base: D.1752_37, index: ivtmp.14_30, offset: 16000B]; which is more effective on platforms where address cost is cheaper than cost of addition operation. That's basically what this adjustment is for. So comment in the source code now looks as follows /* Do cost correction when address difference produces additional non-invariant add operation which is less profitable if address cost is cheaper than cost of add. */ > > /* Now the computation is in shape symbol + var1 + const + ratio * var2. > (symbol/var1/const parts may be omitted). If we are looking for an > address, find the cost of addressing this. */ > if (address_p) > return add_costs (cost, > get_address_cost (symbol_present, var_present, > offset, ratio, cstepi, > TYPE_MODE (TREE_TYPE (utype)), > TYPE_ADDR_SPACE (TREE_TYPE (utype)), > speed, stmt_is_after_inc, > can_autoinc)); > > thus refactoring the code a bit would make it possible to CSE the > get_address_cost > call and eventually make it clearer what the code does. 'offset' could be changed beetween two calls of get_address_cost so such refactoring looks useless. New patch (only the comment was changed) attached. Changelog was changed as well. > > Richard. > Changelog: 2012-04-26 Yuri Rumyantsev * tree-ssa-loop-ivopts.c (get_computation_cost_at): Adjust cost model when address difference produces additional non-invariant add operation which is less profitable if address cost is cheaper than cost of add. Thanks, Igor ivopts1.patch Description: Binary data
[PATCH, tree-optimization] Fix for PR 52868
Hi all! I'd like to post for review the patch which makes some costs adjusting in get_computation_cost_at routine in ivopts part. As mentioned in the PR changes also fix the bwaves regression from PR 52272. Also changes introduce no degradations on spec2000/2006 and EEMBC1.1/2.0(this was measured on Atom) on x86 Bootstrapped and regtested on x86. Ok to commit? Changelog: 2012-04-26 Yuri Rumyantsev * tree-ssa-loop-ivopts.c (get_computation_cost_at): Add cost of extra addition if cost of address difference is high enough. Thanks, Igor ivopts.patch Description: Binary data
Re: FW: [v3] libstdc++/52689 testcase
This testcase is reported as failed on x86 > > Noticed that this testcase wasn't put in as part of the patch. Fixed as > follows. > > tested x86/linux > > -benjamin >
Re: [PATCH] Atom: Scheduler improvements for better imul placement
On Tue, Apr 17, 2012 at 12:27 AM, Igor Zamyatin wrote: > On Fri, Apr 13, 2012 at 4:20 PM, Andrey Belevantsev wrote: >> On 13.04.2012 14:18, Igor Zamyatin wrote: >>> >>> On Thu, Apr 12, 2012 at 5:01 PM, Andrey Belevantsev >>> wrote: >>>> >>>> On 12.04.2012 16:38, Richard Guenther wrote: >>>>> >>>>> >>>>> On Thu, Apr 12, 2012 at 2:36 PM, Igor Zamyatin >>>>> wrote: >>>>>> >>>>>> >>>>>> On Thu, Apr 12, 2012 at 4:24 PM, Richard Guenther >>>>>> wrote: >>>>>>> >>>>>>> >>>>>>> On Thu, Apr 12, 2012 at 2:00 PM, Alexander Monakov >>>>>>> wrote: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> Can atom execute two IMUL in parallel? Or what exactly is the >>>>>>>>> pipeline >>>>>>>>> behavior? >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> As I understand from Intel's optimization reference manual, the >>>>>>>> behavior is as >>>>>>>> follows: if the instruction immediately following IMUL has shorter >>>>>>>> latency, >>>>>>>> execution is stalled for 4 cycles (which is IMUL's latency); >>>>>>>> otherwise, >>>>>>>> a >>>>>>>> 4-or-more cycles latency instruction can be issued after IMUL without >>>>>>>> a >>>>>>>> stall. >>>>>>>> In other words, IMUL is pipelined with respect to other long-latency >>>>>>>> instructions, but not to short-latency instructions. >>>>>>> >>>>>>> >>>>>>> >>>>>>> It seems to be modeled in the pipeline description though: >>>>>>> >>>>>>> ;;; imul insn has 5 cycles latency >>>>>>> (define_reservation "atom-imul-32" >>>>>>> "atom-imul-1, atom-imul-2, atom-imul-3, >>>>>>> atom-imul-4, >>>>>>> atom-port-0") >>>>>>> >>>>>>> ;;; imul instruction excludes other non-FP instructions. >>>>>>> (exclusion_set "atom-eu-0, atom-eu-1" >>>>>>> "atom-imul-1, atom-imul-2, atom-imul-3, atom-imul-4") >>>>>>> >>>>>> >>>>>> The main idea is quite simple: >>>>>> >>>>>> If we are going to schedule IMUL instruction (it is on the top of >>>>>> ready list) we try to find out producer of other (independent) IMUL >>>>>> instruction that is in ready list too. The goal is try to schedule >>>>>> such a producer to get another IMUL in ready list and get scheduling >>>>>> of 2 successive IMUL instructions. >>>>> >>>>> >>>>> >>>>> Why does that not happen without your patch? Does it never happen >>>>> without >>>>> your patch or does it merely not happen for one EEMBC benchmark (can >>>>> you provide a testcase?)? >>>> >>>> >>>> >>>> It does not happen because the scheduler by itself does not do such >>>> specific >>>> reordering. That said, it is easy to imagine the cases where this patch >>>> will make things worse rather than better. >>>> >>>> Igor, why not try different subtler mechanisms like adjust_priority, >>>> which >>>> is get called when an insn is added to the ready list? E.g. increase the >>>> producer's priority. >>>> >>>> The patch as is misses checks for NONDEBUG_INSN_P. Also, why bail out >>>> when >>>> you have more than one imul in the ready list? Don't you want to bump >>>> the >>>> priority of the other imul found? >>> >>> >>> Could you provide some examples when this patch would harm the >>> performance? >> >> >> I thought of the cases when the other ready insns can fill up the hole and >> that would be more beneficial because e.g. they would be on more critical >> paths than the producer of your
Re: [PATCH] Atom: Enabling unroll at O2 optimization level
On Thu, Apr 12, 2012 at 3:16 PM, Richard Guenther wrote: > On Thu, Apr 12, 2012 at 1:05 PM, Igor Zamyatin wrote: >> On Wed, Apr 11, 2012 at 12:39 PM, Richard Guenther >> wrote: >>> On Tue, Apr 10, 2012 at 8:43 PM, Igor Zamyatin wrote: >>>> Hi All! >>>> >>>> Here is a patch that enables unroll at O2 for Atom. >>>> >>>> This gives good performance boost on EEMBC 2.0 (~+8% in Geomean for 32 >>>> bits) with quite moderate code size increase (~5% for EEMBC2.0, 32 >>>> bits). >>> >>> 5% is not moderate. Your patch does enable unrolling at -O2 but not -O3, >>> why? Why do you disable register renaming? check_imull requires a function >>> comment. >> >> Sure, enabling unroll for O3 could be the next step. >> We can't avoid code size increase with unroll - what number do you >> think will be appropriate? >> Register renaming was the reason of several degradations during tuning >> process >> Comment for check_imull was added >> >>> >>> This completely looks like a hack for EEMBC2.0, so it's definitely not ok. >> >> Why? EEMBC was measured and result provided here just because this >> benchmark considers to be very relevant for Atom > > I'd say that SPEC INT (2000 / 2006) is more relevant for Atom (SPEC FP > would be irrelevant OTOH). Similar code size for, say, Mozilla Firefox > or GCC itself would be important. > >>> -O2 is not supposed to give best benchmark results. >> >> O2 is wide-used so performance improvement could be important for users. > > But not at a 5% size cost. Please also always check the compile-time effect > which is important for -O2 as well. What would be an acceptable number of size cost/compile-time increase for O2 and O3 on EEMBC, SPEC INT 2000 and Mozilla? Is it possible in common to put Atom-specific unroll heuristics under some option which could be mentioned in GCC docs? > > Richard. > >>> >>> Thanks, >>> Richard. >>> >>>> >>>> Tested for i386 and x86-64, ok for trunk? >> >> Updated patch attached >> >>>> >>>> Thanks, >>>> Igor >>>> >>>> ChangeLog: >>>> >>>> 2012-04-10 Yakovlev Vladimir >>>> >>>> * gcc/config/i386/i386.c (check_imul): New routine. >>>> (ix86_loop_unroll_adjust): New target hook. >>>> (ix86_option_override_internal): Enable unrolling on Atom at -O2. >>>> (TARGET_LOOP_UNROLL_ADJUST): New define. Thanks, Igor
Re: [PATCH] Atom: Scheduler improvements for better imul placement
On Fri, Apr 13, 2012 at 4:20 PM, Andrey Belevantsev wrote: > On 13.04.2012 14:18, Igor Zamyatin wrote: >> >> On Thu, Apr 12, 2012 at 5:01 PM, Andrey Belevantsev >> wrote: >>> >>> On 12.04.2012 16:38, Richard Guenther wrote: >>>> >>>> >>>> On Thu, Apr 12, 2012 at 2:36 PM, Igor Zamyatin >>>> wrote: >>>>> >>>>> >>>>> On Thu, Apr 12, 2012 at 4:24 PM, Richard Guenther >>>>> wrote: >>>>>> >>>>>> >>>>>> On Thu, Apr 12, 2012 at 2:00 PM, Alexander Monakov >>>>>> wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>>> Can atom execute two IMUL in parallel? Or what exactly is the >>>>>>>> pipeline >>>>>>>> behavior? >>>>>>> >>>>>>> >>>>>>> >>>>>>> As I understand from Intel's optimization reference manual, the >>>>>>> behavior is as >>>>>>> follows: if the instruction immediately following IMUL has shorter >>>>>>> latency, >>>>>>> execution is stalled for 4 cycles (which is IMUL's latency); >>>>>>> otherwise, >>>>>>> a >>>>>>> 4-or-more cycles latency instruction can be issued after IMUL without >>>>>>> a >>>>>>> stall. >>>>>>> In other words, IMUL is pipelined with respect to other long-latency >>>>>>> instructions, but not to short-latency instructions. >>>>>> >>>>>> >>>>>> >>>>>> It seems to be modeled in the pipeline description though: >>>>>> >>>>>> ;;; imul insn has 5 cycles latency >>>>>> (define_reservation "atom-imul-32" >>>>>> "atom-imul-1, atom-imul-2, atom-imul-3, >>>>>> atom-imul-4, >>>>>> atom-port-0") >>>>>> >>>>>> ;;; imul instruction excludes other non-FP instructions. >>>>>> (exclusion_set "atom-eu-0, atom-eu-1" >>>>>> "atom-imul-1, atom-imul-2, atom-imul-3, atom-imul-4") >>>>>> >>>>> >>>>> The main idea is quite simple: >>>>> >>>>> If we are going to schedule IMUL instruction (it is on the top of >>>>> ready list) we try to find out producer of other (independent) IMUL >>>>> instruction that is in ready list too. The goal is try to schedule >>>>> such a producer to get another IMUL in ready list and get scheduling >>>>> of 2 successive IMUL instructions. >>>> >>>> >>>> >>>> Why does that not happen without your patch? Does it never happen >>>> without >>>> your patch or does it merely not happen for one EEMBC benchmark (can >>>> you provide a testcase?)? >>> >>> >>> >>> It does not happen because the scheduler by itself does not do such >>> specific >>> reordering. That said, it is easy to imagine the cases where this patch >>> will make things worse rather than better. >>> >>> Igor, why not try different subtler mechanisms like adjust_priority, >>> which >>> is get called when an insn is added to the ready list? E.g. increase the >>> producer's priority. >>> >>> The patch as is misses checks for NONDEBUG_INSN_P. Also, why bail out >>> when >>> you have more than one imul in the ready list? Don't you want to bump >>> the >>> priority of the other imul found? >> >> >> Could you provide some examples when this patch would harm the >> performance? > > > I thought of the cases when the other ready insns can fill up the hole and > that would be more beneficial because e.g. they would be on more critical > paths than the producer of your second imul. I don't know enough of Atom to > give an example -- maybe some long divisions? > > >> >> Sched_reorder was chosen since it is used in other ports and looks >> most suitable for such case, e.g. it provides access to the whole >> ready list. >> BTW, just increasing producer's priority seems to be more risky in >> performance sense - we can incorrectly start delaying some &g
Re: [PATCH] Atom: Scheduler improvements for better imul placement
On Fri, Apr 13, 2012 at 2:18 PM, Igor Zamyatin wrote: > On Thu, Apr 12, 2012 at 5:01 PM, Andrey Belevantsev wrote: >> On 12.04.2012 16:38, Richard Guenther wrote: >>> >>> On Thu, Apr 12, 2012 at 2:36 PM, Igor Zamyatin >>> wrote: >>>> >>>> On Thu, Apr 12, 2012 at 4:24 PM, Richard Guenther >>>> wrote: >>>>> >>>>> On Thu, Apr 12, 2012 at 2:00 PM, Alexander Monakov >>>>> wrote: >>>>>> >>>>>> >>>>>>> Can atom execute two IMUL in parallel? Or what exactly is the >>>>>>> pipeline >>>>>>> behavior? >>>>>> >>>>>> >>>>>> As I understand from Intel's optimization reference manual, the >>>>>> behavior is as >>>>>> follows: if the instruction immediately following IMUL has shorter >>>>>> latency, >>>>>> execution is stalled for 4 cycles (which is IMUL's latency); otherwise, >>>>>> a >>>>>> 4-or-more cycles latency instruction can be issued after IMUL without a >>>>>> stall. >>>>>> In other words, IMUL is pipelined with respect to other long-latency >>>>>> instructions, but not to short-latency instructions. >>>>> >>>>> >>>>> It seems to be modeled in the pipeline description though: >>>>> >>>>> ;;; imul insn has 5 cycles latency >>>>> (define_reservation "atom-imul-32" >>>>> "atom-imul-1, atom-imul-2, atom-imul-3, atom-imul-4, >>>>> atom-port-0") >>>>> >>>>> ;;; imul instruction excludes other non-FP instructions. >>>>> (exclusion_set "atom-eu-0, atom-eu-1" >>>>> "atom-imul-1, atom-imul-2, atom-imul-3, atom-imul-4") >>>>> >>>> >>>> The main idea is quite simple: >>>> >>>> If we are going to schedule IMUL instruction (it is on the top of >>>> ready list) we try to find out producer of other (independent) IMUL >>>> instruction that is in ready list too. The goal is try to schedule >>>> such a producer to get another IMUL in ready list and get scheduling >>>> of 2 successive IMUL instructions. >>> >>> >>> Why does that not happen without your patch? Does it never happen without >>> your patch or does it merely not happen for one EEMBC benchmark (can >>> you provide a testcase?)? >> >> >> It does not happen because the scheduler by itself does not do such specific >> reordering. That said, it is easy to imagine the cases where this patch >> will make things worse rather than better. >> >> Igor, why not try different subtler mechanisms like adjust_priority, which >> is get called when an insn is added to the ready list? E.g. increase the >> producer's priority. >> >> The patch as is misses checks for NONDEBUG_INSN_P. Also, why bail out when >> you have more than one imul in the ready list? Don't you want to bump the >> priority of the other imul found? > > Could you provide some examples when this patch would harm the performance? > > Sched_reorder was chosen since it is used in other ports and looks > most suitable for such case, e.g. it provides access to the whole > ready list. > BTW, just increasing producer's priority seems to be more risky in > performance sense - we can incorrectly start delaying some > instructions. > > Thought ready list doesn't contain DEBUG_INSN... Is it so? If it > contains them - this could be added easily > > The case with more than one imul in the ready list wasn't considered > just because the goal was to catch the particular case when there is a > risk to get the following picture: "imul-producer-imul" which is less > effective than "producer-imul-imul" for Atom Actually, this constraint of one imul could be omitted. I'm going to try this change > >> >> Andrey >> >> >>> >>>> And MD allows us only prefer scheduling of successive IMUL instructions, >>>> i.e. >>>> If IMUL was just scheduled and ready list contains another IMUL >>>> instruction then it will be chosen as the best candidate for >>>> scheduling. >>>> >>>> >>>>> at least from my very limited guessing of what the above does. So, did >>>>> you >>>>> analyze why the scheduler does not properly schedule things for you? >>>>> >>>>> Richard. >>>>> >>>>>> >>>>>> From reading the patch, I could not understand the link between >>>>>> pipeline >>>>>> behavior and what the patch appears to do. >>>>>> >>>>>> Hope that helps. >>>>>> >>>>>> Alexander >> >> > > Thanks, > Igor
Re: [PATCH] Atom: Scheduler improvements for better imul placement
On Thu, Apr 12, 2012 at 5:01 PM, Andrey Belevantsev wrote: > On 12.04.2012 16:38, Richard Guenther wrote: >> >> On Thu, Apr 12, 2012 at 2:36 PM, Igor Zamyatin >> wrote: >>> >>> On Thu, Apr 12, 2012 at 4:24 PM, Richard Guenther >>> wrote: >>>> >>>> On Thu, Apr 12, 2012 at 2:00 PM, Alexander Monakov >>>> wrote: >>>>> >>>>> >>>>>> Can atom execute two IMUL in parallel? Or what exactly is the >>>>>> pipeline >>>>>> behavior? >>>>> >>>>> >>>>> As I understand from Intel's optimization reference manual, the >>>>> behavior is as >>>>> follows: if the instruction immediately following IMUL has shorter >>>>> latency, >>>>> execution is stalled for 4 cycles (which is IMUL's latency); otherwise, >>>>> a >>>>> 4-or-more cycles latency instruction can be issued after IMUL without a >>>>> stall. >>>>> In other words, IMUL is pipelined with respect to other long-latency >>>>> instructions, but not to short-latency instructions. >>>> >>>> >>>> It seems to be modeled in the pipeline description though: >>>> >>>> ;;; imul insn has 5 cycles latency >>>> (define_reservation "atom-imul-32" >>>> "atom-imul-1, atom-imul-2, atom-imul-3, atom-imul-4, >>>> atom-port-0") >>>> >>>> ;;; imul instruction excludes other non-FP instructions. >>>> (exclusion_set "atom-eu-0, atom-eu-1" >>>> "atom-imul-1, atom-imul-2, atom-imul-3, atom-imul-4") >>>> >>> >>> The main idea is quite simple: >>> >>> If we are going to schedule IMUL instruction (it is on the top of >>> ready list) we try to find out producer of other (independent) IMUL >>> instruction that is in ready list too. The goal is try to schedule >>> such a producer to get another IMUL in ready list and get scheduling >>> of 2 successive IMUL instructions. >> >> >> Why does that not happen without your patch? Does it never happen without >> your patch or does it merely not happen for one EEMBC benchmark (can >> you provide a testcase?)? > > > It does not happen because the scheduler by itself does not do such specific > reordering. That said, it is easy to imagine the cases where this patch > will make things worse rather than better. > > Igor, why not try different subtler mechanisms like adjust_priority, which > is get called when an insn is added to the ready list? E.g. increase the > producer's priority. > > The patch as is misses checks for NONDEBUG_INSN_P. Also, why bail out when > you have more than one imul in the ready list? Don't you want to bump the > priority of the other imul found? Could you provide some examples when this patch would harm the performance? Sched_reorder was chosen since it is used in other ports and looks most suitable for such case, e.g. it provides access to the whole ready list. BTW, just increasing producer's priority seems to be more risky in performance sense - we can incorrectly start delaying some instructions. Thought ready list doesn't contain DEBUG_INSN... Is it so? If it contains them - this could be added easily The case with more than one imul in the ready list wasn't considered just because the goal was to catch the particular case when there is a risk to get the following picture: "imul-producer-imul" which is less effective than "producer-imul-imul" for Atom > > Andrey > > >> >>> And MD allows us only prefer scheduling of successive IMUL instructions, >>> i.e. >>> If IMUL was just scheduled and ready list contains another IMUL >>> instruction then it will be chosen as the best candidate for >>> scheduling. >>> >>> >>>> at least from my very limited guessing of what the above does. So, did >>>> you >>>> analyze why the scheduler does not properly schedule things for you? >>>> >>>> Richard. >>>> >>>>> >>>>> From reading the patch, I could not understand the link between >>>>> pipeline >>>>> behavior and what the patch appears to do. >>>>> >>>>> Hope that helps. >>>>> >>>>> Alexander > > Thanks, Igor
Re: [PATCH] Atom: Scheduler improvements for better imul placement
On Thu, Apr 12, 2012 at 4:24 PM, Richard Guenther wrote: > On Thu, Apr 12, 2012 at 2:00 PM, Alexander Monakov wrote: >> >>> Can atom execute two IMUL in parallel? Or what exactly is the pipeline >>> behavior? >> >> As I understand from Intel's optimization reference manual, the behavior is >> as >> follows: if the instruction immediately following IMUL has shorter latency, >> execution is stalled for 4 cycles (which is IMUL's latency); otherwise, a >> 4-or-more cycles latency instruction can be issued after IMUL without a >> stall. >> In other words, IMUL is pipelined with respect to other long-latency >> instructions, but not to short-latency instructions. > > It seems to be modeled in the pipeline description though: > > ;;; imul insn has 5 cycles latency > (define_reservation "atom-imul-32" > "atom-imul-1, atom-imul-2, atom-imul-3, atom-imul-4, > atom-port-0") > > ;;; imul instruction excludes other non-FP instructions. > (exclusion_set "atom-eu-0, atom-eu-1" > "atom-imul-1, atom-imul-2, atom-imul-3, atom-imul-4") > The main idea is quite simple: If we are going to schedule IMUL instruction (it is on the top of ready list) we try to find out producer of other (independent) IMUL instruction that is in ready list too. The goal is try to schedule such a producer to get another IMUL in ready list and get scheduling of 2 successive IMUL instructions. And MD allows us only prefer scheduling of successive IMUL instructions, i.e. If IMUL was just scheduled and ready list contains another IMUL instruction then it will be chosen as the best candidate for scheduling. > at least from my very limited guessing of what the above does. So, did you > analyze why the scheduler does not properly schedule things for you? > > Richard. > >> >> From reading the patch, I could not understand the link between pipeline >> behavior and what the patch appears to do. >> >> Hope that helps. >> >> Alexander
Re: [PATCH] Atom: Enabling unroll at O2 optimization level
On Wed, Apr 11, 2012 at 12:39 PM, Richard Guenther wrote: > On Tue, Apr 10, 2012 at 8:43 PM, Igor Zamyatin wrote: >> Hi All! >> >> Here is a patch that enables unroll at O2 for Atom. >> >> This gives good performance boost on EEMBC 2.0 (~+8% in Geomean for 32 >> bits) with quite moderate code size increase (~5% for EEMBC2.0, 32 >> bits). > > 5% is not moderate. Your patch does enable unrolling at -O2 but not -O3, > why? Why do you disable register renaming? check_imull requires a function > comment. Sure, enabling unroll for O3 could be the next step. We can't avoid code size increase with unroll - what number do you think will be appropriate? Register renaming was the reason of several degradations during tuning process Comment for check_imull was added > > This completely looks like a hack for EEMBC2.0, so it's definitely not ok. Why? EEMBC was measured and result provided here just because this benchmark considers to be very relevant for Atom > > -O2 is not supposed to give best benchmark results. O2 is wide-used so performance improvement could be important for users. > > Thanks, > Richard. > >> >> Tested for i386 and x86-64, ok for trunk? Updated patch attached >> >> Thanks, >> Igor >> >> ChangeLog: >> >> 2012-04-10 Yakovlev Vladimir >> >> * gcc/config/i386/i386.c (check_imul): New routine. >> (ix86_loop_unroll_adjust): New target hook. >> (ix86_option_override_internal): Enable unrolling on Atom at -O2. >> (TARGET_LOOP_UNROLL_ADJUST): New define. unroll1.patch Description: Binary data
Re: [PATCH] Atom: Enabling unroll at O2 optimization level
On Wed, Apr 11, 2012 at 5:34 PM, Andi Kleen wrote: > Richard Guenther writes: >> >> 5% is not moderate. Your patch does enable unrolling at -O2 but not -O3, >> why? Why do you disable register renaming? check_imull requires a function >> comment. >> >> This completely looks like a hack for EEMBC2.0, so it's definitely not ok. >> >> -O2 is not supposed to give best benchmark results. > > Besides it is against the Intel Optimization Manual recommendation > to prefer small code on Atom to avoid falling out of the predecode hints > in the cache. Yes, this is well-known concern for Atom. But in the same time unroll could help a lot for inorder machines because it could provide more opportunities to a compiler scheduler. And experiments showed that unroll could really help. > > So would need much more benchmarking on macro workloads first at least. Like what, for example? I believe in this case everything also strongly depends on test usage model (e.g. it usually compiled with Os not O2) and, let's say, internal test structure - whether there are hot loops that suitable for unroll. > > -Andi > > -- > a...@linux.intel.com -- Speaking for myself only
Re: [PATCH] Atom: Scheduler improvements for better imul placement
On Wed, Apr 11, 2012 at 6:10 PM, Richard Guenther wrote: > On Wed, Apr 11, 2012 at 3:38 PM, Andi Kleen wrote: >> Igor Zamyatin writes: >> >>> Hi All! >>> >>> It is known that imul placement is rather critical for Atom processors >>> and changes try to improve imul scheduling for Atom. >>> >>> This gives +5% performance on several tests from new OA 2.0 testsuite >>> from EEMBC. >>> >>> Tested for i386 and x86-64, ok for trunk? >> >> Did you measure how much this slows down the compiler when compiling >> for Atom? The new pass looks rather slow. > > Also please explain why adjusting the automaton for Atom is not a way to > attack this issue. If I understand the question correctly - it's a dynamic check and it's not clear how to describe this adjusting statically in machine description > > Richard. > >> -Andi >> >> -- >> a...@linux.intel.com -- Speaking for myself only
Re: [PATCH] Atom: Scheduler improvements for better imul placement
On Wed, Apr 11, 2012 at 5:38 PM, Andi Kleen wrote: > Igor Zamyatin writes: > >> Hi All! >> >> It is known that imul placement is rather critical for Atom processors >> and changes try to improve imul scheduling for Atom. >> >> This gives +5% performance on several tests from new OA 2.0 testsuite >> from EEMBC. >> >> Tested for i386 and x86-64, ok for trunk? > > Did you measure how much this slows down the compiler when compiling > for Atom? The new pass looks rather slow. No, since this "pass" is applied rarely. Conditions are mentioned in the comments to ix86_sched_reorder > > -Andi > > -- > a...@linux.intel.com -- Speaking for myself only
[PATCH] Atom: Scheduler improvements for better imul placement
Hi All! It is known that imul placement is rather critical for Atom processors and changes try to improve imul scheduling for Atom. This gives +5% performance on several tests from new OA 2.0 testsuite from EEMBC. Tested for i386 and x86-64, ok for trunk? ChangeLog: 2012-04-10 Yuri Rumyantsev * config/i386/i386.c (x86_sched_reorder): New function. Add new function x86_sched_reorder to take advantage of Atom pipelened IMUL execution. imul_sched.patch Description: Binary data
[PATCH] Atom: Enabling unroll at O2 optimization level
Hi All! Here is a patch that enables unroll at O2 for Atom. This gives good performance boost on EEMBC 2.0 (~+8% in Geomean for 32 bits) with quite moderate code size increase (~5% for EEMBC2.0, 32 bits). Tested for i386 and x86-64, ok for trunk? Thanks, Igor ChangeLog: 2012-04-10 Yakovlev Vladimir * gcc/config/i386/i386.c (check_imull): New routine. (ix86_loop_unroll_adjust): New target hook. (ix86_option_override_internal): Enable unrolling on Atom at -O2. (TARGET_LOOP_UNROLL_ADJUST): New define. unroll.patch Description: Binary data
Re: FW: patch to fix PR21617
Unfortunately patch doesn't help neither for separate EEMBC_2_0 tests nor for the whole benchmark. Do you want me to do some debugging here? On Fri, Jan 20, 2012 at 10:13 PM, Vladimir Makarov wrote: > On 01/19/2012 03:52 PM, Vladimir Makarov wrote: >> >> On 01/18/2012 02:30 PM, Zamyatin, Igor wrote: >>> >>> Yes, we use Atom for EEMBC measurements. >>> >>> We'll be glad to help you with your findings. >>> >>> >> Thanks. >> >> Unfortunately I tried several alternative patches but I did not find a >> better solution (it is mostly code size degradation on CoreI7). Now I am >> even thinking that the best action would have been ignoring the original PR. >> > Could you try the attached patch. It might work. > > I've tried several approach to prevent small hole creation in > ira-color.c::assign_hard_reg but it does not work well. > > Thanks. > >
Re: [PATCH] PR testsuite/51097 fix: a lot of "FAIL: gcc.dg/vect" on i686 avx build 181167 to 181177
Thanks, I'll look into your remarks On Thu, Dec 29, 2011 at 5:33 PM, Ira Rosen wrote: > > > Igor Zamyatin wrote on 29/12/2011 02:29:46 PM: > >> >> Because it includes AVX and AVX2 which deals with int and for AVX2 >> there are no problems with doubled diagnostics. > > And you can't just update vect_int because AVX does support it but with > 128-bit vectors, right? > So, your vect_float_no_int looks incorrect as well. > You need to describe the case when two vector sizes are analyzed, but the > first one always fails. Maybe vect_sizes_32B_16B_noint? Probably ugly, but > correct at least. > > I also suggest to simplify the checks and not to check the number of times > a pattern was detected , like this: > > Index: vect-widen-mult-half.c > === > --- vect-widen-mult-half.c (revision 182703) > +++ vect-widen-mult-half.c (working copy) > @@ -43,7 +43,7 @@ int main (void) > } > > /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 > "vect" { target vect_widen_mult_hi_to_si } } } */ > -/* { dg-final { scan-tree-dump-times "vect_recog_widen_mult_pattern: > detected" 1 "vect" { target vect_widen_mult_hi_to_si_pattern } } } */ > +/* { dg-final { scan-tree-dump "vect_recog_widen_mult_pattern: detected" > "vect" { target vect_widen_mult_hi_to_si_pattern } } } */ > /* { dg-final { scan-tree-dump-times "pattern recognized" 1 > "vect" { target vect_widen_mult_hi_to_si_pattern } } } */ > /* { dg-final { cleanup-tree-dump "vect" } } */ > > > And split the tests with several loops into several files, like, for > example, vect-widen-mult-const-s16.c. > > The simplification is ok for slp-reduc-6.c, vect-119.c, vect-35-big-array.c > as well. > > For the rest of the tests, I don't understand why there are alignment > messages printed twice for two vector sizes. Why doesn't the vectorizer > fail during vect_determine_vectorization_factor? > > Ira > >> >> I understand that all this looks quite bulky but it's hard to create >> something which looks better without loosing generality >> >> >> On Thu, Dec 29, 2011 at 4:15 PM, Ira Rosen wrote: >> > >> > >> > Igor Zamyatin wrote on 29/12/2011 02:04:45 PM: >> > >> >> When compiler configured with, say corei7-avx, it outputs twice more >> >> diagnostics on integer tests since AVX deals mostly with floats. I.e. >> >> compiler tries to vectorize on AVX vector size, than fails and then >> >> vectorizes on smaller vector size. This double work leads to double >> >> diagnostic output. >> > >> > OK, so you why not use vect_sizes_32B_16B? >> > >> > Ira >> > >> > >> >
Re: [PATCH] PR testsuite/51097 fix: a lot of "FAIL: gcc.dg/vect" on i686 avx build 181167 to 181177
Because it includes AVX and AVX2 which deals with int and for AVX2 there are no problems with doubled diagnostics. I understand that all this looks quite bulky but it's hard to create something which looks better without loosing generality On Thu, Dec 29, 2011 at 4:15 PM, Ira Rosen wrote: > > > Igor Zamyatin wrote on 29/12/2011 02:04:45 PM: > >> When compiler configured with, say corei7-avx, it outputs twice more >> diagnostics on integer tests since AVX deals mostly with floats. I.e. >> compiler tries to vectorize on AVX vector size, than fails and then >> vectorizes on smaller vector size. This double work leads to double >> diagnostic output. > > OK, so you why not use vect_sizes_32B_16B? > > Ira > >
Re: [PATCH] PR testsuite/51097 fix: a lot of "FAIL: gcc.dg/vect" on i686 avx build 181167 to 181177
When compiler configured with, say corei7-avx, it outputs twice more diagnostics on integer tests since AVX deals mostly with floats. I.e. compiler tries to vectorize on AVX vector size, than fails and then vectorizes on smaller vector size. This double work leads to double diagnostic output. On Thu, Dec 29, 2011 at 12:07 PM, Ira Rosen wrote: > > > gcc-patches-ow...@gcc.gnu.org wrote on 28/12/2011 11:05:19 PM: > >> Hi, > > Hi Igor, > >> >> Here is another patch about failures in gcc.dg/vect tests. These >> changes fix fails that could be seen on avx-built compilers. It also >> introduces no FAILs/XFAILs/XPASSes/ERRORs on regular i686, x86_64, >> avx2_32, avx2_64. >> Is it ok for the trunk? > > >> diff --git a/gcc/testsuite/gcc.dg/vect/vect-widen-mult-sum.c > b/gcc/testsuite/gcc.dg/vect/vect-widen-mult-sum.c >> index 2898918..1d190fc 100644 >> --- a/gcc/testsuite/gcc.dg/vect/vect-widen-mult-sum.c >> +++ b/gcc/testsuite/gcc.dg/vect/vect-widen-mult-sum.c >> @@ -43,5 +43,6 @@ int main (void) >> >> >> /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 > "vect" { target vect_widen_mult_hi_to_si } } } */ >> -/* { dg-final { scan-tree-dump-times "vect_recog_widen_mult_pattern: > detected" 1 "vect" } } */ >> +/* { dg-final { scan-tree-dump-times "vect_recog_widen_mult_pattern: > detected" 1 "vect" { target {! vect_float_no_int } } } } */ >> +/* { dg-final { scan-tree-dump-times "vect_recog_widen_mult_pattern: > detected" 2 "vect" { target vect_float_no_int } } } */ >> /* { dg-final { cleanup-tree-dump "vect" } } */ > > Could you please explain what are you trying to do? How is >> +# Return 1 if the target supports hardware vectors of float and doesn't > support >> +# vectors of int, 0 otherwise. > related to the number of times that pattern is detected? > > Thanks, > Ira > > >> >> Thanks, >> Igor >> >> 2011-12-28 Igor Zamyatin >> >> PR testsuite/51097 >> * lib/target-supports.exp > (check_effective_target_vect_float_no_int): >> New function. >> (check_avx2_available): Ditto. >> * gcc.dg/vect/no-scevccp-outer-7.c: Adjust dg-scans for AVX-built >> compiler. >> * gcc.dg/vect/no-scevccp-vect-iv-3.c: Likewise. >> * gcc.dg/vect/no-vfa-vect-depend-1.c: Likewise. >> * gcc.dg/vect/no-vfa-vect-dv-2.c: Likewise. >> * gcc.dg/vect/slp-perm-9.c: Likewise. >> * gcc.dg/vect/slp-reduc-6.c: Likewise. >> * gcc.dg/vect/slp-widen-mult-half.c: Likewise. >> * gcc.dg/vect/vect-109.c: Likewise. >> * gcc.dg/vect/vect-119.c: Likewise. >> * gcc.dg/vect/vect-35-big-array.c: Likewise. >> * gcc.dg/vect/vect-91.c: Likewise. >> * gcc.dg/vect/vect-multitypes-4.c: Likewise. >> * gcc.dg/vect/vect-multitypes-6.c: Likewise. >> * gcc.dg/vect/vect-outer-4c-big-array.c: Likewise. >> * gcc.dg/vect/vect-over-widen-1.c: Likewise. >> * gcc.dg/vect/vect-over-widen-4.c: Likewise. >> * gcc.dg/vect/vect-peel-1.c: Likewise. >> * gcc.dg/vect/vect-peel-3.c: Likewise. >> * gcc.dg/vect/vect-peel-4.c: Likewise. >> * gcc.dg/vect/vect-reduc-dot-s16a.c: Likewise. >> * gcc.dg/vect/vect-reduc-dot-s8a.c: Likewise. >> * gcc.dg/vect/vect-reduc-dot-u8a.c: Likewise. >> * gcc.dg/vect/vect-reduc-dot-u8b.c: Likewise. >> * gcc.dg/vect/vect-reduc-pattern-1a.c: Likewise. >> * gcc.dg/vect/vect-reduc-pattern-1b-big-array.c: Likewise. >> * gcc.dg/vect/vect-reduc-pattern-1c-big-array.c: Likewise. >> * gcc.dg/vect/vect-reduc-pattern-2a.c: Likewise. >> * gcc.dg/vect/vect-reduc-pattern-2b-big-array.c: Likewise. >> * gcc.dg/vect/vect-widen-mult-const-s16.c: Likewise. >> * gcc.dg/vect/vect-widen-mult-const-u16.c: Likewise. >> * gcc.dg/vect/vect-widen-mult-half-u8.c: Likewise. >> * gcc.dg/vect/vect-widen-mult-half.c: Likewise. >> * gcc.dg/vect/vect-widen-mult-sum.c: Likewise. >> * gcc.dg/vect/vect-widen-mult-u16.c: Likewise. >> * gcc.dg/vect/wrapv-vect-reduc-dot-s8b.c: Likewise. >> [attachment "51097.patch" deleted by Ira Rosen/Haifa/IBM] >
Re: FW: patch to fix PR21617
Ilya is on vacation so I'll make the answer. Overall score became worse on 0.3%. > -Original Message- > From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches-ow...@gcc.gnu.org] On > Behalf Of Vladimir Makarov > Sent: Thursday, December 22, 2011 8:15 PM > To: Ilya Enkovich > Cc: gcc-patches > Subject: Re: patch to fix PR21617 > > On 12/22/2011 06:19 AM, Ilya Enkovich wrote: >> 2011/12/13 Vladimir Makarov: >>> The following patch solves PR 21617 which is described on >>> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21617. >>> >>> Just adding number of necessary hard registers solves the problem but >>> creates 2% SPEC2000 perlbmk degradation on x86. Fortunately, removing >>> allocno class comparison removes the degradation. The allocno class >>> comparison was originally added for debugging purposes to put coloring of >>> allocnos of the same class in one place. It was ok when we had cover >>> classes which did not intersect. Now allocno classes intersect and this >>> comparison should be not used (at least as highest priority heuristic). >>> >>> Beside solving the problem, the patch improves a bit SPEC2000 performance >>> and code size on x86. >>> >>> The patch was successfully bootstrapped on x86/x86-64. I expect it should >>> be pretty safe for other targets because it changes heuristics only. >>> >>> Committed as rev. 182263. >>> >>> >>> 2011-12-12 Vladimir Makarov >>> >>> PR rtl-optimization/21617 >>> * ira-color.c (bucket_allocno_compare_func): Don't compare >>> allocno classes. Compare number of hard registers needed. >>> >> Hello, Vladimir, >> >> I think there may be a typo in this patch. I saw few degradations in >> EEMBC2.0 benchmarks caused by this patch. Following fix removes these >> degradations: >> >> - if ((diff = (ira_reg_class_max_nregs[cl1][ALLOCNO_MODE (a1)] >> - - ira_reg_class_max_nregs[cl2][ALLOCNO_MODE (a2)])) != 0) >> + if ((diff = (ira_reg_class_max_nregs[cl2][ALLOCNO_MODE (a2)] >> + - ira_reg_class_max_nregs[cl1][ALLOCNO_MODE (a1)])) != 0) >> >> I suspect typo because previous calculation of 'diff' value used f(a2) >> - f(a1) formula and now it is f(a1) - f(a2). Could you please look at >> it? >> > I don't think it was a typo. The patch puts multi-registers pseudos at > the end of the coloring bucket, so they will push to the coloring stacke > last and popped from the stack first and get better chance to be > assigned to hard registers because smaller chance of small holes > creation of free hard registers. > > Generally speaking, RA is all about heuristics and it is always possible > to find a test when one heuristic will work better than another one and > vise versa. So for me, there is a little sense to work on such PRs of > course unless it results in a discovery of better heuristics which > improves overall score on a credible test set like SPECCPU, for example. > > I work on these PRs mostly to help next gcc release to come out. > > Returning to your complaint, could you give me info how the overall > score of EEMBC changed after committing the patch, not just saying that > there are few degradations.
[PATCH] PR testsuite/51097 fix: a lot of "FAIL: gcc.dg/vect" on i686 avx build 181167 to 181177
Hi, Here is another patch about failures in gcc.dg/vect tests. These changes fix fails that could be seen on avx-built compilers. It also introduces no FAILs/XFAILs/XPASSes/ERRORs on regular i686, x86_64, avx2_32, avx2_64. Is it ok for the trunk? Thanks, Igor 2011-12-28 Igor Zamyatin PR testsuite/51097 * lib/target-supports.exp (check_effective_target_vect_float_no_int): New function. (check_avx2_available): Ditto. * gcc.dg/vect/no-scevccp-outer-7.c: Adjust dg-scans for AVX-built compiler. * gcc.dg/vect/no-scevccp-vect-iv-3.c: Likewise. * gcc.dg/vect/no-vfa-vect-depend-1.c: Likewise. * gcc.dg/vect/no-vfa-vect-dv-2.c: Likewise. * gcc.dg/vect/slp-perm-9.c: Likewise. * gcc.dg/vect/slp-reduc-6.c: Likewise. * gcc.dg/vect/slp-widen-mult-half.c: Likewise. * gcc.dg/vect/vect-109.c: Likewise. * gcc.dg/vect/vect-119.c: Likewise. * gcc.dg/vect/vect-35-big-array.c: Likewise. * gcc.dg/vect/vect-91.c: Likewise. * gcc.dg/vect/vect-multitypes-4.c: Likewise. * gcc.dg/vect/vect-multitypes-6.c: Likewise. * gcc.dg/vect/vect-outer-4c-big-array.c: Likewise. * gcc.dg/vect/vect-over-widen-1.c: Likewise. * gcc.dg/vect/vect-over-widen-4.c: Likewise. * gcc.dg/vect/vect-peel-1.c: Likewise. * gcc.dg/vect/vect-peel-3.c: Likewise. * gcc.dg/vect/vect-peel-4.c: Likewise. * gcc.dg/vect/vect-reduc-dot-s16a.c: Likewise. * gcc.dg/vect/vect-reduc-dot-s8a.c: Likewise. * gcc.dg/vect/vect-reduc-dot-u8a.c: Likewise. * gcc.dg/vect/vect-reduc-dot-u8b.c: Likewise. * gcc.dg/vect/vect-reduc-pattern-1a.c: Likewise. * gcc.dg/vect/vect-reduc-pattern-1b-big-array.c: Likewise. * gcc.dg/vect/vect-reduc-pattern-1c-big-array.c: Likewise. * gcc.dg/vect/vect-reduc-pattern-2a.c: Likewise. * gcc.dg/vect/vect-reduc-pattern-2b-big-array.c: Likewise. * gcc.dg/vect/vect-widen-mult-const-s16.c: Likewise. * gcc.dg/vect/vect-widen-mult-const-u16.c: Likewise. * gcc.dg/vect/vect-widen-mult-half-u8.c: Likewise. * gcc.dg/vect/vect-widen-mult-half.c: Likewise. * gcc.dg/vect/vect-widen-mult-sum.c: Likewise. * gcc.dg/vect/vect-widen-mult-u16.c: Likewise. * gcc.dg/vect/wrapv-vect-reduc-dot-s8b.c: Likewise. 51097.patch Description: Binary data