RE: [PATCH] Fix bdverN vector cost of cond_[not_]taken_branch_cost
>Note that before the fixes for PR64909 the epilogue/prologue loops had very >large costs associated due to a bug in the cost model implementation. After >the fix their cost is reasonable but the cost of the extra jumps is way >under-accounted for due to the numbers for >cond_taken_branch_cost and >cond_not_taken_branch_cost. > The proposes match mitigates that somewhat. Richard! The patch is good. We are done with our benchmarking and found no regressions. > How did you arrive at the original cost model? The original cost model as you suspect is not based on architecture alone. Those are the numbers arrived at by analyzing benchmarks and the cost model bugs then. These initial numbers were copied for subsequent architectures too. Cost assignments saying "scalar_stmt_cost = 6" and "scalar load_cost = 4" doesn't make sense at all. We will have a look into it. Regards Ganesh -Original Message- From: Richard Biener [mailto:rguent...@suse.de] Sent: Wednesday, April 08, 2015 1:08 PM To: Gopalasubramanian, Ganesh Cc: Uros Bizjak; gcc-patches@gcc.gnu.org Subject: RE: [PATCH] Fix bdverN vector cost of cond_[not_]taken_branch_cost On Wed, 8 Apr 2015, Gopalasubramanian, Ganesh wrote: > > I have added a person from AMD to comment on the decision. > > Otherwise, the patch looks OK, but please wait a couple of days for > > possible comments. > > Thank you Uros! > I am checking the changes with few tests and benchmarking them. > Please wait for a couple of days. Note that before the fixes for PR64909 the epilogue/prologue loops had very large costs associated due to a bug in the cost model implementation. After the fix their cost is reasonable but the cost of the extra jumps is way under-accounted for due to the numbers for cond_taken_branch_cost and cond_not_taken_branch_cost. The proposes match mitigates that somewhat. How did you arrive at the original cost model? Thanks, Richard. -- Richard Biener SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Jennifer Guild, Dilip Upmanyu, Graham Norton HRB 21284 (AG Nuernberg)
RE: [PATCH] Fix bdverN vector cost of cond_[not_]taken_branch_cost
> I have added a person from AMD to comment on the decision. > Otherwise, the patch looks OK, but please wait a couple of days for possible > comments. Thank you Uros! I am checking the changes with few tests and benchmarking them. Please wait for a couple of days. -Ganesh
RE: [PATCH] Rename gimple_build_assign_with_ops to gimple_build_assign and swap the first two arguments of it
The following patch implements that. Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk? Our aarch64 build also breaks as mentioned in https://gcc.gnu.org/ml/gcc-patches/2014-12/msg00119.html Regards Ganesh
RE: [PATCH, aarch64] Add prefetch support
Please ignore the previous patch sent. The attachment was wrong. > There's no point in the buffer or the sprintf. > The text is short enough to repeat whole pattern in the array: Updated the patch for the above suggestions. make -k check RUNTESTFLAGS="execute.exp compile.exp dg.exp" passes. Is it ok for upstream? Regards Ganesh prefetch.diff Description: prefetch.diff
RE: [PATCH, aarch64] Add prefetch support
> There's no point in the buffer or the sprintf. > The text is short enough to repeat whole pattern in the array: Updated the patch for the above suggestions. Is it ok for upstream? Regards Ganesh prefetch.diff Description: prefetch.diff
RE: [PATCH, aarch64] Add prefetch support
> For this prefetch patch I suggest we go with the existing "load1". I have removed the changes done in types.md. > The inline patch has been munged by your mailer, I tried applying the patch > to my tree but it is full of escape sequences. Can you either fix your > mailer or submit patches as attachments? I am attaching the revised patch. > Check the whitespace in your ChangeLog line. Changelog entry is also embedded in the attachment. Regards Ganesh prefetch.diff Description: prefetch.diff
FW: [PATCH, aarch64] Add prefetch support
PING! I am worried if it goes in stage-1. -Original Message- From: Gopalasubramanian, Ganesh Sent: Thursday, October 30, 2014 2:24 PM To: gcc-patches@gcc.gnu.org Subject: [PATCH, aarch64] Add prefetch support Hi, Below is the patch that implements prefetching support. This patch has been already discussed on a) https://gcc.gnu.org/ml/gcc-patches/2014-02/msg01644.html b) https://gcc.gnu.org/ml/gcc-patches/2014-07/msg00612.html I have not added a test as there are ample tests in compile and execute suites. "make -k check" passes. Ok for trunk? Changelog: 2014-10-30 Ganesh Gopalasubramanian * config/aarch64/aarch64.md (define_insn "prefetch"): New. * config/arm/types.md (define_attr "type"): Add prefetch. Regards Ganesh diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md index 74b554e..12a3f170 100644 --- a/gcc/config/aarch64/aarch64.md +++ b/gcc/config/aarch64/aarch64.md @@ -320,6 +320,38 @@ [(set_attr "type" "no_insn")] ) + +(define_insn "prefetch" + [(prefetch (match_operand:DI 0 "address_operand" "r") + (match_operand:QI 1 "const_int_operand" "") + (match_operand:QI 2 "const_int_operand" ""))] + "" + "* +{ + const char * pftype[2][10] + = { {\"PLDL1STRM\", \"PLDL3KEEP\", \"PLDL2KEEP\", \"PLDL1KEEP\"}, + {\"PSTL1STRM\", \"PSTL3KEEP\", \"PSTL2KEEP\", \"PSTL1KEEP\"}, + }; + + int locality = INTVAL (operands[2]); + char pattern[100]; + + gcc_assert (IN_RANGE (locality, 0, 3)); + + strcpy (pattern, \"prfm\\t\"); + strcat (pattern, (const char*)pftype[INTVAL(operands[1])][locality]); + strcat (pattern, \", %a0\"); + + output_asm_insn (pattern, + operands); + + return \"\"; + +}" + [(set_attr "type" "prefetch")] +) + (define_insn "trap" [(trap_if (const_int 1) (const_int 8))] "" diff --git a/gcc/config/arm/types.md b/gcc/config/arm/types.md index c1151f5..8b4b7a1 100644 --- a/gcc/config/arm/types.md +++ b/gcc/config/arm/types.md @@ -118,6 +118,7 @@ ; mvn_shift_reg inverting move instruction, shifted operand by a register. ; no_insn an insn which does not represent an instruction in the ; final output, thus having no impact on scheduling. +; prefetch a prefetch instruction ; rbit reverse bits. ; rev reverse bytes. ; sdiv signed division. @@ -556,6 +557,7 @@ call,\ clz,\ no_insn,\ + prefetch,\ csel,\ crc,\
[PATCH, aarch64] Add prefetch support
Hi, Below is the patch that implements prefetching support. This patch has been already discussed on a) https://gcc.gnu.org/ml/gcc-patches/2014-02/msg01644.html b) https://gcc.gnu.org/ml/gcc-patches/2014-07/msg00612.html I have not added a test as there are ample tests in compile and execute suites. "make -k check" passes. Ok for trunk? Changelog: 2014-10-30 Ganesh Gopalasubramanian * config/aarch64/aarch64.md (define_insn "prefetch"): New. * config/arm/types.md (define_attr "type"): Add prefetch. Regards Ganesh diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md index 74b554e..12a3f170 100644 --- a/gcc/config/aarch64/aarch64.md +++ b/gcc/config/aarch64/aarch64.md @@ -320,6 +320,38 @@ [(set_attr "type" "no_insn")] ) + +(define_insn "prefetch" + [(prefetch (match_operand:DI 0 "address_operand" "r") + (match_operand:QI 1 "const_int_operand" "") + (match_operand:QI 2 "const_int_operand" ""))] + "" + "* +{ + const char * pftype[2][10] + = { {\"PLDL1STRM\", \"PLDL3KEEP\", \"PLDL2KEEP\", \"PLDL1KEEP\"}, + {\"PSTL1STRM\", \"PSTL3KEEP\", \"PSTL2KEEP\", \"PSTL1KEEP\"}, + }; + + int locality = INTVAL (operands[2]); + char pattern[100]; + + gcc_assert (IN_RANGE (locality, 0, 3)); + + strcpy (pattern, \"prfm\\t\"); + strcat (pattern, (const char*)pftype[INTVAL(operands[1])][locality]); + strcat (pattern, \", %a0\"); + + output_asm_insn (pattern, + operands); + + return \"\"; + +}" + [(set_attr "type" "prefetch")] +) + (define_insn "trap" [(trap_if (const_int 1) (const_int 8))] "" diff --git a/gcc/config/arm/types.md b/gcc/config/arm/types.md index c1151f5..8b4b7a1 100644 --- a/gcc/config/arm/types.md +++ b/gcc/config/arm/types.md @@ -118,6 +118,7 @@ ; mvn_shift_reg inverting move instruction, shifted operand by a register. ; no_insn an insn which does not represent an instruction in the ; final output, thus having no impact on scheduling. +; prefetch a prefetch instruction ; rbit reverse bits. ; rev reverse bytes. ; sdiv signed division. @@ -556,6 +557,7 @@ call,\ clz,\ no_insn,\ + prefetch,\ csel,\ crc,\
RE: RFA: another patch to fix PR61360
>The "r->x" alternative results in "vector" decoding on amdfam10. This is >AMD-speak for microcoded instructions, and AMD optimization manual strongly >recommends avoiding them. I have CC'd Ganesh, maybe he >can provide more >relevant data on the performance impact. Thanks Uros! Yes, the AMD SWOG recommends precisely what Uros mentions. When moving data from a GPR to an XMM register, use separate store and load instructions to move the data first from the source register to a temporary location in memory and then from memory into the destination register This is listed as an optimization too. This holds good for all amdfam10 and BD family processors. I have to dig through the performance numbers will try to get them. Regards Ganesh
[PATCH, i386] PR61360: Do not update "enabled" attribute during lra and reload passes
This patch fixes PR 61360. The attribute "enabled" should actually be used enable/disable alternative based on sub-targets. In this pattern, it gets used across passes too. However, modifying this attribute in LRA pass is not something it is meant for. This patch allows enabling/disabling the attribute when optimizing for size, but not during lra pass or reload pass. Bootstrap passes. OK for upstream? Regards Ganesh diff --git a/gcc/ChangeLog b/gcc/ChangeLog index 6d91da0..3775f6e 100644 --- a/gcc/ChangeLog +++ b/gcc/ChangeLog @@ -1,4 +1,10 @@ -2014-08-22 David Malcolm +2014-08-22 Ganesh Gopalasubramanian + + PR 61360 + * config/i386/i386.md (*float2_sse): + Do not modify "enabled" attribute during LRA pass. + +014-08-22 David Malcolm * cprop.c (struct occr): Strengthen field "insn" from rtx to rtx_insn *. diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md index 8e74eab..de2ecf0 100644 --- a/gcc/config/i386/i386.md +++ b/gcc/config/i386/i386.md @@ -4795,10 +4795,10 @@ /* ??? For sched1 we need constrain_operands to be able to select an alternative. Leave this enabled before RA. */ (symbol_ref "TARGET_INTER_UNIT_CONVERSIONS - || optimize_function_for_size_p (cfun) - || !(reload_completed -|| reload_in_progress -|| lra_in_progress)") + || (optimize_function_for_size_p (cfun) + && !(reload_completed +|| reload_in_progress +|| lra_in_progress))") ] (symbol_ref "true"))) ])
RE: [PATCH, i386] Remove use of vpmacsdql instruction from multiplication.
Hi Uros! > > +2014-06-10 Ganesh Gopalasubramanian > > + > > + > > + * config/i386/i386.c (ix86_expand_sse2_mulvxdi3): Issue > > +instructions "vpmuludq" and "vpaddq" instead of "vpmacsdql" for > > +handling 32-bit multiplication. > > > OK for mainline and release branches. I would like to backport the above patch for 4.9. Is it OK? Regards Ganesh
RE: [PATCH, i386] Add RDRND and MOVBE for AMD bdver4
> OK for mainline. Thanks Uros. Committed to revision 213572 I would like to backport to 4.9 branch too. Is it OK? - Ganesh
[PATCH, i386] Add RDRND and MOVBE for AMD bdver4
Below patch adds PTA_RDRND and PTA_MOVBE for bdver4. Bootstrap passes. Ok for upstream? Regards Ganesh Index: gcc/ChangeLog === --- gcc/ChangeLog (revision 213568) +++ gcc/ChangeLog (working copy) @@ -24,6 +24,11 @@ 2014-08-04 Ganesh Gopalasubramanian +* gcc/config/i386/i386.c (ix86_option_override_internal): Add + PTA_RDRND and PTA_MOVBE for bdver4. + +2014-08-04 Ganesh Gopalasubramanian + * config/i386/driver-i386.c (host_detect_local_cpu): Handle AMD's extended family information. Handle BTVER2 cpu with cpuid family value. Index: gcc/config/i386/i386.c === --- gcc/config/i386/i386.c (revision 213568) +++ gcc/config/i386/i386.c (working copy) @@ -3267,12 +3267,13 @@ ix86_option_override_internal (bool main | PTA_FMA | PTA_PRFCHW | PTA_FXSR | PTA_XSAVE | PTA_XSAVEOPT | PTA_FSGSBASE}, {"bdver4", PROCESSOR_BDVER4, CPU_BDVER4, -PTA_64BIT | PTA_MMX | PTA_SSE | PTA_SSE2 | PTA_SSE3 -| PTA_SSE4A | PTA_CX16 | PTA_ABM | PTA_SSSE3 | PTA_SSE4_1 -| PTA_SSE4_2 | PTA_AES | PTA_PCLMUL | PTA_AVX | PTA_AVX2 + PTA_64BIT | PTA_MMX | PTA_SSE | PTA_SSE2 | PTA_SSE3 + | PTA_SSE4A | PTA_CX16 | PTA_ABM | PTA_SSSE3 | PTA_SSE4_1 + | PTA_SSE4_2 | PTA_AES | PTA_PCLMUL | PTA_AVX | PTA_AVX2 | PTA_FMA4 | PTA_XOP | PTA_LWP | PTA_BMI | PTA_BMI2 | PTA_TBM | PTA_F16C | PTA_FMA | PTA_PRFCHW | PTA_FXSR - | PTA_XSAVE | PTA_XSAVEOPT | PTA_FSGSBASE}, + | PTA_XSAVE | PTA_XSAVEOPT | PTA_FSGSBASE | PTA_RDRND + | PTA_MOVBE}, {"btver1", PROCESSOR_BTVER1, CPU_GENERIC, PTA_64BIT | PTA_MMX | PTA_SSE | PTA_SSE2 | PTA_SSE3 | PTA_SSSE3 | PTA_SSE4A |PTA_ABM | PTA_CX16 | PTA_PRFCHW
RE: [PATCH, i386] Handle extended family cpuid info for AMD
> In this case, having only check for family ID should be enough. If >BTVER1 and BTVER2 can be uniquely determined by their family IDs , >IMO, this would be the most future-proof approach. Signature checks will >override family id checks which will override cpuid checks. Thank you Uros! I have modified source only for BTVER2. The way BTVER1 is currently assigned to processor includes more than one family. So, I am leaving that unmoved. Bootstrap passes. Is it OK for trunk and backport to open branches. Regards -Ganesh diff --git a/gcc/ChangeLog b/gcc/ChangeLog index 706fedc..202bd99 100644 --- a/gcc/ChangeLog +++ b/gcc/ChangeLog @@ -1,3 +1,8 @@ +2014-08-01 Ganesh Gopalasubramanian + + * config/i386/driver-i386.c (host_detect_local_cpu): Handle AMD's extended family + information. Handle BTVER2 cpu with cpuid family value. + 2014-07-31 James Greenhalgh * config/aarch64/arm_neon.h (vpadd_<8,16,32,64>): Move to diff --git a/gcc/config/i386/driver-i386.c b/gcc/config/i386/driver-i386.c index 1c6385f..0402c90 100644 --- a/gcc/config/i386/driver-i386.c +++ b/gcc/config/i386/driver-i386.c @@ -432,7 +432,8 @@ const char *host_detect_local_cpu (int argc, const char **argv) model = (eax >> 4) & 0x0f; family = (eax >> 8) & 0x0f; - if (vendor == signature_INTEL_ebx) + if ((vendor == signature_INTEL_ebx) || + (vendor == signature_AMD_ebx)) { unsigned int extended_model, extended_family; @@ -576,7 +577,7 @@ const char *host_detect_local_cpu (int argc, const char **argv) if (name == signature_NSC_ebx) processor = PROCESSOR_GEODE; - else if (has_movbe) + else if (family == 22) processor = PROCESSOR_BTVER2; else if (has_avx2) processor = PROCESSOR_BDVER4;
RE: [PATCH, i386] Handle extended family cpuid info for AMD
Uros! > I would like to have a check for a family at the beginning, something like: > if (name == signature_NSC_ebx) >processor = PROCESSOR_GEODE; > else if (family == 22) >{ > if (has_movbe) I get your idea of having the family checked first and then differentiating with cpuid info later. But, this case is getting interesting because, BTVER1 and BTVER2 are two variants but don't really have same family numbers. BTVER1 is family 14h and BTVER2 is family 16h. I don't see near term plans for any additional cpus to either 14h or 16h. Given that fact, this particular check is applicable only for BTVER2. In that case, having else if (family == 22) if (has_movbe) processor = PROCESSOR_BTVER2; looks odd. Regards Ganesh
RE: [PATCH, i386] Handle extended family cpuid info for AMD
> Then just use: > + else if (has_avx2) > +processor = PROCESSOR_BDVER4; > else if (has_movbe) >processor = PROCESSOR_BTVER2; >- else if (has_avx2) >-processor = PROCESSOR_BDVER4; > else if (has_xsaveopt) In that case, with earlier GCC versions where we don’t have bdver4 added, the fall back would be BTVER2, whereas a BD variant is more desirable. Ganesh
RE: [PATCH, i386] Handle extended family cpuid info for AMD
> But, looking to processor_alias_table in config/i386/i386.c, only > PROCESSOR_BTVER2 defines PTA_MOVBE. According to this, the logic is already > correct, so the patch is not needed. We are evaluating bdver4 cpu. Bdver4 also supports MOVBE. I will submit patch for bdver4 PTA after our evaluation. Ganesh.
[PATCH, i386] Handle extended family cpuid info for AMD
Hi, The below patch handles the AMD's cpuid family information. With the information from cpuid, BTVER2 cpu for -march=native flag is handled. Bootstrap passes. Is it OK for trunk and branches? Regards Ganesh diff --git a/gcc/ChangeLog b/gcc/ChangeLog index 6223bd6..3f8bb2c 100644 --- a/gcc/ChangeLog +++ b/gcc/ChangeLog @@ -1,3 +1,8 @@ +2014-07-31 Ganesh Gopalasubramanian + + * tree-sra.c (host_detect_local_cpu): Handle AMD's extended family + information. Handle BTVER2 cpu with cpuid family value. + 2014-07-30 Martin Jambor * tree-sra.c (sra_ipa_modify_assign): Change type of the first diff --git a/gcc/config/i386/driver-i386.c b/gcc/config/i386/driver-i386.c index 1c6385f..21ae1f3 100644 --- a/gcc/config/i386/driver-i386.c +++ b/gcc/config/i386/driver-i386.c @@ -432,7 +432,8 @@ const char *host_detect_local_cpu (int argc, const char **argv) model = (eax >> 4) & 0x0f; family = (eax >> 8) & 0x0f; - if (vendor == signature_INTEL_ebx) + if ((vendor == signature_INTEL_ebx) || + (vendor == signature_AMD_ebx)) { unsigned int extended_model, extended_family; @@ -576,7 +577,7 @@ const char *host_detect_local_cpu (int argc, const char **argv) if (name == signature_NSC_ebx) processor = PROCESSOR_GEODE; - else if (has_movbe) + else if (has_movbe && family == 22) processor = PROCESSOR_BTVER2; else if (has_avx2) processor = PROCESSOR_BDVER4;
FW: [PATCH, aarch64] Add prefetch support
PING! -Original Message- From: Gopalasubramanian, Ganesh Sent: Sunday, July 06, 2014 2:12 AM To: gcc-patches@gcc.gnu.org Cc: marcus.shawcr...@arm.com; richard.earns...@arm.com Subject: RE: [PATCH, aarch64] Add prefetch support PING! From: Gopalasubramanian, Ganesh Sent: Friday, July 04, 2014 5:57 AM To: gcc-patches@gcc.gnu.org Cc: marcus.shawcr...@arm.com; richard.earns...@arm.com Subject: [PATCH, aarch64] Add prefetch support Hi, Attached is a patch that implements * Prefetch with immediate offset in the range 0 to 32760 (multiple of 8). Added a predicate for this. * Prefetch with immediate offset - in the range -256 to 255 (Gets generated only when we have a negative offset. Generates prfum instruction). Added a predicate for this. * Prefetch with register offset. (modified for printing the locality) This patch has been already discussed on https://gcc.gnu.org/ml/gcc-patches/2014-02/msg01644.html "make -k check" passes. Ok for trunk? Changelog 2014-07-04 Ganesh Gopalasubramanian * config/aarch64/aarch64.md (define_insn "*prefetch") (define_insn "prefetch"): New * config/aarch64/predicates.md (aarch64_prefetch_pimm) (aarch64_prefetch_unscaled): New. * config/arm/types.md (define_attr "type"): Add prefetch. Regards Ganesh
RE: [PATCH, aarch64] Add prefetch support
PING! From: Gopalasubramanian, Ganesh Sent: Friday, July 04, 2014 5:57 AM To: gcc-patches@gcc.gnu.org Cc: marcus.shawcr...@arm.com; richard.earns...@arm.com Subject: [PATCH, aarch64] Add prefetch support Hi, Attached is a patch that implements * Prefetch with immediate offset in the range 0 to 32760 (multiple of 8). Added a predicate for this. * Prefetch with immediate offset - in the range -256 to 255 (Gets generated only when we have a negative offset. Generates prfum instruction). Added a predicate for this. * Prefetch with register offset. (modified for printing the locality) This patch has been already discussed on https://gcc.gnu.org/ml/gcc-patches/2014-02/msg01644.html "make -k check" passes. Ok for trunk? Changelog 2014-07-04 Ganesh Gopalasubramanian * config/aarch64/aarch64.md (define_insn "*prefetch") (define_insn "prefetch"): New * config/aarch64/predicates.md (aarch64_prefetch_pimm) (aarch64_prefetch_unscaled): New. * config/arm/types.md (define_attr "type"): Add prefetch. Regards Ganesh
[PATCH, aarch64] Add prefetch support
Hi, Attached is a patch that implements * Prefetch with immediate offset in the range 0 to 32760 (multiple of 8). Added a predicate for this. * Prefetch with immediate offset - in the range -256 to 255 (Gets generated only when we have a negative offset. Generates prfum instruction). Added a predicate for this. * Prefetch with register offset. (modified for printing the locality) This patch has been already discussed on https://gcc.gnu.org/ml/gcc-patches/2014-02/msg01644.html "make -k check" passes. Ok for trunk? Changelog 2014-07-04 Ganesh Gopalasubramanian * config/aarch64/aarch64.md (define_insn "*prefetch") (define_insn "prefetch"): New * config/aarch64/predicates.md (aarch64_prefetch_pimm) (aarch64_prefetch_unscaled): New. * config/arm/types.md (define_attr "type"): Add prefetch. Regards Ganesh prefetch.diff Description: prefetch.diff
[PATCH, i386] Remove use of vpmacsdql instruction from multiplication.
Hi, The below patch fixes the issue with 64-bit multiplication. The instruction "vpmacsdql" does signed 32-bit multiplication. For V2DImode, we require widened unsigned multiplication. So, replacing the "vpmacsdql" instruction with "vpmuludq" and "vpaddq". This patch had been already discussed in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=52908 With required change in the test xop-imul64-vector.c, make check passes. Is it OK for upstream? Regards Ganesh diff --git a/gcc/ChangeLog b/gcc/ChangeLog index d0a1253..c158612 100644 --- a/gcc/ChangeLog +++ b/gcc/ChangeLog @@ -1,3 +1,9 @@ +2014-06-10 Ganesh Gopalasubramanian + + * config/i386/i386.c (ix86_expand_sse2_mulvxdi3): Issue instructions +"vpmuludq" and "vpaddq" instead of "vpmacsdql" for handling 32-bit +multiplication. + 2014-06-07 Jan Hubicka * cgraphunit.c (assemble_thunks_and_aliases): Expand thunks before diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index 9105132..184d82d 100644 --- a/gcc/config/i386/i386.c +++ b/gcc/config/i386/i386.c @@ -45205,8 +45205,10 @@ ix86_expand_sse2_mulvxdi3 (rtx op0, rtx op1, rtx op2) /* t4: ((B*E)+(A*F))<<32, ((D*G)+(C*H))<<32 */ emit_insn (gen_ashlv2di3 (t4, t3, GEN_INT (32))); - /* op0: (((B*E)+(A*F))<<32)+(B*F), (((D*G)+(C*H))<<32)+(D*H) */ - emit_insn (gen_xop_pmacsdql (op0, op1, op2, t4)); + /* Multiply lower parts and add all */ + t5 = gen_reg_rtx (V2DImode); + emit_insn (gen_vec_widen_umult_even_v4si (t5, gen_lowpart (V4SImode, op1), gen_lowpart (V4SImode, op2))); + op0 = expand_binop (mode, add_optab, t5, t4, op0, 1, OPTAB_DIRECT); } else { diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog index a6913af..757d3e3 100644 --- a/gcc/testsuite/ChangeLog +++ b/gcc/testsuite/ChangeLog @@ -1,3 +1,8 @@ +2014-06-10 Ganesh Gopalasubramanian + + * gcc.target/i386/xop-imul64-vector.c: Remove the check for + vpmacsdql instruction. + 2014-06-07 Eric Botcazou * gnat.dg/opt38.adb: New test. diff --git a/gcc/testsuite/gcc.target/i386/xop-imul64-vector.c b/gcc/testsuite/gcc.target/i386/xop-imul64-vector.c index fbf605f..fc8c880 100644 --- a/gcc/testsuite/gcc.target/i386/xop-imul64-vector.c +++ b/gcc/testsuite/gcc.target/i386/xop-imul64-vector.c @@ -33,4 +33,3 @@ int main () /* { dg-final { scan-assembler "vpmulld" } } */ /* { dg-final { scan-assembler "vphadddq" } } */ -/* { dg-final { scan-assembler "vpmacsdql" } } */
RE: [AArch64 05/14] Add AArch64 'prefetch'-pattern.
Hi Philipp, > These changes look good to me. > We'll try them out on the benchmarks that caused us to add prefetching in the > first place. If you are OK, I would like to get these changes upstreamed. -Ganesh -Original Message- From: Dr. Philipp Tomsich [mailto:philipp.toms...@theobroma-systems.com] Sent: Friday, February 28, 2014 2:58 PM To: Gopalasubramanian, Ganesh Cc: gcc-patches@gcc.gnu.org; pins...@gmail.com Subject: Re: [AArch64 05/14] Add AArch64 'prefetch'-pattern. Ganesh, On 28 Feb 2014, at 10:13 , Gopalasubramanian, Ganesh wrote: > I also have attached a patch that implements the following. > * Prefetch with immediate offset in the range 0 to 32760 (multiple of 8). > Added a predicate for this. > * Prefetch with immediate offset - in the range -256 to 255 (Gets > generated only when we have a negative offset. Generates prfum instruction). > Added a predicate for this. > * Prefetch with register offset. (modified for printing the locality) These changes look good to me. We'll try them out on the benchmarks that caused us to add prefetching in the first place. Best, Philipp.
RE: [AArch64 05/14] Add AArch64 'prefetch'-pattern.
Avoided top-posting and resending. + /* temporal locality */ + return (INTVAL(operands[1])) ? \"prfm\\tPSTL1KEEP, [%0, #0]\" : +\"prfm\\tPLDL1KEEP, [%0, #0]\"; }" + [(set_attr "type" "prefetch")] +) + With the locality value received in the instruction pattern, I think it would be safe to handle them in prefetch instruction. This helps especially AArch64 has prefetch instructions that can handle this locality. +(define_insn "prefetch" + [(prefetch (match_operand:DI 0 "address_operand" "r") +(match_operand:QI 1 "const_int_operand" "n") +(match_operand:QI 2 "const_int_operand" "n"))] + "" + "* +{ + int locality = INTVAL (operands[2]); + + gcc_assert (IN_RANGE (locality, 0, 3)); + + if (locality == 0) + /* non temporal locality */ + return (INTVAL(operands[1])) ? \"prfm\\tPSTL1STRM, [%0, #0]\" : \"prfm\\tPLDL1STRM, [%0, #0]\"; + + /* temporal locality */ + return (INTVAL(operands[1])) ? \"prfm\\tPSTL%2KEEP, [%0, #0]\" : \"prfm\\tPLDL%2KEEP, [%0, #0]\"; +}" + [(set_attr "type" "prefetch")] +) + I also have attached a patch that implements the following. * Prefetch with immediate offset in the range 0 to 32760 (multiple of 8). Added a predicate for this. * Prefetch with immediate offset - in the range -256 to 255 (Gets generated only when we have a negative offset. Generates prfum instruction). Added a predicate for this. * Prefetch with register offset. (modified for printing the locality) Regards Ganesh prefetchdiff.log Description: prefetchdiff.log
RE: [AArch64 05/14] Add AArch64 'prefetch'-pattern.
With the locality value received in the instruction pattern, I think it would be safe to handle them in prefetch instruction. This helps especially AArch64 has prefetch instructions that can handle this locality. +(define_insn "prefetch" + [(prefetch (match_operand:DI 0 "address_operand" "r") +(match_operand:QI 1 "const_int_operand" "n") +(match_operand:QI 2 "const_int_operand" "n"))] + "" + "* +{ + int locality = INTVAL (operands[2]); + + gcc_assert (IN_RANGE (locality, 0, 3)); + + if (locality == 0) + /* non temporal locality */ + return (INTVAL(operands[1])) ? \"prfm\\tPSTL1STRM, [%0, #0]\" : \"prfm\\tPLDL1STRM, [%0, #0]\"; + + /* temporal locality */ + return (INTVAL(operands[1])) ? \"prfm\\tPSTL%2KEEP, [%0, #0]\" : \"prfm\\tPLDL%2KEEP, [%0, #0]\"; +}" + [(set_attr "type" "prefetch")] +) + I also have attached a patch that implements * Prefetch with immediate offset in the range 0 to 32760 (multiple of 8). Added a predicate for this. * Prefetch with immediate offset - in the range -256 to 255 (Gets generated only when we have a negative offset. Generates prfum instruction). Added a predicate for this. * Prefetch with register offset. (modified for printing the locality) Regards Ganesh -Original Message- From: Philipp Tomsich [mailto:philipp.toms...@theobroma-systems.com] Sent: Wednesday, February 19, 2014 2:40 AM To: gcc-patches@gcc.gnu.org Cc: philipp.toms...@theobroma-systems.com Subject: [AArch64 05/14] Add AArch64 'prefetch'-pattern. --- gcc/config/aarch64/aarch64.md | 17 + gcc/config/arm/types.md | 2 ++ 2 files changed, 19 insertions(+) diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md index 99a6ac8..b972a1b 100644 --- a/gcc/config/aarch64/aarch64.md +++ b/gcc/config/aarch64/aarch64.md @@ -293,6 +293,23 @@ [(set_attr "type" "no_insn")] ) +(define_insn "prefetch" + [(prefetch (match_operand:DI 0 "register_operand" "r") +(match_operand:QI 1 "const_int_operand" "n") +(match_operand:QI 2 "const_int_operand" "n"))] + "" + "* +{ + if (INTVAL(operands[2]) == 0) + /* no temporal locality */ + return (INTVAL(operands[1])) ? \"prfm\\tPSTL1STRM, [%0, #0]\" : +\"prfm\\tPLDL1STRM, [%0, #0]\"; + + /* temporal locality */ + return (INTVAL(operands[1])) ? \"prfm\\tPSTL1KEEP, [%0, #0]\" : +\"prfm\\tPLDL1KEEP, [%0, #0]\"; }" + [(set_attr "type" "prefetch")] +) + (define_insn "trap" [(trap_if (const_int 1) (const_int 8))] "" diff --git a/gcc/config/arm/types.md b/gcc/config/arm/types.md index cc39cd1..1d1280d 100644 --- a/gcc/config/arm/types.md +++ b/gcc/config/arm/types.md @@ -117,6 +117,7 @@ ; mvn_shift_reg inverting move instruction, shifted operand by a register. ; no_insnan insn which does not represent an instruction in the ;final output, thus having no impact on scheduling. +; prefetch a prefetch instruction ; rbit reverse bits. ; revreverse bytes. ; sdiv signed division. @@ -553,6 +554,7 @@ call,\ clz,\ no_insn,\ + prefetch,\ csel,\ crc,\ extend,\ -- 1.9.0 prefetchdiff.log Description: prefetchdiff.log
FW: Non-temporal move
I could see "storent" pattern in x86 machine descriptions (in sse.md)., but internals doc don't mention it. Should we add a description about this in the internals doc? Regards Ganesh
RE: [Patch, i386] PR 59422 - Support more targets for function multi versioning
> I'm sorry I didn't notice previous conversation. Please install ASAP. Thanks Uros! Committed to revision 206210. - Ganesh
RE: [Patch, i386] PR 59422 - Support more targets for function multi versioning
Hi, >> (get_amd_cpu): Handle AMD_BOBCAT, AMD_JAGUAR, AMDFAM15H_BDVER2 and >> AMDFAM15H_BDVER3. As mentioned earlier, we would like to stick with BTVER1 and BTVER2 instead of using BOBCAT or JAGUAR. Attached patch does the changes. Regards Ganesh NameChange.patch Description: NameChange.patch
RE: [Patch, i386] PR 59422 - Support more targets for function multi versioning
> Sorry, I must have been looking at an older version, but as I said I already > did enable it in the latest patch. (see > http://gcc.gnu.org/ml/gcc-patches/2013-12/msg01577.html ) Sorry for causing another revision but we would like to stick with "btver1" and "btver2" rather than "BOBCAT" or "JAGUAR". Therefore the changes would be like Index: gcc/config/i386/i386.c === --- gcc/config/i386/i386.c (revision 206065) +++ gcc/config/i386/i386.c (working copy) @@ -29965,9 +29965,14 @@ P_PROC_SSE4_2, P_POPCNT, P_AVX, +P_PROC_AVX, +P_FMA4, +P_XOP, +P_PROC_XOP, +P_FMA, +P_PROC_FMA, P_AVX2, -P_FMA, -P_PROC_FMA +P_PROC_AVX2 }; enum feature_priority priority = P_ZERO; @@ -29986,11 +29991,15 @@ {"sse", P_SSE}, {"sse2", P_SSE2}, {"sse3", P_SSE3}, + {"sse4a", P_SSE4_a}, {"ssse3", P_SSSE3}, {"sse4.1", P_SSE4_1}, {"sse4.2", P_SSE4_2}, {"popcnt", P_POPCNT}, {"avx", P_AVX}, + {"fma4", P_FMA4}, + {"xop", P_XOP}, + {"fma", P_FMA}, {"avx2", P_AVX2} }; @@ -30044,25 +30053,49 @@ break; case PROCESSOR_COREI7_AVX: arg_str = "corei7-avx"; - priority = P_PROC_SSE4_2; + priority = P_PROC_AVX; break; +case PROCESSOR_HASWELL: + arg_str = "core-avx2"; + priority = P_PROC_AVX2; + break; case PROCESSOR_ATOM: arg_str = "atom"; priority = P_PROC_SSSE3; break; +case PROCESSOR_SLM: + arg_str = "slm"; + priority = P_PROC_SSE4_2; + break; case PROCESSOR_AMDFAM10: arg_str = "amdfam10h"; priority = P_PROC_SSE4_a; break; +case PROCESSOR_BTVER1: + arg_str = "btver1"; + priority = P_PROC_SSE4_a; + break; +case PROCESSOR_BTVER2: + arg_str = "btver2"; + priority = P_PROC_AVX; + break; case PROCESSOR_BDVER1: arg_str = "bdver1"; - priority = P_PROC_FMA; + priority = P_PROC_XOP; break; case PROCESSOR_BDVER2: arg_str = "bdver2"; priority = P_PROC_FMA; break; - } +case PROCESSOR_BDVER3: + arg_str = "bdver3"; + priority = P_PROC_FMA; + break; +case PROCESSOR_BDVER4: + arg_str = "bdver4"; + priority = P_PROC_AVX2; + break; +} } cl_target_option_restore (&global_options, &cur_target); @@ -30922,9 +30955,13 @@ F_SSE2, F_SSE3, F_SSSE3, +F_SSE4_a, F_SSE4_1, F_SSE4_2, F_AVX, +F_FMA4, +F_XOP, +F_FMA, F_AVX2, F_MAX }; @@ -30943,6 +30980,10 @@ M_AMDFAM10H, M_AMDFAM15H, M_INTEL_SLM, +M_INTEL_COREI7_AVX, +M_INTEL_CORE_AVX2, +M_AMD_BTVER1, +M_AMD_BTVER2, M_CPU_SUBTYPE_START, M_INTEL_COREI7_NEHALEM, M_INTEL_COREI7_WESTMERE, @@ -30953,7 +30994,9 @@ M_AMDFAM15H_BDVER1, M_AMDFAM15H_BDVER2, M_AMDFAM15H_BDVER3, -M_AMDFAM15H_BDVER4 +M_AMDFAM15H_BDVER4, +M_INTEL_COREI7_IVYBRIDGE, +M_INTEL_CORE_HASWELL }; static struct _arch_names_table @@ -30971,11 +31014,17 @@ {"corei7", M_INTEL_COREI7}, {"nehalem", M_INTEL_COREI7_NEHALEM}, {"westmere", M_INTEL_COREI7_WESTMERE}, + {"corei7-avx", M_INTEL_COREI7_AVX}, {"sandybridge", M_INTEL_COREI7_SANDYBRIDGE}, + {"ivybridge", M_INTEL_COREI7_IVYBRIDGE}, + {"core-avx2", M_INTEL_CORE_AVX2}, + {"haswell", M_INTEL_CORE_HASWELL}, {"amdfam10h", M_AMDFAM10H}, {"barcelona", M_AMDFAM10H_BARCELONA}, {"shanghai", M_AMDFAM10H_SHANGHAI}, {"istanbul", M_AMDFAM10H_ISTANBUL}, + {"btver1", M_AMD_BTVER1}, + {"btver2", M_AMD_BTVER2}, {"amdfam15h", M_AMDFAM15H}, {"bdver1", M_AMDFAM15H_BDVER1}, {"bdver2", M_AMDFAM15H_BDVER2}, @@ -30997,9 +31046,13 @@ {"sse2", F_SSE2}, {"sse3", F_SSE3}, {"ssse3", F_SSSE3}, + {"sse4a", F_SSE4_a}, {"sse4.1", F_SSE4_1}, {"sse4.2", F_SSE4_2}, {"avx",F_AVX}, + {"fma4", F_FMA4}, + {"xop",F_XOP}, + {"fma",F_FMA}, {"avx2", F_AVX2} }; Index: libgcc/config/i386/cpuinfo.c === --- libgcc/config/i386/cpuinfo.c(revision 206065) +++ libgcc/config/i386/cpuinfo.c(working copy) @@ -62,6 +62,10 @@ AMDFAM10H, AMDFAM15H, INTEL_SLM, + INTEL_COREI7_AVX, + INTEL_CORE_AVX2, + AMD_BTVER1, + AMD_BTVER2, CPU_TYPE_MAX }; @@ -75,6 +79,10
RE: [RFC] [PATCH, i386] Adjust unroll factor for bdver3 and bdver4
> Please provide updated ChangeLog. --- gcc/ChangeLog (revision 206106) +++ gcc/ChangeLog (working copy) @@ -1,3 +1,14 @@ +2013-12-19 Ganesh Gopalasubramanian + + * config/i386/i386.c: Include cfgloop.h. + (ix86_loop_memcount): New function. + (ix86_loop_unroll_adjust): New function. + (TARGET_LOOP_UNROLL_ADJUST): Define. + * config/i386/i386.h + (TARGET_ADJUST_UNROLL): Define. + * config/i386/x86-tune.def + (X86_TUNE_ADJUST_UNROLL): Define. + > The function comment is missing. Maybe you should also describe magic number > 32 here? Added the function comment. > Otherwise, the patch looks OK. Thanks. Bootstrapping passes. Is it OK for upstream? > BTW: Please avoid top-posting, see e.g. [1] for reasons... Sorry for the lapse. Will comply. Regards Ganesh unroll-adjust.patch Description: unroll-adjust.patch
RE: [Patch, i386] PR 59422 - Support more targets for function multi versioning
> Yes, I changed that in the last patch, though I consider it momentarily > problematic because you do not yet enable AVX with march=btver2 (AVX versions > would currently be better than btver2 versions for a btver2 arch), but expect march=btver2 will be fixed soon. The " processor_alias_table" entry for "btver2" in i386.c enables AVX. {"btver2", PROCESSOR_BTVER2, CPU_BTVER2, PTA_64BIT | PTA_MMX | PTA_SSE | PTA_SSE2 | PTA_SSE3 | PTA_SSSE3 | PTA_SSE4A |PTA_ABM | PTA_CX16 | PTA_SSE4_1 | PTA_SSE4_2 | PTA_AES | PTA_PCLMUL | PTA_AVX | PTA_BMI | PTA_F16C | PTA_MOVBE | PTA_PRFCHW | PTA_FXSR | PTA_XSAVE | PTA_XSAVEOPT}, The assembly listing for a simple test (compiled with -march=btver2) also has -mavx enabled. So, can you please enable AVX for btver2? Regards Ganesh
RE: [Patch, i386] PR 59422 - Support more targets for function multi versioning
Ping! "Gopalasubramanian, Ganesh" wrote: > Yes, I figured that was the original idea behind it, but the final family of > the jaguar processors seems to have become 16h instead of 14h (bobcat) at > some point. Yes. It is amdfam16h. I was supposed to pass on some comments on the patch. 1. Amdfam16h for Jaguar. 2. For Jaguar, the priority needs to be AVX (AVX got included into the Jaguar ISA). I have a doubt! What would be done if priority is set to "F_FMA4" instead of "F_XOP" for bdver1? Regards Ganesh
RE: [Patch, i386] PR 59422 - Support more targets for function multi versioning
> Yes, I figured that was the original idea behind it, but the final family of > the jaguar processors seems to have become 16h instead of 14h (bobcat) at > some point. Yes. It is amdfam16h. I was supposed to pass on some comments on the patch. 1. Amdfam16h for Jaguar. 2. For Jaguar, the priority needs to be AVX (AVX got included into the Jaguar ISA). I have a doubt! What would be done if priority is set to "F_FMA4" instead of "F_XOP" for bdver1? Regards Ganesh
RE: [Patch, i386] PR 59422 - Support more targets for function multi versioning
> Btw, I couldn't find anything that corresponds to gcc's btver2 arch. Is that > an old term for what has become the Jaguar architecture? Yes, "btver2" = "jaguar". We have the name as per its family name (i.e, bobcat family) in GCC. Similarly we have the names "bdver2" = "piledriver", "bdver3" = "steamroller" as per their family (bulldozer) name. Regards Ganesh -Original Message- From: Allan Sandfeld Jensen [mailto:carew...@gmail.com] Sent: Monday, December 16, 2013 12:25 AM To: Uros Bizjak Cc: gcc-patches@gcc.gnu.org Subject: Re: [Patch, i386] PR 59422 - Support more targets for function multi versioning Hi again On Wednesday 11 December 2013, Uros Bizjak wrote: > Hello! > > > PR gcc/59422 > > > > This patch extends the supported targets for function multi versiong > > to also include Haswell, Silvermont, and the most recent AMD models. > > It also prioritizes AVX2 versions over AMD specific pre-AVX2 versions. > > Please add a ChangeLog entry and attach the complete patch. Please > also state how you tested the patch, as outlined in the instructions > [1]. > > [1] http://gcc.gnu.org/contribute.html > Updated patch for better CPU model detection and added ChangeLog. The patch has been tested with the attached test.cpp. Verified that it doesn't build before the patch, and that it builds after, and verified it selects correct versions at runtime based on either CPU model or supported ISA (tested on 3 machines: SandyBridge, IvyBridge and Phenom II). Btw, I couldn't find anything that corresponds to gcc's btver2 arch. Is that an old term for what has become the Jaguar architecture? `Allan
RE: [RFC] [PATCH, i386] Adjust unroll factor for bdver3 and bdver4
Hi Uros! Accommodated the changes that you mentioned. Completed the bootstrap testing too. Regards Ganesh -Original Message- From: Uros Bizjak [mailto:ubiz...@gmail.com] Sent: Wednesday, December 04, 2013 3:17 PM To: Gopalasubramanian, Ganesh Cc: gcc-patches@gcc.gnu.org; Richard Guenther (richard.guent...@gmail.com) Subject: Re: [RFC] [PATCH, i386] Adjust unroll factor for bdver3 and bdver4 On Wed, Dec 4, 2013 at 9:39 AM, Gopalasubramanian, Ganesh wrote: > Attached is the revised patch. > The target independent part has been already approved and added. > > This revision of the patch adds a x86 tune definition and checks it while > deciding the unroll factor. > > Accommodated the comments given by you except one. > >> *x will never be null for active insns. > Since every rtx in the insn is checked for memory references, the NULL_RTX > check is required. Yes you are correct. for_each_rtx also passes NULL_RTX, I was distracted by "There are no sub-expressions." comment. +if (NONDEBUG_INSN_P (insn) && INSN_CODE (insn) != -1) Do you need to check for INSN_CODE here? IIRC, checking for NONDEBUG_INSN_P is enough. +for_each_rtx (&insn, (rtx_function) ix86_loop_memcount, &mem_count); +} + free (bbs); + + if (mem_count <=32) +return 32/mem_count; Ouch... mem_count can be zero. Is there a reason to change this part from previous patch? Uros. unroll-adjust.patch Description: unroll-adjust.patch
[patch][wwwdocs] gcc 4.9 changes - AMD new cores
Hello, This patch adds details about new AMD cores that got enabled in GCC-4.9. OK for the wwwdocs? Regards Ganesh cvs diff: Diffing . Index: changes.html === RCS file: /cvs/gcc/wwwdocs/htdocs/gcc-4.9/changes.html,v retrieving revision 1.44 diff -r1.44 changes.html 404a405,407 > Support for new AMD family 15h processors (Excavator core) > is now available through the -march=bdver4 and > -mtune=bdver4 options.
RE: [RFC] [PATCH, i386] Adjust unroll factor for bdver3 and bdver4
> Ouch... mem_count can be zero. Is there a reason to change this part from > previous patch? Oops! You're right. I will correct this. The idea is to count the memory references and decide on the unrolling factor. Previous patch does that in two steps I thought of doing that in a single step. (I think I missed my step here ;) ) Regards Ganesh -Original Message- From: Uros Bizjak [mailto:ubiz...@gmail.com] Sent: Wednesday, December 04, 2013 3:17 PM To: Gopalasubramanian, Ganesh Cc: gcc-patches@gcc.gnu.org; Richard Guenther (richard.guent...@gmail.com) Subject: Re: [RFC] [PATCH, i386] Adjust unroll factor for bdver3 and bdver4 On Wed, Dec 4, 2013 at 9:39 AM, Gopalasubramanian, Ganesh wrote: > Attached is the revised patch. > The target independent part has been already approved and added. > > This revision of the patch adds a x86 tune definition and checks it while > deciding the unroll factor. > > Accommodated the comments given by you except one. > >> *x will never be null for active insns. > Since every rtx in the insn is checked for memory references, the NULL_RTX > check is required. Yes you are correct. for_each_rtx also passes NULL_RTX, I was distracted by "There are no sub-expressions." comment. +if (NONDEBUG_INSN_P (insn) && INSN_CODE (insn) != -1) Do you need to check for INSN_CODE here? IIRC, checking for NONDEBUG_INSN_P is enough. +for_each_rtx (&insn, (rtx_function) ix86_loop_memcount, &mem_count); +} + free (bbs); + + if (mem_count <=32) +return 32/mem_count; Ouch... mem_count can be zero. Is there a reason to change this part from previous patch? Uros.
RE: [RFC] [PATCH, i386] Adjust unroll factor for bdver3 and bdver4
Hi Uros! Attached is the revised patch. The target independent part has been already approved and added. This revision of the patch adds a x86 tune definition and checks it while deciding the unroll factor. Accommodated the comments given by you except one. > *x will never be null for active insns. Since every rtx in the insn is checked for memory references, the NULL_RTX check is required. Regards Ganesh -Original Message- From: Uros Bizjak [mailto:ubiz...@gmail.com] Sent: Friday, November 22, 2013 1:46 PM To: Gopalasubramanian, Ganesh Cc: gcc-patches@gcc.gnu.org; Richard Guenther (richard.guent...@gmail.com); borntrae...@de.ibm.com; H.J. Lu (hjl.to...@gmail.com); Jakub Jelinek (ja...@redhat.com) Subject: Re: [RFC] [PATCH, i386] Adjust unroll factor for bdver3 and bdver4 On Wed, Nov 20, 2013 at 7:26 PM, Gopalasubramanian, Ganesh wrote: > Steamroller processors contain a loop predictor and a loop buffer, which may > make unrolling small loops less important. > When unrolling small loops for steamroller, making the unrolled loop fit in > the loop buffer should be a priority. > > This patch uses a heuristic approach (number of memory references) to decide > the unrolling factor for small loops. > This patch has some noise in SPEC 2006 results. > > Bootstrapping passes. > > I would like to know your comments before committing. Please split the patch to target-dependant and target-independant part, and get target-idependant part reviewed first. This part: + if (ix86_tune != PROCESSOR_BDVER3 && ix86_tune != PROCESSOR_BDVER4) + { +return nunroll; + } is wrong. You should introduce tune variable (as H.J. suggested) and check that variable here. Target dependant tuning options should be in x86-tune.def, so everything regarding tuning can be found in one place. +if (INSN_P (insn) && INSN_CODE (insn) != -1) +for_each_rtx (&insn, (rtx_function) ix86_loop_memcount, &mem_count); if (NONDEBUG_INSN_P (insn)) for_each_rtx (&PATTERN(insn), ...); otherwise your heuristics will depend on -g compile option. + if ( (mem_count*nunroll) <= 32) Extra parenthesis. +static int +ix86_loop_memcount (rtx *x, unsigned *mem_count) { + if (*x != NULL_RTX && MEM_P (*x)) *x will never be null for active insns. Uros. unroll-adjust.patch Description: unroll-adjust.patch
RE: [RFC] [PATCH, i386] Adjust unroll factor for bdver3 and bdver4
This patch adds influence of macro TARGET_LOOP_UNROLL_ADJUST during constant iterations (decide_unroll_constant_iterations). The macro has been already checked for runtime iterations (decide_unroll_runtime_iterations), and for unroll stupid (decide_unroll_stupid). Bootstrapping and test passes. Would like to know your comments before committing. Regards Ganesh 2013-11-28 Ganesh Gopalasubramanian * loop-unroll.c (decide_unroll_constant_iterations): Check macro TARGET_LOOP_UNROLL_ADJUST while deciding unroll factor. diff --git a/gcc/loop-unroll.c b/gcc/loop-unroll.c index 9c87167..557915f 100644 --- a/gcc/loop-unroll.c +++ b/gcc/loop-unroll.c @@ -664,6 +664,9 @@ decide_unroll_constant_iterations (struct loop *loop, int flags) if (nunroll > (unsigned) PARAM_VALUE (PARAM_MAX_UNROLL_TIMES)) nunroll = PARAM_VALUE (PARAM_MAX_UNROLL_TIMES); + if (targetm.loop_unroll_adjust) +nunroll = targetm.loop_unroll_adjust (nunroll, loop); + /* Skip big loops. */ if (nunroll <= 1) { -Original Message- From: Uros Bizjak [mailto:ubiz...@gmail.com] Sent: Friday, November 22, 2013 1:46 PM To: Gopalasubramanian, Ganesh Cc: gcc-patches@gcc.gnu.org; Richard Guenther (richard.guent...@gmail.com); borntrae...@de.ibm.com; H.J. Lu (hjl.to...@gmail.com); Jakub Jelinek (ja...@redhat.com) Subject: Re: [RFC] [PATCH, i386] Adjust unroll factor for bdver3 and bdver4 On Wed, Nov 20, 2013 at 7:26 PM, Gopalasubramanian, Ganesh wrote: > Steamroller processors contain a loop predictor and a loop buffer, which may > make unrolling small loops less important. > When unrolling small loops for steamroller, making the unrolled loop fit in > the loop buffer should be a priority. > > This patch uses a heuristic approach (number of memory references) to decide > the unrolling factor for small loops. > This patch has some noise in SPEC 2006 results. > > Bootstrapping passes. > > I would like to know your comments before committing. Please split the patch to target-dependant and target-independant part, and get target-idependant part reviewed first.
RE: [PATCH, i386]: Fix PR56788, _mm_frcz_sd and _mm_frcz_ss ignore their second argument
> Hopefully someone from AMD will provide tests that are mysteriously missing > from XOP testsuite. As pointed out by Marc, I added myself to the bug later. I was bit confused about the "internal insn representation" with "user-visible function". So, couldn't add test then and there. I could have solved that earlier. Sorry for that. Attached is the test that checks the (controversial) "frcz" functions. Uros could you please add this to your patch while committing. Regards Ganesh -Original Message- From: Uros Bizjak [mailto:ubiz...@gmail.com] Sent: Saturday, November 23, 2013 6:49 PM To: gcc-patches@gcc.gnu.org Cc: Cong Hou; Marc Glisse; Gopalasubramanian, Ganesh Subject: [PATCH, i386]: Fix PR56788, _mm_frcz_sd and _mm_frcz_ss ignore their second argument Hello! Attached patch fixes PR56788, where _mm_frcz_{ss,sd} intrinsics ignored their second argument. As explained in the PR [1], gcc implements two-operand "vector-merge" form as documented in Microsoft's definition [2]. However, in contrast to other SSE scalar insns, the instruction itself clears upper bits to zero. There were a couple of problems: the builtin was declared as builtin with two input operands, but the number of input operands didn't correspond to referred insn pattern, leaving its second operand uninitialized. The intrinsic was also implemented without necessary movss/movsd fixup that would merge both its operands in a correct way. Please also note that the definition in clang is wrong. I didn't include any testcase in the patch, since I don't have access to XOP target. Hopefully someone from AMD will provide tests that are mysteriously missing from XOP testsuite. 2013-11-23 Uros Bizjak PR target/56788 * config/i386/i386.c (bdesc_multi_arg) : Declare as MULTI_ARG_1_SF instruction. : Decleare as MULTI_ARG_1_DF instruction. * config/i386/sse.md (*xop_vmfrcz2): Rename from *xop_vmfrcz_. * config/i386/xopintrin.h (_mm_frcz_ss): Use __builtin_ia32_movss to merge scalar result with __A. (_mm_frcz_sd): Use __builtin_ia32_movsd to merge scalar result with __A. Patch was bootstrapped and regression tested on x86_64-pc-linux-gnu {,-m32}. The patch was committed to mainline SVN and will be committed to other release branches in a couple of days (hopefully with additional tests). [1] http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56788 [2] http://msdn.microsoft.com/en-us/library/vstudio/gg445126%28v=vs.100%29.aspx Uros. #include #include "m128-check.h" void check_mm_vmfrcz_sd (__m128d __A, __m128d __B) { union128d a, b, c; double d[2]; a.x = __A; b.x = __B; c.x = _mm_frcz_sd (__A, __B); d[0] = b.a[0] - (int)b.a[0] ; d[1] = a.a[1]; if (check_union128d (c, d)) abort (); } void check_mm_vmfrcz_ss (__m128 __A, __m128 __B) { union128 a, b, c; float f[4]; a.x = __A; b.x = __B; c.x = _mm_frcz_ss (__A, __B); f[0] = b.a[0] - (int)b.a[0] ; f[1] = a.a[1]; f[2] = a.a[2]; f[3] = a.a[3]; if (check_union128 (c, f)) abort (); } void main (void) { union128 a, b; union128d c,d; int i; for (i = 0; i < 4; i++) { a.a[i] = i + 3.5; b.a[i] = i + 7.9; } for (i = 0; i < 2; i++) { c.a[i] = i + 3.5; d.a[i] = i + 7.987654321; } check_mm_vmfrcz_ss (a.x, b.x); check_mm_vmfrcz_sd (c.x, d.x); }
RE: [RFC] [PATCH, i386] Adjust unroll factor for bdver3 and bdver4
Ping! -Original Message- From: Gopalasubramanian, Ganesh Sent: Thursday, November 21, 2013 10:35 AM To: 'H.J. Lu' Cc: gcc-patches@gcc.gnu.org; Uros Bizjak (ubiz...@gmail.com); Richard Guenther (richard.guent...@gmail.com); borntrae...@de.ibm.com; Jakub Jelinek (ja...@redhat.com) Subject: RE: [RFC] [PATCH, i386] Adjust unroll factor for bdver3 and bdver4 > I suggest you add this to x86-tune.def and enable it for > bdver3 and bdver4. The macro TARGET_LOOP_UNROLL_ADJUST is not new. It is already available and is used by target s390. Since it is not an "x86 only" feature I didn't add that in x86-tune.def. Regards Ganesh -Original Message- From: H.J. Lu [mailto:hjl.to...@gmail.com] Sent: Thursday, November 21, 2013 12:02 AM To: Gopalasubramanian, Ganesh Cc: gcc-patches@gcc.gnu.org; Uros Bizjak (ubiz...@gmail.com); Richard Guenther (richard.guent...@gmail.com); borntrae...@de.ibm.com; Jakub Jelinek (ja...@redhat.com) Subject: Re: [RFC] [PATCH, i386] Adjust unroll factor for bdver3 and bdver4 On Wed, Nov 20, 2013 at 10:26 AM, Gopalasubramanian, Ganesh wrote: > Hi, > > Steamroller processors contain a loop predictor and a loop buffer, which may > make unrolling small loops less important. > When unrolling small loops for steamroller, making the unrolled loop fit in > the loop buffer should be a priority. > > This patch uses a heuristic approach (number of memory references) to decide > the unrolling factor for small loops. > This patch has some noise in SPEC 2006 results. > > Bootstrapping passes. > > I would like to know your comments before committing. > I suggest you add this to x86-tune.def and enable it for bdver3 and bdver4. -- H.J.
RE: [RFC] [PATCH, i386] Adjust unroll factor for bdver3 and bdver4
> I suggest you add this to x86-tune.def and enable it for > bdver3 and bdver4. The macro TARGET_LOOP_UNROLL_ADJUST is not new. It is already available and is used by target s390. Since it is not an "x86 only" feature I didn't add that in x86-tune.def. Regards Ganesh -Original Message- From: H.J. Lu [mailto:hjl.to...@gmail.com] Sent: Thursday, November 21, 2013 12:02 AM To: Gopalasubramanian, Ganesh Cc: gcc-patches@gcc.gnu.org; Uros Bizjak (ubiz...@gmail.com); Richard Guenther (richard.guent...@gmail.com); borntrae...@de.ibm.com; Jakub Jelinek (ja...@redhat.com) Subject: Re: [RFC] [PATCH, i386] Adjust unroll factor for bdver3 and bdver4 On Wed, Nov 20, 2013 at 10:26 AM, Gopalasubramanian, Ganesh wrote: > Hi, > > Steamroller processors contain a loop predictor and a loop buffer, which may > make unrolling small loops less important. > When unrolling small loops for steamroller, making the unrolled loop fit in > the loop buffer should be a priority. > > This patch uses a heuristic approach (number of memory references) to decide > the unrolling factor for small loops. > This patch has some noise in SPEC 2006 results. > > Bootstrapping passes. > > I would like to know your comments before committing. > I suggest you add this to x86-tune.def and enable it for bdver3 and bdver4. -- H.J.
[RFC] [PATCH, i386] Adjust unroll factor for bdver3 and bdver4
Hi, Steamroller processors contain a loop predictor and a loop buffer, which may make unrolling small loops less important. When unrolling small loops for steamroller, making the unrolled loop fit in the loop buffer should be a priority. This patch uses a heuristic approach (number of memory references) to decide the unrolling factor for small loops. This patch has some noise in SPEC 2006 results. Bootstrapping passes. I would like to know your comments before committing. Regards Ganesh loop_unroll_bdver3.patch Description: loop_unroll_bdver3.patch
RE: Honnor ix86_accumulate_outgoing_args again
> we are going to have some AMD CPU with AVX2 support soon, the question is > if it will prefer 256-bit vmovups/vmovupd/vmovdqu or split, but even > if it will prefer split, the question is if like bdver{1,2,3} it will > be X86_TUNE_AVX128_OPTIMAL, because if yes, then how 256-bit unaligned > loads/stores are handled is much less important there. Ganesh? 256-bit is friendly on bdver4. But, 256 bit unaligned stores are micro-coded which we would like to avoid. So we require 128-bit MOVUPS. -Original Message- From: Jakub Jelinek [mailto:ja...@redhat.com] Sent: Tuesday, November 12, 2013 3:57 PM To: Jan Hubicka Cc: H.J. Lu; Vladimir Makarov; GCC Patches; Uros Bizjak; Richard Henderson; Gopalasubramanian, Ganesh Subject: Re: Honnor ix86_accumulate_outgoing_args again On Tue, Nov 12, 2013 at 11:05:45AM +0100, Jan Hubicka wrote: > > @@ -16576,7 +16576,7 @@ ix86_avx256_split_vector_move_misalign (rtx > > op0, rtx op1) > > > >if (MEM_P (op1)) > > { > > - if (TARGET_AVX256_SPLIT_UNALIGNED_LOAD) > > + if (!TARGET_AVX2 && TARGET_AVX256_SPLIT_UNALIGNED_LOAD) > > { > >rtx r = gen_reg_rtx (mode); > >m = adjust_address (op1, mode, 0); @@ -16596,7 +16596,7 @@ > > ix86_avx256_split_vector_move_misalign (rtx op0, rtx op1) > > } > >else if (MEM_P (op0)) > > { > > - if (TARGET_AVX256_SPLIT_UNALIGNED_STORE) > > + if (!TARGET_AVX2 && TARGET_AVX256_SPLIT_UNALIGNED_STORE) > > I would add explanation comment on those two. Looking at http://gcc.gnu.org/ml/gcc-patches/2013-11/msg01235.html we are going to have some AMD CPU with AVX2 support soon, the question is if it will prefer 256-bit vmovups/vmovupd/vmovdqu or split, but even if it will prefer split, the question is if like bdver{1,2,3} it will be X86_TUNE_AVX128_OPTIMAL, because if yes, then how 256-bit unaligned loads/stores are handled is much less important there. Ganesh? > Shall we also disable argument accumulation for cores? It seems we won't > solve the IRA issues, right? You mean LRA issues here, right? If you are starting to use no-accumulate-outgoing-args much more often than in the past, I think the problem that LRA forces a frame pointer in that case is much more important now (or has that been fixed in the mean time?). Vlad? Jakub
RE: Fix scheduler ix86_issue_rate and ix86_adjust_cost for modern x86 chips
Thanks Honza! I have committed changes ( for default ). http://gcc.gnu.org/viewcvs/gcc?view=revision&revision=204442 I will add lookahead value 8 for O3 after experimenting with it. Regards Ganesh -Original Message- From: Jan Hubicka [mailto:hubi...@ucw.cz] Sent: Wednesday, October 30, 2013 1:54 AM To: Richard Biener Cc: Jan Hubicka; Gopalasubramanian, Ganesh; gcc-patches@gcc.gnu.org; Uros Bizjak (ubiz...@gmail.com); H.J. Lu (hjl.to...@gmail.com) Subject: Re: Fix scheduler ix86_issue_rate and ix86_adjust_cost for modern x86 chips > On Fri, 25 Oct 2013, Jan Hubicka wrote: > > > > > OK, so it is about 2%. Did you try if you need lookahead even in the > > > > early pass (before reload)? My guess would be so, but if not, it could > > > > cut the cost to half. For -Ofast/-O3 it looks resonable to me, but we > > > > will need to announce it on the ML. For other settings I think we > > > > need to work on more improvements or cut the expenses. > > > > > > Yes, it is required before reload. > > > > > > I have another idea which can be pondered upon. Currently, can we enable > > > lookahead with the value 4 (pre reload) for default? This will > > > exponentially cut the cost of build time. > > > I have done some measurements on the build time of some benchmarks > > > (mentioned below) with lookahead value 4. The 2% increase in build time > > > with value 8 is now almost gone. > > > > > >dfa4 no_lookahead > > > > > > perlbench - 191s 193s > > > bzip2 - 19s 19s > > > gcc - 429s 429s > > > mcf - 3s3s > > > gobmk - 116s 115s > > > hmmer - 60s 60s > > > sjeng - 18s 17s > > > libquantum - 6s6s > > > h264ref - 107s 107s > > > omnetpp - 128s 128s > > > astar - 7s7s > > > bwaves - 5s5s > > > gamess - 1964s 1957s > > > milc- 18s 18s > > > GemsFDTD- 273s 272s > > > > > > Lookahead value 4 also helps because, the modified decoder model in > > > bdver3.md is only two cycles deep (though in hardware it is actually 4 > > > cycles deep). This means that we can look another two levels deep for > > > better schedule. > > > GemsFDTD still retains the performance boost of around 6-7% with value 4. > > > > > > Let me know your thoughts. > > > > This seems resonable. I would go for lookahead of 4 for now and 8 > > for -Ofast and we can tune things based on the experience with this setting > > incrementally. > > Uros, Richard, what do you think? > > Well, certainly -O3 not -Ofast. Yes, enabling 4 by default and 8 at -O3 seems fine to me. Honza > > Richard.
RE: Fix scheduler ix86_issue_rate and ix86_adjust_cost for modern x86 chips
> OK, so it is about 2%. Did you try if you need lookahead even in the early > pass (before reload)? My guess would be so, but if not, it could cut the > cost to half. For -Ofast/-O3 it looks resonable to me, but we will need to > announce it on the ML. For other settings I think we need to work on more > improvements or cut the expenses. Yes, it is required before reload. I have another idea which can be pondered upon. Currently, can we enable lookahead with the value 4 (pre reload) for default? This will exponentially cut the cost of build time. I have done some measurements on the build time of some benchmarks (mentioned below) with lookahead value 4. The 2% increase in build time with value 8 is now almost gone. dfa4 no_lookahead perlbench - 191s 193s bzip2 - 19s 19s gcc - 429s 429s mcf - 3s3s gobmk - 116s 115s hmmer - 60s 60s sjeng - 18s 17s libquantum - 6s6s h264ref - 107s 107s omnetpp - 128s 128s astar - 7s7s bwaves - 5s5s gamess - 1964s 1957s milc- 18s 18s GemsFDTD- 273s 272s Lookahead value 4 also helps because, the modified decoder model in bdver3.md is only two cycles deep (though in hardware it is actually 4 cycles deep). This means that we can look another two levels deep for better schedule. GemsFDTD still retains the performance boost of around 6-7% with value 4. Let me know your thoughts. Regards Ganesh -Original Message- From: Jan Hubicka [mailto:hubi...@ucw.cz] Sent: Thursday, October 24, 2013 6:48 PM To: Gopalasubramanian, Ganesh Cc: Jan Hubicka; gcc-patches@gcc.gnu.org; Uros Bizjak (ubiz...@gmail.com); H.J. Lu (hjl.to...@gmail.com) Subject: Re: Fix scheduler ix86_issue_rate and ix86_adjust_cost for modern x86 chips > Hi, > > > Is this with -fschedule-insns? Or only with default settings? Did you test > > the compile time implications of increasing the lookahead? (value of 8 is > > very large, we may consider enbling it only for -Ofast, limiting for > > postreload only or something similar). > > The improvement is seen with the options "-fschedule-insns -fschedule-insns2 > -fsched-pressure" > > Below are the build times of some of the SPEC benchmarks > > dfa8 no_lookahead > > perlbench - 196s 193s > bzip2 - 19s 19s > gcc - 439s 429s > mcf - 3s3s > gobmk - 119s 115s > hmmer - 62s 60s > sjeng - 18s 17s > libquantum - 6s6s > h264ref - 110s 107s > omnetpp - 132s 128s > astar - 7s7s > bwaves - 4s5s > gamess - 1996s 1957s > milc- 18s 18s > GemsFDTD- 276s 272s > > I think we can enable it by default rather than for -Ofast. > Please let me know your inputs. OK, so it is about 2%. Did you try if you need lookahead even in the early pass (before reload)? My guess would be so, but if not, it could cut the cost to half. For -Ofast/-O3 it looks resonable to me, but we will need to announce it on the ML. For other settings I think we need to work on more improvmeents or cut the expenses. Honza > > Regards > Ganesh > > -Original Message- > From: Jan Hubicka [mailto:hubi...@ucw.cz] > Sent: Thursday, October 24, 2013 2:54 PM > To: Gopalasubramanian, Ganesh > Cc: gcc-patches@gcc.gnu.org; Uros Bizjak (ubiz...@gmail.com); > hubi...@ucw.cz; H.J. Lu (hjl.to...@gmail.com) > Subject: Re: Fix scheduler ix86_issue_rate and ix86_adjust_cost for > modern x86 chips > > > Attached is the patch which does the following scheduler related changes. > > * re-models bdver3 decoder. > > * It enables lookahead with value 8 for all BD architectures. The patch > > doesn't consider if reloading is completed or not (an area that needs to be > > worked on). > > * The issue rate for BD architectures are set to 4. > > > > I see the following performance improvements on bdver3 machine. > > * GemsFDTD improves by 6-7% with lookahead value changed to 8. > > * Hmmer improves by 9% when issue rate when set to 4 . > > Is this with -fschedule-insns? Or only with default settings? Did you test > the compile time implications of increasing the lookahead? (value of 8 is > very large, we may consider enbling it only for -Ofast, limiting for > postreload only or something sim
RE: Fix scheduler ix86_issue_rate and ix86_adjust_cost for modern x86 chips
Hi, > Is this with -fschedule-insns? Or only with default settings? Did you test > the compile time implications of increasing the lookahead? (value of 8 is > very large, we may consider enbling it only for -Ofast, limiting for > postreload only or something similar). The improvement is seen with the options "-fschedule-insns -fschedule-insns2 -fsched-pressure" Below are the build times of some of the SPEC benchmarks dfa8 no_lookahead perlbench - 196s 193s bzip2 - 19s 19s gcc - 439s 429s mcf - 3s3s gobmk - 119s 115s hmmer - 62s 60s sjeng - 18s 17s libquantum - 6s6s h264ref - 110s 107s omnetpp - 132s 128s astar - 7s7s bwaves - 4s5s gamess - 1996s 1957s milc- 18s 18s GemsFDTD- 276s 272s I think we can enable it by default rather than for -Ofast. Please let me know your inputs. Regards Ganesh -Original Message- From: Jan Hubicka [mailto:hubi...@ucw.cz] Sent: Thursday, October 24, 2013 2:54 PM To: Gopalasubramanian, Ganesh Cc: gcc-patches@gcc.gnu.org; Uros Bizjak (ubiz...@gmail.com); hubi...@ucw.cz; H.J. Lu (hjl.to...@gmail.com) Subject: Re: Fix scheduler ix86_issue_rate and ix86_adjust_cost for modern x86 chips > Attached is the patch which does the following scheduler related changes. > * re-models bdver3 decoder. > * It enables lookahead with value 8 for all BD architectures. The patch > doesn't consider if reloading is completed or not (an area that needs to be > worked on). > * The issue rate for BD architectures are set to 4. > > I see the following performance improvements on bdver3 machine. > * GemsFDTD improves by 6-7% with lookahead value changed to 8. > * Hmmer improves by 9% when issue rate when set to 4 . Is this with -fschedule-insns? Or only with default settings? Did you test the compile time implications of increasing the lookahead? (value of 8 is very large, we may consider enbling it only for -Ofast, limiting for postreload only or something similar). > > I have considered the following hardware details for the model. > * There are four decoders inside a hardware decoder block. > * These four independent decoders can execute in parallel. (They can take 8B > from four different instructions and decode). > * These four decoders are pipelined 4 cycles deep and are non-stalling. > * Each decoder takes 8B of instruction data every cycle and tries decoding > it. > * Issue rate is 4. What is the overall limitation on number of bytes the instructions can occupy? I think they need to fit into 2 16 byte windows, right? In that case we may want to tweak the existing corei7 scheduling code to take care of this. Making scheduler not overly optimistic about the parallelism is good since it will make less register pressure during the first pass. > > Is it OK for upstream? Otherwise the patch seems OK, but I would like to know the compile time effect first. Honza > > Changelog > > 2013-10-24 Ganesh Gopalasubramanian > > > * config/i386/bdver3.md : Added two additional decoder units > to support issue rate of 4 and remodeled vector unit. > > * config/i386/i386.c (ix86_issue_rate): Issue rate for BD > architectures is set to 4. > > * config/i386/i386.c (ia32_multipass_dfa_lookahead): DFA > lookahead is set to 8 for BD architectures. > > Regards > Ganesh >
RE: Fix scheduler ix86_issue_rate and ix86_adjust_cost for modern x86 chips
Attached is the patch which does the following scheduler related changes. * re-models bdver3 decoder. * It enables lookahead with value 8 for all BD architectures. The patch doesn't consider if reloading is completed or not (an area that needs to be worked on). * The issue rate for BD architectures are set to 4. I see the following performance improvements on bdver3 machine. * GemsFDTD improves by 6-7% with lookahead value changed to 8. * Hmmer improves by 9% when issue rate when set to 4 . I have considered the following hardware details for the model. * There are four decoders inside a hardware decoder block. * These four independent decoders can execute in parallel. (They can take 8B from four different instructions and decode). * These four decoders are pipelined 4 cycles deep and are non-stalling. * Each decoder takes 8B of instruction data every cycle and tries decoding it. * Issue rate is 4. Is it OK for upstream? Changelog 2013-10-24 Ganesh Gopalasubramanian * config/i386/bdver3.md : Added two additional decoder units to support issue rate of 4 and remodeled vector unit. * config/i386/i386.c (ix86_issue_rate): Issue rate for BD architectures is set to 4. * config/i386/i386.c (ia32_multipass_dfa_lookahead): DFA lookahead is set to 8 for BD architectures. Regards Ganesh bdver3_issue_rate_lookahead.patch Description: bdver3_issue_rate_lookahead.patch
RE: [PATCH,i386] Enable FMA4 for AMD bdver3
> 4.8.2 is already rolling, so too late for that. Is 4.8 branch (gcc/branches/gcc-4_8-branch) open? If yes, shall I commit these changes? Regards Ganesh -Original Message- From: Jakub Jelinek [mailto:ja...@redhat.com] Sent: Wednesday, October 16, 2013 12:41 PM To: Uros Bizjak Cc: Gopalasubramanian, Ganesh; gcc-patches@gcc.gnu.org Subject: Re: [PATCH,i386] Enable FMA4 for AMD bdver3 On Wed, Oct 16, 2013 at 09:00:58AM +0200, Uros Bizjak wrote: > On Wed, Oct 16, 2013 at 8:28 AM, Gopalasubramanian, Ganesh > wrote: > > > The below patch enables FMA4 for AMD bdver3 architectures. > > > > "make -k check" passes. > > > > +2013-10-16 Ganesh Gopalasubramanian > > + > > + > > + * config/i386/i386.c (ix86_option_override_internal): Enable FMA4 > > + for AMD bdver3. > > + > > OK for mainline and 4.8 branch (for 4.8.2 if approved by Jakub, > otherwise please wait for branch to open). 4.8.2 is already rolling, so too late for that. Jakub
[PATCH,i386] Enable FMA4 for AMD bdver3
Hi The below patch enables FMA4 for AMD bdver3 architectures. "make -k check" passes. Is it OK for upstream? Regards Ganesh diff --git a/gcc/ChangeLog b/gcc/ChangeLog index fb5b267..cbb5311 100644 --- a/gcc/ChangeLog +++ b/gcc/ChangeLog @@ -1,3 +1,8 @@ +2013-10-16 Ganesh Gopalasubramanian + + * config/i386/i386.c (ix86_option_override_internal): Enable FMA4 + for AMD bdver3. + 2013-10-16 Hans-Peter Nilsson * config/cris/t-elfmulti (MULTILIB_OPTIONS, MULTILIB_DIRNAMES) diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index b5796db..c24ce36 100644 --- a/gcc/config/i386/i386.c +++ b/gcc/config/i386/i386.c @@ -3104,7 +3104,7 @@ ix86_option_override_internal (bool main_args_p, {"bdver3", PROCESSOR_BDVER3, CPU_BDVER3, PTA_64BIT | PTA_MMX | PTA_SSE | PTA_SSE2 | PTA_SSE3 | PTA_SSE4A | PTA_CX16 | PTA_ABM | PTA_SSSE3 | PTA_SSE4_1 - | PTA_SSE4_2 | PTA_AES | PTA_PCLMUL | PTA_AVX + | PTA_SSE4_2 | PTA_AES | PTA_PCLMUL | PTA_AVX | PTA_FMA4 | PTA_XOP | PTA_LWP | PTA_BMI | PTA_TBM | PTA_F16C | PTA_FMA | PTA_PRFCHW | PTA_FXSR | PTA_XSAVE | PTA_XSAVEOPT | PTA_FSGSBASE},
RE: Fix scheduler ix86_issue_rate and ix86_adjust_cost for modern x86 chips
Hi Honza! > I will give it a try. I understand how the proposed patch would be. That is OK to me. > OK, my undertanding always was, that the decoders works sort of sequentially. > > One decoder of bdver1 can do > vector, > double, single, > single, single, signle > >where decoder 1 is somehow special but hardware is able to swap first and >second single. Now if two decoders run, i would expect > > vector, vector > single, single, signle > double, single, double, single > > to be decoded in once cycle My understanding on the decode unit is mentioned below. Please correct me if I am wrong. The sequential allotment of decoders is not there for bdver1. Intel Sandybridge\core2 have four decoders. The first decoder is special for intel processors. For ex, In Sandybridge, the instructions that have only one µop can be decoded by any of the four decoders. Instructions that have up to four µops will be decoded by the first decoder (of the four decoders available) only. Bdver1 has four decoders. None of them is special unlike intel processors. For bdver1, microcoded instructions are single issue. All four decoders are engaged for decoding microcoded instructions. Decode unit of bdver3 has following specifications * Four independent decoders which can execute in parallel * Microcoded instructions are single issue. (All four decoders are engaged). This means that only one vectorpath instruction get issued in one cycle. * The additional hardware instruction decoder increases the instruction decode capacity to eight instructions per clock cycle. * The decoders are pipelined 4 cycles deep and are non-stalling. So modeling vectorpath instructions is straightforward. No instructions are issued along with vector instructions. We need to model only fastpath single and fastpath double instructions. There are four decoders and they can execute in parallel. So they can take either two double or four single instructions. We also don't need to model them in two stage as there is no sequence involved. So, the modeling can be done such that in one cycle we schedule 2 singles + 1 double (or) 4 singles (or) 2 doubles. I have tried to model this for bdver3 (code changes are mentioned below). Please let me know your opinion. Regards Ganesh Patch - diff --git a/gcc/config/i386/bdver3.md b/gcc/config/i386/bdver3.md index 52418b5..9e59395 100644 --- a/gcc/config/i386/bdver3.md +++ b/gcc/config/i386/bdver3.md @@ -34,6 +34,8 @@ (define_cpu_unit "bdver3-decode0" "bdver3") (define_cpu_unit "bdver3-decode1" "bdver3") +(define_cpu_unit "bdver3-decode2" "bdver3") +(define_cpu_unit "bdver3-decode3" "bdver3") (define_cpu_unit "bdver3-decodev" "bdver3") ;; Double decoded instructions take two cycles whereas @@ -42,12 +44,15 @@ ;; two decoders in two cycles. ;; Vectorpath instructions are single issue instructions. ;; So, we have separate unit for vector instructions. -(exclusion_set "bdver3-decodev" "bdver3-decode0,bdver3-decode1") +(exclusion_set "bdver3-decodev" "bdver3-decode0,bdver3-decode1,bdver3-decode2,bdver3-decode3") (define_reservation "bdver3-vector" "bdver3-decodev") -(define_reservation "bdver3-direct" "(bdver3-decode0|bdver3-decode1)") +(define_reservation "bdver3-direct" "(bdver3-decode0|bdver3-decode1|bdver3-decoder2|bdver3-decoder3)") -(define_reservation "bdver3-double" "(bdver3-decode0|bdver3-decode1)*2") +(define_reservation "bdver3-double" "(bdver3-decode0+bdver3-decode1)| + (bdver3-decode1+bdver3-decode2)|(bdver3-decode2+bdver3-decode3)| + (bdver3-decode0+bdver3-decode2)|(bdver3-decode1+bdver3-decode3)| + (bdver3-decode0+bdver3-decode3)") -Original Message- From: Jan Hubicka [mailto:hubi...@ucw.cz] Sent: Wednesday, October 09, 2013 7:18 PM To: Gopalasubramanian, Ganesh Cc: Jan Hubicka; gcc-patches@gcc.gnu.org; hjl.to...@gmail.com Subject: Re: Fix scheduler ix86_issue_rate and ix86_adjust_cost for modern x86 chips > Before merging the insn reservations, I need to compare the latency values > for bdver1 and bdver3. I know that they are different for some of the > instructions. > In that case, the merging should prop up another subset of latency > differences. I would like to keep these insn reservations in two .md files > (one for bdver1 and one for bdver3) even after the merger. I am not really insisting on merging (define_insn_reservation "bdver3*") with (define_insn_reservation "bdver1*). What I have in mind is merging actual atuomatons in cases it makes sense. Latencies are not really encoded in those. Bdver 12 has: (define_automaton "bdver1,bdver1_ieu,bdver1_load,bdver1_fp,bdver1_agu
RE: Fix scheduler ix86_issue_rate and ix86_adjust_cost for modern x86 chips
Hi Honza, > Yep, I think we need to merge only those autmatas tha are same for both: > (define_automaton "bdver3,bdver3_ieu,bdver3_load,bdver3_fp,bdver3_agu") > probably can become > (define_automaton "bdver3,bdver3_fp") > with the corresponding reservations using bdver3_ieu,bdver3_load,bdver3_agu > changed to bdver1 > automaton. I think it should result in smaller binary - the fact that all > conditionals are > physically duplicated in bdver1/bdev3.md should be optimized away by > genautomata. Before merging the insn reservations, I need to compare the latency values for bdver1 and bdver3. I know that they are different for some of the instructions. In that case, the merging should prop up another subset of latency differences. I would like to keep these insn reservations in two .md files (one for bdver1 and one for bdver3) even after the merger. > Your version has problem that it does not model the thing that the two > decoders works sequentially. The two stage modeling is required so that the decode unit reservations are screened from other unit reservations. But this sort of goes away in bdver3 because of the decode cycle. In bdver3, the decode units scan two of these windows every "two" cycles decoding a maximum of eight instructions. The hardware scan is done every two cycles in bdver3 whereas it is done every single cycle in bdver1/bdver2. (But we have two separate hardware decoders which guarantees higher throughput) This means that the two stage modeling is not required in the scheduler descriptions since the hardware sort of guarantees that with its scanning mechanism. Our job is to make sure that 8 direct instructions get scheduled in two cycles or 4 double instructions get scheduled in two cycles. So, I have modeled the bdver3 decoders such that with in a cycle they guarantee to issue 4 direct instructions or 2 double instructions. This eliminates the sequencing problem in modeling decoders and also ensures that the issue rate can be numbered for a single cycle rather than two cycles. This is one of the reasons why I remodeled only bdver3. Let me know your comments on this. > We can also experiment with defining TARGET_SCHED_VARIABLE_ISSUE to get more > realistic estimates on what still can be issued - the value of 6 is > unrealistically high. This would get more complicated if we go by decoder capacity in bdver3. As we have two hardware decoders in steamroller (bdver3), they have a capacity to decode eight instructions per clock cycle, providing up to twice the decode and dispatch bandwidth compared to bdver1. If we model this in GCC we need to change the issue rate to 8. If 6 is high, then 8 would add more joy and excitement. TARGET_SCHED_VARIABLE_ISSUE is a nice suggestion to schedule instructions in different way. > We also should enable ia32_multipass_dfa_lookahead - with that scheduler > should be able to put double decoded and vector decoded insns on the proper > places. Yes. Whenever we have this scheduler analysis in place we discuss about this but unfortunately is left as it is. I will look into this after I do the enablement for bdver4. > I will work on replacing most of the CPU cases into tuning flags + costs. I am planning to get bdver4 enablement in place once scheduler descriptions for bdver3 is done with. I will have cycles to look into the cost models. Please delegate some tasks if you can and I am willing to take them up. Regards Ganesh -Original Message- From: Jan Hubicka [mailto:hubi...@ucw.cz] Sent: Tuesday, October 08, 2013 3:20 PM To: Gopalasubramanian, Ganesh Cc: Jan Hubicka; gcc-patches@gcc.gnu.org; hjl.to...@gmail.com Subject: Re: Fix scheduler ix86_issue_rate and ix86_adjust_cost for modern x86 chips > Hi Honza, > > I am planning to update the scheduler descriptions for bdver3 first. > Attached is the patch. Please let me know your comments if any. > > Though I agree on merging bdver1/2 and bdver3 on most parts, the FP lines and > decoding schemes are different. So, let me know how can I approach merging > these. Yep, I think we need to merge only those autmatas tha are same for both: (define_automaton "bdver3,bdver3_ieu,bdver3_load,bdver3_fp,bdver3_agu") probably can become (define_automaton "bdver3,bdver3_fp") with the corresponding reservations using bdver3_ieu,bdver3_load,bdver3_agu changed to bdver1 automaton. I think it should result in smaller binary - the fact that all conditionals are physically duplicated in bdver1/bdev3.md should be optimized away by genautomata. I also played a bit with the decoders and I am attaching my version - that seems SPEC neutral though. Your version has problem that it does not model the thing that the two decoders works sequentially. I removed the bdver1-decodev unit and instead i simply reserve all thre decoders + I added pr
RE: Fix scheduler ix86_issue_rate and ix86_adjust_cost for modern x86 chips
Hi Honza, I am planning to update the scheduler descriptions for bdver3 first. Attached is the patch. Please let me know your comments if any. Though I agree on merging bdver1/2 and bdver3 on most parts, the FP lines and decoding schemes are different. So, let me know how can I approach merging these. Regards Ganesh -Original Message- From: Jan Hubicka [mailto:hubi...@ucw.cz] Sent: Monday, September 30, 2013 4:47 PM To: gcc-patches@gcc.gnu.org; Gopalasubramanian, Ganesh; hjl.to...@gmail.com Subject: Fix scheduler ix86_issue_rate and ix86_adjust_cost for modern x86 chips Hi, while looking into schedules produced for Buldozer and Core I noticed that they do not seem to match reality. This is because ix86_issue_rate limits those CPUs into 3 instructions per cycle, while they are designed to do 4 and somewhat confused ix86_adjust_cost. I also added stack engine into modern chips even though scheduler doesn't really understand that multiple push operations can happen in one cycle. At least it gets the stack updates in sequences of push/pop operations. I did not updated buldozer issue rates yet. The current scheduler model won't allow it to execute more than 3 instructions per cycle (and 2 for version 3). I think bdver1.md/bdver3.md needs to be updated first. I am testing x86_64-linux and will commit if there are no complains. Honza * i386.c (ix86_issue_rate): Pentium4/Nocona issue 2 instructions per cycle, Core/CoreI7/Haswell 4 instructions per cycle. (ix86_adjust_cost): Add stack engine to modern AMD chips; fix for core; remove Atom that mistakely shared code with AMD. Index: config/i386/i386.c === --- config/i386/i386.c (revision 203011) +++ config/i386/i386.c (working copy) @@ -24435,17 +24435,14 @@ ix86_issue_rate (void) case PROCESSOR_SLM: case PROCESSOR_K6: case PROCESSOR_BTVER2: +case PROCESSOR_PENTIUM4: +case PROCESSOR_NOCONA: return 2; case PROCESSOR_PENTIUMPRO: -case PROCESSOR_PENTIUM4: -case PROCESSOR_CORE2: -case PROCESSOR_COREI7: -case PROCESSOR_HASWELL: case PROCESSOR_ATHLON: case PROCESSOR_K8: case PROCESSOR_AMDFAM10: -case PROCESSOR_NOCONA: case PROCESSOR_GENERIC: case PROCESSOR_BDVER1: case PROCESSOR_BDVER2: @@ -24453,6 +24450,11 @@ ix86_issue_rate (void) case PROCESSOR_BTVER1: return 3; +case PROCESSOR_CORE2: +case PROCESSOR_COREI7: +case PROCESSOR_HASWELL: + return 4; + default: return 1; } @@ -24709,10 +24711,15 @@ ix86_adjust_cost (rtx insn, rtx link, rt case PROCESSOR_BDVER3: case PROCESSOR_BTVER1: case PROCESSOR_BTVER2: -case PROCESSOR_ATOM: case PROCESSOR_GENERIC: memory = get_attr_memory (insn); + /* Stack engine allows to execute push&pop instructions in parall. */ + if (((insn_type == TYPE_PUSH || insn_type == TYPE_POP) + && (dep_insn_type == TYPE_PUSH || dep_insn_type == TYPE_POP)) + && (ix86_tune != PROCESSOR_ATHLON && ix86_tune != PROCESSOR_K8)) + return 0; + /* Show ability of reorder buffer to hide latency of load by executing in parallel with previous instruction in case previous instruction is not needed to compute the address. */ @@ -24737,6 +24744,29 @@ ix86_adjust_cost (rtx insn, rtx link, rt else cost = 0; } + break; + +case PROCESSOR_CORE2: +case PROCESSOR_COREI7: +case PROCESSOR_HASWELL: + memory = get_attr_memory (insn); + + /* Stack engine allows to execute push&pop instructions in parall. */ + if ((insn_type == TYPE_PUSH || insn_type == TYPE_POP) + && (dep_insn_type == TYPE_PUSH || dep_insn_type == TYPE_POP)) + return 0; + + /* Show ability of reorder buffer to hide latency of load by executing +in parallel with previous instruction in case +previous instruction is not needed to compute the address. */ + if ((memory == MEMORY_LOAD || memory == MEMORY_BOTH) + && !ix86_agi_dependent (dep_insn, insn)) + { + if (cost >= 4) + cost -= 4; + else + cost = 0; + } break; case PROCESSOR_SLM: issue_rate_bdver3.patch Description: issue_rate_bdver3.patch
RE: Fwd: [PATCH] Scheduling result adjustment to enable macro-fusion
> 1. For cmp/test with rip-relative addressing mem operand, don't group > insns. Bulldozer also doesn't support fusion for cmp/test with both > displacement MEM and immediate operand, while m_CORE_ALL doesn't > support fusion for cmp/test with MEM and immediate operand. I simplify > choose to use the more stringent constraint here (m_CORE_ALL's > constraint). This suits Bulldozer's specification. We don't see an issue with the proposed patch. Regards Ganesh -Original Message- From: H.J. Lu [mailto:hjl.to...@gmail.com] Sent: Wednesday, September 25, 2013 2:12 AM To: Wei Mi Cc: Jan Hubicka; Alexander Monakov; Steven Bosscher; GCC Patches; David Li; Kirill Yukhin Subject: Re: Fwd: [PATCH] Scheduling result adjustment to enable macro-fusion On Tue, Sep 24, 2013 at 12:06 PM, Wei Mi wrote: > This is the updated patch2. > Changed: > 1. For cmp/test with rip-relative addressing mem operand, don't group > insns. Bulldozer also doesn't support fusion for cmp/test with both > displacement MEM and immediate operand, while m_CORE_ALL doesn't > support fusion for cmp/test with MEM and immediate operand. I simplify > choose to use the more stringent constraint here (m_CORE_ALL's > constraint). > 2. Add Budozer back and merge TARGET_FUSE_CMP_AND_BRANCH_64 and > TARGET_FUSE_CMP_AND_BRANCH_32. > > bootstrap and regression pass. ok for trunk? > > 2013-09-24 Wei Mi > > * gcc/config/i386/i386.c (rip_relative_addr_p): New Function. > (ix86_macro_fusion_p): Ditto. > (ix86_macro_fusion_pair_p): Ditto. > * gcc/config/i386/i386.h: Add new tune features about macro-fusion. > * gcc/config/i386/x86-tune.def (DEF_TUNE): Ditto. > * gcc/doc/tm.texi: Generated. > * gcc/doc/tm.texi.in: Ditto. > * gcc/haifa-sched.c (try_group_insn): New Function. > (group_insns_for_macro_fusion): Ditto. > (sched_init): Call group_insns_for_macro_fusion. > * gcc/sched-rgn.c (add_branch_dependences): Keep insns in > a SCHED_GROUP at the end of BB to remain their location. > * gcc/target.def: Add two hooks: macro_fusion_p and > macro_fusion_pair_p. > > diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index > 1fd3f60..4a04778 100644 > --- a/gcc/config/i386/i386.c > +++ b/gcc/config/i386/i386.c > @@ -24856,6 +24856,167 @@ ia32_multipass_dfa_lookahead (void) > } > } > > +/* Extracted from ix86_print_operand_address. Check whether ADDR is a > + rip-relative address. */ > + > +static bool > +rip_relative_addr_p (rtx addr) > +{ > + struct ix86_address parts; > + rtx base, index, disp; > + int ok; > + > + if (GET_CODE (addr) == UNSPEC && XINT (addr, 1) == UNSPEC_VSIBADDR) > +{ > + ok = ix86_decompose_address (XVECEXP (addr, 0, 0), &parts); > + parts.index = XVECEXP (addr, 0, 1); > +} > + else if (GET_CODE (addr) == UNSPEC && XINT (addr, 1) == UNSPEC_LEA_ADDR) > +ok = ix86_decompose_address (XVECEXP (addr, 0, 0), &parts); else > +ok = ix86_decompose_address (addr, &parts); > + > + gcc_assert (ok); > + base = parts.base; > + index = parts.index; > + disp = parts.disp; > + > + if (TARGET_64BIT && !base && !index) > +{ > + rtx symbol = disp; > + > + if (GET_CODE (disp) == CONST > + && GET_CODE (XEXP (disp, 0)) == PLUS > + && CONST_INT_P (XEXP (XEXP (disp, 0), 1))) > + symbol = XEXP (XEXP (disp, 0), 0); > + > + if (GET_CODE (symbol) == LABEL_REF > + || (GET_CODE (symbol) == SYMBOL_REF > + && SYMBOL_REF_TLS_MODEL (symbol) == 0)) > + return true; > +} > + if (flag_pic && !base && !index) > +{ > + if (GET_CODE (disp) == CONST > + && GET_CODE (XEXP (disp, 0)) == UNSPEC > + && (XINT (XEXP (disp, 0), 1) == UNSPEC_PCREL > + || XINT (XEXP (disp, 0), 1) == UNSPEC_GOTPCREL > + || (TARGET_64BIT > + && XINT (XEXP (disp, 0), 1) == UNSPEC_GOTNTPOFF))) > + return true; > +} > + return false; > +} > + It doesn't look right. IP relative address is only possible with TARGET_64BIT and 1. base == pc. Or 2. UUNSPEC_PCREL, UNSPEC_GOTPCREL, and NSPEC_GOTNTPOFF. -- H.J.
RE: [PATCH,i386] Default alignment for AMD BD and BT
Thanks Jakub! Committed revision 201402. -Original Message- From: Jakub Jelinek [mailto:ja...@redhat.com] Sent: Thursday, July 04, 2013 4:46 PM To: Gopalasubramanian, Ganesh Cc: Uros Bizjak (ubiz...@gmail.com); gcc-patches@gcc.gnu.org Subject: Re: [PATCH,i386] Default alignment for AMD BD and BT On Thu, Jul 04, 2013 at 11:14:24AM +, Gopalasubramanian, Ganesh wrote: > Can this be backported now! Yes. Jakub
RE: [PATCH,i386] Default alignment for AMD BD and BT
Hi Uros, Can this be backported now! Regards Ganesh -Original Message- From: Uros Bizjak [mailto:ubiz...@gmail.com] Sent: Thursday, May 30, 2013 1:40 PM To: Gopalasubramanian, Ganesh Cc: gcc-patches@gcc.gnu.org Subject: Re: [PATCH,i386] Default alignment for AMD BD and BT On Wed, May 29, 2013 at 1:28 PM, Gopalasubramanian, Ganesh wrote: > We want this to be backported to GCC48 branch. > Please approve. > > -Original Message- > From: Uros Bizjak [mailto:ubiz...@gmail.com] > Sent: Tuesday, May 07, 2013 6:22 PM > To: Gopalasubramanian, Ganesh > Cc: gcc-patches@gcc.gnu.org > Subject: Re: [PATCH,i386] Default alignment for AMD BD and BT > > On Tue, May 7, 2013 at 9:16 AM, Gopalasubramanian, Ganesh > wrote: > >> The patch updates the alignment values for AMD BD and BT architectures. >> >> "make -k check" passes. >> >> Is it OK for upstream? >> >> 2013-05-07 Ganesh Gopalasubramanian >> >> >> * config/i386/i386.c (processor_target_table): Modified default >> alignment values for AMD BD and BT architectures. This is OK, but please wait until 4.8 branch is open again. Thanks, Uros.
RE: [PATCH,i386] Default alignment for AMD BD and BT
Hi We want this to be backported to GCC48 branch. Please approve. Regards Ganesh -Original Message- From: Uros Bizjak [mailto:ubiz...@gmail.com] Sent: Tuesday, May 07, 2013 6:22 PM To: Gopalasubramanian, Ganesh Cc: gcc-patches@gcc.gnu.org Subject: Re: [PATCH,i386] Default alignment for AMD BD and BT On Tue, May 7, 2013 at 9:16 AM, Gopalasubramanian, Ganesh wrote: > The patch updates the alignment values for AMD BD and BT architectures. > > "make -k check" passes. > > Is it OK for upstream? > > 2013-05-07 Ganesh Gopalasubramanian > > > * config/i386/i386.c (processor_target_table): Modified default > alignment values for AMD BD and BT architectures. The value 11 indeed looks a bit weird, but it means: align to 16 byte boundary only if this can be done by skipping 10 bytes or less. Tha patch is OK for mainline. Thanks, Uros.
RE: [PATCH,i386] FP Reassociation for AMD bdver1 and bdver2
Thanks Uros! Committed at r199405. -Ganesh -Original Message- From: Uros Bizjak [mailto:ubiz...@gmail.com] Sent: Thursday, May 23, 2013 4:47 PM To: Gopalasubramanian, Ganesh Cc: gcc-patches@gcc.gnu.org Subject: Re: [PATCH,i386] FP Reassociation for AMD bdver1 and bdver2 On Thu, May 23, 2013 at 1:11 PM, Gopalasubramanian, Ganesh wrote: > The patch enables FP Reassociation pass AMD bdver1 and bdver2 architectures. > We note a performance uplift of around ~8% on calculix. > > "make -k check" passes. > > Is it OK for upstream? OK. Thanks, Uros.
[PATCH,i386] FP Reassociation for AMD bdver1 and bdver2
Hi The patch enables FP Reassociation pass AMD bdver1 and bdver2 architectures. We note a performance uplift of around ~8% on calculix. "make -k check" passes. Is it OK for upstream? Regards Ganesh Index: gcc/ChangeLog === --- gcc/ChangeLog (revision 199133) +++ gcc/ChangeLog (working copy) @@ -1,3 +1,8 @@ +2013-05-23 Ganesh Gopalasubramanian + +* config/i386/i386.c (initial_ix86_tune_features): Enable +FP Reassociation for AMD bdver1 and bdver2. + 2013-05-21 Christian Bruel * dwarf2out.c (multiple_reg_loc_descriptor): Use dbx_reg_number for Index: gcc/config/i386/i386.c === --- gcc/config/i386/i386.c (revision 199133) +++ gcc/config/i386/i386.c (working copy) @@ -2026,7 +2026,7 @@ /* X86_TUNE_REASSOC_FP_TO_PARALLEL: Try to produce parallel computations during reassociation of fp computation. */ - m_ATOM | m_HASWELL, + m_ATOM | m_HASWELL | m_BDVER1 | m_BDVER2, /* X86_TUNE_GENERAL_REGS_SSE_SPILL: Try to spill general regs to SSE regs instead of memory. */ -Original Message----- From: Gopalasubramanian, Ganesh Sent: Monday, May 13, 2013 5:24 PM To: gcc-patches@gcc.gnu.org Cc: Uros Bizjak (ubiz...@gmail.com) Subject: [PATCH,i386] FSGSBASE for AMD bdver3 Hi The patch enables FSGSBASE instruction generation for AMD bdver3 architectures. "make -k check" passes. Is it OK for upstream? Regards Ganesh Index: gcc/ChangeLog === --- gcc/ChangeLog (revision 198821) +++ gcc/ChangeLog (working copy) @@ -1,3 +1,8 @@ +2013-05-13 Ganesh Gopalasubramanian + +* config/i386/i386.c (processor_alias_table): Add instruction +FSGSBASE for AMD bdver3 architecture. + 2013-05-13 Martin Jambor PR middle-end/42371 Index: gcc/config/i386/i386.c === --- gcc/config/i386/i386.c (revision 198821) +++ gcc/config/i386/i386.c (working copy) @@ -3000,7 +3000,7 @@ | PTA_SSE4_2 | PTA_AES | PTA_PCLMUL | PTA_AVX | PTA_XOP | PTA_LWP | PTA_BMI | PTA_TBM | PTA_F16C | PTA_FMA | PTA_PRFCHW | PTA_FXSR | PTA_XSAVE - | PTA_XSAVEOPT}, + | PTA_XSAVEOPT | PTA_FSGSBASE}, {"btver1", PROCESSOR_BTVER1, CPU_GENERIC64, PTA_64BIT | PTA_MMX | PTA_SSE | PTA_SSE2 | PTA_SSE3 | PTA_SSSE3 | PTA_SSE4A |PTA_ABM | PTA_CX16 | PTA_PRFCHW
RE: [PATCH, i386]: Update processor_alias_table for missing PTA_PRFCHW and PTA_FXSR flags
Thank you Uros for the patch. Could you backport this to the 4.8.0? -Original Message- From: Uros Bizjak [mailto:ubiz...@gmail.com] Sent: Wednesday, May 15, 2013 11:16 PM To: gcc-patches@gcc.gnu.org Cc: Gopalasubramanian, Ganesh Subject: [PATCH, i386]: Update processor_alias_table for missing PTA_PRFCHW and PTA_FXSR flags Hello! Attached patch adds missing PTA_PRFCHW and PTA_FXSR flags to x86 processor alias table. PRFCHW CPUID flag is shared with 3dnow prefetch flag, so some additional logic is needed to avoid generating SSE prefetches for non-SSE 3dNow! targets, while still generating full set of 3dnow prefetches on 3dNow! targets. 2013-05-15 Uros Bizjak * config/i386/i386.c (iy86_option_override_internal): Update processor_alias_table for missing PTA_PRFCHW and PTA_FXSR flags. Add PTA_POPCNT to corei7 entry and remove PTA_SSE from athlon-4 entry. Do not enable SSE prefetch on non-SSE 3dNow! targets. Enable TARGET_PRFCHW for TARGET_3DNOW targets. * config/i386/i386.md (prefetch): Enable for TARGET_PRFCHW instead of TARGET_3DNOW. (*prefetch_3dnow): Enable for TARGET_PRFCHW only. Patch was bootstrapped and regression tested on x86_64-pc-linux-gnu {,-m32} and was committed to mainline SVN. The patch will be backported to 4.8 branch in a couple of days. Uros.
RE: [PATCH,i386] FSGSBASE for AMD bdver3
Thank you Uros! Patch for FSGSBASE instruction generation for AMD bdver3 committed to trunk (rr198916). Regards Ganesh -Original Message- From: Uros Bizjak [mailto:ubiz...@gmail.com] Sent: Monday, May 13, 2013 5:50 PM To: Gopalasubramanian, Ganesh Cc: gcc-patches@gcc.gnu.org Subject: Re: [PATCH,i386] FSGSBASE for AMD bdver3 On Mon, May 13, 2013 at 1:54 PM, Gopalasubramanian, Ganesh wrote: > The patch enables FSGSBASE instruction generation for AMD bdver3 > architectures. > > "make -k check" passes. > > Is it OK for upstream? OK. Please also check for missing PTA_PRFCHW and PTA_FXSR for AMD processors in processor_alias_table. Thanks, Uros.
RE: [PATCH,i386] FSGSBASE for AMD bdver3
Thanks Uros! I think you mean the amdfam10 ISA mismatch between march=native and march=amdfam10. The below patch fills the gap. "make -k check" passes. Regards Ganesh 2013-05-07 Ganesh Gopalasubramanian * config/i386/i386.c (processor_alias_table): Mismatch in ISAs Between march=native and march=amdfam10 is fixed. --- ./wkcpy/gcc-4.9.0/gcc/config/i386/i386.c 2013-02-21 16:27:10.0 +0530 +++ ./source/gcc-4.9.0/gcc/config/i386/i386.c 2013-10-21 22:20:28.0 +0530 @@ -2964,7 +2964,8 @@ | PTA_SSE2 | PTA_NO_SAHF}, {"amdfam10", PROCESSOR_AMDFAM10, CPU_AMDFAM10, PTA_64BIT | PTA_MMX | PTA_3DNOW | PTA_3DNOW_A | PTA_SSE - | PTA_SSE2 | PTA_SSE3 | PTA_SSE4A | PTA_CX16 | PTA_ABM}, + | PTA_SSE2 | PTA_SSE3 | PTA_SSE4A | PTA_CX16 | PTA_ABM + | PTA_FXSR | PTA_PRFCHW}, {"barcelona", PROCESSOR_AMDFAM10, CPU_AMDFAM10, PTA_64BIT | PTA_MMX | PTA_3DNOW | PTA_3DNOW_A | PTA_SSE | PTA_SSE2 | PTA_SSE3 | PTA_SSE4A | PTA_CX16 | PTA_ABM}, -Original Message- From: Uros Bizjak [mailto:ubiz...@gmail.com] Sent: Monday, May 13, 2013 5:50 PM To: Gopalasubramanian, Ganesh Cc: gcc-patches@gcc.gnu.org Subject: Re: [PATCH,i386] FSGSBASE for AMD bdver3 On Mon, May 13, 2013 at 1:54 PM, Gopalasubramanian, Ganesh wrote: > The patch enables FSGSBASE instruction generation for AMD bdver3 > architectures. > > "make -k check" passes. > > Is it OK for upstream? OK. Please also check for missing PTA_PRFCHW and PTA_FXSR for AMD processors in processor_alias_table. Thanks, Uros.
[PATCH,i386] FSGSBASE for AMD bdver3
Hi The patch enables FSGSBASE instruction generation for AMD bdver3 architectures. "make -k check" passes. Is it OK for upstream? Regards Ganesh Index: gcc/ChangeLog === --- gcc/ChangeLog (revision 198821) +++ gcc/ChangeLog (working copy) @@ -1,3 +1,8 @@ +2013-05-13 Ganesh Gopalasubramanian + +* config/i386/i386.c (processor_alias_table): Add instruction +FSGSBASE for AMD bdver3 architecture. + 2013-05-13 Martin Jambor PR middle-end/42371 Index: gcc/config/i386/i386.c === --- gcc/config/i386/i386.c (revision 198821) +++ gcc/config/i386/i386.c (working copy) @@ -3000,7 +3000,7 @@ | PTA_SSE4_2 | PTA_AES | PTA_PCLMUL | PTA_AVX | PTA_XOP | PTA_LWP | PTA_BMI | PTA_TBM | PTA_F16C | PTA_FMA | PTA_PRFCHW | PTA_FXSR | PTA_XSAVE - | PTA_XSAVEOPT}, + | PTA_XSAVEOPT | PTA_FSGSBASE}, {"btver1", PROCESSOR_BTVER1, CPU_GENERIC64, PTA_64BIT | PTA_MMX | PTA_SSE | PTA_SSE2 | PTA_SSE3 | PTA_SSSE3 | PTA_SSE4A |PTA_ABM | PTA_CX16 | PTA_PRFCHW
RE: [PATCH,i386] Default alignment for AMD BD and BT
Thank you Uros! Committed r198820. Regards Ganesh -Original Message- From: Uros Bizjak [mailto:ubiz...@gmail.com] Sent: Tuesday, May 07, 2013 6:22 PM To: Gopalasubramanian, Ganesh Cc: gcc-patches@gcc.gnu.org Subject: Re: [PATCH,i386] Default alignment for AMD BD and BT On Tue, May 7, 2013 at 9:16 AM, Gopalasubramanian, Ganesh wrote: > The patch updates the alignment values for AMD BD and BT architectures. > > "make -k check" passes. > > Is it OK for upstream? > > 2013-05-07 Ganesh Gopalasubramanian > > > * config/i386/i386.c (processor_target_table): Modified default > alignment values for AMD BD and BT architectures. The value 11 indeed looks a bit weird, but it means: align to 16 byte boundary only if this can be done by skipping 10 bytes or less. Tha patch is OK for mainline. Thanks, Uros.
[PATCH,i386] Default alignment for AMD BD and BT
Hi The patch updates the alignment values for AMD BD and BT architectures. "make -k check" passes. Is it OK for upstream? Regards Ganesh 2013-05-07 Ganesh Gopalasubramanian * config/i386/i386.c (processor_target_table): Modified default alignment values for AMD BD and BT architectures. Index: gcc/config/i386/i386.c === --- gcc/config/i386/i386.c (revision 198386) +++ gcc/config/i386/i386.c (working copy) @@ -2450,11 +2450,11 @@ {&generic32_cost, 16, 7, 16, 7, 16}, {&generic64_cost, 16, 10, 16, 10, 16}, {&amdfam10_cost, 32, 24, 32, 7, 32}, - {&bdver1_cost, 32, 24, 32, 7, 32}, - {&bdver2_cost, 32, 24, 32, 7, 32}, - {&bdver3_cost, 32, 24, 32, 7, 32}, - {&btver1_cost, 32, 24, 32, 7, 32}, - {&btver2_cost, 32, 24, 32, 7, 32}, + {&bdver1_cost, 16, 10, 16, 7, 11}, + {&bdver2_cost, 16, 10, 16, 7, 11}, + {&bdver3_cost, 16, 10, 16, 7, 11}, + {&btver1_cost, 16, 10, 16, 7, 11}, + {&btver2_cost, 16, 10, 16, 7, 11}, {&atom_cost, 16, 15, 16, 7, 16} };
RE: [patch][wwwdocs] gcc 4.8 changes - AMD new cores
Thank you Gerald! Committed with the changes. Regards Ganesh -Original Message- From: Gerald Pfeifer [mailto:ger...@pfeifer.com] Sent: Thursday, February 14, 2013 2:40 PM To: Gopalasubramanian, Ganesh Cc: gcc-patchesUros Bizjak Subject: RE: [patch][wwwdocs] gcc 4.8 changes - AMD new cores On Thu, 14 Feb 2013, Gopalasubramanian, Ganesh wrote: > Is it OK for wwdocs? Looks good to me if you say "...through the... options" (adding "the" in two cases) and breaking the lines to not exceed 76 columns. Thanks, Gerald
RE: [patch][wwwdocs] gcc 4.8 changes - AMD new cores
Is it OK for wwdocs? Index: gcc-4.8/changes.html === RCS file: /cvs/gcc/wwwdocs/htdocs/gcc-4.8/changes.html,v retrieving revision 1.96 diff -u -r1.96 changes.html --- gcc-4.8/changes.html12 Feb 2013 16:33:58 - 1.96 +++ gcc-4.8/changes.html13 Feb 2013 08:24:53 - @@ -529,6 +529,10 @@ information. Windows MinGW-w64 targets (*-w64-mingw*) require at least r5437 from the Mingw-w64 trunk. +Support for new AMD family 15h processors (Steamroller core) is now available + through -march=bdver3 and -mtune=bdver3 options. +Support for new AMD family 16h processors (Jaguar core) is now available + through -march=btver2 and -mtune=btver2 options. FRV -Original Message- From: Mikael Morin [mailto:mikael.mo...@sfr.fr] Sent: Wednesday, February 13, 2013 6:38 PM To: Richard Biener Cc: Gopalasubramanian, Ganesh; gcc-patches@gcc.gnu.org; ubizjak at gmail dot com (gcc-bugzi...@gcc.gnu.org); ger...@pfeifer.com Subject: Re: [patch][wwwdocs] gcc 4.8 changes - AMD new cores Le 13/02/2013 14:00, Richard Biener a écrit : > Of course not. Next they'll add blver ... Sorry
[patch][wwwdocs] gcc 4.8 changes - AMD new cores
Hello, This patch adds short words about the new AMD cores that got enabled in GCC-4.8. OK for the wwwdocs? Regards Ganesh Index: gcc-4.8/changes.html === RCS file: /cvs/gcc/wwwdocs/htdocs/gcc-4.8/changes.html,v retrieving revision 1.96 diff -u -r1.96 changes.html --- gcc-4.8/changes.html12 Feb 2013 16:33:58 - 1.96 +++ gcc-4.8/changes.html13 Feb 2013 08:24:53 - @@ -529,6 +529,10 @@ information. Windows MinGW-w64 targets (*-w64-mingw*) require at least r5437 from the Mingw-w64 trunk. +Support for new AMD family 15h processors (Steamroller core) is now available + through -march=bdver3 and -mtune=bdver3 options. +Support for new AMD family 16h processors (Jaguar core) is now available + through -march=btver2 and -mtune=btver2 options. FRV
RE: [PATCH, i386]: AMD bdver3 enablement
Thank Uros for the comments. The changes are committed to trunk http://gcc.gnu.org/viewcvs?view=revision&revision=193548 http://gcc.gnu.org/viewcvs?view=revision&revision=193549 Regards Ganesh -Original Message- From: Uros Bizjak [mailto:ubiz...@gmail.com] Sent: Wednesday, November 14, 2012 4:15 PM To: Gopalasubramanian, Ganesh Cc: gcc-patches@gcc.gnu.org Subject: Re: [PATCH, i386]: AMD bdver3 enablement On Wed, Nov 14, 2012 at 10:22 AM, Gopalasubramanian, Ganesh wrote: >> sseshuf replaces sselog in some insn patterns, but should be handled in the >> same way in *existing* .md files. > > Modifications done as per the comments. > 1. Sseshuf is added along with sselog in existing md files. > 2. sseshuf is handled in a separate pattern in bdver3.md > > Bootstrapping and "make -k check" passes. > Ok for trunk? > > 2012-11-14 Ganesh Gopalasubramanian > > > bdver3 Enablement > * gcc/doc/extend.texi: Add details about bdver3. > * gcc/doc/invoke.texi: Add details about bdver3. > * config.gcc (i[34567]86-*-linux* | ...): Add bdver3. > (case ${target}): Add bdver3. > * config/i386/i386.h (TARGET_BDVER3): New definition. > * config/i386/i386.md (define_attr "cpu"): Add bdver3. > * config/i386/sse.md (sseshuf): New type attribute. > * config/i386/athlon.md (sseshuf):Likewise. > * config/i386/atom.md (sseshuf):Likewise. > * config/i386/ppro.md (sseshuf):Likewise. > * config/i386/bdver1.md (sseshuf):Likewise. > * config/i386/i386.opt (flag_dispatch_scheduler): Add bdver3. > * config/i386/i386-c.c (ix86_target_macros_internal): Add > bdver3 def_and_undef > * config/i386/driver-i386.c (host_detect_local_cpu): Let > -march=native recognize bdver3 processors. > * config/i386/i386.c (struct processor_costs bdver3_cost): New. > (m_BDVER3): New definition. > (m_AMD_MULTIPLE): Includes m_BDVER3. > (initial_ix86_tune_features): Add bdver3 tune. > (processor_target_table): Add bdver3 entry. > (static const char *const cpu_names): Add bdver3 entry. > (software_prefetching_beneficial_p): Add bdver3. > (ix86_option_override_internal): Add bdver3 instruction sets. > (ix86_option_override_internal): Remove XSAVEOPT for bdver1 > and bdver2. > (ix86_issue_rate): Add bdver3. > (ix86_adjust_cost): Add bdver3. > (enum target_cpu_default): Add TARGET_CPU_DEFAULT_bdver3. > (enum processor_type): Add PROCESSOR_BDVER3. > * config/i386/bdver3.md: New file describing bdver3 pipelines. OK for mainline. Thanks, Uros.
RE: [PATCH, i386]: AMD bdver3 enablement
> You can see from the changes of sse.md that this is functionally a no-op > change. Sseshuf replaces sselog. So, do you mean it should be added with sselog instead of sseadd? Adding it with sseadd (instead of sselog) influences the latency information. Regards Ganesh -Original Message- From: Uros Bizjak [mailto:ubiz...@gmail.com] Sent: Monday, November 12, 2012 2:30 AM To: Gopalasubramanian, Ganesh Cc: gcc-patches@gcc.gnu.org Subject: Re: [PATCH, i386]: AMD bdver3 enablement On Fri, Nov 9, 2012 at 4:39 AM, Gopalasubramanian, Ganesh wrote: > Changes done with respect to the review comments. > Conditionally setting "sseshuf" type attribute has been removed. > Instead new attribute is added and is included for other attribute > calculations. > > The patch is attached as (difflog.txt). > The new file (bdver3.md) describing the pipelines is also attached. > > Bootstrapping and "make -k check" passes. > > OK for upstream? > > 2012-11-09 Ganesh Gopalasubramanian > > bdver3 Enablement > * gcc/doc/extend.texi: Add details about bdver3. > * gcc/doc/invoke.texi: Add details about bdver3. > * config.gcc (i[34567]86-*-linux* | ...): Add bdver3. > (case ${target}): Add bdver3. > * config/i386/i386.h (TARGET_BDVER3): New definition. > * config/i386/i386.md (define_attr "cpu"): Add bdver3. > * config/i386/sse.md (sseshuf): New type attribute. > * config/i386/athlon.md (sseshuf):Likewise. > * config/i386/atom.md (sseshuf):Likewise. > * config/i386/ppro.md (sseshuf):Likewise. Index: gcc/config/i386/atom.md === --- gcc/config/i386/atom.md (revision 193132) +++ gcc/config/i386/atom.md (working copy) @@ -455,6 +455,30 @@ (eq_attr "memory" "!none"))) "atom-simple-0") +(define_insn_reservation "atom_sseshuf" 1 + (and (eq_attr "cpu" "atom") + (and (eq_attr "type" "sseshuf") +(eq_attr "memory" "none"))) + "atom-simple-either") + +(define_insn_reservation "atom_sseshuf_mem" 1 + (and (eq_attr "cpu" "atom") + (and (eq_attr "type" "sseshuf") +(eq_attr "memory" "!none"))) + "atom-simple-either") + +(define_insn_reservation "atom_sseshuf1" 1 + (and (eq_attr "cpu" "atom") + (and (eq_attr "type" "sseshuf1") +(eq_attr "memory" "none"))) + "atom-simple-0") + +(define_insn_reservation "atom_sseshuf1_mem" 1 + (and (eq_attr "cpu" "atom") + (and (eq_attr "type" "sseshuf1") +(eq_attr "memory" "!none"))) + "atom-simple-0") + ;; not pmad, not psad (define_insn_reservation "atom_sseiadd" 1 (and (eq_attr "cpu" "atom") This was not what I had in mind for changes in existing .md files. Just change them in this way: Index: atom.md === --- atom.md (revision 193407) +++ atom.md (working copy) @@ -594,7 +594,7 @@ ;; no memory simple (define_insn_reservation "atom_sseadd" 5 (and (eq_attr "cpu" "atom") - (and (eq_attr "type" "sseadd,sseadd1") + (and (eq_attr "type" "sseadd,sseshuf,sseadd1,sseshuf1") (and (eq_attr "memory" "none") (and (eq_attr "mode" "!V2DF") (eq_attr "atom_unit" "!complex") @@ -603,7 +603,7 @@ ;; memory simple (define_insn_reservation "atom_sseadd_mem" 5 (and (eq_attr "cpu" "atom") - (and (eq_attr "type" "sseadd,sseadd1") + (and (eq_attr "type" "sseadd,sseshuf,sseadd1,sseshuf1") (and (eq_attr "memory" "!none") (and (eq_attr "mode" "!V2DF") (eq_attr "atom_unit" "!complex") @@ -612,7 +612,7 @@ ;; maxps, minps, *pd, hadd, hsub (define_insn_reservation "atom_sseadd_3" 8 (and (eq_attr "cpu" "atom") - (and (eq_attr "type" "sseadd,sseadd1") + (and (eq_attr "type" "sseadd,sseshuf,sseadd1,sseshuf1") (ior (eq_attr "mode" "V2DF") (eq_attr "atom_unit" "complex" "atom-complex, atom-all-eu*7") You can see from the changes of sse.md that this is functionally a no-op change. Uros.
Add myself to MAINTAINERS
Adding myself to the list of members in "write after approval". Index: ChangeLog === --- ChangeLog (revision 192977) +++ ChangeLog (working copy) @@ -1,3 +1,7 @@ +2012-10-30 Ganesh Gopalasubramanian + + * MAINTAINERS (Write After Approval): Add myself. + 2012-10-26 James Greenhalgh * MAINTAINERS (Write After Approval): Add myself. Index: MAINTAINERS === --- MAINTAINERS (revision 192977) +++ MAINTAINERS (working copy) @@ -372,6 +372,7 @@ Chao-ying Fu f...@mips.com Gary Funck g...@intrepid.com Pompapathi V Gadad pompapathi.v.ga...@nsc.com +Gopalasubramanian Ganesh ganesh.gopalasubraman...@amd.com Kaveh Ghazigh...@gcc.gnu.org Matthew Gingellging...@gnat.com Tristan Gingoldging...@adacore.com Regards Ganesh
RE: GCC 4.8.0 Status Report (2012-10-29), Stage 1 to end soon
Hi Jakub, We are working on the following. 1. bdver3 enablement. Review completed. Changes to be incorporated and checked-in. http://gcc.gnu.org/ml/gcc-patches/2012-10/msg01131.html 2. btver2 basic enablement is done (http://gcc.gnu.org/ml/gcc-patches/2012-07/msg01018.html)/ Scheduler descriptions are being updated. This is architecture specific and we consider it not to be a stage-1 material. Regards Ganesh -Original Message- From: Jakub Jelinek [mailto:ja...@redhat.com] Sent: Monday, October 29, 2012 11:27 PM To: g...@gcc.gnu.org Cc: gcc-patches@gcc.gnu.org Subject: GCC 4.8.0 Status Report (2012-10-29), Stage 1 to end soon Status == I'd like to close the stage 1 phase of GCC 4.8 development on Monday, November 5th. If you have still patches for new features you'd like to see in GCC 4.8, please post them for review soon. Patches posted before the freeze, but reviewed shortly after the freeze, may still go in, further changes should be just bugfixes and documentation fixes. Quality Data Priority # Change from Last Report --- --- P1 23 + 23 P2 77 + 8 P3 85 + 84 --- --- Total 185 +115 Previous Report === http://gcc.gnu.org/ml/gcc/2012-03/msg00011.html The next report will be sent by me again, announcing end of stage 1.
RE: [PATCH, i386]: Fix PR51109, symbol size in scheduler state machine is reduced
That was obvious. Sorry for the wrong commit. Thanks Jakub. -Ganesh -Original Message- From: Paolo Carlini [mailto:paolo.carl...@oracle.com] Sent: Wednesday, October 10, 2012 4:33 PM To: Jakub Jelinek Cc: Gopalasubramanian, Ganesh; Uros Bizjak; gcc-patches@gcc.gnu.org; veku...@gcc.gnu.org Subject: Re: [PATCH, i386]: Fix PR51109, symbol size in scheduler state machine is reduced On 10/10/2012 01:00 PM, Jakub Jelinek wrote: > I have removed the extra line as obvious in SVN, to allow my > bootstraps to continue. Thanks! Paolo.
RE: [PATCH, i386]: Fix PR51109, symbol size in scheduler state machine is reduced
Testing was done before posting the patch. It was successful. Regards Ganesh -Original Message- From: Uros Bizjak [mailto:ubiz...@gmail.com] Sent: Thursday, September 27, 2012 5:57 PM To: Gopalasubramanian, Ganesh Cc: gcc-patches@gcc.gnu.org Subject: Re: [PATCH, i386]: Fix PR51109, symbol size in scheduler state machine is reduced On Thu, Sep 27, 2012 at 10:30 AM, Gopalasubramanian, Ganesh wrote: > This is a fix for PR 51109. > > There are three changes > > 1. Microcoded instructions are considered as single issue instructions > and are therefore issued to a separate execution unit. > 2. The multiplier unit is attached to execution unit 1 (ieu1). Since ieu > is handled as a separate automaton in the patch, separate "mult" automaton is > not required. > 3. The integer execution units (2AGUs and 2EXs) are now decoupled. Now, > they are described as separate automatons. > > Is it OK for upstream? > > Regards > Ganesh > > 2012-09-27 Ganesh Gopalasubramanian > > > PR 51109 > * gcc/config/i386/bdver1.md (bdver1_int): Automaton has been > split to reduce state transitions. OK for mainline, if tested according to [1]. [1] http://gcc.gnu.org/contribute.html#testing Thanks, Uros.
[PATCH, i386]: Fix PR51109, symbol size in scheduler state machine is reduced
Hi All, This is a fix for PR 51109. There are three changes 1. Microcoded instructions are considered as single issue instructions and are therefore issued to a separate execution unit. 2. The multiplier unit is attached to execution unit 1 (ieu1). Since ieu is handled as a separate automaton in the patch, separate "mult" automaton is not required. 3. The integer execution units (2AGUs and 2EXs) are now decoupled. Now, they are described as separate automatons. Is it OK for upstream? Regards Ganesh 2012-09-27 Ganesh Gopalasubramanian PR 51109 * gcc/config/i386/bdver1.md (bdver1_int): Automaton has been split to reduce state transitions. Index: gcc/config/i386/bdver1.md === --- gcc/config/i386/bdver1.md (revision 191658) +++ gcc/config/i386/bdver1.md (working copy) @@ -36,7 +36,7 @@ (define_attr "bdver1_decode" "direct,vector,double" (const_string "direct")) -(define_automaton "bdver1,bdver1_int,bdver1_load,bdver1_mult,bdver1_fp") +(define_automaton "bdver1,bdver1_ieu,bdver1_load,bdver1_fp,bdver1_agu") (define_cpu_unit "bdver1-decode0" "bdver1") (define_cpu_unit "bdver1-decode1" "bdver1") @@ -71,16 +71,14 @@ | (nothing,(bdver1-decode1 + bdver1-decode2)))") -(define_cpu_unit "bdver1-ieu0" "bdver1_int") -(define_cpu_unit "bdver1-ieu1" "bdver1_int") +(define_cpu_unit "bdver1-ieu0" "bdver1_ieu") +(define_cpu_unit "bdver1-ieu1" "bdver1_ieu") (define_reservation "bdver1-ieu" "(bdver1-ieu0 | bdver1-ieu1)") -(define_cpu_unit "bdver1-agu0" "bdver1_int") -(define_cpu_unit "bdver1-agu1" "bdver1_int") +(define_cpu_unit "bdver1-agu0" "bdver1_agu") +(define_cpu_unit "bdver1-agu1" "bdver1_agu") (define_reservation "bdver1-agu" "(bdver1-agu0 | bdver1-agu1)") -(define_cpu_unit "bdver1-mult" "bdver1_mult") - (define_cpu_unit "bdver1-load0" "bdver1_load") (define_cpu_unit "bdver1-load1" "bdver1_load") (define_reservation "bdver1-load" "bdver1-agu, @@ -93,6 +91,12 @@ ;; 128bit SSE instructions issue two stores at once. (define_reservation "bdver1-store2" "(bdver1-load0 + bdver1-load1)") +;; vectorpath (microcoded) instructions are single issue instructions. +;; So, they occupy all the integer units. +(define_reservation "bdver1-ivector" "bdver1-ieu0+bdver1-ieu1+ + bdver1-agu0+bdver1-agu1+ + bdver1-load0+bdver1-load1") + ;; The FP operations start to execute at stage 12 in the pipeline, while ;; integer operations start to execute at stage 9 for athlon and 11 for K8 ;; Compensate the difference for athlon because it results in significantly @@ -125,7 +129,7 @@ (define_insn_reservation "bdver1_call" 0 (and (eq_attr "cpu" "bdver1,bdver2") (eq_attr "type" "call,callv")) -"bdver1-double,bdver1-agu,bdver1-ieu") +"bdver1-double,bdver1-agu") ;; PUSH mem is double path. (define_insn_reservation "bdver1_push" 1 (and (eq_attr "cpu" "bdver1,bdver2") @@ -135,17 +139,17 @@ (define_insn_reservation "bdver1_pop" 1 (and (eq_attr "cpu" "bdver1,bdver2") (eq_attr "type" "pop")) -"bdver1-direct,(bdver1-ieu+bdver1-load)") +"bdver1-direct,bdver1-ivector") ;; LEAVE no latency info so far, assume same with amdfam10. (define_insn_reservation "bdver1_leave" 3 (and (eq_attr "cpu" "bdver1,bdver2") (eq_attr "type" "leave")) -"bdver1-vector,(bdver1-ieu+bdver1-load)") +"bdver1-vector,bdver1-ivector") ;; LEA executes in AGU unit with 1 cycle latency on BDVER1. (define_insn_reservation "bdver1_lea" 1 (and (eq_attr "cpu" "bdver1,bdver2") (eq_attr "type" "lea")) -"bdver1-direct,bdver1-agu,nothing") +"bdver1-direct,bdver1-agu") ;; MUL executes in special multiplier unit attached to IEU1. (define_insn_reservation "bdver1_imul_DI" 6 @@ -153,23 +157,23 @@ (and (eq_attr "type" "imul") (and (eq_attr "mode" "DI") (eq_attr "memory" "none,unknown" - "bdver1-direct1,bdver1-ieu1,bdver1-mult,nothing,bdver1-ieu1") +"bdver1-direct1,bdver1-ieu1") (define_insn_reservation "bdver1_imul" 4 (and (eq_attr "cpu" "bdver1,bdver2") (and (eq_attr "type" "imul") (eq_attr "memory" "none,unknown"))) -"bdver1-direct1,bdver1-ieu1,bdver1-mult,bdver1-ieu1") +"bdver1-direct1,bdver1-ieu1") (define_insn_reserv
RE: [PATCH,i386] fma4 addition for bdver2
Hi, The second change (done in config/i386/driver-i386.c (host_detect_local_cpu)) is not reflected in svn revision 191109. Since we are handling the fma instruction selection in i386.c\i386.md, we need not disable the flag in driver. Let me know your opinion. Regards Ganesh -Original Message- From: Gopalasubramanian, Ganesh Sent: Wednesday, September 05, 2012 3:41 PM To: gcc-patches@gcc.gnu.org Cc: Uros Bizjak (ubiz...@gmail.com) Subject: [PATCH,i386] fma4 addition for bdver2 Hello, FMA4 and FMA3 ISA are implemented in bdver2 target. FMA3 is selected by default. This patch supports the use of FMA4 intrinsics for bdver2 targets. Is it OK for trunk? Regards Ganesh 2012-09-05 Ganesh Gopalasubramanian * config/i386/i386.md : Comments on fma4 instruction selection reflect requirement on register pressure based cost model. * config/i386/driver-i386.c (host_detect_local_cpu): fma4 flag is set-reset as informed by the cpuid flag. * config/i386/i386.c (processor_alias_table): fma4 flag is enabled for bdver2. Index: gcc/config/i386/i386.md === --- gcc/config/i386/i386.md (revision 190830) +++ gcc/config/i386/i386.md (working copy) @@ -659,9 +659,11 @@ (eq_attr "isa" "noavx2") (symbol_ref "!TARGET_AVX2") (eq_attr "isa" "bmi2") (symbol_ref "TARGET_BMI2") (eq_attr "isa" "fma") (symbol_ref "TARGET_FMA") -;; Disable generation of FMA4 instructions for generic code -;; since FMA3 is preferred for targets that implement both -;; instruction sets. +;; Fma instruction selection has to be done based on +;; register pressure. For generating fma4, a cost model +;; based on register pressure is required. Till then, +;; fma4 instruction is disabled for targets that implement +;; both fma and fma4 instruction sets. (eq_attr "isa" "fma4") (symbol_ref "TARGET_FMA4 && !TARGET_FMA") ] Index: gcc/config/i386/driver-i386.c === --- gcc/config/i386/driver-i386.c (revision 190830) +++ gcc/config/i386/driver-i386.c (working copy) @@ -483,8 +483,6 @@ has_abm = ecx & bit_ABM; has_lwp = ecx & bit_LWP; has_fma4 = ecx & bit_FMA4; - if (vendor == SIG_AMD && has_fma4 && has_fma) - has_fma4 = 0; has_xop = ecx & bit_XOP; has_tbm = ecx & bit_TBM; has_lzcnt = ecx & bit_LZCNT; Index: gcc/config/i386/i386.c === --- gcc/config/i386/i386.c (revision 190830) +++ gcc/config/i386/i386.c (working copy) @@ -3164,7 +3164,7 @@ {"bdver2", PROCESSOR_BDVER2, CPU_BDVER2, PTA_64BIT | PTA_MMX | PTA_SSE | PTA_SSE2 | PTA_SSE3 | PTA_SSE4A | PTA_CX16 | PTA_ABM | PTA_SSSE3 | PTA_SSE4_1 - | PTA_SSE4_2 | PTA_AES | PTA_PCLMUL | PTA_AVX + | PTA_SSE4_2 | PTA_AES | PTA_PCLMUL | PTA_AVX | PTA_FMA4 | PTA_XOP | PTA_LWP | PTA_BMI | PTA_TBM | PTA_F16C | PTA_FMA}, {"btver1", PROCESSOR_BTVER1, CPU_GENERIC64, Regards Ganesh
[PATCH,i386] fma4 addition for bdver2
Hello, FMA4 and FMA3 ISA are implemented in bdver2 target. FMA3 is selected by default. This patch supports the use of FMA4 intrinsics for bdver2 targets. Is it OK for trunk? Regards Ganesh 2012-09-05 Ganesh Gopalasubramanian * config/i386/i386.md : Comments on fma4 instruction selection reflect requirement on register pressure based cost model. * config/i386/driver-i386.c (host_detect_local_cpu): fma4 flag is set-reset as informed by the cpuid flag. * config/i386/i386.c (processor_alias_table): fma4 flag is enabled for bdver2. Index: gcc/config/i386/i386.md === --- gcc/config/i386/i386.md (revision 190830) +++ gcc/config/i386/i386.md (working copy) @@ -659,9 +659,11 @@ (eq_attr "isa" "noavx2") (symbol_ref "!TARGET_AVX2") (eq_attr "isa" "bmi2") (symbol_ref "TARGET_BMI2") (eq_attr "isa" "fma") (symbol_ref "TARGET_FMA") -;; Disable generation of FMA4 instructions for generic code -;; since FMA3 is preferred for targets that implement both -;; instruction sets. +;; Fma instruction selection has to be done based on +;; register pressure. For generating fma4, a cost model +;; based on register pressure is required. Till then, +;; fma4 instruction is disabled for targets that implement +;; both fma and fma4 instruction sets. (eq_attr "isa" "fma4") (symbol_ref "TARGET_FMA4 && !TARGET_FMA") ] Index: gcc/config/i386/driver-i386.c === --- gcc/config/i386/driver-i386.c (revision 190830) +++ gcc/config/i386/driver-i386.c (working copy) @@ -483,8 +483,6 @@ has_abm = ecx & bit_ABM; has_lwp = ecx & bit_LWP; has_fma4 = ecx & bit_FMA4; - if (vendor == SIG_AMD && has_fma4 && has_fma) - has_fma4 = 0; has_xop = ecx & bit_XOP; has_tbm = ecx & bit_TBM; has_lzcnt = ecx & bit_LZCNT; Index: gcc/config/i386/i386.c === --- gcc/config/i386/i386.c (revision 190830) +++ gcc/config/i386/i386.c (working copy) @@ -3164,7 +3164,7 @@ {"bdver2", PROCESSOR_BDVER2, CPU_BDVER2, PTA_64BIT | PTA_MMX | PTA_SSE | PTA_SSE2 | PTA_SSE3 | PTA_SSE4A | PTA_CX16 | PTA_ABM | PTA_SSSE3 | PTA_SSE4_1 - | PTA_SSE4_2 | PTA_AES | PTA_PCLMUL | PTA_AVX + | PTA_SSE4_2 | PTA_AES | PTA_PCLMUL | PTA_AVX | PTA_FMA4 | PTA_XOP | PTA_LWP | PTA_BMI | PTA_TBM | PTA_F16C | PTA_FMA}, {"btver1", PROCESSOR_BTVER1, CPU_GENERIC64, Regards Ganesh
RE: [PATCH,i386] fma,fma4 and xop flags
> This won't work, since we have to prefer FMA3 also in case when only "-mfma > -mfma4" without -mtune=XX is used. > We can add TARGET_FMA_BOTH though, but I doubt there will ever be target that > implements both insn sets without preferences. Preferring FMA3 over FMA4 might not do good always. For instance, with increased register pressure FMA3 can be used. But, when we have more registers at our disposal, fma4 if used might do good by avoiding extra reload. IMO, when preference of FMA instructions is adjudged by register pressure, we may need some functionality to support that. So, ideally for bdver2, we like to have both fma and fma4 getting generated with options "-mfma -mfma4". Regards Ganesh -Original Message- From: Uros Bizjak [mailto:ubiz...@gmail.com] Sent: Tuesday, August 14, 2012 9:12 PM To: Richard Henderson Cc: Gopalasubramanian, Ganesh; gcc-patches@gcc.gnu.org Subject: Re: [PATCH,i386] fma,fma4 and xop flags On Mon, Aug 13, 2012 at 9:50 PM, Richard Henderson wrote: > On 08/13/2012 12:33 PM, Uros Bizjak wrote: >> AFAIU fma3 is better than fma4 for bdver2 (the only CPU that >> implements both FMA sets). Current description of bdver2 doesn't even >> enable fma4 in processor_alias_table due to this fact. >> >> The change you are referring to adds preference for fma3 insn set for >> generic code (not FMA4 builtins!), even when fma4 is enabled. So, no >> matter which combination and sequence of -mfmfa -mfma4 or -mxop user >> passes to the compiler, only fma3 instructions will be generated. > > This rationale needs to appear as a comment above > >> + (eq_attr "isa" "fma4") >> +(symbol_ref "TARGET_FMA4 && !TARGET_FMA") I plan to commit following patch: --cut here-- Index: i386.md === --- i386.md (revision 190362) +++ i386.md (working copy) @@ -659,6 +659,9 @@ (eq_attr "isa" "noavx2") (symbol_ref "!TARGET_AVX2") (eq_attr "isa" "bmi2") (symbol_ref "TARGET_BMI2") (eq_attr "isa" "fma") (symbol_ref "TARGET_FMA") +;; Disable generation of FMA4 instructions for generic code +;; since FMA3 is preferred for targets that implement both +;; instruction sets. (eq_attr "isa" "fma4") (symbol_ref "TARGET_FMA4 && !TARGET_FMA") ] --cut here-- > Longer term we may well require some sort of > > (TARGET_FMA4 && !(TARGET_FMA && TARGET_PREFER_FMA3)) > > with an appropriate entry in ix86_tune_features to match. This won't work, since we have to prefer FMA3 also in case when only "-mfma -mfma4" without -mtune=XX is used. We can add TARGET_FMA_BOTH though, but I doubt there will ever be target that implements both insn sets without preferences. Uros.
RE: [PATCH,i386] cpuid function for prefetchw
Yes! Thanks Jakub. -Original Message- From: Jakub Jelinek [mailto:ja...@redhat.com] Sent: Monday, August 13, 2012 3:16 PM To: Gopalasubramanian, Ganesh Cc: gcc-patches@gcc.gnu.org Subject: Re: [PATCH,i386] cpuid function for prefetchw On Mon, Aug 13, 2012 at 09:29:45AM +, Gopalasubramanian, Ganesh wrote: > To get the prefetchw cpuid flag, cpuid function 0x8001 needs to be > called. > Previous to patch, function 0x7 is called. > > Bootstrapping and "make -k check" passes without failures. > Ok for trunk? IMHO you move it to a wrong spot, ecx bits of CPUID 0x8001 are tested earlier. So I think you want this instead (bootstrap/regtest in progress): 2012-08-13 Ganesh Gopalasubramanian Jakub Jelinek PR driver/54210 * config/i386/driver-i386.c (host_detect_local_cpu): Test bit_PRFCHW bit of CPUID 0x8001 %ecx instead of CPUID 7 %ecx. * config/i386/cpuid.h (bits_PRFCHW): Move definition to CPUID 0x8001 %ecx flags. --- gcc/config/i386/driver-i386.c.jj2012-08-10 15:49:25.0 +0200 +++ gcc/config/i386/driver-i386.c 2012-08-13 11:30:14.570494736 +0200 @@ -467,7 +467,6 @@ const char *host_detect_local_cpu (int a has_bmi2 = ebx & bit_BMI2; has_fsgsbase = ebx & bit_FSGSBASE; has_rdseed = ebx & bit_RDSEED; - has_prfchw = ecx & bit_PRFCHW; has_adx = ebx & bit_ADX; } @@ -488,6 +487,7 @@ const char *host_detect_local_cpu (int a has_xop = ecx & bit_XOP; has_tbm = ecx & bit_TBM; has_lzcnt = ecx & bit_LZCNT; + has_prfchw = ecx & bit_PRFCHW; has_longmode = edx & bit_LM; has_3dnowp = edx & bit_3DNOWP; --- gcc/config/i386/cpuid.h.jj 2012-08-10 15:49:25.0 +0200 +++ gcc/config/i386/cpuid.h 2012-08-13 11:31:30.346494092 +0200 @@ -52,6 +52,7 @@ #define bit_LAHF_LM(1 << 0) #define bit_ABM(1 << 5) #define bit_SSE4a (1 << 6) +#define bit_PRFCHW (1 << 8) #define bit_XOP (1 << 11) #define bit_LWP(1 << 15) #define bit_FMA4(1 << 16) @@ -69,7 +70,6 @@ #define bit_HLE(1 << 4) #define bit_AVX2 (1 << 5) #define bit_BMI2 (1 << 8) -#define bit_PRFCHW (1 << 8) #define bit_RTM(1 << 11) #define bit_RDSEED (1 << 18) #define bit_ADX(1 << 19) Jakub
[PATCH,i386] cpuid function for prefetchw
Hello, To get the prefetchw cpuid flag, cpuid function 0x8001 needs to be called. Previous to patch, function 0x7 is called. Bootstrapping and "make -k check" passes without failures. Ok for trunk? Regards Ganesh 2012-08-13 Ganesh Gopalasubramanian PR driver/54210 * config/i386/driver-i386.c (host_detect_local_cpu): Call cpuid function 0x8001 to get the prfchw cpuid flag. Index: gcc/config/i386/driver-i386.c === --- gcc/config/i386/driver-i386.c (revision 189996) +++ gcc/config/i386/driver-i386.c (working copy) @@ -467,7 +467,6 @@ has_bmi2 = ebx & bit_BMI2; has_fsgsbase = ebx & bit_FSGSBASE; has_rdseed = ebx & bit_RDSEED; - has_prfchw = ecx & bit_PRFCHW; } /* Check cpuid level of extended features. */ @@ -491,6 +490,7 @@ has_longmode = edx & bit_LM; has_3dnowp = edx & bit_3DNOWP; has_3dnow = edx & bit_3DNOW; + has_prfchw = ecx & bit_PRFCHW; } if (!arch
RE: [PATCH,i386] fma,fma4 and xop flags
Thank you Uros, Richard! I will confirm the test results in couple off days. Regards Ganesh -Original Message- From: Uros Bizjak [mailto:ubiz...@gmail.com] Sent: Saturday, August 11, 2012 3:54 AM To: Richard Henderson Cc: Gopalasubramanian, Ganesh; gcc-patches@gcc.gnu.org Subject: Re: [PATCH,i386] fma,fma4 and xop flags On Fri, Aug 10, 2012 at 10:02 PM, Richard Henderson wrote: > On 2012-08-10 12:59, Uros Bizjak wrote: >> Actually, this is the problem you are trying to solve. The fma4 >> patterns are defined before fma3, so gcc prefers these. > > The Real Problem is that they should not be separate patterns. > They should be a single pattern that selects alternatives via the > enabled isa. 2012-08-11 Uros Bizjak * config/i386/i386.md (isa): Add fma and fma4. (enabled): Handle fma and fma4. * config/i386/sse.md (*fma_fmadd_): Merge *fma4_fmadd_. (*fma_fmsub_): Merge *fma4_fmsub_. (*fma_fnmadd_): Merge *fma4_fnmadd_. (*fma_fnmsub_): Merge *fma4_fnmsub_. (*fma_fmaddsub_): Merge *fma4_fmaddsub_. (*fma_fmsubadd_): Merge *fma4_fmsubadd_. Tested on x86_64-pc-linux-gnu {,-m32}, committed to mainline SVN. I will wait a couple of days before backporting patches to 4.7, so please Ganesh, test mainline if everything is OK. BTW: With this patch, we can enable PTA_FMA4 for bdver2 target. Uros.
RE: [PATCH,i386] fma,fma4 and xop flags
> -mxop implies -mfma4, but reverse is not true. I think this handling went in for bdver1. But, with bdver2, we have both fma and fma4. So for bdver2, -mxop should not be enabling one of them. > if someone set -mfma4 together > with -mfma on the command line, we should NOT disable selected ISA > behind user's back If both -mfma4 and -mfma are enabled, GCC outputs fma4 instructions. This, I think is because fma4 instruction patterns are read before fma instruction patterns from the ".md" files. So, enabling both -mfma4 and -mfma is not good for bdver2. Moreover, if user tries to use, -mfma -mno-fma4 -mxop, the order in which these options are used becomes crucial. -mxop enables -mfma4 and by instruction patterns fma4 instructions gets listed in the assembly file. For the below test, double a,b,c,d; int fn(){ a = b + c * d ; return a; } #1) Using options "-O2 -mno-fma4 -mfma -mxop" outputs fma4. (vfmaddsdb(%rip), %xmm2, %xmm1, %xmm0) #2) Using options "-O2 -mfma -mno-fma4 -mxop" outputs fma4. (vfmaddsdb(%rip), %xmm2, %xmm1, %xmm0) #3) Using options "-mxop -mno-fma4 -mfma" outpts fma. (vfmadd132sd d(%rip), %xmm1, %xmm0) As we see the order in which the options are used becomes crucial. This is confusing. I haven't really tested other implied options. But, I suspect similar phenomenon in those cases too. IMO, we can directly go by the CPUID flags and enable the flags. This will be a one to one mapping and leave the user with lot more liberty. Please let me know your opinion. Regards Ganesh -Original Message- From: Uros Bizjak [mailto:ubiz...@gmail.com] Sent: Friday, August 10, 2012 1:21 AM To: Gopalasubramanian, Ganesh Cc: gcc-patches@gcc.gnu.org Subject: Re: [PATCH,i386] fma,fma4 and xop flags On Wed, Aug 8, 2012 at 1:31 PM, wrote: > Bdver2 cpu supports both fma and fma4 instructions. > Previous to patch, option "-mno-xop" removes "-mfma4". > Similarly, option "-mno-fma4" removes "-mxop". It looks to me that there is some misunderstanding. AFAICS: -mxop implies -mfma4, but reverse is not true. Please see #define OPTION_MASK_ISA_FMA4_SET \ (OPTION_MASK_ISA_FMA4 | OPTION_MASK_ISA_SSE4A_SET \ | OPTION_MASK_ISA_AVX_SET) #define OPTION_MASK_ISA_XOP_SET \ (OPTION_MASK_ISA_XOP | OPTION_MASK_ISA_FMA4_SET) So, -mxop sets -mfma4, etc ..., but -mfma4 does NOT enable -mxop. OTOH, #define OPTION_MASK_ISA_FMA4_UNSET \ (OPTION_MASK_ISA_FMA4 | OPTION_MASK_ISA_XOP_UNSET) #define OPTION_MASK_ISA_XOP_UNSET OPTION_MASK_ISA_XOP -mno-fma4 implies -mno-xop, but again reverse is not true. Thus, -mno-xop does NOT imply -mno-fma4. > So, the patch conditionally disables "-mfma" or "-mfma4". > Enabling "-mxop" is done by also checking "-mfma". Please note that conditional handling of ISA flags belongs to ix86_option_override_internal. However, if someone set -mfma4 together with -mfma on the command line, we should NOT disable selected ISA behind user's back, in the same way as we don't disable anything with "-march=i386 -msse4". With -march=bdver2, we already marked that only fma is supported, and if user selected "-march=bdver2 -mfma4" on the command line, we shouldn't disable anything. Uros.
RE: [PATCH,i386] fma,fma4 and xop flags
> Otherwise, what does -mno-fma4 -mxop do? > (it should enable both xop and fma4!) what should -mfma4 -mno-xop do > (it should disable both xop and fma4!). Yes! that's what GCC does now. Some flags are coupled (atleast for now). For ex, -mno-sse4.2 -mavx enables both sse4.2 and avx whereas -mavx -mno-sse4.2 disables both. Setting of the following are clubbed. 1) 3DNow sets MMX 2) SSE2 sets SSE 3) SSE3 sets SSE2 4) SSE4_1 sets SSE3 5) SSE4_2 sets SSE4_1 6) FMA sets AVX 7) AVX2 sets AVX 8) SSE4_A sets SSE3 9) FMA4 set SSE4_A and AVX 10) XOP sets FMA4 11) AES sets SSE2 12) PCLMUL sets SSE2 13) ABM sets POPCNT Resetting is done in reversely (MMX resets 3DNOW). IMO, if we have different cpuid flags, enabling\disabling the compiler flags depends on these cpuid flags directly. Adding subsets to them or tangling them together may give wrong results. Please let me know your opinion. Regards Ganesh -Original Message- From: Richard Guenther [mailto:richard.guent...@gmail.com] Sent: Wednesday, August 08, 2012 5:12 PM To: Gopalasubramanian, Ganesh Cc: gcc-patches@gcc.gnu.org; ubiz...@gmail.com Subject: Re: [PATCH,i386] fma,fma4 and xop flags On Wed, Aug 8, 2012 at 1:31 PM, wrote: > Hello, > > Bdver2 cpu supports both fma and fma4 instructions. > Previous to patch, option "-mno-xop" removes "-mfma4". > Similarly, option "-mno-fma4" removes "-mxop". Eh? Why's that? I think we should disentangle -mxop and -mfma4 instead. Otherwise, what does -mno-fma4 -mxop do? (it should enable both xop and fma4!) what should -mfma4 -mno-xop do (it should disable both xop and fma4!). All this is just confusing to the user, even if in AMD documents XOP includes FMA4. Richard.
Backport: fma3 instruction generation for 'march=native' in AMD processors
Hello, Below is the patch that has been committed in trunk (Revision: 187075). We like to backport it to GCC 4.7 branch as couple of AMD processors require this change for fma3 instruction generation. Bootstrapping and testing are successful. Is it OK to commit in GCC 4.7 branch? Regards Ganesh PATCH = * config/i386/driver-i386.c (host_detect_local_cpu): Reset has_fma4 for AMD processors with both fma3 and fma4 support. Index: config/i386/driver-i386.c === --- config/i386/driver-i386.c (revision 186897) +++ config/i386/driver-i386.c (working copy) @@ -472,6 +472,8 @@ has_abm = ecx & bit_ABM; has_lwp = ecx & bit_LWP; has_fma4 = ecx & bit_FMA4; + if (vendor == SIG_AMD && has_fma4 && has_fma) + has_fma4 = 0; has_xop = ecx & bit_XOP; has_tbm = ecx & bit_TBM; has_lzcnt = ecx & bit_LZCNT;
Re: [PATCH] [i386] fma3 instruction generation for 'march=native' in AMD processors
I have added the ChangeLog and modified the patch. Is it OK to commit to trunk? Regards Ganesh 2012-05-03 Ganesh Gopalasubramanian * config/i386/driver-i386.c (host_detect_local_cpu): Reset has_fma4 for AMD processors with both fma3 and fma4 support. Index: config/i386/driver-i386.c === --- config/i386/driver-i386.c (revision 186897) +++ config/i386/driver-i386.c (working copy) @@ -472,6 +472,8 @@ has_abm = ecx & bit_ABM; has_lwp = ecx & bit_LWP; has_fma4 = ecx & bit_FMA4; + if (vendor == SIG_AMD && has_fma4 && has_fma) + has_fma4 = 0; has_xop = ecx & bit_XOP; has_tbm = ecx & bit_TBM; has_lzcnt = ecx & bit_LZCNT; -Original Message- From: Jakub Jelinek [mailto:ja...@redhat.com] Sent: Wednesday, May 02, 2012 5:11 PM To: Gopalasubramanian, Ganesh Cc: gcc-patches@gcc.gnu.org Subject: Re: [PATCH] [i386] fma3 instruction generation for 'march=native' in AMD processors On Wed, May 02, 2012 at 11:12:33AM +, Gopalasubramanian, Ganesh wrote: > For AMD architectures with both fma3 and fma4 instructions' support, GCC > generates fma4 by default. Instead, we like to generate fma3 instruction. > Below patch enables the fma3 instruction generation for "-march=native". > > Ok for trunk? You haven't provided ChangeLog entry. > Index: gcc/config/i386/driver-i386.c > === > --- gcc/config/i386/driver-i386.c (revision 186897) > +++ gcc/config/i386/driver-i386.c (working copy) > @@ -472,6 +472,10 @@ >has_abm = ecx & bit_ABM; >has_lwp = ecx & bit_LWP; >has_fma4 = ecx & bit_FMA4; > + if (((vendor == SIG_AMD)) && (has_fma4) && (has_fma)) > +{ > +has_fma4 = 0; > +} And the formatting of this is wrong, 4 unnecessary pairs of (), one unnecessary {} pair, bad indentation of the has_fma4 = 0; assignment (should use a tab). Jakub
[PATCH] [i386] fma3 instruction generation for 'march=native' in AMD processors
For AMD architectures with both fma3 and fma4 instructions' support, GCC generates fma4 by default. Instead, we like to generate fma3 instruction. Below patch enables the fma3 instruction generation for "-march=native". Ok for trunk? Index: gcc/config/i386/driver-i386.c === --- gcc/config/i386/driver-i386.c (revision 186897) +++ gcc/config/i386/driver-i386.c (working copy) @@ -472,6 +472,10 @@ has_abm = ecx & bit_ABM; has_lwp = ecx & bit_LWP; has_fma4 = ecx & bit_FMA4; + if (((vendor == SIG_AMD)) && (has_fma4) && (has_fma)) +{ +has_fma4 = 0; +} has_xop = ecx & bit_XOP; has_tbm = ecx & bit_TBM; has_lzcnt = ecx & bit_LZCNT; Regards Ganesh