from:"Gopalasubramanian, Ganesh"

RE: [PATCH] Fix bdverN vector cost of cond_[not_]taken_branch_cost

2015-04-12 Thread Gopalasubramanian, Ganesh

>Note that before the fixes for PR64909 the epilogue/prologue loops had very 
>large costs associated due to a bug in the cost model implementation.  After 
>the fix their cost is reasonable but the cost of the extra jumps is way 
>under-accounted for due to the numbers for >cond_taken_branch_cost and 
>cond_not_taken_branch_cost.
> The proposes match mitigates that somewhat.

Richard! The patch is good. We are done with our benchmarking and found no 
regressions.

> How did you arrive at the original cost model?

The original cost model as you suspect is not based on architecture alone. 
Those are the numbers arrived at by analyzing benchmarks and the cost model 
bugs then. These initial numbers were copied for subsequent architectures too.
Cost assignments saying "scalar_stmt_cost = 6" and "scalar load_cost = 4" 
doesn't make sense at all. We will have a look into it.

Regards
Ganesh

-Original Message-
From: Richard Biener [mailto:rguent...@suse.de] 
Sent: Wednesday, April 08, 2015 1:08 PM
To: Gopalasubramanian, Ganesh
Cc: Uros Bizjak; gcc-patches@gcc.gnu.org
Subject: RE: [PATCH] Fix bdverN vector cost of cond_[not_]taken_branch_cost

On Wed, 8 Apr 2015, Gopalasubramanian, Ganesh wrote:

> > I have added a person from AMD to comment on the decision.
> > Otherwise, the patch looks OK, but please wait a couple of days for 
> > possible comments.
> 
> Thank you Uros!
> I am checking the changes with few tests and benchmarking them.
> Please wait for a couple of days.

Note that before the fixes for PR64909 the epilogue/prologue loops had very 
large costs associated due to a bug in the cost model implementation.  After 
the fix their cost is reasonable but the cost of the extra jumps is way 
under-accounted for due to the numbers for cond_taken_branch_cost and 
cond_not_taken_branch_cost.
The proposes match mitigates that somewhat.

How did you arrive at the original cost model?

Thanks,
Richard.

--
Richard Biener 
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Jennifer Guild, Dilip 
Upmanyu, Graham Norton HRB 21284 (AG Nuernberg)

RE: [PATCH] Fix bdverN vector cost of cond_[not_]taken_branch_cost

2015-04-07 Thread Gopalasubramanian, Ganesh

> I have added a person from AMD to comment on the decision.
> Otherwise, the patch looks OK, but please wait a couple of days for possible 
> comments.

Thank you Uros!
I am checking the changes with few tests and benchmarking them.
Please wait for a couple of days.

-Ganesh

RE: [PATCH] Rename gimple_build_assign_with_ops to gimple_build_assign and swap the first two arguments of it

2014-12-01 Thread Gopalasubramanian, Ganesh

The following patch implements that.  Bootstrapped/regtested on x86_64-linux
and i686-linux, ok for trunk?

Our aarch64 build also breaks as mentioned in 
https://gcc.gnu.org/ml/gcc-patches/2014-12/msg00119.html

Regards
Ganesh

RE: [PATCH, aarch64] Add prefetch support

2014-11-30 Thread Gopalasubramanian, Ganesh

Please ignore the previous patch sent. The attachment was wrong.

> There's no point in the buffer or the sprintf.
> The text is short enough to repeat whole pattern in the array:

Updated the patch for the above suggestions.
make -k check RUNTESTFLAGS="execute.exp compile.exp dg.exp" passes.

Is it ok for upstream?

Regards
Ganesh


prefetch.diff
Description: prefetch.diff

RE: [PATCH, aarch64] Add prefetch support

2014-11-30 Thread Gopalasubramanian, Ganesh

> There's no point in the buffer or the sprintf.
> The text is short enough to repeat whole pattern in the array:

Updated the patch for the above suggestions.
Is it ok for upstream?

Regards
Ganesh


prefetch.diff
Description: prefetch.diff

RE: [PATCH, aarch64] Add prefetch support

2014-11-14 Thread Gopalasubramanian, Ganesh

> For this prefetch patch I suggest we go with the existing "load1".
I have removed the changes done in types.md.

> The inline patch has been munged by your mailer, I tried applying the patch 
> to my tree but it is full of  escape sequences.  Can you either fix your 
> mailer or submit patches as attachments?
I am attaching the revised patch.

> Check the whitespace in your ChangeLog line.
Changelog entry is also embedded in the attachment.

Regards
Ganesh


prefetch.diff
Description: prefetch.diff

FW: [PATCH, aarch64] Add prefetch support

2014-11-10 Thread Gopalasubramanian, Ganesh

PING! I am worried if it goes in stage-1.

-Original Message-
From: Gopalasubramanian, Ganesh 
Sent: Thursday, October 30, 2014 2:24 PM
To: gcc-patches@gcc.gnu.org
Subject: [PATCH, aarch64] Add prefetch support

Hi,

Below is the patch that implements prefetching support.

This patch has been already discussed on
a) https://gcc.gnu.org/ml/gcc-patches/2014-02/msg01644.html
b) https://gcc.gnu.org/ml/gcc-patches/2014-07/msg00612.html

I have not added a test as there are ample tests in compile and execute suites.
"make -k check" passes. Ok for trunk?

Changelog:
2014-10-30  Ganesh Gopalasubramanian 

    * config/aarch64/aarch64.md (define_insn "prefetch"): New.
    * config/arm/types.md (define_attr "type"): Add prefetch.

Regards
Ganesh

diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md 
index 74b554e..12a3f170 100644
--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -320,6 +320,38 @@
   [(set_attr "type" "no_insn")]
)

+
+(define_insn "prefetch"
+  [(prefetch (match_operand:DI 0 "address_operand" "r")
+    (match_operand:QI 1 "const_int_operand" "")
+    (match_operand:QI 2 "const_int_operand" ""))]
+  ""
+  "*
+{
+  const char * pftype[2][10]
+    = { {\"PLDL1STRM\", \"PLDL3KEEP\", \"PLDL2KEEP\", \"PLDL1KEEP\"},
+   {\"PSTL1STRM\", \"PSTL3KEEP\", \"PSTL2KEEP\", \"PSTL1KEEP\"},
+  };
+
+  int locality = INTVAL (operands[2]);
+  char pattern[100];
+
+  gcc_assert (IN_RANGE (locality, 0, 3));
+
+  strcpy (pattern, \"prfm\\t\");
+  strcat (pattern, (const char*)pftype[INTVAL(operands[1])][locality]);
+  strcat (pattern, \", %a0\");
+
+  output_asm_insn (pattern,
+   operands);
+
+  return \"\";
+
+}"
+  [(set_attr "type" "prefetch")]
+)
+
(define_insn "trap"
   [(trap_if (const_int 1) (const_int 8))]
   ""
diff --git a/gcc/config/arm/types.md b/gcc/config/arm/types.md index 
c1151f5..8b4b7a1 100644
--- a/gcc/config/arm/types.md
+++ b/gcc/config/arm/types.md
@@ -118,6 +118,7 @@
; mvn_shift_reg  inverting move instruction, shifted operand by a register.
; no_insn    an insn which does not represent an instruction in the ;   
 final output, thus having no impact on scheduling.
+; prefetch   a prefetch instruction
; rbit   reverse bits.
; rev    reverse bytes.
; sdiv   signed division.
@@ -556,6 +557,7 @@
   call,\
   clz,\
   no_insn,\
+  prefetch,\
   csel,\
   crc,\

[PATCH, aarch64] Add prefetch support

2014-10-30 Thread Gopalasubramanian, Ganesh

Hi,

Below is the patch that implements prefetching support.

This patch has been already discussed on 
a) https://gcc.gnu.org/ml/gcc-patches/2014-02/msg01644.html
b) https://gcc.gnu.org/ml/gcc-patches/2014-07/msg00612.html

I have not added a test as there are ample tests in compile and execute suites.
"make -k check" passes. Ok for trunk?

Changelog:
2014-10-30  Ganesh Gopalasubramanian 

    * config/aarch64/aarch64.md (define_insn "prefetch"): New.
    * config/arm/types.md (define_attr "type"): Add prefetch.

Regards
Ganesh

diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
index 74b554e..12a3f170 100644
--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -320,6 +320,38 @@
   [(set_attr "type" "no_insn")]
)

+
+(define_insn "prefetch"
+  [(prefetch (match_operand:DI 0 "address_operand" "r")
+    (match_operand:QI 1 "const_int_operand" "")
+    (match_operand:QI 2 "const_int_operand" ""))]
+  ""
+  "*
+{
+  const char * pftype[2][10]
+    = { {\"PLDL1STRM\", \"PLDL3KEEP\", \"PLDL2KEEP\", \"PLDL1KEEP\"},
+   {\"PSTL1STRM\", \"PSTL3KEEP\", \"PSTL2KEEP\", \"PSTL1KEEP\"},
+  };
+
+  int locality = INTVAL (operands[2]);
+  char pattern[100];
+
+  gcc_assert (IN_RANGE (locality, 0, 3));
+
+  strcpy (pattern, \"prfm\\t\");
+  strcat (pattern, (const char*)pftype[INTVAL(operands[1])][locality]);
+  strcat (pattern, \", %a0\");
+
+  output_asm_insn (pattern,
+   operands);
+
+  return \"\";
+
+}"
+  [(set_attr "type" "prefetch")]
+)
+
(define_insn "trap"
   [(trap_if (const_int 1) (const_int 8))]
   ""
diff --git a/gcc/config/arm/types.md b/gcc/config/arm/types.md
index c1151f5..8b4b7a1 100644
--- a/gcc/config/arm/types.md
+++ b/gcc/config/arm/types.md
@@ -118,6 +118,7 @@
; mvn_shift_reg  inverting move instruction, shifted operand by a register.
; no_insn    an insn which does not represent an instruction in the
;    final output, thus having no impact on scheduling.
+; prefetch   a prefetch instruction
; rbit   reverse bits.
; rev    reverse bytes.
; sdiv   signed division.
@@ -556,6 +557,7 @@
   call,\
   clz,\
   no_insn,\
+  prefetch,\
   csel,\
   crc,\

RE: RFA: another patch to fix PR61360

2014-09-24 Thread Gopalasubramanian, Ganesh

>The "r->x" alternative results in "vector" decoding on amdfam10. This is 
>AMD-speak for microcoded instructions, and AMD optimization manual strongly 
>recommends avoiding them. I have CC'd Ganesh, maybe he >can provide more 
>relevant data on the performance impact.

Thanks Uros!

Yes, the AMD SWOG recommends precisely what Uros mentions.

When moving data from a GPR to an XMM register, use separate store and load 
instructions to move
the data first from the source register to a temporary location in memory and 
then from memory into
the destination register


This is listed as an optimization too. This holds good for all amdfam10 and BD  
family processors. 
I have to dig through the performance numbers will try to get them.

Regards
Ganesh

[PATCH, i386] PR61360: Do not update "enabled" attribute during lra and reload passes

2014-08-22 Thread Gopalasubramanian, Ganesh

This patch fixes PR 61360.

The attribute "enabled" should actually be used enable/disable alternative 
based on sub-targets.
In this pattern, it gets used across passes too. 
However, modifying this attribute in LRA pass is not something it is meant for.
This patch allows enabling/disabling the attribute when optimizing for size, 
but not during lra pass or reload pass.

Bootstrap passes.
OK for upstream?

Regards
Ganesh

diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index 6d91da0..3775f6e 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,4 +1,10 @@
-2014-08-22  David Malcolm  
+2014-08-22 Ganesh Gopalasubramanian  
+
+   PR 61360
+   * config/i386/i386.md (*float2_sse):
+   Do not modify "enabled" attribute during LRA pass.
+
+014-08-22  David Malcolm  

* cprop.c (struct occr): Strengthen field "insn" from rtx to
rtx_insn *.
diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md
index 8e74eab..de2ecf0 100644
--- a/gcc/config/i386/i386.md
+++ b/gcc/config/i386/i386.md
@@ -4795,10 +4795,10 @@
   /* ??? For sched1 we need constrain_operands to be able to
  select an alternative.  Leave this enabled before RA.  */
   (symbol_ref "TARGET_INTER_UNIT_CONVERSIONS
-   || optimize_function_for_size_p (cfun)
-   || !(reload_completed
-|| reload_in_progress
-|| lra_in_progress)")
+   || (optimize_function_for_size_p (cfun)
+   && !(reload_completed
+|| reload_in_progress
+|| lra_in_progress))")
]
(symbol_ref "true")))
])

RE: [PATCH, i386] Remove use of vpmacsdql instruction from multiplication.

2014-08-11 Thread Gopalasubramanian, Ganesh

Hi Uros!

> > +2014-06-10  Ganesh Gopalasubramanian 
> > +
> > +
> > +   * config/i386/i386.c (ix86_expand_sse2_mulvxdi3): Issue 
> > +instructions "vpmuludq" and "vpaddq" instead of "vpmacsdql" for 
> > +handling 32-bit multiplication.
> >

> OK for mainline and release branches.

I would like to backport the above patch for 4.9. Is it OK?

Regards
Ganesh

RE: [PATCH, i386] Add RDRND and MOVBE for AMD bdver4

2014-08-08 Thread Gopalasubramanian, Ganesh

> OK for mainline.

Thanks Uros. Committed to revision 213572
I would like to backport to 4.9 branch too.
Is it OK?

- Ganesh

[PATCH, i386] Add RDRND and MOVBE for AMD bdver4

2014-08-04 Thread Gopalasubramanian, Ganesh

Below patch adds PTA_RDRND and PTA_MOVBE for bdver4.

Bootstrap passes.

Ok for upstream?

Regards
Ganesh

Index: gcc/ChangeLog
===
--- gcc/ChangeLog   (revision 213568)
+++ gcc/ChangeLog   (working copy)
@@ -24,6 +24,11 @@

2014-08-04 Ganesh Gopalasubramanian  

+* gcc/config/i386/i386.c (ix86_option_override_internal): Add
+   PTA_RDRND and PTA_MOVBE for bdver4.
+
+2014-08-04 Ganesh Gopalasubramanian  
+
* config/i386/driver-i386.c (host_detect_local_cpu): Handle AMD's 
extended
family information. Handle BTVER2 cpu with cpuid family value.

Index: gcc/config/i386/i386.c
===
--- gcc/config/i386/i386.c  (revision 213568)
+++ gcc/config/i386/i386.c  (working copy)
@@ -3267,12 +3267,13 @@ ix86_option_override_internal (bool main
| PTA_FMA | PTA_PRFCHW | PTA_FXSR | PTA_XSAVE
| PTA_XSAVEOPT | PTA_FSGSBASE},
  {"bdver4", PROCESSOR_BDVER4, CPU_BDVER4,
-PTA_64BIT | PTA_MMX | PTA_SSE | PTA_SSE2 | PTA_SSE3
-| PTA_SSE4A | PTA_CX16 | PTA_ABM | PTA_SSSE3 | PTA_SSE4_1
-| PTA_SSE4_2 | PTA_AES | PTA_PCLMUL | PTA_AVX | PTA_AVX2
+   PTA_64BIT | PTA_MMX | PTA_SSE | PTA_SSE2 | PTA_SSE3
+   | PTA_SSE4A | PTA_CX16 | PTA_ABM | PTA_SSSE3 | PTA_SSE4_1
+   | PTA_SSE4_2 | PTA_AES | PTA_PCLMUL | PTA_AVX | PTA_AVX2
| PTA_FMA4 | PTA_XOP | PTA_LWP | PTA_BMI | PTA_BMI2
| PTA_TBM | PTA_F16C | PTA_FMA | PTA_PRFCHW | PTA_FXSR
-   | PTA_XSAVE | PTA_XSAVEOPT | PTA_FSGSBASE},
+   | PTA_XSAVE | PTA_XSAVEOPT | PTA_FSGSBASE | PTA_RDRND
+   | PTA_MOVBE},
   {"btver1", PROCESSOR_BTVER1, CPU_GENERIC,
PTA_64BIT | PTA_MMX |  PTA_SSE  | PTA_SSE2 | PTA_SSE3
| PTA_SSSE3 | PTA_SSE4A |PTA_ABM | PTA_CX16 | PTA_PRFCHW

RE: [PATCH, i386] Handle extended family cpuid info for AMD

2014-08-01 Thread Gopalasubramanian, Ganesh

> In this case, having only check for family ID should be enough. If
>BTVER1 and BTVER2 can be uniquely determined by their family IDs ,

>IMO, this would be the most future-proof approach. Signature checks will 
>override family id checks which will override cpuid checks.

Thank you Uros!

I have modified source only for BTVER2. 
The way BTVER1 is currently assigned to processor includes more than one 
family. So, I am leaving that unmoved.

Bootstrap passes.
Is it OK for trunk and backport to open branches.

Regards
-Ganesh


diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index 706fedc..202bd99 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,3 +1,8 @@
+2014-08-01  Ganesh Gopalasubramanian  
+
+   * config/i386/driver-i386.c (host_detect_local_cpu): Handle AMD's 
extended family
+   information. Handle BTVER2 cpu with cpuid family value.  
+
 2014-07-31  James Greenhalgh  
 
* config/aarch64/arm_neon.h (vpadd_<8,16,32,64>): Move to
diff --git a/gcc/config/i386/driver-i386.c b/gcc/config/i386/driver-i386.c
index 1c6385f..0402c90 100644
--- a/gcc/config/i386/driver-i386.c
+++ b/gcc/config/i386/driver-i386.c
@@ -432,7 +432,8 @@ const char *host_detect_local_cpu (int argc, const char 
**argv)
 
   model = (eax >> 4) & 0x0f;
   family = (eax >> 8) & 0x0f;
-  if (vendor == signature_INTEL_ebx)
+  if ((vendor == signature_INTEL_ebx) ||
+  (vendor == signature_AMD_ebx))
 {
   unsigned int extended_model, extended_family;
 
@@ -576,7 +577,7 @@ const char *host_detect_local_cpu (int argc, const char 
**argv)
 
   if (name == signature_NSC_ebx)
processor = PROCESSOR_GEODE;
-  else if (has_movbe)
+  else if (family == 22)
processor = PROCESSOR_BTVER2;
   else if (has_avx2)
 processor = PROCESSOR_BDVER4;

RE: [PATCH, i386] Handle extended family cpuid info for AMD

2014-07-31 Thread Gopalasubramanian, Ganesh

Uros!

> I would like to have a check for a family at the beginning, something like:

>  if (name == signature_NSC_ebx)
>processor = PROCESSOR_GEODE;
>  else if (family == 22)
>{
>  if (has_movbe)

I get your idea of having the family checked first and then differentiating 
with cpuid info later.
But, this case is getting interesting because, BTVER1 and BTVER2 are two 
variants but don't really have same family numbers.
BTVER1 is family 14h and BTVER2 is family 16h. I don't see near term plans for 
any additional cpus to either 14h or 16h.
Given that fact, this particular check is applicable only for BTVER2.
In that case, having 

  else if (family == 22)
if (has_movbe)
   processor = PROCESSOR_BTVER2;

looks odd. 

Regards
Ganesh

RE: [PATCH, i386] Handle extended family cpuid info for AMD

2014-07-31 Thread Gopalasubramanian, Ganesh

> Then just use:

> +  else if (has_avx2)
> +processor = PROCESSOR_BDVER4;
>   else if (has_movbe)
>processor = PROCESSOR_BTVER2;
>-  else if (has_avx2)
>-processor = PROCESSOR_BDVER4;
>  else if (has_xsaveopt)

In that case, with earlier GCC versions where we don’t have bdver4 added, the 
fall back would be BTVER2, 
whereas a BD variant is more desirable.

Ganesh

RE: [PATCH, i386] Handle extended family cpuid info for AMD

2014-07-31 Thread Gopalasubramanian, Ganesh

> But, looking to processor_alias_table in config/i386/i386.c, only
> PROCESSOR_BTVER2 defines PTA_MOVBE. According to this, the logic is already 
> correct, so the patch is not needed.

We are evaluating bdver4 cpu. Bdver4 also supports MOVBE. I will submit patch 
for bdver4 PTA after our evaluation.

Ganesh.

[PATCH, i386] Handle extended family cpuid info for AMD

2014-07-31 Thread Gopalasubramanian, Ganesh

Hi,

The below patch handles the AMD's cpuid family information.
With the information from cpuid, BTVER2 cpu for -march=native flag is handled.

Bootstrap passes.
Is it OK for trunk and branches?

Regards
Ganesh 

diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index 6223bd6..3f8bb2c 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,3 +1,8 @@
+2014-07-31  Ganesh Gopalasubramanian  
+
+   * tree-sra.c (host_detect_local_cpu): Handle AMD's extended family
+   information. Handle BTVER2 cpu with cpuid family value.  
+
 2014-07-30  Martin Jambor  
 
* tree-sra.c (sra_ipa_modify_assign): Change type of the first
diff --git a/gcc/config/i386/driver-i386.c b/gcc/config/i386/driver-i386.c
index 1c6385f..21ae1f3 100644
--- a/gcc/config/i386/driver-i386.c
+++ b/gcc/config/i386/driver-i386.c
@@ -432,7 +432,8 @@ const char *host_detect_local_cpu (int argc, const char 
**argv)
 
   model = (eax >> 4) & 0x0f;
   family = (eax >> 8) & 0x0f;
-  if (vendor == signature_INTEL_ebx)
+  if ((vendor == signature_INTEL_ebx) ||
+  (vendor == signature_AMD_ebx))
 {
   unsigned int extended_model, extended_family;
 
@@ -576,7 +577,7 @@ const char *host_detect_local_cpu (int argc, const char 
**argv)
 
   if (name == signature_NSC_ebx)
processor = PROCESSOR_GEODE;
-  else if (has_movbe)
+  else if (has_movbe && family == 22)
processor = PROCESSOR_BTVER2;
   else if (has_avx2)
 processor = PROCESSOR_BDVER4;

FW: [PATCH, aarch64] Add prefetch support

2014-07-08 Thread Gopalasubramanian, Ganesh

PING!

-Original Message-
From: Gopalasubramanian, Ganesh 
Sent: Sunday, July 06, 2014 2:12 AM
To: gcc-patches@gcc.gnu.org
Cc: marcus.shawcr...@arm.com; richard.earns...@arm.com
Subject: RE: [PATCH, aarch64] Add prefetch support

PING!

From: Gopalasubramanian, Ganesh
Sent: Friday, July 04, 2014 5:57 AM
To: gcc-patches@gcc.gnu.org
Cc: marcus.shawcr...@arm.com; richard.earns...@arm.com
Subject: [PATCH, aarch64] Add prefetch support

Hi,

Attached is a patch that implements
*   Prefetch with immediate offset in the range 0 to 32760 (multiple of 8). 
Added a predicate for this.
*   Prefetch with immediate offset - in the range -256 to 255 (Gets 
generated only when we have a negative offset. Generates prfum instruction). 
Added a predicate for this.
*   Prefetch with register offset. (modified for printing the locality)

This patch has been already discussed on 
https://gcc.gnu.org/ml/gcc-patches/2014-02/msg01644.html

"make -k check" passes. Ok for trunk?

Changelog

2014-07-04  Ganesh Gopalasubramanian 

* config/aarch64/aarch64.md (define_insn "*prefetch")
(define_insn "prefetch"): New
* config/aarch64/predicates.md (aarch64_prefetch_pimm)
(aarch64_prefetch_unscaled): New.
* config/arm/types.md (define_attr "type"): Add prefetch.

Regards
Ganesh

RE: [PATCH, aarch64] Add prefetch support

2014-07-05 Thread Gopalasubramanian, Ganesh

PING!

From: Gopalasubramanian, Ganesh
Sent: Friday, July 04, 2014 5:57 AM
To: gcc-patches@gcc.gnu.org
Cc: marcus.shawcr...@arm.com; richard.earns...@arm.com
Subject: [PATCH, aarch64] Add prefetch support

Hi,

Attached is a patch that implements
*   Prefetch with immediate offset in the range 0 to 32760 (multiple of 8). 
Added a predicate for this.
*   Prefetch with immediate offset - in the range -256 to 255 (Gets 
generated only when we have a negative offset. Generates prfum instruction). 
Added a predicate for this.
*   Prefetch with register offset. (modified for printing the locality)

This patch has been already discussed on 
https://gcc.gnu.org/ml/gcc-patches/2014-02/msg01644.html

"make -k check" passes. Ok for trunk?

Changelog

2014-07-04  Ganesh Gopalasubramanian 

* config/aarch64/aarch64.md (define_insn "*prefetch")
(define_insn "prefetch"): New
* config/aarch64/predicates.md (aarch64_prefetch_pimm)
(aarch64_prefetch_unscaled): New.
* config/arm/types.md (define_attr "type"): Add prefetch.

Regards
Ganesh

[PATCH, aarch64] Add prefetch support

2014-07-04 Thread Gopalasubramanian, Ganesh

Hi,

Attached is a patch that implements
*   Prefetch with immediate offset in the range 0 to 32760 (multiple of 8). 
Added a predicate for this.
*   Prefetch with immediate offset - in the range -256 to 255 (Gets 
generated only when we have a negative offset. Generates prfum instruction). 
Added a predicate for this.
*   Prefetch with register offset. (modified for printing the locality)

This patch has been already discussed on 
https://gcc.gnu.org/ml/gcc-patches/2014-02/msg01644.html

"make -k check" passes. Ok for trunk?

Changelog

2014-07-04  Ganesh Gopalasubramanian 

* config/aarch64/aarch64.md (define_insn "*prefetch")
(define_insn "prefetch"): New
* config/aarch64/predicates.md (aarch64_prefetch_pimm) 
(aarch64_prefetch_unscaled): New.
* config/arm/types.md (define_attr "type"): Add prefetch.

Regards
Ganesh


prefetch.diff
Description: prefetch.diff

[PATCH, i386] Remove use of vpmacsdql instruction from multiplication.

2014-06-10 Thread Gopalasubramanian, Ganesh

Hi,

The below patch fixes the issue with 64-bit multiplication.
The instruction "vpmacsdql" does signed 32-bit multiplication.
For V2DImode, we require widened unsigned multiplication.
So, replacing the "vpmacsdql" instruction with "vpmuludq" and "vpaddq".

This patch had been already discussed in 
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=52908

With required change in the test xop-imul64-vector.c,  make check passes. Is it 
OK for upstream?

Regards
Ganesh

diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index d0a1253..c158612 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,3 +1,9 @@
+2014-06-10  Ganesh Gopalasubramanian 
+
+   * config/i386/i386.c (ix86_expand_sse2_mulvxdi3): Issue instructions
+"vpmuludq" and "vpaddq" instead of "vpmacsdql" for handling 32-bit
+multiplication.
+
 2014-06-07  Jan Hubicka  

* cgraphunit.c (assemble_thunks_and_aliases): Expand thunks before
diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 9105132..184d82d 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -45205,8 +45205,10 @@ ix86_expand_sse2_mulvxdi3 (rtx op0, rtx op1, rtx op2)
   /* t4: ((B*E)+(A*F))<<32, ((D*G)+(C*H))<<32 */
   emit_insn (gen_ashlv2di3 (t4, t3, GEN_INT (32)));

-  /* op0: (((B*E)+(A*F))<<32)+(B*F), (((D*G)+(C*H))<<32)+(D*H) */
-  emit_insn (gen_xop_pmacsdql (op0, op1, op2, t4));
+  /* Multiply lower parts and add all */
+  t5 = gen_reg_rtx (V2DImode);
+  emit_insn (gen_vec_widen_umult_even_v4si (t5, gen_lowpart (V4SImode, 
op1), gen_lowpart (V4SImode, op2)));
+  op0 = expand_binop (mode, add_optab, t5, t4, op0, 1, OPTAB_DIRECT);
 }
   else
 {
diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog
index a6913af..757d3e3 100644
--- a/gcc/testsuite/ChangeLog
+++ b/gcc/testsuite/ChangeLog
@@ -1,3 +1,8 @@
+2014-06-10 Ganesh Gopalasubramanian  
+
+   * gcc.target/i386/xop-imul64-vector.c: Remove the check for
+   vpmacsdql instruction.
+
 2014-06-07  Eric Botcazou  

* gnat.dg/opt38.adb: New test.
diff --git a/gcc/testsuite/gcc.target/i386/xop-imul64-vector.c 
b/gcc/testsuite/gcc.target/i386/xop-imul64-vector.c
index fbf605f..fc8c880 100644
--- a/gcc/testsuite/gcc.target/i386/xop-imul64-vector.c
+++ b/gcc/testsuite/gcc.target/i386/xop-imul64-vector.c
@@ -33,4 +33,3 @@ int main ()

 /* { dg-final { scan-assembler "vpmulld" } } */
 /* { dg-final { scan-assembler "vphadddq" } } */
-/* { dg-final { scan-assembler "vpmacsdql" } } */

RE: [AArch64 05/14] Add AArch64 'prefetch'-pattern.

2014-05-28 Thread Gopalasubramanian, Ganesh

Hi Philipp,

> These changes look good to me.
> We'll try them out on the benchmarks that caused us to add prefetching in the 
> first place.

If you are OK, I would like to get these changes upstreamed.

-Ganesh

-Original Message-
From: Dr. Philipp Tomsich [mailto:philipp.toms...@theobroma-systems.com] 
Sent: Friday, February 28, 2014 2:58 PM
To: Gopalasubramanian, Ganesh
Cc: gcc-patches@gcc.gnu.org; pins...@gmail.com
Subject: Re: [AArch64 05/14] Add AArch64 'prefetch'-pattern.

Ganesh,

On 28 Feb 2014, at 10:13 , Gopalasubramanian, Ganesh 
 wrote:

> I also have attached a patch that implements the following. 
> * Prefetch with immediate offset in the range 0 to 32760 (multiple of 8). 
> Added a predicate for this.
> * Prefetch with immediate offset - in the range -256 to 255 (Gets 
> generated only when we have a negative offset. Generates prfum instruction). 
> Added a predicate for this.
> * Prefetch with register offset. (modified for printing the locality)

These changes look good to me.
We'll try them out on the benchmarks that caused us to add prefetching in the 
first place.

Best,
Philipp.

RE: [AArch64 05/14] Add AArch64 'prefetch'-pattern.

2014-02-28 Thread Gopalasubramanian, Ganesh

Avoided top-posting and resending.

+  /* temporal locality */
+  return (INTVAL(operands[1])) ? \"prfm\\tPSTL1KEEP, [%0, #0]\" : 
+\"prfm\\tPLDL1KEEP, [%0, #0]\"; }"
+  [(set_attr "type" "prefetch")]
+)
+

With the locality value received in the instruction pattern, I think it would 
be safe to handle them in prefetch instruction.
This helps especially AArch64 has prefetch instructions that can handle this 
locality.

+(define_insn "prefetch"
+  [(prefetch (match_operand:DI 0 "address_operand" "r")
+(match_operand:QI 1 "const_int_operand" "n")
+(match_operand:QI 2 "const_int_operand" "n"))]
+  ""
+  "*
+{
+  int locality = INTVAL (operands[2]);
+
+  gcc_assert (IN_RANGE (locality, 0, 3));
+
+  if (locality == 0)
+ /* non temporal locality */
+ return (INTVAL(operands[1])) ? \"prfm\\tPSTL1STRM, [%0, #0]\" : 
\"prfm\\tPLDL1STRM, [%0, #0]\";
+
+  /* temporal locality */
+  return (INTVAL(operands[1])) ? \"prfm\\tPSTL%2KEEP, [%0, #0]\" : 
\"prfm\\tPLDL%2KEEP, [%0, #0]\";
+}"
+  [(set_attr "type" "prefetch")]
+)
+

I also have attached a patch that implements the following. 
*   Prefetch with immediate offset in the range 0 to 32760 (multiple of 8). 
Added a predicate for this.
*   Prefetch with immediate offset - in the range -256 to 255 (Gets 
generated only when we have a negative offset. Generates prfum instruction). 
Added a predicate for this.
*   Prefetch with register offset. (modified for printing the locality)

Regards
Ganesh


prefetchdiff.log
Description: prefetchdiff.log

RE: [AArch64 05/14] Add AArch64 'prefetch'-pattern.

2014-02-28 Thread Gopalasubramanian, Ganesh

With the locality value received in the instruction pattern, I think it would 
be safe to handle them in prefetch instruction.
This helps especially AArch64 has prefetch instructions that can handle this 
locality.

+(define_insn "prefetch"
+  [(prefetch (match_operand:DI 0 "address_operand" "r")
+(match_operand:QI 1 "const_int_operand" "n")
+(match_operand:QI 2 "const_int_operand" "n"))]
+  ""
+  "*
+{
+  int locality = INTVAL (operands[2]);
+
+  gcc_assert (IN_RANGE (locality, 0, 3));
+
+  if (locality == 0)
+ /* non temporal locality */
+ return (INTVAL(operands[1])) ? \"prfm\\tPSTL1STRM, [%0, #0]\" : 
\"prfm\\tPLDL1STRM, [%0, #0]\";
+
+  /* temporal locality */
+  return (INTVAL(operands[1])) ? \"prfm\\tPSTL%2KEEP, [%0, #0]\" : 
\"prfm\\tPLDL%2KEEP, [%0, #0]\";
+}"
+  [(set_attr "type" "prefetch")]
+)
+

I also have attached a patch that implements
*   Prefetch with immediate offset in the range 0 to 32760 (multiple of 8). 
Added a predicate for this.
*   Prefetch with immediate offset - in the range -256 to 255 (Gets 
generated only when we have a negative offset. Generates prfum instruction). 
Added a predicate for this.
*   Prefetch with register offset. (modified for printing the locality)

Regards
Ganesh

-Original Message-
From: Philipp Tomsich [mailto:philipp.toms...@theobroma-systems.com] 
Sent: Wednesday, February 19, 2014 2:40 AM
To: gcc-patches@gcc.gnu.org
Cc: philipp.toms...@theobroma-systems.com
Subject: [AArch64 05/14] Add AArch64 'prefetch'-pattern.

---
 gcc/config/aarch64/aarch64.md | 17 +
 gcc/config/arm/types.md   |  2 ++
 2 files changed, 19 insertions(+)

diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md 
index 99a6ac8..b972a1b 100644
--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -293,6 +293,23 @@
   [(set_attr "type" "no_insn")]
 )
 
+(define_insn "prefetch"
+  [(prefetch (match_operand:DI 0 "register_operand" "r")
+(match_operand:QI 1 "const_int_operand" "n")
+(match_operand:QI 2 "const_int_operand" "n"))]
+  ""
+  "*
+{
+  if (INTVAL(operands[2]) == 0)
+ /* no temporal locality */
+ return (INTVAL(operands[1])) ? \"prfm\\tPSTL1STRM, [%0, #0]\" : 
+\"prfm\\tPLDL1STRM, [%0, #0]\";
+
+  /* temporal locality */
+  return (INTVAL(operands[1])) ? \"prfm\\tPSTL1KEEP, [%0, #0]\" : 
+\"prfm\\tPLDL1KEEP, [%0, #0]\"; }"
+  [(set_attr "type" "prefetch")]
+)
+
 (define_insn "trap"
   [(trap_if (const_int 1) (const_int 8))]
   ""
diff --git a/gcc/config/arm/types.md b/gcc/config/arm/types.md index 
cc39cd1..1d1280d 100644
--- a/gcc/config/arm/types.md
+++ b/gcc/config/arm/types.md
@@ -117,6 +117,7 @@
 ; mvn_shift_reg  inverting move instruction, shifted operand by a register.
 ; no_insnan insn which does not represent an instruction in the
 ;final output, thus having no impact on scheduling.
+; prefetch  a prefetch instruction
 ; rbit   reverse bits.
 ; revreverse bytes.
 ; sdiv   signed division.
@@ -553,6 +554,7 @@
   call,\
   clz,\
   no_insn,\
+  prefetch,\
   csel,\
   crc,\
   extend,\
--
1.9.0



prefetchdiff.log
Description: prefetchdiff.log

FW: Non-temporal move

2014-02-24 Thread Gopalasubramanian, Ganesh

I could see "storent" pattern in x86 machine descriptions (in sse.md)., but 
internals doc don't mention it. Should we add a description about this in the 
internals doc?

Regards
Ganesh

RE: [Patch, i386] PR 59422 - Support more targets for function multi versioning

2013-12-26 Thread Gopalasubramanian, Ganesh

> I'm sorry I didn't notice previous conversation. Please install ASAP.

Thanks Uros! Committed to revision 206210.
- Ganesh

RE: [Patch, i386] PR 59422 - Support more targets for function multi versioning

2013-12-25 Thread Gopalasubramanian, Ganesh

Hi,

>> (get_amd_cpu): Handle AMD_BOBCAT, AMD_JAGUAR, AMDFAM15H_BDVER2 and
>> AMDFAM15H_BDVER3.

As mentioned earlier, we would like to stick with BTVER1 and BTVER2 instead of 
using BOBCAT or JAGUAR.
Attached patch does the changes.

Regards
Ganesh


NameChange.patch
Description: NameChange.patch

RE: [Patch, i386] PR 59422 - Support more targets for function multi versioning

2013-12-19 Thread Gopalasubramanian, Ganesh

> Sorry, I must have been looking at an older version, but as I said I already 
> did enable it in the latest patch. (see 
> http://gcc.gnu.org/ml/gcc-patches/2013-12/msg01577.html )

Sorry for causing another revision but we would like to stick with "btver1" and 
"btver2" rather than "BOBCAT" or "JAGUAR".
Therefore the changes would be like

Index: gcc/config/i386/i386.c
===
--- gcc/config/i386/i386.c  (revision 206065)
+++ gcc/config/i386/i386.c  (working copy)
@@ -29965,9 +29965,14 @@
 P_PROC_SSE4_2,
 P_POPCNT,
 P_AVX,
+P_PROC_AVX,
+P_FMA4,
+P_XOP,
+P_PROC_XOP,
+P_FMA,
+P_PROC_FMA,
 P_AVX2,
-P_FMA,
-P_PROC_FMA
+P_PROC_AVX2
   };
 
  enum feature_priority priority = P_ZERO;
@@ -29986,11 +29991,15 @@
   {"sse", P_SSE},
   {"sse2", P_SSE2},
   {"sse3", P_SSE3},
+  {"sse4a", P_SSE4_a},
   {"ssse3", P_SSSE3},
   {"sse4.1", P_SSE4_1},
   {"sse4.2", P_SSE4_2},
   {"popcnt", P_POPCNT},
   {"avx", P_AVX},
+  {"fma4", P_FMA4},
+  {"xop", P_XOP},
+  {"fma", P_FMA},
   {"avx2", P_AVX2}
 };
 
@@ -30044,25 +30053,49 @@
  break;
 case PROCESSOR_COREI7_AVX:
   arg_str = "corei7-avx";
-  priority = P_PROC_SSE4_2;
+  priority = P_PROC_AVX;
   break;
+case PROCESSOR_HASWELL:
+  arg_str = "core-avx2";
+  priority = P_PROC_AVX2;
+  break;
case PROCESSOR_ATOM:
  arg_str = "atom";
  priority = P_PROC_SSSE3;
  break;
+case PROCESSOR_SLM:
+  arg_str = "slm";
+  priority = P_PROC_SSE4_2;
+  break;
case PROCESSOR_AMDFAM10:
  arg_str = "amdfam10h";
  priority = P_PROC_SSE4_a;
  break;
+case PROCESSOR_BTVER1:
+  arg_str = "btver1";
+  priority = P_PROC_SSE4_a;
+  break;
+case PROCESSOR_BTVER2:
+  arg_str = "btver2";
+  priority = P_PROC_AVX;
+  break;
case PROCESSOR_BDVER1:
  arg_str = "bdver1";
- priority = P_PROC_FMA;
+ priority = P_PROC_XOP;
  break;
case PROCESSOR_BDVER2:
  arg_str = "bdver2";
  priority = P_PROC_FMA;
  break;
-   }  
+case PROCESSOR_BDVER3:
+  arg_str = "bdver3";
+  priority = P_PROC_FMA;
+  break;
+case PROCESSOR_BDVER4:
+  arg_str = "bdver4";
+  priority = P_PROC_AVX2;
+  break;
+}  
}
 
   cl_target_option_restore (&global_options, &cur_target);
@@ -30922,9 +30955,13 @@
 F_SSE2,
 F_SSE3,
 F_SSSE3,
+F_SSE4_a,
 F_SSE4_1,
 F_SSE4_2,
 F_AVX,
+F_FMA4,
+F_XOP,
+F_FMA,
 F_AVX2,
 F_MAX
   };
@@ -30943,6 +30980,10 @@
 M_AMDFAM10H,
 M_AMDFAM15H,
 M_INTEL_SLM,
+M_INTEL_COREI7_AVX,
+M_INTEL_CORE_AVX2,
+M_AMD_BTVER1,
+M_AMD_BTVER2,
 M_CPU_SUBTYPE_START,
 M_INTEL_COREI7_NEHALEM,
 M_INTEL_COREI7_WESTMERE,
@@ -30953,7 +30994,9 @@
 M_AMDFAM15H_BDVER1,
 M_AMDFAM15H_BDVER2,
 M_AMDFAM15H_BDVER3,
-M_AMDFAM15H_BDVER4
+M_AMDFAM15H_BDVER4,
+M_INTEL_COREI7_IVYBRIDGE,
+M_INTEL_CORE_HASWELL
   };
 
   static struct _arch_names_table
@@ -30971,11 +31014,17 @@
   {"corei7", M_INTEL_COREI7},
   {"nehalem", M_INTEL_COREI7_NEHALEM},
   {"westmere", M_INTEL_COREI7_WESTMERE},
+  {"corei7-avx", M_INTEL_COREI7_AVX},
   {"sandybridge", M_INTEL_COREI7_SANDYBRIDGE},
+  {"ivybridge", M_INTEL_COREI7_IVYBRIDGE},
+  {"core-avx2", M_INTEL_CORE_AVX2},
+  {"haswell", M_INTEL_CORE_HASWELL},
   {"amdfam10h", M_AMDFAM10H},
   {"barcelona", M_AMDFAM10H_BARCELONA},
   {"shanghai", M_AMDFAM10H_SHANGHAI},
   {"istanbul", M_AMDFAM10H_ISTANBUL},
+  {"btver1", M_AMD_BTVER1},
+  {"btver2", M_AMD_BTVER2},
   {"amdfam15h", M_AMDFAM15H},
   {"bdver1", M_AMDFAM15H_BDVER1},
   {"bdver2", M_AMDFAM15H_BDVER2},
@@ -30997,9 +31046,13 @@
   {"sse2",   F_SSE2},
   {"sse3",   F_SSE3},
   {"ssse3",  F_SSSE3},
+  {"sse4a",  F_SSE4_a},
   {"sse4.1", F_SSE4_1},
   {"sse4.2", F_SSE4_2},
   {"avx",F_AVX},
+  {"fma4",   F_FMA4},
+  {"xop",F_XOP},
+  {"fma",F_FMA},
   {"avx2",   F_AVX2}
 };

Index: libgcc/config/i386/cpuinfo.c
===
--- libgcc/config/i386/cpuinfo.c(revision 206065)
+++ libgcc/config/i386/cpuinfo.c(working copy)
@@ -62,6 +62,10 @@
   AMDFAM10H,
   AMDFAM15H,
   INTEL_SLM,
+  INTEL_COREI7_AVX,
+  INTEL_CORE_AVX2,
+  AMD_BTVER1,
+  AMD_BTVER2,
   CPU_TYPE_MAX
 };
 
@@ -75,6 +79,10

RE: [RFC] [PATCH, i386] Adjust unroll factor for bdver3 and bdver4

2013-12-19 Thread Gopalasubramanian, Ganesh

> Please provide updated ChangeLog.

--- gcc/ChangeLog   (revision 206106)
+++ gcc/ChangeLog   (working copy)
@@ -1,3 +1,14 @@
+2013-12-19  Ganesh Gopalasubramanian  
+
+   * config/i386/i386.c: Include cfgloop.h. 
+   (ix86_loop_memcount): New function.
+   (ix86_loop_unroll_adjust): New function.
+   (TARGET_LOOP_UNROLL_ADJUST): Define.
+   * config/i386/i386.h
+   (TARGET_ADJUST_UNROLL): Define.
+   * config/i386/x86-tune.def
+   (X86_TUNE_ADJUST_UNROLL): Define.
+

> The function comment is missing. Maybe you should also describe magic number 
> 32 here?
Added the function comment.

> Otherwise, the patch looks OK.
Thanks.  Bootstrapping passes. Is it OK for upstream?

> BTW: Please avoid top-posting, see e.g. [1] for reasons...
Sorry for the lapse. Will comply.

Regards
Ganesh



unroll-adjust.patch
Description: unroll-adjust.patch

RE: [Patch, i386] PR 59422 - Support more targets for function multi versioning

2013-12-18 Thread Gopalasubramanian, Ganesh

> Yes, I changed that in the last patch, though I consider it momentarily 
> problematic because you do not yet enable AVX with march=btver2 (AVX versions 
> would currently be better than btver2 versions for a btver2 arch), but expect
march=btver2 will be fixed soon.

The " processor_alias_table" entry for "btver2" in i386.c enables AVX.


  {"btver2", PROCESSOR_BTVER2, CPU_BTVER2,
PTA_64BIT | PTA_MMX |  PTA_SSE  | PTA_SSE2 | PTA_SSE3
| PTA_SSSE3 | PTA_SSE4A |PTA_ABM | PTA_CX16 | PTA_SSE4_1
| PTA_SSE4_2 | PTA_AES | PTA_PCLMUL | PTA_AVX
| PTA_BMI | PTA_F16C | PTA_MOVBE | PTA_PRFCHW
| PTA_FXSR | PTA_XSAVE | PTA_XSAVEOPT},


The assembly listing for a simple test (compiled with -march=btver2) also has 
-mavx enabled. So, can you please enable AVX for btver2?

Regards
Ganesh

RE: [Patch, i386] PR 59422 - Support more targets for function multi versioning

2013-12-18 Thread Gopalasubramanian, Ganesh

Ping!

"Gopalasubramanian, Ganesh"  wrote:

> Yes, I figured that was the original idea behind it, but the final family of 
> the jaguar processors seems to have become 16h instead of 14h (bobcat) at 
> some point.

Yes. It is amdfam16h. I was supposed to pass on some comments on the patch.
1. Amdfam16h for Jaguar.
2. For Jaguar, the priority needs to be AVX (AVX got included into the Jaguar 
ISA).

I have a doubt! What would be done if priority is set to "F_FMA4" instead of 
"F_XOP" for bdver1?

Regards
Ganesh

RE: [Patch, i386] PR 59422 - Support more targets for function multi versioning

2013-12-16 Thread Gopalasubramanian, Ganesh

> Yes, I figured that was the original idea behind it, but the final family of 
> the jaguar processors seems to have become 16h instead of 14h (bobcat) at 
> some point.

Yes. It is amdfam16h. I was supposed to pass on some comments on the patch.
1. Amdfam16h for Jaguar.
2. For Jaguar, the priority needs to be AVX (AVX got included into the Jaguar 
ISA).

I have a doubt! What would be done if priority is set to "F_FMA4" instead of 
"F_XOP" for bdver1?

Regards
Ganesh

RE: [Patch, i386] PR 59422 - Support more targets for function multi versioning

2013-12-16 Thread Gopalasubramanian, Ganesh

> Btw, I couldn't find anything that corresponds to gcc's btver2 arch. Is that 
> an old term for what has become the Jaguar architecture?

Yes, "btver2" = "jaguar". We have the name as per its family name (i.e, bobcat 
family) in GCC. 
Similarly we have the names "bdver2" = "piledriver", "bdver3" = "steamroller" 
as per their family (bulldozer) name.

Regards
Ganesh

-Original Message-
From: Allan Sandfeld Jensen [mailto:carew...@gmail.com] 
Sent: Monday, December 16, 2013 12:25 AM
To: Uros Bizjak
Cc: gcc-patches@gcc.gnu.org
Subject: Re: [Patch, i386] PR 59422 - Support more targets for function multi 
versioning

Hi again
On Wednesday 11 December 2013, Uros Bizjak wrote:
> Hello!
> 
> > PR gcc/59422
> > 
> > This patch extends the supported targets for function multi versiong 
> > to also include Haswell, Silvermont, and the most recent AMD models. 
> > It also prioritizes AVX2 versions over AMD specific pre-AVX2 versions.
> 
> Please add a ChangeLog entry and attach the complete patch. Please 
> also state how you tested the patch, as outlined in the instructions 
> [1].
> 
> [1] http://gcc.gnu.org/contribute.html
> 
Updated patch for better CPU model detection and added ChangeLog.

The patch has been tested with the attached test.cpp. Verified that it doesn't 
build before the patch, and that it builds after, and verified it selects 
correct versions at runtime based on either CPU model or supported ISA (tested 
on 3 machines: SandyBridge, IvyBridge and Phenom II).

Btw, I couldn't find anything that corresponds to gcc's btver2 arch. Is that an 
old term for what has become the Jaguar architecture?

`Allan

RE: [RFC] [PATCH, i386] Adjust unroll factor for bdver3 and bdver4

2013-12-11 Thread Gopalasubramanian, Ganesh

Hi Uros!

Accommodated the changes that you mentioned.
Completed the bootstrap testing too.

Regards
Ganesh

-Original Message-
From: Uros Bizjak [mailto:ubiz...@gmail.com] 
Sent: Wednesday, December 04, 2013 3:17 PM
To: Gopalasubramanian, Ganesh
Cc: gcc-patches@gcc.gnu.org; Richard Guenther  
(richard.guent...@gmail.com)
Subject: Re: [RFC] [PATCH, i386] Adjust unroll factor for bdver3 and bdver4

On Wed, Dec 4, 2013 at 9:39 AM, Gopalasubramanian, Ganesh 
 wrote:

> Attached is the revised patch.
> The target independent part has been already approved and added.
>
> This revision of the patch adds a x86 tune definition and checks it while 
> deciding the unroll factor.
>
> Accommodated the comments given by you except one.
>
>> *x will never be null for active insns.
> Since every rtx in the insn is checked for memory references, the NULL_RTX 
> check is required.

Yes you are correct. for_each_rtx also passes NULL_RTX, I was distracted by 
"There are no sub-expressions." comment.

+if (NONDEBUG_INSN_P (insn) && INSN_CODE (insn) != -1)

Do you need to check for INSN_CODE here? IIRC, checking for NONDEBUG_INSN_P is 
enough.

+for_each_rtx (&insn, (rtx_function) ix86_loop_memcount,
&mem_count);
+}
+  free (bbs);
+
+  if (mem_count <=32)
+return 32/mem_count;

Ouch... mem_count can be zero. Is there a reason to change this part from 
previous patch?

Uros.

unroll-adjust.patch
Description: unroll-adjust.patch

[patch][wwwdocs] gcc 4.9 changes - AMD new cores

2013-12-05 Thread Gopalasubramanian, Ganesh

Hello, 

This patch adds details about new AMD cores that got enabled in GCC-4.9.
OK for the wwwdocs? 

Regards
Ganesh

cvs diff: Diffing .
Index: changes.html
===
RCS file: /cvs/gcc/wwwdocs/htdocs/gcc-4.9/changes.html,v
retrieving revision 1.44
diff -r1.44 changes.html
404a405,407
> Support for new AMD family 15h processors (Excavator core)
> is now available through the -march=bdver4 and
> -mtune=bdver4 options.

RE: [RFC] [PATCH, i386] Adjust unroll factor for bdver3 and bdver4

2013-12-04 Thread Gopalasubramanian, Ganesh

> Ouch... mem_count can be zero. Is there a reason to change this part from 
> previous patch?

Oops!  You're right. I will correct this.

The idea is to count the memory references and decide on the unrolling factor.
Previous patch does that in two steps I thought of doing that in a single step. 
(I think I missed my step here ;) )

Regards
Ganesh

-Original Message-
From: Uros Bizjak [mailto:ubiz...@gmail.com] 
Sent: Wednesday, December 04, 2013 3:17 PM
To: Gopalasubramanian, Ganesh
Cc: gcc-patches@gcc.gnu.org; Richard Guenther  
(richard.guent...@gmail.com)
Subject: Re: [RFC] [PATCH, i386] Adjust unroll factor for bdver3 and bdver4

On Wed, Dec 4, 2013 at 9:39 AM, Gopalasubramanian, Ganesh 
 wrote:

> Attached is the revised patch.
> The target independent part has been already approved and added.
>
> This revision of the patch adds a x86 tune definition and checks it while 
> deciding the unroll factor.
>
> Accommodated the comments given by you except one.
>
>> *x will never be null for active insns.
> Since every rtx in the insn is checked for memory references, the NULL_RTX 
> check is required.

Yes you are correct. for_each_rtx also passes NULL_RTX, I was distracted by 
"There are no sub-expressions." comment.

+if (NONDEBUG_INSN_P (insn) && INSN_CODE (insn) != -1)

Do you need to check for INSN_CODE here? IIRC, checking for NONDEBUG_INSN_P is 
enough.

+for_each_rtx (&insn, (rtx_function) ix86_loop_memcount,
&mem_count);
+}
+  free (bbs);
+
+  if (mem_count <=32)
+return 32/mem_count;

Ouch... mem_count can be zero. Is there a reason to change this part from 
previous patch?

Uros.

RE: [RFC] [PATCH, i386] Adjust unroll factor for bdver3 and bdver4

2013-12-04 Thread Gopalasubramanian, Ganesh

Hi Uros!

Attached is the revised patch.
The target independent part has been already approved and added.

This revision of the patch adds a x86 tune definition and checks it while 
deciding the unroll factor.

Accommodated the comments given by you except one.

> *x will never be null for active insns.
Since every rtx in the insn is checked for memory references, the NULL_RTX 
check is required.

Regards
Ganesh

-Original Message-
From: Uros Bizjak [mailto:ubiz...@gmail.com] 
Sent: Friday, November 22, 2013 1:46 PM
To: Gopalasubramanian, Ganesh
Cc: gcc-patches@gcc.gnu.org; Richard Guenther  
(richard.guent...@gmail.com); borntrae...@de.ibm.com; H.J. Lu 
(hjl.to...@gmail.com); Jakub Jelinek (ja...@redhat.com)
Subject: Re: [RFC] [PATCH, i386] Adjust unroll factor for bdver3 and bdver4

On Wed, Nov 20, 2013 at 7:26 PM, Gopalasubramanian, Ganesh 
 wrote:

> Steamroller processors contain a loop predictor and a loop buffer, which may 
> make unrolling small loops less important.
> When unrolling small loops for steamroller, making the unrolled loop fit in 
> the loop buffer should be a priority.
>
> This patch uses a heuristic approach (number of memory references) to decide 
> the unrolling factor for small loops.
> This patch has some noise in SPEC 2006 results.
>
> Bootstrapping passes.
>
> I would like to know your comments before committing.

Please split the patch to target-dependant and target-independant part, and get 
target-idependant part reviewed first.

This part:

+  if (ix86_tune != PROCESSOR_BDVER3 && ix86_tune != PROCESSOR_BDVER4)  
+ {
+return nunroll;
+  }

is wrong. You should introduce tune variable (as H.J. suggested) and check that 
variable here. Target dependant tuning options should be in x86-tune.def, so 
everything regarding tuning can be found in one place.

+if (INSN_P (insn) && INSN_CODE (insn) != -1)
+for_each_rtx (&insn, (rtx_function) ix86_loop_memcount,
&mem_count);

if (NONDEBUG_INSN_P (insn))
  for_each_rtx (&PATTERN(insn), ...);

otherwise your heuristics will depend on -g compile option.

+  if ( (mem_count*nunroll) <= 32)

Extra parenthesis.

+static int
+ix86_loop_memcount (rtx *x, unsigned *mem_count) {
+  if (*x != NULL_RTX && MEM_P (*x))

*x will never be null for active insns.

Uros.

unroll-adjust.patch
Description: unroll-adjust.patch

RE: [RFC] [PATCH, i386] Adjust unroll factor for bdver3 and bdver4

2013-11-28 Thread Gopalasubramanian, Ganesh

This patch adds influence of macro TARGET_LOOP_UNROLL_ADJUST during constant 
iterations (decide_unroll_constant_iterations).
The macro has been already checked for runtime iterations 
(decide_unroll_runtime_iterations),  and for unroll stupid 
(decide_unroll_stupid).

Bootstrapping and test passes.

Would like to know your comments before committing.

Regards
Ganesh

2013-11-28  Ganesh Gopalasubramanian  

* loop-unroll.c (decide_unroll_constant_iterations): Check macro 
TARGET_LOOP_UNROLL_ADJUST while deciding unroll factor.

diff --git a/gcc/loop-unroll.c b/gcc/loop-unroll.c
index 9c87167..557915f 100644
--- a/gcc/loop-unroll.c
+++ b/gcc/loop-unroll.c
@@ -664,6 +664,9 @@ decide_unroll_constant_iterations (struct loop *loop, int 
flags)
   if (nunroll > (unsigned) PARAM_VALUE (PARAM_MAX_UNROLL_TIMES))
 nunroll = PARAM_VALUE (PARAM_MAX_UNROLL_TIMES);

+  if (targetm.loop_unroll_adjust)
+nunroll = targetm.loop_unroll_adjust (nunroll, loop);
+
   /* Skip big loops.  */
   if (nunroll <= 1)
 {

-Original Message-
From: Uros Bizjak [mailto:ubiz...@gmail.com] 
Sent: Friday, November 22, 2013 1:46 PM
To: Gopalasubramanian, Ganesh
Cc: gcc-patches@gcc.gnu.org; Richard Guenther  
(richard.guent...@gmail.com); borntrae...@de.ibm.com; H.J. Lu 
(hjl.to...@gmail.com); Jakub Jelinek (ja...@redhat.com)
Subject: Re: [RFC] [PATCH, i386] Adjust unroll factor for bdver3 and bdver4

On Wed, Nov 20, 2013 at 7:26 PM, Gopalasubramanian, Ganesh 
 wrote:

> Steamroller processors contain a loop predictor and a loop buffer, which may 
> make unrolling small loops less important.
> When unrolling small loops for steamroller, making the unrolled loop fit in 
> the loop buffer should be a priority.
>
> This patch uses a heuristic approach (number of memory references) to decide 
> the unrolling factor for small loops.
> This patch has some noise in SPEC 2006 results.
>
> Bootstrapping passes.
>
> I would like to know your comments before committing.

Please split the patch to target-dependant and target-independant part, and get 
target-idependant part reviewed first.

RE: [PATCH, i386]: Fix PR56788, _mm_frcz_sd and _mm_frcz_ss ignore their second argument

2013-11-26 Thread Gopalasubramanian, Ganesh

> Hopefully someone from AMD will provide tests that are mysteriously missing 
> from XOP testsuite.

As pointed out by Marc, I added myself to the bug later. 
I was bit confused about the "internal insn representation" with "user-visible 
function".
So, couldn't add test then and there. I could have solved that earlier. Sorry 
for that.

Attached is the test that checks the (controversial) "frcz" functions.

Uros could you please add this to your patch while committing.

Regards
Ganesh

-Original Message-
From: Uros Bizjak [mailto:ubiz...@gmail.com] 
Sent: Saturday, November 23, 2013 6:49 PM
To: gcc-patches@gcc.gnu.org
Cc: Cong Hou; Marc Glisse; Gopalasubramanian, Ganesh
Subject: [PATCH, i386]: Fix PR56788, _mm_frcz_sd and _mm_frcz_ss ignore their 
second argument

Hello!

Attached patch fixes PR56788, where _mm_frcz_{ss,sd} intrinsics ignored their 
second argument.

As explained in the PR [1], gcc implements two-operand "vector-merge"
form as documented in Microsoft's definition [2]. However, in contrast to other 
SSE scalar insns, the instruction itself clears upper bits to zero.

There were a couple of problems: the builtin was declared as builtin with two 
input operands, but the number of input operands didn't correspond to referred 
insn pattern, leaving its second operand uninitialized. The intrinsic was also 
implemented without necessary movss/movsd fixup that would merge both its 
operands in a correct way.

Please also note that the definition in clang is wrong.

I didn't include any testcase in the patch, since I don't have access to XOP 
target. Hopefully someone from AMD will provide tests that are mysteriously 
missing from XOP testsuite.

2013-11-23  Uros Bizjak  

PR target/56788
* config/i386/i386.c (bdesc_multi_arg) :
Declare as MULTI_ARG_1_SF instruction.
: Decleare as MULTI_ARG_1_DF instruction.
* config/i386/sse.md (*xop_vmfrcz2): Rename
from *xop_vmfrcz_.
* config/i386/xopintrin.h (_mm_frcz_ss): Use __builtin_ia32_movss
to merge scalar result with __A.
(_mm_frcz_sd): Use __builtin_ia32_movsd to merge scalar
result with __A.

Patch was bootstrapped and regression tested on x86_64-pc-linux-gnu {,-m32}.

The patch was committed to mainline SVN and will be committed to other release 
branches in a couple of days (hopefully with additional tests).

[1] http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56788
[2] http://msdn.microsoft.com/en-us/library/vstudio/gg445126%28v=vs.100%29.aspx

Uros.
#include 
#include "m128-check.h"

void
check_mm_vmfrcz_sd (__m128d __A, __m128d __B)
{
  union128d a, b, c;
  double d[2];

  a.x = __A;
  b.x = __B;
  c.x = _mm_frcz_sd (__A, __B);
  d[0] = b.a[0] - (int)b.a[0] ;
  d[1] = a.a[1];
  if (check_union128d (c, d))
abort ();
}

void
check_mm_vmfrcz_ss (__m128 __A, __m128 __B)
{
  union128 a, b, c;
  float f[4];

  a.x = __A;
  b.x = __B;
  c.x = _mm_frcz_ss (__A, __B);
  f[0] = b.a[0] - (int)b.a[0] ;
  f[1] = a.a[1];
  f[2] = a.a[2];
  f[3] = a.a[3];
  if (check_union128 (c, f))
abort ();
}

void
main (void)
{
  union128 a, b;
  union128d c,d;
  int i;

  for (i = 0; i < 4; i++)
{
   a.a[i] = i + 3.5;
   b.a[i] = i + 7.9;
}
  for (i = 0; i < 2; i++)
{
   c.a[i] = i + 3.5;
   d.a[i] = i + 7.987654321;
}
  check_mm_vmfrcz_ss (a.x, b.x);
  check_mm_vmfrcz_sd (c.x, d.x);
}

RE: [RFC] [PATCH, i386] Adjust unroll factor for bdver3 and bdver4

2013-11-21 Thread Gopalasubramanian, Ganesh

Ping!

-Original Message-
From: Gopalasubramanian, Ganesh 
Sent: Thursday, November 21, 2013 10:35 AM
To: 'H.J. Lu'
Cc: gcc-patches@gcc.gnu.org; Uros Bizjak (ubiz...@gmail.com); Richard Guenther 
 (richard.guent...@gmail.com); 
borntrae...@de.ibm.com; Jakub Jelinek (ja...@redhat.com)
Subject: RE: [RFC] [PATCH, i386] Adjust unroll factor for bdver3 and bdver4

> I suggest you add this to x86-tune.def and enable it for
> bdver3 and bdver4.

The macro TARGET_LOOP_UNROLL_ADJUST is not new. It is already available and is 
used by target s390.
Since it is not an "x86 only" feature I didn't add that in x86-tune.def.

Regards
Ganesh

-Original Message-
From: H.J. Lu [mailto:hjl.to...@gmail.com] 
Sent: Thursday, November 21, 2013 12:02 AM
To: Gopalasubramanian, Ganesh
Cc: gcc-patches@gcc.gnu.org; Uros Bizjak (ubiz...@gmail.com); Richard Guenther 
 (richard.guent...@gmail.com); 
borntrae...@de.ibm.com; Jakub Jelinek (ja...@redhat.com)
Subject: Re: [RFC] [PATCH, i386] Adjust unroll factor for bdver3 and bdver4

On Wed, Nov 20, 2013 at 10:26 AM, Gopalasubramanian, Ganesh 
 wrote:
> Hi,
>
> Steamroller processors contain a loop predictor and a loop buffer, which may 
> make unrolling small loops less important.
> When unrolling small loops for steamroller, making the unrolled loop fit in 
> the loop buffer should be a priority.
>
> This patch uses a heuristic approach (number of memory references) to decide 
> the unrolling factor for small loops.
> This patch has some noise in SPEC 2006 results.
>
> Bootstrapping passes.
>
> I would like to know your comments before committing.
>

I suggest you add this to x86-tune.def and enable it for
bdver3 and bdver4.



--
H.J.

RE: [RFC] [PATCH, i386] Adjust unroll factor for bdver3 and bdver4

2013-11-20 Thread Gopalasubramanian, Ganesh

> I suggest you add this to x86-tune.def and enable it for
> bdver3 and bdver4.

The macro TARGET_LOOP_UNROLL_ADJUST is not new. It is already available and is 
used by target s390.
Since it is not an "x86 only" feature I didn't add that in x86-tune.def.

Regards
Ganesh

-Original Message-
From: H.J. Lu [mailto:hjl.to...@gmail.com] 
Sent: Thursday, November 21, 2013 12:02 AM
To: Gopalasubramanian, Ganesh
Cc: gcc-patches@gcc.gnu.org; Uros Bizjak (ubiz...@gmail.com); Richard Guenther 
 (richard.guent...@gmail.com); 
borntrae...@de.ibm.com; Jakub Jelinek (ja...@redhat.com)
Subject: Re: [RFC] [PATCH, i386] Adjust unroll factor for bdver3 and bdver4

On Wed, Nov 20, 2013 at 10:26 AM, Gopalasubramanian, Ganesh 
 wrote:
> Hi,
>
> Steamroller processors contain a loop predictor and a loop buffer, which may 
> make unrolling small loops less important.
> When unrolling small loops for steamroller, making the unrolled loop fit in 
> the loop buffer should be a priority.
>
> This patch uses a heuristic approach (number of memory references) to decide 
> the unrolling factor for small loops.
> This patch has some noise in SPEC 2006 results.
>
> Bootstrapping passes.
>
> I would like to know your comments before committing.
>

I suggest you add this to x86-tune.def and enable it for
bdver3 and bdver4.



--
H.J.

[RFC] [PATCH, i386] Adjust unroll factor for bdver3 and bdver4

2013-11-20 Thread Gopalasubramanian, Ganesh

Hi,

Steamroller processors contain a loop predictor and a loop buffer, which may 
make unrolling small loops less important.
When unrolling small loops for steamroller, making the unrolled loop fit in the 
loop buffer should be a priority.

This patch uses a heuristic approach (number of memory references) to decide 
the unrolling factor for small loops.
This patch has some noise in SPEC 2006 results. 

Bootstrapping passes.

I would like to know your comments before committing.

Regards
Ganesh


loop_unroll_bdver3.patch
Description: loop_unroll_bdver3.patch

RE: Honnor ix86_accumulate_outgoing_args again

2013-11-12 Thread Gopalasubramanian, Ganesh

> we are going to have some AMD CPU with AVX2 support soon, the question is
> if it will prefer 256-bit vmovups/vmovupd/vmovdqu or split, but even
> if it will prefer split, the question is if like bdver{1,2,3} it will
> be X86_TUNE_AVX128_OPTIMAL, because if yes, then how 256-bit unaligned
> loads/stores are handled is much less important there.  Ganesh?

256-bit is friendly on bdver4. 
But, 256 bit unaligned stores are micro-coded which we would like to avoid. So 
we require 128-bit MOVUPS.

-Original Message-
From: Jakub Jelinek [mailto:ja...@redhat.com] 
Sent: Tuesday, November 12, 2013 3:57 PM
To: Jan Hubicka
Cc: H.J. Lu; Vladimir Makarov; GCC Patches; Uros Bizjak; Richard Henderson; 
Gopalasubramanian, Ganesh
Subject: Re: Honnor ix86_accumulate_outgoing_args again

On Tue, Nov 12, 2013 at 11:05:45AM +0100, Jan Hubicka wrote:
> > @@ -16576,7 +16576,7 @@ ix86_avx256_split_vector_move_misalign (rtx 
> > op0, rtx op1)
> > 
> >if (MEM_P (op1))
> >  {
> > -  if (TARGET_AVX256_SPLIT_UNALIGNED_LOAD)
> > +  if (!TARGET_AVX2 && TARGET_AVX256_SPLIT_UNALIGNED_LOAD)
> >  {
> >rtx r = gen_reg_rtx (mode);
> >m = adjust_address (op1, mode, 0); @@ -16596,7 +16596,7 @@ 
> > ix86_avx256_split_vector_move_misalign (rtx op0, rtx op1)
> >  }
> >else if (MEM_P (op0))
> >  {
> > -  if (TARGET_AVX256_SPLIT_UNALIGNED_STORE)
> > +  if (!TARGET_AVX2 && TARGET_AVX256_SPLIT_UNALIGNED_STORE)
> 
> I would add explanation comment on those two.

Looking at http://gcc.gnu.org/ml/gcc-patches/2013-11/msg01235.html
we are going to have some AMD CPU with AVX2 support soon, the question is
if it will prefer 256-bit vmovups/vmovupd/vmovdqu or split, but even
if it will prefer split, the question is if like bdver{1,2,3} it will
be X86_TUNE_AVX128_OPTIMAL, because if yes, then how 256-bit unaligned
loads/stores are handled is much less important there.  Ganesh?

> Shall we also disable argument accumulation for cores? It seems we won't
> solve the IRA issues, right?

You mean LRA issues here, right?  If you are starting to use
no-accumulate-outgoing-args much more often than in the past, I think
the problem that LRA forces a frame pointer in that case is much more
important now (or has that been fixed in the mean time?).  Vlad?

Jakub

RE: Fix scheduler ix86_issue_rate and ix86_adjust_cost for modern x86 chips

2013-11-05 Thread Gopalasubramanian, Ganesh

Thanks Honza!

I have committed changes ( for default ).
http://gcc.gnu.org/viewcvs/gcc?view=revision&revision=204442

I will add lookahead value 8 for O3 after experimenting with it.

Regards
Ganesh

-Original Message-
From: Jan Hubicka [mailto:hubi...@ucw.cz] 
Sent: Wednesday, October 30, 2013 1:54 AM
To: Richard Biener
Cc: Jan Hubicka; Gopalasubramanian, Ganesh; gcc-patches@gcc.gnu.org; Uros 
Bizjak (ubiz...@gmail.com); H.J. Lu (hjl.to...@gmail.com)
Subject: Re: Fix scheduler ix86_issue_rate and ix86_adjust_cost for modern x86 
chips

> On Fri, 25 Oct 2013, Jan Hubicka wrote:
> 
> > > > OK, so it is about 2%.  Did you try if you need lookahead even in the 
> > > > early pass (before reload)?  My guess would be so, but if not, it could 
> > > > cut the cost to half.  For -Ofast/-O3 it looks resonable to me, but we 
> > > > will  need to announce it on the ML.  For other settings I think we 
> > > > need to work on more improvements or cut the expenses.
> > > 
> > > Yes, it is required before reload.  
> > > 
> > > I have another idea which can be pondered upon. Currently, can we enable 
> > > lookahead with the value 4 (pre reload) for default? This will 
> > > exponentially cut the cost of build time. 
> > > I have done some measurements on the build time of some benchmarks 
> > > (mentioned below) with lookahead value 4. The 2% increase in build time 
> > > with value 8 is now almost gone.
> > > 
> > >dfa4   no_lookahead
> > >  
> > >  perlbench   - 191s  193s
> > >  bzip2   - 19s   19s
> > >  gcc - 429s  429s
> > >  mcf - 3s3s
> > >  gobmk   - 116s  115s
> > >  hmmer   - 60s   60s
> > >  sjeng   - 18s   17s
> > >  libquantum  - 6s6s
> > >  h264ref - 107s  107s
> > >  omnetpp - 128s  128s
> > >  astar   - 7s7s
> > >  bwaves  - 5s5s
> > >  gamess  - 1964s 1957s
> > >  milc- 18s   18s
> > >  GemsFDTD- 273s  272s
> > > 
> > > Lookahead value 4 also helps because, the modified decoder model in 
> > > bdver3.md is only two cycles deep (though in hardware it is actually 4 
> > > cycles deep). This means that we can look another two levels deep for 
> > > better schedule.
> > > GemsFDTD still retains the performance boost of around 6-7% with value 4.
> > > 
> > > Let me know your thoughts.
> > 
> > This seems resonable.  I would go for lookahead of 4 for now and 8 
> > for -Ofast and we can tune things based on the experience with this setting 
> > incrementally.
> > Uros, Richard, what do you think?
> 
> Well, certainly -O3 not -Ofast.

Yes, enabling 4 by default and 8 at -O3 seems fine to me.

Honza
> 
> Richard.

RE: Fix scheduler ix86_issue_rate and ix86_adjust_cost for modern x86 chips

2013-10-25 Thread Gopalasubramanian, Ganesh

> OK, so it is about 2%.  Did you try if you need lookahead even in the early 
> pass (before reload)?  My guess would be so, but if not, it could cut the 
> cost to half.  For -Ofast/-O3 it looks resonable to me, but we will  need to 
> announce it on the ML.  For other settings I think we need to work on more 
> improvements or cut the expenses.

Yes, it is required before reload.  

I have another idea which can be pondered upon. Currently, can we enable 
lookahead with the value 4 (pre reload) for default? This will exponentially 
cut the cost of build time. 
I have done some measurements on the build time of some benchmarks (mentioned 
below) with lookahead value 4. The 2% increase in build time with value 8 is 
now almost gone.

   dfa4   no_lookahead
 
 perlbench   - 191s  193s
 bzip2   - 19s   19s
 gcc - 429s  429s
 mcf - 3s3s
 gobmk   - 116s  115s
 hmmer   - 60s   60s
 sjeng   - 18s   17s
 libquantum  - 6s6s
 h264ref - 107s  107s
 omnetpp - 128s  128s
 astar   - 7s7s
 bwaves  - 5s5s
 gamess  - 1964s 1957s
 milc- 18s   18s
 GemsFDTD- 273s  272s

Lookahead value 4 also helps because, the modified decoder model in bdver3.md 
is only two cycles deep (though in hardware it is actually 4 cycles deep). This 
means that we can look another two levels deep for better schedule.
GemsFDTD still retains the performance boost of around 6-7% with value 4.

Let me know your thoughts.

Regards
Ganesh

-Original Message-
From: Jan Hubicka [mailto:hubi...@ucw.cz] 
Sent: Thursday, October 24, 2013 6:48 PM
To: Gopalasubramanian, Ganesh
Cc: Jan Hubicka; gcc-patches@gcc.gnu.org; Uros Bizjak (ubiz...@gmail.com); H.J. 
Lu (hjl.to...@gmail.com)
Subject: Re: Fix scheduler ix86_issue_rate and ix86_adjust_cost for modern x86 
chips

> Hi,
> 
> > Is this with -fschedule-insns? Or only with default settings?  Did you test 
> > the compile time implications of increasing the lookahead? (value of 8 is 
> > very large, we may consider enbling it only for -Ofast, limiting for 
> > postreload only or something similar).
> 
> The improvement is seen with the options "-fschedule-insns  -fschedule-insns2 
> -fsched-pressure"
> 
> Below are the build times of some of the SPEC benchmarks
> 
>   dfa8   no_lookahead
> 
> perlbench   - 196s  193s
> bzip2   - 19s   19s
> gcc - 439s  429s
> mcf - 3s3s
> gobmk   - 119s  115s
> hmmer   - 62s   60s
> sjeng   - 18s   17s
> libquantum  - 6s6s
> h264ref - 110s  107s
> omnetpp - 132s  128s
> astar   - 7s7s
> bwaves  - 4s5s
> gamess  - 1996s 1957s
> milc- 18s   18s
> GemsFDTD- 276s  272s
> 
> I think we can enable it by default rather than for -Ofast.
> Please let me know your inputs.

OK, so it is about 2%.  Did you try if you need lookahead even in the early 
pass (before reload)?  My guess would be so, but if not, it could cut the cost 
to half.  For -Ofast/-O3 it looks resonable to me, but we will need to announce 
it on the ML.  For other settings I think we need to work on more improvmeents 
or cut the expenses.

Honza
> 
> Regards
> Ganesh
> 
> -Original Message-
> From: Jan Hubicka [mailto:hubi...@ucw.cz]
> Sent: Thursday, October 24, 2013 2:54 PM
> To: Gopalasubramanian, Ganesh
> Cc: gcc-patches@gcc.gnu.org; Uros Bizjak (ubiz...@gmail.com); 
> hubi...@ucw.cz; H.J. Lu (hjl.to...@gmail.com)
> Subject: Re: Fix scheduler ix86_issue_rate and ix86_adjust_cost for 
> modern x86 chips
> 
> > Attached is the patch which does the following scheduler related changes.
> > * re-models bdver3 decoder.
> > * It enables lookahead with value 8 for all BD architectures. The patch 
> > doesn't consider if reloading is completed or not (an area that needs to be 
> > worked on).
> > * The issue rate for BD architectures are set to 4.
> > 
> > I see the following performance improvements on bdver3 machine.
> > * GemsFDTD improves by 6-7% with lookahead value changed to 8.
> > * Hmmer improves by 9% when issue rate when set to 4 .
> 
> Is this with -fschedule-insns? Or only with default settings?  Did you test 
> the compile time implications of increasing the lookahead? (value of 8 is 
> very large, we may consider enbling it only for -Ofast, limiting for 
> postreload only or something sim

RE: Fix scheduler ix86_issue_rate and ix86_adjust_cost for modern x86 chips

2013-10-24 Thread Gopalasubramanian, Ganesh

Hi, 

> Is this with -fschedule-insns? Or only with default settings?  Did you test 
> the compile time implications of increasing the lookahead? (value of 8 is 
> very large, we may consider enbling it only for -Ofast, limiting for 
> postreload only or something similar).

The improvement is seen with the options "-fschedule-insns  -fschedule-insns2 
-fsched-pressure"

Below are the build times of some of the SPEC benchmarks

  dfa8   no_lookahead

perlbench   - 196s  193s
bzip2   - 19s   19s
gcc - 439s  429s
mcf - 3s3s
gobmk   - 119s  115s
hmmer   - 62s   60s
sjeng   - 18s   17s
libquantum  - 6s6s
h264ref - 110s  107s
omnetpp - 132s  128s
astar   - 7s7s
bwaves  - 4s5s
gamess  - 1996s 1957s
milc- 18s   18s
GemsFDTD- 276s  272s

I think we can enable it by default rather than for -Ofast.
Please let me know your inputs.

Regards
Ganesh

-Original Message-
From: Jan Hubicka [mailto:hubi...@ucw.cz] 
Sent: Thursday, October 24, 2013 2:54 PM
To: Gopalasubramanian, Ganesh
Cc: gcc-patches@gcc.gnu.org; Uros Bizjak (ubiz...@gmail.com); hubi...@ucw.cz; 
H.J. Lu (hjl.to...@gmail.com)
Subject: Re: Fix scheduler ix86_issue_rate and ix86_adjust_cost for modern x86 
chips

> Attached is the patch which does the following scheduler related changes.
> * re-models bdver3 decoder.
> * It enables lookahead with value 8 for all BD architectures. The patch 
> doesn't consider if reloading is completed or not (an area that needs to be 
> worked on).
> * The issue rate for BD architectures are set to 4.
> 
> I see the following performance improvements on bdver3 machine.
> * GemsFDTD improves by 6-7% with lookahead value changed to 8.
> * Hmmer improves by 9% when issue rate when set to 4 .

Is this with -fschedule-insns? Or only with default settings?  Did you test the 
compile time implications of increasing the lookahead? (value of 8 is very 
large, we may consider enbling it only for -Ofast, limiting for postreload only 
or something similar).

> 
> I have considered the following hardware details for the model.
> * There are four decoders inside a hardware decoder block.
> * These four independent decoders can execute in parallel.  (They can take 8B 
> from four different instructions and decode).
> * These four decoders are pipelined 4 cycles deep and are non-stalling.
> * Each decoder takes 8B of instruction data every cycle and tries decoding 
> it. 
> * Issue rate is 4.
What is the overall limitation on number of bytes the instructions can occupy?
I think they need to fit into 2 16 byte windows, right?
In that case we may want to tweak the existing corei7 scheduling code to take 
care of this.  Making scheduler not overly optimistic about the parallelism is 
good since it will make less register pressure during the first pass.
> 
> Is it OK for upstream?

Otherwise the patch seems OK, but I would like to know the compile time effect 
first.

Honza
> 
> Changelog
> 
> 2013-10-24  Ganesh Gopalasubramanian  
> 
> 
>   * config/i386/bdver3.md : Added two additional decoder units 
>   to support issue rate of 4 and remodeled vector unit.
> 
>   * config/i386/i386.c (ix86_issue_rate): Issue rate for BD
>   architectures is set to 4.
> 
>   * config/i386/i386.c (ia32_multipass_dfa_lookahead): DFA
>   lookahead is set to 8 for BD architectures.
> 
> Regards
> Ganesh
>

RE: Fix scheduler ix86_issue_rate and ix86_adjust_cost for modern x86 chips

2013-10-24 Thread Gopalasubramanian, Ganesh

Attached is the patch which does the following scheduler related changes.
* re-models bdver3 decoder.
* It enables lookahead with value 8 for all BD architectures. The patch doesn't 
consider if reloading is completed or not (an area that needs to be worked on).
* The issue rate for BD architectures are set to 4.

I see the following performance improvements on bdver3 machine.
* GemsFDTD improves by 6-7% with lookahead value changed to 8.
* Hmmer improves by 9% when issue rate when set to 4 .

I have considered the following hardware details for the model.
* There are four decoders inside a hardware decoder block.
* These four independent decoders can execute in parallel.  (They can take 8B 
from four different instructions and decode).
* These four decoders are pipelined 4 cycles deep and are non-stalling.
* Each decoder takes 8B of instruction data every cycle and tries decoding it. 
* Issue rate is 4.

Is it OK for upstream?

Changelog

2013-10-24  Ganesh Gopalasubramanian  

* config/i386/bdver3.md : Added two additional decoder units 
to support issue rate of 4 and remodeled vector unit.

* config/i386/i386.c (ix86_issue_rate): Issue rate for BD
architectures is set to 4.

* config/i386/i386.c (ia32_multipass_dfa_lookahead): DFA
lookahead is set to 8 for BD architectures.

Regards
Ganesh



bdver3_issue_rate_lookahead.patch
Description: bdver3_issue_rate_lookahead.patch

RE: [PATCH,i386] Enable FMA4 for AMD bdver3

2013-10-16 Thread Gopalasubramanian, Ganesh

> 4.8.2 is already rolling, so too late for that.

Is 4.8 branch (gcc/branches/gcc-4_8-branch) open?
If yes, shall I commit these changes?

Regards
Ganesh

-Original Message-
From: Jakub Jelinek [mailto:ja...@redhat.com] 
Sent: Wednesday, October 16, 2013 12:41 PM
To: Uros Bizjak
Cc: Gopalasubramanian, Ganesh; gcc-patches@gcc.gnu.org
Subject: Re: [PATCH,i386] Enable FMA4 for AMD bdver3

On Wed, Oct 16, 2013 at 09:00:58AM +0200, Uros Bizjak wrote:
> On Wed, Oct 16, 2013 at 8:28 AM, Gopalasubramanian, Ganesh 
>  wrote:
> 
> > The below patch enables FMA4 for AMD bdver3 architectures.
> >
> > "make -k check" passes.
> >
> > +2013-10-16 Ganesh Gopalasubramanian  
> > +
> > +
> > +   * config/i386/i386.c (ix86_option_override_internal): Enable FMA4
> > +   for AMD bdver3.
> > +
> 
> OK for mainline and 4.8 branch (for 4.8.2 if approved by Jakub, 
> otherwise please wait for branch to open).

4.8.2 is already rolling, so too late for that.

Jakub

[PATCH,i386] Enable FMA4 for AMD bdver3

2013-10-15 Thread Gopalasubramanian, Ganesh

Hi 

The below patch enables FMA4 for AMD bdver3 architectures.

"make -k check" passes.

Is it OK for upstream?

Regards
Ganesh


diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index fb5b267..cbb5311 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,3 +1,8 @@
+2013-10-16 Ganesh Gopalasubramanian  
+
+   * config/i386/i386.c (ix86_option_override_internal): Enable FMA4
+   for AMD bdver3.
+
 2013-10-16  Hans-Peter Nilsson  

* config/cris/t-elfmulti (MULTILIB_OPTIONS, MULTILIB_DIRNAMES)
diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index b5796db..c24ce36 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -3104,7 +3104,7 @@ ix86_option_override_internal (bool main_args_p,
   {"bdver3", PROCESSOR_BDVER3, CPU_BDVER3,
PTA_64BIT | PTA_MMX | PTA_SSE | PTA_SSE2 | PTA_SSE3
| PTA_SSE4A | PTA_CX16 | PTA_ABM | PTA_SSSE3 | PTA_SSE4_1
-   | PTA_SSE4_2 | PTA_AES | PTA_PCLMUL | PTA_AVX
+   | PTA_SSE4_2 | PTA_AES | PTA_PCLMUL | PTA_AVX | PTA_FMA4
| PTA_XOP | PTA_LWP | PTA_BMI | PTA_TBM | PTA_F16C
| PTA_FMA | PTA_PRFCHW | PTA_FXSR | PTA_XSAVE
| PTA_XSAVEOPT | PTA_FSGSBASE},

RE: Fix scheduler ix86_issue_rate and ix86_adjust_cost for modern x86 chips

2013-10-11 Thread Gopalasubramanian, Ganesh

Hi Honza!

> I will give it a try.
I understand how the proposed patch would be. That is OK to me.

> OK, my undertanding always was, that the decoders works sort of sequentially.
>
> One decoder of bdver1 can do
> vector,
> double, single,
> single, single, signle
>
>where decoder 1 is somehow special but hardware is able to swap first and 
>second single.  Now if two decoders run, i would expect
>
> vector, vector
> single, single, signle
> double, single, double, single
> 
> to be decoded in once cycle

My understanding on the decode unit is mentioned below. Please correct me if I 
am wrong.

The sequential allotment of decoders is not there for bdver1.
Intel Sandybridge\core2 have four decoders. The first decoder is special for 
intel processors. 
For ex, In Sandybridge, the instructions that have only one µop can be decoded 
by any of the four decoders.
Instructions that have up to four µops will be decoded by the first decoder (of 
the four decoders available) only.

Bdver1 has four decoders. None of them is special unlike intel processors.
For bdver1, microcoded instructions are single issue. All four decoders are 
engaged for decoding microcoded instructions.

Decode unit of bdver3 has following specifications
* Four independent decoders which can execute in parallel
* Microcoded instructions are single issue. (All four decoders are engaged). 
This means that only one vectorpath instruction get issued in one cycle.
* The additional hardware instruction decoder increases the instruction decode 
capacity to eight instructions per clock cycle.
* The decoders are pipelined 4 cycles deep and are non-stalling.

So modeling vectorpath instructions is straightforward. No instructions are 
issued along with vector instructions.
We need to model only fastpath single and fastpath double instructions.
There are four decoders and they can execute in parallel. So they can take 
either two double or four single instructions.
We also don't need to model them in two stage as there is no sequence involved.
So, the modeling can be done such that in one cycle we schedule  2 singles + 1 
double (or) 4 singles (or) 2 doubles.

I have tried to model this for bdver3 (code changes are mentioned below). 
Please let me know your opinion.

Regards
Ganesh

Patch
-
diff --git a/gcc/config/i386/bdver3.md b/gcc/config/i386/bdver3.md
index 52418b5..9e59395 100644
--- a/gcc/config/i386/bdver3.md
+++ b/gcc/config/i386/bdver3.md
@@ -34,6 +34,8 @@

 (define_cpu_unit "bdver3-decode0" "bdver3")
 (define_cpu_unit "bdver3-decode1" "bdver3")
+(define_cpu_unit "bdver3-decode2" "bdver3")
+(define_cpu_unit "bdver3-decode3" "bdver3")
 (define_cpu_unit "bdver3-decodev" "bdver3")

 ;; Double decoded instructions take two cycles whereas
@@ -42,12 +44,15 @@
 ;; two decoders in two cycles.
 ;; Vectorpath instructions are single issue instructions.
 ;; So, we have separate unit for vector instructions.
-(exclusion_set "bdver3-decodev" "bdver3-decode0,bdver3-decode1")
+(exclusion_set "bdver3-decodev" 
"bdver3-decode0,bdver3-decode1,bdver3-decode2,bdver3-decode3")

 (define_reservation "bdver3-vector" "bdver3-decodev")
-(define_reservation "bdver3-direct" "(bdver3-decode0|bdver3-decode1)")
+(define_reservation "bdver3-direct" 
"(bdver3-decode0|bdver3-decode1|bdver3-decoder2|bdver3-decoder3)")

-(define_reservation "bdver3-double" "(bdver3-decode0|bdver3-decode1)*2")
+(define_reservation "bdver3-double" "(bdver3-decode0+bdver3-decode1)|
+   (bdver3-decode1+bdver3-decode2)|(bdver3-decode2+bdver3-decode3)|
+   (bdver3-decode0+bdver3-decode2)|(bdver3-decode1+bdver3-decode3)|
+   (bdver3-decode0+bdver3-decode3)") 


-Original Message-
From: Jan Hubicka [mailto:hubi...@ucw.cz] 
Sent: Wednesday, October 09, 2013 7:18 PM
To: Gopalasubramanian, Ganesh
Cc: Jan Hubicka; gcc-patches@gcc.gnu.org; hjl.to...@gmail.com
Subject: Re: Fix scheduler ix86_issue_rate and ix86_adjust_cost for modern x86 
chips

> Before merging the insn reservations, I need to compare the latency values 
> for bdver1 and bdver3. I know that they are different for some of the 
> instructions. 
> In that case, the merging should prop up another subset of latency 
> differences. I would like to keep these insn reservations in two .md files 
> (one for bdver1 and one for bdver3) even after the merger.

I am not really insisting on merging (define_insn_reservation "bdver3*") with 
(define_insn_reservation "bdver1*).  What I have in mind is merging actual 
atuomatons in cases it makes sense.  Latencies are not really encoded in those.

Bdver 12 has:
(define_automaton "bdver1,bdver1_ieu,bdver1_load,bdver1_fp,bdver1_agu

RE: Fix scheduler ix86_issue_rate and ix86_adjust_cost for modern x86 chips

2013-10-09 Thread Gopalasubramanian, Ganesh

Hi Honza, 

> Yep, I think we need to merge only those autmatas tha are same for both:
> (define_automaton "bdver3,bdver3_ieu,bdver3_load,bdver3_fp,bdver3_agu")
> probably can become
> (define_automaton "bdver3,bdver3_fp")
> with the corresponding reservations using bdver3_ieu,bdver3_load,bdver3_agu 
> changed to bdver1
> automaton.  I think it should result in smaller binary - the fact that all 
> conditionals are
> physically duplicated in bdver1/bdev3.md should be optimized away by 
> genautomata.

Before merging the insn reservations, I need to compare the latency values for 
bdver1 and bdver3. I know that they are different for some of the instructions. 
In that case, the merging should prop up another subset of latency differences. 
I would like to keep these insn reservations in two .md files (one for bdver1 
and one for bdver3) even after the merger.

> Your version has problem that it does not model the thing that the two 
> decoders works sequentially.

The two stage modeling is required so that the decode unit reservations are 
screened from other unit reservations.
But this sort of goes away in bdver3 because of the decode cycle.
In bdver3, the decode units scan two of these windows every "two" cycles 
decoding a maximum of eight instructions.
The hardware scan is done every two cycles in bdver3 whereas it is done every 
single cycle in bdver1/bdver2. (But we have two separate hardware decoders 
which guarantees higher throughput)
This means that the two stage modeling is not required in the scheduler 
descriptions since the hardware sort of guarantees that with its scanning 
mechanism. 
Our job is to make sure that 8 direct instructions get scheduled in two cycles 
or 4 double instructions get scheduled in two cycles.
So, I have modeled the bdver3 decoders such that with in a cycle they guarantee 
to issue 4 direct instructions or 2 double instructions. 
This eliminates the sequencing problem in modeling decoders and also ensures 
that the issue rate can be numbered for a single cycle rather than two cycles.
This is one of the reasons why I remodeled only bdver3. Let me know your 
comments on this.

> We can also experiment with defining TARGET_SCHED_VARIABLE_ISSUE to get more 
> realistic estimates on what still can be issued - the value of 6 is 
> unrealistically high.
This would get more complicated if we go by decoder capacity in bdver3. As we 
have two hardware decoders in steamroller (bdver3), they have a capacity to 
decode eight instructions per clock cycle, providing up to twice the decode and 
dispatch bandwidth compared to bdver1.
If we model this in GCC we need to change the issue rate to 8. If 6 is high, 
then 8 would add more joy and excitement.

TARGET_SCHED_VARIABLE_ISSUE is a nice suggestion to schedule instructions in 
different way. 

> We also should enable ia32_multipass_dfa_lookahead - with that scheduler 
> should be able to put double decoded and vector decoded insns on the proper 
> places.
Yes. Whenever we have this scheduler analysis in place we discuss about this 
but unfortunately is left as it is.
I will look into this after I do the enablement for bdver4.

> I will work on replacing most of the CPU cases into tuning flags + costs.
I am planning to get bdver4 enablement in place once scheduler descriptions for 
bdver3 is done with.
I will have cycles to look into the cost models. Please delegate some tasks if 
you can and I am willing to take them up.

Regards
Ganesh

-Original Message-
From: Jan Hubicka [mailto:hubi...@ucw.cz] 
Sent: Tuesday, October 08, 2013 3:20 PM
To: Gopalasubramanian, Ganesh
Cc: Jan Hubicka; gcc-patches@gcc.gnu.org; hjl.to...@gmail.com
Subject: Re: Fix scheduler ix86_issue_rate and ix86_adjust_cost for modern x86 
chips

> Hi Honza,
> 
> I am planning to update the scheduler descriptions for bdver3 first.
> Attached is the patch. Please let me know your comments if any.
> 
> Though I agree on merging bdver1/2 and bdver3 on most parts, the FP lines and 
> decoding schemes are different. So, let me know how can I approach merging 
> these.

Yep, I think we need to merge only those autmatas tha are same for both:
(define_automaton "bdver3,bdver3_ieu,bdver3_load,bdver3_fp,bdver3_agu")
probably can become
(define_automaton "bdver3,bdver3_fp")
with the corresponding reservations using bdver3_ieu,bdver3_load,bdver3_agu 
changed to bdver1
automaton.  I think it should result in smaller binary - the fact that all 
conditionals are
physically duplicated in bdver1/bdev3.md should be optimized away by 
genautomata.

I also played a bit with the decoders and I am attaching my version - that 
seems SPEC neutral though.
Your version has problem that it does not model the thing that the two decoders 
works sequentially.

I removed the bdver1-decodev unit and instead i simply reserve all thre 
decoders + I added 
pr

RE: Fix scheduler ix86_issue_rate and ix86_adjust_cost for modern x86 chips

2013-10-08 Thread Gopalasubramanian, Ganesh

Hi Honza,

I am planning to update the scheduler descriptions for bdver3 first.
Attached is the patch. Please let me know your comments if any.

Though I agree on merging bdver1/2 and bdver3 on most parts, the FP lines and 
decoding schemes are different. So, let me know how can I approach merging 
these.

Regards
Ganesh

-Original Message-
From: Jan Hubicka [mailto:hubi...@ucw.cz] 
Sent: Monday, September 30, 2013 4:47 PM
To: gcc-patches@gcc.gnu.org; Gopalasubramanian, Ganesh; hjl.to...@gmail.com
Subject: Fix scheduler ix86_issue_rate and ix86_adjust_cost for modern x86 chips


Hi,
while looking into schedules produced for Buldozer and Core I noticed that they 
do not seem to match reality.  This is because ix86_issue_rate limits those 
CPUs into 3 instructions per cycle, while they are designed to do 4 and 
somewhat confused ix86_adjust_cost.

I also added stack engine into modern chips even though scheduler doesn't 
really understand that multiple push operations can happen in one cycle. At 
least it gets the stack updates in sequences of push/pop operations.

I did not updated buldozer issue rates yet.  The current scheduler model won't 
allow it to execute more than 3 instructions per cycle (and 2 for version 3).  
I think bdver1.md/bdver3.md needs to be updated first.

I am testing x86_64-linux and will commit if there are no complains.

Honza

* i386.c (ix86_issue_rate): Pentium4/Nocona issue 2 instructions
per cycle, Core/CoreI7/Haswell 4 instructions per cycle.
(ix86_adjust_cost): Add stack engine to modern AMD chips;
fix for core; remove Atom that mistakely shared code with AMD.
Index: config/i386/i386.c
===
--- config/i386/i386.c  (revision 203011)
+++ config/i386/i386.c  (working copy)
@@ -24435,17 +24435,14 @@ ix86_issue_rate (void)
 case PROCESSOR_SLM:
 case PROCESSOR_K6:
 case PROCESSOR_BTVER2:
+case PROCESSOR_PENTIUM4:
+case PROCESSOR_NOCONA:
   return 2;
 
 case PROCESSOR_PENTIUMPRO:
-case PROCESSOR_PENTIUM4:
-case PROCESSOR_CORE2:
-case PROCESSOR_COREI7:
-case PROCESSOR_HASWELL:
 case PROCESSOR_ATHLON:
 case PROCESSOR_K8:
 case PROCESSOR_AMDFAM10:
-case PROCESSOR_NOCONA:
 case PROCESSOR_GENERIC:
 case PROCESSOR_BDVER1:
 case PROCESSOR_BDVER2:
@@ -24453,6 +24450,11 @@ ix86_issue_rate (void)
 case PROCESSOR_BTVER1:
   return 3;
 
+case PROCESSOR_CORE2:
+case PROCESSOR_COREI7:
+case PROCESSOR_HASWELL:
+  return 4;
+
 default:
   return 1;
 }
@@ -24709,10 +24711,15 @@ ix86_adjust_cost (rtx insn, rtx link, rt
 case PROCESSOR_BDVER3:
 case PROCESSOR_BTVER1:
 case PROCESSOR_BTVER2:
-case PROCESSOR_ATOM:
 case PROCESSOR_GENERIC:
   memory = get_attr_memory (insn);
 
+  /* Stack engine allows to execute push&pop instructions in parall.  */
+  if (((insn_type == TYPE_PUSH || insn_type == TYPE_POP)
+  && (dep_insn_type == TYPE_PUSH || dep_insn_type == TYPE_POP))
+ && (ix86_tune != PROCESSOR_ATHLON && ix86_tune != PROCESSOR_K8))
+   return 0;
+
   /* Show ability of reorder buffer to hide latency of load by executing
 in parallel with previous instruction in case
 previous instruction is not needed to compute the address.  */ @@ 
-24737,6 +24744,29 @@ ix86_adjust_cost (rtx insn, rtx link, rt
  else
cost = 0;
}
+  break;
+
+case PROCESSOR_CORE2:
+case PROCESSOR_COREI7:
+case PROCESSOR_HASWELL:
+  memory = get_attr_memory (insn);
+
+  /* Stack engine allows to execute push&pop instructions in parall.  */
+  if ((insn_type == TYPE_PUSH || insn_type == TYPE_POP)
+ && (dep_insn_type == TYPE_PUSH || dep_insn_type == TYPE_POP))
+   return 0;
+
+  /* Show ability of reorder buffer to hide latency of load by executing
+in parallel with previous instruction in case
+previous instruction is not needed to compute the address.  */
+  if ((memory == MEMORY_LOAD || memory == MEMORY_BOTH)
+ && !ix86_agi_dependent (dep_insn, insn))
+   {
+ if (cost >= 4)
+   cost -= 4;
+ else
+   cost = 0;
+   }
   break;
 
 case PROCESSOR_SLM:


issue_rate_bdver3.patch
Description: issue_rate_bdver3.patch

RE: Fwd: [PATCH] Scheduling result adjustment to enable macro-fusion

2013-09-30 Thread Gopalasubramanian, Ganesh

> 1. For cmp/test with rip-relative addressing mem operand, don't group 
> insns. Bulldozer also doesn't support fusion for cmp/test with both 
> displacement MEM and immediate operand, while m_CORE_ALL doesn't 
> support fusion for cmp/test with MEM and immediate operand. I simplify 
> choose to use the more stringent constraint here (m_CORE_ALL's 
> constraint).

This suits Bulldozer's specification. We don't see an issue with the proposed 
patch.

Regards
Ganesh

-Original Message-
From: H.J. Lu [mailto:hjl.to...@gmail.com] 
Sent: Wednesday, September 25, 2013 2:12 AM
To: Wei Mi
Cc: Jan Hubicka; Alexander Monakov; Steven Bosscher; GCC Patches; David Li; 
Kirill Yukhin
Subject: Re: Fwd: [PATCH] Scheduling result adjustment to enable macro-fusion

On Tue, Sep 24, 2013 at 12:06 PM, Wei Mi  wrote:
> This is the updated patch2.
> Changed:
> 1. For cmp/test with rip-relative addressing mem operand, don't group 
> insns. Bulldozer also doesn't support fusion for cmp/test with both 
> displacement MEM and immediate operand, while m_CORE_ALL doesn't 
> support fusion for cmp/test with MEM and immediate operand. I simplify 
> choose to use the more stringent constraint here (m_CORE_ALL's 
> constraint).
> 2. Add Budozer back and merge TARGET_FUSE_CMP_AND_BRANCH_64 and 
> TARGET_FUSE_CMP_AND_BRANCH_32.
>
> bootstrap and regression pass. ok for trunk?
>
> 2013-09-24  Wei Mi  
>
> * gcc/config/i386/i386.c (rip_relative_addr_p): New Function.
> (ix86_macro_fusion_p): Ditto.
> (ix86_macro_fusion_pair_p): Ditto.
> * gcc/config/i386/i386.h: Add new tune features about macro-fusion.
> * gcc/config/i386/x86-tune.def (DEF_TUNE): Ditto.
> * gcc/doc/tm.texi: Generated.
> * gcc/doc/tm.texi.in: Ditto.
> * gcc/haifa-sched.c (try_group_insn): New Function.
> (group_insns_for_macro_fusion): Ditto.
> (sched_init): Call group_insns_for_macro_fusion.
> * gcc/sched-rgn.c (add_branch_dependences): Keep insns in
> a SCHED_GROUP at the end of BB to remain their location.
> * gcc/target.def: Add two hooks: macro_fusion_p and
> macro_fusion_pair_p.
>
> diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index 
> 1fd3f60..4a04778 100644
> --- a/gcc/config/i386/i386.c
> +++ b/gcc/config/i386/i386.c
> @@ -24856,6 +24856,167 @@ ia32_multipass_dfa_lookahead (void)
>  }
>  }
>
> +/* Extracted from ix86_print_operand_address. Check whether ADDR is a
> +   rip-relative address.  */
> +
> +static bool
> +rip_relative_addr_p (rtx addr)
> +{
> +  struct ix86_address parts;
> +  rtx base, index, disp;
> +  int ok;
> +
> +  if (GET_CODE (addr) == UNSPEC && XINT (addr, 1) == UNSPEC_VSIBADDR)
> +{
> +  ok = ix86_decompose_address (XVECEXP (addr, 0, 0), &parts);
> +  parts.index = XVECEXP (addr, 0, 1);
> +}
> +  else if (GET_CODE (addr) == UNSPEC && XINT (addr, 1) == UNSPEC_LEA_ADDR)
> +ok = ix86_decompose_address (XVECEXP (addr, 0, 0), &parts);  else
> +ok = ix86_decompose_address (addr, &parts);
> +
> +  gcc_assert (ok);
> +  base = parts.base;
> +  index = parts.index;
> +  disp = parts.disp;
> +
> +  if (TARGET_64BIT && !base && !index)
> +{
> +  rtx symbol = disp;
> +
> +  if (GET_CODE (disp) == CONST
> + && GET_CODE (XEXP (disp, 0)) == PLUS
> + && CONST_INT_P (XEXP (XEXP (disp, 0), 1)))
> +   symbol = XEXP (XEXP (disp, 0), 0);
> +
> +  if (GET_CODE (symbol) == LABEL_REF
> + || (GET_CODE (symbol) == SYMBOL_REF
> + && SYMBOL_REF_TLS_MODEL (symbol) == 0))
> +   return true;
> +}
> +  if (flag_pic && !base && !index)
> +{
> +  if (GET_CODE (disp) == CONST
> + && GET_CODE (XEXP (disp, 0)) == UNSPEC
> + && (XINT (XEXP (disp, 0), 1) == UNSPEC_PCREL
> + || XINT (XEXP (disp, 0), 1) == UNSPEC_GOTPCREL
> + || (TARGET_64BIT
> + && XINT (XEXP (disp, 0), 1) == UNSPEC_GOTNTPOFF)))
> +   return true;
> +}
> +  return false;
> +}
> +

It doesn't look right.  IP relative address is only possible with TARGET_64BIT 
and

1. base == pc. Or
2. UUNSPEC_PCREL,  UNSPEC_GOTPCREL, and
NSPEC_GOTNTPOFF.


--
H.J.

RE: [PATCH,i386] Default alignment for AMD BD and BT

2013-08-01 Thread Gopalasubramanian, Ganesh

Thanks Jakub!

Committed revision 201402.

-Original Message-
From: Jakub Jelinek [mailto:ja...@redhat.com] 
Sent: Thursday, July 04, 2013 4:46 PM
To: Gopalasubramanian, Ganesh
Cc: Uros Bizjak (ubiz...@gmail.com); gcc-patches@gcc.gnu.org
Subject: Re: [PATCH,i386] Default alignment for AMD BD and BT

On Thu, Jul 04, 2013 at 11:14:24AM +, Gopalasubramanian, Ganesh wrote:
> Can this be backported now!

Yes.

Jakub

RE: [PATCH,i386] Default alignment for AMD BD and BT

2013-07-04 Thread Gopalasubramanian, Ganesh

Hi Uros,

Can this be backported now!

Regards
Ganesh

-Original Message-
From: Uros Bizjak [mailto:ubiz...@gmail.com] 
Sent: Thursday, May 30, 2013 1:40 PM
To: Gopalasubramanian, Ganesh
Cc: gcc-patches@gcc.gnu.org
Subject: Re: [PATCH,i386] Default alignment for AMD BD and BT

On Wed, May 29, 2013 at 1:28 PM, Gopalasubramanian, Ganesh 
 wrote:

> We want this to be backported to GCC48 branch.
> Please approve.
>
> -Original Message-
> From: Uros Bizjak [mailto:ubiz...@gmail.com]
> Sent: Tuesday, May 07, 2013 6:22 PM
> To: Gopalasubramanian, Ganesh
> Cc: gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH,i386] Default alignment for AMD BD and BT
>
> On Tue, May 7, 2013 at 9:16 AM, Gopalasubramanian, Ganesh 
>  wrote:
>
>> The patch updates the alignment values for AMD BD and BT architectures.
>>
>> "make -k check" passes.
>>
>> Is it OK for upstream?
>>
>> 2013-05-07  Ganesh Gopalasubramanian
>> 
>>
>> * config/i386/i386.c (processor_target_table): Modified default
>> alignment values for AMD BD and BT architectures.

This is OK, but please wait until 4.8 branch is open again.

Thanks,
Uros.

RE: [PATCH,i386] Default alignment for AMD BD and BT

2013-05-29 Thread Gopalasubramanian, Ganesh

Hi

We want this to be backported to GCC48 branch.
Please approve.

Regards
Ganesh

-Original Message-
From: Uros Bizjak [mailto:ubiz...@gmail.com] 
Sent: Tuesday, May 07, 2013 6:22 PM
To: Gopalasubramanian, Ganesh
Cc: gcc-patches@gcc.gnu.org
Subject: Re: [PATCH,i386] Default alignment for AMD BD and BT

On Tue, May 7, 2013 at 9:16 AM, Gopalasubramanian, Ganesh 
 wrote:

> The patch updates the alignment values for AMD BD and BT architectures.
>
> "make -k check" passes.
>
> Is it OK for upstream?
>
> 2013-05-07  Ganesh Gopalasubramanian  
> 
>
> * config/i386/i386.c (processor_target_table): Modified default
> alignment values for AMD BD and BT architectures.

The value 11 indeed looks a bit weird, but it means: align to 16 byte boundary 
only if this can be done by skipping 10 bytes or less.

Tha patch is OK for mainline.

Thanks,
Uros.

RE: [PATCH,i386] FP Reassociation for AMD bdver1 and bdver2

2013-05-29 Thread Gopalasubramanian, Ganesh

Thanks Uros! Committed at r199405.

-Ganesh

-Original Message-
From: Uros Bizjak [mailto:ubiz...@gmail.com] 
Sent: Thursday, May 23, 2013 4:47 PM
To: Gopalasubramanian, Ganesh
Cc: gcc-patches@gcc.gnu.org
Subject: Re: [PATCH,i386] FP Reassociation for AMD bdver1 and bdver2

On Thu, May 23, 2013 at 1:11 PM, Gopalasubramanian, Ganesh 
 wrote:

> The patch enables FP Reassociation pass AMD bdver1 and bdver2 architectures.
> We note a performance uplift of around ~8% on calculix.
>
> "make -k check" passes.
>
> Is it OK for upstream?

OK.

Thanks,
Uros.

[PATCH,i386] FP Reassociation for AMD bdver1 and bdver2

2013-05-23 Thread Gopalasubramanian, Ganesh

Hi 

The patch enables FP Reassociation pass AMD bdver1 and bdver2 architectures.
We note a performance uplift of around ~8% on calculix.

"make -k check" passes.

Is it OK for upstream?

Regards
Ganesh

Index: gcc/ChangeLog
===
--- gcc/ChangeLog   (revision 199133)
+++ gcc/ChangeLog   (working copy)
@@ -1,3 +1,8 @@
+2013-05-23  Ganesh Gopalasubramanian  
+
+* config/i386/i386.c (initial_ix86_tune_features): Enable
+FP Reassociation for AMD bdver1 and bdver2.
+
 2013-05-21  Christian Bruel  

 * dwarf2out.c (multiple_reg_loc_descriptor): Use dbx_reg_number for
Index: gcc/config/i386/i386.c
===
--- gcc/config/i386/i386.c  (revision 199133)
+++ gcc/config/i386/i386.c  (working copy)
@@ -2026,7 +2026,7 @@

   /* X86_TUNE_REASSOC_FP_TO_PARALLEL: Try to produce parallel computations
  during reassociation of fp computation.  */
-  m_ATOM | m_HASWELL,
+  m_ATOM | m_HASWELL | m_BDVER1 | m_BDVER2,

   /* X86_TUNE_GENERAL_REGS_SSE_SPILL: Try to spill general regs to SSE
  regs instead of memory.  */



-Original Message-----
From: Gopalasubramanian, Ganesh 
Sent: Monday, May 13, 2013 5:24 PM
To: gcc-patches@gcc.gnu.org
Cc: Uros Bizjak (ubiz...@gmail.com)
Subject: [PATCH,i386] FSGSBASE for AMD bdver3 

Hi 

The patch enables FSGSBASE instruction generation for AMD bdver3 architectures.

"make -k check" passes.

Is it OK for upstream?

Regards
Ganesh

Index: gcc/ChangeLog
===
--- gcc/ChangeLog   (revision 198821)
+++ gcc/ChangeLog   (working copy)
@@ -1,3 +1,8 @@
+2013-05-13  Ganesh Gopalasubramanian  
+
+* config/i386/i386.c (processor_alias_table): Add instruction
+FSGSBASE for AMD bdver3 architecture.
+
 2013-05-13  Martin Jambor  

PR middle-end/42371
Index: gcc/config/i386/i386.c
===
--- gcc/config/i386/i386.c  (revision 198821)
+++ gcc/config/i386/i386.c  (working copy)
@@ -3000,7 +3000,7 @@
| PTA_SSE4_2 | PTA_AES | PTA_PCLMUL | PTA_AVX
| PTA_XOP | PTA_LWP | PTA_BMI | PTA_TBM | PTA_F16C
| PTA_FMA | PTA_PRFCHW | PTA_FXSR | PTA_XSAVE
-   | PTA_XSAVEOPT},
+   | PTA_XSAVEOPT | PTA_FSGSBASE},
   {"btver1", PROCESSOR_BTVER1, CPU_GENERIC64,
PTA_64BIT | PTA_MMX |  PTA_SSE  | PTA_SSE2 | PTA_SSE3
| PTA_SSSE3 | PTA_SSE4A |PTA_ABM | PTA_CX16 | PTA_PRFCHW

RE: [PATCH, i386]: Update processor_alias_table for missing PTA_PRFCHW and PTA_FXSR flags

2013-05-16 Thread Gopalasubramanian, Ganesh

Thank you Uros for the patch. Could you backport this to the 4.8.0?

-Original Message-
From: Uros Bizjak [mailto:ubiz...@gmail.com] 
Sent: Wednesday, May 15, 2013 11:16 PM
To: gcc-patches@gcc.gnu.org
Cc: Gopalasubramanian, Ganesh
Subject: [PATCH, i386]: Update processor_alias_table for missing PTA_PRFCHW and 
PTA_FXSR flags

Hello!

Attached patch adds missing PTA_PRFCHW and PTA_FXSR flags to x86 processor 
alias table. PRFCHW CPUID flag is shared with 3dnow prefetch flag, so some 
additional logic is needed to avoid generating SSE prefetches for non-SSE 
3dNow! targets, while still generating full set of 3dnow prefetches on 3dNow! 
targets.

2013-05-15  Uros Bizjak  

* config/i386/i386.c (iy86_option_override_internal): Update
processor_alias_table for missing PTA_PRFCHW and PTA_FXSR flags.  Add
PTA_POPCNT to corei7 entry and remove PTA_SSE from athlon-4 entry.
Do not enable SSE prefetch on non-SSE 3dNow! targets.  Enable
TARGET_PRFCHW for TARGET_3DNOW targets.
* config/i386/i386.md (prefetch): Enable for TARGET_PRFCHW instead
of TARGET_3DNOW.
(*prefetch_3dnow): Enable for TARGET_PRFCHW only.

Patch was bootstrapped and regression tested on x86_64-pc-linux-gnu {,-m32}  
and was committed to mainline SVN. The patch will be backported to 4.8 branch 
in a couple of days.

Uros.

RE: [PATCH,i386] FSGSBASE for AMD bdver3

2013-05-15 Thread Gopalasubramanian, Ganesh

Thank you Uros!

Patch for FSGSBASE instruction generation for AMD bdver3 committed to trunk 
(rr198916).

Regards
Ganesh

-Original Message-
From: Uros Bizjak [mailto:ubiz...@gmail.com] 
Sent: Monday, May 13, 2013 5:50 PM
To: Gopalasubramanian, Ganesh
Cc: gcc-patches@gcc.gnu.org
Subject: Re: [PATCH,i386] FSGSBASE for AMD bdver3

On Mon, May 13, 2013 at 1:54 PM, Gopalasubramanian, Ganesh 
 wrote:

> The patch enables FSGSBASE instruction generation for AMD bdver3 
> architectures.
>
> "make -k check" passes.
>
> Is it OK for upstream?

OK. Please also check for missing PTA_PRFCHW and PTA_FXSR for AMD processors in 
processor_alias_table.

Thanks,
Uros.

RE: [PATCH,i386] FSGSBASE for AMD bdver3

2013-05-14 Thread Gopalasubramanian, Ganesh

Thanks Uros!

I think you mean the amdfam10 ISA mismatch between march=native and 
march=amdfam10.
The below patch fills the gap.

"make -k check" passes.

Regards
Ganesh

2013-05-07  Ganesh Gopalasubramanian  

* config/i386/i386.c (processor_alias_table): Mismatch in ISAs
Between march=native and march=amdfam10 is fixed.

--- ./wkcpy/gcc-4.9.0/gcc/config/i386/i386.c  2013-02-21 16:27:10.0 
+0530
+++ ./source/gcc-4.9.0/gcc/config/i386/i386.c   2013-10-21 22:20:28.0 
+0530
@@ -2964,7 +2964,8 @@
| PTA_SSE2 | PTA_NO_SAHF},
   {"amdfam10", PROCESSOR_AMDFAM10, CPU_AMDFAM10,
PTA_64BIT | PTA_MMX | PTA_3DNOW | PTA_3DNOW_A | PTA_SSE
-   | PTA_SSE2 | PTA_SSE3 | PTA_SSE4A | PTA_CX16 | PTA_ABM},
+   | PTA_SSE2 | PTA_SSE3 | PTA_SSE4A | PTA_CX16 | PTA_ABM
+   | PTA_FXSR | PTA_PRFCHW},
   {"barcelona", PROCESSOR_AMDFAM10, CPU_AMDFAM10,
PTA_64BIT | PTA_MMX | PTA_3DNOW | PTA_3DNOW_A | PTA_SSE
| PTA_SSE2 | PTA_SSE3 | PTA_SSE4A | PTA_CX16 | PTA_ABM},


-Original Message-
From: Uros Bizjak [mailto:ubiz...@gmail.com] 
Sent: Monday, May 13, 2013 5:50 PM
To: Gopalasubramanian, Ganesh
Cc: gcc-patches@gcc.gnu.org
Subject: Re: [PATCH,i386] FSGSBASE for AMD bdver3

On Mon, May 13, 2013 at 1:54 PM, Gopalasubramanian, Ganesh 
 wrote:

> The patch enables FSGSBASE instruction generation for AMD bdver3 
> architectures.
>
> "make -k check" passes.
>
> Is it OK for upstream?

OK. Please also check for missing PTA_PRFCHW and PTA_FXSR for AMD processors in 
processor_alias_table.

Thanks,
Uros.

[PATCH,i386] FSGSBASE for AMD bdver3

2013-05-13 Thread Gopalasubramanian, Ganesh

Hi 

The patch enables FSGSBASE instruction generation for AMD bdver3 architectures.

"make -k check" passes.

Is it OK for upstream?

Regards
Ganesh

Index: gcc/ChangeLog
===
--- gcc/ChangeLog   (revision 198821)
+++ gcc/ChangeLog   (working copy)
@@ -1,3 +1,8 @@
+2013-05-13  Ganesh Gopalasubramanian  
+
+* config/i386/i386.c (processor_alias_table): Add instruction
+FSGSBASE for AMD bdver3 architecture.
+
 2013-05-13  Martin Jambor  

PR middle-end/42371
Index: gcc/config/i386/i386.c
===
--- gcc/config/i386/i386.c  (revision 198821)
+++ gcc/config/i386/i386.c  (working copy)
@@ -3000,7 +3000,7 @@
| PTA_SSE4_2 | PTA_AES | PTA_PCLMUL | PTA_AVX
| PTA_XOP | PTA_LWP | PTA_BMI | PTA_TBM | PTA_F16C
| PTA_FMA | PTA_PRFCHW | PTA_FXSR | PTA_XSAVE
-   | PTA_XSAVEOPT},
+   | PTA_XSAVEOPT | PTA_FSGSBASE},
   {"btver1", PROCESSOR_BTVER1, CPU_GENERIC64,
PTA_64BIT | PTA_MMX |  PTA_SSE  | PTA_SSE2 | PTA_SSE3
| PTA_SSSE3 | PTA_SSE4A |PTA_ABM | PTA_CX16 | PTA_PRFCHW

RE: [PATCH,i386] Default alignment for AMD BD and BT

2013-05-13 Thread Gopalasubramanian, Ganesh

Thank you Uros!

Committed r198820. 

Regards
Ganesh

-Original Message-
From: Uros Bizjak [mailto:ubiz...@gmail.com] 
Sent: Tuesday, May 07, 2013 6:22 PM
To: Gopalasubramanian, Ganesh
Cc: gcc-patches@gcc.gnu.org
Subject: Re: [PATCH,i386] Default alignment for AMD BD and BT

On Tue, May 7, 2013 at 9:16 AM, Gopalasubramanian, Ganesh 
 wrote:

> The patch updates the alignment values for AMD BD and BT architectures.
>
> "make -k check" passes.
>
> Is it OK for upstream?
>
> 2013-05-07  Ganesh Gopalasubramanian  
> 
>
> * config/i386/i386.c (processor_target_table): Modified default
> alignment values for AMD BD and BT architectures.

The value 11 indeed looks a bit weird, but it means: align to 16 byte boundary 
only if this can be done by skipping 10 bytes or less.

Tha patch is OK for mainline.

Thanks,
Uros.

[PATCH,i386] Default alignment for AMD BD and BT

2013-05-07 Thread Gopalasubramanian, Ganesh

Hi 

The patch updates the alignment values for AMD BD and BT architectures.

"make -k check" passes.

Is it OK for upstream?

Regards
Ganesh

2013-05-07  Ganesh Gopalasubramanian  

* config/i386/i386.c (processor_target_table): Modified default 
alignment values for AMD BD and BT architectures.

Index: gcc/config/i386/i386.c
===
--- gcc/config/i386/i386.c  (revision 198386)
+++ gcc/config/i386/i386.c  (working copy)
@@ -2450,11 +2450,11 @@
   {&generic32_cost, 16, 7, 16, 7, 16},
   {&generic64_cost, 16, 10, 16, 10, 16},
   {&amdfam10_cost, 32, 24, 32, 7, 32},
-  {&bdver1_cost, 32, 24, 32, 7, 32},
-  {&bdver2_cost, 32, 24, 32, 7, 32},
-  {&bdver3_cost, 32, 24, 32, 7, 32},
-  {&btver1_cost, 32, 24, 32, 7, 32},
-  {&btver2_cost, 32, 24, 32, 7, 32},
+  {&bdver1_cost, 16, 10, 16, 7, 11},
+  {&bdver2_cost, 16, 10, 16, 7, 11},
+  {&bdver3_cost, 16, 10, 16, 7, 11},
+  {&btver1_cost, 16, 10, 16, 7, 11},
+  {&btver2_cost, 16, 10, 16, 7, 11},
   {&atom_cost, 16, 15, 16, 7, 16}
 };

RE: [patch][wwwdocs] gcc 4.8 changes - AMD new cores

2013-02-14 Thread Gopalasubramanian, Ganesh

Thank you Gerald!

Committed with the changes. 

Regards
Ganesh

-Original Message-
From: Gerald Pfeifer [mailto:ger...@pfeifer.com] 
Sent: Thursday, February 14, 2013 2:40 PM
To: Gopalasubramanian, Ganesh
Cc: gcc-patchesUros Bizjak
Subject: RE: [patch][wwwdocs] gcc 4.8 changes - AMD new cores

On Thu, 14 Feb 2013, Gopalasubramanian, Ganesh wrote:
> Is it OK for wwdocs?

Looks good to me if you say "...through the... options" (adding "the" in two 
cases) and breaking the lines to not exceed 76 columns.

Thanks,
Gerald

RE: [patch][wwwdocs] gcc 4.8 changes - AMD new cores

2013-02-13 Thread Gopalasubramanian, Ganesh

Is it OK for wwdocs?

Index: gcc-4.8/changes.html
===
RCS file: /cvs/gcc/wwwdocs/htdocs/gcc-4.8/changes.html,v
retrieving revision 1.96
diff -u -r1.96 changes.html
--- gcc-4.8/changes.html12 Feb 2013 16:33:58 -  1.96
+++ gcc-4.8/changes.html13 Feb 2013 08:24:53 -
@@ -529,6 +529,10 @@
 information.
 
  Windows MinGW-w64 targets (*-w64-mingw*) require at 
least r5437 from the Mingw-w64 trunk. 
+Support for new AMD family 15h processors (Steamroller core) is now 
available
+  through -march=bdver3 and -mtune=bdver3 
options.
+Support for new AMD family 16h processors (Jaguar core) is now 
available
+  through -march=btver2 and -mtune=btver2 
options.
   

 FRV

-Original Message-
From: Mikael Morin [mailto:mikael.mo...@sfr.fr] 
Sent: Wednesday, February 13, 2013 6:38 PM
To: Richard Biener
Cc: Gopalasubramanian, Ganesh; gcc-patches@gcc.gnu.org; ubizjak at gmail dot 
com (gcc-bugzi...@gcc.gnu.org); ger...@pfeifer.com
Subject: Re: [patch][wwwdocs] gcc 4.8 changes - AMD new cores

Le 13/02/2013 14:00, Richard Biener a écrit :
> Of course not.  Next they'll add blver ...
Sorry

[patch][wwwdocs] gcc 4.8 changes - AMD new cores

2013-02-13 Thread Gopalasubramanian, Ganesh

Hello, 

This patch adds short words about the new AMD cores that got enabled in GCC-4.8.
OK for the wwwdocs? 

Regards
Ganesh

Index: gcc-4.8/changes.html
===
RCS file: /cvs/gcc/wwwdocs/htdocs/gcc-4.8/changes.html,v
retrieving revision 1.96
diff -u -r1.96 changes.html
--- gcc-4.8/changes.html12 Feb 2013 16:33:58 -  1.96
+++ gcc-4.8/changes.html13 Feb 2013 08:24:53 -
@@ -529,6 +529,10 @@
 information.
 
  Windows MinGW-w64 targets (*-w64-mingw*) require at 
least r5437 from the Mingw-w64 trunk. 
+Support for new AMD family 15h processors (Steamroller core) is now 
available
+  through -march=bdver3 and -mtune=bdver3 
options.
+Support for new AMD family 16h processors (Jaguar core) is now 
available
+  through -march=btver2 and -mtune=btver2 
options.
   

 FRV

RE: [PATCH, i386]: AMD bdver3 enablement

2012-11-15 Thread Gopalasubramanian, Ganesh

Thank Uros for the comments.

The changes are committed to trunk
http://gcc.gnu.org/viewcvs?view=revision&revision=193548
http://gcc.gnu.org/viewcvs?view=revision&revision=193549

Regards
Ganesh

-Original Message-
From: Uros Bizjak [mailto:ubiz...@gmail.com] 
Sent: Wednesday, November 14, 2012 4:15 PM
To: Gopalasubramanian, Ganesh
Cc: gcc-patches@gcc.gnu.org
Subject: Re: [PATCH, i386]: AMD bdver3 enablement

On Wed, Nov 14, 2012 at 10:22 AM, Gopalasubramanian, Ganesh 
 wrote:

>> sseshuf replaces sselog in some insn patterns, but should be handled in the 
>> same way in *existing* .md files.
>
> Modifications done as per the comments.
> 1. Sseshuf is added along with sselog in existing md files.
> 2. sseshuf is handled in a separate pattern in bdver3.md
>
> Bootstrapping and "make -k check" passes.
> Ok for trunk?
>
> 2012-11-14  Ganesh Gopalasubramanian  
> 
>
> bdver3 Enablement
> * gcc/doc/extend.texi: Add details about bdver3.
> * gcc/doc/invoke.texi: Add details about bdver3.
> * config.gcc (i[34567]86-*-linux* | ...): Add bdver3.
> (case ${target}): Add bdver3.
> * config/i386/i386.h (TARGET_BDVER3): New definition.
> * config/i386/i386.md (define_attr "cpu"): Add bdver3.
> * config/i386/sse.md (sseshuf): New type attribute.
> * config/i386/athlon.md (sseshuf):Likewise.
> * config/i386/atom.md (sseshuf):Likewise.
> * config/i386/ppro.md (sseshuf):Likewise.
> * config/i386/bdver1.md (sseshuf):Likewise.
> * config/i386/i386.opt (flag_dispatch_scheduler): Add bdver3.
> * config/i386/i386-c.c (ix86_target_macros_internal): Add
> bdver3 def_and_undef
> * config/i386/driver-i386.c (host_detect_local_cpu): Let
> -march=native recognize bdver3 processors.
> * config/i386/i386.c (struct processor_costs bdver3_cost): New.
> (m_BDVER3): New definition.
> (m_AMD_MULTIPLE): Includes m_BDVER3.
> (initial_ix86_tune_features): Add bdver3 tune.
> (processor_target_table): Add bdver3 entry.
> (static const char *const cpu_names): Add bdver3 entry.
> (software_prefetching_beneficial_p): Add bdver3.
> (ix86_option_override_internal): Add bdver3 instruction sets.
> (ix86_option_override_internal): Remove XSAVEOPT for bdver1
> and bdver2.
> (ix86_issue_rate): Add bdver3.
> (ix86_adjust_cost): Add bdver3.
> (enum target_cpu_default): Add TARGET_CPU_DEFAULT_bdver3.
> (enum processor_type): Add PROCESSOR_BDVER3.
> * config/i386/bdver3.md: New file describing bdver3 pipelines.

OK for mainline.

Thanks,
Uros.

RE: [PATCH, i386]: AMD bdver3 enablement

2012-11-11 Thread Gopalasubramanian, Ganesh

> You can see from the changes of sse.md that this is functionally a no-op 
> change.
Sseshuf replaces sselog.
So, do you mean it should be added with sselog instead of sseadd?
Adding it with sseadd (instead of sselog) influences the latency information.

Regards
Ganesh

-Original Message-
From: Uros Bizjak [mailto:ubiz...@gmail.com] 
Sent: Monday, November 12, 2012 2:30 AM
To: Gopalasubramanian, Ganesh
Cc: gcc-patches@gcc.gnu.org
Subject: Re: [PATCH, i386]: AMD bdver3 enablement

On Fri, Nov 9, 2012 at 4:39 AM, Gopalasubramanian, Ganesh
 wrote:

> Changes done with respect to the review comments.
> Conditionally setting "sseshuf" type attribute has been removed.
> Instead new attribute is added and is included for other attribute 
> calculations.
>
> The patch is attached as (difflog.txt).
> The new file (bdver3.md) describing the pipelines is also attached.
>
> Bootstrapping and "make -k check" passes.
>
> OK for upstream?
>
> 2012-11-09  Ganesh Gopalasubramanian  
>
> bdver3 Enablement
> * gcc/doc/extend.texi: Add details about bdver3.
> * gcc/doc/invoke.texi: Add details about bdver3.
> * config.gcc (i[34567]86-*-linux* | ...): Add bdver3.
> (case ${target}): Add bdver3.
> * config/i386/i386.h (TARGET_BDVER3): New definition.
> * config/i386/i386.md (define_attr "cpu"): Add bdver3.
> * config/i386/sse.md (sseshuf): New type attribute.
> * config/i386/athlon.md (sseshuf):Likewise.
> * config/i386/atom.md (sseshuf):Likewise.
> * config/i386/ppro.md (sseshuf):Likewise.

Index: gcc/config/i386/atom.md
===
--- gcc/config/i386/atom.md (revision 193132)
+++ gcc/config/i386/atom.md (working copy)
@@ -455,6 +455,30 @@
 (eq_attr "memory" "!none")))
   "atom-simple-0")

+(define_insn_reservation  "atom_sseshuf" 1
+  (and (eq_attr "cpu" "atom")
+   (and (eq_attr "type" "sseshuf")
+(eq_attr "memory" "none")))
+  "atom-simple-either")
+
+(define_insn_reservation  "atom_sseshuf_mem" 1
+  (and (eq_attr "cpu" "atom")
+   (and (eq_attr "type" "sseshuf")
+(eq_attr "memory" "!none")))
+  "atom-simple-either")
+
+(define_insn_reservation  "atom_sseshuf1" 1
+  (and (eq_attr "cpu" "atom")
+   (and (eq_attr "type" "sseshuf1")
+(eq_attr "memory" "none")))
+  "atom-simple-0")
+
+(define_insn_reservation  "atom_sseshuf1_mem" 1
+  (and (eq_attr "cpu" "atom")
+   (and (eq_attr "type" "sseshuf1")
+(eq_attr "memory" "!none")))
+  "atom-simple-0")
+
 ;; not pmad, not psad
 (define_insn_reservation  "atom_sseiadd" 1
   (and (eq_attr "cpu" "atom")

This was not what I had in mind for changes in existing .md files.
Just change them in this way:

Index: atom.md
===
--- atom.md (revision 193407)
+++ atom.md (working copy)
@@ -594,7 +594,7 @@
 ;; no memory simple
 (define_insn_reservation  "atom_sseadd" 5
   (and (eq_attr "cpu" "atom")
-   (and (eq_attr "type" "sseadd,sseadd1")
+   (and (eq_attr "type" "sseadd,sseshuf,sseadd1,sseshuf1")
 (and (eq_attr "memory" "none")
  (and (eq_attr "mode" "!V2DF")
   (eq_attr "atom_unit" "!complex")
@@ -603,7 +603,7 @@
 ;; memory simple
 (define_insn_reservation  "atom_sseadd_mem" 5
   (and (eq_attr "cpu" "atom")
-   (and (eq_attr "type" "sseadd,sseadd1")
+   (and (eq_attr "type" "sseadd,sseshuf,sseadd1,sseshuf1")
 (and (eq_attr "memory" "!none")
  (and (eq_attr "mode" "!V2DF")
   (eq_attr "atom_unit" "!complex")
@@ -612,7 +612,7 @@
 ;; maxps, minps, *pd, hadd, hsub
 (define_insn_reservation  "atom_sseadd_3" 8
   (and (eq_attr "cpu" "atom")
-   (and (eq_attr "type" "sseadd,sseadd1")
+   (and (eq_attr "type" "sseadd,sseshuf,sseadd1,sseshuf1")
 (ior (eq_attr "mode" "V2DF") (eq_attr "atom_unit" "complex"
   "atom-complex, atom-all-eu*7")

You can see from the changes of sse.md that this is functionally a no-op change.

Uros.

Add myself to MAINTAINERS

2012-10-30 Thread Gopalasubramanian, Ganesh

Adding myself to the list of members in "write after approval".

Index: ChangeLog
===
--- ChangeLog   (revision 192977)
+++ ChangeLog   (working copy)
@@ -1,3 +1,7 @@
+2012-10-30 Ganesh Gopalasubramanian  
+
+   * MAINTAINERS (Write After Approval): Add myself.
+
 2012-10-26  James Greenhalgh  

* MAINTAINERS (Write After Approval): Add myself.
Index: MAINTAINERS
===
--- MAINTAINERS (revision 192977)
+++ MAINTAINERS (working copy)
@@ -372,6 +372,7 @@
 Chao-ying Fu   f...@mips.com
 Gary Funck g...@intrepid.com
 Pompapathi V Gadad pompapathi.v.ga...@nsc.com
+Gopalasubramanian Ganesh   ganesh.gopalasubraman...@amd.com
 Kaveh Ghazigh...@gcc.gnu.org
 Matthew Gingellging...@gnat.com
 Tristan Gingoldging...@adacore.com

Regards
Ganesh

RE: GCC 4.8.0 Status Report (2012-10-29), Stage 1 to end soon

2012-10-29 Thread Gopalasubramanian, Ganesh

Hi Jakub,

We are working on the following. 
1. bdver3 enablement. Review completed. Changes to be incorporated and 
checked-in.
http://gcc.gnu.org/ml/gcc-patches/2012-10/msg01131.html

2. btver2 basic enablement is done 
(http://gcc.gnu.org/ml/gcc-patches/2012-07/msg01018.html)/
Scheduler descriptions are being updated. This is architecture specific and we 
consider it not to be a stage-1 material.

Regards
Ganesh

-Original Message-
From: Jakub Jelinek [mailto:ja...@redhat.com] 
Sent: Monday, October 29, 2012 11:27 PM
To: g...@gcc.gnu.org
Cc: gcc-patches@gcc.gnu.org
Subject: GCC 4.8.0 Status Report (2012-10-29), Stage 1 to end soon

Status
==

I'd like to close the stage 1 phase of GCC 4.8 development on Monday, November 
5th.  If you have still patches for new features you'd like to see in GCC 4.8, 
please post them for review soon.  Patches posted before the freeze, but 
reviewed shortly after the freeze, may still go in, further changes should be 
just bugfixes and documentation fixes.


Quality Data


Priority  #   Change from Last Report
---   ---
P1   23   + 23
P2   77   +  8
P3   85   + 84
---   ---
Total   185   +115


Previous Report
===

http://gcc.gnu.org/ml/gcc/2012-03/msg00011.html

The next report will be sent by me again, announcing end of stage 1.

RE: [PATCH, i386]: Fix PR51109, symbol size in scheduler state machine is reduced

2012-10-10 Thread Gopalasubramanian, Ganesh

That was obvious. Sorry for the wrong commit.

Thanks Jakub.

-Ganesh

-Original Message-
From: Paolo Carlini [mailto:paolo.carl...@oracle.com] 
Sent: Wednesday, October 10, 2012 4:33 PM
To: Jakub Jelinek
Cc: Gopalasubramanian, Ganesh; Uros Bizjak; gcc-patches@gcc.gnu.org; 
veku...@gcc.gnu.org
Subject: Re: [PATCH, i386]: Fix PR51109, symbol size in scheduler state machine 
is reduced

On 10/10/2012 01:00 PM, Jakub Jelinek wrote:
> I have removed the extra line as obvious in SVN, to allow my 
> bootstraps to continue.
Thanks!

Paolo.

RE: [PATCH, i386]: Fix PR51109, symbol size in scheduler state machine is reduced

2012-10-03 Thread Gopalasubramanian, Ganesh

Testing was done before posting the patch. It was successful.

Regards
Ganesh

-Original Message-
From: Uros Bizjak [mailto:ubiz...@gmail.com] 
Sent: Thursday, September 27, 2012 5:57 PM
To: Gopalasubramanian, Ganesh
Cc: gcc-patches@gcc.gnu.org
Subject: Re: [PATCH, i386]: Fix PR51109, symbol size in scheduler state machine 
is reduced

On Thu, Sep 27, 2012 at 10:30 AM, Gopalasubramanian, Ganesh 
 wrote:

> This is a fix for PR 51109.
>
> There are three changes
>
> 1.  Microcoded instructions are considered as single issue instructions 
> and are therefore issued to a separate execution unit.
> 2.  The multiplier unit is attached to execution unit 1 (ieu1). Since ieu 
> is handled as a separate automaton in the patch, separate "mult" automaton is 
> not required.
> 3.  The integer execution units (2AGUs and 2EXs) are now decoupled. Now, 
> they are described as separate automatons.
>
> Is it OK for upstream?
>
> Regards
> Ganesh
>
> 2012-09-27  Ganesh Gopalasubramanian  
> 
>
> PR 51109
> * gcc/config/i386/bdver1.md (bdver1_int): Automaton has been
> split to reduce state transitions.

OK for mainline, if tested according to [1].

[1] http://gcc.gnu.org/contribute.html#testing

Thanks,
Uros.

[PATCH, i386]: Fix PR51109, symbol size in scheduler state machine is reduced

2012-09-27 Thread Gopalasubramanian, Ganesh

Hi All,

This is a fix for PR 51109.

There are three changes

1.  Microcoded instructions are considered as single issue instructions and 
are therefore issued to a separate execution unit.
2.  The multiplier unit is attached to execution unit 1 (ieu1). Since ieu 
is handled as a separate automaton in the patch, separate "mult" automaton is 
not required.
3.  The integer execution units (2AGUs and 2EXs) are now decoupled. Now, 
they are described as separate automatons.

Is it OK for upstream?

Regards
Ganesh

2012-09-27  Ganesh Gopalasubramanian  

PR 51109
* gcc/config/i386/bdver1.md (bdver1_int): Automaton has been 
split to reduce state transitions.
   
Index: gcc/config/i386/bdver1.md
===
--- gcc/config/i386/bdver1.md   (revision 191658)
+++ gcc/config/i386/bdver1.md   (working copy)
@@ -36,7 +36,7 @@
 (define_attr "bdver1_decode" "direct,vector,double"
   (const_string "direct"))

-(define_automaton "bdver1,bdver1_int,bdver1_load,bdver1_mult,bdver1_fp")
+(define_automaton "bdver1,bdver1_ieu,bdver1_load,bdver1_fp,bdver1_agu")

 (define_cpu_unit "bdver1-decode0" "bdver1")
 (define_cpu_unit "bdver1-decode1" "bdver1")
@@ -71,16 +71,14 @@
 | (nothing,(bdver1-decode1 + 
bdver1-decode2)))")


-(define_cpu_unit "bdver1-ieu0" "bdver1_int")
-(define_cpu_unit "bdver1-ieu1" "bdver1_int")
+(define_cpu_unit "bdver1-ieu0" "bdver1_ieu")
+(define_cpu_unit "bdver1-ieu1" "bdver1_ieu")
 (define_reservation "bdver1-ieu" "(bdver1-ieu0 | bdver1-ieu1)")

-(define_cpu_unit "bdver1-agu0" "bdver1_int")
-(define_cpu_unit "bdver1-agu1" "bdver1_int")
+(define_cpu_unit "bdver1-agu0" "bdver1_agu")
+(define_cpu_unit "bdver1-agu1" "bdver1_agu")
 (define_reservation "bdver1-agu" "(bdver1-agu0 | bdver1-agu1)")

-(define_cpu_unit "bdver1-mult" "bdver1_mult")
-
 (define_cpu_unit "bdver1-load0" "bdver1_load")
 (define_cpu_unit "bdver1-load1" "bdver1_load")
 (define_reservation "bdver1-load" "bdver1-agu,
@@ -93,6 +91,12 @@
 ;; 128bit SSE instructions issue two stores at once.
 (define_reservation "bdver1-store2" "(bdver1-load0 + bdver1-load1)")

+;; vectorpath (microcoded) instructions are single issue instructions.
+;; So, they occupy all the integer units.
+(define_reservation "bdver1-ivector" "bdver1-ieu0+bdver1-ieu1+
+  bdver1-agu0+bdver1-agu1+
+  bdver1-load0+bdver1-load1")
+
 ;; The FP operations start to execute at stage 12 in the pipeline, while
 ;; integer operations start to execute at stage 9 for athlon and 11 for K8
 ;; Compensate the difference for athlon because it results in significantly
@@ -125,7 +129,7 @@
 (define_insn_reservation "bdver1_call" 0
 (and (eq_attr "cpu" "bdver1,bdver2")
  (eq_attr "type" "call,callv"))
-"bdver1-double,bdver1-agu,bdver1-ieu")
+"bdver1-double,bdver1-agu")
 ;; PUSH mem is double path.
 (define_insn_reservation "bdver1_push" 1
 (and (eq_attr "cpu" "bdver1,bdver2")
@@ -135,17 +139,17 @@
 (define_insn_reservation "bdver1_pop" 1
 (and (eq_attr "cpu" "bdver1,bdver2")
  (eq_attr "type" "pop"))
-"bdver1-direct,(bdver1-ieu+bdver1-load)")
+"bdver1-direct,bdver1-ivector")
 ;; LEAVE no latency info so far, assume same with amdfam10.
 (define_insn_reservation "bdver1_leave" 3
 (and (eq_attr "cpu" "bdver1,bdver2")
  (eq_attr "type" "leave"))
-"bdver1-vector,(bdver1-ieu+bdver1-load)")
+"bdver1-vector,bdver1-ivector")
 ;; LEA executes in AGU unit with 1 cycle latency on BDVER1.
 (define_insn_reservation "bdver1_lea" 1
 (and (eq_attr "cpu" "bdver1,bdver2")
  (eq_attr "type" "lea"))
-"bdver1-direct,bdver1-agu,nothing")
+"bdver1-direct,bdver1-agu")

 ;; MUL executes in special multiplier unit attached to IEU1.
 (define_insn_reservation "bdver1_imul_DI" 6
@@ -153,23 +157,23 @@
  (and (eq_attr "type" "imul")
   (and (eq_attr "mode" "DI")
(eq_attr "memory" "none,unknown"
-
"bdver1-direct1,bdver1-ieu1,bdver1-mult,nothing,bdver1-ieu1")
+"bdver1-direct1,bdver1-ieu1")
 (define_insn_reservation "bdver1_imul" 4
 (and (eq_attr "cpu" "bdver1,bdver2")
  (and (eq_attr "type" "imul")
   (eq_attr "memory" "none,unknown")))
-"bdver1-direct1,bdver1-ieu1,bdver1-mult,bdver1-ieu1")
+"bdver1-direct1,bdver1-ieu1")
 (define_insn_reserv

RE: [PATCH,i386] fma4 addition for bdver2

2012-09-09 Thread Gopalasubramanian, Ganesh

Hi,

The second change (done in config/i386/driver-i386.c (host_detect_local_cpu)) 
is not reflected in svn revision 191109.
Since we are handling the fma instruction selection in i386.c\i386.md, we need 
not disable the flag in driver.

Let me know your opinion.

Regards
Ganesh

-Original Message-
From: Gopalasubramanian, Ganesh 
Sent: Wednesday, September 05, 2012 3:41 PM
To: gcc-patches@gcc.gnu.org
Cc: Uros Bizjak (ubiz...@gmail.com)
Subject: [PATCH,i386] fma4 addition for bdver2 

Hello,

FMA4 and FMA3 ISA are implemented in bdver2 target.
FMA3 is selected by default. 
This patch supports the use of FMA4 intrinsics for bdver2 targets.

Is it OK for trunk?

Regards
Ganesh

2012-09-05  Ganesh Gopalasubramanian  

* config/i386/i386.md : Comments on fma4 instruction 
selection reflect requirement on register pressure based
cost model.
   
* config/i386/driver-i386.c (host_detect_local_cpu): fma4
flag is set-reset as informed by the cpuid flag.

* config/i386/i386.c (processor_alias_table): fma4
flag is enabled for bdver2.

Index: gcc/config/i386/i386.md
===
--- gcc/config/i386/i386.md (revision 190830)
+++ gcc/config/i386/i386.md (working copy)
@@ -659,9 +659,11 @@
 (eq_attr "isa" "noavx2") (symbol_ref "!TARGET_AVX2")
 (eq_attr "isa" "bmi2") (symbol_ref "TARGET_BMI2")
 (eq_attr "isa" "fma") (symbol_ref "TARGET_FMA")
-;; Disable generation of FMA4 instructions for generic code
-;; since FMA3 is preferred for targets that implement both
-;; instruction sets.
+;; Fma instruction selection has to be done based on
+;; register pressure. For generating fma4, a cost model
+;; based on register pressure is required. Till then,
+;; fma4 instruction is disabled for targets that implement
+;; both fma and fma4 instruction sets.
 (eq_attr "isa" "fma4")
   (symbol_ref "TARGET_FMA4 && !TARGET_FMA")
]
Index: gcc/config/i386/driver-i386.c
===
--- gcc/config/i386/driver-i386.c   (revision 190830)
+++ gcc/config/i386/driver-i386.c   (working copy)
@@ -483,8 +483,6 @@
   has_abm = ecx & bit_ABM;
   has_lwp = ecx & bit_LWP;
   has_fma4 = ecx & bit_FMA4;
-  if (vendor == SIG_AMD && has_fma4 && has_fma)
-   has_fma4 = 0;
   has_xop = ecx & bit_XOP;
   has_tbm = ecx & bit_TBM;
   has_lzcnt = ecx & bit_LZCNT;
Index: gcc/config/i386/i386.c
===
--- gcc/config/i386/i386.c  (revision 190830)
+++ gcc/config/i386/i386.c  (working copy)
@@ -3164,7 +3164,7 @@
   {"bdver2", PROCESSOR_BDVER2, CPU_BDVER2,
PTA_64BIT | PTA_MMX | PTA_SSE | PTA_SSE2 | PTA_SSE3
| PTA_SSE4A | PTA_CX16 | PTA_ABM | PTA_SSSE3 | PTA_SSE4_1
-   | PTA_SSE4_2 | PTA_AES | PTA_PCLMUL | PTA_AVX
+   | PTA_SSE4_2 | PTA_AES | PTA_PCLMUL | PTA_AVX | PTA_FMA4
| PTA_XOP | PTA_LWP | PTA_BMI | PTA_TBM | PTA_F16C
| PTA_FMA},
   {"btver1", PROCESSOR_BTVER1, CPU_GENERIC64,

Regards
Ganesh

[PATCH,i386] fma4 addition for bdver2

2012-09-05 Thread Gopalasubramanian, Ganesh

Hello,

FMA4 and FMA3 ISA are implemented in bdver2 target.
FMA3 is selected by default. 
This patch supports the use of FMA4 intrinsics for bdver2 targets.

Is it OK for trunk?

Regards
Ganesh

2012-09-05  Ganesh Gopalasubramanian  

* config/i386/i386.md : Comments on fma4 instruction 
selection reflect requirement on register pressure based
cost model.
   
* config/i386/driver-i386.c (host_detect_local_cpu): fma4
flag is set-reset as informed by the cpuid flag.

* config/i386/i386.c (processor_alias_table): fma4
flag is enabled for bdver2.

Index: gcc/config/i386/i386.md
===
--- gcc/config/i386/i386.md (revision 190830)
+++ gcc/config/i386/i386.md (working copy)
@@ -659,9 +659,11 @@
 (eq_attr "isa" "noavx2") (symbol_ref "!TARGET_AVX2")
 (eq_attr "isa" "bmi2") (symbol_ref "TARGET_BMI2")
 (eq_attr "isa" "fma") (symbol_ref "TARGET_FMA")
-;; Disable generation of FMA4 instructions for generic code
-;; since FMA3 is preferred for targets that implement both
-;; instruction sets.
+;; Fma instruction selection has to be done based on
+;; register pressure. For generating fma4, a cost model
+;; based on register pressure is required. Till then,
+;; fma4 instruction is disabled for targets that implement
+;; both fma and fma4 instruction sets.
 (eq_attr "isa" "fma4")
   (symbol_ref "TARGET_FMA4 && !TARGET_FMA")
]
Index: gcc/config/i386/driver-i386.c
===
--- gcc/config/i386/driver-i386.c   (revision 190830)
+++ gcc/config/i386/driver-i386.c   (working copy)
@@ -483,8 +483,6 @@
   has_abm = ecx & bit_ABM;
   has_lwp = ecx & bit_LWP;
   has_fma4 = ecx & bit_FMA4;
-  if (vendor == SIG_AMD && has_fma4 && has_fma)
-   has_fma4 = 0;
   has_xop = ecx & bit_XOP;
   has_tbm = ecx & bit_TBM;
   has_lzcnt = ecx & bit_LZCNT;
Index: gcc/config/i386/i386.c
===
--- gcc/config/i386/i386.c  (revision 190830)
+++ gcc/config/i386/i386.c  (working copy)
@@ -3164,7 +3164,7 @@
   {"bdver2", PROCESSOR_BDVER2, CPU_BDVER2,
PTA_64BIT | PTA_MMX | PTA_SSE | PTA_SSE2 | PTA_SSE3
| PTA_SSE4A | PTA_CX16 | PTA_ABM | PTA_SSSE3 | PTA_SSE4_1
-   | PTA_SSE4_2 | PTA_AES | PTA_PCLMUL | PTA_AVX
+   | PTA_SSE4_2 | PTA_AES | PTA_PCLMUL | PTA_AVX | PTA_FMA4
| PTA_XOP | PTA_LWP | PTA_BMI | PTA_TBM | PTA_F16C
| PTA_FMA},
   {"btver1", PROCESSOR_BTVER1, CPU_GENERIC64,

Regards
Ganesh

RE: [PATCH,i386] fma,fma4 and xop flags

2012-08-16 Thread Gopalasubramanian, Ganesh

> This won't work, since we have to prefer FMA3 also in case when only "-mfma 
> -mfma4" without -mtune=XX is used. 
> We can add TARGET_FMA_BOTH though, but I doubt there will ever be target that 
> implements both insn sets without preferences.

Preferring FMA3 over FMA4 might not do good always. For instance, with 
increased register pressure FMA3 can be used. 
But, when we have more registers at our disposal, fma4 if used might do good by 
avoiding extra reload.
IMO, when preference of FMA instructions is adjudged by register pressure,  we 
may need some functionality to support that.

So, ideally for bdver2, we like to have both fma and fma4 getting generated 
with options "-mfma -mfma4".

Regards
Ganesh

-Original Message-
From: Uros Bizjak [mailto:ubiz...@gmail.com] 
Sent: Tuesday, August 14, 2012 9:12 PM
To: Richard Henderson
Cc: Gopalasubramanian, Ganesh; gcc-patches@gcc.gnu.org
Subject: Re: [PATCH,i386] fma,fma4 and xop flags

On Mon, Aug 13, 2012 at 9:50 PM, Richard Henderson  wrote:
> On 08/13/2012 12:33 PM, Uros Bizjak wrote:
>> AFAIU fma3 is better than fma4 for bdver2 (the only CPU that 
>> implements both FMA sets). Current description of bdver2 doesn't even 
>> enable fma4 in processor_alias_table due to this fact.
>>
>> The change you are referring to adds preference for fma3 insn set for 
>> generic code (not FMA4 builtins!), even when fma4 is enabled. So, no 
>> matter which combination and sequence of -mfmfa -mfma4 or -mxop user 
>> passes to the compiler, only fma3 instructions will be generated.
>
> This rationale needs to appear as a comment above
>
>> +  (eq_attr "isa" "fma4")
>> +(symbol_ref "TARGET_FMA4 && !TARGET_FMA")

I plan to commit following patch:

--cut here--
Index: i386.md
===
--- i386.md (revision 190362)
+++ i386.md (working copy)
@@ -659,6 +659,9 @@
 (eq_attr "isa" "noavx2") (symbol_ref "!TARGET_AVX2")
 (eq_attr "isa" "bmi2") (symbol_ref "TARGET_BMI2")
 (eq_attr "isa" "fma") (symbol_ref "TARGET_FMA")
+;; Disable generation of FMA4 instructions for generic code
+;; since FMA3 is preferred for targets that implement both
+;; instruction sets.
 (eq_attr "isa" "fma4")
   (symbol_ref "TARGET_FMA4 && !TARGET_FMA")
]
--cut here--

> Longer term we may well require some sort of
>
>   (TARGET_FMA4 && !(TARGET_FMA && TARGET_PREFER_FMA3))
>
> with an appropriate entry in ix86_tune_features to match.

This won't work, since we have to prefer FMA3 also in case when only "-mfma 
-mfma4" without -mtune=XX is used. We can add TARGET_FMA_BOTH though, but I 
doubt there will ever be target that implements both insn sets without 
preferences.

Uros.

RE: [PATCH,i386] cpuid function for prefetchw

2012-08-13 Thread Gopalasubramanian, Ganesh

Yes! Thanks Jakub.

-Original Message-
From: Jakub Jelinek [mailto:ja...@redhat.com] 
Sent: Monday, August 13, 2012 3:16 PM
To: Gopalasubramanian, Ganesh
Cc: gcc-patches@gcc.gnu.org
Subject: Re: [PATCH,i386] cpuid function for prefetchw

On Mon, Aug 13, 2012 at 09:29:45AM +, Gopalasubramanian, Ganesh wrote:
> To get the prefetchw cpuid flag, cpuid function 0x8001 needs to be 
> called.
> Previous to patch, function 0x7 is called.
> 
> Bootstrapping and "make -k check" passes without failures.
> Ok for trunk?

IMHO you move it to a wrong spot, ecx bits of CPUID 0x8001 are tested 
earlier.

So I think you want this instead (bootstrap/regtest in progress):

2012-08-13  Ganesh Gopalasubramanian  
Jakub Jelinek  

PR driver/54210
* config/i386/driver-i386.c (host_detect_local_cpu): Test bit_PRFCHW
bit of CPUID 0x8001 %ecx instead of CPUID 7 %ecx.
* config/i386/cpuid.h (bits_PRFCHW): Move definition to CPUID
0x8001 %ecx flags.

--- gcc/config/i386/driver-i386.c.jj2012-08-10 15:49:25.0 +0200
+++ gcc/config/i386/driver-i386.c   2012-08-13 11:30:14.570494736 +0200
@@ -467,7 +467,6 @@ const char *host_detect_local_cpu (int a
   has_bmi2 = ebx & bit_BMI2;
   has_fsgsbase = ebx & bit_FSGSBASE;
   has_rdseed = ebx & bit_RDSEED;
-  has_prfchw = ecx & bit_PRFCHW;
   has_adx = ebx & bit_ADX;
 }
 
@@ -488,6 +487,7 @@ const char *host_detect_local_cpu (int a
   has_xop = ecx & bit_XOP;
   has_tbm = ecx & bit_TBM;
   has_lzcnt = ecx & bit_LZCNT;
+  has_prfchw = ecx & bit_PRFCHW;
 
   has_longmode = edx & bit_LM;
   has_3dnowp = edx & bit_3DNOWP;
--- gcc/config/i386/cpuid.h.jj  2012-08-10 15:49:25.0 +0200
+++ gcc/config/i386/cpuid.h 2012-08-13 11:31:30.346494092 +0200
@@ -52,6 +52,7 @@
 #define bit_LAHF_LM(1 << 0)
 #define bit_ABM(1 << 5)
 #define bit_SSE4a  (1 << 6)
+#define bit_PRFCHW (1 << 8)
 #define bit_XOP (1 << 11)
 #define bit_LWP(1 << 15)
 #define bit_FMA4(1 << 16)
@@ -69,7 +70,6 @@
 #define bit_HLE(1 << 4)
 #define bit_AVX2   (1 << 5)
 #define bit_BMI2   (1 << 8)
-#define bit_PRFCHW (1 << 8)
 #define bit_RTM(1 << 11)
 #define bit_RDSEED (1 << 18)
 #define bit_ADX(1 << 19)


Jakub

[PATCH,i386] cpuid function for prefetchw

2012-08-13 Thread Gopalasubramanian, Ganesh

Hello,

To get the prefetchw cpuid flag, cpuid
function 0x8001 needs to be called.
Previous to patch, function 0x7 is called.

Bootstrapping and "make -k check" passes without failures.
Ok for trunk?

Regards
Ganesh

2012-08-13 Ganesh Gopalasubramanian  

PR driver/54210
* config/i386/driver-i386.c (host_detect_local_cpu): Call
cpuid function 0x8001 to get the prfchw cpuid flag. 

Index: gcc/config/i386/driver-i386.c
===
--- gcc/config/i386/driver-i386.c   (revision 189996)
+++ gcc/config/i386/driver-i386.c   (working copy)
@@ -467,7 +467,6 @@
   has_bmi2 = ebx & bit_BMI2;
   has_fsgsbase = ebx & bit_FSGSBASE;
   has_rdseed = ebx & bit_RDSEED;
-  has_prfchw = ecx & bit_PRFCHW;
 }

   /* Check cpuid level of extended features.  */ @@ -491,6 +490,7 @@
   has_longmode = edx & bit_LM;
   has_3dnowp = edx & bit_3DNOWP;
   has_3dnow = edx & bit_3DNOW;
+  has_prfchw = ecx & bit_PRFCHW;
 }

   if (!arch

RE: [PATCH,i386] fma,fma4 and xop flags

2012-08-12 Thread Gopalasubramanian, Ganesh

Thank you Uros, Richard!

I will confirm the test results in couple off days.

Regards
Ganesh

-Original Message-
From: Uros Bizjak [mailto:ubiz...@gmail.com] 
Sent: Saturday, August 11, 2012 3:54 AM
To: Richard Henderson
Cc: Gopalasubramanian, Ganesh; gcc-patches@gcc.gnu.org
Subject: Re: [PATCH,i386] fma,fma4 and xop flags

On Fri, Aug 10, 2012 at 10:02 PM, Richard Henderson  wrote:
> On 2012-08-10 12:59, Uros Bizjak wrote:
>> Actually, this is the problem you are trying to solve. The fma4 
>> patterns are defined before fma3, so gcc prefers these.
>
> The Real Problem is that they should not be separate patterns.
> They should be a single pattern that selects alternatives via the 
> enabled isa.

2012-08-11  Uros Bizjak  

* config/i386/i386.md (isa): Add fma and fma4.
(enabled): Handle fma and fma4.
* config/i386/sse.md (*fma_fmadd_): Merge *fma4_fmadd_.
(*fma_fmsub_): Merge *fma4_fmsub_.
(*fma_fnmadd_): Merge *fma4_fnmadd_.
(*fma_fnmsub_): Merge *fma4_fnmsub_.
(*fma_fmaddsub_): Merge *fma4_fmaddsub_.
(*fma_fmsubadd_): Merge *fma4_fmsubadd_.

Tested on x86_64-pc-linux-gnu {,-m32}, committed to mainline SVN.

I will wait a couple of days before backporting patches to 4.7, so please 
Ganesh, test mainline if everything is OK.

BTW: With this patch, we can enable PTA_FMA4 for bdver2 target.

Uros.

RE: [PATCH,i386] fma,fma4 and xop flags

2012-08-09 Thread Gopalasubramanian, Ganesh

> -mxop implies -mfma4, but reverse is not true.

I think this handling went in for bdver1.
But, with bdver2, we have both fma and fma4.
So for bdver2, -mxop should not be enabling one of them.

> if someone set -mfma4 together
> with -mfma on the command line, we should NOT disable selected ISA
> behind user's back

If both -mfma4 and -mfma are enabled, GCC outputs fma4 instructions.
This, I think is because fma4 instruction patterns are read before fma 
instruction patterns from the ".md" files.
So, enabling both -mfma4 and -mfma is not good for bdver2.

Moreover, if user tries to use, -mfma -mno-fma4 -mxop, the order in which these 
options are used becomes crucial. -mxop enables -mfma4 and by instruction 
patterns fma4 instructions gets listed in the assembly file.

For the below test,

double a,b,c,d;
int fn(){
a = b + c * d ;
return a;
}

#1) Using options "-O2 -mno-fma4 -mfma -mxop" outputs fma4. 
(vfmaddsdb(%rip), %xmm2, %xmm1, %xmm0)
#2) Using options "-O2 -mfma -mno-fma4 -mxop" outputs fma4.
(vfmaddsdb(%rip), %xmm2, %xmm1, %xmm0)
#3) Using options "-mxop -mno-fma4 -mfma" outpts fma.
(vfmadd132sd d(%rip), %xmm1, %xmm0)

As we see the order in which the options are used becomes crucial.
This is confusing.

I haven't really tested other implied options. But, I suspect
similar phenomenon in those cases too. 

IMO, we can directly go by the CPUID flags and enable the flags.
This will be a one to one mapping and leave the user with lot more liberty.
Please let me know your opinion.

Regards
Ganesh

-Original Message-
From: Uros Bizjak [mailto:ubiz...@gmail.com] 
Sent: Friday, August 10, 2012 1:21 AM
To: Gopalasubramanian, Ganesh
Cc: gcc-patches@gcc.gnu.org
Subject: Re: [PATCH,i386] fma,fma4 and xop flags

On Wed, Aug 8, 2012 at 1:31 PM,   wrote:

> Bdver2 cpu supports both fma and fma4 instructions.
> Previous to patch, option "-mno-xop" removes "-mfma4".
> Similarly, option "-mno-fma4" removes "-mxop".

It looks to me that there is some misunderstanding. AFAICS:

-mxop implies -mfma4, but reverse is not true. Please see

#define OPTION_MASK_ISA_FMA4_SET \
  (OPTION_MASK_ISA_FMA4 | OPTION_MASK_ISA_SSE4A_SET \
   | OPTION_MASK_ISA_AVX_SET)
#define OPTION_MASK_ISA_XOP_SET \
  (OPTION_MASK_ISA_XOP | OPTION_MASK_ISA_FMA4_SET)

So, -mxop sets -mfma4, etc ..., but -mfma4 does NOT enable -mxop.

OTOH,

#define OPTION_MASK_ISA_FMA4_UNSET \
  (OPTION_MASK_ISA_FMA4 | OPTION_MASK_ISA_XOP_UNSET)
#define OPTION_MASK_ISA_XOP_UNSET OPTION_MASK_ISA_XOP

-mno-fma4 implies -mno-xop, but again reverse is not true. Thus,
-mno-xop does NOT imply -mno-fma4.

> So, the patch conditionally disables "-mfma" or "-mfma4".
> Enabling "-mxop" is done by also checking "-mfma".

Please note that conditional handling of ISA flags belongs to
ix86_option_override_internal. However, if someone set -mfma4 together
with -mfma on the command line, we should NOT disable selected ISA
behind user's back, in the same way as we don't disable anything with
"-march=i386 -msse4". With -march=bdver2, we already marked that only
fma is supported, and if user selected "-march=bdver2 -mfma4" on the
command line, we shouldn't disable anything.

Uros.

RE: [PATCH,i386] fma,fma4 and xop flags

2012-08-08 Thread Gopalasubramanian, Ganesh

> Otherwise, what does -mno-fma4 -mxop do?
> (it should enable both xop and fma4!)  what should -mfma4 -mno-xop do
> (it should disable both xop and fma4!).

Yes! that's what GCC does now.
Some flags are coupled (atleast for now).
For ex, -mno-sse4.2 -mavx enables both sse4.2 and avx 
whereas -mavx -mno-sse4.2 disables both.

Setting of the following are clubbed.
1) 3DNow sets MMX
2) SSE2 sets SSE
3) SSE3 sets SSE2
4) SSE4_1 sets SSE3
5) SSE4_2 sets SSE4_1
6) FMA sets AVX
7) AVX2 sets AVX
8) SSE4_A sets SSE3
9) FMA4 set SSE4_A and AVX
10) XOP sets FMA4
11) AES sets SSE2
12) PCLMUL sets SSE2
13) ABM sets POPCNT

Resetting is done in reversely (MMX resets 3DNOW).

IMO, if we have different cpuid flags, enabling\disabling 
the compiler flags depends on these cpuid flags directly. 
Adding subsets to them or tangling them together may give 
wrong results.

Please let me know your opinion.

Regards
Ganesh

-Original Message-
From: Richard Guenther [mailto:richard.guent...@gmail.com] 
Sent: Wednesday, August 08, 2012 5:12 PM
To: Gopalasubramanian, Ganesh
Cc: gcc-patches@gcc.gnu.org; ubiz...@gmail.com
Subject: Re: [PATCH,i386] fma,fma4 and xop flags

On Wed, Aug 8, 2012 at 1:31 PM,   wrote:
> Hello,
>
> Bdver2 cpu supports both fma and fma4 instructions.
> Previous to patch, option "-mno-xop" removes "-mfma4".
> Similarly, option "-mno-fma4" removes "-mxop".

Eh?  Why's that?  I think we should disentangle -mxop and -mfma4
instead.  Otherwise, what does -mno-fma4 -mxop do?
(it should enable both xop and fma4!)  what should -mfma4 -mno-xop do
(it should disable both xop and fma4!).  All this is just confusing to
the user, even if in AMD documents XOP includes FMA4.

Richard.

Backport: fma3 instruction generation for 'march=native' in AMD processors

2012-05-09 Thread Gopalasubramanian, Ganesh

Hello,

Below is the patch that has been committed in trunk (Revision: 187075).
We like to backport it to GCC 4.7 branch as couple of AMD processors require 
this change for fma3 instruction generation.

Bootstrapping and testing are successful. 

Is it OK to commit in GCC 4.7 branch?

Regards
Ganesh

PATCH
=

  * config/i386/driver-i386.c (host_detect_local_cpu): Reset
  has_fma4 for AMD processors with both fma3 and fma4 support.

Index: config/i386/driver-i386.c
===
--- config/i386/driver-i386.c   (revision 186897)
+++ config/i386/driver-i386.c   (working copy)
@@ -472,6 +472,8 @@
   has_abm = ecx & bit_ABM;
   has_lwp = ecx & bit_LWP;
   has_fma4 = ecx & bit_FMA4;
+  if (vendor == SIG_AMD && has_fma4 && has_fma)
+   has_fma4 = 0;
   has_xop = ecx & bit_XOP;
   has_tbm = ecx & bit_TBM;
   has_lzcnt = ecx & bit_LZCNT;

Re: [PATCH] [i386] fma3 instruction generation for 'march=native' in AMD processors

2012-05-02 Thread Gopalasubramanian, Ganesh

I have added the ChangeLog and modified the patch.

Is it OK to commit to trunk?

Regards
Ganesh

2012-05-03  Ganesh Gopalasubramanian  

* config/i386/driver-i386.c (host_detect_local_cpu): Reset
has_fma4 for AMD processors with both fma3 and fma4 support.

Index: config/i386/driver-i386.c
===
--- config/i386/driver-i386.c   (revision 186897)
+++ config/i386/driver-i386.c   (working copy)
@@ -472,6 +472,8 @@
   has_abm = ecx & bit_ABM;
   has_lwp = ecx & bit_LWP;
   has_fma4 = ecx & bit_FMA4;
+  if (vendor == SIG_AMD && has_fma4 && has_fma)
+   has_fma4 = 0;
   has_xop = ecx & bit_XOP;
   has_tbm = ecx & bit_TBM;
   has_lzcnt = ecx & bit_LZCNT;

-Original Message-
From: Jakub Jelinek [mailto:ja...@redhat.com] 
Sent: Wednesday, May 02, 2012 5:11 PM
To: Gopalasubramanian, Ganesh
Cc: gcc-patches@gcc.gnu.org
Subject: Re: [PATCH] [i386] fma3 instruction generation for 'march=native' in 
AMD processors

On Wed, May 02, 2012 at 11:12:33AM +, Gopalasubramanian, Ganesh wrote:
> For AMD architectures with both fma3 and fma4 instructions' support, GCC 
> generates fma4 by default. Instead, we like to generate fma3 instruction. 
> Below patch enables the fma3 instruction generation for "-march=native".
> 
> Ok for trunk?

You haven't provided ChangeLog entry.

> Index: gcc/config/i386/driver-i386.c
> ===
> --- gcc/config/i386/driver-i386.c   (revision 186897)
> +++ gcc/config/i386/driver-i386.c   (working copy)
> @@ -472,6 +472,10 @@
>has_abm = ecx & bit_ABM;
>has_lwp = ecx & bit_LWP;
>has_fma4 = ecx & bit_FMA4;
> +  if (((vendor == SIG_AMD)) && (has_fma4) && (has_fma))
> +{
> +has_fma4 = 0;
> +}

And the formatting of this is wrong, 4 unnecessary pairs of (),
one unnecessary {} pair, bad indentation of the has_fma4 = 0;
assignment (should use a tab).

Jakub

[PATCH] [i386] fma3 instruction generation for 'march=native' in AMD processors

2012-05-02 Thread Gopalasubramanian, Ganesh

For AMD architectures with both fma3 and fma4 instructions' support, GCC 
generates fma4 by default. Instead, we like to generate fma3 instruction. Below 
patch enables the fma3 instruction generation for "-march=native".

Ok for trunk?

Index: gcc/config/i386/driver-i386.c
===
--- gcc/config/i386/driver-i386.c   (revision 186897)
+++ gcc/config/i386/driver-i386.c   (working copy)
@@ -472,6 +472,10 @@
   has_abm = ecx & bit_ABM;
   has_lwp = ecx & bit_LWP;
   has_fma4 = ecx & bit_FMA4;
+  if (((vendor == SIG_AMD)) && (has_fma4) && (has_fma))
+{
+has_fma4 = 0;
+}
   has_xop = ecx & bit_XOP;
   has_tbm = ecx & bit_TBM;
   has_lzcnt = ecx & bit_LZCNT;

Regards
Ganesh

86 matches

Mail list logo