[google][gcc-4_9] update hardreg costs only when conflict_costs[] < 0

2015-12-10 Thread Wei Mi
Hi,

The patch is for google branch only.

In r216697, when a hardreg is assigned to an allocno, a positive cost
will be added to those conflict allocnos to reflect the disfavor of
the hardreg.

However, the fact that conflict allocno disfavors a hard_regno doesn't
necessarily mean current allocno should prefer the hard_regno, so it
is incorrect to update the costs of an allocno directly according to
its conflict allocnos. The patch changes the code to update costs[i]
of an allocno only when conflict_costs[i] < 0, .i.e, when conflict
allocno prefer hardreg i.

Another issue is the costs of an allocno is updated only when the
conflict allocno is not marked as may_be_spilled_p. However, even if a
conflict allocno is marked as may_be_spilled_p right now, it still has
high probablity to get colored later. It is not right to ignore the
preferences from those conflict allocnos marked as may_be_spilled_p.
The patch changes it.

Test:
Gcc unit tests ok.
Minor improvement for google internal benchmarks.

Thanks,
Wei.
gcc/ChangeLog:

2015-12-10  Wei Mi  

* ira-color.c (restore_costs_from_conflicts): Don't record the
cost change.
(update_conflict_hard_regno_costs): Update costs[i] only when
conflict_costs[i] < 0.
(assign_hard_reg): Ditto.


Index: gcc/ira-color.c
===
--- gcc/ira-color.c (revision 231143)
+++ gcc/ira-color.c (working copy)
@@ -1588,9 +1588,11 @@ restore_costs_from_conflicts (ira_allocn
prev = curr;
}
  /* Propagate the disfavor of hardreg from conflict_a to the
-allocnos connecting with conflict_a via copies.  */
+allocnos connecting with conflict_a via copies.
+Note: once the hardreg is assigned to a, it will not be
+changed, so we don't need to record this change. */
  update_costs_from_allocno (conflict_a, hardreg,
-1, false, true, true);
+1, false, false, true);
}
 }
 }
@@ -1601,7 +1603,7 @@ restore_costs_from_conflicts (ira_allocn
update increases chances to remove some copies.  */
 static void
 update_conflict_hard_regno_costs (int *costs, enum reg_class aclass,
- bool decr_p)
+ bool decr_p, bool conflict)
 {
   int i, cost, class_size, freq, mult, div, divisor;
   int index, hard_regno;
@@ -1682,7 +1684,16 @@ update_conflict_hard_regno_costs (int *c
cont_p = true;
if (decr_p)
  cost = -cost;
-   costs[index] += cost;
+   /* conflict being true indicates this is updating costs[]
+  according to preferences of allocnos connected by copies
+  to the conflict allocnos.
+  The fact conflict allocno disfavors hard_regno doesn't
+  necessarily mean current allocno should prefer hard_regno
+  (actually only a little), so we update costs[] only
+  when conflict allocno prefers hard_regno, .i.e, when
+  conflict_costs[i] < 0. */
+   if (conflict && conflict_costs [i] < 0)
+ costs[index] += cost;
  }
  }
/* Probably 5 hops will be enough.  */
@@ -1934,7 +1945,6 @@ assign_hard_reg (ira_allocno_t a, bool r
}
}
  else if (! retry_p
-  && ! ALLOCNO_COLOR_DATA (conflict_a)->may_be_spilled_p
   /* Don't process the conflict allocno twice.  */
   && (ALLOCNO_COLOR_DATA (conflict_a)->last_process
   != curr_allocno_process))
@@ -1967,7 +1977,13 @@ assign_hard_reg (ira_allocno_t a, bool r
  ->new_conflict_hard_regs,
  hard_regno))
  continue;
-   full_costs[j] -= conflict_costs[k];
+   /* The fact conflict_a disfavors hard_regno doesn't
+  necessarily mean current allocno should prefer
+  hard_regno so much (only a little), so we only
+  update full_costs[] when conflict_a prefers
+  hard_regno, .i.e, when conflict_costs[k] < 0. */
+   if (conflict_costs[k] < 0)
+ full_costs[j] -= conflict_costs[k];
  }
  queue_update_cost (conflict_a, NULL, COST_HOP_DIVISOR);
 
@@ -1977,7 +1993,7 @@ assign_hard_reg (ira_allocno_t a, bool r
   if (! retry_p)
 /* Take into account preferences of allocnos connected by copies to
the conflict allocnos.  */
-update_conflict_hard_regno_costs (full_costs, aclass, true);
+update_conflict_hard_regno_costs (full_costs, a

[PATCH PR64557] get_addr in true_dependence_1 cannot handle VALUE inside an expr

2015-01-21 Thread Wei Mi
Hi,

The patch is to address the bug here:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64557

It is to call get_addr for VALUE before forming a mem_addr expr with
the VALUE and an offset. This is to avoid the problem that get_addr
can only handle VALUE but cannot handle an expr like: (VALUE +
offset). With the fix, find_base_term can always get the base of the
original addr.

bootstrap and regression test on x86_64-linux-gnu are ok. regression
tests on aarch64-linux-gnu and powerpc64-linux-gnu are also ok. Is it
ok for trunk?

Thanks,
Wei.

gcc/ChangeLog:

2015-01-21  Wei Mi  

* dse.c (record_store): Call get_addr for mem_addr.
(check_mem_read_rtx): Likewise.

Index: gcc/dse.c
===
--- gcc/dse.c   (revision 219975)
+++ gcc/dse.c   (working copy)
@@ -1575,6 +1575,7 @@ record_store (rtx body, bb_info_t bb_inf
= rtx_group_vec[group_id];
  mem_addr = group->canon_base_addr;
}
+  mem_addr = get_addr (mem_addr);
   if (offset)
mem_addr = plus_constant (get_address_mode (mem), mem_addr, offset);
 }
@@ -2188,6 +2189,7 @@ check_mem_read_rtx (rtx *loc, bb_info_t
= rtx_group_vec[group_id];
  mem_addr = group->canon_base_addr;
}
+  mem_addr = get_addr (mem_addr);
   if (offset)
mem_addr = plus_constant (get_address_mode (mem), mem_addr, offset);
 }


Re: [PATCH PR64557] get_addr in true_dependence_1 cannot handle VALUE inside an expr

2015-01-22 Thread Wei Mi
Thanks for the review. Comments addressed and patch committed. The
problem exists on gcc-4_9 too. Is it ok for gcc-4_9-branch? Will wait
another day to commit it to gcc-4_9 if it is ok.

Thanks,
Wei.

On Thu, Jan 22, 2015 at 9:39 AM, Jeff Law  wrote:
> On 01/21/15 15:32, Wei Mi wrote:
>>
>> Hi,
>>
>> The patch is to address the bug here:
>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64557
>>
>> It is to call get_addr for VALUE before forming a mem_addr expr with
>> the VALUE and an offset. This is to avoid the problem that get_addr
>> can only handle VALUE but cannot handle an expr like: (VALUE +
>> offset). With the fix, find_base_term can always get the base of the
>> original addr.
>>
>> bootstrap and regression test on x86_64-linux-gnu are ok. regression
>> tests on aarch64-linux-gnu and powerpc64-linux-gnu are also ok. Is it
>> ok for trunk?
>>
>> Thanks,
>> Wei.
>>
>> gcc/ChangeLog:
>>
>> 2015-01-21  Wei Mi  
>>
>>  * dse.c (record_store): Call get_addr for mem_addr.
>>  (check_mem_read_rtx): Likewise.
>
> Please add a PR marker to the ChangeLog entry.  A testcase would be great,
> but from reading the PR that doesn't seem possible without some heroic
> efforts.
>
> OK with the PR marker and a comment before the two calls indicating why
> those two calls are necessary.
>
> jeff
>


[PATCH, PR61776] verify_flow_info failed: control flow in the middle of basic block with -fprofile-generate

2014-07-21 Thread Wei Mi
Hi,

This patch is to fix:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61776

It records func decls whose const/pure flags are reset during
instrumentation. After the loop resetting const/pure flags, find out
stmts calling those recorded funcs and perform cfg fixup on them.

bootstrap and regression test pass on x86_64-linux-gnu. ok for trunk
and gcc-4_9?

Thanks,
Wei.

ChangeLog:

2014-07-21  Wei Mi  

PR middle-end/61776
* tree-profile.c (tree_profiling): Fix cfg after the const/pure
flags of some funcs are reset after instrumentation.

2014-07-21  Wei Mi  

PR middle-end/61776
* testsuite/gcc.dg/pr61776.c: New test.

Index: tree-profile.c
===
--- tree-profile.c  (revision 212442)
+++ tree-profile.c  (working copy)
@@ -56,6 +56,7 @@ along with GCC; see the file COPYING3.
 #include "target.h"
 #include "tree-cfgcleanup.h"
 #include "tree-nested.h"
+#include "pointer-set.h"

 static GTY(()) tree gcov_type_node;
 static GTY(()) tree tree_interval_profiler_fn;
@@ -562,6 +563,9 @@ static unsigned int
 tree_profiling (void)
 {
   struct cgraph_node *node;
+  int i;
+  struct pointer_set_t *modified_constpure_decls;
+  vec modified_constpure_stmts;

   /* This is a small-ipa pass that gets called only once, from
  cgraphunit.c:ipa_passes().  */
@@ -603,6 +607,9 @@ tree_profiling (void)
   pop_cfun ();
 }

+  modified_constpure_decls = pointer_set_create ();
+  modified_constpure_stmts.create (0);
+
   /* Drop pure/const flags from instrumented functions.  */
   FOR_EACH_DEFINED_FUNCTION (node)
 {
@@ -615,6 +622,11 @@ tree_profiling (void)
   if (DECL_SOURCE_LOCATION (node->decl) == BUILTINS_LOCATION)
continue;

+  /* If the const/pure flag of node is about to change, record
+node->decl in modified_constpure_decls.  */
+  if (DECL_PURE_P (node->decl) || TREE_READONLY (node->decl))
+   pointer_set_insert (modified_constpure_decls, node->decl);
+
   cgraph_set_const_flag (node, false, false);
   cgraph_set_pure_flag (node, false, false);
 }
@@ -623,6 +635,7 @@ tree_profiling (void)
   FOR_EACH_DEFINED_FUNCTION (node)
 {
   basic_block bb;
+  gimple stmt;

   if (!gimple_has_body_p (node->decl)
  || !(!node->clone_of
@@ -642,10 +655,29 @@ tree_profiling (void)
{
  gimple stmt = gsi_stmt (gsi);
  if (is_gimple_call (stmt))
-   update_stmt (stmt);
+   {
+ tree decl = gimple_call_fndecl(stmt);
+ if (decl && pointer_set_contains (modified_constpure_decls,
+   decl))
+   modified_constpure_stmts.safe_push (stmt);
+ update_stmt (stmt);
+   }
}
}

+  /* The const/pure flag of the decl of call stmt in
modified_constpure_stmts
+is changed because of instrumentation. Split block if the
call stmt is not
+the last stmt of bb and the call stmt ends bb.  */
+  FOR_EACH_VEC_ELT (modified_constpure_stmts, i, stmt)
+   {
+ basic_block bb = gimple_bb (stmt);
+
+ if (stmt != gsi_stmt (gsi_last_bb (bb))
+ && stmt_ends_bb_p (stmt))
+   split_block (bb, stmt);
+   }
+  modified_constpure_stmts.release ();
+
   /* re-merge split blocks.  */
   cleanup_tree_cfg ();
   update_ssa (TODO_update_ssa);
@@ -657,6 +689,7 @@ tree_profiling (void)

   handle_missing_profiles ();

+  pointer_set_destroy (modified_constpure_decls);
   del_node_map ();
   return 0;
 }
Index: testsuite/gcc.dg/pr61776.c
===
--- testsuite/gcc.dg/pr61776.c  (revision 0)
+++ testsuite/gcc.dg/pr61776.c  (revision 0)
@@ -0,0 +1,27 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fprofile-generate" } */
+
+#include 
+
+int cond1, cond2;
+
+int goo() __attribute__((noinline));
+
+int goo() {
+ if (cond1)
+   return 1;
+ else
+   return 2;
+}
+
+jmp_buf env;
+int foo() {
+ int a;
+
+ setjmp(env);
+ if (cond2)
+   a = goo();
+ else
+   a = 3;
+ return a;
+}


Re: [PATCH, PR61776] verify_flow_info failed: control flow in the middle of basic block with -fprofile-generate

2014-07-21 Thread Wei Mi
By the way, the resetting of const/pure flags loop is also executed
during profile-useļ¼Œ but if there is no instrumentation, the reset is
unnecessary. The flags are kept until pass_ipa_pure_const fixes them.
And because of non-instantaneous ssa update,  the fixes are reflected
on ssa only after ipa passes finish.

If it is agreed that this is a problem, I will address the
conservativeness in a separate patch.

Regards,
Wei.


Re: [PATCH, PR61776] verify_flow_info failed: control flow in the middle of basic block with -fprofile-generate

2014-07-27 Thread Wei Mi
> But fact is that it is _not_ necessary to split the block because there
> are no outgoing abnormal edges from it.
>
> The verifier failure is an artifact from using the same predicates during
> CFG building and CFG verifying (usually ok, but for this particular
> case it leads to this issue).
>
> So I don't think your patch is the proper way to address this issue
> (while it certainly works).
>
> Instead whether a call can make abnormal gotos should be recorded
> per-call and stored on the gimple-call.  Martin - this is exactly
> one of the cases your patch would address?
>

Thanks for the comment and thanks to Martin's patch. I try the patch.
It works well to address both pr60449 and pr61776 after some
extension. One extension is to replace GF_CALL_LEAF attribute using
GF_CALL_NO_ABNORMAL_GOTO. That is because not only dropping "leaf"
attribute in lto symbol merge could introduce the control flow
verification problem in pr60449, dropping "const/pure" attributes
could introduce the same problem too. It is unnecessary to introduce
per-call attributes for all these three: ECF_LEAF/ECF_CONST/ECF_PURE,
so GF_CALL_NO_ABNORMAL_GOTO is introduced to indicate that a call stmt
has no abnormal goto.

GF_CALL_NO_ABNORMAL_GOTO will be set according to gimple_call_flags()
once gimple call stmt is created, then updated in execute_fixup_cfg
and cleanup_tree_cfg.

I posted the extended patch here. I didn't add the noreturn part in
because it has no direct impact on pr60449 and pr61776. I can help
Martin to test and post that part as an independent patch later.

bootstrap and regression pass on x86_64-linux-gnu. Is it ok?

Thanks,
Wei.
ChangeLog:

2014-07-27  Martin Jambor  
Wei Mi  

PR ipa/60449
PR middle-end/61776
* tree-cfgcleanup.c (update_no_abnormal_goto_attr): New function.
(cleanup_tree_cfg_1): Use update_no_abnormal_goto_attr.
* gimple.c (gimple_call_initialize_no_abnormal_goto): New function.
(gimple_build_call_1): Use gimple_call_initialize_no_abnormal_goto.
(gimple_build_call_internal_1): Ditto.
* gimple.h (enum gf_mask): Added GF_NO_ABNORMAL_GOTO.
(gimple_call_set_no_abnormal_goto): New function.
(gimple_call_no_abnormal_goto_p): Ditto.
* tree-cfg.c (call_can_make_abnormal_goto):
Use gimple_call_no_abnormal_goto_p.
(execute_fixup_cfg): Use gimple_call_set_no_abnormal_goto.

2014-07-27  Martin Jambor  
Wei Mi  

PR ipa/60449
PR middle-end/61776
* testsuite/gcc.dg/pr61776.c: New test.
* testsuite/gcc.dg/lto/pr60449_1.c: New test.
* testsuite/gcc.dg/lto/pr60449_0.c: New test.

Index: tree-cfgcleanup.c
===
--- tree-cfgcleanup.c   (revision 212442)
+++ tree-cfgcleanup.c   (working copy)
@@ -621,6 +621,28 @@ split_bbs_on_noreturn_calls (void)
   return changed;
 }
 
+/* Update GF_NO_ABNORMAL_GOTO attribute for call stmts in BB according
+   to gimple_call_flags.  */
+
+static void
+update_no_abnormal_goto_attr (basic_block bb)
+{
+  gimple_stmt_iterator gsi;
+  for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
+{
+  gimple stmt = gsi_stmt (gsi);
+
+  if (!is_gimple_call (stmt))
+   continue;
+
+  int flags = gimple_call_flags (stmt);
+  if ((flags & (ECF_CONST | ECF_PURE)
+   && !(flags & ECF_LOOPING_CONST_OR_PURE))
+ || (flags & ECF_LEAF))
+   gimple_call_set_no_abnormal_goto (stmt, true);
+}
+}
+
 /* Tries to cleanup cfg in basic block BB.  Returns true if anything
changes.  */
 
@@ -672,7 +694,10 @@ cleanup_tree_cfg_1 (void)
 {
   bb = BASIC_BLOCK_FOR_FN (cfun, i);
   if (bb)
-   retval |= cleanup_tree_cfg_bb (bb);
+   {
+ update_no_abnormal_goto_attr (bb);
+ retval |= cleanup_tree_cfg_bb (bb);
+   }
 }
 
   /* Now process the altered blocks, as long as any are available.  */
@@ -687,6 +712,7 @@ cleanup_tree_cfg_1 (void)
   if (!bb)
continue;
 
+  update_no_abnormal_goto_attr (bb);
   retval |= cleanup_tree_cfg_bb (bb);
 
   /* Rerun split_bbs_on_noreturn_calls, in case we have altered any 
noreturn
Index: gimple.c
===
--- gimple.c(revision 212442)
+++ gimple.c(working copy)
@@ -186,6 +186,19 @@ gimple_build_return (tree retval)
   return s;
 }
 
+/* Set GF_NO_ABNORMAL_GOTO attribute according to gimple_call_flags(STMT).  */
+
+void
+gimple_call_initialize_no_abnormal_goto (gimple stmt)
+{
+  int flags = gimple_call_flags (stmt);
+
+  if ((flags & (ECF_CONST | ECF_PURE)
+   && !(flags & ECF_LOOPING_CONST_OR_PURE))
+  || (flags & ECF_LEAF))
+gimple_call_set_no_abnormal_goto (stmt, true);
+}
+
 /* Reset alias information on call S.  */
 
 vo

[PATCH, x86] merge movsd/movhpd pair in peephole

2014-04-09 Thread Wei Mi
Hi,

For the testcase 1.c

#include 

double a[1000];

__m128d foo1() {
  __m128d res;
  res = _mm_load_sd(&a[1]);
  res = _mm_loadh_pd(res, &a[2]);
  return res;
}

llvm will merge movsd/movhpd to movupd while gcc will not. The merge
is beneficial on x86 machines starting from Nehalem.

The patch is to add the merging in peephole.
bootstrap and regression pass. Is it ok for stage1?

Thanks,
Wei.

gcc/ChangeLog:

2014-04-09  Wei Mi  

* config/i386/i386.c (get_memref_parts): New function.
(adjacent_mem_locations): Ditto.
* config/i386/i386-protos.h: Add decl for adjacent_mem_locations.
* config/i386/sse.md: Add define_peephole rule.

gcc/testsuite/ChangeLog:

2014-04-09  Wei Mi  

* gcc.target/i386/sse2-unaligned-mov.c: New test.

diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h
index 6e32978..3ae0d6d 100644
--- a/gcc/config/i386/i386-protos.h
+++ b/gcc/config/i386/i386-protos.h
@@ -312,6 +312,7 @@ extern enum attr_cpu ix86_schedule;
 #endif

 extern const char * ix86_output_call_insn (rtx insn, rtx call_op);
+extern bool adjacent_mem_locations (rtx mem1, rtx mem2);

 #ifdef RTX_CODE
 /* Target data for multipass lookahead scheduling.
diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 3eefe4a..a330e84 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -46737,6 +46737,70 @@ ix86_atomic_assign_expand_fenv (tree *hold,
tree *clear, tree *update)
atomic_feraiseexcept_call);
 }

+/* Try to determine BASE/OFFSET/SIZE parts of the given MEM.
+   Return true if successful, false if all the values couldn't
+   be determined.
+
+   This function only looks for REG/SYMBOL or REG/SYMBOL+CONST
+   address forms. */
+
+static bool
+get_memref_parts (rtx mem, rtx *base, HOST_WIDE_INT *offset,
+ HOST_WIDE_INT *size)
+{
+  rtx addr_rtx;
+  if MEM_SIZE_KNOWN_P (mem)
+*size = MEM_SIZE (mem);
+  else
+return false;
+
+  if (GET_CODE (XEXP (mem, 0)) == CONST)
+addr_rtx = XEXP (XEXP (mem, 0), 0);
+  else
+addr_rtx = (XEXP (mem, 0));
+
+  if (GET_CODE (addr_rtx) == REG
+  || GET_CODE (addr_rtx) == SYMBOL_REF)
+{
+  *base = addr_rtx;
+  *offset = 0;
+}
+  else if (GET_CODE (addr_rtx) == PLUS
+  && CONST_INT_P (XEXP (addr_rtx, 1)))
+{
+  *base = XEXP (addr_rtx, 0);
+  *offset = INTVAL (XEXP (addr_rtx, 1));
+}
+  else
+return false;
+
+  return true;
+}
+
+/* If MEM1 is adjacent to MEM2 and MEM1 has lower address,
+   return true.  */
+
+extern bool
+adjacent_mem_locations (rtx mem1, rtx mem2)
+{
+  rtx base1, base2;
+  HOST_WIDE_INT off1, size1, off2, size2;
+
+  if (get_memref_parts (mem1, &base1, &off1, &size1)
+  && get_memref_parts (mem2, &base2, &off2, &size2))
+{
+  if (GET_CODE (base1) == SYMBOL_REF
+ && GET_CODE (base2) == SYMBOL_REF
+ && SYMBOL_REF_DECL (base1) == SYMBOL_REF_DECL (base2))
+return (off1 + size1 == off2);
+  else if (REG_P (base1)
+  && REG_P (base2)
+  && REGNO (base1) == REGNO (base2))
+return (off1 + size1 == off2);
+}
+  return false;
+}
+
 /* Initialize the GCC target structure.  */
 #undef TARGET_RETURN_IN_MEMORY
 #define TARGET_RETURN_IN_MEMORY ix86_return_in_memory
diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index 72a4d6d..4bf8461 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -15606,3 +15606,37 @@
   [(set_attr "type" "sselog1")
(set_attr "length_immediate" "1")
(set_attr "mode" "TI")])
+
+;; merge movsd/movhpd to movupd when TARGET_SSE_UNALIGNED_LOAD_OPTIMAL
+;; is true.
+(define_peephole2
+  [(set (match_operand:DF 0 "register_operand")
+   (match_operand:DF 1 "memory_operand"))
+   (set (match_operand:V2DF 2 "register_operand")
+   (vec_concat:V2DF (match_dup 0)
+(match_operand:DF 3 "memory_operand")))]
+  "TARGET_SSE_UNALIGNED_LOAD_OPTIMAL
+   && REGNO (operands[0]) == REGNO (operands[2])
+   && adjacent_mem_locations (operands[1], operands[3])"
+  [(set (match_dup 2)
+   (unspec:V2DF [(match_dup 4)] UNSPEC_LOADU))]
+{
+  operands[4] = gen_rtx_MEM (V2DFmode, XEXP(operands[1], 0));
+})
+
+;; merge movsd/movhpd to movupd when TARGET_SSE_UNALIGNED_STORE_OPTIMAL
+;; is true.
+(define_peephole2
+  [(set (match_operand:DF 0 "memory_operand")
+(vec_select:DF (match_operand:V2DF 1 "register_operand")
+  (parallel [(const_int 0)])))
+   (set (match_operand:DF 2 "memory_operand")
+(vec_select:DF (match_dup 1)
+   (parallel [(const_int 1)])))]
+  "TARGET_SSE_UNALIGNED_STORE_OPTIMAL
+   && adjacent_mem_locations (operands[0], operands[2])"
+  [(set (match_dup

Re: [PATCH, x86] merge movsd/movhpd pair in peephole

2014-04-09 Thread Wei Mi
Hi Bin,

Yes, we have the same problem that if movsd and movhpd are separated,
peephole cannot merge them. The patch could solve the motivational
performance issue we saw to a good extent, but maybe there is still
space to improve if peephole misses some pairs. Glad to know you are
working on this part. It is the same thing we want. Look forward to
your patch.

Thanks,
Wei.

On Wed, Apr 9, 2014 at 7:27 PM, Bin.Cheng  wrote:
> On Thu, Apr 10, 2014 at 8:18 AM, Wei Mi  wrote:
>> Hi,
>>
>> For the testcase 1.c
>>
>> #include 
>>
>> double a[1000];
>>
>> __m128d foo1() {
>>   __m128d res;
>>   res = _mm_load_sd(&a[1]);
>>   res = _mm_loadh_pd(res, &a[2]);
>>   return res;
>> }
>>
>> llvm will merge movsd/movhpd to movupd while gcc will not. The merge
>> is beneficial on x86 machines starting from Nehalem.
>>
>> The patch is to add the merging in peephole.
>> bootstrap and regression pass. Is it ok for stage1?
>>
>> Thanks,
>> Wei.
>>
>> gcc/ChangeLog:
>>
>> 2014-04-09  Wei Mi  
>>
>> * config/i386/i386.c (get_memref_parts): New function.
>> (adjacent_mem_locations): Ditto.
>> * config/i386/i386-protos.h: Add decl for adjacent_mem_locations.
>> * config/i386/sse.md: Add define_peephole rule.
>>
>> gcc/testsuite/ChangeLog:
>>
>> 2014-04-09  Wei Mi  
>>
>> * gcc.target/i386/sse2-unaligned-mov.c: New test.
>>
>> diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h
>> index 6e32978..3ae0d6d 100644
>> --- a/gcc/config/i386/i386-protos.h
>> +++ b/gcc/config/i386/i386-protos.h
>> @@ -312,6 +312,7 @@ extern enum attr_cpu ix86_schedule;
>>  #endif
>>
>>  extern const char * ix86_output_call_insn (rtx insn, rtx call_op);
>> +extern bool adjacent_mem_locations (rtx mem1, rtx mem2);
>>
>>  #ifdef RTX_CODE
>>  /* Target data for multipass lookahead scheduling.
>> diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
>> index 3eefe4a..a330e84 100644
>> --- a/gcc/config/i386/i386.c
>> +++ b/gcc/config/i386/i386.c
>> @@ -46737,6 +46737,70 @@ ix86_atomic_assign_expand_fenv (tree *hold,
>> tree *clear, tree *update)
>> atomic_feraiseexcept_call);
>>  }
>>
>> +/* Try to determine BASE/OFFSET/SIZE parts of the given MEM.
>> +   Return true if successful, false if all the values couldn't
>> +   be determined.
>> +
>> +   This function only looks for REG/SYMBOL or REG/SYMBOL+CONST
>> +   address forms. */
>> +
>> +static bool
>> +get_memref_parts (rtx mem, rtx *base, HOST_WIDE_INT *offset,
>> + HOST_WIDE_INT *size)
>> +{
>> +  rtx addr_rtx;
>> +  if MEM_SIZE_KNOWN_P (mem)
>> +*size = MEM_SIZE (mem);
>> +  else
>> +return false;
>> +
>> +  if (GET_CODE (XEXP (mem, 0)) == CONST)
>> +addr_rtx = XEXP (XEXP (mem, 0), 0);
>> +  else
>> +addr_rtx = (XEXP (mem, 0));
>> +
>> +  if (GET_CODE (addr_rtx) == REG
>> +  || GET_CODE (addr_rtx) == SYMBOL_REF)
>> +{
>> +  *base = addr_rtx;
>> +  *offset = 0;
>> +}
>> +  else if (GET_CODE (addr_rtx) == PLUS
>> +  && CONST_INT_P (XEXP (addr_rtx, 1)))
>> +{
>> +  *base = XEXP (addr_rtx, 0);
>> +  *offset = INTVAL (XEXP (addr_rtx, 1));
>> +}
>> +  else
>> +return false;
>> +
>> +  return true;
>> +}
>> +
>> +/* If MEM1 is adjacent to MEM2 and MEM1 has lower address,
>> +   return true.  */
>> +
>> +extern bool
>> +adjacent_mem_locations (rtx mem1, rtx mem2)
>> +{
>> +  rtx base1, base2;
>> +  HOST_WIDE_INT off1, size1, off2, size2;
>> +
>> +  if (get_memref_parts (mem1, &base1, &off1, &size1)
>> +  && get_memref_parts (mem2, &base2, &off2, &size2))
>> +{
>> +  if (GET_CODE (base1) == SYMBOL_REF
>> + && GET_CODE (base2) == SYMBOL_REF
>> + && SYMBOL_REF_DECL (base1) == SYMBOL_REF_DECL (base2))
>> +return (off1 + size1 == off2);
>> +  else if (REG_P (base1)
>> +  && REG_P (base2)
>> +  && REGNO (base1) == REGNO (base2))
>> +return (off1 + size1 == off2);
>> +}
>> +  return false;
>> +}
>> +
>>  /* Initialize the GCC target structure.  */
>>  #undef TARGET_RETURN_IN_MEMORY
>>  #define TARGET_RETURN_IN_MEMORY ix86_return_in_me

Re: [PATCH, x86] merge movsd/movhpd pair in peephole

2014-04-21 Thread Wei Mi
Ping.

Thanks,
Wei.

On Wed, Apr 9, 2014 at 5:18 PM, Wei Mi  wrote:
> Hi,
>
> For the testcase 1.c
>
> #include 
>
> double a[1000];
>
> __m128d foo1() {
>   __m128d res;
>   res = _mm_load_sd(&a[1]);
>   res = _mm_loadh_pd(res, &a[2]);
>   return res;
> }
>
> llvm will merge movsd/movhpd to movupd while gcc will not. The merge
> is beneficial on x86 machines starting from Nehalem.
>
> The patch is to add the merging in peephole.
> bootstrap and regression pass. Is it ok for stage1?
>
> Thanks,
> Wei.
>
> gcc/ChangeLog:
>
> 2014-04-09  Wei Mi  
>
> * config/i386/i386.c (get_memref_parts): New function.
> (adjacent_mem_locations): Ditto.
> * config/i386/i386-protos.h: Add decl for adjacent_mem_locations.
>     * config/i386/sse.md: Add define_peephole rule.
>
> gcc/testsuite/ChangeLog:
>
> 2014-04-09  Wei Mi  
>
> * gcc.target/i386/sse2-unaligned-mov.c: New test.
>
> diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h
> index 6e32978..3ae0d6d 100644
> --- a/gcc/config/i386/i386-protos.h
> +++ b/gcc/config/i386/i386-protos.h
> @@ -312,6 +312,7 @@ extern enum attr_cpu ix86_schedule;
>  #endif
>
>  extern const char * ix86_output_call_insn (rtx insn, rtx call_op);
> +extern bool adjacent_mem_locations (rtx mem1, rtx mem2);
>
>  #ifdef RTX_CODE
>  /* Target data for multipass lookahead scheduling.
> diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
> index 3eefe4a..a330e84 100644
> --- a/gcc/config/i386/i386.c
> +++ b/gcc/config/i386/i386.c
> @@ -46737,6 +46737,70 @@ ix86_atomic_assign_expand_fenv (tree *hold,
> tree *clear, tree *update)
> atomic_feraiseexcept_call);
>  }
>
> +/* Try to determine BASE/OFFSET/SIZE parts of the given MEM.
> +   Return true if successful, false if all the values couldn't
> +   be determined.
> +
> +   This function only looks for REG/SYMBOL or REG/SYMBOL+CONST
> +   address forms. */
> +
> +static bool
> +get_memref_parts (rtx mem, rtx *base, HOST_WIDE_INT *offset,
> + HOST_WIDE_INT *size)
> +{
> +  rtx addr_rtx;
> +  if MEM_SIZE_KNOWN_P (mem)
> +*size = MEM_SIZE (mem);
> +  else
> +return false;
> +
> +  if (GET_CODE (XEXP (mem, 0)) == CONST)
> +addr_rtx = XEXP (XEXP (mem, 0), 0);
> +  else
> +addr_rtx = (XEXP (mem, 0));
> +
> +  if (GET_CODE (addr_rtx) == REG
> +  || GET_CODE (addr_rtx) == SYMBOL_REF)
> +{
> +  *base = addr_rtx;
> +  *offset = 0;
> +}
> +  else if (GET_CODE (addr_rtx) == PLUS
> +  && CONST_INT_P (XEXP (addr_rtx, 1)))
> +{
> +  *base = XEXP (addr_rtx, 0);
> +  *offset = INTVAL (XEXP (addr_rtx, 1));
> +}
> +  else
> +return false;
> +
> +  return true;
> +}
> +
> +/* If MEM1 is adjacent to MEM2 and MEM1 has lower address,
> +   return true.  */
> +
> +extern bool
> +adjacent_mem_locations (rtx mem1, rtx mem2)
> +{
> +  rtx base1, base2;
> +  HOST_WIDE_INT off1, size1, off2, size2;
> +
> +  if (get_memref_parts (mem1, &base1, &off1, &size1)
> +  && get_memref_parts (mem2, &base2, &off2, &size2))
> +{
> +  if (GET_CODE (base1) == SYMBOL_REF
> + && GET_CODE (base2) == SYMBOL_REF
> + && SYMBOL_REF_DECL (base1) == SYMBOL_REF_DECL (base2))
> +return (off1 + size1 == off2);
> +  else if (REG_P (base1)
> +  && REG_P (base2)
> +  && REGNO (base1) == REGNO (base2))
> +return (off1 + size1 == off2);
> +}
> +  return false;
> +}
> +
>  /* Initialize the GCC target structure.  */
>  #undef TARGET_RETURN_IN_MEMORY
>  #define TARGET_RETURN_IN_MEMORY ix86_return_in_memory
> diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
> index 72a4d6d..4bf8461 100644
> --- a/gcc/config/i386/sse.md
> +++ b/gcc/config/i386/sse.md
> @@ -15606,3 +15606,37 @@
>[(set_attr "type" "sselog1")
> (set_attr "length_immediate" "1")
> (set_attr "mode" "TI")])
> +
> +;; merge movsd/movhpd to movupd when TARGET_SSE_UNALIGNED_LOAD_OPTIMAL
> +;; is true.
> +(define_peephole2
> +  [(set (match_operand:DF 0 "register_operand")
> +   (match_operand:DF 1 "memory_operand"))
> +   (set (match_operand:V2DF 2 "register_operand")
> +   (vec_concat:V2DF (match_dup 0)
> +(match_operand:DF 3 "memory_operand")))]
> +  "TARGET_SSE_UNALIGNED_LOAD_OPTIMAL
> +   && REGNO (operands[0]) == REGNO (operands[2])
>

[PATCH, PR60738] More LRA split for regno conflicting with single reg class operand

2014-04-25 Thread Wei Mi
Hi,

This patch is to address the missing optimization reported in
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60738

Now in process_single_reg_class_operands, any allocno A conflicting
with a single reg class operand B is marked to never use the reg class
in IRA. This is non-optimal when A is in a hot region while B is in a
cold region. The patch allows A to use the register in the single reg
class if only the hotness difference between A and B is large enough.
The patch also extends lra_split to make sure A is splitted in the
code region for B instead of being spilled.

bootstrap and regression test are ok for x86_64-linux-gnu. Is it ok for trunk?

Thanks,
Wei.

ChangeLog:

2014-04-25  Wei Mi  

PR rtl-optimization/60738
* params.h: New param.
* params.def: Ditto.
* lra-constraints.c (need_for_split_p): Let more
cases to do lra-split.
* ira-lives.c (process_single_reg_class_operands):
Avoid to add single reg class into conflict hardreg set in
some cases.


ChangeLog:

2014-04-25  Wei Mi  

PR rtl-optimization/60738
* testsuite/gcc.target/i386/pr60738-2.c: New test.
* testsuite/gcc.target/i386/pr60738-1.c: New test.

Index: params.def
===
--- params.def  (revision 209253)
+++ params.def  (working copy)
@@ -826,6 +826,11 @@ DEFPARAM (PARAM_LRA_MAX_CONSIDERED_RELOA
  "The max number of reload pseudos which are considered
during spilling a non-reload pseudo",
  500, 0, 0)

+DEFPARAM (PARAM_LRA_SPLIT_FREQ_RATIO,
+ "lra-split-freq-ratio",
+ "The ratio used to check when lra split is preferred than spilled",
+ 9, 0, 0)
+
 /* Switch initialization conversion will refuse to create arrays that are
bigger than this parameter times the number of switch branches.  */

Index: ira-lives.c
===
--- ira-lives.c (revision 209253)
+++ ira-lives.c (working copy)
@@ -1025,7 +1025,11 @@ process_single_reg_class_operands (bool
 {
  ira_object_t obj = ira_object_id_map[px];
  a = OBJECT_ALLOCNO (obj);
- if (a != operand_a)
+ /* If a is much hotter in some other region, don't add reg class
+cl into its conflict hardreg set. Let lra_split to do splitting
+here for operand_a.  */
+ if (a != operand_a
+ && (LRA_SPLIT_FREQ_RATIO * freq >= a->freq))
{
  /* We could increase costs of A instead of making it
 conflicting with the hard register.  But it works worse
Index: params.h
===
--- params.h(revision 209253)
+++ params.h(working copy)
@@ -198,6 +198,8 @@ extern void init_param_values (int *para
   PARAM_VALUE (PARAM_IRA_LOOP_RESERVED_REGS)
 #define LRA_MAX_CONSIDERED_RELOAD_PSEUDOS \
   PARAM_VALUE (PARAM_LRA_MAX_CONSIDERED_RELOAD_PSEUDOS)
+#define LRA_SPLIT_FREQ_RATIO \
+  PARAM_VALUE (PARAM_LRA_SPLIT_FREQ_RATIO)
 #define SWITCH_CONVERSION_BRANCH_RATIO \
   PARAM_VALUE (PARAM_SWITCH_CONVERSION_BRANCH_RATIO)
 #define LOOP_INVARIANT_MAX_BBS_IN_LOOP \
Index: lra-constraints.c
===
--- lra-constraints.c   (revision 209253)
+++ lra-constraints.c   (working copy)
@@ -129,6 +129,7 @@
 #include "ira.h"
 #include "rtl-error.h"
 #include "lra-int.h"
+#include "params.h"

 /* Value of LRA_CURR_RELOAD_NUM at the beginning of BB of the current
insn.  Remember that LRA_CURR_RELOAD_NUM is the number of emitted
@@ -4632,8 +4633,13 @@ static bitmap_head ebb_global_regs;
 static inline bool
 need_for_split_p (HARD_REG_SET potential_reload_hard_regs, int regno)
 {
+  int freq;
+  rtx last_use_insn;
   int hard_regno = regno < FIRST_PSEUDO_REGISTER ? regno : reg_renumber[regno];

+  last_use_insn = skip_usage_debug_insns (usage_insns[regno].insns);
+  freq = REG_FREQ_FROM_BB (BLOCK_FOR_INSN (last_use_insn));
+
   lra_assert (hard_regno >= 0);
   return ((TEST_HARD_REG_BIT (potential_reload_hard_regs, hard_regno)
   /* Don't split eliminable hard registers, otherwise we can
@@ -4653,25 +4659,27 @@ need_for_split_p (HARD_REG_SET potential
   && (regno >= FIRST_PSEUDO_REGISTER
   || ! TEST_HARD_REG_BIT (call_used_reg_set, regno)
   || usage_insns[regno].calls_num == calls_num)
-  /* We need at least 2 reloads to make pseudo splitting
- profitable.  We should provide hard regno splitting in
- any case to solve 1st insn scheduling problem when
- moving hard register definition up might result in
- impossibility to find hard register for reload pseudo of
- small register class.  */
-  && (usag

Re: [PATCH, PR60738] More LRA split for regno conflicting with single reg class operand

2014-04-28 Thread Wei Mi
Thanks. I will change

> + if (a != operand_a
> + && (LRA_SPLIT_FREQ_RATIO * freq >= a->freq))

to

> + if (a != operand_a
> + && (!ira_use_lra_p || LRA_SPLIT_FREQ_RATIO * freq >= a->freq))

Regards,
Wei.

On Mon, Apr 28, 2014 at 12:57 AM, Steven Bosscher  wrote:
> On Sat, Apr 26, 2014 at 5:35 AM, Wei Mi wrote:
>> Index: ira-lives.c
>> ===
>> --- ira-lives.c (revision 209253)
>> +++ ira-lives.c (working copy)
>> @@ -1025,7 +1025,11 @@ process_single_reg_class_operands (bool
>>  {
>>   ira_object_t obj = ira_object_id_map[px];
>>   a = OBJECT_ALLOCNO (obj);
>> - if (a != operand_a)
>> + /* If a is much hotter in some other region, don't add reg class
>> +cl into its conflict hardreg set. Let lra_split to do splitting
>> +here for operand_a.  */
>> + if (a != operand_a
>> + && (LRA_SPLIT_FREQ_RATIO * freq >= a->freq))
>> {
>>   /* We could increase costs of A instead of making it
>>  conflicting with the hard register.  But it works worse
>
> AFAICT this path is not LRA specific, so your patch may break ports
> still relying on reload.
>
> Ciao!
> Steven


Re: [PATCH, PR58066] preferred_stack_boundary update for tls expanded call

2014-04-30 Thread Wei Mi
Ping. Is pr58066-3.patch or pr58066-4.patch ok for trunk?

Thanks,
Wei.

>> I attached the patch which combined your two patches and the fix in
>> legitimize_tls_address. I tried pr58066.c and c.i in ia32/x32/x86_64,
>> the code looked fine. Do you think it is ok?
>>
>> Thanks,
>> Wei.
>
> Either pr58066-3.patch or pr58066-4.patch looks good to me.
>
> Thanks.
>
> --
> H.J.


Re: [PATCH] Builtins handling in IVOPT

2014-04-30 Thread Wei Mi
Ping.

Thanks,
Wei.

On Tue, Dec 17, 2013 at 11:34 AM, Wei Mi  wrote:
> Ping.
>
> Thanks,
> Wei.
>
> On Mon, Dec 9, 2013 at 9:54 PM, Wei Mi  wrote:
>> Ping.
>>
>> Thanks,
>> wei.
>>
>> On Sat, Nov 23, 2013 at 10:46 AM, Wei Mi  wrote:
>>> bootstrap and regression of the updated patch pass.
>>>
>>> On Sat, Nov 23, 2013 at 12:05 AM, Wei Mi  wrote:
>>>> On Thu, Nov 21, 2013 at 12:19 AM, Zdenek Dvorak
>>>>  wrote:
>>>>> Hi,
>>>>>
>>>>>> This patch works on the intrinsic calls handling issue in IVOPT 
>>>>>> mentioned here:
>>>>>> http://gcc.gnu.org/ml/gcc-patches/2010-10/msg01295.html
>>>>>>
>>>>>> In find_interesting_uses_stmt, it changes
>>>>>>
>>>>>> arg = expr
>>>>>> __builtin_xxx (arg)
>>>>>>
>>>>>> to
>>>>>>
>>>>>> arg = expr;
>>>>>> tmp = addr_expr (mem_ref(arg));
>>>>>> __builtin_xxx (tmp, ...)
>>>>>
>>>>> this looks a bit confusing (and wasteful) to me. It would make more sense 
>>>>> to
>>>>> just record the argument as USE_ADDRESS and do the rewriting in 
>>>>> rewrite_use_address.
>>>>>
>>>>> Zdenek
>>>>
>>>> I updated the patch. The gimple changing part is now moved to
>>>> rewrite_use_address. Add support for plain address expr in addition to
>>>> reference expr in find_interesting_uses_address.
>>>>
>>>> bootstrap and testing is going on.
>>>>
>>>> 2013-11-22  Wei Mi  
>>>>
>>>> * expr.c (expand_expr_addr_expr_1): Not to split TMR.
>>>> (expand_expr_real_1): Ditto.
>>>> * targhooks.c (default_builtin_has_mem_ref_p): Default
>>>> builtin.
>>>> * tree-ssa-loop-ivopts.c (builtin_has_mem_ref_p): New function.
>>>> (rewrite_use_address): Add TMR for builtin.
>>>> (find_interesting_uses_stmt): Special handling of builtins.
>>>> * gimple-expr.c (is_gimple_address): Add handling of TMR.
>>>> * gimple-expr.h (is_gimple_addressable): Ditto.
>>>> * config/i386/i386.c (ix86_builtin_has_mem_ref_p): New target hook.
>>>> (ix86_atomic_assign_expand_fenv): Ditto.
>>>> (ix86_expand_special_args_builtin): Special handling of TMR for
>>>> builtin.
>>>> * target.def (builtin_has_mem_ref_p): New hook.
>>>> * doc/tm.texi.in: Ditto.
>>>> * doc/tm.texi: Generated.
>>>>
>>>> 2013-11-22  Wei Mi  
>>>>
>>>> * gcc.dg/tree-ssa/ivopt_5.c: New test.
>>>>
>>>> Index: testsuite/gcc.dg/tree-ssa/ivopt_5.c
>>>> ===
>>>> --- testsuite/gcc.dg/tree-ssa/ivopt_5.c (revision 0)
>>>> +++ testsuite/gcc.dg/tree-ssa/ivopt_5.c (revision 0)
>>>> @@ -0,0 +1,21 @@
>>>> +/* { dg-do compile { target {{ i?86-*-* x86_64-*-* } && lp64 } } } */
>>>> +/* { dg-options "-O2 -m64 -fdump-tree-ivopts-details" } */
>>>> +
>>>> +/* Make sure only one iv is selected after IVOPT.  */
>>>> +
>>>> +#include 
>>>> +extern __m128i arr[], d[];
>>>> +void test (void)
>>>> +{
>>>> +unsigned int b;
>>>> +for (b = 0; b < 1000; b += 2) {
>>>> +  __m128i *p = (__m128i *)(&d[b]);
>>>> +  __m128i a = _mm_load_si128(&arr[4*b+3]);
>>>> +  __m128i v = _mm_loadu_si128(p);
>>>> +  v = _mm_xor_si128(v, a);
>>>> +  _mm_storeu_si128(p, v);
>>>> +}
>>>> +}
>>>> +
>>>> +/* { dg-final { scan-tree-dump-times "PHI >>> +/* { dg-final { cleanup-tree-dump "ivopts" } } */
>>>> Index: targhooks.c
>>>> ===
>>>> --- targhooks.c (revision 204792)
>>>> +++ targhooks.c (working copy)
>>>> @@ -566,6 +566,13 @@ default_builtin_reciprocal (unsigned int
>>>>  }
>>>>
>>>>  bool
>>>> +default_builtin_has_mem_ref_p (int built_in_function ATTRIBUTE_UNUSED,
>>>> +  int i AT

Re: [PATCH, PR58066] preferred_stack_boundary update for tls expanded call

2014-05-01 Thread Wei Mi
On Wed, Apr 30, 2014 at 11:44 PM, Uros Bizjak  wrote:
> On Thu, May 1, 2014 at 6:42 AM, Wei Mi  wrote:
>> Ping. Is pr58066-3.patch or pr58066-4.patch ok for trunk?
>
> None of these patches have correct ChangeLog entries. Please follow
> the rules, outlined in http://gcc.gnu.org/contribute.html (Submitting
> Patches section), otherwise your patches will be simply ignored.
>

Will add Changelog in the final patch.

>
> pr58066-4 patch is definitely not OK. I wonder, how it works at all,
> since you can't split the insn to the same pattern. The generic code
> detects this condition and forces ICE (IIRC: this is the reason for
> UNSPEC_DIV_ALREADY_SPLIT tag in divmod).
>

If the generic code detects same pattern, it will not ICE but return
the original insn (See the code in try_split under the comment /*
Avoid infinite loop if any insn of the result matches the original
pattern. */). ix86_tls_descriptor_calls_expanded_in_cfun will be set
to true even if original insn is returned by try_split. That is how
pr58066-4 works.

> From pr58066-3 patch:
>
> -;; Local dynamic of a single variable is a lose.  Show combine how
> -;; to convert that back to global dynamic.
> -
> -(define_insn_and_split "*tls_local_dynamic_32_once"
> -  [(set (match_operand:SI 0 "register_operand" "=a")
> -(plus:SI
> - (unspec:SI [(match_operand:SI 1 "register_operand" "b")
> - (match_operand 2 "constant_call_address_operand" "z")]
> -UNSPEC_TLS_LD_BASE)
> - (const:SI (unspec:SI
> -[(match_operand 3 "tls_symbolic_operand")]
> -UNSPEC_DTPOFF
> -   (clobber (match_scratch:SI 4 "=d"))
> -   (clobber (match_scratch:SI 5 "=c"))
> -   (clobber (reg:CC FLAGS_REG))]
> -  ""
> -  "#"
> -  ""
> -  [(parallel
> - [(set (match_dup 0)
> -   (unspec:SI [(match_dup 1) (match_dup 3) (match_dup 2)]
> -  UNSPEC_TLS_GD))
> -  (clobber (match_dup 4))
> -  (clobber (match_dup 5))
> -  (clobber (reg:CC FLAGS_REG))])])
>
> Why did you remove this splitter?
>

After we add add call into the pattern of tls_local_dynamic_base_32,
it is difficult for the pattern tls_local_dynamic_32_once to be
matched. I don't have a good way to rewrite tls_local_dynamic_32_once.
Do you have any idea?

(define_expand "tls_local_dynamic_base_32"
  [(parallel
 [(set (match_operand:SI 0 "register_operand")
   (call:SI
(mem:QI (match_operand 2 "constant_call_address_operand"))
(const_int 0)))
  (unspec:SI [(match_operand:SI 1 "register_operand")]
 UNSPEC_TLS_LD_BASE)
  (clobber (match_scratch:SI 3))
  (clobber (match_scratch:SI 4))
  (clobber (reg:CC FLAGS_REG))])]

> Please do not write:
>
> +{
> +  ix86_tls_descriptor_calls_expanded_in_cfun = true;
> +})
>
> but use a short form:
>
> +  "ix86_tls_descriptor_calls_expanded_in_cfun = true;")
>
> Please also add a testcase (from one of the previous mails):
>
> --- testsuite/gcc.dg/pr58066.c (revision 0)
> +++ testsuite/gcc.dg/pr58066.c (revision 0)
>
> Put this test to gcc.target/i386 directory ...
> @@ -0,0 +1,18 @@
> +/* { dg-do compile { target {{ i?86-*-* x86_64-*-* } && { ! ia32 } } } } */
>
> ... to avoid target selector.
>
> +/* { dg-options "-fPIC -O2" } */
> +
> +/* Check whether the stack frame starting addresses of tls expanded calls
> +   in foo and goo are 16bytes aligned.  */
> +static __thread char ccc1;
> +void* foo()
> +{
> + return &ccc1;
> +}
> +
> +__thread char ccc2;
> +void* goo()
> +{
> + return &ccc2;
> +}
> +
> +/* { dg-final { scan-assembler-times ".cfi_def_cfa_offset 16" 2 } } */
>
> Please repost the complete patch with a proper ChangeLog.
>
> Uros.

will do that.

Thanks,
Wei.


Re: [PATCH, PR58066] preferred_stack_boundary update for tls expanded call

2014-05-07 Thread Wei Mi
This is the updated patch of pr58066-3.patch.

The calls added in the templates of tls_local_dynamic_base_32 and
tls_global_dynamic_32 in pr58066-3.patch are used to prevent sched2
from moving sp setting across implicit tls calls, but those calls make
the combine of UNSPEC_TLS_LD_BASE and UNSPEC_DTPOFF difficult, so that
the optimization in tls_local_dynamic_32_once to convert local_dynamic
to global_dynamic mode for single tls reference cannot take effect. In
the updated patch, I remove those calls from insn templates and add
"reg:SI SP_REG" explicitly in the templates of UNSPEC_TLS_GD and
UNSPEC_TLS_LD_BASE. It solves the sched2 and combine problems above,
and now the optimization in tls_local_dynamic_32_once works.

bootstrapped ok on x86_64-linux-gnu. regression is going on. Is it OK
if regression passes?

Thanks.
Wei.

ChangeLog:

gcc/
2014-05-07  Wei Mi  

* config/i386/i386.c (ix86_compute_frame_layout):
preferred_stack_boundary updated for tls expanded call.
* config/i386/i386.md: Set ix86_tls_descriptor_calls_expanded_in_cfun.

gcc/testsuite/
2014-05-07  Wei Mi  

* gcc.target/i386/pr58066.c: New test.

Index: testsuite/gcc.target/i386/pr58066.c
===
--- testsuite/gcc.target/i386/pr58066.c (revision 0)
+++ testsuite/gcc.target/i386/pr58066.c (revision 0)
@@ -0,0 +1,18 @@
+/* { dg-do compile } */
+/* { dg-options "-fPIC -O2" } */
+
+/* Check whether the stack frame starting addresses of tls expanded calls
+   in foo and goo are 16bytes aligned.  */
+static __thread char ccc1;
+void* foo()
+{
+ return &ccc1;
+}
+
+__thread char ccc2;
+void* goo()
+{
+ return &ccc2;
+}
+
+/* { dg-final { scan-assembler-times ".cfi_def_cfa_offset 16" 2 } } */
Index: config/i386/i386.c
===
--- config/i386/i386.c  (revision 209979)
+++ config/i386/i386.c  (working copy)
@@ -9485,20 +9485,30 @@ ix86_compute_frame_layout (struct ix86_f
   frame->nregs = ix86_nsaved_regs ();
   frame->nsseregs = ix86_nsaved_sseregs ();

-  stack_alignment_needed = crtl->stack_alignment_needed / BITS_PER_UNIT;
-  preferred_alignment = crtl->preferred_stack_boundary / BITS_PER_UNIT;
-
   /* 64-bit MS ABI seem to require stack alignment to be always 16 except for
  function prologues and leaf.  */
-  if ((TARGET_64BIT_MS_ABI && preferred_alignment < 16)
+  if ((TARGET_64BIT_MS_ABI && crtl->preferred_stack_boundary < 128)
   && (!crtl->is_leaf || cfun->calls_alloca != 0
   || ix86_current_function_calls_tls_descriptor))
 {
-  preferred_alignment = 16;
-  stack_alignment_needed = 16;
   crtl->preferred_stack_boundary = 128;
   crtl->stack_alignment_needed = 128;
 }
+  /* preferred_stack_boundary is never updated for call
+ expanded from tls descriptor. Update it here. We don't update it in
+ expand stage because according to the comments before
+ ix86_current_function_calls_tls_descriptor, tls calls may be optimized
+ away.  */
+  else if (ix86_current_function_calls_tls_descriptor
+  && crtl->preferred_stack_boundary < PREFERRED_STACK_BOUNDARY)
+{
+  crtl->preferred_stack_boundary = PREFERRED_STACK_BOUNDARY;
+  if (crtl->stack_alignment_needed < PREFERRED_STACK_BOUNDARY)
+   crtl->stack_alignment_needed = PREFERRED_STACK_BOUNDARY;
+}
+
+  stack_alignment_needed = crtl->stack_alignment_needed / BITS_PER_UNIT;
+  preferred_alignment = crtl->preferred_stack_boundary / BITS_PER_UNIT;

   gcc_assert (!size || stack_alignment_needed);
   gcc_assert (preferred_alignment >= STACK_BOUNDARY / BITS_PER_UNIT);
Index: config/i386/i386.md
===
--- config/i386/i386.md (revision 209979)
+++ config/i386/i386.md (working copy)
@@ -12530,7 +12530,8 @@
(unspec:SI
 [(match_operand:SI 1 "register_operand" "b")
  (match_operand 2 "tls_symbolic_operand")
- (match_operand 3 "constant_call_address_operand" "z")]
+ (match_operand 3 "constant_call_address_operand" "z")
+ (reg:SI SP_REG)]
 UNSPEC_TLS_GD))
(clobber (match_scratch:SI 4 "=d"))
(clobber (match_scratch:SI 5 "=c"))
@@ -12555,11 +12556,14 @@
 [(set (match_operand:SI 0 "register_operand")
  (unspec:SI [(match_operand:SI 2 "register_operand")
  (match_operand 1 "tls_symbolic_operand")
- (match_operand 3 "constant_call_address_operand")]
+ (match_operand 3 "constant_call_address_operand")
+ (reg:SI SP_REG)]
 UNSPEC_TLS_GD))
  (clobber (match_scratch:SI 4))
  (c

Re: [PATCH, PR58066] preferred_stack_boundary update for tls expanded call

2014-05-10 Thread Wei Mi
Here is a patch for the test. It contains two changes:
1. For emutls, there will be an explicit call generated at expand
pass, and no stack adjustment is needed. So add /* {
dg-require-effective-target tls_native } */ in the test.
2. Replace cfi_def_cfa_offset with insn sequence check.

Is it ok?

Thanks,
Wei.

Index: testsuite/gcc.target/i386/pr58066.c
===
--- testsuite/gcc.target/i386/pr58066.c (revision 210301)
+++ testsuite/gcc.target/i386/pr58066.c (working copy)
@@ -1,5 +1,6 @@
 /* { dg-do compile } */
-/* { dg-options "-fPIC -O2" } */
+/* { dg-require-effective-target tls_native } */
+/* { dg-options "-fPIC -fomit-frame-pointer -O2" } */

 /* Check whether the stack frame starting addresses of tls expanded calls
in foo and goo are 16bytes aligned.  */
@@ -15,4 +16,4 @@ void* goo()
  return &ccc2;
 }

-/* { dg-final { scan-assembler-times ".cfi_def_cfa_offset 16" 2 } } */
+/* { dg-final { scan-assembler
"sub\[^\r\n\]*8\[^\r\n\]*sp.*call\[^\r\n\]*__tls_get_addr.*sub\[^\r\n\]*8\[^\r\n\]*sp.*call\[^\r\n\]*__tls_get_addr"
} } */

On Sat, May 10, 2014 at 6:47 AM, Rainer Orth
 wrote:
> domi...@lps.ens.fr (Dominique Dhumieres) writes:
>
>>> This is the updated patch of pr58066-3.patch. ...
>>
>> On x86_64-apple-darwin13 I get
>>
>> FAIL: gcc.target/i386/pr58066.c scan-assembler-times .cfi_def_cfa_offset 16 2
>
> Same on i386-pc-solaris2.* with Sun as (which doesn't support cfi
> directives).
>
> Rainer
>
> --
> -
> Rainer Orth, Center for Biotechnology, Bielefeld University


Re: [PATCH, PR58066] preferred_stack_boundary update for tls expanded call

2014-05-12 Thread Wei Mi
>> Here is a patch for the test. It contains two changes:
>> 1. For emutls, there will be an explicit call generated at expand
>> pass, and no stack adjustment is needed. So add /* {
>> dg-require-effective-target tls_native } */ in the test.
>> 2. Replace cfi_def_cfa_offset with insn sequence check.
>>
>> Is it ok?
>
> No, the test FAILs for 32-bit i386-pc-solaris2.11 with Sun as/ld:
>
> FAIL: gcc.target/i386/pr58066.c scan-assembler 
> sub[^\r\n]*8[^\r\n]*sp.*call[^\r\n]*__tls_get_addr.*sub[^\r\n]*8[^\r\n]*sp.*call[^\r\n]*__tls_get_addr
>
> The TLS code sequence is different here:
>
> subl$8, %esp
> lealccc1@tlsgd(,%ebx,1), %eax
> callccc1@tlsgdplt
>
> I fear this insn scanning is going to be extremely fragile.
>
> Rainer

Thanks for trying the testcase. rtl scanning will be slightly better
than assembly scanning. So how about this one?

Thanks,
Wei.

Index: testsuite/gcc.target/i386/pr58066.c
===
--- testsuite/gcc.target/i386/pr58066.c (revision 210222)
+++ testsuite/gcc.target/i386/pr58066.c (working copy)
@@ -1,5 +1,6 @@
 /* { dg-do compile } */
-/* { dg-options "-fPIC -O2" } */
+/* { dg-require-effective-target tls_native } */
+/* { dg-options "-fPIC -fomit-frame-pointer -O2 -fdump-rtl-final" } */

 /* Check whether the stack frame starting addresses of tls expanded calls
in foo and goo are 16bytes aligned.  */
@@ -15,4 +16,6 @@ void* goo()
  return &ccc2;
 }

-/* { dg-final { scan-assembler-times ".cfi_def_cfa_offset 16" 2 } } */
+/* { dg-final { scan-rtl-dump "Function
foo.*set\[^\r\n\]*sp\\)\[\r\n\]\[^\r\n\]*plus\[^\r\n\]*sp\\)\[\r\n\]\[^\r\n\]*const_int
-8.*UNSPEC_TLS.*Function goo" "final" } } */
+/* { dg-final { scan-rtl-dump "Function
goo.*set\[^\r\n\]*sp\\)\[\r\n\]\[^\r\n\]*plus\[^\r\n\]*sp\\)\[\r\n\]\[^\r\n\]*const_int
-8.*UNSPEC_TLS" "final" } } */
+/* { dg-final { cleanup-rtl-dump "final" } } */


Re: [PATCH, PR58066] preferred_stack_boundary update for tls expanded call

2014-05-14 Thread Wei Mi
Can I checkin this testcase fix?

Thanks,
Wei.


On Tue, May 13, 2014 at 1:39 AM, Rainer Orth
 wrote:
> Wei Mi  writes:
>
>> Thanks for trying the testcase. rtl scanning will be slightly better
>> than assembly scanning. So how about this one?
>
> This one works fine for me.
>
> Thanks.
> Rainer
>
> --
> -
> Rainer Orth, Center for Biotechnology, Bielefeld University


Re: [GCC RFC]A new and simple pass merging paired load store instructions

2014-05-20 Thread Wei Mi
On Tue, May 20, 2014 at 12:13 AM, Bin.Cheng  wrote:
> On Tue, May 20, 2014 at 1:30 AM, Jeff Law  wrote:
>> On 05/19/14 00:38, Bin.Cheng wrote:
>>>
>>> On Sat, May 17, 2014 at 12:32 AM, Jeff Law  wrote:

 On 05/16/14 04:07, Bin.Cheng wrote:



 But can't you go through movXX to generate either the simple insn on the
 ARM
 or the PARALLEL on the thumb?

>>> Yes, I think it's more than upsizing the mode.  There is another
>>> example from one of x86's candidate peephole patch at
>>> https://gcc.gnu.org/ml/gcc-patches/2014-04/msg00467.html
>>>
>>> The patch wants to do below transformation, which I think is very
>>> target dependent.
>>
>> Presumably there's no way to go through an expander here?
> I don't know very much with respect to this case, maybe the patch
> author can help here.
>

I just checked the expand result for my case. TER could make expand
see the two intrinsics at the same time so in theory it is possible to
make the merge happen in expand. I havn't looked into detail.

Thanks,
Wei.


[google gcc-4_8] Don't use gcov counter related ssa name as induction variables

2014-02-10 Thread Wei Mi
Hi,

I saw a bug happened in fdo-gen phase when a gcov counter related ssa
name was used as induction variable and used to calculate loop
boundary after loop cond was replaced by an expr with the ssa name. We
knew that there was data race in gcov counter in multithread program,
so the values in gcov counter may be changed unexpectedly. Normally
compiler optimizations assume global variables have no data race, so
it was compiler's responsiblity to prevent gcov counter related
variables from affecting program's correctness. However, using gcov
counter related ssa tmp as induction variable and doing iv
replacements was a break to this rule.

The following testcase shows the problem in concept.

void *ptr;
int N;

void foo(void *t) {
  int i;
  for (i = 0; i < N; i++) {
t = *(void **)t;
  }
  ptr = t;
}

The compile command:
gcc-r206603/build/install/bin/gcc -O2 -fprofile-generate=./fdoprof
-fno-tree-loop-im -fdump-tree-ivopts-details-blocks -c 1.c

IR for kernel loop before IVOPT:

;;   basic block 3, loop depth 0, count 0, freq 819, maybe hot
;;prev block 2, next block 4, flags: (NEW)
;;pred:   2 [91.0%]  (TRUE_VALUE,EXECUTABLE)
  pretmp_1 = __gcov0.foo[0];
;;succ:   4 [100.0%]  (FALLTHRU,EXECUTABLE)

;;   basic block 4, loop depth 1, count 0, freq 9100, maybe hot
;;prev block 3, next block 5, flags: (NEW, REACHABLE)
;;pred:   5 [100.0%]  (FALLTHRU,EXECUTABLE)
;;3 [100.0%]  (FALLTHRU,EXECUTABLE)
  # t_25 = PHI 
  # i_26 = PHI 
  # prephitmp_23 = PHI 
  PROF_edge_counter_10 = prephitmp_23 + 1;
  __gcov0.foo[0] = PROF_edge_counter_10;
  t_6 = MEM[(void * *)t_25];
  i_7 = i_26 + 1;
  if (i_7 < N.0_24)
goto ;
  else
goto ;
;;succ:   5 [91.0%]  (TRUE_VALUE,EXECUTABLE)
;;6 [9.0%]  (FALSE_VALUE,EXECUTABLE)

;;   basic block 5, loop depth 1, count 0, freq 8281, maybe hot
;;prev block 4, next block 6, flags: (NEW)
;;pred:   4 [91.0%]  (TRUE_VALUE,EXECUTABLE)
  goto ;

Induction variable dumps:

In IVOPT dump, I can see that some gcov counter related ssa names are
used as induction variables.
Induction variables:
ssa name PROF_edge_counter_10
  type long int
  base pretmp_1 + 1
  step 1
  is a biv
ssa name prephitmp_23
  type long int
  base pretmp_1
  step 1
  is a biv

After IVOPT:
pretmp_1 is a gcov counter related tmp var, and it is used to
calculate _33 which is the loop boundary. Sometimes register
allocation may replace the use of pretmp_1 with __gcov0.foo[0], so
there may be muliple __gcov0.foo[0] accesses in loop boundary
calculation. But the values of those __gcov0.foo[0] accesses may or
may not be the same because of data race. That caused the bug.

;;   basic block 3, loop depth 0, count 0, freq 819, maybe hot
;;prev block 2, next block 4, flags: (NEW)
;;pred:   2 [91.0%]  (TRUE_VALUE,EXECUTABLE)
  pretmp_1 = __gcov0.foo[0];
  _22 = pretmp_1 + 1;
  ivtmp.8_2 = (unsigned long) _22;
  _21 = (unsigned int) N.0_24;
  _29 = _21 + 4294967295;
  _30 = (unsigned long) _29;
  _31 = (unsigned long) pretmp_1;  // may be replaced by
__gcov0.foo[0] in register allocation.
  _32 = _30 + _31;
  _33 = _32 + 2;
;;succ:   4 [100.0%]  (FALLTHRU,EXECUTABLE)

;;   basic block 4, loop depth 1, count 0, freq 9100, maybe hot
;;prev block 3, next block 5, flags: (NEW, REACHABLE)
;;pred:   5 [100.0%]  (FALLTHRU,EXECUTABLE)
;;3 [100.0%]  (FALLTHRU,EXECUTABLE)
  # t_25 = PHI 
  # ivtmp.8_9 = PHI 
  PROF_edge_counter_10 = (long int) ivtmp.8_9;
  __gcov0.foo[0] = PROF_edge_counter_10;
  t_6 = MEM[(void * *)t_25];
  ivtmp.8_5 = ivtmp.8_9 + 1;
  if (ivtmp.8_5 != _33)
goto ;
  else
goto ;
;;succ:   5 [91.0%]  (TRUE_VALUE,EXECUTABLE)
;;6 [9.0%]  (FALSE_VALUE,EXECUTABLE)

;;   basic block 5, loop depth 1, count 0, freq 8281, maybe hot
;;prev block 4, next block 6, flags: (NEW)
;;pred:   4 [91.0%]  (TRUE_VALUE,EXECUTABLE)
  goto ;


The patch is to mark ssa name generated in profile-gen as
PROFILE_GENERATED. In IVOPT, for ssa name marked as PROFILE_GENERATED,
or ssa name defined by PHI with other PROFILE_GENERATED ssa name as
PHI operand, they will not be identified as induction variables.

Testing is going on. Is it ok if tests pass?

2014-02-10  Wei Mi  

* tree-flow-inline.h (make_prof_ssa_name): New.
(make_temp_prof_ssa_name): Ditto.
* tree.h (struct tree_base): Add PROFILE_GENERATED flag for ssa name.
* tree-inline.c (remap_ssa_name): Set PROFILE_GENERATED flag for
ssa name.
* tree-profile.c (add_sampling_wrapper): Ditto.
(add_execonce_wrapper): Ditto.
(gimple_gen_edge_profiler): Ditto.
(gimple_gen_ic_profiler): Ditto.
(gimple_gen_dc_profiler): Ditto.
* value-prof.c (gimple_divmod_fixed_value): Ditto.
(gimple_mod_pow2): Ditto.
(gimple_mod_subtract): Ditto.
(gimple_ic): Ditto.
(gimple_stringop_fix

Re: [google gcc-4_8] Don't use gcov counter related ssa name as induction variables

2014-02-10 Thread Wei Mi
Here is the updated patch, which follow UD chain to determine whether
iv.base is defined by __gcovx.xxx[] var. It is a lot simpler than
adding a tree bit.

regression test and previously failed benchmark in piii mode is ok.
Other test is going on.

2014-02-10  Wei Mi  

* tree-ssa-loop-ivopts.c (defined_by_gcov_counter): New.
(contains_abnormal_ssa_name_p): Add defined_by_gcov_counter
check for ssa name.

* testsuite/gcc.dg/profile-generate-4.c: New.

Index: tree-ssa-loop-ivopts.c
===
--- tree-ssa-loop-ivopts.c  (revision 207019)
+++ tree-ssa-loop-ivopts.c  (working copy)
@@ -705,6 +705,68 @@ idx_contains_abnormal_ssa_name_p (tree b
   return !abnormal_ssa_name_p (*index);
 }

+/* Return true if the use is defined by a gcov counter var.
+   It is used to check if an iv candidate is generated for
+   profiling. For profile generated ssa name, we should not
+   use it as IV because gcov counter may have data-race for
+   multithread program, it could involve tricky bug to use
+   such ssa var in IVOPT.
+
+   To limit patterns to be checked, we list the possible cases
+   here:
+   Before PRE, the ssa name used to set __gcov counter is as
+   follows:
+   for () {
+ PROF_edge_counter_1 = __gcov.foo[i];
+ PROF_edge_counter_2 = PROF_edge_counter_1 + 1;
+ __gcov.foo[i] = PROF_edge_counter_2;
+   }
+   If PRE works, the loop may be transformed to:
+   pretmp_1 = __gcov.foo[i];
+   for () {
+ prephitmp_1 = PHI (PROF_edge_counter_2, pretmp_1);
+ PROF_edge_counter_1 = prephitmp_1;
+ PROF_edge_counter_2 = PROF_edge_counter_1 + 1;
+ __gcov.foo[i] = PROF_edge_counter_2;
+   }
+   So there are two cases:
+   case1: If PRE doesn't work, PROF_edge_counter_1 and PROF_edge_counter_2
+   are neither induction variables candidates. We don't have to worry
+   about this case.
+   case2: If PRE works, the iv candidate base of PROF_edge_counter_1 and
+   PROF_edge_counter_2 are pretmp_1 or pretmp_1 + 1. pretmp_1 is defined
+   by __gcov var.
+
+   So this func only has to check case2. For a ssa name which is an iv
+   candidate, check its base USE and see if it is defined by __gcov var.
+   Returning true means the ssa name is generated for profiling.  */
+
+bool
+defined_by_gcov_counter (tree use)
+{
+  gimple stmt;
+  tree rhs, decl;
+  const char *name;
+
+  stmt = SSA_NAME_DEF_STMT (use);
+  if (!is_gimple_assign (stmt))
+return false;
+
+  rhs = gimple_assign_rhs1 (stmt);
+  if (TREE_CODE (rhs) != ARRAY_REF)
+return false;
+
+  decl = TREE_OPERAND (rhs, 0);
+  if (TREE_CODE (decl) != VAR_DECL)
+return false;
+
+  name = IDENTIFIER_POINTER (DECL_NAME (decl));
+  if (strncmp (name, "__gcov", 6))
+return false;
+
+  return true;
+}
+
 /* Returns true if EXPR contains a ssa name that occurs in an
abnormal phi node.  */

@@ -721,7 +783,8 @@ contains_abnormal_ssa_name_p (tree expr)
   codeclass = TREE_CODE_CLASS (code);

   if (code == SSA_NAME)
-return SSA_NAME_OCCURS_IN_ABNORMAL_PHI (expr) != 0;
+return SSA_NAME_OCCURS_IN_ABNORMAL_PHI (expr) != 0
+  || defined_by_gcov_counter (expr);

   if (code == INTEGER_CST
   || is_gimple_min_invariant (expr))
Index: testsuite/gcc.dg/profile-generate-4.c
===
--- testsuite/gcc.dg/profile-generate-4.c   (revision 0)
+++ testsuite/gcc.dg/profile-generate-4.c   (revision 0)
@@ -0,0 +1,21 @@
+/* { dg-do compile { target { i?86-*-* x86_64-*-* } } } */
+/* { dg-options "-O2 -fprofile-generate -fno-tree-loop-im
-fdump-tree-ivopts-details-blocks" } */
+
+/* Because gcov counter related var has data race for multithread program,
+   compiler should prevent them from affecting program correctness. So
+   PROF_edge_counter variable should not be used as induction variable, or
+   else IVOPT may use such variable to compute loop boundary.  */
+
+void *ptr;
+int N;
+
+void foo(void *t) {
+  int i;
+  for (i = 0; i < N; i++) {
+t = *(void **)t;
+  }
+  ptr = t;
+}
+
+/* { dg-final { scan-tree-dump-times "ssa name PROF_edge_counter" 0
"ivopts"} } */
+/* { dg-final { cleanup-tree-dump "ivopts" } } */


Re: [google gcc-4_8] Don't use gcov counter related ssa name as induction variables

2014-02-11 Thread Wei Mi
>> +/* Return true if the use is defined by a gcov counter var.
>> +   It is used to check if an iv candidate is generated for
>> +   profiling. For profile generated ssa name, we should not
>> +   use it as IV because gcov counter may have data-race for
>> +   multithread program, it could involve tricky bug to use
>> +   such ssa var in IVOPT.
>> +
>
> Add the snippets describing how ralloc introduces a second gcov
> counter load and asynchronous update of it from another thread leading
> to bogus trip count.
>
> Also mention that setting volatile flag on gcov counter accesses may
> greatly degrade profile-gen performance.
>

Comments added.

>> +  if (!is_gimple_assign (stmt))
>> +return false;
>> +
>> +  rhs = gimple_assign_rhs1 (stmt);
>> +  if (TREE_CODE (rhs) != ARRAY_REF)
>> +return false;
>> +
>> +  decl = TREE_OPERAND (rhs, 0);
>> +  if (TREE_CODE (decl) != VAR_DECL)
>> +return false;
>
>
>
> Also check TREE_STATIC and DECL_ARTIFICIAL flag.
>
>
> David
>

Check added. Add DECL_ARTIFICIAL setting in build_var() in coverage.c.

The updated patch is attached.

Thanks,
Wei.
2014-02-11  Wei Mi  

* tree-ssa-loop-ivopts.c (defined_by_gcov_counter): New.
(contains_abnormal_ssa_name_p): Add defined_by_gcov_counter
check for ssa name.
* coverage.c (build_var): Set DECL_ARTIFICIAL(gcov var decl)
to be 1.

* testsuite/gcc.dg/profile-generate-4.c: New.

Index: testsuite/gcc.dg/profile-generate-4.c
===
--- testsuite/gcc.dg/profile-generate-4.c   (revision 0)
+++ testsuite/gcc.dg/profile-generate-4.c   (revision 0)
@@ -0,0 +1,21 @@
+/* { dg-do compile { target { i?86-*-* x86_64-*-* } } } */
+/* { dg-options "-O2 -fprofile-generate -fno-tree-loop-im 
-fdump-tree-ivopts-details-blocks" } */
+
+/* Because gcov counter related var has data race for multithread program,
+   compiler should prevent them from affecting program correctness. So
+   PROF_edge_counter variable should not be used as induction variable, or
+   else IVOPT may use such variable to compute loop boundary. Ā */
+
+void *ptr;
+int N;
+
+void foo(void *t) {
+  int i;
+  for (i = 0; i < N; i++) {
+t = *(void **)t;
+  }
+  ptr = t;
+}
+
+/* { dg-final { scan-tree-dump-times "ssa name PROF_edge_counter" 0 "ivopts"} 
} */
+/* { dg-final { cleanup-tree-dump "ivopts" } } */
Index: coverage.c
===
--- coverage.c  (revision 207019)
+++ coverage.c  (working copy)
@@ -1485,6 +1485,7 @@ build_var (tree fn_decl, tree type, int
   TREE_STATIC (var) = 1;
   TREE_ADDRESSABLE (var) = 1;
   DECL_ALIGN (var) = TYPE_ALIGN (type);
+  DECL_ARTIFICIAL (var) = 1;
 
   return var;
 }
Index: tree-ssa-loop-ivopts.c
===
--- tree-ssa-loop-ivopts.c  (revision 207019)
+++ tree-ssa-loop-ivopts.c  (working copy)
@@ -705,6 +705,115 @@ idx_contains_abnormal_ssa_name_p (tree b
   return !abnormal_ssa_name_p (*index);
 }
 
+/* Return true if the use is defined by a gcov counter var.
+   It is used to check if an iv candidate is generated for
+   profiling. For profile generated ssa name, we should not
+   use it as IV because gcov counter may have data-race for
+   multithread program, it is compiler's responsibility to
+   avoid connecting profile counter related vars with program
+   correctness.
+
+   Without the check, the following bug could happen in
+   following case:
+   * original loop
+
+ int i;
+ for (i = 0; i < N; i++) {
+   t = *(void **)t;
+ }
+
+   * after profile-gen and IVOPT, loop condition is replaced and
+ pretmp_1 is involved in loop boundary computation.
+
+ pretmp_1 = __gcov0.foo[0];
+ _22 = pretmp_1 + 1;
+ ...
+ _31 = (unsigned long) pretmp_1;
+ _32 = _30 + _31;
+ _33 = _32 + 2;
+  label:
+ ivtmp.8_9 = PHI 
+ PROF_edge_counter_10 = (long int) ivtmp.8_9;
+ __gcov0.foo[0] = PROF_edge_counter_10;
+   ...
+ ivtmp.8_5 = ivtmp.8_9 + 1;
+ if (ivtmp.8_5 != _33)
+   goto label
+
+   * after register allocation, pretmp_1 may be marked as REG_EQUIV in IRA
+ with __gcov0.foo[0] and some references are replaced by __gcov0.foo in 
LRA.
+
+ _22 = __gcov0.foo[0] + 1;
+ ...
+ _31 = (unsigned long) __gcov0.foo[0];
+ _32 = _30 + _31;
+ _33 = _32 + 2;
+  label:
+ 
+
+   * Bug happens when __gcov0.foo[0] is updated asynchronously by other thread
+ between the above __gcov0.foo[0] references statements.
+
+   We don't choose to mark gcov counter as volatile because it may greatly
+   degrade profile-gen performance.
+
+   To limit patterns to be checked, we list the possible cases
+   he

[PATCH, PR58066] preferred_stack_boundary update for tls expanded call

2014-03-07 Thread Wei Mi
Hi,

This patch is to fix the problem described here:
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58066

I follow Ian's suggestion and set
ix86_tls_descriptor_calls_expanded_in_cfun in
tls_global_dynamic_64_ and tls_local_dynamic_base_64_.
Although 32bit doesn't have the problem,
ix86_tls_descriptor_calls_expanded_in_cfun is also set for
tls_global_dynamic_32 and tls_local_dynamic_base_32 to make
ix86_tls_descriptor_calls_expanded_in_cfun setting consistent across
32bits and 64bits.

If ix86_current_function_calls_tls_descriptor is set, we know that
there is tls expanded call in current function. Update
crtl->preferred_stack_boundary and crtl->stack_alignment_needed to be
no less than PREFERED_STACK_ALIGNMENT at the start of
ix86_compute_frame_layout. We don't do the update in
legitimize_tls_address in cfgexpand stage, which is too early because
according to the comments before
ix86_current_function_calls_tls_descriptor, tls call may be optimized
away. ix86_compute_frame_layout is the latest place to do the update.

bootstrap on x86_64-linux-gnu is ok. regression test is going on. Ok
for trunk if tests pass?

Thanks,
Wei.

gcc/ChangeLog:

2014-03-07  Wei Mi  

* config/i386/i386.c (ix86_compute_frame_layout): Update
preferred_stack_boundary when there is tls expanded call.
* config/i386/i386.md: Set
ix86_tls_descriptor_calls_expanded_in_cfun.

gcc/testsuite/ChangeLog:

2014-03-07  Wei Mi  

* g++.dg/pr58066.C: New test.


Index: gcc/config/i386/i386.c
===
--- gcc/config/i386/i386.c  (revision 208410)
+++ gcc/config/i386/i386.c  (working copy)
@@ -9504,6 +9504,19 @@ ix86_compute_frame_layout (struct ix86_f
   crtl->preferred_stack_boundary = 128;
   crtl->stack_alignment_needed = 128;
 }
+  /* For 64-bit target, preferred_stack_boundary is never updated for call
+ expanded from tls descriptor. Update it here. We don't update it in
+ expand stage because according to the comments before
+ ix86_current_function_calls_tls_descriptor, tls calls may be optimized
+ away.  */
+  else if (TARGET_64BIT
+  && ix86_current_function_calls_tls_descriptor
+  && crtl->preferred_stack_boundary < PREFERRED_STACK_BOUNDARY)
+{
+  crtl->preferred_stack_boundary = PREFERRED_STACK_BOUNDARY;
+  if (crtl->stack_alignment_needed < PREFERRED_STACK_BOUNDARY)
+   crtl->stack_alignment_needed = PREFERRED_STACK_BOUNDARY;
+}

   gcc_assert (!size || stack_alignment_needed);
   gcc_assert (preferred_alignment >= STACK_BOUNDARY / BITS_PER_UNIT);
Index: gcc/config/i386/i386.md
===
--- gcc/config/i386/i386.md (revision 208410)
+++ gcc/config/i386/i386.md (working copy)
@@ -12891,7 +12891,11 @@
 UNSPEC_TLS_GD))
  (clobber (match_scratch:SI 4))
  (clobber (match_scratch:SI 5))
- (clobber (reg:CC FLAGS_REG))])])
+ (clobber (reg:CC FLAGS_REG))])]
+  ""
+{
+  ix86_tls_descriptor_calls_expanded_in_cfun = true;
+})

 (define_insn "*tls_global_dynamic_64_"
   [(set (match_operand:P 0 "register_operand" "=a")
@@ -12946,7 +12950,10 @@
   (const_int 0)))
  (unspec:P [(match_operand 1 "tls_symbolic_operand")]
   UNSPEC_TLS_GD)])]
-  "TARGET_64BIT")
+  "TARGET_64BIT"
+{
+  ix86_tls_descriptor_calls_expanded_in_cfun = true;
+})

 (define_insn "*tls_local_dynamic_base_32_gnu"
   [(set (match_operand:SI 0 "register_operand" "=a")
@@ -12982,7 +12989,11 @@
UNSPEC_TLS_LD_BASE))
   (clobber (match_scratch:SI 3))
   (clobber (match_scratch:SI 4))
-  (clobber (reg:CC FLAGS_REG))])])
+  (clobber (reg:CC FLAGS_REG))])]
+  ""
+{
+  ix86_tls_descriptor_calls_expanded_in_cfun = true;
+})

 (define_insn "*tls_local_dynamic_base_64_"
   [(set (match_operand:P 0 "register_operand" "=a")
@@ -13029,7 +13040,10 @@
(mem:QI (match_operand 1))
(const_int 0)))
   (unspec:P [(const_int 0)] UNSPEC_TLS_LD_BASE)])]
-  "TARGET_64BIT")
+  "TARGET_64BIT"
+{
+  ix86_tls_descriptor_calls_expanded_in_cfun = true;
+})

 ;; Local dynamic of a single variable is a lose.  Show combine how
 ;; to convert that back to global dynamic.
Index: gcc/testsuite/g++.dg/pr58066.C
===
--- gcc/testsuite/g++.dg/pr58066.C  (revision 0)
+++ gcc/testsuite/g++.dg/pr58066.C  (revision 0)
@@ -0,0 +1,12 @@
+/* { dg-do compile { target {{ i?86-*-* x86_64-*-* } && lp64 } } } */
+/* { dg-options "-fPIC -O2 -m64" } */
+
+/* Check whether the stack frame starting address of tls expanded call
+   in __cxa_get_globals() is 16bytes aligned.  */
+static __thread char ccc;
+extern "C" void* __cxa_get_globals() throw()
+{
+ return &ccc;
+}
+
+/* { dg-final { scan-assembler ".cfi_def_cfa_offset 16" } } */


Re: [PATCH, PR58066] preferred_stack_boundary update for tls expanded call

2014-03-07 Thread Wei Mi
Regression test is ok.

Thanks,
Wei.


On Fri, Mar 7, 2014 at 1:26 PM, Wei Mi  wrote:
> Hi,
>
> This patch is to fix the problem described here:
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58066
>
> I follow Ian's suggestion and set
> ix86_tls_descriptor_calls_expanded_in_cfun in
> tls_global_dynamic_64_ and tls_local_dynamic_base_64_.
> Although 32bit doesn't have the problem,
> ix86_tls_descriptor_calls_expanded_in_cfun is also set for
> tls_global_dynamic_32 and tls_local_dynamic_base_32 to make
> ix86_tls_descriptor_calls_expanded_in_cfun setting consistent across
> 32bits and 64bits.
>
> If ix86_current_function_calls_tls_descriptor is set, we know that
> there is tls expanded call in current function. Update
> crtl->preferred_stack_boundary and crtl->stack_alignment_needed to be
> no less than PREFERED_STACK_ALIGNMENT at the start of
> ix86_compute_frame_layout. We don't do the update in
> legitimize_tls_address in cfgexpand stage, which is too early because
> according to the comments before
> ix86_current_function_calls_tls_descriptor, tls call may be optimized
> away. ix86_compute_frame_layout is the latest place to do the update.
>
> bootstrap on x86_64-linux-gnu is ok. regression test is going on. Ok
> for trunk if tests pass?
>
> Thanks,
> Wei.
>
> gcc/ChangeLog:
>
> 2014-03-07  Wei Mi  
>
> * config/i386/i386.c (ix86_compute_frame_layout): Update
> preferred_stack_boundary when there is tls expanded call.
>         * config/i386/i386.md: Set
> ix86_tls_descriptor_calls_expanded_in_cfun.
>
> gcc/testsuite/ChangeLog:
>
> 2014-03-07  Wei Mi  
>
> * g++.dg/pr58066.C: New test.
>
>
> Index: gcc/config/i386/i386.c
> ===
> --- gcc/config/i386/i386.c  (revision 208410)
> +++ gcc/config/i386/i386.c  (working copy)
> @@ -9504,6 +9504,19 @@ ix86_compute_frame_layout (struct ix86_f
>crtl->preferred_stack_boundary = 128;
>crtl->stack_alignment_needed = 128;
>  }
> +  /* For 64-bit target, preferred_stack_boundary is never updated for call
> + expanded from tls descriptor. Update it here. We don't update it in
> + expand stage because according to the comments before
> + ix86_current_function_calls_tls_descriptor, tls calls may be optimized
> + away.  */
> +  else if (TARGET_64BIT
> +  && ix86_current_function_calls_tls_descriptor
> +  && crtl->preferred_stack_boundary < PREFERRED_STACK_BOUNDARY)
> +{
> +  crtl->preferred_stack_boundary = PREFERRED_STACK_BOUNDARY;
> +  if (crtl->stack_alignment_needed < PREFERRED_STACK_BOUNDARY)
> +   crtl->stack_alignment_needed = PREFERRED_STACK_BOUNDARY;
> +}
>
>gcc_assert (!size || stack_alignment_needed);
>gcc_assert (preferred_alignment >= STACK_BOUNDARY / BITS_PER_UNIT);
> Index: gcc/config/i386/i386.md
> ===
> --- gcc/config/i386/i386.md (revision 208410)
> +++ gcc/config/i386/i386.md (working copy)
> @@ -12891,7 +12891,11 @@
>  UNSPEC_TLS_GD))
>   (clobber (match_scratch:SI 4))
>   (clobber (match_scratch:SI 5))
> - (clobber (reg:CC FLAGS_REG))])])
> + (clobber (reg:CC FLAGS_REG))])]
> +  ""
> +{
> +  ix86_tls_descriptor_calls_expanded_in_cfun = true;
> +})
>
>  (define_insn "*tls_global_dynamic_64_"
>[(set (match_operand:P 0 "register_operand" "=a")
> @@ -12946,7 +12950,10 @@
>(const_int 0)))
>   (unspec:P [(match_operand 1 "tls_symbolic_operand")]
>UNSPEC_TLS_GD)])]
> -  "TARGET_64BIT")
> +  "TARGET_64BIT"
> +{
> +  ix86_tls_descriptor_calls_expanded_in_cfun = true;
> +})
>
>  (define_insn "*tls_local_dynamic_base_32_gnu"
>[(set (match_operand:SI 0 "register_operand" "=a")
> @@ -12982,7 +12989,11 @@
> UNSPEC_TLS_LD_BASE))
>(clobber (match_scratch:SI 3))
>(clobber (match_scratch:SI 4))
> -  (clobber (reg:CC FLAGS_REG))])])
> +  (clobber (reg:CC FLAGS_REG))])]
> +  ""
> +{
> +  ix86_tls_descriptor_calls_expanded_in_cfun = true;
> +})
>
>  (define_insn "*tls_local_dynamic_base_64_"
>[(set (match_operand:P 0 "register_operand" "=a")
> @@ -13029,7 +13040,10 @@
> (mem:QI (match_operand 1))
> (const_int 0)))
>(unspec:P [(const_int 0)] UNSPEC_TLS_LD_BASE)])]
> -  "TARGET_64BIT")
> +  "TARGET_64BIT&

Re: [PATCH, PR58066] preferred_stack_boundary update for tls expanded call

2014-03-07 Thread Wei Mi
Yes, x32 has the same problem. It should be tested. Fixed.

Thanks,
Wei.


On Fri, Mar 7, 2014 at 2:06 PM, H.J. Lu  wrote:
> On Fri, Mar 7, 2014 at 1:26 PM, Wei Mi  wrote:
>> Hi,
>>
>> This patch is to fix the problem described here:
>> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58066
>>
>> I follow Ian's suggestion and set
>> ix86_tls_descriptor_calls_expanded_in_cfun in
>> tls_global_dynamic_64_ and tls_local_dynamic_base_64_.
>> Although 32bit doesn't have the problem,
>> ix86_tls_descriptor_calls_expanded_in_cfun is also set for
>> tls_global_dynamic_32 and tls_local_dynamic_base_32 to make
>> ix86_tls_descriptor_calls_expanded_in_cfun setting consistent across
>> 32bits and 64bits.
>>
>> If ix86_current_function_calls_tls_descriptor is set, we know that
>> there is tls expanded call in current function. Update
>> crtl->preferred_stack_boundary and crtl->stack_alignment_needed to be
>> no less than PREFERED_STACK_ALIGNMENT at the start of
>> ix86_compute_frame_layout. We don't do the update in
>> legitimize_tls_address in cfgexpand stage, which is too early because
>> according to the comments before
>> ix86_current_function_calls_tls_descriptor, tls call may be optimized
>> away. ix86_compute_frame_layout is the latest place to do the update.
>>
>> bootstrap on x86_64-linux-gnu is ok. regression test is going on. Ok
>> for trunk if tests pass?
>>
>> Thanks,
>> Wei.
>>
>> gcc/ChangeLog:
>>
>> 2014-03-07  Wei Mi  
>>
>> * config/i386/i386.c (ix86_compute_frame_layout): Update
>> preferred_stack_boundary when there is tls expanded call.
>> * config/i386/i386.md: Set
>> ix86_tls_descriptor_calls_expanded_in_cfun.
>>
>> gcc/testsuite/ChangeLog:
>>
>> 2014-03-07  Wei Mi  
>>
>> * g++.dg/pr58066.C: New test.
>>
>>
>> Index: gcc/config/i386/i386.c
>> ===
>> --- gcc/config/i386/i386.c  (revision 208410)
>> +++ gcc/config/i386/i386.c  (working copy)
>> @@ -9504,6 +9504,19 @@ ix86_compute_frame_layout (struct ix86_f
>>crtl->preferred_stack_boundary = 128;
>>crtl->stack_alignment_needed = 128;
>>  }
>> +  /* For 64-bit target, preferred_stack_boundary is never updated for call
>> + expanded from tls descriptor. Update it here. We don't update it in
>> + expand stage because according to the comments before
>> + ix86_current_function_calls_tls_descriptor, tls calls may be optimized
>> + away.  */
>> +  else if (TARGET_64BIT
>> +  && ix86_current_function_calls_tls_descriptor
>> +  && crtl->preferred_stack_boundary < PREFERRED_STACK_BOUNDARY)
>> +{
>> +  crtl->preferred_stack_boundary = PREFERRED_STACK_BOUNDARY;
>> +  if (crtl->stack_alignment_needed < PREFERRED_STACK_BOUNDARY)
>> +   crtl->stack_alignment_needed = PREFERRED_STACK_BOUNDARY;
>> +}
>>
>>gcc_assert (!size || stack_alignment_needed);
>>gcc_assert (preferred_alignment >= STACK_BOUNDARY / BITS_PER_UNIT);
>> Index: gcc/config/i386/i386.md
>> ===
>> --- gcc/config/i386/i386.md (revision 208410)
>> +++ gcc/config/i386/i386.md (working copy)
>> @@ -12891,7 +12891,11 @@
>>  UNSPEC_TLS_GD))
>>   (clobber (match_scratch:SI 4))
>>   (clobber (match_scratch:SI 5))
>> - (clobber (reg:CC FLAGS_REG))])])
>> + (clobber (reg:CC FLAGS_REG))])]
>> +  ""
>> +{
>> +  ix86_tls_descriptor_calls_expanded_in_cfun = true;
>> +})
>>
>>  (define_insn "*tls_global_dynamic_64_"
>>[(set (match_operand:P 0 "register_operand" "=a")
>> @@ -12946,7 +12950,10 @@
>>(const_int 0)))
>>   (unspec:P [(match_operand 1 "tls_symbolic_operand")]
>>UNSPEC_TLS_GD)])]
>> -  "TARGET_64BIT")
>> +  "TARGET_64BIT"
>> +{
>> +  ix86_tls_descriptor_calls_expanded_in_cfun = true;
>> +})
>>
>>  (define_insn "*tls_local_dynamic_base_32_gnu"
>>[(set (match_operand:SI 0 "register_operand" "=a")
>> @@ -12982,7 +12989,11 @@
>> UNSPEC_TLS_LD_BASE))
>>(clobber (match_scratch:SI 3))
>>(clobber (match_scratch:SI 4))
>> -  (clobber (reg:CC FLAGS_REG))])])
>> +   

[GOOGLE, PR58066] preferred_stack_boundary update for tls expanded call

2014-03-12 Thread Wei Mi
This patch is to fix the problem described here:
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58066

The original patch is here:
http://gcc.gnu.org/ml/gcc-patches/2014-03/msg00369.html
The attached patch addresses HJ's comment.

bootstrap, regression test is ok. perf test in plain mode is ok. ok
for google-4_8 branch?

Thanks,
Wei.


gcc/ChangeLog:

2014-03-07  Wei Mi  

* config/i386/i386.c (ix86_compute_frame_layout): update
preferred_stack_boundary when there is tls expanded call.
* config/i386/i386.md: set
ix86_tls_descriptor_calls_expanded_in_cfun.

gcc/testsuite/ChangeLog:

2014-03-07  Wei Mi  

* g++.dg/pr58066.C: New test.

Index: config/i386/i386.c
===
--- config/i386/i386.c  (revision 208464)
+++ config/i386/i386.c  (working copy)
@@ -9211,6 +9211,19 @@ ix86_compute_frame_layout (struct ix86_f
   crtl->preferred_stack_boundary = 128;
   crtl->stack_alignment_needed = 128;
 }
+  /* For 64-bit target, preferred_stack_boundary is never updated for call
+ expanded from tls descriptor. Update it here. We don't update it in
+ expand stage because according to the comments before
+ ix86_current_function_calls_tls_descriptor, tls calls may be optimized
+ away.  */
+  else if (TARGET_64BIT
+  && ix86_current_function_calls_tls_descriptor
+  && crtl->preferred_stack_boundary < PREFERRED_STACK_BOUNDARY)
+{
+  crtl->preferred_stack_boundary = PREFERRED_STACK_BOUNDARY;
+  if (crtl->stack_alignment_needed < PREFERRED_STACK_BOUNDARY)
+   crtl->stack_alignment_needed = PREFERRED_STACK_BOUNDARY;
+}

   gcc_assert (!size || stack_alignment_needed);
   gcc_assert (preferred_alignment >= STACK_BOUNDARY / BITS_PER_UNIT);
Index: config/i386/i386.md
===
--- config/i386/i386.md (revision 208464)
+++ config/i386/i386.md (working copy)
@@ -12776,7 +12776,11 @@
 UNSPEC_TLS_GD))
  (clobber (match_scratch:SI 4))
  (clobber (match_scratch:SI 5))
- (clobber (reg:CC FLAGS_REG))])])
+ (clobber (reg:CC FLAGS_REG))])]
+  ""
+{
+  ix86_tls_descriptor_calls_expanded_in_cfun = true;
+})

 (define_insn "*tls_global_dynamic_64_"
   [(set (match_operand:P 0 "register_operand" "=a")
@@ -12809,7 +12813,10 @@
   (const_int 0)))
  (unspec:P [(match_operand 1 "tls_symbolic_operand")]
   UNSPEC_TLS_GD)])]
-  "TARGET_64BIT")
+  "TARGET_64BIT"
+{
+  ix86_tls_descriptor_calls_expanded_in_cfun = true;
+})

 (define_insn "*tls_local_dynamic_base_32_gnu"
   [(set (match_operand:SI 0 "register_operand" "=a")
@@ -12844,7 +12851,11 @@
UNSPEC_TLS_LD_BASE))
   (clobber (match_scratch:SI 3))
   (clobber (match_scratch:SI 4))
-  (clobber (reg:CC FLAGS_REG))])])
+  (clobber (reg:CC FLAGS_REG))])]
+  ""
+{
+  ix86_tls_descriptor_calls_expanded_in_cfun = true;
+})

 (define_insn "*tls_local_dynamic_base_64_"
   [(set (match_operand:P 0 "register_operand" "=a")
@@ -12870,7 +12881,10 @@
(mem:QI (match_operand 1 "constant_call_address_operand"))
(const_int 0)))
   (unspec:P [(const_int 0)] UNSPEC_TLS_LD_BASE)])]
-  "TARGET_64BIT")
+  "TARGET_64BIT"
+{
+  ix86_tls_descriptor_calls_expanded_in_cfun = true;
+})

 ;; Local dynamic of a single variable is a lose.  Show combine how
 ;; to convert that back to global dynamic.
Index: testsuite/g++.dg/pr58066.C
===
--- testsuite/g++.dg/pr58066.C  (revision 0)
+++ testsuite/g++.dg/pr58066.C  (revision 0)
@@ -0,0 +1,12 @@
+/* { dg-do compile { target {{ i?86-*-* x86_64-*-* } && { ! ia32 } } } } */
+/* { dg-options "-fPIC -O2" } */
+
+/* Check whether the stack frame starting address of tls expanded call
+   in __cxa_get_globals() is 16bytes aligned.  */
+static __thread char ccc;
+extern "C" void* __cxa_get_globals() throw()
+{
+ return &ccc;
+}
+
+/* { dg-final { scan-assembler ".cfi_def_cfa_offset 16" } } */


Re: [PATCH, PR58066] preferred_stack_boundary update for tls expanded call

2014-03-12 Thread Wei Mi
> There are several problems with this:
>
> 1.  It doesn't work with C.

Ok, I will change the testcase using C.

> 2.  IA32 has the same issue and isn't fixed.

I thought IA32 didn't have the same issue because abi only requires 32
bit alignment for stack starting address.

oh, I found the old patch
http://gcc.gnu.org/ml/gcc-patches/2006-09/msg00298.html which changed
the default alignment to 128bit. Ok, will remove the TARGET_64BIT
constraint.

> 3.  There is no testcase for global dynamic model.
>
> --
> H.J.

Will add the testcase.

Thanks,
Wei.


Re: [PATCH, PR58066] preferred_stack_boundary update for tls expanded call

2014-03-12 Thread Wei Mi
Hi H.J.,

Could you show me why you postpone the setting
ix86_tls_descriptor_calls_expanded_in_cfun until reload_complete and
use ix86_tls_descriptor_calls_expanded_in_cfun instead of
ix86_current_function_calls_tls_descriptor? Isn't
ix86_current_function_calls_tls_descriptor useful to consider the case
that tls call is optimized away?

Thanks,
Wei.

On Wed, Mar 12, 2014 at 2:07 PM, H.J. Lu  wrote:
> On Wed, Mar 12, 2014 at 2:03 PM, Wei Mi  wrote:
>>> There are several problems with this:
>>>
>>> 1.  It doesn't work with C.
>>
>> Ok, I will change the testcase using C.
>>
>>> 2.  IA32 has the same issue and isn't fixed.
>>
>> I thought IA32 didn't have the same issue because abi only requires 32
>> bit alignment for stack starting address.
>>
>> oh, I found the old patch
>> http://gcc.gnu.org/ml/gcc-patches/2006-09/msg00298.html which changed
>> the default alignment to 128bit. Ok, will remove the TARGET_64BIT
>> constraint.
>>
>>> 3.  There is no testcase for global dynamic model.
>>>
>>> --
>>> H.J.
>>
>> Will add the testcase.
>>
>
> I posted a different patch in
>
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58066
>
> --
> H.J.


Re: [PATCH, PR58066] preferred_stack_boundary update for tls expanded call

2014-03-12 Thread Wei Mi
Oh, I see. Thanks!

Wei.

On Wed, Mar 12, 2014 at 2:42 PM, H.J. Lu  wrote:
> On Wed, Mar 12, 2014 at 2:36 PM, Wei Mi  wrote:
>> Hi H.J.,
>>
>> Could you show me why you postpone the setting
>> ix86_tls_descriptor_calls_expanded_in_cfun until reload_complete and
>> use ix86_tls_descriptor_calls_expanded_in_cfun instead of
>> ix86_current_function_calls_tls_descriptor? Isn't
>> ix86_current_function_calls_tls_descriptor useful to consider the case
>> that tls call is optimized away?
>>
>
> When a tls call is optimized away, it won't survive reload.
> If it does survive reload, it isn't optimized away.  Also
> checking df_regs_ever_live_p (SP_REG) isn't reliable
> when called from ix86_compute_frame_layout.
>
> --
> H.J.


Re: [PATCH, PR58066] preferred_stack_boundary update for tls expanded call

2014-03-12 Thread Wei Mi
This is the updated testcase.

Thanks,
Wei.

===
--- testsuite/gcc.dg/pr58066.c (revision 0)
+++ testsuite/gcc.dg/pr58066.c (revision 0)
@@ -0,0 +1,18 @@
+/* { dg-do compile { target {{ i?86-*-* x86_64-*-* } && { ! ia32 } } } } */
+/* { dg-options "-fPIC -O2" } */
+
+/* Check whether the stack frame starting addresses of tls expanded calls
+   in foo and goo are 16bytes aligned.  */
+static __thread char ccc1;
+void* foo()
+{
+ return &ccc1;
+}
+
+__thread char ccc2;
+void* goo()
+{
+ return &ccc2;
+}
+
+/* { dg-final { scan-assembler-times ".cfi_def_cfa_offset 16" 2 } } */

On Wed, Mar 12, 2014 at 2:51 PM, Wei Mi  wrote:
> Oh, I see. Thanks!
>
> Wei.
>
> On Wed, Mar 12, 2014 at 2:42 PM, H.J. Lu  wrote:
>> On Wed, Mar 12, 2014 at 2:36 PM, Wei Mi  wrote:
>>> Hi H.J.,
>>>
>>> Could you show me why you postpone the setting
>>> ix86_tls_descriptor_calls_expanded_in_cfun until reload_complete and
>>> use ix86_tls_descriptor_calls_expanded_in_cfun instead of
>>> ix86_current_function_calls_tls_descriptor? Isn't
>>> ix86_current_function_calls_tls_descriptor useful to consider the case
>>> that tls call is optimized away?
>>>
>>
>> When a tls call is optimized away, it won't survive reload.
>> If it does survive reload, it isn't optimized away.  Also
>> checking df_regs_ever_live_p (SP_REG) isn't reliable
>> when called from ix86_compute_frame_layout.
>>
>> --
>> H.J.


Re: [PATCH, PR58066] preferred_stack_boundary update for tls expanded call

2014-03-12 Thread Wei Mi
On Wed, Mar 12, 2014 at 3:07 PM, H.J. Lu  wrote:
> On Wed, Mar 12, 2014 at 2:58 PM, Wei Mi  wrote:
>> This is the updated testcase.
>
> Does my patch fix the original problem?

Yes, it works. I am doing bootstrap and regression test for your patch. Thanks!

>
>> Thanks,
>> Wei.
>>
>> ===
>> --- testsuite/gcc.dg/pr58066.c (revision 0)
>> +++ testsuite/gcc.dg/pr58066.c (revision 0)
>> @@ -0,0 +1,18 @@
>> +/* { dg-do compile { target {{ i?86-*-* x86_64-*-* } && { ! ia32 } } } } */
>
> Since it is a C testcase and we should test it under ia32, it
> should be moved to gcc.target/i386 and remove target.
>

Fixed.

Thanks,
Wei.

Index: testsuite/gcc.target/i386/pr58066.c
===
--- testsuite/gcc.target/i386/pr58066.c (revision 0)
+++ testsuite/gcc.target/i386/pr58066.c (revision 0)
@@ -0,0 +1,18 @@
+/* { dg-do compile } */
+/* { dg-options "-fPIC -O2" } */
+
+/* Check whether the stack frame starting addresses of tls expanded calls
+   in foo and goo are 16bytes aligned.  */
+static __thread char ccc1;
+void* foo()
+{
+ return &ccc1;
+}
+
+__thread char ccc2;
+void* goo()
+{
+ return &ccc2;
+}
+
+/* { dg-final { scan-assembler-times ".cfi_def_cfa_offset 16" 2 } } */


Re: [PATCH, PR58066] preferred_stack_boundary update for tls expanded call

2014-03-12 Thread Wei Mi
>> Does my patch fix the original problem?
>
> Yes, it works. I am doing bootstrap and regression test for your patch. 
> Thanks!
>

The patch passes bootstrap and regression test on x86_64-linux-gnu.

Thanks,
Wei.


Re: [PATCH, PR58066] preferred_stack_boundary update for tls expanded call

2014-03-12 Thread Wei Mi
I saw the problem last patch had on ia32. Without explicit call in rtl
template, scheduler may schedule the sp adjusting insn across tls
descriptor and break the alignment assumption.
I am testing the updated patch on x86_64.

Can we combine the last two patches, both adding call explicitly in
rtl template for tls_local_dynamic_base_32/tls_global_dynamic_32, and
set ix86_tls_descriptor_calls_expanded_in_cfun to true only after
reload complete?

Regards,
Wei.

On Wed, Mar 12, 2014 at 5:33 PM, H.J. Lu  wrote:
> On Wed, Mar 12, 2014 at 5:28 PM, Wei Mi  wrote:
>>>> Does my patch fix the original problem?
>>>
>>> Yes, it works. I am doing bootstrap and regression test for your patch. 
>>> Thanks!
>>>
>>
>> The patch passes bootstrap and regression test on x86_64-linux-gnu.
>>
>
> My patch fails to handle ia32.  Here is the updated one.
>
> --
> H.J.


Re: [PATCH, PR58066] preferred_stack_boundary update for tls expanded call

2014-03-13 Thread Wei Mi
pr58066-2.patch worked for pr58066.c on ia32/x32/x86_64, but it failed
on bootstrap.

/usr/local/google/home/wmi/workarea/gcc-r208410-2/build/./gcc/xgcc
-B/usr/local/google/home/wmi/workarea/gcc-r208410-2/build/./gcc/
-B/usr/local/google/home/wmi/workarea/gcc-r208410-2/build/install/x86_64-unknown-linux-gnu/bin/
-B/usr/local/google/home/wmi/workarea/gcc-r208410-2/build/install/x86_64-unknown-linux-gnu/lib/
-isystem 
/usr/local/google/home/wmi/workarea/gcc-r208410-2/build/install/x86_64-unknown-linux-gnu/include
-isystem 
/usr/local/google/home/wmi/workarea/gcc-r208410-2/build/install/x86_64-unknown-linux-gnu/sys-include
   -g -O2 -m64 -O2  -g -O2 -DIN_GCC-W -Wall -Wwrite-strings
-Wcast-qual -Wno-format -Wstrict-prototypes -Wmissing-prototypes
-Wold-style-definition  -isystem ./include   -fpic -mlong-double-80 -g
-DIN_LIBGCC2 -fbuilding-libgcc -fno-stack-protector   -fpic
-mlong-double-80 -I. -I. -I../../.././gcc -I../../../../src/libgcc
-I../../../../src/libgcc/. -I../../../../src/libgcc/../gcc
-I../../../../src/libgcc/../include
-I../../../../src/libgcc/config/libbid -DENABLE_DECIMAL_BID_FORMAT
-DHAVE_CC_TLS  -DUSE_TLS -o bid_decimal_globals.o -MT
bid_decimal_globals.o -MD -MP -MF bid_decimal_globals.dep -c
../../../../src/libgcc/config/libbid/bid_decimal_globals.c

(call_insn 5 2 6 2 (parallel [
(set (reg/f:SI 85)
(call:SI (mem:QI (symbol_ref:SI ("___tls_get_addr")) [0  S1 A8])
(const_int 0 [0])))
(unspec:SI [
(reg:SI 3 bx)
(symbol_ref:SI ("__bid_IDEC_glbflags") [flags
0x10]  )
] UNSPEC_TLS_GD)
(clobber (reg:SI 91))
(clobber (reg:SI 92))
(clobber (reg:CC 17 flags))
]) ../../../../src/libgcc/config/libbid/bid_decimal_globals.c:51
772 {*tls_global_dynamic_32_gnu}
 (expr_list:REG_UNUSED (reg:SI 92)
(expr_list:REG_UNUSED (reg:SI 91)
(nil)))
(nil))
../../../../src/libgcc/config/libbid/bid_decimal_globals.c:52:1:
internal compiler error: in curr_insn_transform, at
lra-constraints.c:3262
0xad8453 _fatal_insn(char const*, rtx_def const*, char const*, int, char const*)
../../src/gcc/rtl-error.c:109
0x9d1221 curr_insn_transform
../../src/gcc/lra-constraints.c:3262
0x9d40e4 lra_constraints(bool)
../../src/gcc/lra-constraints.c:4157
0x9c0ad8 lra(_IO_FILE*)
../../src/gcc/lra.c:2340
0x96e310 do_reload
../../src/gcc/ira.c:5457
0x96e622 rest_of_handle_reload
../../src/gcc/ira.c:5598
0x96e66c execute
../../src/gcc/ira.c:5627

The problem is the return value of the call may be assigned to a
different hardreg than AX_REG. But LRA cannot do reload for output
operand of call. The fix is to change the above pattern to the
following pattern in legitimize_tls_address() in config/i386/i386.c.

(call_insn/u 5 4 6 (parallel [
(set (reg:SI 0 ax)
(call:SI (mem:QI (symbol_ref:SI ("___tls_get_addr")) [0  S1 A8])
(const_int 0 [0])))
(unspec:SI [
(reg:SI 3 bx)
(symbol_ref:SI ("__bid_IDEC_glbflags") [flags
0x10]  )
] UNSPEC_TLS_GD)
(clobber (scratch:SI))
(clobber (scratch:SI))
(clobber (reg:CC 17 flags))
]) ../../../../src/libgcc/config/libbid/bid_decimal_globals.c:51 -1
 (expr_list:REG_EH_REGION (const_int -2147483648 [0x8000])
(nil))
(nil))

(insn 6 5 7 (set (reg/f:SI 85)
(reg:SI 0 ax))
../../../../src/libgcc/config/libbid/bid_decimal_globals.c:51 -1
 (expr_list:REG_EQUAL (symbol_ref:SI ("__bid_IDEC_glbflags")
[flags 0x10]  )

After the problem is fixed, bootstrap and regression test on x86-64 are ok.

Thanks,
Wei.
Index: config/i386/i386.md
===
--- config/i386/i386.md	(revision 208410)
+++ config/i386/i386.md	(working copy)
@@ -12859,13 +12859,14 @@
 
 (define_insn "*tls_global_dynamic_32_gnu"
   [(set (match_operand:SI 0 "register_operand" "=a")
-	(unspec:SI
-	 [(match_operand:SI 1 "register_operand" "b")
-	  (match_operand 2 "tls_symbolic_operand")
-	  (match_operand 3 "constant_call_address_operand" "z")]
-	 UNSPEC_TLS_GD))
-   (clobber (match_scratch:SI 4 "=d"))
-   (clobber (match_scratch:SI 5 "=c"))
+	(call:SI
+	 (mem:QI (match_operand 3 "constant_call_address_operand" "z"))
+	 (match_operand 4)))
+   (unspec:SI [(match_operand:SI 1 "register_operand" "b")
+	   (match_operand 2 "tls_symbolic_operand")]
+	  UNSPEC_TLS_GD)
+   (clobber (match_scratch:SI 5 "=d"))
+   (clobber (match_scratch:SI 6 "=c"))
(clobber (reg:CC FLAGS_REG))]
   "!TARGET_64BIT && TARGET_GNU_TLS"
 {
@@ -12885,13 +12886,19 @@
 (define_expand "tls_global_dynamic_32"
   [(parallel
 [(set (match_operand:SI 0 "register_operand")
-	  (unspec:SI [(match_operand:SI 2 "register_operand")
-		  (match_operand 1 "tls_symbolic_operand")
-		  (match_operand 3 "constant_call_address_operand")]
-		 UN

Re: [PATCH, PR58066] preferred_stack_boundary update for tls expanded call

2014-03-13 Thread Wei Mi
>
> My ia32 change generates much worse code:
>
> [hjl@gnu-6 gcc]$ cat /tmp/c.i
> static __thread char ccc, bbb;
>
> int __cxa_get_globals()
> {
>  return &ccc - &bbb;
> }
> [hjl@gnu-6 gcc]$ ./xgcc -B./ -S -O2 -fPIC /tmp/c.i
> [hjl@gnu-6 gcc]$ cat c.s
> .file "c.i"
> .section .text.unlikely,"ax",@progbits
> .LCOLDB0:
> .text
> .LHOTB0:
> .p2align 4,,15
> .globl __cxa_get_globals
> .type __cxa_get_globals, @function
> __cxa_get_globals:
> .LFB0:
> .cfi_startproc
> subq $8, %rsp
> .cfi_def_cfa_offset 16
> leaq ccc@tlsld(%rip), %rdi
> call __tls_get_addr@PLT
> addq $8, %rsp
> .cfi_def_cfa_offset 8
> leaq ccc@dtpoff(%rax), %rcx
> leaq bbb@dtpoff(%rax), %rdx
> movq %rcx, %rax
> subq %rdx, %rax
> ret
> .cfi_endproc
> .LFE0:
> .size __cxa_get_globals, .-__cxa_get_globals
> .section .text.unlikely
> .LCOLDE0:
> .text
> .LHOTE0:
> .section .tbss,"awT",@nobits
> .type bbb, @object
> .size bbb, 1
> bbb:
> .zero 1
> .type ccc, @object
> .size ccc, 1
> ccc:
> .zero 1
> .ident "GCC: (GNU) 4.9.0 20140312 (experimental)"
> .section .note.GNU-stack,"",@progbits
> [hjl@gnu-6 gcc]$ cat /tmp/c.i
> static __thread char ccc, bbb;
>
> int __cxa_get_globals()
> {
>  return &ccc - &bbb;
> }
> [hjl@gnu-6 gcc]$ ./xgcc -B./ -S -O2 -fPIC /tmp/c.i -m32
> [hjl@gnu-6 gcc]$ cat c.s
> .file "c.i"
> .section .text.unlikely,"ax",@progbits
> .LCOLDB0:
> .text
> .LHOTB0:
> .p2align 4,,15
> .globl __cxa_get_globals
> .type __cxa_get_globals, @function
> __cxa_get_globals:
> .LFB0:
> .cfi_startproc
> pushl %esi
> .cfi_def_cfa_offset 8
> .cfi_offset 6, -8
> pushl %ebx
> .cfi_def_cfa_offset 12
> .cfi_offset 3, -12
> call __x86.get_pc_thunk.bx
> addl $_GLOBAL_OFFSET_TABLE_, %ebx
> subl $4, %esp
> .cfi_def_cfa_offset 16
> leal ccc@tlsldm(%ebx), %eax
> call ___tls_get_addr@PLT
> leal ccc@dtpoff(%eax), %esi
> leal ccc@tlsldm(%ebx), %eax
> call ___tls_get_addr@PLT
> addl $4, %esp
> .cfi_def_cfa_offset 12
> leal bbb@dtpoff(%eax), %eax
> popl %ebx
> .cfi_restore 3
> .cfi_def_cfa_offset 8
> subl %eax, %esi
> movl %esi, %eax
> popl %esi
> .cfi_restore 6
> .cfi_def_cfa_offset 4
> ret
> .cfi_endproc
>
> Maybe we should keep the original patterns and
> split them to add CALL.
>
> --
> H.J.

I tried pr58066-3.patch on the above testcase, the code it generated
seems ok. I think after we change the 32bits pattern in i386.md to be
similar as 64bits pattern, we should change 32bit expand to be similar
as 64bit expand in legitimize_tls_address too?

Thanks,
Wei.

~/workarea/gcc-r208410-2/build/install/bin/gcc -m32 -S -fPIC 1.c

.file   "1.c"
.section.tbss,"awT",@nobits
.type   ccc, @object
.size   ccc, 1
ccc:
.zero   1
.type   bbb, @object
.size   bbb, 1
bbb:
.zero   1
.text
.globl  __cxa_get_globals
.type   __cxa_get_globals, @function
__cxa_get_globals:
.LFB0:
.cfi_startproc
pushl   %ebp
.cfi_def_cfa_offset 8
.cfi_offset 5, -8
movl%esp, %ebp
.cfi_def_cfa_register 5
pushl   %esi
pushl   %ebx
.cfi_offset 6, -12
.cfi_offset 3, -16
call__x86.get_pc_thunk.bx
addl$_GLOBAL_OFFSET_TABLE_, %ebx
lealccc@tlsgd(,%ebx,1), %eax
call___tls_get_addr@PLT
movl%eax, %esi
lealbbb@tlsgd(,%ebx,1), %eax
call___tls_get_addr@PLT
subl%eax, %esi
movl%esi, %eax
popl%ebx
.cfi_restore 3
popl%esi
.cfi_restore 6
popl%ebp
.cfi_restore 5
.cfi_def_cfa 4, 4
ret
.cfi_endproc
.LFE0:
.size   __cxa_get_globals, .-__cxa_get_globals
.section
.text.__x86.get_pc_thunk.bx,"axG",@progbits,__x86.get_pc_thunk.bx,comdat
.globl  __x86.get_pc_thunk.bx
.hidden __x86.get_pc_thunk.bx
.type   __x86.get_pc_thunk.bx, @function
__x86.get_pc_thunk.bx:
.LFB1:
.cfi_startproc
movl(%esp), %ebx
ret
.cfi_endproc
.LFE1:
.ident  "GCC: (GNU) 4.9.0 20140307 (experimental)"
.section.note.GNU-stack,"",@progbits


Re: [PATCH, PR58066] preferred_stack_boundary update for tls expanded call

2014-03-13 Thread Wei Mi
> I tried pr58066-3.patch on the above testcase, the code it generated
> seems ok. I think after we change the 32bits pattern in i386.md to be
> similar as 64bits pattern, we should change 32bit expand to be similar
> as 64bit expand in legitimize_tls_address too?
>
> Thanks,
> Wei.
>

Sorry, I pasted the wrong code. This is the code generated by pr58066-3.patch.
wmi@miwei:/tmp$ cat 1.c
static __thread char ccc, bbb;

int __cxa_get_globals()
{
 return &ccc - &bbb;
}

wmi@miwei:/tmp$ ~/workarea/gcc-r208410-2/build/install/bin/gcc -O2
-fPIC -m32 -S 1.c
wmi@miwei:/tmp$ cat 1.s
.file   "1.c"
.section.text.unlikely,"ax",@progbits
.LCOLDB0:
.text
.LHOTB0:
.p2align 4,,15
.globl  __cxa_get_globals
.type   __cxa_get_globals, @function
__cxa_get_globals:
.LFB0:
.cfi_startproc
pushl   %ebx
.cfi_def_cfa_offset 8
.cfi_offset 3, -8
call__x86.get_pc_thunk.bx
addl$_GLOBAL_OFFSET_TABLE_, %ebx
subl$8, %esp
.cfi_def_cfa_offset 16
lealccc@tlsldm(%ebx), %eax
call___tls_get_addr@PLT
addl$8, %esp
.cfi_def_cfa_offset 8
lealccc@dtpoff(%eax), %edx
lealbbb@dtpoff(%eax), %eax
popl%ebx
.cfi_restore 3
.cfi_def_cfa_offset 4
subl%eax, %edx
movl%edx, %eax
ret
.cfi_endproc
.LFE0:
.size   __cxa_get_globals, .-__cxa_get_globals
.section.text.unlikely
.LCOLDE0:
.text
.LHOTE0:
.section.tbss,"awT",@nobits
.type   bbb, @object
.size   bbb, 1
bbb:
.zero   1
.type   ccc, @object
.size   ccc, 1
ccc:
.zero   1
.section
.text.__x86.get_pc_thunk.bx,"axG",@progbits,__x86.get_pc_thunk.bx,comdat
.globl  __x86.get_pc_thunk.bx
.hidden __x86.get_pc_thunk.bx
.type   __x86.get_pc_thunk.bx, @function
__x86.get_pc_thunk.bx:
.LFB1:
.cfi_startproc
movl(%esp), %ebx
.LFB1:
.cfi_startproc
movl(%esp), %ebx
ret
.cfi_endproc
.LFE1:
.ident  "GCC: (GNU) 4.9.0 20140307 (experimental)"
.section.note.GNU-stack,"",@progbits


Re: [PATCH, PR58066] preferred_stack_boundary update for tls expanded call

2014-03-13 Thread Wei Mi
> Can we combine the last two patches, both adding call explicitly in
> rtl template for tls_local_dynamic_base_32/tls_global_dynamic_32, and
> set ix86_tls_descriptor_calls_expanded_in_cfun to true only after
> reload complete?
>

Hi H.J.

I attached the patch which combined your two patches and the fix in
legitimize_tls_address. I tried pr58066.c and c.i in ia32/x32/x86_64,
the code looked fine. Do you think it is ok?

Thanks,
Wei.
Index: config/i386/i386.c
===
--- config/i386/i386.c	(revision 208410)
+++ config/i386/i386.c	(working copy)
@@ -9082,7 +9082,7 @@ ix86_frame_pointer_required (void)
  we've not got a leaf function.  */
   if (TARGET_OMIT_LEAF_FRAME_POINTER
   && (!crtl->is_leaf
-	  || ix86_current_function_calls_tls_descriptor))
+	  || ix86_tls_descriptor_calls_expanded_in_cfun))
 return true;
 
   if (crtl->profile && !flag_fentry)
@@ -9331,7 +9331,7 @@ ix86_select_alt_pic_regnum (void)
 {
   if (crtl->is_leaf
   && !crtl->profile
-  && !ix86_current_function_calls_tls_descriptor)
+  && !ix86_tls_descriptor_calls_expanded_in_cfun)
 {
   int i, drap;
   /* Can't use the same register for both PIC and DRAP.  */
@@ -9490,20 +9490,28 @@ ix86_compute_frame_layout (struct ix86_f
   frame->nregs = ix86_nsaved_regs ();
   frame->nsseregs = ix86_nsaved_sseregs ();
 
-  stack_alignment_needed = crtl->stack_alignment_needed / BITS_PER_UNIT;
-  preferred_alignment = crtl->preferred_stack_boundary / BITS_PER_UNIT;
-
   /* 64-bit MS ABI seem to require stack alignment to be always 16 except for
  function prologues and leaf.  */
-  if ((TARGET_64BIT_MS_ABI && preferred_alignment < 16)
+  if ((TARGET_64BIT_MS_ABI && crtl->preferred_stack_boundary < 128)
   && (!crtl->is_leaf || cfun->calls_alloca != 0
-  || ix86_current_function_calls_tls_descriptor))
+  || ix86_tls_descriptor_calls_expanded_in_cfun))
 {
-  preferred_alignment = 16;
-  stack_alignment_needed = 16;
   crtl->preferred_stack_boundary = 128;
   crtl->stack_alignment_needed = 128;
 }
+  /* preferred_stack_boundary is never updated for call expanded from
+ tls descriptor. Update it here. We don't update it in expand stage
+ because tls calls may be optimized away.  */
+  else if (ix86_tls_descriptor_calls_expanded_in_cfun
+  && crtl->preferred_stack_boundary < PREFERRED_STACK_BOUNDARY)
+{
+  crtl->preferred_stack_boundary = PREFERRED_STACK_BOUNDARY;
+  if (crtl->stack_alignment_needed < PREFERRED_STACK_BOUNDARY)
+   crtl->stack_alignment_needed = PREFERRED_STACK_BOUNDARY;
+}
+
+  stack_alignment_needed = crtl->stack_alignment_needed / BITS_PER_UNIT;
+  preferred_alignment = crtl->preferred_stack_boundary / BITS_PER_UNIT;
 
   gcc_assert (!size || stack_alignment_needed);
   gcc_assert (preferred_alignment >= STACK_BOUNDARY / BITS_PER_UNIT);
@@ -9608,7 +9616,7 @@ ix86_compute_frame_layout (struct ix86_f
   || size != 0
   || !crtl->is_leaf
   || cfun->calls_alloca
-  || ix86_current_function_calls_tls_descriptor)
+  || ix86_tls_descriptor_calls_expanded_in_cfun)
 offset = (offset + stack_alignment_needed - 1) & -stack_alignment_needed;
 
   /* Frame pointer points here.  */
@@ -9623,7 +9631,7 @@ ix86_compute_frame_layout (struct ix86_f
  of stack frame are unused.  */
   if (ACCUMULATE_OUTGOING_ARGS
   && (!crtl->is_leaf || cfun->calls_alloca
-	  || ix86_current_function_calls_tls_descriptor))
+	  || ix86_tls_descriptor_calls_expanded_in_cfun))
 {
   offset += crtl->outgoing_args_size;
   frame->outgoing_arguments_size = crtl->outgoing_args_size;
@@ -9634,7 +9642,7 @@ ix86_compute_frame_layout (struct ix86_f
   /* Align stack boundary.  Only needed if we're calling another function
  or using alloca.  */
   if (!crtl->is_leaf || cfun->calls_alloca
-  || ix86_current_function_calls_tls_descriptor)
+  || ix86_tls_descriptor_calls_expanded_in_cfun)
 offset = (offset + preferred_alignment - 1) & -preferred_alignment;
 
   /* We've reached end of stack frame.  */
@@ -9650,7 +9658,7 @@ ix86_compute_frame_layout (struct ix86_f
   if (ix86_using_red_zone ()
   && crtl->sp_is_unchanging
   && crtl->is_leaf
-  && !ix86_current_function_calls_tls_descriptor)
+  && !ix86_tls_descriptor_calls_expanded_in_cfun)
 {
   frame->red_zone_size = to_allocate;
   if (frame->save_regs_using_mov)
@@ -10623,7 +10631,7 @@ ix86_finalize_stack_realign_flags (void)
   && crtl->is_leaf
   && flag_omit_frame_pointer
   && crtl->sp_is_unchanging
-  && !ix86_current_function_calls_tls_descriptor
+  && !ix86_tls_descriptor_calls_expanded_in_cfun
   && !crtl->accesses_prior_frames
   && !cfun->calls_alloca
   && !crtl->calls_eh_return
@@ -13437,26 +13445,25 @@ legitimize_tls_address (rtx x, enum tls_
   else
 	{
 	  rtx caddr = ix86_tls_get_addr ();
+	  rtx ax = gen_rtx_REG (Pmo

Fwd: [GOOGLE, AUTOFDO] Assign different discriminators to calls with the same lineno

2014-08-06 Thread Wei Mi
(Sorry if you received the mail twice because it was not plain text at
first and was rejected by @sourceware.org)

We saw bb like this in the IR dump after pass_build_cfg:

  :
  [1.cc : 205:45] D.332088 = table->_vptr.Table;
  [1.cc : 205:45] D.332134 = D.332088 + 104;
  [1.cc : 205:45] D.332135 = [1.cc : 205] *D.332134;
  [1.cc : 205:45] D.332092 = [1.cc : 205] &this->cp_stream_;
  [1.cc : 205:46] OBJ_TYPE_REF(D.332135;(struct Table)table->13)
(table, cp_arg, D.332092);  // indirect call
  [1.cc : 179:64] Reader::~Reader (&version);
  [1.cc : 205:46] Switcher::~Switcher (&tcswr);

The indirect call above has the same source lineno with
"Switcher::~Switcher (&tcswr);", but they have no discriminator so
they cannot be discriminated in autofdo. This causes the problem that
autofdo mistakenly regards "Switcher::~Switcher (&tcswr);" as a target
of the indirect call above, and makes a wrong promotion.

The existing code has the logic to assign different discriminators to
calls with the same lineno, but it only works when the lineno in a bb
is monotonical. In this case, there is another stmt with lineno 179
between the indirect call and "Switcher::~Switcher (&tcswr);" (both
with lineno 205), so existing code will not assign different
discriminators for them.

The patch is to assign discriminators for calls with the same lineno anyway.

regression test is going. internal perf test for autofdo shows a
little improvement. Ok for google-4_9 if regression pass?

Thanks,
Wei.

ChangeLog:

2014-08-06  Wei Mi  

* tree-cfg.c (increase_discriminator_for_locus): It was
next_discriminator_for_locus. Add a param "return_next".
(next_discriminator_for_locus): Renamed.
(assign_discriminator): Use the renamed func.
(assign_discriminators): Assign different discriminators
for calls with the same lineno.


Index: tree-cfg.c
===
--- tree-cfg.c  (revision 213402)
+++ tree-cfg.c  (working copy)
@@ -914,10 +914,12 @@ make_edges (void)
 /* Find the next available discriminator value for LOCUS.  The
discriminator distinguishes among several basic blocks that
share a common locus, allowing for more accurate sample-based
-   profiling.  */
+   profiling. If RETURN_NEXT is true, return the discriminator
+   value after the increase, else return the discriminator value
+   before the increase.  */

 static int
-next_discriminator_for_locus (location_t locus)
+increase_discriminator_for_locus (location_t locus, bool return_next)
 {
   struct locus_discrim_map item;
   struct locus_discrim_map **slot;
@@ -934,8 +936,10 @@ next_discriminator_for_locus (location_t
   (*slot)->locus = locus;
   (*slot)->discriminator = 0;
 }
+
   (*slot)->discriminator++;
-  return (*slot)->discriminator;
+  return return_next ? (*slot)->discriminator
+: (*slot)->discriminator - 1;
 }

 /* Return TRUE if LOCUS1 and LOCUS2 refer to the same source line.  */
@@ -974,7 +978,7 @@ assign_discriminator (location_t locus,
   if (locus == UNKNOWN_LOCATION)
 return;

-  discriminator = next_discriminator_for_locus (locus);
+  discriminator = increase_discriminator_for_locus (locus, true);

   for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
 {
@@ -1009,23 +1013,16 @@ assign_discriminators (void)
   for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
{
  gimple stmt = gsi_stmt (gsi);
- if (curr_locus == UNKNOWN_LOCATION)
-   {
- curr_locus = gimple_location (stmt);
-   }
- else if (!same_line_p (curr_locus, gimple_location (stmt)))
+ if (gimple_code (stmt) == GIMPLE_CALL)
{
  curr_locus = gimple_location (stmt);
- curr_discr = 0;
-   }
- else if (curr_discr != 0)
-   {
- gimple_set_location (stmt, location_with_discriminator (
- gimple_location (stmt), curr_discr));
+ /* return the current discriminator first, then increase the
+discriminator for next call.  */
+ curr_discr = increase_discriminator_for_locus (curr_locus, false);
+ if (curr_discr != 0)
+   gimple_set_location (stmt, location_with_discriminator (
+   gimple_location (stmt), curr_discr));
}
- /* Allocate a new discriminator for CALL stmt.  */
- if (gimple_code (stmt) == GIMPLE_CALL)
-   curr_discr = next_discriminator_for_locus (curr_locus);
}

   if (locus == UNKNOWN_LOCATION)


Re: [GOOGLE, AUTOFDO] Assign different discriminators to calls with the same lineno

2014-08-07 Thread Wei Mi
Yes, that is intentional. It is to avoid assiging a discriminator for
the first call in the group of calls with the same source lineno.
Starting from the second call in the group, it will get a different
discriminator with previous call in the same group.

Thanks,
Wei.

On Thu, Aug 7, 2014 at 12:17 PM, Cary Coutant  wrote:
>>  static int
>> -next_discriminator_for_locus (location_t locus)
>> +increase_discriminator_for_locus (location_t locus, bool return_next)
>>  {
>>struct locus_discrim_map item;
>>struct locus_discrim_map **slot;
>> @@ -934,8 +936,10 @@ next_discriminator_for_locus (location_t
>>(*slot)->locus = locus;
>>(*slot)->discriminator = 0;
>>  }
>> +
>>(*slot)->discriminator++;
>> -  return (*slot)->discriminator;
>> +  return return_next ? (*slot)->discriminator
>> +: (*slot)->discriminator - 1;
>>  }
>
> Won't this have the effect of sometimes incrementing the next
> available discriminator without actually using the new value? That is,
> if you call it once with return_next == false, and then with
> return_next == true.
>
> -cary


Re: [GOOGLE, AUTOFDO] Assign different discriminators to calls with the same lineno

2014-08-07 Thread Wei Mi
No, it is not. This IR is dumped before early inline -- just after
pass_build_cfg. The line number of the deconstructor is marked
according to where its constructor is located, instead of where it is
inserted. This is also problematic.

Wei.

On Thu, Aug 7, 2014 at 12:11 PM, Xinliang David Li  wrote:
> Is this
>
> [1.cc : 179:64] Reader::~Reader (&version);
>
> from an inline instance?
>
> David
>
> On Wed, Aug 6, 2014 at 10:18 AM, Wei Mi  wrote:
>> We saw bb like this in the IR dump after pass_build_cfg:
>>
>>   :
>>   [1.cc : 205:45] D.332088 = table->_vptr.Table;
>>   [1.cc : 205:45] D.332134 = D.332088 + 104;
>>   [1.cc : 205:45] D.332135 = [1.cc : 205] *D.332134;
>>   [1.cc : 205:45] D.332092 = [1.cc : 205] &this->cp_stream_;
>>   [1.cc : 205:46] OBJ_TYPE_REF(D.332135;(struct Table)table->13) (table,
>> cp_arg, D.332092);  // indirect call
>>   [1.cc : 179:64] Reader::~Reader (&version);
>>   [1.cc : 205:46] Switcher::~Switcher (&tcswr);
>>
>> The indirect call above has the same source lineno with "Switcher::~Switcher
>> (&tcswr);", but they have no discriminator so they cannot be discriminated
>> in autofdo. This causes the problem that autofdo mistakenly regards
>> "Switcher::~Switcher (&tcswr);" as a target of the indirect call above, and
>> makes a wrong promotion.
>>
>> The existing code has the logic to assign different discriminators to calls
>> with the same lineno, but it only works when the lineno in a bb is
>> monotonical. In this case, there is another stmt with lineno 179 between the
>> indirect call and "Switcher::~Switcher (&tcswr);" (both with lineno 205), so
>> existing code will not assign different discriminators for them.
>>
>> The patch is to assign discriminators for calls with the same lineno anyway.
>>
>> regression test is going. internal perf test for autofdo shows a little
>> improvement. Ok for google-4_9 if regression pass?
>>
>> Thanks,
>> Wei.
>>
>> ChangeLog:
>>
>> 2014-08-06  Wei Mi  
>>
>> * tree-cfg.c (increase_discriminator_for_locus): It was
>> next_discriminator_for_locus. Add a param "return_next".
>> (next_discriminator_for_locus): Renamed.
>> (assign_discriminator): Use the renamed func.
>> (assign_discriminators): Assign different discriminators
>> for calls with the same lineno.
>>
>>
>> Index: tree-cfg.c
>> ===
>> --- tree-cfg.c  (revision 213402)
>> +++ tree-cfg.c  (working copy)
>> @@ -914,10 +914,12 @@ make_edges (void)
>>  /* Find the next available discriminator value for LOCUS.  The
>> discriminator distinguishes among several basic blocks that
>> share a common locus, allowing for more accurate sample-based
>> -   profiling.  */
>> +   profiling. If RETURN_NEXT is true, return the discriminator
>> +   value after the increase, else return the discriminator value
>> +   before the increase.  */
>>
>>  static int
>> -next_discriminator_for_locus (location_t locus)
>> +increase_discriminator_for_locus (location_t locus, bool return_next)
>>  {
>>struct locus_discrim_map item;
>>struct locus_discrim_map **slot;
>> @@ -934,8 +936,10 @@ next_discriminator_for_locus (location_t
>>(*slot)->locus = locus;
>>(*slot)->discriminator = 0;
>>  }
>> +
>>(*slot)->discriminator++;
>> -  return (*slot)->discriminator;
>> +  return return_next ? (*slot)->discriminator
>> +: (*slot)->discriminator - 1;
>>  }
>>
>>  /* Return TRUE if LOCUS1 and LOCUS2 refer to the same source line.  */
>> @@ -974,7 +978,7 @@ assign_discriminator (location_t locus,
>>if (locus == UNKNOWN_LOCATION)
>>  return;
>>
>> -  discriminator = next_discriminator_for_locus (locus);
>> +  discriminator = increase_discriminator_for_locus (locus, true);
>>
>>for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
>>  {
>> @@ -1009,23 +1013,16 @@ assign_discriminators (void)
>>for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
>> {
>>   gimple stmt = gsi_stmt (gsi);
>> - if (curr_locus == UNKNOWN_LOCATION)
>> -   {
>> - curr_locus = gimple_location (stmt);
>> -   }
>> - else if (!same_line_p (curr_locus, gimple_location (stmt)))
>

Re: [GOOGLE, AUTOFDO] Assign different discriminators to calls with the same lineno

2014-08-07 Thread Wei Mi
On Thu, Aug 7, 2014 at 2:40 PM, Xinliang David Li  wrote:
> On Thu, Aug 7, 2014 at 2:20 PM, Wei Mi  wrote:
>> No, it is not. This IR is dumped before early inline -- just after
>> pass_build_cfg. The line number of the deconstructor is marked
>> according to where its constructor is located,
>
> The definition location or the invocation location?
>
> David
>

The definition location and the invocation location are the same
source line for the case.

Wei.


Re: [PATCH, PR61776] verify_flow_info failed: control flow in the middle of basic block with -fprofile-generate

2014-08-12 Thread Wei Mi
Ping.

On Sun, Jul 27, 2014 at 11:08 PM, Wei Mi  wrote:
>> But fact is that it is _not_ necessary to split the block because there
>> are no outgoing abnormal edges from it.
>>
>> The verifier failure is an artifact from using the same predicates during
>> CFG building and CFG verifying (usually ok, but for this particular
>> case it leads to this issue).
>>
>> So I don't think your patch is the proper way to address this issue
>> (while it certainly works).
>>
>> Instead whether a call can make abnormal gotos should be recorded
>> per-call and stored on the gimple-call.  Martin - this is exactly
>> one of the cases your patch would address?
>>
>
> Thanks for the comment and thanks to Martin's patch. I try the patch.
> It works well to address both pr60449 and pr61776 after some
> extension. One extension is to replace GF_CALL_LEAF attribute using
> GF_CALL_NO_ABNORMAL_GOTO. That is because not only dropping "leaf"
> attribute in lto symbol merge could introduce the control flow
> verification problem in pr60449, dropping "const/pure" attributes
> could introduce the same problem too. It is unnecessary to introduce
> per-call attributes for all these three: ECF_LEAF/ECF_CONST/ECF_PURE,
> so GF_CALL_NO_ABNORMAL_GOTO is introduced to indicate that a call stmt
> has no abnormal goto.
>
> GF_CALL_NO_ABNORMAL_GOTO will be set according to gimple_call_flags()
> once gimple call stmt is created, then updated in execute_fixup_cfg
> and cleanup_tree_cfg.
>
> I posted the extended patch here. I didn't add the noreturn part in
> because it has no direct impact on pr60449 and pr61776. I can help
> Martin to test and post that part as an independent patch later.
>
> bootstrap and regression pass on x86_64-linux-gnu. Is it ok?
>
> Thanks,
> Wei.


Re: [PATCH, PR61776] verify_flow_info failed: control flow in the middle of basic block with -fprofile-generate

2014-08-19 Thread Wei Mi
Sorry for the late reply. I took some time to make myself more
familiar with NORETURN and related code, and finally I understood what
you mean and saw why only GF_CALL_CTRL_ALTERING was enough and
GF_CALL_NORETURN was unneeded. With your suggestion, the change looks
much briefer! Please check if the new patch attached is ok.

bootstrap and regression tests pass on x86_64-linux-gnu.

Thanks,
Wei.

> +static void
> +update_no_abnormal_goto_attr (basic_block bb)
> +{
> +  gimple_stmt_iterator gsi;
> +  for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
> +{
>
> it should be enough to check these on last stmts of a basic block, no?

Yes, that is better.

>
> That you call update_no_abnormal_goto_attr from two places
> before cleanup_tree_cfg_bb suggests you may want to perform
> this change in cleanup_control_flow_bb which already looks
> at the last stmt only?

Changed.

>
> Btw, I originally had this whole idea of moving flags to the gimple
> stmt level because of the "interesting" way we handle the noreturn
> attribute (calls to noreturn functions also end basic-blocks).
>
> Thus would it be possible to turn all these into a single flag,
> GF_CALL_CTRL_ALTERING?  That is, cover everything
> that is_ctrl_altering_stmt covers?  I suggest we initialize it at
> CFG build time and only ever clear it later.

Good idea!
ChangeLog:
2014-08-19  Martin Jambor  
Wei Mi  

PR ipa/60449
PR middle-end/61776
* tree-ssa-operands.c (update_stmt_operands): Remove
MODIFIED_NORETURN_CALLS.
* tree-cfgcleanup.c (cleanup_call_ctrl_altering_flag): New func.
(cleanup_control_flow_bb): Use cleanup_call_ctrl_altering_flag.
(split_bb_on_noreturn_calls): Renamed from split_bbs_on_noreturn_calls.
(cleanup_tree_cfg_1): Use split_bb_on_noreturn_calls.
* tree-ssanames.h: Remove MODIFIED_NORETURN_CALLS.
* gimple.h (enum gf_mask): Add GF_CALL_CTRL_ALTERING.
(gimple_call_set_ctrl_altering): New func.
(gimple_call_ctrl_altering_p): Ditto.
* tree-cfg.c (gimple_call_initialize_ctrl_altering): Ditto.
(make_blocks): Use gimple_call_initialize_ctrl_altering.
(is_ctrl_altering_stmt): Use gimple_call_ctrl_altering_p.
(execute_fixup_cfg): Use gimple_call_ctrl_altering_p and
        remove MODIFIED_NORETURN_CALLS.

2014-08-19  Martin Jambor  
Wei Mi  

PR ipa/60449
PR middle-end/61776
* testsuite/gcc.dg/lto/pr60449_1.c: New test.
* testsuite/gcc.dg/lto/pr60449_0.c: New test.
* testsuite/gcc.dg/pr61776.c: New test.

Index: tree-ssa-operands.c
===
--- tree-ssa-operands.c (revision 212442)
+++ tree-ssa-operands.c (working copy)
@@ -1087,12 +1087,6 @@ update_stmt_operands (struct function *f
 
   timevar_push (TV_TREE_OPS);
 
-  /* If the stmt is a noreturn call queue it to be processed by
- split_bbs_on_noreturn_calls during cfg cleanup.  */
-  if (is_gimple_call (stmt)
-  && gimple_call_noreturn_p (stmt))
-vec_safe_push (MODIFIED_NORETURN_CALLS (fn), stmt);
-
   gcc_assert (gimple_modified_p (stmt));
   build_ssa_operands (fn, stmt);
   gimple_set_modified (stmt, false);
Index: tree-cfgcleanup.c
===
--- tree-cfgcleanup.c   (revision 212442)
+++ tree-cfgcleanup.c   (working copy)
@@ -162,6 +162,23 @@ cleanup_control_expr_graph (basic_block
   return retval;
 }
 
+/* Cleanup the GF_CALL_CTRL_ALTERING flag according to
+   to updated gimple_call_flags.  */
+
+static void
+cleanup_call_ctrl_altering_flag (gimple bb_end)
+{
+  if (!is_gimple_call (bb_end)
+  || !gimple_call_ctrl_altering_p (bb_end))
+return;
+
+  int flags = gimple_call_flags (bb_end);
+  if (((flags & (ECF_CONST | ECF_PURE))
+   && !(flags & ECF_LOOPING_CONST_OR_PURE))
+  || (flags & ECF_LEAF))
+gimple_call_set_ctrl_altering (bb_end, false);
+}
+
 /* Try to remove superfluous control structures in basic block BB.  Returns
true if anything changes.  */
 
@@ -182,6 +199,9 @@ cleanup_control_flow_bb (basic_block bb)
 
   stmt = gsi_stmt (gsi);
 
+  /* Try to cleanup ctrl altering flag for call which ends bb.  */
+  cleanup_call_ctrl_altering_flag (stmt);
+
   if (gimple_code (stmt) == GIMPLE_COND
   || gimple_code (stmt) == GIMPLE_SWITCH)
 retval |= cleanup_control_expr_graph (bb, gsi);
@@ -594,30 +614,24 @@ fixup_noreturn_call (gimple stmt)
known not to return, and remove the unreachable code.  */
 
 static bool
-split_bbs_on_noreturn_calls (void)
+split_bb_on_noreturn_calls (basic_block bb)
 {
   bool changed = false;
-  gimple stmt;
-  basic_block bb;
+  gimple_stmt_iterator gsi;
 
-  /* Detect cases where a mid-block call is now known not to return.  */
-  if (cfun->gimple_df)
-while (vec_safe_leng

Re: [GOOGLE, AUTOFDO] Assign different discriminators to calls with the same lineno

2014-08-24 Thread Wei Mi
To avoid the unused new discriminator value, I added a map
"found_call_this_line" to track whether a call is the first call in a
source line seen when assigning discriminators. For the first call in
a source line, its discriminator is 0. For the following calls in the
same source line, a new discriminator will be used everytime. The new
patch is attached. Internal perf test and regression test are ok. Is
it ok for google-4_9?

Thanks,
Wei.



On Thu, Aug 7, 2014 at 2:10 PM, Wei Mi  wrote:
> Yes, that is intentional. It is to avoid assiging a discriminator for
> the first call in the group of calls with the same source lineno.
> Starting from the second call in the group, it will get a different
> discriminator with previous call in the same group.
>
> Thanks,
> Wei.
>
> On Thu, Aug 7, 2014 at 12:17 PM, Cary Coutant  wrote:
>>>  static int
>>> -next_discriminator_for_locus (location_t locus)
>>> +increase_discriminator_for_locus (location_t locus, bool return_next)
>>>  {
>>>struct locus_discrim_map item;
>>>struct locus_discrim_map **slot;
>>> @@ -934,8 +936,10 @@ next_discriminator_for_locus (location_t
>>>(*slot)->locus = locus;
>>>(*slot)->discriminator = 0;
>>>  }
>>> +
>>>(*slot)->discriminator++;
>>> -  return (*slot)->discriminator;
>>> +  return return_next ? (*slot)->discriminator
>>> +: (*slot)->discriminator - 1;
>>>  }
>>
>> Won't this have the effect of sometimes incrementing the next
>> available discriminator without actually using the new value? That is,
>> if you call it once with return_next == false, and then with
>> return_next == true.
>>
>> -cary
ChangeLog:

2014-08-24  Wei Mi  

* tree-cfg.c (assign_discriminators): Assign different discriminators
for calls belonging to the same source line.


Index: tree-cfg.c
===
--- tree-cfg.c  (revision 213402)
+++ tree-cfg.c  (working copy)
@@ -992,7 +992,13 @@ static void
 assign_discriminators (void)
 {
   basic_block bb;
+  /* If there is a location saved in the hash_table, it means that we
+ already found a call in the source line before. For the calls which
+ are not the first call found in the same source line, we don't assign
+ new discriminator for it, so that .debug_line section will be smaller.  */
+  hash_table  found_call_this_line;
 
+  found_call_this_line.create (13);
   FOR_EACH_BB_FN (bb, cfun)
 {
   edge e;
@@ -1009,23 +1015,31 @@ assign_discriminators (void)
   for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
{
  gimple stmt = gsi_stmt (gsi);
- if (curr_locus == UNKNOWN_LOCATION)
-   {
- curr_locus = gimple_location (stmt);
-   }
- else if (!same_line_p (curr_locus, gimple_location (stmt)))
+ if (gimple_code (stmt) == GIMPLE_CALL)
{
+ struct locus_discrim_map item;
+ struct locus_discrim_map **slot;
+
  curr_locus = gimple_location (stmt);
- curr_discr = 0;
-   }
- else if (curr_discr != 0)
-   {
- gimple_set_location (stmt, location_with_discriminator (
- gimple_location (stmt), curr_discr));
+ item.locus = curr_locus;
+ item.discriminator = 0;
+ slot = found_call_this_line.find_slot (&item, INSERT);
+ /* If the current call is not the first call seen in curr_locus,
+assign the next discriminator to it, else keep its 
discriminator
+unchanged.  */
+ if (*slot != HTAB_EMPTY_ENTRY)
+   {
+ curr_discr = next_discriminator_for_locus (curr_locus);
+ gimple_set_location (stmt, location_with_discriminator (
+   gimple_location (stmt), curr_discr));
+   }
+ else
+   {
+ *slot = XNEW (struct locus_discrim_map);
+ (*slot)->locus = curr_locus;
+ (*slot)->discriminator = 0;
+   }
}
- /* Allocate a new discriminator for CALL stmt.  */
- if (gimple_code (stmt) == GIMPLE_CALL)
-   curr_discr = next_discriminator_for_locus (curr_locus);
}
 
   if (locus == UNKNOWN_LOCATION)
@@ -1047,6 +1061,7 @@ assign_discriminators (void)
}
}
 }
+  found_call_this_line.dispose ();
 }
 
 /* Create the edges for a GIMPLE_COND starting at block BB.  */


Re: [GOOGLE, AUTOFDO] Assign different discriminators to calls with the same lineno

2014-08-28 Thread Wei Mi
Hi Cary,

Is the new patch ok for google-4_9?

Thanks,
Wei.


On Sun, Aug 24, 2014 at 8:53 PM, Wei Mi  wrote:
> To avoid the unused new discriminator value, I added a map
> "found_call_this_line" to track whether a call is the first call in a
> source line seen when assigning discriminators. For the first call in
> a source line, its discriminator is 0. For the following calls in the
> same source line, a new discriminator will be used everytime. The new
> patch is attached. Internal perf test and regression test are ok. Is
> it ok for google-4_9?
>
> Thanks,
> Wei.
>
>
>
> On Thu, Aug 7, 2014 at 2:10 PM, Wei Mi  wrote:
>> Yes, that is intentional. It is to avoid assiging a discriminator for
>> the first call in the group of calls with the same source lineno.
>> Starting from the second call in the group, it will get a different
>> discriminator with previous call in the same group.
>>
>> Thanks,
>> Wei.
>>
>> On Thu, Aug 7, 2014 at 12:17 PM, Cary Coutant  wrote:
>>>>  static int
>>>> -next_discriminator_for_locus (location_t locus)
>>>> +increase_discriminator_for_locus (location_t locus, bool return_next)
>>>>  {
>>>>struct locus_discrim_map item;
>>>>struct locus_discrim_map **slot;
>>>> @@ -934,8 +936,10 @@ next_discriminator_for_locus (location_t
>>>>(*slot)->locus = locus;
>>>>(*slot)->discriminator = 0;
>>>>  }
>>>> +
>>>>(*slot)->discriminator++;
>>>> -  return (*slot)->discriminator;
>>>> +  return return_next ? (*slot)->discriminator
>>>> +: (*slot)->discriminator - 1;
>>>>  }
>>>
>>> Won't this have the effect of sometimes incrementing the next
>>> available discriminator without actually using the new value? That is,
>>> if you call it once with return_next == false, and then with
>>> return_next == true.
>>>
>>> -cary


Re: [GOOGLE, AUTOFDO] Assign different discriminators to calls with the same lineno

2014-08-29 Thread Wei Mi
Thanks, that is ellegant. Will paste a new patch in this way soon.

Wei.

On Fri, Aug 29, 2014 at 10:11 AM, Cary Coutant  wrote:
>> To avoid the unused new discriminator value, I added a map
>> "found_call_this_line" to track whether a call is the first call in a
>> source line seen when assigning discriminators. For the first call in
>> a source line, its discriminator is 0. For the following calls in the
>> same source line, a new discriminator will be used everytime. The new
>> patch is attached. Internal perf test and regression test are ok. Is
>> it ok for google-4_9?
>
> This seems overly complex to me. I'd think all you need to do is add a
> bit to locus_discrim_map (stealing a bit from discriminator ought to
> be fine) that indicates whether the next call should increment the
> discriminator or not. Something like this:
>
> increase_discriminator_for_locus (location_t locus, bool return_next)
> {
>   ...
>   if (return_next || (*slot)->needs_increment)
> {
>   (*slot)->discriminator++;
>   (*slot)->needs_increment = false;
> }
>   else
> (*slot)->needs_increment = true;
>   return (*slot)->discriminator;
> }
>
> -cary


Re: [GOOGLE, AUTOFDO] Assign different discriminators to calls with the same lineno

2014-08-29 Thread Wei Mi
> On Fri, Aug 29, 2014 at 10:11 AM, Cary Coutant  wrote:
>>> To avoid the unused new discriminator value, I added a map
>>> "found_call_this_line" to track whether a call is the first call in a
>>> source line seen when assigning discriminators. For the first call in
>>> a source line, its discriminator is 0. For the following calls in the
>>> same source line, a new discriminator will be used everytime. The new
>>> patch is attached. Internal perf test and regression test are ok. Is
>>> it ok for google-4_9?
>>
>> This seems overly complex to me. I'd think all you need to do is add a
>> bit to locus_discrim_map (stealing a bit from discriminator ought to
>> be fine) that indicates whether the next call should increment the
>> discriminator or not. Something like this:
>>
>> increase_discriminator_for_locus (location_t locus, bool return_next)
>> {
>>   ...
>>   if (return_next || (*slot)->needs_increment)
>> {
>>   (*slot)->discriminator++;
>>   (*slot)->needs_increment = false;
>> }
>>   else
>> (*slot)->needs_increment = true;
>>   return (*slot)->discriminator;
>> }
>>
>> -cary

Here is the new patch (attached). Regression test passes. Cary, is it ok?

Thanks,
Wei.
ChangeLog:

2014-08-29  Wei Mi  

* tree-cfg.c (struct locus_discrim_map): New field needs_increment.
(next_discriminator_for_locus): Increase discriminator only when
return_next or needs_increment are true.
(assign_discriminator): Add an actual for next_discriminator_for_locus.
(assign_discriminators): Assign different discriminators for calls
belonging to the same source line.

Index: tree-cfg.c
===
--- tree-cfg.c  (revision 213402)
+++ tree-cfg.c  (working copy)
@@ -112,7 +112,14 @@ static struct cfg_stats_d cfg_stats;
 struct locus_discrim_map
 {
   location_t locus;
-  int discriminator;
+  /* Different calls belonging to the same source line will be assigned
+ different discriminators. But we want to keep the discriminator of
+ the first call in the same source line to be 0, in order to reduce
+ the .debug_line section size. needs_increment is used for this
+ purpose. It is initialized as false and will be set to true after
+ the first call is seen.  */
+  bool needs_increment:1;
+  int discriminator:31;
 };
 
 /* Hashtable helpers.  */
@@ -914,10 +921,15 @@ make_edges (void)
 /* Find the next available discriminator value for LOCUS.  The
discriminator distinguishes among several basic blocks that
share a common locus, allowing for more accurate sample-based
-   profiling.  */
+   profiling. If RETURN_NEXT is true, return the next discriminator
+   anyway. If RETURN_NEXT is not true, we may not increase the
+   discriminator if locus_discrim_map::needs_increment is false,
+   which is used when the stmt is the first call stmt in current
+   source line. locus_discrim_map::needs_increment will be set to
+   true after the first call is seen.  */
 
 static int
-next_discriminator_for_locus (location_t locus)
+next_discriminator_for_locus (location_t locus, bool return_next)
 {
   struct locus_discrim_map item;
   struct locus_discrim_map **slot;
@@ -932,9 +944,13 @@ next_discriminator_for_locus (location_t
   *slot = XNEW (struct locus_discrim_map);
   gcc_assert (*slot);
   (*slot)->locus = locus;
+  (*slot)->needs_increment = false;
   (*slot)->discriminator = 0;
 }
-  (*slot)->discriminator++;
+  if (return_next || (*slot)->needs_increment)
+(*slot)->discriminator++;
+  else
+(*slot)->needs_increment = true;
   return (*slot)->discriminator;
 }
 
@@ -974,7 +990,7 @@ assign_discriminator (location_t locus,
   if (locus == UNKNOWN_LOCATION)
 return;
 
-  discriminator = next_discriminator_for_locus (locus);
+  discriminator = next_discriminator_for_locus (locus, true);
 
   for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
 {
@@ -1009,23 +1025,13 @@ assign_discriminators (void)
   for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
{
  gimple stmt = gsi_stmt (gsi);
- if (curr_locus == UNKNOWN_LOCATION)
-   {
+  if (gimple_code (stmt) == GIMPLE_CALL)
+{
  curr_locus = gimple_location (stmt);
-   }
- else if (!same_line_p (curr_locus, gimple_location (stmt)))
-   {
- curr_locus = gimple_location (stmt);
- curr_discr = 0;
-   }
- else if (curr_discr != 0)
-   {
+ curr_discr = next_discriminator_for_locus (curr_locus, false);
  gimple_set_location (stmt, location_with_discriminator 

Re: [asan] Emit GIMPLE directly, small cleanups

2012-10-11 Thread Wei Mi
Hi Diego,

>> /* Build
>> - (base_addr >> ASAN_SHADOW_SHIFT) | targetm.asan_shadow_offset ().
>> */
>> + (base_addr >> ASAN_SHADOW_SHIFT) + targetm.asan_shadow_offset ().
>> */
>
>
> Hm, I wonder if this is a documentation problem or we're generating bad
> runtime code.  Wei, you tested the runtime and it was working with the GCC
> generated code, right?
>
> In any case, we can adjust the expression later.

I only tested my smallcase and it worked. Because usually the redzone
are not a very small areas (more than 4K), so I think it is possible
the smallcase works even if the shadow addr calculation is incorrect
and has small deviation.

Thanks,
Wei.


Re: [asan] Emit GIMPLE directly, small cleanups

2012-10-11 Thread Wei Mi
Hi,

Here is the initial test results of gcc asan patch, and it shows us
some missing features in gcc but existing in llvm.
[1]. gcc regression test for gcc-asan passes.
[2]. llvm regression tests for gcc-asan: 18 failures in 123 for tests
written in google test and 24 failures in 28 for tests written in lit
tests.

gcc missing features:
1. gcc implementation doesn't support stack/global overflow check
1. gcc implementation doesn't support some attributes, such as
__attribute__((no_address_safety_analysis)), which llvm does
2. gcc doesn't detect out-of-bound bitfield access of heap, which llvm does
3. gcc doesn't detect out-of-bound memset, memcopy, strcat and strcpy
for heap allocated memory or string, which llvm does
4. gcc doesn't contain similar options: -mllvm -asan-blacklist, -mllvm
-asan-initialization-order
5. gcc -fasan doesn't take effect at -O0, but llvm does. Most lit
tests contain checks from -O0 to -O3, which makes gcc fail.
6. $HOME/llvm/trunk/projects/compiler-rt/lib/asan/scripts/asan_symbolize.py
could generate valid source locations from virtual addresses for llvm
binary, but fail to do that for gcc binary.  example, llvm result #1
0x402694 in main heap-overflow.cc:23 .vs. gcc result: #1 0x402694 in
main ??:0. Some FileCheck in llvm lit tests expect the valid source
locations.

Thanks,
Wei.


On Thu, Oct 11, 2012 at 10:31 AM, Jakub Jelinek  wrote:
> On Thu, Oct 11, 2012 at 01:14:31PM -0400, Diego Novillo wrote:
>> On 2012-10-11 12:38 , Jakub Jelinek wrote:
>>
>> >-  gimple_seq seq, stmts;
>> >-  tree shadow_type = size_in_bytes == 16 ?
>> >-  short_integer_type_node : char_type_node;
>> >-  tree shadow_ptr_type = build_pointer_type (shadow_type);
>> >-  tree uintptr_type = lang_hooks.types.type_for_mode (ptr_mode,
>> >-  /*unsignedp=*/true);
>> >+  tree shadow_ptr_type = shadow_ptr_types[size_in_bytes == 16];
>>
>> Add '? 1 : 0' in the array index expression.
>
> Ok.
>
>> >/* Build
>> >- (base_addr >> ASAN_SHADOW_SHIFT) | targetm.asan_shadow_offset ().  */
>> >+ (base_addr >> ASAN_SHADOW_SHIFT) + targetm.asan_shadow_offset ().  */
>>
>> Hm, I wonder if this is a documentation problem or we're generating
>> bad runtime code.  Wei, you tested the runtime and it was working
>> with the GCC generated code, right?
>
> The asan web pages document |, the old tree-asan.c emitted +, I've changed
> it to BIT_IOR_EXPR, but that resulted in worse assembly, and I believe at
> least for the current x86_64 and i686 address ranges and shadow offset
> values it actually doesn't matter.
> On x86_64 stack is like 0x76e0, shifted down by 3 is still smaller
> than 1L << 44 that is ored or added to it.  And the negative half of the
> address space is used by the kernel, nothing is mapped into it (besides
> vsyscall page) and neither | nor + of 1L << 44 to it would work well.
> On i386, | and + works the same for all addresses, as 0xU >> 3
> is still smaller than 1 << 29.
> The reason why + generates better code on x86_64/i686 is that one can use
> e.g. movzbl (%r1, %r2), %r3 instead of orq %r2, %r1; movzb (%r1), %r3.
>
>> >+  if (shadow_ptr_types[0] == NULL_TREE)
>> >+{
>> >+  alias_set_type set = new_alias_set ();
>> >+  shadow_ptr_types[0]
>> >+= build_distinct_type_copy (unsigned_char_type_node);
>> >+  TYPE_ALIAS_SET (shadow_ptr_types[0]) = set;
>> >+  shadow_ptr_types[0] = build_pointer_type (shadow_ptr_types[0]);
>> >+  shadow_ptr_types[1]
>> >+= build_distinct_type_copy (short_unsigned_type_node);
>> >+  TYPE_ALIAS_SET (shadow_ptr_types[1]) = set;
>> >+  shadow_ptr_types[1] = build_pointer_type (shadow_ptr_types[1]);
>> >+}
>>
>> Move this to an initialization function, please.
>
> Okay.
>
> Jakub


Re: [asan] migrate runtime from llvm

2012-10-15 Thread Wei Mi
>
> This is a good start.  Can you send a patch out without including
> libasan so at least the toplevel parts can be reviewed easier?
> Also the changelog entry for gcc.c go under the gcc/ChangeLog rather
> than the toplevel one.
>
> Thanks,
> Andrew Pinski

Sure, I attach it. Thanks for pointing out the changelog error.

Thanks,
Wei.
Index: Makefile.in
===
--- Makefile.in (revision 192487)
+++ Makefile.in (working copy)
@@ -575,7 +575,7 @@ all:
 
 # This is the list of directories that may be needed in RPATH_ENVVAR
 # so that programs built for the target machine work.
-TARGET_LIB_PATH = 
$(TARGET_LIB_PATH_libstdc++-v3)$(TARGET_LIB_PATH_libmudflap)$(TARGET_LIB_PATH_libssp)$(TARGET_LIB_PATH_libgomp)$(TARGET_LIB_PATH_libitm)$(TARGET_LIB_PATH_libatomic)$(HOST_LIB_PATH_gcc)
+TARGET_LIB_PATH = 
$(TARGET_LIB_PATH_libstdc++-v3)$(TARGET_LIB_PATH_libmudflap)$(TARGET_LIB_PATH_libasan)$(TARGET_LIB_PATH_libssp)$(TARGET_LIB_PATH_libgomp)$(TARGET_LIB_PATH_libitm)$(TARGET_LIB_PATH_libatomic)$(HOST_LIB_PATH_gcc)
 
 @if target-libstdc++-v3
 TARGET_LIB_PATH_libstdc++-v3 = $$r/$(TARGET_SUBDIR)/libstdc++-v3/src/.libs:
@@ -585,6 +585,10 @@ TARGET_LIB_PATH_libstdc++-v3 = $$r/$(TAR
 TARGET_LIB_PATH_libmudflap = $$r/$(TARGET_SUBDIR)/libmudflap/.libs:
 @endif target-libmudflap
 
+@if target-libasan
+TARGET_LIB_PATH_libasan = $$r/$(TARGET_SUBDIR)/libasan/.libs:
+@endif target-libasan
+
 @if target-libssp
 TARGET_LIB_PATH_libssp = $$r/$(TARGET_SUBDIR)/libssp/.libs:
 @endif target-libssp
@@ -914,6 +918,7 @@ configure-host:  \
 configure-target:  \
 maybe-configure-target-libstdc++-v3 \
 maybe-configure-target-libmudflap \
+maybe-configure-target-libasan \
 maybe-configure-target-libssp \
 maybe-configure-target-newlib \
 maybe-configure-target-libgcc \
@@ -1062,6 +1067,7 @@ all-host: maybe-all-lto-plugin
 all-target: maybe-all-target-libstdc++-v3
 @endif target-libstdc++-v3-no-bootstrap
 all-target: maybe-all-target-libmudflap
+all-target: maybe-all-target-libasan
 all-target: maybe-all-target-libssp
 all-target: maybe-all-target-newlib
 @if target-libgcc-no-bootstrap
@@ -1152,6 +1158,7 @@ info-host: maybe-info-lto-plugin
 
 info-target: maybe-info-target-libstdc++-v3
 info-target: maybe-info-target-libmudflap
+info-target: maybe-info-target-libasan
 info-target: maybe-info-target-libssp
 info-target: maybe-info-target-newlib
 info-target: maybe-info-target-libgcc
@@ -1233,6 +1240,7 @@ dvi-host: maybe-dvi-lto-plugin
 
 dvi-target: maybe-dvi-target-libstdc++-v3
 dvi-target: maybe-dvi-target-libmudflap
+dvi-target: maybe-dvi-target-libasan
 dvi-target: maybe-dvi-target-libssp
 dvi-target: maybe-dvi-target-newlib
 dvi-target: maybe-dvi-target-libgcc
@@ -1314,6 +1322,7 @@ pdf-host: maybe-pdf-lto-plugin
 
 pdf-target: maybe-pdf-target-libstdc++-v3
 pdf-target: maybe-pdf-target-libmudflap
+pdf-target: maybe-pdf-target-libasan
 pdf-target: maybe-pdf-target-libssp
 pdf-target: maybe-pdf-target-newlib
 pdf-target: maybe-pdf-target-libgcc
@@ -1395,6 +1404,7 @@ html-host: maybe-html-lto-plugin
 
 html-target: maybe-html-target-libstdc++-v3
 html-target: maybe-html-target-libmudflap
+html-target: maybe-html-target-libasan
 html-target: maybe-html-target-libssp
 html-target: maybe-html-target-newlib
 html-target: maybe-html-target-libgcc
@@ -1476,6 +1486,7 @@ TAGS-host: maybe-TAGS-lto-plugin
 
 TAGS-target: maybe-TAGS-target-libstdc++-v3
 TAGS-target: maybe-TAGS-target-libmudflap
+TAGS-target: maybe-TAGS-target-libasan
 TAGS-target: maybe-TAGS-target-libssp
 TAGS-target: maybe-TAGS-target-newlib
 TAGS-target: maybe-TAGS-target-libgcc
@@ -1557,6 +1568,7 @@ install-info-host: maybe-install-info-lt
 
 install-info-target: maybe-install-info-target-libstdc++-v3
 install-info-target: maybe-install-info-target-libmudflap
+install-info-target: maybe-install-info-target-libasan
 install-info-target: maybe-install-info-target-libssp
 install-info-target: maybe-install-info-target-newlib
 install-info-target: maybe-install-info-target-libgcc
@@ -1638,6 +1650,7 @@ install-pdf-host: maybe-install-pdf-lto-
 
 install-pdf-target: maybe-install-pdf-target-libstdc++-v3
 install-pdf-target: maybe-install-pdf-target-libmudflap
+install-pdf-target: maybe-install-pdf-target-libasan
 install-pdf-target: maybe-install-pdf-target-libssp
 install-pdf-target: maybe-install-pdf-target-newlib
 install-pdf-target: maybe-install-pdf-target-libgcc
@@ -1719,6 +1732,7 @@ install-html-host: maybe-install-html-lt
 
 install-html-target: maybe-install-html-target-libstdc++-v3
 install-html-target: maybe-install-html-target-libmudflap
+install-html-target: maybe-install-html-target-libasan
 install-html-target: maybe-install-html-target-libssp
 install-html-target: maybe-install-html-target-newlib
 install-html-target: maybe-install-html-target-libgcc
@@ -1800,6 +1814,7 @@ installcheck-host: maybe-installcheck-lt
 
 installcheck-target: maybe-installcheck-target-libstdc++-v3
 installcheck-targ

Re: [asan] migrate runtime from llvm

2012-10-18 Thread Wei Mi
On Thu, Oct 18, 2012 at 11:16 AM, Jakub Jelinek  wrote:
> On Thu, Oct 18, 2012 at 09:46:36AM -0700, Wei Mi wrote:
>> --- gcc/gcc.c (revision 192567)
>> +++ gcc/gcc.c (working copy)
>> @@ -679,6 +679,7 @@ proper position among the other output f
>>  %{fgnu-tm:%:include(libitm.spec)%(link_itm)}\
>>  %(mflib) " STACK_SPLIT_SPEC "\
>>  %{fprofile-arcs|fprofile-generate*|coverage:-lgcov}\
>> +%{fasan:-lasan -lpthread -ldl}\
>>  %{!nostdlib:%{!nodefaultlibs:%(link_ssp) %(link_gcc_c_sequence)}}\
>>  %{!nostdlib:%{!nostartfiles:%E}} %{T*} }}"
>>  #endif
>
> Sorry for not mentioning it earlier at once, but -lpthread -ldl shouldn't
> be there either, the -fasan compiled code makes no direct calls to
> -lpthread nor -ldl, just libasan.
>
> Jakub

Thanks, I move those options to libasan LDFLAGS.

Wei.


Re: [asan] migrate runtime from llvm

2012-10-19 Thread Wei Mi
David, I put the m4 subdir under libasan because once I use the .m4
files (libtool.m4  lt~obsolete.m4  ltoptions.m4  ltsugar.m4
ltversion.m4) and ltmain.sh under $topsrcdir, the problem that a bad
libtool was generated under
$topbuilddir/x86_64-unknown-linux-gnu/libasan you met yesterday
appeared.  That is why I had to generate the new libtool m4 files and
ltmain.sh using libtoolize.

Thanks,
Wei.

On Fri, Oct 19, 2012 at 10:16 AM, Xinliang David Li  wrote:
> I tried it, and this version works for me.
>
> Your probably do not need to add the m4 subdir under libasan.  The
> required m4 files are either in .. or ../config dir. See how
> libmudflap does it.
>
> Other than that, if there are no other comments, the change is good to
> check into the branch. Remaining bugs can always be found and fixed
> later.
>
> thanks,
>
> David
>
>
>
> On Thu, Oct 18, 2012 at 8:04 PM, Wei Mi  wrote:
>> Hi,
>>
>> David cought a problem in the last patch when he tries to built
>> libasan. The problem was that an incomplete libtool under libasan
>> build directory was generated. The cause is that the old patch used an
>> old ltmain.sh to generate libtool. I fix it and attach a new patch.
>> And the new patch move -lpthread and -ldl to libasan LDFLAGS.
>>
>> Thanks,
>> Wei.


[asan] a small patch to fix bogus error about global buffer overflow

2012-10-25 Thread Wei Mi
Hi,

A small patch to remove the bogus error reports exposed in the
spec2000 testing. In varasm.c, asan_protected should be equivalent
with asan_protect_global (decl) all the time, or else compiler will
not insert redzones for some globals planned to be protected.

gcc/ChangeLog:
2012-10-25   Wei Mi  

A small fix to remove bogus error report of global buffer overflow.
* varasm.c: correct the condition of asan_protected being true.

Index: varasm.c
===
--- varasm.c(revision 192822)
+++ varasm.c(working copy)
@@ -1991,11 +1991,10 @@ assemble_variable (tree decl, int top_le
   align_variable (decl, dont_output_data);

   if (flag_asan
-  && asan_protect_global (decl)
-  && DECL_ALIGN (decl) < ASAN_RED_ZONE_SIZE * BITS_PER_UNIT)
+  && asan_protect_global (decl))
 {
   asan_protected = true;
-  DECL_ALIGN (decl) = ASAN_RED_ZONE_SIZE * BITS_PER_UNIT;
+  DECL_ALIGN (decl) = MAX (DECL_ALIGN (decl), ASAN_RED_ZONE_SIZE
* BITS_PER_UNIT);
 }

   set_mem_align (decl_rtl, DECL_ALIGN (decl));

Thanks,
Wei.


Re: [asan] a small patch to fix bogus error about global buffer overflow

2012-10-25 Thread Wei Mi
Hi,

Thanks for all the comments. Fixed. Ok to checkin?

2012-10-25  Wei Mi Ā 

* varasm.c (assemble_variable): Set asan_protected even
for decls that are already ASAN_RED_ZONE_SIZE or more
bytes aligned.

Index: varasm.c
===
--- varasm.c(revision 192822)
+++ varasm.c(working copy)
@@ -1991,11 +1991,11 @@ assemble_variable (tree decl, int top_le
   align_variable (decl, dont_output_data);

   if (flag_asan
-  && asan_protect_global (decl)
-  && DECL_ALIGN (decl) < ASAN_RED_ZONE_SIZE * BITS_PER_UNIT)
+  && asan_protect_global (decl))
 {
   asan_protected = true;
-  DECL_ALIGN (decl) = ASAN_RED_ZONE_SIZE * BITS_PER_UNIT;
+  DECL_ALIGN (decl) = MAX (DECL_ALIGN (decl),
+   ASAN_RED_ZONE_SIZE * BITS_PER_UNIT);
 }

   set_mem_align (decl_rtl, DECL_ALIGN (decl));

On Thu, Oct 25, 2012 at 2:46 PM, Jakub Jelinek  wrote:
> On Thu, Oct 25, 2012 at 02:32:33PM -0700, Wei Mi wrote:
>> A small patch to remove the bogus error reports exposed in the
>> spec2000 testing. In varasm.c, asan_protected should be equivalent
>> with asan_protect_global (decl) all the time, or else compiler will
>> not insert redzones for some globals planned to be protected.
>>
>> gcc/ChangeLog:
>> 2012-10-25   Wei Mi  
>>
>> A small fix to remove bogus error report of global buffer overflow.
>> * varasm.c: correct the condition of asan_protected being true.
>
> The patch is almost ok, the ChangeLog entry is not.
> Two instead of 3 spaces between date and name, no need for introductory
> comment above * varasm.c line, name of modified function and capital
> letter after :.
> Perhaps
> * varasm.c (assemble_variable): Set asan_protected even for decls
> that are already ASAN_RED_ZONE_SIZE or more bytes aligned.
>
> Ok with that Change and:
>
>> --- varasm.c(revision 192822)
>> +++ varasm.c(working copy)
>> @@ -1991,11 +1991,10 @@ assemble_variable (tree decl, int top_le
>>align_variable (decl, dont_output_data);
>>
>>if (flag_asan
>> -  && asan_protect_global (decl)
>> -  && DECL_ALIGN (decl) < ASAN_RED_ZONE_SIZE * BITS_PER_UNIT)
>> +  && asan_protect_global (decl))
>>  {
>>asan_protected = true;
>> -  DECL_ALIGN (decl) = ASAN_RED_ZONE_SIZE * BITS_PER_UNIT;
>> +  DECL_ALIGN (decl) = MAX (DECL_ALIGN (decl), ASAN_RED_ZONE_SIZE
>  * BITS_PER_UNIT);
>
> Too long line, put ASAN_RED_ZONE_SIZE * BITS_PER_UNIT on next
> line below the second DECL_ALIGN (decl).
>
> Jakub


[tsan] ThreadSanitizer instrumentation part

2012-10-31 Thread Wei Mi
Hi,

The patch is about ThreadSanitizer. ThreadSanitizer is a data race
detector for C/C++ programs. It contains two parts: instrumentation
and runtime library. This patch is the first part, and runtime will be
included in the second part. Dmitry(dvyu...@google.com) is the author
of this part, and I try to migrate it to trunk. Ok for trunk?

gcc/ChangeLog:
2012-10-31  Wei Mi  

* Makefile.in (tsan.o): New
* passes.c (init_optimization_passes): Add tsan passes
* tree-pass.h (register_pass_info): Ditto
* cfghooks.h (GCC_CFGHOOKS_H): Avoid including duplicate headers
* doc/invoke.texi: Document tsan related options
* toplev.c (compile_file): Add tsan pass in driver
* gcc.c (LINK_COMMAND_SPEC): Add -lasan in link command if there
-ftsan is on.
* tsan.c: New file about tsan
* tsan.h: Ditto

Please check the following links for background:
http://code.google.com/p/data-race-test
http://gcc.gnu.org/wiki/cauldron2012?action=AttachFile&do=get&target=kcc.pdf
(the second half is about ThreadSanitizer).

A small testcase race_on_heap.cc is attached to show its
functionality. Run the small testcase with -ftsan produce the
following warning:

WARNING: ThreadSanitizer: data race (pid=5978)
  Write of size 4 at 0x7d0600039040 by thread 3:
#0 Thread2(void*) ??:0 (exe+0x52c0)

  Previous write of size 4 at 0x7d0600039040 by thread 2:
#0 Thread1(void*) ??:0 (exe+0x527d)

  Location is heap block of size 99 at 0x7d0600039030 allocated by thread 1:
#0 malloc 
/usr/local/google/home/wmi/Work/llvm-main/trunk/projects/compiler-rt/lib/tsan/rtl/tsan_interceptors.cc:293
(exe+0xe9ce)
#1 alloc() ??:0 (exe+0x52fc)
#2 AllocThread(void*) ??:0 (exe+0x532c)

  Thread 3 (tid=5981, running) created at:
#0 pthread_create
/usr/local/google/home/wmi/Work/llvm-main/trunk/projects/compiler-rt/lib/tsan/rtl/tsan_interceptors.cc:645
(exe+0xbf1d)
#1 main ??:0 (exe+0x5433)

  Thread 2 (tid=5980, finished) created at:
#0 pthread_create
/usr/local/google/home/wmi/Work/llvm-main/trunk/projects/compiler-rt/lib/tsan/rtl/tsan_interceptors.cc:645
(exe+0xbf1d)
#1 main ??:0 (exe+0x5400)

  Thread 1 (tid=5979, finished) created at:
#0 pthread_create
/usr/local/google/home/wmi/Work/llvm-main/trunk/projects/compiler-rt/lib/tsan/rtl/tsan_interceptors.cc:645
(exe+0xbf1d)
#1 main ??:0 (exe+0x5384)

Thanks,
Wei.
Index: gcc/Makefile.in
===
--- gcc/Makefile.in (revision 193016)
+++ gcc/Makefile.in (working copy)
@@ -1351,6 +1351,7 @@ OBJS = \
trans-mem.o \
tree-affine.o \
tree-call-cdce.o \
+   tsan.o \
tree-cfg.o \
tree-cfgcleanup.o \
tree-chrec.o \
@@ -2618,6 +2619,12 @@ tree-nomudflap.o : $(CONFIG_H) $(SYSTEM_
$(C_TREE_H) $(C_COMMON_H) $(GIMPLE_H) $(DIAGNOSTIC_H) $(HASHTAB_H) \
output.h langhooks.h tree-mudflap.h $(TM_H) coretypes.h \
$(GGC_H) gt-tree-mudflap.h $(TREE_PASS_H) $(DIAGNOSTIC_CORE_H)
+tsan.o : $(CONFIG_H) $(SYSTEM_H) $(TREE_H) $(TREE_INLINE_H) \
+   $(GIMPLE_H) $(DIAGNOSTIC_H) langhooks.h \
+   $(TM_H) coretypes.h $(TREE_DUMP_H) $(TREE_PASS_H) $(CGRAPH_H) $(GGC_H) \
+   $(BASIC_BLOCK_H) $(FLAGS_H) $(FUNCTION_H) \
+   $(TM_P_H) $(TREE_FLOW_H) $(DIAGNOSTIC_CORE_H) $(GIMPLE_H) tree-iterator.h \
+   intl.h cfghooks.h output.h options.h c-family/c-common.h tsan.h
 tree-pretty-print.o : tree-pretty-print.c $(CONFIG_H) $(SYSTEM_H) \
$(TREE_H) $(DIAGNOSTIC_H) $(HASHTAB_H) $(TREE_FLOW_H) \
$(TM_H) coretypes.h dumpfile.h tree-iterator.h $(SCEV_H) langhooks.h \
@@ -2671,7 +2678,8 @@ toplev.o : toplev.c $(CONFIG_H) $(SYSTEM
$(CGRAPH_H) $(COVERAGE_H) alloc-pool.h $(GGC_H) \
$(OPTS_H) params.def tree-mudflap.h $(TREE_PASS_H) $(GIMPLE_H) \
tree-ssa-alias.h $(PLUGIN_H) realmpfr.h tree-diagnostic.h \
-   $(TREE_PRETTY_PRINT_H) opts-diagnostic.h $(COMMON_TARGET_H)
+   $(TREE_PRETTY_PRINT_H) opts-diagnostic.h $(COMMON_TARGET_H) \
+   tsan.h
 
 hwint.o : hwint.c $(CONFIG_H) $(SYSTEM_H) $(DIAGNOSTIC_CORE_H)
 
Index: gcc/passes.c
===
--- gcc/passes.c(revision 193016)
+++ gcc/passes.c(working copy)
@@ -1439,6 +1439,7 @@ init_optimization_passes (void)
   NEXT_PASS (pass_split_crit_edges);
   NEXT_PASS (pass_pre);
   NEXT_PASS (pass_sink_code);
+  NEXT_PASS (pass_tsan);
   NEXT_PASS (pass_tree_loop);
{
  struct opt_pass **p = &pass_tree_loop.pass.sub;
@@ -1544,6 +1545,7 @@ init_optimization_passes (void)
   NEXT_PASS (pass_tm_edges);
 }
   NEXT_PASS (pass_lower_complex_O0);
+  NEXT_PASS (pass_tsan_O0);
   NEXT_PASS (pass_cleanup_eh);
   NEXT_PASS (pass_lower_resx);
   NEXT_PASS (pass_nrv);
Index: gcc/tree-pass.h
===

[asan] change libasan to libsanitizer

2012-11-01 Thread Wei Mi
Hi,

Here is the patch to change libasan to libsanitizer and reorganize the
directory. I divided the patch into three parts for review.

patch.part1.txt: Contains the changes in the outermost level.
patch.part2.txt.bz2: Remove libasan
patch.part3.txt.bz2: Add libsanitizer

Is it ok for asan branch?

2012-11-1  Wei Mi  

* configure.ac: Change target-libasan to target-libsanitizer.
* configure.in: Regenerate.
* Makefile.def: Change libasan module to libsanitizer.
* Makefile.in: Regenerate.
* libsanitizer: Change libasan to libsanitizer and add
an empty tsan directory under libsanitizer.

Thanks,
Wei.
Index: configure
===
--- configure   (revision 193063)
+++ configure   (working copy)
@@ -2771,7 +2771,7 @@ target_libraries="target-libgcc \
target-libitm \
target-libstdc++-v3 \
target-libmudflap \
-   target-libasan \
+   target-libsanitizer \
target-libssp \
target-libquadmath \
target-libgfortran \
Index: Makefile.in
===
--- Makefile.in (revision 193063)
+++ Makefile.in (working copy)
@@ -575,7 +575,7 @@ all:
 
 # This is the list of directories that may be needed in RPATH_ENVVAR
 # so that programs built for the target machine work.
-TARGET_LIB_PATH = 
$(TARGET_LIB_PATH_libstdc++-v3)$(TARGET_LIB_PATH_libmudflap)$(TARGET_LIB_PATH_libasan)$(TARGET_LIB_PATH_libssp)$(TARGET_LIB_PATH_libgomp)$(TARGET_LIB_PATH_libitm)$(TARGET_LIB_PATH_libatomic)$(HOST_LIB_PATH_gcc)
+TARGET_LIB_PATH = 
$(TARGET_LIB_PATH_libstdc++-v3)$(TARGET_LIB_PATH_libmudflap)$(TARGET_LIB_PATH_libsanitizer)$(TARGET_LIB_PATH_libssp)$(TARGET_LIB_PATH_libgomp)$(TARGET_LIB_PATH_libitm)$(TARGET_LIB_PATH_libatomic)$(HOST_LIB_PATH_gcc)
 
 @if target-libstdc++-v3
 TARGET_LIB_PATH_libstdc++-v3 = $$r/$(TARGET_SUBDIR)/libstdc++-v3/src/.libs:
@@ -585,9 +585,9 @@ TARGET_LIB_PATH_libstdc++-v3 = $$r/$(TAR
 TARGET_LIB_PATH_libmudflap = $$r/$(TARGET_SUBDIR)/libmudflap/.libs:
 @endif target-libmudflap
 
-@if target-libasan
-TARGET_LIB_PATH_libasan = $$r/$(TARGET_SUBDIR)/libasan/.libs:
-@endif target-libasan
+@if target-libsanitizer
+TARGET_LIB_PATH_libsanitizer = $$r/$(TARGET_SUBDIR)/libsanitizer/.libs:
+@endif target-libsanitizer
 
 @if target-libssp
 TARGET_LIB_PATH_libssp = $$r/$(TARGET_SUBDIR)/libssp/.libs:
@@ -924,7 +924,7 @@ configure-host:  \
 configure-target:  \
 maybe-configure-target-libstdc++-v3 \
 maybe-configure-target-libmudflap \
-maybe-configure-target-libasan \
+maybe-configure-target-libsanitizer \
 maybe-configure-target-libssp \
 maybe-configure-target-newlib \
 maybe-configure-target-libgcc \
@@ -1073,7 +1073,7 @@ all-host: maybe-all-lto-plugin
 all-target: maybe-all-target-libstdc++-v3
 @endif target-libstdc++-v3-no-bootstrap
 all-target: maybe-all-target-libmudflap
-all-target: maybe-all-target-libasan
+all-target: maybe-all-target-libsanitizer
 all-target: maybe-all-target-libssp
 all-target: maybe-all-target-newlib
 @if target-libgcc-no-bootstrap
@@ -1164,7 +1164,7 @@ info-host: maybe-info-lto-plugin
 
 info-target: maybe-info-target-libstdc++-v3
 info-target: maybe-info-target-libmudflap
-info-target: maybe-info-target-libasan
+info-target: maybe-info-target-libsanitizer
 info-target: maybe-info-target-libssp
 info-target: maybe-info-target-newlib
 info-target: maybe-info-target-libgcc
@@ -1246,7 +1246,7 @@ dvi-host: maybe-dvi-lto-plugin
 
 dvi-target: maybe-dvi-target-libstdc++-v3
 dvi-target: maybe-dvi-target-libmudflap
-dvi-target: maybe-dvi-target-libasan
+dvi-target: maybe-dvi-target-libsanitizer
 dvi-target: maybe-dvi-target-libssp
 dvi-target: maybe-dvi-target-newlib
 dvi-target: maybe-dvi-target-libgcc
@@ -1328,7 +1328,7 @@ pdf-host: maybe-pdf-lto-plugin
 
 pdf-target: maybe-pdf-target-libstdc++-v3
 pdf-target: maybe-pdf-target-libmudflap
-pdf-target: maybe-pdf-target-libasan
+pdf-target: maybe-pdf-target-libsanitizer
 pdf-target: maybe-pdf-target-libssp
 pdf-target: maybe-pdf-target-newlib
 pdf-target: maybe-pdf-target-libgcc
@@ -1410,7 +1410,7 @@ html-host: maybe-html-lto-plugin
 
 html-target: maybe-html-target-libstdc++-v3
 html-target: maybe-html-target-libmudflap
-html-target: maybe-html-target-libasan
+html-target: maybe-html-target-libsanitizer
 html-target: maybe-html-target-libssp
 html-target: maybe-html-target-newlib
 html-target: maybe-html-target-libgcc
@@ -1492,7 +1492,7 @@ TAGS-host: maybe-TAGS-lto-plugin
 
 TAGS-target: maybe-TAGS-target-libstdc++-v3
 TAGS-target: maybe-TAGS-target-libmudflap
-TAGS-target: maybe-TAGS-target-libasan
+TAGS-target: maybe-TAGS-target-libsanitizer
 TAGS-target: maybe-TAGS-target-libssp
 TAGS-target: maybe-TAGS-target-newlib
 TAGS-target: maybe-TAGS-target-libgcc
@@ -1574,7 +1574,7 @@ install-info-host: maybe-install-info-lt
 
 install-info-target: maybe-in

Re: [asan] change libasan to libsanitizer

2012-11-01 Thread Wei Mi
Yes. That will be easier and clearer. The patch is too big.

Thanks,
Wei.

On Thu, Nov 1, 2012 at 1:19 PM, Xinliang David Li  wrote:
> Will it be easier if you just rolled back your previous libasan
> library changes, and resubmit it with the restructured directory?
>
> David
>
> On Thu, Nov 1, 2012 at 1:17 PM, Wei Mi  wrote:
>> patch.part2.txt.bz2 and patch.part3.txt.bz2 are still too big.
>>
>> Divide patch.part2.txt.bz2 into two parts:
>> patch.part2-1.txt.bz2 + patch.part2-2.txt.bz2
>>
>> Divide patch.part3.txt.bz2 into two parts:
>> patch.part3-1.txt.bz2 + patch.part3-2.txt.bz2
>>
>> This is patch.part2-1.txt.bz2.
>>
>> Thanks,
>> Wei.
>>
>> On Thu, Nov 1, 2012 at 1:02 PM, Wei Mi  wrote:
>>> Hi,
>>>
>>> Here is the patch to change libasan to libsanitizer and reorganize the
>>> directory. I divided the patch into three parts for review.
>>>
>>> patch.part1.txt: Contains the changes in the outermost level.
>>> patch.part2.txt.bz2: Remove libasan
>>> patch.part3.txt.bz2: Add libsanitizer
>>>
>>> Is it ok for asan branch?
>>>
>>> 2012-11-1  Wei Mi  
>>>
>>> * configure.ac: Change target-libasan to target-libsanitizer.
>>> * configure.in: Regenerate.
>>> * Makefile.def: Change libasan module to libsanitizer.
>>> * Makefile.in: Regenerate.
>>> * libsanitizer: Change libasan to libsanitizer and add
>>> an empty tsan directory under libsanitizer.
>>>
>>> Thanks,
>>> Wei.


Re: [asan] change libasan to libsanitizer

2012-11-01 Thread Wei Mi
Ok, I will check in the patch.

Thanks,
Wei.

On Thu, Nov 1, 2012 at 3:03 PM, Xinliang David Li  wrote:
> On Thu, Nov 1, 2012 at 2:23 PM, Xinliang David Li  wrote:
>> On Thu, Nov 1, 2012 at 2:17 PM, Wei Mi  wrote:
>>> Thanks for the suggestion!
>>>
>>> The planned svn commands will be:
>>>
>>> svn mv libasan libsanitizer
>>> svn add libsanitizer/asan
>>> svn add libsanitizer/tsan
>>
>> Probably keep the tsan creation out of this patch.
>
> If there is no other objections, this patch is ok for asan branch with
> the above. There might be errors spotted under trunk review, but that
> should be fine ..
>
> thanks,
>
> David
>
>>
>> David
>>
>>> cd libsanitizer
>>> for i in `ls asan_*`; do
>>>   svn mv $i asan/$i
>>> done
>>>
>>> Then apply the two patches attached on top of that. patch.1.txt is to
>>> handle the toplevel configure and Makefile changes. patch.2.txt is to
>>> handle the configure and Makefile changes in libsanitizer.
>>>
>>> Thanks,
>>> Wei.
>>>
>>> On Thu, Nov 1, 2012 at 1:34 PM, Xinliang David Li  
>>> wrote:
>>>> that sounds good to me.
>>>>
>>>> David
>>>>
>>>> On Thu, Nov 1, 2012 at 1:31 PM, Jakub Jelinek  wrote:
>>>>> On Thu, Nov 01, 2012 at 01:19:42PM -0700, Xinliang David Li wrote:
>>>>>> Will it be easier if you just rolled back your previous libasan
>>>>>> library changes, and resubmit it with the restructured directory?
>>>>>
>>>>> I think better would be if you didn't apply it as a patch with lots of svn
>>>>> add/svn rm commands, but instead just svn mv the directory or files.
>>>>> So it would be better if you could post the planned svn commands
>>>>> and the patch that would be applied on top of that.
>>>>>
>>>>> Jakub


Re: [asan] change libasan to libsanitizer

2012-11-01 Thread Wei Mi
The patch has been checked in.
Committed revision 193074.

Thanks,
Wei.

On Thu, Nov 1, 2012 at 3:16 PM, Wei Mi  wrote:
> Ok, I will check in the patch.
>
> Thanks,
> Wei.
>
> On Thu, Nov 1, 2012 at 3:03 PM, Xinliang David Li  wrote:
>> On Thu, Nov 1, 2012 at 2:23 PM, Xinliang David Li  wrote:
>>> On Thu, Nov 1, 2012 at 2:17 PM, Wei Mi  wrote:
>>>> Thanks for the suggestion!
>>>>
>>>> The planned svn commands will be:
>>>>
>>>> svn mv libasan libsanitizer
>>>> svn add libsanitizer/asan
>>>> svn add libsanitizer/tsan
>>>
>>> Probably keep the tsan creation out of this patch.
>>
>> If there is no other objections, this patch is ok for asan branch with
>> the above. There might be errors spotted under trunk review, but that
>> should be fine ..
>>
>> thanks,
>>
>> David
>>
>>>
>>> David
>>>
>>>> cd libsanitizer
>>>> for i in `ls asan_*`; do
>>>>   svn mv $i asan/$i
>>>> done
>>>>
>>>> Then apply the two patches attached on top of that. patch.1.txt is to
>>>> handle the toplevel configure and Makefile changes. patch.2.txt is to
>>>> handle the configure and Makefile changes in libsanitizer.
>>>>
>>>> Thanks,
>>>> Wei.
>>>>
>>>> On Thu, Nov 1, 2012 at 1:34 PM, Xinliang David Li  
>>>> wrote:
>>>>> that sounds good to me.
>>>>>
>>>>> David
>>>>>
>>>>> On Thu, Nov 1, 2012 at 1:31 PM, Jakub Jelinek  wrote:
>>>>>> On Thu, Nov 01, 2012 at 01:19:42PM -0700, Xinliang David Li wrote:
>>>>>>> Will it be easier if you just rolled back your previous libasan
>>>>>>> library changes, and resubmit it with the restructured directory?
>>>>>>
>>>>>> I think better would be if you didn't apply it as a patch with lots of 
>>>>>> svn
>>>>>> add/svn rm commands, but instead just svn mv the directory or files.
>>>>>> So it would be better if you could post the planned svn commands
>>>>>> and the patch that would be applied on top of that.
>>>>>>
>>>>>> Jakub


Re: [tsan] ThreadSanitizer instrumentation part

2012-11-02 Thread Wei Mi
Hi,

Thanks for so many useful comments! I update the file according to the
comments. The major changes include adding sanitizer.def and
generating gimple directly. New patch file is attached.

> On Wed, Oct 31, 2012 at 11:34:10AM -0700, Wei Mi wrote:
>> gcc/ChangeLog:
>> 2012-10-31  Wei Mi  
>
> If Dmitry wrote parts of the patch, it would be nice to mention
> him in the ChangeLog too.

> All ChangeLog entries should end with a dot.

Changed.

2012-10-31  Dmitry Vyukov  
 Wei Mi  

* Makefile.in (tsan.o): New.
(BUILTINS_DEF): Add sanitizer.def.
* sanitizer.def: New.
* passes.c (init_optimization_passes): Add tsan passes.
* tree-pass.h (register_pass_info): Ditto.
* cfghooks.h (GCC_CFGHOOKS_H): Avoid including duplicate headers.
* doc/invoke.texi: Document tsan related options.
* toplev.c (compile_file): Add tsan pass in driver.
* gcc.c (LINK_COMMAND_SPEC): Add -lasan in link command if there
-fthread_sanitizer is on.
* tsan.c: New file about tsan.
* tsan.h: Ditto.


>>  struct cfg_hooks
>> @@ -219,3 +222,4 @@ extern void gimple_register_cfg_hooks (v
>>  extern struct cfg_hooks get_cfg_hooks (void);
>>  extern void set_cfg_hooks (struct cfg_hooks);
>>
>> +#endif  /* GCC_CFGHOOKS_H */
>
> Why this?  Simply don't include that header in tsan.c, it is already
> included by basic-block.h.

Remove cfghooks.h from tsan.c. Remove the #ifdef GCC_CFGHOOKS_H from cfghooks.h

> Can't google just assign the code to FSF, and use a standard boilerplate
> as everything else in gcc/ ?

Copy from asan header and make some change.

>> +static tree
>> +get_vptr_update_decl (void)
>> +{
>> +  tree typ;
>> +  static tree decl;
>> +
>> +  if (decl != NULL)
>> +return decl;
>> +  typ = build_function_type_list (void_type_node,
>> +  ptr_type_node, ptr_type_node, NULL_TREE);
>> +  decl = build_func_decl (typ, "__tsan_vptr_update");
>> +  return decl;
>> +}
> ...
>
> Instead of this (but same applies to asan), I think we should just consider
> putting it into builtins.def (or have sanitizer.def like there is sync.def
> or omp-builtins.def).  The problem might be non-C/C++ family frontends
> though.

Create sanitizer.def and use builtin_decl_implicit to create builtin decls.

>> +  while (TREE_CODE (expr_type) == ARRAY_TYPE)
>> +expr_type = TREE_TYPE (expr_type);
>> +  size = (TREE_INT_CST_LOW (TYPE_SIZE (expr_type))) / BITS_PER_UNIT;
>
> int_size_in_bytes.

Changed.

> preferrably without building everything as trees, then gimplifying it.

Generate gimple directly. Remove funcs: instr_memory_access,
instr_vptr_update, instr_func_entry, instr_func_exit

> For func_calls and func_mops, I believe why you need two variables instead
> of just one, and why the function can't just return a bool whether
> entry/exit needs to be instrumented or not.

instrument_memory_accesses return a bool indicating whether or not
entry/exit needs to be instrumented. func_calls and func_mops removed.

>> +set_location (gimple_seq seq, location_t loc)
>> +{
>> +  gimple_seq_node n;
>> +
>> +  for (n = gimple_seq_first (seq); n != NULL; n = n->gsbase.next)
>
> This really should use a stmt iterator.

set_location removed. set gimple location using gimple_set_location
everytime a new gimple statement is inserted.

>> +  FOR_EACH_BB (bb)
>> +{
>> +  for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
>> +{
>> +  instrument_gimple (gsi);
>> +}
>> +}
>
> Extraneous two pairs of {}s.

Fixed.

>> +struct gimple_opt_pass pass_tsan = {{
>
> Please watch formatting of other gimple_opt_pass structures.
> {{ isn't used anywhere.

Fixed.

> Is that the option that LLVM uses (I'm talking about -faddress-sanitizer
> in LLVM vs. -fasan right now in GCC, isn't that similar?).

Fixed.

> +static tree
> +get_init_decl (void)
> +{
> +  tree typ;
> +  static tree decl;
> +
> +  if (decl != NULL)
> +return decl;
> +  typ = build_function_type_list (void_type_node, NULL_TREE);
> +  decl = build_func_decl (typ, "__tsan_init");
> +  return decl;
> +}
>
> The above can crash the compiler btw, as that static tree decl
> (in many other functions) is not GTY(()) marked (must be file scope for
> that), thus ggc_collect might free it.  Also, please use type
> instead of typ for variable names.

Func get_init_decl removed after generating gimple directly.

>> +  /* Instrumentation for assignment of a function result
>> + must be inserted after the c

Re: [tsan] ThreadSanitizer instrumentation part

2012-11-03 Thread Wei Mi
Sorry, I attached an incorrect patch.txt yesterday. This is the correct one.

Thanks,
Wei.

On Fri, Nov 2, 2012 at 6:31 PM, Wei Mi  wrote:
> Hi,
>
> Thanks for so many useful comments! I update the file according to the
> comments. The major changes include adding sanitizer.def and
> generating gimple directly. New patch file is attached.
>
>> On Wed, Oct 31, 2012 at 11:34:10AM -0700, Wei Mi wrote:
>>> gcc/ChangeLog:
>>> 2012-10-31  Wei Mi  
>>
>> If Dmitry wrote parts of the patch, it would be nice to mention
>> him in the ChangeLog too.
>
>> All ChangeLog entries should end with a dot.
>
> Changed.
>
> 2012-10-31  Dmitry Vyukov  
>  Wei Mi  
>
> * Makefile.in (tsan.o): New.
> (BUILTINS_DEF): Add sanitizer.def.
> * sanitizer.def: New.
> * passes.c (init_optimization_passes): Add tsan passes.
> * tree-pass.h (register_pass_info): Ditto.
> * cfghooks.h (GCC_CFGHOOKS_H): Avoid including duplicate headers.
> * doc/invoke.texi: Document tsan related options.
> * toplev.c (compile_file): Add tsan pass in driver.
> * gcc.c (LINK_COMMAND_SPEC): Add -lasan in link command if there
> -fthread_sanitizer is on.
> * tsan.c: New file about tsan.
> * tsan.h: Ditto.
>
>
>>>  struct cfg_hooks
>>> @@ -219,3 +222,4 @@ extern void gimple_register_cfg_hooks (v
>>>  extern struct cfg_hooks get_cfg_hooks (void);
>>>  extern void set_cfg_hooks (struct cfg_hooks);
>>>
>>> +#endif  /* GCC_CFGHOOKS_H */
>>
>> Why this?  Simply don't include that header in tsan.c, it is already
>> included by basic-block.h.
>
> Remove cfghooks.h from tsan.c. Remove the #ifdef GCC_CFGHOOKS_H from 
> cfghooks.h
>
>> Can't google just assign the code to FSF, and use a standard boilerplate
>> as everything else in gcc/ ?
>
> Copy from asan header and make some change.
>
>>> +static tree
>>> +get_vptr_update_decl (void)
>>> +{
>>> +  tree typ;
>>> +  static tree decl;
>>> +
>>> +  if (decl != NULL)
>>> +return decl;
>>> +  typ = build_function_type_list (void_type_node,
>>> +  ptr_type_node, ptr_type_node, NULL_TREE);
>>> +  decl = build_func_decl (typ, "__tsan_vptr_update");
>>> +  return decl;
>>> +}
>> ...
>>
>> Instead of this (but same applies to asan), I think we should just consider
>> putting it into builtins.def (or have sanitizer.def like there is sync.def
>> or omp-builtins.def).  The problem might be non-C/C++ family frontends
>> though.
>
> Create sanitizer.def and use builtin_decl_implicit to create builtin decls.
>
>>> +  while (TREE_CODE (expr_type) == ARRAY_TYPE)
>>> +expr_type = TREE_TYPE (expr_type);
>>> +  size = (TREE_INT_CST_LOW (TYPE_SIZE (expr_type))) / BITS_PER_UNIT;
>>
>> int_size_in_bytes.
>
> Changed.
>
>> preferrably without building everything as trees, then gimplifying it.
>
> Generate gimple directly. Remove funcs: instr_memory_access,
> instr_vptr_update, instr_func_entry, instr_func_exit
>
>> For func_calls and func_mops, I believe why you need two variables instead
>> of just one, and why the function can't just return a bool whether
>> entry/exit needs to be instrumented or not.
>
> instrument_memory_accesses return a bool indicating whether or not
> entry/exit needs to be instrumented. func_calls and func_mops removed.
>
>>> +set_location (gimple_seq seq, location_t loc)
>>> +{
>>> +  gimple_seq_node n;
>>> +
>>> +  for (n = gimple_seq_first (seq); n != NULL; n = n->gsbase.next)
>>
>> This really should use a stmt iterator.
>
> set_location removed. set gimple location using gimple_set_location
> everytime a new gimple statement is inserted.
>
>>> +  FOR_EACH_BB (bb)
>>> +{
>>> +  for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
>>> +{
>>> +  instrument_gimple (gsi);
>>> +}
>>> +}
>>
>> Extraneous two pairs of {}s.
>
> Fixed.
>
>>> +struct gimple_opt_pass pass_tsan = {{
>>
>> Please watch formatting of other gimple_opt_pass structures.
>> {{ isn't used anywhere.
>
> Fixed.
>
>> Is that the option that LLVM uses (I'm talking about -faddress-sanitizer
>> in LLVM vs. -fasan right now in GCC, isn't that similar?).
>
> Fixed.
>
>> +static tree
>> +get_init

Re: [tsan] ThreadSanitizer instrumentation part

2012-11-05 Thread Wei Mi
Hi Jakub,

Thanks for the comments. I fix most of them except the setting of
TODO_ The new patch.txt is attached.

Thanks,
Wei.

>> +  TODO_verify_all | TODO_update_ssa
>
> Ideally you shouldn't need TODO_update_ssa.
>

I got error when I removed TODO_update_ssa, so I kept it.

>> +| TODO_update_address_taken /* todo_flags_finish  */
>
> And why this?
>

If we generate tsan_read(&a) for a non-address taken static variable
a, we need to change a to be address taken, right?

On Sat, Nov 3, 2012 at 11:39 AM, Jakub Jelinek  wrote:
> On Sat, Nov 03, 2012 at 10:05:35AM -0700, Wei Mi wrote:
>> --- gcc/sanitizer.def (revision 0)
>> +++ gcc/sanitizer.def (revision 0)
>> @@ -0,0 +1,31 @@
>> +DEF_SANITIZER_BUILTIN(BUILT_IN_TSAN_WRITE_16, "__tsan_write16",
>> +  BT_FN_VOID_PTR, ATTR_NOTHROW_LEAF_LIST)
>> +
>> +
>> +
>
> Please remove the trailing whitespace.

Done

>
>> +/* Builtin used by the implementation of libsanitizer. These
>> +   functions are mapped to the actual implementation of the
>> +   libasan and libtsan library. */
>> +#undef DEF_SANITIZER_BUILTIN
>> +#define DEF_SANITIZER_BUILTIN(ENUM, NAME, TYPE, ATTRS) \
>> +  DEF_BUILTIN (ENUM, "__builtin_" NAME, BUILT_IN_NORMAL, TYPE, TYPE,\
>> +   true, true, true, ATTRS, true, flag_tsan)
>
> That should be eventually flag_asan || flag_tsan, as sanitizer.def
> should be also for asan builtins, or it must be DEF_TSAN_BUILTIN/tsan.def.
>

Postpone to fix it after asan checkin to trunk.

>> +static tree
>> +get_memory_access_decl (bool is_write, unsigned size)
>> +{
>> +  enum built_in_function fcode;
>> +
>> +  if (size <= 1)
>> +fcode = is_write ? BUILT_IN_TSAN_WRITE_1 :
>> +   BUILT_IN_TSAN_READ_1;
>
> Formatting, : should be below ?.

Fixed.

>> +
>> +  return builtin_decl_implicit(fcode);
>
> Space before (. Several times in the code.
>

Fixed.

> Also, as is the tsan builtins will be defined only for
> C/C++ family FEs, so either something needs to be done
> for other FEs, or perhaps the pass should just error out
> if say the BUILT_IN_TSAN_INIT isn't defined.
>

Wrap builtin_decl_implicit in get_tsan_builtin_decl. If
builtin_decl_implicit return invalid decl, output error message and
then exit.

>> +static tree
>> +is_vptr_store (gimple stmt, tree expr, int is_write)
>
> is_write should be bool,
>
>> +{
>> +  if (is_write == 1
>
> and this just is_write
>
>> +static bool
>> +is_load_of_const_p (tree expr, int is_write)
>> +{
>> +  if (is_write)
>> +return false;
>
> Again.
>

Fixed

>> +  /* The var does not live in memory -> no possibility of races.  */
>> +  || (tcode == VAR_DECL
>> +  && !TREE_ADDRESSABLE (expr)
>> +  && TREE_STATIC (expr) == 0)
>
> Please use && !is_global_var (expr) here instead.
>

Changed.

>> +  /* TODO: handle other cases
>> + (FIELD_DECL, MEM_REF, ARRAY_RANGE_REF, TARGET_MEM_REF, ADDR_EXPR).  */
>
> The comment is obsolete, MEM_REF is handled.
>

Fixed.

>> +  if (tcode != ARRAY_REF
>> +  && tcode != VAR_DECL
>> +  && tcode != COMPONENT_REF
>> +  && tcode != INDIRECT_REF
>> +  && tcode != MEM_REF)
>> +return false;
>> +
>> +  stmt = gsi_stmt (gsi);
>> +  loc = gimple_location (stmt);
>> +  rhs = is_vptr_store (stmt, expr, is_write);
>> +#ifdef DEBUG
>> +  if (rhs == NULL)
>> +gcc_assert (is_gimple_addressable (expr));
>> +#endif
>
> That should be
>   gcc_checking_assert (rhs != NULL || is_gimple_addressable (expr));
> if you want to check it in checking versions only.
>

Fixed.

>> +  size = int_size_in_bytes(expr_type);
>
> Missing space.
>

Fixed.

>> +  g = gimple_build_call(
>> +get_memory_access_decl(is_write, size),
>> +1, expr_ptr);
>
> And the formatting here is completely wrong.
>

Fixed.

>> +}
>> +  else
>> +g = gimple_build_call(
>> +  builtin_decl_implicit(BUILT_IN_TSAN_VPTR_UPDATE),
>> +  1, expr_ptr);
>> +  gimple_set_location (g, loc);
>> +  /* Instrumentation for assignment of a function result
>> + must be inserted after the call.  Instrumentation for
>> + reads of function arguments must be inserted before the call.
>> + That's because the call can contain synchronization.  */
>> +  if (is_gimple_call (stmt) && is_write)
>

Re: [PATCH 01/10] Initial import of asan from the Google branch into trunk

2012-11-09 Thread Wei Mi
> Other issues:
>
> * libasan does not seem to be a multilib, at least I only find the 64bit
> version on x86-64-gnu-linux such that "-m32" compilation fails.
>

That is because originally configure file is shared between asan and
tsan (tsan doesn't support 32 bit). Diego has suggested me to split
the configure, so we will send a patch to support 32bit version asan
after Dodji's patches checkin to trunk.

Thanks,
Wei.


Re: Asan/Tsan Unit/Regression testing (was [asan] Emit GIMPLE direclty, small cleanups)

2012-11-09 Thread Wei Mi
>
>>
>> > 2. Large Gtest-based unittest. This is a set of c++ files that should
>> > be built with the asan switch, depends on gtest
>> > (http://code.google.com/p/googletest/).
>> >
>> > http://llvm.org/viewvc/llvm-project/compiler-rt/trunk/lib/asan/tests/asan_test.cc?revision=166104&view=markup
>> > This should be easy to port to GCC, but it requires gtest.
>>
>> I don't think we want to depend on gtest, if the tests only use a small
>> subset of that, it would be better to just use some header with macros
>> compatible with that for assertions and the like.
>
>
> We use a large and heavy subset of gtest, namely: death tests.
> It can't be easily replaced with "some header with macros".
>

gtest integrate multiple tests into the same file with each test being
a single line check.  I cannot think out a method to migrate it to
dejagnu without using gtest, except splitting a single gtest file to
multiple files with each file per test. asan has about 130 tests so
have to write 130 files which will be a doable but painful task.

>>
>>
>> > 3. Full output tests (a .cc file should be build with asan switch,
>> > executable should be run and the stderr is compared with the expected
>> > output)
>> > Example:
>> > http://llvm.org/viewvc/llvm-project/compiler-rt/trunk/lib/asan/lit_tests/stack-overflow.cc?revision=165391&view=markup
>> > The can be ported to GCC, but the uses of FileCheck
>> > (http://llvm.org/docs/CommandGuide/FileCheck.html) will need to be
>> > replaced with GCC's analog.
>> > We should probably start with these tests.
>>
>> Dejagnu in GCC has
>>
>> ! { dg-do run }
>> ! { dg-options "-fbounds-check" }
>> ! { dg-shouldfail "Duplicate value 2 in ORDER argument to RESHAPE
>> intrinsic" }
>> program main
>>   implicit none
>>   integer(kind=1), dimension(6) :: source1 = (/ 1, 2, 3, 4, 5, 6 /)
>>   integer, dimension(2) :: shape1 = (/ 2, 3/)
>>   integer(kind=1), dimension(2) :: pad1 = (/ 0, 0/)
>>   character(len=200) :: l1, l2
>>   integer :: i1, i2
>>
>>   l1 = "2 2"
>>   read(unit=l1,fmt=*) i1, i2
>>   write (unit=l2,fmt=*) reshape(source1, shape1, pad1, (/i1, i2/)) !
>> Invalid
>> end program main
>> ! { dg-output "Fortran runtime error: Duplicate value 2 in ORDER argument
>> to RESHAPE intrinsic" }
>>
>> style markings, dg-shouldfail says that the program is expected to fail
>> rather than pass (if it aborts), and dg-output (perhaps multiple) can
>> contain regexps to match against stderr + stdout joined.  Haven't looked
>> at the asan tests yet, do you expect just one ASAN abort per test,
>
>
> These tests do just one abort (actually, _exit(1)) per test.
> Let's start with these.
>
> --kcc
>

I will start with this.

Thanks,
Wei.


Re: Asan/Tsan Unit/Regression testing (was [asan] Emit GIMPLE direclty, small cleanups)

2012-11-12 Thread Wei Mi
On Fri, Nov 9, 2012 at 11:13 AM, Jakub Jelinek  wrote:
> On Fri, Nov 09, 2012 at 11:05:37AM -0800, Wei Mi wrote:
>> gtest integrate multiple tests into the same file with each test being
>> a single line check.  I cannot think out a method to migrate it to
>> dejagnu without using gtest, except splitting a single gtest file to
>> multiple files with each file per test. asan has about 130 tests so
>> have to write 130 files which will be a doable but painful task.
>
> See the glibc _FORTIFY_SOURCE check I've referenced, there it is 3 lines
> per test expected to crash, but could be done in a single macro too.
> If the failure can be intercepted, it can be done in a single process (e.g.
> SIGABRT can, _exit can't that easily), otherwise perhaps say way to skip
> previous tests and communicate with dejagnu how many times it should run the
> executable.
>
> Jakub

Using setjmp/longjmp to do multiple tests in a single testfile, the
test statements in the front could affect the tests in the back. gtest
will fork a new process for every test statement. The forked process
will do only one test and skip all the other test statements. That is
to say, multiple test statements in the same testfile are guaranteed
to be independent from each other in gtest. If we use setjmp/longjmp
pattern to do the test, existing testsuite may need to be rewritten if
their test statements could affect each other.

Thanks,
Wei.


[Google] X86_TUNE_USE_VECTOR_CONVERTS adjustment

2013-08-15 Thread Wei Mi
Turning off X86_TUNE_USE_VECTOR_CONVERTS uses cvtss2sd instead of
unpcklps+cvtps2pd, which is better for some recent intel micro arch
such as westmere and sandybridge. So turn it off for m_GENERIC and
m_CORE_ALL.

regression and bootstrap ok. ok for 4.8 branch?

Index: config/i386/i386.c
===
--- config/i386/i386.c (revision 201675)
+++ config/i386/i386.c (working copy)
@@ -1995,7 +1995,7 @@ static unsigned int initial_ix86_tune_fe

   /* X86_TUNE_USE_VECTOR_FP_CONVERTS: Prefer vector packed SSE conversion
  from FP to FP. */
-  m_CORE_ALL | m_AMDFAM10 | m_GENERIC,
+  m_AMDFAM10,

   /* X86_TUNE_USE_VECTOR_CONVERTS: Prefer vector packed SSE conversion
  from integer to FP. */


Fwd: [PATCH] Scheduling result adjustment to enable macro-fusion

2013-09-03 Thread Wei Mi
This is a patch to prevent scheduler from scheduling compare and
branch away, in order to increase macro-fusion opportunity on recent
x86 platforms. It is motivated by the following small testcase.

double __attribute__ ((noinline)) bar (double sum);

int a[100];

double bar (double sum)
{
  int i;
  for (i = 0; i < 100; i++)
   sum += (0.5 + (a[i%100] - 128));
  return sum;
}

int main() {
  double total;
  int i;

  for (i = 0; i < 1000; i++)
total += bar (i);

  return total != 0.333;
}

~/workarea/gcc-r201963/build/install/bin/gcc -O2 -mtune=corei7-avx 1.c -o 1.out
The binary of the kernel loop in func bar () is:

  401180:   89 c8   mov%ecx,%eax
  401182:   66 0f 57 c9 xorpd  %xmm1,%xmm1
  401186:   f7 ee   imul   %esi
  401188:   89 c8   mov%ecx,%eax
  40118a:   c1 f8 1fsar$0x1f,%eax
  40118d:   c1 fa 05sar$0x5,%edx
  401190:   29 c2   sub%eax,%edx
  401192:   b8 64 00 00 00  mov$0x64,%eax
  401197:   0f af d0imul   %eax,%edx
  40119a:   89 c8   mov%ecx,%eax
  40119c:   83 c1 01add$0x1,%ecx
  40119f:   29 d0   sub%edx,%eax
  4011a1:   48 98   cltq
  4011a3:   8b 04 85 60 51 6c 00mov0x6c5160(,%rax,4),%eax
  4011aa:   83 c0 80add$0xff80,%eax
  4011ad:   81 f9 40 42 0f 00   cmp$0xf4240,%ecx
  4011b3:   f2 0f 2a c8 cvtsi2sd %eax,%xmm1
  4011b7:   f2 0f 58 ca addsd  %xmm2,%xmm1
  4011bb:   f2 0f 58 c1 addsd  %xmm1,%xmm0
  4011bf:   75 bf   jne401180 

Here cmp (addr: 4011ad) and jne (addr: 4011bf) are not consecutive in
object code, but they are consecutive before sched2 pass. If we
manually keep the cmp and jne together, the performance of 1.out
changes from 2.40s to 2.31s on a sandybridge machine. Perf stat result
shows that UOPS_RETIRED.MACRO_FUSED event increases from 131,075 to
1,000,130,308, and UOPS_RETIRED.ANY event decreases from
23,002,543,637 to 22,002,511,525.

The patch is to reschedule cmp and jmp to make them consecutive. It is
done at the end of scheduling each block before schedule result is
commited. bootstrapped and regression ok on x86_64-linux-gnu. ok for
trunk?

2013-09-03  Wei Mi  

* haifa-sched.c (move_insns): New function.
(adjust_for_macro_fusion): Ditto.
(schedule_block): Call adjust_for_macro_fusion before commit schedule.
* doc/tm.texi.in: Generated.
* doc/tm.texi: Ditto.
* config/i386/x86-tune.def (DEF_TUNE): Add m_COREI7 for
X86_TUNE_FUSE_CMP_AND_BRANCH.
* config/i386/i386.c (ix86_macro_fusion_p): New function.
(ix86_macro_fusion_pair_p): Ditto.
* target.def: Add macro_fusion_p and macro_fusion_pair_p in sched
group.

Index: haifa-sched.c
===
--- haifa-sched.c   (revision 201963)
+++ haifa-sched.c   (working copy)
@@ -5605,6 +5605,56 @@ choose_ready (struct ready_list *ready,
 }
 }

+/* Move insn scheduled_insns[I] to the position J in scheduled_insns.  */
+
+static void
+move_insns (int i, int j)
+{
+  rtx insn = scheduled_insns[i];
+  scheduled_insns.ordered_remove (i);
+  scheduled_insns.safe_insert (j, insn);
+}
+
+/* If the last cond jump and the cond register setting insn are consecutive
+   before scheduling, and are scheduled away from each other, this func
+   tries to rearrange insns in scheduled_insns and keep those two insns
+   together. This is good for performance on microarchitectures supporting
+   macro-fusion.  */
+
+static void
+adjust_for_macro_fusion ()
+{
+  int i = -1, length;
+  unsigned int condreg1, condreg2;
+  rtx cc_reg_1;
+  rtx insn;
+  rtx last = scheduled_insns.last();
+
+  targetm.fixed_condition_code_regs (&condreg1, &condreg2);
+  cc_reg_1 = gen_rtx_REG (CCmode, condreg1);
+  length = scheduled_insns.length ();
+  if (any_condjump_p (last) && reg_referenced_p (cc_reg_1, PATTERN (last)))
+{
+  for (i = length - 2; i >= 0; i--)
+   {
+ insn = scheduled_insns[i];
+ if (modified_in_p (cc_reg_1, insn))
+break;
+   }
+}
+  if (i < 0 || i == length - 2)
+return;
+
+  if (NEXT_INSN (insn) != last)
+return;
+
+  if (!targetm.sched.macro_fusion_pair_p
+  || !targetm.sched.macro_fusion_pair_p (insn, last))
+return;
+
+  move_insns (i, length - 2);
+}
+
 /* This function is called when we have successfully scheduled a
block.  It uses the schedule stored in the scheduled_insns vector
to rearrange the RTL.  PREV_HEAD is used as the anchor to which we
@@ -6421,6 +6471,9 @@ schedule_block (basic_block *target_bb,

   if (success)
 {
+  if (

Re: Fwd: [PATCH] Scheduling result adjustment to enable macro-fusion

2013-09-04 Thread Wei Mi
Thanks for the suggestions! I take a look at adjust_priority, and find
it may not guarantee to schedule cmp and jmp together. The priority is
used to choose a candidate from ready list. If cmp is the only insn in
ready list and there is another insn-A in queued set (insn-A's
dependence has been resolved, but it is not ready because of data
delay or resource delay), then cmp will be scheduled before insn-A no
matter what their priorities are.

I will take a look at whether SCHED_GROUP is going to work.

On Wed, Sep 4, 2013 at 12:33 PM, Alexander Monakov  wrote:
> On Wed, Sep 4, 2013 at 9:53 PM, Steven Bosscher  wrote:
>>
>> On Wed, Sep 4, 2013 at 10:58 AM, Alexander Monakov wrote:
>> > Hello,
>> >
>> > Could you use the existing facilities instead, such as adjust_priority 
>> > hook,
>> > or making the compare-branch insn sequence a SCHED_GROUP?
>>
>>
>> Or a define_bypass?
>
> Hm, I don't think define_bypass would work: it still leaves the
> scheduler freedom to move the compare up.
>
> IMO adjust_priority would be preferable if it allows to achieve the goal.
>
> Alexander


Re: Fwd: [PATCH] Scheduling result adjustment to enable macro-fusion

2013-09-06 Thread Wei Mi
SCHED_GROUP works after I add chain_to_prev_insn after
add_branch_dependences, in order to chain control dependences to prev
insn for sched group. Here is the new patch. Testing is going on.

Thanks,
Wei Mi.

2013-09-06  Wei Mi  

* config/i386/i386.c (ix86_macro_fusion_p): New function.
(ix86_macro_fusion_pair_p): Ditto.
* config/i386/x86-tune.def (DEF_TUNE): Add m_COREI7 for
X86_TUNE_FUSE_CMP_AND_BRANCH.
* sched-deps.c (group_insns_for_macro_fusion): New function.
(sched_analyze_insn): Call group_insns_for_macro_fusion.
(chain_to_prev_insn): Change it from static to extern.
(chain_to_prev_insn_p): Ditto.
* doc/tm.texi: Generated.
* doc/tm.texi.in: Ditto.
* sched-int.h: New declarations.
* sched-rgn.c (add_branch_dependences): Chain control
dependences to prev insn for sched group.
* target.def: Add macro_fusion_p and macro_fusion_pair_p.

Index: config/i386/i386.c
===
--- config/i386/i386.c  (revision 201963)
+++ config/i386/i386.c  (working copy)
@@ -24850,6 +24850,99 @@ ia32_multipass_dfa_lookahead (void)
 }
 }

+/* Return true if target platform supports macro-fusion.  */
+
+static bool
+ix86_macro_fusion_p ()
+{
+  if (TARGET_FUSE_CMP_AND_BRANCH)
+return true;
+  else
+return false;
+}
+
+/* Check whether current microarchitecture support macro fusion
+   for insn pair "CONDGEN + CONDJMP". Refer to
+   "Intel Architectures Optimization Reference Manual". */
+
+static bool
+ix86_macro_fusion_pair_p (rtx condgen, rtx condjmp)
+{
+  rtx src;
+  if (!strcmp (ix86_tune_string, "corei7"))
+{
+  /* For Nehalem.  */
+  rtx single_set = single_set (condgen);
+  /* Nehalem doesn't support macro-fusion for add/sub+jmp.  */
+  if (single_set == NULL_RTX)
+return false;
+
+  src = SET_SRC (single_set);
+  if (GET_CODE (src) != COMPARE)
+   return false;
+
+  /* Nehalem doesn't support macro-fusion for cmp/test MEM-IMM
+insn pattern.  */
+  if ((MEM_P (XEXP (src, 0))
+  && CONST_INT_P (XEXP (src, 1)))
+ || (MEM_P (XEXP (src, 1))
+ && CONST_INT_P (XEXP (src, 0
+   return false;
+
+  /* Nehalem doesn't support macro-fusion for add/sub/dec/inc + jmp.  */
+  if (get_attr_type (condgen) != TYPE_TEST
+ && get_attr_type (condgen) != TYPE_ICMP)
+   return false;
+  return true;
+}
+  else if (!strcmp (ix86_tune_string, "corei7-avx"))
+{
+  /* For Sandybridge.  */
+  enum rtx_code ccode;
+  rtx compare_set = NULL_RTX, test_if, cond;
+  rtx single_set = single_set (condgen);
+  if (single_set != NULL_RTX)
+compare_set = single_set;
+  else
+   {
+ int i;
+ rtx pat = PATTERN (condgen);
+ for (i = 0; i < XVECLEN (pat, 0); i++)
+   if (GET_CODE (XVECEXP (pat, 0, i)) == SET
+   && GET_CODE (SET_SRC (XVECEXP (pat, 0, i))) == COMPARE)
+ compare_set = XVECEXP (pat, 0, i);
+   }
+
+  if (compare_set == NULL_RTX)
+   return false;
+  src = SET_SRC (compare_set);
+  if (GET_CODE (src) != COMPARE)
+   return false;
+
+  /* Sandybridge doesn't support macro-fusion for cmp/test MEM-IMM
+insn pattern.  */
+  if ((MEM_P (XEXP (src, 0))
+   && CONST_INT_P (XEXP (src, 1)))
+  || (MEM_P (XEXP (src, 1))
+  && CONST_INT_P (XEXP (src, 0
+return false;
+
+  /* Sandybridge doesn't support macro-fusion for inc/dec +
+unsigned comparison jmp.  */
+  test_if = SET_SRC (pc_set (condjmp));
+  cond = XEXP (test_if, 0);
+  ccode = GET_CODE (cond);
+  if (get_attr_type (condgen) == TYPE_INCDEC
+ && (ccode == GEU
+ || ccode == GTU
+ || ccode == LEU
+ || ccode == LTU))
+   return false;
+  return true;
+}
+  return false;
+}
+
 /* Try to reorder ready list to take advantage of Atom pipelined IMUL
execution. It is applied if
(1) IMUL instruction is on the top of list;
@@ -42982,6 +43075,10 @@ ix86_memmodel_check (unsigned HOST_WIDE_
 #undef TARGET_SCHED_FIRST_CYCLE_MULTIPASS_DFA_LOOKAHEAD
 #define TARGET_SCHED_FIRST_CYCLE_MULTIPASS_DFA_LOOKAHEAD \
   ia32_multipass_dfa_lookahead
+#undef TARGET_SCHED_MACRO_FUSION_P
+#define TARGET_SCHED_MACRO_FUSION_P ix86_macro_fusion_p
+#undef TARGET_SCHED_MACRO_FUSION_PAIR_P
+#define TARGET_SCHED_MACRO_FUSION_PAIR_P ix86_macro_fusion_pair_p

 #undef TARGET_FUNCTION_OK_FOR_SIBCALL
 #define TARGET_FUNCTION_OK_FOR_SIBCALL ix86_function_ok_for_sibcall
Index: config/i386/x86-tune.def
===
--- config/i386/x86-tune.def(revision 201963)
+++ config/i386/x86-tune.def(workin

Re: Fwd: [PATCH] Scheduling result adjustment to enable macro-fusion

2013-09-09 Thread Wei Mi
Add a testcase. bootstrap and regression ok for the patch in last mail.

2013-09-09  Wei Mi  

* gcc/testsuite/gcc.dg/macro-fusion-1.c: New.

Index: gcc/testsuite/gcc.dg/macro-fusion-1.c
===
--- gcc/testsuite/gcc.dg/macro-fusion-1.c   (revision 0)
+++ gcc/testsuite/gcc.dg/macro-fusion-1.c   (revision 0)
@@ -0,0 +1,14 @@
+/* { dg-do compile { target i?86-*-* x86_64-*-* } } */
+/* { dg-options "-O2 -mtune=corei7 -fdump-rtl-sched2" } */
+/* { dg-final { scan-rtl-dump-not
"compare.*insn.*jump_insn.*jump_insn" "sched2" } } */
+
+int a[100];
+
+double bar (double sum)
+{
+  int i;
+  for (i = 0; i < 100; i++)
+   sum += (0.5 + (a[i%100] - 128));
+  return sum;
+}
+

On Fri, Sep 6, 2013 at 10:39 AM, Wei Mi  wrote:
> SCHED_GROUP works after I add chain_to_prev_insn after
> add_branch_dependences, in order to chain control dependences to prev
> insn for sched group. Here is the new patch. Testing is going on.
>
> Thanks,
> Wei Mi.
>
> 2013-09-06  Wei Mi  
>
> * config/i386/i386.c (ix86_macro_fusion_p): New function.
> (ix86_macro_fusion_pair_p): Ditto.
> * config/i386/x86-tune.def (DEF_TUNE): Add m_COREI7 for
> X86_TUNE_FUSE_CMP_AND_BRANCH.
> * sched-deps.c (group_insns_for_macro_fusion): New function.
> (sched_analyze_insn): Call group_insns_for_macro_fusion.
> (chain_to_prev_insn): Change it from static to extern.
> (chain_to_prev_insn_p): Ditto.
> * doc/tm.texi: Generated.
> * doc/tm.texi.in: Ditto.
> * sched-int.h: New declarations.
> * sched-rgn.c (add_branch_dependences): Chain control
> dependences to prev insn for sched group.
> * target.def: Add macro_fusion_p and macro_fusion_pair_p.
>
> Index: config/i386/i386.c
> ===
> --- config/i386/i386.c  (revision 201963)
> +++ config/i386/i386.c  (working copy)
> @@ -24850,6 +24850,99 @@ ia32_multipass_dfa_lookahead (void)
>  }
>  }
>
> +/* Return true if target platform supports macro-fusion.  */
> +
> +static bool
> +ix86_macro_fusion_p ()
> +{
> +  if (TARGET_FUSE_CMP_AND_BRANCH)
> +return true;
> +  else
> +return false;
> +}
> +
> +/* Check whether current microarchitecture support macro fusion
> +   for insn pair "CONDGEN + CONDJMP". Refer to
> +   "Intel Architectures Optimization Reference Manual". */
> +
> +static bool
> +ix86_macro_fusion_pair_p (rtx condgen, rtx condjmp)
> +{
> +  rtx src;
> +  if (!strcmp (ix86_tune_string, "corei7"))
> +{
> +  /* For Nehalem.  */
> +  rtx single_set = single_set (condgen);
> +  /* Nehalem doesn't support macro-fusion for add/sub+jmp.  */
> +  if (single_set == NULL_RTX)
> +return false;
> +
> +  src = SET_SRC (single_set);
> +  if (GET_CODE (src) != COMPARE)
> +   return false;
> +
> +  /* Nehalem doesn't support macro-fusion for cmp/test MEM-IMM
> +insn pattern.  */
> +  if ((MEM_P (XEXP (src, 0))
> +  && CONST_INT_P (XEXP (src, 1)))
> + || (MEM_P (XEXP (src, 1))
> + && CONST_INT_P (XEXP (src, 0
> +   return false;
> +
> +  /* Nehalem doesn't support macro-fusion for add/sub/dec/inc + jmp.  */
> +  if (get_attr_type (condgen) != TYPE_TEST
> + && get_attr_type (condgen) != TYPE_ICMP)
> +   return false;
> +  return true;
> +}
> +  else if (!strcmp (ix86_tune_string, "corei7-avx"))
> +{
> +  /* For Sandybridge.  */
> +  enum rtx_code ccode;
> +  rtx compare_set = NULL_RTX, test_if, cond;
> +  rtx single_set = single_set (condgen);
> +  if (single_set != NULL_RTX)
> +compare_set = single_set;
> +  else
> +   {
> + int i;
> + rtx pat = PATTERN (condgen);
> + for (i = 0; i < XVECLEN (pat, 0); i++)
> +   if (GET_CODE (XVECEXP (pat, 0, i)) == SET
> +   && GET_CODE (SET_SRC (XVECEXP (pat, 0, i))) == COMPARE)
> + compare_set = XVECEXP (pat, 0, i);
> +   }
> +
> +  if (compare_set == NULL_RTX)
> +   return false;
> +  src = SET_SRC (compare_set);
> +  if (GET_CODE (src) != COMPARE)
> +   return false;
> +
> +  /* Sandybridge doesn't support macro-fusion for cmp/test MEM-IMM
> +insn pattern.  */
> +  if ((MEM_P (XEXP (src, 0))
> +   && CONST_INT_P (XEXP (src, 1)))
> +  || (MEM_P (XEXP (src, 1))
> +  && CONST_INT_P (XEXP 

Re: Fwd: [PATCH] Scheduling result adjustment to enable macro-fusion

2013-09-10 Thread Wei Mi
Because deps_analyze_insn only analyzes data deps but no control deps.
Control deps are included by add_branch_dependences. Without the
chain_to_prev_insn in the end of add_branch_dependences, jmp will be
control dependent on every previous insn in the same bb, and the cmp
and jmp group could still be scheduled apart since they will not be
put in ready list at the same time.


On Tue, Sep 10, 2013 at 4:44 AM, Alexander Monakov  wrote:
>
>
> On Fri, 6 Sep 2013, Wei Mi wrote:
>
>> SCHED_GROUP works after I add chain_to_prev_insn after
>> add_branch_dependences, in order to chain control dependences to prev
>> insn for sched group.
>
> chain_to_prev_insn is done in the end of deps_analyze_insn, why is that not
> sufficient?
>
> Alexander


Re: Fwd: [PATCH] Scheduling result adjustment to enable macro-fusion

2013-09-11 Thread Wei Mi
I tried that and it caused some regressions, so I choosed to do
chain_to_prev_insn another time in add_branch_dependences. There could
be some dependence between those two functions.

On Wed, Sep 11, 2013 at 2:58 AM, Alexander Monakov  wrote:
>
>
> On Tue, 10 Sep 2013, Wei Mi wrote:
>
>> Because deps_analyze_insn only analyzes data deps but no control deps.
>> Control deps are included by add_branch_dependences. Without the
>> chain_to_prev_insn in the end of add_branch_dependences, jmp will be
>> control dependent on every previous insn in the same bb, and the cmp
>> and jmp group could still be scheduled apart since they will not be
>> put in ready list at the same time.
>
> Would calling add_branch_dependences before sched_analyze solve that, then?
>
> Alexander


Re: Fwd: [PATCH] Scheduling result adjustment to enable macro-fusion

2013-09-11 Thread Wei Mi
Taking the same issue slot is not enough for x86. The compare and
branch need to be consecutive in binary to be macro-fused on x86.

Thanks,
Wei Mi.

On Wed, Sep 11, 2013 at 10:45 AM, Andrew Pinski  wrote:
> On Wed, Sep 4, 2013 at 12:33 PM, Alexander Monakov  wrote:
>> On Wed, Sep 4, 2013 at 9:53 PM, Steven Bosscher  
>> wrote:
>>>
>>> On Wed, Sep 4, 2013 at 10:58 AM, Alexander Monakov wrote:
>>> > Hello,
>>> >
>>> > Could you use the existing facilities instead, such as adjust_priority 
>>> > hook,
>>> > or making the compare-branch insn sequence a SCHED_GROUP?
>>>
>>>
>>> Or a define_bypass?
>>
>> Hm, I don't think define_bypass would work: it still leaves the
>> scheduler freedom to move the compare up.
>
> Even though it allows the scheduler freedom to move the compare up,
> the schedule does due to the schedule model not being correct for the
> processor.  I have done the same for Octeon2 where it is able to
> combine the compare and the branch and found the resulting schedule is
> much better than even what this hack could do due to the instructions
> still take a issue slot.  Is it true that for these two processors it
> takes an issue slot or is it being done before issue?
>
> Thanks,
> Andrew Pinski
>
>>
>> IMO adjust_priority would be preferable if it allows to achieve the goal.
>>
>> Alexander


[PATCH] disable use_vector_fp_converts for m_CORE_ALL

2013-09-11 Thread Wei Mi
For the following testcase 1.c, on westmere and sandybridge,
performance with the option -mtune=^use_vector_fp_converts is better
(improves from 3.46s to 2.83s). It means cvtss2sd is often better than
unpcklps+cvtps2pd on recent x86 platforms.

1.c:
float total = 0.2;
int k = 5;

int main() {
 int i;

 for (i = 0; i < 10; i++) {
   total += (0.5 + k);
 }

 return total == 0.3;
}

assembly generated by gcc-r201963 without -mtune=^use_vector_fp_converts
.L2:
unpcklps%xmm0, %xmm0
subl$1, %eax
cvtps2pd%xmm0, %xmm0
addsd   %xmm1, %xmm0
unpcklpd%xmm0, %xmm0
cvtpd2ps%xmm0, %xmm0
jne .L2

assembly generated by gcc-r201963 with -mtune=^use_vector_fp_converts
.L2:
cvtss2sd%xmm0, %xmm0
subl$1, %eax
addsd   %xmm1, %xmm0
cvtsd2ss%xmm0, %xmm0
jne .L2

But for testcase 2.c (Thanks to Igor Zamyatin for the testcase),
performance with the option -mtune=^use_vector_fp_converts is worse.
Analysis to the assembly shows the performance degradation comes from
partial reg stall caused by cvtsd2ss. Adding pxor %xmm0, %xmm0 before
cvtsd2ss b(,%rdx,8), %xmm0 gets the performance back.

2.c:
double b[1024];

float a[1024];

int main()
{
int i;
for(i = 0 ; i < 1024 * 1024 * 256; i++)
  a[i & 1023] = a[i & 1023] * (float)b[i & 1023];
return (int)a[512];
}

without -mtune-crtl=^use_vector_fp_converts
.L2:
movl%eax, %edx
addl$1, %eax
andl$1023, %edx
cmpl$268435456, %eax
movsd   b(,%rdx,8), %xmm0
cvtpd2ps%xmm0, %xmm0==> without partial reg stall
because of movsd.
mulss   a(,%rdx,4), %xmm0
movss   %xmm0, a(,%rdx,4)
jne .L2

with -mtune-crtl=^use_vector_fp_converts
.L2:
movl%eax, %edx
addl$1, %eax
andl$1023, %edx
cmpl$268435456, %eax
cvtsd2ssb(,%rdx,8), %xmm0   ==> with partial reg
stall. Needs to insert "pxor %xmm0, %xmm0" before current insn.
mulss   a(,%rdx,4), %xmm0
movss   %xmm0, a(,%rdx,4)
jne .L2

So the patch is to turn off use_vector_fp_converts for m_CORE_ALL to
use cvtss2sd/cvtsd2ss directly,  and add "pxor %xmmreg %xmmreg" before
cvtss2sd/cvtsd2ss to break partial reg stall (similar as what r201308
does for cvtsi2ss/cvtsi2sd). bootstrap and regression pass. ok for
trunk?

Thanks,
Wei Mi.

2013-09-11  Wei Mi  

* config/i386/x86-tune.def (DEF_TUNE): Remove
m_CORE_ALL.
* config/i386/i386.md: Add define_peephole2 to
break partial reg stall for cvtss2sd/cvtsd2ss.

Index: config/i386/x86-tune.def
===
--- config/i386/x86-tune.def(revision 201963)
+++ config/i386/x86-tune.def(working copy)
@@ -189,7 +189,7 @@ DEF_TUNE (X86_TUNE_NOT_VECTORMODE, "not_
 /* X86_TUNE_USE_VECTOR_FP_CONVERTS: Prefer vector packed SSE conversion
from FP to FP. */
 DEF_TUNE (X86_TUNE_USE_VECTOR_FP_CONVERTS, "use_vector_fp_converts",
-  m_CORE_ALL | m_AMDFAM10 | m_GENERIC)
+  m_AMDFAM10 | m_GENERIC)
 /* X86_TUNE_USE_VECTOR_CONVERTS: Prefer vector packed SSE conversion
from integer to FP. */
 DEF_TUNE (X86_TUNE_USE_VECTOR_CONVERTS, "use_vector_converts", m_AMDFAM10)
Index: config/i386/i386.md
===
--- config/i386/i386.md (revision 201963)
+++ config/i386/i386.md (working copy)
@@ -5075,6 +5075,63 @@
   emit_move_insn (operands[0], CONST0_RTX (mode));
 })

+;; Break partial reg stall for cvtsd2ss.
+
+(define_peephole2
+  [(set (match_operand:SF 0 "register_operand")
+(float_truncate:SF
+ (match_operand:DF 1 "nonimmediate_operand")))]
+  "TARGET_SSE2 && TARGET_SSE_MATH
+   && TARGET_SSE_PARTIAL_REG_DEPENDENCY
+   && optimize_function_for_speed_p (cfun)
+   && reload_completed && SSE_REG_P (operands[0])
+   && peep2_reg_dead_p (0, operands[0])
+   && (!SSE_REG_P (operands[1])
+   || REGNO (operands[0]) != REGNO (operands[1]))"
+  [(set (match_dup 0)
+   (vec_merge:V4SF
+ (vec_duplicate:V4SF
+   (float_truncate:V2SF
+ (match_dup 1)))
+ (match_dup 0)
+ (const_int 1)))]
+{
+  operands[0] = simplify_gen_subreg (V4SFmode, operands[0],
+SFmode, 0);
+  operands[1] = simplify_gen_subreg (V2DFmode, operands[1],
+DFmode, 0);
+  emit_move_insn (operands[0], CONST0_RTX (V4SFmode));
+})
+
+;; Break partial reg stall for cvtss2sd.
+
+(define_peephole2
+  [(set (match_operand:DF 0 "register_operand")
+(float_extend:DF
+  (match_operand:SF 1 "nonimmediate_operand")))]
+  "TARGET_SSE2 &

Re: Fwd: [PATCH] Scheduling result adjustment to enable macro-fusion

2013-09-11 Thread Wei Mi
Thanks! Your method to adjust 'last' is more concise. I try it and it
works for small testcases. bootstrap and regression are ok. More
performance test is going on.

I agree with you that explicit handling in sched-deps.c for this
feature looks not good. So I move it to sched_init (Instead of
ix86_sched_init_global because ix86_sched_init_global is used to
install scheduling hooks), and then it is possible for other
architectures to use it.
I also need the two hooks because one is used as the gate for
macro-fusion controlled by -mtune-ctrl=fuse_cmp_and_branch on x86, and
the other is used to check for which kind of cmp and branch pair
macro-fusion is supported on target platform. But I am not sure if it
is proper to put those two hooks under TARGET_SCHED hook vector.

Thanks,
Wei Mi.

updated patch:

Index: doc/tm.texi.in
===
--- doc/tm.texi.in  (revision 201771)
+++ doc/tm.texi.in  (working copy)
@@ -6455,6 +6455,10 @@ scheduling one insn causes other insns t
 cycle.  These other insns can then be taken into account properly.
 @end deftypefn

+@hook TARGET_SCHED_MACRO_FUSION_P
+
+@hook TARGET_SCHED_MACRO_FUSION_PAIR_P
+
 @hook TARGET_SCHED_DEPENDENCIES_EVALUATION_HOOK
 This hook is called after evaluation forward dependencies of insns in
 chain given by two parameter values (@var{head} and @var{tail}
Index: doc/tm.texi
===
--- doc/tm.texi (revision 201771)
+++ doc/tm.texi (working copy)
@@ -6551,6 +6551,17 @@ scheduling one insn causes other insns t
 cycle.  These other insns can then be taken into account properly.
 @end deftypefn

+@deftypefn {Target Hook} bool TARGET_SCHED_MACRO_FUSION_P (void)
+This hook is used to check whether target platform supports macro fusion.
+@end deftypefn
+
+@deftypefn {Target Hook} bool TARGET_SCHED_MACRO_FUSION_PAIR_P (rtx
@var{condgen}, rtx @var{condjmp})
+This hook is used to check whether two insns could be macro fused for
+target microarchitecture. If this hook returns true for the given insn pair
+(@var{condgen} and @var{condjmp}), scheduler will put them into a sched
+group, and they will not be scheduled apart.
+@end deftypefn
+
 @deftypefn {Target Hook} void
TARGET_SCHED_DEPENDENCIES_EVALUATION_HOOK (rtx @var{head}, rtx
@var{tail})
 This hook is called after evaluation forward dependencies of insns in
 chain given by two parameter values (@var{head} and @var{tail}
Index: config/i386/i386.c
===
--- config/i386/i386.c  (revision 201771)
+++ config/i386/i386.c  (working copy)
@@ -2004,7 +2004,7 @@ static unsigned int initial_ix86_tune_fe
   /* X86_TUNE_FUSE_CMP_AND_BRANCH: Fuse a compare or test instruction
  with a subsequent conditional jump instruction into a single
  compare-and-branch uop.  */
-  m_BDVER,
+  m_COREI7 | m_BDVER,

   /* X86_TUNE_OPT_AGU: Optimize for Address Generation Unit. This flag
  will impact LEA instruction selection. */
@@ -24845,6 +24845,99 @@ ia32_multipass_dfa_lookahead (void)
 }
 }

+/* Return true if target platform supports macro-fusion.  */
+
+static bool
+ix86_macro_fusion_p ()
+{
+  if (TARGET_FUSE_CMP_AND_BRANCH)
+return true;
+  else
+return false;
+}
+
+/* Check whether current microarchitecture support macro fusion
+   for insn pair "CONDGEN + CONDJMP". Refer to
+   "Intel Architectures Optimization Reference Manual". */
+
+static bool
+ix86_macro_fusion_pair_p (rtx condgen, rtx condjmp)
+{
+  rtx src;
+  if (!strcmp (ix86_tune_string, "corei7"))
+{
+  /* For Nehalem.  */
+  rtx single_set = single_set (condgen);
+  /* Nehalem doesn't support macro-fusion for add/sub+jmp.  */
+  if (single_set == NULL_RTX)
+return false;
+
+  src = SET_SRC (single_set);
+  if (GET_CODE (src) != COMPARE)
+   return false;
+
+  /* Nehalem doesn't support macro-fusion for cmp/test MEM-IMM
+insn pattern.  */
+  if ((MEM_P (XEXP (src, 0))
+  && CONST_INT_P (XEXP (src, 1)))
+ || (MEM_P (XEXP (src, 1))
+ && CONST_INT_P (XEXP (src, 0
+   return false;
+
+  /* Nehalem doesn't support macro-fusion for add/sub/dec/inc + jmp.  */
+  if (get_attr_type (condgen) != TYPE_TEST
+ && get_attr_type (condgen) != TYPE_ICMP)
+   return false;
+  return true;
+}
+  else if (!strcmp (ix86_tune_string, "corei7-avx"))
+{
+  /* For Sandybridge.  */
+  enum rtx_code ccode;
+  rtx compare_set = NULL_RTX, test_if, cond;
+  rtx single_set = single_set (condgen);
+  if (single_set != NULL_RTX)
+compare_set = single_set;
+  else
+   {
+ int i;
+ rtx pat = PATTERN (condgen);
+ for (i = 0; i < XVECLEN (pat, 0); i++)
+   if (GET_CODE (XVECEXP (pat, 0, i)) == SET
+  

Re: Fwd: [PATCH] Scheduling result adjustment to enable macro-fusion

2013-09-12 Thread Wei Mi
> Your new implementation is not efficient: when looping over BBs, you need to
> look only at the last insn of each basic block.
>

Thanks, fixed. New patch attached.


patch
Description: Binary data


Re: Fwd: [PATCH] Scheduling result adjustment to enable macro-fusion

2013-09-13 Thread Wei Mi
> Thanks.  At this point you need feedback from x86 and scheduler maintainers.
> I would recommend you to resubmit the patch with a Changelog text, and with
> the text of the patch inline in the email (your last mail has the patch as a
> binary attachment, which makes it harder to review and respond to).  Please
> mention if the updated patch passes bootstrap and regtest.

Thanks! Here is the new patch. bootstrap and regression pass. ok for trunk?

2013-09-13  Wei Mi  

* sched-rgn.c (add_branch_dependences): Keep insns in
a SCHED_GROUP at the end of bb to remain their locations.
* config/i386/x86-tune.def (DEF_TUNE): Add m_COREI7 for
X86_TUNE_FUSE_CMP_AND_BRANCH.
* config/i386/i386.c (ix86_macro_fusion_p): New Function.
(ix86_macro_fusion_pair_p): Ditto.
* doc/tm.texi.in: Generated.
* doc/tm.texi: Ditto.
* target.def: Add two hooks: macro_fusion_p and
macro_fusion_pair_p.
* haifa-sched.c (try_group_insn): New function.
(group_insns_for_macro_fusion): New function.
(sched_init): Call group_insns_for_macro_fusion.

Index: sched-rgn.c
===
--- sched-rgn.c (revision 201963)
+++ sched-rgn.c (working copy)
@@ -2443,6 +2443,8 @@ add_branch_dependences (rtx head, rtx ta
  cc0 setters remain at the end because they can't be moved away from
  their cc0 user.

+ Predecessors of SCHED_GROUP_P instructions at the end remain at the end.
+
  COND_EXEC insns cannot be moved past a branch (see e.g. PR17808).

  Insns setting TARGET_CLASS_LIKELY_SPILLED_P registers (usually return
@@ -2465,7 +2467,8 @@ add_branch_dependences (rtx head, rtx ta
 #endif
  || (!reload_completed
  && sets_likely_spilled (PATTERN (insn)
- || NOTE_P (insn))
+ || NOTE_P (insn)
+ || (last != 0 && SCHED_GROUP_P (last)))
 {
   if (!NOTE_P (insn))
  {
Index: config/i386/x86-tune.def
===
--- config/i386/x86-tune.def(revision 201963)
+++ config/i386/x86-tune.def(working copy)
@@ -196,7 +196,8 @@ DEF_TUNE (X86_TUNE_USE_VECTOR_CONVERTS,
 /* X86_TUNE_FUSE_CMP_AND_BRANCH: Fuse a compare or test instruction
with a subsequent conditional jump instruction into a single
compare-and-branch uop.  */
-DEF_TUNE (X86_TUNE_FUSE_CMP_AND_BRANCH, "fuse_cmp_and_branch", m_BDVER)
+DEF_TUNE (X86_TUNE_FUSE_CMP_AND_BRANCH, "fuse_cmp_and_branch",
+  m_COREI7 | m_BDVER)
 /* X86_TUNE_OPT_AGU: Optimize for Address Generation Unit. This flag
will impact LEA instruction selection. */
 DEF_TUNE (X86_TUNE_OPT_AGU, "opt_agu", m_ATOM | m_SLM)
Index: config/i386/i386.c
===
--- config/i386/i386.c  (revision 201963)
+++ config/i386/i386.c  (working copy)
@@ -24850,6 +24850,99 @@ ia32_multipass_dfa_lookahead (void)
 }
 }

+/* Return true if target platform supports macro-fusion.  */
+
+static bool
+ix86_macro_fusion_p ()
+{
+  if (TARGET_FUSE_CMP_AND_BRANCH)
+return true;
+  else
+return false;
+}
+
+/* Check whether current microarchitecture support macro fusion
+   for insn pair "CONDGEN + CONDJMP". Refer to
+   "Intel Architectures Optimization Reference Manual". */
+
+static bool
+ix86_macro_fusion_pair_p (rtx condgen, rtx condjmp)
+{
+  rtx src;
+  if (!strcmp (ix86_tune_string, "corei7"))
+{
+  /* For Nehalem.  */
+  rtx single_set = single_set (condgen);
+  /* Nehalem doesn't support macro-fusion for add/sub+jmp.  */
+  if (single_set == NULL_RTX)
+return false;
+
+  src = SET_SRC (single_set);
+  if (GET_CODE (src) != COMPARE)
+   return false;
+
+  /* Nehalem doesn't support macro-fusion for cmp/test MEM-IMM
+insn pattern.  */
+  if ((MEM_P (XEXP (src, 0))
+  && CONST_INT_P (XEXP (src, 1)))
+ || (MEM_P (XEXP (src, 1))
+ && CONST_INT_P (XEXP (src, 0
+   return false;
+
+  /* Nehalem doesn't support macro-fusion for add/sub/dec/inc + jmp.  */
+  if (get_attr_type (condgen) != TYPE_TEST
+ && get_attr_type (condgen) != TYPE_ICMP)
+   return false;
+  return true;
+}
+  else if (!strcmp (ix86_tune_string, "corei7-avx"))
+{
+  /* For Sandybridge.  */
+  enum rtx_code ccode;
+  rtx compare_set = NULL_RTX, test_if, cond;
+  rtx single_set = single_set (condgen);
+  if (single_set != NULL_RTX)
+compare_set = single_set;
+  else
+   {
+ int i;
+ rtx pat = PATTERN (condgen);
+ for (i = 0; i < XVECLEN (pat, 0); i++)
+   if (GET_CODE (XVECEXP (pat, 0, i)) == SET
+   && GET_CODE (SET_SRC (XVECEXP (pat, 0, i))) == COMPARE)
+ compare_set = XVECEXP (pat, 0, i);
+ 

Re: Fwd: [PATCH] Scheduling result adjustment to enable macro-fusion

2013-09-13 Thread Wei Mi
> Checking corei7/corei7-avx explicitly isn't a good idea.
> It is also useful for Ivy Bridge and Haswell.  I think you
> should use a variable to control it, similar to
> TARGET_FUSE_CMP_AND_BRANCH.
>
>
> --
> H.J.

Different x86 microarchitectures support macro-fusion for different
compare and branch combinations. I need to differentiate various x86
microarchitectures. If use TARGET_FUSE_CMP_AND_BRANCH like vars to
control it, it requires a bunch of them. That is why I choose to check
corei7/corei7-avx in that function. I don't add core-avx-i/core-avx2
for now because I don't have those machines for testing.

Thanks,
Wei Mi.


Re: Fwd: [PATCH] Scheduling result adjustment to enable macro-fusion

2013-09-13 Thread Wei Mi
On Fri, Sep 13, 2013 at 12:09 PM, H.J. Lu  wrote:
> On Fri, Sep 13, 2013 at 11:28 AM, Wei Mi  wrote:
>>> Checking corei7/corei7-avx explicitly isn't a good idea.
>>> It is also useful for Ivy Bridge and Haswell.  I think you
>>> should use a variable to control it, similar to
>>> TARGET_FUSE_CMP_AND_BRANCH.
>>>
>>>
>>> --
>>> H.J.
>>
>> Different x86 microarchitectures support macro-fusion for different
>> compare and branch combinations. I need to differentiate various x86
>> microarchitectures. If use TARGET_FUSE_CMP_AND_BRANCH like vars to
>> control it, it requires a bunch of them. That is why I choose to check
>
> Can you use TARGET_FUSE_CMP_AND_BRANCH covers cmp/test
> and branch,  TARGET_FUSE_ALU_AND_BRANCH covers and/add/sub/inc/dec
> and branch?
>

Yes, I can. Thanks for the suggestion. Will fix it, and with Ivy
Bridge and Haswell included.

Thanks,
Wei.


Re: Fwd: [PATCH] Scheduling result adjustment to enable macro-fusion

2013-09-13 Thread Wei Mi
On Fri, Sep 13, 2013 at 1:45 PM, Wei Mi  wrote:
> On Fri, Sep 13, 2013 at 12:09 PM, H.J. Lu  wrote:
>> On Fri, Sep 13, 2013 at 11:28 AM, Wei Mi  wrote:
>>>> Checking corei7/corei7-avx explicitly isn't a good idea.
>>>> It is also useful for Ivy Bridge and Haswell.  I think you
>>>> should use a variable to control it, similar to
>>>> TARGET_FUSE_CMP_AND_BRANCH.
>>>>
>>>>
>>>> --
>>>> H.J.
>>>
>>> Different x86 microarchitectures support macro-fusion for different
>>> compare and branch combinations. I need to differentiate various x86
>>> microarchitectures. If use TARGET_FUSE_CMP_AND_BRANCH like vars to
>>> control it, it requires a bunch of them. That is why I choose to check
>>
>> Can you use TARGET_FUSE_CMP_AND_BRANCH covers cmp/test
>> and branch,  TARGET_FUSE_ALU_AND_BRANCH covers and/add/sub/inc/dec
>> and branch?
>>
>
> Yes, I can. Thanks for the suggestion. Will fix it, and with Ivy
> Bridge and Haswell included.
>

Just notice another problem here:
processor_type only contains PROCESSOR_COREI7, so I cannot
differentiate Westmere and Sandybridge in x86-tune.def, which are
different for TARGET_FUSE_ALU_AND_BRANCH. So do I have to separate
m_SANDYBRIDGE out from m_COREI7?


Re: Fwd: [PATCH] Scheduling result adjustment to enable macro-fusion

2013-09-16 Thread Wei Mi
>> Just notice another problem here:
>> processor_type only contains PROCESSOR_COREI7, so I cannot
>> differentiate Westmere and Sandybridge in x86-tune.def, which are
>> different for TARGET_FUSE_ALU_AND_BRANCH. So do I have to separate
>> m_SANDYBRIDGE out from m_COREI7?
>
> Yes, please.
>
> Thanks.
>
> --
> H.J.

I separate the change into two patches here:

Patch1 is to separate PROCESSOR_COREI7_AVX from PROCESSOR_COREI7.
PROCESSOR_COREI7_AVX includes Sandybridge and Ivybridge.

Patch2 is about the change for macro-fusion. Add three tune features here:
X86_TUNE_FUSE_CMP_AND_BRANCH_64:   CORE2 only support macrofusion for
32 bits. COREI7, COREI7_AVX and Haswell support 64 bits.
X86_TUNE_FUSE_CMP_AND_BRANCH_SOFLAGS: CORE2 only support macrofusion
for branch only checking Zero and Carry flags. COREI7, COREI7_AVX and
Haswell support branch checking Sign and Overflow flags.
X86_TUNE_FUSE_ALU_AND_BRANCH: COREI7 doesn't support macrofusion for
alu + branch. COREI7_AVX and Haswell support it.

bootstrap and regression ok for the two patches.

Thanks,
Wei Mi.


Patch1:

2013-09-16  Wei Mi  

* gcc/config/i386/i386-c.c (ix86_target_macros_internal): Separate
PROCESSOR_COREI7_AVX out from PROCESSOR_COREI7.
* gcc/config/i386/i386.c (ix86_option_override_internal): Ditto.
(ix86_issue_rate): Ditto.
(ia32_multipass_dfa_lookahead): Ditto.
(ix86_sched_init_global): Ditto.
(get_builtin_code_for_version): Ditto.
* gcc/config/i386/i386.h (enum target_cpu_default): Ditto.
(enum processor_type): Ditto.
* gcc/config/i386/x86-tune.def (DEF_TUNE): Ditto.

diff --git a/gcc/config/i386/i386-c.c b/gcc/config/i386/i386-c.c
index 14349be..7ea68cc 100644
--- a/gcc/config/i386/i386-c.c
+++ b/gcc/config/i386/i386-c.c
@@ -141,6 +141,10 @@ ix86_target_macros_internal (HOST_WIDE_INT isa_flag,
   def_or_undef (parse_in, "__corei7");
   def_or_undef (parse_in, "__corei7__");
   break;
+case PROCESSOR_COREI7_AVX:
+  def_or_undef (parse_in, "__corei7_avx");
+  def_or_undef (parse_in, "__corei7_avx__");
+  break;
 case PROCESSOR_HASWELL:
   def_or_undef (parse_in, "__core_avx2");
   def_or_undef (parse_in, "__core_avx2__");
@@ -239,6 +243,9 @@ ix86_target_macros_internal (HOST_WIDE_INT isa_flag,
 case PROCESSOR_COREI7:
   def_or_undef (parse_in, "__tune_corei7__");
   break;
+case PROCESSOR_COREI7_AVX:
+  def_or_undef (parse_in, "__tune_corei7_avx__");
+  break;
 case PROCESSOR_HASWELL:
   def_or_undef (parse_in, "__tune_core_avx2__");
   break;
diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 536c357..1fd3f60 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -1908,8 +1908,9 @@ const struct processor_costs *ix86_cost = &pentium_cost;
 #define m_P4_NOCONA (m_PENT4 | m_NOCONA)
 #define m_CORE2 (1<

* gcc/config/i386/i386.c (ix86_macro_fusion_p): New Function.
(ix86_macro_fusion_pair_p): Ditto.
* gcc/config/i386/i386.h: Add new tune features about macro-fusion.
* gcc/config/i386/x86-tune.def (DEF_TUNE): Ditto.
* gcc/doc/tm.texi: Generated.
* gcc/doc/tm.texi.in: Ditto.
* gcc/haifa-sched.c (try_group_insn): New function.
(group_insns_for_macro_fusion): Ditto.
(sched_init): Call group_insns_for_macro_fusion.
* gcc/sched-rgn.c (add_branch_dependences): Keep insns in
a SCHED_GROUP at the end of BB to remain their location.
* gcc/target.def: Add two hooks: macro_fusion_p and
macro_fusion_pair_p.

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 1fd3f60..85b7aa0 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -24856,6 +24856,90 @@ ia32_multipass_dfa_lookahead (void)
 }
 }

+/* Return true if target platform supports macro-fusion.  */
+
+static bool
+ix86_macro_fusion_p ()
+{
+  if (TARGET_FUSE_CMP_AND_BRANCH
+  && (!TARGET_64BIT || TARGET_FUSE_CMP_AND_BRANCH_64))
+return true;
+  else
+return false;
+}
+
+/* Check whether current microarchitecture support macro fusion
+   for insn pair "CONDGEN + CONDJMP". Refer to
+   "Intel Architectures Optimization Reference Manual". */
+
+static bool
+ix86_macro_fusion_pair_p (rtx condgen, rtx condjmp)
+{
+  rtx src;
+  rtx single_set = single_set (condgen);
+  enum rtx_code ccode;
+  rtx compare_set = NULL_RTX, test_if, cond;
+
+  if (single_set == NULL_RTX
+  && !TARGET_FUSE_ALU_AND_BRANCH)
+return false;
+
+  if (single_set != NULL_RTX)
+compare_set = single_set;
+  else
+{
+  int i;
+  rtx pat = PATTERN (condgen);
+  for (i = 0; i < XVECLEN (pat, 0); i++)
+   if (GET_CODE (XVECEXP (pat, 0, i)) == SET
+   && GET_CODE (SET_SRC (X

Re: [PATCH] disable use_vector_fp_converts for m_CORE_ALL

2013-09-20 Thread Wei Mi
Ping.

> -Original Message-
> From: Wei Mi [mailto:w...@google.com]
> Sent: Thursday, September 12, 2013 2:51 AM
> To: GCC Patches
> Cc: David Li; Zamyatin, Igor
> Subject: [PATCH] disable use_vector_fp_converts for m_CORE_ALL
>
> For the following testcase 1.c, on westmere and sandybridge, performance with 
> the option -mtune=^use_vector_fp_converts is better (improves from 3.46s to 
> 2.83s). It means cvtss2sd is often better than
> unpcklps+cvtps2pd on recent x86 platforms.
>
> 1.c:
> float total = 0.2;
> int k = 5;
>
> int main() {
>  int i;
>
>  for (i = 0; i < 10; i++) {
>total += (0.5 + k);
>  }
>
>  return total == 0.3;
> }
>
> assembly generated by gcc-r201963 without -mtune=^use_vector_fp_converts
> .L2:
> unpcklps%xmm0, %xmm0
> subl$1, %eax
> cvtps2pd%xmm0, %xmm0
> addsd   %xmm1, %xmm0
> unpcklpd%xmm0, %xmm0
> cvtpd2ps%xmm0, %xmm0
> jne .L2
>
> assembly generated by gcc-r201963 with -mtune=^use_vector_fp_converts
> .L2:
> cvtss2sd%xmm0, %xmm0
> subl$1, %eax
> addsd   %xmm1, %xmm0
> cvtsd2ss%xmm0, %xmm0
> jne .L2
>
> But for testcase 2.c (Thanks to Igor Zamyatin for the testcase), performance 
> with the option -mtune=^use_vector_fp_converts is worse.
> Analysis to the assembly shows the performance degradation comes from partial 
> reg stall caused by cvtsd2ss. Adding pxor %xmm0, %xmm0 before cvtsd2ss 
> b(,%rdx,8), %xmm0 gets the performance back.
>
> 2.c:
> double b[1024];
>
> float a[1024];
>
> int main()
> {
> int i;
> for(i = 0 ; i < 1024 * 1024 * 256; i++)
>   a[i & 1023] = a[i & 1023] * (float)b[i & 1023];
> return (int)a[512];
> }
>
> without -mtune-crtl=^use_vector_fp_converts
> .L2:
> movl%eax, %edx
> addl$1, %eax
> andl$1023, %edx
> cmpl$268435456, %eax
> movsd   b(,%rdx,8), %xmm0
> cvtpd2ps%xmm0, %xmm0==> without partial reg stall
> because of movsd.
> mulss   a(,%rdx,4), %xmm0
> movss   %xmm0, a(,%rdx,4)
> jne .L2
>
> with -mtune-crtl=^use_vector_fp_converts
> .L2:
> movl%eax, %edx
> addl$1, %eax
> andl$1023, %edx
> cmpl$268435456, %eax
> cvtsd2ssb(,%rdx,8), %xmm0   ==> with partial reg
> stall. Needs to insert "pxor %xmm0, %xmm0" before current insn.
> mulss   a(,%rdx,4), %xmm0
> movss   %xmm0, a(,%rdx,4)
> jne .L2
>
> So the patch is to turn off use_vector_fp_converts for m_CORE_ALL to use 
> cvtss2sd/cvtsd2ss directly,  and add "pxor %xmmreg %xmmreg" before 
> cvtss2sd/cvtsd2ss to break partial reg stall (similar as what r201308 does 
> for cvtsi2ss/cvtsi2sd). bootstrap and regression pass. ok for trunk?
>
> Thanks,
> Wei Mi.
>
> 2013-09-11  Wei Mi  
>
> * config/i386/x86-tune.def (DEF_TUNE): Remove
> m_CORE_ALL.
> * config/i386/i386.md: Add define_peephole2 to
> break partial reg stall for cvtss2sd/cvtsd2ss.
>
> Index: config/i386/x86-tune.def
> ===
> --- config/i386/x86-tune.def(revision 201963)
> +++ config/i386/x86-tune.def(working copy)
> @@ -189,7 +189,7 @@ DEF_TUNE (X86_TUNE_NOT_VECTORMODE, "not_
>  /* X86_TUNE_USE_VECTOR_FP_CONVERTS: Prefer vector packed SSE conversion
> from FP to FP. */
>  DEF_TUNE (X86_TUNE_USE_VECTOR_FP_CONVERTS, "use_vector_fp_converts",
> -  m_CORE_ALL | m_AMDFAM10 | m_GENERIC)
> +  m_AMDFAM10 | m_GENERIC)
>  /* X86_TUNE_USE_VECTOR_CONVERTS: Prefer vector packed SSE conversion
> from integer to FP. */
>  DEF_TUNE (X86_TUNE_USE_VECTOR_CONVERTS, "use_vector_converts", m_AMDFAM10)
> Index: config/i386/i386.md
> ===
> --- config/i386/i386.md (revision 201963)
> +++ config/i386/i386.md (working copy)
> @@ -5075,6 +5075,63 @@
>emit_move_insn (operands[0], CONST0_RTX (mode));
>  })
>
> +;; Break partial reg stall for cvtsd2ss.
> +
> +(define_peephole2
> +  [(set (match_operand:SF 0 "register_operand")
> +(float_truncate:SF
> + (match_operand:DF 1 "nonimmediate_operand")))]
> +  "TARGET_SSE2 && TARGET_SSE_MATH
> +   && TARGET_SSE_PARTIAL_REG_DEPENDENCY
> +   && optimize_function_for_speed_p (cfun)
> +   && reload_completed && SSE_REG_

Re: Revisit Core tunning flags

2013-09-22 Thread Wei Mi
>> > http://gcc.gnu.org/ml/gcc-patches/2013-09/msg00884.html
>
> This patch seems resonable. (in fact I have pretty much same in my tree)
> use_vector_fp_converts is actually trying to solve the same problem in AMD
> hardware - you need to type the whole register when converting.
> So it may work well for AMD chips too or may be the difference is that
> Intel chips somehow handle "cvtpd2ps%xmm0, %xmm0" well even though
> the upper half of xmm0 is ill defined, while AMD chips doesn't.
>
> The patch seems OK. I do not see rason for
>   && peep2_reg_dead_p (0, operands[0])
> test.  Reg has to be dead since it is full destination of the operation.

Ok, I see. I will delete it.

>
> Lets wait few days before commit so we know effect of
> individual changes.  I will test it on AMD hardware and we can decide on
> generic tuning then.
>
> Honza

Ok, thanks.

Wei Mi.


Re: Fwd: [PATCH] Scheduling result adjustment to enable macro-fusion

2013-09-22 Thread Wei Mi
> You disable fusion for Budozer here sinze you did not add it into
> TARGET_FUSE_CMP_AND_BRANCH_64.

Ok, will add it.

>
> Perhaps we can have TARGET_FUSE_CMP_AND_BRANCH_64 and 
> TARGET_FUSE_CMP_AND_BRANCH_32
> plus an macro TARGET_FUSE_CMP_AND_BRANCH that chose corresponding variant 
> based
> on TARGET_64BIT rather than having to wind down the test in every use.

Ok, will fix it.

>> +/* X86_TUNE_FUSE_CMP_AND_BRANCH_SOFLAGS: Fuse compare with a
>> +   subsequent conditional jump instruction when the condition jump
>> +   check sign flag (SF) or overflow flag (OF).  */
>> +DEF_TUNE (X86_TUNE_FUSE_CMP_AND_BRANCH_SOFLAGS, 
>> "fuse_cmp_and_branch_soflags",
>> +  m_COREI7 | m_COREI7_AVX | m_HASWELL)
>
> This flag is affecting only fuding of ALU and BRANCh or should it also affect
> X86_TUNE_FUSE_CMP_AND_BRANCH?  In current implementation it seems to be the 
> first
> and in that case it ought to be documented that way and probably
> called ALT_AND_BRANCH_SOFLAGS to avoid confussion.
>

X86_TUNE_FUSE_CMP_AND_BRANCH_SOFLAGS is not affecting fusing ALU and
BRANCH. It is added because m_CORE2 doesn't support fusing cmp and
JL/JG/JLE/JGE.

> This is what Agner Fog says:
>
> A CMP or TEST instruction immediately followed by a conditional jump can be
> fused into a single macro-op. This applies to all versions of the CMP and TEST
> instructions and all conditional jumps except if the CMP or TEST has a
> rip-relative address or both a displacement and an immediate operand.
>
> So it is a bit more weird.  Perhaps you can extend your predicate to look
> for IP relative addresses & displacements of CMP and TEST, too.
>
> Honza

Thanks for checking it. Agner's guide also mentions this constraint
for sandybridge, ivybridge I missed it because Intel optimization
reference manual doesn't mention it. I did some experiment just now
and verified the constraint for sandybridge existed. Will add the
predicate.

Thanks,
Wei Mi.


Re: Fwd: [PATCH] Scheduling result adjustment to enable macro-fusion

2013-09-24 Thread Wei Mi
This is the updated patch2.
Changed:
1. For cmp/test with rip-relative addressing mem operand, don't group
insns. Bulldozer also doesn't support fusion for cmp/test with both
displacement MEM and immediate operand, while m_CORE_ALL doesn't
support fusion for cmp/test with MEM and immediate operand. I simplify
choose to use the more stringent constraint here (m_CORE_ALL's
constraint).
2. Add Budozer back and merge TARGET_FUSE_CMP_AND_BRANCH_64 and
TARGET_FUSE_CMP_AND_BRANCH_32.

bootstrap and regression pass. ok for trunk?

2013-09-24  Wei Mi  

* gcc/config/i386/i386.c (rip_relative_addr_p): New Function.
(ix86_macro_fusion_p): Ditto.
(ix86_macro_fusion_pair_p): Ditto.
* gcc/config/i386/i386.h: Add new tune features about macro-fusion.
* gcc/config/i386/x86-tune.def (DEF_TUNE): Ditto.
* gcc/doc/tm.texi: Generated.
* gcc/doc/tm.texi.in: Ditto.
* gcc/haifa-sched.c (try_group_insn): New Function.
(group_insns_for_macro_fusion): Ditto.
(sched_init): Call group_insns_for_macro_fusion.
* gcc/sched-rgn.c (add_branch_dependences): Keep insns in
a SCHED_GROUP at the end of BB to remain their location.
* gcc/target.def: Add two hooks: macro_fusion_p and
macro_fusion_pair_p.

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 1fd3f60..4a04778 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -24856,6 +24856,167 @@ ia32_multipass_dfa_lookahead (void)
 }
 }

+/* Extracted from ix86_print_operand_address. Check whether ADDR is a
+   rip-relative address.  */
+
+static bool
+rip_relative_addr_p (rtx addr)
+{
+  struct ix86_address parts;
+  rtx base, index, disp;
+  int ok;
+
+  if (GET_CODE (addr) == UNSPEC && XINT (addr, 1) == UNSPEC_VSIBADDR)
+{
+  ok = ix86_decompose_address (XVECEXP (addr, 0, 0), &parts);
+  parts.index = XVECEXP (addr, 0, 1);
+}
+  else if (GET_CODE (addr) == UNSPEC && XINT (addr, 1) == UNSPEC_LEA_ADDR)
+ok = ix86_decompose_address (XVECEXP (addr, 0, 0), &parts);
+  else
+ok = ix86_decompose_address (addr, &parts);
+
+  gcc_assert (ok);
+  base = parts.base;
+  index = parts.index;
+  disp = parts.disp;
+
+  if (TARGET_64BIT && !base && !index)
+{
+  rtx symbol = disp;
+
+  if (GET_CODE (disp) == CONST
+ && GET_CODE (XEXP (disp, 0)) == PLUS
+ && CONST_INT_P (XEXP (XEXP (disp, 0), 1)))
+   symbol = XEXP (XEXP (disp, 0), 0);
+
+  if (GET_CODE (symbol) == LABEL_REF
+ || (GET_CODE (symbol) == SYMBOL_REF
+ && SYMBOL_REF_TLS_MODEL (symbol) == 0))
+   return true;
+}
+  if (flag_pic && !base && !index)
+{
+  if (GET_CODE (disp) == CONST
+ && GET_CODE (XEXP (disp, 0)) == UNSPEC
+ && (XINT (XEXP (disp, 0), 1) == UNSPEC_PCREL
+ || XINT (XEXP (disp, 0), 1) == UNSPEC_GOTPCREL
+ || (TARGET_64BIT
+ && XINT (XEXP (disp, 0), 1) == UNSPEC_GOTNTPOFF)))
+   return true;
+}
+  return false;
+}
+
+/* Return true if target platform supports macro-fusion.  */
+
+static bool
+ix86_macro_fusion_p ()
+{
+  if (TARGET_FUSE_CMP_AND_BRANCH)
+return true;
+  else
+return false;
+}
+
+/* Check whether current microarchitecture support macro fusion
+   for insn pair "CONDGEN + CONDJMP". Refer to
+   "Intel Architectures Optimization Reference Manual". */
+
+static bool
+ix86_macro_fusion_pair_p (rtx condgen, rtx condjmp)
+{
+  rtx src, dest;
+  rtx single_set = single_set (condgen);
+  enum rtx_code ccode;
+  rtx compare_set = NULL_RTX, test_if, cond;
+  rtx alu_set = NULL_RTX, addr = NULL_RTX;
+
+  if (get_attr_type (condgen) != TYPE_TEST
+  && get_attr_type (condgen) != TYPE_ICMP
+  && get_attr_type (condgen) != TYPE_INCDEC
+  && get_attr_type (condgen) != TYPE_ALU)
+return false;
+
+  if (single_set == NULL_RTX
+  && !TARGET_FUSE_ALU_AND_BRANCH)
+return false;
+
+  if (single_set != NULL_RTX)
+compare_set = single_set;
+  else
+{
+  int i;
+  rtx pat = PATTERN (condgen);
+  for (i = 0; i < XVECLEN (pat, 0); i++)
+   if (GET_CODE (XVECEXP (pat, 0, i)) == SET)
+ {
+   rtx set_src = SET_SRC (XVECEXP (pat, 0, i));
+   if (GET_CODE (set_src) == COMPARE)
+ compare_set = XVECEXP (pat, 0, i);
+   else
+ alu_set = XVECEXP (pat, 0, i);
+ }
+}
+  if (compare_set == NULL_RTX)
+return false;
+  src = SET_SRC (compare_set);
+  if (GET_CODE (src) != COMPARE)
+return false;
+
+  /* Macro-fusion for cmp/test MEM-IMM + conditional jmp is not
+ supported.  */
+  if ((MEM_P (XEXP (src, 0))
+   && CONST_INT_P (XEXP (src, 1)))
+  || (MEM_P (XEXP (src, 1))
+ && CONST_INT_P (XEXP (src, 0
+   

Re: Fwd: [PATCH] Scheduling result adjustment to enable macro-fusion

2013-09-24 Thread Wei Mi
>> It doesn't look right.  IP relative address is only possible
>> with TARGET_64BIT and
>>
>> 1. base == pc. Or
>> 2. UUNSPEC_PCREL,  UNSPEC_GOTPCREL, and
>> NSPEC_GOTNTPOFF.
>
> Target 64bit should be tested above.  We however output RIP addresses
> also for basic symbol references.  I.e. when base is an symbol addresss.
> such as in:
> int a;
> int t()
> {
>   return a;
> }
>
> memory_address_length already contains logic to figure out if there is IP
> relative addressing going on (I am not sure it is completely accurate either).
> Better to break it out to a common predicate and perhaps unify with what
> ix86_print_operand_address is doing.
>
> Honza
>>
>>
>> --
>> H.J.

Thanks. How about this one. bootstrap and regression are going on.

2013-09-24  Wei Mi  

* gcc/config/i386/i386.c (memory_address_length): Extract a part
of code to rip_relative_addr_p.
(rip_relative_addr_p): New Function.
(ix86_macro_fusion_p): Ditto.
(ix86_macro_fusion_pair_p): Ditto.
* gcc/config/i386/i386.h: Add new tune features about macro-fusion.
* gcc/config/i386/x86-tune.def (DEF_TUNE): Ditto.
* gcc/doc/tm.texi: Generated.
* gcc/doc/tm.texi.in: Ditto.
* gcc/haifa-sched.c (try_group_insn): New Function.
(group_insns_for_macro_fusion): Ditto.
(sched_init): Call group_insns_for_macro_fusion.
* gcc/sched-rgn.c (add_branch_dependences): Keep insns in
a SCHED_GROUP at the end of BB to remain their location.
* gcc/target.def: Add two hooks: macro_fusion_p and
macro_fusion_pair_p.

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 1fd3f60..808e0c6 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -24275,25 +24275,8 @@ memory_address_length (rtx addr, bool lea)
   else if (disp && !base && !index)
 {
   len += 4;
-  if (TARGET_64BIT)
-   {
- rtx symbol = disp;
-
- if (GET_CODE (disp) == CONST)
-   symbol = XEXP (disp, 0);
- if (GET_CODE (symbol) == PLUS
- && CONST_INT_P (XEXP (symbol, 1)))
-   symbol = XEXP (symbol, 0);
-
- if (GET_CODE (symbol) != LABEL_REF
- && (GET_CODE (symbol) != SYMBOL_REF
- || SYMBOL_REF_TLS_MODEL (symbol) != 0)
- && (GET_CODE (symbol) != UNSPEC
- || (XINT (symbol, 1) != UNSPEC_GOTPCREL
- && XINT (symbol, 1) != UNSPEC_PCREL
- && XINT (symbol, 1) != UNSPEC_GOTNTPOFF)))
-   len++;
-   }
+  if (rip_relative_addr_p (&parts))
+   len++;
 }
   else
 {
@@ -24856,6 +24839,159 @@ ia32_multipass_dfa_lookahead (void)
 }
 }

+/* Check whether x86 address PARTS is a pc-relative address.  */
+
+static bool
+rip_relative_addr_p (struct ix86_address *parts)
+{
+  struct ix86_address *parts;
+  rtx base, index, disp;
+
+  base = parts->base;
+  index = parts->index;
+  disp = parts->disp;
+
+  if (disp && !base && !index)
+{
+  if (TARGET_64BIT)
+   {
+ rtx symbol = disp;
+
+ if (GET_CODE (disp) == CONST)
+   symbol = XEXP (disp, 0);
+ if (GET_CODE (symbol) == PLUS
+ && CONST_INT_P (XEXP (symbol, 1)))
+   symbol = XEXP (symbol, 0);
+
+ if (GET_CODE (symbol) == LABEL_REF
+ || (GET_CODE (symbol) == SYMBOL_REF
+ && SYMBOL_REF_TLS_MODEL (symbol) == 0)
+ || (GET_CODE (symbol) == UNSPEC
+ && (XINT (symbol, 1) == UNSPEC_GOTPCREL
+ || XINT (symbol, 1) == UNSPEC_PCREL
+ || XINT (symbol, 1) == UNSPEC_GOTNTPOFF)))
+   return true;
+   }
+}
+  return false;
+}
+
+/* Return true if target platform supports macro-fusion.  */
+
+static bool
+ix86_macro_fusion_p ()
+{
+  if (TARGET_FUSE_CMP_AND_BRANCH)
+return true;
+  else
+return false;
+}
+
+/* Check whether current microarchitecture support macro fusion
+   for insn pair "CONDGEN + CONDJMP". Refer to
+   "Intel Architectures Optimization Reference Manual". */
+
+static bool
+ix86_macro_fusion_pair_p (rtx condgen, rtx condjmp)
+{
+  rtx src, dest;
+  rtx single_set = single_set (condgen);
+  enum rtx_code ccode;
+  rtx compare_set = NULL_RTX, test_if, cond;
+  rtx alu_set = NULL_RTX, addr = NULL_RTX;
+
+  if (get_attr_type (condgen) != TYPE_TEST
+  && get_attr_type (condgen) != TYPE_ICMP
+  && get_attr_type (condgen) != TYPE_INCDEC
+  && get_attr_type (condgen) != TYPE_ALU)
+return false;
+
+  if (single_set == NULL_RTX
+  && !TARGET_FUSE_ALU_AND_BRANCH)
+return false;
+
+  if (single_set != NULL_RTX)
+compare_set = single_set;
+ 

[PATCH, IRA] Fix ALLOCNO_MODE in the case of paradoxical subreg.

2013-09-24 Thread Wei Mi
Hi,

This patch is to address the problem described here:
http://gcc.gnu.org/ml/gcc/2013-09/msg00187.html

The patch changes ALLOCNO_MODE of a pseudo reg to be outermode if the
pseudo reg is used in a paradoxical subreg, so IRA will not mistakenly
assign an operand with a bigger mode to a smaller hardreg which
couldn't find a pair register.

No test is added because I cannot create a small testcase to reproduce
the problem on trunk, the difficulty of which was described in the
above post.

bootstrap and regression pass. ok for trunk?

Thanks,
Wei Mi.

2013-09-24  Wei Mi  

* ira-build.c (create_insn_allocnos): Fix ALLOCNO_MODE
in the case of paradoxical subreg.

Index: ira-build.c
===
--- ira-build.c (revision 201963)
+++ ira-build.c (working copy)
@@ -1688,6 +1688,30 @@ create_insn_allocnos (rtx x, bool output
}
   return;
 }
+  else if (code == SUBREG)
+{
+  int regno;
+  rtx subreg_reg = SUBREG_REG (x);
+  enum machine_mode outermode, innermode;
+
+  create_insn_allocnos (subreg_reg, output_p);
+  /* For paradoxical subreg, set allocno's mode to be
+the outermode.  */
+  outermode = GET_MODE (x);
+  innermode = GET_MODE (subreg_reg);
+  if (REG_P (subreg_reg)
+ && (GET_MODE_SIZE (outermode) > GET_MODE_SIZE (innermode)))
+   {
+ regno = REGNO (subreg_reg);
+ if (regno >= FIRST_PSEUDO_REGISTER)
+   {
+ ira_allocno_t a = ira_curr_regno_allocno_map[regno];
+ ira_assert (a != NULL);
+ ALLOCNO_MODE (a) = outermode;
+   }
+   }
+  return;
+}
   else if (code == SET)
 {
   create_insn_allocnos (SET_DEST (x), true);


Re: [PATCH, IRA] Fix ALLOCNO_MODE in the case of paradoxical subreg.

2013-09-25 Thread Wei Mi
> performance. For example, we have code
>
> ... (reg:DI) ...
> ...
> ... (subreg:TI (reg:DI))
> ...
> ...(reg:DI)
>
> We need two hard regs only for the second place by transforming
>
> p = (reg:DI)
>
> ...(subreg:TI p)
>
> With this patch we requires two hard regs for the all live range of the
> original pseudo (which can be quite long).  It might considerably worsen
> code performance.
>
> So the problem could be fixed in LRA which can make this transformation
> or even in IRA (that would be even better as we put pseudo P into the
> global picture vs. probably spilling some pseudos for P in LRA).
>

Thanks for your detailed explanation. Now I understand what you concern here.

> I need some time to think what is better (e.g. I don't know how to
> implement it in IRA without big compilation slow down).  In any case,
> the solution for the problem will be not that easy as in the patch.

To fix it in IRA, it looks like we want a live range splitting pass
for pseudos used in paradoxical subreg here. Is the potential
compilation slow down you mention here caused by more allocnos
introduced by the live range splitting, or something else?

Thanks,
Wei Mi.


Re: [PATCH, IRA] Fix ALLOCNO_MODE in the case of paradoxical subreg.

2013-09-25 Thread Wei Mi
>  To define for what occurrence of the pseudo we should do the
> transformation, we need to create allocnos and calculate reg classes to
> know what paradoxical subreg needs more hard regs (the transformations
> can not be done for all paradoxical subregs as my experience shows many
> RTL changes result in worse RA even if we have heuristics to remove the
> generated changes as in this case would be trying to assign the same
> hard reg for the original and the new pseudo).
>   After doing the transformations, we need to recalculate reg classes
> and rebuild allocnos (both are expensive).  To speed up the process it
> could be implemented as some kind of update of already existing data but
> it will complicate code much.
>

I see, thanks!

> So right now I think implementing this in LRA would be easier  Still LRA
> has a pretty good register (re-)allocation (although it is worse than in
> IRA).
>

When you get an idea how to fix it in LRA, if you are still busy, I
would be happy to do the implementation if you could brief your idea.

Thanks,
Wei Mi.


Re: [PATCH, IRA] Fix ALLOCNO_MODE in the case of paradoxical subreg.

2013-09-30 Thread Wei Mi
> Probably the best place to add a code for this is in
> lra-constraints.c::simplify_operand_subreg by permitting subreg reload
> for paradoxical subregs whose hard regs are not fully in allocno class
> of the inner pseudo.
>
> It needs a good testing (i'd check that the generated code is not
> changed on variety benchmarks to see that the change has no impact on
> the most programs performance) and you need to add a good comment
> describing why this change is needed.
>

Vlad, thanks! I make another patch here by following your guidance.
Please check whether it is ok. Boostrap and regression ok. I am also
verifying its performance effect on google applications (But most of
them are 64 bits, so I cannot verify its performance effect on 32 bits
apps).

The idea of the patch is here:

For the following two types of paradoxical subreg, we insert reload in
simplify_operand_subreg:
1. If the op_type is OP_IN, and the hardreg could not be paired with
other hardreg to contain the outermode operand, for example R15 in
x86-64 (checked by in_hard_reg_set_p), we need to insert a reload. If
the hardreg allocated in IRA is R12, we don't need to insert reload
here because upper half of rvalue paradoxical subreg is undefined so
it is ok for R13 to contain undefined data.

2. If the op_type is OP_OUT or OP_INOUT.
(It is possible that we don't need to insert reload for this case
too, because the upper half of lvalue paradoxical subreg is useless.
If the assignment to upper half of subreg register will not be
generated by rtl split4 stage, we don't need to insert reload here.
But I havn't got a testcase to verify it so I keep it)

Here is a paradoxical subreg example showing how the reload is generated:

 (insn 5 4 7 2 (set (reg:TI 106 [ __comp ])
(subreg:TI (reg:DI 107 [ __comp ]) 0)) {*movti_internal_rex64}

In IRA, reg107 is allocated to a DImode hardreg. If reg107 is assigned
to hardreg R15, compiler cannot find another hardreg to pair with R15
to contain TImode data. So we insert a TImode reload pseudo reg180 for
it.

After reload is inserted:

 (insn 283 0 0 (set (subreg:DI (reg:TI 180 [orig:107 __comp ] [107]) 0)
(reg:DI 107 [ __comp ])) -1
 (insn 5 4 7 2 (set (reg:TI 106 [ __comp ])
(subreg:TI (reg:TI 180 [orig:107 __comp ] [107]) 0))
{*movti_internal_rex64}

Two reload hard registers will be allocated to reg180 to save TImode
operand in LRA_assign.

Thanks,
Wei Mi.

2013-09-30  Wei Mi  

* lra-constraints.c (insert_move_for_subreg): New function.
(simplify_operand_subreg): Add reload for paradoxical subreg.


Index: lra-constraints.c
===
--- lra-constraints.c   (revision 201963)
+++ lra-constraints.c   (working copy)
@@ -1158,6 +1158,30 @@ process_addr_reg (rtx *loc, rtx *before,
   return true;
 }

+/* Insert move insn in simplify_operand_subreg. BEFORE returns
+   the insn to be inserted before curr insn. AFTER returns the
+   the insn to be inserted after curr insn.  ORIGREG and NEWREG
+   are the original reg and new reg for reload.  */
+static void
+insert_move_for_subreg (rtx *before, rtx *after, rtx origreg, rtx newreg)
+{
+  if (before)
+{
+  push_to_sequence (*before);
+  lra_emit_move (newreg, origreg);
+  *before = get_insns ();
+  end_sequence ();
+}
+  if (after)
+{
+  start_sequence ();
+  lra_emit_move (origreg, newreg);
+  emit_insn (*after);
+  *after = get_insns ();
+  end_sequence ();
+}
+}
+
 /* Make reloads for subreg in operand NOP with internal subreg mode
REG_MODE, add new reloads for further processing.  Return true if
any reload was generated.  */
@@ -1169,6 +1193,8 @@ simplify_operand_subreg (int nop, enum m
   enum machine_mode mode;
   rtx reg, new_reg;
   rtx operand = *curr_id->operand_loc[nop];
+  enum reg_class regclass;
+  enum op_type type;

   before = after = NULL_RTX;

@@ -1177,6 +1203,7 @@ simplify_operand_subreg (int nop, enum m

   mode = GET_MODE (operand);
   reg = SUBREG_REG (operand);
+  type = curr_static_id->operand[nop].type;
   /* If we change address for paradoxical subreg of memory, the
  address might violate the necessary alignment or the access might
  be slow.  So take this into consideration.  We should not worry
@@ -1215,13 +1242,9 @@ simplify_operand_subreg (int nop, enum m
&& (hard_regno_nregs[hard_regno][GET_MODE (reg)]
   >= hard_regno_nregs[hard_regno][mode])
&& simplify_subreg_regno (hard_regno, GET_MODE (reg),
-SUBREG_BYTE (operand), mode) < 0
-   /* Don't reload subreg for matching reload.  It is actually
- valid subreg in LRA.  */
-   && ! LRA_SUBREG_P (operand))
+SUBREG_BYTE (operand), mode) < 0)
   || CONSTANT_P (reg) || GET_CODE (reg) == PLUS || MEM_P (reg

Re: [PATCH, IRA] Fix ALLOCNO_MODE in the case of paradoxical subreg.

2013-10-01 Thread Wei Mi
> Please check whether it is ok. Boostrap and regression ok. I am also
> verifying its performance effect on google applications (But most of
> them are 64 bits, so I cannot verify its performance effect on 32 bits
> apps).

Have verified It has no performance impact on google applications.

Thanks,
Wei Mi.


Re: [PATCH] disable use_vector_fp_converts for m_CORE_ALL

2013-10-01 Thread Wei Mi
> Hi Wei Mi,
>
> Have you checked in your patch?
>
> --
> H.J.

No, I havn't. Honza wants me to wait for his testing on AMD hardware.
http://gcc.gnu.org/ml/gcc-patches/2013-09/msg01603.html


Re: [PATCH] disable use_vector_fp_converts for m_CORE_ALL

2013-10-01 Thread Wei Mi
On Tue, Oct 1, 2013 at 3:50 PM, Jan Hubicka  wrote:
>> > Hi Wei Mi,
>> >
>> > Have you checked in your patch?
>> >
>> > --
>> > H.J.
>>
>> No, I havn't. Honza wants me to wait for his testing on AMD hardware.
>> http://gcc.gnu.org/ml/gcc-patches/2013-09/msg01603.html
> I only wanted to separate it from the changes in generic so the regular 
> testers
> can pick it up separately.  So just go ahead and check it in.
>
> Honza

Thanks, check in as r203095.

Wei Mi.


Re: [PATCH, IRA] Fix ALLOCNO_MODE in the case of paradoxical subreg.

2013-10-03 Thread Wei Mi
> You removed conditition with LRA_SUBREG for non-paradoxical subreg
> generated for matched operands.  I think that is important condition and
> the comment says why.  There are some 32-bit insns constraints requiring
> different modes (int and fp ones) for matching operands in FP regs.  The
> condition prevents LRA cycling as such subreg can look invalid (in
> simplify_subreg_regno) but it is used internally in LRA for matching
> constraint expression and should be not reloaded.
>
> With this change the patch is ok for me.
>

Thank you very much for the review! Patch fixed according to your
comments and committed as r203169.

Regards,
Wei Mi.


Re: Fwd: [PATCH] Scheduling result adjustment to enable macro-fusion

2013-10-03 Thread Wei Mi
On Tue, Sep 24, 2013 at 4:32 PM, Wei Mi  wrote:
>>> It doesn't look right.  IP relative address is only possible
>>> with TARGET_64BIT and
>>>
>>> 1. base == pc. Or
>>> 2. UUNSPEC_PCREL,  UNSPEC_GOTPCREL, and
>>> NSPEC_GOTNTPOFF.
>>
>> Target 64bit should be tested above.  We however output RIP addresses
>> also for basic symbol references.  I.e. when base is an symbol addresss.
>> such as in:
>> int a;
>> int t()
>> {
>>   return a;
>> }
>>
>> memory_address_length already contains logic to figure out if there is IP
>> relative addressing going on (I am not sure it is completely accurate 
>> either).
>> Better to break it out to a common predicate and perhaps unify with what
>> ix86_print_operand_address is doing.
>>
>> Honza
>>>
>>>
>>> --
>>> H.J.
>
> Thanks. How about this one. bootstrap and regression are going on.
>

Ccing scheduler maintainers.

Ping. Repaste the patch with some minor error fixed. bootstrap and
regression ok. Ok for trunk?

Thanks,
Wei Mi.

2013-10-03  Wei Mi  

* gcc/config/i386/i386.c (memory_address_length): Extract a part
of code to rip_relative_addr_p.
(rip_relative_addr_p): New Function.
(ix86_macro_fusion_p): Ditto.
(ix86_macro_fusion_pair_p): Ditto.
* gcc/config/i386/i386.h: Add new tune features about macro-fusion.
* gcc/config/i386/x86-tune.def (DEF_TUNE): Ditto.
* gcc/doc/tm.texi: Generated.
* gcc/doc/tm.texi.in: Ditto.
* gcc/haifa-sched.c (try_group_insn): New Function.
(group_insns_for_macro_fusion): Ditto.
(sched_init): Call group_insns_for_macro_fusion.
* gcc/sched-rgn.c (add_branch_dependences): Keep insns in
a SCHED_GROUP at the end of BB to remain their location.
* gcc/target.def: Add two hooks: macro_fusion_p and
macro_fusion_pair_p.

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 1fd3f60..59b0bcf 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -24204,6 +24204,42 @@ ix86_instantiate_decls (void)
   instantiate_decl_rtl (s->rtl);
 }

+/* Check whether x86 address PARTS is a pc-relative address.  */
+
+static bool
+rip_relative_addr_p (struct ix86_address *parts)
+{
+  rtx base, index, disp;
+
+  base = parts->base;
+  index = parts->index;
+  disp = parts->disp;
+
+  if (disp && !base && !index)
+{
+  if (TARGET_64BIT)
+   {
+ rtx symbol = disp;
+
+ if (GET_CODE (disp) == CONST)
+   symbol = XEXP (disp, 0);
+ if (GET_CODE (symbol) == PLUS
+ && CONST_INT_P (XEXP (symbol, 1)))
+   symbol = XEXP (symbol, 0);
+
+ if (GET_CODE (symbol) == LABEL_REF
+ || (GET_CODE (symbol) == SYMBOL_REF
+ && SYMBOL_REF_TLS_MODEL (symbol) == 0)
+ || (GET_CODE (symbol) == UNSPEC
+ && (XINT (symbol, 1) == UNSPEC_GOTPCREL
+ || XINT (symbol, 1) == UNSPEC_PCREL
+ || XINT (symbol, 1) == UNSPEC_GOTNTPOFF)))
+   return true;
+   }
+}
+  return false;
+}
+
 /* Calculate the length of the memory address in the instruction encoding.
Includes addr32 prefix, does not include the one-byte modrm, opcode,
or other prefixes.  We never generate addr32 prefix for LEA insn.  */
@@ -24275,25 +24311,8 @@ memory_address_length (rtx addr, bool lea)
   else if (disp && !base && !index)
 {
   len += 4;
-  if (TARGET_64BIT)
-   {
- rtx symbol = disp;
-
- if (GET_CODE (disp) == CONST)
-   symbol = XEXP (disp, 0);
- if (GET_CODE (symbol) == PLUS
- && CONST_INT_P (XEXP (symbol, 1)))
-   symbol = XEXP (symbol, 0);
-
- if (GET_CODE (symbol) != LABEL_REF
- && (GET_CODE (symbol) != SYMBOL_REF
- || SYMBOL_REF_TLS_MODEL (symbol) != 0)
- && (GET_CODE (symbol) != UNSPEC
- || (XINT (symbol, 1) != UNSPEC_GOTPCREL
- && XINT (symbol, 1) != UNSPEC_PCREL
- && XINT (symbol, 1) != UNSPEC_GOTNTPOFF)))
-   len++;
-   }
+  if (rip_relative_addr_p (&parts))
+   len++;
 }
   else
 {
@@ -24856,6 +24875,122 @@ ia32_multipass_dfa_lookahead (void)
 }
 }

+/* Return true if target platform supports macro-fusion.  */
+
+static bool
+ix86_macro_fusion_p ()
+{
+  if (TARGET_FUSE_CMP_AND_BRANCH)
+return true;
+  else
+return false;
+}
+
+/* Check whether current microarchitecture support macro fusion
+   for insn pair "CONDGEN + CONDJMP". Refer to
+   "Intel Architectures Optimization Reference Manual". */
+
+static bool
+ix86_macro_fusio

Re: Fwd: [PATCH] Scheduling result adjustment to enable macro-fusion

2013-10-15 Thread Wei Mi
Thanks for the comments. One question inlined. Preparing another patch
addressing the comments.

Regards,
Wei Mi.

On Tue, Oct 15, 2013 at 1:35 PM, Jeff Law  wrote:
> On 10/03/13 12:24, Wei Mi wrote:
>>
>> Thanks,
>> Wei Mi.
>>
>> 2013-10-03  Wei Mi  
>>
>>  * gcc/config/i386/i386.c (memory_address_length): Extract a part
>>  of code to rip_relative_addr_p.
>>  (rip_relative_addr_p): New Function.
>>  (ix86_macro_fusion_p): Ditto.
>>  (ix86_macro_fusion_pair_p): Ditto.
>>  * gcc/config/i386/i386.h: Add new tune features about
>> macro-fusion.
>>  * gcc/config/i386/x86-tune.def (DEF_TUNE): Ditto.
>>  * gcc/doc/tm.texi: Generated.
>>  * gcc/doc/tm.texi.in: Ditto.
>>  * gcc/haifa-sched.c (try_group_insn): New Function.
>>  (group_insns_for_macro_fusion): Ditto.
>>  (sched_init): Call group_insns_for_macro_fusion.
>>  * gcc/sched-rgn.c (add_branch_dependences): Keep insns in
>>  a SCHED_GROUP at the end of BB to remain their location.
>>  * gcc/target.def: Add two hooks: macro_fusion_p and
>>  macro_fusion_pair_p.
>
> I'm not going to comment on the x86 specific stuff -- I'll defer to the port
> maintainers for that.
>
>
>
>> index 61eaaef..d6726a9 100644
>> --- a/gcc/haifa-sched.c
>> +++ b/gcc/haifa-sched.c
>> @@ -6519,6 +6519,44 @@ setup_sched_dump (void)
>>  ? stderr : dump_file);
>>   }
>>
>> +static void
>> +try_group_insn (rtx insn)
>
> You need a comment for this function.
>

Ok, will add comment for it.

>
>
>> +{
>> +  unsigned int condreg1, condreg2;
>> +  rtx cc_reg_1;
>> +  rtx prev;
>> +
>> +  targetm.fixed_condition_code_regs (&condreg1, &condreg2);
>> +  cc_reg_1 = gen_rtx_REG (CCmode, condreg1);
>> +  prev = prev_nonnote_nondebug_insn (insn);
>> +  if (!any_condjump_p (insn)
>> +  || !reg_referenced_p (cc_reg_1, PATTERN (insn))
>> +  || !prev
>> +  || !modified_in_p (cc_reg_1, prev))
>> +return;
>
> I'd test !any_condjump_p at the start of this function before calling the
> target hook.  If insn isn't a conditional jump, then all the other work is
> totally useless.

Ok. will fix it.

>
> Aren't you just trying to see if we have a comparison feeding the
> conditional jump and if they're already adjacent?  Do you actually need to
> get the condition code regs to do that test?
>

Yes, I am trying to see if we have a comparison feeding the
conditional jump and if they're already adjacent. Do you have more
easier way to do that test?

>
>> +
>> +  /* Different microarchitectures support macro fusions for different
>> + combinations of insn pairs.  */
>> +  if (!targetm.sched.macro_fusion_pair_p
>> +  || !targetm.sched.macro_fusion_pair_p (prev, insn))
>> +return;
>> +
>> +  SCHED_GROUP_P (insn) = 1;
>
> I'm surprised that SCHED_GROUP_P worked -- I've tried to do similar stuff in
> the past and ran into numerous problems trying to hijack SCHED_GROUP_P for
> this kind of purpose.
>
>
>
>>
>>   static void haifa_init_only_bb (basic_block, basic_block);
>> diff --git a/gcc/sched-rgn.c b/gcc/sched-rgn.c
>> index e1a2dce..156359e 100644
>> --- a/gcc/sched-rgn.c
>> +++ b/gcc/sched-rgn.c
>> @@ -2443,6 +2443,8 @@ add_branch_dependences (rtx head, rtx tail)
>>cc0 setters remain at the end because they can't be moved away from
>>their cc0 user.
>>
>> + Predecessors of SCHED_GROUP_P instructions at the end remain at the
>> end.
>> +
>>COND_EXEC insns cannot be moved past a branch (see e.g. PR17808).
>>
>>Insns setting TARGET_CLASS_LIKELY_SPILLED_P registers (usually
>> return
>> @@ -2465,7 +2467,8 @@ add_branch_dependences (rtx head, rtx tail)
>>   #endif
>>   || (!reload_completed
>>   && sets_likely_spilled (PATTERN (insn)
>> -|| NOTE_P (insn))
>> +|| NOTE_P (insn)
>> +|| (last != 0 && SCHED_GROUP_P (last)))
>>   {
>> if (!NOTE_P (insn))
>>  {
>
> This looks like a straighforward bugfix and probably should go forward
> independent of this enhancement.

Ok, I will separate it into another patch.

>
> Jeff


Re: Fwd: [PATCH] Scheduling result adjustment to enable macro-fusion

2013-10-16 Thread Wei Mi
> Go ahead and consider that pre-approved.  Just send it to the list with a
> note that I approved it in this thread.
>
> Jeff

Thanks! The new patch addressed Jeff's comments.

Is it ok for x86 maintainer?

Thanks,
Wei Mi.

2013-10-16  Wei Mi  

* gcc/config/i386/i386.c (memory_address_length): Extract a part
of code to rip_relative_addr_p.
(rip_relative_addr_p): New Function.
(ix86_macro_fusion_p): Ditto.
(ix86_macro_fusion_pair_p): Ditto.
* gcc/config/i386/i386.h: Add new tune features about macro-fusion.
* gcc/config/i386/x86-tune.def (DEF_TUNE): Ditto.
* gcc/doc/tm.texi: Generated.
* gcc/doc/tm.texi.in: Ditto.
* gcc/haifa-sched.c (try_group_insn): New Function.
(group_insns_for_macro_fusion): Ditto.
(sched_init): Call group_insns_for_macro_fusion.
* gcc/target.def: Add two hooks: macro_fusion_p and
macro_fusion_pair_p.

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 1fd3f60..59b0bcf 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -24204,6 +24204,42 @@ ix86_instantiate_decls (void)
   instantiate_decl_rtl (s->rtl);
 }

+/* Check whether x86 address PARTS is a pc-relative address.  */
+
+static bool
+rip_relative_addr_p (struct ix86_address *parts)
+{
+  rtx base, index, disp;
+
+  base = parts->base;
+  index = parts->index;
+  disp = parts->disp;
+
+  if (disp && !base && !index)
+{
+  if (TARGET_64BIT)
+   {
+ rtx symbol = disp;
+
+ if (GET_CODE (disp) == CONST)
+   symbol = XEXP (disp, 0);
+ if (GET_CODE (symbol) == PLUS
+ && CONST_INT_P (XEXP (symbol, 1)))
+   symbol = XEXP (symbol, 0);
+
+ if (GET_CODE (symbol) == LABEL_REF
+ || (GET_CODE (symbol) == SYMBOL_REF
+ && SYMBOL_REF_TLS_MODEL (symbol) == 0)
+ || (GET_CODE (symbol) == UNSPEC
+ && (XINT (symbol, 1) == UNSPEC_GOTPCREL
+ || XINT (symbol, 1) == UNSPEC_PCREL
+ || XINT (symbol, 1) == UNSPEC_GOTNTPOFF)))
+   return true;
+   }
+}
+  return false;
+}
+
 /* Calculate the length of the memory address in the instruction encoding.
Includes addr32 prefix, does not include the one-byte modrm, opcode,
or other prefixes.  We never generate addr32 prefix for LEA insn.  */
@@ -24275,25 +24311,8 @@ memory_address_length (rtx addr, bool lea)
   else if (disp && !base && !index)
 {
   len += 4;
-  if (TARGET_64BIT)
-   {
- rtx symbol = disp;
-
- if (GET_CODE (disp) == CONST)
-   symbol = XEXP (disp, 0);
- if (GET_CODE (symbol) == PLUS
- && CONST_INT_P (XEXP (symbol, 1)))
-   symbol = XEXP (symbol, 0);
-
- if (GET_CODE (symbol) != LABEL_REF
- && (GET_CODE (symbol) != SYMBOL_REF
- || SYMBOL_REF_TLS_MODEL (symbol) != 0)
- && (GET_CODE (symbol) != UNSPEC
- || (XINT (symbol, 1) != UNSPEC_GOTPCREL
- && XINT (symbol, 1) != UNSPEC_PCREL
- && XINT (symbol, 1) != UNSPEC_GOTNTPOFF)))
-   len++;
-   }
+  if (rip_relative_addr_p (&parts))
+   len++;
 }
   else
 {
@@ -24856,6 +24875,122 @@ ia32_multipass_dfa_lookahead (void)
 }
 }

+/* Return true if target platform supports macro-fusion.  */
+
+static bool
+ix86_macro_fusion_p ()
+{
+  if (TARGET_FUSE_CMP_AND_BRANCH)
+return true;
+  else
+return false;
+}
+
+/* Check whether current microarchitecture support macro fusion
+   for insn pair "CONDGEN + CONDJMP". Refer to
+   "Intel Architectures Optimization Reference Manual". */
+
+static bool
+ix86_macro_fusion_pair_p (rtx condgen, rtx condjmp)
+{
+  rtx src, dest;
+  rtx single_set = single_set (condgen);
+  enum rtx_code ccode;
+  rtx compare_set = NULL_RTX, test_if, cond;
+  rtx alu_set = NULL_RTX, addr = NULL_RTX;
+
+  if (get_attr_type (condgen) != TYPE_TEST
+  && get_attr_type (condgen) != TYPE_ICMP
+  && get_attr_type (condgen) != TYPE_INCDEC
+  && get_attr_type (condgen) != TYPE_ALU)
+return false;
+
+  if (single_set == NULL_RTX
+  && !TARGET_FUSE_ALU_AND_BRANCH)
+return false;
+
+  if (single_set != NULL_RTX)
+compare_set = single_set;
+  else
+{
+  int i;
+  rtx pat = PATTERN (condgen);
+  for (i = 0; i < XVECLEN (pat, 0); i++)
+   if (GET_CODE (XVECEXP (pat, 0, i)) == SET)
+ {
+   rtx set_src = SET_SRC (XVECEXP (pat, 0, i));
+   if (GET_CODE (set_src) == COMPARE)
+ compare_set = XVECEXP (pat, 0, i);
+   else
+ alu_set = XVECEXP (pat, 0, i);
+ }
+}
+  if (compare_set == NULL_RTX)
+return false;
+  src = SET_SRC

Re: Fwd: [PATCH] Scheduling result adjustment to enable macro-fusion

2013-10-17 Thread Wei Mi
On Thu, Oct 17, 2013 at 12:35 AM, Marek Polacek  wrote:
> On Wed, Oct 16, 2013 at 04:25:58PM -0700, Wei Mi wrote:
>> +/* Return true if target platform supports macro-fusion.  */
>> +
>> +static bool
>> +ix86_macro_fusion_p ()
>> +{
>> +  if (TARGET_FUSE_CMP_AND_BRANCH)
>> +return true;
>> +  else
>> +return false;
>> +}
>
> That looks weird, why not just
>
> static bool
> ix86_macro_fusion_p (void)
> {
>   return TARGET_FUSE_CMP_AND_BRANCH;
> }
>
> ?
>
> Marek

Thanks, fixed.

Wei Mi.


Re: Fwd: [PATCH] Scheduling result adjustment to enable macro-fusion

2013-11-01 Thread Wei Mi
Ping.  Is it ok for x86 maintainer?

Thanks,
Wei Mi.

On Wed, Oct 16, 2013 at 4:25 PM, Wei Mi  wrote:
>> Go ahead and consider that pre-approved.  Just send it to the list with a
>> note that I approved it in this thread.
>>
>> Jeff
>
> Thanks! The new patch addressed Jeff's comments.
>
> Is it ok for x86 maintainer?
>
> Thanks,
> Wei Mi.
>
> 2013-10-16  Wei Mi  
>
> * gcc/config/i386/i386.c (memory_address_length): Extract a part
> of code to rip_relative_addr_p.
> (rip_relative_addr_p): New Function.
> (ix86_macro_fusion_p): Ditto.
> (ix86_macro_fusion_pair_p): Ditto.
> * gcc/config/i386/i386.h: Add new tune features about macro-fusion.
> * gcc/config/i386/x86-tune.def (DEF_TUNE): Ditto.
> * gcc/doc/tm.texi: Generated.
> * gcc/doc/tm.texi.in: Ditto.
> * gcc/haifa-sched.c (try_group_insn): New Function.
> (group_insns_for_macro_fusion): Ditto.
> (sched_init): Call group_insns_for_macro_fusion.
> * gcc/target.def: Add two hooks: macro_fusion_p and
> macro_fusion_pair_p.
>
> diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
> index 1fd3f60..59b0bcf 100644
> --- a/gcc/config/i386/i386.c
> +++ b/gcc/config/i386/i386.c
> @@ -24204,6 +24204,42 @@ ix86_instantiate_decls (void)
>instantiate_decl_rtl (s->rtl);
>  }
>
> +/* Check whether x86 address PARTS is a pc-relative address.  */
> +
> +static bool
> +rip_relative_addr_p (struct ix86_address *parts)
> +{
> +  rtx base, index, disp;
> +
> +  base = parts->base;
> +  index = parts->index;
> +  disp = parts->disp;
> +
> +  if (disp && !base && !index)
> +{
> +  if (TARGET_64BIT)
> +   {
> + rtx symbol = disp;
> +
> + if (GET_CODE (disp) == CONST)
> +   symbol = XEXP (disp, 0);
> + if (GET_CODE (symbol) == PLUS
> + && CONST_INT_P (XEXP (symbol, 1)))
> +   symbol = XEXP (symbol, 0);
> +
> + if (GET_CODE (symbol) == LABEL_REF
> + || (GET_CODE (symbol) == SYMBOL_REF
> + && SYMBOL_REF_TLS_MODEL (symbol) == 0)
> + || (GET_CODE (symbol) == UNSPEC
> + && (XINT (symbol, 1) == UNSPEC_GOTPCREL
> + || XINT (symbol, 1) == UNSPEC_PCREL
> + || XINT (symbol, 1) == UNSPEC_GOTNTPOFF)))
> +   return true;
> +   }
> +}
> +  return false;
> +}
> +
>  /* Calculate the length of the memory address in the instruction encoding.
> Includes addr32 prefix, does not include the one-byte modrm, opcode,
> or other prefixes.  We never generate addr32 prefix for LEA insn.  */
> @@ -24275,25 +24311,8 @@ memory_address_length (rtx addr, bool lea)
>else if (disp && !base && !index)
>  {
>len += 4;
> -  if (TARGET_64BIT)
> -   {
> - rtx symbol = disp;
> -
> - if (GET_CODE (disp) == CONST)
> -   symbol = XEXP (disp, 0);
> - if (GET_CODE (symbol) == PLUS
> - && CONST_INT_P (XEXP (symbol, 1)))
> -   symbol = XEXP (symbol, 0);
> -
> - if (GET_CODE (symbol) != LABEL_REF
> - && (GET_CODE (symbol) != SYMBOL_REF
> - || SYMBOL_REF_TLS_MODEL (symbol) != 0)
> - && (GET_CODE (symbol) != UNSPEC
> - || (XINT (symbol, 1) != UNSPEC_GOTPCREL
> - && XINT (symbol, 1) != UNSPEC_PCREL
> - && XINT (symbol, 1) != UNSPEC_GOTNTPOFF)))
> -   len++;
> -   }
> +  if (rip_relative_addr_p (&parts))
> +   len++;
>  }
>else
>  {
> @@ -24856,6 +24875,122 @@ ia32_multipass_dfa_lookahead (void)
>  }
>  }
>
> +/* Return true if target platform supports macro-fusion.  */
> +
> +static bool
> +ix86_macro_fusion_p ()
> +{
> +  if (TARGET_FUSE_CMP_AND_BRANCH)
> +return true;
> +  else
> +return false;
> +}
> +
> +/* Check whether current microarchitecture support macro fusion
> +   for insn pair "CONDGEN + CONDJMP". Refer to
> +   "Intel Architectures Optimization Reference Manual". */
> +
> +static bool
> +ix86_macro_fusion_pair_p (rtx condgen, rtx condjmp)
> +{
> +  rtx src, dest;
> +  rtx single_set = single_set (condgen);
> +  enum rtx_code ccode;
> +  rtx compare_set = NULL_RTX, test_if, cond;
> +  rtx alu_set = NULL_RTX, addr = NULL_RTX;
> +
> +  if (get_attr_type (condgen) != TYPE_TEST
> +  && get_attr_type (condgen) != TYPE_ICMP
> +  

[PATCH] PR58985: testcase error.

2013-11-04 Thread Wei Mi
Hi,

This is to fix testcase error reported in PR58985.

The intention of the testcase was to ensure there was no REG_EQUIV
notes generated for a reg which was used in a paradoxical subreg. When
target was x86, there was subreg generated so I omitted to add the
subreg in the regexp pattern. However there is no subreg generated for
target cris-axis-elf, so REG_EQUIV should be allowed.

Is it ok for trunk and gcc-4.8 branch?

Thanks,
Wei Mi.

2013-11-04  Wei Mi  

PR regression/58985
* testsuite/gcc.dg/pr57518.c: Add subreg in regexp pattern.

Index: testsuite/gcc.dg/pr57518.c
===
--- testsuite/gcc.dg/pr57518.c  (revision 204353)
+++ testsuite/gcc.dg/pr57518.c  (working copy)
@@ -2,7 +2,7 @@

 /* { dg-do compile } */
 /* { dg-options "-O2 -fdump-rtl-ira" } */
-/* { dg-final { scan-rtl-dump-not
"REG_EQUIV\[^\n\]*mem\[^\n\]*\"ip\"" "ira" } } */
+/* { dg-final { scan-rtl-dump-not
"REG_EQUIV\[^\n\]*mem\[^\n\]*\"ip\".*subreg" "ira" } } */

 char ip[10];
 int total;


  1   2   >