On 09/07/2016 02:09 PM, Richard Biener wrote:
> On Wed, Sep 7, 2016 at 1:37 PM, Martin Liška <mli...@suse.cz> wrote:
>> On 08/18/2016 06:06 PM, Richard Biener wrote:
>>> On August 18, 2016 5:54:49 PM GMT+02:00, Jakub Jelinek <ja...@redhat.com> 
>>> wrote:
>>>> On Thu, Aug 18, 2016 at 08:51:31AM -0700, Andi Kleen wrote:
>>>>>> I'd prefer to make updates atomic in multi-threaded applications.
>>>>>> The best proxy we have for that is -pthread.
>>>>>>
>>>>>> Is it slower, most definitely, but odds are we're giving folks
>>>>>> garbage data otherwise, which in many ways is even worse.
>>>>>
>>>>> It will likely be catastrophically slower in some cases.
>>>>>
>>>>> Catastrophically as in too slow to be usable.
>>>>>
>>>>> An atomic instruction is a lot more expensive than a single
>>>> increment. Also
>>>>> they sometimes are really slow depending on the state of the machine.
>>>>
>>>> Can't we just have thread-local copies of all the counters (perhaps
>>>> using
>>>> __thread pointer as base) and just atomically merge at thread
>>>> termination?
>>>
>>> I suggested that as well but of course it'll have its own class of issues 
>>> (short lived threads, so we need to somehow re-use counters from terminated 
>>> threads, large number of threads and thus using too much memory for the 
>>> counters)
>>>
>>> Richard.
>>
>> Hello.
>>
>> I've got written the approach on my TODO list, let's see whether it would be 
>> doable in a reasonable amount of time.
>>
>> I've just finished some measurements to illustrate slow-down of 
>> -fprofile-update=atomic approach.
>> All numbers are: no profile, -fprofile-generate, -fprofile-generate 
>> -fprofile-update=atomic
>> c-ray benchmark (utilizing 8 threads, -O3): 1.7, 15.5., 38.1s
>> unrar (utilizing 8 threads, -O3): 3.6, 11.6, 38s
>> tramp3d (1 thread, -O3): 18.0, 46.6, 168s
>>
>> So the slow-down is roughly 300% compared to -fprofile-generate. I'm not 
>> having much experience with default option
>> selection, but these numbers can probably help.
>>
>> Thoughts?
> 
> Look at the generated code for an instrumented simple loop and see that for
> the non-atomic updates we happily apply store-motion to the counter update
> and thus we only get one counter update per loop exit rather than one per
> loop iteration.  Now see what happens for the atomic case (I suspect you
> get one per iteration).
> 
> I'll bet this accounts for most of the slowdown.
> 
> Back in time ICC which had atomic counter updates (but using function
> calls - ugh!) had a > 1000% overhead with FDO for tramp3d (they also
> didn't have early inlining -- removing abstraction helps reducing the number
> of counters significantly).
> 
> Richard.

Hi.

During Cauldron I discussed with Richi approaches how to speed-up ARCS
profile counter updates. My first attempt is to utilize TLS storage, where
every function is accumulating arcs counters. These are eventually added
(using atomic operations) to the global one at the very end of a function.
Currently I rely on target support of TLS, which is questionable whether
to have such a requirement for -fprofile-update=atomic, or to add a new option 
value
like -fprofile-update=atomic-tls?

Running the patch on tramp3d, compared to previous numbers, it takes 88s to 
finish.
Time shrinks to 50%, compared to the current implementation.

Thoughts?
Martin

> 
>> Martin
>>
>>>
>>>>      Jakub
>>>
>>>
>>

>From 91b5342c422950b32d1ba7d616bda418c7993a84 Mon Sep 17 00:00:00 2001
From: marxin <mli...@suse.cz>
Date: Thu, 15 Sep 2016 09:49:41 +0200
Subject: [PATCH] Improve implementation of -fprofile-update=atomic

---
 gcc/coverage.c     | 99 ++++++++++++++++++++++++++++++++++++++++++++++++++----
 gcc/coverage.h     |  6 ++--
 gcc/profile.c      |  2 ++
 gcc/tree-profile.c | 45 +++++++++----------------
 4 files changed, 114 insertions(+), 38 deletions(-)

diff --git a/gcc/coverage.c b/gcc/coverage.c
index a6a888a..1e0052f 100644
--- a/gcc/coverage.c
+++ b/gcc/coverage.c
@@ -48,6 +48,11 @@ along with GCC; see the file COPYING3.  If not see
 #include "intl.h"
 #include "params.h"
 #include "auto-profile.h"
+#include "varasm.h"
+#include "gimple.h"
+#include "gimple-iterator.h"
+#include "tree-vrp.h"
+#include "tree-ssanames.h"
 
 #include "gcov-io.c"
 
@@ -93,6 +98,9 @@ static GTY(()) tree fn_v_ctrs[GCOV_COUNTERS];   /* counter variables.  */
 static unsigned fn_n_ctrs[GCOV_COUNTERS]; /* Counters allocated.  */
 static unsigned fn_b_ctrs[GCOV_COUNTERS]; /* Allocation base.  */
 
+/* Thread-local storage variable of ARCS counters.  */
+static GTY(()) tree arcs_tls_ctr;
+
 /* Coverage info VAR_DECL and function info type nodes.  */
 static GTY(()) tree gcov_info_var;
 static GTY(()) tree gcov_fn_info_type;
@@ -127,7 +135,8 @@ static const char *const ctr_names[GCOV_COUNTERS] = {
 
 /* Forward declarations.  */
 static void read_counts_file (void);
-static tree build_var (tree, tree, int);
+static tree build_var (tree fn_decl, tree type, int counter,
+		       bool is_tls = false);
 static void build_fn_info_type (tree, unsigned, tree);
 static void build_info_type (tree, tree);
 static tree build_fn_info (const struct coverage_data *, tree, tree);
@@ -442,6 +451,10 @@ coverage_counter_alloc (unsigned counter, unsigned num)
 
       fn_v_ctrs[counter]
 	= build_var (current_function_decl, array_type, counter);
+
+      if (counter == GCOV_COUNTER_ARCS)
+	arcs_tls_ctr = build_var (current_function_decl, array_type, counter,
+				  true);
     }
 
   fn_b_ctrs[counter] = fn_n_ctrs[counter];
@@ -454,7 +467,8 @@ coverage_counter_alloc (unsigned counter, unsigned num)
 /* Generate a tree to access COUNTER NO.  */
 
 tree
-tree_coverage_counter (unsigned counter, unsigned no, coverage_usage_type type)
+tree_coverage_counter (unsigned counter, unsigned no, coverage_usage_type type,
+		       bool is_tls)
 {
   tree gcov_type_node = get_gcov_type ();
 
@@ -463,7 +477,8 @@ tree_coverage_counter (unsigned counter, unsigned no, coverage_usage_type type)
   no += fn_b_ctrs[counter];
   
   /* "no" here is an array index, scaled to bytes later.  */
-  tree v = build4 (ARRAY_REF, gcov_type_node, fn_v_ctrs[counter],
+  tree array = is_tls ? arcs_tls_ctr : fn_v_ctrs[counter];
+  tree v = build4 (ARRAY_REF, gcov_type_node, array,
 		   build_int_cst (integer_type_node, no), NULL, NULL);
 
   if (type == COVERAGE_ADDR)
@@ -471,6 +486,59 @@ tree_coverage_counter (unsigned counter, unsigned no, coverage_usage_type type)
 
   return v;
 }
+
+/* Generate profile counter update code emission at the end of a function.  */
+
+void
+generate_arcs_tls_update (void)
+{
+  edge_iterator ei;
+  edge e;
+
+  if (flag_profile_update != PROFILE_UPDATE_ATOMIC)
+    return;
+
+  unsigned counter_count = fn_n_ctrs[GCOV_COUNTER_ARCS];
+  size_t mode_size = LONG_LONG_TYPE_SIZE > 32 ? 8 : 4;
+
+  tree atomic_add_fn = builtin_decl_explicit (mode_size == 8
+					      ? BUILT_IN_ATOMIC_FETCH_ADD_8:
+					      BUILT_IN_ATOMIC_FETCH_ADD_4);
+  /* Zero the TLS ARCS counters.  */
+  basic_block entry_bb = ENTRY_BLOCK_PTR_FOR_FN (cfun);
+  FOR_EACH_EDGE (e, ei, entry_bb->succs)
+    {
+      size_t size = counter_count * mode_size;
+      tree fn = builtin_decl_explicit (BUILT_IN_BZERO);
+      tree addr = tree_coverage_counter (GCOV_COUNTER_ARCS, 0, COVERAGE_ADDR);
+      gcall *call = gimple_build_call (fn, 2, addr,
+				       build_int_cst  (integer_type_node, size));
+      gsi_insert_on_edge (e, call);
+    }
+
+  /* Update global counters with the values stored in TLS.  */
+  basic_block exit_bb = EXIT_BLOCK_PTR_FOR_FN (cfun);
+  FOR_EACH_EDGE (e, ei, exit_bb->preds)
+    {
+      for (unsigned i = 0; i < counter_count; i++)
+	{
+	  tree ref = tree_coverage_counter (GCOV_COUNTER_ARCS, i, COVERAGE_REF,
+					    true);
+	  tree gcov_tmp_var = make_temp_ssa_name (get_gcov_type (),
+						  NULL, "PROF_edge_counter");
+	  gassign *stmt = gimple_build_assign (gcov_tmp_var, ref);
+	  gsi_insert_on_edge (e, stmt);
+	  tree addr = tree_coverage_counter (GCOV_COUNTER_ARCS, i,
+					     COVERAGE_ADDR);
+
+	  /* __atomic_fetch_add (&counter, 1, MEMMODEL_RELAXED); */
+	  gcall *call = gimple_build_call (atomic_add_fn, 3, addr, gcov_tmp_var,
+					   build_int_cst (integer_type_node,
+							  MEMMODEL_RELAXED));
+	  gsi_insert_on_edge (e, call);
+	}
+    }
+}
 
 
 /* Generate a checksum for a string.  CHKSUM is the current
@@ -706,6 +774,14 @@ coverage_end_function (unsigned lineno_checksum, unsigned cfg_checksum)
 	      DECL_SIZE (var) = TYPE_SIZE (array_type);
 	      DECL_SIZE_UNIT (var) = TYPE_SIZE_UNIT (array_type);
 	      varpool_node::finalize_decl (var);
+
+	      if (i == GCOV_COUNTER_ARCS)
+		{
+		  TREE_TYPE (arcs_tls_ctr) = array_type;
+		  DECL_SIZE (arcs_tls_ctr) = TYPE_SIZE (array_type);
+		  DECL_SIZE_UNIT (arcs_tls_ctr) = TYPE_SIZE_UNIT (array_type);
+		  varpool_node::finalize_decl (arcs_tls_ctr);
+		}
 	    }
 	  
 	  fn_b_ctrs[i] = fn_n_ctrs[i] = 0;
@@ -717,10 +793,12 @@ coverage_end_function (unsigned lineno_checksum, unsigned cfg_checksum)
 }
 
 /* Build a coverage variable of TYPE for function FN_DECL.  If COUNTER
-   >= 0 it is a counter array, otherwise it is the function structure.  */
+   >= 0 it is a counter array, otherwise it is the function structure.
+   If IS_TLS is true, the newly added variable will live in thread-local
+   storage.  */
 
 static tree
-build_var (tree fn_decl, tree type, int counter)
+build_var (tree fn_decl, tree type, int counter, bool is_tls)
 {
   tree var = build_decl (BUILTINS_LOCATION, VAR_DECL, NULL_TREE, type);
   const char *fn_name = IDENTIFIER_POINTER (DECL_ASSEMBLER_NAME (fn_decl));
@@ -729,13 +807,18 @@ build_var (tree fn_decl, tree type, int counter)
 
   fn_name = targetm.strip_name_encoding (fn_name);
   fn_name_len = strlen (fn_name);
-  buf = XALLOCAVEC (char, fn_name_len + 8 + sizeof (int) * 3);
+  buf = XALLOCAVEC (char, fn_name_len + 13 + sizeof (int) * 3);
 
   if (counter < 0)
     strcpy (buf, "__gcov__");
   else
     sprintf (buf, "__gcov%u_", counter);
+
+  len = strlen (buf);
+  if (is_tls)
+    strcpy (buf + len, "tls_");
   len = strlen (buf);
+
   buf[len - 1] = symbol_table::symbol_suffix_separator ();
   memcpy (buf + len, fn_name, fn_name_len + 1);
   DECL_NAME (var) = get_identifier (buf);
@@ -744,6 +827,10 @@ build_var (tree fn_decl, tree type, int counter)
   DECL_NONALIASED (var) = 1;
   SET_DECL_ALIGN (var, TYPE_ALIGN (type));
 
+  // TODO
+  if (is_tls && targetm.have_tls)
+    set_decl_tls_model (var, decl_default_tls_model (var));
+
   return var;
 }
 
diff --git a/gcc/coverage.h b/gcc/coverage.h
index 253aca4..49a332a 100644
--- a/gcc/coverage.h
+++ b/gcc/coverage.h
@@ -53,8 +53,10 @@ enum coverage_usage_type
 };
 
 /* Use a counter from the most recent allocation.  */
-extern tree tree_coverage_counter (unsigned, unsigned, coverage_usage_type);
-
+extern tree tree_coverage_counter (unsigned counter, unsigned no,
+				   coverage_usage_type type,
+				   bool is_tls = false);
+extern void generate_arcs_tls_update (void);
 /* Get all the counters for the current function.  */
 extern gcov_type *get_coverage_counts (unsigned /*counter*/,
 				       unsigned /*expected*/,
diff --git a/gcc/profile.c b/gcc/profile.c
index 4519e7d..de20293 100644
--- a/gcc/profile.c
+++ b/gcc/profile.c
@@ -143,6 +143,8 @@ instrument_edges (struct edge_list *el)
 	}
     }
 
+  generate_arcs_tls_update ();
+
   total_num_blocks_created += num_edges;
   if (dump_file)
     fprintf (dump_file, "%d edges instrumented\n", num_instr_edges);
diff --git a/gcc/tree-profile.c b/gcc/tree-profile.c
index 7b69338..a43a117 100644
--- a/gcc/tree-profile.c
+++ b/gcc/tree-profile.c
@@ -255,36 +255,21 @@ gimple_gen_edge_profiler (int edgeno, edge e)
 
   one = build_int_cst (gcov_type_node, 1);
 
-  if (flag_profile_update == PROFILE_UPDATE_ATOMIC)
-    {
-      /* __atomic_fetch_add (&counter, 1, MEMMODEL_RELAXED); */
-      tree addr = tree_coverage_counter (GCOV_COUNTER_ARCS, edgeno,
-					 COVERAGE_ADDR);
-      tree f = builtin_decl_explicit (LONG_LONG_TYPE_SIZE > 32
-				      ? BUILT_IN_ATOMIC_FETCH_ADD_8:
-				      BUILT_IN_ATOMIC_FETCH_ADD_4);
-      gcall *stmt = gimple_build_call (f, 3, addr, one,
-				       build_int_cst (integer_type_node,
-						      MEMMODEL_RELAXED));
-      gsi_insert_on_edge (e, stmt);
-    }
-  else
-    {
-      tree ref = tree_coverage_counter (GCOV_COUNTER_ARCS, edgeno,
-					COVERAGE_REF);
-      tree gcov_type_tmp_var = make_temp_ssa_name (gcov_type_node,
-						   NULL, "PROF_edge_counter");
-      gassign *stmt1 = gimple_build_assign (gcov_type_tmp_var, ref);
-      gcov_type_tmp_var = make_temp_ssa_name (gcov_type_node,
-					      NULL, "PROF_edge_counter");
-      gassign *stmt2 = gimple_build_assign (gcov_type_tmp_var, PLUS_EXPR,
-					    gimple_assign_lhs (stmt1), one);
-      gassign *stmt3 = gimple_build_assign (unshare_expr (ref),
-					    gimple_assign_lhs (stmt2));
-      gsi_insert_on_edge (e, stmt1);
-      gsi_insert_on_edge (e, stmt2);
-      gsi_insert_on_edge (e, stmt3);
-    }
+  tree ref
+    = tree_coverage_counter (GCOV_COUNTER_ARCS, edgeno, COVERAGE_REF,
+			     flag_profile_update == PROFILE_UPDATE_ATOMIC);
+  tree gcov_type_tmp_var = make_temp_ssa_name (gcov_type_node,
+					       NULL, "PROF_edge_counter");
+  gassign *stmt1 = gimple_build_assign (gcov_type_tmp_var, ref);
+  gcov_type_tmp_var = make_temp_ssa_name (gcov_type_node,
+					  NULL, "PROF_edge_counter");
+  gassign *stmt2 = gimple_build_assign (gcov_type_tmp_var, PLUS_EXPR,
+					gimple_assign_lhs (stmt1), one);
+  gassign *stmt3 = gimple_build_assign (unshare_expr (ref),
+					gimple_assign_lhs (stmt2));
+  gsi_insert_on_edge (e, stmt1);
+  gsi_insert_on_edge (e, stmt2);
+  gsi_insert_on_edge (e, stmt3);
 }
 
 /* Emits code to get VALUE to instrument at GSI, and returns the
-- 
2.9.2

Reply via email to