Hi,

  I am working on user-directed and compiler-directed function
multiversioning which has been discussed in these threads:

1) http://gcc.gnu.org/ml/gcc-patches/2011-04/msg02344.html
2)  http://gcc.gnu.org/ml/gcc/2011-08/msg00298.html

The gist of the discussions for user-directed multiversioning is that
it should use a function overloading mechanism to allow the user to
specify multiple versions of the same function and/or use a function
attribute to specify that a particular function must be mutiversioned.
Afterwards, a call to such a function is appropriately dispatched by
the compiler. This work is in progress. However, this patch is *not*
about this.

This patch does compiler directed multi-versioning which is to allow
the compiler to automatically version functions in order to
exploit uArch features to maxmize performance for a set of selected
target platforms in the same binary. I have added a new flag, mvarch,
to allow the user to specify the arches on which the generated
binary will be running on. 
More than one arch name is allowed, for instance, -mvarch=core2,
corei7 (arch names same as that allowed by march). The compiler will
then automatically create function versions that are specialized for
these arches by tagging "-mtune=<arch>" on the versions. It will only
create versions of those functions where it sees opportunities for
performance improvement.

As a use case, I have added versioning for core2 where the function
will be optimized for vectorization. I submitted a patch recently to
not allow vectorization of loops in core2 if unaligned vector
load/stores are generated as these are very slow :
http://gcc.gnu.org/ml/gcc-patches/2011-12/msg00955.html
With mvarch=core2, the compiler will identify functions with
vectorizable loads/stores in loops and create a version for core2. The
core2 version will not have unaligned vector load/stores whereas the
default will.

It is also easy to add versioning criteria for other arches or add more
versioning criteria for core2. I already experimented with one other
versioning criterion for corei7 and I plan to add that in a follow-up
patch. Basically, any mtune specific optimizations can be plugged in
as a criterion to version. For this patch, only the vectorization based
versioning is shown.

The version dispatch happens via the IFUNC mechanism to keep the
run-time overhead of version dispatch minimal. When the compiler has
to version a function foo, it makes two copies, foo.autoclone.original
and foo.autoclone.clone0. It then modifies foo by replacing it with a
ifunc call to these two versions based on the outcome of a run-time
check for the processor type.

The function cloning for preventing vectorization on core2 is done 
aggressively as it conservatively checks if a particular function can
generate unaligned accesses. This is necessary as the cloning pass
happens early whereas the actual vectorization happens much later in
the loop optimization passes.  Hence, the vectorization pass sees
a different IR and the same checks to detect unaligned accesses cannot
be reused here.  So, it could turn out that the final code generated
 in all the function versions is identical. I am working on solutions
 to this problem but there is the ICF feature in the gold linker which
 can detect identical function bodies and merge them. Note that this
need to be true always for other optimizations if it possible to
detect with high accuracy if a function will benefit or not from
versioning.  

Regarding the placement of the versioning pass in the pass order, it
comes after inlining otherwise the ifunc calls would prevent inlining
of the functions.  Also, it was desired that all versioning decisions
happen at one point and it should happen before any target specific 
optimizations kick in. Hence, it was chosen to come just after the
ipa-inline pass.

This optimization was tested on one of our image processing related
benchmarks and the text size bloat from auto-versioning was about 20%.
The performance improvement from tree vectorization was ~22% on corei7,
 10% on AMD istanbul and ~2% on core2. Without this versioning, tree
vectorization was deteriorating the performance by ~6%. 


        * mversn-dispatch.c (make_name): Use '.' to concatenate to suffix
        mangled names.
        (clone_function): Do not make clones ctors/dtors. Recompute dominance
        info.
        (make_bb_flow): New function.
        (get_selector_gimple_seq): New function.
        (make_selector_function): New function.
        (make_attribute): New function.
        (make_ifunc_function): New function.
        (copy_decl_attributes): New function.
        (dispatch_using_ifunc): New function.
        (purge_function_body): New function.
        (function_can_make_abnormal_goto): New function.
        (make_function_not_cloneable): New function.
        (do_auto_clone): New function.
        (pass_auto_clone): New gimple pass.
        * passes.c (init_optimization_passes): Add pass_auto_clone to list.
        * tree-pass.h (pass_auto_clone): New pass.
        * params.def (PARAM_MAX_FUNCTION_SIZE_FOR_AUTO_CLONING): New param.
        * target.def (mversion_function): New target hook.
        * config/i386/i386.c (ix86_option_override_internal): Check correctness
        of ix86_mv_arch_string.
        (add_condition_to_bb): New function.
        (make_empty_function): New function.
        (make_condition_function): New function.
        (is_loop_form_vectorizable): New function.
        (is_loop_stmts_vectorizable): New function.
        (any_loops_vectorizable_with_load_store): New function.
        (mversion_for_core2): New function.
        (ix86_mversion_function): New function.
        * config/i386/i386.opt (mvarch): New option.
        * doc/tm.texi (TARGET_MVERSION_FUNCTION): Document.
        * doc/tm.texi.in (TARGET_MVERSION_FUNCTION): Document.
        * testsuite/gcc.dg/automversn_1.c: New testcase.

Index: doc/tm.texi
===================================================================
--- doc/tm.texi (revision 182355)
+++ doc/tm.texi (working copy)
@@ -10927,6 +10927,11 @@ The result is another tree containing a simplified
 call's result.  If @var{ignore} is true the value will be ignored.
 @end deftypefn
 
+@deftypefn {Target Hook} int TARGET_MVERSION_FUNCTION (tree @var{fndecl}, tree 
*@var{optimization_node_chain}, tree *@var{cond_func_decl})
+Check if a function needs to be multi-versioned to support variants of
+this architecture.  @var{fndecl} is the declaration of the function.
+@end deftypefn
+
 @deftypefn {Target Hook} bool TARGET_SLOW_UNALIGNED_VECTOR_MEMOP (void)
 Return true if unaligned vector memory load/store is a slow operation
 on this target.
Index: doc/tm.texi.in
===================================================================
--- doc/tm.texi.in      (revision 182355)
+++ doc/tm.texi.in      (working copy)
@@ -10873,6 +10873,11 @@ The result is another tree containing a simplified
 call's result.  If @var{ignore} is true the value will be ignored.
 @end deftypefn
 
+@hook TARGET_MVERSION_FUNCTION
+Check if a function needs to be multi-versioned to support variants of
+this architecture.  @var{fndecl} is the declaration of the function.
+@end deftypefn
+
 @hook TARGET_SLOW_UNALIGNED_VECTOR_MEMOP
 Return true if unaligned vector memory load/store is a slow operation
 on this target.
Index: target.def
===================================================================
--- target.def  (revision 182355)
+++ target.def  (working copy)
@@ -1277,6 +1277,12 @@ DEFHOOK
  "",
  bool, (void), NULL)
 
+/* Target hook to check if this function should be versioned.  */
+DEFHOOK
+(mversion_function,
+ "",
+ int, (tree fndecl, tree *optimization_node_chain, tree *cond_func_decl), NULL)
+
 /* Returns a code for a target-specific builtin that implements
    reciprocal of the function, or NULL_TREE if not available.  */
 DEFHOOK
Index: tree-pass.h
===================================================================
--- tree-pass.h (revision 182355)
+++ tree-pass.h (working copy)
@@ -449,6 +449,7 @@ extern struct gimple_opt_pass pass_split_functions
 extern struct gimple_opt_pass pass_feedback_split_functions;
 extern struct gimple_opt_pass pass_threadsafe_analyze;
 extern struct gimple_opt_pass pass_tree_convert_builtin_dispatch;
+extern struct gimple_opt_pass pass_auto_clone;
 
 /* IPA Passes */
 extern struct simple_ipa_opt_pass pass_ipa_lower_emutls;
Index: testsuite/gcc.dg/automversn_1.c
===================================================================
--- testsuite/gcc.dg/automversn_1.c     (revision 0)
+++ testsuite/gcc.dg/automversn_1.c     (revision 0)
@@ -0,0 +1,27 @@
+/* Check that the auto_clone pass works correctly.  Function foo must be cloned
+   because it is hot and has a vectorizable store.  */
+
+/* { dg-options "-O2 -ftree-vectorize -mvarch=core2 -fdump-tree-auto_clone" } 
*/
+/* { dg-do run } */
+
+char a[16];
+
+int __attribute__ ((hot)) __attribute__ ((noinline))
+foo (void)
+{
+  int i;
+  for (i = 0; i< 16; i++)
+    a[i] = 0;
+  return 0;
+}
+
+int
+main ()
+{
+  return foo ();
+}
+
+
+/* { dg-final { scan-tree-dump "foo\.autoclone\.original" "auto_clone" } } */
+/* { dg-final { scan-tree-dump "foo\.autoclone\.0" "auto_clone" } } */
+/* { dg-final { cleanup-tree-dump "auto_clone" } } */
Index: mversn-dispatch.c
===================================================================
--- mversn-dispatch.c   (revision 182355)
+++ mversn-dispatch.c   (working copy)
@@ -135,6 +135,8 @@ along with GCC; see the file COPYING3.  If not see
 #include "output.h"
 #include "vecprim.h"
 #include "gimple-pretty-print.h"
+#include "target.h"
+#include "cfgloop.h"
 
 typedef struct cgraph_node* NODEPTR;
 DEF_VEC_P (NODEPTR);
@@ -212,8 +214,7 @@ function_args_count (tree fntype)
   return num;
 }
 
-/* Return the variable name (global/constructor) to use for the
-   version_selector function with name of DECL by appending SUFFIX. */
+/* Return a new name by appending SUFFIX to the DECL name. */
 
 static char *
 make_name (tree decl, const char *suffix)
@@ -226,7 +227,8 @@ make_name (tree decl, const char *suffix)
 
   name_len = strlen (name) + strlen (suffix) + 2;
   global_var_name = (char *) xmalloc (name_len);
-  snprintf (global_var_name, name_len, "%s_%s", name, suffix);
+  /* Use '.' to concatenate names as it is demangler friendly.  */
+  snprintf (global_var_name, name_len, "%s.%s", name, suffix);
   return global_var_name;
 }
 
@@ -246,9 +248,9 @@ static char*
 make_feature_test_global_name (tree decl, bool is_constructor)
 {
   if (is_constructor)
-    return make_name (decl, "version_selector_constructor");
+    return make_name (decl, "version.selector.constructor");
 
-  return make_name (decl, "version_selector_global");
+  return make_name (decl, "version.selector.global");
 }
 
 /* This function creates a new VAR_DECL with attributes set
@@ -865,6 +867,9 @@ empty_function_body (tree fndecl)
   e = make_edge (new_bb, EXIT_BLOCK_PTR, 0);
   gcc_assert (e != NULL);
 
+  if (dump_file)
+    dump_function_to_file (current_function_decl, dump_file, TDF_BLOCKS);
+
   current_function_decl = old_current_function_decl;
   pop_cfun ();
   return new_bb;
@@ -921,6 +926,10 @@ clone_function (tree orig_fndecl, const char *name
   push_cfun (DECL_STRUCT_FUNCTION (new_decl));
   current_function_decl = new_decl;
 
+  /* The clones should not be ctors or dtors.  */
+  DECL_STATIC_CONSTRUCTOR (new_decl) = 0;
+  DECL_STATIC_DESTRUCTOR (new_decl) = 0;
+
   TREE_READONLY (new_decl) = TREE_READONLY (orig_fndecl);
   TREE_STATIC (new_decl) = TREE_STATIC (orig_fndecl);
   TREE_USED (new_decl) = TREE_USED (orig_fndecl);
@@ -954,6 +963,12 @@ clone_function (tree orig_fndecl, const char *name
   cgraph_call_function_insertion_hooks (new_version);
   cgraph_mark_needed_node (new_version);
 
+  
+  free_dominance_info (CDI_DOMINATORS);
+  free_dominance_info (CDI_POST_DOMINATORS);
+  calculate_dominance_info (CDI_DOMINATORS); 
+  calculate_dominance_info (CDI_POST_DOMINATORS);
+
   pop_cfun ();
   current_function_decl = old_current_function_decl;
 
@@ -1034,9 +1049,9 @@ make_specialized_call_to_clone (gimple generic_stm
   gcc_assert (generic_fndecl != NULL);
 
   if (side == 0)
-    new_name = make_name (generic_fndecl, "clone_0");
+    new_name = make_name (generic_fndecl, "clone.0");
   else
-    new_name = make_name (generic_fndecl, "clone_1");
+    new_name = make_name (generic_fndecl, "clone.1");
 
   slot = htab_find_slot_with_hash (name_decl_htab, new_name,
                                    htab_hash_string (new_name), NO_INSERT);
@@ -1764,3 +1779,700 @@ struct gimple_opt_pass pass_tree_convert_builtin_d
   TODO_update_ssa | TODO_verify_ssa
  }
 };
+
+/* This function generates gimple code in NEW_BB to check if COND_VAR
+   is equal to WHICH_VERSION and return FN_VER pointer if it is equal.
+   The basic block returned is the block where the control flows if
+   the equality is false.  */
+
+static basic_block
+make_bb_flow (basic_block new_bb, tree cond_var, tree fn_ver,
+             int which_version, tree bindings)
+{
+  tree result_var;
+  tree convert_expr;
+
+  basic_block bb1, bb2, bb3;
+  edge e12, e23;
+
+  gimple if_else_stmt;
+  gimple if_stmt;
+  gimple return_stmt;
+  gimple_seq gseq = bb_seq (new_bb);
+
+  /* Check if the value of cond_var is equal to which_version.  */
+  if_else_stmt = gimple_build_cond (EQ_EXPR, cond_var,
+                                   build_int_cst (NULL, which_version),
+                                   NULL_TREE, NULL_TREE);
+
+  mark_symbols_for_renaming (if_else_stmt);
+  gimple_seq_add_stmt (&gseq, if_else_stmt);
+  gimple_set_block (if_else_stmt, bindings);
+  gimple_set_bb (if_else_stmt, new_bb);
+
+  result_var = create_tmp_var (ptr_type_node, NULL);
+  add_referenced_var (result_var);
+
+  convert_expr = build1 (CONVERT_EXPR, ptr_type_node, fn_ver);
+  if_stmt = gimple_build_assign (result_var, convert_expr);
+  mark_symbols_for_renaming (if_stmt);
+  gimple_seq_add_stmt (&gseq, if_stmt);
+  gimple_set_block (if_stmt, bindings);
+
+  return_stmt = gimple_build_return (result_var);
+  mark_symbols_for_renaming (return_stmt);
+  gimple_seq_add_stmt (&gseq, return_stmt);
+
+  set_bb_seq (new_bb, gseq);
+
+  bb1 = new_bb;
+  e12 = split_block (bb1, if_else_stmt);
+  bb2 = e12->dest;
+  e12->flags &= ~EDGE_FALLTHRU;
+  e12->flags |= EDGE_TRUE_VALUE;
+
+  e23 = split_block (bb2, return_stmt);
+  gimple_set_bb (if_stmt, bb2);
+  gimple_set_bb (return_stmt, bb2);
+  bb3 = e23->dest;
+  make_edge (bb1, bb3, EDGE_FALSE_VALUE); 
+
+  remove_edge (e23);
+  make_edge (bb2, EXIT_BLOCK_PTR, 0);
+
+  return bb3;
+}
+
+/* Given the pointer to the condition function COND_FUNC_ARG, whose return
+   value decides the version that gets executed, and the pointers to the
+   function versions, FN_VER_LIST, this function generates control-flow to
+   return the appropriate function version pointer based on the return value
+   of the conditional function.   The condition function is assumed to return
+   values 0, 1, 2, ... */
+
+static gimple_seq
+get_selector_gimple_seq (tree cond_func_arg, tree fn_ver_list, tree 
default_ver,
+                        basic_block new_bb, tree bindings)
+{
+  basic_block final_bb;
+
+  gimple return_stmt, default_stmt;
+  gimple_seq gseq = NULL;
+  gimple_seq gseq_final = NULL;
+  gimple call_cond_stmt;
+
+  tree result_var;
+  tree convert_expr;
+  tree p;
+  tree cond_var;
+
+  int which_version;
+
+  /* Call the condition function once and store the outcome in cond_var.  */
+  cond_var = create_tmp_var (integer_type_node, NULL);
+  call_cond_stmt = gimple_build_call (cond_func_arg, 0);
+  gimple_call_set_lhs (call_cond_stmt, cond_var);
+  add_referenced_var (cond_var);
+  mark_symbols_for_renaming (call_cond_stmt);
+
+  gimple_seq_add_stmt (&gseq, call_cond_stmt);
+  gimple_set_block (call_cond_stmt, bindings);
+  gimple_set_bb (call_cond_stmt, new_bb);
+
+  set_bb_seq (new_bb, gseq);
+
+  final_bb = new_bb;
+
+  which_version = 0; 
+  for (p = fn_ver_list; p != NULL_TREE; p = TREE_CHAIN (p))
+    {
+      tree ver = TREE_PURPOSE (p);
+      /* Return this version's pointer, VER, if the value returned by the
+        condition funciton is equal to WHICH_VERSION.  */
+      final_bb = make_bb_flow (final_bb, cond_var, ver, which_version,
+                              bindings);
+      which_version++;
+    }
+
+  result_var = create_tmp_var (ptr_type_node, NULL);
+  add_referenced_var (result_var);
+
+  /* Return the default version function pointer as the default.  */
+  convert_expr = build1 (CONVERT_EXPR, ptr_type_node, default_ver);
+  default_stmt = gimple_build_assign (result_var, convert_expr);
+  mark_symbols_for_renaming (default_stmt);
+  gimple_seq_add_stmt (&gseq_final, default_stmt);
+  gimple_set_block (default_stmt, bindings);
+  gimple_set_bb (default_stmt, final_bb);
+
+  return_stmt = gimple_build_return (result_var);
+  mark_symbols_for_renaming (return_stmt);
+  gimple_seq_add_stmt (&gseq_final, return_stmt);
+  gimple_set_bb (return_stmt, final_bb);
+
+  set_bb_seq (final_bb, gseq_final);
+
+  return gseq; 
+}
+
+/* Make the ifunc selector function which calls function pointed to by
+   COND_FUNC_ARG and checks the value to return the appropriate function
+   version pointer.  */
+
+static tree
+make_selector_function (const char *name, tree cond_func_arg,
+                       tree fn_ver_list, tree default_ver)
+{
+  tree decl, type, t;
+  basic_block new_bb;
+  tree old_current_function_decl;
+  tree decl_name;
+
+  /* The selector function should return a (void *). */
+  type = build_function_type_list (ptr_type_node, NULL_TREE);
+ 
+  decl = build_fn_decl (name, type);
+
+  decl_name = get_identifier (name);
+  SET_DECL_ASSEMBLER_NAME (decl, decl_name);
+  DECL_NAME (decl) = decl_name;
+  gcc_assert (cgraph_node (decl) != NULL);
+
+  TREE_USED (decl) = 1;
+  DECL_ARTIFICIAL (decl) = 1;
+  DECL_IGNORED_P (decl) = 0;
+  TREE_PUBLIC (decl) = 0;
+  DECL_UNINLINABLE (decl) = 1;
+  DECL_EXTERNAL (decl) = 0;
+  DECL_CONTEXT (decl) = NULL_TREE;
+  DECL_INITIAL (decl) = make_node (BLOCK);
+  DECL_STATIC_CONSTRUCTOR (decl) = 0;
+  TREE_READONLY (decl) = 0;
+  DECL_PURE_P (decl) = 0;
+ 
+  /* Build result decl and add to function_decl. */
+  t = build_decl (UNKNOWN_LOCATION, RESULT_DECL, NULL_TREE, ptr_type_node);
+  DECL_ARTIFICIAL (t) = 1;
+  DECL_IGNORED_P (t) = 1;
+  DECL_RESULT (decl) = t;
+
+  gimplify_function_tree (decl);
+
+  old_current_function_decl = current_function_decl;
+  push_cfun (DECL_STRUCT_FUNCTION (decl));
+  current_function_decl = decl;
+  init_empty_tree_cfg_for_function (DECL_STRUCT_FUNCTION (decl));
+
+  cfun->curr_properties |=
+    (PROP_gimple_lcf | PROP_gimple_leh | PROP_cfg | PROP_referenced_vars |
+     PROP_ssa);
+
+  new_bb = create_empty_bb (ENTRY_BLOCK_PTR);
+  make_edge (ENTRY_BLOCK_PTR, new_bb, EDGE_FALLTHRU);
+  make_edge (new_bb, EXIT_BLOCK_PTR, 0);
+
+  /* This call is very important if this pass runs when the IR is in
+     SSA form.  It breaks things in strange ways otherwise. */
+  init_tree_ssa (DECL_STRUCT_FUNCTION (decl));
+  init_ssa_operands ();
+
+  /* Make the body of thr selector function.  */
+  get_selector_gimple_seq (cond_func_arg, fn_ver_list, default_ver, new_bb,
+                          DECL_INITIAL (decl));
+
+  cgraph_add_new_function (decl, true);
+  cgraph_call_function_insertion_hooks (cgraph_node (decl));
+  cgraph_mark_needed_node (cgraph_node (decl));
+
+  if (dump_file)
+    dump_function_to_file (decl, dump_file, TDF_BLOCKS);
+
+  pop_cfun ();
+  current_function_decl = old_current_function_decl;
+  return decl;
+}
+
+/* Makes a function attribute of the form NAME(ARG_NAME) and chains
+   it to CHAIN.  */
+
+static tree
+make_attribute (const char *name, const char *arg_name, tree chain)
+{
+  tree attr_name;
+  tree attr_arg_name;
+  tree attr_args;
+  tree attr;
+
+  attr_name = get_identifier (name);
+  attr_arg_name = build_string (strlen (arg_name), arg_name);
+  attr_args = tree_cons (NULL_TREE, attr_arg_name, NULL_TREE);
+  attr = tree_cons (attr_name, attr_args, chain);
+  return attr;
+}
+
+/* This creates the ifunc function IFUNC_NAME whose selector function is
+   SELECTOR_NAME. */
+
+static tree
+make_ifunc_function (const char* ifunc_name, const char *selector_name,
+                    tree fn_type)
+{
+  tree type;
+  tree decl;
+
+  /* The signature of the ifunc function is set to the
+     type of any version.  */
+  type = build_function_type (TREE_TYPE (fn_type), TYPE_ARG_TYPES (fn_type));
+  decl = build_fn_decl (ifunc_name, type);
+
+  DECL_CONTEXT (decl) = NULL_TREE;
+  DECL_INITIAL (decl) = error_mark_node;
+
+  /* Set ifunc attribute */
+  DECL_ATTRIBUTES (decl)
+    = make_attribute ("ifunc", selector_name, DECL_ATTRIBUTES (decl));
+
+  assemble_alias (decl, get_identifier (selector_name)); 
+
+  return decl;
+}
+
+/* Copy the decl attributes from from_decl to to_decl, except
+   DECL_ARTIFICIAL and TREE_PUBLIC.  */
+
+static void
+copy_decl_attributes (tree to_decl, tree from_decl)
+{
+  TREE_READONLY (to_decl) = TREE_READONLY (from_decl);
+  TREE_USED (to_decl) = TREE_USED (from_decl);
+  DECL_ARTIFICIAL (to_decl) = 1;
+  DECL_IGNORED_P (to_decl) = DECL_IGNORED_P (from_decl);
+  TREE_PUBLIC (to_decl) = 0;
+  DECL_CONTEXT (to_decl) = DECL_CONTEXT (from_decl);
+  DECL_EXTERNAL (to_decl) = DECL_EXTERNAL (from_decl);
+  DECL_COMDAT (to_decl) = DECL_COMDAT (from_decl);
+  DECL_COMDAT_GROUP (to_decl) = DECL_COMDAT_GROUP (from_decl);
+  DECL_VIRTUAL_P (to_decl) = DECL_VIRTUAL_P (from_decl);
+  DECL_WEAK (to_decl) = DECL_WEAK (from_decl);
+}
+
+/* This function does the mult-version run-time dispatch using IFUNC.  Given
+   NUM_VERSIONS versions of a function with the decls in FN_VER_LIST along
+   with a default version in DEFAULT_VER.  Also given is a condition function,
+   COND_FUNC_ADDR, whose return value decides the version that gets executed.
+   This function generates the necessary code to dispatch the right function
+   version and returns this a GIMPLE_SEQ. The decls of the ifunc function and
+   the selector function that are created are stored in IFUNC_DECL and
+   SELECTOR_DECL.  */
+
+static gimple_seq
+dispatch_using_ifunc (int num_versions, tree orig_func_decl,
+                     tree cond_func_addr, tree fn_ver_list,
+                     tree default_ver, tree *selector_decl,
+                     tree *ifunc_decl)
+{
+  char *selector_name;
+  char *ifunc_name;
+  tree ifunc_function;
+  tree selector_function;
+  tree return_type;
+  VEC (tree, heap) *nargs = NULL;
+  tree arg;
+  gimple ifunc_call_stmt;
+  gimple return_stmt;
+  gimple_seq gseq = NULL;
+
+  gcc_assert (cond_func_addr != NULL
+             && num_versions > 0
+             && orig_func_decl != NULL
+             && fn_ver_list != NULL);
+
+  /* The return type of any function version.  */
+  return_type = TREE_TYPE (TREE_TYPE (orig_func_decl));
+
+  nargs = VEC_alloc (tree, heap, 4);
+
+  for (arg = DECL_ARGUMENTS (orig_func_decl);
+       arg; arg = TREE_CHAIN (arg))
+    {
+      VEC_safe_push (tree, heap, nargs, arg);
+      add_referenced_var (arg);
+    }
+
+  /* Assign names to ifunc and ifunc_selector functions. */
+  selector_name = make_name (orig_func_decl, "ifunc.selector");
+  ifunc_name = make_name (orig_func_decl, "ifunc");
+
+  /* Make a selector function which returns the appropriate function
+     version pointer based on the outcome of the condition function
+     execution.  */
+  selector_function = make_selector_function (selector_name, cond_func_addr,
+                                             fn_ver_list, default_ver);
+  *selector_decl = selector_function;
+
+  /* Make a new ifunc function.  */
+  ifunc_function = make_ifunc_function  (ifunc_name, selector_name,
+                                        TREE_TYPE (orig_func_decl));
+  *ifunc_decl = ifunc_function;
+
+  /* Make selector and ifunc shadow the attributes of the original function.  
*/
+  copy_decl_attributes (ifunc_function, orig_func_decl);
+  copy_decl_attributes (selector_function, orig_func_decl);
+ 
+  ifunc_call_stmt = gimple_build_call_vec (ifunc_function, nargs);
+  gimple_seq_add_stmt (&gseq, ifunc_call_stmt); 
+
+  /* Make function return the value of it is a non-void type.  */
+  if (TREE_CODE (return_type) != VOID_TYPE)
+    {
+      tree lhs_var;
+      tree lhs_var_ssa_name;
+      tree result_decl;
+
+      result_decl = DECL_RESULT (orig_func_decl);
+
+      if (result_decl
+         && aggregate_value_p (result_decl, orig_func_decl)
+         && !TREE_ADDRESSABLE (result_decl))
+       {
+         /* Build a RESULT_DECL rather than a VAR_DECL for this case.
+            See tree-nrv.c: tree_nrv. It checks if the DECL_RESULT and the
+            return value are the same.  */
+         lhs_var = build_decl (UNKNOWN_LOCATION, RESULT_DECL, NULL,
+                               return_type);
+         DECL_ARTIFICIAL (lhs_var) = 1;
+         DECL_IGNORED_P (lhs_var) = 1;
+         TREE_READONLY (lhs_var) = 0;
+         DECL_EXTERNAL (lhs_var) = 0;
+         TREE_STATIC (lhs_var) = 0;
+         TREE_USED (lhs_var) = 1;
+
+          add_referenced_var (lhs_var);
+          DECL_RESULT (orig_func_decl) = lhs_var;
+       }
+     else if (!TREE_ADDRESSABLE (return_type)
+              && COMPLETE_TYPE_P (return_type))
+        {
+          lhs_var = create_tmp_var (return_type, NULL);
+          add_referenced_var (lhs_var);
+        }
+      else
+       {
+          lhs_var = create_tmp_var_raw (return_type, NULL);
+         TREE_ADDRESSABLE (lhs_var) = 1;
+         gimple_add_tmp_var (lhs_var);
+          add_referenced_var (lhs_var);
+       }
+
+      if (AGGREGATE_TYPE_P (return_type)
+         || TREE_CODE (return_type) == COMPLEX_TYPE)
+        {
+          gimple_call_set_lhs (ifunc_call_stmt, lhs_var);
+          return_stmt = gimple_build_return (lhs_var);
+       }
+      else
+       {
+         lhs_var_ssa_name = make_ssa_name (lhs_var, ifunc_call_stmt);
+         gimple_call_set_lhs (ifunc_call_stmt, lhs_var_ssa_name);
+         return_stmt = gimple_build_return (lhs_var_ssa_name);
+       }
+    }
+  else
+    {
+      return_stmt = gimple_build_return (NULL_TREE);
+    }
+
+  mark_symbols_for_renaming (ifunc_call_stmt);
+  mark_symbols_for_renaming (return_stmt);
+  gimple_seq_add_stmt (&gseq, return_stmt); 
+
+  VEC_free (tree, heap, nargs);
+  return gseq;
+}
+
+/* Empty the function body of function fndecl.  Retain just one basic block
+   along with the ENTRY and EXIT block.  Return the retained basic block.  */
+
+static basic_block
+purge_function_body (tree fndecl)
+{
+  basic_block bb, new_bb;
+  edge first_edge, last_edge;
+  tree old_current_function_decl;
+
+  old_current_function_decl = current_function_decl;
+  push_cfun (DECL_STRUCT_FUNCTION (fndecl));
+  current_function_decl = fndecl;
+
+  /* Set new_bb to be the first block after ENTRY_BLOCK_PTR. */
+
+  first_edge  = VEC_index (edge, ENTRY_BLOCK_PTR->succs, 0);
+  new_bb = first_edge->dest;
+  gcc_assert (new_bb != NULL);
+
+  for (bb = ENTRY_BLOCK_PTR; bb != NULL;)
+    {
+      edge_iterator ei;
+      edge e;
+      basic_block bb_next;
+      bb_next = bb->next_bb;
+      if (bb == EXIT_BLOCK_PTR)
+        {
+         VEC_truncate (edge, EXIT_BLOCK_PTR->preds, 0);
+        }
+      else if (bb == ENTRY_BLOCK_PTR)
+       {
+         VEC_truncate (edge, ENTRY_BLOCK_PTR->succs, 0);
+       }
+      else
+        {
+          remove_phi_nodes (bb);
+          if (bb_seq (bb) != NULL)
+            {
+              gimple_stmt_iterator i;
+              for (i = gsi_start_bb (bb); !gsi_end_p (i);)
+               {
+                 gimple stmt = gsi_stmt (i);
+                 unlink_stmt_vdef (stmt);
+                 reset_debug_uses (stmt);
+                  gsi_remove (&i, true);
+                 release_defs (stmt);
+               }
+            }
+         FOR_EACH_EDGE (e, ei, bb->succs)
+           {
+             n_edges--;
+             ggc_free (e);
+           }
+         VEC_truncate (edge, bb->succs, 0);
+         VEC_truncate (edge, bb->preds, 0);
+          bb->prev_bb = NULL;
+          bb->next_bb = NULL;
+         if (bb == new_bb)
+           {
+             bb = bb_next;
+             continue;
+           }
+          bb->il.gimple = NULL;
+          SET_BASIC_BLOCK (bb->index, NULL);
+          n_basic_blocks--;
+        }
+      bb = bb_next;
+    }
+
+
+  /* This is to allow iterating over the basic blocks. */
+  new_bb->next_bb = EXIT_BLOCK_PTR;
+  EXIT_BLOCK_PTR->prev_bb = new_bb;
+
+  new_bb->prev_bb = ENTRY_BLOCK_PTR;
+  ENTRY_BLOCK_PTR->next_bb = new_bb;
+
+  gcc_assert (find_edge (new_bb, EXIT_BLOCK_PTR) == NULL);
+  last_edge = make_edge (new_bb, EXIT_BLOCK_PTR, 0);
+  gcc_assert (last_edge);
+
+  gcc_assert (find_edge (ENTRY_BLOCK_PTR, new_bb) == NULL);
+  last_edge = make_edge (ENTRY_BLOCK_PTR, new_bb, EDGE_FALLTHRU);
+  gcc_assert (last_edge);
+
+  free_dominance_info (CDI_DOMINATORS);
+  free_dominance_info (CDI_POST_DOMINATORS);
+  calculate_dominance_info (CDI_DOMINATORS); 
+  calculate_dominance_info (CDI_POST_DOMINATORS);
+
+  current_function_decl = old_current_function_decl;
+  pop_cfun ();
+
+  return new_bb;
+}
+
+/* Returns true if function FUNC_DECL contains abnormal goto statements.  */
+
+static bool
+function_can_make_abnormal_goto (tree func_decl)
+{
+  basic_block bb;
+  FOR_EACH_BB_FN (bb, DECL_STRUCT_FUNCTION (func_decl))
+    {
+      gimple_stmt_iterator gsi;
+      for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
+        {
+         gimple stmt = gsi_stmt (gsi);
+         if (stmt_can_make_abnormal_goto (stmt))
+           return true;
+       }
+    }
+  return false;
+}
+
+/* Has an entry for every cloned function and auxiliaries that have been
+   generated by auto cloning.  These cannot be further cloned.  */
+
+htab_t cloned_function_decls_htab = NULL;
+
+/* Adds function FUNC_DECL to the cloned_function_decls_htab.  */
+
+static void
+mark_function_not_cloneable (tree func_decl)
+{
+  void **slot;
+
+  slot = htab_find_slot_with_hash (cloned_function_decls_htab, func_decl,
+                                  htab_hash_pointer (func_decl), INSERT);
+  gcc_assert (*slot == NULL);
+  *slot = func_decl;
+}
+
+/* Entry point for the auto clone pass.  Calls the target hook to determine if
+   this function must be cloned.  */
+
+static unsigned int
+do_auto_clone (void)
+{
+  tree opt_node = NULL_TREE;
+  int num_versions = 0;
+  int i = 0;
+  tree fn_ver_addr_chain = NULL_TREE;
+  tree default_ver = NULL_TREE;
+  tree cond_func_decl = NULL_TREE;
+  tree cond_func_addr;
+  tree default_decl;
+  basic_block empty_bb;
+  gimple_seq gseq = NULL;
+  gimple_stmt_iterator gsi;
+  tree selector_decl;
+  tree ifunc_decl;
+  void **slot;
+  struct cgraph_node *node;
+
+  node = cgraph_node (current_function_decl);
+
+  if (lookup_attribute ("noclone", DECL_ATTRIBUTES (current_function_decl))
+      != NULL)
+    {
+      if (dump_file)
+       fprintf (dump_file, "Not cloning, noclone attribute set\n");
+      return 0;
+    }
+
+  /* Check if function size is within permissible limits for cloning.  */
+  if (node->global.size
+      > PARAM_VALUE (PARAM_MAX_FUNCTION_SIZE_FOR_AUTO_CLONING))
+    {
+      if (dump_file)
+        fprintf (dump_file, "Function size exceeds auto cloning threshold.\n");
+      return 0;
+    }
+
+  if (cloned_function_decls_htab == NULL)
+    cloned_function_decls_htab = htab_create (10, htab_hash_pointer,
+                                             htab_eq_pointer, NULL);
+
+
+  /* If this function is a clone or other, like the selector function, pass.  
*/
+  slot = htab_find_slot_with_hash (cloned_function_decls_htab,
+                                  current_function_decl,
+                                  htab_hash_pointer (current_function_decl),
+                                  INSERT);
+
+  if (*slot != NULL)
+    return 0;
+
+  if (profile_status == PROFILE_READ
+      && !hot_function_p (cgraph_node (current_function_decl)))
+    return 0;
+
+  /* Ignore functions with abnormal gotos, not correct to clone them.  */
+  if (function_can_make_abnormal_goto (current_function_decl))
+    return 0;
+
+  if (!targetm.mversion_function)
+    return 0;
+
+  /* Call the target hook to see if this function needs to be versioned.  */
+  num_versions = targetm.mversion_function (current_function_decl, &opt_node,
+                                           &cond_func_decl);
+      
+  /* Nothing more to do if versions are not to be created.  */
+  if (num_versions == 0)
+    return 0;
+
+  mark_function_not_cloneable (cond_func_decl);
+  copy_decl_attributes (cond_func_decl, current_function_decl);
+
+  /* Make as many clones as requested.  */
+  for (i = 0; i < num_versions; ++i)
+    {
+      tree cloned_decl;
+      char clone_name[100];
+
+      sprintf (clone_name, "autoclone.%d", i);
+      cloned_decl = clone_function (current_function_decl, clone_name);
+      fn_ver_addr_chain = tree_cons (build_fold_addr_expr (cloned_decl),
+                                    NULL, fn_ver_addr_chain);
+      gcc_assert (cloned_decl != NULL);
+      mark_function_not_cloneable (cloned_decl);
+      DECL_FUNCTION_SPECIFIC_TARGET (cloned_decl)
+        = TREE_PURPOSE (opt_node);
+      opt_node = TREE_CHAIN (opt_node);
+    }
+
+  /* The current function is replaced by an ifunc call to the right version.
+     Make another clone for the default.  */
+  default_decl = clone_function (current_function_decl, "autoclone.original");
+  mark_function_not_cloneable (default_decl);
+  /* Empty the body of the current function.  */
+  empty_bb = purge_function_body (current_function_decl);
+  default_ver = build_fold_addr_expr (default_decl);
+  cond_func_addr = build_fold_addr_expr (cond_func_decl);
+
+  /* Get the gimple sequence to replace the current function's body with a
+     ifunc dispatch call to the right version.  */
+  gseq = dispatch_using_ifunc (num_versions, current_function_decl,
+                              cond_func_addr, fn_ver_addr_chain,
+                              default_ver, &selector_decl, &ifunc_decl);
+
+  mark_function_not_cloneable (selector_decl);
+  mark_function_not_cloneable (ifunc_decl);
+
+  for (gsi = gsi_start (gseq); !gsi_end_p (gsi); gsi_next (&gsi))
+    gimple_set_bb (gsi_stmt (gsi), empty_bb);
+   
+  set_bb_seq (empty_bb, gseq);
+
+  if (dump_file)
+    dump_function_to_file (current_function_decl, dump_file, TDF_BLOCKS);
+
+  update_ssa (TODO_update_ssa_no_phi);
+  
+  return 0;
+}
+
+static bool
+gate_auto_clone (void)
+{
+  /* Turned on at -O2 and above.  */
+  return optimize >= 2;
+}
+
+struct gimple_opt_pass pass_auto_clone =
+{
+ {
+  GIMPLE_PASS,
+  "auto_clone",                                /* name */
+  gate_auto_clone,                     /* gate */
+  do_auto_clone,                       /* execute */
+  NULL,                                        /* sub */
+  NULL,                                        /* next */
+  0,                                   /* static_pass_number */
+  TV_MVERSN_DISPATCH,                  /* tv_id */
+  PROP_cfg,                            /* properties_required */
+  PROP_cfg,                            /* properties_provided */
+  0,                                   /* properties_destroyed */
+  0,                                   /* todo_flags_start */
+  TODO_dump_func |                     /* todo_flags_finish */
+  TODO_cleanup_cfg | TODO_dump_cgraph |
+  TODO_update_ssa | TODO_verify_ssa
+ }
+};
Index: passes.c
===================================================================
--- passes.c    (revision 182355)
+++ passes.c    (working copy)
@@ -1278,6 +1278,7 @@ init_optimization_passes (void)
   /* These passes are run after IPA passes on every function that is being
      output to the assembler file.  */
   p = &all_passes;
+  NEXT_PASS (pass_auto_clone);
   NEXT_PASS (pass_direct_call_profile);
   NEXT_PASS (pass_lower_eh_dispatch);
   NEXT_PASS (pass_all_optimizations);
Index: config/i386/i386.opt
===================================================================
--- config/i386/i386.opt        (revision 182355)
+++ config/i386/i386.opt        (working copy)
@@ -101,6 +101,10 @@ march=
 Target RejectNegative Joined Var(ix86_arch_string)
 Generate code for given CPU
 
+mvarch=
+Target RejectNegative Joined Var(ix86_mv_arch_string)
+Multiversion for the given CPU(s)
+
 masm=
 Target RejectNegative Joined Var(ix86_asm_string)
 Use given assembler dialect
Index: config/i386/i386.c
===================================================================
--- config/i386/i386.c  (revision 182355)
+++ config/i386/i386.c  (working copy)
@@ -60,7 +60,11 @@ along with GCC; see the file COPYING3.  If not see
 #include "fibheap.h"
 #include "tree-flow.h"
 #include "tree-pass.h"
+#include "tree-dump.h"
+#include "gimple-pretty-print.h"
 #include "cfgloop.h"
+#include "tree-scalar-evolution.h"
+#include "tree-vectorizer.h"
 
 enum upper_128bits_state
 {
@@ -2353,6 +2357,8 @@ enum processor_type ix86_tune;
 /* Which instruction set architecture to use.  */
 enum processor_type ix86_arch;
 
+char ix86_varch[PROCESSOR_max];
+
 /* true if sse prefetch instruction is not NOOP.  */
 int x86_prefetch_sse;
 
@@ -2492,6 +2498,7 @@ static enum calling_abi ix86_function_abi (const_t
 /* Whether -mtune= or -march= were specified */
 static int ix86_tune_defaulted;
 static int ix86_arch_specified;
+static int ix86_varch_specified;
 
 /* A mask of ix86_isa_flags that includes bit X if X
    was set or cleared on the command line.  */
@@ -4316,6 +4323,36 @@ ix86_option_override_internal (bool main_args_p)
       /* Disable vzeroupper pass if TARGET_AVX is disabled.  */
       target_flags &= ~MASK_VZEROUPPER;
     }
+
+  /* Handle ix86_mv_arch_string.  The values allowed are the same as
+     -march=<>.  More than one value is allowed and values must be
+     comma separated.  */
+  if (ix86_mv_arch_string)
+    {
+      char *token;
+      char *varch;
+      int i;
+
+      ix86_varch_specified = 1;
+      memset (ix86_varch, 0, sizeof (ix86_varch));
+      token = XNEWVEC (char, strlen (ix86_mv_arch_string) + 1);
+      strcpy (token, ix86_mv_arch_string);
+      varch = strtok ((char *)token, ",");
+      while (varch != NULL)
+        {
+          for (i = 0; i < pta_size; i++)
+            if (!strcmp (varch, processor_alias_table[i].name))
+             {
+               ix86_varch[processor_alias_table[i].processor] = 1;
+               break;
+             }
+          if (i == pta_size)
+            error ("bad value (%s) for %sv-arch=%s %s",
+                  varch, prefix, suffix, sw);
+         varch = strtok (NULL, ",");
+       }
+      free (token);
+    }
 }
 
 /* Return TRUE if VAL is passed in register with 256bit AVX modes.  */
@@ -26120,6 +26157,489 @@ ix86_fold_builtin (tree fndecl, int n_args ATTRIBU
   return NULL_TREE;
 }
 
+/* This adds a condition to the basic_block NEW_BB in function FUNCTION_DECL
+   to return integer VERSION_NUM if the outcome of the function PREDICATE_DECL
+   is true (or false if INVERT_CHECK is true).  This function will be called
+   during version dispatch to ecide which function version to execute.  */
+
+static basic_block
+add_condition_to_bb (tree function_decl, int version_num,
+                    basic_block new_bb, tree predicate_decl,
+                    bool invert_check)
+{
+  gimple return_stmt;
+  gimple call_cond_stmt;
+  gimple if_else_stmt;
+
+  basic_block bb1, bb2, bb3;
+  edge e12, e23;
+
+  tree cond_var;
+  gimple_seq gseq;
+
+  tree old_current_function_decl;
+
+  old_current_function_decl = current_function_decl;
+  push_cfun (DECL_STRUCT_FUNCTION (function_decl));
+  current_function_decl = function_decl;
+
+  gcc_assert (new_bb != NULL);
+  gseq = bb_seq (new_bb);
+
+  if (predicate_decl == NULL_TREE)
+    {
+      return_stmt = gimple_build_return (build_int_cst (NULL, version_num));
+      gimple_seq_add_stmt (&gseq, return_stmt);
+      set_bb_seq (new_bb, gseq);
+      gimple_set_bb (return_stmt, new_bb);
+      pop_cfun ();
+      current_function_decl = old_current_function_decl;
+      return new_bb;
+    }
+
+  cond_var = create_tmp_var (integer_type_node, NULL);
+  call_cond_stmt = gimple_build_call (predicate_decl, 0);
+  gimple_call_set_lhs (call_cond_stmt, cond_var);
+  add_referenced_var (cond_var);
+  mark_symbols_for_renaming (call_cond_stmt); 
+
+  gimple_set_block (call_cond_stmt, DECL_INITIAL (function_decl));
+  gimple_set_bb (call_cond_stmt, new_bb);
+  gimple_seq_add_stmt (&gseq, call_cond_stmt);
+
+  if (!invert_check)
+    if_else_stmt = gimple_build_cond (GT_EXPR, cond_var,
+                                     integer_zero_node,
+                                     NULL_TREE, NULL_TREE);
+  else
+    if_else_stmt = gimple_build_cond (LE_EXPR, cond_var,
+                                     integer_zero_node,
+                                     NULL_TREE, NULL_TREE);
+
+  mark_symbols_for_renaming (if_else_stmt);
+  gimple_set_block (if_else_stmt, DECL_INITIAL (function_decl));
+  gimple_set_bb (if_else_stmt, new_bb);
+  gimple_seq_add_stmt (&gseq, if_else_stmt);
+
+  return_stmt = gimple_build_return (build_int_cst (NULL, version_num));
+  gimple_seq_add_stmt (&gseq, return_stmt);
+
+ 
+  set_bb_seq (new_bb, gseq);
+
+  bb1 = new_bb;
+  e12 = split_block (bb1, if_else_stmt);
+  bb2 = e12->dest;
+  e12->flags &= ~EDGE_FALLTHRU;
+  e12->flags |= EDGE_TRUE_VALUE;
+
+  e23 = split_block (bb2, return_stmt);
+  gimple_set_bb (return_stmt, bb2);
+  bb3 = e23->dest;
+  make_edge (bb1, bb3, EDGE_FALSE_VALUE); 
+
+  remove_edge (e23);
+  make_edge (bb2, EXIT_BLOCK_PTR, 0);
+
+  free_dominance_info (CDI_DOMINATORS);
+  free_dominance_info (CDI_POST_DOMINATORS);
+  calculate_dominance_info (CDI_DOMINATORS);
+  calculate_dominance_info (CDI_POST_DOMINATORS);
+  rebuild_cgraph_edges ();
+  update_ssa (TODO_update_ssa);
+  if (dump_file)
+    dump_function_to_file (current_function_decl, dump_file, TDF_BLOCKS);
+
+  pop_cfun ();
+  current_function_decl = old_current_function_decl;
+
+  return bb3;
+}
+
+/* This makes an empty function with one empty basic block *CREATED_BB
+   apart from the ENTRY and EXIT blocks.  */
+
+static tree
+make_empty_function (basic_block *created_bb)
+{
+  tree decl, type, t;
+  basic_block new_bb;
+  tree old_current_function_decl;
+  tree decl_name;
+  char name[1000];
+  static int num = 0;
+
+  /* The condition function should return an integer. */
+  type = build_function_type_list (integer_type_node, NULL_TREE);
+ 
+  sprintf (name, "cond_%d", num);
+  num++;
+  decl = build_fn_decl (name, type);
+
+  decl_name = get_identifier (name);
+  SET_DECL_ASSEMBLER_NAME (decl, decl_name);
+  DECL_NAME (decl) = decl_name;
+  gcc_assert (cgraph_node (decl) != NULL);
+
+  TREE_USED (decl) = 1;
+  DECL_ARTIFICIAL (decl) = 1;
+  DECL_IGNORED_P (decl) = 0;
+  TREE_PUBLIC (decl) = 0;
+  DECL_UNINLINABLE (decl) = 1;
+  DECL_EXTERNAL (decl) = 0;
+  DECL_CONTEXT (decl) = NULL_TREE;
+  DECL_INITIAL (decl) = make_node (BLOCK);
+  DECL_STATIC_CONSTRUCTOR (decl) = 0;
+  TREE_READONLY (decl) = 0;
+  DECL_PURE_P (decl) = 0;
+ 
+  /* Build result decl and add to function_decl. */
+  t = build_decl (UNKNOWN_LOCATION, RESULT_DECL, NULL_TREE, ptr_type_node);
+  DECL_ARTIFICIAL (t) = 1;
+  DECL_IGNORED_P (t) = 1;
+  DECL_RESULT (decl) = t;
+
+  gimplify_function_tree (decl);
+
+  old_current_function_decl = current_function_decl;
+  push_cfun (DECL_STRUCT_FUNCTION (decl));
+  current_function_decl = decl;
+  init_empty_tree_cfg_for_function (DECL_STRUCT_FUNCTION (decl));
+
+  cfun->curr_properties |=
+    (PROP_gimple_lcf | PROP_gimple_leh | PROP_cfg | PROP_referenced_vars |
+     PROP_ssa);
+
+  new_bb = create_empty_bb (ENTRY_BLOCK_PTR);
+  make_edge (ENTRY_BLOCK_PTR, new_bb, EDGE_FALLTHRU);
+  make_edge (new_bb, EXIT_BLOCK_PTR, 0);
+
+  /* This call is very important if this pass runs when the IR is in
+     SSA form.  It breaks things in strange ways otherwise. */
+  init_tree_ssa (DECL_STRUCT_FUNCTION (decl));
+  init_ssa_operands ();
+
+  cgraph_add_new_function (decl, true);
+  cgraph_call_function_insertion_hooks (cgraph_node (decl));
+  cgraph_mark_needed_node (cgraph_node (decl));
+
+  if (dump_file)
+    dump_function_to_file (decl, dump_file, TDF_BLOCKS);
+
+  pop_cfun ();
+  current_function_decl = old_current_function_decl;
+  *created_bb = new_bb;
+  return decl; 
+}
+
+/* This function conservatively checks if loop LOOP is tree vectorizable.
+   The code is adapted from tree-vectorize.cc and tree-vect-stmts.cc  */
+
+static bool
+is_loop_form_vectorizable (struct loop *loop)
+{
+  /* Inner most loops should have 2 basic blocks.  */
+  if (!loop->inner)
+    {
+      /* This is inner most.  */
+      if (loop->num_nodes != 2)
+        return false;
+      /* Empty loop.  */
+      if (empty_block_p (loop->header))
+        return false;
+    }
+  else
+    {
+      /* Bail if there are multiple nested loops.  */ 
+      if ((loop->inner)->inner || (loop->inner)->next)
+       return false;
+      /* Recursive call for the inner loop.  */
+      if (!is_loop_form_vectorizable (loop->inner))
+        return false;
+      if (loop->num_nodes != 5)
+       return false;
+      /* The tree has 0 iterations.  */
+      if (TREE_INT_CST_LOW (number_of_latch_executions (loop)) == 0)
+       return false;
+    }
+
+   return true;        
+}
+
+/* This function checks if there is atleast one vectorizable
+   load/store in loop LOOP.  Code adapted from tree-vect-stmts.cc.  */
+
+static bool
+is_loop_stmts_vectorizable (struct loop *loop)
+{
+  basic_block *body;
+  unsigned int i;
+  bool vect_load_store = false;
+
+  body = get_loop_body (loop);
+
+  for (i = 0; i < loop->num_nodes; i++)
+    {
+      gimple_stmt_iterator gsi;
+      for (gsi = gsi_start_bb (body[i]); !gsi_end_p (gsi); gsi_next (&gsi))
+        {
+         gimple stmt = gsi_stmt (gsi);
+         enum gimple_code code = gimple_code (stmt);
+
+         if (gimple_has_volatile_ops (stmt))
+           return false;
+
+         /* Does it have a vectorizable store or load in a hot bb? */
+         if (code == GIMPLE_ASSIGN)
+           {
+             enum tree_code lhs_code = TREE_CODE (gimple_assign_lhs (stmt));
+             enum tree_code rhs_code = gimple_assign_rhs_code (stmt);
+
+             /* Only look at hot vectorizable loads/stores.  */
+             if (profile_status == PROFILE_READ
+                 && !maybe_hot_bb_p (body[i]))
+               continue;
+
+             if (lhs_code == ARRAY_REF
+                 || lhs_code == INDIRECT_REF
+                 || lhs_code == COMPONENT_REF
+                 || lhs_code == IMAGPART_EXPR
+                 || lhs_code == REALPART_EXPR
+                 || lhs_code == MEM_REF)
+               vect_load_store = true;
+             else if (rhs_code == ARRAY_REF
+                 || rhs_code == INDIRECT_REF
+                 || rhs_code == COMPONENT_REF
+                 || rhs_code == IMAGPART_EXPR
+                 || rhs_code == REALPART_EXPR
+                 || rhs_code == MEM_REF)
+               vect_load_store = true;
+           }
+       }
+    }
+
+  return vect_load_store;
+}
+
+/* This function checks if there are any vectorizable loops present
+   in CURRENT_FUNCTION_DECL.  This function is called before the
+   loop optimization passes and is therefore very conservative in
+   checking for vectorizable loops.  Also, all the checks used in the
+   vectorizer pass cannot used here since many loop optimizations
+   have not occurred which could change the loop structure and the
+   stmts.
+
+   The conditions for a loop being vectorizable are adapted from
+   tree-vectorizer.c, tree-vect-stmts.c. */
+
+static bool
+any_loops_vectorizable_with_load_store (void)
+{
+  unsigned int vect_loops_num;
+  loop_iterator li;
+  struct loop *loop;
+  bool vectorizable_loop_found = false;
+
+  loop_optimizer_init (LOOPS_NORMAL | LOOPS_HAVE_RECORDED_EXITS);
+
+  vect_loops_num = number_of_loops ();
+
+  /* Bail out if there are no loops.  */
+  if (vect_loops_num <= 1)
+    {
+      loop_optimizer_finalize ();
+      return false;
+    }
+
+  scev_initialize ();
+
+  /* This is iterating over all loops.  */
+  FOR_EACH_LOOP (li, loop, 0)
+    if (optimize_loop_nest_for_speed_p (loop))
+      {
+       if (!is_loop_form_vectorizable (loop))
+         continue;
+       if (!is_loop_stmts_vectorizable (loop))
+         continue;
+        vectorizable_loop_found = true;
+       break;
+      }
+
+
+  loop_optimizer_finalize ();
+  scev_finalize ();
+
+  return vectorizable_loop_found;
+}
+
+/* This makes the function that chooses the version to execute based
+   on the condition.  This condition function will decide which version
+   of the function to execute.  It should look like this:
+
+   int cond_i ()
+   {
+      __builtin_cpu_init (); // Get the cpu type.
+      a =  __builtin_cpu_is_<type1> ();
+      if (a)
+        return 1; // first version created.
+      a =  __builtin_cpu_is_<type2> ();
+      if (a)
+        return 2; // second version created.
+      ...
+      return 0; // the default version.
+   }
+
+   NEW_BB is the new last basic block of this function and to which more
+   conditions can be added.  It is updated by this function.  */
+
+static tree
+make_condition_function (basic_block *new_bb)
+{
+  gimple ifunc_cpu_init_stmt;
+  gimple_seq gseq;
+  tree cond_func_decl;
+  tree old_current_function_decl;
+ 
+
+  cond_func_decl = make_empty_function (new_bb);
+
+  old_current_function_decl = current_function_decl;
+  push_cfun (DECL_STRUCT_FUNCTION (cond_func_decl));
+  current_function_decl = cond_func_decl;
+
+  gseq = bb_seq (*new_bb);
+
+  /* Since this is possibly dispatched with IFUNC, call builtin_cpu_init
+     explicitly, as the constructor will only fire after IFUNC
+     initializers. */
+  ifunc_cpu_init_stmt = gimple_build_call_vec (
+                     ix86_builtins [(int) IX86_BUILTIN_CPU_INIT], NULL);
+  gimple_seq_add_stmt (&gseq, ifunc_cpu_init_stmt);
+  gimple_set_bb (ifunc_cpu_init_stmt, *new_bb);
+  set_bb_seq (*new_bb, gseq);
+      
+  pop_cfun ();
+  current_function_decl = old_current_function_decl;
+  return cond_func_decl;
+}
+
+/* Create a new target optimization node with tune set to ARCH_TUNE.  */
+static tree
+create_mtune_target_opt_node (const char *arch_tune)
+{
+  struct cl_target_option target_options;
+  const char *old_tune_string;
+  tree optimization_node;
+  
+  /* Build an optimization node that is the same as the current one except with
+     "tune=arch_tune".  */
+  cl_target_option_save (&target_options, &global_options);
+  old_tune_string = ix86_tune_string;
+
+  ix86_tune_string = arch_tune;
+  ix86_option_override_internal (false);
+
+  optimization_node = build_target_option_node ();
+
+  ix86_tune_string = old_tune_string;
+  cl_target_option_restore (&global_options, &target_options);
+
+  return optimization_node;
+}
+
+/* Should a version of this function be specially optimized for core2?
+
+   This function should have checks to see if there are any opportunities for
+   core2 specific optimizations, otherwise do not create a clone.  The
+   following opportunities are checked.
+   
+   * Check if this function has vectorizable loads/stores as it is known that
+     unaligned 128-bit movs to/from memory (movdqu) are very expensive on
+     core2 whereas the later generations like corei7 have no additional
+     overhead.
+
+     This versioning is triggered only when -ftree-vectorize is turned on
+     and when multi-versioning for core2 is requested using -mvarch=core2.
+
+   Return false if no versioning is required.  Return true if a version must
+   be created.  Generate the *OPTIMIZATION_NODE that must be used to optimize
+   the newly created version, that is tag "tune=core2" on the new version.  */
+
+static bool
+mversion_for_core2 (tree *optimization_node,
+                   tree *cond_func_decl, basic_block *new_bb)
+{
+  tree predicate_decl;
+  bool is_mversion_target_core2 = false;
+  bool create_version = false;
+
+  if (ix86_varch_specified
+      && ix86_varch[PROCESSOR_CORE2_64])
+    is_mversion_target_core2 = true;
+
+  /* Check for criteria to create a new version for core2.  */
+
+  /* If -ftree-vectorize is not used of MV is not requested, bail.  */
+  if (flag_tree_vectorize && is_mversion_target_core2)
+    {
+      /* Check if there is atleast one loop that has a vectorizable load/store.
+         These are the ones that can generate the unaligned mov which is known
+         to be very slow on core2.  */
+      if (any_loops_vectorizable_with_load_store ())
+        create_version = true;
+    }
+  /* else if XXX: Add more criteria to version for core2.  */
+
+  if (!create_version)
+    return false;
+
+  /* If the condition function's body has not been created, create it now.  */
+  if (*cond_func_decl == NULL)
+    *cond_func_decl = make_condition_function (new_bb);
+
+  *optimization_node = create_mtune_target_opt_node ("core2");
+
+  predicate_decl = ix86_builtins [(int) IX86_BUILTIN_CPU_IS_INTEL_CORE2];
+  *new_bb = add_condition_to_bb (*cond_func_decl, 0, *new_bb,
+                                predicate_decl, false);
+  return true;
+}
+
+/* Should this function CURRENT_FUNCTION_DECL be multi-versioned, if so 
+   the number of versions to be created (other than the original) is
+   returned.  The outcome of COND_FUNC_DECL will decide the version to be
+   executed.  The OPTIMIZATION_NODE_CHAIN has a unique node for each
+   version to be created.  */
+
+static int
+ix86_mversion_function (tree fndecl ATTRIBUTE_UNUSED,
+                       tree *optimization_node_chain,
+                       tree *cond_func_decl)
+{
+  basic_block new_bb;
+  tree optimization_node;
+  int num_versions_created = 0;
+
+  if (ix86_mv_arch_string == NULL)
+    return 0;
+
+  if (mversion_for_core2 (&optimization_node, cond_func_decl, &new_bb))
+    num_versions_created++;
+
+  if (!num_versions_created)
+    return 0;
+
+  *optimization_node_chain = tree_cons (optimization_node,
+                                       NULL_TREE, *optimization_node_chain);
+
+  /* Return the default version as the last stmt in cond_func_decl.  */
+  if (*cond_func_decl != NULL)
+    new_bb = add_condition_to_bb (*cond_func_decl, num_versions_created,
+                                  new_bb, NULL_TREE, false);
+
+  return num_versions_created;
+}
+
 /* A builtin to init/return the cpu type or feature.  Returns an
    integer and the type is a const if IS_CONST is set. */
 
@@ -35608,6 +36128,9 @@ ix86_loop_unroll_adjust (unsigned nunroll, struct
 #undef TARGET_FOLD_BUILTIN
 #define TARGET_FOLD_BUILTIN ix86_fold_builtin
 
+#undef TARGET_MVERSION_FUNCTION
+#define TARGET_MVERSION_FUNCTION ix86_mversion_function
+
 #undef TARGET_SLOW_UNALIGNED_VECTOR_MEMOP
 #define TARGET_SLOW_UNALIGNED_VECTOR_MEMOP ix86_slow_unaligned_vector_memop
 
Index: params.def
===================================================================
--- params.def  (revision 182355)
+++ params.def  (working copy)
@@ -1037,6 +1037,11 @@ DEFPARAM (PARAM_PMU_PROFILE_N_ADDRESS,
          "While doing PMU profiling symbolize this many top addresses.",
          50, 1, 10000)
 
+DEFPARAM (PARAM_MAX_FUNCTION_SIZE_FOR_AUTO_CLONING,
+         "autoclone-function-size-limit",
+         "Do not auto clone functions beyond this size.",
+         450, 0, 100000)
+
 /*
 Local variables:
 mode:c

--
This patch is available for review at http://codereview.appspot.com/5490054

Reply via email to