[PATCH RFC]Pair load store instructions using a generic scheduling fusion pass

Bin Cheng Tue, 30 Sep 2014 02:25:28 -0700

Hi,
Last time I posted the patch pairing consecutive load/store instructions on
ARM, the patch got some review comments.  The most important one, as
suggested by Jeff and Mike, is about to do the load/store pairing using
existing scheduling facility.  In the initial investigation, I cleared up
Mike's patch and fixed some implementation bugs in it, it can now find as
many load/store pairs as my old patch.  Then I decided to take one step
forward to introduce a generic instruction fusion infrastructure in GCC,
because in essence, load/store pair is nothing different with other
instruction fusion, all these optimizations want is to push instructions
together in instruction flow.
So here comes this patch.  It adds a new sched_fusion pass just before
peephole2.  The methodology is like:
1) The priority in scheduler is extended into [fusion_priority, priority]
pair, with fusion_priority as the major key and priority as the minor key.
2) The back-end assigns priorities pair to each instruction, instructions
want to be fused together get same fusion_priority assigned.
3) The haifa scheduler schedules instructions based on fusion priorities,
all other parts are just like the original sched2 pass.  Of course, some
functions can be simplified/skipped in this process.
4) With instructions fused together in flow, the following peephole2 pass
can easily transform interesting instructions into other forms, just like
ldrd/strd for ARM.


The new infrastructure can handle different kinds of fusion in one pass.
It's also easy to extend for new fusion cases, all it takes is to identify
instructions which want to be fused and assign new fusion priorities to
them.  Also as Mike suggested last time, the patch can be very simple by
reusing existing scheduler facility.

I collected performance data for both cortex-a15 and cortex-a57 (with a
local peephole ldp/stp patch), the benchmarks can be obviously improved on
arm/aarch64.  I also collected instrument data about how many load/store
pairs are found.  For the four versions of load/store pair patches:
0) The original Mike's patch.
1) My original prototype patch.
2) Cleaned up pass of Mike (with implementation bugs resolved).
3) This new prototype fusion pass.

The numbers of paired opportunities satisfy below relations:
3 * N0 ~ N1 ~ N2 < N3
For example, for one benchmark suite, we have:
N0 ~= 1300
N1/N2 ~= 5000
N3 ~= 7500

As a matter of fact, if we move sched_fusion and peephole2 pass after
register renaming (~11000 for above benchmark suite), then enable register
renaming pass, this patch can find even more load store pairs.  But rename
pass has its own impact on performance and we need more benchmark data
before doing that change.

Of course, this patch is no the perfect solution, it does miss load/store
pair in some corner cases which have complicated instruction dependencies.
Actually it breaks one load/store pair test on armv6 because of the corner
case, that's why the pass is disabled by default on non-armv7 processors.  I
may investigate the failure and try to enable the pass for all arm targets
in the future.

So any comments on this?

2014-09-30  Bin Cheng  <bin.ch...@arm.com>
            Mike Stump  <mikest...@comcast.net>

        * timevar.def (TV_SCHED_FUSION): New time var.
        * passes.def (pass_sched_fusion): New pass.
        * config/arm/arm.c (TARGET_SCHED_FUSION_PRIORITY): New.
        (extract_base_offset_in_addr, fusion_load_store): New.
        (arm_sched_fusion_priority): New.
        (arm_option_override): Disable scheduling fusion on non-armv7
        processors by default.
        * sched-int.h (struct _haifa_insn_data): New field.
        (INSN_FUSION_PRIORITY, FUSION_MAX_PRIORITY, sched_fusion): New.
        * sched-rgn.c (rest_of_handle_sched_fusion): New.
        (pass_data_sched_fusion, pass_sched_fusion): New.
        (make_pass_sched_fusion): New.
        * haifa-sched.c (sched_fusion): New.
        (insn_cost): Handle sched_fusion.
        (priority): Handle sched_fusion by calling target hook.
        (enum rfs_decision): New enum value.
        (rfs_str): New element for RFS_FUSION.
        (rank_for_schedule): Support sched_fusion.
        (schedule_insn, max_issue, prune_ready_list): Handle sched_fusion.
        (schedule_block, fix_tick_ready): Handle sched_fusion.
        * common.opt (flag_schedule_fusion): New.
        * tree-pass.h (make_pass_sched_fusion): New.
        * target.def (fusion_priority): New.
        * doc/tm.texi.in (TARGET_SCHED_FUSION_PRIORITY): New.
        * doc/tm.texi: Regenerated.
        * doc/invoke.texi (-fschedule-fusion): New.

gcc/testsuite/ChangeLog
2014-09-30  Bin Cheng  <bin.ch...@arm.com>

        * gcc.target/arm/ldrd-strd-pair-1.c: New test.
        * gcc.target/arm/vfp-1.c: Improve scanning string.

Index: gcc/timevar.def
===================================================================
--- gcc/timevar.def     (revision 215662)
+++ gcc/timevar.def     (working copy)
@@ -244,6 +244,7 @@ DEFTIMEVAR (TV_IFCVT2                    , "if-conversion 
2")
 DEFTIMEVAR (TV_COMBINE_STACK_ADJUST  , "combine stack adjustments")
 DEFTIMEVAR (TV_PEEPHOLE2             , "peephole 2")
 DEFTIMEVAR (TV_RENAME_REGISTERS      , "rename registers")
+DEFTIMEVAR (TV_SCHED_FUSION          , "scheduling fusion")
 DEFTIMEVAR (TV_CPROP_REGISTERS       , "hard reg cprop")
 DEFTIMEVAR (TV_SCHED2                , "scheduling 2")
 DEFTIMEVAR (TV_MACH_DEP              , "machine dep reorg")
Index: gcc/doc/tm.texi
===================================================================
--- gcc/doc/tm.texi     (revision 215662)
+++ gcc/doc/tm.texi     (working copy)
@@ -6677,6 +6677,29 @@ This hook is called by tree reassociator to determ
 parallelism required in output calculations chain.
 @end deftypefn
 
+@deftypefn {Target Hook} void TARGET_SCHED_FUSION_PRIORITY (rtx_insn 
*@var{insn}, int @var{max_pri}, int *@var{fusion_pri}, int *@var{pri})
+This hook is called by scheduling fusion pass.  It calculates fusion
+priorities for each instruction passed in by parameter.  The priorities
+are returned via pointer parameters.
+
+@var{insn} is the instruction whose priorities need to be calculated.
+@var{max_pri} is the maximum priority can be returned in any cases.
+@var{fusion_pri} is the pointer parameter through which @var{insn}'s
+fusion priority should be calculated and returned.
+@var{pri} is the pointer parameter through which @var{insn}'s priority
+should be calculated and returned.
+
+Same @var{fusion_pri} should be returned for instructions which should
+be scheduled together.  Different @var{pri} should be returned for
+instructions with same @var{fusion_pri}.  All instructions will be
+scheduled according to the two priorities.  @var{fusion_pri} is the major
+sort key, @var{pri} is the minor sort key.  All priorities calculated
+should be between 0 (exclusive) and @var{max_pri} (inclusive).  To avoid
+false dependencies, @var{fusion_pri} of instructions which need to be
+scheduled together should be smaller than @var{fusion_pri} of irrelevant
+instructions.
+@end deftypefn
+
 @node Sections
 @section Dividing the Output into Sections (Texts, Data, @dots{})
 @c the above section title is WAY too long.  maybe cut the part between
Index: gcc/doc/tm.texi.in
===================================================================
--- gcc/doc/tm.texi.in  (revision 215662)
+++ gcc/doc/tm.texi.in  (working copy)
@@ -4817,6 +4817,8 @@ them: try the first ones in this list first.
 
 @hook TARGET_SCHED_REASSOCIATION_WIDTH
 
+@hook TARGET_SCHED_FUSION_PRIORITY
+
 @node Sections
 @section Dividing the Output into Sections (Texts, Data, @dots{})
 @c the above section title is WAY too long.  maybe cut the part between
Index: gcc/doc/invoke.texi
===================================================================
--- gcc/doc/invoke.texi (revision 215662)
+++ gcc/doc/invoke.texi (working copy)
@@ -404,7 +404,7 @@ Objective-C and Objective-C++ Dialects}.
 -fprofile-correction -fprofile-dir=@var{path} -fprofile-generate @gol
 -fprofile-generate=@var{path} @gol
 -fprofile-use -fprofile-use=@var{path} -fprofile-values 
-fprofile-reorder-functions @gol
--freciprocal-math -free -frename-registers -freorder-blocks @gol
+-freciprocal-math -free -frename-registers -fschedule-fusion -freorder-blocks 
@gol
 -freorder-blocks-and-partition -freorder-functions @gol
 -frerun-cse-after-loop -freschedule-modulo-scheduled-loops @gol
 -frounding-math -fsched2-use-superblocks -fsched-pressure @gol
@@ -9459,6 +9459,14 @@ a ``home register''.
 
 Enabled by default with @option{-funroll-loops} and @option{-fpeel-loops}.
 
+@item -fschedule-fusion
+@opindex fschedule-fusion
+Performs a target dependent pass over the instruction stream to schedule
+instructions of same type together because target machine can execute them
+more efficiently if they are adjacent to each other in the instruction flow.
+
+Enabled at levels @option{-O2}, @option{-O3}, @option{-Os}.
+
 @item -ftracer
 @opindex ftracer
 Perform tail duplication to enlarge superblock size.  This transformation
Index: gcc/common.opt
===================================================================
--- gcc/common.opt      (revision 215662)
+++ gcc/common.opt      (working copy)
@@ -1806,6 +1806,10 @@ frename-registers
 Common Report Var(flag_rename_registers) Init(2) Optimization
 Perform a register renaming optimization pass
 
+fschedule-fusion
+Common Report Var(flag_schedule_fusion) Init(2) Optimization
+Perform a target dependent instruction fusion optimization pass
+
 freorder-blocks
 Common Report Var(flag_reorder_blocks) Optimization
 Reorder basic blocks to improve code placement
Index: gcc/haifa-sched.c
===================================================================
--- gcc/haifa-sched.c   (revision 215662)
+++ gcc/haifa-sched.c   (working copy)
@@ -1370,6 +1370,9 @@ insn_cost (rtx_insn *insn)
 {
   int cost;
 
+  if (sched_fusion)
+    return 0;
+
   if (sel_sched_p ())
     {
       if (recog_memoized (insn) < 0)
@@ -1581,6 +1584,8 @@ dep_list_size (rtx insn, sd_list_types_def list)
 
   return nodbgcount;
 }
+
+bool sched_fusion;
 
 /* Compute the priority number for INSN.  */
 static int
@@ -1596,7 +1601,15 @@ priority (rtx_insn *insn)
     {
       int this_priority = -1;
 
-      if (dep_list_size (insn, SD_LIST_FORW) == 0)
+      if (sched_fusion)
+       {
+         int this_fusion_priority;
+
+         targetm.sched.fusion_priority (insn, FUSION_MAX_PRIORITY,
+                                        &this_fusion_priority, &this_priority);
+         INSN_FUSION_PRIORITY (insn) = this_fusion_priority;
+       }
+      else if (dep_list_size (insn, SD_LIST_FORW) == 0)
        /* ??? We should set INSN_PRIORITY to insn_cost when and insn has
           some forward deps but all of them are ignored by
           contributes_to_priority hook.  At the moment we set priority of
@@ -2527,7 +2540,7 @@ enum rfs_decision {
   RFS_SCHED_GROUP, RFS_PRESSURE_DELAY, RFS_PRESSURE_TICK,
   RFS_FEEDS_BACKTRACK_INSN, RFS_PRIORITY, RFS_SPECULATION,
   RFS_SCHED_RANK, RFS_LAST_INSN, RFS_PRESSURE_INDEX,
-  RFS_DEP_COUNT, RFS_TIE, RFS_N };
+  RFS_DEP_COUNT, RFS_TIE, RFS_FUSION, RFS_N };
 
 /* Corresponding strings for print outs.  */
 static const char *rfs_str[RFS_N] = {
@@ -2535,7 +2548,7 @@ static const char *rfs_str[RFS_N] = {
   "RFS_SCHED_GROUP", "RFS_PRESSURE_DELAY", "RFS_PRESSURE_TICK",
   "RFS_FEEDS_BACKTRACK_INSN", "RFS_PRIORITY", "RFS_SPECULATION",
   "RFS_SCHED_RANK", "RFS_LAST_INSN", "RFS_PRESSURE_INDEX",
-  "RFS_DEP_COUNT", "RFS_TIE" };
+  "RFS_DEP_COUNT", "RFS_TIE", "RFS_FUSION" };
 
 /* Statistical breakdown of rank_for_schedule decisions.  */
 typedef struct { unsigned stats[RFS_N]; } rank_for_schedule_stats_t;
@@ -2596,6 +2609,53 @@ rank_for_schedule (const void *x, const void *y)
   /* Make sure that priority of TMP and TMP2 are initialized.  */
   gcc_assert (INSN_PRIORITY_KNOWN (tmp) && INSN_PRIORITY_KNOWN (tmp2));
 
+  if (sched_fusion)
+    {
+      /* The instruction that has the same fusion priority as the last
+        instruction is the instruction we picked next.  If that is not
+        the case, we sort ready list firstly by fusion priority, then
+        by priority, and at last by INSN_LUID.  */
+      int a = INSN_FUSION_PRIORITY (tmp);
+      int b = INSN_FUSION_PRIORITY (tmp2);
+      int last = -1;
+
+      if (last_nondebug_scheduled_insn
+         && !NOTE_P (last_nondebug_scheduled_insn)
+         && BLOCK_FOR_INSN (tmp)
+              == BLOCK_FOR_INSN (last_nondebug_scheduled_insn))
+       last = INSN_FUSION_PRIORITY (last_nondebug_scheduled_insn);
+
+      if (a != last && b != last)
+       {
+         if (a == b)
+           {
+             a = INSN_PRIORITY (tmp);
+             b = INSN_PRIORITY (tmp2);
+           }
+         if (a != b)
+           return rfs_result (RFS_FUSION, b - a);
+         else
+           return rfs_result (RFS_FUSION, INSN_LUID (tmp) - INSN_LUID (tmp2));
+       }
+      else if (a == b)
+       {
+         gcc_assert (last_nondebug_scheduled_insn
+                     && !NOTE_P (last_nondebug_scheduled_insn));
+         last = INSN_PRIORITY (last_nondebug_scheduled_insn);
+
+         a = abs (INSN_PRIORITY (tmp) - last);
+         b = abs (INSN_PRIORITY (tmp2) - last);
+         if (a != b)
+           return rfs_result (RFS_FUSION, a - b);
+         else
+           return rfs_result (RFS_FUSION, INSN_LUID (tmp) - INSN_LUID (tmp2));
+       }
+      else if (a == last)
+       return rfs_result (RFS_FUSION, -1);
+      else
+       return rfs_result (RFS_FUSION, 1);
+    }
+
   if (sched_pressure != SCHED_PRESSURE_NONE)
     {
       /* Prefer insn whose scheduling results in the smallest register
@@ -3914,8 +3974,8 @@ schedule_insn (rtx_insn *insn)
   gcc_assert (INSN_TICK (insn) >= MIN_TICK);
   if (INSN_TICK (insn) > clock_var)
     /* INSN has been prematurely moved from the queue to the ready list.
-       This is possible only if following flag is set.  */
-    gcc_assert (flag_sched_stalled_insns);
+       This is possible only if following flags are set.  */
+    gcc_assert (flag_sched_stalled_insns || sched_fusion);
 
   /* ??? Probably, if INSN is scheduled prematurely, we should leave
      INSN_TICK untouched.  This is a machine-dependent issue, actually.  */
@@ -5413,6 +5473,9 @@ max_issue (struct ready_list *ready, int privilege
   struct choice_entry *top;
   rtx_insn *insn;
 
+  if (sched_fusion)
+    return 0;
+
   n_ready = ready->n_ready;
   gcc_assert (dfa_lookahead >= 1 && privileged_n >= 0
              && privileged_n <= n_ready);
@@ -5768,6 +5831,9 @@ prune_ready_list (state_t temp_state, bool first_c
   bool sched_group_found = false;
   int min_cost_group = 1;
 
+  if (sched_fusion)
+    return;
+
   for (i = 0; i < ready.n_ready; i++)
     {
       rtx_insn *insn = ready_element (&ready, i);
@@ -6375,7 +6441,7 @@ schedule_block (basic_block *target_bb, state_t in
            {
              memcpy (temp_state, curr_state, dfa_state_size);
              cost = state_transition (curr_state, insn);
-             if (sched_pressure != SCHED_PRESSURE_WEIGHTED)
+             if (sched_pressure != SCHED_PRESSURE_WEIGHTED && !sched_fusion)
                gcc_assert (cost < 0);
              if (memcmp (temp_state, curr_state, dfa_state_size) != 0)
                cycle_issued_insns++;
@@ -7195,7 +7261,7 @@ fix_tick_ready (rtx_insn *next)
   INSN_TICK (next) = tick;
 
   delay = tick - clock_var;
-  if (delay <= 0 || sched_pressure != SCHED_PRESSURE_NONE)
+  if (delay <= 0 || sched_pressure != SCHED_PRESSURE_NONE || sched_fusion)
     delay = QUEUE_READY;
 
   change_queue_index (next, delay);
Index: gcc/sched-int.h
===================================================================
--- gcc/sched-int.h     (revision 215662)
+++ gcc/sched-int.h     (working copy)
@@ -806,6 +806,9 @@ struct _haifa_insn_data
   /* A priority for each insn.  */
   int priority;
 
+  /* The fusion priority for each insn.  */
+  int fusion_priority;
+
   /* The minimum clock tick at which the insn becomes ready.  This is
      used to note timing constraints for the insns in the pending list.  */
   int tick;
@@ -901,6 +904,7 @@ extern vec<haifa_insn_data_def> h_i_d;
 /* Accessor macros for h_i_d.  There are more in haifa-sched.c and
    sched-rgn.c.  */
 #define INSN_PRIORITY(INSN) (HID (INSN)->priority)
+#define INSN_FUSION_PRIORITY(INSN) (HID (INSN)->fusion_priority)
 #define INSN_REG_PRESSURE(INSN) (HID (INSN)->reg_pressure)
 #define INSN_MAX_REG_PRESSURE(INSN) (HID (INSN)->max_reg_pressure)
 #define INSN_REG_USE_LIST(INSN) (HID (INSN)->reg_use_list)
@@ -1618,6 +1622,10 @@ extern void sd_copy_back_deps (rtx_insn *, rtx_ins
 extern void sd_delete_dep (sd_iterator_def);
 extern void sd_debug_lists (rtx, sd_list_types_def);
 
+/* Macros and declarations for scheduling fusion.  */
+#define FUSION_MAX_PRIORITY (INT_MAX)
+extern bool sched_fusion;
+
 #endif /* INSN_SCHEDULING */
 
 #endif /* GCC_SCHED_INT_H */
Index: gcc/sched-rgn.c
===================================================================
--- gcc/sched-rgn.c     (revision 215662)
+++ gcc/sched-rgn.c     (working copy)
@@ -3647,6 +3647,17 @@ rest_of_handle_sched2 (void)
   return 0;
 }
 
+static unsigned int
+rest_of_handle_sched_fusion (void)
+{
+#ifdef INSN_SCHEDULING
+  sched_fusion = true;
+  schedule_insns ();
+  sched_fusion = false;
+#endif
+  return 0;
+}
+
 namespace {
 
 const pass_data pass_data_live_range_shrinkage =
@@ -3789,3 +3800,53 @@ make_pass_sched2 (gcc::context *ctxt)
 {
   return new pass_sched2 (ctxt);
 }
+
+namespace {
+
+const pass_data pass_data_sched_fusion =
+{
+  RTL_PASS, /* type */
+  "sched_fusion", /* name */
+  OPTGROUP_NONE, /* optinfo_flags */
+  TV_SCHED_FUSION, /* tv_id */
+  0, /* properties_required */
+  0, /* properties_provided */
+  0, /* properties_destroyed */
+  0, /* todo_flags_start */
+  TODO_df_finish, /* todo_flags_finish */
+};
+
+class pass_sched_fusion : public rtl_opt_pass
+{
+public:
+  pass_sched_fusion (gcc::context *ctxt)
+    : rtl_opt_pass (pass_data_sched_fusion, ctxt)
+  {}
+
+  /* opt_pass methods: */
+  virtual bool gate (function *);
+  virtual unsigned int execute (function *)
+    {
+      return rest_of_handle_sched_fusion ();
+    }
+
+}; // class pass_sched2
+
+bool
+pass_sched_fusion::gate (function *)
+{
+#ifdef INSN_SCHEDULING
+  return (optimize > 0 && flag_schedule_fusion
+    && targetm.sched.fusion_priority != NULL);
+#else
+  return 0;
+#endif
+}
+
+} // anon namespace
+
+rtl_opt_pass *
+make_pass_sched_fusion (gcc::context *ctxt)
+{
+  return new pass_sched_fusion (ctxt);
+}
Index: gcc/target.def
===================================================================
--- gcc/target.def      (revision 215662)
+++ gcc/target.def      (working copy)
@@ -1507,6 +1507,32 @@ parallelism required in output calculations chain.
 int, (unsigned int opc, enum machine_mode mode),
 hook_int_uint_mode_1)
 
+/* The following member value is a function that returns priority for
+   fusion of each instruction via pointer parameters.  */
+DEFHOOK
+(fusion_priority,
+"This hook is called by scheduling fusion pass.  It calculates fusion\n\
+priorities for each instruction passed in by parameter.  The priorities\n\
+are returned via pointer parameters.\n\
+\n\
+@var{insn} is the instruction whose priorities need to be calculated.\n\
+@var{max_pri} is the maximum priority can be returned in any cases.\n\
+@var{fusion_pri} is the pointer parameter through which @var{insn}'s\n\
+fusion priority should be calculated and returned.\n\
+@var{pri} is the pointer parameter through which @var{insn}'s priority\n\
+should be calculated and returned.\n\
+\n\
+Same @var{fusion_pri} should be returned for instructions which should\n\
+be scheduled together.  Different @var{pri} should be returned for\n\
+instructions with same @var{fusion_pri}.  All instructions will be\n\
+scheduled according to the two priorities.  @var{fusion_pri} is the major\n\
+sort key, @var{pri} is the minor sort key.  All priorities calculated\n\
+should be between 0 (exclusive) and @var{max_pri} (inclusive).  To avoid\n\
+false dependencies, @var{fusion_pri} of instructions which need to be\n\
+scheduled together should be smaller than @var{fusion_pri} of irrelevant\n\
+instructions.",
+void, (rtx_insn *insn, int max_pri, int *fusion_pri, int *pri), NULL)
+
 HOOK_VECTOR_END (sched)
 
 /* Functions relating to OpenMP and Cilk Plus SIMD clones.  */
Index: gcc/testsuite/gcc.target/arm/vfp-1.c
===================================================================
--- gcc/testsuite/gcc.target/arm/vfp-1.c        (revision 215662)
+++ gcc/testsuite/gcc.target/arm/vfp-1.c        (working copy)
@@ -126,7 +126,7 @@ void test_convert () {
 }
 
 void test_ldst (float f[], double d[]) {
-  /* { dg-final { scan-assembler "vldr.32.+ \\\[r0, #1020\\\]" } } */
+  /* { dg-final { scan-assembler "vldr.32.+ \\\[r0, #-?\[0-9\]+\\\]" } } */
   /* { dg-final { scan-assembler "vldr.32.+ \\\[r\[0-9\], #-1020\\\]" { target 
{ arm32 && { ! arm_thumb2_ok } } } } } */
   /* { dg-final { scan-assembler "add.+ r0, #1024" } } */
   /* { dg-final { scan-assembler "vstr.32.+ \\\[r\[0-9\]\\\]\n" } } */
Index: gcc/testsuite/gcc.target/arm/ldrd-strd-pair-1.c
===================================================================
--- gcc/testsuite/gcc.target/arm/ldrd-strd-pair-1.c     (revision 0)
+++ gcc/testsuite/gcc.target/arm/ldrd-strd-pair-1.c     (revision 0)
@@ -0,0 +1,23 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target arm_prefer_ldrd_strd } */
+/* { dg-options "-O2" } */
+
+struct
+{
+  int x;
+  int y;
+  char c;
+  int d;
+}a;
+
+int foo(int x, int y)
+{
+  int c;
+  a.x = x;
+  c = a.x;
+  a.d = c;
+  a.y = y;
+
+  return 0;
+}
+/* { dg-final { scan-assembler "strd\t" } } */
Index: gcc/config/arm/arm.c
===================================================================
--- gcc/config/arm/arm.c        (revision 215662)
+++ gcc/config/arm/arm.c        (working copy)
@@ -291,6 +291,8 @@ static unsigned arm_add_stmt_cost (void *data, int
 static void arm_canonicalize_comparison (int *code, rtx *op0, rtx *op1,
                                         bool op0_preserve_value);
 static unsigned HOST_WIDE_INT arm_asan_shadow_offset (void);
+
+static void arm_sched_fusion_priority (rtx_insn *, int, int *, int*);
 
 /* Table of machine attributes.  */
 static const struct attribute_spec arm_attribute_table[] =
@@ -688,6 +690,9 @@ static const struct attribute_spec arm_attribute_t
 #undef TARGET_CALL_FUSAGE_CONTAINS_NON_CALLEE_CLOBBERS
 #define TARGET_CALL_FUSAGE_CONTAINS_NON_CALLEE_CLOBBERS true
 
+#undef TARGET_SCHED_FUSION_PRIORITY
+#define TARGET_SCHED_FUSION_PRIORITY arm_sched_fusion_priority
+
 struct gcc_target targetm = TARGET_INITIALIZER;
 
 /* Obstack for minipool constant handling.  */
@@ -3116,6 +3121,10 @@ arm_option_override (void)
   if (target_slow_flash_data)
     arm_disable_literal_pool = true;
 
+  /* Only enable scheduling fusion for armv7 processors by default.  */
+  if (flag_schedule_fusion == 2 && !arm_arch7)
+    flag_schedule_fusion = 0;
+
   /* Register global variables with the garbage collector.  */
   arm_add_gc_roots ();
 }
@@ -32289,4 +32298,124 @@ arm_is_constant_pool_ref (rtx x)
          && CONSTANT_POOL_ADDRESS_P (XEXP (x, 0)));
 }
 
+/* If MEM is in the form of [base+offset], extract the two parts
+   of address and set to BASE and OFFSET, otherwise return false
+   after clearing BASE and OFFSET.  */
+
+static bool
+extract_base_offset_in_addr (rtx mem, rtx *base, rtx *offset)
+{
+  rtx addr;
+
+  gcc_assert (MEM_P (mem));
+
+  addr = XEXP (mem, 0);
+
+  /* Strip off const from addresses like (const (addr)).  */
+  if (GET_CODE (addr) == CONST)
+    addr = XEXP (addr, 0);
+
+  if (GET_CODE (addr) == REG)
+    {
+      *base = addr;
+      *offset = const0_rtx;
+      return true;
+    }
+
+  if (GET_CODE (addr) == PLUS
+      && GET_CODE (XEXP (addr, 0)) == REG
+      && CONST_INT_P (XEXP (addr, 1)))
+    {
+      *base = XEXP (addr, 0);
+      *offset = XEXP (addr, 1);
+      return true;
+    }
+
+  *base = NULL_RTX;
+  *offset = NULL_RTX;
+
+  return false;
+}
+
+/* If INSN is a load or store of address in the form of [base+offset],
+   extract the two parts and set to BASE and OFFSET. IS_LOAD is set
+   to TRUE if it's a load.  Return TRUE if INSN is such an instruction,
+   otherwise return FALSE.  */
+
+static bool
+fusion_load_store (rtx_insn *insn, rtx *base, rtx *offset, bool *is_load)
+{
+  rtx x, dest, src;
+
+  gcc_assert (INSN_P (insn));
+  x = PATTERN (insn);
+  if (GET_CODE (x) != SET)
+    return false;
+
+  src = SET_SRC (x);
+  dest = SET_DEST (x);
+  if (GET_CODE (src) == REG && GET_CODE (dest) == MEM)
+    {
+      *is_load = false;
+      extract_base_offset_in_addr (dest, base, offset);
+    }
+  else if (GET_CODE (src) == MEM && GET_CODE (dest) == REG)
+    {
+      *is_load = true;
+      extract_base_offset_in_addr (src, base, offset);
+    }
+  else
+    return false;
+
+  return (*base != NULL_RTX && *offset != NULL_RTX);
+}
+
+/* Implement the TARGET_SCHED_FUSION_PRIORITY hook.
+
+   Currently we only support to fuse ldr or str instructions, so FUSION_PRI
+   and PRI are only calculated for these instructions.  For other instruction,
+   FUSION_PRI and PRI are simply set to MAX_PRI.  In the future, other kind
+   instruction fusion can be supported by returning different priorities.
+
+   It's important that irrelevant instructions get the largest FUSION_PRI.  */
+
+static void
+arm_sched_fusion_priority (rtx_insn *insn, int max_pri,
+                          int *fusion_pri, int *pri)
+{
+  int tmp, off_val;
+  bool is_load;
+  rtx base, offset;
+
+  gcc_assert (INSN_P (insn));
+
+  tmp = max_pri - 1;
+  if (!fusion_load_store (insn, &base, &offset, &is_load))
+    {
+      *pri = tmp;
+      *fusion_pri = tmp;
+      return;
+    }
+
+  /* Load goes first.  */
+  if (is_load)
+    *fusion_pri = tmp - 1;
+  else
+    *fusion_pri = tmp - 2;
+
+  tmp /= 2;
+
+  /* INSN with smaller base register goes first.  */
+  tmp -= ((REGNO (base) & 0xff) << 20);
+
+  /* INSN with smaller offset goes first.  */
+  off_val = (int)(INTVAL (offset));
+  if (off_val >= 0)
+    tmp -= (off_val & 0xfffff);
+  else
+    tmp += (off_val & 0xfffff);
+
+  *pri = tmp;
+  return;
+}
 #include "gt-arm.h"
Index: gcc/passes.def
===================================================================
--- gcc/passes.def      (revision 215662)
+++ gcc/passes.def      (working copy)
@@ -400,6 +400,7 @@ along with GCC; see the file COPYING3.  If not see
          NEXT_PASS (pass_stack_adjustments);
          NEXT_PASS (pass_jump2);
          NEXT_PASS (pass_duplicate_computed_gotos);
+         NEXT_PASS (pass_sched_fusion);
          NEXT_PASS (pass_peephole2);
          NEXT_PASS (pass_if_after_reload);
          NEXT_PASS (pass_regrename);
Index: gcc/tree-pass.h
===================================================================
--- gcc/tree-pass.h     (revision 215662)
+++ gcc/tree-pass.h     (working copy)
@@ -542,6 +542,7 @@ extern rtl_opt_pass *make_pass_branch_target_load_
 extern rtl_opt_pass *make_pass_thread_prologue_and_epilogue (gcc::context
                                                             *ctxt);
 extern rtl_opt_pass *make_pass_stack_adjustments (gcc::context *ctxt);
+extern rtl_opt_pass *make_pass_sched_fusion (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_peephole2 (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_if_after_reload (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_regrename (gcc::context *ctxt);

[PATCH RFC]Pair load store instructions using a generic scheduling fusion pass

Reply via email to