On Mon, Oct 16, 2017 at 5:27 PM, Bin.Cheng <amker.ch...@gmail.com> wrote:
> On Mon, Oct 16, 2017 at 5:00 PM, Bin.Cheng <amker.ch...@gmail.com> wrote:
>> On Mon, Oct 16, 2017 at 2:56 PM, Bin.Cheng <amker.ch...@gmail.com> wrote:
>>> On Thu, Oct 12, 2017 at 2:43 PM, Richard Biener
>>> <richard.guent...@gmail.com> wrote:
>>>> On Thu, Oct 5, 2017 at 3:17 PM, Bin Cheng <bin.ch...@arm.com> wrote:
>>>>> Hi,
>>>>> This patch merges adjacent memset builtin partitions if possible.  It is
>>>>> a useful special case optimization transforming below code:
>>>>>
>>>>> #define M (256)
>>>>> #define N (512)
>>>>>
>>>>> struct st
>>>>> {
>>>>>   int a[M][N];
>>>>>   int c[M];
>>>>>   int b[M][N];
>>>>> };
>>>>>
>>>>> void
>>>>> foo (struct st *p)
>>>>> {
>>>>>   for (unsigned i = 0; i < M; ++i)
>>>>>     {
>>>>>       p->c[i] = 0;
>>>>>       for (unsigned j = N; j > 0; --j)
>>>>>         {
>>>>>           p->a[i][j - 1] = 0;
>>>>>           p->b[i][j - 1] = 0;
>>>>>         }
>>>>>     }
>>>>>
>>>>> into a single memset function call, rather than three calls initializing
>>>>> the structure field by field.
>>>>>
>>>>> Bootstrap and test in patch set on x86_64 and AArch64, is it OK?
>>>>
>>>> +      /* Insertion sort is good enough for the small sub-array.  */
>>>> +      for (k = i + 1; k < j; ++k)
>>>> +       {
>>>> +         part1 = (*partitions)[k];
>>>> +         for (l = k; l > i; --l)
>>>> +           {
>>>> +             part2 = (*partitions)[l - 1];
>>>> +             if (part1->builtin->dst_base_offset
>>>> +                   >= part2->builtin->dst_base_offset)
>>>> +               break;
>>>> +
>>>> +             (*partitions)[l] = part2;
>>>> +           }
>>>> +         (*partitions)[l] = part1;
>>>> +       }
>>>>
>>>> so we want to sort [i, j[ after dst_base_offset.  I realize you don't want
>>>> to write a qsort helper for this but I can't wrap my head around the above
>>>> in 5 minutes so ... please!
>>> Hmm, I thought twice about this and now I believe stable sorting (thus
>>> insertion sort)
>>> is required here.  Please see below for explanation.
>>>
>>>>
>>>> You don't seem to check the sizes of the memsets given that they
>>>> obviously cannot overlap(!?)
>>> Yes, given it's quite special case transformation, I did add code
>>> checking overlap cases.
>>>>
>>>> Also why handle memset and not memcpy/memmove or combinations of
>>>> them (for sorting)?
>>>>
>>>>   for ()
>>>>    {
>>>>       p->a[i] = 0;
>>>>       p->c[i] = q->c[i];
>>>>       p->b[i] = 0;
>>>>    }
>>>>
>>>> with a and b adjacent.  I suppose p->c could be computed by a
>>>> non-builtin partition as well.
>>> Yes, the two memset builtin partitions can be merged in this case, but...
>>>> So don't we want to see if dependences allow sorting all builtin
>>>> partitions next to each other
>>>> as much as possible?  (maybe we do that already?)
>>> The answer for this, above partition merging and use of qsort is no.
>>> I think all the three are the same question here.  For now we only do
>>> topological sort for partitions.  To maximize parallelism (either by merging
>>> normal parallel partitions or merging builtin partitions) requires 
>>> fine-tuned
>>> sorting between partitions that doesn't dependence on each other.
>>> In order to sort all memset/memcpy/memmove, we need check dependence
>>> between all data references between different partitions.  For example, I
>>> created new test ldist-36.c illustrating sorting memcpy along with memset
>>> would generate wrong code because dependence is broken.  It's the same
>>> for qsort.  In extreme case, if the same array is set twice with different 
>>> rhs
>>> value, the order between the two sets needs to be preserved.  Unfortunately,
>>> qsort is unstable and could reorder different sets.  This would break output
>>> dependence.
>>> At the point of this function, dependence graph is destroyed, we can't do
>>> much in addition to special case handling for memset.  Full solution would
>>> require a customized topological sorting process.
>>>
>>> So, this updated patch keeps insertion sort with additional comment 
>>> explaining
>>> why.  Also two test cases added showing when memset partitions should be
>>> merged (we can't for now) and when memset partitions should not be merged.
>> Hmm, I can factor out the sorting loop into a function, that might
>> make it easier
>> to read.
> Okay, I will use std::stable_sort in this case, that's exactly what we
> want for this case.
Hi Richard,
This is the 3rd version of patch, it uses std::stable_sort which gives
stable sort and code simplicity.
Bootstrap and test.  Is it OK?

Thanks,
bin
2017-10-17  Bin Cheng  <bin.ch...@arm.com>

    * tree-loop-distribution.c (INCLUDE_ALGORITHM): New header file.
    (tree-ssa-loop-ivopts.h): New header file.
    (struct builtin_info): New fields.
    (classify_builtin_1): Compute and record base and offset parts for
    memset builtin partition by calling strip_offset.
    (offset_cmp, fuse_memset_builtins): New functions.
    (finalize_partitions): Fuse adjacent memset partitions by calling
    above function.
    * tree-ssa-loop-ivopts.c (strip_offset): Delete static declaration.
    Expose the interface.
    * tree-ssa-loop-ivopts.h (strip_offset): New declaration.

gcc/testsuite/ChangeLog
2017-10-17  Bin Cheng  <bin.ch...@arm.com>

    * gcc.dg/tree-ssa/ldist-17.c: Adjust test string.
    * gcc.dg/tree-ssa/ldist-32.c: New test.
    * gcc.dg/tree-ssa/ldist-35.c: New test.
    * gcc.dg/tree-ssa/ldist-36.c: New test.
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ldist-17.c 
b/gcc/testsuite/gcc.dg/tree-ssa/ldist-17.c
index 4efc0a4..b3617f6 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ldist-17.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ldist-17.c
@@ -45,5 +45,5 @@ mad_synth_mute (struct mad_synth *synth)
   return;
 }
 
-/* { dg-final { scan-tree-dump "distributed: split to 0 loops and 4 library 
calls" "ldist" } } */
-/* { dg-final { scan-tree-dump-times "generated memset zero" 4 "ldist" } } */
+/* { dg-final { scan-tree-dump "Loop nest . distributed: split to 0 loops and 
1 library calls" "ldist" } } */
+/* { dg-final { scan-tree-dump-times "generated memset zero" 1 "ldist" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ldist-32.c 
b/gcc/testsuite/gcc.dg/tree-ssa/ldist-32.c
new file mode 100644
index 0000000..477d222
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ldist-32.c
@@ -0,0 +1,29 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-loop-distribution -ftree-loop-distribute-patterns 
-fdump-tree-ldist-details" } */
+
+#define M (256)
+#define N (512)
+
+struct st
+{
+  int a[M][N];
+  int c[M];
+  int b[M][N];
+};
+
+void
+foo (struct st *p)
+{
+  for (unsigned i = 0; i < M; ++i)
+    {
+      p->c[i] = 0;
+      for (unsigned j = N; j > 0; --j)
+       {
+         p->a[i][j - 1] = 0;
+         p->b[i][j - 1] = 0;
+       }
+    }
+}
+
+/* { dg-final { scan-tree-dump-times "Loop nest . distributed: split to 0 
loops and 1 library" 1 "ldist" } } */
+/* { dg-final { scan-tree-dump-times "__builtin_memset \\(.*, 0, 1049600\\);" 
1 "ldist" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ldist-35.c 
b/gcc/testsuite/gcc.dg/tree-ssa/ldist-35.c
new file mode 100644
index 0000000..445d23d
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ldist-35.c
@@ -0,0 +1,28 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-loop-distribution -ftree-loop-distribute-patterns 
-fdump-tree-ldist-details" } */
+
+#define M (256)
+#define N (512)
+
+struct st
+{
+  int a[M][N];
+  int c[M];
+  int b[M][N];
+};
+
+void
+foo (struct st * restrict p, struct st * restrict q)
+{
+  for (unsigned i = 0; i < M; ++i)
+    {
+      p->c[i] = 0;
+      for (unsigned j = N; j > 0; --j)
+       {
+         p->a[i][j - 1] = q->a[i][j - 1];
+         p->b[i][j - 1] = 0;
+       }
+    }
+}
+
+/* { dg-final { scan-tree-dump-times "Loop nest . distributed: split to 0 
loops and 1 library" 1 "ldist" { xfail *-*-* } } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ldist-36.c 
b/gcc/testsuite/gcc.dg/tree-ssa/ldist-36.c
new file mode 100644
index 0000000..0e843f4
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ldist-36.c
@@ -0,0 +1,28 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-loop-distribution -ftree-loop-distribute-patterns 
-fdump-tree-ldist-details" } */
+
+#define M (256)
+#define N (512)
+
+struct st
+{
+  int a[M][N];
+  int c[M];
+  int b[M][N];
+};
+
+void
+foo (struct st * restrict p)
+{
+  for (unsigned i = 0; i < M; ++i)
+    {
+      p->c[i] = 0;
+      for (unsigned j = N; j > 0; --j)
+       {
+         p->b[i][j - 1] = p->a[i][j - 1];
+         p->a[i][j - 1] = 0;
+       }
+    }
+}
+
+/* { dg-final { scan-tree-dump-times "Loop nest . distributed: split to 0 
loops and 3 library" 1 "ldist" } } */
diff --git a/gcc/tree-loop-distribution.c b/gcc/tree-loop-distribution.c
index 5e835be..fdf5bab 100644
--- a/gcc/tree-loop-distribution.c
+++ b/gcc/tree-loop-distribution.c
@@ -90,6 +90,7 @@ along with GCC; see the file COPYING3.  If not see
        data reuse.  */
 
 #include "config.h"
+#define INCLUDE_ALGORITHM /* stable_sort */
 #include "system.h"
 #include "coretypes.h"
 #include "backend.h"
@@ -106,6 +107,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "stor-layout.h"
 #include "tree-cfg.h"
 #include "tree-ssa-loop-manip.h"
+#include "tree-ssa-loop-ivopts.h"
 #include "tree-ssa-loop.h"
 #include "tree-into-ssa.h"
 #include "tree-ssa.h"
@@ -604,6 +606,10 @@ struct builtin_info
   tree dst_base;
   tree src_base;
   tree size;
+  /* Base and offset part of dst_base after stripping constant offset.  This
+     is only used in memset builtin distribution for now.  */
+  tree dst_base_base;
+  unsigned HOST_WIDE_INT dst_base_offset;
 };
 
 /* Partition for loop distribution.  */
@@ -1500,7 +1506,11 @@ classify_builtin_st (loop_p loop, partition *partition, 
data_reference_p dr)
   if (!compute_access_range (loop, dr, &base, &size))
     return;
 
-  partition->builtin = alloc_builtin (dr, NULL, base, NULL_TREE, size);
+  struct builtin_info *builtin;
+  builtin = alloc_builtin (dr, NULL, base, NULL_TREE, size);
+  builtin->dst_base_base = strip_offset (builtin->dst_base,
+                                        &builtin->dst_base_offset);
+  partition->builtin = builtin;
   partition->kind = PKIND_MEMSET;
 }
 
@@ -2476,6 +2486,113 @@ version_for_distribution_p (vec<struct partition *> 
*partitions,
   return (alias_ddrs->length () > 0);
 }
 
+/* Compare base offset of builtin mem* partitions P1 and P2.  */
+
+static bool
+offset_cmp (struct partition *p1, struct partition *p2)
+{
+  gcc_assert (p1 != NULL && p1->builtin != NULL);
+  gcc_assert (p2 != NULL && p2->builtin != NULL);
+  return p1->builtin->dst_base_offset < p2->builtin->dst_base_offset;
+}
+
+/* Fuse adjacent memset builtin PARTITIONS if possible.  This is a special
+   case optimization transforming below code:
+
+     __builtin_memset (&obj, 0, 100);
+     _1 = &obj + 100;
+     __builtin_memset (_1, 0, 200);
+     _2 = &obj + 300;
+     __builtin_memset (_2, 0, 100);
+
+   into:
+
+     __builtin_memset (&obj, 0, 400);
+
+   Note we don't have dependence information between different partitions
+   at this point, as a result, we can't handle nonadjacent memset builtin
+   partitions since dependence might be broken.  */
+
+static void
+fuse_memset_builtins (vec<struct partition *> *partitions)
+{
+  unsigned i, j;
+  struct partition *part1, *part2;
+
+  for (i = 0; partitions->iterate (i, &part1);)
+    {
+      if (part1->kind != PKIND_MEMSET)
+       {
+         i++;
+         continue;
+       }
+
+      /* Find sub-array of memset builtins of the same base.  Index range
+        of the sub-array is [i, j) with "j > i".  */
+      for (j = i + 1; partitions->iterate (j, &part2); ++j)
+       {
+         if (part2->kind != PKIND_MEMSET
+             || !operand_equal_p (part1->builtin->dst_base_base,
+                                  part2->builtin->dst_base_base, 0))
+           break;
+       }
+
+      /* Stable sort is required in order to avoid breaking dependence.  */
+      std::stable_sort (&(*partitions)[i],
+                       &(*partitions)[i] + j - i, offset_cmp);
+      /* Continue with next partition.  */
+      i = j;
+    }
+
+  /* Merge all consecutive memset builtin partitions.  */
+  for (i = 0; i < partitions->length () - 1;)
+    {
+      part1 = (*partitions)[i];
+      if (part1->kind != PKIND_MEMSET)
+       {
+         i++;
+         continue;
+       }
+
+      part2 = (*partitions)[i + 1];
+      /* Only merge memset partitions of the same base and with constant
+        access sizes.  */
+      if (part2->kind != PKIND_MEMSET
+         || TREE_CODE (part1->builtin->size) != INTEGER_CST
+         || TREE_CODE (part2->builtin->size) != INTEGER_CST
+         || !operand_equal_p (part1->builtin->dst_base_base,
+                              part2->builtin->dst_base_base, 0))
+       {
+         i++;
+         continue;
+       }
+      tree rhs1 = gimple_assign_rhs1 (DR_STMT (part1->builtin->dst_dr));
+      tree rhs2 = gimple_assign_rhs1 (DR_STMT (part2->builtin->dst_dr));
+      int bytev1 = const_with_all_bytes_same (rhs1);
+      int bytev2 = const_with_all_bytes_same (rhs2);
+      /* Only merge memset partitions of the same value.  */
+      if (bytev1 != bytev2 || bytev1 == -1)
+       {
+         i++;
+         continue;
+       }
+      wide_int end1 = wi::add (part1->builtin->dst_base_offset,
+                              wi::to_wide (part1->builtin->size));
+      /* Only merge adjacent memset partitions.  */
+      if (wi::ne_p (end1, part2->builtin->dst_base_offset))
+       {
+         i++;
+         continue;
+       }
+      /* Merge partitions[i] and partitions[i+1].  */
+      part1->builtin->size = fold_build2 (PLUS_EXPR, sizetype,
+                                         part1->builtin->size,
+                                         part2->builtin->size);
+      partition_free (part2);
+      partitions->ordered_remove (i + 1);
+    }
+}
+
 /* Fuse PARTITIONS of LOOP if necessary before finalizing distribution.
    ALIAS_DDRS contains ddrs which need runtime alias check.  */
 
@@ -2519,6 +2636,10 @@ finalize_partitions (struct loop *loop, vec<struct 
partition *> *partitions,
        }
       partitions->truncate (1);
     }
+
+  /* Fuse memset builtins if possible.  */
+  if (partitions->length () > 1)
+    fuse_memset_builtins (partitions);
 }
 
 /* Distributes the code from LOOP in such a way that producer statements
diff --git a/gcc/tree-ssa-loop-ivopts.c b/gcc/tree-ssa-loop-ivopts.c
index 2a71027..4eb0e73 100644
--- a/gcc/tree-ssa-loop-ivopts.c
+++ b/gcc/tree-ssa-loop-ivopts.c
@@ -1550,9 +1550,6 @@ record_invariant (struct ivopts_data *data, tree op, bool 
nonlinear_use)
   bitmap_set_bit (data->relevant, SSA_NAME_VERSION (op));
 }
 
-static tree
-strip_offset (tree expr, unsigned HOST_WIDE_INT *offset);
-
 /* Record a group of TYPE.  */
 
 static struct iv_group *
@@ -2863,7 +2860,7 @@ strip_offset_1 (tree expr, bool inside_addr, bool 
top_compref,
 
 /* Strips constant offsets from EXPR and stores them to OFFSET.  */
 
-static tree
+tree
 strip_offset (tree expr, unsigned HOST_WIDE_INT *offset)
 {
   HOST_WIDE_INT off;
diff --git a/gcc/tree-ssa-loop-ivopts.h b/gcc/tree-ssa-loop-ivopts.h
index f8f31e9..bd92051 100644
--- a/gcc/tree-ssa-loop-ivopts.h
+++ b/gcc/tree-ssa-loop-ivopts.h
@@ -28,6 +28,7 @@ extern void dump_cand (FILE *, struct iv_cand *);
 extern bool contains_abnormal_ssa_name_p (tree);
 extern struct loop *outermost_invariant_loop_for_expr (struct loop *, tree);
 extern bool expr_invariant_in_loop_p (struct loop *, tree);
+extern tree strip_offset (tree, unsigned HOST_WIDE_INT *);
 bool may_be_nonaddressable_p (tree expr);
 void tree_ssa_iv_optimize (void);
 

Reply via email to