On Mon, Oct 16, 2017 at 5:27 PM, Bin.Cheng <amker.ch...@gmail.com> wrote: > On Mon, Oct 16, 2017 at 5:00 PM, Bin.Cheng <amker.ch...@gmail.com> wrote: >> On Mon, Oct 16, 2017 at 2:56 PM, Bin.Cheng <amker.ch...@gmail.com> wrote: >>> On Thu, Oct 12, 2017 at 2:43 PM, Richard Biener >>> <richard.guent...@gmail.com> wrote: >>>> On Thu, Oct 5, 2017 at 3:17 PM, Bin Cheng <bin.ch...@arm.com> wrote: >>>>> Hi, >>>>> This patch merges adjacent memset builtin partitions if possible. It is >>>>> a useful special case optimization transforming below code: >>>>> >>>>> #define M (256) >>>>> #define N (512) >>>>> >>>>> struct st >>>>> { >>>>> int a[M][N]; >>>>> int c[M]; >>>>> int b[M][N]; >>>>> }; >>>>> >>>>> void >>>>> foo (struct st *p) >>>>> { >>>>> for (unsigned i = 0; i < M; ++i) >>>>> { >>>>> p->c[i] = 0; >>>>> for (unsigned j = N; j > 0; --j) >>>>> { >>>>> p->a[i][j - 1] = 0; >>>>> p->b[i][j - 1] = 0; >>>>> } >>>>> } >>>>> >>>>> into a single memset function call, rather than three calls initializing >>>>> the structure field by field. >>>>> >>>>> Bootstrap and test in patch set on x86_64 and AArch64, is it OK? >>>> >>>> + /* Insertion sort is good enough for the small sub-array. */ >>>> + for (k = i + 1; k < j; ++k) >>>> + { >>>> + part1 = (*partitions)[k]; >>>> + for (l = k; l > i; --l) >>>> + { >>>> + part2 = (*partitions)[l - 1]; >>>> + if (part1->builtin->dst_base_offset >>>> + >= part2->builtin->dst_base_offset) >>>> + break; >>>> + >>>> + (*partitions)[l] = part2; >>>> + } >>>> + (*partitions)[l] = part1; >>>> + } >>>> >>>> so we want to sort [i, j[ after dst_base_offset. I realize you don't want >>>> to write a qsort helper for this but I can't wrap my head around the above >>>> in 5 minutes so ... please! >>> Hmm, I thought twice about this and now I believe stable sorting (thus >>> insertion sort) >>> is required here. Please see below for explanation. >>> >>>> >>>> You don't seem to check the sizes of the memsets given that they >>>> obviously cannot overlap(!?) >>> Yes, given it's quite special case transformation, I did add code >>> checking overlap cases. >>>> >>>> Also why handle memset and not memcpy/memmove or combinations of >>>> them (for sorting)? >>>> >>>> for () >>>> { >>>> p->a[i] = 0; >>>> p->c[i] = q->c[i]; >>>> p->b[i] = 0; >>>> } >>>> >>>> with a and b adjacent. I suppose p->c could be computed by a >>>> non-builtin partition as well. >>> Yes, the two memset builtin partitions can be merged in this case, but... >>>> So don't we want to see if dependences allow sorting all builtin >>>> partitions next to each other >>>> as much as possible? (maybe we do that already?) >>> The answer for this, above partition merging and use of qsort is no. >>> I think all the three are the same question here. For now we only do >>> topological sort for partitions. To maximize parallelism (either by merging >>> normal parallel partitions or merging builtin partitions) requires >>> fine-tuned >>> sorting between partitions that doesn't dependence on each other. >>> In order to sort all memset/memcpy/memmove, we need check dependence >>> between all data references between different partitions. For example, I >>> created new test ldist-36.c illustrating sorting memcpy along with memset >>> would generate wrong code because dependence is broken. It's the same >>> for qsort. In extreme case, if the same array is set twice with different >>> rhs >>> value, the order between the two sets needs to be preserved. Unfortunately, >>> qsort is unstable and could reorder different sets. This would break output >>> dependence. >>> At the point of this function, dependence graph is destroyed, we can't do >>> much in addition to special case handling for memset. Full solution would >>> require a customized topological sorting process. >>> >>> So, this updated patch keeps insertion sort with additional comment >>> explaining >>> why. Also two test cases added showing when memset partitions should be >>> merged (we can't for now) and when memset partitions should not be merged. >> Hmm, I can factor out the sorting loop into a function, that might >> make it easier >> to read. > Okay, I will use std::stable_sort in this case, that's exactly what we > want for this case. Hi Richard, This is the 3rd version of patch, it uses std::stable_sort which gives stable sort and code simplicity. Bootstrap and test. Is it OK?
Thanks, bin 2017-10-17 Bin Cheng <bin.ch...@arm.com> * tree-loop-distribution.c (INCLUDE_ALGORITHM): New header file. (tree-ssa-loop-ivopts.h): New header file. (struct builtin_info): New fields. (classify_builtin_1): Compute and record base and offset parts for memset builtin partition by calling strip_offset. (offset_cmp, fuse_memset_builtins): New functions. (finalize_partitions): Fuse adjacent memset partitions by calling above function. * tree-ssa-loop-ivopts.c (strip_offset): Delete static declaration. Expose the interface. * tree-ssa-loop-ivopts.h (strip_offset): New declaration. gcc/testsuite/ChangeLog 2017-10-17 Bin Cheng <bin.ch...@arm.com> * gcc.dg/tree-ssa/ldist-17.c: Adjust test string. * gcc.dg/tree-ssa/ldist-32.c: New test. * gcc.dg/tree-ssa/ldist-35.c: New test. * gcc.dg/tree-ssa/ldist-36.c: New test.
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ldist-17.c b/gcc/testsuite/gcc.dg/tree-ssa/ldist-17.c index 4efc0a4..b3617f6 100644 --- a/gcc/testsuite/gcc.dg/tree-ssa/ldist-17.c +++ b/gcc/testsuite/gcc.dg/tree-ssa/ldist-17.c @@ -45,5 +45,5 @@ mad_synth_mute (struct mad_synth *synth) return; } -/* { dg-final { scan-tree-dump "distributed: split to 0 loops and 4 library calls" "ldist" } } */ -/* { dg-final { scan-tree-dump-times "generated memset zero" 4 "ldist" } } */ +/* { dg-final { scan-tree-dump "Loop nest . distributed: split to 0 loops and 1 library calls" "ldist" } } */ +/* { dg-final { scan-tree-dump-times "generated memset zero" 1 "ldist" } } */ diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ldist-32.c b/gcc/testsuite/gcc.dg/tree-ssa/ldist-32.c new file mode 100644 index 0000000..477d222 --- /dev/null +++ b/gcc/testsuite/gcc.dg/tree-ssa/ldist-32.c @@ -0,0 +1,29 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -ftree-loop-distribution -ftree-loop-distribute-patterns -fdump-tree-ldist-details" } */ + +#define M (256) +#define N (512) + +struct st +{ + int a[M][N]; + int c[M]; + int b[M][N]; +}; + +void +foo (struct st *p) +{ + for (unsigned i = 0; i < M; ++i) + { + p->c[i] = 0; + for (unsigned j = N; j > 0; --j) + { + p->a[i][j - 1] = 0; + p->b[i][j - 1] = 0; + } + } +} + +/* { dg-final { scan-tree-dump-times "Loop nest . distributed: split to 0 loops and 1 library" 1 "ldist" } } */ +/* { dg-final { scan-tree-dump-times "__builtin_memset \\(.*, 0, 1049600\\);" 1 "ldist" } } */ diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ldist-35.c b/gcc/testsuite/gcc.dg/tree-ssa/ldist-35.c new file mode 100644 index 0000000..445d23d --- /dev/null +++ b/gcc/testsuite/gcc.dg/tree-ssa/ldist-35.c @@ -0,0 +1,28 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -ftree-loop-distribution -ftree-loop-distribute-patterns -fdump-tree-ldist-details" } */ + +#define M (256) +#define N (512) + +struct st +{ + int a[M][N]; + int c[M]; + int b[M][N]; +}; + +void +foo (struct st * restrict p, struct st * restrict q) +{ + for (unsigned i = 0; i < M; ++i) + { + p->c[i] = 0; + for (unsigned j = N; j > 0; --j) + { + p->a[i][j - 1] = q->a[i][j - 1]; + p->b[i][j - 1] = 0; + } + } +} + +/* { dg-final { scan-tree-dump-times "Loop nest . distributed: split to 0 loops and 1 library" 1 "ldist" { xfail *-*-* } } } */ diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ldist-36.c b/gcc/testsuite/gcc.dg/tree-ssa/ldist-36.c new file mode 100644 index 0000000..0e843f4 --- /dev/null +++ b/gcc/testsuite/gcc.dg/tree-ssa/ldist-36.c @@ -0,0 +1,28 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -ftree-loop-distribution -ftree-loop-distribute-patterns -fdump-tree-ldist-details" } */ + +#define M (256) +#define N (512) + +struct st +{ + int a[M][N]; + int c[M]; + int b[M][N]; +}; + +void +foo (struct st * restrict p) +{ + for (unsigned i = 0; i < M; ++i) + { + p->c[i] = 0; + for (unsigned j = N; j > 0; --j) + { + p->b[i][j - 1] = p->a[i][j - 1]; + p->a[i][j - 1] = 0; + } + } +} + +/* { dg-final { scan-tree-dump-times "Loop nest . distributed: split to 0 loops and 3 library" 1 "ldist" } } */ diff --git a/gcc/tree-loop-distribution.c b/gcc/tree-loop-distribution.c index 5e835be..fdf5bab 100644 --- a/gcc/tree-loop-distribution.c +++ b/gcc/tree-loop-distribution.c @@ -90,6 +90,7 @@ along with GCC; see the file COPYING3. If not see data reuse. */ #include "config.h" +#define INCLUDE_ALGORITHM /* stable_sort */ #include "system.h" #include "coretypes.h" #include "backend.h" @@ -106,6 +107,7 @@ along with GCC; see the file COPYING3. If not see #include "stor-layout.h" #include "tree-cfg.h" #include "tree-ssa-loop-manip.h" +#include "tree-ssa-loop-ivopts.h" #include "tree-ssa-loop.h" #include "tree-into-ssa.h" #include "tree-ssa.h" @@ -604,6 +606,10 @@ struct builtin_info tree dst_base; tree src_base; tree size; + /* Base and offset part of dst_base after stripping constant offset. This + is only used in memset builtin distribution for now. */ + tree dst_base_base; + unsigned HOST_WIDE_INT dst_base_offset; }; /* Partition for loop distribution. */ @@ -1500,7 +1506,11 @@ classify_builtin_st (loop_p loop, partition *partition, data_reference_p dr) if (!compute_access_range (loop, dr, &base, &size)) return; - partition->builtin = alloc_builtin (dr, NULL, base, NULL_TREE, size); + struct builtin_info *builtin; + builtin = alloc_builtin (dr, NULL, base, NULL_TREE, size); + builtin->dst_base_base = strip_offset (builtin->dst_base, + &builtin->dst_base_offset); + partition->builtin = builtin; partition->kind = PKIND_MEMSET; } @@ -2476,6 +2486,113 @@ version_for_distribution_p (vec<struct partition *> *partitions, return (alias_ddrs->length () > 0); } +/* Compare base offset of builtin mem* partitions P1 and P2. */ + +static bool +offset_cmp (struct partition *p1, struct partition *p2) +{ + gcc_assert (p1 != NULL && p1->builtin != NULL); + gcc_assert (p2 != NULL && p2->builtin != NULL); + return p1->builtin->dst_base_offset < p2->builtin->dst_base_offset; +} + +/* Fuse adjacent memset builtin PARTITIONS if possible. This is a special + case optimization transforming below code: + + __builtin_memset (&obj, 0, 100); + _1 = &obj + 100; + __builtin_memset (_1, 0, 200); + _2 = &obj + 300; + __builtin_memset (_2, 0, 100); + + into: + + __builtin_memset (&obj, 0, 400); + + Note we don't have dependence information between different partitions + at this point, as a result, we can't handle nonadjacent memset builtin + partitions since dependence might be broken. */ + +static void +fuse_memset_builtins (vec<struct partition *> *partitions) +{ + unsigned i, j; + struct partition *part1, *part2; + + for (i = 0; partitions->iterate (i, &part1);) + { + if (part1->kind != PKIND_MEMSET) + { + i++; + continue; + } + + /* Find sub-array of memset builtins of the same base. Index range + of the sub-array is [i, j) with "j > i". */ + for (j = i + 1; partitions->iterate (j, &part2); ++j) + { + if (part2->kind != PKIND_MEMSET + || !operand_equal_p (part1->builtin->dst_base_base, + part2->builtin->dst_base_base, 0)) + break; + } + + /* Stable sort is required in order to avoid breaking dependence. */ + std::stable_sort (&(*partitions)[i], + &(*partitions)[i] + j - i, offset_cmp); + /* Continue with next partition. */ + i = j; + } + + /* Merge all consecutive memset builtin partitions. */ + for (i = 0; i < partitions->length () - 1;) + { + part1 = (*partitions)[i]; + if (part1->kind != PKIND_MEMSET) + { + i++; + continue; + } + + part2 = (*partitions)[i + 1]; + /* Only merge memset partitions of the same base and with constant + access sizes. */ + if (part2->kind != PKIND_MEMSET + || TREE_CODE (part1->builtin->size) != INTEGER_CST + || TREE_CODE (part2->builtin->size) != INTEGER_CST + || !operand_equal_p (part1->builtin->dst_base_base, + part2->builtin->dst_base_base, 0)) + { + i++; + continue; + } + tree rhs1 = gimple_assign_rhs1 (DR_STMT (part1->builtin->dst_dr)); + tree rhs2 = gimple_assign_rhs1 (DR_STMT (part2->builtin->dst_dr)); + int bytev1 = const_with_all_bytes_same (rhs1); + int bytev2 = const_with_all_bytes_same (rhs2); + /* Only merge memset partitions of the same value. */ + if (bytev1 != bytev2 || bytev1 == -1) + { + i++; + continue; + } + wide_int end1 = wi::add (part1->builtin->dst_base_offset, + wi::to_wide (part1->builtin->size)); + /* Only merge adjacent memset partitions. */ + if (wi::ne_p (end1, part2->builtin->dst_base_offset)) + { + i++; + continue; + } + /* Merge partitions[i] and partitions[i+1]. */ + part1->builtin->size = fold_build2 (PLUS_EXPR, sizetype, + part1->builtin->size, + part2->builtin->size); + partition_free (part2); + partitions->ordered_remove (i + 1); + } +} + /* Fuse PARTITIONS of LOOP if necessary before finalizing distribution. ALIAS_DDRS contains ddrs which need runtime alias check. */ @@ -2519,6 +2636,10 @@ finalize_partitions (struct loop *loop, vec<struct partition *> *partitions, } partitions->truncate (1); } + + /* Fuse memset builtins if possible. */ + if (partitions->length () > 1) + fuse_memset_builtins (partitions); } /* Distributes the code from LOOP in such a way that producer statements diff --git a/gcc/tree-ssa-loop-ivopts.c b/gcc/tree-ssa-loop-ivopts.c index 2a71027..4eb0e73 100644 --- a/gcc/tree-ssa-loop-ivopts.c +++ b/gcc/tree-ssa-loop-ivopts.c @@ -1550,9 +1550,6 @@ record_invariant (struct ivopts_data *data, tree op, bool nonlinear_use) bitmap_set_bit (data->relevant, SSA_NAME_VERSION (op)); } -static tree -strip_offset (tree expr, unsigned HOST_WIDE_INT *offset); - /* Record a group of TYPE. */ static struct iv_group * @@ -2863,7 +2860,7 @@ strip_offset_1 (tree expr, bool inside_addr, bool top_compref, /* Strips constant offsets from EXPR and stores them to OFFSET. */ -static tree +tree strip_offset (tree expr, unsigned HOST_WIDE_INT *offset) { HOST_WIDE_INT off; diff --git a/gcc/tree-ssa-loop-ivopts.h b/gcc/tree-ssa-loop-ivopts.h index f8f31e9..bd92051 100644 --- a/gcc/tree-ssa-loop-ivopts.h +++ b/gcc/tree-ssa-loop-ivopts.h @@ -28,6 +28,7 @@ extern void dump_cand (FILE *, struct iv_cand *); extern bool contains_abnormal_ssa_name_p (tree); extern struct loop *outermost_invariant_loop_for_expr (struct loop *, tree); extern bool expr_invariant_in_loop_p (struct loop *, tree); +extern tree strip_offset (tree, unsigned HOST_WIDE_INT *); bool may_be_nonaddressable_p (tree expr); void tree_ssa_iv_optimize (void);