On Tue, Oct 17, 2017 at 4:28 PM, Bin.Cheng <amker.ch...@gmail.com> wrote: > On Mon, Oct 16, 2017 at 5:27 PM, Bin.Cheng <amker.ch...@gmail.com> wrote: >> On Mon, Oct 16, 2017 at 5:00 PM, Bin.Cheng <amker.ch...@gmail.com> wrote: >>> On Mon, Oct 16, 2017 at 2:56 PM, Bin.Cheng <amker.ch...@gmail.com> wrote: >>>> On Thu, Oct 12, 2017 at 2:43 PM, Richard Biener >>>> <richard.guent...@gmail.com> wrote: >>>>> On Thu, Oct 5, 2017 at 3:17 PM, Bin Cheng <bin.ch...@arm.com> wrote: >>>>>> Hi, >>>>>> This patch merges adjacent memset builtin partitions if possible. It is >>>>>> a useful special case optimization transforming below code: >>>>>> >>>>>> #define M (256) >>>>>> #define N (512) >>>>>> >>>>>> struct st >>>>>> { >>>>>> int a[M][N]; >>>>>> int c[M]; >>>>>> int b[M][N]; >>>>>> }; >>>>>> >>>>>> void >>>>>> foo (struct st *p) >>>>>> { >>>>>> for (unsigned i = 0; i < M; ++i) >>>>>> { >>>>>> p->c[i] = 0; >>>>>> for (unsigned j = N; j > 0; --j) >>>>>> { >>>>>> p->a[i][j - 1] = 0; >>>>>> p->b[i][j - 1] = 0; >>>>>> } >>>>>> } >>>>>> >>>>>> into a single memset function call, rather than three calls initializing >>>>>> the structure field by field. >>>>>> >>>>>> Bootstrap and test in patch set on x86_64 and AArch64, is it OK? >>>>> >>>>> + /* Insertion sort is good enough for the small sub-array. */ >>>>> + for (k = i + 1; k < j; ++k) >>>>> + { >>>>> + part1 = (*partitions)[k]; >>>>> + for (l = k; l > i; --l) >>>>> + { >>>>> + part2 = (*partitions)[l - 1]; >>>>> + if (part1->builtin->dst_base_offset >>>>> + >= part2->builtin->dst_base_offset) >>>>> + break; >>>>> + >>>>> + (*partitions)[l] = part2; >>>>> + } >>>>> + (*partitions)[l] = part1; >>>>> + } >>>>> >>>>> so we want to sort [i, j[ after dst_base_offset. I realize you don't want >>>>> to write a qsort helper for this but I can't wrap my head around the above >>>>> in 5 minutes so ... please! >>>> Hmm, I thought twice about this and now I believe stable sorting (thus >>>> insertion sort) >>>> is required here. Please see below for explanation. >>>> >>>>> >>>>> You don't seem to check the sizes of the memsets given that they >>>>> obviously cannot overlap(!?) >>>> Yes, given it's quite special case transformation, I did add code >>>> checking overlap cases. >>>>> >>>>> Also why handle memset and not memcpy/memmove or combinations of >>>>> them (for sorting)? >>>>> >>>>> for () >>>>> { >>>>> p->a[i] = 0; >>>>> p->c[i] = q->c[i]; >>>>> p->b[i] = 0; >>>>> } >>>>> >>>>> with a and b adjacent. I suppose p->c could be computed by a >>>>> non-builtin partition as well. >>>> Yes, the two memset builtin partitions can be merged in this case, but... >>>>> So don't we want to see if dependences allow sorting all builtin >>>>> partitions next to each other >>>>> as much as possible? (maybe we do that already?) >>>> The answer for this, above partition merging and use of qsort is no. >>>> I think all the three are the same question here. For now we only do >>>> topological sort for partitions. To maximize parallelism (either by >>>> merging >>>> normal parallel partitions or merging builtin partitions) requires >>>> fine-tuned >>>> sorting between partitions that doesn't dependence on each other. >>>> In order to sort all memset/memcpy/memmove, we need check dependence >>>> between all data references between different partitions. For example, I >>>> created new test ldist-36.c illustrating sorting memcpy along with memset >>>> would generate wrong code because dependence is broken. It's the same >>>> for qsort. In extreme case, if the same array is set twice with different >>>> rhs >>>> value, the order between the two sets needs to be preserved. >>>> Unfortunately, >>>> qsort is unstable and could reorder different sets. This would break >>>> output >>>> dependence. >>>> At the point of this function, dependence graph is destroyed, we can't do >>>> much in addition to special case handling for memset. Full solution would >>>> require a customized topological sorting process. >>>> >>>> So, this updated patch keeps insertion sort with additional comment >>>> explaining >>>> why. Also two test cases added showing when memset partitions should be >>>> merged (we can't for now) and when memset partitions should not be merged. >>> Hmm, I can factor out the sorting loop into a function, that might >>> make it easier >>> to read. >> Okay, I will use std::stable_sort in this case, that's exactly what we >> want for this case. > Hi Richard, > This is the 3rd version of patch, it uses std::stable_sort which gives > stable sort and code simplicity. > Bootstrap and test. Is it OK?
Ok. Thanks, Richard. > Thanks, > bin > 2017-10-17 Bin Cheng <bin.ch...@arm.com> > > * tree-loop-distribution.c (INCLUDE_ALGORITHM): New header file. > (tree-ssa-loop-ivopts.h): New header file. > (struct builtin_info): New fields. > (classify_builtin_1): Compute and record base and offset parts for > memset builtin partition by calling strip_offset. > (offset_cmp, fuse_memset_builtins): New functions. > (finalize_partitions): Fuse adjacent memset partitions by calling > above function. > * tree-ssa-loop-ivopts.c (strip_offset): Delete static declaration. > Expose the interface. > * tree-ssa-loop-ivopts.h (strip_offset): New declaration. > > gcc/testsuite/ChangeLog > 2017-10-17 Bin Cheng <bin.ch...@arm.com> > > * gcc.dg/tree-ssa/ldist-17.c: Adjust test string. > * gcc.dg/tree-ssa/ldist-32.c: New test. > * gcc.dg/tree-ssa/ldist-35.c: New test. > * gcc.dg/tree-ssa/ldist-36.c: New test.