On Mon, Oct 16, 2017 at 2:56 PM, Bin.Cheng <amker.ch...@gmail.com> wrote: > On Thu, Oct 12, 2017 at 2:43 PM, Richard Biener > <richard.guent...@gmail.com> wrote: >> On Thu, Oct 5, 2017 at 3:17 PM, Bin Cheng <bin.ch...@arm.com> wrote: >>> Hi, >>> This patch merges adjacent memset builtin partitions if possible. It is >>> a useful special case optimization transforming below code: >>> >>> #define M (256) >>> #define N (512) >>> >>> struct st >>> { >>> int a[M][N]; >>> int c[M]; >>> int b[M][N]; >>> }; >>> >>> void >>> foo (struct st *p) >>> { >>> for (unsigned i = 0; i < M; ++i) >>> { >>> p->c[i] = 0; >>> for (unsigned j = N; j > 0; --j) >>> { >>> p->a[i][j - 1] = 0; >>> p->b[i][j - 1] = 0; >>> } >>> } >>> >>> into a single memset function call, rather than three calls initializing >>> the structure field by field. >>> >>> Bootstrap and test in patch set on x86_64 and AArch64, is it OK? >> >> + /* Insertion sort is good enough for the small sub-array. */ >> + for (k = i + 1; k < j; ++k) >> + { >> + part1 = (*partitions)[k]; >> + for (l = k; l > i; --l) >> + { >> + part2 = (*partitions)[l - 1]; >> + if (part1->builtin->dst_base_offset >> + >= part2->builtin->dst_base_offset) >> + break; >> + >> + (*partitions)[l] = part2; >> + } >> + (*partitions)[l] = part1; >> + } >> >> so we want to sort [i, j[ after dst_base_offset. I realize you don't want >> to write a qsort helper for this but I can't wrap my head around the above >> in 5 minutes so ... please! > Hmm, I thought twice about this and now I believe stable sorting (thus > insertion sort) > is required here. Please see below for explanation. > >> >> You don't seem to check the sizes of the memsets given that they >> obviously cannot overlap(!?) > Yes, given it's quite special case transformation, I did add code > checking overlap cases. >> >> Also why handle memset and not memcpy/memmove or combinations of >> them (for sorting)? >> >> for () >> { >> p->a[i] = 0; >> p->c[i] = q->c[i]; >> p->b[i] = 0; >> } >> >> with a and b adjacent. I suppose p->c could be computed by a >> non-builtin partition as well. > Yes, the two memset builtin partitions can be merged in this case, but... >> So don't we want to see if dependences allow sorting all builtin >> partitions next to each other >> as much as possible? (maybe we do that already?) > The answer for this, above partition merging and use of qsort is no. > I think all the three are the same question here. For now we only do > topological sort for partitions. To maximize parallelism (either by merging > normal parallel partitions or merging builtin partitions) requires fine-tuned > sorting between partitions that doesn't dependence on each other. > In order to sort all memset/memcpy/memmove, we need check dependence > between all data references between different partitions. For example, I > created new test ldist-36.c illustrating sorting memcpy along with memset > would generate wrong code because dependence is broken. It's the same > for qsort. In extreme case, if the same array is set twice with different rhs > value, the order between the two sets needs to be preserved. Unfortunately, > qsort is unstable and could reorder different sets. This would break output > dependence. > At the point of this function, dependence graph is destroyed, we can't do > much in addition to special case handling for memset. Full solution would > require a customized topological sorting process. > > So, this updated patch keeps insertion sort with additional comment explaining > why. Also two test cases added showing when memset partitions should be > merged (we can't for now) and when memset partitions should not be merged. Hmm, I can factor out the sorting loop into a function, that might make it easier to read.
Thanks, bin > > Bootstrap and test. Is it OK? > > Thanks, > bin > > 2017-10-14 Bin Cheng <bin.ch...@arm.com> > > * tree-loop-distribution.c (tree-ssa-loop-ivopts.h): New header file. > (struct builtin_info): New fields. > (classify_builtin_1): Compute and record base and offset parts for > memset builtin partition by calling strip_offset. > (fuse_memset_builtins): New function. > (finalize_partitions): Fuse adjacent memset partitions by calling > above function. > * tree-ssa-loop-ivopts.c (strip_offset): Delete static declaration. > Expose the interface. > * tree-ssa-loop-ivopts.h (strip_offset): New declaration. > > gcc/testsuite/ChangeLog > 2017-10-14 Bin Cheng <bin.ch...@arm.com> > > * gcc.dg/tree-ssa/ldist-17.c: Adjust test string. > * gcc.dg/tree-ssa/ldist-32.c: New test. > * gcc.dg/tree-ssa/ldist-35.c: New test. > * gcc.dg/tree-ssa/ldist-36.c: New test.