On Tue, Oct 17, 2017 at 4:28 PM, Bin.Cheng <amker.ch...@gmail.com> wrote:
> On Mon, Oct 16, 2017 at 5:27 PM, Bin.Cheng <amker.ch...@gmail.com> wrote:
>> On Mon, Oct 16, 2017 at 5:00 PM, Bin.Cheng <amker.ch...@gmail.com> wrote:
>>> On Mon, Oct 16, 2017 at 2:56 PM, Bin.Cheng <amker.ch...@gmail.com> wrote:
>>>> On Thu, Oct 12, 2017 at 2:43 PM, Richard Biener
>>>> <richard.guent...@gmail.com> wrote:
>>>>> On Thu, Oct 5, 2017 at 3:17 PM, Bin Cheng <bin.ch...@arm.com> wrote:
>>>>>> Hi,
>>>>>> This patch merges adjacent memset builtin partitions if possible.  It is
>>>>>> a useful special case optimization transforming below code:
>>>>>>
>>>>>> #define M (256)
>>>>>> #define N (512)
>>>>>>
>>>>>> struct st
>>>>>> {
>>>>>>   int a[M][N];
>>>>>>   int c[M];
>>>>>>   int b[M][N];
>>>>>> };
>>>>>>
>>>>>> void
>>>>>> foo (struct st *p)
>>>>>> {
>>>>>>   for (unsigned i = 0; i < M; ++i)
>>>>>>     {
>>>>>>       p->c[i] = 0;
>>>>>>       for (unsigned j = N; j > 0; --j)
>>>>>>         {
>>>>>>           p->a[i][j - 1] = 0;
>>>>>>           p->b[i][j - 1] = 0;
>>>>>>         }
>>>>>>     }
>>>>>>
>>>>>> into a single memset function call, rather than three calls initializing
>>>>>> the structure field by field.
>>>>>>
>>>>>> Bootstrap and test in patch set on x86_64 and AArch64, is it OK?
>>>>>
>>>>> +      /* Insertion sort is good enough for the small sub-array.  */
>>>>> +      for (k = i + 1; k < j; ++k)
>>>>> +       {
>>>>> +         part1 = (*partitions)[k];
>>>>> +         for (l = k; l > i; --l)
>>>>> +           {
>>>>> +             part2 = (*partitions)[l - 1];
>>>>> +             if (part1->builtin->dst_base_offset
>>>>> +                   >= part2->builtin->dst_base_offset)
>>>>> +               break;
>>>>> +
>>>>> +             (*partitions)[l] = part2;
>>>>> +           }
>>>>> +         (*partitions)[l] = part1;
>>>>> +       }
>>>>>
>>>>> so we want to sort [i, j[ after dst_base_offset.  I realize you don't want
>>>>> to write a qsort helper for this but I can't wrap my head around the above
>>>>> in 5 minutes so ... please!
>>>> Hmm, I thought twice about this and now I believe stable sorting (thus
>>>> insertion sort)
>>>> is required here.  Please see below for explanation.
>>>>
>>>>>
>>>>> You don't seem to check the sizes of the memsets given that they
>>>>> obviously cannot overlap(!?)
>>>> Yes, given it's quite special case transformation, I did add code
>>>> checking overlap cases.
>>>>>
>>>>> Also why handle memset and not memcpy/memmove or combinations of
>>>>> them (for sorting)?
>>>>>
>>>>>   for ()
>>>>>    {
>>>>>       p->a[i] = 0;
>>>>>       p->c[i] = q->c[i];
>>>>>       p->b[i] = 0;
>>>>>    }
>>>>>
>>>>> with a and b adjacent.  I suppose p->c could be computed by a
>>>>> non-builtin partition as well.
>>>> Yes, the two memset builtin partitions can be merged in this case, but...
>>>>> So don't we want to see if dependences allow sorting all builtin
>>>>> partitions next to each other
>>>>> as much as possible?  (maybe we do that already?)
>>>> The answer for this, above partition merging and use of qsort is no.
>>>> I think all the three are the same question here.  For now we only do
>>>> topological sort for partitions.  To maximize parallelism (either by 
>>>> merging
>>>> normal parallel partitions or merging builtin partitions) requires 
>>>> fine-tuned
>>>> sorting between partitions that doesn't dependence on each other.
>>>> In order to sort all memset/memcpy/memmove, we need check dependence
>>>> between all data references between different partitions.  For example, I
>>>> created new test ldist-36.c illustrating sorting memcpy along with memset
>>>> would generate wrong code because dependence is broken.  It's the same
>>>> for qsort.  In extreme case, if the same array is set twice with different 
>>>> rhs
>>>> value, the order between the two sets needs to be preserved.  
>>>> Unfortunately,
>>>> qsort is unstable and could reorder different sets.  This would break 
>>>> output
>>>> dependence.
>>>> At the point of this function, dependence graph is destroyed, we can't do
>>>> much in addition to special case handling for memset.  Full solution would
>>>> require a customized topological sorting process.
>>>>
>>>> So, this updated patch keeps insertion sort with additional comment 
>>>> explaining
>>>> why.  Also two test cases added showing when memset partitions should be
>>>> merged (we can't for now) and when memset partitions should not be merged.
>>> Hmm, I can factor out the sorting loop into a function, that might
>>> make it easier
>>> to read.
>> Okay, I will use std::stable_sort in this case, that's exactly what we
>> want for this case.
> Hi Richard,
> This is the 3rd version of patch, it uses std::stable_sort which gives
> stable sort and code simplicity.
> Bootstrap and test.  Is it OK?

Ok.

Thanks,
Richard.

> Thanks,
> bin
> 2017-10-17  Bin Cheng  <bin.ch...@arm.com>
>
>     * tree-loop-distribution.c (INCLUDE_ALGORITHM): New header file.
>     (tree-ssa-loop-ivopts.h): New header file.
>     (struct builtin_info): New fields.
>     (classify_builtin_1): Compute and record base and offset parts for
>     memset builtin partition by calling strip_offset.
>     (offset_cmp, fuse_memset_builtins): New functions.
>     (finalize_partitions): Fuse adjacent memset partitions by calling
>     above function.
>     * tree-ssa-loop-ivopts.c (strip_offset): Delete static declaration.
>     Expose the interface.
>     * tree-ssa-loop-ivopts.h (strip_offset): New declaration.
>
> gcc/testsuite/ChangeLog
> 2017-10-17  Bin Cheng  <bin.ch...@arm.com>
>
>     * gcc.dg/tree-ssa/ldist-17.c: Adjust test string.
>     * gcc.dg/tree-ssa/ldist-32.c: New test.
>     * gcc.dg/tree-ssa/ldist-35.c: New test.
>     * gcc.dg/tree-ssa/ldist-36.c: New test.

Reply via email to