On Mon, Oct 16, 2017 at 5:00 PM, Bin.Cheng <amker.ch...@gmail.com> wrote:
> On Mon, Oct 16, 2017 at 2:56 PM, Bin.Cheng <amker.ch...@gmail.com> wrote:
>> On Thu, Oct 12, 2017 at 2:43 PM, Richard Biener
>> <richard.guent...@gmail.com> wrote:
>>> On Thu, Oct 5, 2017 at 3:17 PM, Bin Cheng <bin.ch...@arm.com> wrote:
>>>> Hi,
>>>> This patch merges adjacent memset builtin partitions if possible.  It is
>>>> a useful special case optimization transforming below code:
>>>>
>>>> #define M (256)
>>>> #define N (512)
>>>>
>>>> struct st
>>>> {
>>>>   int a[M][N];
>>>>   int c[M];
>>>>   int b[M][N];
>>>> };
>>>>
>>>> void
>>>> foo (struct st *p)
>>>> {
>>>>   for (unsigned i = 0; i < M; ++i)
>>>>     {
>>>>       p->c[i] = 0;
>>>>       for (unsigned j = N; j > 0; --j)
>>>>         {
>>>>           p->a[i][j - 1] = 0;
>>>>           p->b[i][j - 1] = 0;
>>>>         }
>>>>     }
>>>>
>>>> into a single memset function call, rather than three calls initializing
>>>> the structure field by field.
>>>>
>>>> Bootstrap and test in patch set on x86_64 and AArch64, is it OK?
>>>
>>> +      /* Insertion sort is good enough for the small sub-array.  */
>>> +      for (k = i + 1; k < j; ++k)
>>> +       {
>>> +         part1 = (*partitions)[k];
>>> +         for (l = k; l > i; --l)
>>> +           {
>>> +             part2 = (*partitions)[l - 1];
>>> +             if (part1->builtin->dst_base_offset
>>> +                   >= part2->builtin->dst_base_offset)
>>> +               break;
>>> +
>>> +             (*partitions)[l] = part2;
>>> +           }
>>> +         (*partitions)[l] = part1;
>>> +       }
>>>
>>> so we want to sort [i, j[ after dst_base_offset.  I realize you don't want
>>> to write a qsort helper for this but I can't wrap my head around the above
>>> in 5 minutes so ... please!
>> Hmm, I thought twice about this and now I believe stable sorting (thus
>> insertion sort)
>> is required here.  Please see below for explanation.
>>
>>>
>>> You don't seem to check the sizes of the memsets given that they
>>> obviously cannot overlap(!?)
>> Yes, given it's quite special case transformation, I did add code
>> checking overlap cases.
>>>
>>> Also why handle memset and not memcpy/memmove or combinations of
>>> them (for sorting)?
>>>
>>>   for ()
>>>    {
>>>       p->a[i] = 0;
>>>       p->c[i] = q->c[i];
>>>       p->b[i] = 0;
>>>    }
>>>
>>> with a and b adjacent.  I suppose p->c could be computed by a
>>> non-builtin partition as well.
>> Yes, the two memset builtin partitions can be merged in this case, but...
>>> So don't we want to see if dependences allow sorting all builtin
>>> partitions next to each other
>>> as much as possible?  (maybe we do that already?)
>> The answer for this, above partition merging and use of qsort is no.
>> I think all the three are the same question here.  For now we only do
>> topological sort for partitions.  To maximize parallelism (either by merging
>> normal parallel partitions or merging builtin partitions) requires fine-tuned
>> sorting between partitions that doesn't dependence on each other.
>> In order to sort all memset/memcpy/memmove, we need check dependence
>> between all data references between different partitions.  For example, I
>> created new test ldist-36.c illustrating sorting memcpy along with memset
>> would generate wrong code because dependence is broken.  It's the same
>> for qsort.  In extreme case, if the same array is set twice with different 
>> rhs
>> value, the order between the two sets needs to be preserved.  Unfortunately,
>> qsort is unstable and could reorder different sets.  This would break output
>> dependence.
>> At the point of this function, dependence graph is destroyed, we can't do
>> much in addition to special case handling for memset.  Full solution would
>> require a customized topological sorting process.
>>
>> So, this updated patch keeps insertion sort with additional comment 
>> explaining
>> why.  Also two test cases added showing when memset partitions should be
>> merged (we can't for now) and when memset partitions should not be merged.
> Hmm, I can factor out the sorting loop into a function, that might
> make it easier
> to read.
Okay, I will use std::stable_sort in this case, that's exactly what we
want for this case.

Thanks,
bin

Reply via email to