Re: [PATCH] Fix PR84512

Rainer Orth Tue, 20 Mar 2018 13:16:03 -0700

Hi Tom,

> On 03/19/2018 10:11 AM, Richard Biener wrote:
>> On Fri, 16 Mar 2018, Tom de Vries wrote:
>>
>>> On 03/16/2018 12:55 PM, Richard Biener wrote:
>>>> On Fri, 16 Mar 2018, Tom de Vries wrote:
>>>>
>>>>> On 02/27/2018 01:42 PM, Richard Biener wrote:
>>>>>> Index: gcc/testsuite/gcc.dg/tree-ssa/pr84512.c
>>>>>> ===================================================================
>>>>>> --- gcc/testsuite/gcc.dg/tree-ssa/pr84512.c      (nonexistent)
>>>>>> +++ gcc/testsuite/gcc.dg/tree-ssa/pr84512.c      (working copy)
>>>>>> @@ -0,0 +1,15 @@
>>>>>> +/* { dg-do compile } */
>>>>>> +/* { dg-options "-O3 -fdump-tree-optimized" } */
>>>>>> +
>>>>>> +int foo()
>>>>>> +{
>>>>>> +  int a[10];
>>>>>> +  for(int i = 0; i < 10; ++i)
>>>>>> +    a[i] = i*i;
>>>>>> +  int res = 0;
>>>>>> +  for(int i = 0; i < 10; ++i)
>>>>>> +    res += a[i];
>>>>>> +  return res;
>>>>>> +}
>>>>>> +
>>>>>> +/* { dg-final { scan-tree-dump "return 285;" "optimized" } } */
>>>>>
>>>>> This fails for nvptx, because it doesn't have the required vector
>>>>> operations.
>>>>> To fix the fail, I've added requiring effective target vect_int_mult.
>>>>
>>>> On targets that do not vectorize you should see the scalar loops unrolled
>>>> instead.  Or do you have only one loop vectorized?
>>>
>>> Sort of. Loop vectorization has no effect, and the scalar loops are 
>>> completely
>>> unrolled. But then slp vectorization vectorizes the stores.
>>>
>>> So at optimized we have:
>>> ...
>>>    MEM[(int *)&a] = { 0, 1 };
>>>    MEM[(int *)&a + 8B] = { 4, 9 };
>>>    MEM[(int *)&a + 16B] = { 16, 25 };
>>>    MEM[(int *)&a + 24B] = { 36, 49 };
>>>    MEM[(int *)&a + 32B] = { 64, 81 };
>>>    _6 = a[0];
>>>    _28 = a[1];
>>>    res_29 = _6 + _28;
>>>    _35 = a[2];
>>>    res_36 = res_29 + _35;
>>>    _42 = a[3];
>>>    res_43 = res_36 + _42;
>>>    _49 = a[4];
>>>    res_50 = res_43 + _49;
>>>    _56 = a[5];
>>>    res_57 = res_50 + _56;
>>>    _63 = a[6];
>>>    res_64 = res_57 + _63;
>>>    _70 = a[7];
>>>    res_71 = res_64 + _70;
>>>    _77 = a[8];
>>>    res_78 = res_71 + _77;
>>>    _2 = a[9];
>>>    res_11 = _2 + res_78;
>>>    a ={v} {CLOBBER};
>>>    return res_11;
>>> ...
>>>
>>> The stores and loads are eliminated by dse1 in the rtl phase, and in the end
>>> we have:
>>> ...
>>> .visible .func (.param.u32 %value_out) foo
>>> {
>>>          .reg.u32 %value;
>>>          .local .align 16 .b8 %frame_ar[48];
>>>          .reg.u64 %frame;
>>>          cvta.local.u64 %frame, %frame_ar;
>>>          mov.u32 %value, 285;
>>>          st.param.u32    [%value_out], %value;
>>>          ret;
>>> }
>>> ...
>>>
>>>> That's precisely
>>>> what the PR was about...  which means it isn't fixed for nvptx :/
>>>
>>> Indeed the assembly is not optimal, and would be optimal if we'd have 
>>> optimal
>>> code at optimized.
>>>
>>> FWIW, using this patch we generate optimal code at optimized:
>>> ...
>>> diff --git a/gcc/passes.def b/gcc/passes.def
>>> index 3ebcfc30349..6b64f600c4a 100644
>>> --- a/gcc/passes.def
>>> +++ b/gcc/passes.def
>>> @@ -325,6 +325,7 @@ along with GCC; see the file COPYING3.  If not see
>>>         NEXT_PASS (pass_tracer);
>>>         NEXT_PASS (pass_thread_jumps);
>>>         NEXT_PASS (pass_dominator, false /* may_peel_loop_headers_p */);
>>> +      NEXT_PASS (pass_fre);
>>>         NEXT_PASS (pass_strlen);
>>>         NEXT_PASS (pass_thread_jumps);
>>>         NEXT_PASS (pass_vrp, false /* warn_array_bounds_p */);
>>> ...
>>>
>>> and we get:
>>> ...
>>> .visible .func (.param.u32 %value_out) foo
>>> {
>>>          .reg.u32 %value;
>>>          mov.u32 %value, 285;
>>>          st.param.u32    [%value_out], %value;
>>>          ret;
>>> }
>>> ...
>>>
>>> I could file a missing optimization PR for nvptx, but I'm not sure where 
>>> this
>>> should be fixed.
>>
>> Ah, yeah... the usual issue then.
>>
>> Can you please XFAIL the test on nvptx instead of requiring vect_int_mult?
>>
>
> Done.
>
> Committed at attached.


this caused the test to FAIL on 64-bit (only) sparc-sun-solaris2.11:

FAIL: gcc.dg/tree-ssa/pr84512.c scan-tree-dump optimized "return 285;"

where it was UNSUPPORTED before.

The dump has

;; Function foo (foo, funcdef_no=0, decl_uid=1557, cgraph_uid=0, symbol_order=0)

foo ()
{
  int res;
  int a[10];
  int _2;
  int _6;
  int _28;
  int _35;
  int _42;
  int _49;
  int _56;
  int _63;
  int _70;
  int _77;

  <bb 2> [local count: 97603132]:
  MEM[(int *)&a] = { 0, 1 };
  MEM[(int *)&a + 8B] = { 4, 9 };
  MEM[(int *)&a + 16B] = { 16, 25 };
  MEM[(int *)&a + 24B] = { 36, 49 };
  MEM[(int *)&a + 32B] = { 64, 81 };
  _6 = a[0];
  _28 = a[1];
  res_29 = _6 + _28;
  _35 = a[2];
  res_36 = res_29 + _35;
  _42 = a[3];
  res_43 = res_36 + _42;
  _49 = a[4];
  res_50 = res_43 + _49;
  _56 = a[5];
  res_57 = res_50 + _56;
  _63 = a[6];
  res_64 = res_57 + _63;
  _70 = a[7];
  res_71 = res_64 + _70;
  _77 = a[8];
  res_78 = res_71 + _77;
  _2 = a[9];
  res_11 = _2 + res_78;
  a ={v} {CLOBBER};
  return res_11;

}

        Rainer

-- 
-----------------------------------------------------------------------------
Rainer Orth, Center for Biotechnology, Bielefeld University

Re: [PATCH] Fix PR84512

Reply via email to