Re: [ARM] implement division using vrecpe/vrecps with -funsafe-math-optimizations

Prathamesh Kulkarni Mon, 30 May 2016 01:23:29 -0700

On 23 May 2016 at 14:59, Prathamesh Kulkarni
<prathamesh.kulka...@linaro.org> wrote:
> On 5 February 2016 at 18:40, Prathamesh Kulkarni
> <prathamesh.kulka...@linaro.org> wrote:
>> On 4 February 2016 at 16:31, Ramana Radhakrishnan
>> <ramana....@googlemail.com> wrote:
>>> On Sun, Jan 17, 2016 at 9:06 AM, Prathamesh Kulkarni
>>> <prathamesh.kulka...@linaro.org> wrote:
>>>> On 31 July 2015 at 15:04, Ramana Radhakrishnan
>>>> <ramana.radhakrish...@foss.arm.com> wrote:
>>>>>
>>>>>
>>>>> On 29/07/15 11:09, Prathamesh Kulkarni wrote:
>>>>>> Hi,
>>>>>> This patch tries to implement division with multiplication by
>>>>>> reciprocal using vrecpe/vrecps
>>>>>> with -funsafe-math-optimizations and -freciprocal-math enabled.
>>>>>> Tested on arm-none-linux-gnueabihf using qemu.
>>>>>> OK for trunk ?
>>>>>>
>>>>>> Thank you,
>>>>>> Prathamesh
>>>>>>
>>>>>
>>>>> I've tried this in the past and never been convinced that 2 iterations 
>>>>> are enough to get to stability with this given that the results are only 
>>>>> precise for 8 bits / iteration. Thus I've always believed you need 3 
>>>>> iterations rather than 2 at which point I've never been sure that it's 
>>>>> worth it. So the testing that you've done with this currently is not 
>>>>> enough for this to go into the tree.
>>>>>
>>>>> I'd like this to be tested on a couple of different AArch32 
>>>>> implementations with a wider range of inputs to verify that the results 
>>>>> are acceptable as well as running something like SPEC2k(6) with atleast 
>>>>> one iteration to ensure correctness.
>>>> Hi,
>>>> I got results of SPEC2k6 fp benchmarks:
>>>> a15: +0.64% overall, 481.wrf: +6.46%
>>>> a53: +0.21% overall, 416.gamess: -1.39%, 481.wrf: +6.76%
>>>> a57: +0.35% overall, 481.wrf: +3.84%
>>>> The other benchmarks had (almost) identical results.
>>>
>>> Thanks for the benchmarking results -  Please repost the patch with
>>> the changes that I had requested in my previous review - given it is
>>> now stage4 , I would rather queue changes like this for stage1 now.
>> Hi,
>> Please find the updated patch attached.
>> It passes testsuite for arm-none-linux-gnueabi, arm-none-linux-gnueabihf and
>> arm-none-eabi.
>> However the test-case added in the patch (neon-vect-div-1.c) fails to
>> get vectorized at -O2
>> for armeb-none-linux-gnueabihf.
>> Charles suggested me to try with -O3, which worked.
>> It appears the test-case fails to get vectorized with
>> -fvect-cost-model=cheap (which is default enabled at -O2)
>> and passes for -fno-vect-cost-model / -fvect-cost-model=dynamic
>>
>> I can't figure out why it fails -fvect-cost-model=cheap.
>> From the vect dump (attached):
>> neon-vect-div-1.c:12:3: note: Setting misalignment to -1.
>> neon-vect-div-1.c:12:3: note: not vectorized: unsupported unaligned load.*_9
> Hi,
> I think I have some idea why the test-case fails attached with patch
> fail to get vectorized on armeb with -O2.
>
> Issue with big endian vectorizer:
> The patch does not cause regressions on big endian vectorizer but
> fails to vectorize the test-cases attached with the patch, while they
> get vectorized on
> litttle-endian.
> Fails with armeb with the following message in dump:
> note: not vectorized: unsupported unaligned load.*_9
>
> The behavior of big and little endian vectorizer seems to be different
> in arm_builtin_support_vector_misalignment() which overrides the hook
> targetm.vectorize.support_vector_misalignment().
>
> targetm.vectorize.support_vector_misalignment is called by
> vect_supportable_dr_alignment () which in turn is called
> by verify_data_refs_alignment ().
>
> Execution upto following condition is common between arm and armeb
> in vect_supportable_dr_alignment():
>
> if ((TYPE_USER_ALIGN (type) && !is_packed)
>       || targetm.vectorize.support_vector_misalignment (mode, type,
>                                             DR_MISALIGNMENT (dr), is_packed))
>         /* Can't software pipeline the loads, but can at least do them.  */
>         return dr_unaligned_supported;
>
> For little endian case:
> arm_builtin_support_vector_misalignment() is called with
> V2SF mode and misalignment == -1, and the following condition
> becomes true:
> /* If the misalignment is unknown, we should be able to handle the access
>          so long as it is not to a member of a packed data structure.  */
>   if (misalignment == -1)
>     return true;
>
> Since the hook returned true we enter the condition above in
> vect_supportable_dr_alignment() and return dr_unaligned_supported;
>
> For big-endian:
> arm_builtin_support_vector_misalignment() is called with V2SF mode.
> The following condition that gates the entire function body fails:
>  if (TARGET_NEON && !BYTES_BIG_ENDIAN && unaligned_access)
> and the default hook gets called with V2SF mode and the default hook
> returns false because
> movmisalign_optab does not exist for V2SF mode.
>
> So the condition above in vect_supportable_dr_alignment() fails
> and we come here:
>  /* Unsupported.  */
> return dr_unaligned_unsupported;
>
> And hence we get the unaligned load not supported message in the dump
> for armeb in verify_data_ref_alignment ():
>
> static bool
> verify_data_ref_alignment (data_reference_p dr)
> {
>   enum dr_alignment_support supportable_dr_alignment
>     = vect_supportable_dr_alignment (dr, false);
>   if (!supportable_dr_alignment)
>     {
>       if (dump_enabled_p ())
>         {
>           if (DR_IS_READ (dr))
>             dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>                              "not vectorized: unsupported unaligned load.");
>
> With -O3, the test-cases vectorize for armeb, because loop peeling for 
> alignment
> is turned on.
> The above behavior is also reproducible with test-case which is
> irrelevant to the patch.
> for instance, we get the same unsupported unaligned load for following
> test-case (replaced / with +)
>
> void
> foo (int len, float * __restrict p, float *__restrict x)
> {
>   len = len & ~31;
>   for (int i = 0; i < len; i++)
>     p[i] = p[i] + x[i];
> }
> Is the patch OK to commit after bootstrap+test ?
ping https://gcc.gnu.org/ml/gcc-patches/2016-05/msg01765.html


Thanks,
Prathamesh
>
> Thanks,
> Prathamesh
>>
>> Thanks,
>> Prathamesh
>>>
>>> Thanks,
>>> Ramana
>>>
>>>>
>>>> Thanks,
>>>> Prathamesh
>>>>>
>>>>>
>>>>> moving on to the patches.
>>>>>
>>>>>> diff --git a/gcc/config/arm/neon.md b/gcc/config/arm/neon.md
>>>>>> index 654d9d5..28c2e2a 100644
>>>>>> --- a/gcc/config/arm/neon.md
>>>>>> +++ b/gcc/config/arm/neon.md
>>>>>> @@ -548,6 +548,32 @@
>>>>>>                      (const_string "neon_mul_<V_elem_ch><q>")))]
>>>>>>  )
>>>>>>
>>>>>
>>>>> Please add a comment here.
>>>>>
>>>>>> +(define_expand "div<mode>3"
>>>>>> +  [(set (match_operand:VCVTF 0 "s_register_operand" "=w")
>>>>>> +        (div:VCVTF (match_operand:VCVTF 1 "s_register_operand" "w")
>>>>>> +               (match_operand:VCVTF 2 "s_register_operand" "w")))]
>>>>>
>>>>> I want to double check that this doesn't collide with Alan's patches for 
>>>>> FP16 especially if he reuses the VCVTF iterator for all the vcvt f16 
>>>>> cases.
>>>>>
>>>>>> +  "TARGET_NEON && flag_unsafe_math_optimizations && 
>>>>>> flag_reciprocal_math"
>>>>>> +  {
>>>>>> +    rtx rec = gen_reg_rtx (<MODE>mode);
>>>>>> +    rtx vrecps_temp = gen_reg_rtx (<MODE>mode);
>>>>>> +
>>>>>> +    /* Reciprocal estimate */
>>>>>> +    emit_insn (gen_neon_vrecpe<mode> (rec, operands[2]));
>>>>>> +
>>>>>> +    /* Perform 2 iterations of Newton-Raphson method for better 
>>>>>> accuracy */
>>>>>> +    for (int i = 0; i < 2; i++)
>>>>>> +      {
>>>>>> +     emit_insn (gen_neon_vrecps<mode> (vrecps_temp, rec, operands[2]));
>>>>>> +     emit_insn (gen_mul<mode>3 (rec, rec, vrecps_temp));
>>>>>> +      }
>>>>>> +
>>>>>> +    /* We now have reciprocal in rec, perform operands[0] = operands[1] 
>>>>>> * rec */
>>>>>> +    emit_insn (gen_mul<mode>3 (operands[0], operands[1], rec));
>>>>>> +    DONE;
>>>>>> +  }
>>>>>> +)
>>>>>> +
>>>>>> +
>>>>>>  (define_insn "mul<mode>3add<mode>_neon"
>>>>>>    [(set (match_operand:VDQW 0 "s_register_operand" "=w")
>>>>>>          (plus:VDQW (mult:VDQW (match_operand:VDQW 2 
>>>>>> "s_register_operand" "w")
>>>>>> diff --git a/gcc/testsuite/gcc.target/arm/vect-div-1.c 
>>>>>> b/gcc/testsuite/gcc.target/arm/vect-div-1.c
>>>>>> new file mode 100644
>>>>>> index 0000000..e562ef3
>>>>>> --- /dev/null
>>>>>> +++ b/gcc/testsuite/gcc.target/arm/vect-div-1.c
>>>>>> @@ -0,0 +1,14 @@
>>>>>> +/* { dg-do compile } */
>>>>>> +/* { dg-require-effective-target arm_v8_neon_ok } */
>>>>>> +/* { dg-options "-O2 -funsafe-math-optimizations -ftree-vectorize 
>>>>>> -fdump-tree-vect-all" } */
>>>>>> +/* { dg-add-options arm_v8_neon } */
>>>>>
>>>>> No this is wrong.
>>>>>
>>>>> What is armv8 specific about this test ? This is just like another test 
>>>>> that is for Neon. vrecpe / vrecps are not instructions that were 
>>>>> introduced in the v8 version of the architecture. They've existed in the 
>>>>> base Neon instruction set. The code generation above in the patterns will 
>>>>> be enabled when TARGET_NEON is true which can happen when -mfpu=neon 
>>>>> -mfloat-abi={softfp/hard} is true.
>>>>>
>>>>>> +
>>>>>> +void
>>>>>> +foo (int len, float * __restrict p, float *__restrict x)
>>>>>> +{
>>>>>> +  len = len & ~31;
>>>>>> +  for (int i = 0; i < len; i++)
>>>>>> +    p[i] = p[i] / x[i];
>>>>>> +}
>>>>>> +
>>>>>> +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } 
>>>>>> */
>>>>>> diff --git a/gcc/testsuite/gcc.target/arm/vect-div-2.c 
>>>>>> b/gcc/testsuite/gcc.target/arm/vect-div-2.c
>>>>>> new file mode 100644
>>>>>> index 0000000..8e15d0a
>>>>>> --- /dev/null
>>>>>> +++ b/gcc/testsuite/gcc.target/arm/vect-div-2.c
>>>>>> @@ -0,0 +1,14 @@
>>>>>> +/* { dg-do compile } */
>>>>>> +/* { dg-require-effective-target arm_v8_neon_ok } */
>>>>>
>>>>> And likewise.
>>>>>
>>>>>> +/* { dg-options "-O2 -funsafe-math-optimizations -fno-reciprocal-math 
>>>>>> -ftree-vectorize -fdump-tree-vect-all" } */
>>>>>> +/* { dg-add-options arm_v8_neon } */
>>>>>> +
>>>>>> +void
>>>>>> +foo (int len, float * __restrict p, float *__restrict x)
>>>>>> +{
>>>>>> +  len = len & ~31;
>>>>>> +  for (int i = 0; i < len; i++)
>>>>>> +    p[i] = p[i] / x[i];
>>>>>> +}
>>>>>> +
>>>>>> +/* { dg-final { scan-tree-dump-times "vectorized 0 loops" 1 "vect" } } 
>>>>>> */
>>>>>
>>>>>
>>>>> regards
>>>>> Ramana

Re: [ARM] implement division using vrecpe/vrecps with -funsafe-math-optimizations

Reply via email to