Re: [ARM] implement division using vrecpe/vrecps with -funsafe-math-optimizations

Ramana Radhakrishnan Thu, 04 Feb 2016 03:02:03 -0800

On Sun, Jan 17, 2016 at 9:06 AM, Prathamesh Kulkarni
<prathamesh.kulka...@linaro.org> wrote:
> On 31 July 2015 at 15:04, Ramana Radhakrishnan
> <ramana.radhakrish...@foss.arm.com> wrote:
>>
>>
>> On 29/07/15 11:09, Prathamesh Kulkarni wrote:
>>> Hi,
>>> This patch tries to implement division with multiplication by
>>> reciprocal using vrecpe/vrecps
>>> with -funsafe-math-optimizations and -freciprocal-math enabled.
>>> Tested on arm-none-linux-gnueabihf using qemu.
>>> OK for trunk ?
>>>
>>> Thank you,
>>> Prathamesh
>>>
>>
>> I've tried this in the past and never been convinced that 2 iterations are 
>> enough to get to stability with this given that the results are only precise 
>> for 8 bits / iteration. Thus I've always believed you need 3 iterations 
>> rather than 2 at which point I've never been sure that it's worth it. So the 
>> testing that you've done with this currently is not enough for this to go 
>> into the tree.
>>
>> I'd like this to be tested on a couple of different AArch32 implementations 
>> with a wider range of inputs to verify that the results are acceptable as 
>> well as running something like SPEC2k(6) with atleast one iteration to 
>> ensure correctness.
> Hi,
> I got results of SPEC2k6 fp benchmarks:
> a15: +0.64% overall, 481.wrf: +6.46%
> a53: +0.21% overall, 416.gamess: -1.39%, 481.wrf: +6.76%
> a57: +0.35% overall, 481.wrf: +3.84%
> The other benchmarks had (almost) identical results.


Thanks for the benchmarking results -  Please repost the patch with
the changes that I had requested in my previous review - given it is
now stage4 , I would rather queue changes like this for stage1 now.

Thanks,
Ramana

>
> Thanks,
> Prathamesh
>>
>>
>> moving on to the patches.
>>
>>> diff --git a/gcc/config/arm/neon.md b/gcc/config/arm/neon.md
>>> index 654d9d5..28c2e2a 100644
>>> --- a/gcc/config/arm/neon.md
>>> +++ b/gcc/config/arm/neon.md
>>> @@ -548,6 +548,32 @@
>>>                      (const_string "neon_mul_<V_elem_ch><q>")))]
>>>  )
>>>
>>
>> Please add a comment here.
>>
>>> +(define_expand "div<mode>3"
>>> +  [(set (match_operand:VCVTF 0 "s_register_operand" "=w")
>>> +        (div:VCVTF (match_operand:VCVTF 1 "s_register_operand" "w")
>>> +               (match_operand:VCVTF 2 "s_register_operand" "w")))]
>>
>> I want to double check that this doesn't collide with Alan's patches for 
>> FP16 especially if he reuses the VCVTF iterator for all the vcvt f16 cases.
>>
>>> +  "TARGET_NEON && flag_unsafe_math_optimizations && flag_reciprocal_math"
>>> +  {
>>> +    rtx rec = gen_reg_rtx (<MODE>mode);
>>> +    rtx vrecps_temp = gen_reg_rtx (<MODE>mode);
>>> +
>>> +    /* Reciprocal estimate */
>>> +    emit_insn (gen_neon_vrecpe<mode> (rec, operands[2]));
>>> +
>>> +    /* Perform 2 iterations of Newton-Raphson method for better accuracy */
>>> +    for (int i = 0; i < 2; i++)
>>> +      {
>>> +     emit_insn (gen_neon_vrecps<mode> (vrecps_temp, rec, operands[2]));
>>> +     emit_insn (gen_mul<mode>3 (rec, rec, vrecps_temp));
>>> +      }
>>> +
>>> +    /* We now have reciprocal in rec, perform operands[0] = operands[1] * 
>>> rec */
>>> +    emit_insn (gen_mul<mode>3 (operands[0], operands[1], rec));
>>> +    DONE;
>>> +  }
>>> +)
>>> +
>>> +
>>>  (define_insn "mul<mode>3add<mode>_neon"
>>>    [(set (match_operand:VDQW 0 "s_register_operand" "=w")
>>>          (plus:VDQW (mult:VDQW (match_operand:VDQW 2 "s_register_operand" 
>>> "w")
>>> diff --git a/gcc/testsuite/gcc.target/arm/vect-div-1.c 
>>> b/gcc/testsuite/gcc.target/arm/vect-div-1.c
>>> new file mode 100644
>>> index 0000000..e562ef3
>>> --- /dev/null
>>> +++ b/gcc/testsuite/gcc.target/arm/vect-div-1.c
>>> @@ -0,0 +1,14 @@
>>> +/* { dg-do compile } */
>>> +/* { dg-require-effective-target arm_v8_neon_ok } */
>>> +/* { dg-options "-O2 -funsafe-math-optimizations -ftree-vectorize 
>>> -fdump-tree-vect-all" } */
>>> +/* { dg-add-options arm_v8_neon } */
>>
>> No this is wrong.
>>
>> What is armv8 specific about this test ? This is just like another test that 
>> is for Neon. vrecpe / vrecps are not instructions that were introduced in 
>> the v8 version of the architecture. They've existed in the base Neon 
>> instruction set. The code generation above in the patterns will be enabled 
>> when TARGET_NEON is true which can happen when -mfpu=neon 
>> -mfloat-abi={softfp/hard} is true.
>>
>>> +
>>> +void
>>> +foo (int len, float * __restrict p, float *__restrict x)
>>> +{
>>> +  len = len & ~31;
>>> +  for (int i = 0; i < len; i++)
>>> +    p[i] = p[i] / x[i];
>>> +}
>>> +
>>> +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
>>> diff --git a/gcc/testsuite/gcc.target/arm/vect-div-2.c 
>>> b/gcc/testsuite/gcc.target/arm/vect-div-2.c
>>> new file mode 100644
>>> index 0000000..8e15d0a
>>> --- /dev/null
>>> +++ b/gcc/testsuite/gcc.target/arm/vect-div-2.c
>>> @@ -0,0 +1,14 @@
>>> +/* { dg-do compile } */
>>> +/* { dg-require-effective-target arm_v8_neon_ok } */
>>
>> And likewise.
>>
>>> +/* { dg-options "-O2 -funsafe-math-optimizations -fno-reciprocal-math 
>>> -ftree-vectorize -fdump-tree-vect-all" } */
>>> +/* { dg-add-options arm_v8_neon } */
>>> +
>>> +void
>>> +foo (int len, float * __restrict p, float *__restrict x)
>>> +{
>>> +  len = len & ~31;
>>> +  for (int i = 0; i < len; i++)
>>> +    p[i] = p[i] / x[i];
>>> +}
>>> +
>>> +/* { dg-final { scan-tree-dump-times "vectorized 0 loops" 1 "vect" } } */
>>
>>
>> regards
>> Ramana

Re: [ARM] implement division using vrecpe/vrecps with -funsafe-math-optimizations

Reply via email to