Re: [ARM] implement division using vrecpe/vrecps with -funsafe-math-optimizations

Prathamesh Kulkarni Fri, 10 Jun 2016 01:57:13 -0700

On 7 June 2016 at 14:07, Ramana Radhakrishnan <ramana....@googlemail.com> wrote:
>>> Please find the updated patch attached.
>>> It passes testsuite for arm-none-linux-gnueabi, arm-none-linux-gnueabihf and
>>> arm-none-eabi.
>>> However the test-case added in the patch (neon-vect-div-1.c) fails to
>>> get vectorized at -O2
>>> for armeb-none-linux-gnueabihf.
>>> Charles suggested me to try with -O3, which worked.
>>> It appears the test-case fails to get vectorized with
>>> -fvect-cost-model=cheap (which is default enabled at -O2)
>>> and passes for -fno-vect-cost-model / -fvect-cost-model=dynamic
>>>
>>> I can't figure out why it fails -fvect-cost-model=cheap.
>>> From the vect dump (attached):
>>> neon-vect-div-1.c:12:3: note: Setting misalignment to -1.
>>> neon-vect-div-1.c:12:3: note: not vectorized: unsupported unaligned load.*_9
>> Hi,
>> I think I have some idea why the test-case fails attached with patch
>> fail to get vectorized on armeb with -O2.
>>
>> Issue with big endian vectorizer:
>> The patch does not cause regressions on big endian vectorizer but
>> fails to vectorize the test-cases attached with the patch, while they
>> get vectorized on
>> litttle-endian.
>> Fails with armeb with the following message in dump:
>> note: not vectorized: unsupported unaligned load.*_9
>>
>> The behavior of big and little endian vectorizer seems to be different
>> in arm_builtin_support_vector_misalignment() which overrides the hook
>> targetm.vectorize.support_vector_misalignment().
>>
>> targetm.vectorize.support_vector_misalignment is called by
>> vect_supportable_dr_alignment () which in turn is called
>> by verify_data_refs_alignment ().
>>
>> Execution upto following condition is common between arm and armeb
>> in vect_supportable_dr_alignment():
>>
>> if ((TYPE_USER_ALIGN (type) && !is_packed)
>>       || targetm.vectorize.support_vector_misalignment (mode, type,
>>                                             DR_MISALIGNMENT (dr), is_packed))
>>         /* Can't software pipeline the loads, but can at least do them.  */
>>         return dr_unaligned_supported;
>>
>> For little endian case:
>> arm_builtin_support_vector_misalignment() is called with
>> V2SF mode and misalignment == -1, and the following condition
>> becomes true:
>> /* If the misalignment is unknown, we should be able to handle the access
>>          so long as it is not to a member of a packed data structure.  */
>>   if (misalignment == -1)
>>     return true;
>>
>> Since the hook returned true we enter the condition above in
>> vect_supportable_dr_alignment() and return dr_unaligned_supported;
>>
>> For big-endian:
>> arm_builtin_support_vector_misalignment() is called with V2SF mode.
>> The following condition that gates the entire function body fails:
>>  if (TARGET_NEON && !BYTES_BIG_ENDIAN && unaligned_access)
>> and the default hook gets called with V2SF mode and the default hook
>> returns false because
>> movmisalign_optab does not exist for V2SF mode.
>>
>> So the condition above in vect_supportable_dr_alignment() fails
>> and we come here:
>>  /* Unsupported.  */
>> return dr_unaligned_unsupported;
>>
>> And hence we get the unaligned load not supported message in the dump
>> for armeb in verify_data_ref_alignment ():
>>
>> static bool
>> verify_data_ref_alignment (data_reference_p dr)
>> {
>>   enum dr_alignment_support supportable_dr_alignment
>>     = vect_supportable_dr_alignment (dr, false);
>>   if (!supportable_dr_alignment)
>>     {
>>       if (dump_enabled_p ())
>>         {
>>           if (DR_IS_READ (dr))
>>             dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>>                              "not vectorized: unsupported unaligned load.");
>>
>> With -O3, the test-cases vectorize for armeb, because loop peeling for 
>> alignment
>> is turned on.
>> The above behavior is also reproducible with test-case which is
>> irrelevant to the patch.
>> for instance, we get the same unsupported unaligned load for following
>> test-case (replaced / with +)
>>
>> void
>> foo (int len, float * __restrict p, float *__restrict x)
>> {
>>   len = len & ~31;
>>   for (int i = 0; i < len; i++)
>>     p[i] = p[i] + x[i];
>> }
>> Is the patch OK to commit after bootstrap+test ?
>
>
> Thanks for the analysis - all the test needs is an additional marker
> to skip it on armeb (is there a helper for misaligned loads from the
> vectorizer ? ) - Ah probably vect_hw_misalign is sufficient for your
> usecase and you want to appropriately fix it for little endian arm
> with neon support enabled.
Hi,
I added dg-require-effective-target vect_hw_misalign to the tests in the patch,
and modified vect_hw_misalign to return true for little-endian arm configs
with neon support enabled. The patch makes the tests unsupported for
armeb.
Does it look correct ?


Unfortunately the change to vect_hw_misalign breaks gcc.dg/vect/vect-align-1.c,
which were passing before:
XPASS: gcc.dg/vect/vect-align-1.c scan-tree-dump-times vect "Alignment
of access forced using versioning" 1
FAIL: gcc.dg/vect/vect-align-1.c scan-tree-dump-times vect
"Vectorizing an unaligned access" 1

I am not sure how to fix this and would be grateful for suggestions.

Thanks,
Prathamesh
>
> From the patch.
>
>>>+   && flag_unsafe_math_optimizations && flag_reciprocal_math"
>
> Why do we need flag_unsafe_math_optimizations && flag_reciprocal_math
> ? flag_unsafe_math_optimizations should be sufficient since it enables
> flag_reciprocal_math - the reason for flag_unsafe_math_optimizations
> is to prevent loss of precision and the fact that on neon denormalized
> numbers are flushed to zero.
>
> Ok with that change and a quick test with vect_hw_misalign added to
> your testcase.
>
> Sorry about the delay in reviewing.
>
> Ramana
>
>
>>
>> Thanks,
>> Prathamesh
>>>
>>> Thanks,
>>> Prathamesh
>>>>
>>>> Thanks,
>>>> Ramana
>>>>
>>>>>
>>>>> Thanks,
>>>>> Prathamesh
>>>>>>
>>>>>>
>>>>>> moving on to the patches.
>>>>>>
>>>>>>> diff --git a/gcc/config/arm/neon.md b/gcc/config/arm/neon.md
>>>>>>> index 654d9d5..28c2e2a 100644
>>>>>>> --- a/gcc/config/arm/neon.md
>>>>>>> +++ b/gcc/config/arm/neon.md
>>>>>>> @@ -548,6 +548,32 @@
>>>>>>>                      (const_string "neon_mul_<V_elem_ch><q>")))]
>>>>>>>  )
>>>>>>>
>>>>>>
>>>>>> Please add a comment here.
>>>>>>
>>>>>>> +(define_expand "div<mode>3"
>>>>>>> +  [(set (match_operand:VCVTF 0 "s_register_operand" "=w")
>>>>>>> +        (div:VCVTF (match_operand:VCVTF 1 "s_register_operand" "w")
>>>>>>> +               (match_operand:VCVTF 2 "s_register_operand" "w")))]
>>>>>>
>>>>>> I want to double check that this doesn't collide with Alan's patches for 
>>>>>> FP16 especially if he reuses the VCVTF iterator for all the vcvt f16 
>>>>>> cases.
>>>>>>
>>>>>>> +  "TARGET_NEON && flag_unsafe_math_optimizations && 
>>>>>>> flag_reciprocal_math"
>>>>>>> +  {
>>>>>>> +    rtx rec = gen_reg_rtx (<MODE>mode);
>>>>>>> +    rtx vrecps_temp = gen_reg_rtx (<MODE>mode);
>>>>>>> +
>>>>>>> +    /* Reciprocal estimate */
>>>>>>> +    emit_insn (gen_neon_vrecpe<mode> (rec, operands[2]));
>>>>>>> +
>>>>>>> +    /* Perform 2 iterations of Newton-Raphson method for better 
>>>>>>> accuracy */
>>>>>>> +    for (int i = 0; i < 2; i++)
>>>>>>> +      {
>>>>>>> +     emit_insn (gen_neon_vrecps<mode> (vrecps_temp, rec, operands[2]));
>>>>>>> +     emit_insn (gen_mul<mode>3 (rec, rec, vrecps_temp));
>>>>>>> +      }
>>>>>>> +
>>>>>>> +    /* We now have reciprocal in rec, perform operands[0] = 
>>>>>>> operands[1] * rec */
>>>>>>> +    emit_insn (gen_mul<mode>3 (operands[0], operands[1], rec));
>>>>>>> +    DONE;
>>>>>>> +  }
>>>>>>> +)
>>>>>>> +
>>>>>>> +
>>>>>>>  (define_insn "mul<mode>3add<mode>_neon"
>>>>>>>    [(set (match_operand:VDQW 0 "s_register_operand" "=w")
>>>>>>>          (plus:VDQW (mult:VDQW (match_operand:VDQW 2 
>>>>>>> "s_register_operand" "w")
>>>>>>> diff --git a/gcc/testsuite/gcc.target/arm/vect-div-1.c 
>>>>>>> b/gcc/testsuite/gcc.target/arm/vect-div-1.c
>>>>>>> new file mode 100644
>>>>>>> index 0000000..e562ef3
>>>>>>> --- /dev/null
>>>>>>> +++ b/gcc/testsuite/gcc.target/arm/vect-div-1.c
>>>>>>> @@ -0,0 +1,14 @@
>>>>>>> +/* { dg-do compile } */
>>>>>>> +/* { dg-require-effective-target arm_v8_neon_ok } */
>>>>>>> +/* { dg-options "-O2 -funsafe-math-optimizations -ftree-vectorize 
>>>>>>> -fdump-tree-vect-all" } */
>>>>>>> +/* { dg-add-options arm_v8_neon } */
>>>>>>
>>>>>> No this is wrong.
>>>>>>
>>>>>> What is armv8 specific about this test ? This is just like another test 
>>>>>> that is for Neon. vrecpe / vrecps are not instructions that were 
>>>>>> introduced in the v8 version of the architecture. They've existed in the 
>>>>>> base Neon instruction set. The code generation above in the patterns 
>>>>>> will be enabled when TARGET_NEON is true which can happen when 
>>>>>> -mfpu=neon -mfloat-abi={softfp/hard} is true.
>>>>>>
>>>>>>> +
>>>>>>> +void
>>>>>>> +foo (int len, float * __restrict p, float *__restrict x)
>>>>>>> +{
>>>>>>> +  len = len & ~31;
>>>>>>> +  for (int i = 0; i < len; i++)
>>>>>>> +    p[i] = p[i] / x[i];
>>>>>>> +}
>>>>>>> +
>>>>>>> +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } 
>>>>>>> */
>>>>>>> diff --git a/gcc/testsuite/gcc.target/arm/vect-div-2.c 
>>>>>>> b/gcc/testsuite/gcc.target/arm/vect-div-2.c
>>>>>>> new file mode 100644
>>>>>>> index 0000000..8e15d0a
>>>>>>> --- /dev/null
>>>>>>> +++ b/gcc/testsuite/gcc.target/arm/vect-div-2.c
>>>>>>> @@ -0,0 +1,14 @@
>>>>>>> +/* { dg-do compile } */
>>>>>>> +/* { dg-require-effective-target arm_v8_neon_ok } */
>>>>>>
>>>>>> And likewise.
>>>>>>
>>>>>>> +/* { dg-options "-O2 -funsafe-math-optimizations -fno-reciprocal-math 
>>>>>>> -ftree-vectorize -fdump-tree-vect-all" } */
>>>>>>> +/* { dg-add-options arm_v8_neon } */
>>>>>>> +
>>>>>>> +void
>>>>>>> +foo (int len, float * __restrict p, float *__restrict x)
>>>>>>> +{
>>>>>>> +  len = len & ~31;
>>>>>>> +  for (int i = 0; i < len; i++)
>>>>>>> +    p[i] = p[i] / x[i];
>>>>>>> +}
>>>>>>> +
>>>>>>> +/* { dg-final { scan-tree-dump-times "vectorized 0 loops" 1 "vect" } } 
>>>>>>> */
>>>>>>
>>>>>>
>>>>>> regards
>>>>>> Ramana

diff --git a/gcc/config/arm/neon.md b/gcc/config/arm/neon.md
index e2fdfbb..fbd4bb6 100644
--- a/gcc/config/arm/neon.md
+++ b/gcc/config/arm/neon.md
@@ -578,6 +578,38 @@
                     (const_string "neon_mul_<V_elem_ch><q>")))]
 )
 
+/* Perform division using multiply-by-reciprocal. 
+   Reciprocal is calculated using Newton-Raphson method.
+   Enabled with -funsafe-math-optimizations -freciprocal-math
+   and disabled for -Os since it increases code size .  */
+
+(define_expand "div<mode>3"
+  [(set (match_operand:VCVTF 0 "s_register_operand" "=w")
+        (div:VCVTF (match_operand:VCVTF 1 "s_register_operand" "w")
+                 (match_operand:VCVTF 2 "s_register_operand" "w")))]
+  "TARGET_NEON && !optimize_size
+   && flag_unsafe_math_optimizations && flag_reciprocal_math"
+  {
+    rtx rec = gen_reg_rtx (<MODE>mode);
+    rtx vrecps_temp = gen_reg_rtx (<MODE>mode);
+
+    /* Reciprocal estimate.  */
+    emit_insn (gen_neon_vrecpe<mode> (rec, operands[2]));
+
+    /* Perform 2 iterations of newton-raphson method.  */
+    for (int i = 0; i < 2; i++)
+      {
+       emit_insn (gen_neon_vrecps<mode> (vrecps_temp, rec, operands[2]));
+       emit_insn (gen_mul<mode>3 (rec, rec, vrecps_temp));
+      }
+
+    /* We now have reciprocal in rec, perform operands[0] = operands[1] * rec. 
 */
+    emit_insn (gen_mul<mode>3 (operands[0], operands[1], rec));
+    DONE;
+  }
+)
+
+
 (define_insn "mul<mode>3add<mode>_neon"
   [(set (match_operand:VDQW 0 "s_register_operand" "=w")
         (plus:VDQW (mult:VDQW (match_operand:VDQW 2 "s_register_operand" "w")
diff --git a/gcc/testsuite/gcc.target/arm/neon-vect-div-1.c 
b/gcc/testsuite/gcc.target/arm/neon-vect-div-1.c
new file mode 100644
index 0000000..dc507a0
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/neon-vect-div-1.c
@@ -0,0 +1,16 @@
+/* Test pattern div<mode>3.  */
+/* { dg-do compile } */
+/* { dg-require-effective-target arm_neon_ok } */
+/* { dg-require-effective-target vect_hw_misalign } */
+/* { dg-options "-O2 -ftree-vectorize -funsafe-math-optimizations 
-fdump-tree-vect-all" } */
+/* { dg-add-options arm_neon } */
+
+void
+foo (int len, float * __restrict p, float *__restrict x)
+{
+  len = len & ~31;
+  for (int i = 0; i < len; i++)
+    p[i] = p[i] / x[i];
+}
+
+/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
diff --git a/gcc/testsuite/gcc.target/arm/neon-vect-div-2.c 
b/gcc/testsuite/gcc.target/arm/neon-vect-div-2.c
new file mode 100644
index 0000000..9654232
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/neon-vect-div-2.c
@@ -0,0 +1,17 @@
+/* Test pattern div<mode>3.  */
+
+/* { dg-do compile } */
+/* { dg-require-effective-target arm_neon_ok } */
+/* { dg-require-effective-target vect_hw_misalign } */
+/* { dg-options "-O3 -funsafe-math-optimizations -fno-reciprocal-math 
-fdump-tree-vect-all" } */
+/* { dg-add-options arm_neon } */
+
+void
+foo (int len, float * __restrict p, float *__restrict x)
+{
+  len = len & ~31;
+  for (int i = 0; i < len; i++)
+    p[i] = p[i] / x[i];
+}
+
+/* { dg-final { scan-tree-dump-times "vectorized 0 loops" 1 "vect" } } */
diff --git a/gcc/testsuite/lib/target-supports.exp 
b/gcc/testsuite/lib/target-supports.exp
index 0b991a5..48feb99 100644
--- a/gcc/testsuite/lib/target-supports.exp
+++ b/gcc/testsuite/lib/target-supports.exp
@@ -4812,7 +4812,9 @@ proc check_effective_target_vect_hw_misalign { } {
         set et_vect_hw_misalign_saved 0
        if { [istarget i?86-*-*] || [istarget x86_64-*-*]
             || ([istarget powerpc*-*-*] && [check_p8vector_hw_available])
-           || [istarget aarch64*-*-*] } {
+           || [istarget aarch64*-*-*]
+            || ([istarget arm-*-*]
+               && [is-effective-target arm_neon_ok]) } {
           set et_vect_hw_misalign_saved 1
        }
     }

Re: [ARM] implement division using vrecpe/vrecps with -funsafe-math-optimizations

Reply via email to