On 7 June 2016 at 14:07, Ramana Radhakrishnan <ramana....@googlemail.com> wrote: >>> Please find the updated patch attached. >>> It passes testsuite for arm-none-linux-gnueabi, arm-none-linux-gnueabihf and >>> arm-none-eabi. >>> However the test-case added in the patch (neon-vect-div-1.c) fails to >>> get vectorized at -O2 >>> for armeb-none-linux-gnueabihf. >>> Charles suggested me to try with -O3, which worked. >>> It appears the test-case fails to get vectorized with >>> -fvect-cost-model=cheap (which is default enabled at -O2) >>> and passes for -fno-vect-cost-model / -fvect-cost-model=dynamic >>> >>> I can't figure out why it fails -fvect-cost-model=cheap. >>> From the vect dump (attached): >>> neon-vect-div-1.c:12:3: note: Setting misalignment to -1. >>> neon-vect-div-1.c:12:3: note: not vectorized: unsupported unaligned load.*_9 >> Hi, >> I think I have some idea why the test-case fails attached with patch >> fail to get vectorized on armeb with -O2. >> >> Issue with big endian vectorizer: >> The patch does not cause regressions on big endian vectorizer but >> fails to vectorize the test-cases attached with the patch, while they >> get vectorized on >> litttle-endian. >> Fails with armeb with the following message in dump: >> note: not vectorized: unsupported unaligned load.*_9 >> >> The behavior of big and little endian vectorizer seems to be different >> in arm_builtin_support_vector_misalignment() which overrides the hook >> targetm.vectorize.support_vector_misalignment(). >> >> targetm.vectorize.support_vector_misalignment is called by >> vect_supportable_dr_alignment () which in turn is called >> by verify_data_refs_alignment (). >> >> Execution upto following condition is common between arm and armeb >> in vect_supportable_dr_alignment(): >> >> if ((TYPE_USER_ALIGN (type) && !is_packed) >> || targetm.vectorize.support_vector_misalignment (mode, type, >> DR_MISALIGNMENT (dr), is_packed)) >> /* Can't software pipeline the loads, but can at least do them. */ >> return dr_unaligned_supported; >> >> For little endian case: >> arm_builtin_support_vector_misalignment() is called with >> V2SF mode and misalignment == -1, and the following condition >> becomes true: >> /* If the misalignment is unknown, we should be able to handle the access >> so long as it is not to a member of a packed data structure. */ >> if (misalignment == -1) >> return true; >> >> Since the hook returned true we enter the condition above in >> vect_supportable_dr_alignment() and return dr_unaligned_supported; >> >> For big-endian: >> arm_builtin_support_vector_misalignment() is called with V2SF mode. >> The following condition that gates the entire function body fails: >> if (TARGET_NEON && !BYTES_BIG_ENDIAN && unaligned_access) >> and the default hook gets called with V2SF mode and the default hook >> returns false because >> movmisalign_optab does not exist for V2SF mode. >> >> So the condition above in vect_supportable_dr_alignment() fails >> and we come here: >> /* Unsupported. */ >> return dr_unaligned_unsupported; >> >> And hence we get the unaligned load not supported message in the dump >> for armeb in verify_data_ref_alignment (): >> >> static bool >> verify_data_ref_alignment (data_reference_p dr) >> { >> enum dr_alignment_support supportable_dr_alignment >> = vect_supportable_dr_alignment (dr, false); >> if (!supportable_dr_alignment) >> { >> if (dump_enabled_p ()) >> { >> if (DR_IS_READ (dr)) >> dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, >> "not vectorized: unsupported unaligned load."); >> >> With -O3, the test-cases vectorize for armeb, because loop peeling for >> alignment >> is turned on. >> The above behavior is also reproducible with test-case which is >> irrelevant to the patch. >> for instance, we get the same unsupported unaligned load for following >> test-case (replaced / with +) >> >> void >> foo (int len, float * __restrict p, float *__restrict x) >> { >> len = len & ~31; >> for (int i = 0; i < len; i++) >> p[i] = p[i] + x[i]; >> } >> Is the patch OK to commit after bootstrap+test ? > > > Thanks for the analysis - all the test needs is an additional marker > to skip it on armeb (is there a helper for misaligned loads from the > vectorizer ? ) - Ah probably vect_hw_misalign is sufficient for your > usecase and you want to appropriately fix it for little endian arm > with neon support enabled. Hi, I added dg-require-effective-target vect_hw_misalign to the tests in the patch, and modified vect_hw_misalign to return true for little-endian arm configs with neon support enabled. The patch makes the tests unsupported for armeb. Does it look correct ?
Unfortunately the change to vect_hw_misalign breaks gcc.dg/vect/vect-align-1.c, which were passing before: XPASS: gcc.dg/vect/vect-align-1.c scan-tree-dump-times vect "Alignment of access forced using versioning" 1 FAIL: gcc.dg/vect/vect-align-1.c scan-tree-dump-times vect "Vectorizing an unaligned access" 1 I am not sure how to fix this and would be grateful for suggestions. Thanks, Prathamesh > > From the patch. > >>>+ && flag_unsafe_math_optimizations && flag_reciprocal_math" > > Why do we need flag_unsafe_math_optimizations && flag_reciprocal_math > ? flag_unsafe_math_optimizations should be sufficient since it enables > flag_reciprocal_math - the reason for flag_unsafe_math_optimizations > is to prevent loss of precision and the fact that on neon denormalized > numbers are flushed to zero. > > Ok with that change and a quick test with vect_hw_misalign added to > your testcase. > > Sorry about the delay in reviewing. > > Ramana > > >> >> Thanks, >> Prathamesh >>> >>> Thanks, >>> Prathamesh >>>> >>>> Thanks, >>>> Ramana >>>> >>>>> >>>>> Thanks, >>>>> Prathamesh >>>>>> >>>>>> >>>>>> moving on to the patches. >>>>>> >>>>>>> diff --git a/gcc/config/arm/neon.md b/gcc/config/arm/neon.md >>>>>>> index 654d9d5..28c2e2a 100644 >>>>>>> --- a/gcc/config/arm/neon.md >>>>>>> +++ b/gcc/config/arm/neon.md >>>>>>> @@ -548,6 +548,32 @@ >>>>>>> (const_string "neon_mul_<V_elem_ch><q>")))] >>>>>>> ) >>>>>>> >>>>>> >>>>>> Please add a comment here. >>>>>> >>>>>>> +(define_expand "div<mode>3" >>>>>>> + [(set (match_operand:VCVTF 0 "s_register_operand" "=w") >>>>>>> + (div:VCVTF (match_operand:VCVTF 1 "s_register_operand" "w") >>>>>>> + (match_operand:VCVTF 2 "s_register_operand" "w")))] >>>>>> >>>>>> I want to double check that this doesn't collide with Alan's patches for >>>>>> FP16 especially if he reuses the VCVTF iterator for all the vcvt f16 >>>>>> cases. >>>>>> >>>>>>> + "TARGET_NEON && flag_unsafe_math_optimizations && >>>>>>> flag_reciprocal_math" >>>>>>> + { >>>>>>> + rtx rec = gen_reg_rtx (<MODE>mode); >>>>>>> + rtx vrecps_temp = gen_reg_rtx (<MODE>mode); >>>>>>> + >>>>>>> + /* Reciprocal estimate */ >>>>>>> + emit_insn (gen_neon_vrecpe<mode> (rec, operands[2])); >>>>>>> + >>>>>>> + /* Perform 2 iterations of Newton-Raphson method for better >>>>>>> accuracy */ >>>>>>> + for (int i = 0; i < 2; i++) >>>>>>> + { >>>>>>> + emit_insn (gen_neon_vrecps<mode> (vrecps_temp, rec, operands[2])); >>>>>>> + emit_insn (gen_mul<mode>3 (rec, rec, vrecps_temp)); >>>>>>> + } >>>>>>> + >>>>>>> + /* We now have reciprocal in rec, perform operands[0] = >>>>>>> operands[1] * rec */ >>>>>>> + emit_insn (gen_mul<mode>3 (operands[0], operands[1], rec)); >>>>>>> + DONE; >>>>>>> + } >>>>>>> +) >>>>>>> + >>>>>>> + >>>>>>> (define_insn "mul<mode>3add<mode>_neon" >>>>>>> [(set (match_operand:VDQW 0 "s_register_operand" "=w") >>>>>>> (plus:VDQW (mult:VDQW (match_operand:VDQW 2 >>>>>>> "s_register_operand" "w") >>>>>>> diff --git a/gcc/testsuite/gcc.target/arm/vect-div-1.c >>>>>>> b/gcc/testsuite/gcc.target/arm/vect-div-1.c >>>>>>> new file mode 100644 >>>>>>> index 0000000..e562ef3 >>>>>>> --- /dev/null >>>>>>> +++ b/gcc/testsuite/gcc.target/arm/vect-div-1.c >>>>>>> @@ -0,0 +1,14 @@ >>>>>>> +/* { dg-do compile } */ >>>>>>> +/* { dg-require-effective-target arm_v8_neon_ok } */ >>>>>>> +/* { dg-options "-O2 -funsafe-math-optimizations -ftree-vectorize >>>>>>> -fdump-tree-vect-all" } */ >>>>>>> +/* { dg-add-options arm_v8_neon } */ >>>>>> >>>>>> No this is wrong. >>>>>> >>>>>> What is armv8 specific about this test ? This is just like another test >>>>>> that is for Neon. vrecpe / vrecps are not instructions that were >>>>>> introduced in the v8 version of the architecture. They've existed in the >>>>>> base Neon instruction set. The code generation above in the patterns >>>>>> will be enabled when TARGET_NEON is true which can happen when >>>>>> -mfpu=neon -mfloat-abi={softfp/hard} is true. >>>>>> >>>>>>> + >>>>>>> +void >>>>>>> +foo (int len, float * __restrict p, float *__restrict x) >>>>>>> +{ >>>>>>> + len = len & ~31; >>>>>>> + for (int i = 0; i < len; i++) >>>>>>> + p[i] = p[i] / x[i]; >>>>>>> +} >>>>>>> + >>>>>>> +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } >>>>>>> */ >>>>>>> diff --git a/gcc/testsuite/gcc.target/arm/vect-div-2.c >>>>>>> b/gcc/testsuite/gcc.target/arm/vect-div-2.c >>>>>>> new file mode 100644 >>>>>>> index 0000000..8e15d0a >>>>>>> --- /dev/null >>>>>>> +++ b/gcc/testsuite/gcc.target/arm/vect-div-2.c >>>>>>> @@ -0,0 +1,14 @@ >>>>>>> +/* { dg-do compile } */ >>>>>>> +/* { dg-require-effective-target arm_v8_neon_ok } */ >>>>>> >>>>>> And likewise. >>>>>> >>>>>>> +/* { dg-options "-O2 -funsafe-math-optimizations -fno-reciprocal-math >>>>>>> -ftree-vectorize -fdump-tree-vect-all" } */ >>>>>>> +/* { dg-add-options arm_v8_neon } */ >>>>>>> + >>>>>>> +void >>>>>>> +foo (int len, float * __restrict p, float *__restrict x) >>>>>>> +{ >>>>>>> + len = len & ~31; >>>>>>> + for (int i = 0; i < len; i++) >>>>>>> + p[i] = p[i] / x[i]; >>>>>>> +} >>>>>>> + >>>>>>> +/* { dg-final { scan-tree-dump-times "vectorized 0 loops" 1 "vect" } } >>>>>>> */ >>>>>> >>>>>> >>>>>> regards >>>>>> Ramana
diff --git a/gcc/config/arm/neon.md b/gcc/config/arm/neon.md index e2fdfbb..fbd4bb6 100644 --- a/gcc/config/arm/neon.md +++ b/gcc/config/arm/neon.md @@ -578,6 +578,38 @@ (const_string "neon_mul_<V_elem_ch><q>")))] ) +/* Perform division using multiply-by-reciprocal. + Reciprocal is calculated using Newton-Raphson method. + Enabled with -funsafe-math-optimizations -freciprocal-math + and disabled for -Os since it increases code size . */ + +(define_expand "div<mode>3" + [(set (match_operand:VCVTF 0 "s_register_operand" "=w") + (div:VCVTF (match_operand:VCVTF 1 "s_register_operand" "w") + (match_operand:VCVTF 2 "s_register_operand" "w")))] + "TARGET_NEON && !optimize_size + && flag_unsafe_math_optimizations && flag_reciprocal_math" + { + rtx rec = gen_reg_rtx (<MODE>mode); + rtx vrecps_temp = gen_reg_rtx (<MODE>mode); + + /* Reciprocal estimate. */ + emit_insn (gen_neon_vrecpe<mode> (rec, operands[2])); + + /* Perform 2 iterations of newton-raphson method. */ + for (int i = 0; i < 2; i++) + { + emit_insn (gen_neon_vrecps<mode> (vrecps_temp, rec, operands[2])); + emit_insn (gen_mul<mode>3 (rec, rec, vrecps_temp)); + } + + /* We now have reciprocal in rec, perform operands[0] = operands[1] * rec. */ + emit_insn (gen_mul<mode>3 (operands[0], operands[1], rec)); + DONE; + } +) + + (define_insn "mul<mode>3add<mode>_neon" [(set (match_operand:VDQW 0 "s_register_operand" "=w") (plus:VDQW (mult:VDQW (match_operand:VDQW 2 "s_register_operand" "w") diff --git a/gcc/testsuite/gcc.target/arm/neon-vect-div-1.c b/gcc/testsuite/gcc.target/arm/neon-vect-div-1.c new file mode 100644 index 0000000..dc507a0 --- /dev/null +++ b/gcc/testsuite/gcc.target/arm/neon-vect-div-1.c @@ -0,0 +1,16 @@ +/* Test pattern div<mode>3. */ +/* { dg-do compile } */ +/* { dg-require-effective-target arm_neon_ok } */ +/* { dg-require-effective-target vect_hw_misalign } */ +/* { dg-options "-O2 -ftree-vectorize -funsafe-math-optimizations -fdump-tree-vect-all" } */ +/* { dg-add-options arm_neon } */ + +void +foo (int len, float * __restrict p, float *__restrict x) +{ + len = len & ~31; + for (int i = 0; i < len; i++) + p[i] = p[i] / x[i]; +} + +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */ diff --git a/gcc/testsuite/gcc.target/arm/neon-vect-div-2.c b/gcc/testsuite/gcc.target/arm/neon-vect-div-2.c new file mode 100644 index 0000000..9654232 --- /dev/null +++ b/gcc/testsuite/gcc.target/arm/neon-vect-div-2.c @@ -0,0 +1,17 @@ +/* Test pattern div<mode>3. */ + +/* { dg-do compile } */ +/* { dg-require-effective-target arm_neon_ok } */ +/* { dg-require-effective-target vect_hw_misalign } */ +/* { dg-options "-O3 -funsafe-math-optimizations -fno-reciprocal-math -fdump-tree-vect-all" } */ +/* { dg-add-options arm_neon } */ + +void +foo (int len, float * __restrict p, float *__restrict x) +{ + len = len & ~31; + for (int i = 0; i < len; i++) + p[i] = p[i] / x[i]; +} + +/* { dg-final { scan-tree-dump-times "vectorized 0 loops" 1 "vect" } } */ diff --git a/gcc/testsuite/lib/target-supports.exp b/gcc/testsuite/lib/target-supports.exp index 0b991a5..48feb99 100644 --- a/gcc/testsuite/lib/target-supports.exp +++ b/gcc/testsuite/lib/target-supports.exp @@ -4812,7 +4812,9 @@ proc check_effective_target_vect_hw_misalign { } { set et_vect_hw_misalign_saved 0 if { [istarget i?86-*-*] || [istarget x86_64-*-*] || ([istarget powerpc*-*-*] && [check_p8vector_hw_available]) - || [istarget aarch64*-*-*] } { + || [istarget aarch64*-*-*] + || ([istarget arm-*-*] + && [is-effective-target arm_neon_ok]) } { set et_vect_hw_misalign_saved 1 } }