Re: [ARM] Implement division using vrecpe, vrecps
On Mon, 5 Nov 2018 at 19:22, Ramana Radhakrishnan wrote: > > On 26/10/2018 06:04, Prathamesh Kulkarni wrote: > > Hi, > > This is a rebased version of patch that adds a pattern to neon.md for > > implementing division with multiplication by reciprocal using > > vrecpe/vrecps with -funsafe-math-optimizations excluding -Os. > > The newly added test-cases are not vectorized on armeb target with > > -O2. I posted the analysis for that here: > > https://gcc.gnu.org/ml/gcc-patches/2016-05/msg01765.html > > > > Briefly, the difference between little and big-endian vectorizer is in > > arm_builtin_support_vector_misalignment() which calls > > default_builtin_support_vector_misalignment() for big-endian case, and > > that returns false because > > movmisalign_optab does not exist for V2SF mode. This isn't observed > > with -O3 because loop peeling for alignment gets enabled. > > > > It seems that the test cases in patch appear unsupported on armeb, > > after r221677 thus this patch requires no changes to > > target-supports.exp to adjust for armeb (unlike last time which > > stalled the patch). > > > > Bootstrap+tested on arm-linux-gnueabihf. > > Cross-tested on arm*-*-* variants. > > OK for trunk ? > > > > Thanks, > > Prathamesh > > > > > > tcwg-319-3.txt > > > > 2018-10-26 Prathamesh Kulkarni > > > > * config/arm/neon.md (div3): New pattern. > > > > testsuite/ > > * gcc.target/arm/neon-vect-div-1.c: New test. > > * gcc.target/arm/neon-vect-div-2.c: Likewise. > > > > diff --git a/gcc/config/arm/neon.md b/gcc/config/arm/neon.md > > index 5aeee4b08c1..25ed45d381a 100644 > > --- a/gcc/config/arm/neon.md > > +++ b/gcc/config/arm/neon.md > > @@ -620,6 +620,38 @@ > > (const_string "neon_mul_")))] > > ) > > > > +/* Perform division using multiply-by-reciprocal. > > + Reciprocal is calculated using Newton-Raphson method. > > + Enabled with -funsafe-math-optimizations -freciprocal-math > > + and disabled for -Os since it increases code size . */ > + > > +(define_expand "div3" > > + [(set (match_operand:VCVTF 0 "s_register_operand" "=w") > > +(div:VCVTF (match_operand:VCVTF 1 "s_register_operand" "w") > > + (match_operand:VCVTF 2 "s_register_operand" "w")))] > > + "TARGET_NEON && !optimize_size > > + && flag_unsafe_math_optimizations && flag_reciprocal_math" > > I would prefer this to be more granular than > flag_unsafe_math_optimization && flag_reciprocal_math which really is > flag_reciprocal_math as it is turned on by default with > funsafe-math-optimizations. > > I think this should really be just flag_reciprocal_math. > > > Otherwise ok. Thanks, committed it in r265948 after removing check for flag_unsafe_math_optimizations. Regards, Prathamesh > > regards > Ramana > > > > > > + { > > +rtx rec = gen_reg_rtx (mode); > > +rtx vrecps_temp = gen_reg_rtx (mode); > > + > > +/* Reciprocal estimate. */ > > +emit_insn (gen_neon_vrecpe (rec, operands[2])); > > + > > +/* Perform 2 iterations of newton-raphson method. */ > > +for (int i = 0; i < 2; i++) > > + { > > + emit_insn (gen_neon_vrecps (vrecps_temp, rec, operands[2])); > > + emit_insn (gen_mul3 (rec, rec, vrecps_temp)); > > + } > > + > > +/* We now have reciprocal in rec, perform operands[0] = operands[1] * > > rec. */ > > +emit_insn (gen_mul3 (operands[0], operands[1], rec)); > > +DONE; > > + } > > +) > > + > > + > > (define_insn "mul3add_neon" > > [(set (match_operand:VDQW 0 "s_register_operand" "=w") > > (plus:VDQW (mult:VDQW (match_operand:VDQW 2 "s_register_operand" > > "w") > > diff --git a/gcc/testsuite/gcc.target/arm/neon-vect-div-1.c > > b/gcc/testsuite/gcc.target/arm/neon-vect-div-1.c > > new file mode 100644 > > index 000..50d04b4175b > > --- /dev/null > > +++ b/gcc/testsuite/gcc.target/arm/neon-vect-div-1.c > > @@ -0,0 +1,16 @@ > > +/* Test pattern div3. */ > > +/* { dg-do compile } */ > > +/* { dg-require-effective-target arm_neon_ok } */ > > +/* { dg-require-effective-target vect_hw_misalign } */ > > +/* { dg-options "-O2 -ftree-vectorize -funsafe-math-optimizations > > -fdump-tree-vect-details" } */ > > +/* { dg-add-options arm_neon } */ > > + > > +void > > +foo (int len, float * __restrict p, float *__restrict x) > > +{ > > + len = len & ~31; > > + for (int i = 0; i < len; i++) > > +p[i] = p[i] / x[i]; > > +} > > + > > +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */ > > diff --git a/gcc/testsuite/gcc.target/arm/neon-vect-div-2.c > > b/gcc/testsuite/gcc.target/arm/neon-vect-div-2.c > > new file mode 100644 > > index 000..606f54b4e0e > > --- /dev/null > > +++ b/gcc/testsuite/gcc.target/arm/neon-vect-div-2.c > > @@ -0,0 +1,16 @@ > > +/* Test pattern div3. */ > > +/* { dg-do compile } */ > > +/* { dg-require-effective-target arm_neon_ok } */ > > +/* { dg-require-effective-target vect_hw_misalign } */ > > +/* { dg-options "-O3 -ftree-vectorize
Re: [ARM] Implement division using vrecpe, vrecps
On 26/10/2018 06:04, Prathamesh Kulkarni wrote: > Hi, > This is a rebased version of patch that adds a pattern to neon.md for > implementing division with multiplication by reciprocal using > vrecpe/vrecps with -funsafe-math-optimizations excluding -Os. > The newly added test-cases are not vectorized on armeb target with > -O2. I posted the analysis for that here: > https://gcc.gnu.org/ml/gcc-patches/2016-05/msg01765.html > > Briefly, the difference between little and big-endian vectorizer is in > arm_builtin_support_vector_misalignment() which calls > default_builtin_support_vector_misalignment() for big-endian case, and > that returns false because > movmisalign_optab does not exist for V2SF mode. This isn't observed > with -O3 because loop peeling for alignment gets enabled. > > It seems that the test cases in patch appear unsupported on armeb, > after r221677 thus this patch requires no changes to > target-supports.exp to adjust for armeb (unlike last time which > stalled the patch). > > Bootstrap+tested on arm-linux-gnueabihf. > Cross-tested on arm*-*-* variants. > OK for trunk ? > > Thanks, > Prathamesh > > > tcwg-319-3.txt > > 2018-10-26 Prathamesh Kulkarni > > * config/arm/neon.md (div3): New pattern. > > testsuite/ > * gcc.target/arm/neon-vect-div-1.c: New test. > * gcc.target/arm/neon-vect-div-2.c: Likewise. > > diff --git a/gcc/config/arm/neon.md b/gcc/config/arm/neon.md > index 5aeee4b08c1..25ed45d381a 100644 > --- a/gcc/config/arm/neon.md > +++ b/gcc/config/arm/neon.md > @@ -620,6 +620,38 @@ > (const_string "neon_mul_")))] > ) > > +/* Perform division using multiply-by-reciprocal. > + Reciprocal is calculated using Newton-Raphson method. > + Enabled with -funsafe-math-optimizations -freciprocal-math > + and disabled for -Os since it increases code size . */ > + > +(define_expand "div3" > + [(set (match_operand:VCVTF 0 "s_register_operand" "=w") > +(div:VCVTF (match_operand:VCVTF 1 "s_register_operand" "w") > + (match_operand:VCVTF 2 "s_register_operand" "w")))] > + "TARGET_NEON && !optimize_size > + && flag_unsafe_math_optimizations && flag_reciprocal_math" I would prefer this to be more granular than flag_unsafe_math_optimization && flag_reciprocal_math which really is flag_reciprocal_math as it is turned on by default with funsafe-math-optimizations. I think this should really be just flag_reciprocal_math. Otherwise ok. regards Ramana > + { > +rtx rec = gen_reg_rtx (mode); > +rtx vrecps_temp = gen_reg_rtx (mode); > + > +/* Reciprocal estimate. */ > +emit_insn (gen_neon_vrecpe (rec, operands[2])); > + > +/* Perform 2 iterations of newton-raphson method. */ > +for (int i = 0; i < 2; i++) > + { > + emit_insn (gen_neon_vrecps (vrecps_temp, rec, operands[2])); > + emit_insn (gen_mul3 (rec, rec, vrecps_temp)); > + } > + > +/* We now have reciprocal in rec, perform operands[0] = operands[1] * > rec. */ > +emit_insn (gen_mul3 (operands[0], operands[1], rec)); > +DONE; > + } > +) > + > + > (define_insn "mul3add_neon" > [(set (match_operand:VDQW 0 "s_register_operand" "=w") > (plus:VDQW (mult:VDQW (match_operand:VDQW 2 "s_register_operand" > "w") > diff --git a/gcc/testsuite/gcc.target/arm/neon-vect-div-1.c > b/gcc/testsuite/gcc.target/arm/neon-vect-div-1.c > new file mode 100644 > index 000..50d04b4175b > --- /dev/null > +++ b/gcc/testsuite/gcc.target/arm/neon-vect-div-1.c > @@ -0,0 +1,16 @@ > +/* Test pattern div3. */ > +/* { dg-do compile } */ > +/* { dg-require-effective-target arm_neon_ok } */ > +/* { dg-require-effective-target vect_hw_misalign } */ > +/* { dg-options "-O2 -ftree-vectorize -funsafe-math-optimizations > -fdump-tree-vect-details" } */ > +/* { dg-add-options arm_neon } */ > + > +void > +foo (int len, float * __restrict p, float *__restrict x) > +{ > + len = len & ~31; > + for (int i = 0; i < len; i++) > +p[i] = p[i] / x[i]; > +} > + > +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */ > diff --git a/gcc/testsuite/gcc.target/arm/neon-vect-div-2.c > b/gcc/testsuite/gcc.target/arm/neon-vect-div-2.c > new file mode 100644 > index 000..606f54b4e0e > --- /dev/null > +++ b/gcc/testsuite/gcc.target/arm/neon-vect-div-2.c > @@ -0,0 +1,16 @@ > +/* Test pattern div3. */ > +/* { dg-do compile } */ > +/* { dg-require-effective-target arm_neon_ok } */ > +/* { dg-require-effective-target vect_hw_misalign } */ > +/* { dg-options "-O3 -ftree-vectorize -funsafe-math-optimizations > -fdump-tree-vect-details -fno-reciprocal-math" } */ > +/* { dg-add-options arm_neon } */ > + > +void > +foo (int len, float * __restrict p, float *__restrict x) > +{ > + len = len & ~31; > + for (int i = 0; i < len; i++) > +p[i] = p[i] / x[i]; > +} > + > +/* { dg-final { scan-tree-dump-not "vectorized 1 loops" "vect" } } */ >
Re: [ARM] Implement division using vrecpe, vrecps
Hi Prathamesh, Prathamesh Kulkarni wrote: > Thanks for the suggestions. The last time I benchmarked the patch > (around Jan 2016) > I got following results with the patch for SPEC2006: > > a15: +0.64% overall, 481.wrf: +6.46% > a53: +0.21% overall, 416.gamess: -1.39%, 481.wrf: +6.76% > a57: +0.35% overall, 481.wrf: +3.84% > (https://gcc.gnu.org/ml/gcc-patches/2016-01/msg01209.html) > > Do these numbers look acceptable ? > I am benchmarking the patch on ToT, and will report if there are any > performance improvements found with the patch. Yes those results are quite good - in fact they seemed too good to be true at first. However looking at arm/neon.md there isn't a division pattern. So I think it's worth mentioning in the description that your patch actually adds vectorization of division. Disassembling the AArch64 wrf binary shows several hundred vector division instructions - so the speedup makes sense now since many more loops are being vectorized. It's a shame this pattern wasn't added many years ago... It's a good idea to add a vectorized (r)sqrt too as this will improve wrf even further. Wilco
Re: [ARM] Implement division using vrecpe, vrecps
On Fri, 2 Nov 2018 at 19:08, Wilco Dijkstra wrote: > > Prathamesh Kulkarni wrote: > > > This is a rebased version of patch that adds a pattern to neon.md for > > implementing division with multiplication by reciprocal using > > vrecpe/vrecps with -funsafe-math-optimizations excluding -Os. > > The newly added test-cases are not vectorized on armeb target with > > -O2. I posted the analysis for that here: > > https://gcc.gnu.org/ml/gcc-patches/2016-05/msg01765.html > > I don't think doing this unconditionally for any CPU is a good idea. On > AArch64 > we don't enable this for any core since it's not really faster (newer CPUs > have > significantly improved division and the reciprocal instructions reduce > throughput > of other FMAs). On wrf doing reciprocal square root is far better than > reciprocal > division, but it's only faster on some specific CPUs, so it's not enabled by > default. Hi Wilco, Thanks for the suggestions. The last time I benchmarked the patch (around Jan 2016) I got following results with the patch for SPEC2006: a15: +0.64% overall, 481.wrf: +6.46% a53: +0.21% overall, 416.gamess: -1.39%, 481.wrf: +6.76% a57: +0.35% overall, 481.wrf: +3.84% (https://gcc.gnu.org/ml/gcc-patches/2016-01/msg01209.html) Do these numbers look acceptable ? I am benchmarking the patch on ToT, and will report if there are any performance improvements found with the patch. Thanks, Prathamesh > > Wilco
Re: [ARM] Implement division using vrecpe, vrecps
Prathamesh Kulkarni wrote: > This is a rebased version of patch that adds a pattern to neon.md for > implementing division with multiplication by reciprocal using > vrecpe/vrecps with -funsafe-math-optimizations excluding -Os. > The newly added test-cases are not vectorized on armeb target with > -O2. I posted the analysis for that here: > https://gcc.gnu.org/ml/gcc-patches/2016-05/msg01765.html I don't think doing this unconditionally for any CPU is a good idea. On AArch64 we don't enable this for any core since it's not really faster (newer CPUs have significantly improved division and the reciprocal instructions reduce throughput of other FMAs). On wrf doing reciprocal square root is far better than reciprocal division, but it's only faster on some specific CPUs, so it's not enabled by default. Wilco
Re: [ARM] Implement division using vrecpe, vrecps
On Fri, 26 Oct 2018 at 10:34, Prathamesh Kulkarni wrote: > > Hi, > This is a rebased version of patch that adds a pattern to neon.md for > implementing division with multiplication by reciprocal using > vrecpe/vrecps with -funsafe-math-optimizations excluding -Os. > The newly added test-cases are not vectorized on armeb target with > -O2. I posted the analysis for that here: > https://gcc.gnu.org/ml/gcc-patches/2016-05/msg01765.html > > Briefly, the difference between little and big-endian vectorizer is in > arm_builtin_support_vector_misalignment() which calls > default_builtin_support_vector_misalignment() for big-endian case, and > that returns false because > movmisalign_optab does not exist for V2SF mode. This isn't observed > with -O3 because loop peeling for alignment gets enabled. > > It seems that the test cases in patch appear unsupported on armeb, > after r221677 thus this patch requires no changes to > target-supports.exp to adjust for armeb (unlike last time which > stalled the patch). > > Bootstrap+tested on arm-linux-gnueabihf. > Cross-tested on arm*-*-* variants. > OK for trunk ? ping: https://gcc.gnu.org/ml/gcc-patches/2018-10/msg01645.html Thanks, Prathamesh > > Thanks, > Prathamesh
[ARM] Implement division using vrecpe, vrecps
Hi, This is a rebased version of patch that adds a pattern to neon.md for implementing division with multiplication by reciprocal using vrecpe/vrecps with -funsafe-math-optimizations excluding -Os. The newly added test-cases are not vectorized on armeb target with -O2. I posted the analysis for that here: https://gcc.gnu.org/ml/gcc-patches/2016-05/msg01765.html Briefly, the difference between little and big-endian vectorizer is in arm_builtin_support_vector_misalignment() which calls default_builtin_support_vector_misalignment() for big-endian case, and that returns false because movmisalign_optab does not exist for V2SF mode. This isn't observed with -O3 because loop peeling for alignment gets enabled. It seems that the test cases in patch appear unsupported on armeb, after r221677 thus this patch requires no changes to target-supports.exp to adjust for armeb (unlike last time which stalled the patch). Bootstrap+tested on arm-linux-gnueabihf. Cross-tested on arm*-*-* variants. OK for trunk ? Thanks, Prathamesh 2018-10-26 Prathamesh Kulkarni * config/arm/neon.md (div3): New pattern. testsuite/ * gcc.target/arm/neon-vect-div-1.c: New test. * gcc.target/arm/neon-vect-div-2.c: Likewise. diff --git a/gcc/config/arm/neon.md b/gcc/config/arm/neon.md index 5aeee4b08c1..25ed45d381a 100644 --- a/gcc/config/arm/neon.md +++ b/gcc/config/arm/neon.md @@ -620,6 +620,38 @@ (const_string "neon_mul_")))] ) +/* Perform division using multiply-by-reciprocal. + Reciprocal is calculated using Newton-Raphson method. + Enabled with -funsafe-math-optimizations -freciprocal-math + and disabled for -Os since it increases code size . */ + +(define_expand "div3" + [(set (match_operand:VCVTF 0 "s_register_operand" "=w") +(div:VCVTF (match_operand:VCVTF 1 "s_register_operand" "w") + (match_operand:VCVTF 2 "s_register_operand" "w")))] + "TARGET_NEON && !optimize_size + && flag_unsafe_math_optimizations && flag_reciprocal_math" + { +rtx rec = gen_reg_rtx (mode); +rtx vrecps_temp = gen_reg_rtx (mode); + +/* Reciprocal estimate. */ +emit_insn (gen_neon_vrecpe (rec, operands[2])); + +/* Perform 2 iterations of newton-raphson method. */ +for (int i = 0; i < 2; i++) + { + emit_insn (gen_neon_vrecps (vrecps_temp, rec, operands[2])); + emit_insn (gen_mul3 (rec, rec, vrecps_temp)); + } + +/* We now have reciprocal in rec, perform operands[0] = operands[1] * rec. */ +emit_insn (gen_mul3 (operands[0], operands[1], rec)); +DONE; + } +) + + (define_insn "mul3add_neon" [(set (match_operand:VDQW 0 "s_register_operand" "=w") (plus:VDQW (mult:VDQW (match_operand:VDQW 2 "s_register_operand" "w") diff --git a/gcc/testsuite/gcc.target/arm/neon-vect-div-1.c b/gcc/testsuite/gcc.target/arm/neon-vect-div-1.c new file mode 100644 index 000..50d04b4175b --- /dev/null +++ b/gcc/testsuite/gcc.target/arm/neon-vect-div-1.c @@ -0,0 +1,16 @@ +/* Test pattern div3. */ +/* { dg-do compile } */ +/* { dg-require-effective-target arm_neon_ok } */ +/* { dg-require-effective-target vect_hw_misalign } */ +/* { dg-options "-O2 -ftree-vectorize -funsafe-math-optimizations -fdump-tree-vect-details" } */ +/* { dg-add-options arm_neon } */ + +void +foo (int len, float * __restrict p, float *__restrict x) +{ + len = len & ~31; + for (int i = 0; i < len; i++) +p[i] = p[i] / x[i]; +} + +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */ diff --git a/gcc/testsuite/gcc.target/arm/neon-vect-div-2.c b/gcc/testsuite/gcc.target/arm/neon-vect-div-2.c new file mode 100644 index 000..606f54b4e0e --- /dev/null +++ b/gcc/testsuite/gcc.target/arm/neon-vect-div-2.c @@ -0,0 +1,16 @@ +/* Test pattern div3. */ +/* { dg-do compile } */ +/* { dg-require-effective-target arm_neon_ok } */ +/* { dg-require-effective-target vect_hw_misalign } */ +/* { dg-options "-O3 -ftree-vectorize -funsafe-math-optimizations -fdump-tree-vect-details -fno-reciprocal-math" } */ +/* { dg-add-options arm_neon } */ + +void +foo (int len, float * __restrict p, float *__restrict x) +{ + len = len & ~31; + for (int i = 0; i < len; i++) +p[i] = p[i] / x[i]; +} + +/* { dg-final { scan-tree-dump-not "vectorized 1 loops" "vect" } } */
Re: [ARM] implement division using vrecpe/vrecps with -funsafe-math-optimizations
On 7 June 2016 at 14:07, Ramana Radhakrishnanwrote: >>> Please find the updated patch attached. >>> It passes testsuite for arm-none-linux-gnueabi, arm-none-linux-gnueabihf and >>> arm-none-eabi. >>> However the test-case added in the patch (neon-vect-div-1.c) fails to >>> get vectorized at -O2 >>> for armeb-none-linux-gnueabihf. >>> Charles suggested me to try with -O3, which worked. >>> It appears the test-case fails to get vectorized with >>> -fvect-cost-model=cheap (which is default enabled at -O2) >>> and passes for -fno-vect-cost-model / -fvect-cost-model=dynamic >>> >>> I can't figure out why it fails -fvect-cost-model=cheap. >>> From the vect dump (attached): >>> neon-vect-div-1.c:12:3: note: Setting misalignment to -1. >>> neon-vect-div-1.c:12:3: note: not vectorized: unsupported unaligned load.*_9 >> Hi, >> I think I have some idea why the test-case fails attached with patch >> fail to get vectorized on armeb with -O2. >> >> Issue with big endian vectorizer: >> The patch does not cause regressions on big endian vectorizer but >> fails to vectorize the test-cases attached with the patch, while they >> get vectorized on >> litttle-endian. >> Fails with armeb with the following message in dump: >> note: not vectorized: unsupported unaligned load.*_9 >> >> The behavior of big and little endian vectorizer seems to be different >> in arm_builtin_support_vector_misalignment() which overrides the hook >> targetm.vectorize.support_vector_misalignment(). >> >> targetm.vectorize.support_vector_misalignment is called by >> vect_supportable_dr_alignment () which in turn is called >> by verify_data_refs_alignment (). >> >> Execution upto following condition is common between arm and armeb >> in vect_supportable_dr_alignment(): >> >> if ((TYPE_USER_ALIGN (type) && !is_packed) >> || targetm.vectorize.support_vector_misalignment (mode, type, >> DR_MISALIGNMENT (dr), is_packed)) >> /* Can't software pipeline the loads, but can at least do them. */ >> return dr_unaligned_supported; >> >> For little endian case: >> arm_builtin_support_vector_misalignment() is called with >> V2SF mode and misalignment == -1, and the following condition >> becomes true: >> /* If the misalignment is unknown, we should be able to handle the access >> so long as it is not to a member of a packed data structure. */ >> if (misalignment == -1) >> return true; >> >> Since the hook returned true we enter the condition above in >> vect_supportable_dr_alignment() and return dr_unaligned_supported; >> >> For big-endian: >> arm_builtin_support_vector_misalignment() is called with V2SF mode. >> The following condition that gates the entire function body fails: >> if (TARGET_NEON && !BYTES_BIG_ENDIAN && unaligned_access) >> and the default hook gets called with V2SF mode and the default hook >> returns false because >> movmisalign_optab does not exist for V2SF mode. >> >> So the condition above in vect_supportable_dr_alignment() fails >> and we come here: >> /* Unsupported. */ >> return dr_unaligned_unsupported; >> >> And hence we get the unaligned load not supported message in the dump >> for armeb in verify_data_ref_alignment (): >> >> static bool >> verify_data_ref_alignment (data_reference_p dr) >> { >> enum dr_alignment_support supportable_dr_alignment >> = vect_supportable_dr_alignment (dr, false); >> if (!supportable_dr_alignment) >> { >> if (dump_enabled_p ()) >> { >> if (DR_IS_READ (dr)) >> dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, >> "not vectorized: unsupported unaligned load."); >> >> With -O3, the test-cases vectorize for armeb, because loop peeling for >> alignment >> is turned on. >> The above behavior is also reproducible with test-case which is >> irrelevant to the patch. >> for instance, we get the same unsupported unaligned load for following >> test-case (replaced / with +) >> >> void >> foo (int len, float * __restrict p, float *__restrict x) >> { >> len = len & ~31; >> for (int i = 0; i < len; i++) >> p[i] = p[i] + x[i]; >> } >> Is the patch OK to commit after bootstrap+test ? > > > Thanks for the analysis - all the test needs is an additional marker > to skip it on armeb (is there a helper for misaligned loads from the > vectorizer ? ) - Ah probably vect_hw_misalign is sufficient for your > usecase and you want to appropriately fix it for little endian arm > with neon support enabled. Hi, I added dg-require-effective-target vect_hw_misalign to the tests in the patch, and modified vect_hw_misalign to return true for little-endian arm configs with neon support enabled. The patch makes the tests unsupported for armeb. Does it look correct ? Unfortunately the change to vect_hw_misalign breaks gcc.dg/vect/vect-align-1.c, which were passing before: XPASS: gcc.dg/vect/vect-align-1.c scan-tree-dump-times vect "Alignment
Re: [ARM] implement division using vrecpe/vrecps with -funsafe-math-optimizations
>> Please find the updated patch attached. >> It passes testsuite for arm-none-linux-gnueabi, arm-none-linux-gnueabihf and >> arm-none-eabi. >> However the test-case added in the patch (neon-vect-div-1.c) fails to >> get vectorized at -O2 >> for armeb-none-linux-gnueabihf. >> Charles suggested me to try with -O3, which worked. >> It appears the test-case fails to get vectorized with >> -fvect-cost-model=cheap (which is default enabled at -O2) >> and passes for -fno-vect-cost-model / -fvect-cost-model=dynamic >> >> I can't figure out why it fails -fvect-cost-model=cheap. >> From the vect dump (attached): >> neon-vect-div-1.c:12:3: note: Setting misalignment to -1. >> neon-vect-div-1.c:12:3: note: not vectorized: unsupported unaligned load.*_9 > Hi, > I think I have some idea why the test-case fails attached with patch > fail to get vectorized on armeb with -O2. > > Issue with big endian vectorizer: > The patch does not cause regressions on big endian vectorizer but > fails to vectorize the test-cases attached with the patch, while they > get vectorized on > litttle-endian. > Fails with armeb with the following message in dump: > note: not vectorized: unsupported unaligned load.*_9 > > The behavior of big and little endian vectorizer seems to be different > in arm_builtin_support_vector_misalignment() which overrides the hook > targetm.vectorize.support_vector_misalignment(). > > targetm.vectorize.support_vector_misalignment is called by > vect_supportable_dr_alignment () which in turn is called > by verify_data_refs_alignment (). > > Execution upto following condition is common between arm and armeb > in vect_supportable_dr_alignment(): > > if ((TYPE_USER_ALIGN (type) && !is_packed) > || targetm.vectorize.support_vector_misalignment (mode, type, > DR_MISALIGNMENT (dr), is_packed)) > /* Can't software pipeline the loads, but can at least do them. */ > return dr_unaligned_supported; > > For little endian case: > arm_builtin_support_vector_misalignment() is called with > V2SF mode and misalignment == -1, and the following condition > becomes true: > /* If the misalignment is unknown, we should be able to handle the access > so long as it is not to a member of a packed data structure. */ > if (misalignment == -1) > return true; > > Since the hook returned true we enter the condition above in > vect_supportable_dr_alignment() and return dr_unaligned_supported; > > For big-endian: > arm_builtin_support_vector_misalignment() is called with V2SF mode. > The following condition that gates the entire function body fails: > if (TARGET_NEON && !BYTES_BIG_ENDIAN && unaligned_access) > and the default hook gets called with V2SF mode and the default hook > returns false because > movmisalign_optab does not exist for V2SF mode. > > So the condition above in vect_supportable_dr_alignment() fails > and we come here: > /* Unsupported. */ > return dr_unaligned_unsupported; > > And hence we get the unaligned load not supported message in the dump > for armeb in verify_data_ref_alignment (): > > static bool > verify_data_ref_alignment (data_reference_p dr) > { > enum dr_alignment_support supportable_dr_alignment > = vect_supportable_dr_alignment (dr, false); > if (!supportable_dr_alignment) > { > if (dump_enabled_p ()) > { > if (DR_IS_READ (dr)) > dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, > "not vectorized: unsupported unaligned load."); > > With -O3, the test-cases vectorize for armeb, because loop peeling for > alignment > is turned on. > The above behavior is also reproducible with test-case which is > irrelevant to the patch. > for instance, we get the same unsupported unaligned load for following > test-case (replaced / with +) > > void > foo (int len, float * __restrict p, float *__restrict x) > { > len = len & ~31; > for (int i = 0; i < len; i++) > p[i] = p[i] + x[i]; > } > Is the patch OK to commit after bootstrap+test ? Thanks for the analysis - all the test needs is an additional marker to skip it on armeb (is there a helper for misaligned loads from the vectorizer ? ) - Ah probably vect_hw_misalign is sufficient for your usecase and you want to appropriately fix it for little endian arm with neon support enabled. From the patch. >>+ && flag_unsafe_math_optimizations && flag_reciprocal_math" Why do we need flag_unsafe_math_optimizations && flag_reciprocal_math ? flag_unsafe_math_optimizations should be sufficient since it enables flag_reciprocal_math - the reason for flag_unsafe_math_optimizations is to prevent loss of precision and the fact that on neon denormalized numbers are flushed to zero. Ok with that change and a quick test with vect_hw_misalign added to your testcase. Sorry about the delay in reviewing. Ramana > > Thanks, > Prathamesh >> >> Thanks, >> Prathamesh >>> >>> Thanks, >>> Ramana >>> Thanks,
Re: [ARM] implement division using vrecpe/vrecps with -funsafe-math-optimizations
On 30 May 2016 at 13:52, Prathamesh Kulkarniwrote: > On 23 May 2016 at 14:59, Prathamesh Kulkarni > wrote: >> On 5 February 2016 at 18:40, Prathamesh Kulkarni >> wrote: >>> On 4 February 2016 at 16:31, Ramana Radhakrishnan >>> wrote: On Sun, Jan 17, 2016 at 9:06 AM, Prathamesh Kulkarni wrote: > On 31 July 2015 at 15:04, Ramana Radhakrishnan > wrote: >> >> >> On 29/07/15 11:09, Prathamesh Kulkarni wrote: >>> Hi, >>> This patch tries to implement division with multiplication by >>> reciprocal using vrecpe/vrecps >>> with -funsafe-math-optimizations and -freciprocal-math enabled. >>> Tested on arm-none-linux-gnueabihf using qemu. >>> OK for trunk ? >>> >>> Thank you, >>> Prathamesh >>> >> >> I've tried this in the past and never been convinced that 2 iterations >> are enough to get to stability with this given that the results are only >> precise for 8 bits / iteration. Thus I've always believed you need 3 >> iterations rather than 2 at which point I've never been sure that it's >> worth it. So the testing that you've done with this currently is not >> enough for this to go into the tree. >> >> I'd like this to be tested on a couple of different AArch32 >> implementations with a wider range of inputs to verify that the results >> are acceptable as well as running something like SPEC2k(6) with atleast >> one iteration to ensure correctness. > Hi, > I got results of SPEC2k6 fp benchmarks: > a15: +0.64% overall, 481.wrf: +6.46% > a53: +0.21% overall, 416.gamess: -1.39%, 481.wrf: +6.76% > a57: +0.35% overall, 481.wrf: +3.84% > The other benchmarks had (almost) identical results. Thanks for the benchmarking results - Please repost the patch with the changes that I had requested in my previous review - given it is now stage4 , I would rather queue changes like this for stage1 now. >>> Hi, >>> Please find the updated patch attached. >>> It passes testsuite for arm-none-linux-gnueabi, arm-none-linux-gnueabihf and >>> arm-none-eabi. >>> However the test-case added in the patch (neon-vect-div-1.c) fails to >>> get vectorized at -O2 >>> for armeb-none-linux-gnueabihf. >>> Charles suggested me to try with -O3, which worked. >>> It appears the test-case fails to get vectorized with >>> -fvect-cost-model=cheap (which is default enabled at -O2) >>> and passes for -fno-vect-cost-model / -fvect-cost-model=dynamic >>> >>> I can't figure out why it fails -fvect-cost-model=cheap. >>> From the vect dump (attached): >>> neon-vect-div-1.c:12:3: note: Setting misalignment to -1. >>> neon-vect-div-1.c:12:3: note: not vectorized: unsupported unaligned load.*_9 >> Hi, >> I think I have some idea why the test-case fails attached with patch >> fail to get vectorized on armeb with -O2. >> >> Issue with big endian vectorizer: >> The patch does not cause regressions on big endian vectorizer but >> fails to vectorize the test-cases attached with the patch, while they >> get vectorized on >> litttle-endian. >> Fails with armeb with the following message in dump: >> note: not vectorized: unsupported unaligned load.*_9 >> >> The behavior of big and little endian vectorizer seems to be different >> in arm_builtin_support_vector_misalignment() which overrides the hook >> targetm.vectorize.support_vector_misalignment(). >> >> targetm.vectorize.support_vector_misalignment is called by >> vect_supportable_dr_alignment () which in turn is called >> by verify_data_refs_alignment (). >> >> Execution upto following condition is common between arm and armeb >> in vect_supportable_dr_alignment(): >> >> if ((TYPE_USER_ALIGN (type) && !is_packed) >> || targetm.vectorize.support_vector_misalignment (mode, type, >> DR_MISALIGNMENT (dr), is_packed)) >> /* Can't software pipeline the loads, but can at least do them. */ >> return dr_unaligned_supported; >> >> For little endian case: >> arm_builtin_support_vector_misalignment() is called with >> V2SF mode and misalignment == -1, and the following condition >> becomes true: >> /* If the misalignment is unknown, we should be able to handle the access >> so long as it is not to a member of a packed data structure. */ >> if (misalignment == -1) >> return true; >> >> Since the hook returned true we enter the condition above in >> vect_supportable_dr_alignment() and return dr_unaligned_supported; >> >> For big-endian: >> arm_builtin_support_vector_misalignment() is called with V2SF mode. >> The following condition that gates the entire function body fails: >> if (TARGET_NEON && !BYTES_BIG_ENDIAN && unaligned_access) >> and the default hook gets called with V2SF mode and the
Re: [ARM] implement division using vrecpe/vrecps with -funsafe-math-optimizations
On 23 May 2016 at 14:59, Prathamesh Kulkarniwrote: > On 5 February 2016 at 18:40, Prathamesh Kulkarni > wrote: >> On 4 February 2016 at 16:31, Ramana Radhakrishnan >> wrote: >>> On Sun, Jan 17, 2016 at 9:06 AM, Prathamesh Kulkarni >>> wrote: On 31 July 2015 at 15:04, Ramana Radhakrishnan wrote: > > > On 29/07/15 11:09, Prathamesh Kulkarni wrote: >> Hi, >> This patch tries to implement division with multiplication by >> reciprocal using vrecpe/vrecps >> with -funsafe-math-optimizations and -freciprocal-math enabled. >> Tested on arm-none-linux-gnueabihf using qemu. >> OK for trunk ? >> >> Thank you, >> Prathamesh >> > > I've tried this in the past and never been convinced that 2 iterations > are enough to get to stability with this given that the results are only > precise for 8 bits / iteration. Thus I've always believed you need 3 > iterations rather than 2 at which point I've never been sure that it's > worth it. So the testing that you've done with this currently is not > enough for this to go into the tree. > > I'd like this to be tested on a couple of different AArch32 > implementations with a wider range of inputs to verify that the results > are acceptable as well as running something like SPEC2k(6) with atleast > one iteration to ensure correctness. Hi, I got results of SPEC2k6 fp benchmarks: a15: +0.64% overall, 481.wrf: +6.46% a53: +0.21% overall, 416.gamess: -1.39%, 481.wrf: +6.76% a57: +0.35% overall, 481.wrf: +3.84% The other benchmarks had (almost) identical results. >>> >>> Thanks for the benchmarking results - Please repost the patch with >>> the changes that I had requested in my previous review - given it is >>> now stage4 , I would rather queue changes like this for stage1 now. >> Hi, >> Please find the updated patch attached. >> It passes testsuite for arm-none-linux-gnueabi, arm-none-linux-gnueabihf and >> arm-none-eabi. >> However the test-case added in the patch (neon-vect-div-1.c) fails to >> get vectorized at -O2 >> for armeb-none-linux-gnueabihf. >> Charles suggested me to try with -O3, which worked. >> It appears the test-case fails to get vectorized with >> -fvect-cost-model=cheap (which is default enabled at -O2) >> and passes for -fno-vect-cost-model / -fvect-cost-model=dynamic >> >> I can't figure out why it fails -fvect-cost-model=cheap. >> From the vect dump (attached): >> neon-vect-div-1.c:12:3: note: Setting misalignment to -1. >> neon-vect-div-1.c:12:3: note: not vectorized: unsupported unaligned load.*_9 > Hi, > I think I have some idea why the test-case fails attached with patch > fail to get vectorized on armeb with -O2. > > Issue with big endian vectorizer: > The patch does not cause regressions on big endian vectorizer but > fails to vectorize the test-cases attached with the patch, while they > get vectorized on > litttle-endian. > Fails with armeb with the following message in dump: > note: not vectorized: unsupported unaligned load.*_9 > > The behavior of big and little endian vectorizer seems to be different > in arm_builtin_support_vector_misalignment() which overrides the hook > targetm.vectorize.support_vector_misalignment(). > > targetm.vectorize.support_vector_misalignment is called by > vect_supportable_dr_alignment () which in turn is called > by verify_data_refs_alignment (). > > Execution upto following condition is common between arm and armeb > in vect_supportable_dr_alignment(): > > if ((TYPE_USER_ALIGN (type) && !is_packed) > || targetm.vectorize.support_vector_misalignment (mode, type, > DR_MISALIGNMENT (dr), is_packed)) > /* Can't software pipeline the loads, but can at least do them. */ > return dr_unaligned_supported; > > For little endian case: > arm_builtin_support_vector_misalignment() is called with > V2SF mode and misalignment == -1, and the following condition > becomes true: > /* If the misalignment is unknown, we should be able to handle the access > so long as it is not to a member of a packed data structure. */ > if (misalignment == -1) > return true; > > Since the hook returned true we enter the condition above in > vect_supportable_dr_alignment() and return dr_unaligned_supported; > > For big-endian: > arm_builtin_support_vector_misalignment() is called with V2SF mode. > The following condition that gates the entire function body fails: > if (TARGET_NEON && !BYTES_BIG_ENDIAN && unaligned_access) > and the default hook gets called with V2SF mode and the default hook > returns false because > movmisalign_optab does not exist for V2SF mode. > > So the condition above in vect_supportable_dr_alignment() fails > and we come here: > /* Unsupported.
Re: [ARM] implement division using vrecpe/vrecps with -funsafe-math-optimizations
On 5 February 2016 at 18:40, Prathamesh Kulkarniwrote: > On 4 February 2016 at 16:31, Ramana Radhakrishnan > wrote: >> On Sun, Jan 17, 2016 at 9:06 AM, Prathamesh Kulkarni >> wrote: >>> On 31 July 2015 at 15:04, Ramana Radhakrishnan >>> wrote: On 29/07/15 11:09, Prathamesh Kulkarni wrote: > Hi, > This patch tries to implement division with multiplication by > reciprocal using vrecpe/vrecps > with -funsafe-math-optimizations and -freciprocal-math enabled. > Tested on arm-none-linux-gnueabihf using qemu. > OK for trunk ? > > Thank you, > Prathamesh > I've tried this in the past and never been convinced that 2 iterations are enough to get to stability with this given that the results are only precise for 8 bits / iteration. Thus I've always believed you need 3 iterations rather than 2 at which point I've never been sure that it's worth it. So the testing that you've done with this currently is not enough for this to go into the tree. I'd like this to be tested on a couple of different AArch32 implementations with a wider range of inputs to verify that the results are acceptable as well as running something like SPEC2k(6) with atleast one iteration to ensure correctness. >>> Hi, >>> I got results of SPEC2k6 fp benchmarks: >>> a15: +0.64% overall, 481.wrf: +6.46% >>> a53: +0.21% overall, 416.gamess: -1.39%, 481.wrf: +6.76% >>> a57: +0.35% overall, 481.wrf: +3.84% >>> The other benchmarks had (almost) identical results. >> >> Thanks for the benchmarking results - Please repost the patch with >> the changes that I had requested in my previous review - given it is >> now stage4 , I would rather queue changes like this for stage1 now. > Hi, > Please find the updated patch attached. > It passes testsuite for arm-none-linux-gnueabi, arm-none-linux-gnueabihf and > arm-none-eabi. > However the test-case added in the patch (neon-vect-div-1.c) fails to > get vectorized at -O2 > for armeb-none-linux-gnueabihf. > Charles suggested me to try with -O3, which worked. > It appears the test-case fails to get vectorized with > -fvect-cost-model=cheap (which is default enabled at -O2) > and passes for -fno-vect-cost-model / -fvect-cost-model=dynamic > > I can't figure out why it fails -fvect-cost-model=cheap. > From the vect dump (attached): > neon-vect-div-1.c:12:3: note: Setting misalignment to -1. > neon-vect-div-1.c:12:3: note: not vectorized: unsupported unaligned load.*_9 Hi, I think I have some idea why the test-case fails attached with patch fail to get vectorized on armeb with -O2. Issue with big endian vectorizer: The patch does not cause regressions on big endian vectorizer but fails to vectorize the test-cases attached with the patch, while they get vectorized on litttle-endian. Fails with armeb with the following message in dump: note: not vectorized: unsupported unaligned load.*_9 The behavior of big and little endian vectorizer seems to be different in arm_builtin_support_vector_misalignment() which overrides the hook targetm.vectorize.support_vector_misalignment(). targetm.vectorize.support_vector_misalignment is called by vect_supportable_dr_alignment () which in turn is called by verify_data_refs_alignment (). Execution upto following condition is common between arm and armeb in vect_supportable_dr_alignment(): if ((TYPE_USER_ALIGN (type) && !is_packed) || targetm.vectorize.support_vector_misalignment (mode, type, DR_MISALIGNMENT (dr), is_packed)) /* Can't software pipeline the loads, but can at least do them. */ return dr_unaligned_supported; For little endian case: arm_builtin_support_vector_misalignment() is called with V2SF mode and misalignment == -1, and the following condition becomes true: /* If the misalignment is unknown, we should be able to handle the access so long as it is not to a member of a packed data structure. */ if (misalignment == -1) return true; Since the hook returned true we enter the condition above in vect_supportable_dr_alignment() and return dr_unaligned_supported; For big-endian: arm_builtin_support_vector_misalignment() is called with V2SF mode. The following condition that gates the entire function body fails: if (TARGET_NEON && !BYTES_BIG_ENDIAN && unaligned_access) and the default hook gets called with V2SF mode and the default hook returns false because movmisalign_optab does not exist for V2SF mode. So the condition above in vect_supportable_dr_alignment() fails and we come here: /* Unsupported. */ return dr_unaligned_unsupported; And hence we get the unaligned load not supported message in the dump for armeb in verify_data_ref_alignment (): static bool verify_data_ref_alignment (data_reference_p dr) { enum dr_alignment_support
Re: [ARM] implement division using vrecpe/vrecps with -funsafe-math-optimizations
On 4 February 2016 at 16:31, Ramana Radhakrishnanwrote: > On Sun, Jan 17, 2016 at 9:06 AM, Prathamesh Kulkarni > wrote: >> On 31 July 2015 at 15:04, Ramana Radhakrishnan >> wrote: >>> >>> >>> On 29/07/15 11:09, Prathamesh Kulkarni wrote: Hi, This patch tries to implement division with multiplication by reciprocal using vrecpe/vrecps with -funsafe-math-optimizations and -freciprocal-math enabled. Tested on arm-none-linux-gnueabihf using qemu. OK for trunk ? Thank you, Prathamesh >>> >>> I've tried this in the past and never been convinced that 2 iterations are >>> enough to get to stability with this given that the results are only >>> precise for 8 bits / iteration. Thus I've always believed you need 3 >>> iterations rather than 2 at which point I've never been sure that it's >>> worth it. So the testing that you've done with this currently is not enough >>> for this to go into the tree. >>> >>> I'd like this to be tested on a couple of different AArch32 implementations >>> with a wider range of inputs to verify that the results are acceptable as >>> well as running something like SPEC2k(6) with atleast one iteration to >>> ensure correctness. >> Hi, >> I got results of SPEC2k6 fp benchmarks: >> a15: +0.64% overall, 481.wrf: +6.46% >> a53: +0.21% overall, 416.gamess: -1.39%, 481.wrf: +6.76% >> a57: +0.35% overall, 481.wrf: +3.84% >> The other benchmarks had (almost) identical results. > > Thanks for the benchmarking results - Please repost the patch with > the changes that I had requested in my previous review - given it is > now stage4 , I would rather queue changes like this for stage1 now. Hi, Please find the updated patch attached. It passes testsuite for arm-none-linux-gnueabi, arm-none-linux-gnueabihf and arm-none-eabi. However the test-case added in the patch (neon-vect-div-1.c) fails to get vectorized at -O2 for armeb-none-linux-gnueabihf. Charles suggested me to try with -O3, which worked. It appears the test-case fails to get vectorized with -fvect-cost-model=cheap (which is default enabled at -O2) and passes for -fno-vect-cost-model / -fvect-cost-model=dynamic I can't figure out why it fails -fvect-cost-model=cheap. From the vect dump (attached): neon-vect-div-1.c:12:3: note: Setting misalignment to -1. neon-vect-div-1.c:12:3: note: not vectorized: unsupported unaligned load.*_9 Thanks, Prathamesh > > Thanks, > Ramana > >> >> Thanks, >> Prathamesh >>> >>> >>> moving on to the patches. >>> diff --git a/gcc/config/arm/neon.md b/gcc/config/arm/neon.md index 654d9d5..28c2e2a 100644 --- a/gcc/config/arm/neon.md +++ b/gcc/config/arm/neon.md @@ -548,6 +548,32 @@ (const_string "neon_mul_")))] ) >>> >>> Please add a comment here. >>> +(define_expand "div3" + [(set (match_operand:VCVTF 0 "s_register_operand" "=w") +(div:VCVTF (match_operand:VCVTF 1 "s_register_operand" "w") + (match_operand:VCVTF 2 "s_register_operand" "w")))] >>> >>> I want to double check that this doesn't collide with Alan's patches for >>> FP16 especially if he reuses the VCVTF iterator for all the vcvt f16 cases. >>> + "TARGET_NEON && flag_unsafe_math_optimizations && flag_reciprocal_math" + { +rtx rec = gen_reg_rtx (mode); +rtx vrecps_temp = gen_reg_rtx (mode); + +/* Reciprocal estimate */ +emit_insn (gen_neon_vrecpe (rec, operands[2])); + +/* Perform 2 iterations of Newton-Raphson method for better accuracy */ +for (int i = 0; i < 2; i++) + { + emit_insn (gen_neon_vrecps (vrecps_temp, rec, operands[2])); + emit_insn (gen_mul3 (rec, rec, vrecps_temp)); + } + +/* We now have reciprocal in rec, perform operands[0] = operands[1] * rec */ +emit_insn (gen_mul3 (operands[0], operands[1], rec)); +DONE; + } +) + + (define_insn "mul3add_neon" [(set (match_operand:VDQW 0 "s_register_operand" "=w") (plus:VDQW (mult:VDQW (match_operand:VDQW 2 "s_register_operand" "w") diff --git a/gcc/testsuite/gcc.target/arm/vect-div-1.c b/gcc/testsuite/gcc.target/arm/vect-div-1.c new file mode 100644 index 000..e562ef3 --- /dev/null +++ b/gcc/testsuite/gcc.target/arm/vect-div-1.c @@ -0,0 +1,14 @@ +/* { dg-do compile } */ +/* { dg-require-effective-target arm_v8_neon_ok } */ +/* { dg-options "-O2 -funsafe-math-optimizations -ftree-vectorize -fdump-tree-vect-all" } */ +/* { dg-add-options arm_v8_neon } */ >>> >>> No this is wrong. >>> >>> What is armv8 specific about this test ? This is just like another test >>> that is for Neon. vrecpe / vrecps are not instructions that were introduced >>> in the v8 version of the
Re: [ARM] implement division using vrecpe/vrecps with -funsafe-math-optimizations
On Sun, Jan 17, 2016 at 9:06 AM, Prathamesh Kulkarniwrote: > On 31 July 2015 at 15:04, Ramana Radhakrishnan > wrote: >> >> >> On 29/07/15 11:09, Prathamesh Kulkarni wrote: >>> Hi, >>> This patch tries to implement division with multiplication by >>> reciprocal using vrecpe/vrecps >>> with -funsafe-math-optimizations and -freciprocal-math enabled. >>> Tested on arm-none-linux-gnueabihf using qemu. >>> OK for trunk ? >>> >>> Thank you, >>> Prathamesh >>> >> >> I've tried this in the past and never been convinced that 2 iterations are >> enough to get to stability with this given that the results are only precise >> for 8 bits / iteration. Thus I've always believed you need 3 iterations >> rather than 2 at which point I've never been sure that it's worth it. So the >> testing that you've done with this currently is not enough for this to go >> into the tree. >> >> I'd like this to be tested on a couple of different AArch32 implementations >> with a wider range of inputs to verify that the results are acceptable as >> well as running something like SPEC2k(6) with atleast one iteration to >> ensure correctness. > Hi, > I got results of SPEC2k6 fp benchmarks: > a15: +0.64% overall, 481.wrf: +6.46% > a53: +0.21% overall, 416.gamess: -1.39%, 481.wrf: +6.76% > a57: +0.35% overall, 481.wrf: +3.84% > The other benchmarks had (almost) identical results. Thanks for the benchmarking results - Please repost the patch with the changes that I had requested in my previous review - given it is now stage4 , I would rather queue changes like this for stage1 now. Thanks, Ramana > > Thanks, > Prathamesh >> >> >> moving on to the patches. >> >>> diff --git a/gcc/config/arm/neon.md b/gcc/config/arm/neon.md >>> index 654d9d5..28c2e2a 100644 >>> --- a/gcc/config/arm/neon.md >>> +++ b/gcc/config/arm/neon.md >>> @@ -548,6 +548,32 @@ >>> (const_string "neon_mul_")))] >>> ) >>> >> >> Please add a comment here. >> >>> +(define_expand "div3" >>> + [(set (match_operand:VCVTF 0 "s_register_operand" "=w") >>> +(div:VCVTF (match_operand:VCVTF 1 "s_register_operand" "w") >>> + (match_operand:VCVTF 2 "s_register_operand" "w")))] >> >> I want to double check that this doesn't collide with Alan's patches for >> FP16 especially if he reuses the VCVTF iterator for all the vcvt f16 cases. >> >>> + "TARGET_NEON && flag_unsafe_math_optimizations && flag_reciprocal_math" >>> + { >>> +rtx rec = gen_reg_rtx (mode); >>> +rtx vrecps_temp = gen_reg_rtx (mode); >>> + >>> +/* Reciprocal estimate */ >>> +emit_insn (gen_neon_vrecpe (rec, operands[2])); >>> + >>> +/* Perform 2 iterations of Newton-Raphson method for better accuracy */ >>> +for (int i = 0; i < 2; i++) >>> + { >>> + emit_insn (gen_neon_vrecps (vrecps_temp, rec, operands[2])); >>> + emit_insn (gen_mul3 (rec, rec, vrecps_temp)); >>> + } >>> + >>> +/* We now have reciprocal in rec, perform operands[0] = operands[1] * >>> rec */ >>> +emit_insn (gen_mul3 (operands[0], operands[1], rec)); >>> +DONE; >>> + } >>> +) >>> + >>> + >>> (define_insn "mul3add_neon" >>>[(set (match_operand:VDQW 0 "s_register_operand" "=w") >>> (plus:VDQW (mult:VDQW (match_operand:VDQW 2 "s_register_operand" >>> "w") >>> diff --git a/gcc/testsuite/gcc.target/arm/vect-div-1.c >>> b/gcc/testsuite/gcc.target/arm/vect-div-1.c >>> new file mode 100644 >>> index 000..e562ef3 >>> --- /dev/null >>> +++ b/gcc/testsuite/gcc.target/arm/vect-div-1.c >>> @@ -0,0 +1,14 @@ >>> +/* { dg-do compile } */ >>> +/* { dg-require-effective-target arm_v8_neon_ok } */ >>> +/* { dg-options "-O2 -funsafe-math-optimizations -ftree-vectorize >>> -fdump-tree-vect-all" } */ >>> +/* { dg-add-options arm_v8_neon } */ >> >> No this is wrong. >> >> What is armv8 specific about this test ? This is just like another test that >> is for Neon. vrecpe / vrecps are not instructions that were introduced in >> the v8 version of the architecture. They've existed in the base Neon >> instruction set. The code generation above in the patterns will be enabled >> when TARGET_NEON is true which can happen when -mfpu=neon >> -mfloat-abi={softfp/hard} is true. >> >>> + >>> +void >>> +foo (int len, float * __restrict p, float *__restrict x) >>> +{ >>> + len = len & ~31; >>> + for (int i = 0; i < len; i++) >>> +p[i] = p[i] / x[i]; >>> +} >>> + >>> +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */ >>> diff --git a/gcc/testsuite/gcc.target/arm/vect-div-2.c >>> b/gcc/testsuite/gcc.target/arm/vect-div-2.c >>> new file mode 100644 >>> index 000..8e15d0a >>> --- /dev/null >>> +++ b/gcc/testsuite/gcc.target/arm/vect-div-2.c >>> @@ -0,0 +1,14 @@ >>> +/* { dg-do compile } */ >>> +/* { dg-require-effective-target arm_v8_neon_ok } */ >> >> And likewise. >> >>> +/* { dg-options "-O2 -funsafe-math-optimizations -fno-reciprocal-math >>>
Re: [ARM] implement division using vrecpe/vrecps with -funsafe-math-optimizations
On 31 July 2015 at 15:04, Ramana Radhakrishnanwrote: > > > On 29/07/15 11:09, Prathamesh Kulkarni wrote: >> Hi, >> This patch tries to implement division with multiplication by >> reciprocal using vrecpe/vrecps >> with -funsafe-math-optimizations and -freciprocal-math enabled. >> Tested on arm-none-linux-gnueabihf using qemu. >> OK for trunk ? >> >> Thank you, >> Prathamesh >> > > I've tried this in the past and never been convinced that 2 iterations are > enough to get to stability with this given that the results are only precise > for 8 bits / iteration. Thus I've always believed you need 3 iterations > rather than 2 at which point I've never been sure that it's worth it. So the > testing that you've done with this currently is not enough for this to go > into the tree. > > I'd like this to be tested on a couple of different AArch32 implementations > with a wider range of inputs to verify that the results are acceptable as > well as running something like SPEC2k(6) with atleast one iteration to ensure > correctness. Hi, I got results of SPEC2k6 fp benchmarks: a15: +0.64% overall, 481.wrf: +6.46% a53: +0.21% overall, 416.gamess: -1.39%, 481.wrf: +6.76% a57: +0.35% overall, 481.wrf: +3.84% The other benchmarks had (almost) identical results. Thanks, Prathamesh > > > moving on to the patches. > >> diff --git a/gcc/config/arm/neon.md b/gcc/config/arm/neon.md >> index 654d9d5..28c2e2a 100644 >> --- a/gcc/config/arm/neon.md >> +++ b/gcc/config/arm/neon.md >> @@ -548,6 +548,32 @@ >> (const_string "neon_mul_")))] >> ) >> > > Please add a comment here. > >> +(define_expand "div3" >> + [(set (match_operand:VCVTF 0 "s_register_operand" "=w") >> +(div:VCVTF (match_operand:VCVTF 1 "s_register_operand" "w") >> + (match_operand:VCVTF 2 "s_register_operand" "w")))] > > I want to double check that this doesn't collide with Alan's patches for FP16 > especially if he reuses the VCVTF iterator for all the vcvt f16 cases. > >> + "TARGET_NEON && flag_unsafe_math_optimizations && flag_reciprocal_math" >> + { >> +rtx rec = gen_reg_rtx (mode); >> +rtx vrecps_temp = gen_reg_rtx (mode); >> + >> +/* Reciprocal estimate */ >> +emit_insn (gen_neon_vrecpe (rec, operands[2])); >> + >> +/* Perform 2 iterations of Newton-Raphson method for better accuracy */ >> +for (int i = 0; i < 2; i++) >> + { >> + emit_insn (gen_neon_vrecps (vrecps_temp, rec, operands[2])); >> + emit_insn (gen_mul3 (rec, rec, vrecps_temp)); >> + } >> + >> +/* We now have reciprocal in rec, perform operands[0] = operands[1] * >> rec */ >> +emit_insn (gen_mul3 (operands[0], operands[1], rec)); >> +DONE; >> + } >> +) >> + >> + >> (define_insn "mul3add_neon" >>[(set (match_operand:VDQW 0 "s_register_operand" "=w") >> (plus:VDQW (mult:VDQW (match_operand:VDQW 2 "s_register_operand" >> "w") >> diff --git a/gcc/testsuite/gcc.target/arm/vect-div-1.c >> b/gcc/testsuite/gcc.target/arm/vect-div-1.c >> new file mode 100644 >> index 000..e562ef3 >> --- /dev/null >> +++ b/gcc/testsuite/gcc.target/arm/vect-div-1.c >> @@ -0,0 +1,14 @@ >> +/* { dg-do compile } */ >> +/* { dg-require-effective-target arm_v8_neon_ok } */ >> +/* { dg-options "-O2 -funsafe-math-optimizations -ftree-vectorize >> -fdump-tree-vect-all" } */ >> +/* { dg-add-options arm_v8_neon } */ > > No this is wrong. > > What is armv8 specific about this test ? This is just like another test that > is for Neon. vrecpe / vrecps are not instructions that were introduced in the > v8 version of the architecture. They've existed in the base Neon instruction > set. The code generation above in the patterns will be enabled when > TARGET_NEON is true which can happen when -mfpu=neon > -mfloat-abi={softfp/hard} is true. > >> + >> +void >> +foo (int len, float * __restrict p, float *__restrict x) >> +{ >> + len = len & ~31; >> + for (int i = 0; i < len; i++) >> +p[i] = p[i] / x[i]; >> +} >> + >> +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */ >> diff --git a/gcc/testsuite/gcc.target/arm/vect-div-2.c >> b/gcc/testsuite/gcc.target/arm/vect-div-2.c >> new file mode 100644 >> index 000..8e15d0a >> --- /dev/null >> +++ b/gcc/testsuite/gcc.target/arm/vect-div-2.c >> @@ -0,0 +1,14 @@ >> +/* { dg-do compile } */ >> +/* { dg-require-effective-target arm_v8_neon_ok } */ > > And likewise. > >> +/* { dg-options "-O2 -funsafe-math-optimizations -fno-reciprocal-math >> -ftree-vectorize -fdump-tree-vect-all" } */ >> +/* { dg-add-options arm_v8_neon } */ >> + >> +void >> +foo (int len, float * __restrict p, float *__restrict x) >> +{ >> + len = len & ~31; >> + for (int i = 0; i < len; i++) >> +p[i] = p[i] / x[i]; >> +} >> + >> +/* { dg-final { scan-tree-dump-times "vectorized 0 loops" 1 "vect" } } */ > > > regards > Ramana
Re: [ARM] implement division using vrecpe/vrecps with -funsafe-math-optimizations
On 31 July 2015 at 10:34, Ramana Radhakrishnan ramana.radhakrish...@foss.arm.com wrote: I've tried this in the past and never been convinced that 2 iterations are enough to get to stability with this given that the results are only precise for 8 bits / iteration. Thus I've always believed you need 3 iterations rather than 2 at which point I've never been sure that it's worth it. So the testing that you've done with this currently is not enough for this to go into the tree. My understanding is that 2 iterations is sufficient for single precision floating point (although not for double precision), because each iteration of Newton-Raphson doubles the number of bits of accuracy. I haven't worked through the maths myself, but https://en.wikipedia.org/wiki/Division_algorithm#Newton.E2.80.93Raphson_division says This squaring of the error at each iteration step — the so-called quadratic convergence of Newton–Raphson's method — has the effect that the number of correct digits in the result roughly doubles for every iteration, a property that becomes extremely valuable when the numbers involved have many digits Therefore: vrecpe - 8 bits of accuracy +1 iteration - 16 bits of accuracy +2 iterations - 32 bits of accuracy (but in reality limited to precision of 32bit float) Since 32 bits is much more accuracy than the 24 bits of precision in a single precision FP value, 2 iterations should be sufficient. I'd like this to be tested on a couple of different AArch32 implementations with a wider range of inputs to verify that the results are acceptable as well as running something like SPEC2k(6) with atleast one iteration to ensure correctness. I can't argue with confirming theory matches practice :) Some corner cases (eg numbers around FLT_MAX, FLT_MIN etc) may result in denormals or out of range values during the reciprocal calculation which could result in answers which are less accurate than the typical case but I think that is acceptable with -ffast-math. Charles
Re: [ARM] implement division using vrecpe/vrecps with -funsafe-math-optimizations
On 29/07/15 11:09, Prathamesh Kulkarni wrote: Hi, This patch tries to implement division with multiplication by reciprocal using vrecpe/vrecps with -funsafe-math-optimizations and -freciprocal-math enabled. Tested on arm-none-linux-gnueabihf using qemu. OK for trunk ? Thank you, Prathamesh I've tried this in the past and never been convinced that 2 iterations are enough to get to stability with this given that the results are only precise for 8 bits / iteration. Thus I've always believed you need 3 iterations rather than 2 at which point I've never been sure that it's worth it. So the testing that you've done with this currently is not enough for this to go into the tree. I'd like this to be tested on a couple of different AArch32 implementations with a wider range of inputs to verify that the results are acceptable as well as running something like SPEC2k(6) with atleast one iteration to ensure correctness. moving on to the patches. diff --git a/gcc/config/arm/neon.md b/gcc/config/arm/neon.md index 654d9d5..28c2e2a 100644 --- a/gcc/config/arm/neon.md +++ b/gcc/config/arm/neon.md @@ -548,6 +548,32 @@ (const_string neon_mul_V_elem_chq)))] ) Please add a comment here. +(define_expand divmode3 + [(set (match_operand:VCVTF 0 s_register_operand =w) +(div:VCVTF (match_operand:VCVTF 1 s_register_operand w) + (match_operand:VCVTF 2 s_register_operand w)))] I want to double check that this doesn't collide with Alan's patches for FP16 especially if he reuses the VCVTF iterator for all the vcvt f16 cases. + TARGET_NEON flag_unsafe_math_optimizations flag_reciprocal_math + { +rtx rec = gen_reg_rtx (MODEmode); +rtx vrecps_temp = gen_reg_rtx (MODEmode); + +/* Reciprocal estimate */ +emit_insn (gen_neon_vrecpemode (rec, operands[2])); + +/* Perform 2 iterations of Newton-Raphson method for better accuracy */ +for (int i = 0; i 2; i++) + { + emit_insn (gen_neon_vrecpsmode (vrecps_temp, rec, operands[2])); + emit_insn (gen_mulmode3 (rec, rec, vrecps_temp)); + } + +/* We now have reciprocal in rec, perform operands[0] = operands[1] * rec */ +emit_insn (gen_mulmode3 (operands[0], operands[1], rec)); +DONE; + } +) + + (define_insn mulmode3addmode_neon [(set (match_operand:VDQW 0 s_register_operand =w) (plus:VDQW (mult:VDQW (match_operand:VDQW 2 s_register_operand w) diff --git a/gcc/testsuite/gcc.target/arm/vect-div-1.c b/gcc/testsuite/gcc.target/arm/vect-div-1.c new file mode 100644 index 000..e562ef3 --- /dev/null +++ b/gcc/testsuite/gcc.target/arm/vect-div-1.c @@ -0,0 +1,14 @@ +/* { dg-do compile } */ +/* { dg-require-effective-target arm_v8_neon_ok } */ +/* { dg-options -O2 -funsafe-math-optimizations -ftree-vectorize -fdump-tree-vect-all } */ +/* { dg-add-options arm_v8_neon } */ No this is wrong. What is armv8 specific about this test ? This is just like another test that is for Neon. vrecpe / vrecps are not instructions that were introduced in the v8 version of the architecture. They've existed in the base Neon instruction set. The code generation above in the patterns will be enabled when TARGET_NEON is true which can happen when -mfpu=neon -mfloat-abi={softfp/hard} is true. + +void +foo (int len, float * __restrict p, float *__restrict x) +{ + len = len ~31; + for (int i = 0; i len; i++) +p[i] = p[i] / x[i]; +} + +/* { dg-final { scan-tree-dump-times vectorized 1 loops 1 vect } } */ diff --git a/gcc/testsuite/gcc.target/arm/vect-div-2.c b/gcc/testsuite/gcc.target/arm/vect-div-2.c new file mode 100644 index 000..8e15d0a --- /dev/null +++ b/gcc/testsuite/gcc.target/arm/vect-div-2.c @@ -0,0 +1,14 @@ +/* { dg-do compile } */ +/* { dg-require-effective-target arm_v8_neon_ok } */ And likewise. +/* { dg-options -O2 -funsafe-math-optimizations -fno-reciprocal-math -ftree-vectorize -fdump-tree-vect-all } */ +/* { dg-add-options arm_v8_neon } */ + +void +foo (int len, float * __restrict p, float *__restrict x) +{ + len = len ~31; + for (int i = 0; i len; i++) +p[i] = p[i] / x[i]; +} + +/* { dg-final { scan-tree-dump-times vectorized 0 loops 1 vect } } */ regards Ramana
Re: [ARM] implement division using vrecpe/vrecps with -funsafe-math-optimizations
On 29 July 2015 at 16:03, Kyrill Tkachov kyrylo.tkac...@arm.com wrote: Hi Prathamesh, This is probably not appropriate for -Os optimisation. And for speed optimisation I imagine it can vary a lot on the target the code is run. Do you have any benchmark results for this patch? Hi Kyrill, Thanks for the review. I have attempted to address your comments in the attached patch. Does it look OK from correctness perspective ? Unfortunately I haven't done benchmarking yet. I ran a test-case (attached) prepared by Charles for target arm-linux-gnueabihf (on APM Mustang), and it appeared to run faster with the patch: Options passed: -O3 -mfpu=neon -mfloat-abi=hard -funsafe-math-optimizations Before: t8a, len = 32, took 2593977 ticks t8a, len = 128, took 2408907 ticks t8a, len = 1024, took 2354950 ticks t8a, len =65536, took 2365041 ticks t8a, len = 1048576, took 2692928 ticks After: t8a, len = 32, took 2027323 ticks t8a, len = 128, took 1920595 ticks t8a, len = 1024, took 1827250 ticks t8a, len =65536, took 1797924 ticks t8a, len = 1048576, took 2026274 ticks I will get back to you soon with benchmarking results. Thanks, Prathamesh Thanks, Kyrill On 29/07/15 11:09, Prathamesh Kulkarni wrote: Hi, This patch tries to implement division with multiplication by reciprocal using vrecpe/vrecps with -funsafe-math-optimizations and -freciprocal-math enabled. Tested on arm-none-linux-gnueabihf using qemu. OK for trunk ? Thank you, Prathamesh +/* Perform 2 iterations of Newton-Raphson method for better accuracy */ +for (int i = 0; i 2; i++) + { +emit_insn (gen_neon_vrecpsmode (vrecps_temp, rec, operands[2])); +emit_insn (gen_mulmode3 (rec, rec, vrecps_temp)); + } + +/* We now have reciprocal in rec, perform operands[0] = operands[1] * rec */ +emit_insn (gen_mulmode3 (operands[0], operands[1], rec)); +DONE; + } +) + Full stop and two spaces at the end of the comments. 2015-07-28 Prathamesh Kulkarni prathamesh.kulka...@linaro.org Charles Baylis charles.bay...@linaro.org * config/arm/neon.md (divmode3): New pattern. testsuite/ * gcc.target/arm/vect-div-1.c: New test-case. * gcc.target/arm/vect-div-2.c: Likewise. diff --git a/gcc/config/arm/neon.md b/gcc/config/arm/neon.md index 654d9d5..f2dbcc4 100644 --- a/gcc/config/arm/neon.md +++ b/gcc/config/arm/neon.md @@ -548,6 +548,33 @@ (const_string neon_mul_V_elem_chq)))] ) +(define_expand divmode3 + [(set (match_operand:VCVTF 0 s_register_operand =w) +(div:VCVTF (match_operand:VCVTF 1 s_register_operand w) + (match_operand:VCVTF 2 s_register_operand w)))] + TARGET_NEON !optimize_size +flag_unsafe_math_optimizations flag_reciprocal_math + { +rtx rec = gen_reg_rtx (MODEmode); +rtx vrecps_temp = gen_reg_rtx (MODEmode); + +/* Reciprocal estimate. */ +emit_insn (gen_neon_vrecpemode (rec, operands[2])); + +/* Perform 2 iterations of newton-raphson method. */ +for (int i = 0; i 2; i++) + { + emit_insn (gen_neon_vrecpsmode (vrecps_temp, rec, operands[2])); + emit_insn (gen_mulmode3 (rec, rec, vrecps_temp)); + } + +/* We now have reciprocal in rec, perform operands[0] = operands[1] * rec. */ +emit_insn (gen_mulmode3 (operands[0], operands[1], rec)); +DONE; + } +) + + (define_insn mulmode3addmode_neon [(set (match_operand:VDQW 0 s_register_operand =w) (plus:VDQW (mult:VDQW (match_operand:VDQW 2 s_register_operand w) diff --git a/gcc/testsuite/gcc.target/arm/vect-div-1.c b/gcc/testsuite/gcc.target/arm/vect-div-1.c new file mode 100644 index 000..e562ef3 --- /dev/null +++ b/gcc/testsuite/gcc.target/arm/vect-div-1.c @@ -0,0 +1,14 @@ +/* { dg-do compile } */ +/* { dg-require-effective-target arm_v8_neon_ok } */ +/* { dg-options -O2 -funsafe-math-optimizations -ftree-vectorize -fdump-tree-vect-all } */ +/* { dg-add-options arm_v8_neon } */ + +void +foo (int len, float * __restrict p, float *__restrict x) +{ + len = len ~31; + for (int i = 0; i len; i++) +p[i] = p[i] / x[i]; +} + +/* { dg-final { scan-tree-dump-times vectorized 1 loops 1 vect } } */ diff --git a/gcc/testsuite/gcc.target/arm/vect-div-2.c b/gcc/testsuite/gcc.target/arm/vect-div-2.c new file mode 100644 index 000..8e15d0a --- /dev/null +++ b/gcc/testsuite/gcc.target/arm/vect-div-2.c @@ -0,0 +1,14 @@ +/* { dg-do compile } */ +/* { dg-require-effective-target arm_v8_neon_ok } */ +/* { dg-options -O2 -funsafe-math-optimizations -fno-reciprocal-math -ftree-vectorize -fdump-tree-vect-all } */ +/* { dg-add-options arm_v8_neon } */ + +void +foo (int len, float * __restrict p, float *__restrict x) +{ + len = len ~31; + for (int i = 0; i len; i++) +p[i] = p[i] / x[i]; +} + +/* { dg-final { scan-tree-dump-times vectorized 0 loops 1 vect } } */ test.tar.gz Description: GNU Zip compressed data
[ARM] implement division using vrecpe/vrecps with -funsafe-math-optimizations
Hi, This patch tries to implement division with multiplication by reciprocal using vrecpe/vrecps with -funsafe-math-optimizations and -freciprocal-math enabled. Tested on arm-none-linux-gnueabihf using qemu. OK for trunk ? Thank you, Prathamesh 2015-07-28 Prathamesh Kulkarni prathamesh.kulka...@linaro.org Charles Baylis charles.bay...@linaro.org * config/arm/neon.md (divmode3): New pattern. testsuite/ * gcc.target/arm/vect-div-1.c: New test-case. * gcc.target/arm/vect-div-2.c: Likewise. diff --git a/gcc/config/arm/neon.md b/gcc/config/arm/neon.md index 654d9d5..28c2e2a 100644 --- a/gcc/config/arm/neon.md +++ b/gcc/config/arm/neon.md @@ -548,6 +548,32 @@ (const_string neon_mul_V_elem_chq)))] ) +(define_expand divmode3 + [(set (match_operand:VCVTF 0 s_register_operand =w) +(div:VCVTF (match_operand:VCVTF 1 s_register_operand w) + (match_operand:VCVTF 2 s_register_operand w)))] + TARGET_NEON flag_unsafe_math_optimizations flag_reciprocal_math + { +rtx rec = gen_reg_rtx (MODEmode); +rtx vrecps_temp = gen_reg_rtx (MODEmode); + +/* Reciprocal estimate */ +emit_insn (gen_neon_vrecpemode (rec, operands[2])); + +/* Perform 2 iterations of Newton-Raphson method for better accuracy */ +for (int i = 0; i 2; i++) + { + emit_insn (gen_neon_vrecpsmode (vrecps_temp, rec, operands[2])); + emit_insn (gen_mulmode3 (rec, rec, vrecps_temp)); + } + +/* We now have reciprocal in rec, perform operands[0] = operands[1] * rec */ +emit_insn (gen_mulmode3 (operands[0], operands[1], rec)); +DONE; + } +) + + (define_insn mulmode3addmode_neon [(set (match_operand:VDQW 0 s_register_operand =w) (plus:VDQW (mult:VDQW (match_operand:VDQW 2 s_register_operand w) diff --git a/gcc/testsuite/gcc.target/arm/vect-div-1.c b/gcc/testsuite/gcc.target/arm/vect-div-1.c new file mode 100644 index 000..e562ef3 --- /dev/null +++ b/gcc/testsuite/gcc.target/arm/vect-div-1.c @@ -0,0 +1,14 @@ +/* { dg-do compile } */ +/* { dg-require-effective-target arm_v8_neon_ok } */ +/* { dg-options -O2 -funsafe-math-optimizations -ftree-vectorize -fdump-tree-vect-all } */ +/* { dg-add-options arm_v8_neon } */ + +void +foo (int len, float * __restrict p, float *__restrict x) +{ + len = len ~31; + for (int i = 0; i len; i++) +p[i] = p[i] / x[i]; +} + +/* { dg-final { scan-tree-dump-times vectorized 1 loops 1 vect } } */ diff --git a/gcc/testsuite/gcc.target/arm/vect-div-2.c b/gcc/testsuite/gcc.target/arm/vect-div-2.c new file mode 100644 index 000..8e15d0a --- /dev/null +++ b/gcc/testsuite/gcc.target/arm/vect-div-2.c @@ -0,0 +1,14 @@ +/* { dg-do compile } */ +/* { dg-require-effective-target arm_v8_neon_ok } */ +/* { dg-options -O2 -funsafe-math-optimizations -fno-reciprocal-math -ftree-vectorize -fdump-tree-vect-all } */ +/* { dg-add-options arm_v8_neon } */ + +void +foo (int len, float * __restrict p, float *__restrict x) +{ + len = len ~31; + for (int i = 0; i len; i++) +p[i] = p[i] / x[i]; +} + +/* { dg-final { scan-tree-dump-times vectorized 0 loops 1 vect } } */
Re: [ARM] implement division using vrecpe/vrecps with -funsafe-math-optimizations
Hi Prathamesh, This is probably not appropriate for -Os optimisation. And for speed optimisation I imagine it can vary a lot on the target the code is run. Do you have any benchmark results for this patch? Thanks, Kyrill On 29/07/15 11:09, Prathamesh Kulkarni wrote: Hi, This patch tries to implement division with multiplication by reciprocal using vrecpe/vrecps with -funsafe-math-optimizations and -freciprocal-math enabled. Tested on arm-none-linux-gnueabihf using qemu. OK for trunk ? Thank you, Prathamesh +/* Perform 2 iterations of Newton-Raphson method for better accuracy */ +for (int i = 0; i 2; i++) + { +emit_insn (gen_neon_vrecpsmode (vrecps_temp, rec, operands[2])); +emit_insn (gen_mulmode3 (rec, rec, vrecps_temp)); + } + +/* We now have reciprocal in rec, perform operands[0] = operands[1] * rec */ +emit_insn (gen_mulmode3 (operands[0], operands[1], rec)); +DONE; + } +) + Full stop and two spaces at the end of the comments.