On Fri, May 27, 2016 at 05:57:30PM -0500, Evandro Menezes wrote: > On 05/25/16 11:16, James Greenhalgh wrote: > >On Wed, Apr 27, 2016 at 04:15:53PM -0500, Evandro Menezes wrote: > >> gcc/ > >> * config/aarch64/aarch64-protos.h > >> (tune_params): Add new member "approx_div_modes". > >> (aarch64_emit_approx_div): Declare new function. > >> * config/aarch64/aarch64.c > >> (generic_tunings): New member "approx_div_modes". > >> (cortexa35_tunings): Likewise. > >> (cortexa53_tunings): Likewise. > >> (cortexa57_tunings): Likewise. > >> (cortexa72_tunings): Likewise. > >> (exynosm1_tunings): Likewise. > >> (thunderx_tunings): Likewise. > >> (xgene1_tunings): Likewise. > >> (aarch64_emit_approx_div): Define new function. > >> * config/aarch64/aarch64.md ("div<mode>3"): New expansion. > >> * config/aarch64/aarch64-simd.md ("div<mode>3"): Likewise. > >> * config/aarch64/aarch64.opt (-mlow-precision-div): Add new option. > >> * doc/invoke.texi (-mlow-precision-div): Describe new option. > >My comments from the other two patches around using a structure to > >group up the tuning flags and whether we really want the new option > >apply here too. > > > >This code has no consumers by default and is only used for > >-mlow-precision-div. Is this option likely to be useful to our users in > >practice? It might all be more palatable under something like the rs6000's > >-mrecip=opt . > > I agree. OK as a follow up?
Works for me. > >>diff --git a/gcc/config/aarch64/aarch64-simd.md > >>b/gcc/config/aarch64/aarch64-simd.md > >>index 47ccb18..7e99e16 100644 > >>--- a/gcc/config/aarch64/aarch64-simd.md > >>+++ b/gcc/config/aarch64/aarch64-simd.md > >>@@ -1509,7 +1509,19 @@ > >> [(set_attr "type" "neon_fp_mul_<Vetype><q>")] > >> ) > >>-(define_insn "div<mode>3" > >>+(define_expand "div<mode>3" > >>+ [(set (match_operand:VDQF 0 "register_operand") > >>+ (div:VDQF (match_operand:VDQF 1 "general_operand") > >What does this relaxation to general_operand give you? > > Hold that thought... > > >>+ (match_operand:VDQF 2 "register_operand")))] > >>+ "TARGET_SIMD" > >>+{ > >>+ if (aarch64_emit_approx_div (operands[0], operands[1], operands[2])) > >>+ DONE; > >>+ > >>+ operands[1] = force_reg (<MODE>mode, operands[1]); > >...other than the need to do this (sorry if I've missed something obvious). > > Hold on... > > >>+ if (num != CONST1_RTX (mode)) > >>+ { > >>+ /* Calculate the approximate division. */ > >>+ rtx xnum = force_reg (mode, num); > >>+ emit_set_insn (xrcp, gen_rtx_MULT (mode, xrcp, xnum)); > >>+ } > > About that relaxation, as you can see here, since the series > approximates the reciprocal of the denominator, if the numerator is > 1.0, a register can be spared, as the result is ready and the > numerator is not needed. But, in the case that the multiplication is by 1, can we not rely on the other optimization machinery eliminating it? I mean, I see the optimization that this enables for you, but can't you rely on future passes to do the cleanup, and save yourself the few lines of special casing? > +/* Emit the instruction sequence to compute the approximation for the > division > + of NUM by DEN and return whether the sequence was emitted or not. */ Needs a brief mention of what we use QUO for :). > + > +bool > +aarch64_emit_approx_div (rtx quo, rtx num, rtx den) > +{ > + machine_mode mode = GET_MODE (quo); > + bool use_approx_division_p = (flag_mlow_precision_div > + || (aarch64_tune_params.approx_modes->division > + & AARCH64_APPROX_MODE (mode))); > + > + if (!flag_finite_math_only > + || flag_trapping_math > + || !flag_unsafe_math_optimizations > + || optimize_function_for_size_p (cfun) > + || !use_approx_division_p) > + return false; > + > + /* Estimate the approximate reciprocal. */ > + rtx xrcp = gen_reg_rtx (mode); > + emit_insn ((*get_recpe_type (mode)) (xrcp, den)); > + > + /* Iterate over the series twice for SF and thrice for DF. */ > + int iterations = (GET_MODE_INNER (mode) == DFmode) ? 3 : 2; > + > + /* Optionally iterate over the series once less for faster performance, > + while sacrificing the accuracy. */ > + if (flag_mlow_precision_div) > + iterations--; > + > + /* Iterate over the series to calculate the approximate reciprocal. */ > + rtx xtmp = gen_reg_rtx (mode); > + while (iterations--) > + { > + emit_insn ((*get_recps_type (mode)) (xtmp, xrcp, den)); > + > + if (iterations > 0) > + emit_set_insn (xrcp, gen_rtx_MULT (mode, xrcp, xtmp)); > + } > + > + if (num != CONST1_RTX (mode)) > + { > + /* As the approximate reciprocal of the denominator is already > calculated, > + only calculate the approximate division when the numerator is not > 1.0. */ Long lines. > + rtx xnum = force_reg (mode, num); > + emit_set_insn (xrcp, gen_rtx_MULT (mode, xrcp, xnum)); > + } > + > + /* Finalize the approximation. */ > + emit_set_insn (quo, gen_rtx_MULT (mode, xrcp, xtmp)); > + return true; > +} > + > /* Return the number of instructions that can be issued per cycle. */ > static int > aarch64_sched_issue_rate (void) Thanks, James