On Fri, May 1, 2015 at 6:02 PM, Kyrill Tkachov
<kyrylo.tkac...@foss.arm.com> wrote:
> Hi all,
>
> GCC has some logic to expand calls to pow (x, 0.75), pow (0.25) and pow (x,
> (int)k + 0.5)
> using square roots. So, for the above examples it would generate sqrt (x) *
> sqrt (sqrt (x)),
> sqrt (sqrt (x)) and powi (x, k) * sqrt (x) (assuming k > 0. For k < 0 it
> will calculate the
> reciprocal of that).
>
> However, the implementation of these optimisations is done on a bit of an
> ad-hoc basis with
> the 0.25, 0.5, 0.75 cases hardcoded.
> Judging by
> https://gcc.gnu.org/wiki/summit2010?action=AttachFile&do=get&target=meissner2.pdf
> these are the most commonly used exponents (at least in SPEC ;))
>
> This patch generalises this optimisation into a (hopefully) more robust
> algorithm.
> In particular, it expands calls to pow (x, CST) by expanding the integer
> part of CST
> using a powi, like it does already, and then expanding the fractional part
> as a product
> of repeated applications of a square root if the fractional part can be
> expressed
> as a multiple of a power of 0.5.
>
> I try to explain the algorithm in more detail in the comments in the patch
> but, for example:
>
> pow (x, 5.625) is not currently handled, but with this patch will be
> expanded
> to powi (x, 5) * sqrt (x) * sqrt (sqrt (sqrt (x))) because 5.625 == 5.0 +
> 0.5 + 0.5**3
>
> Negative exponents are handled in either of two ways, depending on the
> exponent value:
> * Using a simple reciprocal.
>   For example:
>   pow (x, -5.625) == 1.0 / pow (x, 5.625)
>     --> 1.0 / (powi (x, 5) * sqrt (x) * sqrt (sqrt (sqrt (x))))
>
> * For pow (x, EXP) with negative exponent EXP with integer part INT and
> fractional part FRAC:
> pow (1.0 - FRAC) / powi (ceil (abs (EXP))).
>   For example:
>   pow (x, -5.875) == pow (x, 0.125) / powi (X, 6)
>     --> sqrt (sqrt (sqrt (x))) / (powi (x, 6))
>
>
> Since hardware square root instructions tend to be expensive, we may want to
> reduce the number
> of square roots we are willing to calculate. Since we reuse intermediate
> square root results,
> this boils down to restricting the depth of the square root chains. In all
> the examples above
> that depth is 3. I've made this maximum depth parametrisable in params.def.
> By adjusting that
> parameter we can adjust the resolution of this optimisation. So, if it's set
> to '4' then we
> will synthesize every exponent that is a multiple of 0.5**4 == 0.0625,
> including negative
> multiples. Currently, GCC will not try to expand negative multiples of
> anything else than 0.5
>
> I have tried to keep the existing functionality intact and activate this
> only for
> -funsafe-math-optimizations and only when the target has a sqrt instruction.
>  An exception to that is pow (x, 0.5) which we prefer to transform to sqrt
> even
> when a hardware sqrt is not available, presumably because the library
> function for
> sqrt is usually faster than pow (?).

Yes.  It's also a safe transform - which you seem to put under
flag_unsafe_math_optimizations only with your patch.

It would be clearer to just leave the special-case

-  /* Optimize pow(x,0.5) = sqrt(x).  This replacement is always safe
-     unless signed zeros must be maintained.  pow(-0,0.5) = +0, while
-     sqrt(-0) = -0.  */
-  if (sqrtfn
-      && REAL_VALUES_EQUAL (c, dconsthalf)
-      && !HONOR_SIGNED_ZEROS (mode))
-    return build_and_insert_call (gsi, loc, sqrtfn, arg0);

in as-is.

You also removed the Os constraint which you should put back in.
Basically if !optimize_function_for_speed_p then generate at most
two calls to sqrt (iff the HW has a sqrt instruction).

You fail to add a testcase that checks that the optimization applies.

Otherwise the idea looks good though there must be a better way
to compute the series than by using real-arithmetic and forcefully
trying out all possibilities...

Richard.

>
>
> Having seen the glibc implementation of a fully IEEE-754-compliant pow
> function, I think we
> would prefer synthesising the pow call whenever we can for -ffast-math.
>
> I have seen this optimisation trigger a few times in SPEC2k6, in particular
> in 447.dealII
> and 481.wrf where it replaced calls to powf (x, -0.25), pow (x, 0.125) and
> pow (x, 0.875)
> with square roots, multiplies and, in the case of -0.25, divides.
> On 481.wrf I saw it remove a total of 22 out of 322 calls to pow
>
> On 481.wrf on aarch64 I saw about a 1% improvement.
> The cycle count on x86_64 was also smaller, but not by a significant amount
> (the same calls to
> pow were eliminated).
>
> In general, I think this can shine if multiple expandable calls to pow
> appear together.
> So, for example for code:
> double
> baz (double a)
> {
>   return __builtin_pow (a, -1.25) + __builtin_pow (a, 5.75) - __builtin_pow
> (a, 3.375);
> }
>
> we can generate:
> baz:
>         fsqrt   d3, d0
>         fmul    d4, d0, d0
>         fmov    d5, 1.0e+0
>         fmul    d6, d0, d4
>         fsqrt   d2, d3
>         fmul    d1, d0, d2
>         fsqrt   d0, d2
>         fmul    d3, d3, d2
>         fdiv    d1, d5, d1
>         fmul    d3, d3, d6
>         fmul    d2, d2, d0
>         fmadd   d0, d4, d3, d1
>         fmsub   d0, d6, d2, d0
>         ret
>
> reusing the sqrt results and doing more optimisations rather than the
> current:
> baz:
>         stp     x29, x30, [sp, -48]!
>         fmov    d1, -1.25e+0
>         add     x29, sp, 0
>         stp     d8, d9, [sp, 16]
>         fmov    d9, d0
>         str     d10, [sp, 32]
>         bl      pow
>         fmov    d8, d0
>         fmov    d0, d9
>         fmov    d1, 5.75e+0
>         bl      pow
>         fmov    d10, d0
>         fmov    d0, d9
>         fmov    d1, 3.375e+0
>         bl      pow
>         fadd    d8, d8, d10
>         ldr     d10, [sp, 32]
>         fsub    d0, d8, d0
>         ldp     d8, d9, [sp, 16]
>         ldp     x29, x30, [sp], 48
>         ret
>
>
> Of course gcc could already do that if the exponents were to fall in the set
> {0.25, 0.75, k+0.5}
> but with this patch that set can be greatly expanded.
>
> I suppose if we're really lucky we might even open up new vectorisation
> opportunities.
> For example:
> void
> vecfoo (float *a, float *b)
> {
>   for (int i = 0; i < N; i++)
>     a[i] = __builtin_powf (a[i], 1.25f) - __builtin_powf (b[i], 3.625);
> }
>
> will now be vectorisable if we have a hardware vector sqrt instruction.
> Though I'm not sure
> how often this would appear in real code.
>
>
> I would like advice on some implementation details:
> - The new function representable_as_half_series_p is used to check if the
> fractional
> part of an exponent can be represented as a multiple of a power of 0.5. It
> does so
> by using REAL_VALUE_TYPE arithmetic and a loop. I wonder whether there is a
> smarter
> way of doing this, considering that REAL_VALUE_TYPE holds the bit
> representation of the
> floating point number?
>
> - Are there any correctness cases that I may have missed? This patch gates
> the optimisation
> on -funsafe-math-optimizations, but maybe there are some edge cases that I
> missed?
>
> - What should be the default value of the max-pow-sqrt-depth param? In this
> patch it's 5 which
> on second thought strikes me as a bit aggressive. To catch exponents that
> are multiples of
> 0.25 we need it to be 2. For multiples of 0.125 it has to be 3 etc... I
> suppose it depends on
> how likely more fine-grained powers are to appear in real code, how
> expensive the C library
> implementation of pow is, and how expensive are the sqrt instructions in
> hardware.
>
>
> Bootstrapped and tested on x86_64, aarch64, arm (pending)
> SPEC2k6 built and ran fine. Can anyone suggest anything other codebase that
> might
> be of interest to look at?
>
> Thanks,
> Kyrill
>
> 2015-05-01  Kyrylo Tkachov  <kyrylo.tkac...@arm.com>
>
>     * params.def (PARAM_MAX_POW_SQRT_DEPTH): New param.
>     * tree-ssa-math-opts.c: Include params.h
>     (pow_synth_sqrt_info): New struct.
>     (representable_as_half_series_p): New function.
>     (get_fn_chain): Likewise.
>     (print_nested_fn): Likewise.
>     (dump_fractional_sqrt_sequence): Likewise.
>     (dump_integer_part): Likewise.
>     (expand_pow_as_sqrts): Likewise.
>     (gimple_expand_builtin_pow): Use above to attempt to expand
>     pow as series of square roots.  Removed now unused variables.
>
> 2015-05-01  Kyrylo Tkachov  <kyrylo.tkac...@arm.com>
>
>     * gcc.dg/pow-sqrt.x: New file.
>     * gcc.dg/pow-sqrt-1.c: New test.
>     * gcc.dg/pow-sqrt-2.c: Likewise.
>     * gcc.dg/pow-sqrt-3.c: Likewise.

Reply via email to