On Fri, May 1, 2015 at 6:02 PM, Kyrill Tkachov <kyrylo.tkac...@foss.arm.com> wrote: > Hi all, > > GCC has some logic to expand calls to pow (x, 0.75), pow (0.25) and pow (x, > (int)k + 0.5) > using square roots. So, for the above examples it would generate sqrt (x) * > sqrt (sqrt (x)), > sqrt (sqrt (x)) and powi (x, k) * sqrt (x) (assuming k > 0. For k < 0 it > will calculate the > reciprocal of that). > > However, the implementation of these optimisations is done on a bit of an > ad-hoc basis with > the 0.25, 0.5, 0.75 cases hardcoded. > Judging by > https://gcc.gnu.org/wiki/summit2010?action=AttachFile&do=get&target=meissner2.pdf > these are the most commonly used exponents (at least in SPEC ;)) > > This patch generalises this optimisation into a (hopefully) more robust > algorithm. > In particular, it expands calls to pow (x, CST) by expanding the integer > part of CST > using a powi, like it does already, and then expanding the fractional part > as a product > of repeated applications of a square root if the fractional part can be > expressed > as a multiple of a power of 0.5. > > I try to explain the algorithm in more detail in the comments in the patch > but, for example: > > pow (x, 5.625) is not currently handled, but with this patch will be > expanded > to powi (x, 5) * sqrt (x) * sqrt (sqrt (sqrt (x))) because 5.625 == 5.0 + > 0.5 + 0.5**3 > > Negative exponents are handled in either of two ways, depending on the > exponent value: > * Using a simple reciprocal. > For example: > pow (x, -5.625) == 1.0 / pow (x, 5.625) > --> 1.0 / (powi (x, 5) * sqrt (x) * sqrt (sqrt (sqrt (x)))) > > * For pow (x, EXP) with negative exponent EXP with integer part INT and > fractional part FRAC: > pow (1.0 - FRAC) / powi (ceil (abs (EXP))). > For example: > pow (x, -5.875) == pow (x, 0.125) / powi (X, 6) > --> sqrt (sqrt (sqrt (x))) / (powi (x, 6)) > > > Since hardware square root instructions tend to be expensive, we may want to > reduce the number > of square roots we are willing to calculate. Since we reuse intermediate > square root results, > this boils down to restricting the depth of the square root chains. In all > the examples above > that depth is 3. I've made this maximum depth parametrisable in params.def. > By adjusting that > parameter we can adjust the resolution of this optimisation. So, if it's set > to '4' then we > will synthesize every exponent that is a multiple of 0.5**4 == 0.0625, > including negative > multiples. Currently, GCC will not try to expand negative multiples of > anything else than 0.5 > > I have tried to keep the existing functionality intact and activate this > only for > -funsafe-math-optimizations and only when the target has a sqrt instruction. > An exception to that is pow (x, 0.5) which we prefer to transform to sqrt > even > when a hardware sqrt is not available, presumably because the library > function for > sqrt is usually faster than pow (?).
Yes. It's also a safe transform - which you seem to put under flag_unsafe_math_optimizations only with your patch. It would be clearer to just leave the special-case - /* Optimize pow(x,0.5) = sqrt(x). This replacement is always safe - unless signed zeros must be maintained. pow(-0,0.5) = +0, while - sqrt(-0) = -0. */ - if (sqrtfn - && REAL_VALUES_EQUAL (c, dconsthalf) - && !HONOR_SIGNED_ZEROS (mode)) - return build_and_insert_call (gsi, loc, sqrtfn, arg0); in as-is. You also removed the Os constraint which you should put back in. Basically if !optimize_function_for_speed_p then generate at most two calls to sqrt (iff the HW has a sqrt instruction). You fail to add a testcase that checks that the optimization applies. Otherwise the idea looks good though there must be a better way to compute the series than by using real-arithmetic and forcefully trying out all possibilities... Richard. > > > Having seen the glibc implementation of a fully IEEE-754-compliant pow > function, I think we > would prefer synthesising the pow call whenever we can for -ffast-math. > > I have seen this optimisation trigger a few times in SPEC2k6, in particular > in 447.dealII > and 481.wrf where it replaced calls to powf (x, -0.25), pow (x, 0.125) and > pow (x, 0.875) > with square roots, multiplies and, in the case of -0.25, divides. > On 481.wrf I saw it remove a total of 22 out of 322 calls to pow > > On 481.wrf on aarch64 I saw about a 1% improvement. > The cycle count on x86_64 was also smaller, but not by a significant amount > (the same calls to > pow were eliminated). > > In general, I think this can shine if multiple expandable calls to pow > appear together. > So, for example for code: > double > baz (double a) > { > return __builtin_pow (a, -1.25) + __builtin_pow (a, 5.75) - __builtin_pow > (a, 3.375); > } > > we can generate: > baz: > fsqrt d3, d0 > fmul d4, d0, d0 > fmov d5, 1.0e+0 > fmul d6, d0, d4 > fsqrt d2, d3 > fmul d1, d0, d2 > fsqrt d0, d2 > fmul d3, d3, d2 > fdiv d1, d5, d1 > fmul d3, d3, d6 > fmul d2, d2, d0 > fmadd d0, d4, d3, d1 > fmsub d0, d6, d2, d0 > ret > > reusing the sqrt results and doing more optimisations rather than the > current: > baz: > stp x29, x30, [sp, -48]! > fmov d1, -1.25e+0 > add x29, sp, 0 > stp d8, d9, [sp, 16] > fmov d9, d0 > str d10, [sp, 32] > bl pow > fmov d8, d0 > fmov d0, d9 > fmov d1, 5.75e+0 > bl pow > fmov d10, d0 > fmov d0, d9 > fmov d1, 3.375e+0 > bl pow > fadd d8, d8, d10 > ldr d10, [sp, 32] > fsub d0, d8, d0 > ldp d8, d9, [sp, 16] > ldp x29, x30, [sp], 48 > ret > > > Of course gcc could already do that if the exponents were to fall in the set > {0.25, 0.75, k+0.5} > but with this patch that set can be greatly expanded. > > I suppose if we're really lucky we might even open up new vectorisation > opportunities. > For example: > void > vecfoo (float *a, float *b) > { > for (int i = 0; i < N; i++) > a[i] = __builtin_powf (a[i], 1.25f) - __builtin_powf (b[i], 3.625); > } > > will now be vectorisable if we have a hardware vector sqrt instruction. > Though I'm not sure > how often this would appear in real code. > > > I would like advice on some implementation details: > - The new function representable_as_half_series_p is used to check if the > fractional > part of an exponent can be represented as a multiple of a power of 0.5. It > does so > by using REAL_VALUE_TYPE arithmetic and a loop. I wonder whether there is a > smarter > way of doing this, considering that REAL_VALUE_TYPE holds the bit > representation of the > floating point number? > > - Are there any correctness cases that I may have missed? This patch gates > the optimisation > on -funsafe-math-optimizations, but maybe there are some edge cases that I > missed? > > - What should be the default value of the max-pow-sqrt-depth param? In this > patch it's 5 which > on second thought strikes me as a bit aggressive. To catch exponents that > are multiples of > 0.25 we need it to be 2. For multiples of 0.125 it has to be 3 etc... I > suppose it depends on > how likely more fine-grained powers are to appear in real code, how > expensive the C library > implementation of pow is, and how expensive are the sqrt instructions in > hardware. > > > Bootstrapped and tested on x86_64, aarch64, arm (pending) > SPEC2k6 built and ran fine. Can anyone suggest anything other codebase that > might > be of interest to look at? > > Thanks, > Kyrill > > 2015-05-01 Kyrylo Tkachov <kyrylo.tkac...@arm.com> > > * params.def (PARAM_MAX_POW_SQRT_DEPTH): New param. > * tree-ssa-math-opts.c: Include params.h > (pow_synth_sqrt_info): New struct. > (representable_as_half_series_p): New function. > (get_fn_chain): Likewise. > (print_nested_fn): Likewise. > (dump_fractional_sqrt_sequence): Likewise. > (dump_integer_part): Likewise. > (expand_pow_as_sqrts): Likewise. > (gimple_expand_builtin_pow): Use above to attempt to expand > pow as series of square roots. Removed now unused variables. > > 2015-05-01 Kyrylo Tkachov <kyrylo.tkac...@arm.com> > > * gcc.dg/pow-sqrt.x: New file. > * gcc.dg/pow-sqrt-1.c: New test. > * gcc.dg/pow-sqrt-2.c: Likewise. > * gcc.dg/pow-sqrt-3.c: Likewise.