Re: [go-nuts] Re: How to optimize as with -O2 from Gcc? (50x speedup)

Michael Jones Fri, 04 Aug 2017 15:48:57 -0700

Dorival,

There is a great deal to learn from this issue. Some interesting issues:


Exponentiation by arbitrary real or complex powers is a little complicated
to evaluate quickly with accuracy. Other than special cases, you're forced
toward x**y ==> Exp(y*Log(x)), where each of Exp() and Log() are related to
Taylor series expansions.

Obviously if you know that your x is in (-1,0,1} then magic is possible.
Same if you have complex values and both R(x) and I(x) are in {-1,0,1}.

Likewise if y is integral. In that case, Exp/Log is not forced and
multiplications are possible. You never need more multiplies than the
number of bits in the binary expression of y. peterGo shared code with you
above. Mine looks similar:

// Integer power: compute a**b using binary powering algorithm
// See Donald Knuth, The Art of Computer Programming, Volume 2, Section
4.6.3
func Pow(a, b int) int {
p := 1
for b > 0 {
if b&1 != 0 {
p *= a
}
b >>= 1
a *= a
}
return p
}

This binary powering method is not necessarily optimal. There can be
shorter sequences than that chosen by the binary method. Searching for
"addition chains" will give background on this. I have a little Go program
for generating such things, and here is some simple output:

celeste:power mtj$ power -n 200
    1: cost=  0  parent=     0 needed=1
    2: cost=  1  parent=     1 needed=1 2
    3: cost=  2  parent=     2 needed=1 2 3
    4: cost=  2  parent=     2 needed=1 2 4
    5: cost=  3  parent=     3 needed=1 2 3 5
    6: cost=  3  parent=     3 needed=1 2 3 6
    7: cost=  4  parent=     4 needed=1 2 3 4 7
    8: cost=  3  parent=     4 needed=1 2 4 8
    9: cost=  4  parent=     6 needed=1 2 3 6 9
   10: cost=  4  parent=     5 needed=1 2 3 5 10
   11: cost=  5  parent=     6 needed=1 2 3 5 6 11
   12: cost=  4  parent=     6 needed=1 2 3 6 12
   13: cost=  5  parent=    10 needed=1 2 3 5 10 13
   14: cost=  5  parent=     7 needed=1 2 3 4 7 14
   15: cost=  5  parent=     9 needed=1 2 3 6 9 15
   16: cost=  4  parent=     8 needed=1 2 4 8 16
   17: cost=  5  parent=    16 needed=1 2 4 8 16 17
   18: cost=  5  parent=     9 needed=1 2 3 6 9 18
   19: cost=  6  parent=    16 needed=1 2 3 4 8 16 19
   20: cost=  5  parent=    10 needed=1 2 3 5 10 20
:
 4088: cost= 15  parent=  2044 needed=1 2 3 6 9 15 30 60 120 240 241 271
511 1022 2044 4088
 4089: cost= 15  parent=  2726 needed=1 2 3 5 10 20 40 80 160 320 640 643
1283 1363 2726 4089
 4090: cost= 15  parent=  2045 needed=1 2 4 8 16 17 32 49 98 196 392 409
818 1227 2045 4090
 4091: cost= 16  parent=  3994 needed=1 2 4 8 16 32 64 65 97 161 322 644
709 1288 1997 3994 4091
 4092: cost= 15  parent=  2046 needed=1 2 4 8 16 17 32 49 81 162 324 341
682 1023 2046 4092
 4093: cost= 16  parent=  3352 needed=1 2 4 8 16 32 64 65 97 161 322 419
741 838 1676 3352 4093
 4094: cost= 16  parent=  2047 needed=1 2 4 8 16 17 32 64 128 145 290 307
580 887 1160 2047 4094
 4095: cost= 15  parent=  2340 needed=1 2 4 8 16 32 64 65 130 195 390 585
1170 1755 2340 4095
 4096: cost= 12  parent=  2048 needed=1 2 4 8 16 32 64 128 256 512 1024
2048 4096


We can understand this as follows, x**18 has an inherent cost of 5
multiplies:

x2 := x*x
x3 := x2*x
x6 := x3*x3
x9 := x6*x3
x18 := x9*x9

The "parent" is the larger factor, x18 needs x9, x9 needs x6, and so on.

However, since you mentioned special cases for powers under 20, you should
realize that there is an issue with expressing this in Go. If you use a
switch statement, and since switch statements don't "pack" dense
alternatives using a jump table, the code for x18 will likely be at the end
of 18 if tests (if power==1, if power ==2, ..), which will cost more than
the multiplies do. When I need this kind of speed up, i use such programs
to generate assembler and also generate a jump table as well, which ends up
looking like (in a different application than powering) and in a Go setting:

var dispatchSLBS = []func(int, [][]int, int) [][]int{
        nil,
        solutionListBySum1,
        solutionListBySum2,
        solutionListBySum3,
        solutionListBySum4,
        solutionListBySum5,
        solutionListBySum6,
        solutionListBySum7,
        solutionListBySum8,
        solutionListBySum9,
        solutionListBySum10,
        solutionListBySum11,
        solutionListBySum12,
        solutionListBySum13,
        solutionListBySum14,
        solutionListBySum15,

I have never tried this for something like exponentiation, but an all-Go
solution with a jump table and autogenerated functions might work for you.

Michael

P.S. Early on, I urged the Go designers to provide an exponentiation
operator just for this situation...where the compiler could translate a
static x**18 into optimal code. IBM did it in 1959 so it seemed that maybe
we could learn from that. The vote against was logical---people don't write
code like that. But some do, and i guess that you and I are two of them. :-)


On Fri, Aug 4, 2017 at 5:36 AM, peterGo <go.peter...@gmail.com> wrote:

> Dorival Pedroso,
>
> PowP has a lot of code. PowG is simpler and it is modestly slower for
> small values of n and much faster for larger values of n.
>
> func PowG(x float64, n uint32) float64 {
>     y := 1.0
>     for i := n; i > 0; i >>= 1 {
>         if i&1 == 1 {
>             y *= x
>         }
>         x *= x
>     }
>     return y
> }
>
>
> BenchmarkPowP1      1000000000        2.89 ns/op
> BenchmarkPowP5       100000000       13.6 ns/op
> BenchmarkPowP10       50000000       29.2 ns/op
> BenchmarkPowP20       20000000       89.6 ns/op
> BenchmarkPowP200        200000    11666 ns/op
> BenchmarkPowG1      1000000000        2.66 ns/op
> BenchmarkPowG5       100000000       17.9 ns/op
> BenchmarkPowG10       30000000       45.1 ns/op
> BenchmarkPowG20       20000000       97.7 ns/op
> BenchmarkPowG200        300000     3972 ns/op
>
> Peter
>
>
> On Friday, August 4, 2017 at 10:08:36 AM UTC, Dorival Pedroso wrote:
>>
>> Since my ASM skills are limited, for positive integers, I'm planning on
>> using:
>>
>> // PowP computes real raised to positive integer xⁿ
>> func PowP(x float64, n uint32) (r float64) {
>> if n == 0 {
>> return 1.0
>> }
>> if n == 1 {
>> return x
>> }
>> if n == 2 {
>> return x * x
>> }
>> if n == 3 {
>> return x * x * x
>> }
>> if n == 4 {
>> r = x * x
>> return r * r
>> }
>> if n == 5 {
>> r = x * x
>> return r * r * x
>> }
>> if n == 6 {
>> r = x * x * x
>> return r * r
>> }
>> if n == 7 {
>> r = x * x * x
>> return r * r * x
>> }
>> if n == 8 {
>> r = x * x * x * x
>> return r * r
>> }
>> if n == 9 {
>> r = x * x * x
>> return r * r * r
>> }
>> if n == 10 {
>> r = x * x * x
>> return r * r * r * x
>> }
>> r = x * x * x
>> r = r * r * r * x
>> var i uint32
>> for i = 11; i <= 20; i++ {
>> r *= x
>> if n == i {
>> return
>> }
>> }
>> return math.Pow(x, float64(n))
>> }
>>
>> which gives a speedup of 5-8 times for small positive integers (<20). the
>> case of negative integers could be handled too.
>>
>> output:
>> BenchmarkPowP10-32         50000000        30.4 ns/op
>> BenchmarkPowP10std-32     5000000       247 ns/op
>> BenchmarkPowP20-32         20000000        98.2 ns/op
>> BenchmarkPowP20std-32     3000000       561 ns/op
>> BenchmarkPowP200-32        200000     10145 ns/op
>> BenchmarkPowP200std-32      200000      9231 ns/op
>>
>> all files are here: https://gist.github.com/
>> cpmech/9f871df5b59fa8407221b0d3fb361a3d
>>
>> I'm also looking at the Cephes library: http://www.netlib.org/cephes/
>> for ideas
>>
>> maybe we could have a similar function in Go (and many more optimised
>> ones, especially for complex numbers...)
>>
>> cheers
>> d
>>
>>
>>
>> On Friday, August 4, 2017 at 4:20:41 PM UTC+10, Dorival Pedroso wrote:
>>>
>>> I've noticed that this C code:
>>>
>>> #include "math.h"
>>> int main() {
>>>         double x = 2.5;
>>>         int Nmax = 10000000;
>>>         for (int N=0; N<Nmax; N++) {
>>>                 for (int i=0; i<20; i++) {
>>>                         pow(x, i);
>>>                 }
>>>         }
>>> }
>>>
>>> can run up to 50x faster than this Go code:
>>>
>>> package main
>>>
>>> import "math"
>>>
>>> func main() {
>>> x := 2.5
>>> Nmax := 10000000
>>> for N := 0; N < Nmax; N++ {
>>> for i := 0; i < 20; i++ {
>>> math.Pow(x, float64(i))
>>> }
>>> }
>>> }
>>>
>>> The C code was compiled with: gcc -O2 ccode.c -o ccode -lm
>>> then run with time ./ccode
>>>
>>> The Go code was compiled with: go build gcode.go
>>> then run with  time ./gcode
>>>
>>> I've used the time command on Linux (Ubuntu) to get some estimate.
>>>
>>> So the question is: how can we make the Go code faster?
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "golang-nuts" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to golang-nuts+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>



-- 
Michael T. Jones
michael.jo...@gmail.com

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [go-nuts] Re: How to optimize as with -O2 from Gcc? (50x speedup)

Reply via email to