Re: [go-nuts] Any interest in nat.mulRange simplification/optimization?

Bakul Shah Sat, 13 Jan 2024 17:04:42 -0800

Fair enough!

I replied as I found a factor of 60 a bit surprising. Though I shouldn't have 
been -- my n digits of pi scheme program was consistently about 16-17 times 
slower than gmp based one (at least up to 1E9), using the same algorithm 
(Chudnovsky's), which is also highly dependent on multiplication speed. And 
both (gambit-scheme and gmp) use faster algorithms for very large numbers. 
Adding these faster algorithms to math/big is probably not justifiable....


Julia seems to use libgmp. "info gmp" reveals it implements Luchny's algorithm.

> On Jan 13, 2024, at 4:07 PM, Rob Pike <r...@golang.org> wrote:
> 
> This is getting interesting but maybe only for me, so I'll stop here.
> 
> I did some profiling and the majority of the time is spent in 
> math/big.addMulVVW. Thinking I might be doing something stupid, I compared it 
> with the Go implementation at 
> http://www.luschny.de/math/factorial/scala/FactorialScalaCsharp.htm, and 
> found that my version (although "straightforward"), is about the same amount 
> of code but without special casing and support tables, yet about 25% faster - 
> and yes, they get the same answer.
> 
> So I believe the factor of 60 comes from comparing an implementation in a 
> language and libraries designed for numerical computation against a much less 
> specialized and optimized world. Or perhaps from Julia using a different and 
> dramatically more efficient algorithm. It does seem like a big gap, but I am 
> no expert in this area.
> 
> Maybe worth investigating further but not by me.
> 
> -rob
> 
> On Sun, Jan 14, 2024 at 9:31 AM Rob Pike <r...@golang.org 
> <mailto:r...@golang.org>> wrote:
>> Oh, I did say my implementation was straightforward. It's free of any clever 
>> multiplication algorithms or mathematical delights. It could easily be 
>> giving up 10x or more for that reason alone. And I haven't even profiled it 
>> yet.
>> 
>> -rob
>> 
>> 
>> On Sat, Jan 13, 2024 at 7:04 PM Bakul Shah <ba...@iitbombay.org 
>> <mailto:ba...@iitbombay.org>> wrote:
>>> FYI Julia (on M1 MBP) seems much faster:
>>> 
>>> julia> @which factorial(big(100000000))
>>> factorial(x::BigInt) in Base.GMP at gmp.jl:645
>>> 
>>> julia> @time begin; factorial(big(100000000)); 1; end
>>>  27.849116 seconds (1.39 M allocations: 11.963 GiB, 0.22% gc time)
>>> 
>>> Probably they use Schönhage-Strassen multiplication algorithm for very 
>>> large numbers as the 1E8! result will have over a 3/4 billion digits. I 
>>> should try this in Gambit-Scheme (which has an excellent multiply 
>>> implementation).
>>> 
>>>> On Jan 12, 2024, at 9:32 PM, Rob Pike <r...@golang.org 
>>>> <mailto:r...@golang.org>> wrote:
>>>> 
>>>> Thanks for the tip. A fairly straightforward implementation of this 
>>>> algorithm gives me about a factor of two speedup for pretty much any 
>>>> value. I went up to 1e8!, which took about half an hour compared to nearly 
>>>> an hour for MulRange.
>>>> 
>>>> I'll probably stick in ivy after a little more tuning. I may even try 
>>>> parallelization.
>>>> 
>>>> -rob
>>>> 
>>>> 
>>>> On Tue, Jan 9, 2024 at 4:54 PM Bakul Shah <ba...@iitbombay.org 
>>>> <mailto:ba...@iitbombay.org>> wrote:
>>>>> For that you may wish to explore Peter Luschny's "prime swing" factorial 
>>>>> algorithm and variations!
>>>>> https://oeis.org/A000142/a000142.pdf
>>>>> 
>>>>> And implementations in various languages including go: 
>>>>> https://github.com/PeterLuschny/Fast-Factorial-Functions
>>>>> 
>>>>>> On Jan 8, 2024, at 9:22 PM, Rob Pike <r...@golang.org 
>>>>>> <mailto:r...@golang.org>> wrote:
>>>>>> 
>>>>>> Here's an example where it's the bottleneck: ivy factorial
>>>>>> 
>>>>>> 
>>>>>> !1e7
>>>>>> 1.20242340052e+65657059
>>>>>> 
>>>>>> )cpu
>>>>>> 1m10s (1m10s user, 167.330ms sys)
>>>>>> 
>>>>>> 
>>>>>> -rob
>>>>>> 
>>>>>> 
>>>>>> On Tue, Jan 9, 2024 at 2:21 PM Bakul Shah <ba...@iitbombay.org 
>>>>>> <mailto:ba...@iitbombay.org>> wrote:
>>>>>>> Perhaps you were thinking of this?
>>>>>>> 
>>>>>>> At iteration number k, the value xk contains O(klog(k)) digits, thus 
>>>>>>> the computation of xk+1 = kxk has cost O(klog(k)). Finally, the total 
>>>>>>> cost with this basic approach is O(2log(2)+¼+n log(n)) = O(n2log(n)).
>>>>>>> A better approach is the binary splitting : it just consists in 
>>>>>>> recursively cutting the product of m consecutive integers in half. It 
>>>>>>> leads to better results when products on large integers are performed 
>>>>>>> with a fast method.
>>>>>>> 
>>>>>>> http://numbers.computation.free.fr/Constants/Algorithms/splitting.html
>>>>>>> 
>>>>>>> I think you can do recursive splitting without using function recursion 
>>>>>>> by allocating N/2 array (where b = a+N-1) and iterating over it; each 
>>>>>>> time the array "shrinks" by half. A "cleverer" algorithm would allocate 
>>>>>>> an array of *words* of a bignum, as you know that the upper limit on 
>>>>>>> size is N*64 (for 64 bit numbers) so you can just reuse the same space 
>>>>>>> for each outer iteration (N/2 multiplie, N/4 ...) and apply Karatsuba 
>>>>>>> 2nd outer iteration onwards. Not sure if this is easy in Go.
>>>>>>> 
>>>>>>>> On Jan 8, 2024, at 11:47 AM, Robert Griesemer <g...@golang.org 
>>>>>>>> <mailto:g...@golang.org>> wrote:
>>>>>>>> 
>>>>>>>> Hello John;
>>>>>>>> 
>>>>>>>> Thanks for your interest in this code.
>>>>>>>> 
>>>>>>>> In a (long past) implementation of the factorial function, I noticed 
>>>>>>>> that computing a * (a+1) * (a+2) * ... (b-1) * b was much faster when 
>>>>>>>> computed in a recursive fashion than when computed iteratively: the 
>>>>>>>> reason (I believed) was that the iterative approach seemed to produce 
>>>>>>>> a lot more "internal fragmentation", that is medium-size intermediate 
>>>>>>>> results where the most significant word (or "limb" as is the term in 
>>>>>>>> other implementations) is only marginally used, resulting in more work 
>>>>>>>> than necessary if those words were fully used.
>>>>>>>> 
>>>>>>>> I never fully investigated, it was enough at the time that the 
>>>>>>>> recursive approach was much faster. In retrospect, I don't quite 
>>>>>>>> believe my own theory. Also, that implementation didn't have Karatsuba 
>>>>>>>> multiplication, it just used grade-school multiplication.
>>>>>>>> 
>>>>>>>> Since a, b are uint64 values (words), this could probably be 
>>>>>>>> implemented in terms of mulAddVWW directly, with a suitable initial 
>>>>>>>> allocation for the result - ideally this should just need one 
>>>>>>>> allocation (not sure how close we can get to the right size). That 
>>>>>>>> would cut down the allocations massively.
>>>>>>>> 
>>>>>>>> In a next step, one should benchmark the implementation again.
>>>>>>>> 
>>>>>>>> But at the very least, the overflow bug should be fixed, thanks for 
>>>>>>>> finding it! I will send out a CL to fix that today.
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> - gri
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Sun, Jan 7, 2024 at 4:47 AM John Jannotti <janno...@gmail.com 
>>>>>>>> <mailto:janno...@gmail.com>> wrote:
>>>>>>>>> Actually, both implementations have bugs!
>>>>>>>>> 
>>>>>>>>> The recursive implementation ends with:
>>>>>>>>> ```
>>>>>>>>> m := (a + b) / 2
>>>>>>>>> return z.mul(nat(nil).mulRange(a, m), nat(nil).mulRange(m+1, b))
>>>>>>>>> ```
>>>>>>>>> 
>>>>>>>>> That's a bug whenever `(a+b)` overflows, making `m` small. 
>>>>>>>>> FIX: `m := a + (b-a)/2`
>>>>>>>>> 
>>>>>>>>> My iterative implementation went into an infinite loop here:
>>>>>>>>> `for m := a + 1; m <= b; m++ {`
>>>>>>>>> if b is `math.MaxUint64`
>>>>>>>>> FIX: add `&& m > a` to the exit condition is an easy fix, but pays a 
>>>>>>>>> small penalty for the vast majority of calls that don't have 
>>>>>>>>> b=MaxUint64
>>>>>>>>> 
>>>>>>>>> I would add these to `mulRangesN` of the unit test:
>>>>>>>>> ```
>>>>>>>>>  {math.MaxUint64 - 3, math.MaxUint64 - 1, 
>>>>>>>>> "6277101735386680760773248120919220245411599323494568951784"},
>>>>>>>>> {math.MaxUint64 - 3, math.MaxUint64, 
>>>>>>>>> "115792089237316195360799967654821100226821973275796746098729803619699194331160"}
>>>>>>>>> ```
>>>>>>>>> 
>>>>>>>>> On Sun, Jan 7, 2024 at 6:34 AM John Jannotti <janno...@gmail.com 
>>>>>>>>> <mailto:janno...@gmail.com>> wrote:
>>>>>>>>>> I'm equally curious.
>>>>>>>>>> 
>>>>>>>>>> FWIW, I realized the loop should perhaps be
>>>>>>>>>> ```
>>>>>>>>>> mb := nat(nil).setUint64(b) // ensure mb starts big enough for b, 
>>>>>>>>>> even on 32-bit arch
>>>>>>>>>> for m := a + 1; m <= b; m++ {
>>>>>>>>>>   mb.setUint64(m)
>>>>>>>>>>   z = z.mul(z, mb)
>>>>>>>>>> }
>>>>>>>>>> ```
>>>>>>>>>> to avoid allocating repeatedly for `m`, which yields:
>>>>>>>>>> BenchmarkIterativeMulRangeN-10      354685      3032 ns/op    2129 
>>>>>>>>>> B/op      48 allocs/op
>>>>>>>>>> 
>>>>>>>>>> On Sun, Jan 7, 2024 at 2:41 AM Rob Pike <r...@golang.org 
>>>>>>>>>> <mailto:r...@golang.org>> wrote:
>>>>>>>>>>> It seems reasonable but first I'd like to understand why the 
>>>>>>>>>>> recursive method is used. I can't deduce why, but the CL that adds 
>>>>>>>>>>> it, by gri, does Karatsuba multiplication, which implies something 
>>>>>>>>>>> deep is going on. I'll add him to the conversation.
>>>>>>>>>>> 
>>>>>>>>>>> -rob
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Sun, Jan 7, 2024 at 5:46 PM John Jannotti <janno...@gmail.com 
>>>>>>>>>>> <mailto:janno...@gmail.com>> wrote:
>>>>>>>>>>>> I enjoy bignum implementations, so I was looking through nat.go 
>>>>>>>>>>>> and saw that `mulRange` is implemented in a surprising, recursive 
>>>>>>>>>>>> way,.  In the non-base case, `mulRange(a, b)` returns `mulrange(a, 
>>>>>>>>>>>> (a+b)/2) * mulRange(1+(a+b)/2, b)` (lots of big.Int ceremony 
>>>>>>>>>>>> elided).
>>>>>>>>>>>> 
>>>>>>>>>>>> That's fine, but I didn't see any advantage over the 
>>>>>>>>>>>> straightforward (and simpler?) for loop.
>>>>>>>>>>>> 
>>>>>>>>>>>> ```
>>>>>>>>>>>> z = z.setUint64(a)
>>>>>>>>>>>> for m := a + 1; m <= b; m++ {
>>>>>>>>>>>>    z = z.mul(z, nat(nil).setUint64(m))
>>>>>>>>>>>> }
>>>>>>>>>>>> return z
>>>>>>>>>>>> ```
>>>>>>>>>>>> 
>>>>>>>>>>>> In fact, I suspected the existing code was slower, and allocated a 
>>>>>>>>>>>> lot more.  That seems true. A quick benchmark, using the existing 
>>>>>>>>>>>> unit test as the benchmark, yields
>>>>>>>>>>>> BenchmarkRecusiveMulRangeN-10        169417              6856 
>>>>>>>>>>>> ns/op            9452 B/op        338 allocs/op
>>>>>>>>>>>> BenchmarkIterativeMulRangeN-10       265354              4269 
>>>>>>>>>>>> ns/op            2505 B/op        196 allocs/op
>>>>>>>>>>>> 
>>>>>>>>>>>> I doubt `mulRange` is a performance bottleneck in anyone's code! 
>>>>>>>>>>>> But it is exported as `int.MulRange` so I guess it's viewed with 
>>>>>>>>>>>> some value.  And seeing as how the for-loop seems even easier to 
>>>>>>>>>>>> understand that the recursive version, maybe it's worth submitting 
>>>>>>>>>>>> a PR? (If so, should I create an issue first?)
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> -- 
>>>>>>>>>>>> You received this message because you are subscribed to the Google 
>>>>>>>>>>>> Groups "golang-nuts" group.
>>>>>>>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>>>>>>>> send an email to golang-nuts+unsubscr...@googlegroups.com 
>>>>>>>>>>>> <mailto:golang-nuts+unsubscr...@googlegroups.com>.
>>>>>>>>>>>> To view this discussion on the web visit 
>>>>>>>>>>>> https://groups.google.com/d/msgid/golang-nuts/e6ceb75a-f8b7-4f77-97dc-9445fb750782n%40googlegroups.com
>>>>>>>>>>>>  
>>>>>>>>>>>> <https://groups.google.com/d/msgid/golang-nuts/e6ceb75a-f8b7-4f77-97dc-9445fb750782n%40googlegroups.com?utm_medium=email&utm_source=footer>.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> -- 
>>>>>>>> You received this message because you are subscribed to the Google 
>>>>>>>> Groups "golang-nuts" group.
>>>>>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>>>>>> an email to golang-nuts+unsubscr...@googlegroups.com 
>>>>>>>> <mailto:golang-nuts+unsubscr...@googlegroups.com>.
>>>>>>>> To view this discussion on the web visit 
>>>>>>>> https://groups.google.com/d/msgid/golang-nuts/CAKy0tf7Lcd8hiF2Qv3NFfjGcfvXDn%2BA%2BxJ1bfKta1w9P-OAs%3Dw%40mail.gmail.com
>>>>>>>>  
>>>>>>>> <https://groups.google.com/d/msgid/golang-nuts/CAKy0tf7Lcd8hiF2Qv3NFfjGcfvXDn%2BA%2BxJ1bfKta1w9P-OAs%3Dw%40mail.gmail.com?utm_medium=email&utm_source=footer>.
>>>>>>> 
>>>>> 
>>> 

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/golang-nuts/4D0207B1-9A65-45E3-A63B-DD8FDE35591C%40iitbombay.org.

Re: [go-nuts] Any interest in nat.mulRange simplification/optimization?

Reply via email to