Re: [go-nuts] Any interest in nat.mulRange simplification/optimization?

2024-01-13 Thread Bakul Shah
Fair enough!

I replied as I found a factor of 60 a bit surprising. Though I shouldn't have 
been -- my n digits of pi scheme program was consistently about 16-17 times 
slower than gmp based one (at least up to 1E9), using the same algorithm 
(Chudnovsky's), which is also highly dependent on multiplication speed. And 
both (gambit-scheme and gmp) use faster algorithms for very large numbers. 
Adding these faster algorithms to math/big is probably not justifiable

Julia seems to use libgmp. "info gmp" reveals it implements Luchny's algorithm.

> On Jan 13, 2024, at 4:07 PM, Rob Pike  wrote:
> 
> This is getting interesting but maybe only for me, so I'll stop here.
> 
> I did some profiling and the majority of the time is spent in 
> math/big.addMulVVW. Thinking I might be doing something stupid, I compared it 
> with the Go implementation at 
> http://www.luschny.de/math/factorial/scala/FactorialScalaCsharp.htm, and 
> found that my version (although "straightforward"), is about the same amount 
> of code but without special casing and support tables, yet about 25% faster - 
> and yes, they get the same answer.
> 
> So I believe the factor of 60 comes from comparing an implementation in a 
> language and libraries designed for numerical computation against a much less 
> specialized and optimized world. Or perhaps from Julia using a different and 
> dramatically more efficient algorithm. It does seem like a big gap, but I am 
> no expert in this area.
> 
> Maybe worth investigating further but not by me.
> 
> -rob
> 
> On Sun, Jan 14, 2024 at 9:31 AM Rob Pike  > wrote:
>> Oh, I did say my implementation was straightforward. It's free of any clever 
>> multiplication algorithms or mathematical delights. It could easily be 
>> giving up 10x or more for that reason alone. And I haven't even profiled it 
>> yet.
>> 
>> -rob
>> 
>> 
>> On Sat, Jan 13, 2024 at 7:04 PM Bakul Shah > > wrote:
>>> FYI Julia (on M1 MBP) seems much faster:
>>> 
>>> julia> @which factorial(big(1))
>>> factorial(x::BigInt) in Base.GMP at gmp.jl:645
>>> 
>>> julia> @time begin; factorial(big(1)); 1; end
>>>  27.849116 seconds (1.39 M allocations: 11.963 GiB, 0.22% gc time)
>>> 
>>> Probably they use Schönhage-Strassen multiplication algorithm for very 
>>> large numbers as the 1E8! result will have over a 3/4 billion digits. I 
>>> should try this in Gambit-Scheme (which has an excellent multiply 
>>> implementation).
>>> 
 On Jan 12, 2024, at 9:32 PM, Rob Pike >>> > wrote:
 
 Thanks for the tip. A fairly straightforward implementation of this 
 algorithm gives me about a factor of two speedup for pretty much any 
 value. I went up to 1e8!, which took about half an hour compared to nearly 
 an hour for MulRange.
 
 I'll probably stick in ivy after a little more tuning. I may even try 
 parallelization.
 
 -rob
 
 
 On Tue, Jan 9, 2024 at 4:54 PM Bakul Shah >>> > wrote:
> For that you may wish to explore Peter Luschny's "prime swing" factorial 
> algorithm and variations!
> https://oeis.org/A000142/a000142.pdf
> 
> And implementations in various languages including go: 
> https://github.com/PeterLuschny/Fast-Factorial-Functions
> 
>> On Jan 8, 2024, at 9:22 PM, Rob Pike > > wrote:
>> 
>> Here's an example where it's the bottleneck: ivy factorial
>> 
>> 
>> !1e7
>> 1.20242340052e+65657059
>> 
>> )cpu
>> 1m10s (1m10s user, 167.330ms sys)
>> 
>> 
>> -rob
>> 
>> 
>> On Tue, Jan 9, 2024 at 2:21 PM Bakul Shah > > wrote:
>>> Perhaps you were thinking of this?
>>> 
>>> At iteration number k, the value xk contains O(klog(k)) digits, thus 
>>> the computation of xk+1 = kxk has cost O(klog(k)). Finally, the total 
>>> cost with this basic approach is O(2log(2)+¼+n log(n)) = O(n2log(n)).
>>> A better approach is the binary splitting : it just consists in 
>>> recursively cutting the product of m consecutive integers in half. It 
>>> leads to better results when products on large integers are performed 
>>> with a fast method.
>>> 
>>> http://numbers.computation.free.fr/Constants/Algorithms/splitting.html
>>> 
>>> I think you can do recursive splitting without using function recursion 
>>> by allocating N/2 array (where b = a+N-1) and iterating over it; each 
>>> time the array "shrinks" by half. A "cleverer" algorithm would allocate 
>>> an array of *words* of a bignum, as you know that the upper limit on 
>>> size is N*64 (for 64 bit numbers) so you can just reuse the same space 
>>> for each outer iteration (N/2 multiplie, N/4 ...) and apply Karatsuba 
>>> 2nd outer iteration onwards. Not sure if this is easy in Go.
>>> 
 On Jan 8, 

Re: [go-nuts] Any interest in nat.mulRange simplification/optimization?

2024-01-13 Thread Rob Pike
This is getting interesting but maybe only for me, so I'll stop here.

I did some profiling and the majority of the time is spent in
math/big.addMulVVW. Thinking I might be doing something stupid, I compared
it with the Go implementation at
http://www.luschny.de/math/factorial/scala/FactorialScalaCsharp.htm, and
found that my version (although "straightforward"), is about the same
amount of code but without special casing and support tables, yet about 25%
faster - and yes, they get the same answer.

So I believe the factor of 60 comes from comparing an implementation in a
language and libraries designed for numerical computation against a much
less specialized and optimized world. Or perhaps from Julia using a
different and dramatically more efficient algorithm. It does seem like a
big gap, but I am no expert in this area.

Maybe worth investigating further but not by me.

-rob

On Sun, Jan 14, 2024 at 9:31 AM Rob Pike  wrote:

> Oh, I did say my implementation was straightforward. It's free of any
> clever multiplication algorithms or mathematical delights. It could easily
> be giving up 10x or more for that reason alone. And I haven't even profiled
> it yet.
>
> -rob
>
>
> On Sat, Jan 13, 2024 at 7:04 PM Bakul Shah  wrote:
>
>> FYI Julia (on M1 MBP) seems much faster:
>>
>> julia> @which factorial(big(1))
>> factorial(x::BigInt) in Base.GMP at gmp.jl:645
>>
>> julia> @time begin; factorial(big(1)); 1; end
>>  27.849116 seconds (1.39 M allocations: 11.963 GiB, 0.22% gc time)
>>
>>
>> Probably they use Schönhage-Strassen multiplication algorithm for very
>> large numbers as the 1E8! result will have over a 3/4 billion digits. I
>> should try this in Gambit-Scheme (which has an excellent multiply
>> implementation).
>>
>> On Jan 12, 2024, at 9:32 PM, Rob Pike  wrote:
>>
>> Thanks for the tip. A fairly straightforward implementation of this
>> algorithm gives me about a factor of two speedup for pretty much any value.
>> I went up to 1e8!, which took about half an hour compared to nearly an hour
>> for MulRange.
>>
>> I'll probably stick in ivy after a little more tuning. I may even try
>> parallelization.
>>
>> -rob
>>
>>
>> On Tue, Jan 9, 2024 at 4:54 PM Bakul Shah  wrote:
>>
>>> For that you may wish to explore Peter Luschny's "prime swing" factorial
>>> algorithm and variations!
>>> https://oeis.org/A000142/a000142.pdf
>>>
>>> And implementations in various languages including go:
>>> https://github.com/PeterLuschny/Fast-Factorial-Functions
>>>
>>> On Jan 8, 2024, at 9:22 PM, Rob Pike  wrote:
>>>
>>> Here's an example where it's the bottleneck: ivy factorial
>>>
>>>
>>> !1e7
>>> 1.20242340052e+65657059
>>>
>>> )cpu
>>> 1m10s (1m10s user, 167.330ms sys)
>>>
>>>
>>> -rob
>>>
>>>
>>> On Tue, Jan 9, 2024 at 2:21 PM Bakul Shah  wrote:
>>>
 Perhaps you were thinking of this?

 At iteration number k, the value xk contains O(klog(k)) digits, thus
 the computation of xk+1 = kxk has cost O(klog(k)). Finally, the total
 cost with this basic approach is O(2log(2)+¼+n log(n)) = O(n2log(n)).

 A better approach is the *binary splitting* : it just consists in
 recursively cutting the product of m consecutive integers in half. It leads
 to better results when products on large integers are performed with a fast
 method.

 http://numbers.computation.free.fr/Constants/Algorithms/splitting.html


 I think you can do recursive splitting without using function recursion
 by allocating N/2 array (where b = a+N-1) and iterating over it; each time
 the array "shrinks" by half. A "cleverer" algorithm would allocate an array
 of *words* of a bignum, as you know that the upper limit on size is N*64
 (for 64 bit numbers) so you can just reuse the same space for each outer
 iteration (N/2 multiplie, N/4 ...) and apply Karatsuba 2nd outer iteration
 onwards. Not sure if this is easy in Go.

 On Jan 8, 2024, at 11:47 AM, Robert Griesemer  wrote:

 Hello John;

 Thanks for your interest in this code.

 In a (long past) implementation of the factorial function, I noticed
 that computing a * (a+1) * (a+2) * ... (b-1) * b was much faster when
 computed in a recursive fashion than when computed iteratively: the reason
 (I believed) was that the iterative approach seemed to produce a lot more
 "internal fragmentation", that is medium-size intermediate results where
 the most significant word (or "limb" as is the term in other
 implementations) is only marginally used, resulting in more work than
 necessary if those words were fully used.

 I never fully investigated, it was enough at the time that the
 recursive approach was much faster. In retrospect, I don't quite believe my
 own theory. Also, that implementation didn't have Karatsuba multiplication,
 it just used grade-school multiplication.

 Since a, b are uint64 values (words), this could probably 

Re: [go-nuts] Any interest in nat.mulRange simplification/optimization?

2024-01-13 Thread Rob Pike
Oh, I did say my implementation was straightforward. It's free of any
clever multiplication algorithms or mathematical delights. It could easily
be giving up 10x or more for that reason alone. And I haven't even profiled
it yet.

-rob


On Sat, Jan 13, 2024 at 7:04 PM Bakul Shah  wrote:

> FYI Julia (on M1 MBP) seems much faster:
>
> julia> @which factorial(big(1))
> factorial(x::BigInt) in Base.GMP at gmp.jl:645
>
> julia> @time begin; factorial(big(1)); 1; end
>  27.849116 seconds (1.39 M allocations: 11.963 GiB, 0.22% gc time)
>
>
> Probably they use Schönhage-Strassen multiplication algorithm for very
> large numbers as the 1E8! result will have over a 3/4 billion digits. I
> should try this in Gambit-Scheme (which has an excellent multiply
> implementation).
>
> On Jan 12, 2024, at 9:32 PM, Rob Pike  wrote:
>
> Thanks for the tip. A fairly straightforward implementation of this
> algorithm gives me about a factor of two speedup for pretty much any value.
> I went up to 1e8!, which took about half an hour compared to nearly an hour
> for MulRange.
>
> I'll probably stick in ivy after a little more tuning. I may even try
> parallelization.
>
> -rob
>
>
> On Tue, Jan 9, 2024 at 4:54 PM Bakul Shah  wrote:
>
>> For that you may wish to explore Peter Luschny's "prime swing" factorial
>> algorithm and variations!
>> https://oeis.org/A000142/a000142.pdf
>>
>> And implementations in various languages including go:
>> https://github.com/PeterLuschny/Fast-Factorial-Functions
>>
>> On Jan 8, 2024, at 9:22 PM, Rob Pike  wrote:
>>
>> Here's an example where it's the bottleneck: ivy factorial
>>
>>
>> !1e7
>> 1.20242340052e+65657059
>>
>> )cpu
>> 1m10s (1m10s user, 167.330ms sys)
>>
>>
>> -rob
>>
>>
>> On Tue, Jan 9, 2024 at 2:21 PM Bakul Shah  wrote:
>>
>>> Perhaps you were thinking of this?
>>>
>>> At iteration number k, the value xk contains O(klog(k)) digits, thus
>>> the computation of xk+1 = kxk has cost O(klog(k)). Finally, the total
>>> cost with this basic approach is O(2log(2)+¼+n log(n)) = O(n2log(n)).
>>>
>>> A better approach is the *binary splitting* : it just consists in
>>> recursively cutting the product of m consecutive integers in half. It leads
>>> to better results when products on large integers are performed with a fast
>>> method.
>>>
>>> http://numbers.computation.free.fr/Constants/Algorithms/splitting.html
>>>
>>>
>>> I think you can do recursive splitting without using function recursion
>>> by allocating N/2 array (where b = a+N-1) and iterating over it; each time
>>> the array "shrinks" by half. A "cleverer" algorithm would allocate an array
>>> of *words* of a bignum, as you know that the upper limit on size is N*64
>>> (for 64 bit numbers) so you can just reuse the same space for each outer
>>> iteration (N/2 multiplie, N/4 ...) and apply Karatsuba 2nd outer iteration
>>> onwards. Not sure if this is easy in Go.
>>>
>>> On Jan 8, 2024, at 11:47 AM, Robert Griesemer  wrote:
>>>
>>> Hello John;
>>>
>>> Thanks for your interest in this code.
>>>
>>> In a (long past) implementation of the factorial function, I noticed
>>> that computing a * (a+1) * (a+2) * ... (b-1) * b was much faster when
>>> computed in a recursive fashion than when computed iteratively: the reason
>>> (I believed) was that the iterative approach seemed to produce a lot more
>>> "internal fragmentation", that is medium-size intermediate results where
>>> the most significant word (or "limb" as is the term in other
>>> implementations) is only marginally used, resulting in more work than
>>> necessary if those words were fully used.
>>>
>>> I never fully investigated, it was enough at the time that the recursive
>>> approach was much faster. In retrospect, I don't quite believe my own
>>> theory. Also, that implementation didn't have Karatsuba multiplication, it
>>> just used grade-school multiplication.
>>>
>>> Since a, b are uint64 values (words), this could probably be implemented
>>> in terms of mulAddVWW directly, with a suitable initial allocation for the
>>> result - ideally this should just need one allocation (not sure how close
>>> we can get to the right size). That would cut down the allocations
>>> massively.
>>>
>>> In a next step, one should benchmark the implementation again.
>>>
>>> But at the very least, the overflow bug should be fixed, thanks for
>>> finding it! I will send out a CL to fix that today.
>>>
>>> Thanks,
>>> - gri
>>>
>>>
>>>
>>> On Sun, Jan 7, 2024 at 4:47 AM John Jannotti  wrote:
>>>
 Actually, both implementations have bugs!

 The recursive implementation ends with:
 ```
 m := (a + b) / 2
 return z.mul(nat(nil).mulRange(a, m), nat(nil).mulRange(m+1, b))
 ```

 That's a bug whenever `(a+b)` overflows, making `m` small.
 FIX: `m := a + (b-a)/2`

 My iterative implementation went into an infinite loop here:
 `for m := a + 1; m <= b; m++ {`
 if b is `math.MaxUint64`
 FIX: add `&& m > a` to the exit condition is an 

Re: [go-nuts] Any interest in nat.mulRange simplification/optimization?

2024-01-13 Thread Bakul Shah
FYI Julia (on M1 MBP) seems much faster:

julia> @which factorial(big(1))
factorial(x::BigInt) in Base.GMP at gmp.jl:645

julia> @time begin; factorial(big(1)); 1; end
 27.849116 seconds (1.39 M allocations: 11.963 GiB, 0.22% gc time)

Probably they use Schönhage-Strassen multiplication algorithm for very large 
numbers as the 1E8! result will have over a 3/4 billion digits. I should try 
this in Gambit-Scheme (which has an excellent multiply implementation).

> On Jan 12, 2024, at 9:32 PM, Rob Pike  wrote:
> 
> Thanks for the tip. A fairly straightforward implementation of this algorithm 
> gives me about a factor of two speedup for pretty much any value. I went up 
> to 1e8!, which took about half an hour compared to nearly an hour for 
> MulRange.
> 
> I'll probably stick in ivy after a little more tuning. I may even try 
> parallelization.
> 
> -rob
> 
> 
> On Tue, Jan 9, 2024 at 4:54 PM Bakul Shah  > wrote:
>> For that you may wish to explore Peter Luschny's "prime swing" factorial 
>> algorithm and variations!
>> https://oeis.org/A000142/a000142.pdf
>> 
>> And implementations in various languages including go: 
>> https://github.com/PeterLuschny/Fast-Factorial-Functions
>> 
>>> On Jan 8, 2024, at 9:22 PM, Rob Pike >> > wrote:
>>> 
>>> Here's an example where it's the bottleneck: ivy factorial
>>> 
>>> 
>>> !1e7
>>> 1.20242340052e+65657059
>>> 
>>> )cpu
>>> 1m10s (1m10s user, 167.330ms sys)
>>> 
>>> 
>>> -rob
>>> 
>>> 
>>> On Tue, Jan 9, 2024 at 2:21 PM Bakul Shah >> > wrote:
 Perhaps you were thinking of this?
 
 At iteration number k, the value xk contains O(klog(k)) digits, thus the 
 computation of xk+1 = kxk has cost O(klog(k)). Finally, the total cost 
 with this basic approach is O(2log(2)+¼+n log(n)) = O(n2log(n)).
 A better approach is the binary splitting : it just consists in 
 recursively cutting the product of m consecutive integers in half. It 
 leads to better results when products on large integers are performed with 
 a fast method.
 
 http://numbers.computation.free.fr/Constants/Algorithms/splitting.html
 
 I think you can do recursive splitting without using function recursion by 
 allocating N/2 array (where b = a+N-1) and iterating over it; each time 
 the array "shrinks" by half. A "cleverer" algorithm would allocate an 
 array of *words* of a bignum, as you know that the upper limit on size is 
 N*64 (for 64 bit numbers) so you can just reuse the same space for each 
 outer iteration (N/2 multiplie, N/4 ...) and apply Karatsuba 2nd outer 
 iteration onwards. Not sure if this is easy in Go.
 
> On Jan 8, 2024, at 11:47 AM, Robert Griesemer  > wrote:
> 
> Hello John;
> 
> Thanks for your interest in this code.
> 
> In a (long past) implementation of the factorial function, I noticed that 
> computing a * (a+1) * (a+2) * ... (b-1) * b was much faster when computed 
> in a recursive fashion than when computed iteratively: the reason (I 
> believed) was that the iterative approach seemed to produce a lot more 
> "internal fragmentation", that is medium-size intermediate results where 
> the most significant word (or "limb" as is the term in other 
> implementations) is only marginally used, resulting in more work than 
> necessary if those words were fully used.
> 
> I never fully investigated, it was enough at the time that the recursive 
> approach was much faster. In retrospect, I don't quite believe my own 
> theory. Also, that implementation didn't have Karatsuba multiplication, 
> it just used grade-school multiplication.
> 
> Since a, b are uint64 values (words), this could probably be implemented 
> in terms of mulAddVWW directly, with a suitable initial allocation for 
> the result - ideally this should just need one allocation (not sure how 
> close we can get to the right size). That would cut down the allocations 
> massively.
> 
> In a next step, one should benchmark the implementation again.
> 
> But at the very least, the overflow bug should be fixed, thanks for 
> finding it! I will send out a CL to fix that today.
> 
> Thanks,
> - gri
> 
> 
> 
> On Sun, Jan 7, 2024 at 4:47 AM John Jannotti  > wrote:
>> Actually, both implementations have bugs!
>> 
>> The recursive implementation ends with:
>> ```
>> m := (a + b) / 2
>> return z.mul(nat(nil).mulRange(a, m), nat(nil).mulRange(m+1, b))
>> ```
>> 
>> That's a bug whenever `(a+b)` overflows, making `m` small. 
>> FIX: `m := a + (b-a)/2`
>> 
>> My iterative implementation went into an infinite loop here:
>> `for m := a + 1; m <= b; m++ {`
>> if b is `math.MaxUint64`
>> FIX: add `&& m > 

Re: [go-nuts] Any interest in nat.mulRange simplification/optimization?

2024-01-12 Thread Rob Pike
Thanks for the tip. A fairly straightforward implementation of this
algorithm gives me about a factor of two speedup for pretty much any value.
I went up to 1e8!, which took about half an hour compared to nearly an hour
for MulRange.

I'll probably stick in ivy after a little more tuning. I may even try
parallelization.

-rob


On Tue, Jan 9, 2024 at 4:54 PM Bakul Shah  wrote:

> For that you may wish to explore Peter Luschny's "prime swing" factorial
> algorithm and variations!
> https://oeis.org/A000142/a000142.pdf
>
> And implementations in various languages including go:
> https://github.com/PeterLuschny/Fast-Factorial-Functions
>
> On Jan 8, 2024, at 9:22 PM, Rob Pike  wrote:
>
> Here's an example where it's the bottleneck: ivy factorial
>
>
> !1e7
> 1.20242340052e+65657059
>
> )cpu
> 1m10s (1m10s user, 167.330ms sys)
>
>
> -rob
>
>
> On Tue, Jan 9, 2024 at 2:21 PM Bakul Shah  wrote:
>
>> Perhaps you were thinking of this?
>>
>> At iteration number k, the value xk contains O(klog(k)) digits, thus the
>> computation of xk+1 = kxk has cost O(klog(k)). Finally, the total cost
>> with this basic approach is O(2log(2)+¼+n log(n)) = O(n2log(n)).
>>
>> A better approach is the *binary splitting* : it just consists in
>> recursively cutting the product of m consecutive integers in half. It leads
>> to better results when products on large integers are performed with a fast
>> method.
>>
>> http://numbers.computation.free.fr/Constants/Algorithms/splitting.html
>>
>>
>> I think you can do recursive splitting without using function recursion
>> by allocating N/2 array (where b = a+N-1) and iterating over it; each time
>> the array "shrinks" by half. A "cleverer" algorithm would allocate an array
>> of *words* of a bignum, as you know that the upper limit on size is N*64
>> (for 64 bit numbers) so you can just reuse the same space for each outer
>> iteration (N/2 multiplie, N/4 ...) and apply Karatsuba 2nd outer iteration
>> onwards. Not sure if this is easy in Go.
>>
>> On Jan 8, 2024, at 11:47 AM, Robert Griesemer  wrote:
>>
>> Hello John;
>>
>> Thanks for your interest in this code.
>>
>> In a (long past) implementation of the factorial function, I noticed that
>> computing a * (a+1) * (a+2) * ... (b-1) * b was much faster when computed
>> in a recursive fashion than when computed iteratively: the reason (I
>> believed) was that the iterative approach seemed to produce a lot more
>> "internal fragmentation", that is medium-size intermediate results where
>> the most significant word (or "limb" as is the term in other
>> implementations) is only marginally used, resulting in more work than
>> necessary if those words were fully used.
>>
>> I never fully investigated, it was enough at the time that the recursive
>> approach was much faster. In retrospect, I don't quite believe my own
>> theory. Also, that implementation didn't have Karatsuba multiplication, it
>> just used grade-school multiplication.
>>
>> Since a, b are uint64 values (words), this could probably be implemented
>> in terms of mulAddVWW directly, with a suitable initial allocation for the
>> result - ideally this should just need one allocation (not sure how close
>> we can get to the right size). That would cut down the allocations
>> massively.
>>
>> In a next step, one should benchmark the implementation again.
>>
>> But at the very least, the overflow bug should be fixed, thanks for
>> finding it! I will send out a CL to fix that today.
>>
>> Thanks,
>> - gri
>>
>>
>>
>> On Sun, Jan 7, 2024 at 4:47 AM John Jannotti  wrote:
>>
>>> Actually, both implementations have bugs!
>>>
>>> The recursive implementation ends with:
>>> ```
>>> m := (a + b) / 2
>>> return z.mul(nat(nil).mulRange(a, m), nat(nil).mulRange(m+1, b))
>>> ```
>>>
>>> That's a bug whenever `(a+b)` overflows, making `m` small.
>>> FIX: `m := a + (b-a)/2`
>>>
>>> My iterative implementation went into an infinite loop here:
>>> `for m := a + 1; m <= b; m++ {`
>>> if b is `math.MaxUint64`
>>> FIX: add `&& m > a` to the exit condition is an easy fix, but pays a
>>> small penalty for the vast majority of calls that don't have b=MaxUint64
>>>
>>> I would add these to `mulRangesN` of the unit test:
>>> ```
>>>  {math.MaxUint64 - 3, math.MaxUint64 - 1,
>>> "6277101735386680760773248120919220245411599323494568951784"},
>>> {math.MaxUint64 - 3, math.MaxUint64,
>>> "115792089237316195360799967654821100226821973275796746098729803619699194331160"}
>>> ```
>>>
>>> On Sun, Jan 7, 2024 at 6:34 AM John Jannotti  wrote:
>>>
 I'm equally curious.

 FWIW, I realized the loop should perhaps be
 ```
 mb := nat(nil).setUint64(b) // ensure mb starts big enough for b, even
 on 32-bit arch
 for m := a + 1; m <= b; m++ {
   mb.setUint64(m)
   z = z.mul(z, mb)
 }
 ```
 to avoid allocating repeatedly for `m`, which yields:
 BenchmarkIterativeMulRangeN-10  354685  3032 ns/op2129 B/op
  48 allocs/op

 On Sun, Jan 7, 2024 at 2:41 AM 

Re: [go-nuts] Any interest in nat.mulRange simplification/optimization?

2024-01-09 Thread John Jannotti
I wrote the single allocation version and added it to the issue, but Bakul
is absolutely right that the recursive solution has merit when the input is
large enough.  By building up large values on each "side" of the recursion,
Karatsuba gets used for the larger multiplies and the recursive version
begins to win. On `mulRange(1_000, 200_000)`, it's more than 20 times
faster than the single allocation iterative version.

I can put together a hybrid that uses iterative for shorter spans, and
recursive on larger, if that sounds interesting.

To get this off the mailing list, I will add this note to the issue, and
discussion can continue there. https://github.com/golang/go/issues/65027

On Mon, Jan 8, 2024 at 10:21 PM Bakul Shah  wrote:

> Perhaps you were thinking of this?
>
> At iteration number k, the value xk contains O(klog(k)) digits, thus the
> computation of xk+1 = kxk has cost O(klog(k)). Finally, the total cost
> with this basic approach is O(2log(2)+¼+n log(n)) = O(n2log(n)).
>
> A better approach is the *binary splitting* : it just consists in
> recursively cutting the product of m consecutive integers in half. It leads
> to better results when products on large integers are performed with a fast
> method.
>
> http://numbers.computation.free.fr/Constants/Algorithms/splitting.html
>
>
> I think you can do recursive splitting without using function recursion by
> allocating N/2 array (where b = a+N-1) and iterating over it; each time the
> array "shrinks" by half. A "cleverer" algorithm would allocate an array of
> *words* of a bignum, as you know that the upper limit on size is N*64 (for
> 64 bit numbers) so you can just reuse the same space for each outer
> iteration (N/2 multiplie, N/4 ...) and apply Karatsuba 2nd outer iteration
> onwards. Not sure if this is easy in Go.
>
> On Jan 8, 2024, at 11:47 AM, Robert Griesemer  wrote:
>
> Hello John;
>
> Thanks for your interest in this code.
>
> In a (long past) implementation of the factorial function, I noticed that
> computing a * (a+1) * (a+2) * ... (b-1) * b was much faster when computed
> in a recursive fashion than when computed iteratively: the reason (I
> believed) was that the iterative approach seemed to produce a lot more
> "internal fragmentation", that is medium-size intermediate results where
> the most significant word (or "limb" as is the term in other
> implementations) is only marginally used, resulting in more work than
> necessary if those words were fully used.
>
> I never fully investigated, it was enough at the time that the recursive
> approach was much faster. In retrospect, I don't quite believe my own
> theory. Also, that implementation didn't have Karatsuba multiplication, it
> just used grade-school multiplication.
>
> Since a, b are uint64 values (words), this could probably be implemented
> in terms of mulAddVWW directly, with a suitable initial allocation for the
> result - ideally this should just need one allocation (not sure how close
> we can get to the right size). That would cut down the allocations
> massively.
>
> In a next step, one should benchmark the implementation again.
>
> But at the very least, the overflow bug should be fixed, thanks for
> finding it! I will send out a CL to fix that today.
>
> Thanks,
> - gri
>
>
>
> On Sun, Jan 7, 2024 at 4:47 AM John Jannotti  wrote:
>
>> Actually, both implementations have bugs!
>>
>> The recursive implementation ends with:
>> ```
>> m := (a + b) / 2
>> return z.mul(nat(nil).mulRange(a, m), nat(nil).mulRange(m+1, b))
>> ```
>>
>> That's a bug whenever `(a+b)` overflows, making `m` small.
>> FIX: `m := a + (b-a)/2`
>>
>> My iterative implementation went into an infinite loop here:
>> `for m := a + 1; m <= b; m++ {`
>> if b is `math.MaxUint64`
>> FIX: add `&& m > a` to the exit condition is an easy fix, but pays a
>> small penalty for the vast majority of calls that don't have b=MaxUint64
>>
>> I would add these to `mulRangesN` of the unit test:
>> ```
>>  {math.MaxUint64 - 3, math.MaxUint64 - 1,
>> "6277101735386680760773248120919220245411599323494568951784"},
>> {math.MaxUint64 - 3, math.MaxUint64,
>> "115792089237316195360799967654821100226821973275796746098729803619699194331160"}
>> ```
>>
>> On Sun, Jan 7, 2024 at 6:34 AM John Jannotti  wrote:
>>
>>> I'm equally curious.
>>>
>>> FWIW, I realized the loop should perhaps be
>>> ```
>>> mb := nat(nil).setUint64(b) // ensure mb starts big enough for b, even
>>> on 32-bit arch
>>> for m := a + 1; m <= b; m++ {
>>>   mb.setUint64(m)
>>>   z = z.mul(z, mb)
>>> }
>>> ```
>>> to avoid allocating repeatedly for `m`, which yields:
>>> BenchmarkIterativeMulRangeN-10  354685  3032 ns/op2129 B/op
>>>  48 allocs/op
>>>
>>> On Sun, Jan 7, 2024 at 2:41 AM Rob Pike  wrote:
>>>
 It seems reasonable but first I'd like to understand why the recursive
 method is used. I can't deduce why, but the CL that adds it, by gri, does
 Karatsuba multiplication, which implies something deep is going on. I'll
 add 

Re: [go-nuts] Any interest in nat.mulRange simplification/optimization?

2024-01-08 Thread Bakul Shah
For that you may wish to explore Peter Luschny's "prime swing" factorial 
algorithm and variations!
https://oeis.org/A000142/a000142.pdf

And implementations in various languages including go: 
https://github.com/PeterLuschny/Fast-Factorial-Functions

> On Jan 8, 2024, at 9:22 PM, Rob Pike  wrote:
> 
> Here's an example where it's the bottleneck: ivy factorial
> 
> 
> !1e7
> 1.20242340052e+65657059
> 
> )cpu
> 1m10s (1m10s user, 167.330ms sys)
> 
> 
> -rob
> 
> 
> On Tue, Jan 9, 2024 at 2:21 PM Bakul Shah  > wrote:
>> Perhaps you were thinking of this?
>> 
>> At iteration number k, the value xk contains O(klog(k)) digits, thus the 
>> computation of xk+1 = kxk has cost O(klog(k)). Finally, the total cost with 
>> this basic approach is O(2log(2)+¼+n log(n)) = O(n2log(n)).
>> A better approach is the binary splitting : it just consists in recursively 
>> cutting the product of m consecutive integers in half. It leads to better 
>> results when products on large integers are performed with a fast method.
>> 
>> http://numbers.computation.free.fr/Constants/Algorithms/splitting.html
>> 
>> I think you can do recursive splitting without using function recursion by 
>> allocating N/2 array (where b = a+N-1) and iterating over it; each time the 
>> array "shrinks" by half. A "cleverer" algorithm would allocate an array of 
>> *words* of a bignum, as you know that the upper limit on size is N*64 (for 
>> 64 bit numbers) so you can just reuse the same space for each outer 
>> iteration (N/2 multiplie, N/4 ...) and apply Karatsuba 2nd outer iteration 
>> onwards. Not sure if this is easy in Go.
>> 
>>> On Jan 8, 2024, at 11:47 AM, Robert Griesemer >> > wrote:
>>> 
>>> Hello John;
>>> 
>>> Thanks for your interest in this code.
>>> 
>>> In a (long past) implementation of the factorial function, I noticed that 
>>> computing a * (a+1) * (a+2) * ... (b-1) * b was much faster when computed 
>>> in a recursive fashion than when computed iteratively: the reason (I 
>>> believed) was that the iterative approach seemed to produce a lot more 
>>> "internal fragmentation", that is medium-size intermediate results where 
>>> the most significant word (or "limb" as is the term in other 
>>> implementations) is only marginally used, resulting in more work than 
>>> necessary if those words were fully used.
>>> 
>>> I never fully investigated, it was enough at the time that the recursive 
>>> approach was much faster. In retrospect, I don't quite believe my own 
>>> theory. Also, that implementation didn't have Karatsuba multiplication, it 
>>> just used grade-school multiplication.
>>> 
>>> Since a, b are uint64 values (words), this could probably be implemented in 
>>> terms of mulAddVWW directly, with a suitable initial allocation for the 
>>> result - ideally this should just need one allocation (not sure how close 
>>> we can get to the right size). That would cut down the allocations 
>>> massively.
>>> 
>>> In a next step, one should benchmark the implementation again.
>>> 
>>> But at the very least, the overflow bug should be fixed, thanks for finding 
>>> it! I will send out a CL to fix that today.
>>> 
>>> Thanks,
>>> - gri
>>> 
>>> 
>>> 
>>> On Sun, Jan 7, 2024 at 4:47 AM John Jannotti >> > wrote:
 Actually, both implementations have bugs!
 
 The recursive implementation ends with:
 ```
 m := (a + b) / 2
 return z.mul(nat(nil).mulRange(a, m), nat(nil).mulRange(m+1, b))
 ```
 
 That's a bug whenever `(a+b)` overflows, making `m` small. 
 FIX: `m := a + (b-a)/2`
 
 My iterative implementation went into an infinite loop here:
 `for m := a + 1; m <= b; m++ {`
 if b is `math.MaxUint64`
 FIX: add `&& m > a` to the exit condition is an easy fix, but pays a small 
 penalty for the vast majority of calls that don't have b=MaxUint64
 
 I would add these to `mulRangesN` of the unit test:
 ```
  {math.MaxUint64 - 3, math.MaxUint64 - 1, 
 "6277101735386680760773248120919220245411599323494568951784"},
 {math.MaxUint64 - 3, math.MaxUint64, 
 "115792089237316195360799967654821100226821973275796746098729803619699194331160"}
 ```
 
 On Sun, Jan 7, 2024 at 6:34 AM John Jannotti >>> > wrote:
> I'm equally curious.
> 
> FWIW, I realized the loop should perhaps be
> ```
> mb := nat(nil).setUint64(b) // ensure mb starts big enough for b, even on 
> 32-bit arch
> for m := a + 1; m <= b; m++ {
>   mb.setUint64(m)
>   z = z.mul(z, mb)
> }
> ```
> to avoid allocating repeatedly for `m`, which yields:
> BenchmarkIterativeMulRangeN-10  354685  3032 ns/op2129 B/op   
>48 allocs/op
> 
> On Sun, Jan 7, 2024 at 2:41 AM Rob Pike  > wrote:
>> It seems reasonable but first I'd like to understand why the recursive 
>> method is used. 

Re: [go-nuts] Any interest in nat.mulRange simplification/optimization?

2024-01-08 Thread Rob Pike
Here's an example where it's the bottleneck: ivy factorial


!1e7
1.20242340052e+65657059

)cpu
1m10s (1m10s user, 167.330ms sys)


-rob


On Tue, Jan 9, 2024 at 2:21 PM Bakul Shah  wrote:

> Perhaps you were thinking of this?
>
> At iteration number k, the value xk contains O(klog(k)) digits, thus the
> computation of xk+1 = kxk has cost O(klog(k)). Finally, the total cost
> with this basic approach is O(2log(2)+¼+n log(n)) = O(n2log(n)).
>
> A better approach is the *binary splitting* : it just consists in
> recursively cutting the product of m consecutive integers in half. It leads
> to better results when products on large integers are performed with a fast
> method.
>
> http://numbers.computation.free.fr/Constants/Algorithms/splitting.html
>
>
> I think you can do recursive splitting without using function recursion by
> allocating N/2 array (where b = a+N-1) and iterating over it; each time the
> array "shrinks" by half. A "cleverer" algorithm would allocate an array of
> *words* of a bignum, as you know that the upper limit on size is N*64 (for
> 64 bit numbers) so you can just reuse the same space for each outer
> iteration (N/2 multiplie, N/4 ...) and apply Karatsuba 2nd outer iteration
> onwards. Not sure if this is easy in Go.
>
> On Jan 8, 2024, at 11:47 AM, Robert Griesemer  wrote:
>
> Hello John;
>
> Thanks for your interest in this code.
>
> In a (long past) implementation of the factorial function, I noticed that
> computing a * (a+1) * (a+2) * ... (b-1) * b was much faster when computed
> in a recursive fashion than when computed iteratively: the reason (I
> believed) was that the iterative approach seemed to produce a lot more
> "internal fragmentation", that is medium-size intermediate results where
> the most significant word (or "limb" as is the term in other
> implementations) is only marginally used, resulting in more work than
> necessary if those words were fully used.
>
> I never fully investigated, it was enough at the time that the recursive
> approach was much faster. In retrospect, I don't quite believe my own
> theory. Also, that implementation didn't have Karatsuba multiplication, it
> just used grade-school multiplication.
>
> Since a, b are uint64 values (words), this could probably be implemented
> in terms of mulAddVWW directly, with a suitable initial allocation for the
> result - ideally this should just need one allocation (not sure how close
> we can get to the right size). That would cut down the allocations
> massively.
>
> In a next step, one should benchmark the implementation again.
>
> But at the very least, the overflow bug should be fixed, thanks for
> finding it! I will send out a CL to fix that today.
>
> Thanks,
> - gri
>
>
>
> On Sun, Jan 7, 2024 at 4:47 AM John Jannotti  wrote:
>
>> Actually, both implementations have bugs!
>>
>> The recursive implementation ends with:
>> ```
>> m := (a + b) / 2
>> return z.mul(nat(nil).mulRange(a, m), nat(nil).mulRange(m+1, b))
>> ```
>>
>> That's a bug whenever `(a+b)` overflows, making `m` small.
>> FIX: `m := a + (b-a)/2`
>>
>> My iterative implementation went into an infinite loop here:
>> `for m := a + 1; m <= b; m++ {`
>> if b is `math.MaxUint64`
>> FIX: add `&& m > a` to the exit condition is an easy fix, but pays a
>> small penalty for the vast majority of calls that don't have b=MaxUint64
>>
>> I would add these to `mulRangesN` of the unit test:
>> ```
>>  {math.MaxUint64 - 3, math.MaxUint64 - 1,
>> "6277101735386680760773248120919220245411599323494568951784"},
>> {math.MaxUint64 - 3, math.MaxUint64,
>> "115792089237316195360799967654821100226821973275796746098729803619699194331160"}
>> ```
>>
>> On Sun, Jan 7, 2024 at 6:34 AM John Jannotti  wrote:
>>
>>> I'm equally curious.
>>>
>>> FWIW, I realized the loop should perhaps be
>>> ```
>>> mb := nat(nil).setUint64(b) // ensure mb starts big enough for b, even
>>> on 32-bit arch
>>> for m := a + 1; m <= b; m++ {
>>>   mb.setUint64(m)
>>>   z = z.mul(z, mb)
>>> }
>>> ```
>>> to avoid allocating repeatedly for `m`, which yields:
>>> BenchmarkIterativeMulRangeN-10  354685  3032 ns/op2129 B/op
>>>  48 allocs/op
>>>
>>> On Sun, Jan 7, 2024 at 2:41 AM Rob Pike  wrote:
>>>
 It seems reasonable but first I'd like to understand why the recursive
 method is used. I can't deduce why, but the CL that adds it, by gri, does
 Karatsuba multiplication, which implies something deep is going on. I'll
 add him to the conversation.

 -rob




 On Sun, Jan 7, 2024 at 5:46 PM John Jannotti 
 wrote:

> I enjoy bignum implementations, so I was looking through nat.go and
> saw that `mulRange` is implemented in a surprising, recursive way,.  In 
> the
> non-base case, `mulRange(a, b)` returns `mulrange(a, (a+b)/2) *
> mulRange(1+(a+b)/2, b)` (lots of big.Int ceremony elided).
>
> That's fine, but I didn't see any advantage over the straightforward
> (and simpler?) for loop.
>
> ```

Re: [go-nuts] Any interest in nat.mulRange simplification/optimization?

2024-01-08 Thread Bakul Shah
Perhaps you were thinking of this?

At iteration number k, the value xk contains O(klog(k)) digits, thus the 
computation of xk+1 = kxk has cost O(klog(k)). Finally, the total cost with 
this basic approach is O(2log(2)+¼+n log(n)) = O(n2log(n)).
A better approach is the binary splitting : it just consists in recursively 
cutting the product of m consecutive integers in half. It leads to better 
results when products on large integers are performed with a fast method.

http://numbers.computation.free.fr/Constants/Algorithms/splitting.html

I think you can do recursive splitting without using function recursion by 
allocating N/2 array (where b = a+N-1) and iterating over it; each time the 
array "shrinks" by half. A "cleverer" algorithm would allocate an array of 
*words* of a bignum, as you know that the upper limit on size is N*64 (for 64 
bit numbers) so you can just reuse the same space for each outer iteration (N/2 
multiplie, N/4 ...) and apply Karatsuba 2nd outer iteration onwards. Not sure 
if this is easy in Go.

> On Jan 8, 2024, at 11:47 AM, Robert Griesemer  wrote:
> 
> Hello John;
> 
> Thanks for your interest in this code.
> 
> In a (long past) implementation of the factorial function, I noticed that 
> computing a * (a+1) * (a+2) * ... (b-1) * b was much faster when computed in 
> a recursive fashion than when computed iteratively: the reason (I believed) 
> was that the iterative approach seemed to produce a lot more "internal 
> fragmentation", that is medium-size intermediate results where the most 
> significant word (or "limb" as is the term in other implementations) is only 
> marginally used, resulting in more work than necessary if those words were 
> fully used.
> 
> I never fully investigated, it was enough at the time that the recursive 
> approach was much faster. In retrospect, I don't quite believe my own theory. 
> Also, that implementation didn't have Karatsuba multiplication, it just used 
> grade-school multiplication.
> 
> Since a, b are uint64 values (words), this could probably be implemented in 
> terms of mulAddVWW directly, with a suitable initial allocation for the 
> result - ideally this should just need one allocation (not sure how close we 
> can get to the right size). That would cut down the allocations massively.
> 
> In a next step, one should benchmark the implementation again.
> 
> But at the very least, the overflow bug should be fixed, thanks for finding 
> it! I will send out a CL to fix that today.
> 
> Thanks,
> - gri
> 
> 
> 
> On Sun, Jan 7, 2024 at 4:47 AM John Jannotti  > wrote:
>> Actually, both implementations have bugs!
>> 
>> The recursive implementation ends with:
>> ```
>> m := (a + b) / 2
>> return z.mul(nat(nil).mulRange(a, m), nat(nil).mulRange(m+1, b))
>> ```
>> 
>> That's a bug whenever `(a+b)` overflows, making `m` small. 
>> FIX: `m := a + (b-a)/2`
>> 
>> My iterative implementation went into an infinite loop here:
>> `for m := a + 1; m <= b; m++ {`
>> if b is `math.MaxUint64`
>> FIX: add `&& m > a` to the exit condition is an easy fix, but pays a small 
>> penalty for the vast majority of calls that don't have b=MaxUint64
>> 
>> I would add these to `mulRangesN` of the unit test:
>> ```
>>  {math.MaxUint64 - 3, math.MaxUint64 - 1, 
>> "6277101735386680760773248120919220245411599323494568951784"},
>> {math.MaxUint64 - 3, math.MaxUint64, 
>> "115792089237316195360799967654821100226821973275796746098729803619699194331160"}
>> ```
>> 
>> On Sun, Jan 7, 2024 at 6:34 AM John Jannotti > > wrote:
>>> I'm equally curious.
>>> 
>>> FWIW, I realized the loop should perhaps be
>>> ```
>>> mb := nat(nil).setUint64(b) // ensure mb starts big enough for b, even on 
>>> 32-bit arch
>>> for m := a + 1; m <= b; m++ {
>>>   mb.setUint64(m)
>>>   z = z.mul(z, mb)
>>> }
>>> ```
>>> to avoid allocating repeatedly for `m`, which yields:
>>> BenchmarkIterativeMulRangeN-10  354685  3032 ns/op2129 B/op 
>>>  48 allocs/op
>>> 
>>> On Sun, Jan 7, 2024 at 2:41 AM Rob Pike >> > wrote:
 It seems reasonable but first I'd like to understand why the recursive 
 method is used. I can't deduce why, but the CL that adds it, by gri, does 
 Karatsuba multiplication, which implies something deep is going on. I'll 
 add him to the conversation.
 
 -rob
 
 
 
 
 On Sun, Jan 7, 2024 at 5:46 PM John Jannotti >>> > wrote:
> I enjoy bignum implementations, so I was looking through nat.go and saw 
> that `mulRange` is implemented in a surprising, recursive way,.  In the 
> non-base case, `mulRange(a, b)` returns `mulrange(a, (a+b)/2) * 
> mulRange(1+(a+b)/2, b)` (lots of big.Int ceremony elided).
> 
> That's fine, but I didn't see any advantage over the straightforward (and 
> simpler?) for loop.
> 
> ```
> z = z.setUint64(a)
> for m := a + 1; m <= b; m++ {
>   z = 

Re: [go-nuts] Any interest in nat.mulRange simplification/optimization?

2024-01-08 Thread Robert Griesemer
The overflow fix is pending: CL 554617
I filed https://github.com/golang/go/issues/65027 for a possibly faster
mulRange implementation.
- gri

On Mon, Jan 8, 2024 at 11:47 AM Robert Griesemer  wrote:

> Hello John;
>
> Thanks for your interest in this code.
>
> In a (long past) implementation of the factorial function, I noticed that
> computing a * (a+1) * (a+2) * ... (b-1) * b was much faster when computed
> in a recursive fashion than when computed iteratively: the reason (I
> believed) was that the iterative approach seemed to produce a lot more
> "internal fragmentation", that is medium-size intermediate results where
> the most significant word (or "limb" as is the term in other
> implementations) is only marginally used, resulting in more work than
> necessary if those words were fully used.
>
> I never fully investigated, it was enough at the time that the recursive
> approach was much faster. In retrospect, I don't quite believe my own
> theory. Also, that implementation didn't have Karatsuba multiplication, it
> just used grade-school multiplication.
>
> Since a, b are uint64 values (words), this could probably be implemented
> in terms of mulAddVWW directly, with a suitable initial allocation for the
> result - ideally this should just need one allocation (not sure how close
> we can get to the right size). That would cut down the allocations
> massively.
>
> In a next step, one should benchmark the implementation again.
>
> But at the very least, the overflow bug should be fixed, thanks for
> finding it! I will send out a CL to fix that today.
>
> Thanks,
> - gri
>
>
>
> On Sun, Jan 7, 2024 at 4:47 AM John Jannotti  wrote:
>
>> Actually, both implementations have bugs!
>>
>> The recursive implementation ends with:
>> ```
>> m := (a + b) / 2
>> return z.mul(nat(nil).mulRange(a, m), nat(nil).mulRange(m+1, b))
>> ```
>>
>> That's a bug whenever `(a+b)` overflows, making `m` small.
>> FIX: `m := a + (b-a)/2`
>>
>> My iterative implementation went into an infinite loop here:
>> `for m := a + 1; m <= b; m++ {`
>> if b is `math.MaxUint64`
>> FIX: add `&& m > a` to the exit condition is an easy fix, but pays a
>> small penalty for the vast majority of calls that don't have b=MaxUint64
>>
>> I would add these to `mulRangesN` of the unit test:
>> ```
>>  {math.MaxUint64 - 3, math.MaxUint64 - 1,
>> "6277101735386680760773248120919220245411599323494568951784"},
>> {math.MaxUint64 - 3, math.MaxUint64,
>> "115792089237316195360799967654821100226821973275796746098729803619699194331160"}
>> ```
>>
>> On Sun, Jan 7, 2024 at 6:34 AM John Jannotti  wrote:
>>
>>> I'm equally curious.
>>>
>>> FWIW, I realized the loop should perhaps be
>>> ```
>>> mb := nat(nil).setUint64(b) // ensure mb starts big enough for b, even
>>> on 32-bit arch
>>> for m := a + 1; m <= b; m++ {
>>>   mb.setUint64(m)
>>>   z = z.mul(z, mb)
>>> }
>>> ```
>>> to avoid allocating repeatedly for `m`, which yields:
>>> BenchmarkIterativeMulRangeN-10  354685  3032 ns/op2129 B/op
>>>  48 allocs/op
>>>
>>> On Sun, Jan 7, 2024 at 2:41 AM Rob Pike  wrote:
>>>
 It seems reasonable but first I'd like to understand why the recursive
 method is used. I can't deduce why, but the CL that adds it, by gri, does
 Karatsuba multiplication, which implies something deep is going on. I'll
 add him to the conversation.

 -rob




 On Sun, Jan 7, 2024 at 5:46 PM John Jannotti 
 wrote:

> I enjoy bignum implementations, so I was looking through nat.go and
> saw that `mulRange` is implemented in a surprising, recursive way,.  In 
> the
> non-base case, `mulRange(a, b)` returns `mulrange(a, (a+b)/2) *
> mulRange(1+(a+b)/2, b)` (lots of big.Int ceremony elided).
>
> That's fine, but I didn't see any advantage over the straightforward
> (and simpler?) for loop.
>
> ```
> z = z.setUint64(a)
> for m := a + 1; m <= b; m++ {
> z = z.mul(z, nat(nil).setUint64(m))
> }
> return z
> ```
>
> In fact, I suspected the existing code was slower, and allocated a lot
> more.  That seems true. A quick benchmark, using the existing unit test as
> the benchmark, yields
> BenchmarkRecusiveMulRangeN-10   169417   6856 ns/op 9452
> B/op  338 allocs/op
> BenchmarkIterativeMulRangeN-10   265354   4269 ns/op 2505
> B/op  196 allocs/op
>
> I doubt `mulRange` is a performance bottleneck in anyone's code! But
> it is exported as `int.MulRange` so I guess it's viewed with some value.
> And seeing as how the for-loop seems even easier to understand that the
> recursive version, maybe it's worth submitting a PR? (If so, should I
> create an issue first?)
>
>
>
> --
> You received this message because you are subscribed to the Google
> Groups "golang-nuts" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to 

Re: [go-nuts] Any interest in nat.mulRange simplification/optimization?

2024-01-08 Thread Robert Griesemer
Hello John;

Thanks for your interest in this code.

In a (long past) implementation of the factorial function, I noticed that
computing a * (a+1) * (a+2) * ... (b-1) * b was much faster when computed
in a recursive fashion than when computed iteratively: the reason (I
believed) was that the iterative approach seemed to produce a lot more
"internal fragmentation", that is medium-size intermediate results where
the most significant word (or "limb" as is the term in other
implementations) is only marginally used, resulting in more work than
necessary if those words were fully used.

I never fully investigated, it was enough at the time that the recursive
approach was much faster. In retrospect, I don't quite believe my own
theory. Also, that implementation didn't have Karatsuba multiplication, it
just used grade-school multiplication.

Since a, b are uint64 values (words), this could probably be implemented in
terms of mulAddVWW directly, with a suitable initial allocation for the
result - ideally this should just need one allocation (not sure how close
we can get to the right size). That would cut down the allocations
massively.

In a next step, one should benchmark the implementation again.

But at the very least, the overflow bug should be fixed, thanks for finding
it! I will send out a CL to fix that today.

Thanks,
- gri



On Sun, Jan 7, 2024 at 4:47 AM John Jannotti  wrote:

> Actually, both implementations have bugs!
>
> The recursive implementation ends with:
> ```
> m := (a + b) / 2
> return z.mul(nat(nil).mulRange(a, m), nat(nil).mulRange(m+1, b))
> ```
>
> That's a bug whenever `(a+b)` overflows, making `m` small.
> FIX: `m := a + (b-a)/2`
>
> My iterative implementation went into an infinite loop here:
> `for m := a + 1; m <= b; m++ {`
> if b is `math.MaxUint64`
> FIX: add `&& m > a` to the exit condition is an easy fix, but pays a small
> penalty for the vast majority of calls that don't have b=MaxUint64
>
> I would add these to `mulRangesN` of the unit test:
> ```
>  {math.MaxUint64 - 3, math.MaxUint64 - 1,
> "6277101735386680760773248120919220245411599323494568951784"},
> {math.MaxUint64 - 3, math.MaxUint64,
> "115792089237316195360799967654821100226821973275796746098729803619699194331160"}
> ```
>
> On Sun, Jan 7, 2024 at 6:34 AM John Jannotti  wrote:
>
>> I'm equally curious.
>>
>> FWIW, I realized the loop should perhaps be
>> ```
>> mb := nat(nil).setUint64(b) // ensure mb starts big enough for b, even on
>> 32-bit arch
>> for m := a + 1; m <= b; m++ {
>>   mb.setUint64(m)
>>   z = z.mul(z, mb)
>> }
>> ```
>> to avoid allocating repeatedly for `m`, which yields:
>> BenchmarkIterativeMulRangeN-10  354685  3032 ns/op2129 B/op
>>48 allocs/op
>>
>> On Sun, Jan 7, 2024 at 2:41 AM Rob Pike  wrote:
>>
>>> It seems reasonable but first I'd like to understand why the recursive
>>> method is used. I can't deduce why, but the CL that adds it, by gri, does
>>> Karatsuba multiplication, which implies something deep is going on. I'll
>>> add him to the conversation.
>>>
>>> -rob
>>>
>>>
>>>
>>>
>>> On Sun, Jan 7, 2024 at 5:46 PM John Jannotti  wrote:
>>>
 I enjoy bignum implementations, so I was looking through nat.go and saw
 that `mulRange` is implemented in a surprising, recursive way,.  In the
 non-base case, `mulRange(a, b)` returns `mulrange(a, (a+b)/2) *
 mulRange(1+(a+b)/2, b)` (lots of big.Int ceremony elided).

 That's fine, but I didn't see any advantage over the straightforward
 (and simpler?) for loop.

 ```
 z = z.setUint64(a)
 for m := a + 1; m <= b; m++ {
 z = z.mul(z, nat(nil).setUint64(m))
 }
 return z
 ```

 In fact, I suspected the existing code was slower, and allocated a lot
 more.  That seems true. A quick benchmark, using the existing unit test as
 the benchmark, yields
 BenchmarkRecusiveMulRangeN-10   169417   6856 ns/op 9452
 B/op  338 allocs/op
 BenchmarkIterativeMulRangeN-10   265354   4269 ns/op 2505
 B/op  196 allocs/op

 I doubt `mulRange` is a performance bottleneck in anyone's code! But it
 is exported as `int.MulRange` so I guess it's viewed with some value.  And
 seeing as how the for-loop seems even easier to understand that the
 recursive version, maybe it's worth submitting a PR? (If so, should I
 create an issue first?)



 --
 You received this message because you are subscribed to the Google
 Groups "golang-nuts" group.
 To unsubscribe from this group and stop receiving emails from it, send
 an email to golang-nuts+unsubscr...@googlegroups.com.
 To view this discussion on the web visit
 https://groups.google.com/d/msgid/golang-nuts/e6ceb75a-f8b7-4f77-97dc-9445fb750782n%40googlegroups.com
 
 .

>>>

-- 
You received 

Re: [go-nuts] Any interest in nat.mulRange simplification/optimization?

2024-01-07 Thread John Jannotti
Actually, both implementations have bugs!

The recursive implementation ends with:
```
m := (a + b) / 2
return z.mul(nat(nil).mulRange(a, m), nat(nil).mulRange(m+1, b))
```

That's a bug whenever `(a+b)` overflows, making `m` small.
FIX: `m := a + (b-a)/2`

My iterative implementation went into an infinite loop here:
`for m := a + 1; m <= b; m++ {`
if b is `math.MaxUint64`
FIX: add `&& m > a` to the exit condition is an easy fix, but pays a small
penalty for the vast majority of calls that don't have b=MaxUint64

I would add these to `mulRangesN` of the unit test:
```
 {math.MaxUint64 - 3, math.MaxUint64 - 1,
"6277101735386680760773248120919220245411599323494568951784"},
{math.MaxUint64 - 3, math.MaxUint64,
"115792089237316195360799967654821100226821973275796746098729803619699194331160"}
```

On Sun, Jan 7, 2024 at 6:34 AM John Jannotti  wrote:

> I'm equally curious.
>
> FWIW, I realized the loop should perhaps be
> ```
> mb := nat(nil).setUint64(b) // ensure mb starts big enough for b, even on
> 32-bit arch
> for m := a + 1; m <= b; m++ {
>   mb.setUint64(m)
>   z = z.mul(z, mb)
> }
> ```
> to avoid allocating repeatedly for `m`, which yields:
> BenchmarkIterativeMulRangeN-10  354685  3032 ns/op2129 B/op
>48 allocs/op
>
> On Sun, Jan 7, 2024 at 2:41 AM Rob Pike  wrote:
>
>> It seems reasonable but first I'd like to understand why the recursive
>> method is used. I can't deduce why, but the CL that adds it, by gri, does
>> Karatsuba multiplication, which implies something deep is going on. I'll
>> add him to the conversation.
>>
>> -rob
>>
>>
>>
>>
>> On Sun, Jan 7, 2024 at 5:46 PM John Jannotti  wrote:
>>
>>> I enjoy bignum implementations, so I was looking through nat.go and saw
>>> that `mulRange` is implemented in a surprising, recursive way,.  In the
>>> non-base case, `mulRange(a, b)` returns `mulrange(a, (a+b)/2) *
>>> mulRange(1+(a+b)/2, b)` (lots of big.Int ceremony elided).
>>>
>>> That's fine, but I didn't see any advantage over the straightforward
>>> (and simpler?) for loop.
>>>
>>> ```
>>> z = z.setUint64(a)
>>> for m := a + 1; m <= b; m++ {
>>> z = z.mul(z, nat(nil).setUint64(m))
>>> }
>>> return z
>>> ```
>>>
>>> In fact, I suspected the existing code was slower, and allocated a lot
>>> more.  That seems true. A quick benchmark, using the existing unit test as
>>> the benchmark, yields
>>> BenchmarkRecusiveMulRangeN-10   169417   6856 ns/op 9452
>>> B/op  338 allocs/op
>>> BenchmarkIterativeMulRangeN-10   265354   4269 ns/op 2505
>>> B/op  196 allocs/op
>>>
>>> I doubt `mulRange` is a performance bottleneck in anyone's code! But it
>>> is exported as `int.MulRange` so I guess it's viewed with some value.  And
>>> seeing as how the for-loop seems even easier to understand that the
>>> recursive version, maybe it's worth submitting a PR? (If so, should I
>>> create an issue first?)
>>>
>>>
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "golang-nuts" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to golang-nuts+unsubscr...@googlegroups.com.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/golang-nuts/e6ceb75a-f8b7-4f77-97dc-9445fb750782n%40googlegroups.com
>>> 
>>> .
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/golang-nuts/CALZHwko9GC7C_JThfVdFxBQtCmu9%2B%2BNYun8Tope4608oiio_6g%40mail.gmail.com.


Re: [go-nuts] Any interest in nat.mulRange simplification/optimization?

2024-01-07 Thread John Jannotti
I'm equally curious.

FWIW, I realized the loop should perhaps be
```
mb := nat(nil).setUint64(b) // ensure mb starts big enough for b, even on
32-bit arch
for m := a + 1; m <= b; m++ {
  mb.setUint64(m)
  z = z.mul(z, mb)
}
```
to avoid allocating repeatedly for `m`, which yields:
BenchmarkIterativeMulRangeN-10  354685  3032 ns/op2129 B/op
 48 allocs/op

On Sun, Jan 7, 2024 at 2:41 AM Rob Pike  wrote:

> It seems reasonable but first I'd like to understand why the recursive
> method is used. I can't deduce why, but the CL that adds it, by gri, does
> Karatsuba multiplication, which implies something deep is going on. I'll
> add him to the conversation.
>
> -rob
>
>
>
>
> On Sun, Jan 7, 2024 at 5:46 PM John Jannotti  wrote:
>
>> I enjoy bignum implementations, so I was looking through nat.go and saw
>> that `mulRange` is implemented in a surprising, recursive way,.  In the
>> non-base case, `mulRange(a, b)` returns `mulrange(a, (a+b)/2) *
>> mulRange(1+(a+b)/2, b)` (lots of big.Int ceremony elided).
>>
>> That's fine, but I didn't see any advantage over the straightforward (and
>> simpler?) for loop.
>>
>> ```
>> z = z.setUint64(a)
>> for m := a + 1; m <= b; m++ {
>> z = z.mul(z, nat(nil).setUint64(m))
>> }
>> return z
>> ```
>>
>> In fact, I suspected the existing code was slower, and allocated a lot
>> more.  That seems true. A quick benchmark, using the existing unit test as
>> the benchmark, yields
>> BenchmarkRecusiveMulRangeN-10   169417   6856 ns/op 9452 B/op
>>338 allocs/op
>> BenchmarkIterativeMulRangeN-10   265354   4269 ns/op 2505
>> B/op  196 allocs/op
>>
>> I doubt `mulRange` is a performance bottleneck in anyone's code! But it
>> is exported as `int.MulRange` so I guess it's viewed with some value.  And
>> seeing as how the for-loop seems even easier to understand that the
>> recursive version, maybe it's worth submitting a PR? (If so, should I
>> create an issue first?)
>>
>>
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "golang-nuts" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to golang-nuts+unsubscr...@googlegroups.com.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/golang-nuts/e6ceb75a-f8b7-4f77-97dc-9445fb750782n%40googlegroups.com
>> 
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/golang-nuts/CALZHwkqd7ZQ6_VxYBJx8GWMjTLcqGo-3ausEC9rc21z3nHjyQA%40mail.gmail.com.


Re: [go-nuts] Any interest in nat.mulRange simplification/optimization?

2024-01-06 Thread Rob Pike
It seems reasonable but first I'd like to understand why the recursive
method is used. I can't deduce why, but the CL that adds it, by gri, does
Karatsuba multiplication, which implies something deep is going on. I'll
add him to the conversation.

-rob




On Sun, Jan 7, 2024 at 5:46 PM John Jannotti  wrote:

> I enjoy bignum implementations, so I was looking through nat.go and saw
> that `mulRange` is implemented in a surprising, recursive way,.  In the
> non-base case, `mulRange(a, b)` returns `mulrange(a, (a+b)/2) *
> mulRange(1+(a+b)/2, b)` (lots of big.Int ceremony elided).
>
> That's fine, but I didn't see any advantage over the straightforward (and
> simpler?) for loop.
>
> ```
> z = z.setUint64(a)
> for m := a + 1; m <= b; m++ {
> z = z.mul(z, nat(nil).setUint64(m))
> }
> return z
> ```
>
> In fact, I suspected the existing code was slower, and allocated a lot
> more.  That seems true. A quick benchmark, using the existing unit test as
> the benchmark, yields
> BenchmarkRecusiveMulRangeN-10   169417   6856 ns/op 9452 B/op
>338 allocs/op
> BenchmarkIterativeMulRangeN-10   265354   4269 ns/op 2505 B/op
>196 allocs/op
>
> I doubt `mulRange` is a performance bottleneck in anyone's code! But it is
> exported as `int.MulRange` so I guess it's viewed with some value.  And
> seeing as how the for-loop seems even easier to understand that the
> recursive version, maybe it's worth submitting a PR? (If so, should I
> create an issue first?)
>
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "golang-nuts" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to golang-nuts+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/golang-nuts/e6ceb75a-f8b7-4f77-97dc-9445fb750782n%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/golang-nuts/CAOXNBZRw27kJkxzXJE8emV4o2OFSjTkc%2B65dJnJBT-b%2B%2BnoZJQ%40mail.gmail.com.