-heap-size problem ?

2024-05-22 Thread T.D. Telford
I recently posted a problem using big numbers that ran quite a bit slower than 
the current Racket.  Peter Bex supplied 2 patches that were a great improvement.
While trying to increase the speed I used the csc option    -heap-size 1000M
where I varied the size from 1000M to 1M.  I have 32 GB of memory and use 
linux mint 21.3.  In all cases I would get the run time message[panic] out 
of memory - heap full - execution terminated

If I don't use this option, the program runs ok.  The program is very recursive 
and uses big numbers (80 to 160 decimal digits). 
Any ideas why this would occur?



Re: Improve "busy" numeric code's performance [was: Re: Big Integers]

2024-05-22 Thread T.D. Telford
 With patch 0001 the elapsed time went from 33.7 seconds to 24.5 seconds.
With patch 0002 the elapsed time went to 23.4 seconds.
Good work -- Doug
On Wednesday, May 22, 2024 at 08:54:49 AM MDT, Peter Bex 
 wrote:  
 
 On Wed, May 22, 2024 at 02:42:38PM +0200, Peter Bex wrote:
> Attached are two patches, one which has this bigger improvement, and
> another which is a minor improvement which translates to shaving about
> a second of runtime off your program (at least on my machine).

The minor patch was incorrect.  I copied some code from C_s_a_i_remainder
into C_s_a_i_modulo inline, but that code had an early return for the
case where both numbers are flonums.  This code needs to be adjusted to
handle the case when the arguments aren't of the same sign, just like we
do after returning from integer_divrem().

I'm sending both patches again for your convenience.  The second patch
has a fix for the aforementioned issue.

Cheers,
Peter
  

Re: Big Integers

2024-05-22 Thread T.D. Telford
 Hello Mario,
Yes, please add the program to the chicken-benchmarks.
Regards,Doug
On Wednesday, May 22, 2024 at 12:50:56 PM MDT, Mario Domenech Goulart 
 wrote:  
 
 Hi Doug,

On Tue, 21 May 2024 21:35:33 + (UTC) "T.D. Telford"  
wrote:

> Thanks for the reply.  The elapsed timings for the program rho3rec are:
>
> chicken 5.3.0:  33.6 seconds
> Racket v8.2 [cs]  : 18.1 seconds
> Dr Racket : 20.6 seconds  (1 MB memory)
>
> The program uses the Pollard rho algorithm to find a factor of 2^257 -1.  The 
> program is highly recursive, but I have a
> non recursive version that has the same timing as the above.  I am using an 
> AMD Ryzen 5600 cpu and 32 GB of main memory.
>
> I tried all of the csc options that appeared related to speed, and the 
> maximum improvement was 0.1 seconds. 
>
> The one option I did not get to work was
>    -heap-size 1000M
> where I varied the size from 1000M to 1M.  In all cases I would get the 
> run time message
>    [panic] out of memory - heap full - execution terminated
>
> I have include the program listing below and also as an attachment.
>
> Regards,
> Doug
>
> ;;;
> ;;#lang racket        ;; uncomment for racket
>
> (define (rho n u v c iter prod)
>  (let* ( [u1 (modulo (+ (* u u) c) n)]
>          [v1 (modulo (+ (* v v) c) n)]
>          [v2 (modulo (+ (* v1 v1) c) n)]
>          [done #f]
>          [max_iter 3000]
>          [iter2 (+ iter 1)]
>          [prod2 (modulo (* prod (- u1 v2)) n)] )
>
>    (if (= (modulo iter2 150) 0)
>      (begin    ; modulo true
>        (let ( [g (gcd prod2 n) ] )
>          (if (and (> g 1) (< g n))
>            (begin ; factor
>              (display "factor = ") (display g) (newline)
>              (display "iterations = ") (display iter2) (newline)
>              (set! done #t)
>            )
>            (set! prod2 1) ; no factor
>          ) ; end if factor
>        ) ; end let 
>      ) ; end begin for modulo true
>      #f ;action for modulo false
>    ) ; end major if
>
>    (if (and (< iter2 max_iter) (not done)) 
>      (rho n u1 v2 c iter2 prod2)
>      (if done ; either found factor or max iterations
>        (display "normal termination \n")
>        #f
>      ) ; if done
>    ) ; if and 
>  ) ; end let*
> )
>
> (let ( [n (- (expt 2 257) 1)] [u 2] [v 11] [c 7] [iter 1] [prod 1] )
>    (display "factor n = ") (display n) (newline)
>    (time (rho n u v c iter prod))
> )
>
> ;;;

Thanks for providing the program.

Would it be ok to add it as a benchmark program to
https://github.com/mario-goulart/chicken-benchmarks?

All the best.
Mario
-- 
http://parenteses.org/mario
  

Re: Big Integers

2024-05-22 Thread Mario Domenech Goulart
Hi Doug,

On Tue, 21 May 2024 21:35:33 + (UTC) "T.D. Telford"  
wrote:

> Thanks for the reply.  The elapsed timings for the program rho3rec are:
>
> chicken 5.3.0:  33.6 seconds
> Racket v8.2 [cs]  : 18.1 seconds
> Dr Racket : 20.6 seconds  (1 MB memory)
>
> The program uses the Pollard rho algorithm to find a factor of 2^257 -1.  The 
> program is highly recursive, but I have a
> non recursive version that has the same timing as the above.  I am using an 
> AMD Ryzen 5600 cpu and 32 GB of main memory.
>
> I tried all of the csc options that appeared related to speed, and the 
> maximum improvement was 0.1 seconds. 
>
> The one option I did not get to work was
> -heap-size 1000M
> where I varied the size from 1000M to 1M.  In all cases I would get the 
> run time message
> [panic] out of memory - heap full - execution terminated
>
> I have include the program listing below and also as an attachment.
>
> Regards,
> Doug
>
> ;;;
> ;;#lang racket;; uncomment for racket
>
> (define (rho n u v c iter prod)
>   (let* ( [u1 (modulo (+ (* u u) c) n)]
>   [v1 (modulo (+ (* v v) c) n)]
>   [v2 (modulo (+ (* v1 v1) c) n)]
>   [done #f]
>   [max_iter 3000]
>   [iter2 (+ iter 1)]
>   [prod2 (modulo (* prod (- u1 v2)) n)] )
>
> (if (= (modulo iter2 150) 0)
>   (begin; modulo true
> (let ( [g (gcd prod2 n) ] )
>   (if (and (> g 1) (< g n))
> (begin ; factor
>   (display "factor = ") (display g) (newline)
>   (display "iterations = ") (display iter2) (newline)
>   (set! done #t)
> )
> (set! prod2 1) ; no factor
>   ) ; end if factor
> ) ; end let 
>   ) ; end begin for modulo true
>   #f ;action for modulo false
> ) ; end major if
>
> (if (and (< iter2 max_iter) (not done)) 
>   (rho n u1 v2 c iter2 prod2)
>   (if done ; either found factor or max iterations
> (display "normal termination \n")
> #f
>   ) ; if done
> ) ; if and 
>   ) ; end let*
> )
>
> (let ( [n (- (expt 2 257) 1)] [u 2] [v 11] [c 7] [iter 1] [prod 1] )
> (display "factor n = ") (display n) (newline)
> (time (rho n u v c iter prod))
> )
>
> ;;;

Thanks for providing the program.

Would it be ok to add it as a benchmark program to
https://github.com/mario-goulart/chicken-benchmarks?

All the best.
Mario
-- 
http://parenteses.org/mario



Re: Improve "busy" numeric code's performance [was: Re: Big Integers]

2024-05-22 Thread felix . winkelmann
> On Wed, May 22, 2024 at 02:42:38PM +0200, Peter Bex wrote:
> > Attached are two patches, one which has this bigger improvement, and
> > another which is a minor improvement which translates to shaving about
> > a second of runtime off your program (at least on my machine).
>
> The minor patch was incorrect.  I copied some code from C_s_a_i_remainder
> into C_s_a_i_modulo inline, but that code had an early return for the
> case where both numbers are flonums.  This code needs to be adjusted to
> handle the case when the arguments aren't of the same sign, just like we
> do after returning from integer_divrem().
>
> I'm sending both patches again for your convenience.  The second patch
> has a fix for the aforementioned issue.
>

Appled - thanks!


felix




Re: Improve "busy" numeric code's performance [was: Re: Big Integers]

2024-05-22 Thread Peter Bex
On Wed, May 22, 2024 at 02:42:38PM +0200, Peter Bex wrote:
> Attached are two patches, one which has this bigger improvement, and
> another which is a minor improvement which translates to shaving about
> a second of runtime off your program (at least on my machine).

The minor patch was incorrect.  I copied some code from C_s_a_i_remainder
into C_s_a_i_modulo inline, but that code had an early return for the
case where both numbers are flonums.  This code needs to be adjusted to
handle the case when the arguments aren't of the same sign, just like we
do after returning from integer_divrem().

I'm sending both patches again for your convenience.  The second patch
has a fix for the aforementioned issue.

Cheers,
Peter
>From d2999bb77a6ba8665be9ca997a85481b670705ea Mon Sep 17 00:00:00 2001
From: Peter Bex 
Date: Wed, 22 May 2024 13:53:28 +0200
Subject: [PATCH 1/2] Don't free the memory for the scratchspace on minor GC

This is too costly if we're in a loop that's operating on the
scratchspace, because it will just reallocate the scratchspace, and
use a small scratchspace which will cause excessive reallocation.
---
 runtime.c | 15 +--
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/runtime.c b/runtime.c
index 9becd2aa..5116ed37 100644
--- a/runtime.c
+++ b/runtime.c
@@ -3675,13 +3675,16 @@ C_regparm void C_fcall C_reclaim(void *trampoline, 
C_word c)
   }
 
   /* GC will have copied any live objects out of scratch space: clear it */
-  if (C_scratchspace_start != NULL) {
-C_free(C_scratchspace_start);
-C_scratchspace_start = NULL;
-C_scratchspace_top = NULL;
-C_scratchspace_limit = NULL;
+  if (C_scratchspace_start != C_scratchspace_top) {
+/* And drop the scratchspace in case of a major or reallocating collection 
*/
+if (gc_mode != GC_MINOR) {
+  C_free(C_scratchspace_start);
+  C_scratchspace_start = NULL;
+  C_scratchspace_limit = NULL;
+  scratchspace_size = 0;
+}
+C_scratchspace_top = C_scratchspace_start;
 C_scratch_usage = 0;
-scratchspace_size = 0;
   }
 
   if(gc_mode == GC_MAJOR) {
-- 
2.42.0

>From 24a7600f3766c0313ef2cbc611499c10e70b5d40 Mon Sep 17 00:00:00 2001
From: Peter Bex 
Date: Wed, 22 May 2024 13:58:11 +0200
Subject: [PATCH 2/2] Call integer_divrem straight from the modulo operations

These would call C_s_a_i_remainder initially, which performs many of
the same checks that we've just already done before actually calling
integer_divrem.  It also allocates more memory on the stack which is
not necessary.

In the case of the generic operator, doing so requires duplicating the
code which converts floating-point integers to exact integers and back
again.

In the case of the integer-specific operator, there's no point in
calling the *generic* remainder function in the first place!
---
 runtime.c | 37 +
 1 file changed, 33 insertions(+), 4 deletions(-)

diff --git a/runtime.c b/runtime.c
index 5116ed37..29dbcb00 100644
--- a/runtime.c
+++ b/runtime.c
@@ -9243,7 +9243,8 @@ C_s_a_u_i_integer_remainder(C_word **ptr, C_word n, 
C_word x, C_word y)
 C_regparm C_word C_fcall
 C_s_a_i_modulo(C_word **ptr, C_word n, C_word x, C_word y)
 {
-  C_word ab[C_SIZEOF_FIX_BIGNUM], *a = ab, r;
+  C_word ab[C_SIZEOF_FIX_BIGNUM], *a = ab, r,
+ nx = C_SCHEME_FALSE, ny = C_SCHEME_FALSE;
 
   if (!C_truep(C_i_integerp(x)))
 barf(C_BAD_ARGUMENT_TYPE_NO_INTEGER_ERROR, "modulo", x);
@@ -9251,13 +9252,41 @@ C_s_a_i_modulo(C_word **ptr, C_word n, C_word x, C_word 
y)
 barf(C_BAD_ARGUMENT_TYPE_NO_INTEGER_ERROR, "modulo", y);
   if (C_truep(C_i_zerop(y))) C_div_by_zero_error("modulo");
 
-  r = C_s_a_i_remainder(, 2, x, y);
-  if (C_i_positivep(y) != C_i_positivep(r) && !C_truep(C_i_zerop(r))) {
+  if (C_truep(C_i_flonump(x))) {
+if C_truep(C_i_flonump(y)) {
+  double dx = C_flonum_magnitude(x), dy = C_flonum_magnitude(y), tmp;
+
+  C_modf(dx / dy, );
+  tmp = dx - tmp * dy;
+  if ((dx > 0.0) != (dy > 0.0) && tmp != 0.0) {
+return C_flonum(ptr, tmp + dy);
+  } else {
+return C_flonum(ptr, tmp);
+  }
+}
+x = nx = C_s_a_u_i_flo_to_int(, 1, x);
+  }
+  if (C_truep(C_i_flonump(y))) {
+y = ny = C_s_a_u_i_flo_to_int(, 1, y);
+  }
+
+  integer_divrem(, x, y, NULL, );
+  if (C_i_positivep(y) != C_i_positivep(r) && r != C_fix(0)) {
 C_word m = C_s_a_i_plus(ptr, 2, r, y);
 m = move_buffer_object(ptr, ab, m);
 clear_buffer_object(ab, r);
 r = m;
   }
+
+  if (C_truep(nx) || C_truep(ny)) {
+C_word newr = C_a_i_exact_to_inexact(ptr, 1, r);
+clear_buffer_object(ab, r);
+r = newr;
+
+clear_buffer_object(ab, nx);
+clear_buffer_object(ab, ny);
+  }
+
   return move_buffer_object(ptr, ab, r);
 }
 
@@ -9267,7 +9296,7 @@ C_s_a_u_i_integer_modulo(C_word **ptr, C_word n, C_word 
x, C_word y)
   C_word ab[C_SIZEOF_FIX_BIGNUM], *a = ab, r;
   if (y == C_fix(0)) C_div_by_zero_error("modulo");
 
-  r = C_s_a_i_remainder(, 2, 

Improve "busy" numeric code's performance [was: Re: Big Integers]

2024-05-22 Thread Peter Bex
Hello Doug and CHICKEN hackers,

Thanks for the benchmarking program.  Somehow your e-mail is a bit
garbled, hence the top-posting, for which I apologize.

The benchmark in question deals mostly with small-sized bignums, which
means we're not even trigger the fancier division algorithms.  This is
all well within the range where the classical Knuth / schoolbook
division algorithm performs fine (which Chez also uses in this case).

In particular, the algorithm calls modulo many many times, which causes
some thrashing in the "scratch space" that we set aside for temporary
calculations when the code in question cannot allocate in the heap.
You can see this when you run your program with -:d (for debug).
Note: the code gets compiled to almost optimal code - it results in a
tight C loop that only occasionally jumps out to display something.
So that particular loop is very "busy" in that it calls out to numeric
code (which allocates in the scratch space) a lot.

I've had a look and I think we can improve this considerably by not
deallocating the scratch space memory area on every minor GC.  I think I
originally did this because I didn't fully trust the scratch space code
and didn't want it to eat up too much memory.  But I think the code is
stable enough now that we can postpone the deallocation (or who knows,
we could even get rid of it entirely?  This would be at the cost of
slightly higher memory use than strictly necessary).

Attached are two patches, one which has this bigger improvement, and
another which is a minor improvement which translates to shaving about
a second of runtime off your program (at least on my machine).

These are for -hackers to sign off on and push to git, but you should
also be able to apply them to the runtime.c of a downloaded copy of
CHICKEN so you don't need to go through the bootstrapping procedure if
you don't want to build from git.

For me, together this reduces the runtime from 58 seconds to 39 seconds,
which is about a 30% speedup.  If that translates to the same kind of
speedup on your machine, we'll be a lot closer to Racket.  I think the
rest is probably overhead which we might be able to shave off but it'll
be a lot more effort (though I could be wrong of course).

Finally, you can squeeze out a little bit more performance by wrapping
the body of "rho" in an assume, like so:

(define (rho n u v c iter prod)
  (assume ((n integer)
   (u integer)
   (v integer)
   (c integer)
   (iter integer)
   (prod integer))
   (let* ( [u1 (modulo (+ (* u u) c) n)]
 

This will avoid emitting generic code which has to do a type check every
single time it calls modulo, + or *.  The somewhat limited flow analysis
we have apparently is unable to figure out that these integer constants
make their way into rho's arguments, even in block compilation mode.

Cheers,
Peter

On Tue, May 21, 2024 at 09:35:33PM +, T.D. Telford wrote:
>  Hello Peter,
> Thanks for the reply.  The elapsed timings for the program rho3rec are:
> chicken 5.3.0:  33.6 secondsRacket v8.2 [cs]  : 18.1 secondsDr Racket : 20.6 
> seconds  (1 MB memory)
> The program uses the Pollard rho algorithm to find a factor of 2^257 -1.  The 
> program is highly recursive, but I have a non recursive version that has the 
> same timing as the above.  I am using an AMD Ryzen 5600 cpu and 32 GB of main 
> memory.
> I tried all of the csc options that appeared related to speed, and the 
> maximum improvement was 0.1 seconds. 
> The one option I did not get to work was    -heap-size 1000M
> where I varied the size from 1000M to 1M.  In all cases I would get the 
> run time message[panic] out of memory - heap full - execution terminated
> 
> I have include the program listing below and also as an attachment.
> Regards,Doug
> ;#lang
>  racket        ;; uncomment for racket
> 
> (define (rho n u v c iter prod)  (let* ( [u1 (modulo (+ (* u u) c) n)]        
>   [v1 (modulo (+ (* v v) c) n)]          [v2 (modulo (+ (* v1 v1) c) n)]      
>     [done #f]          [max_iter 3000]          [iter2 (+ iter 1)]        
>   [prod2 (modulo (* prod (- u1 v2)) n)] )
>     (if (= (modulo iter2 150) 0)      (begin    ; modulo true        (let ( 
> [g (gcd prod2 n) ] )          (if (and (> g 1) (< g n))            (begin ; 
> factor              (display "factor = ") (display g) (newline)              
> (display "iterations = ") (display iter2) (newline)              (set! done 
> #t)            )            (set! prod2 1) ; no factor          ) ; end if 
> factor        ) ; end let       ) ; end begin for modulo true      #f ;action 
> for modulo false    ) ; end major if
>     (if (and (< iter2 max_iter) (not done))       (rho n u1 v2 c iter2 prod2) 
>      (if done ; either found factor or max iterations        (display "normal 
> termination \n")        #f      ) ; if done    ) ; if and   ) ; end