Re: Performance question concerning chicken flonum vs "foreign flonum"

2022-04-10 Thread felix . winkelmann
> Dear felix,
>
> after coming back to this function and the associated issues regularly, I 
> revised my opinion on integrating"fp+*" into (chicken flonum), given it uses 
> the C99-fma function. On the one hand, this operation is so fundamental in 
> numerical computations that it warrants a specialized function, on the other 
> hand the (somewhat) improved rounding could help a little. Finally, Gauche ( 
> https://practical-scheme.net/gauche/man/gauche-refe/R7RS-large.html#index-fl_002b_002a
>  ) and MIT Scheme ( 
> https://www.gnu.org/software/mit-scheme/documentation/stable/mit-scheme-ref.html#Flonum-Operations
>  ) provide this functionality. All in all, I would really appreciate if an 
> inclusion of a fma-based "fp+*" function into the (chicken flonum) module 
> could be considered in future versions of CHICKEN Scheme. Maybe your provided 
> patch reduces the effort for this.

All right, I'll submit the existing patch to the mailing list. Thanks for your
suggestion - it makes sense to follow the other implementations here.


cheers,
felix




Re: Performance question concerning chicken flonum vs "foreign flonum"

2022-04-08 Thread Christian Himpe


Christian Himpe schrieb am 2021-11-07:

> felix.winkelm...@bevuta.com schrieb am 2021-11-07:
> > > Dear Felix,
> > >
> > > Thank you for the patch. I built the current git head with your patch.
> > > After importing chicken.flonum, I get the following error when calling 
> > > fp*+:
> > >

> > I'm terribly sorry. I'm an ass, I didn't even test it in the interpreter. 
> > Please
> > find attached a revised patch.


> > felix

> Dear felix,

> the latest patch works. I extended my test code and here are the results:

> without -C -mfma:

> csc -O5 -d0 -C -O3 fma-test.scm && ./fma-test
> 7.998s CPU time, 0/225861 GCs (major/minor), maximum live heap: 30.78 MiB
> 10.104s CPU time, 0/256410 GCs (major/minor), maximum live heap: 30.78 MiB
> 10.69s CPU time, 0/311364 GCs (major/minor), maximum live heap: 30.78 MiB

> with -C -mfma:

> csc -O5 -d0 -C -O3 -C -mfma fma-test.scm && ./fma-test
> 7.697s CPU time, 0/238095 GCs (major/minor), maximum live heap: 30.78 MiB
> 9.135s CPU time, 0/262467 GCs (major/minor), maximum live heap: 30.78 MiB
> 11.008s CPU time, 0/317460 GCs (major/minor), maximum live heap: 30.78 MiB

> It seems the number of GCs is a lot higher than for fp*/fp+ or c99-fma with 
> or without fma compiler flag. So currently, there seems to be no benefit 
> integrating c99's fma as fp*+ besides a slightly better rounding error. At 
> least for me, this comes unexpected.

> Thank you for providing the patch. If you want to test something in this 
> regard in the future, I am happy to test further patches.

> Cheers

> Christian

Dear felix,

after coming back to this function and the associated issues regularly, I 
revised my opinion on integrating"fp+*" into (chicken flonum), given it uses 
the C99-fma function. On the one hand, this operation is so fundamental in 
numerical computations that it warrants a specialized function, on the other 
hand the (somewhat) improved rounding could help a little. Finally, Gauche ( 
https://practical-scheme.net/gauche/man/gauche-refe/R7RS-large.html#index-fl_002b_002a
 ) and MIT Scheme ( 
https://www.gnu.org/software/mit-scheme/documentation/stable/mit-scheme-ref.html#Flonum-Operations
 ) provide this functionality. All in all, I would really appreciate if an 
inclusion of a fma-based "fp+*" function into the (chicken flonum) module could 
be considered in future versions of CHICKEN Scheme. Maybe your provided patch 
reduces the effort for this.

Thank you very much

Christian 



Re: Performance question concerning chicken flonum vs "foreign flonum"

2021-11-07 Thread Christian Himpe


felix.winkelm...@bevuta.com schrieb am 2021-11-07:
> > Dear Felix,
> >
> > Thank you for the patch. I built the current git head with your patch.
> > After importing chicken.flonum, I get the following error when calling fp*+:
> >

> I'm terribly sorry. I'm an ass, I didn't even test it in the interpreter. 
> Please
> find attached a revised patch.


> felix

Dear felix,

the latest patch works. I extended my test code and here are the results:

without -C -mfma:

csc -O5 -d0 -C -O3 fma-test.scm && ./fma-test
7.998s CPU time, 0/225861 GCs (major/minor), maximum live heap: 30.78 MiB
10.104s CPU time, 0/256410 GCs (major/minor), maximum live heap: 30.78 MiB
10.69s CPU time, 0/311364 GCs (major/minor), maximum live heap: 30.78 MiB

with -C -mfma:

csc -O5 -d0 -C -O3 -C -mfma fma-test.scm && ./fma-test 
7.697s CPU time, 0/238095 GCs (major/minor), maximum live heap: 30.78 MiB
9.135s CPU time, 0/262467 GCs (major/minor), maximum live heap: 30.78 MiB
11.008s CPU time, 0/317460 GCs (major/minor), maximum live heap: 30.78 MiB

It seems the number of GCs is a lot higher than for fp*/fp+ or c99-fma with or 
without fma compiler flag. So currently, there seems to be no benefit 
integrating c99's fma as fp*+ besides a slightly better rounding error. At 
least for me, this comes unexpected.

Thank you for providing the patch. If you want to test something in this regard 
in the future, I am happy to test further patches.

Cheers

Christian



Re: Performance question concerning chicken flonum vs "foreign flonum"

2021-11-07 Thread felix . winkelmann
> Dear Felix,
>
> Thank you for the patch. I built the current git head with your patch.
> After importing chicken.flonum, I get the following error when calling fp*+:
>

I'm terribly sorry. I'm an ass, I didn't even test it in the interpreter. Please
find attached a revised patch.


felix
From 29b7abfd1a990e1fe4fc10f3d2532eadd079151f Mon Sep 17 00:00:00 2001
From: felix 
Date: Sun, 7 Nov 2021 13:48:31 +0100
Subject: [PATCH] Add support for fused-multiply-add

(suggested by Christian Himpe on chicken-users)
---
 NEWS   | 4 
 c-platform.scm | 3 ++-
 chicken.h  | 2 ++
 lfa2.scm   | 2 ++
 library.scm| 6 ++
 manual/Module (chicken flonum) | 3 ++-
 types.db   | 3 +++
 7 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/NEWS b/NEWS
index de01c00e..69fe5054 100644
--- a/NEWS
+++ b/NEWS
@@ -4,6 +4,10 @@
   - Default "cc" on BSD systems for building CHICKEN to avoid ABI problems
 when linking with C++ code.
 
+- Core libraries
+  - Added "fp*+" (fused multiply-add) to "chicken.flonum" module 
+(suggested by Christian Himpe).
+
 5.3.0rc4
 
 - Compiler
diff --git a/c-platform.scm b/c-platform.scm
index 00960c82..e59b1f1c 100644
--- a/c-platform.scm
+++ b/c-platform.scm
@@ -149,7 +149,7 @@
 
 (define-constant +flonum-bindings+
   (map (lambda (x) (symbol-append 'chicken.flonum# x))
-   '(fp/? fp+ fp- fp* fp/ fp> fp< fp= fp>= fp<= fpmin fpmax fpneg fpgcd
+   '(fp/? fp+ fp- fp* fp/ fp> fp< fp= fp>= fp<= fpmin fpmax fpneg fpgcd 
fp*+
 fpfloor fpceiling fptruncate fpround fpsin fpcos fptan fpasin fpacos
 fpatan fpatan2 fpexp fpexpt fplog fpsqrt fpabs fpinteger?)))
 
@@ -652,6 +652,7 @@
 (rewrite 'chicken.flonum#fp/? 16 2 "C_a_i_flonum_quotient_checked" #f 
words-per-flonum)
 (rewrite 'chicken.flonum#fpneg 16 1 "C_a_i_flonum_negate" #f words-per-flonum)
 (rewrite 'chicken.flonum#fpgcd 16 2 "C_a_i_flonum_gcd" #f words-per-flonum)
+(rewrite 'chicken.flonum#fp*+ 16 3 "C_a_i_flonum_multiply_add" #f 
words-per-flonum)
 
 (rewrite 'scheme#zero? 5 "C_eqp" 0 'fixnum)
 (rewrite 'scheme#zero? 2 1 "C_u_i_zerop2" #f)
diff --git a/chicken.h b/chicken.h
index 7e51a38f..ba075471 100644
--- a/chicken.h
+++ b/chicken.h
@@ -1204,6 +1204,7 @@ typedef void (C_ccall *C_proc)(C_word, C_word *) C_noret;
 #define C_a_i_flonum_plus(ptr, c, n1, n2) C_flonum(ptr, C_flonum_magnitude(n1) 
+ C_flonum_magnitude(n2))
 #define C_a_i_flonum_difference(ptr, c, n1, n2) C_flonum(ptr, 
C_flonum_magnitude(n1) - C_flonum_magnitude(n2))
 #define C_a_i_flonum_times(ptr, c, n1, n2) C_flonum(ptr, 
C_flonum_magnitude(n1) * C_flonum_magnitude(n2))
+#define C_a_i_flonum_multiply_add(ptr, c, n1, n2, n3) C_flonum(ptr, 
fma(C_flonum_magnitude(n1), C_flonum_magnitude(n2), C_flonum_magnitude(n3)))
 #define C_a_i_flonum_quotient(ptr, c, n1, n2) C_flonum(ptr, 
C_flonum_magnitude(n1) / C_flonum_magnitude(n2))
 #define C_a_i_flonum_negate(ptr, c, n)  C_flonum(ptr, -C_flonum_magnitude(n))
 #define C_a_u_i_flonum_signum(ptr, n, x) (C_flonum_magnitude(x) == 0.0 ? (x) : 
((C_flonum_magnitude(x) < 0.0) ? C_flonum(ptr, -1.0) : C_flonum(ptr, 1.0)))
@@ -1513,6 +1514,7 @@ typedef void (C_ccall *C_proc)(C_word, C_word *) C_noret;
 #define C_ub_i_flonum_difference(x, y)  ((x) - (y))
 #define C_ub_i_flonum_times(x, y)   ((x) * (y))
 #define C_ub_i_flonum_quotient(x, y)((x) / (y))
+#define C_ub_i_flonum_multiply_add(x, y, z)fma((x), (y), (z))
 
 #define C_ub_i_flonum_equalp(n1, n2)C_mk_bool((n1) == (n2))
 #define C_ub_i_flonum_greaterp(n1, n2)  C_mk_bool((n1) > (n2))
diff --git a/lfa2.scm b/lfa2.scm
index 45057578..e4bd308e 100644
--- a/lfa2.scm
+++ b/lfa2.scm
@@ -191,6 +191,7 @@
 ("C_a_i_flonum_sqrt" float)
 ("C_a_i_flonum_tan" float)
 ("C_a_i_flonum_times" float)
+("C_a_i_flonum_multiply_add" float)
 ("C_a_i_flonum_truncate" float)
 ("C_a_u_i_f64vector_ref" float)
 ("C_a_u_i_f32vector_ref" float)
@@ -201,6 +202,7 @@
   '(("C_a_i_flonum_plus" "C_ub_i_flonum_plus" op)
 ("C_a_i_flonum_difference" "C_ub_i_flonum_difference" op)
 ("C_a_i_flonum_times" "C_ub_i_flonum_times" op)
+("C_a_i_flonum_multiply_add" "C_ub_i_flonum_multiply_add" op)
 ("C_a_i_flonum_quotient" "C_ub_i_flonum_quotient" op)
 ("C_flonum_equalp" "C_ub_i_flonum_equalp" pred)
 ("C_flonum_greaterp" "C_ub_i_flonum_greaterp" pred)
diff --git a/library.scm b/library.scm
index 6c6a6942..45182e84 100644
--- a/library.scm
+++ b/library.scm
@@ -1590,6 +1590,12 @@ EOF
   (fp-check-flonums x y 'fp/)
   (##core#inline_allocate ("C_a_i_flonum_quotient" 4) x y) )
 
+(define (fp*+ x y z) 
+  (unless (and (flonum? x) (flonum? y) (flonum? z))
+(##sys#error-hook (foreign-value "C_BAD_ARGUMENT_TYPE_NO_FLONUM_ERROR" int)
+  'fp*+ x y z) )
+  (##core#inline_allocate ("C_a_i_flonum_multiply_add" 4) x y z) )
+
 (define (fpgcd x y)
   (fp-check-flonums x y 'fpgcd)
   (##core#inline_allocate 

Re: Performance question concerning chicken flonum vs "foreign flonum"

2021-11-07 Thread Christian Himpe
Dear Felix,

Thank you for the patch. I built the current git head with your patch.
After importing chicken.flonum, I get the following error when calling fp*+:

#;2> (fp*+ 1.0 2.0 3.0)

Error: unbound variable: g18021803

Call history:

  (fp*+ 1.0 2.0 3.0)
(fp*+ 1.0 2.0 3.0)<--

But, fp*+ is found:

#;2> (procedure? fp*+)
#t

I performed the following build steps:

git clone git://code.call-cc.org/chicken-core
cd, mv etc.
patch -p1 < 0001-Add-support-for-fused-multiply-add.patch
make PREFIX=XXX PLATFORM=linux OPTIMIZE_FOR_SPEED=1 
CHICKEN=XXX/chicken52/bin/chicken
make PREFIX=XXX PLATFORM=linux install

Best

Christian



felix.winkelm...@bevuta.com schrieb am 2021-11-07:
> Hi!

> Here a patch against the current git HEAD, adding support for "fp*+". Please 
> give it a try, if you want.
> This is experimental, if people consider this worthwhile, I can submit it for 
> adding to the core
> system. Note that you still may need passing extra C-compiler options to 
> enable inlining of
> the fma(3) call.


> cheers,
> felix

-- 
Dr. rer. nat. Christian Himpe
University of Münster / Applied Mathematics Münster
Orléans-Ring 10 / 48149 Münster / Germany
https://himpe.science



Re: Performance question concerning chicken flonum vs "foreign flonum"

2021-11-07 Thread felix . winkelmann
Hi!

Here a patch against the current git HEAD, adding support for "fp*+". Please 
give it a try, if you want.
This is experimental, if people consider this worthwhile, I can submit it for 
adding to the core
system. Note that you still may need passing extra C-compiler options to enable 
inlining of
the fma(3) call.


cheers,
felix
From 0f9c68a2b3954eb7c7d2a6075d6b4dfa3dcfb2a5 Mon Sep 17 00:00:00 2001
From: felix 
Date: Sun, 7 Nov 2021 13:48:31 +0100
Subject: [PATCH] Add support for fused-multiply-add

(suggested by Christian Himpe on chicken-users)
---
 NEWS   | 4 
 c-platform.scm | 3 ++-
 chicken.h  | 2 ++
 lfa2.scm   | 2 ++
 library.scm| 4 
 manual/Module (chicken flonum) | 3 ++-
 types.db   | 3 +++
 7 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/NEWS b/NEWS
index de01c00e..69fe5054 100644
--- a/NEWS
+++ b/NEWS
@@ -4,6 +4,10 @@
   - Default "cc" on BSD systems for building CHICKEN to avoid ABI problems
 when linking with C++ code.
 
+- Core libraries
+  - Added "fp*+" (fused multiply-add) to "chicken.flonum" module 
+(suggested by Christian Himpe).
+
 5.3.0rc4
 
 - Compiler
diff --git a/c-platform.scm b/c-platform.scm
index 00960c82..e59b1f1c 100644
--- a/c-platform.scm
+++ b/c-platform.scm
@@ -149,7 +149,7 @@
 
 (define-constant +flonum-bindings+
   (map (lambda (x) (symbol-append 'chicken.flonum# x))
-   '(fp/? fp+ fp- fp* fp/ fp> fp< fp= fp>= fp<= fpmin fpmax fpneg fpgcd
+   '(fp/? fp+ fp- fp* fp/ fp> fp< fp= fp>= fp<= fpmin fpmax fpneg fpgcd 
fp*+
 fpfloor fpceiling fptruncate fpround fpsin fpcos fptan fpasin fpacos
 fpatan fpatan2 fpexp fpexpt fplog fpsqrt fpabs fpinteger?)))
 
@@ -652,6 +652,7 @@
 (rewrite 'chicken.flonum#fp/? 16 2 "C_a_i_flonum_quotient_checked" #f 
words-per-flonum)
 (rewrite 'chicken.flonum#fpneg 16 1 "C_a_i_flonum_negate" #f words-per-flonum)
 (rewrite 'chicken.flonum#fpgcd 16 2 "C_a_i_flonum_gcd" #f words-per-flonum)
+(rewrite 'chicken.flonum#fp*+ 16 3 "C_a_i_flonum_multiply_add" #f 
words-per-flonum)
 
 (rewrite 'scheme#zero? 5 "C_eqp" 0 'fixnum)
 (rewrite 'scheme#zero? 2 1 "C_u_i_zerop2" #f)
diff --git a/chicken.h b/chicken.h
index 7e51a38f..ba075471 100644
--- a/chicken.h
+++ b/chicken.h
@@ -1204,6 +1204,7 @@ typedef void (C_ccall *C_proc)(C_word, C_word *) C_noret;
 #define C_a_i_flonum_plus(ptr, c, n1, n2) C_flonum(ptr, C_flonum_magnitude(n1) 
+ C_flonum_magnitude(n2))
 #define C_a_i_flonum_difference(ptr, c, n1, n2) C_flonum(ptr, 
C_flonum_magnitude(n1) - C_flonum_magnitude(n2))
 #define C_a_i_flonum_times(ptr, c, n1, n2) C_flonum(ptr, 
C_flonum_magnitude(n1) * C_flonum_magnitude(n2))
+#define C_a_i_flonum_multiply_add(ptr, c, n1, n2, n3) C_flonum(ptr, 
fma(C_flonum_magnitude(n1), C_flonum_magnitude(n2), C_flonum_magnitude(n3)))
 #define C_a_i_flonum_quotient(ptr, c, n1, n2) C_flonum(ptr, 
C_flonum_magnitude(n1) / C_flonum_magnitude(n2))
 #define C_a_i_flonum_negate(ptr, c, n)  C_flonum(ptr, -C_flonum_magnitude(n))
 #define C_a_u_i_flonum_signum(ptr, n, x) (C_flonum_magnitude(x) == 0.0 ? (x) : 
((C_flonum_magnitude(x) < 0.0) ? C_flonum(ptr, -1.0) : C_flonum(ptr, 1.0)))
@@ -1513,6 +1514,7 @@ typedef void (C_ccall *C_proc)(C_word, C_word *) C_noret;
 #define C_ub_i_flonum_difference(x, y)  ((x) - (y))
 #define C_ub_i_flonum_times(x, y)   ((x) * (y))
 #define C_ub_i_flonum_quotient(x, y)((x) / (y))
+#define C_ub_i_flonum_multiply_add(x, y, z)fma((x), (y), (z))
 
 #define C_ub_i_flonum_equalp(n1, n2)C_mk_bool((n1) == (n2))
 #define C_ub_i_flonum_greaterp(n1, n2)  C_mk_bool((n1) > (n2))
diff --git a/lfa2.scm b/lfa2.scm
index 45057578..e4bd308e 100644
--- a/lfa2.scm
+++ b/lfa2.scm
@@ -191,6 +191,7 @@
 ("C_a_i_flonum_sqrt" float)
 ("C_a_i_flonum_tan" float)
 ("C_a_i_flonum_times" float)
+("C_a_i_flonum_multiply_add" float)
 ("C_a_i_flonum_truncate" float)
 ("C_a_u_i_f64vector_ref" float)
 ("C_a_u_i_f32vector_ref" float)
@@ -201,6 +202,7 @@
   '(("C_a_i_flonum_plus" "C_ub_i_flonum_plus" op)
 ("C_a_i_flonum_difference" "C_ub_i_flonum_difference" op)
 ("C_a_i_flonum_times" "C_ub_i_flonum_times" op)
+("C_a_i_flonum_multiply_add" "C_ub_i_flonum_multiply_add" op)
 ("C_a_i_flonum_quotient" "C_ub_i_flonum_quotient" op)
 ("C_flonum_equalp" "C_ub_i_flonum_equalp" pred)
 ("C_flonum_greaterp" "C_ub_i_flonum_greaterp" pred)
diff --git a/library.scm b/library.scm
index 6c6a6942..2fb82557 100644
--- a/library.scm
+++ b/library.scm
@@ -1590,6 +1590,10 @@ EOF
   (fp-check-flonums x y 'fp/)
   (##core#inline_allocate ("C_a_i_flonum_quotient" 4) x y) )
 
+(define (fp*+ x y z) 
+  (fp-check-flonums x y z 'fp*+)
+  (##core#inline_allocate ("C_a_i_flonum_multiply_add" 4) x y z) )
+
 (define (fpgcd x y)
   (fp-check-flonums x y 'fpgcd)
   (##core#inline_allocate ("C_a_i_flonum_gcd" 4) x y))
diff --git a/manual/Module (chicken flonum) b/manual/Module 

Re: Performance question concerning chicken flonum vs "foreign flonum"

2021-11-06 Thread felix . winkelmann
> a patch would  be great, if it is not too much work. Attached you find three 
> assembly language files:
>
> * fma-test_original.s (unchananged csc c to assembly)
> * fma-test_modified.s (modified csc c from previous mail)
> * fma-test_modified_mfma.s (modified csc c and -mfma gcc option)
>
> all files were created with the additional gcc arguments -O3 -S 
> -fverbose-asm. Are these files sufficient?

Yes, thanks a lot. the third case shows that the intrinsic is indeed used.

>
> I hoped the fma libc function would insulate one from intrinsics; the 
> compiler option -mfma should activate (I think via defining a C macro) the 
> use of the corresponding CPU instruction (fma3 on current x86), which my CPU 
> supports, but using it does not seem to make a difference.

Suprising, but I guess, the speedup is too minor, in the presence of all the 
noise regarding stack
usage, etc. In straight-line, dumb, cache-friendly C code it may make a 
difference, I guess...

I will provide an experimental patch for the new operation. Stay tuned.


felix




Re: Performance question concerning chicken flonum vs "foreign flonum"

2021-11-05 Thread felix . winkelmann
> modified code:
>
> 7.378s CPU time, 0/225861 GCs (major/minor), maximum live heap: 30.78 MiB
> 8.498s CPU time, 0/238095 GCs (major/minor), maximum live heap: 30.78 MiB
>
> Both were compiled with -O3 optimization level in gcc.
>
> I am fine with these results given your layout of the internals in the 
> background.
>
> Would it be theoretically thinkable to include such fma functionality 
> directly into chicken.flonum, i.e. as fp+*, or are included modules typically 
> unaltered?

The core modules like chicken.flonum can be optimized freely, as they are always
delivered with the base system and the compiler is often tuned to treat these 
specially.
I wonder why the speed difference still exists, could you send me the generated
assembly code for the test program, as produced by your compiler? I'd like to 
see
how far the C compiler goes at inlining the fma operation.
If this can give a noticable speedup, I see no reason why not to add such an
operation, but it would be nice to measure the effect before we do this. I can 
send
you a patch for testing if you like.

Note that one may have to use compiler intrinsics or special C compiler options
to enable this, see for example:


https://stackoverflow.com/questions/15933100/how-to-use-fused-multiply-add-fma-instructions-with-sse-avx


felix




Re: Performance question concerning chicken flonum vs "foreign flonum"

2021-11-05 Thread Christian Himpe


felix.winkelm...@bevuta.com schrieb am 2021-11-04:
> > 7.558s CPU time, 0/225861 GCs (major/minor), maximum live heap: 30.78 MiB
> > 8.839s CPU time, 0/256410 GCs (major/minor), maximum live heap: 30.78 MiB
> >
> >[...]
> >
> > It would be great to get some help or explanation with this issue.

> Hi!

> I have similar timings and the difference in the number of minor GC indicates
> that the c99-fma variant allocates more stack space and thus causes more
> minor GCs.

> Looking at the generated C file ("csc -k"), we see that scm-fma unboxes the 
> intermediate
> result and thus generates relatively decent code:

> /* scm-fma in k183 in k180 in k177 in k174 */
> static void C_ccall f_187(C_word c,C_word *av){
> C_word tmp;
> C_word t0=av[0];
> C_word t1=av[1];
> C_word t2=av[2];
> C_word t3=av[3];
> C_word t4=av[4];
> C_word t5;
> double f0;
> C_word *a;
> if(C_unlikely(!C_demand(C_calculate_demand(4,c,1{
> C_save_and_reclaim((void *)f_187,c,av);}
> a=C_alloc(4);
> f0=C_ub_i_flonum_times(C_flonum_magnitude(t2),C_flonum_magnitude(t3));
> t5=t1;{
> C_word *av2=av;
> av2[0]=t5;
> av2[1]=C_flonum(,C_ub_i_flonum_plus(C_flonum_magnitude(t4),f0));
> ((C_proc)(void*)(*((C_word*)t5+1)))(2,av2);}}

> The other version allocates a bytevector to hold the result:

> /* c99-fma in k183 in k180 in k177 in k174 */
> static void C_ccall f_197(C_word c,C_word *av){
> C_word tmp;
> C_word t0=av[0];
> C_word t1=av[1];
> C_word t2=av[2];
> C_word t3=av[3];
> C_word t4=av[4];
> C_word t5;
> C_word t6;
> C_word *a;
> if(C_unlikely(!C_demand(C_calculate_demand(6,c,1{
> C_save_and_reclaim((void *)f_197,c,av);}
> a=C_alloc(6);
> t5=C_a_i_bytevector(,1,C_fix(4));
> t6=t1;{
> C_word *av2=av;
> av2[0]=t6;
> av2[1]=stub21(t5,t2,t3,t4);
> ((C_proc)(void*)(*((C_word*)t6+1)))(2,av2);}}

> I thought that the allocation of 4 words for the bytevector (which is more 
> than
> needed on a 64 bit machine) makes the difference, but it turns out to be 
> negligible
> Changing it to 2 and also adjusting the values for C_calculate_demand and
> C_alloc doesn't seem to change a lot, but you may want to try that -
> just modify the C code and compile it with the same options as the .scm file.

> On my laptop fma is a library call, so currently my guess is simply that
> the scm-fma code is tighter and avoids 3 additional function calls (one to 
> the stub,
> one to C_a_i_bytevector and one to fma). The increased number of GCs may
> also be caused by the bytevector above, which is used as a placeholder for
> the flonum result, which wastes one word.

> There is room for improvement for the compiler, though: the C_fix(4) is overly
> conservative (4 words are correct on 32-bit, taking care of flonum alignment, 
> but
> unnecessary on 64 bits). Also, the bytevector thing is a bit of a hack - we
> could actually just pass "a" to stub21 directly. You may want to try this out:

> /* c99-fma in k183 in k180 in k177 in k174 (modified) */
> static void C_ccall f_197(C_word c,C_word *av){
> C_word tmp;
> C_word t0=av[0];
> C_word t1=av[1];
> C_word t2=av[2];
> C_word t3=av[3];
> C_word t4=av[4];
> C_word t6;
> C_word *a;
> if(C_unlikely(!C_demand(C_calculate_demand(4,c,1{
> C_save_and_reclaim((void *)f_197,c,av);}
> a=C_alloc(4);
> t6=t1;{
> C_word *av2=av;
> av2[0]=t6;
> av2[1]=stub21((C_word)a,t2,t3,t4);
> ((C_proc)(void*)(*((C_word*)t6+1)))(2,av2);}}

> This reduces minor GCs on my machine to roughly the same. If your
> compiler inlines stub21 and fma, then you should see comparable performance.
> Also, default optimization-levels for C are -Os (pass -v to csc to see what is
> passed to the C compiler), so using -O2 instead should make a difference.


> felix

Dear Felix,

thank you for ypur explanantions. I tested your modified source and indeed the 
number of GCs is significantly reduced, but the timing difference remains:

original code:

7.656s CPU time, 0/225861 GCs (major/minor), maximum live heap: 30.78 MiB
8.849s CPU time, 0/256410 GCs (major/minor), maximum live heap: 30.78 MiB

modified code:

7.378s CPU time, 0/225861 GCs (major/minor), maximum live heap: 30.78 MiB
8.498s CPU time, 0/238095 GCs (major/minor), maximum live heap: 30.78 MiB

Both were compiled with -O3 optimization level in gcc.

I am fine with these results given your layout of the internals in the 
background.

Would it be theoretically thinkable to include such fma functionality directly 
into chicken.flonum, i.e. as fp+*, or are included modules typically unaltered?

Thank you

Christian



Re: Performance question concerning chicken flonum vs "foreign flonum"

2021-11-04 Thread felix . winkelmann
> 7.558s CPU time, 0/225861 GCs (major/minor), maximum live heap: 30.78 MiB
> 8.839s CPU time, 0/256410 GCs (major/minor), maximum live heap: 30.78 MiB
>
>[...]
>
> It would be great to get some help or explanation with this issue.

Hi!

I have similar timings and the difference in the number of minor GC indicates
that the c99-fma variant allocates more stack space and thus causes more
minor GCs.

Looking at the generated C file ("csc -k"), we see that scm-fma unboxes the 
intermediate
result and thus generates relatively decent code:

/* scm-fma in k183 in k180 in k177 in k174 */
static void C_ccall f_187(C_word c,C_word *av){
C_word tmp;
C_word t0=av[0];
C_word t1=av[1];
C_word t2=av[2];
C_word t3=av[3];
C_word t4=av[4];
C_word t5;
double f0;
C_word *a;
if(C_unlikely(!C_demand(C_calculate_demand(4,c,1{
C_save_and_reclaim((void *)f_187,c,av);}
a=C_alloc(4);
f0=C_ub_i_flonum_times(C_flonum_magnitude(t2),C_flonum_magnitude(t3));
t5=t1;{
C_word *av2=av;
av2[0]=t5;
av2[1]=C_flonum(,C_ub_i_flonum_plus(C_flonum_magnitude(t4),f0));
((C_proc)(void*)(*((C_word*)t5+1)))(2,av2);}}

The other version allocates a bytevector to hold the result:

/* c99-fma in k183 in k180 in k177 in k174 */
static void C_ccall f_197(C_word c,C_word *av){
C_word tmp;
C_word t0=av[0];
C_word t1=av[1];
C_word t2=av[2];
C_word t3=av[3];
C_word t4=av[4];
C_word t5;
C_word t6;
C_word *a;
if(C_unlikely(!C_demand(C_calculate_demand(6,c,1{
C_save_and_reclaim((void *)f_197,c,av);}
a=C_alloc(6);
t5=C_a_i_bytevector(,1,C_fix(4));
t6=t1;{
C_word *av2=av;
av2[0]=t6;
av2[1]=stub21(t5,t2,t3,t4);
((C_proc)(void*)(*((C_word*)t6+1)))(2,av2);}}

I thought that the allocation of 4 words for the bytevector (which is more than
needed on a 64 bit machine) makes the difference, but it turns out to be 
negligible
Changing it to 2 and also adjusting the values for C_calculate_demand and
C_alloc doesn't seem to change a lot, but you may want to try that -
just modify the C code and compile it with the same options as the .scm file.

On my laptop fma is a library call, so currently my guess is simply that
the scm-fma code is tighter and avoids 3 additional function calls (one to the 
stub,
one to C_a_i_bytevector and one to fma). The increased number of GCs may
also be caused by the bytevector above, which is used as a placeholder for
the flonum result, which wastes one word.

There is room for improvement for the compiler, though: the C_fix(4) is overly
conservative (4 words are correct on 32-bit, taking care of flonum alignment, 
but
unnecessary on 64 bits). Also, the bytevector thing is a bit of a hack - we
could actually just pass "a" to stub21 directly. You may want to try this out:

/* c99-fma in k183 in k180 in k177 in k174 (modified) */
static void C_ccall f_197(C_word c,C_word *av){
C_word tmp;
C_word t0=av[0];
C_word t1=av[1];
C_word t2=av[2];
C_word t3=av[3];
C_word t4=av[4];
C_word t6;
C_word *a;
if(C_unlikely(!C_demand(C_calculate_demand(4,c,1{
C_save_and_reclaim((void *)f_197,c,av);}
a=C_alloc(4);
t6=t1;{
C_word *av2=av;
av2[0]=t6;
av2[1]=stub21((C_word)a,t2,t3,t4);
((C_proc)(void*)(*((C_word*)t6+1)))(2,av2);}}

This reduces minor GCs on my machine to roughly the same. If your
compiler inlines stub21 and fma, then you should see comparable performance.
Also, default optimization-levels for C are -Os (pass -v to csc to see what is
passed to the C compiler), so using -O2 instead should make a difference.


felix




Re: Performance question concerning chicken flonum vs "foreign flonum"

2021-11-04 Thread Jörg F. Wittenberger
Hi Christian,

this might be a case of "never trust a statistics you did not falsify
yourself".

Not bothering to speculate about explanations, I tend to ask how stable
the results are wrt. larger N's, repetition etc.

IMHO the results are too close for a call.  Roughly this looks like 91%
memory usage (minor gc's) going along of 85% runtime.  Ergo: GC takes
time. My first guess: There may be allocation going on in the FFI
accounting for the increased memory usage.

I'm in no way competent to actually confirm or rule out that
hypothesis.  Please take my whole assessment with a grain of salt; just
a fist guess.

Am Thu, 04 Nov 2021 16:46:50 +0100 (CET)
schrieb :

> Dear All,
> 
> I am currently experimenting with Chicken Scheme and I would like to
> ask about the following situation: I am comparing a "pure" Scheme
> fused-multiply-add (fma) using chicken.flonum against C99's fma via
> chicken.foreign. Here is my test code:
> 
>  fma-test.scm
> 
> (import (chicken flonum) (chicken foreign) srfi-4)
> 
> (foreign-declare "#include ")
> 
> ;; FMA via nested fp+ and fp* from chicken-flonum
> (define (scm-fma x y z)
>   (fp+ z (fp* x y)))
> 
> ;; FMA via C99 function through chicken-foreign
> (define c99-fma (foreign-lambda double "fma" double double double))
> 
> ;; Test function for FMAs
> (define (dot fma a b)
>   (do [(idx 0 (add1 idx))
>(dim (f64vector-length a))
>(ret 0.0 (fma (f64vector-ref a idx) (f64vector-ref b idx)
> ret))] ((= idx dim) ret)))
> 
> ;; Test vector dimension
> (define dim 200)
> 
> ;; Test vector 1
> (define a (make-f64vector dim 1.2345))
> 
> ;; Test vector 2
> (define b (make-f64vector dim 0.9876))
> 
> ;; Test repetitions
> (define N 200)
> 
> ;; Test scm-dot
> (time (do [(n 0 (add1 n))]
> ((= n N))
> (dot scm-fma a b)))
> 
> ;; Test fma-dot
> (time (do [(n 0 (add1 n))]
> ((= n N))
> (dot c99-fma a b)))
> 
> ;eof
> 
> Runnnig this code as follows:
> 
> csc -O5 fma-test.scm && ./fma-test
> 
> yields the results in:
> 
> 7.558s CPU time, 0/225861 GCs (major/minor), maximum live heap: 30.78
> MiB 8.839s CPU time, 0/256410 GCs (major/minor), maximum live heap:
> 30.78 MiB
> 
> Now I wonder why C's single function (instruction) is slower than two
> Scheme functions calls. I have four potential explanations:
> 
> 1. chicken.foreign needs to do some type conversion for each argument
> and return value which accounts for the extra time. If so could this
> be avoided by type declarations somehow?
> 
> 2. chicken.flonum does something to make fpX calls very fast. If so
> can this be done for the foreign fma, too?
> 
> 3. I am using chicken.foreign inefficiently, but I think srfi-144 is
> using it similarly.
> 
> 4. This is an effect only on my machine?
> 
> It would be great to get some help or explanation with this issue.
> 
> Here is my setup:
> 
> CHICKEN Scheme 5.2.0
> gcc 10.3.0
> Ubuntu 20.04
> AMD Ryzen 5 4500U with 16GB
> 
> Thank you very much
> 
> Christian
> 




Performance question concerning chicken flonum vs "foreign flonum"

2021-11-04 Thread christian.himpe
Dear All,

I am currently experimenting with Chicken Scheme and I would like to ask about 
the following situation: I am comparing a "pure" Scheme fused-multiply-add 
(fma) using chicken.flonum against C99's fma via chicken.foreign. Here is my 
test code:

 fma-test.scm

(import (chicken flonum) (chicken foreign) srfi-4)

(foreign-declare "#include ")

;; FMA via nested fp+ and fp* from chicken-flonum
(define (scm-fma x y z)
  (fp+ z (fp* x y)))

;; FMA via C99 function through chicken-foreign
(define c99-fma (foreign-lambda double "fma" double double double))

;; Test function for FMAs
(define (dot fma a b)
  (do [(idx 0 (add1 idx))
   (dim (f64vector-length a))
   (ret 0.0 (fma (f64vector-ref a idx) (f64vector-ref b idx) ret))]
((= idx dim) ret)))

;; Test vector dimension
(define dim 200)

;; Test vector 1
(define a (make-f64vector dim 1.2345))

;; Test vector 2
(define b (make-f64vector dim 0.9876))

;; Test repetitions
(define N 200)

;; Test scm-dot
(time (do [(n 0 (add1 n))]
((= n N))
(dot scm-fma a b)))

;; Test fma-dot
(time (do [(n 0 (add1 n))]
((= n N))
(dot c99-fma a b)))

;eof

Runnnig this code as follows:

csc -O5 fma-test.scm && ./fma-test

yields the results in:

7.558s CPU time, 0/225861 GCs (major/minor), maximum live heap: 30.78 MiB
8.839s CPU time, 0/256410 GCs (major/minor), maximum live heap: 30.78 MiB

Now I wonder why C's single function (instruction) is slower than two Scheme 
functions calls. I have four potential explanations:

1. chicken.foreign needs to do some type conversion for each argument and 
return value which accounts for the extra time. If so could this be avoided by 
type declarations somehow?

2. chicken.flonum does something to make fpX calls very fast. If so can this be 
done for the foreign fma, too?

3. I am using chicken.foreign inefficiently, but I think srfi-144 is using it 
similarly.

4. This is an effect only on my machine?

It would be great to get some help or explanation with this issue.

Here is my setup:

CHICKEN Scheme 5.2.0
gcc 10.3.0
Ubuntu 20.04
AMD Ryzen 5 4500U with 16GB

Thank you very much

Christian