Re: Performance question concerning chicken flonum vs "foreign flonum"
> Dear felix, > > after coming back to this function and the associated issues regularly, I > revised my opinion on integrating"fp+*" into (chicken flonum), given it uses > the C99-fma function. On the one hand, this operation is so fundamental in > numerical computations that it warrants a specialized function, on the other > hand the (somewhat) improved rounding could help a little. Finally, Gauche ( > https://practical-scheme.net/gauche/man/gauche-refe/R7RS-large.html#index-fl_002b_002a > ) and MIT Scheme ( > https://www.gnu.org/software/mit-scheme/documentation/stable/mit-scheme-ref.html#Flonum-Operations > ) provide this functionality. All in all, I would really appreciate if an > inclusion of a fma-based "fp+*" function into the (chicken flonum) module > could be considered in future versions of CHICKEN Scheme. Maybe your provided > patch reduces the effort for this. All right, I'll submit the existing patch to the mailing list. Thanks for your suggestion - it makes sense to follow the other implementations here. cheers, felix
Re: Performance question concerning chicken flonum vs "foreign flonum"
Christian Himpe schrieb am 2021-11-07: > felix.winkelm...@bevuta.com schrieb am 2021-11-07: > > > Dear Felix, > > > > > > Thank you for the patch. I built the current git head with your patch. > > > After importing chicken.flonum, I get the following error when calling > > > fp*+: > > > > > I'm terribly sorry. I'm an ass, I didn't even test it in the interpreter. > > Please > > find attached a revised patch. > > felix > Dear felix, > the latest patch works. I extended my test code and here are the results: > without -C -mfma: > csc -O5 -d0 -C -O3 fma-test.scm && ./fma-test > 7.998s CPU time, 0/225861 GCs (major/minor), maximum live heap: 30.78 MiB > 10.104s CPU time, 0/256410 GCs (major/minor), maximum live heap: 30.78 MiB > 10.69s CPU time, 0/311364 GCs (major/minor), maximum live heap: 30.78 MiB > with -C -mfma: > csc -O5 -d0 -C -O3 -C -mfma fma-test.scm && ./fma-test > 7.697s CPU time, 0/238095 GCs (major/minor), maximum live heap: 30.78 MiB > 9.135s CPU time, 0/262467 GCs (major/minor), maximum live heap: 30.78 MiB > 11.008s CPU time, 0/317460 GCs (major/minor), maximum live heap: 30.78 MiB > It seems the number of GCs is a lot higher than for fp*/fp+ or c99-fma with > or without fma compiler flag. So currently, there seems to be no benefit > integrating c99's fma as fp*+ besides a slightly better rounding error. At > least for me, this comes unexpected. > Thank you for providing the patch. If you want to test something in this > regard in the future, I am happy to test further patches. > Cheers > Christian Dear felix, after coming back to this function and the associated issues regularly, I revised my opinion on integrating"fp+*" into (chicken flonum), given it uses the C99-fma function. On the one hand, this operation is so fundamental in numerical computations that it warrants a specialized function, on the other hand the (somewhat) improved rounding could help a little. Finally, Gauche ( https://practical-scheme.net/gauche/man/gauche-refe/R7RS-large.html#index-fl_002b_002a ) and MIT Scheme ( https://www.gnu.org/software/mit-scheme/documentation/stable/mit-scheme-ref.html#Flonum-Operations ) provide this functionality. All in all, I would really appreciate if an inclusion of a fma-based "fp+*" function into the (chicken flonum) module could be considered in future versions of CHICKEN Scheme. Maybe your provided patch reduces the effort for this. Thank you very much Christian
Re: Performance question concerning chicken flonum vs "foreign flonum"
felix.winkelm...@bevuta.com schrieb am 2021-11-07: > > Dear Felix, > > > > Thank you for the patch. I built the current git head with your patch. > > After importing chicken.flonum, I get the following error when calling fp*+: > > > I'm terribly sorry. I'm an ass, I didn't even test it in the interpreter. > Please > find attached a revised patch. > felix Dear felix, the latest patch works. I extended my test code and here are the results: without -C -mfma: csc -O5 -d0 -C -O3 fma-test.scm && ./fma-test 7.998s CPU time, 0/225861 GCs (major/minor), maximum live heap: 30.78 MiB 10.104s CPU time, 0/256410 GCs (major/minor), maximum live heap: 30.78 MiB 10.69s CPU time, 0/311364 GCs (major/minor), maximum live heap: 30.78 MiB with -C -mfma: csc -O5 -d0 -C -O3 -C -mfma fma-test.scm && ./fma-test 7.697s CPU time, 0/238095 GCs (major/minor), maximum live heap: 30.78 MiB 9.135s CPU time, 0/262467 GCs (major/minor), maximum live heap: 30.78 MiB 11.008s CPU time, 0/317460 GCs (major/minor), maximum live heap: 30.78 MiB It seems the number of GCs is a lot higher than for fp*/fp+ or c99-fma with or without fma compiler flag. So currently, there seems to be no benefit integrating c99's fma as fp*+ besides a slightly better rounding error. At least for me, this comes unexpected. Thank you for providing the patch. If you want to test something in this regard in the future, I am happy to test further patches. Cheers Christian
Re: Performance question concerning chicken flonum vs "foreign flonum"
> Dear Felix, > > Thank you for the patch. I built the current git head with your patch. > After importing chicken.flonum, I get the following error when calling fp*+: > I'm terribly sorry. I'm an ass, I didn't even test it in the interpreter. Please find attached a revised patch. felix From 29b7abfd1a990e1fe4fc10f3d2532eadd079151f Mon Sep 17 00:00:00 2001 From: felix Date: Sun, 7 Nov 2021 13:48:31 +0100 Subject: [PATCH] Add support for fused-multiply-add (suggested by Christian Himpe on chicken-users) --- NEWS | 4 c-platform.scm | 3 ++- chicken.h | 2 ++ lfa2.scm | 2 ++ library.scm| 6 ++ manual/Module (chicken flonum) | 3 ++- types.db | 3 +++ 7 files changed, 21 insertions(+), 2 deletions(-) diff --git a/NEWS b/NEWS index de01c00e..69fe5054 100644 --- a/NEWS +++ b/NEWS @@ -4,6 +4,10 @@ - Default "cc" on BSD systems for building CHICKEN to avoid ABI problems when linking with C++ code. +- Core libraries + - Added "fp*+" (fused multiply-add) to "chicken.flonum" module +(suggested by Christian Himpe). + 5.3.0rc4 - Compiler diff --git a/c-platform.scm b/c-platform.scm index 00960c82..e59b1f1c 100644 --- a/c-platform.scm +++ b/c-platform.scm @@ -149,7 +149,7 @@ (define-constant +flonum-bindings+ (map (lambda (x) (symbol-append 'chicken.flonum# x)) - '(fp/? fp+ fp- fp* fp/ fp> fp< fp= fp>= fp<= fpmin fpmax fpneg fpgcd + '(fp/? fp+ fp- fp* fp/ fp> fp< fp= fp>= fp<= fpmin fpmax fpneg fpgcd fp*+ fpfloor fpceiling fptruncate fpround fpsin fpcos fptan fpasin fpacos fpatan fpatan2 fpexp fpexpt fplog fpsqrt fpabs fpinteger?))) @@ -652,6 +652,7 @@ (rewrite 'chicken.flonum#fp/? 16 2 "C_a_i_flonum_quotient_checked" #f words-per-flonum) (rewrite 'chicken.flonum#fpneg 16 1 "C_a_i_flonum_negate" #f words-per-flonum) (rewrite 'chicken.flonum#fpgcd 16 2 "C_a_i_flonum_gcd" #f words-per-flonum) +(rewrite 'chicken.flonum#fp*+ 16 3 "C_a_i_flonum_multiply_add" #f words-per-flonum) (rewrite 'scheme#zero? 5 "C_eqp" 0 'fixnum) (rewrite 'scheme#zero? 2 1 "C_u_i_zerop2" #f) diff --git a/chicken.h b/chicken.h index 7e51a38f..ba075471 100644 --- a/chicken.h +++ b/chicken.h @@ -1204,6 +1204,7 @@ typedef void (C_ccall *C_proc)(C_word, C_word *) C_noret; #define C_a_i_flonum_plus(ptr, c, n1, n2) C_flonum(ptr, C_flonum_magnitude(n1) + C_flonum_magnitude(n2)) #define C_a_i_flonum_difference(ptr, c, n1, n2) C_flonum(ptr, C_flonum_magnitude(n1) - C_flonum_magnitude(n2)) #define C_a_i_flonum_times(ptr, c, n1, n2) C_flonum(ptr, C_flonum_magnitude(n1) * C_flonum_magnitude(n2)) +#define C_a_i_flonum_multiply_add(ptr, c, n1, n2, n3) C_flonum(ptr, fma(C_flonum_magnitude(n1), C_flonum_magnitude(n2), C_flonum_magnitude(n3))) #define C_a_i_flonum_quotient(ptr, c, n1, n2) C_flonum(ptr, C_flonum_magnitude(n1) / C_flonum_magnitude(n2)) #define C_a_i_flonum_negate(ptr, c, n) C_flonum(ptr, -C_flonum_magnitude(n)) #define C_a_u_i_flonum_signum(ptr, n, x) (C_flonum_magnitude(x) == 0.0 ? (x) : ((C_flonum_magnitude(x) < 0.0) ? C_flonum(ptr, -1.0) : C_flonum(ptr, 1.0))) @@ -1513,6 +1514,7 @@ typedef void (C_ccall *C_proc)(C_word, C_word *) C_noret; #define C_ub_i_flonum_difference(x, y) ((x) - (y)) #define C_ub_i_flonum_times(x, y) ((x) * (y)) #define C_ub_i_flonum_quotient(x, y)((x) / (y)) +#define C_ub_i_flonum_multiply_add(x, y, z)fma((x), (y), (z)) #define C_ub_i_flonum_equalp(n1, n2)C_mk_bool((n1) == (n2)) #define C_ub_i_flonum_greaterp(n1, n2) C_mk_bool((n1) > (n2)) diff --git a/lfa2.scm b/lfa2.scm index 45057578..e4bd308e 100644 --- a/lfa2.scm +++ b/lfa2.scm @@ -191,6 +191,7 @@ ("C_a_i_flonum_sqrt" float) ("C_a_i_flonum_tan" float) ("C_a_i_flonum_times" float) +("C_a_i_flonum_multiply_add" float) ("C_a_i_flonum_truncate" float) ("C_a_u_i_f64vector_ref" float) ("C_a_u_i_f32vector_ref" float) @@ -201,6 +202,7 @@ '(("C_a_i_flonum_plus" "C_ub_i_flonum_plus" op) ("C_a_i_flonum_difference" "C_ub_i_flonum_difference" op) ("C_a_i_flonum_times" "C_ub_i_flonum_times" op) +("C_a_i_flonum_multiply_add" "C_ub_i_flonum_multiply_add" op) ("C_a_i_flonum_quotient" "C_ub_i_flonum_quotient" op) ("C_flonum_equalp" "C_ub_i_flonum_equalp" pred) ("C_flonum_greaterp" "C_ub_i_flonum_greaterp" pred) diff --git a/library.scm b/library.scm index 6c6a6942..45182e84 100644 --- a/library.scm +++ b/library.scm @@ -1590,6 +1590,12 @@ EOF (fp-check-flonums x y 'fp/) (##core#inline_allocate ("C_a_i_flonum_quotient" 4) x y) ) +(define (fp*+ x y z) + (unless (and (flonum? x) (flonum? y) (flonum? z)) +(##sys#error-hook (foreign-value "C_BAD_ARGUMENT_TYPE_NO_FLONUM_ERROR" int) + 'fp*+ x y z) ) + (##core#inline_allocate ("C_a_i_flonum_multiply_add" 4) x y z) ) + (define (fpgcd x y) (fp-check-flonums x y 'fpgcd) (##core#inline_allocate
Re: Performance question concerning chicken flonum vs "foreign flonum"
Dear Felix, Thank you for the patch. I built the current git head with your patch. After importing chicken.flonum, I get the following error when calling fp*+: #;2> (fp*+ 1.0 2.0 3.0) Error: unbound variable: g18021803 Call history: (fp*+ 1.0 2.0 3.0) (fp*+ 1.0 2.0 3.0)<-- But, fp*+ is found: #;2> (procedure? fp*+) #t I performed the following build steps: git clone git://code.call-cc.org/chicken-core cd, mv etc. patch -p1 < 0001-Add-support-for-fused-multiply-add.patch make PREFIX=XXX PLATFORM=linux OPTIMIZE_FOR_SPEED=1 CHICKEN=XXX/chicken52/bin/chicken make PREFIX=XXX PLATFORM=linux install Best Christian felix.winkelm...@bevuta.com schrieb am 2021-11-07: > Hi! > Here a patch against the current git HEAD, adding support for "fp*+". Please > give it a try, if you want. > This is experimental, if people consider this worthwhile, I can submit it for > adding to the core > system. Note that you still may need passing extra C-compiler options to > enable inlining of > the fma(3) call. > cheers, > felix -- Dr. rer. nat. Christian Himpe University of Münster / Applied Mathematics Münster Orléans-Ring 10 / 48149 Münster / Germany https://himpe.science
Re: Performance question concerning chicken flonum vs "foreign flonum"
Hi! Here a patch against the current git HEAD, adding support for "fp*+". Please give it a try, if you want. This is experimental, if people consider this worthwhile, I can submit it for adding to the core system. Note that you still may need passing extra C-compiler options to enable inlining of the fma(3) call. cheers, felix From 0f9c68a2b3954eb7c7d2a6075d6b4dfa3dcfb2a5 Mon Sep 17 00:00:00 2001 From: felix Date: Sun, 7 Nov 2021 13:48:31 +0100 Subject: [PATCH] Add support for fused-multiply-add (suggested by Christian Himpe on chicken-users) --- NEWS | 4 c-platform.scm | 3 ++- chicken.h | 2 ++ lfa2.scm | 2 ++ library.scm| 4 manual/Module (chicken flonum) | 3 ++- types.db | 3 +++ 7 files changed, 19 insertions(+), 2 deletions(-) diff --git a/NEWS b/NEWS index de01c00e..69fe5054 100644 --- a/NEWS +++ b/NEWS @@ -4,6 +4,10 @@ - Default "cc" on BSD systems for building CHICKEN to avoid ABI problems when linking with C++ code. +- Core libraries + - Added "fp*+" (fused multiply-add) to "chicken.flonum" module +(suggested by Christian Himpe). + 5.3.0rc4 - Compiler diff --git a/c-platform.scm b/c-platform.scm index 00960c82..e59b1f1c 100644 --- a/c-platform.scm +++ b/c-platform.scm @@ -149,7 +149,7 @@ (define-constant +flonum-bindings+ (map (lambda (x) (symbol-append 'chicken.flonum# x)) - '(fp/? fp+ fp- fp* fp/ fp> fp< fp= fp>= fp<= fpmin fpmax fpneg fpgcd + '(fp/? fp+ fp- fp* fp/ fp> fp< fp= fp>= fp<= fpmin fpmax fpneg fpgcd fp*+ fpfloor fpceiling fptruncate fpround fpsin fpcos fptan fpasin fpacos fpatan fpatan2 fpexp fpexpt fplog fpsqrt fpabs fpinteger?))) @@ -652,6 +652,7 @@ (rewrite 'chicken.flonum#fp/? 16 2 "C_a_i_flonum_quotient_checked" #f words-per-flonum) (rewrite 'chicken.flonum#fpneg 16 1 "C_a_i_flonum_negate" #f words-per-flonum) (rewrite 'chicken.flonum#fpgcd 16 2 "C_a_i_flonum_gcd" #f words-per-flonum) +(rewrite 'chicken.flonum#fp*+ 16 3 "C_a_i_flonum_multiply_add" #f words-per-flonum) (rewrite 'scheme#zero? 5 "C_eqp" 0 'fixnum) (rewrite 'scheme#zero? 2 1 "C_u_i_zerop2" #f) diff --git a/chicken.h b/chicken.h index 7e51a38f..ba075471 100644 --- a/chicken.h +++ b/chicken.h @@ -1204,6 +1204,7 @@ typedef void (C_ccall *C_proc)(C_word, C_word *) C_noret; #define C_a_i_flonum_plus(ptr, c, n1, n2) C_flonum(ptr, C_flonum_magnitude(n1) + C_flonum_magnitude(n2)) #define C_a_i_flonum_difference(ptr, c, n1, n2) C_flonum(ptr, C_flonum_magnitude(n1) - C_flonum_magnitude(n2)) #define C_a_i_flonum_times(ptr, c, n1, n2) C_flonum(ptr, C_flonum_magnitude(n1) * C_flonum_magnitude(n2)) +#define C_a_i_flonum_multiply_add(ptr, c, n1, n2, n3) C_flonum(ptr, fma(C_flonum_magnitude(n1), C_flonum_magnitude(n2), C_flonum_magnitude(n3))) #define C_a_i_flonum_quotient(ptr, c, n1, n2) C_flonum(ptr, C_flonum_magnitude(n1) / C_flonum_magnitude(n2)) #define C_a_i_flonum_negate(ptr, c, n) C_flonum(ptr, -C_flonum_magnitude(n)) #define C_a_u_i_flonum_signum(ptr, n, x) (C_flonum_magnitude(x) == 0.0 ? (x) : ((C_flonum_magnitude(x) < 0.0) ? C_flonum(ptr, -1.0) : C_flonum(ptr, 1.0))) @@ -1513,6 +1514,7 @@ typedef void (C_ccall *C_proc)(C_word, C_word *) C_noret; #define C_ub_i_flonum_difference(x, y) ((x) - (y)) #define C_ub_i_flonum_times(x, y) ((x) * (y)) #define C_ub_i_flonum_quotient(x, y)((x) / (y)) +#define C_ub_i_flonum_multiply_add(x, y, z)fma((x), (y), (z)) #define C_ub_i_flonum_equalp(n1, n2)C_mk_bool((n1) == (n2)) #define C_ub_i_flonum_greaterp(n1, n2) C_mk_bool((n1) > (n2)) diff --git a/lfa2.scm b/lfa2.scm index 45057578..e4bd308e 100644 --- a/lfa2.scm +++ b/lfa2.scm @@ -191,6 +191,7 @@ ("C_a_i_flonum_sqrt" float) ("C_a_i_flonum_tan" float) ("C_a_i_flonum_times" float) +("C_a_i_flonum_multiply_add" float) ("C_a_i_flonum_truncate" float) ("C_a_u_i_f64vector_ref" float) ("C_a_u_i_f32vector_ref" float) @@ -201,6 +202,7 @@ '(("C_a_i_flonum_plus" "C_ub_i_flonum_plus" op) ("C_a_i_flonum_difference" "C_ub_i_flonum_difference" op) ("C_a_i_flonum_times" "C_ub_i_flonum_times" op) +("C_a_i_flonum_multiply_add" "C_ub_i_flonum_multiply_add" op) ("C_a_i_flonum_quotient" "C_ub_i_flonum_quotient" op) ("C_flonum_equalp" "C_ub_i_flonum_equalp" pred) ("C_flonum_greaterp" "C_ub_i_flonum_greaterp" pred) diff --git a/library.scm b/library.scm index 6c6a6942..2fb82557 100644 --- a/library.scm +++ b/library.scm @@ -1590,6 +1590,10 @@ EOF (fp-check-flonums x y 'fp/) (##core#inline_allocate ("C_a_i_flonum_quotient" 4) x y) ) +(define (fp*+ x y z) + (fp-check-flonums x y z 'fp*+) + (##core#inline_allocate ("C_a_i_flonum_multiply_add" 4) x y z) ) + (define (fpgcd x y) (fp-check-flonums x y 'fpgcd) (##core#inline_allocate ("C_a_i_flonum_gcd" 4) x y)) diff --git a/manual/Module (chicken flonum) b/manual/Module
Re: Performance question concerning chicken flonum vs "foreign flonum"
> a patch would be great, if it is not too much work. Attached you find three > assembly language files: > > * fma-test_original.s (unchananged csc c to assembly) > * fma-test_modified.s (modified csc c from previous mail) > * fma-test_modified_mfma.s (modified csc c and -mfma gcc option) > > all files were created with the additional gcc arguments -O3 -S > -fverbose-asm. Are these files sufficient? Yes, thanks a lot. the third case shows that the intrinsic is indeed used. > > I hoped the fma libc function would insulate one from intrinsics; the > compiler option -mfma should activate (I think via defining a C macro) the > use of the corresponding CPU instruction (fma3 on current x86), which my CPU > supports, but using it does not seem to make a difference. Suprising, but I guess, the speedup is too minor, in the presence of all the noise regarding stack usage, etc. In straight-line, dumb, cache-friendly C code it may make a difference, I guess... I will provide an experimental patch for the new operation. Stay tuned. felix
Re: Performance question concerning chicken flonum vs "foreign flonum"
> modified code: > > 7.378s CPU time, 0/225861 GCs (major/minor), maximum live heap: 30.78 MiB > 8.498s CPU time, 0/238095 GCs (major/minor), maximum live heap: 30.78 MiB > > Both were compiled with -O3 optimization level in gcc. > > I am fine with these results given your layout of the internals in the > background. > > Would it be theoretically thinkable to include such fma functionality > directly into chicken.flonum, i.e. as fp+*, or are included modules typically > unaltered? The core modules like chicken.flonum can be optimized freely, as they are always delivered with the base system and the compiler is often tuned to treat these specially. I wonder why the speed difference still exists, could you send me the generated assembly code for the test program, as produced by your compiler? I'd like to see how far the C compiler goes at inlining the fma operation. If this can give a noticable speedup, I see no reason why not to add such an operation, but it would be nice to measure the effect before we do this. I can send you a patch for testing if you like. Note that one may have to use compiler intrinsics or special C compiler options to enable this, see for example: https://stackoverflow.com/questions/15933100/how-to-use-fused-multiply-add-fma-instructions-with-sse-avx felix
Re: Performance question concerning chicken flonum vs "foreign flonum"
felix.winkelm...@bevuta.com schrieb am 2021-11-04: > > 7.558s CPU time, 0/225861 GCs (major/minor), maximum live heap: 30.78 MiB > > 8.839s CPU time, 0/256410 GCs (major/minor), maximum live heap: 30.78 MiB > > > >[...] > > > > It would be great to get some help or explanation with this issue. > Hi! > I have similar timings and the difference in the number of minor GC indicates > that the c99-fma variant allocates more stack space and thus causes more > minor GCs. > Looking at the generated C file ("csc -k"), we see that scm-fma unboxes the > intermediate > result and thus generates relatively decent code: > /* scm-fma in k183 in k180 in k177 in k174 */ > static void C_ccall f_187(C_word c,C_word *av){ > C_word tmp; > C_word t0=av[0]; > C_word t1=av[1]; > C_word t2=av[2]; > C_word t3=av[3]; > C_word t4=av[4]; > C_word t5; > double f0; > C_word *a; > if(C_unlikely(!C_demand(C_calculate_demand(4,c,1{ > C_save_and_reclaim((void *)f_187,c,av);} > a=C_alloc(4); > f0=C_ub_i_flonum_times(C_flonum_magnitude(t2),C_flonum_magnitude(t3)); > t5=t1;{ > C_word *av2=av; > av2[0]=t5; > av2[1]=C_flonum(,C_ub_i_flonum_plus(C_flonum_magnitude(t4),f0)); > ((C_proc)(void*)(*((C_word*)t5+1)))(2,av2);}} > The other version allocates a bytevector to hold the result: > /* c99-fma in k183 in k180 in k177 in k174 */ > static void C_ccall f_197(C_word c,C_word *av){ > C_word tmp; > C_word t0=av[0]; > C_word t1=av[1]; > C_word t2=av[2]; > C_word t3=av[3]; > C_word t4=av[4]; > C_word t5; > C_word t6; > C_word *a; > if(C_unlikely(!C_demand(C_calculate_demand(6,c,1{ > C_save_and_reclaim((void *)f_197,c,av);} > a=C_alloc(6); > t5=C_a_i_bytevector(,1,C_fix(4)); > t6=t1;{ > C_word *av2=av; > av2[0]=t6; > av2[1]=stub21(t5,t2,t3,t4); > ((C_proc)(void*)(*((C_word*)t6+1)))(2,av2);}} > I thought that the allocation of 4 words for the bytevector (which is more > than > needed on a 64 bit machine) makes the difference, but it turns out to be > negligible > Changing it to 2 and also adjusting the values for C_calculate_demand and > C_alloc doesn't seem to change a lot, but you may want to try that - > just modify the C code and compile it with the same options as the .scm file. > On my laptop fma is a library call, so currently my guess is simply that > the scm-fma code is tighter and avoids 3 additional function calls (one to > the stub, > one to C_a_i_bytevector and one to fma). The increased number of GCs may > also be caused by the bytevector above, which is used as a placeholder for > the flonum result, which wastes one word. > There is room for improvement for the compiler, though: the C_fix(4) is overly > conservative (4 words are correct on 32-bit, taking care of flonum alignment, > but > unnecessary on 64 bits). Also, the bytevector thing is a bit of a hack - we > could actually just pass "a" to stub21 directly. You may want to try this out: > /* c99-fma in k183 in k180 in k177 in k174 (modified) */ > static void C_ccall f_197(C_word c,C_word *av){ > C_word tmp; > C_word t0=av[0]; > C_word t1=av[1]; > C_word t2=av[2]; > C_word t3=av[3]; > C_word t4=av[4]; > C_word t6; > C_word *a; > if(C_unlikely(!C_demand(C_calculate_demand(4,c,1{ > C_save_and_reclaim((void *)f_197,c,av);} > a=C_alloc(4); > t6=t1;{ > C_word *av2=av; > av2[0]=t6; > av2[1]=stub21((C_word)a,t2,t3,t4); > ((C_proc)(void*)(*((C_word*)t6+1)))(2,av2);}} > This reduces minor GCs on my machine to roughly the same. If your > compiler inlines stub21 and fma, then you should see comparable performance. > Also, default optimization-levels for C are -Os (pass -v to csc to see what is > passed to the C compiler), so using -O2 instead should make a difference. > felix Dear Felix, thank you for ypur explanantions. I tested your modified source and indeed the number of GCs is significantly reduced, but the timing difference remains: original code: 7.656s CPU time, 0/225861 GCs (major/minor), maximum live heap: 30.78 MiB 8.849s CPU time, 0/256410 GCs (major/minor), maximum live heap: 30.78 MiB modified code: 7.378s CPU time, 0/225861 GCs (major/minor), maximum live heap: 30.78 MiB 8.498s CPU time, 0/238095 GCs (major/minor), maximum live heap: 30.78 MiB Both were compiled with -O3 optimization level in gcc. I am fine with these results given your layout of the internals in the background. Would it be theoretically thinkable to include such fma functionality directly into chicken.flonum, i.e. as fp+*, or are included modules typically unaltered? Thank you Christian
Re: Performance question concerning chicken flonum vs "foreign flonum"
> 7.558s CPU time, 0/225861 GCs (major/minor), maximum live heap: 30.78 MiB > 8.839s CPU time, 0/256410 GCs (major/minor), maximum live heap: 30.78 MiB > >[...] > > It would be great to get some help or explanation with this issue. Hi! I have similar timings and the difference in the number of minor GC indicates that the c99-fma variant allocates more stack space and thus causes more minor GCs. Looking at the generated C file ("csc -k"), we see that scm-fma unboxes the intermediate result and thus generates relatively decent code: /* scm-fma in k183 in k180 in k177 in k174 */ static void C_ccall f_187(C_word c,C_word *av){ C_word tmp; C_word t0=av[0]; C_word t1=av[1]; C_word t2=av[2]; C_word t3=av[3]; C_word t4=av[4]; C_word t5; double f0; C_word *a; if(C_unlikely(!C_demand(C_calculate_demand(4,c,1{ C_save_and_reclaim((void *)f_187,c,av);} a=C_alloc(4); f0=C_ub_i_flonum_times(C_flonum_magnitude(t2),C_flonum_magnitude(t3)); t5=t1;{ C_word *av2=av; av2[0]=t5; av2[1]=C_flonum(,C_ub_i_flonum_plus(C_flonum_magnitude(t4),f0)); ((C_proc)(void*)(*((C_word*)t5+1)))(2,av2);}} The other version allocates a bytevector to hold the result: /* c99-fma in k183 in k180 in k177 in k174 */ static void C_ccall f_197(C_word c,C_word *av){ C_word tmp; C_word t0=av[0]; C_word t1=av[1]; C_word t2=av[2]; C_word t3=av[3]; C_word t4=av[4]; C_word t5; C_word t6; C_word *a; if(C_unlikely(!C_demand(C_calculate_demand(6,c,1{ C_save_and_reclaim((void *)f_197,c,av);} a=C_alloc(6); t5=C_a_i_bytevector(,1,C_fix(4)); t6=t1;{ C_word *av2=av; av2[0]=t6; av2[1]=stub21(t5,t2,t3,t4); ((C_proc)(void*)(*((C_word*)t6+1)))(2,av2);}} I thought that the allocation of 4 words for the bytevector (which is more than needed on a 64 bit machine) makes the difference, but it turns out to be negligible Changing it to 2 and also adjusting the values for C_calculate_demand and C_alloc doesn't seem to change a lot, but you may want to try that - just modify the C code and compile it with the same options as the .scm file. On my laptop fma is a library call, so currently my guess is simply that the scm-fma code is tighter and avoids 3 additional function calls (one to the stub, one to C_a_i_bytevector and one to fma). The increased number of GCs may also be caused by the bytevector above, which is used as a placeholder for the flonum result, which wastes one word. There is room for improvement for the compiler, though: the C_fix(4) is overly conservative (4 words are correct on 32-bit, taking care of flonum alignment, but unnecessary on 64 bits). Also, the bytevector thing is a bit of a hack - we could actually just pass "a" to stub21 directly. You may want to try this out: /* c99-fma in k183 in k180 in k177 in k174 (modified) */ static void C_ccall f_197(C_word c,C_word *av){ C_word tmp; C_word t0=av[0]; C_word t1=av[1]; C_word t2=av[2]; C_word t3=av[3]; C_word t4=av[4]; C_word t6; C_word *a; if(C_unlikely(!C_demand(C_calculate_demand(4,c,1{ C_save_and_reclaim((void *)f_197,c,av);} a=C_alloc(4); t6=t1;{ C_word *av2=av; av2[0]=t6; av2[1]=stub21((C_word)a,t2,t3,t4); ((C_proc)(void*)(*((C_word*)t6+1)))(2,av2);}} This reduces minor GCs on my machine to roughly the same. If your compiler inlines stub21 and fma, then you should see comparable performance. Also, default optimization-levels for C are -Os (pass -v to csc to see what is passed to the C compiler), so using -O2 instead should make a difference. felix
Re: Performance question concerning chicken flonum vs "foreign flonum"
Hi Christian, this might be a case of "never trust a statistics you did not falsify yourself". Not bothering to speculate about explanations, I tend to ask how stable the results are wrt. larger N's, repetition etc. IMHO the results are too close for a call. Roughly this looks like 91% memory usage (minor gc's) going along of 85% runtime. Ergo: GC takes time. My first guess: There may be allocation going on in the FFI accounting for the increased memory usage. I'm in no way competent to actually confirm or rule out that hypothesis. Please take my whole assessment with a grain of salt; just a fist guess. Am Thu, 04 Nov 2021 16:46:50 +0100 (CET) schrieb : > Dear All, > > I am currently experimenting with Chicken Scheme and I would like to > ask about the following situation: I am comparing a "pure" Scheme > fused-multiply-add (fma) using chicken.flonum against C99's fma via > chicken.foreign. Here is my test code: > > fma-test.scm > > (import (chicken flonum) (chicken foreign) srfi-4) > > (foreign-declare "#include ") > > ;; FMA via nested fp+ and fp* from chicken-flonum > (define (scm-fma x y z) > (fp+ z (fp* x y))) > > ;; FMA via C99 function through chicken-foreign > (define c99-fma (foreign-lambda double "fma" double double double)) > > ;; Test function for FMAs > (define (dot fma a b) > (do [(idx 0 (add1 idx)) >(dim (f64vector-length a)) >(ret 0.0 (fma (f64vector-ref a idx) (f64vector-ref b idx) > ret))] ((= idx dim) ret))) > > ;; Test vector dimension > (define dim 200) > > ;; Test vector 1 > (define a (make-f64vector dim 1.2345)) > > ;; Test vector 2 > (define b (make-f64vector dim 0.9876)) > > ;; Test repetitions > (define N 200) > > ;; Test scm-dot > (time (do [(n 0 (add1 n))] > ((= n N)) > (dot scm-fma a b))) > > ;; Test fma-dot > (time (do [(n 0 (add1 n))] > ((= n N)) > (dot c99-fma a b))) > > ;eof > > Runnnig this code as follows: > > csc -O5 fma-test.scm && ./fma-test > > yields the results in: > > 7.558s CPU time, 0/225861 GCs (major/minor), maximum live heap: 30.78 > MiB 8.839s CPU time, 0/256410 GCs (major/minor), maximum live heap: > 30.78 MiB > > Now I wonder why C's single function (instruction) is slower than two > Scheme functions calls. I have four potential explanations: > > 1. chicken.foreign needs to do some type conversion for each argument > and return value which accounts for the extra time. If so could this > be avoided by type declarations somehow? > > 2. chicken.flonum does something to make fpX calls very fast. If so > can this be done for the foreign fma, too? > > 3. I am using chicken.foreign inefficiently, but I think srfi-144 is > using it similarly. > > 4. This is an effect only on my machine? > > It would be great to get some help or explanation with this issue. > > Here is my setup: > > CHICKEN Scheme 5.2.0 > gcc 10.3.0 > Ubuntu 20.04 > AMD Ryzen 5 4500U with 16GB > > Thank you very much > > Christian >
Performance question concerning chicken flonum vs "foreign flonum"
Dear All, I am currently experimenting with Chicken Scheme and I would like to ask about the following situation: I am comparing a "pure" Scheme fused-multiply-add (fma) using chicken.flonum against C99's fma via chicken.foreign. Here is my test code: fma-test.scm (import (chicken flonum) (chicken foreign) srfi-4) (foreign-declare "#include ") ;; FMA via nested fp+ and fp* from chicken-flonum (define (scm-fma x y z) (fp+ z (fp* x y))) ;; FMA via C99 function through chicken-foreign (define c99-fma (foreign-lambda double "fma" double double double)) ;; Test function for FMAs (define (dot fma a b) (do [(idx 0 (add1 idx)) (dim (f64vector-length a)) (ret 0.0 (fma (f64vector-ref a idx) (f64vector-ref b idx) ret))] ((= idx dim) ret))) ;; Test vector dimension (define dim 200) ;; Test vector 1 (define a (make-f64vector dim 1.2345)) ;; Test vector 2 (define b (make-f64vector dim 0.9876)) ;; Test repetitions (define N 200) ;; Test scm-dot (time (do [(n 0 (add1 n))] ((= n N)) (dot scm-fma a b))) ;; Test fma-dot (time (do [(n 0 (add1 n))] ((= n N)) (dot c99-fma a b))) ;eof Runnnig this code as follows: csc -O5 fma-test.scm && ./fma-test yields the results in: 7.558s CPU time, 0/225861 GCs (major/minor), maximum live heap: 30.78 MiB 8.839s CPU time, 0/256410 GCs (major/minor), maximum live heap: 30.78 MiB Now I wonder why C's single function (instruction) is slower than two Scheme functions calls. I have four potential explanations: 1. chicken.foreign needs to do some type conversion for each argument and return value which accounts for the extra time. If so could this be avoided by type declarations somehow? 2. chicken.flonum does something to make fpX calls very fast. If so can this be done for the foreign fma, too? 3. I am using chicken.foreign inefficiently, but I think srfi-144 is using it similarly. 4. This is an effect only on my machine? It would be great to get some help or explanation with this issue. Here is my setup: CHICKEN Scheme 5.2.0 gcc 10.3.0 Ubuntu 20.04 AMD Ryzen 5 4500U with 16GB Thank you very much Christian