Re: Performance question concerning chicken flonum vs "foreign flonum"
> 7.558s CPU time, 0/225861 GCs (major/minor), maximum live heap: 30.78 MiB > 8.839s CPU time, 0/256410 GCs (major/minor), maximum live heap: 30.78 MiB > >[...] > > It would be great to get some help or explanation with this issue. Hi! I have similar timings and the difference in the number of minor GC indicates that the c99-fma variant allocates more stack space and thus causes more minor GCs. Looking at the generated C file ("csc -k"), we see that scm-fma unboxes the intermediate result and thus generates relatively decent code: /* scm-fma in k183 in k180 in k177 in k174 */ static void C_ccall f_187(C_word c,C_word *av){ C_word tmp; C_word t0=av[0]; C_word t1=av[1]; C_word t2=av[2]; C_word t3=av[3]; C_word t4=av[4]; C_word t5; double f0; C_word *a; if(C_unlikely(!C_demand(C_calculate_demand(4,c,1{ C_save_and_reclaim((void *)f_187,c,av);} a=C_alloc(4); f0=C_ub_i_flonum_times(C_flonum_magnitude(t2),C_flonum_magnitude(t3)); t5=t1;{ C_word *av2=av; av2[0]=t5; av2[1]=C_flonum(,C_ub_i_flonum_plus(C_flonum_magnitude(t4),f0)); ((C_proc)(void*)(*((C_word*)t5+1)))(2,av2);}} The other version allocates a bytevector to hold the result: /* c99-fma in k183 in k180 in k177 in k174 */ static void C_ccall f_197(C_word c,C_word *av){ C_word tmp; C_word t0=av[0]; C_word t1=av[1]; C_word t2=av[2]; C_word t3=av[3]; C_word t4=av[4]; C_word t5; C_word t6; C_word *a; if(C_unlikely(!C_demand(C_calculate_demand(6,c,1{ C_save_and_reclaim((void *)f_197,c,av);} a=C_alloc(6); t5=C_a_i_bytevector(,1,C_fix(4)); t6=t1;{ C_word *av2=av; av2[0]=t6; av2[1]=stub21(t5,t2,t3,t4); ((C_proc)(void*)(*((C_word*)t6+1)))(2,av2);}} I thought that the allocation of 4 words for the bytevector (which is more than needed on a 64 bit machine) makes the difference, but it turns out to be negligible Changing it to 2 and also adjusting the values for C_calculate_demand and C_alloc doesn't seem to change a lot, but you may want to try that - just modify the C code and compile it with the same options as the .scm file. On my laptop fma is a library call, so currently my guess is simply that the scm-fma code is tighter and avoids 3 additional function calls (one to the stub, one to C_a_i_bytevector and one to fma). The increased number of GCs may also be caused by the bytevector above, which is used as a placeholder for the flonum result, which wastes one word. There is room for improvement for the compiler, though: the C_fix(4) is overly conservative (4 words are correct on 32-bit, taking care of flonum alignment, but unnecessary on 64 bits). Also, the bytevector thing is a bit of a hack - we could actually just pass "a" to stub21 directly. You may want to try this out: /* c99-fma in k183 in k180 in k177 in k174 (modified) */ static void C_ccall f_197(C_word c,C_word *av){ C_word tmp; C_word t0=av[0]; C_word t1=av[1]; C_word t2=av[2]; C_word t3=av[3]; C_word t4=av[4]; C_word t6; C_word *a; if(C_unlikely(!C_demand(C_calculate_demand(4,c,1{ C_save_and_reclaim((void *)f_197,c,av);} a=C_alloc(4); t6=t1;{ C_word *av2=av; av2[0]=t6; av2[1]=stub21((C_word)a,t2,t3,t4); ((C_proc)(void*)(*((C_word*)t6+1)))(2,av2);}} This reduces minor GCs on my machine to roughly the same. If your compiler inlines stub21 and fma, then you should see comparable performance. Also, default optimization-levels for C are -Os (pass -v to csc to see what is passed to the C compiler), so using -O2 instead should make a difference. felix
Re: Performance question concerning chicken flonum vs "foreign flonum"
Hi Christian, this might be a case of "never trust a statistics you did not falsify yourself". Not bothering to speculate about explanations, I tend to ask how stable the results are wrt. larger N's, repetition etc. IMHO the results are too close for a call. Roughly this looks like 91% memory usage (minor gc's) going along of 85% runtime. Ergo: GC takes time. My first guess: There may be allocation going on in the FFI accounting for the increased memory usage. I'm in no way competent to actually confirm or rule out that hypothesis. Please take my whole assessment with a grain of salt; just a fist guess. Am Thu, 04 Nov 2021 16:46:50 +0100 (CET) schrieb : > Dear All, > > I am currently experimenting with Chicken Scheme and I would like to > ask about the following situation: I am comparing a "pure" Scheme > fused-multiply-add (fma) using chicken.flonum against C99's fma via > chicken.foreign. Here is my test code: > > fma-test.scm > > (import (chicken flonum) (chicken foreign) srfi-4) > > (foreign-declare "#include ") > > ;; FMA via nested fp+ and fp* from chicken-flonum > (define (scm-fma x y z) > (fp+ z (fp* x y))) > > ;; FMA via C99 function through chicken-foreign > (define c99-fma (foreign-lambda double "fma" double double double)) > > ;; Test function for FMAs > (define (dot fma a b) > (do [(idx 0 (add1 idx)) >(dim (f64vector-length a)) >(ret 0.0 (fma (f64vector-ref a idx) (f64vector-ref b idx) > ret))] ((= idx dim) ret))) > > ;; Test vector dimension > (define dim 200) > > ;; Test vector 1 > (define a (make-f64vector dim 1.2345)) > > ;; Test vector 2 > (define b (make-f64vector dim 0.9876)) > > ;; Test repetitions > (define N 200) > > ;; Test scm-dot > (time (do [(n 0 (add1 n))] > ((= n N)) > (dot scm-fma a b))) > > ;; Test fma-dot > (time (do [(n 0 (add1 n))] > ((= n N)) > (dot c99-fma a b))) > > ;eof > > Runnnig this code as follows: > > csc -O5 fma-test.scm && ./fma-test > > yields the results in: > > 7.558s CPU time, 0/225861 GCs (major/minor), maximum live heap: 30.78 > MiB 8.839s CPU time, 0/256410 GCs (major/minor), maximum live heap: > 30.78 MiB > > Now I wonder why C's single function (instruction) is slower than two > Scheme functions calls. I have four potential explanations: > > 1. chicken.foreign needs to do some type conversion for each argument > and return value which accounts for the extra time. If so could this > be avoided by type declarations somehow? > > 2. chicken.flonum does something to make fpX calls very fast. If so > can this be done for the foreign fma, too? > > 3. I am using chicken.foreign inefficiently, but I think srfi-144 is > using it similarly. > > 4. This is an effect only on my machine? > > It would be great to get some help or explanation with this issue. > > Here is my setup: > > CHICKEN Scheme 5.2.0 > gcc 10.3.0 > Ubuntu 20.04 > AMD Ryzen 5 4500U with 16GB > > Thank you very much > > Christian >
Re: New egg: nng
Hi Ariela, On Sun, 31 Oct 2021 17:18:45 -0300 Ariela Wenner wrote: > So, here's a followup > > I linked pthreads and added the things you suggested to the wiki. > > As for the hangs, I hate to say this but it seems like the test was the > problem. > More precisely, it seems like the topic for the subscriber socket wasn't set > fast enough, so when the publisher socket sent a message, the subscriber > wasn't > ready for it. Hang. > > I added some coordination in the test itself and ran the tests over 100 times > just to be sure. No hang. Maybe I'm just being extremely unlucky though, so as > always feedback is appreciated. > > Just pushed a new tag with the changes, fingers crossed. Thanks. I'm sad to report, though, that tests of 0.2.1 still hang here. I'm testing it with the following command: $ test-new-egg nng https://gitlab.com/ariSun/chicken-nng/-/raw/main/nng.release-info I executed the command above three times and tests hanged consistently. I'm using strace to verify that the test process is not actually doing anything: $ strace -p 14836 strace: Process 14836 attached restart_syscall(<... resuming interrupted poll ...> All the best. Mario -- http://parenteses.org/mario
Re: new egg: cmark
On Thu, 04 Nov 2021 12:52:50 + "Caolan McMahon" wrote: >> Caolan: would you be ok for you if cmark for CHICKEN 5 points to >> Harley's implementation? > > Yes, please go ahead - and thanks to Harley for creating a CHICKEN 5 version > :) Thanks, Caolan. Harley: thanks again. Your egg has been added to the coop. All the best. Mario -- http://parenteses.org/mario
Performance question concerning chicken flonum vs "foreign flonum"
Dear All, I am currently experimenting with Chicken Scheme and I would like to ask about the following situation: I am comparing a "pure" Scheme fused-multiply-add (fma) using chicken.flonum against C99's fma via chicken.foreign. Here is my test code: fma-test.scm (import (chicken flonum) (chicken foreign) srfi-4) (foreign-declare "#include ") ;; FMA via nested fp+ and fp* from chicken-flonum (define (scm-fma x y z) (fp+ z (fp* x y))) ;; FMA via C99 function through chicken-foreign (define c99-fma (foreign-lambda double "fma" double double double)) ;; Test function for FMAs (define (dot fma a b) (do [(idx 0 (add1 idx)) (dim (f64vector-length a)) (ret 0.0 (fma (f64vector-ref a idx) (f64vector-ref b idx) ret))] ((= idx dim) ret))) ;; Test vector dimension (define dim 200) ;; Test vector 1 (define a (make-f64vector dim 1.2345)) ;; Test vector 2 (define b (make-f64vector dim 0.9876)) ;; Test repetitions (define N 200) ;; Test scm-dot (time (do [(n 0 (add1 n))] ((= n N)) (dot scm-fma a b))) ;; Test fma-dot (time (do [(n 0 (add1 n))] ((= n N)) (dot c99-fma a b))) ;eof Runnnig this code as follows: csc -O5 fma-test.scm && ./fma-test yields the results in: 7.558s CPU time, 0/225861 GCs (major/minor), maximum live heap: 30.78 MiB 8.839s CPU time, 0/256410 GCs (major/minor), maximum live heap: 30.78 MiB Now I wonder why C's single function (instruction) is slower than two Scheme functions calls. I have four potential explanations: 1. chicken.foreign needs to do some type conversion for each argument and return value which accounts for the extra time. If so could this be avoided by type declarations somehow? 2. chicken.flonum does something to make fpX calls very fast. If so can this be done for the foreign fma, too? 3. I am using chicken.foreign inefficiently, but I think srfi-144 is using it similarly. 4. This is an effect only on my machine? It would be great to get some help or explanation with this issue. Here is my setup: CHICKEN Scheme 5.2.0 gcc 10.3.0 Ubuntu 20.04 AMD Ryzen 5 4500U with 16GB Thank you very much Christian
Re: new egg: cmark
> Caolan: would you be ok for you if cmark for CHICKEN 5 points to > Harley's implementation? Yes, please go ahead - and thanks to Harley for creating a CHICKEN 5 version :) Caolan
Re: new egg: cmark
On Thu, 04 Nov 2021 01:14:13 + "Harley Swick" wrote: > Mario, > > Looks like I missed a step. I have added a release-info file and tested it > with test-new-egg. > > Jim, > > Thanks for the feedback. I have updated the wiki with your suggestions. Thanks, Harley. Let's just double-check with Caolan whether he has any objection regarding reusing the egg name from CHICKEN 4, as he is the author of that egg. Caolan: would you be ok for you if cmark for CHICKEN 5 points to Harley's implementation? All the best. Mario -- http://parenteses.org/mario