Re: Performance question concerning chicken flonum vs "foreign flonum"

2021-11-04 Thread felix . winkelmann
> 7.558s CPU time, 0/225861 GCs (major/minor), maximum live heap: 30.78 MiB
> 8.839s CPU time, 0/256410 GCs (major/minor), maximum live heap: 30.78 MiB
>
>[...]
>
> It would be great to get some help or explanation with this issue.

Hi!

I have similar timings and the difference in the number of minor GC indicates
that the c99-fma variant allocates more stack space and thus causes more
minor GCs.

Looking at the generated C file ("csc -k"), we see that scm-fma unboxes the 
intermediate
result and thus generates relatively decent code:

/* scm-fma in k183 in k180 in k177 in k174 */
static void C_ccall f_187(C_word c,C_word *av){
C_word tmp;
C_word t0=av[0];
C_word t1=av[1];
C_word t2=av[2];
C_word t3=av[3];
C_word t4=av[4];
C_word t5;
double f0;
C_word *a;
if(C_unlikely(!C_demand(C_calculate_demand(4,c,1{
C_save_and_reclaim((void *)f_187,c,av);}
a=C_alloc(4);
f0=C_ub_i_flonum_times(C_flonum_magnitude(t2),C_flonum_magnitude(t3));
t5=t1;{
C_word *av2=av;
av2[0]=t5;
av2[1]=C_flonum(,C_ub_i_flonum_plus(C_flonum_magnitude(t4),f0));
((C_proc)(void*)(*((C_word*)t5+1)))(2,av2);}}

The other version allocates a bytevector to hold the result:

/* c99-fma in k183 in k180 in k177 in k174 */
static void C_ccall f_197(C_word c,C_word *av){
C_word tmp;
C_word t0=av[0];
C_word t1=av[1];
C_word t2=av[2];
C_word t3=av[3];
C_word t4=av[4];
C_word t5;
C_word t6;
C_word *a;
if(C_unlikely(!C_demand(C_calculate_demand(6,c,1{
C_save_and_reclaim((void *)f_197,c,av);}
a=C_alloc(6);
t5=C_a_i_bytevector(,1,C_fix(4));
t6=t1;{
C_word *av2=av;
av2[0]=t6;
av2[1]=stub21(t5,t2,t3,t4);
((C_proc)(void*)(*((C_word*)t6+1)))(2,av2);}}

I thought that the allocation of 4 words for the bytevector (which is more than
needed on a 64 bit machine) makes the difference, but it turns out to be 
negligible
Changing it to 2 and also adjusting the values for C_calculate_demand and
C_alloc doesn't seem to change a lot, but you may want to try that -
just modify the C code and compile it with the same options as the .scm file.

On my laptop fma is a library call, so currently my guess is simply that
the scm-fma code is tighter and avoids 3 additional function calls (one to the 
stub,
one to C_a_i_bytevector and one to fma). The increased number of GCs may
also be caused by the bytevector above, which is used as a placeholder for
the flonum result, which wastes one word.

There is room for improvement for the compiler, though: the C_fix(4) is overly
conservative (4 words are correct on 32-bit, taking care of flonum alignment, 
but
unnecessary on 64 bits). Also, the bytevector thing is a bit of a hack - we
could actually just pass "a" to stub21 directly. You may want to try this out:

/* c99-fma in k183 in k180 in k177 in k174 (modified) */
static void C_ccall f_197(C_word c,C_word *av){
C_word tmp;
C_word t0=av[0];
C_word t1=av[1];
C_word t2=av[2];
C_word t3=av[3];
C_word t4=av[4];
C_word t6;
C_word *a;
if(C_unlikely(!C_demand(C_calculate_demand(4,c,1{
C_save_and_reclaim((void *)f_197,c,av);}
a=C_alloc(4);
t6=t1;{
C_word *av2=av;
av2[0]=t6;
av2[1]=stub21((C_word)a,t2,t3,t4);
((C_proc)(void*)(*((C_word*)t6+1)))(2,av2);}}

This reduces minor GCs on my machine to roughly the same. If your
compiler inlines stub21 and fma, then you should see comparable performance.
Also, default optimization-levels for C are -Os (pass -v to csc to see what is
passed to the C compiler), so using -O2 instead should make a difference.


felix




Re: Performance question concerning chicken flonum vs "foreign flonum"

2021-11-04 Thread Jörg F. Wittenberger
Hi Christian,

this might be a case of "never trust a statistics you did not falsify
yourself".

Not bothering to speculate about explanations, I tend to ask how stable
the results are wrt. larger N's, repetition etc.

IMHO the results are too close for a call.  Roughly this looks like 91%
memory usage (minor gc's) going along of 85% runtime.  Ergo: GC takes
time. My first guess: There may be allocation going on in the FFI
accounting for the increased memory usage.

I'm in no way competent to actually confirm or rule out that
hypothesis.  Please take my whole assessment with a grain of salt; just
a fist guess.

Am Thu, 04 Nov 2021 16:46:50 +0100 (CET)
schrieb :

> Dear All,
> 
> I am currently experimenting with Chicken Scheme and I would like to
> ask about the following situation: I am comparing a "pure" Scheme
> fused-multiply-add (fma) using chicken.flonum against C99's fma via
> chicken.foreign. Here is my test code:
> 
>  fma-test.scm
> 
> (import (chicken flonum) (chicken foreign) srfi-4)
> 
> (foreign-declare "#include ")
> 
> ;; FMA via nested fp+ and fp* from chicken-flonum
> (define (scm-fma x y z)
>   (fp+ z (fp* x y)))
> 
> ;; FMA via C99 function through chicken-foreign
> (define c99-fma (foreign-lambda double "fma" double double double))
> 
> ;; Test function for FMAs
> (define (dot fma a b)
>   (do [(idx 0 (add1 idx))
>(dim (f64vector-length a))
>(ret 0.0 (fma (f64vector-ref a idx) (f64vector-ref b idx)
> ret))] ((= idx dim) ret)))
> 
> ;; Test vector dimension
> (define dim 200)
> 
> ;; Test vector 1
> (define a (make-f64vector dim 1.2345))
> 
> ;; Test vector 2
> (define b (make-f64vector dim 0.9876))
> 
> ;; Test repetitions
> (define N 200)
> 
> ;; Test scm-dot
> (time (do [(n 0 (add1 n))]
> ((= n N))
> (dot scm-fma a b)))
> 
> ;; Test fma-dot
> (time (do [(n 0 (add1 n))]
> ((= n N))
> (dot c99-fma a b)))
> 
> ;eof
> 
> Runnnig this code as follows:
> 
> csc -O5 fma-test.scm && ./fma-test
> 
> yields the results in:
> 
> 7.558s CPU time, 0/225861 GCs (major/minor), maximum live heap: 30.78
> MiB 8.839s CPU time, 0/256410 GCs (major/minor), maximum live heap:
> 30.78 MiB
> 
> Now I wonder why C's single function (instruction) is slower than two
> Scheme functions calls. I have four potential explanations:
> 
> 1. chicken.foreign needs to do some type conversion for each argument
> and return value which accounts for the extra time. If so could this
> be avoided by type declarations somehow?
> 
> 2. chicken.flonum does something to make fpX calls very fast. If so
> can this be done for the foreign fma, too?
> 
> 3. I am using chicken.foreign inefficiently, but I think srfi-144 is
> using it similarly.
> 
> 4. This is an effect only on my machine?
> 
> It would be great to get some help or explanation with this issue.
> 
> Here is my setup:
> 
> CHICKEN Scheme 5.2.0
> gcc 10.3.0
> Ubuntu 20.04
> AMD Ryzen 5 4500U with 16GB
> 
> Thank you very much
> 
> Christian
> 




Re: New egg: nng

2021-11-04 Thread Mario Domenech Goulart
Hi Ariela,

On Sun, 31 Oct 2021 17:18:45 -0300 Ariela Wenner  wrote:

> So, here's a followup
>
> I linked pthreads and added the things you suggested to the wiki.
>
> As for the hangs, I hate to say this but it seems like the test was the 
> problem.
> More precisely, it seems like the topic for the subscriber socket wasn't set
> fast enough, so when the publisher socket sent a message, the subscriber 
> wasn't
> ready for it. Hang.
>
> I added some coordination in the test itself and ran the tests over 100 times
> just to be sure. No hang. Maybe I'm just being extremely unlucky though, so as
> always feedback is appreciated.
>
> Just pushed a new tag with the changes, fingers crossed.

Thanks.

I'm sad to report, though, that tests of 0.2.1 still hang here.

I'm testing it with the following command:

  $ test-new-egg nng 
https://gitlab.com/ariSun/chicken-nng/-/raw/main/nng.release-info

I executed the command above three times and tests hanged consistently.

I'm using strace to verify that the test process is not actually doing
anything:

$ strace -p 14836
strace: Process 14836 attached
restart_syscall(<... resuming interrupted poll ...>

All the best.
Mario
-- 
http://parenteses.org/mario



Re: new egg: cmark

2021-11-04 Thread Mario Domenech Goulart
On Thu, 04 Nov 2021 12:52:50 + "Caolan McMahon"  wrote:

>> Caolan: would you be ok for you if cmark for CHICKEN 5 points to
>> Harley's implementation?
>
> Yes, please go ahead - and thanks to Harley for creating a CHICKEN 5 version 
> :)

Thanks, Caolan.

Harley: thanks again.  Your egg has been added to the coop.

All the best.
Mario
-- 
http://parenteses.org/mario



Performance question concerning chicken flonum vs "foreign flonum"

2021-11-04 Thread christian.himpe
Dear All,

I am currently experimenting with Chicken Scheme and I would like to ask about 
the following situation: I am comparing a "pure" Scheme fused-multiply-add 
(fma) using chicken.flonum against C99's fma via chicken.foreign. Here is my 
test code:

 fma-test.scm

(import (chicken flonum) (chicken foreign) srfi-4)

(foreign-declare "#include ")

;; FMA via nested fp+ and fp* from chicken-flonum
(define (scm-fma x y z)
  (fp+ z (fp* x y)))

;; FMA via C99 function through chicken-foreign
(define c99-fma (foreign-lambda double "fma" double double double))

;; Test function for FMAs
(define (dot fma a b)
  (do [(idx 0 (add1 idx))
   (dim (f64vector-length a))
   (ret 0.0 (fma (f64vector-ref a idx) (f64vector-ref b idx) ret))]
((= idx dim) ret)))

;; Test vector dimension
(define dim 200)

;; Test vector 1
(define a (make-f64vector dim 1.2345))

;; Test vector 2
(define b (make-f64vector dim 0.9876))

;; Test repetitions
(define N 200)

;; Test scm-dot
(time (do [(n 0 (add1 n))]
((= n N))
(dot scm-fma a b)))

;; Test fma-dot
(time (do [(n 0 (add1 n))]
((= n N))
(dot c99-fma a b)))

;eof

Runnnig this code as follows:

csc -O5 fma-test.scm && ./fma-test

yields the results in:

7.558s CPU time, 0/225861 GCs (major/minor), maximum live heap: 30.78 MiB
8.839s CPU time, 0/256410 GCs (major/minor), maximum live heap: 30.78 MiB

Now I wonder why C's single function (instruction) is slower than two Scheme 
functions calls. I have four potential explanations:

1. chicken.foreign needs to do some type conversion for each argument and 
return value which accounts for the extra time. If so could this be avoided by 
type declarations somehow?

2. chicken.flonum does something to make fpX calls very fast. If so can this be 
done for the foreign fma, too?

3. I am using chicken.foreign inefficiently, but I think srfi-144 is using it 
similarly.

4. This is an effect only on my machine?

It would be great to get some help or explanation with this issue.

Here is my setup:

CHICKEN Scheme 5.2.0
gcc 10.3.0
Ubuntu 20.04
AMD Ryzen 5 4500U with 16GB

Thank you very much

Christian



Re: new egg: cmark

2021-11-04 Thread Caolan McMahon
> Caolan: would you be ok for you if cmark for CHICKEN 5 points to
> Harley's implementation?

Yes, please go ahead - and thanks to Harley for creating a CHICKEN 5 version :)

Caolan



Re: new egg: cmark

2021-11-04 Thread Mario Domenech Goulart
On Thu, 04 Nov 2021 01:14:13 + "Harley Swick"  
wrote:

> Mario,
>
> Looks like I missed a step. I have added a release-info file and tested it 
> with test-new-egg.
>
> Jim,
>
> Thanks for the feedback. I have updated the wiki with your suggestions.

Thanks, Harley.

Let's just double-check with Caolan whether he has any objection
regarding reusing the egg name from CHICKEN 4, as he is the author of
that egg.

Caolan: would you be ok for you if cmark for CHICKEN 5 points to
Harley's implementation?

All the best.
Mario
-- 
http://parenteses.org/mario