Re: Performance question concerning chicken flonum vs "foreign flonum"

2021-11-05 Thread felix . winkelmann
> modified code:
>
> 7.378s CPU time, 0/225861 GCs (major/minor), maximum live heap: 30.78 MiB
> 8.498s CPU time, 0/238095 GCs (major/minor), maximum live heap: 30.78 MiB
>
> Both were compiled with -O3 optimization level in gcc.
>
> I am fine with these results given your layout of the internals in the 
> background.
>
> Would it be theoretically thinkable to include such fma functionality 
> directly into chicken.flonum, i.e. as fp+*, or are included modules typically 
> unaltered?

The core modules like chicken.flonum can be optimized freely, as they are always
delivered with the base system and the compiler is often tuned to treat these 
specially.
I wonder why the speed difference still exists, could you send me the generated
assembly code for the test program, as produced by your compiler? I'd like to 
see
how far the C compiler goes at inlining the fma operation.
If this can give a noticable speedup, I see no reason why not to add such an
operation, but it would be nice to measure the effect before we do this. I can 
send
you a patch for testing if you like.

Note that one may have to use compiler intrinsics or special C compiler options
to enable this, see for example:


https://stackoverflow.com/questions/15933100/how-to-use-fused-multiply-add-fma-instructions-with-sse-avx


felix




Re: Performance question concerning chicken flonum vs "foreign flonum"

2021-11-05 Thread Christian Himpe


felix.winkelm...@bevuta.com schrieb am 2021-11-04:
> > 7.558s CPU time, 0/225861 GCs (major/minor), maximum live heap: 30.78 MiB
> > 8.839s CPU time, 0/256410 GCs (major/minor), maximum live heap: 30.78 MiB
> >
> >[...]
> >
> > It would be great to get some help or explanation with this issue.

> Hi!

> I have similar timings and the difference in the number of minor GC indicates
> that the c99-fma variant allocates more stack space and thus causes more
> minor GCs.

> Looking at the generated C file ("csc -k"), we see that scm-fma unboxes the 
> intermediate
> result and thus generates relatively decent code:

> /* scm-fma in k183 in k180 in k177 in k174 */
> static void C_ccall f_187(C_word c,C_word *av){
> C_word tmp;
> C_word t0=av[0];
> C_word t1=av[1];
> C_word t2=av[2];
> C_word t3=av[3];
> C_word t4=av[4];
> C_word t5;
> double f0;
> C_word *a;
> if(C_unlikely(!C_demand(C_calculate_demand(4,c,1{
> C_save_and_reclaim((void *)f_187,c,av);}
> a=C_alloc(4);
> f0=C_ub_i_flonum_times(C_flonum_magnitude(t2),C_flonum_magnitude(t3));
> t5=t1;{
> C_word *av2=av;
> av2[0]=t5;
> av2[1]=C_flonum(,C_ub_i_flonum_plus(C_flonum_magnitude(t4),f0));
> ((C_proc)(void*)(*((C_word*)t5+1)))(2,av2);}}

> The other version allocates a bytevector to hold the result:

> /* c99-fma in k183 in k180 in k177 in k174 */
> static void C_ccall f_197(C_word c,C_word *av){
> C_word tmp;
> C_word t0=av[0];
> C_word t1=av[1];
> C_word t2=av[2];
> C_word t3=av[3];
> C_word t4=av[4];
> C_word t5;
> C_word t6;
> C_word *a;
> if(C_unlikely(!C_demand(C_calculate_demand(6,c,1{
> C_save_and_reclaim((void *)f_197,c,av);}
> a=C_alloc(6);
> t5=C_a_i_bytevector(,1,C_fix(4));
> t6=t1;{
> C_word *av2=av;
> av2[0]=t6;
> av2[1]=stub21(t5,t2,t3,t4);
> ((C_proc)(void*)(*((C_word*)t6+1)))(2,av2);}}

> I thought that the allocation of 4 words for the bytevector (which is more 
> than
> needed on a 64 bit machine) makes the difference, but it turns out to be 
> negligible
> Changing it to 2 and also adjusting the values for C_calculate_demand and
> C_alloc doesn't seem to change a lot, but you may want to try that -
> just modify the C code and compile it with the same options as the .scm file.

> On my laptop fma is a library call, so currently my guess is simply that
> the scm-fma code is tighter and avoids 3 additional function calls (one to 
> the stub,
> one to C_a_i_bytevector and one to fma). The increased number of GCs may
> also be caused by the bytevector above, which is used as a placeholder for
> the flonum result, which wastes one word.

> There is room for improvement for the compiler, though: the C_fix(4) is overly
> conservative (4 words are correct on 32-bit, taking care of flonum alignment, 
> but
> unnecessary on 64 bits). Also, the bytevector thing is a bit of a hack - we
> could actually just pass "a" to stub21 directly. You may want to try this out:

> /* c99-fma in k183 in k180 in k177 in k174 (modified) */
> static void C_ccall f_197(C_word c,C_word *av){
> C_word tmp;
> C_word t0=av[0];
> C_word t1=av[1];
> C_word t2=av[2];
> C_word t3=av[3];
> C_word t4=av[4];
> C_word t6;
> C_word *a;
> if(C_unlikely(!C_demand(C_calculate_demand(4,c,1{
> C_save_and_reclaim((void *)f_197,c,av);}
> a=C_alloc(4);
> t6=t1;{
> C_word *av2=av;
> av2[0]=t6;
> av2[1]=stub21((C_word)a,t2,t3,t4);
> ((C_proc)(void*)(*((C_word*)t6+1)))(2,av2);}}

> This reduces minor GCs on my machine to roughly the same. If your
> compiler inlines stub21 and fma, then you should see comparable performance.
> Also, default optimization-levels for C are -Os (pass -v to csc to see what is
> passed to the C compiler), so using -O2 instead should make a difference.


> felix

Dear Felix,

thank you for ypur explanantions. I tested your modified source and indeed the 
number of GCs is significantly reduced, but the timing difference remains:

original code:

7.656s CPU time, 0/225861 GCs (major/minor), maximum live heap: 30.78 MiB
8.849s CPU time, 0/256410 GCs (major/minor), maximum live heap: 30.78 MiB

modified code:

7.378s CPU time, 0/225861 GCs (major/minor), maximum live heap: 30.78 MiB
8.498s CPU time, 0/238095 GCs (major/minor), maximum live heap: 30.78 MiB

Both were compiled with -O3 optimization level in gcc.

I am fine with these results given your layout of the internals in the 
background.

Would it be theoretically thinkable to include such fma functionality directly 
into chicken.flonum, i.e. as fp+*, or are included modules typically unaltered?

Thank you

Christian



Re: New egg: nng

2021-11-05 Thread Ariela Wenner
Hi Mario!

Welp... that's a bummer. I was sure it was a timing issue with the tests.

I'll keep poking at it on different machines to see what I'm missing.

Thanks for giving it a try! Cheers!

El 4 de noviembre de 2021 4:16:46 p. m. GMT-03:00, Mario Domenech Goulart 
 escribió:
>Hi Ariela,
>
>On Sun, 31 Oct 2021 17:18:45 -0300 Ariela Wenner  wrote:
>
>> So, here's a followup
>>
>> I linked pthreads and added the things you suggested to the wiki.
>>
>> As for the hangs, I hate to say this but it seems like the test was the 
>> problem.
>> More precisely, it seems like the topic for the subscriber socket wasn't set
>> fast enough, so when the publisher socket sent a message, the subscriber 
>> wasn't
>> ready for it. Hang.
>>
>> I added some coordination in the test itself and ran the tests over 100 times
>> just to be sure. No hang. Maybe I'm just being extremely unlucky though, so 
>> as
>> always feedback is appreciated.
>>
>> Just pushed a new tag with the changes, fingers crossed.
>
>Thanks.
>
>I'm sad to report, though, that tests of 0.2.1 still hang here.
>
>I'm testing it with the following command:
>
>  $ test-new-egg nng 
> https://gitlab.com/ariSun/chicken-nng/-/raw/main/nng.release-info
>
>I executed the command above three times and tests hanged consistently.
>
>I'm using strace to verify that the test process is not actually doing
>anything:
>
>$ strace -p 14836
>strace: Process 14836 attached
>restart_syscall(<... resuming interrupted poll ...>
>
>All the best.
>Mario
>-- 
>http://parenteses.org/mario

-- 
Ariela Wenner