[Mono-list] numerical performance comparison [C++ vs JVM vs Mono]

2010-10-04 Thread Jonathan Shore
Hi,

I am looking forward to moving all of my code from Java / C++ to F# / C# in the 
very near future.   I took the nbody code from the language shootout and ran 
with 500 million iterations (much more than used in the shootout to provide a 
fair comparison) on ubuntu server on a core i7 920 box.

I used:

- C++ (g++ -O3 with various MMX related flags as done in the shootout)
- Java 7  -server
- Mono 2.4.4, compiling with -optimize:+

I had the following results in seconds:

1.  C++:98 seconds
2.  JVM:126 seconds,  a 28% performance gap against C++
3.  Mono:   191 seconds,  a 50% performance gap with the JVM

Because the nbody problem uses sqrt for the euclidean distance in each loop, 
thought that maybe the discrepancy might be more related to the implementation 
of Sqrt().

I implemented a (very poor) numerical algorithm as a substitute for the sqrt() 
function in each implementation to provide an apples-to-apples comparison.
The new numbers became:

1.  C++:517 seconds
2.  JVM:527 seconds
3  Mono:223 seconds (wow, a surprise here)

I noticed that the Mono runtime libraries use an internal implementation of 
Sqrt() that seems to resolve to an Op Code.   I am wondering, ultimately, what 
implementation this maps to?   Clearly the Sqrt implementation in Mono is 2x as 
slow (or access through the layers is 2x as slow) as the libc implementation.   

I do mostly numerical work, so concerned about sqrt as well as other 
fundamental functions in this regard.   Are these custom implementations in 
assembler for each arch?Would it be reasonable to try to map these to the 
existing libc library when available?

Thanks

--
Jonathan Shore
Systematic Trading Group

___
Mono-list maillist  -  Mono-list@lists.ximian.com
http://lists.ximian.com/mailman/listinfo/mono-list


Re: [Mono-list] numerical performance comparison [C++ vs JVM vs Mono]

2010-10-04 Thread ted leslie
i did the exact same thing, and dumped the assembly generated by c and c# on 
nbody,
in addition to the sqrt issue, which i believe is pre and post a SSEn, where 
mono isn't as uptodate
on full use of SSEn , the other issue was, mono did a few sets of unnecessary 
register transfer,
as compared to the assembly generated by C. With these two issues resolved, the 
benchmark would have match on time. I am not even sure mono's older sqrt call 
was the majority of the diff
in the mark, I believe it was the unnecessary reg. trans.

you will also notice in that bench mark game, many of the C versions (most 
recent version of given benchmark),
often are not even solutions that can even be compared, I think there is even 
one that is threaded solution
in the winner, and mono uses single thread. Really, that "game" is way more to 
do with people making better
and better algos for a given language solution, then it is a comparison of 
language. Having said that,
it seems to me, given my comparison of the assembly code of a few, that aside 
from obvious issues of
array boundary checks and so on for safety, the main issue of performance kill 
(for mono) appears to be
non optimal use of registers, with to much unnecessary transfer/setup. This 
however is only most noticeable
in these huge loops, with for many people using  mono, isn't an issue. Another 
issue i noticed is that
in the latest SSE4.? there is 16 registers to use, but I see Mono shuffling 
within 8 (i think, if i remember
correctly). Oddly enough I didn't see gnu gcc using the available 16 either.

tl

On Mon, 4 Oct 2010 10:43:18 -0400
Jonathan Shore  wrote:

> Hi,
> 
> I am looking forward to moving all of my code from Java / C++ to F# / C# in 
> the very near future.   I took the nbody code from the language shootout and 
> ran with 500 million iterations (much more than used in the shootout to 
> provide a fair comparison) on ubuntu server on a core i7 920 box.
> 
> I used:
> 
> - C++ (g++ -O3 with various MMX related flags as done in the shootout)
> - Java 7  -server
> - Mono 2.4.4, compiling with -optimize:+
> 
> I had the following results in seconds:
> 
> 1.  C++:  98 seconds
> 2.  JVM:  126 seconds,  a 28% performance gap against C++
> 3.  Mono: 191 seconds,  a 50% performance gap with the JVM
> 
> Because the nbody problem uses sqrt for the euclidean distance in each loop, 
> thought that maybe the discrepancy might be more related to the 
> implementation of Sqrt().
> 
> I implemented a (very poor) numerical algorithm as a substitute for the 
> sqrt() function in each implementation to provide an apples-to-apples 
> comparison.The new numbers became:
> 
> 1.  C++:  517 seconds
> 2.  JVM:  527 seconds
> 3  Mono:  223 seconds (wow, a surprise here)
> 
> I noticed that the Mono runtime libraries use an internal implementation of 
> Sqrt() that seems to resolve to an Op Code.   I am wondering, ultimately, 
> what implementation this maps to?   Clearly the Sqrt implementation in Mono 
> is 2x as slow (or access through the layers is 2x as slow) as the libc 
> implementation.   
> 
> I do mostly numerical work, so concerned about sqrt as well as other 
> fundamental functions in this regard.   Are these custom implementations in 
> assembler for each arch?Would it be reasonable to try to map these to the 
> existing libc library when available?
> 
> Thanks
> 
> --
> Jonathan Shore
> Systematic Trading Group
> 
> ___
> Mono-list maillist  -  Mono-list@lists.ximian.com
> http://lists.ximian.com/mailman/listinfo/mono-list
> 


-- 
ted leslie 
___
Mono-list maillist  -  Mono-list@lists.ximian.com
http://lists.ximian.com/mailman/listinfo/mono-list


Re: [Mono-list] numerical performance comparison [C++ vs JVM vs Mono]

2010-10-04 Thread Rodrigo Kumpera
Hi Jonathan,

Mono 64bits uses the x87 square root instruction which is far from optimal
since it requires 2 stores and 1 load.
Zoltan, is it possible to switch to SQRTSD?

And, yes, mono does have some issues with register allocation. It's
something that will be worked on in the mid term thou.




On Mon, Oct 4, 2010 at 11:43 AM, Jonathan Shore wrote:

> Hi,
>
> I am looking forward to moving all of my code from Java / C++ to F# / C# in
> the very near future.   I took the nbody code from the language shootout and
> ran with 500 million iterations (much more than used in the shootout to
> provide a fair comparison) on ubuntu server on a core i7 920 box.
>
> I used:
>
> - C++ (g++ -O3 with various MMX related flags as done in the shootout)
> - Java 7  -server
> - Mono 2.4.4, compiling with -optimize:+
>
> I had the following results in seconds:
>
> 1.  C++:98 seconds
> 2.  JVM:126 seconds,  a 28% performance gap against C++
> 3.  Mono:   191 seconds,  a 50% performance gap with the JVM
>
> Because the nbody problem uses sqrt for the euclidean distance in each
> loop, thought that maybe the discrepancy might be more related to the
> implementation of Sqrt().
>
> I implemented a (very poor) numerical algorithm as a substitute for the
> sqrt() function in each implementation to provide an apples-to-apples
> comparison.The new numbers became:
>
> 1.  C++:517 seconds
> 2.  JVM:527 seconds
> 3  Mono:223 seconds (wow, a surprise here)
>
> I noticed that the Mono runtime libraries use an internal implementation of
> Sqrt() that seems to resolve to an Op Code.   I am wondering, ultimately,
> what implementation this maps to?   Clearly the Sqrt implementation in Mono
> is 2x as slow (or access through the layers is 2x as slow) as the libc
> implementation.
>
> I do mostly numerical work, so concerned about sqrt as well as other
> fundamental functions in this regard.   Are these custom implementations in
> assembler for each arch?Would it be reasonable to try to map these to
> the existing libc library when available?
>
> Thanks
>
> --
> Jonathan Shore
> Systematic Trading Group
>
> ___
> Mono-list maillist  -  Mono-list@lists.ximian.com
> http://lists.ximian.com/mailman/listinfo/mono-list
>
___
Mono-list maillist  -  Mono-list@lists.ximian.com
http://lists.ximian.com/mailman/listinfo/mono-list