Hello all, lastly I was looking at n-body test on shootout (http://shootout.alioth.debian.org/u64q/performance.php?test=nbody) in the context of Mono. Program is very simple so it is a nice piece to analyze sources of performance problems. I've also contributed SIMD version, but it has minor meaning as it wasn't significantly faster than the usual one.
I also made a comparison of performance of this test on Windows, using .NET. It took about 70% of time that Mono needed, that is little interesting as well. However reasons behind that can be interesting. I made small analysis of code emitted by JIT in both cases. Let's analyze a part of code, specifically instruction, which computes square of length of vector which is difference between positions of two bodies: double dx = bi.x - bj.x, dy = bi.y - bj.y, dz = bi.z - bj.z; double d2 = dx * dx + dy * dy + dz * dz; On .NET emitted jit code is: fld qword ptr [edx+4] fsub qword ptr [eax+4] fld qword ptr [edx+0Ch] fsub qword ptr [eax+0Ch] fld qword ptr [edx+14h] fsub qword ptr [eax+14h] // here we have differences in st - st(2) fld st(2) fmul st,st(3) fld st(2) fmul st,st(3) faddp st(1),st fld st(1) fmul st,st(2) faddp st(1),st Here one can see code generated by Mono's jit engine (here is used AT&T notation, but it shouldn't be a problem while reading): fldl 0x8(%ebx) fldl 0x8(%edi) fsubrp %st,%st(1) fstpl -0x18(%ebp) fldl 0x10(%ebx) fldl 0x10(%edi) fsubrp %st,%st(1) fstpl -0x20(%ebp) fldl 0x18(%ebx) fldl 0x18(%edi) fsubrp %st,%st(1) fstpl -0x28(%ebp) // we have differences in three offsets from ebp (local variables) fldl -0x18(%ebp) fldl -0x18(%ebp) fmulp %st,%st(1) fldl -0x20(%ebp) fldl -0x20(%ebp) fmulp %st,%st(1) faddp %st,%st(1) fldl -0x28(%ebp) fldl -0x28(%ebp) fmulp %st,%st(1) faddp %st,%st(1) fstpl -0x30(%ebp) Similarly to .NET, Mono is holding pointers to objects bi and bj in registers (in Mono's case edi and ebx as one can see). There is however a important difference in fact that .NET treats registers (floating point stack positions) as local variables while Mono always stores result from registers to stack allocated memory. In fact, in these case, stack variables for dx etc can be fully eliminated and this is what .NET jit does. Optimization which should be adopted here is to bound local variables to registers if possible. On the amd64 architecture problem is very similar but with respect to xmm registers. Problem is very common for all programs that do a lot of computations. Agree, this may be minority of software runned on Mono, but still. Was anyone investigating such kind of optimization? Is it too hard to achieve due to nature of Mini or any other problems? If I was interested in providing such optimization what should be my introduction to Mini's code? Thanks in advance for answers, regards, Konrad _______________________________________________ Mono-devel-list mailing list Mono-devel-list@lists.ximian.com http://lists.ximian.com/mailman/listinfo/mono-devel-list