Re: Inlining Code Test

Nick Voronin Thu, 16 Dec 2010 22:55:24 -0800

On Mon, 13 Dec 2010 01:50:50 +0300, Craig Black <craigbla...@cox.net>wrote:

The following program illustrates the problems with inlining in the dmdcompiler. Perhaps with some more work I can reduce it to a smaller testcase. I was playing around with a simple Array template, and noticedthat similar C++ code performs much better. This is due, at least inpart, to opIndex not being properly inlined by dmd. There are two sortfunctions, quickSort1 and quickSort2. quickSort1 indexes an Array datastructure. quickSort2 indexes raw pointers. quickSort2 is roughly 20%faster on my core i7.


Compiled with dmd v2.050/win32  -g -O -inline -release

First, I looked in debugger on actual asm and I must say inlining is donevery well. Code for two versions is almost identical with slight overheadin case of Array for there is extra level of indirection in data access,inlining or not.

Second, I have anywhere from 3.3 to 6.7% difference in performance, but nomore than that. Tested on Core2Duo E6300, Windows XP SP3. I increasednumber of iterations for benchmark!() to 5 to reduce volatility ofresults. That's the only change to source I did.

Third... Now here is a funny thing. Absolute times and difference betweenimplementation depends on how do you run the program. I was dumbfounded asof how does it matter, but the fact is that aforementioned avg 5%difference I get if I run it with command line as "inline.exe". If I runit as "inline" without extension I get difference around 15% and absolutetimes are notably smaller.



X:\d\tests\craig>inline.exe
Sorting with Array.opIndex: 6533
Sorting with pointers: 6264
4.11756 percent faster

X:\d\tests\craig>inline
Sorting with Array.opIndex: 5390
Sorting with pointers: 4674
13.2839 percent faster

Something like that. It's not a fluke. I tested it on my old AthlonXp withXP SP2 and saw exactly the same picture (btw, difference in % betweenimplementation was about the same).

I ran both variants under stracent and found no difference except onepointer on the stack when LeaveCriticalSection and GetCurrentThreadId arecalled was always off by 4 bytes. This made me thinking. The onlyobservable difference is length of command line. And indeed, renamingprogram showed that only length of command line is a reason, not thecontent.

Further tests suggest that some value is either aligned to 8 byte or notdepending on length of command line and this makes all the difference(which happens to be greater than difference between implementations ofsorting). I couldn't find what value causes slowdown though.


--
Using Opera's revolutionary email client: http://www.opera.com/mail/

Re: Inlining Code Test

Reply via email to