On Mon, 13 Dec 2010 01:50:50 +0300, Craig Black <craigbla...@cox.net> wrote:

The following program illustrates the problems with inlining in the dmd compiler. Perhaps with some more work I can reduce it to a smaller test case. I was playing around with a simple Array template, and noticed that similar C++ code performs much better. This is due, at least in part, to opIndex not being properly inlined by dmd. There are two sort functions, quickSort1 and quickSort2. quickSort1 indexes an Array data structure. quickSort2 indexes raw pointers. quickSort2 is roughly 20% faster on my core i7.

Compiled with dmd v2.050/win32  -g -O -inline -release

First, I looked in debugger on actual asm and I must say inlining is done very well. Code for two versions is almost identical with slight overhead in case of Array for there is extra level of indirection in data access, inlining or not.

Second, I have anywhere from 3.3 to 6.7% difference in performance, but no more than that. Tested on Core2Duo E6300, Windows XP SP3. I increased number of iterations for benchmark!() to 5 to reduce volatility of results. That's the only change to source I did.

Third... Now here is a funny thing. Absolute times and difference between implementation depends on how do you run the program. I was dumbfounded as of how does it matter, but the fact is that aforementioned avg 5% difference I get if I run it with command line as "inline.exe". If I run it as "inline" without extension I get difference around 15% and absolute times are notably smaller.


X:\d\tests\craig>inline.exe
Sorting with Array.opIndex: 6533
Sorting with pointers: 6264
4.11756 percent faster

X:\d\tests\craig>inline
Sorting with Array.opIndex: 5390
Sorting with pointers: 4674
13.2839 percent faster

Something like that. It's not a fluke. I tested it on my old AthlonXp with XP SP2 and saw exactly the same picture (btw, difference in % between implementation was about the same).

I ran both variants under stracent and found no difference except one pointer on the stack when LeaveCriticalSection and GetCurrentThreadId are called was always off by 4 bytes. This made me thinking. The only observable difference is length of command line. And indeed, renaming program showed that only length of command line is a reason, not the content.

Further tests suggest that some value is either aligned to 8 byte or not depending on length of command line and this makes all the difference (which happens to be greater than difference between implementations of sorting). I couldn't find what value causes slowdown though.

--
Using Opera's revolutionary email client: http://www.opera.com/mail/

Reply via email to