On 8/2/2013 6:16 AM, Dmitry Olshansky wrote:
31-Jul-2013 22:20, Walter Bright пишет:
On 7/31/2013 8:26 AM, Dmitry Olshansky wrote:
Ouch... to boot it's always aligned by word size, so
key % sizeof(size_t) == 0
...
rendering lower 2-3 bits useless, that would make straight slice lower
bits
approach rather weak :)

Yeah, I realized that, too. Gotta shift it right 3 or 4 bits.

And that helped a bit... Anyhow after doing a bit more pervasive integer hash
power of 2 tables stand up to their promise.

The pull that reaps the minor speed benefit over the original (~2% speed gain!):
https://github.com/D-Programming-Language/dmd/pull/2436

2% is worth taking.


Not bad given that _aaGetRValue takes only a fraction of time itself.

I failed to see much of any improvement on Win32 though, allocations are
dominating the picture.

And sharing the joy of having a nice sampling profiler, here is what AMD
CodeAnalyst have to say (top X functions by CPU clocks not halted).

Original DMD:

Function     CPU clocks     DC accesses     DC misses
RTLHeap::Alloc     49410     520     3624
Obj::ledata     10300     1308     3166
Obj::fltused     6464     3218     6
cgcs_term     4018     1328     626
TemplateInstance::semantic     3362     2396     26
Obj::byte     3212     506     692
vsprintf     3030     3060     2
ScopeDsymbol::search     2780     1592     244
_pformat     2506     2772     16
_aaGetRvalue     2134     806     304
memmove     1904     1084     28
strlen     1804     486     36
malloc     1282     786     40
Parameter::foreach     1240     778     34
StringTable::search     952     220     42
MD5Final     918     318

Variation of DMD with pow-2 tables:

Function     CPU clocks     DC accesses     DC misses
RTLHeap::Alloc     51638     552     3538
Obj::ledata     9936     1346     3290
Obj::fltused     7392     2948     6
cgcs_term     3892     1292     638
TemplateInstance::semantic     3724     2346     20
Obj::byte     3280     548     676
vsprintf     3056     3006     4
ScopeDsymbol::search     2648     1706     220
_pformat     2560     2718     26
memcpy     2014     1122     46
strlen     1694     494     32
_aaGetRvalue     1588     658     278
Parameter::foreach     1266     658     38
malloc     1198     758     44
StringTable::search     970     214     24
MD5Final     866     274     2


This underlies the point that DMC RTL allocator is the biggest speed detractor.
It is "followed" by ledata (could it be due to linear search inside?) and
surprisingly the tiny Obj::fltused is draining lots of cycles (is it called that
often?).

It's not fltused() that is taking up time, it is the static function following it. The sampling profiler you're using is unaware of non-global function names.

Reply via email to