Re: D compilation is too slow and I am forking the compiler

welkam via Digitalmars-d-announce Fri, 23 Nov 2018 08:26:30 -0800

On Friday, 23 November 2018 at 14:32:39 UTC, Vladimir Panteleevwrote:

On Friday, 23 November 2018 at 13:23:22 UTC, welkam wrote:
If we run these steps in different thread on the same corewith SMT we could better use core`s resources. Reading filewith kernel, decoding UTF-8 with vector instructions andlexing/parsing with scalar operations while all communicationis done trough L1 and L2 cache.
You might save some pages from the data cache, but by doingmore work at once, the code might stop fitting in theexecution-related caches (code pages, microcode, branchprediction) instead.

Its not about saving tlb pages or fitting better in cache.Compilers are considered streaming applications - they dontutilize cpu caches effectively. You cant read one character andemit machine code then read next character you have to go overall data multiple times while you modify it. I can find whitepapers if you interested where people test GCC with differentcache architectures and it doesnt make much of a difference. GCCis popular application when testing caches.


Here are profiling data from DMD
 Performance counter stats for 'dmd -c main.d':

600.77 msec task-clock:u # 0.803 CPUsutilized

                 0      context-switches:u        #    0.000 K/sec
                 0      cpu-migrations:u          #    0.000 K/sec

33,209 page-faults:u # 55348.333M/sec1,072,289,307 cycles:u # 1787148.845GHz870,175,210 stalled-cycles-frontend:u # 81.15%frontend cycles idle721,897,927 stalled-cycles-backend:u # 67.32%backend cycles idle881,895,208 instructions:u # 0.82 insnper cycle# 0.99stalled cycles per insn171,211,752 branches:u # 285352920.000M/sec11,287,327 branch-misses:u # 6.59% ofall branches


       0.747720395 seconds time elapsed

       0.497698000 seconds user
       0.104165000 seconds sys

Most important data in this conversation is 0.82 insn per cycle.My CPU could do ~2 IPC so there are plenty of CPU resourcesavailable. New Intel desktop processors are designed to do 4insn/cycle. What is limiting DMD performance is slow RAM, datafetching and not what you listed.

code pages - you mean TLB here?

microcode cache. Not all processors have it and those who haveonly benefit trivial loops. DMD have complex loops.

branch prediction. More entries in branch predictor wont helphere because branches are missed because data is unpredictablenot because there are too many branches. Also branchmissprediction penalty is around 30 cycles while reading from RAMcould be over 200 cycles.

L1 code cache. You didnt mention this but running those tasks inSMT mode might trash L1$ so execution might not be optimal.

Instead of parallel reading of imports DMD needs more dataoriented data structures instead of old OOP inspired datastructures. Ill give you example why its the case.


Consider
struct {
    bool isAlive;
    <other data at least 7 bytes of size>
}

If you want to read data from that bool CPU needs to fetch 8bytes of data(cache line of 64 bits). What this means is that forone bit of information CPU fetches 64 bits of data resulting in1/64 = 0.015625 or ~1.6 % signal to noise ratio. This is terrible!

AFAIK DMD doesnt make this kind of mistake but its full of largestructs and classes that are not efficient to read. To fix thiswe need to split those large data structures into smaller onesthat only contain what is needed for particular algorithm. Ipredict 2x speed improvement if we transform all data structuresin DMD. Thats improvement without improving algorithms onlychanging data structures. This getting too longs so i will stopright now

Re: D compilation is too slow and I am forking the compiler

Reply via email to