On Friday, 23 November 2018 at 14:32:39 UTC, Vladimir Panteleev wrote:
On Friday, 23 November 2018 at 13:23:22 UTC, welkam wrote:
If we run these steps in different thread on the same core with SMT we could better use core`s resources. Reading file with kernel, decoding UTF-8 with vector instructions and lexing/parsing with scalar operations while all communication is done trough L1 and L2 cache.

You might save some pages from the data cache, but by doing more work at once, the code might stop fitting in the execution-related caches (code pages, microcode, branch prediction) instead.

Its not about saving tlb pages or fitting better in cache. Compilers are considered streaming applications - they dont utilize cpu caches effectively. You cant read one character and emit machine code then read next character you have to go over all data multiple times while you modify it. I can find white papers if you interested where people test GCC with different cache architectures and it doesnt make much of a difference. GCC is popular application when testing caches.

Here are profiling data from DMD
 Performance counter stats for 'dmd -c main.d':

600.77 msec task-clock:u # 0.803 CPUs utilized
                 0      context-switches:u        #    0.000 K/sec
                 0      cpu-migrations:u          #    0.000 K/sec
33,209 page-faults:u # 55348.333 M/sec 1,072,289,307 cycles:u # 1787148.845 GHz 870,175,210 stalled-cycles-frontend:u # 81.15% frontend cycles idle 721,897,927 stalled-cycles-backend:u # 67.32% backend cycles idle 881,895,208 instructions:u # 0.82 insn per cycle # 0.99 stalled cycles per insn 171,211,752 branches:u # 285352920.000 M/sec 11,287,327 branch-misses:u # 6.59% of all branches

       0.747720395 seconds time elapsed

       0.497698000 seconds user
       0.104165000 seconds sys

Most important data in this conversation is 0.82 insn per cycle. My CPU could do ~2 IPC so there are plenty of CPU resources available. New Intel desktop processors are designed to do 4 insn/cycle. What is limiting DMD performance is slow RAM, data fetching and not what you listed.
code pages - you mean TLB here?

microcode cache. Not all processors have it and those who have only benefit trivial loops. DMD have complex loops.

branch prediction. More entries in branch predictor wont help here because branches are missed because data is unpredictable not because there are too many branches. Also branch missprediction penalty is around 30 cycles while reading from RAM could be over 200 cycles.

L1 code cache. You didnt mention this but running those tasks in SMT mode might trash L1$ so execution might not be optimal.

Instead of parallel reading of imports DMD needs more data oriented data structures instead of old OOP inspired data structures. Ill give you example why its the case.

Consider
struct {
    bool isAlive;
    <other data at least 7 bytes of size>
}

If you want to read data from that bool CPU needs to fetch 8 bytes of data(cache line of 64 bits). What this means is that for one bit of information CPU fetches 64 bits of data resulting in 1/64 = 0.015625 or ~1.6 % signal to noise ratio. This is terrible!

AFAIK DMD doesnt make this kind of mistake but its full of large structs and classes that are not efficient to read. To fix this we need to split those large data structures into smaller ones that only contain what is needed for particular algorithm. I predict 2x speed improvement if we transform all data structures in DMD. Thats improvement without improving algorithms only changing data structures. This getting too longs so i will stop right now

Reply via email to