@snej \- the `gcc` people say that the biggests boosts from PGO come from 
better inlining decisions. A function call can be like 14 cycles (or 14*2..6 
issue = 28..84 superscalar dynamic instructions. That's a lot of (potential) 
work and much Nim code might have small non-inlined functions that only do like 
3-15 dynamic instructions. So, speed-up potential might be 1.86x .. 28x which 
is a lot, but L1 i-cache and the uop cache (on Intel, anyhow) is also very 
scarce and also have large speed multipliers. So, it's kind of a tricky 
"dynamic stew" and the claims of the `gcc` people are plausible.

In light of that claim, you may be able to get `--passC:-flto` performance 
closer to PGO performance by better static inline decisions, such as adding the 
`{.inline.}` pragma (more?) judiciously. Or there could be some core `system` 
procs that just need some `{.inline.}` annotation as "low hanging fruit".

In a more perfect world, these C/C++ compilers might have option flags to "show 
their work" in PGO mode. Then it might be easier to investigate on a 
case-by-case basis. You can diff the output, of course, but that's not easy 
work. I haven't really looked. Maybe `clang` or `gcc` have such "show their 
work" features.

Reply via email to