@snej \- the `gcc` people say that the biggests boosts from PGO come from better inlining decisions. A function call can be like 14 cycles (or 14*2..6 issue = 28..84 superscalar dynamic instructions. That's a lot of (potential) work and much Nim code might have small non-inlined functions that only do like 3-15 dynamic instructions. So, speed-up potential might be 1.86x .. 28x which is a lot, but L1 i-cache and the uop cache (on Intel, anyhow) is also very scarce and also have large speed multipliers. So, it's kind of a tricky "dynamic stew" and the claims of the `gcc` people are plausible.
In light of that claim, you may be able to get `--passC:-flto` performance closer to PGO performance by better static inline decisions, such as adding the `{.inline.}` pragma (more?) judiciously. Or there could be some core `system` procs that just need some `{.inline.}` annotation as "low hanging fruit". In a more perfect world, these C/C++ compilers might have option flags to "show their work" in PGO mode. Then it might be easier to investigate on a case-by-case basis. You can diff the output, of course, but that's not easy work. I haven't really looked. Maybe `clang` or `gcc` have such "show their work" features.