https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99386
--- Comment #4 from Milian Wolff <mail at milianw dot de> --- Ah, but LTO only helps with the variant that contains a single type. The variant with two types remains very slow: variant with single type: ``` Performance counter stats for './variant 1' (5 runs): 264.14 msec task-clock # 0.999 CPUs utilized ( +- 0.13% ) 0 context-switches # 0.001 K/sec ( +-100.00% ) 0 cpu-migrations # 0.000 K/sec 380 page-faults # 0.001 M/sec ( +- 0.13% ) 1,182,582,454 cycles # 4.477 GHz ( +- 0.06% ) (62.52%) 634,015 stalled-cycles-frontend # 0.05% frontend cycles idle ( +- 3.72% ) (62.52%) 1,044,218,220 stalled-cycles-backend # 88.30% backend cycles idle ( +- 0.16% ) (62.52%) 1,187,317,899 instructions # 1.00 insn per cycle # 0.88 stalled cycles per insn ( +- 0.11% ) (62.52%) 132,470,519 branches # 501.512 M/sec ( +- 0.09% ) (62.53%) 2,967 branch-misses # 0.00% of all branches ( +- 7.80% ) (62.47%) 788,740,131 L1-dcache-loads # 2986.044 M/sec ( +- 0.16% ) (62.47%) 16,466,669 L1-dcache-load-misses # 2.09% of all L1-dcache accesses ( +- 0.16% ) (62.46%) <not supported> LLC-loads <not supported> LLC-load-misses 0.264412 +- 0.000379 seconds time elapsed ( +- 0.14% ) ``` The above measurements is in the same ballpark as the no-variant baseline without LTO. But check out the following for using a variant with two types: ``` Performance counter stats for './variant 2' (5 runs): 1,807.01 msec task-clock # 1.000 CPUs utilized ( +- 0.04% ) 4 context-switches # 0.002 K/sec ( +- 11.59% ) 0 cpu-migrations # 0.000 K/sec ( +- 61.24% ) 383 page-faults # 0.212 K/sec ( +- 0.27% ) 8,093,139,812 cycles # 4.479 GHz ( +- 0.01% ) (62.35%) 1,393,308 stalled-cycles-frontend # 0.02% frontend cycles idle ( +- 5.84% ) (62.52%) 7,257,955,665 stalled-cycles-backend # 89.68% backend cycles idle ( +- 0.08% ) (62.62%) 4,728,542,717 instructions # 0.58 insn per cycle # 1.53 stalled cycles per insn ( +- 0.02% ) (62.65%) 395,189,246 branches # 218.698 M/sec ( +- 0.02% ) (62.65%) 17,570 branch-misses # 0.00% of all branches ( +- 12.38% ) (62.55%) 3,806,321,294 L1-dcache-loads # 2106.424 M/sec ( +- 0.02% ) (62.39%) 16,753,910 L1-dcache-load-misses # 0.44% of all L1-dcache accesses ( +- 0.11% ) (62.28%) <not supported> LLC-loads <not supported> LLC-load-misses 1.807335 +- 0.000776 seconds time elapsed ( +- 0.04% ) ``` Again, performance suffers dramatically