https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82004
--- Comment #36 from Richard Biener <rguenth at gcc dot gnu.org> --- (In reply to Jakub Jelinek from comment #35) > Created attachment 43763 [details] > pr82004_dumps.tar.xz > > Dumps. For lto I've just added the init_sw_absorption function parts of the > dump, the dumps are too large. Skipping partial redundancy for expression {plus_expr,logchl_926,1.00000000000000002081668171172168513294309377670288085938e-2} (0812), no redundancy on to be optimized for speed edge Skipping partial redundancy for expression {call_expr<__builtin_pow>,real_cst<1.0e+1>,logchl_1040} (0813), no redundancy on to be optimized for speed edge so with LTO we have "better" profile estimates and the entry edge is considered cold... LTO: <bb 33> [local count: 3813]: ... <bb 34> [local count: 16255]: # n_925 = PHI <0(33), _1128(129)> # logchl_926 = PHI <-3.0099999999999997868371792719699442386627197265625e+0(33), logchl_1040(129)> non-LTO: <bb 33> [local count: 10616]: ... <bb 34> [local count: 85892]: # logchl_591 = PHI <-3.0099999999999997868371792719699442386627197265625e+0(33), logchl_701(129)> in general optimizing the redundancy on the entry edge isn't worth it given it often increases register pressure by introducing loop-carried dependences. So I think LTO is "correct" here, even if that's unfortunate... :/