https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67153
--- Comment #23 from Nathan Kurz <nate at verse dot com> --- > 1. As a correction: *without* the count takes twice as long to run as with, > or when using bitset<>. Oops, I did say that backwards. My tests agree with what you say. > 2. As a heuristic, favoring a branch to skip a biggish loop body evidently > has much less downside than favoring the branch into it. A correctly predicted branch-not-taken will be the same speed or one cycle faster than a correctly predicted branch-taken. Modern Intel can sustain 2 fused cmp/jcc µops per cycle, but only if the first is not-taken. > Maybe Gcc already has such a heuristic, and the addition of 7 conditional > increments in the loop, or whatever overhead bitset<> adds, was enough > to push it over? I presume this is true, but don't know the GCC internals. I did test though with clang++ 3.8.0 and icpc 16.01, and found that both made the same layout choices as GCC in the absence of __builtin_expect() and had approximately the same speed on Skylake (about 2 cycles per iteration). With USE_EXPECT, clang and gcc were equally fast (one cycle), and icc was at an intermediate speed (1.2-ish). My reason for thinking this is not a bug is that the fastest choice will depend on the contents of the word list. Regardless of layout, there will be one option that is slightly faster than the other. I guess it's reasonable to ask, though, whether it's better by default to try to save one cycle on an already very fast empty loop, or to save one cycle on a more expensive loop. But the real gain (if there is one) will be matching the layout to the runtime behavior, for which the compiler requires outside information. > Westmere runs both instruction sequences (with and without __builtin_expect) > the same. Maybe on Westmere the loop takes two cycles regardless of code > placement, and Gcc is (still) tuned for the older timings? This matches my belief: Haswell and Skylake have the potential of doing the inner loop in a single cycle if the layout matches the runtime pattern, while Nehalem (Westmere's sibling) takes at least 2 cycles regardless. This is due to the capabilities for "Macro-op Fusion" across the generations. See the sections with that heading for Nehalem, Sandy Bridge, Haswell, and Skylake here: http://www.agner.org/optimize/microarchitecture.pdf. Often the limiting factor for very tight loops like this is that only 4 µops can be "retired" per cycle. If both cmp/jcc instructions are fused (cmp here includes a different subset of cmp/test/add/sub for different generations), and if the first branch is not taken, it's possible to do both plus the 'mov' and 'add' in a single cycle on Haswell and Skylake. If as on previous generations they are not both fused (see Agner's PDF), or if the first branch is taken, there is a two cycle minimum for the loop. My recommendation to any GCC internals experts reading this would be to read Agner's suggestions for promoting fusion, and change the loop optimizers to respect them. While not happening in this case, I feel I often see situations where GCC often puts 'mov' or other non-flag changing instructions between the 'cmp' and 'jcc', defeating macro-op fusion and creating slower loops. Intel's icc frequently does a better job at avoiding this, and thus often produces slightly faster loops.