https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67153
--- Comment #22 from ncm at cantrip dot org --- (In reply to Nathan Kurz from comment #21) > My current belief is > that everything here is expected behavior, and there is no bug with either > the compiler or processor. > > The code spends most of its time in a tight loop that depends on runtime > input, and the compiler doesn't have any reason to know which branch is more > likely. The addition of "count" changes the heuristic in recent compilers, > and by chance, changes it for the worse. I am going to disagree, carefully. It seems clear, in any case, that Haswell is off the hook. 1. As a correction: *without* the count takes twice as long to run as with, or when using bitset<>. 2. As a heuristic, favoring a branch to skip a biggish loop body evidently has much less downside than favoring the branch into it. Maybe Gcc already has such a heuristic, and the addition of 7 conditional increments in the loop, or whatever overhead bitset<> adds, was enough to push it over? Westmere runs both instruction sequences (with and without __builtin_expect) the same. Maybe on Westmere the loop takes two cycles regardless of code placement, and Gcc is (still) tuned for the older timings?