https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67153

--- Comment #23 from Nathan Kurz <nate at verse dot com> ---
> 1. As a correction: *without* the count takes twice as long to run as with,
>    or when using bitset<>.

Oops, I did say that backwards.  My tests agree with what you say.  

> 2. As a heuristic, favoring a branch to skip a biggish loop body evidently 
>    has much less downside than favoring the branch into it.  

A correctly predicted branch-not-taken will be the same speed or one cycle
faster than a correctly predicted branch-taken.   Modern Intel can sustain 2
fused cmp/jcc µops per cycle, but only if the first is not-taken.  

> Maybe Gcc already has such a heuristic, and the addition of 7 conditional 
> increments in the loop, or whatever overhead bitset<> adds, was enough
> to push it over?  

I presume this is true, but don't know the GCC internals.  I did test though
with clang++ 3.8.0 and icpc 16.01, and found that both made the same layout
choices as GCC in the absence of __builtin_expect() and had approximately the
same speed on Skylake (about 2 cycles per iteration).  With USE_EXPECT, clang
and gcc were equally fast (one cycle), and icc was at an intermediate speed
(1.2-ish).  

My reason for thinking this is not a bug is that the fastest choice will depend
on the contents of the word list.   Regardless of layout, there will be one
option that is slightly faster than the other.   I guess it's reasonable to
ask, though, whether it's better by default to try to save one cycle on an
already very fast empty loop, or to save one cycle on a more expensive loop.
But the real gain (if there is one) will be matching the layout to the runtime
behavior, for which the compiler requires outside information.   

> Westmere runs both instruction sequences (with and without __builtin_expect)
> the same.  Maybe on Westmere the loop takes two cycles regardless of code 
> placement, and Gcc is (still) tuned for the older timings?


This matches my belief:  Haswell and Skylake have the potential of doing the
inner loop in a single cycle if the layout matches the runtime pattern, while
Nehalem (Westmere's sibling) takes at least 2 cycles regardless.   This is due
to the capabilities for "Macro-op Fusion" across the generations.  See the
sections with that heading for Nehalem, Sandy Bridge, Haswell, and Skylake
here: http://www.agner.org/optimize/microarchitecture.pdf. 

Often the limiting factor for very tight loops like this is that only 4 µops
can be "retired" per cycle.   If both cmp/jcc instructions are fused (cmp here
includes a different subset of cmp/test/add/sub for different generations), and
if the first branch is not taken, it's possible to do both plus the 'mov' and
'add' in a single cycle on Haswell and Skylake.  If as on previous generations
they are not both fused (see Agner's PDF), or if the first branch is taken,
there is a two cycle minimum for the loop.  

My recommendation to any GCC internals experts reading this would be to read
Agner's suggestions for promoting fusion, and change the loop optimizers to
respect them. While not happening in this case, I feel I often see situations
where  GCC often puts 'mov' or other non-flag changing instructions between the
'cmp' and 'jcc', defeating macro-op fusion and creating slower loops.  Intel's
icc frequently does a better job at avoiding this, and thus often produces
slightly faster loops.

Reply via email to