https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
--- Comment #16 from Wilco <wilco at gcc dot gnu.org> --- (In reply to rguent...@suse.de from comment #15) > which is what I refered to for branch prediction. Your & prompts me > to a way to do sth similar as duffs device, turning the loop into a nest. > > head: > if (n == 0) exit; > <1 iteration> > if (n & 1) > n -= 1, goto head; > <1 iteration> > if (n & 2) > n -= 2, goto head; > <2 iterations> > if (n & 4) > n -= 4, goto head; > <4 iterations> > n -= 8, goto head; > > the inner loop tests should become well-predicted quickly. > > But as always - more branches make it more likely that one is > mispredicted. For a single non-unrolled loop you usually only > get the actual exit mispredicted. Yes the overlapping the branches for the tail loop and the main loop will result in more mispredictions. And there are still 4 branches for an 8x unrolled loop, blocking optimizations and scheduling. So Duff's device is always inefficient - the above loop is much faster like this: if (n & 1) do 1 iteration if (n & 2) do 2 iterations if (n >= 4) do 4 iterations while ((n -= 4) > 0)