ixid:
In any case with large values of k the branch prediction will be right almost all of the time, explaining why this form is faster than modulo as modulo is fairly slow while this is a correctly predicted branch doing an addition if it doesn't make it branchless.
That seems the explanation.
The branchless version gives the same time result as branched, is there a way to force that line not to optimized to compare the predicted version?
I don't fully understand the question. Do you mean annotations like the __builtin_expect of GCC?
Bye, bearophile