Out of curiosity, have you actually checked whether the branchless version is faster?
For me, using the following code is actually faster by a factor of almost three than any of the variants using bit masking or modulo: j -= 1 if j < 0: j = 3 Run This is most likely because branch predictors tend to be very good at such regular branching patterns. Results may vary if there's a lot of interleaved code, but as always: measure first, then optimize. I'll also note that you have to be careful with assuming that `x mod n` will be optimized to an `and` operation where `n` is a power of 2. It generally will _not_ happen, as the compiler can only rarely prove that `x` is non-negative. The reason is that in C99, integer modulo is defined to yield negative results for negative `xx`. You may get something like this for `x mod 8`, where `x` is a 32-bit integer. movl %edi, %eax sarl $31, %eax shrl $29, %eax addl %edi, %eax andl $-8, %eax subl %eax, %edi movl %edi, %eax Run A safer approach would be a template that special-cases the case where the modulo is a power of 2, e.g. via: when (m and (m-1)) == 0: # m is a power of 2 if non-negative. x and (m-1) else: # modulo code for arbitrary x and m Run Alternatively, cast to `uint` before the `mod` if you know that the value is non-negative, and back to an `int` afterwards.