Out of curiosity, have you actually checked whether the branchless version is 
faster?

For me, using the following code is actually faster by a factor of almost three 
than any of the variants using bit masking or modulo:
    
    
    j -= 1
    if j < 0:
      j = 3
    
    
    Run

This is most likely because branch predictors tend to be very good at such 
regular branching patterns. Results may vary if there's a lot of interleaved 
code, but as always: measure first, then optimize.

I'll also note that you have to be careful with assuming that `x mod n` will be 
optimized to an `and` operation where `n` is a power of 2. It generally will 
_not_ happen, as the compiler can only rarely prove that `x` is non-negative. 
The reason is that in C99, integer modulo is defined to yield negative results 
for negative `xx`. You may get something like this for `x mod 8`, where `x` is 
a 32-bit integer.
    
    
            movl    %edi, %eax
            sarl    $31, %eax
            shrl    $29, %eax
            addl    %edi, %eax
            andl    $-8, %eax
            subl    %eax, %edi
            movl    %edi, %eax
    
    
    Run

A safer approach would be a template that special-cases the case where the 
modulo is a power of 2, e.g. via:
    
    
    when (m and (m-1)) == 0: # m is a power of 2 if non-negative.
      x and (m-1)
    else:
      # modulo code for arbitrary x and m
    
    
    Run

Alternatively, cast to `uint` before the `mod` if you know that the value is 
non-negative, and back to an `int` afterwards.

Reply via email to