P.S. The cited paper calls your algorithm the "OpenBSD algorithm" and has a bunch of benchmarks comparing it to others in Fisher-Yates shuffles of sizes 1e3..1e9.
Including all overhead (base PRNG, shuffle), it's 3x slower for 32-bit operations and 8x slower for 64-bit up to arrays of size 1e6, after which cache misses slow all algorithms, reducing the ratio. If you want a faster division-based agorithm, the "Java algorithm" does 1+retries divides: unsigned long java(unsigned long s) { unsigned long x, r; do { x = random_integer(); r = x % s; } while (x - r > -s); return r; }