On Sun, 24 Mar 2019 at 21:47:50 +0100, Jason A. Donenfeld wrote: > I generally use a slightly simpler algorithm in various different projects: > > //[0, bound) > static unsigned long random_bounded(unsigned long bound) > { > unsigned long ret; > const unsigned long max_mod_bound = (1 + ~bound) % bound; > > if (bound < 2) > return 0; > do > ret = random_integer(); > while (ret < max_mod_bound); > return ret % bound; > } > > Is the motivation behind using Lemire that you avoid the division (via > the modulo) in favor of a multiplication?
Yes. If we define eps = max_mod_bound * ldexp(1.0, -BITS_PER_LONG) as the probability of one retry, and retries = eps / (1 - eps) as the expected number of retries, then both algorithms take 1+retries random_integer()s. The above agorithm takes 2 divisions, always. Divides are slow, and usually not pipelined, so two in short succession gets a latency penalty. Lemire's mutiplicative algorithm takes 1 multiplication on the fast path (probability 1 - 2*eps on average), 1 additional division on the slow path (probability 2*eps), and 1 multiplication per retry. In the common case when bound is much less than ULONG_MAX, eps is tiny and the fast path is taken almost all the time, and it's a huge win. Even in the absolute worst case of bound = ULONG_MAX/2 + 2 when eps ~ 0.5 (2 multiplies, 0.5 divide; there's no 2*eps penalty in this case), it's faster as long as 2 mutiplies cost less than 1.5 divides. I you want simpler code, we could omit the fast path and stil get a speedup. But a predictable branch for a divide seemed like a worthwhile trade. (FYI, this all came about as a side project of a kernel-janitor project to replace "prandom_u32() % range" by "prandom_u32() * range >> 32". I'm also annoyed that get_random_u32() and get_random_u64() have separate buffers, even if EFFICIENT_UNALIGNED_ACCESS, but that's a separate complaint.)