[ https://issues.apache.org/jira/browse/RNG-187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17876297#comment-17876297 ]
Alex Herbert commented on RNG-187: ---------------------------------- The code examples provided by the authors are given in C and require unsigned 64-bit integers with support for the unsigned 128-bit product. This would require JDK 21 to access the intrinsic method for [Math.unsignedMultiplyHigh(long, long)|https://docs.oracle.com/en/java/javase/21/docs/api/java.base/java/lang/Math.html#unsignedMultiplyHigh(long,long)]. I limited my test implementation to 32-bit integers. When generating multiple integers in ranges [0, n1), [0, n2), etc from a single random value there is a limit on the size of each range n. The total number of bits to represent the product n1 * n2 * ... must be less than the number of bits in the random number. In practice it should be smaller again to avoid excess rejection in the algorithm. The paper by Brackett-Rozinsky & Lemire provides an estimate of the length of an array, and the number of integers k that can be generated from a single L-bit source: ||k||L=64||L=32|| |2|1358187913|20724| |3|929104|581| |4|26573|109| |5|3225| | |6|815| | In practice these thresholds are lowered to the next power of 2 down. Shuffling is performed from the end of the array visiting each item once. When the length is above the limit for k=2 then the shuffle can use a classic Fisher-Yates algorithm which swaps each item with a randomly selected one from the preceding items. This is equivalent to a batch size k=1. Otherwise the algorithm generates random indices for a batch of size k and swaps the current top k elements with the preceding selected items. When using a 64-bit generator the length of the array when the batched integer generation can be used is large. The number of k that can be supported is also larger. I tested only k=2 for a 32-bit generator which allows some optimisations of the batch shuffle code. h2. Results I have implemented the shuffle2 algorithm in the RNG JMH examples module. The results are from: {noformat} Java version: 21.0.3, vendor: Eclipse Adoptium MacOS Sonoma 14.6.1 Apple M2 Max{noformat} The following shows the time per item for various array lengths when using a fast RNG (XO_RO_SHI_RO_128_PP): !shuffle-1.png|width=800,height=600! This speed-up is comparable to the performance results in the paper for the shuffle2 algorithm. Note that the added complexity of shuffle2 does not impact small arrays until they are very small. The following table compares the relative performance when using an RNG of different speed for size 16384 (2^14): |RandomSource|shuffle|shuffle2|Relative| |JDK|116758.9|61857.5|0.530| |MWC_256|43716.6|27806.1|0.636| |XO_RO_SHI_RO_128_PP|28785.2|19496.2|0.677| The slower the RNG, the greater the performance gain from the batched integer generation. Speed-up is around 30% for a fast RNG. The performance gain is largest at a size below the threshold for switching to batching (2^15=32768). h2. Summary This method provides a noticeable performance gain for shuffling of smaller arrays with no penalty for larger arrays. Note: I did not implement a k=3 or k=4 batch shuffle. The original paper reports that the speed-up when moving from k=1 to k=2 is much larger than k=2 to k=6. Only when using a very slow RNG such as a cryptographically secure generator is the additional performance gain more noticeable. When limited to a 32-bit algorithm only k=3 and k=4 are practical. Future work could add these for comparison. > Performance improvements for an array shuffle > --------------------------------------------- > > Key: RNG-187 > URL: https://issues.apache.org/jira/browse/RNG-187 > Project: Commons RNG > Issue Type: New Feature > Components: sampling > Affects Versions: 1.6 > Reporter: Alex Herbert > Priority: Minor > Attachments: shuffle-1.png, shuffle.png > > > Improve the performance of array shuffling by generation of multiple indices > in a range [0, n) from a single sample from a random source. > The current UniformRandomProvider method to sample an integer in a range [0, > n) uses a rejection method based on multiplication with a single use of a > slow modulus operator if rejection is required. The author of that method has > extended it to allow generation of multiple integers from 2 or more ranges > from the same random sample. > When applied to a shuffle algorithm the speed-up can be significant, > typically in the range of up to 2-fold for smaller arrays depending on the > speed of the random generator; slower generators see a greater speed-up. > The method is described in: > {noformat} > Nevin Brackett-Rozinsky, Daniel Lemire, > Batched Ranged Random Integer Generation, > Software: Practice and Experience (to appear) > {noformat} > See [arXiv:2408.06213M|https://arxiv.org/abs/2408.06213]. > See also [Daniel Lemire Blog: Faster random integer generation with > batching|https://lemire.me/blog/2024/08/17/faster-random-integer-generation-with-batching/] > -- This message was sent by Atlassian Jira (v8.20.10#820010)