[ 
https://issues.apache.org/jira/browse/RNG-187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17876297#comment-17876297
 ] 

Alex Herbert commented on RNG-187:
----------------------------------

The code examples provided by the authors are given in C and require unsigned 
64-bit integers with support for the unsigned 128-bit product. This would 
require JDK 21 to access the intrinsic method for 
[Math.unsignedMultiplyHigh(long, 
long)|https://docs.oracle.com/en/java/javase/21/docs/api/java.base/java/lang/Math.html#unsignedMultiplyHigh(long,long)].
 I limited my test implementation to 32-bit integers.

When generating multiple integers in ranges [0, n1), [0, n2), etc from a single 
random value there is a limit on the size of each range n. The total number of 
bits to represent the product n1 * n2 * ... must be less than the number of 
bits in the random number. In practice it should be smaller again to avoid 
excess rejection in the algorithm. The paper by Brackett-Rozinsky & Lemire 
provides an estimate of the length of an array, and the number of integers k 
that can be generated from a single L-bit source:
||k||L=64||L=32||
|2|1358187913|20724|
|3|929104|581|
|4|26573|109|
|5|3225| |
|6|815| |

In practice these thresholds are lowered to the next power of 2 down. Shuffling 
is performed from the end of the array visiting each item once. When the length 
is above the limit for k=2 then the shuffle can use a classic Fisher-Yates 
algorithm which swaps each item with a randomly selected one from the preceding 
items. This is equivalent to a batch size k=1. Otherwise the algorithm 
generates random indices for a batch of size k and swaps the current top k 
elements with the preceding selected items.

When using a 64-bit generator the length of the array when the batched integer 
generation can be used is large. The number of k that can be supported is also 
larger. I tested only k=2 for a 32-bit generator which allows some 
optimisations of the batch shuffle code.
h2. Results

I have implemented the shuffle2 algorithm in the RNG JMH examples module. The 
results are from:
{noformat}
Java version: 21.0.3, vendor: Eclipse Adoptium
MacOS Sonoma 14.6.1
Apple M2 Max{noformat}
The following shows the time per item for various array lengths when using a 
fast RNG (XO_RO_SHI_RO_128_PP):

!shuffle-1.png|width=800,height=600!

This speed-up is comparable to the performance results in the paper for the 
shuffle2 algorithm. Note that the added complexity of shuffle2 does not impact 
small arrays until they are very small.

The following table compares the relative performance when using an RNG of 
different speed for size 16384 (2^14):
|RandomSource|shuffle|shuffle2|Relative|
|JDK|116758.9|61857.5|0.530|
|MWC_256|43716.6|27806.1|0.636|
|XO_RO_SHI_RO_128_PP|28785.2|19496.2|0.677|

The slower the RNG, the greater the performance gain from the batched integer 
generation. Speed-up is around 30% for a fast RNG. The performance gain is 
largest at a size below the threshold for switching to batching (2^15=32768).
h2. Summary

This method provides a noticeable performance gain for shuffling of smaller 
arrays with no penalty for larger arrays.

Note: I did not implement a k=3 or k=4 batch shuffle. The original paper 
reports that the speed-up when moving from k=1 to k=2 is much larger than k=2 
to k=6. Only when using a very slow RNG such as a cryptographically secure 
generator is the additional performance gain more noticeable. When limited to a 
32-bit algorithm only k=3 and k=4 are practical. Future work could add these 
for comparison.

 

> Performance improvements for an array shuffle
> ---------------------------------------------
>
>                 Key: RNG-187
>                 URL: https://issues.apache.org/jira/browse/RNG-187
>             Project: Commons RNG
>          Issue Type: New Feature
>          Components: sampling
>    Affects Versions: 1.6
>            Reporter: Alex Herbert
>            Priority: Minor
>         Attachments: shuffle-1.png, shuffle.png
>
>
> Improve the performance of array shuffling by generation of multiple indices 
> in a range [0, n) from a single sample from a random source.
> The current UniformRandomProvider method to sample an integer in a range [0, 
> n) uses a rejection method based on multiplication with a single use of a 
> slow modulus operator if rejection is required. The author of that method has 
> extended it to allow generation of multiple integers from 2 or more ranges 
> from the same random sample.
> When applied to a shuffle algorithm the speed-up can be significant, 
> typically in the range of up to 2-fold for smaller arrays depending on the 
> speed of the random generator; slower generators see a greater speed-up.
> The method is described in:
> {noformat}
> Nevin Brackett-Rozinsky, Daniel Lemire,
> Batched Ranged Random Integer Generation,
> Software: Practice and Experience (to appear)
> {noformat}
> See [arXiv:2408.06213M|https://arxiv.org/abs/2408.06213].
> See also [Daniel Lemire Blog: Faster random integer generation with 
> batching|https://lemire.me/blog/2024/08/17/faster-random-integer-generation-with-batching/]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to