[
https://issues.apache.org/jira/browse/RNG-54?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18074869#comment-18074869
]
Alex Herbert commented on RNG-54:
---------------------------------
Revisiting this ticket 8 years later...
A formal StringSampler to select from a provided alphabet can now be
implemented using the DiscreteUniformSampler. This has optimised methods to
sample from [0, n). It uses a bit shift on a random integer where n is a power
of 2 so should perform as well as the RadixStringSampler for that case. When n
is not a power of 2 it uses a precomputed modulus (2^32 % n) to allow efficient
sampling of the number from the range.
Implementation for a provided alphabet and String length can use a
Supplier<String> to create samples, for example:
{code:java}
UniformRandomProvider rng = ...
char[] alphabet = {'0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'a', 'b',
'c', 'd', 'e', 'f'};
// Sample from an exclusive upper bound
SharedStateDiscreteSampler s = DiscreteUniformSampler.of(rng, 0,
alphabet.length - 1);
int length = 10;
// Array can be reused as String::valueOf uses a copy
char[] c = new char[length];
Supplier<String> gen = () -> {
for (int i = 0; i < length; i++) {
c[i] = alphabet[s.sample()];
}
return String.valueOf(c);
};
// Random hex string of length 10
String sample = gen.get();
{code}
> StringSampler
> -------------
>
> Key: RNG-54
> URL: https://issues.apache.org/jira/browse/RNG-54
> Project: Commons RNG
> Issue Type: Improvement
> Components: sampling
> Affects Versions: 1.1
> Reporter: Alex Herbert
> Priority: Minor
>
> There is currently no equivalent for the function
> {{org.apache.commons.math3.random.RandomDataGenerator.nextHexString(int)}}.
> Here is the original version adapted to use the {{UniformRandomProvider:}}
> {code:java}
> public String nextHexStringOriginal(UniformRandomProvider ran, int len) {
> // Initialize output buffer
> StringBuilder outBuffer = new StringBuilder();
> // Get int(len/2)+1 random bytes
> byte[] randomBytes = new byte[(len / 2) + 1];
> ran.nextBytes(randomBytes);
> // Convert each byte to 2 hex digits
> for (int i = 0; i < randomBytes.length; i++) {
> Integer c = Integer.valueOf(randomBytes[i]);
> /*
> * Add 128 to byte value to make interval 0-255 before doing hex
> conversion.
> * This guarantees <= 2 hex digits from toHexString() toHexString
> would
> * otherwise add 2^32 to negative arguments.
> */
> String hex = Integer.toHexString(c.intValue() + 128);
> // Make sure we add 2 hex digits for each byte
> if (hex.length() == 1) {
> hex = "0" + hex;
> }
> outBuffer.append(hex);
> }
> return outBuffer.toString().substring(0, len);
> }
> {code}
> Note: I removed the length check to make the speed test (see below) fair.
> This makes use of {{StringBuider}} and is not very efficient. I have created
> a version based on the Hex encoding within
> {{org.apache.commons.codec.digest.DigestUtils}} and
> {{org.apache.commons.codec.binary.Hex}}. This uses a direct look-up of the
> hex character using the index from successive 4 bits of a byte array to form
> an index from 0-15.
> Here's the function without details of how the {{byte[]}} is correctly sized:
> {code:java}
> private static final char[] DIGITS_LOWER = {
> '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'a', 'b', 'c', 'd',
> 'e', 'f' };
> private static String nextHexString(UniformRandomProvider rng, byte[] bytes,
> int length) {
> rng.nextBytes(bytes);
> // Use the upper and lower 4 bits of each byte as an
> // index in the range 0-15 for each hex character.
> final char[] out = new char[length];
> // Run the loop without checking index j by producing characters
> // up to the size below the desired length.
> final int loopLimit = length / 2;
> int i = 0, j = 0;
> while (i < loopLimit) {
> final byte b = bytes[i];
> // 0x0F == 0x01 | 0x02 | 0x04 | 0x08
> out[j++] = DIGITS_LOWER[(b >>> 4) & 0x0F];
> out[j++] = DIGITS_LOWER[b & 0x0F];
> i++;
> }
> // The final character
> if (j < length)
> out[j++] = DIGITS_LOWER[(bytes[i] >>> 4) & 0x0F];
> return new String(out);
> }
> {code}
> I've compared this to the original function and a modified one below that
> computes the exact same strings:
> {code:java}
> public String nextHexStringModified(UniformRandomProvider ran, int len) {
> // Initialize output buffer
> StringBuilder outBuffer = new StringBuilder();
> // byte[] randomBytes = new byte[(len/2) + 1]; // ORIGINAL
> byte[] randomBytes = new byte[(len + 1) / 2];
> ran.nextBytes(randomBytes);
> // Convert each byte to 2 hex digits
> for (int i = 0; i < randomBytes.length; i++) {
> // ORIGINAL
> // Integer c = Integer.valueOf(randomBytes[i]);
> // String hex = Integer.toHexString(c.intValue() + 128);
> String hex = Integer.toHexString(randomBytes[i] & 0xff);
> // Make sure we add 2 hex digits for each byte
> if (hex.length() == 1) {
> outBuffer.append('0');
> }
> outBuffer.append(hex);
> }
> return outBuffer.toString().substring(0, len);
> }
> {code}
> The timings are:
>
> ||Name||Time||Relative||
> |StringSampler|316103|0.073|
> |nextHexStringModified|3708104|0.853|
> |nextHexStringOriginal|4348063|1.000|
> This is not using JMH but the results show the method performs better.
> The full {{StringSampler}} class supports a radix of 2, 8, and 16 for binary,
> octal and hex strings.
> JUnit tests show: the sampler computes the same values as
> {{nextHexStringModified(int);}} edges cases are handled with exceptions; and
> the output strings are uniform for each of the supported character sets
> (using a Chi Squared test).
> Can I create a PR for a {{org.apache.commons.rng.sampling.StringSampler}}?
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)