[ 
https://issues.apache.org/jira/browse/RNG-57?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16641910#comment-16641910
 ] 

Alex D Herbert commented on RNG-57:
-----------------------------------

Apologies in advance for the long post. This stems from the desire to 
understand why my simple changes for RNG-57 have broken the test suite for 
commons-rng-core.

 

I ran the {{ProvidersCommonParametricTest}} test 1000 times with a random seed 
generated by SecureRandom as a 1024 byte array. Code snippet to do this is 
posted in [#comment-16640722].

The raw data were large (>40MB) so I extracted the number of failures for each 
test and have uploaded this to my current PC. Unfortunately I produced the raw 
data incorrectly as the test stops at the first failure and later tests will be 
missing repeats. So for example this test for nextInteger has incomplete data 
because early tests fail:
{code:java}
    @Test
    public void testUniformNextIntegerInRange() {
        checkNextIntegerInRange(4, 1000);
        checkNextIntegerInRange(10, 1000);
        checkNextIntegerInRange(12, 1000);

        // If this fails ...
        checkNextIntegerInRange(31, 1000);

        // Data for the remaining will be missing !!!
        checkNextIntegerInRange(32, 1000);
        checkNextIntegerInRange(2016128993, 1000);
        checkNextIntegerInRange(1834691456, 1000);
        checkNextIntegerInRange(869657561, 1000);
        checkNextIntegerInRange(1570504788, 1000);
    }
{code}
However the number of tests is always above 900 and so I did a Chi squared test 
to determine if the number of failures fits a Binomial distribution. Here are 
some demo results for the MWC_256 method.

The table shows the method under test, the mean number of failures, the p-value 
from a chi squared test matching the histogram of failures to a Binomial and 
then a cumulative histogram of the number of failures. The final column value 
is the number of tests (sometimes less than 1000).
|Method|Mean fails|p(Binomial(n,p=0.01)|Cumul Fails 
(X<=x\|x=0)|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|
|nextInteger(1834691456)|5.027|0.999|6|49|130|256|406|565|710|812|892|923|937|947|950|950|951|
 |
|nextLong(6917529027641081853)|4.904|1|9|41|127|252|418|580|719|828|867|898|920|926|928|
 | | |
|nextDouble()|4.928|0.989|8|42|131|265|446|633|772|876|940|973|989|998|999|1000|
 | |
|nextInteger(32)|4.889|0.997|6|38|130|268|451|619|761|848|909|941|958|964|967|969|
 | |
|nextInteger(10)|4.914|0.991|9|36|121|270|460|636|776|872|940|976|994|997|999|1000|
 | |
|nextInteger(1570504788)|4.902|1|4|38|118|256|429|594|725|809|867|899|918|921|923|925|925|926|
|nextInteger(2016128993)|4.98|1|6|39|110|230|428|585|749|839|900|935|951|954|957|958|
 | |
|nextInteger(31)|4.969|0.996|9|40|132|265|450|606|753|839|901|952|969|975|979|980|
 | |
|nextLong(31)|4.99|0.999|7|32|106|253|429|601|747|845|911|947|964|968|969|970|971|
 |
|nextLong(4611686018427387902)|4.989|1|7|44|122|254|404|572|721|819|876|912|928|940|940|940|941|
 |
|nextLong(32)|5.053|1|7|36|113|240|408|585|728|845|897|929|949|961|963|964| | |
|nextInteger(12)|5.016|0.994|6|44|129|261|436|610|752|856|925|961|980|985|992|993|994|
 |
|nextFloat()|4.953|0.991|8|42|129|261|428|625|777|878|949|974|984|994|998|1000| 
| |
|nextLong(11)|4.922|0.991|7|36|135|272|440|631|776|890|940|973|986|995|997|1000|
 | |
|nextLong(2305843009213693951)|5.046|1|1|33|109|241|401|582|704|826|894|924|941|945|947|949|
 | |
|nextInteger(4)|0|0|1000| | | | | | | | | | | | | | | |
|nextLong(4)|0|0|1000| | | | | | | | | | | | | | | |
|nextLong(19)|4.977|0.997|5|37|130|271|448|620|744|855|915|948|971|983|984|986| 
| |
|nextInteger(869657561)|4.918|1|5|33|98|252|409|599|749|831|886|915|926|934|936|937|
 | |

Here are some conclusions:
 - Repeating the Chi squared test for uniformity 500 times with a critical 
p-value of 0.01 produces counts that can be modelled as a binomial distribution.
 - The mean is 5 (n*p = 500*0.01).
 - The test for nextInteger/Long(4) always passes. The test is performed using 
10 bins and there only 4 discrete values. Basically the test is invalid and the 
Chi square test should be done using 3 degrees of freedom.
 - The test fail threshold of 10 allowed failures corresponds to a cumulative 
probability of 0.9868 (see [#comment-16640891]). This means it will fail 1.32% 
of the time.

Note that currently there are 19 tests using this approach. We can discount 2 
(nextInteger(4) and nextLong(4)) leaving 17. There are also 3 random walk tests 
with a 1% critical value on the range allowed and 2 tests for uniform next 
bytes using a single chi squared p-value of 1%.

So for a single RNG the probability that no tests will fail is:
{noformat}
(1 - 0.0132)^17 * (1 - 0.01)^3 * (1 - 0.01)^2 = 0.7587
{noformat}
The test is performed on 16 variants of the RNG. So the probability no tests 
will fail is:
{noformat}
0.7587^16 = 0.0121
{noformat}
This seems very low. I don't have data for the random walk and nextBytes tests. 
But the data for the 19 variants of the nextUniform tests show the following 
histogram of the highest fail count recorded for a single RNG:
|RNG|0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16|17|18|
|ISAACRandom|0|0|0|0|0|0|13|85|206|279|217|130|43|19|6|1|0|0|1|
|JDKRandom|0|0|0|0|0|0|9|76|206|275|225|120|60|17|10|0|2| | |
|KISSRandom|0|0|0|0|0|0|18|87|221|272|203|122|46|20|7|3|0|1| |
|MersenneTwister|0|0|0|0|0|0|9|73|209|300|187|141|47|21|6|5|1|1| |
|MersenneTwister64|0|0|0|0|0|0|7|73|204|312|197|120|48|28|9|1|1| | |
|MultiplyWithCarry256|0|0|0|0|0|0|12|82|233|288|206|113|40|21|4|1|0| | |
|SplitMix64|0|0|0|0|0|0|9|67|236|299|200|114|47|21|6|1| | | |
|TwoCmres|0|0|0|0|0|1|10|83|225|287|189|122|52|25|5|1| | | |
|TwoCmres2|0|0|0|0|0|0|12|93|223|276|194|129|43|24|6| | | | |
|Well1024a|0|0|0|0|0|0|8|67|223|283|207|130|47|23|9|3| | | |
|Well19937a|0|0|0|0|0|0|4|82|226|266|210|135|49|16|9|2|1| | |
|Well19937c|0|0|0|0|0|0|11|83|192|273|239|123|45|21|10|2|1| | |
|Well44497a|0|0|0|0|0|0|9|90|242|276|198|108|37|28|9|3| | | |
|Well44497b|0|0|0|0|0|0|11|68|232|271|234|111|41|18|8|5|1|0| |
|Well512a|0|0|0|0|0|0|13|87|214|263|211|121|54|27|4|4|1|1| |
|XorShift1024Star|0|0|0|0|0|0|14|63|210|283|240|102|60|18|4|6| | | |
|Total|0|0|0|0|0|1|169|1259|3502|4503|3357|1941|759|347|112|38|8|3|1|
|Cumulative| 
|0|0|0|0|1|170|1429|4931|9434|12791|14732|15491|15838|15950|15988|15996|15999|16000|
|Mean pass rate| 
|0|0|0|0|0.0000625|0.010625|0.0893125|0.3081875|0.589625|0.7994375|0.92075|0.9681875|0.989875|0.996875|0.99925|0.99975|0.9999375|1|

 
 The mean pass rate for a single RNG is 78.9% if the allowed fail count is 10. 
This matches the theory of
{noformat}
(1 - 0.0132)^17 = 0.7978
{noformat}
Here is the histogram for the highest fail count across the 16 RNGs:
 |Max Fail (x)|Count|Frequency|Cumulative|P(X<=x)|
|10|24|0.024|24|0.024|
|11|252|0.252|276|0.276|
|12|327|0.327|603|0.603|
|13|254|0.254|857|0.857|
|14|95|0.095|952|0.952|
|15|36|0.036|988|0.988|
|16|8|0.008|996|0.996|
|17|3|0.003|999|0.999|
|18|1|0.001|1000|1|
 
So the entire test suite of nextUniform tests only passes 2.4% of the time from 
1000 repeats. This matches the theory of:
{noformat}
0.7978^16 = 0.0269
{noformat}
I.e. 97.3% of the time the test suite will fail.

I think it is fair to say that the test suite is operating as expected. All the 
RNGs can be modelled as perfect uniform providers in that they pass the tests 
with the expected probabilities.

However due to the number of tests being run the test suite is likely to fail 
if random seeds are used.

Q. What to do about it?

A simple solution is to increase the cut-off for failures in the repeat 
Chi-square test to 15. Then the suite will pass a lot more. This is equivalent 
to setting the p-value for the test from 0.0132 to <0.0001. So actually makes 
the test very weak as a standalone test.

A [Bonferroni correction|https://en.wikipedia.org/wiki/Bonferroni_correction] 
could be used to get a different p-value other than 1%. This also makes each 
individual test very weak and relies on the suite having a lot of tests.

An alternative approach would be to reduce the number of tests per RNG. These 
could be limited to nextBytes and either nextInt or nextLong depending on the 
type of provider. The later may be redundant and the abstract 
{{IntProvider/LongProvider}} implementation of nextInt(int)/nextLong(long) 
tested using an external source of randomness, i.e. something not in the core 
library. SecureRandom can be configured (at least in JDK 8) to use the 
underlying operating systems source of entropy for seeding and should be a 
strong source of randomness. So a test could be written for the abstract 
providers using e.g.
{code:java}
public class SecureRandom32 extends IntProvider {
    /** Delegate. */
    private SecureRandom delegate;

    /**
     * Creates an instance.
     */
    public SecureRandom32() {
        delegate = new SecureRandom();
    }

    @Override
    public int next() {
        return delegate.nextInt();
    }
}
{code}
The tests for preconditions and save/restore state can be in one class. A test 
for randomness could be moved to a dedicated class for the provider depending 
on the type of random source.

If future providers are developed that do not produce a source of int/long 
randomness, e.g. directly a source of float randomness, then those can be 
tested differently.

If the number of tests is limited to just 1 (i.e. nextBytes) then a p-value of 
1% can be used and the suite will pass at the following rate for 16 providers:
{noformat}
0.99^16 = 0.85
{noformat}
Obviously this will have to be revisited if future providers are added.

However it will allow the suite to run most of the time with a random seed and 
Travis can handle repeats. It is not very convenient for local development 
though as builds will fail a fair amount. Possibly fixed using fixed seeds for 
local builds and a random seed configured via a Maven profile for Travis.

 

> CachedUniformRandomProvider for nextBoolean() and nextInt()
> -----------------------------------------------------------
>
>                 Key: RNG-57
>                 URL: https://issues.apache.org/jira/browse/RNG-57
>             Project: Commons RNG
>          Issue Type: Improvement
>          Components: sampling
>    Affects Versions: 1.2
>            Reporter: Alex D Herbert
>            Priority: Minor
>              Labels: performance
>
> Implement a wrapper around a {{UniformRandomProvider}} that can cache the 
> underlying source of random bytes for use in the methods {{nextBoolean()}} 
> and {{nextInt()}} (in the case of {{LongProvider}}). E.g.
> {code:java}
> LongProvider provider = RandomSource.create(RandomSource.SPLIT_MIX_64);
> CachedLongProvider rng = new CachedLongProvider(provider);
> // Uses cached nextLong() 64 times
> rng.nextBoolean();
> // Uses cached nextLong() twice
> rng.nextInt();
> IntProvider provider = RandomSource.create(RandomSource.KISS);
> CachedIntProvider rng2 = new CachedIntProvider(provider);
> // Uses cached nextInt() 32 times
> rng2.nextBoolean();
> // This could be wrapped by a factory method:
> UniformRandomProvider rng = CachedUniformRandomProviderFactory.wrap(
>         // Any supported source: IntProvider or LongProvider
>         RandomSource.create(RandomSource...));
> {code}
> The implementation should be speed tested to determine the benefit for 
> {{nextBoolean()}} and if {{nextInt()}} can be improved for {{LongProviders}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to