[ 
https://issues.apache.org/jira/browse/RNG-109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16901147#comment-16901147
 ] 

Alex D Herbert commented on RNG-109:
------------------------------------

I have created a benchmark that tests different distributions:
 * 4-sided loaded die: p = 1/2, 1/3, 1/12, 1/12
 * Binomial(n=67, p=0.7)
 * Geometric(p=0.2)
 * Poisson(mean=3.14)
 * Bimodal distribution of Poisson(mean=10) + Poisson(mean=20)
 * Random probability arrays with different sizes

The benchmark was performed using XO_RO_SHI_RO_128_PLUS generator.

*Sample*
 This tests the time for a sample from an already created sampler.
||Distribution||BinarySearch||GuideTable||AliasMethod||MarsagliaTsangWang||
|4 Sided Loaded Die|10.91|14.42|8.09|3.31|
|Binomial(n=67, p=0.7)|23.11|9.19|8.91|4.94|
|Geometric(p=0.2)|21.09|11.28|7.37|4.44|
|Poisson(mean=3.14)|17.84|10.81|9.05|3.86|
|Bimodal|25.85|10.70|8.12|5.36|
|Random 6|16.08|14.97|8.56|3.61|
|Random 96|37.24|13.95|9.11|6.99|
|Random 3072|63.47|14.18|11.09|6.56|
|*Total Result*|*215.59*|*99.50*|*70.30*|*39.06*|

The table shows that the samplers should be ranked by speed as:
 # MarsagliaTsangWang
 # Alias Method
 # Guide Table

The difference to the MarsagliaTsangWang is larger than the difference between 
the Alias and guide table samplers.

*Single Sample*
 This tests the time to create a sampler then perform a single sample.
||Distribution||BinarySearch||GuideTable||AliasMethod||MarsagliaTsangWang||
|4 Sided Loaded Die|46.34|102.85|118.91|263.75|
|Binomial(n=67, p=0.7)|435.92|681.25|1,510.39|2,305.91|
|Geometric(p=0.2)|682.97|1,025.44|1,721.29|3,764.49|
|Poisson(mean=3.14)|150.18|190.06|434.88|766.69|
|Bimodal|305.20|471.47|838.82|2,682.14|
|Random 6|51.91|82.26|150.36|312.62|
|Random 96|644.97|812.66|1,333.96|5,528.96|
|Random 3072|19,007.48|29,940.95|71,336.64|363,975.19|
|*Total Result*|*21,324.97*|*33,306.93*|*77,445.26*|*379,599.75*|

The table shows that faster samplers take longer to construct.

Note that most of the tested distributions are small (less than 100 
probabilities). When the list of probabilities is large (Random 3072) then the 
construction time for the MarsagliaTsangWang is very large.

The Alias method should have construction time of {{Order( n )}}. It typically 
takes around 1.5 to 2 fold longer to construct than the guide table sampler. 
The construction time for the large random table is 2.4x longer. Due to the 
algorithm used to pad the internal alias tables to a power of 2 the 
construction time may vary in relation to the input data length.

The guide table sampler takes around 1.5x longer to construct than the binary 
search table. This is because both methods create a cumulative probability 
table but the guide table sampler also writes to the guide table in the same 
pass over the input data. A second pass is then required to complete the guide 
table. Construction time for the guide table sampler will be very stable.

*Which sampler to use?*

The choice of sampler is a compromise between construction time and sample 
time. The MarsagliaTsangWang is slow to construct and is specialised for 
probabilities than sum to 2^30. Any probability less than 2^-31 will be 
ignored. It is not as generally applicable as the other samplers.

The choice is between the guide table and alias method. For minimal impact on 
construction time the guide table method is best. For faster sampling the alias 
method is best.

The benchmark was performed using the fastest generator in the library: 
XO_RO_SHI_RO_128_PLUS. A slower generator may show larger differences in sample 
time.

A conservative choice would be to use the guide table method. An option could 
be added to the DiscreteProbabilityCollectionSampler to allow the 
implementation to be chosen.

[PR 62|https://github.com/apache/commons-rng/pull/62] contains an example of 
how the sampler can be updated and simplified to use a delegate for the 
collection sampling. Changing the sampler implementation is a single line 
change in the code.

 

> DiscreteProbabilityCollectionSampler to use an internal DiscreteSampler
> -----------------------------------------------------------------------
>
>                 Key: RNG-109
>                 URL: https://issues.apache.org/jira/browse/RNG-109
>             Project: Commons RNG
>          Issue Type: Improvement
>          Components: sampling
>    Affects Versions: 1.3
>            Reporter: Alex D Herbert
>            Assignee: Alex D Herbert
>            Priority: Minor
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> The DiscreteProbabilityCollectionSampler builds a cumulative probability 
> distribution from an input of items and their associated probabilities. This 
> is sampled using a binary search (O(log n) time).
> The library now contains DiscreteSamplers that can sample efficiently 
> (approaching O(1) time) from a cumulative probability distribution:
>  * AliasMethodDiscreteSampler
>  * GuideTableDiscreteSampler
> Test these samplers to determine which is appropriate to use as the algorithm 
> behind the DiscreteProbabilityCollectionSampler. The test should include:
>  * Construction time
>  * Sample time
>  * A range of distributions (uniform, linear ramp, Poisson, Geometric, 
> BiModal, TriModel, Random scatter)
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Reply via email to