[ https://issues.apache.org/jira/browse/RNG-109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16901147#comment-16901147 ]
Alex D Herbert commented on RNG-109: ------------------------------------ I have created a benchmark that tests different distributions: * 4-sided loaded die: p = 1/2, 1/3, 1/12, 1/12 * Binomial(n=67, p=0.7) * Geometric(p=0.2) * Poisson(mean=3.14) * Bimodal distribution of Poisson(mean=10) + Poisson(mean=20) * Random probability arrays with different sizes The benchmark was performed using XO_RO_SHI_RO_128_PLUS generator. *Sample* This tests the time for a sample from an already created sampler. ||Distribution||BinarySearch||GuideTable||AliasMethod||MarsagliaTsangWang|| |4 Sided Loaded Die|10.91|14.42|8.09|3.31| |Binomial(n=67, p=0.7)|23.11|9.19|8.91|4.94| |Geometric(p=0.2)|21.09|11.28|7.37|4.44| |Poisson(mean=3.14)|17.84|10.81|9.05|3.86| |Bimodal|25.85|10.70|8.12|5.36| |Random 6|16.08|14.97|8.56|3.61| |Random 96|37.24|13.95|9.11|6.99| |Random 3072|63.47|14.18|11.09|6.56| |*Total Result*|*215.59*|*99.50*|*70.30*|*39.06*| The table shows that the samplers should be ranked by speed as: # MarsagliaTsangWang # Alias Method # Guide Table The difference to the MarsagliaTsangWang is larger than the difference between the Alias and guide table samplers. *Single Sample* This tests the time to create a sampler then perform a single sample. ||Distribution||BinarySearch||GuideTable||AliasMethod||MarsagliaTsangWang|| |4 Sided Loaded Die|46.34|102.85|118.91|263.75| |Binomial(n=67, p=0.7)|435.92|681.25|1,510.39|2,305.91| |Geometric(p=0.2)|682.97|1,025.44|1,721.29|3,764.49| |Poisson(mean=3.14)|150.18|190.06|434.88|766.69| |Bimodal|305.20|471.47|838.82|2,682.14| |Random 6|51.91|82.26|150.36|312.62| |Random 96|644.97|812.66|1,333.96|5,528.96| |Random 3072|19,007.48|29,940.95|71,336.64|363,975.19| |*Total Result*|*21,324.97*|*33,306.93*|*77,445.26*|*379,599.75*| The table shows that faster samplers take longer to construct. Note that most of the tested distributions are small (less than 100 probabilities). When the list of probabilities is large (Random 3072) then the construction time for the MarsagliaTsangWang is very large. The Alias method should have construction time of {{Order( n )}}. It typically takes around 1.5 to 2 fold longer to construct than the guide table sampler. The construction time for the large random table is 2.4x longer. Due to the algorithm used to pad the internal alias tables to a power of 2 the construction time may vary in relation to the input data length. The guide table sampler takes around 1.5x longer to construct than the binary search table. This is because both methods create a cumulative probability table but the guide table sampler also writes to the guide table in the same pass over the input data. A second pass is then required to complete the guide table. Construction time for the guide table sampler will be very stable. *Which sampler to use?* The choice of sampler is a compromise between construction time and sample time. The MarsagliaTsangWang is slow to construct and is specialised for probabilities than sum to 2^30. Any probability less than 2^-31 will be ignored. It is not as generally applicable as the other samplers. The choice is between the guide table and alias method. For minimal impact on construction time the guide table method is best. For faster sampling the alias method is best. The benchmark was performed using the fastest generator in the library: XO_RO_SHI_RO_128_PLUS. A slower generator may show larger differences in sample time. A conservative choice would be to use the guide table method. An option could be added to the DiscreteProbabilityCollectionSampler to allow the implementation to be chosen. [PR 62|https://github.com/apache/commons-rng/pull/62] contains an example of how the sampler can be updated and simplified to use a delegate for the collection sampling. Changing the sampler implementation is a single line change in the code. > DiscreteProbabilityCollectionSampler to use an internal DiscreteSampler > ----------------------------------------------------------------------- > > Key: RNG-109 > URL: https://issues.apache.org/jira/browse/RNG-109 > Project: Commons RNG > Issue Type: Improvement > Components: sampling > Affects Versions: 1.3 > Reporter: Alex D Herbert > Assignee: Alex D Herbert > Priority: Minor > Time Spent: 10m > Remaining Estimate: 0h > > The DiscreteProbabilityCollectionSampler builds a cumulative probability > distribution from an input of items and their associated probabilities. This > is sampled using a binary search (O(log n) time). > The library now contains DiscreteSamplers that can sample efficiently > (approaching O(1) time) from a cumulative probability distribution: > * AliasMethodDiscreteSampler > * GuideTableDiscreteSampler > Test these samplers to determine which is appropriate to use as the algorithm > behind the DiscreteProbabilityCollectionSampler. The test should include: > * Construction time > * Sample time > * A range of distributions (uniform, linear ramp, Poisson, Geometric, > BiModal, TriModel, Random scatter) > -- This message was sent by Atlassian JIRA (v7.6.14#76016)