> On 10 May 2019, at 15:07, Gilles Sadowski <gillese...@gmail.com> wrote:
> 
> Hi.
> 
> Le ven. 10 mai 2019 à 15:53, Alex Herbert <alex.d.herb...@gmail.com 
> <mailto:alex.d.herb...@gmail.com>> a écrit :
>> 
>> 
>> On 10/05/2019 14:27, Gilles Sadowski wrote:
>>> Hi Alex.
>>> 
>>> Le ven. 10 mai 2019 à 13:57, Alex Herbert <alex.d.herb...@gmail.com> a 
>>> écrit :
>>>> Can I get a review of the PR for RNG-101 please.
>>> Thanks for this work!
>>> 
>>> I didn't go into the details; however, I see many fields and methods like
>>>   table1 ... table5
>>>   fillTable1 ... fillTable5
>>>   getTable1 ... getTable5
>>> Wouldn't it be possible to use a 2D table:
>>>   table[5][];
>>> so that e.g. only one "fillTable(int tableIndex, /* other args */)" method
>>> is necessary (where "tableIndex" runs from 0 to 4)?
>> 
>> Yes. The design is based around using 5 tables as per the example code.
>> 
>> The sample() method knows which table it needs so it can directly jump
>> to the table in question. I'd have to look at the difference in speed
>> when using a 2D table as you are adding another array access but
>> reducing the number of possible method calls (although you still need a
>> method call). Maybe this will be optimised out by the JVM.
>> 
>> If the speed is not a factor then I'll rewrite it. Otherwise it's
>> probably better done for speed as this is the entire point of the
>> sampler given it disregards any probability under 2^-31 (i.e. it's not a
>> perfectly fair sampler).
>> 
>> Note that 5 tables are needed for 5 hex digits (base 2^6). The paper
>> states using 3 tables of base 2^10 then you get a speed increase
>> (roughly 1.16x) at the cost of storage (roughly 9x). Changing to 2
>> tables of base 2^15 does not make it much faster again.
>> 
>> I'll have a rethink to see if I can make the design work for different
>> base sizes.
> 
> That could be an extension made easier with the 2D table, but
> I quite agree that given the relatively minor speed improvement
> to be expected, it is not the main reason; the rationale was just to
> make the code a more compact and a little easier to grasp (IMHO).
> 
> Gilles

I’ve done a more extensive look at the implications of changing the 
implementation of the algorithm. This tested using: 1D or 2D tables; interfaced 
storage to dynamic table types; base 6 or base 10 for the algorithm; and 
allowing the base to be chosen. Results are in the Jira ticket. Basically 2D 
arrays are slower and supporting choices for the backing storage or base of the 
algorithm is slower.

To support the Poisson and Binomial samplers only requires 16-bit storage. So a 
dedicated sampler using base 6 and short for the tables will be the best 
compromise between storage space and speed. The base 10 sampler is faster but 
takes about 9-10x more space in memory.

Note I originally wrote the sampler to use only 16-bit storage. I then modified 
it to use dynamic storage without measuring performance. And so I made it 
slightly slower.

The question is does the library even need to have a 32-bit storage 
implementation? This would be used for a probability distribution with more 
than 2^16 different possible samples. I think this would be an edge case. Here 
the memory requirements will be in the tens of MB at a minimum but may balloon 
to become much larger. In this case a different algorithm such as the Alias 
method or a guide table is more memory stable as it only requires 12 bytes of 
storage per index, irrespective of the shape of the probability distribution.

If different implementations (of this algorithm) are added to the library then 
the effect of using a sampler that dynamically chooses the storage space and/or 
base for the algorithm is noticeable in the performance. In this case these 
would be better served using a factory:

class DiscreteProbabilitySamplerFactory {
    DiscreteSampler createDiscreteProbabilitySampler(UniformRandomProvider, 
double[])
}

But if specifically targeting this algorithm it could be:

class MarsagliaTsangWangDiscreteProbabilitySamplerFactory {
    DiscreteSampler createDiscreteProbabilitySampler(UniformRandomProvider, 
double[], boolean useBase10)
}

Or something similar. The user can then choose to use a base 10 algorithm if 
memory is not a concern.

I am wary of making this too complicated for just this sampler. So I would vote 
for ignoring the base 10 version and sticking to the interfaced storage 
implementation in the current PR or reverting back to the 16-bit storage and 
not supporting very large distributions. In the later case this is at least 
partially supported by the fact that the method only supports probabilities 
down to 1^-31. Anything else is set to zero after scaling by 2^30 and rounding. 
So large probability distributions are more likely to have values that are 
misrepresented due to the conversion of probabilities to fractions with a base 
of 2^30.

Thoughts on this?

Alex


> 
>> 
>>> 
>>> The diff for "DiscreteSamplersList.java" refers to
>>>    MarsagliaTsangWangBinomialSampler
>>> but
>>>   MarsagliaTsangWangSmallMeanPoissonSampler
>>> seems to be missing.
>> 
>> Oops, I missed adding that back. I built the PR from code where I was
>> testing lots of implementations.
>> 
>> I've just added it back and it is still passing locally. Travis should
>> see that too as I pushed the change to the PR.
>> 
>>> 
>>> Regards,
>>> Gilles
>>> 
>>>> This is a new sampler based on the source code from the paper:
>>>> 
>>>> George Marsaglia, Wai Wan Tsang, Jingbo Wang (2004)
>>>> Fast Generation of Discrete Random Variables.
>>>> Journal of Statistical Software. Vol. 11, Issue. 3, pp. 1-11.
>>>> 
>>>> https://www.jstatsoft.org/article/view/v011i03
>>>> 
>>>> The code has no explicit licence.
>>>> 
>>>> The paper states:
>>>> 
>>>> "We have provided C versions of the two methods described here, for
>>>> inclusion in the “Browse
>>>> files”section of the journal. ... You may then want to examine the
>>>> components of the two files, for illumination
>>>> or for extracting portions that might be usefully applied to your
>>>> discrete distributions."
>>>> 
>>>> So I assuming that it can be incorporated with little modification.
>>>> 
>>>> The Java implementation has been rewritten to allow the storage to be
>>>> optimised for the required size. The generation of the tables has been
>>>> adapted appropriately and checks have been added on the input parameters
>>>> to ensure the sampler does not generate exceptions once constructed (I
>>>> found out the hard way that the original code was not entirely correct).
>>>> 
>>>> Thanks.
>>>> 
>>>> Alex
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
> For additional commands, e-mail: dev-h...@commons.apache.org

Reply via email to