Re: [rng] RNG-101 new MarsagliaTsangWang discrete probability sampler

2019-05-11 Thread Alex Herbert



> On 11 May 2019, at 22:58, Gilles Sadowski  wrote:
> 
> Le sam. 11 mai 2019 à 23:32, Alex Herbert  a écrit :
>> 
>> 
>> 
>>> On 10 May 2019, at 15:07, Gilles Sadowski  wrote:
>>> 
>>> Hi.
>>> 
>>> Le ven. 10 mai 2019 à 15:53, Alex Herbert >> > a écrit :
 
 
 On 10/05/2019 14:27, Gilles Sadowski wrote:
> Hi Alex.
> 
> Le ven. 10 mai 2019 à 13:57, Alex Herbert  a 
> écrit :
>> Can I get a review of the PR for RNG-101 please.
> Thanks for this work!
> 
> I didn't go into the details; however, I see many fields and methods like
>  table1 ... table5
>  fillTable1 ... fillTable5
>  getTable1 ... getTable5
> Wouldn't it be possible to use a 2D table:
>  table[5][];
> so that e.g. only one "fillTable(int tableIndex, /* other args */)" method
> is necessary (where "tableIndex" runs from 0 to 4)?
 
 Yes. The design is based around using 5 tables as per the example code.
 
 The sample() method knows which table it needs so it can directly jump
 to the table in question. I'd have to look at the difference in speed
 when using a 2D table as you are adding another array access but
 reducing the number of possible method calls (although you still need a
 method call). Maybe this will be optimised out by the JVM.
 
 If the speed is not a factor then I'll rewrite it. Otherwise it's
 probably better done for speed as this is the entire point of the
 sampler given it disregards any probability under 2^-31 (i.e. it's not a
 perfectly fair sampler).
 
 Note that 5 tables are needed for 5 hex digits (base 2^6). The paper
 states using 3 tables of base 2^10 then you get a speed increase
 (roughly 1.16x) at the cost of storage (roughly 9x). Changing to 2
 tables of base 2^15 does not make it much faster again.
 
 I'll have a rethink to see if I can make the design work for different
 base sizes.
>>> 
>>> That could be an extension made easier with the 2D table, but
>>> I quite agree that given the relatively minor speed improvement
>>> to be expected, it is not the main reason; the rationale was just to
>>> make the code a more compact and a little easier to grasp (IMHO).
>>> 
>>> Gilles
>> 
>> I’ve done a more extensive look at the implications of changing the 
>> implementation of the algorithm. This tested using: 1D or 2D tables; 
>> interfaced storage to dynamic table types; base 6 or base 10 for the 
>> algorithm; and allowing the base to be chosen. Results are in the Jira 
>> ticket. Basically 2D arrays are slower and supporting choices for the 
>> backing storage or base of the algorithm is slower.
>> 
>> To support the Poisson and Binomial samplers only requires 16-bit storage. 
>> So a dedicated sampler using base 6 and short for the tables will be the 
>> best compromise between storage space and speed. The base 10 sampler is 
>> faster but takes about 9-10x more space in memory.
>> 
>> Note I originally wrote the sampler to use only 16-bit storage. I then 
>> modified it to use dynamic storage without measuring performance. And so I 
>> made it slightly slower.
>> 
>> The question is does the library even need to have a 32-bit storage 
>> implementation? This would be used for a probability distribution with more 
>> than 2^16 different possible samples. I think this would be an edge case. 
>> Here the memory requirements will be in the tens of MB at a minimum but may 
>> balloon to become much larger. In this case a different algorithm such as 
>> the Alias method or a guide table is more memory stable as it only requires 
>> 12 bytes of storage per index, irrespective of the shape of the probability 
>> distribution.
>> 
>> If different implementations (of this algorithm) are added to the library 
>> then the effect of using a sampler that dynamically chooses the storage 
>> space and/or base for the algorithm is noticeable in the performance. In 
>> this case these would be better served using a factory:
>> 
>> class DiscreteProbabilitySamplerFactory {
>>DiscreteSampler createDiscreteProbabilitySampler(UniformRandomProvider, 
>> double[])
>> }
>> 
>> But if specifically targeting this algorithm it could be:
>> 
>> class MarsagliaTsangWangDiscreteProbabilitySamplerFactory {
>>DiscreteSampler createDiscreteProbabilitySampler(UniformRandomProvider, 
>> double[], boolean useBase10)
>> }
>> 
>> Or something similar. The user can then choose to use a base 10 algorithm if 
>> memory is not a concern.
>> 
>> I am wary of making this too complicated for just this sampler. So I would 
>> vote for ignoring the base 10 version and sticking to the interfaced storage 
>> implementation in the current PR or reverting back to the 16-bit storage and 
>> not supporting very large distributions. In the later case this is at least 
>> partially supported by the fact that the method only supports probabilities 
>> down to 1^-31. Anything 

Re: [rng] RNG-101 new MarsagliaTsangWang discrete probability sampler

2019-05-11 Thread Gilles Sadowski
Le sam. 11 mai 2019 à 23:32, Alex Herbert  a écrit :
>
>
>
> > On 10 May 2019, at 15:07, Gilles Sadowski  wrote:
> >
> > Hi.
> >
> > Le ven. 10 mai 2019 à 15:53, Alex Herbert  > > a écrit :
> >>
> >>
> >> On 10/05/2019 14:27, Gilles Sadowski wrote:
> >>> Hi Alex.
> >>>
> >>> Le ven. 10 mai 2019 à 13:57, Alex Herbert  a 
> >>> écrit :
>  Can I get a review of the PR for RNG-101 please.
> >>> Thanks for this work!
> >>>
> >>> I didn't go into the details; however, I see many fields and methods like
> >>>   table1 ... table5
> >>>   fillTable1 ... fillTable5
> >>>   getTable1 ... getTable5
> >>> Wouldn't it be possible to use a 2D table:
> >>>   table[5][];
> >>> so that e.g. only one "fillTable(int tableIndex, /* other args */)" method
> >>> is necessary (where "tableIndex" runs from 0 to 4)?
> >>
> >> Yes. The design is based around using 5 tables as per the example code.
> >>
> >> The sample() method knows which table it needs so it can directly jump
> >> to the table in question. I'd have to look at the difference in speed
> >> when using a 2D table as you are adding another array access but
> >> reducing the number of possible method calls (although you still need a
> >> method call). Maybe this will be optimised out by the JVM.
> >>
> >> If the speed is not a factor then I'll rewrite it. Otherwise it's
> >> probably better done for speed as this is the entire point of the
> >> sampler given it disregards any probability under 2^-31 (i.e. it's not a
> >> perfectly fair sampler).
> >>
> >> Note that 5 tables are needed for 5 hex digits (base 2^6). The paper
> >> states using 3 tables of base 2^10 then you get a speed increase
> >> (roughly 1.16x) at the cost of storage (roughly 9x). Changing to 2
> >> tables of base 2^15 does not make it much faster again.
> >>
> >> I'll have a rethink to see if I can make the design work for different
> >> base sizes.
> >
> > That could be an extension made easier with the 2D table, but
> > I quite agree that given the relatively minor speed improvement
> > to be expected, it is not the main reason; the rationale was just to
> > make the code a more compact and a little easier to grasp (IMHO).
> >
> > Gilles
>
> I’ve done a more extensive look at the implications of changing the 
> implementation of the algorithm. This tested using: 1D or 2D tables; 
> interfaced storage to dynamic table types; base 6 or base 10 for the 
> algorithm; and allowing the base to be chosen. Results are in the Jira 
> ticket. Basically 2D arrays are slower and supporting choices for the backing 
> storage or base of the algorithm is slower.
>
> To support the Poisson and Binomial samplers only requires 16-bit storage. So 
> a dedicated sampler using base 6 and short for the tables will be the best 
> compromise between storage space and speed. The base 10 sampler is faster but 
> takes about 9-10x more space in memory.
>
> Note I originally wrote the sampler to use only 16-bit storage. I then 
> modified it to use dynamic storage without measuring performance. And so I 
> made it slightly slower.
>
> The question is does the library even need to have a 32-bit storage 
> implementation? This would be used for a probability distribution with more 
> than 2^16 different possible samples. I think this would be an edge case. 
> Here the memory requirements will be in the tens of MB at a minimum but may 
> balloon to become much larger. In this case a different algorithm such as the 
> Alias method or a guide table is more memory stable as it only requires 12 
> bytes of storage per index, irrespective of the shape of the probability 
> distribution.
>
> If different implementations (of this algorithm) are added to the library 
> then the effect of using a sampler that dynamically chooses the storage space 
> and/or base for the algorithm is noticeable in the performance. In this case 
> these would be better served using a factory:
>
> class DiscreteProbabilitySamplerFactory {
> DiscreteSampler createDiscreteProbabilitySampler(UniformRandomProvider, 
> double[])
> }
>
> But if specifically targeting this algorithm it could be:
>
> class MarsagliaTsangWangDiscreteProbabilitySamplerFactory {
> DiscreteSampler createDiscreteProbabilitySampler(UniformRandomProvider, 
> double[], boolean useBase10)
> }
>
> Or something similar. The user can then choose to use a base 10 algorithm if 
> memory is not a concern.
>
> I am wary of making this too complicated for just this sampler. So I would 
> vote for ignoring the base 10 version and sticking to the interfaced storage 
> implementation in the current PR or reverting back to the 16-bit storage and 
> not supporting very large distributions. In the later case this is at least 
> partially supported by the fact that the method only supports probabilities 
> down to 1^-31. Anything else is set to zero after scaling by 2^30 and 
> rounding. So large probability distributions are more likely to have values 
> that 

Re: [rng] RNG-101 new MarsagliaTsangWang discrete probability sampler

2019-05-11 Thread Alex Herbert


> On 10 May 2019, at 15:07, Gilles Sadowski  wrote:
> 
> Hi.
> 
> Le ven. 10 mai 2019 à 15:53, Alex Herbert  > a écrit :
>> 
>> 
>> On 10/05/2019 14:27, Gilles Sadowski wrote:
>>> Hi Alex.
>>> 
>>> Le ven. 10 mai 2019 à 13:57, Alex Herbert  a 
>>> écrit :
 Can I get a review of the PR for RNG-101 please.
>>> Thanks for this work!
>>> 
>>> I didn't go into the details; however, I see many fields and methods like
>>>   table1 ... table5
>>>   fillTable1 ... fillTable5
>>>   getTable1 ... getTable5
>>> Wouldn't it be possible to use a 2D table:
>>>   table[5][];
>>> so that e.g. only one "fillTable(int tableIndex, /* other args */)" method
>>> is necessary (where "tableIndex" runs from 0 to 4)?
>> 
>> Yes. The design is based around using 5 tables as per the example code.
>> 
>> The sample() method knows which table it needs so it can directly jump
>> to the table in question. I'd have to look at the difference in speed
>> when using a 2D table as you are adding another array access but
>> reducing the number of possible method calls (although you still need a
>> method call). Maybe this will be optimised out by the JVM.
>> 
>> If the speed is not a factor then I'll rewrite it. Otherwise it's
>> probably better done for speed as this is the entire point of the
>> sampler given it disregards any probability under 2^-31 (i.e. it's not a
>> perfectly fair sampler).
>> 
>> Note that 5 tables are needed for 5 hex digits (base 2^6). The paper
>> states using 3 tables of base 2^10 then you get a speed increase
>> (roughly 1.16x) at the cost of storage (roughly 9x). Changing to 2
>> tables of base 2^15 does not make it much faster again.
>> 
>> I'll have a rethink to see if I can make the design work for different
>> base sizes.
> 
> That could be an extension made easier with the 2D table, but
> I quite agree that given the relatively minor speed improvement
> to be expected, it is not the main reason; the rationale was just to
> make the code a more compact and a little easier to grasp (IMHO).
> 
> Gilles

I’ve done a more extensive look at the implications of changing the 
implementation of the algorithm. This tested using: 1D or 2D tables; interfaced 
storage to dynamic table types; base 6 or base 10 for the algorithm; and 
allowing the base to be chosen. Results are in the Jira ticket. Basically 2D 
arrays are slower and supporting choices for the backing storage or base of the 
algorithm is slower.

To support the Poisson and Binomial samplers only requires 16-bit storage. So a 
dedicated sampler using base 6 and short for the tables will be the best 
compromise between storage space and speed. The base 10 sampler is faster but 
takes about 9-10x more space in memory.

Note I originally wrote the sampler to use only 16-bit storage. I then modified 
it to use dynamic storage without measuring performance. And so I made it 
slightly slower.

The question is does the library even need to have a 32-bit storage 
implementation? This would be used for a probability distribution with more 
than 2^16 different possible samples. I think this would be an edge case. Here 
the memory requirements will be in the tens of MB at a minimum but may balloon 
to become much larger. In this case a different algorithm such as the Alias 
method or a guide table is more memory stable as it only requires 12 bytes of 
storage per index, irrespective of the shape of the probability distribution.

If different implementations (of this algorithm) are added to the library then 
the effect of using a sampler that dynamically chooses the storage space and/or 
base for the algorithm is noticeable in the performance. In this case these 
would be better served using a factory:

class DiscreteProbabilitySamplerFactory {
DiscreteSampler createDiscreteProbabilitySampler(UniformRandomProvider, 
double[])
}

But if specifically targeting this algorithm it could be:

class MarsagliaTsangWangDiscreteProbabilitySamplerFactory {
DiscreteSampler createDiscreteProbabilitySampler(UniformRandomProvider, 
double[], boolean useBase10)
}

Or something similar. The user can then choose to use a base 10 algorithm if 
memory is not a concern.

I am wary of making this too complicated for just this sampler. So I would vote 
for ignoring the base 10 version and sticking to the interfaced storage 
implementation in the current PR or reverting back to the 16-bit storage and 
not supporting very large distributions. In the later case this is at least 
partially supported by the fact that the method only supports probabilities 
down to 1^-31. Anything else is set to zero after scaling by 2^30 and rounding. 
So large probability distributions are more likely to have values that are 
misrepresented due to the conversion of probabilities to fractions with a base 
of 2^30.

Thoughts on this?

Alex


> 
>> 
>>> 
>>> The diff for "DiscreteSamplersList.java" refers to
>>>MarsagliaTsangWangBinomialSampler
>>> but
>>>   

Re: [statistics] Mode function for Cauchy distribution

2019-05-11 Thread Gilles Sadowski
Hi.

Le ven. 10 mai 2019 à 14:45, Udit Arora  a écrit :
>
> I am not sure what to say.. I completely agree that most distributions have
> undefined statistical values. I dont really have any particular reason for
> adding mode in the interface like one mentioned by Sir Alex for mean and
> variance. Please let me know if I should go ahead..

If you don't see a reason, it's reason enough for not doing it. ;-)

Perhaps a more straightforward way to start contributing is to
browse the list of open issue issues; see e.g. the "Numbers"
project[1].  Help is most needed to progress towards a release,
because "Statistics", and others, depend on it.

Regards,
Gilles

[1] 
https://issues.apache.org/jira/issues/?jql=project%20%3D%20NUMBERS%20AND%20status%20%3D%20Open

>
> On Fri, 10 May 2019, 2:15 am Alex Herbert,  wrote:
>
> >
> >
> > > On 9 May 2019, at 21:17, Eric Barnhill  wrote:
> > >
> > > Awesome!
> > >
> > > On Thu, May 9, 2019 at 10:44 AM Udit Arora 
> > wrote:
> > >
> > >> I will see what I can do. It will take some time, but I will get to know
> > >> more about the other distributions.
> > >>
> > >>
> > >> On Thu, 9 May 2019, 10:58 pm Eric Barnhill, 
> > >> wrote:
> > >>
> > >>> Udit, is it clear what to do here? Gilles recommends you propose some
> > >> edits
> > >>> to ContinuousDistribution instead, to return Mode and Median.
> > >>>
> > >>> But then, if an interface is altered, all the classes that implement
> > that
> > >>> interface need to have these functions added, so we hope you are up for
> > >> all
> > >>> that additional work. We can help you.
> >
> > I think it would be prudent to go through all the distributions and see
> > what is defined for each. Wikipedia has a helper table for all its
> > distributions containing:
> >
> > Mean
> > Median
> > Mode
> > Variance
> > Skewness
> > Ex. kurtosis
> > Entropy
> > Fisher Information
> >
> > If many are undefined then you are adding to an interface something not
> > generally supported.
> >
> > Currently the ContinuousDistribution interface only has the mean and the
> > variance. But note that these are used by the inverse cumulative
> > probability method in the base abstract class. Same goes for the
> > DiscreteDistribution.
> >
> > I am +0 for adding more methods. I don’t see a reason not to. But nor do I
> > see a need (within the library) to have them at the interface level if the
> > mode or median for example are not required in a generic way.
> >
> > >>>
> > >>> Last is the idea of accessor methods. if the method starts with get_()
> > >> then
> > >>> in principle this is just returning a field already present. But with
> > >> that
> > >>> in mind, I don't know why we already have a method name like getMean()
> > in
> > >>> this interface. We don't really know whether for a given distribution,
> > >> that
> > >>> would be a true accessor or need to be calculated. So I think all these
> > >>> method names should just be mean(), mode(), median(), etc.
> > >>>
> > >>> So sorry if this is blowing up into more work than you expected. It
> > often
> > >>> works that way! I certainly think these changes are worthwhile however.
> > >>>
> > >>>
> > >>>
> > >>> On Thu, May 9, 2019 at 7:17 AM Gilles Sadowski 
> > >>> wrote:
> > >>>
> >  Hi Udit.
> > 
> >  Le jeu. 9 mai 2019 à 12:52, Udit Arora  a
> > >> écrit :
> > >
> > > I intend to add a mode function for the Cauchy Distribution. It is a
> >  small
> > > addition which i thought might be helpful.
> > 
> >  How will it be helpful?  I.e. what would an application developer
> >  be able to do, that he can't with the current code?
> > 
> >  You've surely noted that that the class you want to modify is but
> >  one of the implementations of the interface "ContinuousDistribution".
> >  So if you propose to change the API, the change should be done
> >  at the interface level, and the appropriate computation performed, or
> >  method overloads defined, for all implementations.
> > 
> >  The "accessor" methods refer to fields that were set by the
> > contructor;
> >  e.g. for "CauchyDistribution", "median" and "scale".
> >  In this case, it happens that "mode" has the same value as "median",
> >  but does this warrant an additional method?
> > 
> >  Regards,
> >  Gilles
> > 
> > > Thanks
> > 
> >  -
> >  To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
> >  For additional commands, e-mail: dev-h...@commons.apache.org
> > 
> > 
> > >>>
> > >>
> >
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
> > For additional commands, e-mail: dev-h...@commons.apache.org
> >
> >

-
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional 

Re: Immutable Keys message

2019-05-11 Thread sebb
On Sat, 11 May 2019 at 13:19, Blaise Carie  wrote:
>
> To Whom This May Concern:
>  I received the following message when applying for a job.  I did not
> receive this message when applying for it yesterday, and today I am unable
> to access my application or other jobs.
> Entry.next=null, data[removeIndex]=241265648=true previous=241265648=true
> key=651070971 value=true size=4096 maxSize=4096 Please check that your keys
> are immutable, and that you have used synchronization properly. If so, then
> please report this to dev@commons.apache.org as a bug.

The above is a part of an IllegalStateException message which is
generated under certain error conditions.

The information is intended for the people using the relevant Commons
component as part of their application.
(in this case it seems to be part of Collections)

It is not intended for end-users.
Such exceptions should normally be captured by the application and
brought to the attention of the application developers.

> May you please help me with this error in order to be able to navigate the
> website ALSDE?

You should report the error to the ALSDE website maintainers.
They need to check whether the pre-conditions in the message
(immutable keys, synchronisation) have been complied with.

> Thank you for your help and please let me know if you have any questions.
>
> Sincerely,
>
> Blaise
>
> Blaise Carie
> *Cold Springs High School*
> Science Teacher
> XC and Track Coach
> *Email*: bca...@ccboe.org

-
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org



Immutable Keys message

2019-05-11 Thread Blaise Carie
To Whom This May Concern:
 I received the following message when applying for a job.  I did not
receive this message when applying for it yesterday, and today I am unable
to access my application or other jobs.
Entry.next=null, data[removeIndex]=241265648=true previous=241265648=true
key=651070971 value=true size=4096 maxSize=4096 Please check that your keys
are immutable, and that you have used synchronization properly. If so, then
please report this to dev@commons.apache.org as a bug.

May you please help me with this error in order to be able to navigate the
website ALSDE?

Thank you for your help and please let me know if you have any questions.

Sincerely,

Blaise

Blaise Carie
*Cold Springs High School*
Science Teacher
XC and Track Coach
*Email*: bca...@ccboe.org


Re: Commons Validator

2019-05-11 Thread sebb
On Sat, 11 May 2019 at 08:30, Andre van der Wal  wrote:
>
> Hi,
>
> When can we expect a new release of the validator?

When someone decides to release it...

> New email domains we
> need were added in April 2017 but the latest release is still from Feb 2017.

The list changes too frequently to be able to keep up with the changes.

You can update the GENERIC_TLDS array programmatically, so long as you
do it before first usage.

http://commons.apache.org/proper/commons-validator/apidocs/org/apache/commons/validator/routines/DomainValidator.html#updateTLDOverride(org.apache.commons.validator.routines.DomainValidator.ArrayType,%20java.lang.String[])

-
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org



Commons Validator

2019-05-11 Thread Andre van der Wal
Hi,

When can we expect a new release of the validator? New email domains we
need were added in April 2017 but the latest release is still from Feb 2017.