Re: [statistics] Release 1.0

Alex Herbert Wed, 15 Sep 2021 15:46:34 -0700

On Wed, 15 Sept 2021 at 17:10, Gilles Sadowski <gillese...@gmail.com> wrote:


> Hello.
>
> Le mar. 14 sept. 2021 à 17:13, Alex Herbert <alex.d.herb...@gmail.com> a
> écrit :
> >
> > The statistics component is a candidate for a release.
> >
> > The statistics distributions module contains mature functionality
> > ported from CM. The dependency on [numbers] is now satisfied as that
> > has had an official release. There is nothing outstanding in the
> > project Jira. Thus a first release of this component can be performed.
> >
> > Items before a release:
> >
> > - Remaining Jira tickets should be checked and resolved
>
> Can we set to "resolved" the following two:
>   https://issues.apache.org/jira/projects/STATISTICS/issues/STATISTICS-3


Since the JDK Math is supposed to be within 1 ULP of the correct value for
exp, log1p and pow (the functions used in place of CM AccurateMath) then
the accuracy issues are due to rounding of intermediates. Looking at the
tests the tolerances have increased from 1.5 to 2 for the mean ULP for one
test. This seems like an acceptably low ULP from the exact result. The
other test the tolerance increased from 220 to 230 ULP for the standard
deviation of the ULP (which must have a mean below 160). Although high in
ULPs the increase is less than 5% in the tolerance which appears rather
arbitrary to begin with (and was probably chosen just to make the test
pass). I would say this ticket is not a problem.


>
>   https://issues.apache.org/jira/projects/STATISTICS/issues/STATISTICS-25


Here the implemented fix is computing results almost as well as python and
R (which compute up to 1e10 to 1e20 degrees of freedom) versus the
threshold of 2.99e6 used in statistics. I would say it is not resolved but
is not a blocker.


>
> ?
>
> It would be nice to have
>   https://issues.apache.org/jira/projects/STATISTICS/issues/STATISTICS-9
> in an "examples" module.
>

OK. The examples module can be based on code in RNG which uses picocli to
build programs. I suggest a program that accepts the name of the
distribution as the command. Each command would have inputs for the
distribution parameters, the range of points to evaluate and the number of
steps in the range. It would output a csv format:

x,pdf(x)

For example for the exponential:

java -jar statistics.jar exp --mean 3.45 --min 0 --max 20 --steps 200
--function cdf

This should not be too complicated.

If the desire is to generate data for figures with multiple
parameterisations then the parameters can be comma delimited:

java -jar statistics.jar gamma --shape 1,2,3 --scale 2,2,2 --min 0 --max 20
--steps 200 --function pdf

Output would be multiple columns:

x,gamma(shape=1;scale=2),gamma(shape=2;scale=2),gamma(shape=3;scale=2),

An alternative is to have the input points determined by a file:

java -jar statistics.jar gamma --shape 1,2,3 --scale 2,2,2 --points
input.txt --function pdf

Functions to support are: pdf, cdf, inverse cdf and survival probability.

Thoughts on this?

Alex

Re: [statistics] Release 1.0

Reply via email to