[ 
https://issues.apache.org/jira/browse/STATISTICS-85?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Herbert updated STATISTICS-85:
-----------------------------------
    Description: 
Add a quantile implementation. This will interpolate the value of a sorted 
array of data for probability p in [0, 1].

Replace the legacy API from Commons Math Percentile with an updated API. The 
new API should:
 * Decouple estimation of quantile positions inside data of length n; and the 
selection of correctly-ordered indices in array data.
 * Support multiple data types.
 * Support pre-sorted data.
 * Avoid performance issues observed in the CM Percentile implementation.

h2. Proposed API
{code:java}
org.apache.commons.statistics.descriptive

public final class Quantile {
    // overwrite=true; EstimationMethod.HF8; NaNPolicy.ERROR
    public static Quantile withDefaults();
    public Quantile withOverwrite(boolean);
    public Quantile with(EstimationMethod);

    // Could support NaN handling ... see below for NaNPolicy
    public Quantile with(NaNPolicy);

    // Create n uniform probabilities in range [p1, p2]
    public static double[] probabilities(int n);
    public static double[] probabilities(int n, double p1, double p2);

    // Quantiles on sorted data a of size n
    public double evaluate(int n, java.util.function.IntToDoubleFunction a, 
double p);
    public double[] evaluate(int n, java.util.function.IntToDoubleFunction a, 
double... p);

    // Quantiles on the primitive types that cannot be easily sorted
    public double evaluate(double[] a, double p);
    public double[] evaluate(double[] a, double... p);
    public double evaluate(int[] a, double p);
    public double[] evaluate(int[] a, double... p);
    public double evaluate(long[] a, double p);
    public double[] evaluate(long[] a, double... p);
    public double evaluate(float[] a, double p);
    public double[] evaluate(float[] a, double... p);

    // Provide the 9 methods in Hyndman and Fan (1996)
    // Sample Quantiles in Statistical Packages.
    // The American Statistician, 50, 361-365.
    public abstract class Quantile$EstimationMethod extends
            java.lang.Enum<Quantile$EstimationMethod> {
        public static final Quantile$EstimationMethod HF1;
        public static final Quantile$EstimationMethod HF2;
        public static final Quantile$EstimationMethod HF3;
        public static final Quantile$EstimationMethod HF4;
        public static final Quantile$EstimationMethod HF5;
        public static final Quantile$EstimationMethod HF6;
        public static final Quantile$EstimationMethod HF7;
        public static final Quantile$EstimationMethod HF8;
        public static final Quantile$EstimationMethod HF9;
    }
}

Note: The CM API used the 9 methods from Hyndman and Fann but labelled them as 
R1-9; this may be derived from the same names used in the R language. I propose 
to rename as HF1-9 to reflect the origin.

{code}
h2. NaNPolicy

There are multiple options here. For reference R and Python's numpy only 
provide the option to exclude NaN:
 * R: quantile errors if NaN is present. median returns NaN. They is an option 
to exclude NaN.
 * numpy: two methods are provided: median/nanmedian + quantile/nanquantile 
(the non-nan versions will return NaN if any NaNs are present)

Commons Math provides a remapping. Note the Statistics ranking module has the 
same NaNStrategy as that in CM:
 * MINIMAL: map to -infinity
 * MAXIMAL: map to +infinity
 * REMOVED: ignore from the data
 * FIXED: leave in place. This makes no sense for quantiles. It is done by 
moving to the end following the order imposed by Double.compare.
 * FAILED: raise an exception

I favour the simpler option of: treating NaN so they are above/below all other 
values; removing them from the data; or raising an exception. I do not see the 
requirement to remap NaN to infinity. This can be done by the user.

The API can be simplified by using:
{code:java}
public final class NaNPolicy extends java.lang.Enum<NaNPolicy> {
    public static final NaNPolicy LAST;    // Move to end of data
    public static final NaNPolicy FIRST;   // Move to start of data
    public static final NaNPolicy REMOVE;  // Remove from data
    public static final NaNPolicy ERROR;   // Raise exception
}
{code}
Note that the FIRST option is not strictly required. But if there is an option 
to order the NaNs (i.e. LAST) then both orders should be supported.

Which option to use as the default is not clear. As a drop in substitute for 
Arrays.sort the default would be handle NaN with a move to the end (LAST). As 
an API to signal to the user that the quantiles are possibly corrupted then 
ERROR would be the default. A user can then decide what to do if they receive 
errors during their analysis. Note that a separation of the Quantile API from 
the partitioning API (see below) may result in doing NaN processing twice 
introducing a performance overhead. If this will occur by default it should be 
documented so a user can override the NaN behaviour if they do not expect NaNs.

Commons Math places the Percentile class in the rank package. I propose to move 
this implementation to place it in the descriptive package where it will sit 
beside other descriptive statistics for data such as mean and standard 
deviation.
h2. Examples
{code:java}
double[] data = ...
double q = Quantile.withDefaults().evaluate(data, 0.25);
double[] quantiles = Quantile.withDefaults().evaluate(data, 0.25, 0.5, 0.75);
// Use cases:
// 1. Median / other quantile estimation
// 2. Box plot of data
// 3. Interquartile range analysis

double[] p = Quantile.probabilities(100, 0.01, 0.99);
double[] quantiles = Quantile.withDefaults().evaluate(data, p);
// Use cases:
// 1. plot p vs quantiles
// 2. use p with an expected (inverse) probability distribution to create a QQ 
plot

// Sorted input / unsupported datatype
short[] data = ...
Arrays.sort(data);
double[] quantiles = Quantile.withDefaults().evaluate(data.length, i -> 
data[i], p);
{code}
h2. Implementation

The Quantile API and the underlying implementation can be separated. Performing 
quantile estimation requires knowing the value of elements (i, i+1) in a sorted 
array a. Note that i+1 is used for interpolation:
{noformat}
value = a[i] + alpha * (a[i+1] - a[i]) ; alpha in [0, 1)
{noformat}
Note: The alpha factor and the index i are chosen from the percentile p using 
the EstimationMethod.

The array does not have to be fully sorted. It can be partitioned so that each 
required element i is equal to the value in a fully sorted array. A 
partitioning API can be placed in Commons Numbers:
{code:java}
org.apache.commons.numbers.arrays

public final class Selection {
    // Operate in-place and support a range as per Arrays.sort
    public void select(double[] a, int k);
    public void select(double[] a, int... k);
    public void select(int from, int to, double[] a, int k);
    public void select(int from, int to, double[] a, int... k);

    // Extend to other primitive types where sorting is slow

    // Sorting of 16-bit data is fast using a histogram so selection
    // has no major speed gain above a small length:
    // byte[] does counting sort at length 64 (temurin JDK 11)
    // short[]/char[] at length 1750 (temurin JDK 11)
}
{code}
h3. Examples
{code:java}
// Find the bottom 100 (with 0-based indexing)
Selection.select(data, 99);
double[] bottom100 = Arrays.copyOf(data, 100);

// Find the bottom and top 10
Selection.select(data, 9, data.length - 10);
double[] bottom10 = Arrays.copyOf(data, 10);
double[] top10 = Arrays.copyOfRange(data, data.length - 10, data.length);
{code}

  was:
Add a quantile implementation. This will interpolate the value of a sorted 
array of data for probability p in [0, 1].

Replace the legacy API from Commons Math Percentile with an updated API. The 
new API should:
 * Decouple estimation of quantile positions inside data of length n; and the 
selection of correctly-ordered indices in array data.
 * Support multiple data types.
 * Support pre-sorted data.
 * Avoid performance issues observed in the CM Percentile implementation.

h2. Proposed API
{code:java}
org.apache.commons.statistics.descriptive

public final class Quantile {
    // overwrite=true; EstimationMethod.HF8; NaNPolicy.ERROR
    public static Quantile withDefaults();
    public Quantile withOverwrite(boolean);
    public Quantile with(EstimationMethod);

    // Could support NaN handling ... see below for NaNPolicy
    public Quantile with(NaNPolicy);

    // Create n uniform probabilities in range [p1, p2]
    public static double[] probabilities(int n);
    public static double[] probabilities(int n, double p1, double p2);

    // Quantiles on sorted data a of size n
    public double evaluate(int n, java.util.function.IntToDoubleFunction a, 
double p);
    public double[] evaluate(int n, java.util.function.IntToDoubleFunction a, 
double... p);

    // Quantiles on the primitive types that cannot be easily sorted
    public double evaluate(double[] a, double p);
    public double[] evaluate(double[] a, double... p);
    public double evaluate(int[] a, double p);
    public double[] evaluate(int[] a, double... p);
    public double evaluate(long[] a, double p);
    public double[] evaluate(long[] a, double... p);
    public double evaluate(float[] a, double p);
    public double[] evaluate(float[] a, double... p);

    // Provide the 9 methods in Hyndman and Fan (1996)
    // Sample Quantiles in Statistical Packages.
    // The American Statistician, 50, 361-365.
    public abstract class Quantile$EstimationMethod extends
            java.lang.Enum<Quantile$EstimationMethod> {
        public static final Quantile$EstimationMethod HF1;
        public static final Quantile$EstimationMethod HF2;
        public static final Quantile$EstimationMethod HF3;
        public static final Quantile$EstimationMethod HF4;
        public static final Quantile$EstimationMethod HF5;
        public static final Quantile$EstimationMethod HF6;
        public static final Quantile$EstimationMethod HF7;
        public static final Quantile$EstimationMethod HF8;
        public static final Quantile$EstimationMethod HF9;
    }
}

Note: The CM API used the 9 methods from Hyndman and Fann but labelled them as 
R1-9; this may be derived from the same names used in the R language. I propose 
to rename as HF1-9 to reflect the origin.

{code}
h2. NaNPolicy

There are multiple options here. For reference R and Python's numpy only 
provide the option to exclude NaN:
 * R: quantile errors if NaN is present. median returns NaN. They is an option 
to exclude NaN.
 * numpy: two methods are provided: median/nanmedian + quantile/nanquantile 
(the non-nan versions will return NaN if any NaNs are present)

Commons Math provides a remapping. Note the Statistics ranking module has the 
same NaNStrategy as that in CM:
 * MINIMAL: map to -infinity
 * MAXIMAL: map to +infinity
 * REMOVED: ignore from the data
 * FIXED: leave in place. This makes no sense for quantiles. It is done by 
moving to the end following the order imposed by Double.compare.
 * FAILED: raise an exception

I favour the simpler option of: treating NaN so they are above/below all other 
values; removing them from the data; or raising an exception. I do not see the 
requirement to remap NaN to infinity. This can be done by the user.

The API can be simplified by using:
{code:java}
public final class NaNPolicy extends java.lang.Enum<NaNPolicy> {
    public static final NaNPolicy LAST;    // Move to end of data
    public static final NaNPolicy FIRST;   // Move to start of data
    public static final NaNPolicy REMOVE;  // Remove from data
    public static final NaNPolicy ERROR;   // Raise exception
}
{code}
Note that the FIRST option is not strictly required. But if there is an option 
to order the NaNs (i.e. LAST) then both orders should be supported.

Which option to use as the default is not clear. As a drop in substitute for 
Arrays.sort the default would be handle NaN with a move to the end (LAST). As 
an API to signal to the user that the quantiles are possibly corrupted then 
ERROR would be the default. A user can then decide what to do if they receive 
errors during their analysis. Note that a separation of the Quantile API from 
the partitioning API (see below) may result in doing NaN processing twice 
introducing a performance overhead. If this will occur by default it should be 
documented so a user can override the NaN behaviour if they do not expect NaNs.

Commons Math places the Percentile class in the rank package. I propose to move 
this implementation to place it in the descriptive package where it will sit 
beside other descriptive statistics for data such as mean and standard 
deviation.
h2. Examples
{code:java}
double[] data = ...
double q = Quantile.withDefaults().evaluate(data, 0.25);
double[] quantiles = Quantile.withDefaults().evaluate(data, 0.25, 0.5, 0.75);
// Use cases:
// 1. Median / other quantile estimation
// 2. Box plot of data
// 3. Interquartile range analysis

double[] p = Quantile.probabilities(100, 0.01, 0.99);
double[] quantiles = Quantile.withDefaults().evaluate(data, p);
// Use cases:
// 1. plot p vs quantiles
// 2. use p with an expected (inverse) probability distribution to create a QQ 
plot

// Sorted input / unsupported datatype
short[] data = ...
Arrays.sort(data);
double[] quantiles = Quantile.withDefaults().evaluate(data.length, i -> 
data[i], p);
{code}
h2. Implementation

The Quantile API and the underlying implementation can be separated. Performing 
quantile estimation requires knowing the value of elements (i, i+1) in a sorted 
array a. Note that i+1 is used for interpolation:
{noformat}
value = a[i] + alpha * a[i+1] ; alpha in [0, 1)
{noformat}
Note: The alpha factor and the index i are chosen from the percentile p using 
the EstimationMethod.

The array does not have to be fully sorted. It can be partitioned so that each 
required element i is equal to the value in a fully sorted array. A 
partitioning API can be placed in Commons Numbers:
{code:java}
org.apache.commons.numbers.arrays

public final class Selection {
    // Operate in-place and support a range as per Arrays.sort
    public void select(double[] a, int k);
    public void select(double[] a, int... k);
    public void select(int from, int to, double[] a, int k);
    public void select(int from, int to, double[] a, int... k);

    // Extend to other primitive types where sorting is slow

    // Sorting of 16-bit data is fast using a histogram so selection
    // has no major speed gain above a small length:
    // byte[] does counting sort at length 64 (temurin JDK 11)
    // short[]/char[] at length 1750 (temurin JDK 11)
}
{code}
h3. Examples
{code:java}
// Find the bottom 100 (with 0-based indexing)
Selection.select(data, 99);
double[] bottom100 = Arrays.copyOf(data, 100);

// Find the bottom and top 10
Selection.select(data, 9, data.length - 10);
double[] bottom10 = Arrays.copyOf(data, 10);
double[] top10 = Arrays.copyOfRange(data, data.length - 10, data.length);
{code}


> Quantile implementation
> -----------------------
>
>                 Key: STATISTICS-85
>                 URL: https://issues.apache.org/jira/browse/STATISTICS-85
>             Project: Commons Statistics
>          Issue Type: New Feature
>          Components: descriptive
>            Reporter: Alex Herbert
>            Priority: Major
>
> Add a quantile implementation. This will interpolate the value of a sorted 
> array of data for probability p in [0, 1].
> Replace the legacy API from Commons Math Percentile with an updated API. The 
> new API should:
>  * Decouple estimation of quantile positions inside data of length n; and the 
> selection of correctly-ordered indices in array data.
>  * Support multiple data types.
>  * Support pre-sorted data.
>  * Avoid performance issues observed in the CM Percentile implementation.
> h2. Proposed API
> {code:java}
> org.apache.commons.statistics.descriptive
> public final class Quantile {
>     // overwrite=true; EstimationMethod.HF8; NaNPolicy.ERROR
>     public static Quantile withDefaults();
>     public Quantile withOverwrite(boolean);
>     public Quantile with(EstimationMethod);
>     // Could support NaN handling ... see below for NaNPolicy
>     public Quantile with(NaNPolicy);
>     // Create n uniform probabilities in range [p1, p2]
>     public static double[] probabilities(int n);
>     public static double[] probabilities(int n, double p1, double p2);
>     // Quantiles on sorted data a of size n
>     public double evaluate(int n, java.util.function.IntToDoubleFunction a, 
> double p);
>     public double[] evaluate(int n, java.util.function.IntToDoubleFunction a, 
> double... p);
>     // Quantiles on the primitive types that cannot be easily sorted
>     public double evaluate(double[] a, double p);
>     public double[] evaluate(double[] a, double... p);
>     public double evaluate(int[] a, double p);
>     public double[] evaluate(int[] a, double... p);
>     public double evaluate(long[] a, double p);
>     public double[] evaluate(long[] a, double... p);
>     public double evaluate(float[] a, double p);
>     public double[] evaluate(float[] a, double... p);
>     // Provide the 9 methods in Hyndman and Fan (1996)
>     // Sample Quantiles in Statistical Packages.
>     // The American Statistician, 50, 361-365.
>     public abstract class Quantile$EstimationMethod extends
>             java.lang.Enum<Quantile$EstimationMethod> {
>         public static final Quantile$EstimationMethod HF1;
>         public static final Quantile$EstimationMethod HF2;
>         public static final Quantile$EstimationMethod HF3;
>         public static final Quantile$EstimationMethod HF4;
>         public static final Quantile$EstimationMethod HF5;
>         public static final Quantile$EstimationMethod HF6;
>         public static final Quantile$EstimationMethod HF7;
>         public static final Quantile$EstimationMethod HF8;
>         public static final Quantile$EstimationMethod HF9;
>     }
> }
> Note: The CM API used the 9 methods from Hyndman and Fann but labelled them 
> as R1-9; this may be derived from the same names used in the R language. I 
> propose to rename as HF1-9 to reflect the origin.
> {code}
> h2. NaNPolicy
> There are multiple options here. For reference R and Python's numpy only 
> provide the option to exclude NaN:
>  * R: quantile errors if NaN is present. median returns NaN. They is an 
> option to exclude NaN.
>  * numpy: two methods are provided: median/nanmedian + quantile/nanquantile 
> (the non-nan versions will return NaN if any NaNs are present)
> Commons Math provides a remapping. Note the Statistics ranking module has the 
> same NaNStrategy as that in CM:
>  * MINIMAL: map to -infinity
>  * MAXIMAL: map to +infinity
>  * REMOVED: ignore from the data
>  * FIXED: leave in place. This makes no sense for quantiles. It is done by 
> moving to the end following the order imposed by Double.compare.
>  * FAILED: raise an exception
> I favour the simpler option of: treating NaN so they are above/below all 
> other values; removing them from the data; or raising an exception. I do not 
> see the requirement to remap NaN to infinity. This can be done by the user.
> The API can be simplified by using:
> {code:java}
> public final class NaNPolicy extends java.lang.Enum<NaNPolicy> {
>     public static final NaNPolicy LAST;    // Move to end of data
>     public static final NaNPolicy FIRST;   // Move to start of data
>     public static final NaNPolicy REMOVE;  // Remove from data
>     public static final NaNPolicy ERROR;   // Raise exception
> }
> {code}
> Note that the FIRST option is not strictly required. But if there is an 
> option to order the NaNs (i.e. LAST) then both orders should be supported.
> Which option to use as the default is not clear. As a drop in substitute for 
> Arrays.sort the default would be handle NaN with a move to the end (LAST). As 
> an API to signal to the user that the quantiles are possibly corrupted then 
> ERROR would be the default. A user can then decide what to do if they receive 
> errors during their analysis. Note that a separation of the Quantile API from 
> the partitioning API (see below) may result in doing NaN processing twice 
> introducing a performance overhead. If this will occur by default it should 
> be documented so a user can override the NaN behaviour if they do not expect 
> NaNs.
> Commons Math places the Percentile class in the rank package. I propose to 
> move this implementation to place it in the descriptive package where it will 
> sit beside other descriptive statistics for data such as mean and standard 
> deviation.
> h2. Examples
> {code:java}
> double[] data = ...
> double q = Quantile.withDefaults().evaluate(data, 0.25);
> double[] quantiles = Quantile.withDefaults().evaluate(data, 0.25, 0.5, 0.75);
> // Use cases:
> // 1. Median / other quantile estimation
> // 2. Box plot of data
> // 3. Interquartile range analysis
> double[] p = Quantile.probabilities(100, 0.01, 0.99);
> double[] quantiles = Quantile.withDefaults().evaluate(data, p);
> // Use cases:
> // 1. plot p vs quantiles
> // 2. use p with an expected (inverse) probability distribution to create a 
> QQ plot
> // Sorted input / unsupported datatype
> short[] data = ...
> Arrays.sort(data);
> double[] quantiles = Quantile.withDefaults().evaluate(data.length, i -> 
> data[i], p);
> {code}
> h2. Implementation
> The Quantile API and the underlying implementation can be separated. 
> Performing quantile estimation requires knowing the value of elements (i, 
> i+1) in a sorted array a. Note that i+1 is used for interpolation:
> {noformat}
> value = a[i] + alpha * (a[i+1] - a[i]) ; alpha in [0, 1)
> {noformat}
> Note: The alpha factor and the index i are chosen from the percentile p using 
> the EstimationMethod.
> The array does not have to be fully sorted. It can be partitioned so that 
> each required element i is equal to the value in a fully sorted array. A 
> partitioning API can be placed in Commons Numbers:
> {code:java}
> org.apache.commons.numbers.arrays
> public final class Selection {
>     // Operate in-place and support a range as per Arrays.sort
>     public void select(double[] a, int k);
>     public void select(double[] a, int... k);
>     public void select(int from, int to, double[] a, int k);
>     public void select(int from, int to, double[] a, int... k);
>     // Extend to other primitive types where sorting is slow
>     // Sorting of 16-bit data is fast using a histogram so selection
>     // has no major speed gain above a small length:
>     // byte[] does counting sort at length 64 (temurin JDK 11)
>     // short[]/char[] at length 1750 (temurin JDK 11)
> }
> {code}
> h3. Examples
> {code:java}
> // Find the bottom 100 (with 0-based indexing)
> Selection.select(data, 99);
> double[] bottom100 = Arrays.copyOf(data, 100);
> // Find the bottom and top 10
> Selection.select(data, 9, data.length - 10);
> double[] bottom10 = Arrays.copyOf(data, 10);
> double[] top10 = Arrays.copyOfRange(data, data.length - 10, data.length);
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to