[jira] [Commented] (STATISTICS-85) Quantile implementation

Alex Herbert (Jira) Wed, 22 May 2024 03:45:35 -0700


    [ 
https://issues.apache.org/jira/browse/STATISTICS-85?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848547#comment-17848547
 ]


Alex Herbert commented on STATISTICS-85:
----------------------------------------

h2. Background

This ticket is related to STATISTICS-18: Port the Percentile class from Commons 
Math.

I originally started doing that and found several issues. The CM API is based 
on implementing the UnivariateStatistic interface. This will create 1 value 
given an input array. This is not ideal as it is often required to obtain 
multiple percentiles from the same array. This is solved by allowing the 
Percentile to cache a work array that is used for partitioning, and cache known 
values within that array that are correctly sorted (this speeds up repeats 
calls). The API is thus bloated by this restriction. The API is also tied to 
the type of data being processed (double[]). The full API is:
{code:java}
org.apache.commons.math4.legacy.stat.descriptive.rank

public class Percentile extends 
org.apache.commons.math4.legacy.stat.descriptive.AbstractUnivariateStatistic {
    public Percentile();
    public Percentile(double);
    public Percentile(Percentile);
    public void setData(double[]);
    public void setData(double[], double[]);
    public void setData(double[], int, int);
    public void setData(double[], double[], int, int);
    public double evaluate(double);
    public double evaluate(double[], double);
    public double evaluate(double[], double[], double);
    public double evaluate(double[], int, int);
    public double evaluate(double[], double[], int, int);
    public double evaluate(double[], int, int, double);
    public double evaluate(double[], double[], int, int, double);
    public double getQuantile();
    public void setQuantile(double);
    public Percentile copy();
    public Percentile$EstimationType getEstimationType();
    public Percentile withEstimationType(Percentile$EstimationType);
    public NaNStrategy getNaNStrategy();
    public Percentile withNaNStrategy(NaNStrategy);
    public KthSelector getKthSelector();
    public PivotingStrategy getPivotingStrategy();
    public Percentile withKthSelector(KthSelector);
    public org.apache.commons.math4.legacy.stat.descriptive.UnivariateStatistic 
copy();

    public abstract class Percentile$EstimationType extends
            java.lang.Enum<Percentile$EstimationType> {
        public static final Percentile$EstimationType LEGACY;
        public static final Percentile$EstimationType R_1;
        public static final Percentile$EstimationType R_2;
        public static final Percentile$EstimationType R_3;
        public static final Percentile$EstimationType R_4;
        public static final Percentile$EstimationType R_5;
        public static final Percentile$EstimationType R_6;
        public static final Percentile$EstimationType R_7;
        public static final Percentile$EstimationType R_8;
        public static final Percentile$EstimationType R_9;
        // Ties the enum to the double[] data type and to the KthSelector 
implementation
        public double evaluate(double[], double, KthSelector);
        // Only implemented in R_7: data, weights, p
        public abstract double evaluate(double[], double[], double);
    }
}

// Depends on:

org.apache.commons.math4.legacy.stat.ranking

public final class NaNStrategy extends java.lang.Enum<NaNStrategy> {
    public static final NaNStrategy MINIMAL;
    public static final NaNStrategy MAXIMAL;
    public static final NaNStrategy REMOVED;
    public static final NaNStrategy FIXED;
    public static final NaNStrategy FAILED;
}
{code}
The new Quantile API removes coupling to the double[] data type and allows 
multiple p values to be passed together. This allows the underlying 
implementation to reuse the algorithm from CM, or find an improvement given 
that all the positions to be sorted in the array are known.
h2. Bugs

There are two bugs in CM Percentile as shown:
{code:java}
@Test
void testCMBugs() {
    // Always interpolates pairs so creates NaN when this is the upper value
    // even if the upper value should be ignored
    // Uses: low + 0.0 * (NaN - low)
    // This also does an additional search for the upper value even when it
    // is not used.
    Percentile p = new Percentile().withNaNStrategy(NaNStrategy.FIXED);
    Assertions.assertEquals(1, p.evaluate(new double[] {0, 1, 2}, 50));
    Assertions.assertEquals(Double.NaN, p.evaluate(new double[] {0, 1, 
Double.NaN}, 50));

    // Cannot use a NaNStrategy on the work array
    Percentile p2 = new Percentile().withNaNStrategy(NaNStrategy.FAILED);
    double[] x = {0, 1, Double.NaN};
    Assertions.assertThrows(NotANumberException.class, () -> p2.evaluate(x, 
50));
    // With a work array
    p2.setData(x);
    Assertions.assertDoesNotThrow(() -> p2.evaluate(50));
}
{code}
Note that the requirement to always interpolate v = a[i] * alpha * a[i+1], even 
if alpha is zero is a performance overhead that can be measured, for example 
when computing the median on odd length arrays.
h2. Quickselect

The CM Percentile implementation uses a quickselect algorithm.

Select: arrange the data so that index k corresponds to the value of the index 
in a fully sorted array:
{noformat}
data[i < k] <= data[k] <= data[k < i]{noformat}
Selection is closely related to sorting. Here we introduce quicksort and the 
related quickselect.

Quicksort divides an array around a pivot point that corresponds to the correct 
value in the sorted array. The two partitions either side of the pivot are 
recursively processed in the same way. This can be done in-place. At small 
lengths the selection of a pivot value and rearrangement of the data can be 
stopped and the sort switched to an insertion sort.

Quickselect is related to quicksort. It was introduced by Hoare (1961). It 
partitions the data using a pivot value but only recurses into the partition 
that contains the index of interest k. At small lengths the selection can use 
an insertion sort, or another method such as heap select to identify k. Heap 
select builds a min/max heap of size m = k-l+1 or m = r-k+1 at one end of the 
array and scans the rest of the data inserting elements if they are 
smaller/larger than the current top of the heap. When finished the top of the 
heap can be placed in the correct location k.

A single-pivot quickselect should ideally divide data into 2 equal partitions 
using a pivot value p. This is repeated in the partition that contains the 
index to select:
{noformat}
l---------------k----------------p---------------r
l---------p-----k---------------r
           l----k--p------------r
           l--p-k-r
               lk-r

{noformat}
Order(n + n/2 + n/4 + n/8 + ...) = Order( n )

The number of quickselect partitions is expected to be log2( n ). The choice of 
the pivot value can be random, or based on sampling of the range such as using 
the median of a number of samples. A sampling approach will be more likely to 
find a value in the middle of the data but will be more expensive to compute.
h2. CM quickselect

The implementation has a cache of pivot points (known locations that are 
equivalent to the correct location in a sorted array). This cache uses a heap 
structure with a fixed depth of 10. It allows the quickselect algorithm to 
store the first 10 decisions during a partitioning search. However quickselect 
requires log2( n ) branches for length n on average. So the depth of 10 is too 
large from small arrays and too small for large arrays. The cache is also 
created and used on a one-time call where it is empty and will never be reused. 
This is inefficient for single usage and small arrays and inadequate for large 
arrays, e.g. 1 million elements will require on average ceil(log2(1e6)) = 20 
quickselect partitions. So the cache only stores initial decisions. However it 
is expected that these are the major steps to reducing the array size so here 
performance is not degraded too much.

The pivot selection uses a median-of-3 strategy. This is vulnerable to 
median-of-3 killer sequences which cause the algorithm to only partition 2 
elements per iteration and result in Order( n^2 ) worst case performance. This 
can be changed by changing the pivoting strategy. But it does not remove the 
fact that the algorithm does not monitor its own performance during 
partitioning. This is known as introspection and was introduced by Musser in 
Introsort/Introselect.

[Musser (1999) Introspective Sorting and Selection Algorithms. Software: 
Practice and Experience 27, 
983-993|https://doi.org/10.1002/(SICI)1097-024X(199708)27:8%3C983::AID-SPE117%3E3.0.CO;2-%23
]

The recursion depth of the quickselect can vary with the data and pivot 
selection method. If pivot selection is poor then quickselect recurses too far 
and performance becomes quadratic. It is easy to determine if quickselect is 
not converging as expected by checking the recursion depth against the expected 
number, or combined size, of ideal partitions. In this case the selection can 
change to a different method. This idea of monitoring the quickselect behaviour 
during execution is used in introsort/introselect (Musser 1999). In introsort 
the quicksort will be changed to a heapsort (Order(n log n)) if quicksort fails 
to converge. For introselect the quickselect is changed to a heapselect, or a 
linearselect algorithm. The switch to a stopper algorithm provides a stable 
worst case performance for data that is unsuitable for quickselect.

The CM quickselect algorithm partitions elements using a 2-state partition with 
{{{}<, >{}}}. This suffers performance degradation when there are many repeated 
elements. It can be fixed by using a 3-state partition with {{<, ==, >}} where 
all values equal to the pivot value are collected in the centre.

In summary any replacement of the CM quickselect implementation should address:
 * Efficient strategy to partition data for multiple keys
 * Introspection to avoid quadratic worst case performance
 * Efficient partitioning of data with repeat elements

I have benchmarked various implementations that address these issues and can 
significantly outperform the CM Percentile class for input data that create 
them.  

 

> Quantile implementation
> -----------------------
>
>                 Key: STATISTICS-85
>                 URL: https://issues.apache.org/jira/browse/STATISTICS-85
>             Project: Commons Statistics
>          Issue Type: New Feature
>          Components: descriptive
>            Reporter: Alex Herbert
>            Priority: Major
>
> Add a quantile implementation. This will interpolate the value of a sorted 
> array of data for probability p in [0, 1].
> Replace the legacy API from Commons Math Percentile with an updated API. The 
> new API should:
>  * Decouple estimation of quantile positions inside data of length n; and the 
> selection of correctly-ordered indices in array data.
>  * Support multiple data types.
>  * Support pre-sorted data.
>  * Avoid performance issues observed in the CM Percentile implementation.
> h2. Proposed API
> {code:java}
> org.apache.commons.statistics.descriptive
> public final class Quantile {
>     // overwrite=true; EstimationMethod.HF8; NaNPolicy.ERROR
>     public static Quantile withDefaults();
>     public Quantile withOverwrite(boolean);
>     public Quantile with(EstimationMethod);
>     // Could support NaN handling ... see below for NaNPolicy
>     public Quantile with(NaNPolicy);
>     // Create n uniform probabilities in range [p1, p2]
>     public static double[] probabilities(int n);
>     public static double[] probabilities(int n, double p1, double p2);
>     // Quantiles on sorted data a of size n
>     public double evaluate(int n, java.util.function.IntToDoubleFunction a, 
> double p);
>     public double[] evaluate(int n, java.util.function.IntToDoubleFunction a, 
> double... p);
>     // Quantiles on the primitive types that cannot be easily sorted
>     public double evaluate(double[] a, double p);
>     public double[] evaluate(double[] a, double... p);
>     public double evaluate(int[] a, double p);
>     public double[] evaluate(int[] a, double... p);
>     public double evaluate(long[] a, double p);
>     public double[] evaluate(long[] a, double... p);
>     public double evaluate(float[] a, double p);
>     public double[] evaluate(float[] a, double... p);
>     // Provide the 9 methods in Hyndman and Fan (1996)
>     // Sample Quantiles in Statistical Packages.
>     // The American Statistician, 50, 361-365.
>     public abstract class Quantile$EstimationMethod extends
>             java.lang.Enum<Quantile$EstimationMethod> {
>         public static final Quantile$EstimationMethod HF1;
>         public static final Quantile$EstimationMethod HF2;
>         public static final Quantile$EstimationMethod HF3;
>         public static final Quantile$EstimationMethod HF4;
>         public static final Quantile$EstimationMethod HF5;
>         public static final Quantile$EstimationMethod HF6;
>         public static final Quantile$EstimationMethod HF7;
>         public static final Quantile$EstimationMethod HF8;
>         public static final Quantile$EstimationMethod HF9;
>     }
> }
> Note: The CM API used the 9 methods from Hyndman and Fann but labelled them 
> as R1-9; this may be derived from the same names used in the R language. I 
> propose to rename as HF1-9 to reflect the origin.
> {code}
> h2. NaNPolicy
> There are multiple options here. For reference R and Python's numpy only 
> provide the option to exclude NaN:
>  * R: quantile errors if NaN is present. median returns NaN. They is an 
> option to exclude NaN.
>  * numpy: two methods are provided: median/nanmedian + quantile/nanquantile 
> (the non-nan versions will return NaN if any NaNs are present)
> Commons Math provides a remapping. Note the Statistics ranking module has the 
> same NaNStrategy as that in CM:
>  * MINIMAL: map to -infinity
>  * MAXIMAL: map to +infinity
>  * REMOVED: ignore from the data
>  * FIXED: leave in place. This makes no sense for quantiles. It is done by 
> moving to the end following the order imposed by Double.compare.
>  * FAILED: raise an exception
> I favour the simpler option of: treating NaN so they are above/below all 
> other values; removing them from the data; or raising an exception. I do not 
> see the requirement to remap NaN to infinity. This can be done by the user.
> The API can be simplified by using:
> {code:java}
> public final class NaNPolicy extends java.lang.Enum<NaNPolicy> {
>     public static final NaNPolicy LAST;    // Move to end of data
>     public static final NaNPolicy FIRST;   // Move to start of data
>     public static final NaNPolicy REMOVE;  // Remove from data
>     public static final NaNPolicy ERROR;   // Raise exception
> }
> {code}
> Note that the FIRST option is not strictly required. But if there is an 
> option to order the NaNs (i.e. LAST) then both orders should be supported.
> Which option to use as the default is not clear. As a drop in substitute for 
> Arrays.sort the default would be handle NaN with a move to the end (LAST). As 
> an API to signal to the user that the quantiles are possibly corrupted then 
> ERROR would be the default. A user can then decide what to do if they receive 
> errors during their analysis. Note that a separation of the Quantile API from 
> the partitioning API (see below) may result in doing NaN processing twice 
> introducing a performance overhead. If this will occur by default it should 
> be documented so a user can override the NaN behaviour if they do not expect 
> NaNs.
> Commons Math places the Percentile class in the rank package. I propose to 
> move this implementation to place it in the descriptive package where it will 
> sit beside other descriptive statistics for data such as mean and standard 
> deviation.
> h2. Examples
> {code:java}
> double[] data = ...
> double q = Quantile.withDefaults().evaluate(data, 0.25);
> double[] quantiles = Quantile.withDefaults().evaluate(data, 0.25, 0.5, 0.75);
> // Use cases:
> // 1. Median / other quantile estimation
> // 2. Box plot of data
> // 3. Interquartile range analysis
> double[] p = Quantile.probabilities(100, 0.01, 0.99);
> double[] quantiles = Quantile.withDefaults().evaluate(data, p);
> // Use cases:
> // 1. plot p vs quantiles
> // 2. use p with an expected (inverse) probability distribution to create a 
> QQ plot
> // Sorted input / unsupported datatype
> short[] data = ...
> Arrays.sort(data);
> double[] quantiles = Quantile.withDefaults().evaluate(data.length, i -> 
> data[i], p);
> {code}
> h2. Implementation
> The Quantile API and the underlying implementation can be separated. 
> Performing quantile estimation requires knowing the value of elements (i, 
> i+1) in a sorted array a. Note that i+1 is used for interpolation:
> {noformat}
> value = a[i] + alpha * a[i+1] ; alpha in [0, 1)
> {noformat}
> Note: The alpha factor and the index i are chosen from the percentile p using 
> the EstimationMethod.
> The array does not have to be fully sorted. It can be partitioned so that 
> each required element i is equal to the value in a fully sorted array. A 
> partitioning API can be placed in Commons Numbers:
> {code:java}
> org.apache.commons.numbers.arrays
> public final class Selection {
>     // Operate in-place and support a range as per Arrays.sort
>     public void select(double[] a, int k);
>     public void select(double[] a, int... k);
>     public void select(int from, int to, double[] a, int k);
>     public void select(int from, int to, double[] a, int... k);
>     // Extend to other primitive types where sorting is slow
>     // Sorting of 16-bit data is fast using a histogram so selection
>     // has no major speed gain above a small length:
>     // byte[] does counting sort at length 64 (temurin JDK 11)
>     // short[]/char[] at length 1750 (temurin JDK 11)
> }
> {code}
> h3. Examples
> {code:java}
> // Find the bottom 100 (with 0-based indexing)
> Selection.select(data, 99);
> double[] bottom100 = Arrays.copyOf(data, 100);
> // Find the bottom and top 10
> Selection.select(data, 9, data.length - 10);
> double[] bottom10 = Arrays.copyOf(data, 10);
> double[] top10 = Arrays.copyOfRange(data, data.length - 10, data.length);
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (STATISTICS-85) Quantile implementation

Reply via email to