[jira] [Commented] (STATISTICS-85) Quantile implementation

Alex Herbert (Jira) Wed, 12 Jun 2024 03:12:05 -0700


    [ 
https://issues.apache.org/jira/browse/STATISTICS-85?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17854364#comment-17854364
 ]


Alex Herbert commented on STATISTICS-85:
----------------------------------------

h2. Median

Note for the median that an odd length input array requires finding one index 
(k = n/2); an even length input requires finding indices (k-1, k). Data were 
created using a given length and extended with an additional range of 1 (e.g. 
n=[1000, 1001]). For the BM data all distributions at n and n+1 are created. 
For random data the length is randomly selected from the range. This creates a 
50:50 mixture of odd and even length data. The data samples are randomly 
shuffled per iteration to reduce branch prediction effects. 

Tested on the BM distribution.
||n||CM3||CM4||JDK||Statistics||
|1000|3568053.381|4778367.635|7302762.526|2295992.742|
|Relative|0.489|0.654|1.0|0.314|
|100000|6242554116|6350143191|1375191908|314671031.5|
|Relative|4.539|4.618|1.0|0.229|

The CM implementation handles the small data. For larger data it is 
significantly worse than a full sort. The Statistics implementation is robust 
to the BM data distributions and 3-4x faster than a sort.

Tested using random data.
||n||CM3||CM4||JDK||Statistics||Relative to JDK||Relative to CM3||
|100|4.289E+04|4.277E+04|1.277E+05|3.424E+04|0.27|0.80|
|1000|1.007E+06|9.976E+05|2.900E+06|6.872E+05|0.24|0.68|
|10000|1.020E+07|1.001E+07|3.944E+07|7.215E+06|0.18|0.71|
|100000|9.613E+07|9.483E+07|5.040E+08|6.346E+07|0.13|0.66|

!MedianRandomData.png|width=1000,height=800!

On random data the CM implementation is faster than sorting. The statistics 
implementation is consistently faster than CM and the relative difference 
increases with length.

 

> Quantile implementation
> -----------------------
>
>                 Key: STATISTICS-85
>                 URL: https://issues.apache.org/jira/browse/STATISTICS-85
>             Project: Commons Statistics
>          Issue Type: New Feature
>          Components: descriptive
>            Reporter: Alex Herbert
>            Priority: Major
>         Attachments: 100QuantilesRandomDataLength10000.png, 
> MedianRandomData.png
>
>
> Add a quantile implementation. This will interpolate the value of a sorted 
> array of data for probability p in [0, 1].
> Replace the legacy API from Commons Math Percentile with an updated API. The 
> new API should:
>  * Decouple estimation of quantile positions inside data of length n; and the 
> selection of correctly-ordered indices in array data.
>  * Support multiple data types.
>  * Support pre-sorted data.
>  * Avoid performance issues observed in the CM Percentile implementation.
> h2. Proposed API
> {code:java}
> org.apache.commons.statistics.descriptive
> public final class Quantile {
>     // overwrite=true; EstimationMethod.HF8; NaNPolicy.ERROR
>     public static Quantile withDefaults();
>     public Quantile withOverwrite(boolean);
>     public Quantile with(EstimationMethod);
>     // Could support NaN handling ... see below for NaNPolicy
>     public Quantile with(NaNPolicy);
>     // Create n uniform probabilities in range [p1, p2]
>     public static double[] probabilities(int n);
>     public static double[] probabilities(int n, double p1, double p2);
>     // Quantiles on sorted data a of size n
>     public double evaluate(int n, java.util.function.IntToDoubleFunction a, 
> double p);
>     public double[] evaluate(int n, java.util.function.IntToDoubleFunction a, 
> double... p);
>     // Quantiles on the primitive types that cannot be easily sorted
>     public double evaluate(double[] a, double p);
>     public double[] evaluate(double[] a, double... p);
>     public double evaluate(int[] a, double p);
>     public double[] evaluate(int[] a, double... p);
>     public double evaluate(long[] a, double p);
>     public double[] evaluate(long[] a, double... p);
>     public double evaluate(float[] a, double p);
>     public double[] evaluate(float[] a, double... p);
>     // Provide the 9 methods in Hyndman and Fan (1996)
>     // Sample Quantiles in Statistical Packages.
>     // The American Statistician, 50, 361-365.
>     public abstract class Quantile$EstimationMethod extends
>             java.lang.Enum<Quantile$EstimationMethod> {
>         public static final Quantile$EstimationMethod HF1;
>         public static final Quantile$EstimationMethod HF2;
>         public static final Quantile$EstimationMethod HF3;
>         public static final Quantile$EstimationMethod HF4;
>         public static final Quantile$EstimationMethod HF5;
>         public static final Quantile$EstimationMethod HF6;
>         public static final Quantile$EstimationMethod HF7;
>         public static final Quantile$EstimationMethod HF8;
>         public static final Quantile$EstimationMethod HF9;
>     }
> }
> Note: The CM API used the 9 methods from Hyndman and Fann but labelled them 
> as R1-9; this may be derived from the same names used in the R language. I 
> propose to rename as HF1-9 to reflect the origin.
> {code}
> h2. NaNPolicy
> There are multiple options here. For reference R and Python's numpy only 
> provide the option to exclude NaN:
>  * R: quantile errors if NaN is present. median returns NaN. They is an 
> option to exclude NaN.
>  * numpy: two methods are provided: median/nanmedian + quantile/nanquantile 
> (the non-nan versions will return NaN if any NaNs are present)
> Commons Math provides a remapping. Note the Statistics ranking module has the 
> same NaNStrategy as that in CM:
>  * MINIMAL: map to -infinity
>  * MAXIMAL: map to +infinity
>  * REMOVED: ignore from the data
>  * FIXED: leave in place. This makes no sense for quantiles. It is done by 
> moving to the end following the order imposed by Double.compare.
>  * FAILED: raise an exception
> I favour the simpler option of: treating NaN so they are above/below all 
> other values; removing them from the data; or raising an exception. I do not 
> see the requirement to remap NaN to infinity. This can be done by the user.
> The API can be simplified by using:
> {code:java}
> public final class NaNPolicy extends java.lang.Enum<NaNPolicy> {
>     public static final NaNPolicy LAST;    // Move to end of data
>     public static final NaNPolicy FIRST;   // Move to start of data
>     public static final NaNPolicy REMOVE;  // Remove from data
>     public static final NaNPolicy ERROR;   // Raise exception
> }
> {code}
> Note that the FIRST option is not strictly required. But if there is an 
> option to order the NaNs (i.e. LAST) then both orders should be supported.
> Which option to use as the default is not clear. As a drop in substitute for 
> Arrays.sort the default would be handle NaN with a move to the end (LAST). As 
> an API to signal to the user that the quantiles are possibly corrupted then 
> ERROR would be the default. A user can then decide what to do if they receive 
> errors during their analysis. Note that a separation of the Quantile API from 
> the partitioning API (see below) may result in doing NaN processing twice 
> introducing a performance overhead. If this will occur by default it should 
> be documented so a user can override the NaN behaviour if they do not expect 
> NaNs.
> Commons Math places the Percentile class in the rank package. I propose to 
> move this implementation to place it in the descriptive package where it will 
> sit beside other descriptive statistics for data such as mean and standard 
> deviation.
> h2. Examples
> {code:java}
> double[] data = ...
> double q = Quantile.withDefaults().evaluate(data, 0.25);
> double[] quantiles = Quantile.withDefaults().evaluate(data, 0.25, 0.5, 0.75);
> // Use cases:
> // 1. Median / other quantile estimation
> // 2. Box plot of data
> // 3. Interquartile range analysis
> double[] p = Quantile.probabilities(100, 0.01, 0.99);
> double[] quantiles = Quantile.withDefaults().evaluate(data, p);
> // Use cases:
> // 1. plot p vs quantiles
> // 2. use p with an expected (inverse) probability distribution to create a 
> QQ plot
> // Sorted input / unsupported datatype
> short[] data = ...
> Arrays.sort(data);
> double[] quantiles = Quantile.withDefaults().evaluate(data.length, i -> 
> data[i], p);
> {code}
> h2. Implementation
> The Quantile API and the underlying implementation can be separated. 
> Performing quantile estimation requires knowing the value of elements (i, 
> i+1) in a sorted array a. Note that i+1 is used for interpolation:
> {noformat}
> value = a[i] + alpha * (a[i+1] - a[i]) ; alpha in [0, 1)
> {noformat}
> Note: The alpha factor and the index i are chosen from the percentile p using 
> the EstimationMethod.
> The array does not have to be fully sorted. It can be partitioned so that 
> each required element i is equal to the value in a fully sorted array. A 
> partitioning API can be placed in Commons Numbers:
> {code:java}
> org.apache.commons.numbers.arrays
> public final class Selection {
>     // Operate in-place and support a range as per Arrays.sort
>     public void select(double[] a, int k);
>     public void select(double[] a, int... k);
>     public void select(int from, int to, double[] a, int k);
>     public void select(int from, int to, double[] a, int... k);
>     // Extend to other primitive types where sorting is slow
>     // Sorting of 16-bit data is fast using a histogram so selection
>     // has no major speed gain above a small length:
>     // byte[] does counting sort at length 64 (temurin JDK 11)
>     // short[]/char[] at length 1750 (temurin JDK 11)
> }
> {code}
> h3. Examples
> {code:java}
> // Find the bottom 100 (with 0-based indexing)
> Selection.select(data, 99);
> double[] bottom100 = Arrays.copyOf(data, 100);
> // Find the bottom and top 10
> Selection.select(data, 9, data.length - 10);
> double[] bottom10 = Arrays.copyOf(data, 10);
> double[] top10 = Arrays.copyOfRange(data, data.length - 10, data.length);
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (STATISTICS-85) Quantile implementation

Reply via email to