[ https://issues.apache.org/jira/browse/STATISTICS-85?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17854364#comment-17854364 ]
Alex Herbert commented on STATISTICS-85: ---------------------------------------- h2. Median Note for the median that an odd length input array requires finding one index (k = n/2); an even length input requires finding indices (k-1, k). Data were created using a given length and extended with an additional range of 1 (e.g. n=[1000, 1001]). For the BM data all distributions at n and n+1 are created. For random data the length is randomly selected from the range. This creates a 50:50 mixture of odd and even length data. The data samples are randomly shuffled per iteration to reduce branch prediction effects. Tested on the BM distribution. ||n||CM3||CM4||JDK||Statistics|| |1000|3568053.381|4778367.635|7302762.526|2295992.742| |Relative|0.489|0.654|1.0|0.314| |100000|6242554116|6350143191|1375191908|314671031.5| |Relative|4.539|4.618|1.0|0.229| The CM implementation handles the small data. For larger data it is significantly worse than a full sort. The Statistics implementation is robust to the BM data distributions and 3-4x faster than a sort. Tested using random data. ||n||CM3||CM4||JDK||Statistics||Relative to JDK||Relative to CM3|| |100|4.289E+04|4.277E+04|1.277E+05|3.424E+04|0.27|0.80| |1000|1.007E+06|9.976E+05|2.900E+06|6.872E+05|0.24|0.68| |10000|1.020E+07|1.001E+07|3.944E+07|7.215E+06|0.18|0.71| |100000|9.613E+07|9.483E+07|5.040E+08|6.346E+07|0.13|0.66| !MedianRandomData.png|width=1000,height=800! On random data the CM implementation is faster than sorting. The statistics implementation is consistently faster than CM and the relative difference increases with length. > Quantile implementation > ----------------------- > > Key: STATISTICS-85 > URL: https://issues.apache.org/jira/browse/STATISTICS-85 > Project: Commons Statistics > Issue Type: New Feature > Components: descriptive > Reporter: Alex Herbert > Priority: Major > Attachments: 100QuantilesRandomDataLength10000.png, > MedianRandomData.png > > > Add a quantile implementation. This will interpolate the value of a sorted > array of data for probability p in [0, 1]. > Replace the legacy API from Commons Math Percentile with an updated API. The > new API should: > * Decouple estimation of quantile positions inside data of length n; and the > selection of correctly-ordered indices in array data. > * Support multiple data types. > * Support pre-sorted data. > * Avoid performance issues observed in the CM Percentile implementation. > h2. Proposed API > {code:java} > org.apache.commons.statistics.descriptive > public final class Quantile { > // overwrite=true; EstimationMethod.HF8; NaNPolicy.ERROR > public static Quantile withDefaults(); > public Quantile withOverwrite(boolean); > public Quantile with(EstimationMethod); > // Could support NaN handling ... see below for NaNPolicy > public Quantile with(NaNPolicy); > // Create n uniform probabilities in range [p1, p2] > public static double[] probabilities(int n); > public static double[] probabilities(int n, double p1, double p2); > // Quantiles on sorted data a of size n > public double evaluate(int n, java.util.function.IntToDoubleFunction a, > double p); > public double[] evaluate(int n, java.util.function.IntToDoubleFunction a, > double... p); > // Quantiles on the primitive types that cannot be easily sorted > public double evaluate(double[] a, double p); > public double[] evaluate(double[] a, double... p); > public double evaluate(int[] a, double p); > public double[] evaluate(int[] a, double... p); > public double evaluate(long[] a, double p); > public double[] evaluate(long[] a, double... p); > public double evaluate(float[] a, double p); > public double[] evaluate(float[] a, double... p); > // Provide the 9 methods in Hyndman and Fan (1996) > // Sample Quantiles in Statistical Packages. > // The American Statistician, 50, 361-365. > public abstract class Quantile$EstimationMethod extends > java.lang.Enum<Quantile$EstimationMethod> { > public static final Quantile$EstimationMethod HF1; > public static final Quantile$EstimationMethod HF2; > public static final Quantile$EstimationMethod HF3; > public static final Quantile$EstimationMethod HF4; > public static final Quantile$EstimationMethod HF5; > public static final Quantile$EstimationMethod HF6; > public static final Quantile$EstimationMethod HF7; > public static final Quantile$EstimationMethod HF8; > public static final Quantile$EstimationMethod HF9; > } > } > Note: The CM API used the 9 methods from Hyndman and Fann but labelled them > as R1-9; this may be derived from the same names used in the R language. I > propose to rename as HF1-9 to reflect the origin. > {code} > h2. NaNPolicy > There are multiple options here. For reference R and Python's numpy only > provide the option to exclude NaN: > * R: quantile errors if NaN is present. median returns NaN. They is an > option to exclude NaN. > * numpy: two methods are provided: median/nanmedian + quantile/nanquantile > (the non-nan versions will return NaN if any NaNs are present) > Commons Math provides a remapping. Note the Statistics ranking module has the > same NaNStrategy as that in CM: > * MINIMAL: map to -infinity > * MAXIMAL: map to +infinity > * REMOVED: ignore from the data > * FIXED: leave in place. This makes no sense for quantiles. It is done by > moving to the end following the order imposed by Double.compare. > * FAILED: raise an exception > I favour the simpler option of: treating NaN so they are above/below all > other values; removing them from the data; or raising an exception. I do not > see the requirement to remap NaN to infinity. This can be done by the user. > The API can be simplified by using: > {code:java} > public final class NaNPolicy extends java.lang.Enum<NaNPolicy> { > public static final NaNPolicy LAST; // Move to end of data > public static final NaNPolicy FIRST; // Move to start of data > public static final NaNPolicy REMOVE; // Remove from data > public static final NaNPolicy ERROR; // Raise exception > } > {code} > Note that the FIRST option is not strictly required. But if there is an > option to order the NaNs (i.e. LAST) then both orders should be supported. > Which option to use as the default is not clear. As a drop in substitute for > Arrays.sort the default would be handle NaN with a move to the end (LAST). As > an API to signal to the user that the quantiles are possibly corrupted then > ERROR would be the default. A user can then decide what to do if they receive > errors during their analysis. Note that a separation of the Quantile API from > the partitioning API (see below) may result in doing NaN processing twice > introducing a performance overhead. If this will occur by default it should > be documented so a user can override the NaN behaviour if they do not expect > NaNs. > Commons Math places the Percentile class in the rank package. I propose to > move this implementation to place it in the descriptive package where it will > sit beside other descriptive statistics for data such as mean and standard > deviation. > h2. Examples > {code:java} > double[] data = ... > double q = Quantile.withDefaults().evaluate(data, 0.25); > double[] quantiles = Quantile.withDefaults().evaluate(data, 0.25, 0.5, 0.75); > // Use cases: > // 1. Median / other quantile estimation > // 2. Box plot of data > // 3. Interquartile range analysis > double[] p = Quantile.probabilities(100, 0.01, 0.99); > double[] quantiles = Quantile.withDefaults().evaluate(data, p); > // Use cases: > // 1. plot p vs quantiles > // 2. use p with an expected (inverse) probability distribution to create a > QQ plot > // Sorted input / unsupported datatype > short[] data = ... > Arrays.sort(data); > double[] quantiles = Quantile.withDefaults().evaluate(data.length, i -> > data[i], p); > {code} > h2. Implementation > The Quantile API and the underlying implementation can be separated. > Performing quantile estimation requires knowing the value of elements (i, > i+1) in a sorted array a. Note that i+1 is used for interpolation: > {noformat} > value = a[i] + alpha * (a[i+1] - a[i]) ; alpha in [0, 1) > {noformat} > Note: The alpha factor and the index i are chosen from the percentile p using > the EstimationMethod. > The array does not have to be fully sorted. It can be partitioned so that > each required element i is equal to the value in a fully sorted array. A > partitioning API can be placed in Commons Numbers: > {code:java} > org.apache.commons.numbers.arrays > public final class Selection { > // Operate in-place and support a range as per Arrays.sort > public void select(double[] a, int k); > public void select(double[] a, int... k); > public void select(int from, int to, double[] a, int k); > public void select(int from, int to, double[] a, int... k); > // Extend to other primitive types where sorting is slow > // Sorting of 16-bit data is fast using a histogram so selection > // has no major speed gain above a small length: > // byte[] does counting sort at length 64 (temurin JDK 11) > // short[]/char[] at length 1750 (temurin JDK 11) > } > {code} > h3. Examples > {code:java} > // Find the bottom 100 (with 0-based indexing) > Selection.select(data, 99); > double[] bottom100 = Arrays.copyOf(data, 100); > // Find the bottom and top 10 > Selection.select(data, 9, data.length - 10); > double[] bottom10 = Arrays.copyOf(data, 10); > double[] top10 = Arrays.copyOfRange(data, data.length - 10, data.length); > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)