[ https://issues.apache.org/jira/browse/NUMBERS-206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17851712#comment-17851712 ]
Alex Herbert commented on NUMBERS-206: -------------------------------------- h2. Single k selection Note that development started with a method based on the Commons Math Percentile implementation of quickselect. This was changed to allow multiple indices to be passed in a single method call for the first generation functions which use a variety of partition algorithms. This was then updated in a second development generation to add more options to configure the algorithm for multiple indices. Only one implementation was provided using the SBM partition algorithm. The introselect based algorithms allow configuration the single/dual-pivot partition algorithm and the introspection. Typically introspection does not detect the need to switch to the alternative (stopper) algorithm. Here the stopper is the QuickselectAdaptive algorithm. Various linear select algorithms are provided, broadly grouped into a, b, or c based on their implementation complexity. Note: The algorithms are changed by using parameters harvested from the name. The default name is the base configuration for the method. This can be changed by appending parameters to the name as is done for some introselect implementations that can use FR sub-sampling. ||Name||Generation||Description|| |SortJDK| |Sort the data using JDK's Arrays.sort.| |SPH|0|Single-pivot using a heap to cache pivots. This is the method used in the Commons Math Percentile implementation. The cache is null on a single index and ignored. Uses a single index input.| |SP|1|Single-pivot using binary partitioning.| |BM|1|Single-pivot using Bentley-McIlroy ternary partitioning. Main comparators are <=, >=.| |SBM|1|Single-pivot using Sedgewick's Bentley-McIlroy ternary partitioning. Main comparators are <, >.| |DP|1|Dual-pivot using Yaroslavskiy's original 1/div for pivot selection. The divisor (div) is initialised at 3 and is increased when the central partition is too large.| |5DP|1|Dual-pivot using 2nd and 4th positions from 5 sorted points as the pivot. The 5 points are sampled evenly from the range.| |2SBM|2|2nd generation single-pivot using Sedgewick's Bentley-McIlroy ternary partitioning. This is the most consistently performing single pivot partition method.| |ISP|3|Single-pivot introselect. Pivot chosen using the dynamic strategy of BM. Above 40 the ninther strategy of a median of 3 medians-of-3 from 9 points; otherwise use a median-of-3.| |IDP|3|Dual-pivot introselect. Pivot using 2 and 4 of 5 sorted points.| |LSP|a|Median-of-medians using medians of 5. Sample is placed on the left side of the data.| |Linear BFPRT IM|b|Median-of-medians using medians of 5. Sample is placed in the centre of the data.| |Linear BFPRTA|b|Median-of-medians using medians of 5. Sample is placed in the centre of the data. The pivot is chosen using an adaptive k.| |Linear RS IM|b|Repeated-step. Sample is placed in the centre of the data.| |Linear RSA|b|Repeated-step. Sample is placed in the centre of the data. The pivot is chosen using an adaptive k.| |QA_CF16|c|QuickselectAdaptive original method. Uses adaptive k for all steps.| |QA|c|QuickselectAdaptive uses the original far step method but updates adaption of k for the far step to reduce the probability of placing the target in the larger partition.| |QA_CF8|c|QuickselectAdaptive using the far step 2 method. Uses adaptive k for all steps.| |FR| |Single-pivot using Floyd-Rivest sub-sampling when the range is >600.| |KFR| |Kiwiel's dual-pivot using Floyd-Rivest sub-sampling when the range is >600.| |ISP_SU600_PAIRED_KEYS|3|Single-pivot introselect. Pivot chosen using the dynamic strategy of BM. Uses FR sub-sampling when the range >600.| |ISP_SU1200_PAIRED_KEYS|3|Single-pivot introselect. Pivot chosen using the dynamic strategy of BM. Uses FR sub-sampling when the range >1200.| |ISP_SU1200_PAIRED_KEYS_CF2|3|Single-pivot introselect. Pivot chosen using the dynamic strategy of BM. Uses FR sub-sampling with a random sample when the range >1200.| |SELECT| |Commons numbers arrays selection implementation. Uses FR sub-sampling and falls back to QuickselectAdaptive if FR fails to hit the expected margins.| Note: The PAIRED_KEYS name is required to activate a method that uses FR sub-sampling. The default method uses a collections of keys with only 1 key. This method is for selecting multiple keys together. As such it is not optimised for FR sub-sampling which must target a single key. h2. Results Warning: Results may vary by machine. The current code has settings that perform well over several machines. One key development goal was that speed should not be obviously slow; rather than the optimum. Tested on MacOSX 14.5; M2 Max cpu with 96Gb RAM; with Eclipse Adoptium JDK 21.0.3. BM test data; range [50000, 50001]; n=984. Selection of a single k in 10 samples. 10 uniformly spaced indices are generated in the range. Each index is one sample. The order of the 10 samples are permuted each benchmark invocation. |*Name*|*Score*|*Error*| |DP|3.12E+10|2.89E+09| |SortJDK|6.22E+09|2.64E+08| |LSP|3.99E+09|6.37E+07| |Linear_BFPRT_IM|2.94E+09|1.39E+08| |Linear_BFPRTA|2.45E+09|5.50E+07| |Linear_RS_IM|2.42E+09|9.16E+07| |Linear_RSA|2.27E+09|9.44E+07| |SP|1.89E+09|5.30E+07| |SPH|1.87E+09|5.66E+07| |SBM|1.74E+09|1.43E+08| |2SBM|1.70E+09|1.01E+08| |BM|1.69E+09|7.11E+07| |ISP|1.66E+09|1.38E+08| |QA_CF16|1.62E+09|7.34E+07| |QA|1.51E+09|5.45E+07| |5DP|1.49E+09|2.76E+07| |QA_CF8|1.47E+09|6.10E+07| |IDP|1.47E+09|6.50E+07| |ISP_SU600_PAIRED_KEYS|1.44E+09|5.43E+07| |ISP_SU1200_PAIRED_KEYS|1.42E+09|6.25E+07| |KFR|1.41E+09|5.10E+07| |SELECT|1.40E+09|6.53E+07| |FR|1.37E+09|1.29E+08| |ISP_SU1200_PAIRED_KEYS_CF2|1.28E+09|1.31E+08| h2. Observations Sorting the entire data is slower than selection. The DP method is slower then a sort. This is an example where the method has had quadratic quickselect performance on at least one of the data samples. It is a demonstration that the method must be robust to all types of input data. Here are the two DP methods benchmarked on only random data: {noformat} DP 41881485.087 ± 2467338.158 ns/op 5DP 43475788.348 ± 2617480.169 ns/op{noformat} Here the DP method is within error of the alternative 5DP (which only changes the pivot strategy). The improved version of the linear select algorithm (with a central layout of the pivot sample) is faster than the version which place the sample on the left (LSP vs Linear_BFPRT_IM). Here the only difference is the sample layout; the computed pivot for the partition will be the same. The adaptive positioning of k within the sample improves performance. The repeated step algorithm is faster than median-of-medians. The slowest quickselect is the original single-pivot quickselect adapted from Commons Math (SP; SPH). Switching to ternary partitioning methods (BM, SBM) improves performance. Note that the BM data creates many samples with a high number of repeat elements. Thus partitioning where values equal to the pivot are identified is an advantage. There is no difference between the first and second generation selection methods using the same partition algorithm (SBM vs 2SBM). However when introspection is added so the algorithm can detect poor quickselect performance (ISP) there is a small improvement in speed. This may be within the error margin, or could be an observation of introspection working. The linear select algorithms are slower than quickselect with a simple pivot strategy. However the QuickselectAdaptive methods are faster. The original QA_CF16 method is slowest; improvements to the adaption of k for the far step (QA); and then the change to the far step 2 method (QA_CF8) increase performance. The dual-pivot introselect is faster on the BM data than the single-pivot introselect. However note that the difference is an average. On some data the ISP is faster; on other the IDP. The relative speed is also dependent on the location k in the data, e.g. 10%, 25%, 50%. This benchmark uses a uniform selection of k and is thus representative of any k location. The single-pivot methods are improved by using Floyd-Rivest sampling. The original FR method is very fast. This uses a sub-sampling size of 600. The ISP based methods use 1200. This value was arrived at by benchmarking on various data lengths. The sub-sampling size can be gradually increased until a noticeable difference in performance between the ISP and ISP+FR is observed. The higher value indicates that the pivot selection strategy of the ISP method (using the BM dynamic selection) is more robust than the original FR pivot selection (which simply uses the target k). Note that FR is faster than ISP_SU1200_PAIRED_KEYS although they are performing similar pivot selection; differences are the sub-sampling size; binary vs ternary partitioning; and introspection, any of which may be changing the performance. The KFR method is slightly slower than the original FR method. Adding random sub-sampling to FR increases performance to create the fastest result (ISP_SU1200_PAIRED_KEYS_CF2). The SELECT method is as fast as the FR methods. It is not as fast as the FR method with random sub-sampling. However it will be stable for repeat input as it will not move data out of sorted order, and will have guaranteed linear performance on any input data as introspection will detect bad partitions and revert to QuickselectAdaptive. The performance gain from using a FR-based algorithm can be observed using large random data, for example 50 million uniform random element (10 repeats with a random k): {noformat} SP 4295643800.200 ± 988383992.147 ns/op QA_CF8 3088731608.400 ± 1060959744.803 ns/op FR 1661633891.600 ± 1069784031.796 ns/op SELECT 1868087800.000 ± 1157148907.486 ns/op ISP_SU1200_PAIRED_KEYS 1751146783.400 ± 1166364039.049 ns/op {noformat} Here the FR algorithm is fastest but the other FR-based methods are close given the error margins. The fastest QuickselectAdaptive configuration is approximately in between the FR methods and a standard quickselect implementation. > Selection API > ------------- > > Key: NUMBERS-206 > URL: https://issues.apache.org/jira/browse/NUMBERS-206 > Project: Commons Numbers > Issue Type: New Feature > Components: arrays > Reporter: Alex Herbert > Priority: Major > > Create a selection API to select the k-th largest element in an array. This > places at k the same value that would be at k in a fully sorted array. > {code:java} > public final class Selection { > public static void select(double[] a, int k); > public static void select(double[] a, int from, int to, int k); > public static void select(double[] a, int[] k); > public static void select(double[] a, int from, int to, int[] k); > // Extend to other primitive data types that are not easily sorted (e.g. > long, float, int) > {code} > Note: This API will support multiple points (int[] k) for use in quantile > estimation of array data by interpolation of neighbouring values (see > STATISTICS-85). -- This message was sent by Atlassian Jira (v8.20.10#820010)