[ https://issues.apache.org/jira/browse/NUMBERS-206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17852511#comment-17852511 ]
Alex Herbert commented on NUMBERS-206: -------------------------------------- h1. Multiple Index Selection Results Multi-index selection methods are based on single-pivot or dual-pivot partitioning. Pivots may be cached to allow bracketing searches, or indices may be passed to the quicksort to allow selective recursion to regions of interest. h2. Double Data The < relation does not impose a total order on all floating-point values. All methods respect the ordering imposed by {{{}Double#compare(double, double){}}}. -0.0 is treated as less than value 0.0; NaN is considered greater than any other value; and all NaN values are considered equal. This is implemented with several strategies: # Compare all values with an equivalent of Double.compare. Special routines were written to return a boolean value to avoid the use of Double.compare(x, y) > 0 or Double.compare(x, y) < 0. This did increase performance over use of Double.compare. # Preprocess data: Move NaN values to the end. Partition using the < operator. Detect and correct reordered signed zeros if a pivot is zero. Note if a pivot is not zero then the order of zeros below / above pivots is not important. # Preprocess data: Move NaN values to the end; count and strip signed zeros. Partition using the < operator. Postprocess data to restore the first 'count' zeros as -0.0. Note that stripping the sign from zeros allows the pivot value to be used as a proxy for elements that match the pivot value. Other strategies instead must always use a[i] as the value if a[i] == pivotValue when moving the element to preserve the count of signed zeros (since a[i] could be -0.0 or 0.0). h2. Methods ||Method||Generation||Double Strategy||Index Strategy||Index Structure||Partition||Description|| |SortJDK| |3| | | |Sort the data using JDK's Arrays.sort.| |SPH_QS16|0|1|Sparse pivot cache|Heap|binary|This is the method used in the Commons Math Percentile implementation with modifications for efficient double compare comparisons.| |SP_QS16|1|1|Pivot cache|BitSet|binary|As per the Commons Math Percentile but switches the heap to a BitSet to store pivots.| |SPN_QS16|1|2|Pivot cache|BitSet|binary|Pre-sorts NaN and allows uses of < and > comparators.| |SBM_QS16|1|2|Pivot cache|BitSet|ternary|Pre-sorts NaN and allows uses of <, == and > comparators.| |2SBM_SEQUENTIAL|2|2|Pivot ranges|IndexSet|ternary|Sorts indices and processes them sequentially in intervals [ka1, kb1], [ka2, kb2], etc. Joins indices with a minimum separation of 8 and can sort saturated ranges. Each interval is processed separately.| |2SBM_PIVOT_CACHE|2|2|Pivot cache|IndexSet|ternary|Cache indices within [minK, maxK]. Partitions minK, then maxK, then internal indices.| |2SBM_INDEX_SET|2|2|Pivot cache|IndexSet|ternary|Cache indices in [left, right]. Processes indices in input order.| |ISP_ORDERED_KEYS|3|3|Index splitting|Sorted array|ternary|Partitioning uses an array of indices and markers into the array for the current interval [ka, kb]. This allows efficient bracketing for splitting the range. Cannot be easily abstracted to an interface.| |ISP_SCANNING_KEY_SEARCHABLE_INTERVAL|3|3|Index splitting|Sorted array|ternary|Indices wrapped in a data structure which uses a linear scan of the entire range. Fingers are used to jump to the start point.| |ISP_SEARCH_KEY_SEARCHABLE_INTERVAL|3|3|Index splitting|Sorted array|ternary|Indices wrapped in a data structure which uses a binary search of the entire range.| |ISP_INDEX_SET|3|3|Index splitting|IndexSet|ternary|Indices stored in a data structure which uses linear scan of the range from the starting index.| |ISP_COMPRESSED_INDEX_SET|3|3|Sparse index splitting|IndexSet|ternary|Indices stored in a data structure which uses linear scan of the range from the starting index. Index compression is 2x, i.e. every pair of indices is stored. Look-up of indices returns k or k+1 depending on the left/right search context. Compression means more regions must be processed as sorted to ensure uncompressed indices are correct. Compression reduces memory and increases search speed.| |ISP_KEY_UPDATING_INTERVAL|3|3|Index splitting|Sorted array|ternary|Indices wrapped in a data structure with start and end markers. Search (either linear/binary) only within the markers.| |ISP_INDEX_SET_UPDATING_INTERVAL|3|3|Index splitting|IndexSet|ternary|Indices stored in a data structure with start and end markers. Search uses linear scan of the range from the starting index.| |ISP_INDEX_ITERATOR|3|3|Index iterator|Sorted array|ternary|Indices wrapped in a data structure which iterates over start and end markers of the current range [ka1, kb1], [ka2, kb2], etc. Joins indices within a separation of 2. The iterator is used during introsort recursion to mark region to process. Can sort saturated ranges.| Note: QS16: Switch to a full sort when the length is below 16. Introselect methods use a sortselect (partial sort of m elements in the range) when the length is below 20. Ternary partitioning schemes use a variant of Sedgewick's Bently-McIlroy partition method. IndexSet is a specialised data structure like a BitSet that is constructed with a fixed backing array and an offset from zero. This allows it to use the minimum storage for [left, right]. Several variants of index splitting for introsort have been tested. Most use an abstraction to an interface of the index structure. This allows it to be optimised for the index density. Only the raw use of the sorted index array with pointers ia and ib to indices ka and kb does not use a data abstraction. The direct use of pointers allows the method to know the number of keys within the current range ka to kb and make decisions based on this. All single-pivot introsort based methods (ISP) are duplicated for dual-pivot (IDP) withe the exception of SEQUENTIAL. Dual-pivot partitioning is by default ternary (<p1, p1<= && <=p2, >p2). If the central region is larger than expected (it should be 1/3) then a second pass looks for equal values in the central partition. This can thus detect large numbers of repeat elements if chosen as a pivot. It may miss low numbers of repeats. Single-pivot methods use the dynamic pivoting strategy (>40 the ninther median of three medians-of-3; otherwise median-of-3). Dual-pivot methods use 2 and 4 of 5 sorted points. h2. Results Tested on MacOSX 14.5; M2 Max cpu with 96Gb RAM; with Eclipse Adoptium JDK 21.0.3. BM test data; range [50000, 50001]; n=984. Selection of a multiple k in 10 repeats. Indices are randomly distributed in the range. The order of the 10 repeats are permuted each benchmark invocation. An identical random seed is used for all benchmarks for partity. ||Method||2||3||5||10||50||100||Grand Total|| |SPH_QS16|2935212592|3958556275|5633278208|11551652125|37139022858|80272651625|1.4149E+11| |SortJDK|6299855558|6342845575|6311112558|6387061133|6362373433|6328273125|38031521383| |SP_QS16|2393546808|2853239850|3487652925|4851848592|6828236108|8697585558|29112109841| |SPN_QS16|2331264341|2702162708|3271903567|4025006367|6189642300|6974660825|25494640108| |2SBM_SEQUENTIAL|2225606992|2602926625|3047485683|3737730492|5126969350|6128371800|22869090942| |2SBM_PIVOT_CACHE|2220154233|2572551317|3049332759|3761262358|5343840267|5787662817|22734803751| |2SBM_INDEX_SET|2230833175|2560114183|3065098058|3712469800|5090656275|6071284967|22730456458| |SBM_QS16|2235443058|2551651625|3020757975|3728188058|5043798092|5763689042|22343527849| |ISP_ORDERED_KEYS|2135611809|2449368742|2876315734|3518354391|5015248084|5711437400|21706336158| |ISP_SCANNING_KEY_SEARCHABLE_INTERVAL|2194975300|2435411650|2887126867|3494219883|4999780300|5656170075|21667684075| |ISP_INDEX_SET|2221913283|2428864892|2864250867|3508743058|4979822733|5638445650|21642040483| |ISP_KEY_UPDATING_INTERVAL|2165703533|2428744367|2885891458|3510715042|4992191050|5638971317|21622216767| |ISP_INDEX_ITERATOR|2133941483|2460761150|2878827175|3493392616|5003309233|5639837942|21610069599| |ISP_COMPRESSED_INDEX_SET|2131995642|2420814917|2876507675|3518649408|5016361509|5645549583|21609878734| |ISP_INDEX_SET_UPDATING_INTERVAL|2146630950|2443437916|2896159000|3498545675|4978484875|5626868958|21590127375| |ISP_SEARCH_KEY_SEARCHABLE_INTERVAL|2150356833|2441610508|2872189367|3491902042|4986592392|5638279608|21580930750| |SELECT|1818460975|2104669217|2441513925|3017016341|4335819217|4867003575|18584483250| |IDP_INDEX_ITERATOR|1853175783|2078940700|2429721533|2950167450|4212081559|4753586842|18277673867| |IDP_INDEX_SET|1877362700|2093442934|2429127142|2927412758|4209576308|4738519742|18275441583| |IDP_KEY_UPDATING_INTERVAL|1840146392|2071597192|2448446875|2938923542|4225289234|4746312525|18270715759| |IDP_SEARCH_KEY_SEARCHABLE_INTERVAL|1832577742|2085849084|2449065825|2922292666|4211871958|4755141175|18256798450| |IDP_COMPRESSED_INDEX_SET|1835870150|2083533359|2426267442|2937226617|4216215283|4749168433|18248281283| |IDP_INDEX_SET_UPDATING_INTERVAL|1827451708|2083277416|2424767633|2926749558|4236465517|4742716934|18241428766| |IDP_SCANNING_KEY_SEARCHABLE_INTERVAL|1842850375|2084051475|2429099483|2933754617|4222331508|4710917383|18223004842| h2. Observations The use of the heap to store pivots is slow (SPH_QS16). Even 2 indices is slower than using the BitSet (SP_QS16). The heap method requires that the algorithm partitions k1 and stores pivots. It then revisits the entire array but uses the pivots from the first pass to inform branch decisions about k2; it skips partition steps if the branch has been previously performed. As such it does not directly bracket the target index with the closest pivots but instead runs through partitioning until it finds a region it does not know about. With a default heap size of 10 this is not enough to remember the full recursion depth. A single-pivot quickselect would use on average log2(n) steps = 15.6; it may use a lot more. This is evident with a large number of indices as the algorithm will have visited regions before but not remember them. At k=100 the performance is 12x slower than a full sort. The sort is essentially the same speed for any k. The benchmark does a sort, then extracts an array of length k with the values at each index. This step is a negligible overhead. The change from SP_QS16 to SPN_QS16 is a switch from using the custom Double.compare method to a method that pre-sorts NaN and uses the <,> operators. This is similar in speed at k=2 but is increasingly faster with higher k. This is because the preprocessing step is a fixed cost and the use of <,> operators is faster than the Double.compare. The change from SPN_QS16 to SBM_QS_16 is a switch from binary to ternary partitioning. Ternary partitioning collects values equal to the pivot. This is noticeably faster at higher k. Note the BM data contains a lot of samples with repeat elements. The change from SBM_QS16 to 2SBM_XXX is a switch of the method architecture to allow the index structure to be specified separately to the partitioning method. All these methods are basically the same speed. They use SBM ternary partitioning and fixed signed zeros when encountered. The SEQUENTIAL_PROCESSING may be slightly slower than the others that use BitSet-type pivot storage. Single-pivot introsort (ISP) based methods are slightly faster than the 2nd generation SBM methods. This may be due to: introspection on some data switching the algorithm to heapselect; the change to not store pivots, but instead pass the keys to the partitioning method to inform decisions on which ranges to process; the use of sortselect as the finishing method instead of a sort of small ranges. Dual-pivot introsort (IDP) based methods are faster than the ISP methods. The ratio between the two is stable across k. Summing all times from matching ISP or IDP results: | |2|3|5|10|50|100|Total| |ISP|15145517025|17059645400|20160952408|24516167725|34956542092|39484123133|1.51323E+11| |IDP|12909434850|14580692159|17036495934|20536527208|29533831367|33196363033|1.27793E+11| |Ratio|1.173213018|1.170016157|1.183397835|1.193783519|1.183610134|1.189411114|1.18412229| The IDP methods for k=100 are faster than a sort. The average separation is 500 so the keys are not close to saturated. h2. SELECT performance The SELECT method is marginally slower than any IDP method. The SELECT method current implements a dual-pivot introsort similar method to the KEY_UPDATING_INTERVAL or INDEX_SET_UPDATING_INTERVAL. It uses an interface to define the functionality to update the left/right bounds and split the interval. The implementation is dynamically chosen. This allows it to use a key interval for a small number of keys, or an index set interval for dense keys. The method also validates all input indices. This is an overhead not present in any of the benchmarking methods. It may add approximately 1% to the runtime. One other difference is SELECT will switch to the single-pivot select when the range of keys is small. This may add an overhead for when the index density is such that ranges are small when the single-pivot method is invoked. When ranges are very large then the single-pivot search can use the adaptive quickselect with Floyd-Rivest sub-sampling. For example a length of 5 million, with k=5 (10 repeats) on 3 data samples: {noformat} SELECT 1720664425.200 ± 219670852.166 ns/op IDP 2076975425.000 ± 193513007.771 ns/op{noformat} Here the SELECT algorithm is noticeably faster then the IDP method (with defaults = INDEX_SET). > Selection API > ------------- > > Key: NUMBERS-206 > URL: https://issues.apache.org/jira/browse/NUMBERS-206 > Project: Commons Numbers > Issue Type: New Feature > Components: arrays > Reporter: Alex Herbert > Priority: Major > > Create a selection API to select the k-th largest element in an array. This > places at k the same value that would be at k in a fully sorted array. > {code:java} > public final class Selection { > public static void select(double[] a, int k); > public static void select(double[] a, int from, int to, int k); > public static void select(double[] a, int[] k); > public static void select(double[] a, int from, int to, int[] k); > // Extend to other primitive data types that are not easily sorted (e.g. > long, float, int) > {code} > Note: This API will support multiple points (int[] k) for use in quantile > estimation of array data by interpolation of neighbouring values (see > STATISTICS-85). -- This message was sent by Atlassian Jira (v8.20.10#820010)