[jira] [Commented] (NUMBERS-206) Selection API

Alex Herbert (Jira) Wed, 05 Jun 2024 09:31:06 -0700


    [ 
https://issues.apache.org/jira/browse/NUMBERS-206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17852511#comment-17852511
 ]


Alex Herbert commented on NUMBERS-206:
--------------------------------------

h1. Multiple Index Selection Results

Multi-index selection methods are based on single-pivot or dual-pivot 
partitioning. Pivots may be cached to allow bracketing searches, or indices may 
be passed to the quicksort to allow selective recursion to regions of interest.
h2. Double Data

The < relation does not impose a total order on all floating-point values. All 
methods respect the ordering imposed by {{{}Double#compare(double, double){}}}. 
-0.0 is treated as less than value 0.0; NaN is considered greater than any 
other value; and all NaN values are considered equal. This is implemented with 
several strategies:
 # Compare all values with an equivalent of Double.compare. Special routines 
were written to return a boolean value to avoid the use of Double.compare(x, y) 
> 0 or Double.compare(x, y) < 0. This did increase performance over use of 
Double.compare.
 # Preprocess data: Move NaN values to the end. Partition using the < operator. 
Detect and correct reordered signed zeros if a pivot is zero. Note if a pivot 
is not zero then the order of zeros below / above pivots is not important.
 # Preprocess data: Move NaN values to the end; count and strip signed zeros. 
Partition using the < operator. Postprocess data to restore the first 'count' 
zeros as -0.0.

Note that stripping the sign from zeros allows the pivot value to be used as a 
proxy for elements that match the pivot value. Other strategies instead must 
always use a[i] as the value if a[i] == pivotValue when moving the element to 
preserve the count of signed zeros (since a[i] could be -0.0 or 0.0).
h2. Methods 
||Method||Generation||Double Strategy||Index Strategy||Index 
Structure||Partition||Description||
|SortJDK| |3| | | |Sort the data using JDK's Arrays.sort.|
|SPH_QS16|0|1|Sparse pivot cache|Heap|binary|This is the method used in the 
Commons Math Percentile implementation with modifications for efficient double 
compare comparisons.|
|SP_QS16|1|1|Pivot cache|BitSet|binary|As per the Commons Math Percentile but 
switches the heap to a BitSet to store pivots.|
|SPN_QS16|1|2|Pivot cache|BitSet|binary|Pre-sorts NaN and allows uses of < and 
> comparators.|
|SBM_QS16|1|2|Pivot cache|BitSet|ternary|Pre-sorts NaN and allows uses of <, == 
and > comparators.|
|2SBM_SEQUENTIAL|2|2|Pivot ranges|IndexSet|ternary|Sorts indices and processes 
them sequentially in intervals [ka1, kb1], [ka2, kb2], etc. Joins indices with 
a minimum separation of 8 and can sort saturated ranges. Each interval is 
processed separately.|
|2SBM_PIVOT_CACHE|2|2|Pivot cache|IndexSet|ternary|Cache indices within [minK, 
maxK]. Partitions minK, then maxK, then internal indices.|
|2SBM_INDEX_SET|2|2|Pivot cache|IndexSet|ternary|Cache indices in [left, 
right]. Processes indices in input order.|
|ISP_ORDERED_KEYS|3|3|Index splitting|Sorted array|ternary|Partitioning uses an 
array of indices and markers into the array for the current interval [ka, kb]. 
This allows efficient bracketing for splitting the range. Cannot be easily 
abstracted to an interface.|
|ISP_SCANNING_KEY_SEARCHABLE_INTERVAL|3|3|Index splitting|Sorted 
array|ternary|Indices wrapped in a data structure which uses a linear scan of 
the entire range. Fingers are used to jump to the start point.|
|ISP_SEARCH_KEY_SEARCHABLE_INTERVAL|3|3|Index splitting|Sorted 
array|ternary|Indices wrapped in a data structure which uses a binary search of 
the entire range.|
|ISP_INDEX_SET|3|3|Index splitting|IndexSet|ternary|Indices stored in a data 
structure which uses linear scan of the range from the starting index.|
|ISP_COMPRESSED_INDEX_SET|3|3|Sparse index splitting|IndexSet|ternary|Indices 
stored in a data structure which uses linear scan of the range from the 
starting index. Index compression is 2x, i.e. every pair of indices is stored. 
Look-up of indices returns k or k+1 depending on the left/right search context. 
Compression means more regions must be processed as sorted to ensure 
uncompressed indices are correct. Compression reduces memory and increases 
search speed.|
|ISP_KEY_UPDATING_INTERVAL|3|3|Index splitting|Sorted array|ternary|Indices 
wrapped in a data structure with start and end markers. Search (either 
linear/binary) only within the markers.|
|ISP_INDEX_SET_UPDATING_INTERVAL|3|3|Index splitting|IndexSet|ternary|Indices 
stored in a data structure with start and end markers. Search uses linear scan 
of the range from the starting index.|
|ISP_INDEX_ITERATOR|3|3|Index iterator|Sorted array|ternary|Indices wrapped in 
a data structure which iterates over start and end markers of the current range 
[ka1, kb1], [ka2, kb2], etc. Joins indices within a separation of 2. The 
iterator is used during introsort recursion to mark region to process. Can sort 
saturated ranges.|

Note: QS16: Switch to a full sort when the length is below 16. Introselect 
methods use a sortselect (partial sort of m elements in the range) when the 
length is below 20.

Ternary partitioning schemes use a variant of Sedgewick's Bently-McIlroy 
partition method.

IndexSet is a specialised data structure like a BitSet that is constructed with 
a fixed backing array and an offset from zero. This allows it to use the 
minimum storage for [left, right].

Several variants of index splitting for introsort have been tested. Most use an 
abstraction to an interface of the index structure. This allows it to be 
optimised for the index density. Only the raw use of the sorted index array 
with pointers ia and ib to indices ka and kb does not use a data abstraction. 
The direct use of pointers allows the method to know the number of keys within 
the current range ka to kb and make decisions based on this.

All single-pivot introsort based methods (ISP) are duplicated for dual-pivot 
(IDP) withe the exception of SEQUENTIAL. Dual-pivot partitioning is by default 
ternary (<p1, p1<= && <=p2, >p2). If the central region is larger than expected 
(it should be 1/3) then a second pass looks for equal values in the central 
partition. This can thus detect large numbers of repeat elements if chosen as a 
pivot. It may miss low numbers of repeats.

Single-pivot methods use the dynamic pivoting strategy (>40 the ninther median 
of three medians-of-3; otherwise median-of-3).

Dual-pivot methods use 2 and 4 of 5 sorted points.
h2. Results

Tested on MacOSX 14.5; M2 Max cpu with 96Gb RAM; with Eclipse Adoptium JDK 
21.0.3.

BM test data; range [50000, 50001]; n=984.

Selection of a multiple k in 10 repeats. Indices are randomly distributed in 
the range. The order of the 10 repeats are permuted each benchmark invocation. 
An identical random seed is used for all benchmarks for partity.
||Method||2||3||5||10||50||100||Grand Total||
|SPH_QS16|2935212592|3958556275|5633278208|11551652125|37139022858|80272651625|1.4149E+11|
|SortJDK|6299855558|6342845575|6311112558|6387061133|6362373433|6328273125|38031521383|
|SP_QS16|2393546808|2853239850|3487652925|4851848592|6828236108|8697585558|29112109841|
|SPN_QS16|2331264341|2702162708|3271903567|4025006367|6189642300|6974660825|25494640108|
|2SBM_SEQUENTIAL|2225606992|2602926625|3047485683|3737730492|5126969350|6128371800|22869090942|
|2SBM_PIVOT_CACHE|2220154233|2572551317|3049332759|3761262358|5343840267|5787662817|22734803751|
|2SBM_INDEX_SET|2230833175|2560114183|3065098058|3712469800|5090656275|6071284967|22730456458|
|SBM_QS16|2235443058|2551651625|3020757975|3728188058|5043798092|5763689042|22343527849|
|ISP_ORDERED_KEYS|2135611809|2449368742|2876315734|3518354391|5015248084|5711437400|21706336158|
|ISP_SCANNING_KEY_SEARCHABLE_INTERVAL|2194975300|2435411650|2887126867|3494219883|4999780300|5656170075|21667684075|
|ISP_INDEX_SET|2221913283|2428864892|2864250867|3508743058|4979822733|5638445650|21642040483|
|ISP_KEY_UPDATING_INTERVAL|2165703533|2428744367|2885891458|3510715042|4992191050|5638971317|21622216767|
|ISP_INDEX_ITERATOR|2133941483|2460761150|2878827175|3493392616|5003309233|5639837942|21610069599|
|ISP_COMPRESSED_INDEX_SET|2131995642|2420814917|2876507675|3518649408|5016361509|5645549583|21609878734|
|ISP_INDEX_SET_UPDATING_INTERVAL|2146630950|2443437916|2896159000|3498545675|4978484875|5626868958|21590127375|
|ISP_SEARCH_KEY_SEARCHABLE_INTERVAL|2150356833|2441610508|2872189367|3491902042|4986592392|5638279608|21580930750|
|SELECT|1818460975|2104669217|2441513925|3017016341|4335819217|4867003575|18584483250|
|IDP_INDEX_ITERATOR|1853175783|2078940700|2429721533|2950167450|4212081559|4753586842|18277673867|
|IDP_INDEX_SET|1877362700|2093442934|2429127142|2927412758|4209576308|4738519742|18275441583|
|IDP_KEY_UPDATING_INTERVAL|1840146392|2071597192|2448446875|2938923542|4225289234|4746312525|18270715759|
|IDP_SEARCH_KEY_SEARCHABLE_INTERVAL|1832577742|2085849084|2449065825|2922292666|4211871958|4755141175|18256798450|
|IDP_COMPRESSED_INDEX_SET|1835870150|2083533359|2426267442|2937226617|4216215283|4749168433|18248281283|
|IDP_INDEX_SET_UPDATING_INTERVAL|1827451708|2083277416|2424767633|2926749558|4236465517|4742716934|18241428766|
|IDP_SCANNING_KEY_SEARCHABLE_INTERVAL|1842850375|2084051475|2429099483|2933754617|4222331508|4710917383|18223004842|
h2. Observations

The use of the heap to store pivots is slow (SPH_QS16). Even 2 indices is 
slower than using the BitSet (SP_QS16). The heap method requires that the 
algorithm partitions k1 and stores pivots. It then revisits the entire array 
but uses the pivots from the first pass to inform branch decisions about k2; it 
skips partition steps if the branch has been previously performed. As such it 
does not directly bracket the target index with the closest pivots but instead 
runs through partitioning until it finds a region it does not know about. With 
a default heap size of 10 this is not enough to remember the full recursion 
depth. A single-pivot quickselect would use on average log2(n) steps = 15.6; it 
may use a lot more. This is evident with a large number of indices as the 
algorithm will have visited regions before but not remember them. At k=100 the 
performance is 12x slower than a full sort.

The sort is essentially the same speed for any k. The benchmark does a sort, 
then extracts an array of length k with the values at each index. This step is 
a negligible overhead.

The change from SP_QS16 to SPN_QS16 is a switch from using the custom 
Double.compare method to a method that pre-sorts NaN and uses the <,> 
operators. This is similar in speed at k=2 but is increasingly faster with 
higher k. This is because the preprocessing step is a fixed cost and the use of 
<,> operators is faster than the Double.compare.

The change from SPN_QS16 to SBM_QS_16 is a switch from binary to ternary 
partitioning. Ternary partitioning collects values equal to the pivot. This is 
noticeably faster at higher k. Note the BM data contains a lot of samples with 
repeat elements.

The change from SBM_QS16 to 2SBM_XXX is a switch of the method architecture to 
allow the index structure to be specified separately to the partitioning 
method. All these methods are basically the same speed. They use SBM ternary 
partitioning and fixed signed zeros when encountered. The SEQUENTIAL_PROCESSING 
may be slightly slower than the others that use BitSet-type pivot storage.

Single-pivot introsort (ISP) based methods are slightly faster than the 2nd 
generation SBM methods. This may be due to: introspection on some data 
switching the algorithm to heapselect; the change to not store pivots, but 
instead pass the keys to the partitioning method to inform decisions on which 
ranges to process; the use of sortselect as the finishing method instead of a 
sort of small ranges.

Dual-pivot introsort (IDP) based methods are faster than the ISP methods. The 
ratio between the two is stable across k. Summing all times from matching ISP 
or IDP results:
| |2|3|5|10|50|100|Total|
|ISP|15145517025|17059645400|20160952408|24516167725|34956542092|39484123133|1.51323E+11|
|IDP|12909434850|14580692159|17036495934|20536527208|29533831367|33196363033|1.27793E+11|
|Ratio|1.173213018|1.170016157|1.183397835|1.193783519|1.183610134|1.189411114|1.18412229|

The IDP methods for k=100 are faster than a sort. The average separation is 500 
so the keys are not close to saturated.
h2. SELECT performance

The SELECT method is marginally slower than any IDP method. The SELECT method 
current implements a dual-pivot introsort similar method to the 
KEY_UPDATING_INTERVAL or INDEX_SET_UPDATING_INTERVAL. It uses an interface to 
define the functionality to update the left/right bounds and split the 
interval. The implementation is dynamically chosen. This allows it to use a key 
interval for a small number of keys, or an index set interval for dense keys. 
The method also validates all input indices. This is an overhead not present in 
any of the benchmarking methods. It may add approximately 1% to the runtime. 

One other difference is SELECT will switch to the single-pivot select when the 
range of keys is small. This may add an overhead for when the index density is 
such that ranges are small when the single-pivot method is invoked. When ranges 
are very large then the single-pivot search can use the adaptive quickselect 
with Floyd-Rivest sub-sampling. For example a length of 5 million, with k=5 (10 
repeats) on 3 data samples:
{noformat}
SELECT           1720664425.200 ± 219670852.166  ns/op
   IDP           2076975425.000 ± 193513007.771  ns/op{noformat}
Here the SELECT algorithm is noticeably faster then the IDP method (with 
defaults = INDEX_SET).

 

> Selection API
> -------------
>
>                 Key: NUMBERS-206
>                 URL: https://issues.apache.org/jira/browse/NUMBERS-206
>             Project: Commons Numbers
>          Issue Type: New Feature
>          Components: arrays
>            Reporter: Alex Herbert
>            Priority: Major
>
> Create a selection API to select the k-th largest element in an array. This 
> places at k the same value that would be at k in a fully sorted array.
> {code:java}
> public final class Selection {
>     public static void select(double[] a, int k);
>     public static void select(double[] a, int from, int to, int k);
>     public static void select(double[] a, int[] k);
>     public static void select(double[] a, int from, int to, int[] k);
>     // Extend to other primitive data types that are not easily sorted (e.g. 
> long, float, int)
> {code}
> Note: This API will support multiple points (int[] k) for use in quantile 
> estimation of array data by interpolation of neighbouring values (see 
> STATISTICS-85).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (NUMBERS-206) Selection API

Reply via email to