jimczi opened a new pull request, #16202: URL: https://github.com/apache/lucene/pull/16202
Adds `BinarySortField`, letting a single-valued `BinaryDocValues` field be used as an index sort (and search sort). Values are compared directly (default unsigned byte order, custom `Comparator<BytesRef>` supported) instead of through a term dictionary and ordinal map, so it suits high-cardinality fields where a dictionary is pure overhead. Because nothing is dictionary-encoded or compressed, it is expected to be faster at high cardinality, most visibly on the compaction (force) merge, which avoids rebuilding the global ordinal map. Implementation notes: - `IndexSorter` gains a `ComparableValues` abstraction for the cross-segment order so it no longer has to be a `long`; the default adapts the existing `ComparableProvider` with no behaviour change, leaving numeric and string sorts untouched. - `BinarySorter` reads each segment strictly sequentially (`nextDoc` only) during merge — efficient even for block-compressed binary formats — and compares slices in place at flush. - Fixes `BinaryDocValuesWriter` to freeze its `PagedBytes` lazily, since index sorting reads the in-RAM values before `flush()`. The change is additive and non-breaking: no existing method signature changes (`getComparableProviders` becomes a `default`, which is binary- and source-compatible), the index-sort serialization format is unchanged, and the index sort remains immutable (a binary sort can only be set on a new index). ## Benchmark `IndexSortDocValuesBenchmark` (JMH, `SingleShotTime`) builds a sorted index of 1M documents with 16-byte values, using the default flush/merge configuration, comparing `BinarySortField` (`BinaryDocValues`) against a `STRING` sort (`SortedDocValues`). `indexWithNaturalMerges` indexes everything and waits for the naturally scheduled merges; `indexThenForceMerge` additionally force-merges to a single segment. | cardinality | operation | binary (ms) | sorted (ms) | |---|---|---|---| | low | index + natural merges | 397 ± 24 | 260 ± 69 | | low | index + forceMerge(1) | 390 ± 20 | 249 ± 33 | | high | index + natural merges | 708 ± 91 | 595 ± 58 | | high | index + forceMerge(1) | 728 ± 130 | 884 ± 90 | At low cardinality the ordinal-based `SortedDocValues` sort wins, as expected. At high cardinality the force-merge step adds ~20 ms for binary versus ~290 ms for sorted (728−708 vs 884−595): the compaction merge is far cheaper for binary because it does not rebuild the global ordinal map. The same shape holds at 10M documents. ## Tests Cover flush/merge order, reverse, missing first/last, empty values, nested document blocks, deletes, multi-field sorts, `addIndexes`, search sorting, a custom comparator, serialization round-trip, index-sort immutability, and a randomized stress test. Search-time sort uses a linear scan (no skipping optimization). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
