[PR] Add BinarySortField: index sorting on BinaryDocValues [lucene]

via GitHub Fri, 05 Jun 2026 03:25:32 -0700


jimczi opened a new pull request, #16202:
URL: https://github.com/apache/lucene/pull/16202


   Adds `BinarySortField`, letting a single-valued `BinaryDocValues` field be 
used as an index sort (and search sort). Values are compared directly (default 
unsigned byte order, custom `Comparator<BytesRef>` supported) instead of 
through a term dictionary and ordinal map, so it suits high-cardinality fields 
where a dictionary is pure overhead. Because nothing is dictionary-encoded or 
compressed, it is expected to be faster at high cardinality, most visibly on 
the compaction (force) merge, which avoids rebuilding the global ordinal map.
   
   Implementation notes:
   - `IndexSorter` gains a `ComparableValues` abstraction for the cross-segment 
order so it no longer has to be a `long`; the default adapts the existing 
`ComparableProvider` with no behaviour change, leaving numeric and string sorts 
untouched.
   - `BinarySorter` reads each segment strictly sequentially (`nextDoc` only) 
during merge — efficient even for block-compressed binary formats — and 
compares slices in place at flush.
   - Fixes `BinaryDocValuesWriter` to freeze its `PagedBytes` lazily, since 
index sorting reads the in-RAM values before `flush()`.
   
   The change is additive and non-breaking: no existing method signature 
changes (`getComparableProviders` becomes a `default`, which is binary- and 
source-compatible), the index-sort serialization format is unchanged, and the 
index sort remains immutable (a binary sort can only be set on a new index).
   
   ## Benchmark
   
   `IndexSortDocValuesBenchmark` (JMH, `SingleShotTime`) builds a sorted index 
of 1M documents with 16-byte values, using the default flush/merge 
configuration, comparing `BinarySortField` (`BinaryDocValues`) against a 
`STRING` sort (`SortedDocValues`). `indexWithNaturalMerges` indexes everything 
and waits for the naturally scheduled merges; `indexThenForceMerge` 
additionally force-merges to a single segment.
   
   | cardinality | operation | binary (ms) | sorted (ms) |
   |---|---|---|---|
   | low | index + natural merges | 397 ± 24 | 260 ± 69 |
   | low | index + forceMerge(1) | 390 ± 20 | 249 ± 33 |
   | high | index + natural merges | 708 ± 91 | 595 ± 58 |
   | high | index + forceMerge(1) | 728 ± 130 | 884 ± 90 |
   
   At low cardinality the ordinal-based `SortedDocValues` sort wins, as 
expected. At high cardinality the force-merge step adds ~20 ms for binary 
versus ~290 ms for sorted (728−708 vs 884−595): the compaction merge is far 
cheaper for binary because it does not rebuild the global ordinal map. The same 
shape holds at 10M documents.
   
   ## Tests
   
   Cover flush/merge order, reverse, missing first/last, empty values, nested 
document blocks, deletes, multi-field sorts, `addIndexes`, search sorting, a 
custom comparator, serialization round-trip, index-sort immutability, and a 
randomized stress test. Search-time sort uses a linear scan (no skipping 
optimization).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] Add BinarySortField: index sorting on BinaryDocValues [lucene]

Reply via email to