[PR] feat: add NumericFieldStats for unified global min/max retrieval [lucene]

via GitHub Mon, 23 Feb 2026 05:30:24 -0800


salvatorecampagna opened a new pull request, #15758:
URL: https://github.com/apache/lucene/pull/15758


   ## Problem
   
   Lucene currently has two independent APIs for retrieving the global min/max 
of a numeric field:
   
   `PointValues.getMinPackedValue()` returns a `byte[]` or `null` when no data 
exists.
   `DocValuesSkipper.globalMinValue()` returns a `long`, but uses sentinel 
values (`Long.MIN_VALUE` / `Long.MAX_VALUE`) when no data exists.
   
   Callers have to figure out which data structure is available, call the right 
API, and deal with two different "no data" conventions. The sentinel approach 
is particularly problematic because `Long.MIN_VALUE` and `Long.MAX_VALUE` are 
also legitimate field values, so there is no way for a caller to distinguish 
"no data" from "the actual min is Long.MIN_VALUE".
   
   This inconsistency might lead to subtle bugs. For example, 
`SortedNumericDocValuesRangeQuery.rewrite()` was using `DocValuesSkipper` 
exclusively. When the skip index was absent (it requires opt-in via 
`DocValuesSkipIndexType.RANGE`), the sentinel values made the rewrite 
optimization a no-op: the range check `lowerValue > Long.MAX_VALUE` is always 
false, so the query could never short-circuit to `MatchNoDocsQuery` even when 
the query range was completely outside the field's actual value range.
   
   ## Solution
   
   This PR introduces `NumericFieldStats`, a utility class in 
`org.apache.lucene.search` that provides a unified API with clean null 
semantics. It exposes two static methods:
   
   `globalMinMax(IndexSearcher, String)` returns a `MinMax` record containing 
the global min and max as `long` values, or `null` when the stats cannot be 
determined.
   
   `globalDocCount(IndexSearcher, String)` returns the total number of 
documents that have a value for the field, or `0`.
   
   Both methods probe `PointValues` first and fall back to `DocValuesSkipper`. 
PointValues goes first because it is always available for standard numeric 
fields (LongField, IntField, etc.), already returns null for "no data" (no 
sentinel ambiguity), and does not require any opt-in configuration. The 
DocValuesSkipper fallback covers doc-values-only fields that were indexed with 
a skip index but without points.
   
   ## Why not fix DocValuesSkipper directly?
   
   An alternative would be to change 
`DocValuesSkipper.globalMinValue()`/`globalMaxValue()` to return `Long` instead 
of `long`, eliminating the sentinel ambiguity at the source. However, the 
per-segment `minValue()`/`maxValue()` methods are called on hot scoring paths 
where boxing to `Long` on every call would add overhead. The static 
`globalMinValue`/`globalMaxValue` methods are also public API, so changing 
their return type would break all existing callers. A higher-level utility that 
abstracts over both data structures avoids these issues while giving callers 
the clean null semantics they need.
   
   ## Packed value decoding
   
   PointValues stores numeric values as big-endian byte arrays with the sign 
bit flipped for sortable byte ordering. IntField produces 4-byte packed values 
and LongField produces 8-byte packed values. The internal `decodeLong` method 
dispatches on the array length to call the appropriate `NumericUtils` decoder 
(`sortableBytesToInt` for 4 bytes, `sortableBytesToLong` for 8 bytes).
   
   The API always returns `long` rather than exposing `byte[]` or providing 
separate int/long variants. This matches how the query layer works internally: 
even IntField queries widen to `long` bounds by the time they reach 
`SortedNumericDocValuesRangeQuery`. When an int field's packed value is 
decoded, the resulting `int` widens to `long` via standard Java sign extension, 
which preserves the original numeric value. Callers that need an `int` back can 
safely use `Math.toIntExact()`, which will never throw for values that 
originated from an IntField.
   
   ## Migration: SortedNumericDocValuesRangeQuery.rewrite()
   
   The `rewrite()` method previously called 
`DocValuesSkipper.globalMinValue/globalMaxValue` directly. It now calls 
`NumericFieldStats.globalMinMax()` instead. This is not just a refactor; it is 
a behavioral improvement. Before, if a field had PointValues but no 
DocValuesSkipper, the rewrite optimization was dead code. Now it works for any 
numeric field that has either data structure available.
   
   ## Tests
   
   14 tests covering both data sources (PointValues and DocValuesSkipper 
paths), edge cases (empty index, nonexistent field, doc values without skip 
index, segments where some have the field and some do not), boundary values 
(Long.MIN_VALUE/Long.MAX_VALUE with LongField, 
Integer.MIN_VALUE/Integer.MAX_VALUE with IntField to verify int-to-long sign 
extension), multi-valued fields, multi-segment indexes, and doc count from both 
PointValues and DocValuesSkipper.
   
   There is also an integration test verifying that 
`SortedNumericDocValuesRangeQuery.rewrite()` now correctly rewrites to 
`MatchNoDocsQuery` and `MatchAllDocsQuery` when only PointValues are available 
(no skip index). This scenario was missing before this change.
   
   `TestNumericFieldStats`:
   ```
   ./gradlew -p lucene/core test --tests 
"org.apache.lucene.search.TestNumericFieldStats"
   ```
   
   `TestSortedNumericDocValuesRangeQuery` (existing):
   ```
   ./gradlew -p lucene/core test --tests 
"org.apache.lucene.document.TestSortedNumericDocValuesRangeQuery"
   ```
   
   Closes #15740
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] feat: add NumericFieldStats for unified global min/max retrieval [lucene]

Reply via email to