salvatorecampagna opened a new pull request, #15760: URL: https://github.com/apache/lucene/pull/15760
## Problem Lucene has two independent APIs for retrieving the global min/max of a numeric field: `PointValues.getMinPackedValue()` (returns `byte[]` or `null`) and `DocValuesSkipper.globalMinValue()` (returns `long` with sentinel values for "no data"). The sentinel approach is problematic because `Long.MIN_VALUE`/`Long.MAX_VALUE` are legitimate field values, making "no data" indistinguishable from real values. This caused `SortedNumericDocValuesRangeQuery.rewrite()` to silently skip its optimization when only PointValues were available (no skip index), since the sentinel-based range check could never trigger. ## Solution `NumericFieldStats` is a utility class that retrieves global field statistics from index metadata structures (`PointValues`, `DocValuesSkipper`) without per-document access. It exposes a single method: `getStats(IndexReader, String)` returns a `Stats` record `(long min, long max, int docCount)`, or `null` when no metadata structure is available. It probes `PointValues` first (always available for standard numeric fields, null-based "no data"), then falls back to `DocValuesSkipper` (covers doc-values-only fields with skip index). `SortedNumericDocValuesRangeQuery.rewrite()` now uses this API, fixing the optimization for fields with PointValues but no skip index. ## Packed value decoding PointValues stores values as big-endian byte arrays with the sign bit flipped. IntField produces 4-byte arrays, LongField produces 8-byte arrays. The internal `decodeLong` method dispatches on array length to call the right `NumericUtils` decoder. The int case widens to long via Java sign extension, which preserves the value. The API returns long unconditionally because the query layer already works with long bounds internally (even for IntField queries). Callers that need int can safely narrow with `Math.toIntExact()`. ## Why not fix DocValuesSkipper directly? Changing `DocValuesSkipper.globalMinValue()`/`globalMaxValue()` to return `Long` instead of `long` would fix the sentinel ambiguity at the source. But the per-segment `minValue()`/`maxValue()` methods are called on hot scoring paths where boxing would add overhead, and the static global methods are public API so changing their return type would break existing callers. A higher-level utility avoids both issues. ## Tests Multiple tests in `TestNumericFieldStats` cover both PointValues and DocValuesSkipper data sources, edge cases (empty index, nonexistent field, doc values without skip index, mixed segments), boundary values, multi-valued fields, and doc count. An integration test in `TestDocValuesQueries` verifies that `rewrite()` now correctly produces `MatchNoDocsQuery`/`MatchAllDocsQuery` when only PointValues are available. ``` ./gradlew -p lucene/core test --tests "org.apache.lucene.search.TestNumericFieldStats" ./gradlew -p lucene/core test --tests "org.apache.lucene.search.TestDocValuesQueries" ``` Closes #15740 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
