salvatorecampagna opened a new pull request, #15758: URL: https://github.com/apache/lucene/pull/15758
## Problem Lucene currently has two independent APIs for retrieving the global min/max of a numeric field: `PointValues.getMinPackedValue()` returns a `byte[]` or `null` when no data exists. `DocValuesSkipper.globalMinValue()` returns a `long`, but uses sentinel values (`Long.MIN_VALUE` / `Long.MAX_VALUE`) when no data exists. Callers have to figure out which data structure is available, call the right API, and deal with two different "no data" conventions. The sentinel approach is particularly problematic because `Long.MIN_VALUE` and `Long.MAX_VALUE` are also legitimate field values, so there is no way for a caller to distinguish "no data" from "the actual min is Long.MIN_VALUE". This inconsistency might lead to subtle bugs. For example, `SortedNumericDocValuesRangeQuery.rewrite()` was using `DocValuesSkipper` exclusively. When the skip index was absent (it requires opt-in via `DocValuesSkipIndexType.RANGE`), the sentinel values made the rewrite optimization a no-op: the range check `lowerValue > Long.MAX_VALUE` is always false, so the query could never short-circuit to `MatchNoDocsQuery` even when the query range was completely outside the field's actual value range. ## Solution This PR introduces `NumericFieldStats`, a utility class in `org.apache.lucene.search` that provides a unified API with clean null semantics. It exposes two static methods: `globalMinMax(IndexSearcher, String)` returns a `MinMax` record containing the global min and max as `long` values, or `null` when the stats cannot be determined. `globalDocCount(IndexSearcher, String)` returns the total number of documents that have a value for the field, or `0`. Both methods probe `PointValues` first and fall back to `DocValuesSkipper`. PointValues goes first because it is always available for standard numeric fields (LongField, IntField, etc.), already returns null for "no data" (no sentinel ambiguity), and does not require any opt-in configuration. The DocValuesSkipper fallback covers doc-values-only fields that were indexed with a skip index but without points. ## Why not fix DocValuesSkipper directly? An alternative would be to change `DocValuesSkipper.globalMinValue()`/`globalMaxValue()` to return `Long` instead of `long`, eliminating the sentinel ambiguity at the source. However, the per-segment `minValue()`/`maxValue()` methods are called on hot scoring paths where boxing to `Long` on every call would add overhead. The static `globalMinValue`/`globalMaxValue` methods are also public API, so changing their return type would break all existing callers. A higher-level utility that abstracts over both data structures avoids these issues while giving callers the clean null semantics they need. ## Packed value decoding PointValues stores numeric values as big-endian byte arrays with the sign bit flipped for sortable byte ordering. IntField produces 4-byte packed values and LongField produces 8-byte packed values. The internal `decodeLong` method dispatches on the array length to call the appropriate `NumericUtils` decoder (`sortableBytesToInt` for 4 bytes, `sortableBytesToLong` for 8 bytes). The API always returns `long` rather than exposing `byte[]` or providing separate int/long variants. This matches how the query layer works internally: even IntField queries widen to `long` bounds by the time they reach `SortedNumericDocValuesRangeQuery`. When an int field's packed value is decoded, the resulting `int` widens to `long` via standard Java sign extension, which preserves the original numeric value. Callers that need an `int` back can safely use `Math.toIntExact()`, which will never throw for values that originated from an IntField. ## Migration: SortedNumericDocValuesRangeQuery.rewrite() The `rewrite()` method previously called `DocValuesSkipper.globalMinValue/globalMaxValue` directly. It now calls `NumericFieldStats.globalMinMax()` instead. This is not just a refactor; it is a behavioral improvement. Before, if a field had PointValues but no DocValuesSkipper, the rewrite optimization was dead code. Now it works for any numeric field that has either data structure available. ## Tests 14 tests covering both data sources (PointValues and DocValuesSkipper paths), edge cases (empty index, nonexistent field, doc values without skip index, segments where some have the field and some do not), boundary values (Long.MIN_VALUE/Long.MAX_VALUE with LongField, Integer.MIN_VALUE/Integer.MAX_VALUE with IntField to verify int-to-long sign extension), multi-valued fields, multi-segment indexes, and doc count from both PointValues and DocValuesSkipper. There is also an integration test verifying that `SortedNumericDocValuesRangeQuery.rewrite()` now correctly rewrites to `MatchNoDocsQuery` and `MatchAllDocsQuery` when only PointValues are available (no skip index). This scenario was missing before this change. `TestNumericFieldStats`: ``` ./gradlew -p lucene/core test --tests "org.apache.lucene.search.TestNumericFieldStats" ``` `TestSortedNumericDocValuesRangeQuery` (existing): ``` ./gradlew -p lucene/core test --tests "org.apache.lucene.document.TestSortedNumericDocValuesRangeQuery" ``` Closes #15740 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
