salvatorecampagna opened a new pull request, #15760:
URL: https://github.com/apache/lucene/pull/15760

   ## Problem
   
   Lucene has two independent APIs for retrieving the global min/max of a 
numeric field: `PointValues.getMinPackedValue()` (returns `byte[]` or `null`) 
and `DocValuesSkipper.globalMinValue()` (returns `long` with sentinel values 
for "no data"). The sentinel approach is problematic because 
`Long.MIN_VALUE`/`Long.MAX_VALUE` are legitimate field values, making "no data" 
indistinguishable from real values.
   
   This caused `SortedNumericDocValuesRangeQuery.rewrite()` to silently skip 
its optimization when only PointValues were available (no skip index), since 
the sentinel-based range check could never trigger.
   
   ## Solution
   
   `NumericFieldStats` is a utility class that retrieves global field 
statistics from index metadata structures (`PointValues`, `DocValuesSkipper`) 
without per-document access. It exposes a single method:
   
   `getStats(IndexReader, String)` returns a `Stats` record `(long min, long 
max, int docCount)`, or `null` when no metadata structure is available.
   
   It probes `PointValues` first (always available for standard numeric fields, 
null-based "no data"), then falls back to `DocValuesSkipper` (covers 
doc-values-only fields with skip index).
   
   `SortedNumericDocValuesRangeQuery.rewrite()` now uses this API, fixing the 
optimization for fields with PointValues but no skip index.
   
   ## Packed value decoding
   
   PointValues stores values as big-endian byte arrays with the sign bit 
flipped. IntField produces 4-byte arrays, LongField produces 8-byte arrays. The 
internal `decodeLong` method dispatches on array length to call the right 
`NumericUtils` decoder. The int case widens to long via Java sign extension, 
which preserves the value. The API returns long unconditionally because the 
query layer already works with long bounds internally (even for IntField 
queries). Callers that need int can safely narrow with `Math.toIntExact()`.
   
   ## Why not fix DocValuesSkipper directly?
   
   Changing `DocValuesSkipper.globalMinValue()`/`globalMaxValue()` to return 
`Long` instead of `long` would fix the sentinel ambiguity at the source. But 
the per-segment `minValue()`/`maxValue()` methods are called on hot scoring 
paths where boxing would add overhead, and the static global methods are public 
API so changing their return type would break existing callers. A higher-level 
utility avoids both issues.
   
   ## Tests
   
   Multiple tests in `TestNumericFieldStats` cover both PointValues and 
DocValuesSkipper data sources, edge cases (empty index, nonexistent field, doc 
values without skip index, mixed segments), boundary values, multi-valued 
fields, and doc count.
   
   An integration test in `TestDocValuesQueries` verifies that `rewrite()` now 
correctly produces `MatchNoDocsQuery`/`MatchAllDocsQuery` when only PointValues 
are available.
   
   ```
   ./gradlew -p lucene/core test --tests 
"org.apache.lucene.search.TestNumericFieldStats"
   ./gradlew -p lucene/core test --tests 
"org.apache.lucene.search.TestDocValuesQueries"
   ```
   
   Closes #15740
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to