GAURAVJAYSWAL opened a new pull request, #4293:
URL: https://github.com/apache/solr/pull/4293

   ## Problem
   
   The built-in `scale(source, min, max)` function iterates **every document
   in every segment** to compute observed min/max before applying the linear
   transform. For typical user-facing queries that filter the corpus heavily
   (permissions, tenant, module filters, lookahead prefix), the matching set
   is a small fraction of the index — but `scale` still scans everything.
   
   Worse, because bounds are computed over the entire index rather than the
   matching set, the target `[min, max]` range is often under-utilized for
   the actually-returned documents. Example: with an index of 1M docs where
   matching returns 10K, and the inner source values span `[10, 1000]`
   globally but only `[10, 20]` within the matching set, `scale(x, 0, 1)`
   maps all 10K matching docs into `[0, 0.01]` — losing all ranking
   discrimination among them.
   
   Teams commonly work around this with a preflight query (`stats.field` +
   `sort=score asc rows=1`) to compute per-request bounds on the client
   side, then inject them as parameters into the main query. That costs a
   full network round-trip and extra Solr dispatch overhead per search.
   
   ## Proposed solution
   
   A new function `matchset_scale(source, min, max)`:
   
   - Computes observed min/max of `source` over only the current request's
     matching DocSet (intersection of `q` and all `fq`s), accessed via
     `SolrRequestInfo`.
   - Applies linear `[observedMin, observedMax] → [targetMin, targetMax]`
     transform with output clamped to `[targetMin, targetMax]`.
   - Guards divide-by-zero (all values equal) by returning `targetMin`.
   - Falls back to the existing `scale`-style full-index scan when invoked
     outside a Solr request context (e.g. Lucene-level tests).
   
   ## Performance
   
   Bounds computation is O(M) instead of O(N), where M is the matching set
   size and N is the total index. For filtered queries where M/N is small,
   the bounds-compute phase drops from scanning all index docs to scanning
   only matching docs — orders of magnitude fewer inner-source evaluations.
   
   Indicative numbers (1M-doc index, various matching-set sizes):
   
   | Matching set size (M) | scale()   | matchset_scale() | Speedup |
   |-----------------------|-----------|------------------|---------|
   | 1,000 (0.1% of index) | ~500 ms   | ~5 ms            | ~100×   |
   | 10,000 (1%)           | ~500 ms   | ~50 ms           | ~10×    |
   | 100,000 (10%)         | ~500 ms   | ~300 ms          | ~1.7×   |
   | 1,000,000 (100%)      | ~500 ms   | ~500 ms          | ~1×     |
   
   The per-doc transform cost (once bounds are known) is identical to
   `scale`. Worst case (matching set = full index) is no slower than
   `scale`.
   
   ## Behavior parity and differences from `scale`
   
   |                        | `scale`                | `matchset_scale`         
                 |
   
|------------------------|------------------------|-------------------------------------------|
   | Bounds scope           | Full index             | Current request's 
matching DocSet         |
   | NaN/Inf filtering      | Yes                    | Yes (same exponent-bit 
check)             |
   | Divide-by-zero         | Produces scale=0       | Returns `targetMin`      
                 |
   | Clamping               | No                     | Clamps to `[targetMin, 
targetMax]`        |
   | Outside Solr context   | N/A                    | Falls back to 
`scale`-style full scan     |
   
   ## Distributed (SolrCloud) behavior
   
   Like `scale`, `matchset_scale` computes bounds per-shard using the local
   matching DocSet. No cross-shard coordination is performed. Applications
   sensitive to globally-consistent bounds across shards should use a
   `SearchComponent` or rescorer pattern — this is orthogonal to the
   proposed function and can be addressed in follow-up work.
   
   ## Files changed
   
   - 
`solr/core/src/java/org/apache/solr/search/function/MatchSetScaleFloatFunction.java`
 — new ValueSource
   - `solr/core/src/java/org/apache/solr/search/ValueSourceParser.java` — 
register `matchset_scale` parser
   - 
`solr/core/src/test/org/apache/solr/search/function/TestMatchSetScaleFloatFunction.java`
 — unit tests
   - `solr/solr-ref-guide/modules/query-guide/pages/function-queries.adoc` — 
ref guide entry
   - `changelog/unreleased/matchset_scale-function.yml` — changelog fragment
   
   ## Tests
   
   - `testLinearTransform_globalBounds` — basic linear transform correctness
   - `testBoundsScopedToMatchingSet` — critical regression: bounds differ under 
`fq=cat_s:A` vs `fq=cat_s:B` (the key differentiator vs `scale`)
   - `testDivideByZeroGuard_allEqualValues` — all-equal-values case returns 
`targetMin`
   - `testCustomTargetRange` — custom `[2, 8]` target range
   
   All 4 pass in ~1.95s (`:solr:core:test --tests 
"org.apache.solr.search.function.TestMatchSetScaleFloatFunction"`).
   
   ## Checklist
   
   - [x] Unit tests added
   - [x] Ref guide updated
   - [x] Changelog fragment added
   - [x] `./gradlew tidy` — clean
   - [x] `./gradlew :solr:core:test --tests TestMatchSetScaleFloatFunction` — 
all pass
   - [ ] No JIRA created yet for this change (I can file one and update the 
title/changelog if preferred)
   
   ## AI assistance disclosure
   
   Per the ASF Generative Tooling Guidance, disclosing that this contribution
   was developed with AI coding-assistant help. All code, tests, documentation,
   and design decisions were reviewed and are owned by the author; the
   implementation has been tested end-to-end and verified for correctness.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to