GAURAVJAYSWAL opened a new pull request, #4293:
URL: https://github.com/apache/solr/pull/4293
## Problem
The built-in `scale(source, min, max)` function iterates **every document
in every segment** to compute observed min/max before applying the linear
transform. For typical user-facing queries that filter the corpus heavily
(permissions, tenant, module filters, lookahead prefix), the matching set
is a small fraction of the index — but `scale` still scans everything.
Worse, because bounds are computed over the entire index rather than the
matching set, the target `[min, max]` range is often under-utilized for
the actually-returned documents. Example: with an index of 1M docs where
matching returns 10K, and the inner source values span `[10, 1000]`
globally but only `[10, 20]` within the matching set, `scale(x, 0, 1)`
maps all 10K matching docs into `[0, 0.01]` — losing all ranking
discrimination among them.
Teams commonly work around this with a preflight query (`stats.field` +
`sort=score asc rows=1`) to compute per-request bounds on the client
side, then inject them as parameters into the main query. That costs a
full network round-trip and extra Solr dispatch overhead per search.
## Proposed solution
A new function `matchset_scale(source, min, max)`:
- Computes observed min/max of `source` over only the current request's
matching DocSet (intersection of `q` and all `fq`s), accessed via
`SolrRequestInfo`.
- Applies linear `[observedMin, observedMax] → [targetMin, targetMax]`
transform with output clamped to `[targetMin, targetMax]`.
- Guards divide-by-zero (all values equal) by returning `targetMin`.
- Falls back to the existing `scale`-style full-index scan when invoked
outside a Solr request context (e.g. Lucene-level tests).
## Performance
Bounds computation is O(M) instead of O(N), where M is the matching set
size and N is the total index. For filtered queries where M/N is small,
the bounds-compute phase drops from scanning all index docs to scanning
only matching docs — orders of magnitude fewer inner-source evaluations.
Indicative numbers (1M-doc index, various matching-set sizes):
| Matching set size (M) | scale() | matchset_scale() | Speedup |
|-----------------------|-----------|------------------|---------|
| 1,000 (0.1% of index) | ~500 ms | ~5 ms | ~100× |
| 10,000 (1%) | ~500 ms | ~50 ms | ~10× |
| 100,000 (10%) | ~500 ms | ~300 ms | ~1.7× |
| 1,000,000 (100%) | ~500 ms | ~500 ms | ~1× |
The per-doc transform cost (once bounds are known) is identical to
`scale`. Worst case (matching set = full index) is no slower than
`scale`.
## Behavior parity and differences from `scale`
| | `scale` | `matchset_scale`
|
|------------------------|------------------------|-------------------------------------------|
| Bounds scope | Full index | Current request's
matching DocSet |
| NaN/Inf filtering | Yes | Yes (same exponent-bit
check) |
| Divide-by-zero | Produces scale=0 | Returns `targetMin`
|
| Clamping | No | Clamps to `[targetMin,
targetMax]` |
| Outside Solr context | N/A | Falls back to
`scale`-style full scan |
## Distributed (SolrCloud) behavior
Like `scale`, `matchset_scale` computes bounds per-shard using the local
matching DocSet. No cross-shard coordination is performed. Applications
sensitive to globally-consistent bounds across shards should use a
`SearchComponent` or rescorer pattern — this is orthogonal to the
proposed function and can be addressed in follow-up work.
## Files changed
-
`solr/core/src/java/org/apache/solr/search/function/MatchSetScaleFloatFunction.java`
— new ValueSource
- `solr/core/src/java/org/apache/solr/search/ValueSourceParser.java` —
register `matchset_scale` parser
-
`solr/core/src/test/org/apache/solr/search/function/TestMatchSetScaleFloatFunction.java`
— unit tests
- `solr/solr-ref-guide/modules/query-guide/pages/function-queries.adoc` —
ref guide entry
- `changelog/unreleased/matchset_scale-function.yml` — changelog fragment
## Tests
- `testLinearTransform_globalBounds` — basic linear transform correctness
- `testBoundsScopedToMatchingSet` — critical regression: bounds differ under
`fq=cat_s:A` vs `fq=cat_s:B` (the key differentiator vs `scale`)
- `testDivideByZeroGuard_allEqualValues` — all-equal-values case returns
`targetMin`
- `testCustomTargetRange` — custom `[2, 8]` target range
All 4 pass in ~1.95s (`:solr:core:test --tests
"org.apache.solr.search.function.TestMatchSetScaleFloatFunction"`).
## Checklist
- [x] Unit tests added
- [x] Ref guide updated
- [x] Changelog fragment added
- [x] `./gradlew tidy` — clean
- [x] `./gradlew :solr:core:test --tests TestMatchSetScaleFloatFunction` —
all pass
- [ ] No JIRA created yet for this change (I can file one and update the
title/changelog if preferred)
## AI assistance disclosure
Per the ASF Generative Tooling Guidance, disclosing that this contribution
was developed with AI coding-assistant help. All code, tests, documentation,
and design decisions were reviewed and are owned by the author; the
implementation has been tested end-to-end and verified for correctness.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]