drempapis opened a new pull request, #16031:
URL: https://github.com/apache/lucene/pull/16031
### The need
`AutomatonQuery` and its subclasses (`RegexpQuery`, `WildcardQuery`,
`PrefixQuery`, `TermRangeQuery`) build a `CompiledAutomaton` eagerly in their
constructor and retain it as a field. Because AutomatonQuery implements
`Accountable`, callers can perform request-scoped memory accounting by reading
`Accountable#ramBytesUsed()` at construction time.
The `FuzzyQuery` constructs the Levenshtein automata lazily inside
`FuzzyTermsEnum`, storing them on the search-scoped`AttributeSource` so they
can be shared across segments during a single rewrite. As a result there is
currently no public way to ask a `FuzzyQuery` how much RAM its automata will
cost without actually executing it; `FuzzyQuery` does not implement
`Accountable`, and the `AutomatonAttribute` mechanism inside `FuzzyTermsEnum`
is `private`.
### Changes
This PR introduces two additions and one visibility relaxation, with no
behavioural changes:
1. `FuzzyQuery` now `implements Accountable`. The `ramBytesUsed()` returns a
stable value: `shallowSizeOfInstance(FuzzyQuery.class) + term.ramBytesUsed()`.
It excludes the Levenshtein automata: those are not retained by the query, they
live per-search on an `AttributeSource`. Folding them into `ramBytesUsed()`
would make the value jump from 0 to N after the first execution and inflate
query-cache accounting with memory the query does not own (see LUCENE-9350).
2. A new public method `FuzzyQuery#computeAutomataRamBytes(AttributeSource
atts)` returns the aggregate RAM cost of the `CompiledAutomaton[]` used to
execute the query, building them on the supplied `AttributeSource` via the same
sharing mechanism `FuzzyTermsEnum` already uses across segments. Returns `0L`
when `maxEdits == 0`. Calling `getTermsEnum(terms, atts)` afterwards with the
same `AttributeSource` reuses the primed automata instead of rebuilding them.
3. The `FuzzyTermsEnum.AutomatonAttribute` and
`FuzzyTermsEnum.AutomatonAttributeImpl` are widened from `private` to
package-private so `FuzzyQuery` can install/read the same attribute type. They
remain non-public Lucene internals.
### How clients can use it
Clients who run `FuzzyQuery` and need to bound or report memory now have two
complementary, non-disruptive primitives:
- `ramBytesUsed()`: cheap, stable, query-identity-preserving accounting of
the query object itself. Safe to fold into existing `Accountable` walks.
- `computeAutomataRamBytes(atts)`: a pre-flight handle on the dominant cost
(the `CompiledAutomaton[]` transition tables). Two usage patterns:
1. **Pre-flight, reuse path.** Passing an `AttributeSource` into
`computeAutomataRamBytes`, account/charge the returned bytes, then pass the
*same* `AttributeSource` into `getTermsEnum`. The automata are built once and
reused across all segments — no duplicate work.
2. **In-flight observation.** Subclassing `FuzzyQuery` and overriding
`getTermsEnum(Terms, AttributeSource)`; after `super.getTermsEnum` returns,
call `computeAutomataRamBytes(atts)` on the same `atts`. `init` is idempotent
on a primed attribute, so this only walks the already-built array( no second
build). That way lets callers attach accounting to query execution without
changing the Lucene execution path.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]