drempapis opened a new pull request, #16031:
URL: https://github.com/apache/lucene/pull/16031

   
   ### The need
   `AutomatonQuery` and its subclasses (`RegexpQuery`, `WildcardQuery`, 
`PrefixQuery`, `TermRangeQuery`) build a `CompiledAutomaton` eagerly in their 
constructor and retain it as a field. Because AutomatonQuery implements 
`Accountable`, callers can perform request-scoped memory accounting by reading 
`Accountable#ramBytesUsed()` at construction time.
   
   The  `FuzzyQuery` constructs the Levenshtein automata lazily inside 
`FuzzyTermsEnum`, storing them on the search-scoped`AttributeSource` so they 
can be shared across segments during a single rewrite. As a result there is 
currently no public way to ask a `FuzzyQuery` how much RAM its automata will 
cost without actually executing it; `FuzzyQuery` does not implement 
`Accountable`, and the `AutomatonAttribute` mechanism inside `FuzzyTermsEnum` 
is `private`. 
   
   ### Changes
   This PR introduces two additions and one visibility relaxation, with no 
behavioural changes:
   
   1. `FuzzyQuery` now `implements Accountable`. The `ramBytesUsed()` returns a 
stable value: `shallowSizeOfInstance(FuzzyQuery.class) + term.ramBytesUsed()`. 
It excludes the Levenshtein automata: those are not retained by the query, they 
live per-search on an `AttributeSource`. Folding them into `ramBytesUsed()` 
would make the value jump from 0 to N after the first execution and inflate 
query-cache accounting with memory the query does not own (see LUCENE-9350).
   2. A new public method `FuzzyQuery#computeAutomataRamBytes(AttributeSource 
atts)` returns the aggregate RAM cost of the `CompiledAutomaton[]` used to 
execute the query, building them on the supplied `AttributeSource` via the same 
sharing mechanism `FuzzyTermsEnum` already uses across segments. Returns `0L` 
when `maxEdits == 0`. Calling `getTermsEnum(terms, atts)` afterwards with the 
same `AttributeSource` reuses the primed automata instead of rebuilding them.
   3. The `FuzzyTermsEnum.AutomatonAttribute` and 
`FuzzyTermsEnum.AutomatonAttributeImpl` are widened from `private` to 
package-private so `FuzzyQuery` can install/read the same attribute type. They 
remain non-public Lucene internals.
   
   ### How clients can use it
   Clients who run `FuzzyQuery` and need to bound or report memory now have two 
complementary, non-disruptive primitives:
   - `ramBytesUsed()`: cheap, stable, query-identity-preserving accounting of 
the query object itself. Safe to fold into existing `Accountable` walks.
   - `computeAutomataRamBytes(atts)`: a pre-flight handle on the dominant cost 
(the `CompiledAutomaton[]` transition tables). Two usage patterns:
     1. **Pre-flight, reuse path.** Passing an `AttributeSource` into 
`computeAutomataRamBytes`, account/charge the returned bytes, then pass the 
*same* `AttributeSource` into `getTermsEnum`. The  automata are built once and 
reused across all segments — no duplicate work.
     2. **In-flight observation.** Subclassing `FuzzyQuery` and overriding 
`getTermsEnum(Terms, AttributeSource)`; after `super.getTermsEnum` returns, 
call `computeAutomataRamBytes(atts)` on the same `atts`. `init` is idempotent 
on a primed attribute, so this only walks the already-built array( no second 
build). That way lets callers attach accounting to query execution without 
changing the Lucene execution path.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to