drempapis commented on PR #16031: URL: https://github.com/apache/lucene/pull/16031#issuecomment-4372596390
@rmuir thank you for your feedback. Fair on the complexity, let me clarify the trade-off, because I think the change is smaller than it reads, and the gap it fills is real. **On complexity**. The PR doesn't introduce any new construction, caching, or sharing mechanism. The two real additions are implements `Accountable` with a stable shallowSize + term.ramBytesUsed() (three lines) and `computeAutomataRamBytes(AttributeSource)`, which simply walks the existing `AutomatonAttribute` path that `FuzzyTermsEnum` already uses to share automata across segment; same supplier, same idempotent init, same array. The only structural change is widening `AutomatonAttribute` / `AutomatonAttributeImpl` from private to package-private so `FuzzyQuery` can install/read the same attribute type the enum already installs. They remain non-public Lucene internals. On **inear in input string**. Agreed, per query, the DFA is `O(n)` with `k ≤ 2` and a small constant. The motivation for the API isn't a single FuzzyQuery, it's the aggregate. In real workloads, a parsed query is often a `BooleanQuery` with many fuzzy clauses against fields with long terms; multiply by concurrent requests and the aggregate `O(m · n)` heap charge is large enough, and unpredictable enough from the host's point of view, to push the JVM toward OOM with no graceful degradation. Lucene rightly doesn't bound any of this, m, n, the per-segment fan-out, or the number of in-flight rewrites, so the host has to. **What this enables**. Hosts that already do request-scoped accounting for the rest of the family, `RegexpQuery`, `WildcardQuery`, `PrefixQuery`, `TermRangeQuery`, all of which expose their compiled-automaton cost through `Accountable#ramBytesUsed()` because they retain the automaton. Currently can't extend the same scheme to `FuzzyQuery`, because FuzzyQuery deliberately doesn't retain its automata on the query. The important part, the `computeAutomataRamBytes(atts)` lets a host pre-compute the automata cost on demand, charge it against a circuit breaker (or any metric applied), and then run the query, passing the same`AttributeSource` into rewrite so the automata are not built twice. So the marginal CPU cost of measurement is zero in the steady-state path: you'd have built those automata during search anyway. **What it does not change**. No effect on `equals/hashCode` (so query-cache keys are unchanged), no behavioral change to rewrite, scoring, or cross-segment automata sharing, no public-API change on `FuzzyTermsEnum`, no default-path change for callers that don't touch the new API. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
