Re: [PR] Implement Accountable and expose Levenshtein automata RAM cost in FuzzyQuery [lucene]

via GitHub Mon, 04 May 2026 09:13:23 -0700


drempapis commented on PR #16031:
URL: https://github.com/apache/lucene/pull/16031#issuecomment-4372596390


   @rmuir thank you for your feedback.
   
   Fair on the complexity,  let me clarify the trade-off, because I think the 
change is smaller than it reads, and the gap it fills is real.
   
   **On complexity**. The PR doesn't introduce any new construction, caching, 
or sharing mechanism. The two real additions are implements `Accountable` with 
a stable shallowSize + term.ramBytesUsed() (three lines) and 
`computeAutomataRamBytes(AttributeSource)`, which simply walks the existing 
`AutomatonAttribute` path that `FuzzyTermsEnum` already uses to share automata 
across segment; same supplier, same idempotent init, same array. The only 
structural change is widening `AutomatonAttribute` / `AutomatonAttributeImpl` 
from private to package-private so `FuzzyQuery` can install/read the same 
attribute type the enum already installs. They remain non-public Lucene 
internals.
   
   On **inear in input string**. Agreed, per query, the  DFA is `O(n)` with `k 
≤ 2` and a small constant. The motivation for the API isn't a single 
FuzzyQuery, it's the aggregate. In real workloads, a parsed query is often a 
`BooleanQuery` with many fuzzy clauses against fields with long terms; multiply 
by concurrent requests and the aggregate `O(m · n)` heap charge is large 
enough, and unpredictable enough from the host's point of view, to push the JVM 
toward OOM with no graceful degradation. Lucene rightly doesn't bound any of 
this,  m, n, the per-segment fan-out, or the number of in-flight rewrites, so 
the host has to.
   
   **What this enables**. Hosts that already do request-scoped accounting for 
the rest of the family, `RegexpQuery`, `WildcardQuery`, `PrefixQuery`, 
`TermRangeQuery`, all of which expose their compiled-automaton cost through 
`Accountable#ramBytesUsed()` because they retain the automaton. Currently can't 
extend the same scheme to `FuzzyQuery`, because FuzzyQuery deliberately doesn't 
retain its automata on the query. The important part, the 
`computeAutomataRamBytes(atts)` lets a host pre-compute the automata cost on 
demand, charge it against a circuit breaker (or any metric applied), and then 
run the query, passing the same`AttributeSource` into rewrite so the automata 
are not built twice. So the marginal CPU cost of measurement is zero in the 
steady-state path: you'd have built those automata during search anyway. 
   
   **What it does not change**. No effect on `equals/hashCode` (so query-cache 
keys are unchanged), no behavioral change to rewrite, scoring, or cross-segment 
automata sharing, no public-API change on `FuzzyTermsEnum`, no default-path 
change for callers that don't touch the new API.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Implement Accountable and expose Levenshtein automata RAM cost in FuzzyQuery [lucene]

Reply via email to