No I don't think NFA will help this case? It potentially reduces the cost of NFA to DFA, but if DFA needs to scan the whole term dictionary then NFA needs to do the same I think
On Tue, Aug 6, 2024, 08:01 Michael Sokolov <msoko...@gmail.com> wrote: > But actually Patrick Zhai added support for nondeterministic regexes > that might help with cases like that? There is this in > TestRegexpQuery: > > /** Test worst-case for getCommonSuffix optimization */ > public void testSlowCommonSuffix() throws Exception { > expectThrows( > TooComplexToDeterminizeException.class, > () -> { > new RegexpQuery(new Term("stringvalue", "(.*a){2000}")); > }); > } > > On Tue, Aug 6, 2024 at 10:56 AM Michael Sokolov <msoko...@gmail.com> > wrote: > > > > Yes, I think degenerate regexes like *a* are potentially costly. > > Actually something like *Ⱗ* is probably worse since yeah it would need > > to scan the entire FST (which probably has some a's in it?) > > > > I don't see any way around that aside from: (1) telling user don't do > > that, or (2) putting some accounting on FST so it can early-terminate > > > > On Fri, Aug 2, 2024 at 8:17 PM Michael Froh <msf...@gmail.com> wrote: > > > > > > Incidentally, speaking as someone with only a superficial > understanding of how the FSTs work, I'm wondering if there is risk of cost > in expanding the first few terms. > > > > > > Say we have a million terms, but only one contains an 'a'. If someone > searches for '*a*', does that devolve into a term scan? Or can the FST > efficiently identify an arc with an 'a' and efficiently identify terms > containing that arc? > > > > > > Thanks, > > > Froh > > > > > > On Fri, Aug 2, 2024 at 3:50 PM Michael Froh <msf...@gmail.com> wrote: > > >> > > >> Exactly! > > >> > > >> My initial implementation added some potential cost. (I think I > enumerated up to 128 terms before giving up.) Now that Mayya moved the > (probably tiny) cost of expanding the first 16 terms upfront, my change is > theoretically "free". > > >> > > >> Froh > > >> > > >> On Fri, Aug 2, 2024 at 3:25 PM Greg Miller <gsmil...@gmail.com> > wrote: > > >>> > > >>> Hey Froh- > > >>> > > >>> I got some time to look through your PR (most of the time was > actually refreshing my memory on the change history leading up to your PR > and digesting the issue described). I think this makes a ton of sense. If > I'm understanding properly, the latest version of your PR essentially takes > advantage of Mayya's recent change ( > https://github.com/apache/lucene/pull/13454) in the score supplier > behavior that is now doing _some_ up-front work to iterate the first <= 16 > terms when building the scoreSupplier and computes a more > accurate/reasonable cost based on that already-done work. Am I getting this > right? If so, this seems like it has no downsides and all upside. > > >>> > > >>> I'll do a proper pass through the PR here shortly, but I love the > idea (assuming I'm understanding it properly on a Friday afternoon after a > long-ish week...). > > >>> > > >>> Cheers, > > >>> -Greg > > >>> > > >>> On Thu, Aug 1, 2024 at 7:47 PM Greg Miller <gsmil...@gmail.com> > wrote: > > >>>> > > >>>> Hi Froh- > > >>>> > > >>>> Thanks for raising this and sorry I missed your tag in GH#13201 > back in June (had some vacation and was generally away). I'd be interested > to see what others think as well, but I'll at least commit to looking > through your PR tomorrow or Monday to get a better handle on what's being > proposed. We went through a few iterations of this originally before we > landed on the current version. One promising approach was to have a more > intelligent query that would load some number of terms up-front to get a > better cost estimate before making a decision, but it required a custom > query implementation that generally didn't get favorable feedback (it's > nice to be able to use the existing IndexOrDocValuesQuery abstraction > instead). I can dig up some of that conversation if it's helpful, but I'll > better understand what you've got in mind first. > > >>>> > > >>>> Unwinding a bit though, I'm also in favor in general that we should > be able to do a better job estimating cost here. I think the tricky part is > how we go about doing that effectively. Thanks again for kicking off this > thread! > > >>>> > > >>>> Cheers, > > >>>> -Greg > > >>>> > > >>>> On Thu, Aug 1, 2024 at 5:58 PM Michael Froh <msf...@gmail.com> > wrote: > > >>>>> > > >>>>> Hi there, > > >>>>> > > >>>>> For a few months, some of us have been running into issues with > the cost estimate from AbstractMultiTermQueryConstantScoreWrapper. ( > https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/AbstractMultiTermQueryConstantScoreWrapper.java#L300 > ) > > >>>>> > > >>>>> In https://github.com/apache/lucene/issues/13029, the problem was > raised in terms of queries not being cached, because the estimated cost was > too high. > > >>>>> > > >>>>> We've also run into problems in OpenSearch, since we started > wrapping MultiTermQueries in IndexOrDocValueQuery. The MTQ gets an > exaggerated cost estimate, so IndexOrDocValueQuery decides it should be a > DV query, even though the MTQ would really only match a handful of docs > (and should be lead iterator). > > >>>>> > > >>>>> I opened a PR back in March ( > https://github.com/apache/lucene/pull/13201) to try to handle the case > where a MultiTermQuery matches a small number of terms. Since Mayya pulled > the rewrite logic that expands up to 16 terms (to rewrite as a Boolean > disjunction) earlier in the workflow (in > https://github.com/apache/lucene/pull/13454), we get the better cost > estimate for MTQs on few terms "for free". > > >>>>> > > >>>>> What do folks think? > > >>>>> > > >>>>> Thanks, > > >>>>> Froh > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > >