Hi Froh- Thanks for raising this and sorry I missed your tag in GH#13201 back in June (had some vacation and was generally away). I'd be interested to see what others think as well, but I'll at least commit to looking through your PR tomorrow or Monday to get a better handle on what's being proposed. We went through a few iterations of this originally before we landed on the current version. One promising approach was to have a more intelligent query that would load some number of terms up-front to get a better cost estimate before making a decision, but it required a custom query implementation that generally didn't get favorable feedback (it's nice to be able to use the existing IndexOrDocValuesQuery abstraction instead). I can dig up some of that conversation if it's helpful, but I'll better understand what you've got in mind first.
Unwinding a bit though, I'm also in favor in general that we should be able to do a better job estimating cost here. I think the tricky part is how we go about doing that effectively. Thanks again for kicking off this thread! Cheers, -Greg On Thu, Aug 1, 2024 at 5:58 PM Michael Froh <msf...@gmail.com> wrote: > Hi there, > > For a few months, some of us have been running into issues with the cost > estimate from AbstractMultiTermQueryConstantScoreWrapper. ( > https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/AbstractMultiTermQueryConstantScoreWrapper.java#L300 > ) > > In https://github.com/apache/lucene/issues/13029, the problem was raised > in terms of queries not being cached, because the estimated cost was too > high. > > We've also run into problems in OpenSearch, since we started wrapping > MultiTermQueries in IndexOrDocValueQuery. The MTQ gets an exaggerated cost > estimate, so IndexOrDocValueQuery decides it should be a DV query, even > though the MTQ would really only match a handful of docs (and should be > lead iterator). > > I opened a PR back in March (https://github.com/apache/lucene/pull/13201) > to try to handle the case where a MultiTermQuery matches a small number of > terms. Since Mayya pulled the rewrite logic that expands up to 16 terms (to > rewrite as a Boolean disjunction) earlier in the workflow (in > https://github.com/apache/lucene/pull/13454), we get the better cost > estimate for MTQs on few terms "for free". > > What do folks think? > > Thanks, > Froh >