Marvin Humphrey wrote:

I'm seeing sizable performance gains by "taking advantage" of Matchers that could expose a random access API. It would be good if we could query a
Matcher to see if it supports random access...

I hadn't replied to this before now for the same reason I hadn't posted a response to the JIRA issue you opened: I'm not fond of the proposed solution,
but I don't have any better ideas. :(

Yeah I'm not sure how we can cleanly solve it in Lucene, either.
There's this one too, which we should factor in to Matcher:

    https://issues.apache.org/jira/browse/LUCENE-1252

ie, for "interesting" queries, whose contraints can be broken into
AND'd sub-constraints, you may want to reach in and test for the first
sub-constraint, against other clausess in the main query, and only go
back to the more expensive second sub-constraint if all main
constraints pass.  Perhaps, during rewrite, such queries could return
2 AND'd constraints somehow.

Reaching into a Matcher and choosing which path to go down depending on whether it supports random access or not is going to result in a lot of duplicated code. Everywhere we want to optimize, we need one track for the random access method, and a fallback track for the iterator. Furthermore, the tracks aren't very similar, so the complexity cost to realize the optimization
is high.

It's worse: you need the ability to distribute a random-access Matcher
"down" to all the sub-Matchers (ie, to match how deleted docs are
applied in Lucene today).

There's a classic OO code simplification technique you may be familiar with: "replace conditionals with polymorphism". This is the exact opposite of that:
"replace polymorphism with conditionals".  Ick.

I fear the high performance solution will simply be messy.  Query
execution/optimization is likely a problem whose best-performing
solution is absurdly hairy.  And, worse, the elegant compact solution
is not just a little bit slower, it seems to be a lot slower.

EG I could imagine a solution where we create dynamic Java code on the
fly, specialized to scoring the exact complex query, compile it, and
run it.  This likely would perform well, but sure is a crazy departure
from what we do today.

BTW I think such such query optimization only applies to "hard"
queries.  Simple queries (whose approxCount() seems low) should just
execute the default way.

At present, I don't have the necessary benchmarking tools at my disposal to verify that your benchmarking tests also hold true for C. Unfortunately, I suspect they do; results might even be worse if there's some automatic bounds
checking needed by Java that we can figure out how to forego in C.

Yeah I'm guessing in C you'll see it even worse than we see in Java.

I'm waiting for some intuitive leap to come along and rescue me.

OK :)

Another thing you might want to add to Matcher is "approxCount()",
which would make a rough guess at the possible number of matches, to
help a future "matching optimizer" pick a strategy.

I can see us doing this within the core. It could be handy when say, deciding whether to invert a sparse filter and apply it using "next match" instead of
"next miss".

However, Matcher is supposed to be a public class and I would be reluctant to add approxCount() as a public method. Since Lucy cannot assume that its target platforms support sane deprecation, once we add a public method it's there forever. Therefore, we need to be conservative about exposing public methods in general, and extremely conservative about exposing methods whose
sole purpose is optimization.

Hmm good point.

Mike

Reply via email to