Re: Matcher

Michael McCandless Fri, 13 Mar 2009 05:54:35 -0700


Marvin Humphrey wrote:

I'm seeing sizable performance gains by "taking advantage" ofMatchers thatcould expose a random access API. It would be good if we couldquery a
Matcher to see if it supports random access...
I hadn't replied to this before now for the same reason I hadn'tposted aresponse to the JIRA issue you opened: I'm not fond of the proposedsolution,
but I don't have any better ideas. :(


Yeah I'm not sure how we can cleanly solve it in Lucene, either.
There's this one too, which we should factor in to Matcher:

    https://issues.apache.org/jira/browse/LUCENE-1252

ie, for "interesting" queries, whose contraints can be broken into
AND'd sub-constraints, you may want to reach in and test for the first
sub-constraint, against other clausess in the main query, and only go
back to the more expensive second sub-constraint if all main
constraints pass.  Perhaps, during rewrite, such queries could return
2 AND'd constraints somehow.

Reaching into a Matcher and choosing which path to go down dependingonwhether it supports random access or not is going to result in a lotofduplicated code. Everywhere we want to optimize, we need one trackfor therandom access method, and a fallback track for the iterator.Furthermore, thetracks aren't very similar, so the complexity cost to realize theoptimization
is high.


It's worse: you need the ability to distribute a random-access Matcher
"down" to all the sub-Matchers (ie, to match how deleted docs are
applied in Lucene today).

There's a classic OO code simplification technique you may befamiliar with:"replace conditionals with polymorphism". This is the exactopposite of that:
"replace polymorphism with conditionals".  Ick.


I fear the high performance solution will simply be messy.  Query
execution/optimization is likely a problem whose best-performing
solution is absurdly hairy.  And, worse, the elegant compact solution
is not just a little bit slower, it seems to be a lot slower.

EG I could imagine a solution where we create dynamic Java code on the
fly, specialized to scoring the exact complex query, compile it, and
run it.  This likely would perform well, but sure is a crazy departure
from what we do today.

BTW I think such such query optimization only applies to "hard"
queries.  Simple queries (whose approxCount() seems low) should just
execute the default way.

At present, I don't have the necessary benchmarking tools at mydisposal toverify that your benchmarking tests also hold true for C.Unfortunately, Isuspect they do; results might even be worse if there's someautomatic bounds
checking needed by Java that we can figure out how to forego in C.


Yeah I'm guessing in C you'll see it even worse than we see in Java.

I'm waiting for some intuitive leap to come along and rescue me.


OK :)

Another thing you might want to add to Matcher is "approxCount()",
which would make a rough guess at the possible number of matches, to
help a future "matching optimizer" pick a strategy.
I can see us doing this within the core. It could be handy whensay, decidingwhether to invert a sparse filter and apply it using "next match"instead of
"next miss".
However, Matcher is supposed to be a public class and I would bereluctant toadd approxCount() as a public method. Since Lucy cannot assume thatitstarget platforms support sane deprecation, once we add a publicmethod it'sthere forever. Therefore, we need to be conservative about exposingpublicmethods in general, and extremely conservative about exposingmethods whose
sole purpose is optimization.


Hmm good point.

Mike

Re: Matcher

Reply via email to