Marvin Humphrey wrote:
I'm seeing sizable performance gains by "taking advantage" of
Matchers that
could expose a random access API. It would be good if we could
query a
Matcher to see if it supports random access...
I hadn't replied to this before now for the same reason I hadn't
posted a
response to the JIRA issue you opened: I'm not fond of the proposed
solution,
but I don't have any better ideas. :(
Yeah I'm not sure how we can cleanly solve it in Lucene, either.
There's this one too, which we should factor in to Matcher:
https://issues.apache.org/jira/browse/LUCENE-1252
ie, for "interesting" queries, whose contraints can be broken into
AND'd sub-constraints, you may want to reach in and test for the first
sub-constraint, against other clausess in the main query, and only go
back to the more expensive second sub-constraint if all main
constraints pass. Perhaps, during rewrite, such queries could return
2 AND'd constraints somehow.
Reaching into a Matcher and choosing which path to go down depending
on
whether it supports random access or not is going to result in a lot
of
duplicated code. Everywhere we want to optimize, we need one track
for the
random access method, and a fallback track for the iterator.
Furthermore, the
tracks aren't very similar, so the complexity cost to realize the
optimization
is high.
It's worse: you need the ability to distribute a random-access Matcher
"down" to all the sub-Matchers (ie, to match how deleted docs are
applied in Lucene today).
There's a classic OO code simplification technique you may be
familiar with:
"replace conditionals with polymorphism". This is the exact
opposite of that:
"replace polymorphism with conditionals". Ick.
I fear the high performance solution will simply be messy. Query
execution/optimization is likely a problem whose best-performing
solution is absurdly hairy. And, worse, the elegant compact solution
is not just a little bit slower, it seems to be a lot slower.
EG I could imagine a solution where we create dynamic Java code on the
fly, specialized to scoring the exact complex query, compile it, and
run it. This likely would perform well, but sure is a crazy departure
from what we do today.
BTW I think such such query optimization only applies to "hard"
queries. Simple queries (whose approxCount() seems low) should just
execute the default way.
At present, I don't have the necessary benchmarking tools at my
disposal to
verify that your benchmarking tests also hold true for C.
Unfortunately, I
suspect they do; results might even be worse if there's some
automatic bounds
checking needed by Java that we can figure out how to forego in C.
Yeah I'm guessing in C you'll see it even worse than we see in Java.
I'm waiting for some intuitive leap to come along and rescue me.
OK :)
Another thing you might want to add to Matcher is "approxCount()",
which would make a rough guess at the possible number of matches, to
help a future "matching optimizer" pick a strategy.
I can see us doing this within the core. It could be handy when
say, deciding
whether to invert a sparse filter and apply it using "next match"
instead of
"next miss".
However, Matcher is supposed to be a public class and I would be
reluctant to
add approxCount() as a public method. Since Lucy cannot assume that
its
target platforms support sane deprecation, once we add a public
method it's
there forever. Therefore, we need to be conservative about exposing
public
methods in general, and extremely conservative about exposing
methods whose
sole purpose is optimization.
Hmm good point.
Mike