Re: Matcher

Michael McCandless Fri, 06 Feb 2009 02:56:16 -0800

I like this approach Marvin, and I appreciate the meandering path you
went through to arrive at it!


I very much like putting the invocation of Matcher.score()
(Scorer_Tally()) into the collector's hands.  In Lucene we always
score prior to collection, but eg if your app will sort by something
other than relevance, and does not intend to present relevance, then
you don't need to compute the score.

One thing I'm struggling with now (in Lucene), that I'm not sure
applies to KS/Lucy, is how to best take "random access Matchers" into
account.  Not all Matchers are created equal.  Some really want to do
the next()/skipTo() iteration (eg a posting list), but others (deleted
docs currently, or cached filters, in Lucene) are "best" accessed by
random access accept(int docID) API, perhaps permuted down to all leaf
TermScorers, though they could also do iteration.  I'm seeing sizable
performance gains by "taking advantage" of Matchers that could expose
a random access API.  It would be good if we could query a Matcher to
see if it supports random access...

Another thing you might want to add to Matcher is "approxCount()",
which would make a rough guess at the possible number of matches, to
help a future "matching optimizer" pick a strategy.

Mike

Marvin Humphrey wrote:

Greets,
Many ideas have come together in a new KS class, Matcher, which Ithink ought
to replace Scorer in both KS and Lucy.
Conceptually, Matcher began as a simple iterator over a set of docnums. Itwas born because KS needed an opaque super class to supportiterating overdeletions -- which might be backed either by a bit vector or by whatwe're
calling "tombstones".

In Lucene, that's the role played by "DocIdSetIterator".  The name
"DocIdSetIterator" is precise and descriptive, but since it's a tadverbose I
stumbled around for a bit looking for an alternative... and arrived at
"Matcher".
The idea of a "Matcher" class has been batted around many timesbefore, butusually in the context of "Scorer minus the scoring". This"Matcher" would be
a more general class, usable not only for iterating over hits, but for
iterating over any set of doc nums in any context.
Around this time, Nate and I had a private email exchange on thesubject of
deletions-as-iterators.  Nate:
That said, deletions as filters seems like a good way to viewthings,
   though.  I'd probably go further, and not have any special handling
   for them in the base, then subclass Scorer_Collect to
   Scorer_Collect_Filtered, and allow for stacked filters.
Interesting... So, if you had both a QueryFilter and some deletions,you mightmerge them together into a single iterator whose output would ORtogethertheir doc num sets. You'd probably want to do this with a priorityqueue...Hmm, that's the same technique that we'd be using for mergingtombstone rows-- thus a generic priority queue class which unions the output ofmultiple
Matcher iterators would kill two birds with one stone.
But wait -- we already have such a class! ScorerDocQueue is used byORScorerto OR together the doc num sets of multiple Scorers. And yet, itdoesn'thandle any scoring duties itself -- the only methods it calls areNext(),Advance() and Get_Doc_Num(), all of which are defined by Scorer'sparent classMatcher. All we have to do to adapt it is change the queue elementsfrom
Scorers to Matchers.  *Three* birds with one stone!

Can we pursue this still further?
The original concept for "Matcher" was a class that matches but doesnot
score.  Instead of a HitCollector, one thought went, we might use a
"MatchCollector" which collects only document numbers.
The KS HitCollector's Collect() method has recently changed toaddress the"match-only" case. MatchCollector's Collect() method wouldtheoretically havetaken only a document number, saving cycles by not requiring theScorer to
calculate a score:

  HC_Collect(collector, doc_num, Scorer_Tally(scorer));
  /* vs. */
  MatchColl_Collect(match_collector, doc_num);
However, passing arguments is cheap. How about, if instead ofpassing scoring
information we pass the Scorer itself, and leave it up to individual
HitCollectors whether to request scoring info?

  HC_Collect(collector, doc_num, scorer);
That way, classes like BitCollector can just accept the doc_num andignore thescorer. Turns out tere's no need for a MatchCollector class, afterall.
But then, what if we want HC_Collect() to accept a Matcher ratherthan requirea Scorer? Logically, we ought to be able to, but Matcher doesn'tcurrentlyknow how to supply the scoring info that some HitCollectors wouldneed.
Well, we can fix that by moving the Scorer_Tally() method up intoMatcher, andhaving it supply a default score of 0.0. Then, *any* Matcheriterating over a
set of doc ids can be used as a source of hit data.
This is another win for code reuse. Classes which performsophisticated
scoring powered by Similarity, such as TF/IDF-enabled TermScorer and
PhraseScorer, or compound scorers like
ANDScorer/ORScorer/RequiredOptionalScorer which callSimilarity.coord(), cansubclass constant-scoring Matcher classes, e.g. ORMatcher. However,classes
like ORMatcher can also be be used in non-scoring contexts.
In addition, we make certain kinds of subclasses simpler to write.Nateexpressed a desire a while back to subclass the KS boolean query/scorerhierarchy so that he could take advantage of its logical matchingcapabilitiesyet implement custom scoring algos -- but having to cancel out allthe TF/IDFstuff imposed an annoying complexity cost. Subclassing booleanMatcher
classes would have made that project simpler.
Now, let's assess where we've ended up. Matcher was originallysupposed to bea simple iterator class. It's still quite simple, but with theadded methodsit looks almost identical to the old Scorer -- the only differenceis themissing Similarity member var. That begs the question, why not justuse
Scorer everywhere?  Use Scorers to iterate over deletions, Scorers for
filtering, etc.
It's possible. But I think it makes more sense to give our searchenginelibrary a broader mandate and re-conceptualize its search apparatusas amatching tool which can be extended with scoring. We can achievethat by
replacing Scorer with Matcher and by using Matcher in contexts where a
"Scorer" wouldn't have seemed appropriate.

Marvin Humphrey

Re: Matcher

Reply via email to