I like this approach Marvin, and I appreciate the meandering path you
went through to arrive at it!

I very much like putting the invocation of Matcher.score()
(Scorer_Tally()) into the collector's hands.  In Lucene we always
score prior to collection, but eg if your app will sort by something
other than relevance, and does not intend to present relevance, then
you don't need to compute the score.

One thing I'm struggling with now (in Lucene), that I'm not sure
applies to KS/Lucy, is how to best take "random access Matchers" into
account.  Not all Matchers are created equal.  Some really want to do
the next()/skipTo() iteration (eg a posting list), but others (deleted
docs currently, or cached filters, in Lucene) are "best" accessed by
random access accept(int docID) API, perhaps permuted down to all leaf
TermScorers, though they could also do iteration.  I'm seeing sizable
performance gains by "taking advantage" of Matchers that could expose
a random access API.  It would be good if we could query a Matcher to
see if it supports random access...

Another thing you might want to add to Matcher is "approxCount()",
which would make a rough guess at the possible number of matches, to
help a future "matching optimizer" pick a strategy.

Mike

Marvin Humphrey wrote:

Greets,

Many ideas have come together in a new KS class, Matcher, which I think ought
to replace Scorer in both KS and Lucy.

Conceptually, Matcher began as a simple iterator over a set of doc nums. It was born because KS needed an opaque super class to support iterating over deletions -- which might be backed either by a bit vector or by what we're
calling "tombstones".

In Lucene, that's the role played by "DocIdSetIterator".  The name
"DocIdSetIterator" is precise and descriptive, but since it's a tad verbose I
stumbled around for a bit looking for an alternative... and arrived at
"Matcher".

The idea of a "Matcher" class has been batted around many times before, but usually in the context of "Scorer minus the scoring". This "Matcher" would be
a more general class, usable not only for iterating over hits, but for
iterating over any set of doc nums in any context.

Around this time, Nate and I had a private email exchange on the subject of
deletions-as-iterators.  Nate:

That said, deletions as filters seems like a good way to view things,
   though.  I'd probably go further, and not have any special handling
   for them in the base, then subclass Scorer_Collect to
   Scorer_Collect_Filtered, and allow for stacked filters.

Interesting... So, if you had both a QueryFilter and some deletions, you might merge them together into a single iterator whose output would OR together their doc num sets. You'd probably want to do this with a priority queue... Hmm, that's the same technique that we'd be using for merging tombstone rows -- thus a generic priority queue class which unions the output of multiple
Matcher iterators would kill two birds with one stone.

But wait -- we already have such a class! ScorerDocQueue is used by ORScorer to OR together the doc num sets of multiple Scorers. And yet, it doesn't handle any scoring duties itself -- the only methods it calls are Next(), Advance() and Get_Doc_Num(), all of which are defined by Scorer's parent class Matcher. All we have to do to adapt it is change the queue elements from
Scorers to Matchers.  *Three* birds with one stone!

Can we pursue this still further?

The original concept for "Matcher" was a class that matches but does not
score.  Instead of a HitCollector, one thought went, we might use a
"MatchCollector" which collects only document numbers.

The KS HitCollector's Collect() method has recently changed to address the "match-only" case. MatchCollector's Collect() method would theoretically have taken only a document number, saving cycles by not requiring the Scorer to
calculate a score:

  HC_Collect(collector, doc_num, Scorer_Tally(scorer));
  /* vs. */
  MatchColl_Collect(match_collector, doc_num);

However, passing arguments is cheap. How about, if instead of passing scoring
information we pass the Scorer itself, and leave it up to individual
HitCollectors whether to request scoring info?

  HC_Collect(collector, doc_num, scorer);

That way, classes like BitCollector can just accept the doc_num and ignore the scorer. Turns out tere's no need for a MatchCollector class, after all.

But then, what if we want HC_Collect() to accept a Matcher rather than require a Scorer? Logically, we ought to be able to, but Matcher doesn't currently know how to supply the scoring info that some HitCollectors would need.

Well, we can fix that by moving the Scorer_Tally() method up into Matcher, and having it supply a default score of 0.0. Then, *any* Matcher iterating over a
set of doc ids can be used as a source of hit data.

This is another win for code reuse. Classes which perform sophisticated
scoring powered by Similarity, such as TF/IDF-enabled TermScorer and
PhraseScorer, or compound scorers like
ANDScorer/ORScorer/RequiredOptionalScorer which call Similarity.coord(), can subclass constant-scoring Matcher classes, e.g. ORMatcher. However, classes
like ORMatcher can also be be used in non-scoring contexts.

In addition, we make certain kinds of subclasses simpler to write. Nate expressed a desire a while back to subclass the KS boolean query/ scorer hierarchy so that he could take advantage of its logical matching capabilities yet implement custom scoring algos -- but having to cancel out all the TF/IDF stuff imposed an annoying complexity cost. Subclassing boolean Matcher
classes would have made that project simpler.

Now, let's assess where we've ended up. Matcher was originally supposed to be a simple iterator class. It's still quite simple, but with the added methods it looks almost identical to the old Scorer -- the only difference is the missing Similarity member var. That begs the question, why not just use
Scorer everywhere?  Use Scorers to iterate over deletions, Scorers for
filtering, etc.

It's possible. But I think it makes more sense to give our search engine library a broader mandate and re-conceptualize its search apparatus as a matching tool which can be extended with scoring. We can achieve that by
replacing Scorer with Matcher and by using Matcher in contexts where a
"Scorer" wouldn't have seemed appropriate.

Marvin Humphrey


Reply via email to