Re: Is TopDocCollector's collect() implementation correct?

Shai Erera Tue, 24 Mar 2009 13:30:29 -0700

>
> Would TopDocsCollector subclass HitCollector or
> MultiReaderHitCollector?
>

Well ... we've been there as well already :). I don't think there's an easy
answer here. I guess if MRHC is the better approach, and we think all
Top***DocCollector would want to have the MRHC functionality, then I'd say
let's extend MRHC. Otherwise, I don't have a good answer. When I started
this thread, I only knew of HitCollector, so things were simpler at the
time.

Well... I understand the problem: we have a real bug.  If you subclass
> TopDocColector or TopScoreDocCollector, and don't do your sorting by
> score, then collect() does the wrong thing.  So we need to fix that.

I actually wanted to avoid fixing it. Hypothetically, if we had such a
TopsDocCollector abstract class, TopScoreDocCollector could have extended
it, and declared final. Its implementation makes a lot of sense if you want
to only provide a top-scoring docs order, which is the most common search
operation. Therefore making it efficient is very important.
The current situation introduces a bug, that's true. However, unless
something better pops up, shouldn't we just make it final? After all, the
class is now called TopScoreDocCollector, which is more aimed at scoring, as
opposed to TopDocCollector, which just declares it collects 'top docs' w/o
committing to score or anything. So why would someone want to extend TSDC to
provide a sort by date for example? The names would not make sense.
And if someone really wants to have a different TSDC implementation, it can
just copy the implementation. There's no real need to extend TSDC. It's a
short, clean and easy to understand implementation.

> Actually, it was Nadav who first proposed the "read interface", to
> solve the "there's no common way for reading its output" problem.
> With an interface (say TopDocsOutput), then you could have some method
> somewhere:
>
>  renderResults(TopDocsOutput results)
>
> and then any collector, independent of how it *collects* results,
> could implement TopDocsOutput if appropriate.

You'd still need to cast the collector to TopDocsOutput, won't you? How's
that different than the code snippet I showed above?

I think we should aim to separate "collection", which is how Lucene
> delivers hits to the collector you passed in, from "results delivery",
> which is how your code talks to the collector to read back the
> results.

I thought we already do that .. HitCollector has a simple collect() method,
which is the only one IndexSearcher knows of and cares about. No runtime
code executed in Lucene has to know that the concrete HItCollector class it
deals with has a topDocs() API.
On the other hand, my application does know that, and therefore I can
instantiate a HitCollector which has more methods than just collect(). I
won't use HitCollector, but that base TopDocsOutputCollector (?), pass it in
to Searcher (which knows only of HitCollector) and when it's back, I don't
need to do any casting.
If only HitCollector was an interface .... :).

May I suggest something else? What if MRHC was actually an interface? It
doesn't contain any implementation. Just an abstract method. That I think
would have solved anything for us:

   - We'd extend HitCollector with a TopDocsOutputCollector (?) that
   provides topDocs(), getTotalHits() and a protected constructor which accepts
   a PQ.
   - Every implementation which extends MRHC can move to extend HC and
   implement MRHC.
   - Nobody outside the Lucene code should care whether it's a MRHC or HC
   that was instantiated. This is very internal to Lucene. That's why making
   MRHC an interface makes sense - the Lucene code still deals w/ HitCollector,
   and wherever it expected or wanted to utilize MRHC, it can still do that (by
   checking instanceof, as I've seen in some places in the code).
   - Since it's still in the trunk, we shouldn't have any back compat
   issues, right?

The point here is that MRHC is really internal to Lucene, while HitCollector
is not (it's part of the API). Therefore making MRHC an interface should not
make any difference to any search application (it's a property of a
HitCollector implementation), while HitCollector and its sub-classes are.

I have a feeling this might work, but I could be missing something.

Shai

On Tue, Mar 24, 2009 at 7:51 PM, Michael McCandless <
luc...@mikemccandless.com> wrote:

> Shai Erera <ser...@gmail.com> wrote:
>
> > But what about cases like collectors chaining, extensions and
> > running w/ several collectors? If each collector will need to
> > request for the document's score, it might be computed over and over
> > again. Consider a case for example, of a TopScoreDocCollector,
> > running together w/ another collector that extracts information
> > about the document, and uses the score to compute something (easy to
> > write a collector which delegates the collect() call to each of
> > them). Today, I could just call collect(doc, score) on each
> > collector. But in the proposed way, I'd call collect(doc) and then
> > each of them will need to request the score.
>
> Yeah good point... we'd somehow need such "chaining collectors" to
> simply return the score if getScore() had already been called?
>
> > Perhaps we can introduce a collect(doc) on HitCollector which does not
> > accept score, but keep the other collect? I am not sure if that's any
> > better, because then the Lucene search code would need to decide to which
> > collect method to call ...
>
> Yeah not sure I like that... and it doesn't scale up well (to other
> getXXX's that we may want to make optionally available from Scorer).
>
> > Regarding the TopDocs interface (we should have a better name as
> > TopDocs is already taken), my point was that just introducing an
> > interface will not solve the problem. If HitCollector was an
> > interface then yes, that would have solved the problem, as we could
> > have introduced a new interface TopDocsCollector that extends
> > HitCollector (the interface) and expose a topDocs() and
> > getTotalHits() methods. But since HitCollector is an abstract class,
> > this will not solve the problem.  The reason is that IndexSearcher
> > accepts a HitCollector.
>
> Actually, it was Nadav who first proposed the "read interface", to
> solve the "there's no common way for reading its output" problem.
> With an interface (say TopDocsOutput), then you could have some method
> somewhere:
>
>  renderResults(TopDocsOutput results)
>
> and then any collector, independent of how it *collects* results,
> could implement TopDocsOutput if appropriate.
>
> > So the code I'd need to write is:
> >
> > HitCollector hc = new TopScoreDocCollector();
> > searcher.search(query, hc);
> > TopDocs td = ((TopDocsCollector) hc).topDocs();
> >
> > That case looks very strange and is completely not safe, as someone
> > could change TopScoreDocCollector to not implement the
> > TopDocsCollector interface, and I'd discover that only at runtime.
>
> I agree one should rely as little as possible on runtime casting,
> though we wouldn't make such a change to TopScoreDocCollector (it
> breaks back compat).
>
> > Instead, I could have written:
> >
> > TopDocsCollctor hc = new TopScoreDocCollector();
> > searcher.search(query, hc);
> > TopDocs td = hc.topDocs();
> >
> > If someone changes TSDC to not implement TDC, I'd get a compilation
> > error and no runtime exceptions will be thrown.
>
> I think we should aim to separate "collection", which is how Lucene
> delivers hits to the collector you passed in, from "results delivery",
> which is how your code talks to the collector to read back the
> results.
>
> > What I also wanted to achieve by introducing a TopDocsCollector
> > abstract class is the shared implementation of topDocs() and
> > getTotalHits(). All of the Top***DocCollector extensions can just
> > implement collect() and use the base implementation of topDocs(). If
> > there's a collector that needs a different implementation, like
> > TopFieldCollector, it can override it.
>
> Would TopDocsCollector subclass HitCollector or
> MultiReaderHitCollector?
>
> > Is my purpose better explained now?
>
> Well... I understand the problem: we have a real bug.  If you subclass
> TopDocColector or TopScoreDocCollector, and don't do your sorting by
> score, then collect() does the wrong thing.  So we need to fix that.
>
> We've gone through various options for fixing it, all of which have
> various drawbacks (breaks back compat, possible performance hit,
> etc.), so we're still trying to converge on the lesser evil.
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>

Re: Is TopDocCollector's collect() implementation correct?

Reply via email to