Re: Is TopDocCollector's collect() implementation correct?

Shai Erera Tue, 24 Mar 2009 05:47:40 -0700

I agree about the unnecessary method call - we should make a collector's
implementation as efficient as possible.

One concern I have about the direction you'd like a HitCollector to evolve
to is that if a collector will need to ask for the document's score instead
of retrieving it, we might face other performance problems. I agree that not
all collectors need it, and therefore there's no need to compute the score
if no one will use it.
But what about cases like collectors chaining, extensions and running w/
several collectors? If each collector will need to request for the
document's score, it might be computed over and over again. Consider a case
for example, of a TopScoreDocCollector, running together w/ another
collector that extracts information about the document, and uses the score
to compute something (easy to write a collector which delegates the
collect() call to each of them). Today, I could just call collect(doc,
score) on each collector. But in the proposed way, I'd call collect(doc) and
then each of them will need to request the score.

Perhaps we can introduce a collect(doc) on HitCollector which does not
accept score, but keep the other collect? I am not sure if that's any
better, because then the Lucene search code would need to decide to which
collect method to call ...

Regarding the TopDocs interface (we should have a better name as TopDocs is
already taken), my point was that just introducing an interface will not
solve the problem. If HitCollector was an interface then yes, that would
have solved the problem, as we could have introduced a new interface
TopDocsCollector that extends HitCollector (the interface) and expose a
topDocs() and getTotalHits() methods. But since HitCollector is an abstract
class, this will not solve the problem.
The reason is that IndexSearcher accepts a HitCollector. So the code I'd
need to write is:

HitCollector hc = new TopScoreDocCollector();
searcher.search(query, hc);
TopDocs td = ((TopDocsCollector) hc).topDocs();

That case looks very strange and is completely not safe, as someone could
change TopScoreDocCollector to not implement the TopDocsCollector interface,
and I'd discover that only at runtime.

Instead, I could have written:
TopDocsCollctor hc = new TopScoreDocCollector();
searcher.search(query, hc);
TopDocs td = hc.topDocs();
If someone changes TSDC to not implement TDC, I'd get a compilation error
and no runtime exceptions will be thrown.

What I also wanted to achieve by introducing a TopDocsCollector abstract
class is the shared implementation of topDocs() and getTotalHits(). All of
the Top***DocCollector extensions can just implement collect() and use the
base implementation of topDocs(). If there's a collector that needs a
different implementation, like TopFieldCollector, it can override it.

Is my purpose better explained now?

Shai

On Mon, Mar 23, 2009 at 8:56 PM, Michael McCandless <
luc...@mikemccandless.com> wrote:

> Shai Erera <ser...@gmail.com> wrote:
>
> > As a side comment, why not add setNextReader to HitCollector and
> > then a getDocId(int doc) method which will do the doc + base
> > arithmetic?
>
> One problem is this breaks back compatibility on any current
> subclasses of HitCollector.
>
> Another problem is: not all collectors would need to add the base on
> each doc.  EG a collector that puts hits into separate pqueues per
> segment could defer the addition until the end when only the top
> results are pulled out of each pqueue.
>
> Also, I am concerned about the method call overhead.  This is the
> absolute ultimate hot spot for Lucene and we should worry about
> causing even a single added instruction in this path.
>
> That said... I would like to [eventually] change the collection API
> along the lines of what Marvin proposed for "Matcher" in Lucy, here:
>
>  http://markmail.org/message/jxshhiqr6wvq77xu
>
> Specifically, I think it should be the collector's job to ask for the
> score for this doc, rather than Lucene's job to pre-compute it, so
> that collectors that don't need the score won't waste CPU.  EG, if you
> are sorting by field (and don't present the relevance score) you
> shouldn't compute it.
>
> Then, we could add other "somewhat expensive" things you might
> retrieve, such as a way to ask which terms participated in the match
> (discussed today on java-user), and/or all term positions that
> participated (discussed in LUCENE-1522).  EG, a top doc collector
> could choose to call these methods only when the doc was competitive.
>
> > Anyway, I don't want to add topDocs and getTotalHits to
> > HitCollector, it will destroy its generic purpose.
>
> I agree.
>
> > An interface is also problematic, as it just means all of these
> > collectors have these methods declared, but they need to implement
> > them. An abstract class grants you w/ both.
>
> I'm confused on this objection -- only collectors that do let you ask
> for the top N set of docs would implement this interface?  (Ie it'd
> only be the TopXXXCollector's that'd implement the interface).  While
> interfaces clearly have the future problem of back-compatibility, this
> case may be simple enough to make an exception.
>
> > So it looks like HitCollector itself is "deprecated" as far as the
> > Lucene core code sees it.
>
> I think HitCollector has a purpose, which is to be the simplest way to
> make a custom collector.  Ie I think it makes sense to offer a simple
> way and a high performance way.
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>

Re: Is TopDocCollector's collect() implementation correct?

Reply via email to