Ok so I feel we're coming to an end. Few comments though:

The difference is for the new code, it's an upcast, which catches any
> errors at compile time, not run time.  The compiler determines that
> the class implements the required interface.
>

I still don't understand how a compiler can detect at compilation time that
a HitCollector instance that is used in method A, and is casted to a
TopDocsOutput instance by calling method B (from A) is actually ok ... I may
be missing some Java information here, but I simply don't see how that
happens at compilation time instead of at runtime ...

But that leaves no way forward for current users subclassing
> TopDocCollector (for the freedom of providing your own pqueue).
>

TopDocCollector is actually marked deprecated. So nobody should continue
subclassing it, as it will be removed in 3.0 no? Perhaps you mean
TopScoreDocCollector? If yes, then I think nobody should subclass it (like
you write at the end of the email).

Maybe we should in fact simply deprecate HitCollector (in favor of
> MultiReaderHitCollector)?  After all, making your own HitCollector is
> an advanced thing; expecting you to properly implement setNextReader
> may be fine.
>

I personally don't understand why MRHC was invented in the first place.
What's wrong with adding a setNextReader to HitCollector with an empty
implementation which does nothing? That doesn't break back-compat, and
everyone can still override that method and do whatever they need.

Notice also that Lucene *lies* to its users about the existence of HC and
MRHC. While I'm passing HC to IndexSearcher, I believe I noticed a code
segment like:

MultiReaderHitCollector mrhc;
if (c instanceof MRHC) {
   mrhc = (MRHC) c;
else {
   mrhc = new MRHC(c);
}

So in fact the internal Lucene code expects only MRHC from a certain point,
and so even if I wrote a HC and passed it on Searcher, it's still converted
to MRHC, with an empty setNextReader() method implementation. That's why I
meant that HC is already deprecated, whether it's marked as deprecated or
not.

What you say about deprecating HC to me is unnecessary. Simply pull
setNextReader up with an empty implementation, get rid of all the
instanceof, casting and wrapping code and you're fine. Nothing is broken.
All works well and better (instanceof, casting and wrapping have their
overheads).
Isn't that right?

I'm all for your suggestions, I just think we can achieve the same thing w/
HC. We can also not deprecate TopDocCollector and remove
TopScoreDocCollector (merging implementations so that TopDocCollector is
more efficient), as well as introduce that in between TopDocsOutputCollector
(or a better name) with the topDocs() and getTotalHits() methods ...

Regarding interfaces .. I don't think they're that bad. Perhaps a different
viewing angle might help. Lucene processes a query and fires events. Today
it fires an event every time a doc's score has been computed, and recently
every time it starts to process a different reader. HitCollector is a
listener implementation on the "doc-score event", while MRHC is a listener
on both.
To me, interfaces play just nicely here. Assume that you have the following
interfaces:
- DocScoreEvent
- ChangeReaderEvent
- EndProcessingEvent (thrown when the query has finished processing -
perhaps to aid collectors to free resources)
- <any other events you foresee?>
The Lucene code receives a HitCollector which listens on all events. In the
future, Lucene might throw other events, but still get a HitCollector. Those
methods will check for instanceof, and you as a user will know that if you
want to catch those events, you pass in a collector implementation which
does. Those events cannot of course be main-stream events (like
DocScoreEvent), but new ones, perhaps even experimental.
Since HitCollector is a concrete class, we can always add new interfaces to
it in the future with empty implementations?

Anyway, this is just a thought. I wouldn't want to deviate the discussion
we're having.

So how about what I propose:

   1. Pull setNextReader up to HitCollector, w/ empty implementation (easy
   change).
   2. Get rid of MRHC and change the classes which extend it (easy change).
   3. Get rid of all the instanceof checks (easy change - the compiler will
   warn you right away since MRHC is deleted).
   4. Create a new TopDocOutputCollector which allows you to set PQ and
   implements topDocs() and getTotalHits().
   5. Un-deprecate TopDocCollector, merge its implementation with
   TopScoreDocCollector and make it final. Override topDocs() and getTotalHits
   (if necessary).

Did I miss anything?

Shai

On Wed, Mar 25, 2009 at 3:38 PM, Michael McCandless <
luc...@mikemccandless.com> wrote:

> This stuff is surprisingly hard to think about!
>
> >> Actually, it was Nadav who first proposed the "read interface", to
> >> solve the "there's no common way for reading its output" problem.
> >> With an interface (say TopDocsOutput), then you could have some
> >> method somewhere:
> >>
> >>  renderResults(TopDocsOutput results)
> >>
> >> and then any collector, independent of how it *collects* results,
> >> could implement TopDocsOutput if appropriate.
> >
> > You'd still need to cast the collector to TopDocsOutput, won't you?
> > How's that different than the code snippet I showed above?
>
> The difference is for the new code, it's an upcast, which catches any
> errors at compile time, not run time.  The compiler determines that
> the class implements the required interface.
>
> > The current situation introduces a bug, that's true. However, unless
> > something better pops up, shouldn't we just make it final?
>
> But that leaves no way forward for current users subclassing
> TopDocCollector (for the freedom of providing your own pqueue).
>
> > May I suggest something else? What if MRHC was actually an
> > interface?
>
> I think interface is too dangerous in this case (the future back
> compatibility problem).  EG here we are wanting to explore a way to
> not pre-compute the score; had we released MRHC as an interface we'd
> be in trouble.  (We may still be in trouble, anyway!).
>
> >> Would TopDocsCollector subclass HitCollector or
> >> MultiReaderHitCollector?
> >
> > Well ... we've been there as well already :). I don't think there's
> > an easy answer here. I guess if MRHC is the better approach, and we
> > think all Top***DocCollector would want to have the MRHC
> > functionality, then I'd say let's extend MRHC. Otherwise, I don't
> > have a good answer. When I started this thread, I only knew of
> > HitCollector, so things were simpler at the time.
>
> We have challenging goals here:
>
>  * The "collect top N by score" collector should be final, use
>    ScorerDocQueue, specialized to sorting by score/docID: performance
>    is important.
>
>  * Likewise for the "collect top N by sorted field" collector, though
>    it does provide extensibility by letting you make a custom
>    comparator (FieldComparatorSource).  Ideally this'd allow with and
>    without computing score (it does not today).
>
>  * A "top N by my own pqueue" collector (this is what
>    TopDocCollector/TopScoreDocsCollector allow today, but it has the
>    bug).
>
>  * Allow fully custom collection, with and without score.
>
> Maybe we should in fact simply deprecate HitCollector (in favor of
> MultiReaderHitCollector)?  After all, making your own HitCollector is
> an advanced thing; expecting you to properly implement setNextReader
> may be fine.
>
> And then we can subclass MultiReaderHitCollector to TopDocsCollector
> (which adds the totalHits/topDocs "results delivery" API).
>
> And then the "collect top docs by score", and "collect top docs by
> fields" collectors subclass TopDocsCollector?
>
> Finally, we add a "collect top docs according to my own pqueue"
> class.
>
> Then we wouldn't need an interface; this works because all core
> collectors deliver top N results in the end.
>
> All that's missing is a way to NOT compute score if it's not needed.
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>

Reply via email to