BTW Mike, I noticed that TopFieldDocCollector extends TopScoreDocCollector. That is a problem if we want to make TSDC final. Now, TFDC is marked deprecated, so it will be removed in the future. I think an easy fix is just to have TFDC extend TopResultsCollector, right?
On Thu, Mar 26, 2009 at 2:52 PM, Shai Erera <ser...@gmail.com> wrote: > I may have lost the context here... but here's what I thought we were >> talking about. >> >> If we choose the interface option (adding a "ProvidesTopDocsResults" >> (say) interface), then you would create method >> renderResults(ProvidesTopDocResults). >> >> Then, any collector implementing that interface could be safely passed >> in, as the upcast is done at compile time not runtime. >> > > Consider this code snippet: > > HitCollector hc = condition? new TopDocCollector() : > TopFieldDocCollector(); > searcher.search(hc); > > The problem is that I need a base class for both collectors. If I use the > interface ProvidesTopDocsResults, then I cannot pass it to searcher, w/o > casting to HitCollector. If I use HitCollector, then I need to cast it > before passing it into rederResults(). Only when both class have the same > base class which is also a HitCollector, I don't need any casting. I.e., > after I submit a patch that develops what we've agreed on, the code can look > like this: > > TopResultsCollector trc = condition ? new TopScoreDocCollector() : new > TopFieldCollector(); > searcher.search(trc); > renderResults(trc); > > Here I can pass 'trc' to both methods since it both a HitCollector and a > TopResultsCollector. That's what I was missing in your proposal. > > Shai do you want to take a first cut at making a patch? Can you open >> an issue? Thanks. > > > I can certainly do that. I think the summary of the steps make sense. I'll > check if TopScoreDocCollector and TopFieldCollector can also extend that > "you provide your own pqueue and we collect top N according to it" > collector, passing a null PQ and extending topDocs(). > > I also would like to propose an additional method to topDocs(), topDocs(int > start, int howMany) which will be more efficient to call in case of paging > through results. The reason is that topDocs() pops everything from the PQ, > then allocates a ScoreDoc[] array of size of number of results and returns a > TopDocs object. You can then choose just the ones you need. > On the other hand, topDocs(start, howMany) would allocate exactly the size > of array needed. E.g., in case someone pages through results 91-100, you > allocate an array of size 10, instead of 100. > It is not a huge improvement, but it does save some allocations, as well as > it's a convenient method. > > BTW, I like the name ResultsCollector, as it's just like HitCollector, but > does not commit too much to "hits" .. i.e., facets aren't hits ... I think? > > Shai > > On Thu, Mar 26, 2009 at 12:55 PM, Michael McCandless < > luc...@mikemccandless.com> wrote: > >> Shai Erera <ser...@gmail.com> wrote: >> >> >> The difference is for the new code, it's an upcast, which catches any >> >> errors at compile time, not run time. The compiler determines that >> >> the class implements the required interface. >> > >> > I still don't understand how a compiler can detect at compilation time >> that >> > a HitCollector instance that is used in method A, and is casted to a >> > TopDocsOutput instance by calling method B (from A) is actually ok ... I >> may >> > be missing some Java information here, but I simply don't see how that >> > happens at compilation time instead of at runtime ... >> >> I may have lost the context here... but here's what I thought we were >> talking about. >> >> If we choose the interface option (adding a "ProvidesTopDocsResults" >> (say) interface), then you would create method >> renderResults(ProvidesTopDocResults). >> >> Then, any collector implementing that interface could be safely passed >> in, as the upcast is done at compile time not runtime. >> >> > So in fact the internal Lucene code expects only MRHC from a certain >> point, >> > and so even if I wrote a HC and passed it on Searcher, it's still >> converted >> > to MRHC, with an empty setNextReader() method implementation. That's why >> I >> > meant that HC is already deprecated, whether it's marked as deprecated >> or >> > not. >> >> The setNextReader() impl is not empty; it does the re-basing of docID >> on behalf of the HC. >> >> > What you say about deprecating HC to me is unnecessary. Simply pull >> > setNextReader up with an empty implementation, get rid of all the >> > instanceof, casting and wrapping code and you're fine. Nothing is >> broken. >> > All works well and better (instanceof, casting and wrapping have their >> > overheads). >> > Isn't that right? >> >> I think we need to deprecate HC, in favor of MRHC (or if we can think >> of a better name... ResultCollector?). >> >> > Regarding interfaces .. I don't think they're that bad. Perhaps a >> different >> > viewing angle might help. Lucene processes a query and fires events. >> Today >> > it fires an event every time a doc's score has been computed, and >> recently >> > every time it starts to process a different reader. HitCollector is a >> > listener implementation on the "doc-score event", while MRHC is a >> listener >> > on both. >> > To me, interfaces play just nicely here. Assume that you have the >> following >> > interfaces: >> > - DocScoreEvent >> > - ChangeReaderEvent >> > - EndProcessingEvent (thrown when the query has finished processing - >> > perhaps to aid collectors to free resources) >> > - <any other events you foresee?> >> > The Lucene code receives a HitCollector which listens on all events. In >> the >> > future, Lucene might throw other events, but still get a HitCollector. >> Those >> > methods will check for instanceof, and you as a user will know that if >> you >> > want to catch those events, you pass in a collector implementation which >> > does. Those events cannot of course be main-stream events (like >> > DocScoreEvent), but new ones, perhaps even experimental. >> > Since HitCollector is a concrete class, we can always add new interfaces >> to >> > it in the future with empty implementations? >> >> I agree interfaces clearly have useful properties, but their achilles >> heel for Lucene in all but the most trivial needs is the >> non-back-compatibility when you want to add a method to the interface. >> There have been a number of java-dev discussons on this problem. >> >> So, I think something like this: >> >> * Deprecate HitCollector in favor of MultiReaderHitCollector (any >> better name here?) abstract class. If you want to make a fully >> custom HitCollector, you subclass this calss. >> >> Let's change MRHC's collect to take only an [unbased] docID, and >> expose a setScorer(Scorer) method. Then if the collector needs >> score it can call Scorer.score(). >> >> * Subclass that to create an abstract "tracks top N results >> collector" (TopDocsCollector? TopHitsCollector?) >> >> * Subclass TopDocsCollector to a final, fast "top N sorted by score" >> collector (exists already: TopScoreDocCollector) >> >> * Subclass TopDocsCollector to a final, fast "top N sorted by field" >> collector (exists already: TopFieldCollector) >> >> * Subclass TopDocsCollector to a "you provide your own pqueue and we >> collect top N according to it" collector (does not yet exist -- >> name?). This is the way forward for existing subclasses of >> TopDocCollector. >> >> Shai do you want to take a first cut at making a patch? Can you open >> an issue? Thanks. >> >> Mike >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-dev-h...@lucene.apache.org >> >> >