Re: Is TopDocCollector's collect() implementation correct?
2009/3/27 Shai Erera ser...@gmail.com: I really liked HItCollector and hate to give up the name ... However Collector is fine with me either, and it is at least more generic than HitCollector ... Hitable sounds too aggressive/violent to me :) Now that you raised this objection, I can't get the image out of my head, and I agree :) BTW, I guess I should create some new searcher API which receives this Collector class (is Collector the chosen name?) and deprecate those who accept HitCollector? Those can also skip the instanceof check, and wrapping of HC to MRHC ... OK that'd be great. There are also default score methods in Scorer that take HitCollector which'll have to be deprecated as well, and some scorers (eg BooleanScorer) define those methods. That also means that I should throw that MRHC wrapper (which rebases doc Ids)? If HitCollector is deprecated, then there's no need to keep it. But perhaps we want it there in 2.9 for easier migration? Personally I think it's redundant since in 3.0 people will need to change all their collectors anyway (since HitCollector will be removed, and every class which extends HitCollector will need be modified). What do you think? I think discard it, assuming the above full-replacement approach doesn't hit any snags. Also, there's no need to deprecate MRHC, since it's only in the trunk - I can simply rename it, right? Exactly :) A nice freedom... maybe we should never make any releases! (kidding) Ok I'll go ahead and prepare a patch. We can discuss the name more, at the end it will just be a short refactor action in Eclipse, so that shouldn't hold us (or me) up. OK thanks Shai. Make sure you mark the issue as Fix Version 2.9. Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Is TopDocCollector's collect() implementation correct?
I created LUCENE-1575 which summarizes what we've agreed to do in this issue. Please review it. Also, let's move the discussion around the names to the issue. I hope I covered everything we agreed on - I had to create the issue twice as it contains a lot of text and the session expired on me in the middle, and everything I typed got lost ... damn. Shai On Fri, Mar 27, 2009 at 11:51 AM, Michael McCandless luc...@mikemccandless.com wrote: 2009/3/27 Shai Erera ser...@gmail.com: I really liked HItCollector and hate to give up the name ... However Collector is fine with me either, and it is at least more generic than HitCollector ... Hitable sounds too aggressive/violent to me :) Now that you raised this objection, I can't get the image out of my head, and I agree :) BTW, I guess I should create some new searcher API which receives this Collector class (is Collector the chosen name?) and deprecate those who accept HitCollector? Those can also skip the instanceof check, and wrapping of HC to MRHC ... OK that'd be great. There are also default score methods in Scorer that take HitCollector which'll have to be deprecated as well, and some scorers (eg BooleanScorer) define those methods. That also means that I should throw that MRHC wrapper (which rebases doc Ids)? If HitCollector is deprecated, then there's no need to keep it. But perhaps we want it there in 2.9 for easier migration? Personally I think it's redundant since in 3.0 people will need to change all their collectors anyway (since HitCollector will be removed, and every class which extends HitCollector will need be modified). What do you think? I think discard it, assuming the above full-replacement approach doesn't hit any snags. Also, there's no need to deprecate MRHC, since it's only in the trunk - I can simply rename it, right? Exactly :) A nice freedom... maybe we should never make any releases! (kidding) Ok I'll go ahead and prepare a patch. We can discuss the name more, at the end it will just be a short refactor action in Eclipse, so that shouldn't hold us (or me) up. OK thanks Shai. Make sure you mark the issue as Fix Version 2.9. Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Is TopDocCollector's collect() implementation correct?
On Fri, Mar 27, 2009 at 12:46 PM, Shai Erera ser...@gmail.com wrote: I hope I covered everything we agreed on - I had to create the issue twice as it contains a lot of text and the session expired on me in the middle, and everything I typed got lost ... damn. Bummer. Strangely, this also happened to me today. Maybe something changed on how aggressively Jira logs you out. But: when this happens, Jira parrots back the text you had just attempted to submit and gives you a chance to copy/paste it. (At least, it do so when I tried to append a comment and got logged out between when I started and when I finished... maybe we all just need to type faster!). Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
RE: Is TopDocCollector's collect() implementation correct?
Hi Shai, hi Mike, You can also use the license of JIRA Client from the contributor's SVN checkout (if you are committer and so have access to it). With it you can manage your issues offline (it's a Java app for windows and other systems). - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: Friday, March 27, 2009 9:02 PM To: java-dev@lucene.apache.org Subject: Re: Is TopDocCollector's collect() implementation correct? On Fri, Mar 27, 2009 at 12:46 PM, Shai Erera ser...@gmail.com wrote: I hope I covered everything we agreed on - I had to create the issue twice as it contains a lot of text and the session expired on me in the middle, and everything I typed got lost ... damn. Bummer. Strangely, this also happened to me today. Maybe something changed on how aggressively Jira logs you out. But: when this happens, Jira parrots back the text you had just attempted to submit and gives you a chance to copy/paste it. (At least, it do so when I tried to append a comment and got logged out between when I started and when I finished... maybe we all just need to type faster!). Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Is TopDocCollector's collect() implementation correct?
Mark Miller markrmil...@gmail.com wrote: bq. I personally don't understand why MRHC was invented in the first place. The evolution of MRHC is in the comments of LUCENE-1483 - a lot of comments to wade through though. MRHC was created because simply adding setNextReader to HC would break back compat, because collect(...) is called on the un-rebased doc. Ie we need a new class so we can tell that it will handle re-basing the doc itself. Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Is TopDocCollector's collect() implementation correct?
Shai Erera ser...@gmail.com wrote: The difference is for the new code, it's an upcast, which catches any errors at compile time, not run time. The compiler determines that the class implements the required interface. I still don't understand how a compiler can detect at compilation time that a HitCollector instance that is used in method A, and is casted to a TopDocsOutput instance by calling method B (from A) is actually ok ... I may be missing some Java information here, but I simply don't see how that happens at compilation time instead of at runtime ... I may have lost the context here... but here's what I thought we were talking about. If we choose the interface option (adding a ProvidesTopDocsResults (say) interface), then you would create method renderResults(ProvidesTopDocResults). Then, any collector implementing that interface could be safely passed in, as the upcast is done at compile time not runtime. So in fact the internal Lucene code expects only MRHC from a certain point, and so even if I wrote a HC and passed it on Searcher, it's still converted to MRHC, with an empty setNextReader() method implementation. That's why I meant that HC is already deprecated, whether it's marked as deprecated or not. The setNextReader() impl is not empty; it does the re-basing of docID on behalf of the HC. What you say about deprecating HC to me is unnecessary. Simply pull setNextReader up with an empty implementation, get rid of all the instanceof, casting and wrapping code and you're fine. Nothing is broken. All works well and better (instanceof, casting and wrapping have their overheads). Isn't that right? I think we need to deprecate HC, in favor of MRHC (or if we can think of a better name... ResultCollector?). Regarding interfaces .. I don't think they're that bad. Perhaps a different viewing angle might help. Lucene processes a query and fires events. Today it fires an event every time a doc's score has been computed, and recently every time it starts to process a different reader. HitCollector is a listener implementation on the doc-score event, while MRHC is a listener on both. To me, interfaces play just nicely here. Assume that you have the following interfaces: - DocScoreEvent - ChangeReaderEvent - EndProcessingEvent (thrown when the query has finished processing - perhaps to aid collectors to free resources) - any other events you foresee? The Lucene code receives a HitCollector which listens on all events. In the future, Lucene might throw other events, but still get a HitCollector. Those methods will check for instanceof, and you as a user will know that if you want to catch those events, you pass in a collector implementation which does. Those events cannot of course be main-stream events (like DocScoreEvent), but new ones, perhaps even experimental. Since HitCollector is a concrete class, we can always add new interfaces to it in the future with empty implementations? I agree interfaces clearly have useful properties, but their achilles heel for Lucene in all but the most trivial needs is the non-back-compatibility when you want to add a method to the interface. There have been a number of java-dev discussons on this problem. So, I think something like this: * Deprecate HitCollector in favor of MultiReaderHitCollector (any better name here?) abstract class. If you want to make a fully custom HitCollector, you subclass this calss. Let's change MRHC's collect to take only an [unbased] docID, and expose a setScorer(Scorer) method. Then if the collector needs score it can call Scorer.score(). * Subclass that to create an abstract tracks top N results collector (TopDocsCollector? TopHitsCollector?) * Subclass TopDocsCollector to a final, fast top N sorted by score collector (exists already: TopScoreDocCollector) * Subclass TopDocsCollector to a final, fast top N sorted by field collector (exists already: TopFieldCollector) * Subclass TopDocsCollector to a you provide your own pqueue and we collect top N according to it collector (does not yet exist -- name?). This is the way forward for existing subclasses of TopDocCollector. Shai do you want to take a first cut at making a patch? Can you open an issue? Thanks. Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Is TopDocCollector's collect() implementation correct?
On Mar 26, 2009, at 6:55 AM, Michael McCandless wrote: think we need to deprecate HC, in favor of MRHC (or if we can think of a better name... ResultCollector?). I like your suggestion for the name. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Is TopDocCollector's collect() implementation correct?
You're right ... I missed that. My fault :) On Thu, Mar 26, 2009 at 12:18 PM, Michael McCandless luc...@mikemccandless.com wrote: Mark Miller markrmil...@gmail.com wrote: bq. I personally don't understand why MRHC was invented in the first place. The evolution of MRHC is in the comments of LUCENE-1483 - a lot of comments to wade through though. MRHC was created because simply adding setNextReader to HC would break back compat, because collect(...) is called on the un-rebased doc. Ie we need a new class so we can tell that it will handle re-basing the doc itself. Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Is TopDocCollector's collect() implementation correct?
I'd say it is a bad name. Raw hit is way far from being result of a search. If you're already breaking back compat with 3.0 release (by incrementing java version), maybe its worthy to break it in some more places, just so ugly names like MRHC and special code paths that check for n-year-old interfaces won't haunt us for the next century. On Thu, Mar 26, 2009 at 14:15, DM Smith dmsmith...@gmail.com wrote: On Mar 26, 2009, at 6:55 AM, Michael McCandless wrote: think we need to deprecate HC, in favor of MRHC (or if we can think of a better name... ResultCollector?). I like your suggestion for the name. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Is TopDocCollector's collect() implementation correct?
Earwin Burrfoot ear...@gmail.com wrote: I'd say it is a bad name. Raw hit is way far from being result of a search. First off, from Lucene's standpoint, the docID *is* the result of the search. Your application will do further things (load titles, do higlighting, etc.) with that result. Second off, since ResultCollector is an abstract base class, it would be subclassed to concrete versions that do more interesting things (call Scorer.score(), etc) so as to make up what your application considers the result. So I understand your objection, but I still feel ResultCollector is an OK name. Or do you have an alternative? Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Is TopDocCollector's collect() implementation correct?
I may have lost the context here... but here's what I thought we were talking about. If we choose the interface option (adding a ProvidesTopDocsResults (say) interface), then you would create method renderResults(ProvidesTopDocResults). Then, any collector implementing that interface could be safely passed in, as the upcast is done at compile time not runtime. Consider this code snippet: HitCollector hc = condition? new TopDocCollector() : TopFieldDocCollector(); searcher.search(hc); The problem is that I need a base class for both collectors. If I use the interface ProvidesTopDocsResults, then I cannot pass it to searcher, w/o casting to HitCollector. If I use HitCollector, then I need to cast it before passing it into rederResults(). Only when both class have the same base class which is also a HitCollector, I don't need any casting. I.e., after I submit a patch that develops what we've agreed on, the code can look like this: TopResultsCollector trc = condition ? new TopScoreDocCollector() : new TopFieldCollector(); searcher.search(trc); renderResults(trc); Here I can pass 'trc' to both methods since it both a HitCollector and a TopResultsCollector. That's what I was missing in your proposal. Shai do you want to take a first cut at making a patch? Can you open an issue? Thanks. I can certainly do that. I think the summary of the steps make sense. I'll check if TopScoreDocCollector and TopFieldCollector can also extend that you provide your own pqueue and we collect top N according to it collector, passing a null PQ and extending topDocs(). I also would like to propose an additional method to topDocs(), topDocs(int start, int howMany) which will be more efficient to call in case of paging through results. The reason is that topDocs() pops everything from the PQ, then allocates a ScoreDoc[] array of size of number of results and returns a TopDocs object. You can then choose just the ones you need. On the other hand, topDocs(start, howMany) would allocate exactly the size of array needed. E.g., in case someone pages through results 91-100, you allocate an array of size 10, instead of 100. It is not a huge improvement, but it does save some allocations, as well as it's a convenient method. BTW, I like the name ResultsCollector, as it's just like HitCollector, but does not commit too much to hits .. i.e., facets aren't hits ... I think? Shai On Thu, Mar 26, 2009 at 12:55 PM, Michael McCandless luc...@mikemccandless.com wrote: Shai Erera ser...@gmail.com wrote: The difference is for the new code, it's an upcast, which catches any errors at compile time, not run time. The compiler determines that the class implements the required interface. I still don't understand how a compiler can detect at compilation time that a HitCollector instance that is used in method A, and is casted to a TopDocsOutput instance by calling method B (from A) is actually ok ... I may be missing some Java information here, but I simply don't see how that happens at compilation time instead of at runtime ... I may have lost the context here... but here's what I thought we were talking about. If we choose the interface option (adding a ProvidesTopDocsResults (say) interface), then you would create method renderResults(ProvidesTopDocResults). Then, any collector implementing that interface could be safely passed in, as the upcast is done at compile time not runtime. So in fact the internal Lucene code expects only MRHC from a certain point, and so even if I wrote a HC and passed it on Searcher, it's still converted to MRHC, with an empty setNextReader() method implementation. That's why I meant that HC is already deprecated, whether it's marked as deprecated or not. The setNextReader() impl is not empty; it does the re-basing of docID on behalf of the HC. What you say about deprecating HC to me is unnecessary. Simply pull setNextReader up with an empty implementation, get rid of all the instanceof, casting and wrapping code and you're fine. Nothing is broken. All works well and better (instanceof, casting and wrapping have their overheads). Isn't that right? I think we need to deprecate HC, in favor of MRHC (or if we can think of a better name... ResultCollector?). Regarding interfaces .. I don't think they're that bad. Perhaps a different viewing angle might help. Lucene processes a query and fires events. Today it fires an event every time a doc's score has been computed, and recently every time it starts to process a different reader. HitCollector is a listener implementation on the doc-score event, while MRHC is a listener on both. To me, interfaces play just nicely here. Assume that you have the following interfaces: - DocScoreEvent - ChangeReaderEvent - EndProcessingEvent (thrown when the query has finished processing - perhaps to aid collectors to free resources) - any other events you foresee? The Lucene
Re: Is TopDocCollector's collect() implementation correct?
BTW Mike, I noticed that TopFieldDocCollector extends TopScoreDocCollector. That is a problem if we want to make TSDC final. Now, TFDC is marked deprecated, so it will be removed in the future. I think an easy fix is just to have TFDC extend TopResultsCollector, right? On Thu, Mar 26, 2009 at 2:52 PM, Shai Erera ser...@gmail.com wrote: I may have lost the context here... but here's what I thought we were talking about. If we choose the interface option (adding a ProvidesTopDocsResults (say) interface), then you would create method renderResults(ProvidesTopDocResults). Then, any collector implementing that interface could be safely passed in, as the upcast is done at compile time not runtime. Consider this code snippet: HitCollector hc = condition? new TopDocCollector() : TopFieldDocCollector(); searcher.search(hc); The problem is that I need a base class for both collectors. If I use the interface ProvidesTopDocsResults, then I cannot pass it to searcher, w/o casting to HitCollector. If I use HitCollector, then I need to cast it before passing it into rederResults(). Only when both class have the same base class which is also a HitCollector, I don't need any casting. I.e., after I submit a patch that develops what we've agreed on, the code can look like this: TopResultsCollector trc = condition ? new TopScoreDocCollector() : new TopFieldCollector(); searcher.search(trc); renderResults(trc); Here I can pass 'trc' to both methods since it both a HitCollector and a TopResultsCollector. That's what I was missing in your proposal. Shai do you want to take a first cut at making a patch? Can you open an issue? Thanks. I can certainly do that. I think the summary of the steps make sense. I'll check if TopScoreDocCollector and TopFieldCollector can also extend that you provide your own pqueue and we collect top N according to it collector, passing a null PQ and extending topDocs(). I also would like to propose an additional method to topDocs(), topDocs(int start, int howMany) which will be more efficient to call in case of paging through results. The reason is that topDocs() pops everything from the PQ, then allocates a ScoreDoc[] array of size of number of results and returns a TopDocs object. You can then choose just the ones you need. On the other hand, topDocs(start, howMany) would allocate exactly the size of array needed. E.g., in case someone pages through results 91-100, you allocate an array of size 10, instead of 100. It is not a huge improvement, but it does save some allocations, as well as it's a convenient method. BTW, I like the name ResultsCollector, as it's just like HitCollector, but does not commit too much to hits .. i.e., facets aren't hits ... I think? Shai On Thu, Mar 26, 2009 at 12:55 PM, Michael McCandless luc...@mikemccandless.com wrote: Shai Erera ser...@gmail.com wrote: The difference is for the new code, it's an upcast, which catches any errors at compile time, not run time. The compiler determines that the class implements the required interface. I still don't understand how a compiler can detect at compilation time that a HitCollector instance that is used in method A, and is casted to a TopDocsOutput instance by calling method B (from A) is actually ok ... I may be missing some Java information here, but I simply don't see how that happens at compilation time instead of at runtime ... I may have lost the context here... but here's what I thought we were talking about. If we choose the interface option (adding a ProvidesTopDocsResults (say) interface), then you would create method renderResults(ProvidesTopDocResults). Then, any collector implementing that interface could be safely passed in, as the upcast is done at compile time not runtime. So in fact the internal Lucene code expects only MRHC from a certain point, and so even if I wrote a HC and passed it on Searcher, it's still converted to MRHC, with an empty setNextReader() method implementation. That's why I meant that HC is already deprecated, whether it's marked as deprecated or not. The setNextReader() impl is not empty; it does the re-basing of docID on behalf of the HC. What you say about deprecating HC to me is unnecessary. Simply pull setNextReader up with an empty implementation, get rid of all the instanceof, casting and wrapping code and you're fine. Nothing is broken. All works well and better (instanceof, casting and wrapping have their overheads). Isn't that right? I think we need to deprecate HC, in favor of MRHC (or if we can think of a better name... ResultCollector?). Regarding interfaces .. I don't think they're that bad. Perhaps a different viewing angle might help. Lucene processes a query and fires events. Today it fires an event every time a doc's score has been computed, and recently every time it starts to process a different reader. HitCollector is a
Re: Is TopDocCollector's collect() implementation correct?
BTW, I like the name ResultsCollector, as it's just like HitCollector, but does not commit too much to hits .. i.e., facets aren't hits ... I think? What this class consumes and what it produces is a totally different thing. HitCollector always collects 'hits', and then produces whatever implementor needs. For example mine collects hits, then collapses 1..N sequential hits into a 'metahit', calculates facets, sorts, takes top and loads some fields. And another one simply counts the hits without doing anything else. But oh, my, I'm not implementing anything like void collect(Facet f) method. It's common sense to name consumer interfaces after what they consume, not what their implementations might do. Or do you have an alternative? HitCollector is absolutely cool with me. Okay, maybe DocCollector, or DocIdCollector. Since that is exactly what 'all' of its implementations do. -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Is TopDocCollector's collect() implementation correct?
On Thu, Mar 26, 2009 at 08:44:57AM -0400, Michael McCandless wrote: do you have an alternative? Brainstorming * Harvester * Trawler * HitPicker * HitGrabber Marvin Humphrey - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Is TopDocCollector's collect() implementation correct?
On Thu, Mar 26, 2009 at 08:44:57AM -0400, Michael McCandless wrote: do you have an alternative? Brainstorming * Harvester * Trawler * HitPicker * HitGrabber Marvin Humphrey NitPicker - that absolutely made my day -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Is TopDocCollector's collect() implementation correct?
I still think that ResultsCollector does what you describe. It simply collects results, while the word 'result' is quite *open* and does not commit to anything ... How about dropping the word Collector, since it might not collect anything, and just save the highest score, or compute some facets .. What about something with a *Listener like: DocIdListener, SearchListener, MatchListener (it listens on search matches). On Thu, Mar 26, 2009 at 3:48 PM, Marvin Humphrey mar...@rectangular.comwrote: On Thu, Mar 26, 2009 at 08:44:57AM -0400, Michael McCandless wrote: do you have an alternative? Brainstorming * Harvester * Trawler * HitPicker * HitGrabber Marvin Humphrey - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Is TopDocCollector's collect() implementation correct?
Shai Erera ser...@gmail.com wrote: I may have lost the context here... but here's what I thought we were talking about. If we choose the interface option (adding a ProvidesTopDocsResults (say) interface), then you would create method renderResults(ProvidesTopDocResults). Then, any collector implementing that interface could be safely passed in, as the upcast is done at compile time not runtime. Consider this code snippet: HitCollector hc = condition? new TopDocCollector() : TopFieldDocCollector(); searcher.search(hc); The problem is that I need a base class for both collectors. If I use the interface ProvidesTopDocsResults, then I cannot pass it to searcher, w/o casting to HitCollector. If I use HitCollector, then I need to cast it before passing it into rederResults(). Only when both class have the same base class which is also a HitCollector, I don't need any casting. I.e., after I submit a patch that develops what we've agreed on, the code can look like this: TopResultsCollector trc = condition ? new TopScoreDocCollector() : new TopFieldCollector(); searcher.search(trc); renderResults(trc); Here I can pass 'trc' to both methods since it both a HitCollector and a TopResultsCollector. That's what I was missing in your proposal. OK I agree. And with our proposed changes (TopResultsCollector), you can do this. Shai do you want to take a first cut at making a patch? Can you open an issue? Thanks. I can certainly do that. I think the summary of the steps make sense. I'll check if TopScoreDocCollector and TopFieldCollector can also extend that you provide your own pqueue and we collect top N according to it collector, passing a null PQ and extending topDocs(). OK, thanks. I also would like to propose an additional method to topDocs(), topDocs(int start, int howMany) which will be more efficient to call in case of paging through results. The reason is that topDocs() pops everything from the PQ, then allocates a ScoreDoc[] array of size of number of results and returns a TopDocs object. You can then choose just the ones you need. On the other hand, topDocs(start, howMany) would allocate exactly the size of array needed. E.g., in case someone pages through results 91-100, you allocate an array of size 10, instead of 100. It is not a huge improvement, but it does save some allocations, as well as it's a convenient method. Though... this is somewhat tricky to implement efficiently when using pqueue: you pop the worst scoring hit first, then next worst scoring, etc, into an array (in reverse order). It would be conceivable to do a [separate] partial sort of the queue to more efficiently retrieve a top subset of N to save some time on the extraction. But my guess is extraction time is trivial; I don't think we need to optimize it. That being said, we could make the API like this, but under the hood simply do what we do today the first time it's called, leaving as a future optimization to speed it up. Alternatively we could make a ScoreDoc result(int n) to retrieve each result one by one... or maybe doc(int n) and score(int n) since some collectors won't score (but, then we'd need to handle FieldDoc, which is used to more generically return the sort field values for each result). BTW, I like the name ResultsCollector, as it's just like HitCollector, but does not commit too much to hits .. i.e., facets aren't hits ... I think? I like it too! Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Is TopDocCollector's collect() implementation correct?
Shai Erera ser...@gmail.com wrote: BTW Mike, I noticed that TopFieldDocCollector extends TopScoreDocCollector. Weird. Probably we could put that back to extending [deprecated] TopDocCollector? That is a problem if we want to make TSDC final. Now, TFDC is marked deprecated, so it will be removed in the future. Right. I think an easy fix is just to have TFDC extend TopResultsCollector, right? Or, back to the way it was pre-1483 (extend TopDocCollector). Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Is TopDocCollector's collect() implementation correct?
In 1483 Doug had also suggested: * Hitable I suppose Collector shouldn't really be in the name, since the class may not actually collect the results (eg if it simply counts). Mike Marvin Humphrey mar...@rectangular.com wrote: On Thu, Mar 26, 2009 at 08:44:57AM -0400, Michael McCandless wrote: do you have an alternative? Brainstorming * Harvester * Trawler * HitPicker * HitGrabber Marvin Humphrey - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Is TopDocCollector's collect() implementation correct?
On Thu, Mar 26, 2009 at 04:01:26PM +0200, Shai Erera wrote: I still think that ResultsCollector does what you describe. It simply collects results, while the word 'result' is quite *open* and does not commit to anything ... Another advantage of ResultsCollector is that the name suggests good self-documenting subclass names and variable names. For instance, it's reasonably clear what a BitSetCollector or a TopDocsCollector might do. And when there's only one var around, the name collector is an obvious choice no matter what the class. This is all possible because there's no other use of Collector within Lucene. I just think ResultsCollector is less euphonic and zippy than HitCollector, so it's worth exploring alternatives. How about dropping the word Collector, since it might not collect anything, and just save the highest score, or compute some facets .. Sure, wiping the slate clean and re-examining HitCollector's actual purpose and usage to discover new names is a good approach. Similar thinking went into Hitable and Harvester. FWIW, I'd have to disagree that HitCollector doesn't collect anything. It may not collect hits per se, but it's definitely iterating over hits (in the sense of successful matches) and with only rare exceptions, it's collecting *something*. What about something with a *Listener like: DocIdListener, SearchListener, MatchListener (it listens on search matches). Considering how we attach HITCOLLECTORTHINGY onto the matching process is a novel take and clarifying to see. However, maybe it's just me, but *Listener evokes the JavaScript EventListener stuff, which seems radically different. Also, if I saw a listener variable in scoring loop code, or a TopDocsListener module in the JavaDocs, it wouldn't spring out to me that it would be doing what a HitCollector does right now. Cheers, Marvin Humphrey - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Is TopDocCollector's collect() implementation correct?
On Thu, Mar 26, 2009 at 5:28 PM, Marvin Humphrey mar...@rectangular.com wrote: On Thu, Mar 26, 2009 at 04:01:26PM +0200, Shai Erera wrote: I still think that ResultsCollector does what you describe. It simply collects results, while the word 'result' is quite *open* and does not commit to anything ... Another advantage of ResultsCollector is that the name suggests good self-documenting subclass names and variable names. For instance, it's reasonably clear what a BitSetCollector or a TopDocsCollector might do. And when there's only one var around, the name collector is an obvious choice no matter what the class. This is all possible because there's no other use of Collector within Lucene. I think ResultsCollector (or maybe ResultCollector) is my favorite so far... But how about simply Collector? (I realize it's very generic... but we don't collect anything else in Lucene?). What about something with a *Listener like: DocIdListener, SearchListener, MatchListener (it listens on search matches). Considering how we attach HITCOLLECTORTHINGY onto the matching process is a novel take and clarifying to see. However, maybe it's just me, but *Listener evokes the JavaScript EventListener stuff, which seems radically different. Also, if I saw a listener variable in scoring loop code, or a TopDocsListener module in the JavaDocs, it wouldn't spring out to me that it would be doing what a HitCollector does right now. Yeah I'm not really a fan of Listener either. Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Is TopDocCollector's collect() implementation correct?
On Thu, Mar 26, 2009 at 06:03:07PM -0400, Michael McCandless wrote: But how about simply Collector? (I realize it's very generic... but we don't collect anything else in Lucene?). +1 Honorable mention to NitPicker, LOL. Marvin Humphrey - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Is TopDocCollector's collect() implementation correct?
I think ResultsCollector (or maybe ResultCollector) is my favorite so far... But how about simply Collector? (I realize it's very generic... but we don't collect anything else in Lucene?). That's exactly what I'm using in my app - abstract class Collector extends HitCollector, that serves as a base for all my custom collectors :) So, yeah, I like this name. Yeah I'm not really a fan of Listener either. +1 -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Is TopDocCollector's collect() implementation correct?
I really liked HItCollector and hate to give up the name ... However Collector is fine with me either, and it is at least more generic than HitCollector ... Hitable sounds too aggressive/violent to me :) BTW, I guess I should create some new searcher API which receives this Collector class (is Collector the chosen name?) and deprecate those who accept HitCollector? Those can also skip the instanceof check, and wrapping of HC to MRHC ... That also means that I should throw that MRHC wrapper (which rebases doc Ids)? If HitCollector is deprecated, then there's no need to keep it. But perhaps we want it there in 2.9 for easier migration? Personally I think it's redundant since in 3.0 people will need to change all their collectors anyway (since HitCollector will be removed, and every class which extends HitCollector will need be modified). What do you think? Also, there's no need to deprecate MRHC, since it's only in the trunk - I can simply rename it, right? Ok I'll go ahead and prepare a patch. We can discuss the name more, at the end it will just be a short refactor action in Eclipse, so that shouldn't hold us (or me) up. Shai On Fri, Mar 27, 2009 at 1:24 AM, Earwin Burrfoot ear...@gmail.com wrote: I think ResultsCollector (or maybe ResultCollector) is my favorite so far... But how about simply Collector? (I realize it's very generic... but we don't collect anything else in Lucene?). That's exactly what I'm using in my app - abstract class Collector extends HitCollector, that serves as a base for all my custom collectors :) So, yeah, I like this name. Yeah I'm not really a fan of Listener either. +1 -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Is TopDocCollector's collect() implementation correct?
This stuff is surprisingly hard to think about! Actually, it was Nadav who first proposed the read interface, to solve the there's no common way for reading its output problem. With an interface (say TopDocsOutput), then you could have some method somewhere: renderResults(TopDocsOutput results) and then any collector, independent of how it *collects* results, could implement TopDocsOutput if appropriate. You'd still need to cast the collector to TopDocsOutput, won't you? How's that different than the code snippet I showed above? The difference is for the new code, it's an upcast, which catches any errors at compile time, not run time. The compiler determines that the class implements the required interface. The current situation introduces a bug, that's true. However, unless something better pops up, shouldn't we just make it final? But that leaves no way forward for current users subclassing TopDocCollector (for the freedom of providing your own pqueue). May I suggest something else? What if MRHC was actually an interface? I think interface is too dangerous in this case (the future back compatibility problem). EG here we are wanting to explore a way to not pre-compute the score; had we released MRHC as an interface we'd be in trouble. (We may still be in trouble, anyway!). Would TopDocsCollector subclass HitCollector or MultiReaderHitCollector? Well ... we've been there as well already :). I don't think there's an easy answer here. I guess if MRHC is the better approach, and we think all Top***DocCollector would want to have the MRHC functionality, then I'd say let's extend MRHC. Otherwise, I don't have a good answer. When I started this thread, I only knew of HitCollector, so things were simpler at the time. We have challenging goals here: * The collect top N by score collector should be final, use ScorerDocQueue, specialized to sorting by score/docID: performance is important. * Likewise for the collect top N by sorted field collector, though it does provide extensibility by letting you make a custom comparator (FieldComparatorSource). Ideally this'd allow with and without computing score (it does not today). * A top N by my own pqueue collector (this is what TopDocCollector/TopScoreDocsCollector allow today, but it has the bug). * Allow fully custom collection, with and without score. Maybe we should in fact simply deprecate HitCollector (in favor of MultiReaderHitCollector)? After all, making your own HitCollector is an advanced thing; expecting you to properly implement setNextReader may be fine. And then we can subclass MultiReaderHitCollector to TopDocsCollector (which adds the totalHits/topDocs results delivery API). And then the collect top docs by score, and collect top docs by fields collectors subclass TopDocsCollector? Finally, we add a collect top docs according to my own pqueue class. Then we wouldn't need an interface; this works because all core collectors deliver top N results in the end. All that's missing is a way to NOT compute score if it's not needed. Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Is TopDocCollector's collect() implementation correct?
Ok so I feel we're coming to an end. Few comments though: The difference is for the new code, it's an upcast, which catches any errors at compile time, not run time. The compiler determines that the class implements the required interface. I still don't understand how a compiler can detect at compilation time that a HitCollector instance that is used in method A, and is casted to a TopDocsOutput instance by calling method B (from A) is actually ok ... I may be missing some Java information here, but I simply don't see how that happens at compilation time instead of at runtime ... But that leaves no way forward for current users subclassing TopDocCollector (for the freedom of providing your own pqueue). TopDocCollector is actually marked deprecated. So nobody should continue subclassing it, as it will be removed in 3.0 no? Perhaps you mean TopScoreDocCollector? If yes, then I think nobody should subclass it (like you write at the end of the email). Maybe we should in fact simply deprecate HitCollector (in favor of MultiReaderHitCollector)? After all, making your own HitCollector is an advanced thing; expecting you to properly implement setNextReader may be fine. I personally don't understand why MRHC was invented in the first place. What's wrong with adding a setNextReader to HitCollector with an empty implementation which does nothing? That doesn't break back-compat, and everyone can still override that method and do whatever they need. Notice also that Lucene *lies* to its users about the existence of HC and MRHC. While I'm passing HC to IndexSearcher, I believe I noticed a code segment like: MultiReaderHitCollector mrhc; if (c instanceof MRHC) { mrhc = (MRHC) c; else { mrhc = new MRHC(c); } So in fact the internal Lucene code expects only MRHC from a certain point, and so even if I wrote a HC and passed it on Searcher, it's still converted to MRHC, with an empty setNextReader() method implementation. That's why I meant that HC is already deprecated, whether it's marked as deprecated or not. What you say about deprecating HC to me is unnecessary. Simply pull setNextReader up with an empty implementation, get rid of all the instanceof, casting and wrapping code and you're fine. Nothing is broken. All works well and better (instanceof, casting and wrapping have their overheads). Isn't that right? I'm all for your suggestions, I just think we can achieve the same thing w/ HC. We can also not deprecate TopDocCollector and remove TopScoreDocCollector (merging implementations so that TopDocCollector is more efficient), as well as introduce that in between TopDocsOutputCollector (or a better name) with the topDocs() and getTotalHits() methods ... Regarding interfaces .. I don't think they're that bad. Perhaps a different viewing angle might help. Lucene processes a query and fires events. Today it fires an event every time a doc's score has been computed, and recently every time it starts to process a different reader. HitCollector is a listener implementation on the doc-score event, while MRHC is a listener on both. To me, interfaces play just nicely here. Assume that you have the following interfaces: - DocScoreEvent - ChangeReaderEvent - EndProcessingEvent (thrown when the query has finished processing - perhaps to aid collectors to free resources) - any other events you foresee? The Lucene code receives a HitCollector which listens on all events. In the future, Lucene might throw other events, but still get a HitCollector. Those methods will check for instanceof, and you as a user will know that if you want to catch those events, you pass in a collector implementation which does. Those events cannot of course be main-stream events (like DocScoreEvent), but new ones, perhaps even experimental. Since HitCollector is a concrete class, we can always add new interfaces to it in the future with empty implementations? Anyway, this is just a thought. I wouldn't want to deviate the discussion we're having. So how about what I propose: 1. Pull setNextReader up to HitCollector, w/ empty implementation (easy change). 2. Get rid of MRHC and change the classes which extend it (easy change). 3. Get rid of all the instanceof checks (easy change - the compiler will warn you right away since MRHC is deleted). 4. Create a new TopDocOutputCollector which allows you to set PQ and implements topDocs() and getTotalHits(). 5. Un-deprecate TopDocCollector, merge its implementation with TopScoreDocCollector and make it final. Override topDocs() and getTotalHits (if necessary). Did I miss anything? Shai On Wed, Mar 25, 2009 at 3:38 PM, Michael McCandless luc...@mikemccandless.com wrote: This stuff is surprisingly hard to think about! Actually, it was Nadav who first proposed the read interface, to solve the there's no common way for reading its output problem. With an interface (say TopDocsOutput), then you could have some method somewhere:
Re: Is TopDocCollector's collect() implementation correct?
Just minor correction - we can't make TopDocCollector final, as it will break back-compat. So we keep it deprecated, keep TopScoreDocCollector and make it extend that TopDocOutputCollector. On Wed, Mar 25, 2009 at 8:55 PM, Shai Erera ser...@gmail.com wrote: Ok so I feel we're coming to an end. Few comments though: The difference is for the new code, it's an upcast, which catches any errors at compile time, not run time. The compiler determines that the class implements the required interface. I still don't understand how a compiler can detect at compilation time that a HitCollector instance that is used in method A, and is casted to a TopDocsOutput instance by calling method B (from A) is actually ok ... I may be missing some Java information here, but I simply don't see how that happens at compilation time instead of at runtime ... But that leaves no way forward for current users subclassing TopDocCollector (for the freedom of providing your own pqueue). TopDocCollector is actually marked deprecated. So nobody should continue subclassing it, as it will be removed in 3.0 no? Perhaps you mean TopScoreDocCollector? If yes, then I think nobody should subclass it (like you write at the end of the email). Maybe we should in fact simply deprecate HitCollector (in favor of MultiReaderHitCollector)? After all, making your own HitCollector is an advanced thing; expecting you to properly implement setNextReader may be fine. I personally don't understand why MRHC was invented in the first place. What's wrong with adding a setNextReader to HitCollector with an empty implementation which does nothing? That doesn't break back-compat, and everyone can still override that method and do whatever they need. Notice also that Lucene *lies* to its users about the existence of HC and MRHC. While I'm passing HC to IndexSearcher, I believe I noticed a code segment like: MultiReaderHitCollector mrhc; if (c instanceof MRHC) { mrhc = (MRHC) c; else { mrhc = new MRHC(c); } So in fact the internal Lucene code expects only MRHC from a certain point, and so even if I wrote a HC and passed it on Searcher, it's still converted to MRHC, with an empty setNextReader() method implementation. That's why I meant that HC is already deprecated, whether it's marked as deprecated or not. What you say about deprecating HC to me is unnecessary. Simply pull setNextReader up with an empty implementation, get rid of all the instanceof, casting and wrapping code and you're fine. Nothing is broken. All works well and better (instanceof, casting and wrapping have their overheads). Isn't that right? I'm all for your suggestions, I just think we can achieve the same thing w/ HC. We can also not deprecate TopDocCollector and remove TopScoreDocCollector (merging implementations so that TopDocCollector is more efficient), as well as introduce that in between TopDocsOutputCollector (or a better name) with the topDocs() and getTotalHits() methods ... Regarding interfaces .. I don't think they're that bad. Perhaps a different viewing angle might help. Lucene processes a query and fires events. Today it fires an event every time a doc's score has been computed, and recently every time it starts to process a different reader. HitCollector is a listener implementation on the doc-score event, while MRHC is a listener on both. To me, interfaces play just nicely here. Assume that you have the following interfaces: - DocScoreEvent - ChangeReaderEvent - EndProcessingEvent (thrown when the query has finished processing - perhaps to aid collectors to free resources) - any other events you foresee? The Lucene code receives a HitCollector which listens on all events. In the future, Lucene might throw other events, but still get a HitCollector. Those methods will check for instanceof, and you as a user will know that if you want to catch those events, you pass in a collector implementation which does. Those events cannot of course be main-stream events (like DocScoreEvent), but new ones, perhaps even experimental. Since HitCollector is a concrete class, we can always add new interfaces to it in the future with empty implementations? Anyway, this is just a thought. I wouldn't want to deviate the discussion we're having. So how about what I propose: 1. Pull setNextReader up to HitCollector, w/ empty implementation (easy change). 2. Get rid of MRHC and change the classes which extend it (easy change). 3. Get rid of all the instanceof checks (easy change - the compiler will warn you right away since MRHC is deleted). 4. Create a new TopDocOutputCollector which allows you to set PQ and implements topDocs() and getTotalHits(). 5. Un-deprecate TopDocCollector, merge its implementation with TopScoreDocCollector and make it final. Override topDocs() and getTotalHits (if necessary). Did I miss anything? Shai On Wed, Mar 25, 2009 at 3:38
Re: Is TopDocCollector's collect() implementation correct?
bq. I personally don't understand why MRHC was invented in the first place. The evolution of MRHC is in the comments of LUCENE-1483 - a lot of comments to wade through though. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Is TopDocCollector's collect() implementation correct?
Ok I didn't phrase it well ... what I meant to say is I don't understand why you had to invent a new class MRHC instead of just coding that logic into HC, with an empty implementation like I suggest. I'm not saying this is something that should have been thought of, but that MRHC in and on itself does not do anything, besides declaring an abstract method. And since the Lucene code has moved internally to use MRHC, I think that my suggestion to pull that method up to HC w/ an empty implementation is a reasonable one, and it simplifies things a lot. Because now, like Mike asked somewhere at the beginning of this thread, one does not know which collector to extend - HC or MRHC. The decisions is even tougher because I believe MRHC does not solve anything for the regular TopDocCollector (original) case? Meaning, isn't Lucene already iterating over the segments one by one? Isn't MRHC good mostly for TopFieldCollector? Don't get me wrong, I think that MRHC makes sense for someone who wants to write a HitCollector and take advantage of knowing when a reader changes / moving to a new segment. But asking everyone to choose between HC and MRHC is a bit too much IMO. If HC would have had this setNextReader() with empty implementation, all that someone had to choose from is whether to override that method or not. Now he needs to choose which HC implementation to extend, which is at the core of this thread's problems. On Wed, Mar 25, 2009 at 10:14 PM, Mark Miller markrmil...@gmail.com wrote: bq. I personally don't understand why MRHC was invented in the first place. The evolution of MRHC is in the comments of LUCENE-1483 - a lot of comments to wade through though. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Is TopDocCollector's collect() implementation correct?
I agree about the unnecessary method call - we should make a collector's implementation as efficient as possible. One concern I have about the direction you'd like a HitCollector to evolve to is that if a collector will need to ask for the document's score instead of retrieving it, we might face other performance problems. I agree that not all collectors need it, and therefore there's no need to compute the score if no one will use it. But what about cases like collectors chaining, extensions and running w/ several collectors? If each collector will need to request for the document's score, it might be computed over and over again. Consider a case for example, of a TopScoreDocCollector, running together w/ another collector that extracts information about the document, and uses the score to compute something (easy to write a collector which delegates the collect() call to each of them). Today, I could just call collect(doc, score) on each collector. But in the proposed way, I'd call collect(doc) and then each of them will need to request the score. Perhaps we can introduce a collect(doc) on HitCollector which does not accept score, but keep the other collect? I am not sure if that's any better, because then the Lucene search code would need to decide to which collect method to call ... Regarding the TopDocs interface (we should have a better name as TopDocs is already taken), my point was that just introducing an interface will not solve the problem. If HitCollector was an interface then yes, that would have solved the problem, as we could have introduced a new interface TopDocsCollector that extends HitCollector (the interface) and expose a topDocs() and getTotalHits() methods. But since HitCollector is an abstract class, this will not solve the problem. The reason is that IndexSearcher accepts a HitCollector. So the code I'd need to write is: HitCollector hc = new TopScoreDocCollector(); searcher.search(query, hc); TopDocs td = ((TopDocsCollector) hc).topDocs(); That case looks very strange and is completely not safe, as someone could change TopScoreDocCollector to not implement the TopDocsCollector interface, and I'd discover that only at runtime. Instead, I could have written: TopDocsCollctor hc = new TopScoreDocCollector(); searcher.search(query, hc); TopDocs td = hc.topDocs(); If someone changes TSDC to not implement TDC, I'd get a compilation error and no runtime exceptions will be thrown. What I also wanted to achieve by introducing a TopDocsCollector abstract class is the shared implementation of topDocs() and getTotalHits(). All of the Top***DocCollector extensions can just implement collect() and use the base implementation of topDocs(). If there's a collector that needs a different implementation, like TopFieldCollector, it can override it. Is my purpose better explained now? Shai On Mon, Mar 23, 2009 at 8:56 PM, Michael McCandless luc...@mikemccandless.com wrote: Shai Erera ser...@gmail.com wrote: As a side comment, why not add setNextReader to HitCollector and then a getDocId(int doc) method which will do the doc + base arithmetic? One problem is this breaks back compatibility on any current subclasses of HitCollector. Another problem is: not all collectors would need to add the base on each doc. EG a collector that puts hits into separate pqueues per segment could defer the addition until the end when only the top results are pulled out of each pqueue. Also, I am concerned about the method call overhead. This is the absolute ultimate hot spot for Lucene and we should worry about causing even a single added instruction in this path. That said... I would like to [eventually] change the collection API along the lines of what Marvin proposed for Matcher in Lucy, here: http://markmail.org/message/jxshhiqr6wvq77xu Specifically, I think it should be the collector's job to ask for the score for this doc, rather than Lucene's job to pre-compute it, so that collectors that don't need the score won't waste CPU. EG, if you are sorting by field (and don't present the relevance score) you shouldn't compute it. Then, we could add other somewhat expensive things you might retrieve, such as a way to ask which terms participated in the match (discussed today on java-user), and/or all term positions that participated (discussed in LUCENE-1522). EG, a top doc collector could choose to call these methods only when the doc was competitive. Anyway, I don't want to add topDocs and getTotalHits to HitCollector, it will destroy its generic purpose. I agree. An interface is also problematic, as it just means all of these collectors have these methods declared, but they need to implement them. An abstract class grants you w/ both. I'm confused on this objection -- only collectors that do let you ask for the top N set of docs would implement this interface? (Ie it'd only be the TopXXXCollector's that'd implement the interface). While interfaces
Re: Is TopDocCollector's collect() implementation correct?
Shai Erera ser...@gmail.com wrote: But what about cases like collectors chaining, extensions and running w/ several collectors? If each collector will need to request for the document's score, it might be computed over and over again. Consider a case for example, of a TopScoreDocCollector, running together w/ another collector that extracts information about the document, and uses the score to compute something (easy to write a collector which delegates the collect() call to each of them). Today, I could just call collect(doc, score) on each collector. But in the proposed way, I'd call collect(doc) and then each of them will need to request the score. Yeah good point... we'd somehow need such chaining collectors to simply return the score if getScore() had already been called? Perhaps we can introduce a collect(doc) on HitCollector which does not accept score, but keep the other collect? I am not sure if that's any better, because then the Lucene search code would need to decide to which collect method to call ... Yeah not sure I like that... and it doesn't scale up well (to other getXXX's that we may want to make optionally available from Scorer). Regarding the TopDocs interface (we should have a better name as TopDocs is already taken), my point was that just introducing an interface will not solve the problem. If HitCollector was an interface then yes, that would have solved the problem, as we could have introduced a new interface TopDocsCollector that extends HitCollector (the interface) and expose a topDocs() and getTotalHits() methods. But since HitCollector is an abstract class, this will not solve the problem. The reason is that IndexSearcher accepts a HitCollector. Actually, it was Nadav who first proposed the read interface, to solve the there's no common way for reading its output problem. With an interface (say TopDocsOutput), then you could have some method somewhere: renderResults(TopDocsOutput results) and then any collector, independent of how it *collects* results, could implement TopDocsOutput if appropriate. So the code I'd need to write is: HitCollector hc = new TopScoreDocCollector(); searcher.search(query, hc); TopDocs td = ((TopDocsCollector) hc).topDocs(); That case looks very strange and is completely not safe, as someone could change TopScoreDocCollector to not implement the TopDocsCollector interface, and I'd discover that only at runtime. I agree one should rely as little as possible on runtime casting, though we wouldn't make such a change to TopScoreDocCollector (it breaks back compat). Instead, I could have written: TopDocsCollctor hc = new TopScoreDocCollector(); searcher.search(query, hc); TopDocs td = hc.topDocs(); If someone changes TSDC to not implement TDC, I'd get a compilation error and no runtime exceptions will be thrown. I think we should aim to separate collection, which is how Lucene delivers hits to the collector you passed in, from results delivery, which is how your code talks to the collector to read back the results. What I also wanted to achieve by introducing a TopDocsCollector abstract class is the shared implementation of topDocs() and getTotalHits(). All of the Top***DocCollector extensions can just implement collect() and use the base implementation of topDocs(). If there's a collector that needs a different implementation, like TopFieldCollector, it can override it. Would TopDocsCollector subclass HitCollector or MultiReaderHitCollector? Is my purpose better explained now? Well... I understand the problem: we have a real bug. If you subclass TopDocColector or TopScoreDocCollector, and don't do your sorting by score, then collect() does the wrong thing. So we need to fix that. We've gone through various options for fixing it, all of which have various drawbacks (breaks back compat, possible performance hit, etc.), so we're still trying to converge on the lesser evil. Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Is TopDocCollector's collect() implementation correct?
On Tue, Mar 24, 2009 at 02:47:07PM +0200, Shai Erera wrote: I agree about the unnecessary method call - we should make a collector's implementation as efficient as possible. Maybe it makes sense to just bite the bullet and duplicate the unrolled code? There's precedent: ScorerDocQueue is not a subclass of PriorityQueue. But what about cases like collectors chaining, extensions and running w/ several collectors? If each collector will need to request for the document's score, it might be computed over and over again. Consider a case for example, of a TopScoreDocCollector, running together w/ another collector that extracts information about the document, and uses the score to compute something (easy to write a collector which delegates the collect() call to each of them). Today, I could just call collect(doc, score) on each collector. But in the proposed way, I'd call collect(doc) and then each of them will need to request the score. In such a case, perhaps it would be possible to supply a trivial Scorer wrapper subclass which caches a score. Then you still have the overhead of the method calls, but not the overhead of calculating the score. That's not ideal, but I think the case of matching-without-scoring is more important to optimize for. Perhaps we can introduce a collect(doc) on HitCollector which does not accept score, but keep the other collect? I am not sure if that's any better, because then the Lucene search code would need to decide to which collect method to call ... Also, passing arguments is dirt cheap. A HitCollector that only cares about adding doc nums to a BitVector can just ignore the second argument. Regarding the TopDocs interface (we should have a better name as TopDocs is already taken), Winners. Marvin Humphrey - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Is TopDocCollector's collect() implementation correct?
Would TopDocsCollector subclass HitCollector or MultiReaderHitCollector? Well ... we've been there as well already :). I don't think there's an easy answer here. I guess if MRHC is the better approach, and we think all Top***DocCollector would want to have the MRHC functionality, then I'd say let's extend MRHC. Otherwise, I don't have a good answer. When I started this thread, I only knew of HitCollector, so things were simpler at the time. Well... I understand the problem: we have a real bug. If you subclass TopDocColector or TopScoreDocCollector, and don't do your sorting by score, then collect() does the wrong thing. So we need to fix that. I actually wanted to avoid fixing it. Hypothetically, if we had such a TopsDocCollector abstract class, TopScoreDocCollector could have extended it, and declared final. Its implementation makes a lot of sense if you want to only provide a top-scoring docs order, which is the most common search operation. Therefore making it efficient is very important. The current situation introduces a bug, that's true. However, unless something better pops up, shouldn't we just make it final? After all, the class is now called TopScoreDocCollector, which is more aimed at scoring, as opposed to TopDocCollector, which just declares it collects 'top docs' w/o committing to score or anything. So why would someone want to extend TSDC to provide a sort by date for example? The names would not make sense. And if someone really wants to have a different TSDC implementation, it can just copy the implementation. There's no real need to extend TSDC. It's a short, clean and easy to understand implementation. Actually, it was Nadav who first proposed the read interface, to solve the there's no common way for reading its output problem. With an interface (say TopDocsOutput), then you could have some method somewhere: renderResults(TopDocsOutput results) and then any collector, independent of how it *collects* results, could implement TopDocsOutput if appropriate. You'd still need to cast the collector to TopDocsOutput, won't you? How's that different than the code snippet I showed above? I think we should aim to separate collection, which is how Lucene delivers hits to the collector you passed in, from results delivery, which is how your code talks to the collector to read back the results. I thought we already do that .. HitCollector has a simple collect() method, which is the only one IndexSearcher knows of and cares about. No runtime code executed in Lucene has to know that the concrete HItCollector class it deals with has a topDocs() API. On the other hand, my application does know that, and therefore I can instantiate a HitCollector which has more methods than just collect(). I won't use HitCollector, but that base TopDocsOutputCollector (?), pass it in to Searcher (which knows only of HitCollector) and when it's back, I don't need to do any casting. If only HitCollector was an interface :). May I suggest something else? What if MRHC was actually an interface? It doesn't contain any implementation. Just an abstract method. That I think would have solved anything for us: - We'd extend HitCollector with a TopDocsOutputCollector (?) that provides topDocs(), getTotalHits() and a protected constructor which accepts a PQ. - Every implementation which extends MRHC can move to extend HC and implement MRHC. - Nobody outside the Lucene code should care whether it's a MRHC or HC that was instantiated. This is very internal to Lucene. That's why making MRHC an interface makes sense - the Lucene code still deals w/ HitCollector, and wherever it expected or wanted to utilize MRHC, it can still do that (by checking instanceof, as I've seen in some places in the code). - Since it's still in the trunk, we shouldn't have any back compat issues, right? The point here is that MRHC is really internal to Lucene, while HitCollector is not (it's part of the API). Therefore making MRHC an interface should not make any difference to any search application (it's a property of a HitCollector implementation), while HitCollector and its sub-classes are. I have a feeling this might work, but I could be missing something. Shai On Tue, Mar 24, 2009 at 7:51 PM, Michael McCandless luc...@mikemccandless.com wrote: Shai Erera ser...@gmail.com wrote: But what about cases like collectors chaining, extensions and running w/ several collectors? If each collector will need to request for the document's score, it might be computed over and over again. Consider a case for example, of a TopScoreDocCollector, running together w/ another collector that extracts information about the document, and uses the score to compute something (easy to write a collector which delegates the collect() call to each of them). Today, I could just call collect(doc, score) on each collector. But in the proposed way, I'd call collect(doc) and then each
Re: Is TopDocCollector's collect() implementation correct?
If we're already creating a new TopScoreDocCollector (when was it added? I must have been dozing off while this happened...) This was LUCENE-1483. How about if we introduce an abstract ScoringCollector (about the name later) which implements topDocs() and getTotalHits() and there will be several implementations of it, such as: TopScoreDocCollector, which sorts the documents by their score, in descending order only, TopFieldDocCollector - for sorting by fields, and additional sort-by collectors. This sounds good... but the challenge is we also need to get both HitCollector and MultiReaderHitCollector in there. HitCollector is the simplest way to create a custom collector. MultiReaderHitCollector (added with LUCENE-1483) is the more performant way, since it lets your collector operate per-segment. All non-deprecated core collectors in Lucene now subclass MultiReaderHitCollector. So would we make separate subclasses for each of them to add getTotalHits() / topDocs()? EG TopDocsHitCollector and TopDocsMultiReaderHitCollector? It's getting confusing. Or maybe we just add totalHits() and topDocs() to HitCollector even though for advanced case (non-top-N-collection) the methods would not be used? Or... maybe this is a time when an interface is the lesser evil: we could make a TopDocs interface that the necessary classes implement? Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Is TopDocCollector's collect() implementation correct?
ok I missed 1483 completely. As a side comment, why not add setNextReader to HitCollector and then a getDocId(int doc) method which will do the doc + base arithmetic? I think it's very easy for someone to forget to add that (+ base) to doc. You could then just change TopDocCollector to call getDocId() instead of duplicating it into TopScoreDocCollector. Isn't that something you'd want all HitCollector implementations to use? I consider some extensions of HitCollector we have - we now will probably want to change them to extend MultiReaderHitCollector, but we'll have to remember to do that +base arithmatic everywhere, instead of calling getDocId(). I understand that changing the call to getDocId is the same as adding + base, from an effort perspective, but I think it's better this way. It does involve an additional method call, but I wonder how good compilers will handle that. Anyway, I don't want to add topDocs and getTotalHits to HitCollector, it will destroy its generic purpose. An interface is also problematic, as it just means all of these collectors have these methods declared, but they need to implement them. An abstract class grants you w/ both. So in case you agree that the logic of MultiReaderHitCollector can (and should?) be in HitCollector, we can create an abstract class called ScoringCollector (or if nobody objects TopDocsCollector) which will implement these two methods. In case you disagree, we can have that abstract class extend MultiReaderHitCollector instead. I'm in favor of the first option as at least as it looks in the code, HitCollector is not extended by any class anymore, except TopDocCollector which is marked as deprecated, and 3 anonymous implementations. So it looks like HitCollector itself is deprecated as far as the Lucene core code sees it. What do you think? Shai On Mon, Mar 23, 2009 at 4:43 PM, Michael McCandless luc...@mikemccandless.com wrote: If we're already creating a new TopScoreDocCollector (when was it added? I must have been dozing off while this happened...) This was LUCENE-1483. How about if we introduce an abstract ScoringCollector (about the name later) which implements topDocs() and getTotalHits() and there will be several implementations of it, such as: TopScoreDocCollector, which sorts the documents by their score, in descending order only, TopFieldDocCollector - for sorting by fields, and additional sort-by collectors. This sounds good... but the challenge is we also need to get both HitCollector and MultiReaderHitCollector in there. HitCollector is the simplest way to create a custom collector. MultiReaderHitCollector (added with LUCENE-1483) is the more performant way, since it lets your collector operate per-segment. All non-deprecated core collectors in Lucene now subclass MultiReaderHitCollector. So would we make separate subclasses for each of them to add getTotalHits() / topDocs()? EG TopDocsHitCollector and TopDocsMultiReaderHitCollector? It's getting confusing. Or maybe we just add totalHits() and topDocs() to HitCollector even though for advanced case (non-top-N-collection) the methods would not be used? Or... maybe this is a time when an interface is the lesser evil: we could make a TopDocs interface that the necessary classes implement? Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Is TopDocCollector's collect() implementation correct?
Shai Erera ser...@gmail.com wrote: As a side comment, why not add setNextReader to HitCollector and then a getDocId(int doc) method which will do the doc + base arithmetic? One problem is this breaks back compatibility on any current subclasses of HitCollector. Another problem is: not all collectors would need to add the base on each doc. EG a collector that puts hits into separate pqueues per segment could defer the addition until the end when only the top results are pulled out of each pqueue. Also, I am concerned about the method call overhead. This is the absolute ultimate hot spot for Lucene and we should worry about causing even a single added instruction in this path. That said... I would like to [eventually] change the collection API along the lines of what Marvin proposed for Matcher in Lucy, here: http://markmail.org/message/jxshhiqr6wvq77xu Specifically, I think it should be the collector's job to ask for the score for this doc, rather than Lucene's job to pre-compute it, so that collectors that don't need the score won't waste CPU. EG, if you are sorting by field (and don't present the relevance score) you shouldn't compute it. Then, we could add other somewhat expensive things you might retrieve, such as a way to ask which terms participated in the match (discussed today on java-user), and/or all term positions that participated (discussed in LUCENE-1522). EG, a top doc collector could choose to call these methods only when the doc was competitive. Anyway, I don't want to add topDocs and getTotalHits to HitCollector, it will destroy its generic purpose. I agree. An interface is also problematic, as it just means all of these collectors have these methods declared, but they need to implement them. An abstract class grants you w/ both. I'm confused on this objection -- only collectors that do let you ask for the top N set of docs would implement this interface? (Ie it'd only be the TopXXXCollector's that'd implement the interface). While interfaces clearly have the future problem of back-compatibility, this case may be simple enough to make an exception. So it looks like HitCollector itself is deprecated as far as the Lucene core code sees it. I think HitCollector has a purpose, which is to be the simplest way to make a custom collector. Ie I think it makes sense to offer a simple way and a high performance way. Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Is TopDocCollector's collect() implementation correct?
On Sat, Mar 21, 2009, Michael McCandless wrote about Re: Is TopDocCollector's collect() implementation correct?: I think I'd lean towards a third solution: tighten up TopScoreDocCollector (make it final, remove ability to change its PQ, make things private) and have it focus on high performance collection by score. The problem I see is this: TopDocCollector currently does not implement any sort of interface for reading its output. This means that if you create a completely different class implementing, for example, a different sorting criteria (e.g., sort by date), there will be no base class or interface that you could use for both of them, to allow changing the sort criterion easily at run time. On the other hand, with the existing collector, you can subclass TopDocCollector, and use that as the common base class. If we're already creating a new TopScoreDocCollector (when was it added? I must have been dozing off while this happened...) perhaps we can create a read interface for it (with the getTotalHits() and topDocs() methods), and have this class implement that interface? Then, indeed, nobody will have any reason to extend the TopScoreDocCollector class, and it can be final. -- Nadav Har'El|Sunday, Mar 22 2009, 26 Adar 5769 IBM Haifa Research Lab |- |Did you sleep well? No, I made a http://nadav.harel.org.il |couple of mistakes. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Is TopDocCollector's collect() implementation correct?
How about if we introduce an abstract ScoringCollector (about the name later) which implements topDocs() and getTotalHits() and there will be several implementations of it, such as: TopScoreDocCollector, which sorts the documents by their score, in descending order only, TopFieldDocCollector - for sorting by fields, and additional sort-by collectors. All these can share the topDocs and getTotalHits methods. Nadav - this is just like your proposed interface, but I would like to propose an abstract class which will implement the common functionality. The only non-common functionality is collect(), and this one will be implemented by subclasses. That way, all of these can be of the same type, which makes it easier to write search applications who offer the user to sort results based on other attributes than just score. This class can have a protected c'tor which accepts a PQ and nothing more. It will also make its PQ and totalHits protected. About the name - TopDocCollector or TopDocsCollector is the perfect name for this class. But the first one is already taken and the second one will just confuse users (with the first one). Unless we can decide to make TopDocCollector abstract n 2.9, instead of just removing it? Or if you are not happy with ScoringCollector, please provide a better name. The more I think about it, I realize that my intentions with 1356 were to make TopDocCollector a superclass for all scoring documents, and sharing its PQ led to the problem I reported in this thread. Perhaps it was better than to define that abstract class .. but better later than never. What do you think? Shai On Sun, Mar 22, 2009 at 3:16 PM, Nadav Har'El n...@math.technion.ac.ilwrote: On Sat, Mar 21, 2009, Michael McCandless wrote about Re: Is TopDocCollector's collect() implementation correct?: I think I'd lean towards a third solution: tighten up TopScoreDocCollector (make it final, remove ability to change its PQ, make things private) and have it focus on high performance collection by score. The problem I see is this: TopDocCollector currently does not implement any sort of interface for reading its output. This means that if you create a completely different class implementing, for example, a different sorting criteria (e.g., sort by date), there will be no base class or interface that you could use for both of them, to allow changing the sort criterion easily at run time. On the other hand, with the existing collector, you can subclass TopDocCollector, and use that as the common base class. If we're already creating a new TopScoreDocCollector (when was it added? I must have been dozing off while this happened...) perhaps we can create a read interface for it (with the getTotalHits() and topDocs() methods), and have this class implement that interface? Then, indeed, nobody will have any reason to extend the TopScoreDocCollector class, and it can be final. -- Nadav Har'El|Sunday, Mar 22 2009, 26 Adar 5769 IBM Haifa Research Lab |- |Did you sleep well? No, I made a http://nadav.harel.org.il |couple of mistakes. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Is TopDocCollector's collect() implementation correct?
I think I'd lean towards a third solution: tighten up TopScoreDocCollector (make it final, remove ability to change its PQ, make things private) and have it focus on high performance collection by score. This is the default collector for Lucene searches so I think keeping it high performance (not adding additional method calls nor boolean checks) is important. You choose this for performance Then introduce a new hit collector that makes subclassing / changing the PQ easier. You choose this one for flexibility. And 2.9 is a great chance to do this since we've deprecated the old class (TopDocCollector) in favor of TopScoreDocCollector. Mike Shai Erera wrote: Thanks Chris/Hoss (not sure who sent the original reply). I don't like calling pq.lessThan, as pq.insert and pq.insertWithOverflow call it anyway internally and since it would add a method call (something that was tried to be avoided in the current implementation), I prefer the code I proposed below. BTW, I introduced 1356, so I take full responsibility on this overlooking. The main reason for 1356 was to allow creating extensions of TopDocCollector so they can be of the same type, and share the topDocs() and totalHIts() implementations. I can file an issue. Any other comments? Shai On Sat, Mar 21, 2009 at 3:48 AM, Chris Hostetter hossman_luc...@fucit.org wrote: (resending msg from earlier today during @apache mail outage -- i didn't get a copy from the list, so i'm assuming no one did) -- Forwarded message -- Date: Fri, 20 Mar 2009 15:29:13 -0700 (PDT) : TopDocCollector's (TDC) implementation of collect() seems a bit problematic : to me. This code isn't an area i'm very familiar with, but your assessment seems correct ... it looks like when LUCENE-1356 introduced the ability to provide a PriorityQueue to the constructor, the existing optimization when the score was obvoiusly too low was overlooked. It looks like this same bug got propogated to TopScoreDocCollector when it was introduced as well. : Introduce in TDC a private boolean which signals whether the default PQ is : used or not. If it's not used, don't do the 'else if' at all. If it is used, : then the 'else if' is safe. Then code could look like: my vote would just be to change the = comarison to a hq.lessThan call ... but i can understand how your proposal might be more efficient -- I'll let the performance experts fight it out ... but i definitely think you should fil a bug. -Hoss - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Is TopDocCollector's collect() implementation correct?
(resending msg from earlier today during @apache mail outage -- i didn't get a copy from the list, so i'm assuming no one did) -- Forwarded message -- Date: Fri, 20 Mar 2009 15:29:13 -0700 (PDT) : TopDocCollector's (TDC) implementation of collect() seems a bit problematic : to me. This code isn't an area i'm very familiar with, but your assessment seems correct ... it looks like when LUCENE-1356 introduced the ability to provide a PriorityQueue to the constructor, the existing optimization when the score was obvoiusly too low was overlooked. It looks like this same bug got propogated to TopScoreDocCollector when it was introduced as well. : Introduce in TDC a private boolean which signals whether the default PQ is : used or not. If it's not used, don't do the 'else if' at all. If it is used, : then the 'else if' is safe. Then code could look like: my vote would just be to change the = comarison to a hq.lessThan call ... but i can understand how your proposal might be more efficient -- I'll let the performance experts fight it out ... but i definitely think you should fil a bug. -Hoss - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Is TopDocCollector's collect() implementation correct?
Thanks Chris/Hoss (not sure who sent the original reply). I don't like calling pq.lessThan, as pq.insert and pq.insertWithOverflow call it anyway internally and since it would add a method call (something that was tried to be avoided in the current implementation), I prefer the code I proposed below. BTW, I introduced 1356, so I take full responsibility on this overlooking. The main reason for 1356 was to allow creating extensions of TopDocCollector so they can be of the same type, and share the topDocs() and totalHIts() implementations. I can file an issue. Any other comments? Shai On Sat, Mar 21, 2009 at 3:48 AM, Chris Hostetter hossman_luc...@fucit.orgwrote: (resending msg from earlier today during @apache mail outage -- i didn't get a copy from the list, so i'm assuming no one did) -- Forwarded message -- Date: Fri, 20 Mar 2009 15:29:13 -0700 (PDT) : TopDocCollector's (TDC) implementation of collect() seems a bit problematic : to me. This code isn't an area i'm very familiar with, but your assessment seems correct ... it looks like when LUCENE-1356 introduced the ability to provide a PriorityQueue to the constructor, the existing optimization when the score was obvoiusly too low was overlooked. It looks like this same bug got propogated to TopScoreDocCollector when it was introduced as well. : Introduce in TDC a private boolean which signals whether the default PQ is : used or not. If it's not used, don't do the 'else if' at all. If it is used, : then the 'else if' is safe. Then code could look like: my vote would just be to change the = comarison to a hq.lessThan call ... but i can understand how your proposal might be more efficient -- I'll let the performance experts fight it out ... but i definitely think you should fil a bug. -Hoss - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Is TopDocCollector's collect() implementation correct?
Hi TopDocCollector's (TDC) implementation of collect() seems a bit problematic to me. if (reusableSD == null) { reusableSD = new ScoreDoc(doc, score); } else if (score = reusableSD.score) { // reusableSD holds the last rejected entry, so, if // this new score is not better than that, there's no // need to try inserting it reusableSD.doc = doc; reusableSD.score = score; } else { return; } reusableSD = (ScoreDoc) hq.insertWithOverflow(reusableSD); The first 'if' is safe, as it assumes that if reusableSD is null, then there is more room in PQ. The second (else if) is a bit problematic. It assumes that if the given score is less than the latest item rejected from PQ, then it should not be added. The problem I see here is if someone extends TDC and provides his own PQ, which for example, sorts based on the lowest scoring documents, or just by document IDs. TDC has a protected constructor which allows that. In that case, comparing the current score to the latest rejected ScoreDoc is wrong. Also, PQ itself will check that (using lessThan) before attempting to call the insert itself. On the other hand, this check saves two method calls (PQ.insert and PQ.lessThan) in case someone wants to sort by top ranking documents, which is the default behavior. Of course we could say to that person who wants to sort by docIDs, to extend TDC and override collect() itself, but I would like to propose an alternative, which will allow extending TDC by providing a different PQ, while still using its collect() method. Introduce in TDC a private boolean which signals whether the default PQ is used or not. If it's not used, don't do the 'else if' at all. If it is used, then the 'else if' is safe. Then code could look like: if (reusableSD == null) { reusableSD = new ScoreDoc(doc, score); } else if (useDefaultPQ){ if (score = reusableSD.score) { // reusableSD holds the last rejected entry, so, if // this new score is not better than that, there's no // need to try inserting it reusableSD.doc = doc; reusableSD.score = score; } else { return; } } else { reusableSD.doc = doc; reusableSD.score = score; } reusableSD = (ScoreDoc) hq.insertWithOverflow(reusableSD); I don't think that adding useDefaultPQ will have any performance implications, but it does add one extra boolean check. Therefore if you agree that this should be fixed in TDC, rather than forcing anyone who provides his own PQ to also override collect() (in the cases I've noted above), then I can submit a patch. Shai