[ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12695575#action_12695575 ]
Shai Erera commented on LUCENE-1575: ------------------------------------ How do I run such a test? Is there an algorithm for that in the benchmark package? I compared the new TSDC to the trunk's version and the new code does ('-' means a negative change, '+' means a positive change, '|' means neither/undetermined): * adds one collector.setScorer() call to each query. (-) * The scorer.score() call in collect() was just moved from whoever called collect() to inside collect(), so I don't think there's a difference. (|) * Does not check if score > 0.0f in each collect (+) * implements the new topDocs() method. Previously, it just implemented topDocs() which returned everything. Now, topDocs() calls topDocs(0, pq.size()), which verifies parameters and such - since that's executed once at the end of the search, I doubt that it has any effect major effect on the results. BTW, as I scanned through the code I noticed that previously TSDC returned maxScore = Float.NEGATIVE_INFINITY in case there were 0 results to the query, and now it returns Float.NaN. I'm not sure however if this breaks anything, since maxScore is probably used (if at all) for normalization of scores, and in case there are 0 results you don't really have anything to normalize? However I'm not sure ... Regarding TopFieldDocs I am quite surprised. I assume the test uses the OneComparatorScoringCollector, which means scores are computed: * It has the same issue as in TSDC regarding topDocs(). So I think it should be changed here as well, however I doubt that's the cause for the performance hit. * It computes the score and then does super.collect(), which adds a method call (-) * It doesn't check if the score is > 0 (+) * It calls comparator.setScorer, which is ignored in all comparators besides RelevanceComparator. Not sure if it has any performance effects (|) The rest of the code in collect() is exactly the same. Can it be that super.collect() has such an effect? When I think on the results of TSDC (-3%) vs. TFC (-28% on avg.), I think it might be since setScorer() is called once before the series of collect() calls, however super.collect() is called for every document. Your index is large (>2M documents, right?) and I don't know how many results are for each query, if they are in the range of 100Ks, then that could be the explanation. Mike - in case it's faster for you to run it, can you try to run the test again with a change in the code which inlines super.collect() into OneComparatorScoringCollector and compare the results again? I will run it also after you tell me which algorithm you used, but only tomorrow morning, so if you get to do it before then, that'd be great. I doubt that the change in topDocs() affects the query time that much, since it's called at the end of the search, and doing 4-5 'if' statements is really not that expensive (I mean once per the entire search), comparing to ScoreDoc[] array allocation, fetching Stored fields from the index etc. So I'd hate to implement all 3 topDocs() in each of the TopDocsCollector extensions unless it proves to be a problem. Shai On Fri, Apr 3, 2009 at 10:02 PM, Michael McCandless (JIRA) <j...@apache.org>wrote: > Refactoring Lucene collectors (HitCollector and extensions) > ----------------------------------------------------------- > > Key: LUCENE-1575 > URL: https://issues.apache.org/jira/browse/LUCENE-1575 > Project: Lucene - Java > Issue Type: Improvement > Components: Search > Reporter: Shai Erera > Assignee: Michael McCandless > Fix For: 2.9 > > Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, > LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch, > LUCENE-1575.6.patch, LUCENE-1575.patch, LUCENE-1575.patch > > > This issue is a result of a recent discussion we've had on the mailing list. > You can read the thread > [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html]. > We have agreed to do the following refactoring: > * Rename MultiReaderHitCollector to Collector, with the purpose that it will > be the base class for all Collector implementations. > * Deprecate HitCollector in favor of the new Collector. > * Introduce new methods in IndexSearcher that accept Collector, and deprecate > those that accept HitCollector. > ** Create a final class HitCollectorWrapper, and use it in the deprecated > methods in IndexSearcher, wrapping the given HitCollector. > ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, > when we remove HitCollector. > ** It will remove any instanceof checks that currently exist in IndexSearcher > code. > * Create a new (abstract) TopDocsCollector, which will: > ** Leave collect and setNextReader unimplemented. > ** Introduce protected members PriorityQueue and totalHits. > ** Introduce a single protected constructor which accepts a PriorityQueue. > ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. > These can be used as-are by extending classes, as well as be overridden. > ** Introduce a new topDocs(start, howMany) method which will be used a > convenience method when implementing a search application which allows paging > through search results. It will also attempt to improve the memory > allocation, by allocating a ScoreDoc[] of the requested size only. > * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() > and getTotalHits() implementations as they are from TopDocsCollector. The > class will also be made final. > * Change TopFieldCollector to extend TopDocsCollector, and make the class > final. Also implement topDocs(start, howMany). > * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, > instead of TopScoreDocCollector. Implement topDocs(start, howMany) > * Review other places where HitCollector is used, such as in Scorer, > deprecate those places and use Collector instead. > Additionally, the following proposal was made w.r.t. decoupling score from > collect(): > * Change collect to accecpt only a doc Id (unbased). > * Introduce a setScorer(Scorer) method. > * If during collect the implementation needs the score, it can call > scorer.score(). > If we do this, then we need to review all places in the code where > collect(doc, score) is called, and assert whether Scorer can be passed. Also > this raises few questions: > * What if during collect() Scorer is null? (i.e., not set) - is it even > possible? > * I noticed that many (if not all) of the collect() implementations discard > the document if its score is not greater than 0. Doesn't it mean that score > is needed in collect() always? > Open issues: > * The name for Collector > * TopDocsCollector was mentioned on the thread as TopResultsCollector, but > that was when we thought to call Colletor ResultsColletor. Since we decided > (so far) on Collector, I think TopDocsCollector makes sense, because of its > TopDocs output. > * Decoupling score from collect(). > I will post a patch a bit later, as this is expected to be a very large > patch. I will split it into 2: (1) code patch (2) test cases (moving to use > Collector instead of HitCollector, as well as testing the new topDocs(start, > howMany) method. > There might be even a 3rd patch which handles the setScorer thing in > Collector (maybe even a different issue?) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org