[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)

Michael McCandless (JIRA) Fri, 03 Apr 2009 12:02:40 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12695537#action_12695537
 ]


Michael McCandless commented on LUCENE-1575:
--------------------------------------------


I ran a "first do no harm" perf test, comparing trunk with this patch:

||query||sort||hits||qps||qpsnew||pctg||
|147|score|   6953|3631.1|3641.8|  0.3%|
|147|title|   6953|2916.7|2255.6|-22.7%|
|147|doc|   6953|3251.2|2676.8|-17.7%|
|text|score| 157101| 208.1| 202.1| -2.9%|
|text|title| 157101|  96.7|  84.8|-12.3%|
|text|doc| 157101| 174.0| 115.2|-33.8%|
|1|score| 565452|  58.0|  56.4| -2.8%|
|1|title| 565452|  44.5|  34.1|-23.4%|
|1|doc| 565452|  49.2|  32.8|-33.3%|
|1 OR 2|score| 784928|  14.1|  13.7| -2.8%|
|1 OR 2|title| 784928|  12.5|  11.5| -8.0%|
|1 OR 2|doc| 784928|  13.0|  11.9| -8.5%|
|1 AND 2|score| 333153|  15.5|  15.5|  0.0%|
|1 AND 2|title| 333153|  14.8|  13.7| -7.4%|
|1 AND 2|doc| 333153|  15.2|  14.2| -6.6%|

Looks like:
 
  * Sort by relevance got maybe a tad slower (~3%)

  * Sort by field is now quite a bit slower (23-33% on term query '1')

This was on a full wikipedia index, with 14 segments, Sun java
1.6.0_07 on OS X Mac Pro quad core, on Intel X25M 160 GB
SSD.

I think we need to iterate some to try to get some performance back.


> Refactoring Lucene collectors (HitCollector and extensions)
> -----------------------------------------------------------
>
>                 Key: LUCENE-1575
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1575
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>            Reporter: Shai Erera
>            Assignee: Michael McCandless
>             Fix For: 2.9
>
>         Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, 
> LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch, 
> LUCENE-1575.6.patch, LUCENE-1575.patch, LUCENE-1575.patch
>
>
> This issue is a result of a recent discussion we've had on the mailing list. 
> You can read the thread 
> [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
> We have agreed to do the following refactoring:
> * Rename MultiReaderHitCollector to Collector, with the purpose that it will 
> be the base class for all Collector implementations.
> * Deprecate HitCollector in favor of the new Collector.
> * Introduce new methods in IndexSearcher that accept Collector, and deprecate 
> those that accept HitCollector.
> ** Create a final class HitCollectorWrapper, and use it in the deprecated 
> methods in IndexSearcher, wrapping the given HitCollector.
> ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, 
> when we remove HitCollector.
> ** It will remove any instanceof checks that currently exist in IndexSearcher 
> code.
> * Create a new (abstract) TopDocsCollector, which will:
> ** Leave collect and setNextReader unimplemented.
> ** Introduce protected members PriorityQueue and totalHits.
> ** Introduce a single protected constructor which accepts a PriorityQueue.
> ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. 
> These can be used as-are by extending classes, as well as be overridden.
> ** Introduce a new topDocs(start, howMany) method which will be used a 
> convenience method when implementing a search application which allows paging 
> through search results. It will also attempt to improve the memory 
> allocation, by allocating a ScoreDoc[] of the requested size only.
> * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() 
> and getTotalHits() implementations as they are from TopDocsCollector. The 
> class will also be made final.
> * Change TopFieldCollector to extend TopDocsCollector, and make the class 
> final. Also implement topDocs(start, howMany).
> * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, 
> instead of TopScoreDocCollector. Implement topDocs(start, howMany)
> * Review other places where HitCollector is used, such as in Scorer, 
> deprecate those places and use Collector instead.
> Additionally, the following proposal was made w.r.t. decoupling score from 
> collect():
> * Change collect to accecpt only a doc Id (unbased).
> * Introduce a setScorer(Scorer) method.
> * If during collect the implementation needs the score, it can call 
> scorer.score().
> If we do this, then we need to review all places in the code where 
> collect(doc, score) is called, and assert whether Scorer can be passed. Also 
> this raises few questions:
> * What if during collect() Scorer is null? (i.e., not set) - is it even 
> possible?
> * I noticed that many (if not all) of the collect() implementations discard 
> the document if its score is not greater than 0. Doesn't it mean that score 
> is needed in collect() always?
> Open issues:
> * The name for Collector
> * TopDocsCollector was mentioned on the thread as TopResultsCollector, but 
> that was when we thought to call Colletor ResultsColletor. Since we decided 
> (so far) on Collector, I think TopDocsCollector makes sense, because of its 
> TopDocs output.
> * Decoupling score from collect().
> I will post a patch a bit later, as this is expected to be a very large 
> patch. I will split it into 2: (1) code patch (2) test cases (moving to use 
> Collector instead of HitCollector, as well as testing the new topDocs(start, 
> howMany) method.
> There might be even a 3rd patch which handles the setScorer thing in 
> Collector (maybe even a different issue?)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)

Reply via email to