[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)

Shai Erera (JIRA) Fri, 03 Apr 2009 09:34:43 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12695478#action_12695478
 ]


Shai Erera commented on LUCENE-1575:
------------------------------------

bq. I like "TimeLimitingCollector", or maybe "TimeoutCollector"?

I like TimeLimitingCollector better, as I think the name makes the class more 
self explanatory.

bq. TopFieldCollector's updateBottom & add methods take score, and are passed 
score from the non-scoring collectors, but shouldn't?

At the end of the day, even the non-scoring collectors store a score in 
ScoreDoc, which is Float.NaN. So they should pass a score. Unlike the scoring 
ones, they always pass Float.NaN without ever calling scorer.score(). That's 
the cleanest way I've found I can make the changes to that class, w/o 
duplicating implementation all over the place. Notice that the scoring versions 
extend the non-scoring, and just add score computation, which resulted in a 
very clean implementation.

bq. TermScorer need not override score(HitCollector hc) (super does the same 
thing).

Agreed.

bq. The changes to TermScorer make me a bit nervous.

Since we pass Sorer to Collector, I thought we cannot really rely on anyone not 
calling scorer.doc() or getSimilarity ever - it is in the API. Since doc() is 
abstract, I had to implement it and just thought that retuning the current doc 
is better than -1 for example. There are some alternatives I see to resolve it:
# Create an abstract ScoringOnlyScorer which extends Scorer and implements all 
methods to throw UOE (also as final), besides score() which it will define 
abstract. We then define a ScoringOnlyScorerWrapper which takes a Scorer and 
delegates the score() calls. We use SOSW in places where we can't extend SOS. 
Where we can, we just extend it directly and implement score(), like in the 
InternalScorer case.
# Create a new class which implements just score() (I've yet to come with a 
good name since Scorer is already taken) and create a wrapper which takes a 
Scorer and delegates the score() calls to it. Then Collector will use that new 
class, and we're sure that only score() can be called.

The last two comments are completely an overlook by my side. I'm not so sure 
about your proposal though. If we add to Searcher a concrete impl which throws 
UOE, how would that work in 3.0? How would anyone who extends Searcher know 
that it has to extend this method? Maybe do it now, and document that in 3.0 it 
will become abstract again?
About Searchable, I wonder how many do implement Searchable, rather than extend 
IndexSearcher. Perhaps instead of making any changes in back-compat and add 
documentation to CHANGES I'll just comment out this method with a TODO to 
re-enstate in 3.0?

> Refactoring Lucene collectors (HitCollector and extensions)
> -----------------------------------------------------------
>
>                 Key: LUCENE-1575
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1575
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>            Reporter: Shai Erera
>            Assignee: Michael McCandless
>             Fix For: 2.9
>
>         Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, 
> LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch, 
> LUCENE-1575.6.patch, LUCENE-1575.patch, LUCENE-1575.patch
>
>
> This issue is a result of a recent discussion we've had on the mailing list. 
> You can read the thread 
> [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
> We have agreed to do the following refactoring:
> * Rename MultiReaderHitCollector to Collector, with the purpose that it will 
> be the base class for all Collector implementations.
> * Deprecate HitCollector in favor of the new Collector.
> * Introduce new methods in IndexSearcher that accept Collector, and deprecate 
> those that accept HitCollector.
> ** Create a final class HitCollectorWrapper, and use it in the deprecated 
> methods in IndexSearcher, wrapping the given HitCollector.
> ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, 
> when we remove HitCollector.
> ** It will remove any instanceof checks that currently exist in IndexSearcher 
> code.
> * Create a new (abstract) TopDocsCollector, which will:
> ** Leave collect and setNextReader unimplemented.
> ** Introduce protected members PriorityQueue and totalHits.
> ** Introduce a single protected constructor which accepts a PriorityQueue.
> ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. 
> These can be used as-are by extending classes, as well as be overridden.
> ** Introduce a new topDocs(start, howMany) method which will be used a 
> convenience method when implementing a search application which allows paging 
> through search results. It will also attempt to improve the memory 
> allocation, by allocating a ScoreDoc[] of the requested size only.
> * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() 
> and getTotalHits() implementations as they are from TopDocsCollector. The 
> class will also be made final.
> * Change TopFieldCollector to extend TopDocsCollector, and make the class 
> final. Also implement topDocs(start, howMany).
> * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, 
> instead of TopScoreDocCollector. Implement topDocs(start, howMany)
> * Review other places where HitCollector is used, such as in Scorer, 
> deprecate those places and use Collector instead.
> Additionally, the following proposal was made w.r.t. decoupling score from 
> collect():
> * Change collect to accecpt only a doc Id (unbased).
> * Introduce a setScorer(Scorer) method.
> * If during collect the implementation needs the score, it can call 
> scorer.score().
> If we do this, then we need to review all places in the code where 
> collect(doc, score) is called, and assert whether Scorer can be passed. Also 
> this raises few questions:
> * What if during collect() Scorer is null? (i.e., not set) - is it even 
> possible?
> * I noticed that many (if not all) of the collect() implementations discard 
> the document if its score is not greater than 0. Doesn't it mean that score 
> is needed in collect() always?
> Open issues:
> * The name for Collector
> * TopDocsCollector was mentioned on the thread as TopResultsCollector, but 
> that was when we thought to call Colletor ResultsColletor. Since we decided 
> (so far) on Collector, I think TopDocsCollector makes sense, because of its 
> TopDocs output.
> * Decoupling score from collect().
> I will post a patch a bit later, as this is expected to be a very large 
> patch. I will split it into 2: (1) code patch (2) test cases (moving to use 
> Collector instead of HitCollector, as well as testing the new topDocs(start, 
> howMany) method.
> There might be even a 3rd patch which handles the setScorer thing in 
> Collector (maybe even a different issue?)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)

Reply via email to