[jira] Commented: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector

Shai Erera (JIRA) Tue, 28 Apr 2009 01:53:56 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12703540#action_12703540
 ]


Shai Erera commented on LUCENE-1593:
------------------------------------

bq. I think I'd lean towards the 12 impls now. They are tiny classes.

If we resolve everything else, that should not hold us back. I'll do it.

bq. We can mull it over some more... sleep on it.

Ok sleeping did help. Originally I thought that the difference between our 
thinking is that you think that PQ should know how to construct a sentinel 
object, while I thought the code which uses PQ should know that. Now I realize 
both are true - the code which uses PQ, or at least instantiates PQ, already 
*knows* how to create those sentinel objects, since it determines which PQ impl 
to instantiate. I forgot for a moment that PQ is not a concrete class, and 
anyone using it should create his own specialized PQ, or reuse an existing one, 
but anyway that specialized PQ should know how to create the sentinel objects 
and compare them to real objects.

So I'm ok with it - I'll make the following changes, as you suggest:
# Add a protected getSentinelObject() which returns null. Use it in PQ.init() 
to fill the queue if it's not null.
# Make the necessary changes to HitQueue.
# Remove the addSentinelObjects from PQ and the code from TSDC.

BTW, we should be aware that this means anyone using HitQueue needs to know 
that upon initialization it's filled with sentinel objects, and that its size() 
will be maxSize etc. Since HQ is package private I don't have a problem with 
it. Generally speaking, the code which instantiates a PQ and the code that uses 
it must be in sync ... i.e., if I instantiate a PQ and pass it to some other 
code which just receives a PQ and adds elements to it, that code should not 
rely on size() being smaller or anything. I don't feel it complicates things 
... and anyway someone can always create a PQ impl which receives a boolean 
that determines whether sentinel objects should be created or not and if not 
return null in its getSentinelObject().

bq. Maybe we should add a "docsInOrder()" method to Scorer?

I'm not sure that will solve it .. BS2 consults its allowDocsOutOfOrder only if 
score(Collector) is called, which it then instantiates a BS and delegates the 
score(Collector) to. So suppose that BS.docsInOrder return false, what will BS2 
return? Remember that it may be used by IndexSearcher in two modes: (1) without 
a filter - BS2.score(Collector), (2) with filter - BS2.next() and skipTo(). So 
it cannot consult its own allowDocsOutOfOrder (even though it gets it as a 
parameter) since depending on how it will be used, the answer is different.
BTW, IndexSearch.doSearch creates the Scorer, but already receives the 
Collector as argument, therefore at this point it's too late to make any 
decisions regarding orderness of docs, no?

There are few issues that we need to solve:
# A user can set BooleanQuery.setAllowDocsOutOfOrder today, which may trigger 
BS2.score(Collector) to instantiate BS, which may screw up the Collector's 
logic if it assumes in-order documents. IndexSearcher creates the Collector 
before it knows whether BQ is used or not so it cannot do any intelligent 
checks. I see two possible solutions, which only 1 of them may be implemented 
now and the other in 3.0:
## Add docsInOrder to Weight (it's an interface, therefore just in 3.0), since 
that seems to allow IS to check if the current query may visit documents 
out-of-order.
##* Actually, maybe we add it to Query, which is abstract, and in IS we do 
weight.getQuery().docsInOrder()?
## In IS we check BQ.getAllowDocsOutOfOrder() and if true we always create 
out-of-order collectors. That might impact performance if there are no BQ 
clauses, but I assume it is not used much? And this doesn't break back-compat 
since that's the only way to instantiate an out-of-order Scorer today (besides 
creating your own).
# Someone can create his own Collector, but will have no way to know if the 
docs will be sent in-order or not. Whatever we do, we have to allow people to 
correctly 'predict' the behavior of their Collectors, that's why I like the BQ 
static setting of that variant. The user is the one that sets it to true, so 
he/she should know that and create their appropriate Collector instance.
#* On the other hand, if we choose to add that information to Query, those 
Collectors may not have that information in hand when they are instantiated ...

So I'm torn here. Adding that information to Query will solve it for those that 
use the convenient search methods (i.e., those that don't receive a Collector), 
but provide their own Query impl, since if we add a default impl to Query which 
returns false (i.e., out-of-order), it should not change the behavior for them. 
And if they always return docs in-order, they can override it to return true.

About those that pass in Collector ... do we really have a problem? I mean, 
today that have no choice but to pass in a Collector that expects out-of-order 
docs, right? We did not make any promises in 2.4.1 regarding documents order? 
So in the worse case their code will be slightly inefficient by perhaps 
unnecessarily attempts to insert docs that compare equal to the top of the 
queue.
And since they can always create the Query and only then create the Collector, 
if we add that info to Query they should have enough information at hand to 
create the proper Collector instance.

If we do add it to Query, then I'd like to deprecate BQ's static setter and 
getter of that attribute and provide a docsInOrder() impl, but we need to 
resolve how it will know whether it will use BS or BS2.

I apologize for the long post, but I'd like to verify that I didn't miss 
anything regarding back-compat and possible usage, so that we make the right 
decision.

> Optimizations to TopScoreDocCollector and TopFieldCollector
> -----------------------------------------------------------
>
>                 Key: LUCENE-1593
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1593
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>            Reporter: Shai Erera
>             Fix For: 2.9
>
>         Attachments: LUCENE-1593.patch, PerfTest.java
>
>
> This is a spin-off of LUCENE-1575 and proposes to optimize TSDC and TFC code 
> to remove unnecessary checks. The plan is:
> # Ensure that IndexSearcher returns segements in increasing doc Id order, 
> instead of numDocs().
> # Change TSDC and TFC's code to not use the doc id as a tie breaker. New docs 
> will always have larger ids and therefore cannot compete.
> # Pre-populate HitQueue with sentinel values in TSDC (score = Float.NEG_INF) 
> and remove the check if reusableSD == null.
> # Also move to use "changing top" and then call adjustTop(), in case we 
> update the queue.
> # some methods in Sort explicitly add SortField.FIELD_DOC as a "tie breaker" 
> for the last SortField. But, doing so should not be necessary (since we 
> already break ties by docID), and is in fact less efficient (once the above 
> optimization is in).
> # Investigate PQ - can we deprecate insert() and have only 
> insertWithOverflow()? Add a addDummyObjects method which will populate the 
> queue without "arranging" it, just store the objects in the array (this can 
> be used to pre-populate sentinel values)?
> I will post a patch as well as some perf measurements as soon as I have them.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector

Reply via email to