[jira] [Commented] (LUCENE-1536) if a filter can support random access API, we should use it

Uwe Schindler (Commented) (JIRA) Mon, 10 Oct 2011 22:57:03 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13124720#comment-13124720
 ]


Uwe Schindler commented on LUCENE-1536:
---------------------------------------

{quote}
hack patch that computes the heuristic up front in weight init, so it scores 
all segments consistently and returns the proper scoresDocsOutOfOrder for BS1.

Uwe's new test (the nestedFilterQuery) doesnt pass yet, don't know why.
{quote}

Very easy to explain: Because it's a hack! The problem is simple: The new test 
explicitely checks that acceptDocs are correctly handled by the query, which is 
not the case for your modifications. In createWeight you get the first segemnt 
and create the filter's docidset on it, passing *liveDocs* (because you have 
nothing else). You cache this first DocIdSet (to not need to execute 
getDocIdSet for the first filter 2 times) and by that miss the real acceptDocs 
(which are != liveDocs in this test). The firts segment therefore returns more 
documents that it should.

Alltogether, the hack is of course uncommitable and the source of outr problem 
only lies in the out of order setting. The fix in your patch is fine, but too 
much. The scoresDocsOutOfOrder method should simply return, what the inner 
weight returns, because it *may* return docs out of order. It can still retun 
them in order (if a filter needs to be applied using iterator). This is not 
different to behaviour before. So the fix is easy: Do the same like in 
ConstantScoreQuery, where we return the setting from the inner weight.

Being consistent in selecting scorer implementations between segments is not an 
issue of this special case, it's a general problem and cannot be solved by a 
hack. The selection of Scorer for BooleanQuery can be different even without 
FilteredQuery, as BooleanWeight might return different different scorer, too 
(so the problem is BooleanScorer that does selection of its Scorer 
per-segment). To fix this, BooleanWeight must do all the scorer descisions in 
it's ctor, so we would need to pass also scoreInOrder and other parameters to 
the Weight's ctor.

Please remove the hack, and only correctly implement scoresDocsOutOfOrder 
(which is the reason for the problem, as it suddenly returns documents in a 
different order). We can still get the documents with that patch in different 
order if we have random access enabled together with the filter but the old 
IndexSearcher used DocIdSetIterator (in-order). We should ignore those 
differences in document order, if score is identical (and Mike's output shows 
scores are equal). If we want to check that the results are identical, the 
benchmark test must explicitely request docs-in-order on trunk vs. patch to be 
consistent. But then it's no longer a benchmark.

Conclusion: In general we explained the differences between the patches and I 
think, my original patch is fine except the Weight.scoresDocsOutOfOrder, which 
should return the inner Weight's setting (like CSQ does) - no magic needed. Our 
patch does *not* return wrong documents, just the order of equal-scoring 
documents is different, which is perfectly fine.
                
> if a filter can support random access API, we should use it
> -----------------------------------------------------------
>
>                 Key: LUCENE-1536
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1536
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/search
>    Affects Versions: 2.4
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>              Labels: gsoc2011, lucene-gsoc-11, mentor
>             Fix For: 4.0
>
>         Attachments: CachedFilterIndexReader.java, LUCENE-1536-rewrite.patch, 
> LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, 
> LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, 
> LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, 
> LUCENE-1536-rewrite.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
> LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
> LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
> LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
> LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
> LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
> LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
> LUCENE-1536_hack.patch, changes-yonik-uwe.patch, luceneutil.patch
>
>
> I ran some performance tests, comparing applying a filter via
> random-access API instead of current trunk's iterator API.
> This was inspired by LUCENE-1476, where we realized deletions should
> really be implemented just like a filter, but then in testing found
> that switching deletions to iterator was a very sizable performance
> hit.
> Some notes on the test:
>   * Index is first 2M docs of Wikipedia.  Test machine is Mac OS X
>     10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153.
>   * I test across multiple queries.  1-X means an OR query, eg 1-4
>     means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2
>     AND 3 AND 4.  "u s" means "united states" (phrase search).
>   * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90,
>     95, 98, 99, 99.99999 (filter is non-null but all bits are set),
>     100 (filter=null, control)).
>   * Method high means I use random-access filter API in
>     IndexSearcher's main loop.  Method low means I use random-access
>     filter API down in SegmentTermDocs (just like deleted docs
>     today).
>   * Baseline (QPS) is current trunk, where filter is applied as iterator up
>     "high" (ie in IndexSearcher's search loop).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-1536) if a filter can support random access API, we should use it

Reply via email to