[jira] [Commented] (LUCENE-1536) if a filter can support random access API, we should use it

Uwe Schindler (Commented) (JIRA) Tue, 11 Oct 2011 05:55:38 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13124993#comment-13124993
 ]


Uwe Schindler commented on LUCENE-1536:
---------------------------------------

Sorry Robert, if that works its fine, but test is still failing, so something 
is wrong.

My problem with the patch here is more, that for most filters, the call to 
getDocIdSet() is the most expensive one. So you are right with caching the 
result. But we actually calculate the DocIdSet twice (unless we use 
CachingWrapperFilter), if the acceptDocs != liveDocs. And as the first segment 
is generally the largest one, this is even worse.

In my opinion, the whole approach of looking into the sparseness of the 
DocIdSet is broken for this case, as we can correctly do this only per segment, 
but we later require all segments to use the same scorer implementation. I have 
no idea, how to solve this. It would not even be enough like Chris/Mikes 
orginal approaches to support something like DocIdSet.useBits()/isSparse() 
whatever, as this is also by segment.

There is also a second problem: It might happen that one filter returns a 
DocIdSet that does not support bits() for one segment, but another one for 
other segments? How to handle that? There is one case where this happens 
(DocIdSet.EMPTY_DOCIDSET always returns null for bits) - but this one is 
grafefully handled by an early exit condition, so we won't get NPE.

The only possible solution is to make Filters always request in-order scoring, 
but this would limit our optimization possibilities.

Finally I still think we should fix BS1 and BS2 to return identical scores (and 
write a test for that which compares scores). Second, in Mike's document/score 
listing above with/wo patch, I see no score differences, only order of docs is 
different (which is caused by out-of-order missing), so where is the problem?
                
> if a filter can support random access API, we should use it
> -----------------------------------------------------------
>
>                 Key: LUCENE-1536
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1536
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/search
>    Affects Versions: 2.4
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>              Labels: gsoc2011, lucene-gsoc-11, mentor
>             Fix For: 4.0
>
>         Attachments: CachedFilterIndexReader.java, LUCENE-1536-rewrite.patch, 
> LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, 
> LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, 
> LUCENE-1536-rewrite.patch, LUCENE-1536-rewrite.patch, 
> LUCENE-1536-rewrite.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
> LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
> LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
> LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
> LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
> LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
> LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
> LUCENE-1536.patch, LUCENE-1536_hack.patch, changes-yonik-uwe.patch, 
> luceneutil.patch
>
>
> I ran some performance tests, comparing applying a filter via
> random-access API instead of current trunk's iterator API.
> This was inspired by LUCENE-1476, where we realized deletions should
> really be implemented just like a filter, but then in testing found
> that switching deletions to iterator was a very sizable performance
> hit.
> Some notes on the test:
>   * Index is first 2M docs of Wikipedia.  Test machine is Mac OS X
>     10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153.
>   * I test across multiple queries.  1-X means an OR query, eg 1-4
>     means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2
>     AND 3 AND 4.  "u s" means "united states" (phrase search).
>   * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90,
>     95, 98, 99, 99.99999 (filter is non-null but all bits are set),
>     100 (filter=null, control)).
>   * Method high means I use random-access filter API in
>     IndexSearcher's main loop.  Method low means I use random-access
>     filter API down in SegmentTermDocs (just like deleted docs
>     today).
>   * Baseline (QPS) is current trunk, where filter is applied as iterator up
>     "high" (ie in IndexSearcher's search loop).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-1536) if a filter can support random access API, we should use it

Reply via email to