[jira] Commented: (LUCENE-1536) if a filter can support random access API, we should use it

Michael McCandless (JIRA) Wed, 07 Apr 2010 10:36:56 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12854612#action_12854612
 ]


Michael McCandless commented on LUCENE-1536:
--------------------------------------------

With flex, you can now get the deleted docs from a reader (returns a Bits, yet 
another interface for bitsets), and, Docs/AndPositionsEnums require that you 
pass in the skipDocs.  Ie they no longer skip deleted docs by default.

I think this makes this issue quite a bit simpler?  EG OpenBitSets implements 
Bits (as of flex landing), so... we just need a way to pass this Bits down to 
the low level scorers that actually pull a postings list.  But, we should only 
do this if the filter is not sparse.

Also: the filter must be inverted, and, ORd with the deleted docs.

This can result in enormous perf gains for searches doing filtering where the 
filter is relatively static (can be cached & shared across searches).

> if a filter can support random access API, we should use it
> -----------------------------------------------------------
>
>                 Key: LUCENE-1536
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1536
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.4
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
> LUCENE-1536.patch, LUCENE-1536.patch
>
>
> I ran some performance tests, comparing applying a filter via
> random-access API instead of current trunk's iterator API.
> This was inspired by LUCENE-1476, where we realized deletions should
> really be implemented just like a filter, but then in testing found
> that switching deletions to iterator was a very sizable performance
> hit.
> Some notes on the test:
>   * Index is first 2M docs of Wikipedia.  Test machine is Mac OS X
>     10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153.
>   * I test across multiple queries.  1-X means an OR query, eg 1-4
>     means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2
>     AND 3 AND 4.  "u s" means "united states" (phrase search).
>   * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90,
>     95, 98, 99, 99.99999 (filter is non-null but all bits are set),
>     100 (filter=null, control)).
>   * Method high means I use random-access filter API in
>     IndexSearcher's main loop.  Method low means I use random-access
>     filter API down in SegmentTermDocs (just like deleted docs
>     today).
>   * Baseline (QPS) is current trunk, where filter is applied as iterator up
>     "high" (ie in IndexSearcher's search loop).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-1536) if a filter can support random access API, we should use it

Reply via email to