[jira] [Updated] (LUCENE-1536) if a filter can support random access API, we should use it

Michael McCandless (JIRA) Sun, 26 Jun 2011 11:28:10 -0700

     [ 
https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Michael McCandless updated LUCENE-1536:
---------------------------------------

    Attachment: LUCENE-1536.patch

Initial patch for trunk... lots of nocommits, but tests all pass and I
think this is [roughly] the approach we should take to get fast(er)
Filter perf.

Conceptually, this change is fairly easy, because the flex APIs all
accept a Bits to apply low-level filtering.  However, this Bits is
inverted vs the Filter that callers pass to IndexSearcher (skipDocs vs
keepDocs), so, my patch inverts 1) the meaning of this first arg to
the Docs/AndPositions enums (it becomes an acceptDocs instead of
skipDocs), and 2) deleted docs coming back from IndexReaders (renames
IR.getDeletedDocs -> IR.getNotDeletedDocs).

That change (inverting the Bits to be keepDocs not skipDocs) is the
vast majority of the patch.

The "real" change is to add DocIdSet.getRandomAccessBits and
bitsIncludesDeletedDocs, which IndexSearcher then consults to figure
out whether to push the filter "low" instead of "high".  I then fixed
OpenBitSet to return this from getRandomAccessBits, and fixed
CachingWrapperFilter to turn this on/off as well as state whether
deleted docs were folded into the filter.

This means filters cached with CachingWrapperFilter will apply "low",
and if it's DeletesMode.RECACHE then it's a single filter that's
applied (else I wrap with an AND NOT deleted check per docID), but
custom filters are also free to impl these methods to have their
filters applied "low".


> if a filter can support random access API, we should use it
> -----------------------------------------------------------
>
>                 Key: LUCENE-1536
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1536
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/search
>    Affects Versions: 2.4
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>              Labels: gsoc2011, lucene-gsoc-11, mentor
>             Fix For: 4.0
>
>         Attachments: CachedFilterIndexReader.java, LUCENE-1536.patch, 
> LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, 
> LUCENE-1536.patch
>
>
> I ran some performance tests, comparing applying a filter via
> random-access API instead of current trunk's iterator API.
> This was inspired by LUCENE-1476, where we realized deletions should
> really be implemented just like a filter, but then in testing found
> that switching deletions to iterator was a very sizable performance
> hit.
> Some notes on the test:
>   * Index is first 2M docs of Wikipedia.  Test machine is Mac OS X
>     10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153.
>   * I test across multiple queries.  1-X means an OR query, eg 1-4
>     means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2
>     AND 3 AND 4.  "u s" means "united states" (phrase search).
>   * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90,
>     95, 98, 99, 99.99999 (filter is non-null but all bits are set),
>     100 (filter=null, control)).
>   * Method high means I use random-access filter API in
>     IndexSearcher's main loop.  Method low means I use random-access
>     filter API down in SegmentTermDocs (just like deleted docs
>     today).
>   * Baseline (QPS) is current trunk, where filter is applied as iterator up
>     "high" (ie in IndexSearcher's search loop).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-1536) if a filter can support random access API, we should use it

Reply via email to