[jira] Commented: (LUCENE-1536) if a filter can support random access API, we should use it

Michael McCandless (JIRA) Wed, 04 Feb 2009 12:44:33 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12670455#action_12670455
 ]


Michael McCandless commented on LUCENE-1536:
--------------------------------------------

Some ideas / further things to explore:

  * Deletions, top-level filters, and BooleanQuery that "factors" to a
    toplevel AND really should be handled by the same code.

  * Even an AND'd filter on a sub-clause of a BooleanQuery can be
    pushed down to the TermDocs under that tree.

  * That common code should send a top-level filter down to the lowest
    level, used by random access API, if the filter supports random
    access (not all do) and it's not super sparse.

  * I think one thing slowing down trunk is the lack of a
    Scorer.skipToButNotNext API.  We now ask the filter for its
    next(), which gives us a filterDocID.  Then we call
    Scorer.skipTo(filterDocID).  If the scorer does not match that
    filterDocID, it internally does next(), which for an expensive
    scorer is alot of likely wasted work: it advances to a docID that
    the filter may not accept.  If we had a "skipToButNotNext" API we
    could avoid that wasted work.  I'm curious what gains this change
    alone would provide.

  * I'm thinking (but haven't tested this) if the filter is relatively
    sparse compared to the other iterators, it'd be better to convert
    it to a sparse repr (eg SortedVIntList) and drive the search by
    iteration through the filter, after fixing the above skipTo issue.
    Maybe a "low iterator" access.

  * We may need a "filter optimizer" utility class somewhere, somehow.
    For filters you do not plan to re-use, you would not bother with
    this.  But for filters that will be re-used, you should 1) convert
    them to sparse or non-sparse repr depending on their density, 2)
    maybe invert them and make sparse if they are close to 100%
    density, 3) maybe factor in deletions to the filter so there is
    only a single top-level filter to apply.

  * I'm not yet sure how to make this change cleanly to the APIs...



> if a filter can support random access API, we should use it
> -----------------------------------------------------------
>
>                 Key: LUCENE-1536
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1536
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.4
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-1536.patch
>
>
> I ran some performance tests, comparing applying a filter via
> random-access API instead of current trunk's iterator API.
> This was inspired by LUCENE-1476, where we realized deletions should
> really be implemented just like a filter, but then in testing found
> that switching deletions to iterator was a very sizable performance
> hit.
> Some notes on the test:
>   * Index is first 2M docs of Wikipedia.  Test machine is Mac OS X
>     10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153.
>   * I test across multiple queries.  1-X means an OR query, eg 1-4
>     means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2
>     AND 3 AND 4.  "u s" means "united states" (phrase search).
>   * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90,
>     95, 98, 99, 99.99999 (filter is non-null but all bits are set),
>     100 (filter=null, control)).
>   * Method high means I use random-access filter API in
>     IndexSearcher's main loop.  Method low means I use random-access
>     filter API down in SegmentTermDocs (just like deleted docs
>     today).
>   * Baseline (QPS) is current trunk, where filter is applied as iterator up
>     "high" (ie in IndexSearcher's search loop).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-1536) if a filter can support random access API, we should use it

Reply via email to