[
https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12670452#action_12670452
]
Michael McCandless commented on LUCENE-1536:
--------------------------------------------
It's a ridiculous amount of data to digest, but here are some initial
observations/thoughts:
* There are very sizable gains here by switching to random-access
low (ie, handling top-level filter the way we now handle deletes).
I'm especially interested in gains in the slowest queries. EG the
phrase query "united states" sees QPS gains from 16%-69%. The
10-clause OR query 1-10 sees QPS gains between 8% and 130%.
* Results are consistent with LUCENE-1476: random-access low gives
the best performance when filter density is >= 1%.
* High is worse than trunk up until ~25% density, which makes sense
since we are asking Scorer to do alot of work producing docIDs
that we then nix with the filter.
* Low is consistently better than high, though as filter density
gets higher the gap between them narrows. I'll drop high from
future tests.
* The gains are generally strongest in the "moderate" density range,
5-25%.
* The degenerate 0% case is clearly far far worse, which is expected
since the iterator scans the bits, finds none set, and quickly
ends the search. For very low density filters we should continue
to use iterator.
* The "control" 100% case (where filter is null) is about the same,
which is expected.
> if a filter can support random access API, we should use it
> -----------------------------------------------------------
>
> Key: LUCENE-1536
> URL: https://issues.apache.org/jira/browse/LUCENE-1536
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Search
> Affects Versions: 2.4
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Priority: Minor
> Attachments: LUCENE-1536.patch
>
>
> I ran some performance tests, comparing applying a filter via
> random-access API instead of current trunk's iterator API.
> This was inspired by LUCENE-1476, where we realized deletions should
> really be implemented just like a filter, but then in testing found
> that switching deletions to iterator was a very sizable performance
> hit.
> Some notes on the test:
> * Index is first 2M docs of Wikipedia. Test machine is Mac OS X
> 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153.
> * I test across multiple queries. 1-X means an OR query, eg 1-4
> means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2
> AND 3 AND 4. "u s" means "united states" (phrase search).
> * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90,
> 95, 98, 99, 99.99999 (filter is non-null but all bits are set),
> 100 (filter=null, control)).
> * Method high means I use random-access filter API in
> IndexSearcher's main loop. Method low means I use random-access
> filter API down in SegmentTermDocs (just like deleted docs
> today).
> * Baseline (QPS) is current trunk, where filter is applied as iterator up
> "high" (ie in IndexSearcher's search loop).
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]