[
https://issues.apache.org/jira/browse/LUCENE-1187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12596546#action_12596546
]
Mark Harwood commented on LUCENE-1187:
--------------------------------------
Paul,
Good work.
Just tried the patch and ran some pre and post-patch benchmarks.
I wanted to measure the overhead of :
the new OpenBitSetDISI.inPlaceOr(DocIdSetIterator)
vs
the previous scheme of BitSet.or(BitSet).
My test was on the biggest index I have here which was 3 million Wikipedia
docs. I had 2 cached TermFilters on very popular terms (500k docs in each) and
was measuring the cost of combining these as 2 "shoulds" in a BooleanFilter.
The expectation was the new scheme would add some overhead in extra method
calls.
The average cost of iterating across BooleanFilter.getDocIdSet() was:
old BitSet scheme: 78 milliseconds
new DISI scheme: 156 milliseconds.
To address this I tried adding this optimisation into BooleanFilter...
DocIdSet dis =
((Filter)shouldFilters.get(i)).getDocIdSet(reader);
if(dis instanceof OpenBitSet)
{
res.or((OpenBitSet) dis); // go-faster method
}
else
{
res.inPlaceOr(getDISI(shouldFilters, i, reader));
//your patch code
}
Before I could benchmark this I had to amend TermsFilter to use OpenBitSet
rather than plain old BitSet
avg speed of your patch with OpenBitSet-enabled TermFilter : 100 milliseconds
avg speed of your patch with OpenBitSet-enabled TermFilter and above
optimisation : 70 milliseconds
I'll try and post a proper patch when I get more time to look at this...
Cheers,
Mark
> Things to be done now that Filter is independent from BitSet
> ------------------------------------------------------------
>
> Key: LUCENE-1187
> URL: https://issues.apache.org/jira/browse/LUCENE-1187
> Project: Lucene - Java
> Issue Type: Improvement
> Components: contrib/*, Search
> Reporter: Paul Elschot
> Assignee: Michael Busch
> Priority: Minor
> Attachments: BooleanFilter20080325.patch,
> ChainedFilterAndCachingFilterTest.patch, Contrib20080325.patch,
> Contrib20080326.patch, Contrib20080427.patch, javadocsZero2Match.patch,
> OpenBitSetDISI-20080322.patch
>
>
> (Aside: where is the documentation on how to mark up text in jira comments?)
> The following things are left over after LUCENE-584 :
> For Lucene 3.0 Filter.bits() will have to be removed.
> There is a CHECKME in IndexSearcher about using ConjunctionScorer to have the
> boolean behaviour of a Filter.
> I have not looked into Filter caching yet, but I suppose there will be some
> room for improvement there.
> Iirc the current core has moved to use OpenBitSetFilter and that is probably
> what is being cached.
> In some cases it might be better to cache a SortedVIntList instead.
> Boolean logic on DocIdSetIterator is already available for Scorers (that
> inherit from DocIdSetIterator) in the search package. This is currently
> implemented by ConjunctionScorer, DisjunctionSumScorer,
> ReqOptSumScorer and ReqExclScorer.
> Boolean logic on BitSets is available in contrib/misc and contrib/queries
> DisjunctionSumScorer calls score() on its subscorers before the score value
> actually needed.
> This could be a reason to introduce a DisjunctionDocIdSetIterator, perhaps as
> a superclass of DisjunctionSumScorer.
> To fully implement non scoring queries a TermDocIdSetIterator will be needed,
> perhaps as a superclass of TermScorer.
> The javadocs in org.apache.lucene.search using matching vs non-zero score:
> I'll investigate this soon, and provide a patch when necessary.
> An early version of the patches of LUCENE-584 contained a class Matcher,
> that differs from the current DocIdSet in that Matcher has an explain()
> method.
> It remains to be seen whether such a Matcher could be useful between
> DocIdSet and Scorer.
> The semantics of scorer.skipTo(scorer.doc()) was discussed briefly.
> This was also discussed at another issue recently, so perhaps it is wortwhile
> to open a separate issue for this.
> Skipping on a SortedVIntList is done using linear search, this could be
> improved by adding multilevel skiplist info much like in the Lucene index for
> documents containing a term.
> One comment by me of 3 Dec 2008:
> A few complete (test) classes are deprecated, it might be good to add the
> target release for removal there.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]