[ https://issues.apache.org/jira/browse/LUCENE-1187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12596546#action_12596546 ]
Mark Harwood commented on LUCENE-1187: -------------------------------------- Paul, Good work. Just tried the patch and ran some pre and post-patch benchmarks. I wanted to measure the overhead of : the new OpenBitSetDISI.inPlaceOr(DocIdSetIterator) vs the previous scheme of BitSet.or(BitSet). My test was on the biggest index I have here which was 3 million Wikipedia docs. I had 2 cached TermFilters on very popular terms (500k docs in each) and was measuring the cost of combining these as 2 "shoulds" in a BooleanFilter. The expectation was the new scheme would add some overhead in extra method calls. The average cost of iterating across BooleanFilter.getDocIdSet() was: old BitSet scheme: 78 milliseconds new DISI scheme: 156 milliseconds. To address this I tried adding this optimisation into BooleanFilter... DocIdSet dis = ((Filter)shouldFilters.get(i)).getDocIdSet(reader); if(dis instanceof OpenBitSet) { res.or((OpenBitSet) dis); // go-faster method } else { res.inPlaceOr(getDISI(shouldFilters, i, reader)); //your patch code } Before I could benchmark this I had to amend TermsFilter to use OpenBitSet rather than plain old BitSet avg speed of your patch with OpenBitSet-enabled TermFilter : 100 milliseconds avg speed of your patch with OpenBitSet-enabled TermFilter and above optimisation : 70 milliseconds I'll try and post a proper patch when I get more time to look at this... Cheers, Mark > Things to be done now that Filter is independent from BitSet > ------------------------------------------------------------ > > Key: LUCENE-1187 > URL: https://issues.apache.org/jira/browse/LUCENE-1187 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/*, Search > Reporter: Paul Elschot > Assignee: Michael Busch > Priority: Minor > Attachments: BooleanFilter20080325.patch, > ChainedFilterAndCachingFilterTest.patch, Contrib20080325.patch, > Contrib20080326.patch, Contrib20080427.patch, javadocsZero2Match.patch, > OpenBitSetDISI-20080322.patch > > > (Aside: where is the documentation on how to mark up text in jira comments?) > The following things are left over after LUCENE-584 : > For Lucene 3.0 Filter.bits() will have to be removed. > There is a CHECKME in IndexSearcher about using ConjunctionScorer to have the > boolean behaviour of a Filter. > I have not looked into Filter caching yet, but I suppose there will be some > room for improvement there. > Iirc the current core has moved to use OpenBitSetFilter and that is probably > what is being cached. > In some cases it might be better to cache a SortedVIntList instead. > Boolean logic on DocIdSetIterator is already available for Scorers (that > inherit from DocIdSetIterator) in the search package. This is currently > implemented by ConjunctionScorer, DisjunctionSumScorer, > ReqOptSumScorer and ReqExclScorer. > Boolean logic on BitSets is available in contrib/misc and contrib/queries > DisjunctionSumScorer calls score() on its subscorers before the score value > actually needed. > This could be a reason to introduce a DisjunctionDocIdSetIterator, perhaps as > a superclass of DisjunctionSumScorer. > To fully implement non scoring queries a TermDocIdSetIterator will be needed, > perhaps as a superclass of TermScorer. > The javadocs in org.apache.lucene.search using matching vs non-zero score: > I'll investigate this soon, and provide a patch when necessary. > An early version of the patches of LUCENE-584 contained a class Matcher, > that differs from the current DocIdSet in that Matcher has an explain() > method. > It remains to be seen whether such a Matcher could be useful between > DocIdSet and Scorer. > The semantics of scorer.skipTo(scorer.doc()) was discussed briefly. > This was also discussed at another issue recently, so perhaps it is wortwhile > to open a separate issue for this. > Skipping on a SortedVIntList is done using linear search, this could be > improved by adding multilevel skiplist info much like in the Lucene index for > documents containing a term. > One comment by me of 3 Dec 2008: > A few complete (test) classes are deprecated, it might be good to add the > target release for removal there. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]