[jira] Commented: (LUCENE-584) Decouple Filter from BitSet
[ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12746017#action_12746017 ] Michael Busch commented on LUCENE-584: -- Mark, are you working on this? Wanna assign this to you? Decouple Filter from BitSet --- Key: LUCENE-584 URL: https://issues.apache.org/jira/browse/LUCENE-584 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.1 Reporter: Peter Schäfer Assignee: Michael Busch Priority: Minor Fix For: 2.9 Attachments: bench-diff.txt, bench-diff.txt, CHANGES.txt.patch, ContribQueries20080111.patch, lucene-584-take2.patch, lucene-584-take3-part1.patch, lucene-584-take3-part2.patch, lucene-584-take4-part1.patch, lucene-584-take4-part2.patch, lucene-584-take5-part1.patch, lucene-584-take5-part2.patch, lucene-584.patch, Matcher-20070905-2default.patch, Matcher-20070905-3core.patch, Matcher-20071122-1ground.patch, Some Matchers.zip, Test20080111.patch {code} package org.apache.lucene.search; public abstract class Filter implements java.io.Serializable { public abstract AbstractBitSet bits(IndexReader reader) throws IOException; } public interface AbstractBitSet { public boolean get(int index); } {code} It would be useful if the method =Filter.bits()= returned an abstract interface, instead of =java.util.BitSet=. Use case: there is a very large index, and, depending on the user's privileges, only a small portion of the index is actually visible. Sparsely populated =java.util.BitSet=s are not efficient and waste lots of memory. It would be desirable to have an alternative BitSet implementation with smaller memory footprint. Though it _is_ possibly to derive classes from =java.util.BitSet=, it was obviously not designed for that purpose. That's why I propose to use an interface instead. The default implementation could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-584) Decouple Filter from BitSet
[ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12643767#action_12643767 ] Paul Elschot commented on LUCENE-584: - Wouter, about this: {{java.lang.ClassCastException: java.util.BitSet cannot be cast to org.apache.lucene.search.DocIdSet}} LUCENE-1187 should have fixed this, so could you file a bug report? In case you need a workaround, also have a look at LUCENE-1296. Decouple Filter from BitSet --- Key: LUCENE-584 URL: https://issues.apache.org/jira/browse/LUCENE-584 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.1 Reporter: Peter Schäfer Assignee: Michael Busch Priority: Minor Fix For: 2.4 Attachments: bench-diff.txt, bench-diff.txt, CHANGES.txt.patch, ContribQueries20080111.patch, lucene-584-take2.patch, lucene-584-take3-part1.patch, lucene-584-take3-part2.patch, lucene-584-take4-part1.patch, lucene-584-take4-part2.patch, lucene-584-take5-part1.patch, lucene-584-take5-part2.patch, lucene-584.patch, Matcher-20070905-2default.patch, Matcher-20070905-3core.patch, Matcher-20071122-1ground.patch, Some Matchers.zip, Test20080111.patch {code} package org.apache.lucene.search; public abstract class Filter implements java.io.Serializable { public abstract AbstractBitSet bits(IndexReader reader) throws IOException; } public interface AbstractBitSet { public boolean get(int index); } {code} It would be useful if the method =Filter.bits()= returned an abstract interface, instead of =java.util.BitSet=. Use case: there is a very large index, and, depending on the user's privileges, only a small portion of the index is actually visible. Sparsely populated =java.util.BitSet=s are not efficient and waste lots of memory. It would be desirable to have an alternative BitSet implementation with smaller memory footprint. Though it _is_ possibly to derive classes from =java.util.BitSet=, it was obviously not designed for that purpose. That's why I propose to use an interface instead. The default implementation could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-584) Decouple Filter from BitSet
[ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12643495#action_12643495 ] Wouter Heijke commented on LUCENE-584: -- We got the same error here on a 15Gb index with Lucene 2.4.0: java.lang.ClassCastException: java.util.BitSet cannot be cast to org.apache.lucene.search.DocIdSet org.apache.lucene.search.CachingWrapperFilter.getDocIdSet(CachingWrapperFilter.java:76) org.apache.lucene.misc.ChainedFilter.getDocIdSet(ChainedFilter.java:200) org.apache.lucene.misc.ChainedFilter.getDocIdSet(ChainedFilter.java:145) org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:140) org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:112) org.apache.lucene.search.Searcher.search(Searcher.java:136) Decouple Filter from BitSet --- Key: LUCENE-584 URL: https://issues.apache.org/jira/browse/LUCENE-584 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.1 Reporter: Peter Schäfer Assignee: Michael Busch Priority: Minor Fix For: 2.4 Attachments: bench-diff.txt, bench-diff.txt, CHANGES.txt.patch, ContribQueries20080111.patch, lucene-584-take2.patch, lucene-584-take3-part1.patch, lucene-584-take3-part2.patch, lucene-584-take4-part1.patch, lucene-584-take4-part2.patch, lucene-584-take5-part1.patch, lucene-584-take5-part2.patch, lucene-584.patch, Matcher-20070905-2default.patch, Matcher-20070905-3core.patch, Matcher-20071122-1ground.patch, Some Matchers.zip, Test20080111.patch {code} package org.apache.lucene.search; public abstract class Filter implements java.io.Serializable { public abstract AbstractBitSet bits(IndexReader reader) throws IOException; } public interface AbstractBitSet { public boolean get(int index); } {code} It would be useful if the method =Filter.bits()= returned an abstract interface, instead of =java.util.BitSet=. Use case: there is a very large index, and, depending on the user's privileges, only a small portion of the index is actually visible. Sparsely populated =java.util.BitSet=s are not efficient and waste lots of memory. It would be desirable to have an alternative BitSet implementation with smaller memory footprint. Though it _is_ possibly to derive classes from =java.util.BitSet=, it was obviously not designed for that purpose. That's why I propose to use an interface instead. The default implementation could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-584) Decouple Filter from BitSet
[ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12577326#action_12577326 ] Paul Elschot commented on LUCENE-584: - From the traceback I suppose this happened at the end, using the ChainedFilter? Iirc ChainedFilter is from contrib/..., and it is mentioned at LUCENE-1187 as one of the things to be done. Could you contribute this code as a contrib/... test case there? Sorry, I don't remember exactly from which contrib module ChainedFilter is. Decouple Filter from BitSet --- Key: LUCENE-584 URL: https://issues.apache.org/jira/browse/LUCENE-584 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.1 Reporter: Peter Schäfer Assignee: Michael Busch Priority: Minor Fix For: 2.4 Attachments: bench-diff.txt, bench-diff.txt, CHANGES.txt.patch, ContribQueries20080111.patch, lucene-584-take2.patch, lucene-584-take3-part1.patch, lucene-584-take3-part2.patch, lucene-584-take4-part1.patch, lucene-584-take4-part2.patch, lucene-584-take5-part1.patch, lucene-584-take5-part2.patch, lucene-584.patch, Matcher-20070905-2default.patch, Matcher-20070905-3core.patch, Matcher-20071122-1ground.patch, Some Matchers.zip, Test20080111.patch {code} package org.apache.lucene.search; public abstract class Filter implements java.io.Serializable { public abstract AbstractBitSet bits(IndexReader reader) throws IOException; } public interface AbstractBitSet { public boolean get(int index); } {code} It would be useful if the method =Filter.bits()= returned an abstract interface, instead of =java.util.BitSet=. Use case: there is a very large index, and, depending on the user's privileges, only a small portion of the index is actually visible. Sparsely populated =java.util.BitSet=s are not efficient and waste lots of memory. It would be desirable to have an alternative BitSet implementation with smaller memory footprint. Though it _is_ possibly to derive classes from =java.util.BitSet=, it was obviously not designed for that purpose. That's why I propose to use an interface instead. The default implementation could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-584) Decouple Filter from BitSet
[ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12577207#action_12577207 ] Mark Miller commented on LUCENE-584: I think there is still an issue here. The code below just broke for me. java.lang.ClassCastException: org.apache.lucene.util.OpenBitSet cannot be cast to java.util.BitSet at org.apache.lucene.search.CachingWrapperFilter.bits(CachingWrapperFilter.java:55) at org.apache.lucene.misc.ChainedFilter.bits(ChainedFilter.java:177) at org.apache.lucene.misc.ChainedFilter.bits(ChainedFilter.java:152) at org.apache.lucene.search.Filter.getDocIdSet(Filter.java:49) {code} public void testChainedCachedQueryFilter() throws IOException, ParseException { String path = c:/TestIndex; Analyzer analyzer = new WhitespaceAnalyzer(); IndexWriter writer = new IndexWriter(path, analyzer, true); Document doc = new Document(); doc.add(new Field(category, red, Store.YES, Index.TOKENIZED)); doc.add(new Field(content, the big bad fox, Store.NO, Index.TOKENIZED)); writer.addDocument(doc); doc = new Document(); doc.add(new Field(category, red, Store.YES, Index.TOKENIZED)); doc.add(new Field(content, the big bad pig, Store.NO, Index.TOKENIZED)); writer.addDocument(doc); doc = new Document(); doc.add(new Field(category, red, Store.YES, Index.TOKENIZED)); doc.add(new Field(content, the horrific girl, Store.NO, Index.TOKENIZED)); writer.addDocument(doc); doc = new Document(); doc.add(new Field(category, blue, Store.YES, Index.TOKENIZED)); doc.add(new Field(content, the dirty boy, Store.NO, Index.TOKENIZED)); writer.addDocument(doc); doc = new Document(); doc.add(new Field(category, blue, Store.YES, Index.TOKENIZED)); doc.add(new Field(content, the careful bad fox, Store.NO, Index.TOKENIZED)); writer.addDocument(doc); writer.addDocument(doc); Searcher searcher = null; searcher = new IndexSearcher(path); QueryParser qp = new QueryParser(field, new KeywordAnalyzer()); Query query = qp.parse(content:fox); QueryWrapperFilter queryFilter = new QueryWrapperFilter(query); CachingWrapperFilter cwf = new CachingWrapperFilter(queryFilter); TopDocs hits = searcher.search(query, cwf, 1); System.out.println(hits: + hits.totalHits); queryFilter = new QueryWrapperFilter(qp.parse(category:red)); CachingWrapperFilter fcwf = new CachingWrapperFilter(queryFilter); Filter[] chain = new Filter[2]; chain[0] = cwf; chain[1] = fcwf; ChainedFilter cf = new ChainedFilter(chain, ChainedFilter.AND); hits = searcher.search(new MatchAllDocsQuery(), cf, 1); System.out.println(red: + hits.totalHits); queryFilter = new QueryWrapperFilter(qp.parse(category:blue)); CachingWrapperFilter fbcwf = new CachingWrapperFilter(queryFilter); chain = new Filter[2]; chain[0] = cwf; chain[1] = fbcwf; cf = new ChainedFilter(chain, ChainedFilter.AND); hits = searcher.search(new MatchAllDocsQuery(), cf, 1); System.out.println(blue: + hits.totalHits); } {code} Decouple Filter from BitSet --- Key: LUCENE-584 URL: https://issues.apache.org/jira/browse/LUCENE-584 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.1 Reporter: Peter Schäfer Assignee: Michael Busch Priority: Minor Fix For: 2.4 Attachments: bench-diff.txt, bench-diff.txt, CHANGES.txt.patch, ContribQueries20080111.patch, lucene-584-take2.patch, lucene-584-take3-part1.patch, lucene-584-take3-part2.patch, lucene-584-take4-part1.patch, lucene-584-take4-part2.patch, lucene-584-take5-part1.patch, lucene-584-take5-part2.patch, lucene-584.patch, Matcher-20070905-2default.patch, Matcher-20070905-3core.patch, Matcher-20071122-1ground.patch, Some Matchers.zip, Test20080111.patch {code} package org.apache.lucene.search; public abstract class Filter implements java.io.Serializable { public abstract AbstractBitSet bits(IndexReader reader) throws IOException; } public interface AbstractBitSet { public boolean get(int index); } {code} It would be useful if the method =Filter.bits()= returned an abstract interface, instead of =java.util.BitSet=. Use case: there is a very large index, and, depending on the user's privileges, only a small portion of the index is actually visible. Sparsely populated =java.util.BitSet=s are not efficient and waste lots of memory. It would be desirable to have an alternative BitSet implementation with smaller memory footprint. Though it _is_ possibly to derive classes from =java.util.BitSet=, it was obviously not designed for that purpose. That's why I propose to use an interface instead. The
[jira] Commented: (LUCENE-584) Decouple Filter from BitSet
[ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12564878#action_12564878 ] Michael Busch commented on LUCENE-584: -- Thanks, Paul for testing and reviewing. I'll correct the javadocs. OK, I will commit this tomorrow if nobody objects! Decouple Filter from BitSet --- Key: LUCENE-584 URL: https://issues.apache.org/jira/browse/LUCENE-584 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.0.1 Reporter: Peter Schäfer Assignee: Michael Busch Priority: Minor Fix For: 2.4 Attachments: bench-diff.txt, bench-diff.txt, ContribQueries20080111.patch, lucene-584-take2.patch, lucene-584-take3-part1.patch, lucene-584-take3-part2.patch, lucene-584-take4-part1.patch, lucene-584-take4-part2.patch, lucene-584-take5-part1.patch, lucene-584-take5-part2.patch, lucene-584.patch, Matcher-20070905-2default.patch, Matcher-20070905-3core.patch, Matcher-20071122-1ground.patch, Some Matchers.zip, Test20080111.patch {code} package org.apache.lucene.search; public abstract class Filter implements java.io.Serializable { public abstract AbstractBitSet bits(IndexReader reader) throws IOException; } public interface AbstractBitSet { public boolean get(int index); } {code} It would be useful if the method =Filter.bits()= returned an abstract interface, instead of =java.util.BitSet=. Use case: there is a very large index, and, depending on the user's privileges, only a small portion of the index is actually visible. Sparsely populated =java.util.BitSet=s are not efficient and waste lots of memory. It would be desirable to have an alternative BitSet implementation with smaller memory footprint. Though it _is_ possibly to derive classes from =java.util.BitSet=, it was obviously not designed for that purpose. That's why I propose to use an interface instead. The default implementation could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-584) Decouple Filter from BitSet
[ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12564875#action_12564875 ] Paul Elschot commented on LUCENE-584: - The take5 patch tests ok here. One very minor remark: the javadoc at RangeFilter.getDocIdSet still mentions BitSet. Decouple Filter from BitSet --- Key: LUCENE-584 URL: https://issues.apache.org/jira/browse/LUCENE-584 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.0.1 Reporter: Peter Schäfer Assignee: Michael Busch Priority: Minor Fix For: 2.4 Attachments: bench-diff.txt, bench-diff.txt, ContribQueries20080111.patch, lucene-584-take2.patch, lucene-584-take3-part1.patch, lucene-584-take3-part2.patch, lucene-584-take4-part1.patch, lucene-584-take4-part2.patch, lucene-584-take5-part1.patch, lucene-584-take5-part2.patch, lucene-584.patch, Matcher-20070905-2default.patch, Matcher-20070905-3core.patch, Matcher-20071122-1ground.patch, Some Matchers.zip, Test20080111.patch {code} package org.apache.lucene.search; public abstract class Filter implements java.io.Serializable { public abstract AbstractBitSet bits(IndexReader reader) throws IOException; } public interface AbstractBitSet { public boolean get(int index); } {code} It would be useful if the method =Filter.bits()= returned an abstract interface, instead of =java.util.BitSet=. Use case: there is a very large index, and, depending on the user's privileges, only a small portion of the index is actually visible. Sparsely populated =java.util.BitSet=s are not efficient and waste lots of memory. It would be desirable to have an alternative BitSet implementation with smaller memory footprint. Though it _is_ possibly to derive classes from =java.util.BitSet=, it was obviously not designed for that purpose. That's why I propose to use an interface instead. The default implementation could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-584) Decouple Filter from BitSet
[ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12564415#action_12564415 ] Mark Harwood commented on LUCENE-584: - Hi Paul, Just eyeballed the code but not had a chance to patch and run it yet. I was wondering about the return type for skipTo() after looking at these types of calls: if (docIdSetIterator.skipTo(i) (docIdSetIterator.doc() == i)) You could save a method invocation in cases like this if skipTo() returned the next doc id rather than a boolean. Returning a -1 would be the equivalent of what used to be false. Not tried benchmarking it but does this seem like something worth considering? Cheers Mark Decouple Filter from BitSet --- Key: LUCENE-584 URL: https://issues.apache.org/jira/browse/LUCENE-584 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.0.1 Reporter: Peter Schäfer Assignee: Michael Busch Priority: Minor Fix For: 2.4 Attachments: bench-diff.txt, bench-diff.txt, ContribQueries20080111.patch, lucene-584-take2.patch, lucene-584-take3-part1.patch, lucene-584-take3-part2.patch, lucene-584-take4-part1.patch, lucene-584-take4-part2.patch, lucene-584.patch, Matcher-20070905-2default.patch, Matcher-20070905-3core.patch, Matcher-20071122-1ground.patch, Some Matchers.zip, Test20080111.patch {code} package org.apache.lucene.search; public abstract class Filter implements java.io.Serializable { public abstract AbstractBitSet bits(IndexReader reader) throws IOException; } public interface AbstractBitSet { public boolean get(int index); } {code} It would be useful if the method =Filter.bits()= returned an abstract interface, instead of =java.util.BitSet=. Use case: there is a very large index, and, depending on the user's privileges, only a small portion of the index is actually visible. Sparsely populated =java.util.BitSet=s are not efficient and waste lots of memory. It would be desirable to have an alternative BitSet implementation with smaller memory footprint. Though it _is_ possibly to derive classes from =java.util.BitSet=, it was obviously not designed for that purpose. That's why I propose to use an interface instead. The default implementation could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Commented: (LUCENE-584) Decouple Filter from BitSet
On Jan 31, 2008, at 9:29 AM, Mark Harwood (JIRA) wrote: You could save a method invocation in cases like this if skipTo() returned the next doc id rather than a boolean. Returning a -1 would be the equivalent of what used to be false. Not tried benchmarking it but does this seem like something worth considering? A contributor to KinoSearch persuaded me to have document numbers begin at 1 for this reason. Marvin Humphrey Rectangular Research http://www.rectangular.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Commented: (LUCENE-584) Decouple Filter from BitSet
Op Thursday 31 January 2008 18:29:12 schreef Mark Harwood (JIRA): [ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12564415#action_12564415 ] Mark Harwood commented on LUCENE-584: - Hi Paul, Just eyeballed the code but not had a chance to patch and run it yet. I was wondering about the return type for skipTo() after looking at these types of calls: if (docIdSetIterator.skipTo(i) (docIdSetIterator.doc() == i)) You could save a method invocation in cases like this if skipTo() returned the next doc id rather than a boolean. Returning a -1 would be the equivalent of what used to be false. Not tried benchmarking it but does this seem like something worth considering? Cheers Mark Performance is always worth consideration, but this is another issue. Returning -1 is not without cost either, it's a constant that needs to be loaded on the called side and tested against on the calling side. A returned boolean may have to be loaded and can be tested directly, so with good inlining I'd expect it to be faster in the normal case in which the document number is not needed immediately. The code shown is likely from an explain() method, and not from a next() or skipTo() implementation, and then it's not the normal case. Less (using a boolean) is more (performance) in this case, I think, but benchmarking may show something else. This skipTo() is also Scorer.skipTo(), so a change there could have an even bigger impact than a change in Filter. Have a look at the size of the take4 patch at LUCENE-584 before trying to change skipTo() at home :) Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-584) Decouple Filter from BitSet
[ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12564620#action_12564620 ] Michael Busch commented on LUCENE-584: -- {quote} You could save a method invocation in cases like this if skipTo() returned the next doc id rather than a boolean. Returning a -1 would be the equivalent of what used to be false. {quote} To change the signature of skipTo() would be an API change, because with this patch Scorer extends DocIdSetIterator. -Michael Decouple Filter from BitSet --- Key: LUCENE-584 URL: https://issues.apache.org/jira/browse/LUCENE-584 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.0.1 Reporter: Peter Schäfer Assignee: Michael Busch Priority: Minor Fix For: 2.4 Attachments: bench-diff.txt, bench-diff.txt, ContribQueries20080111.patch, lucene-584-take2.patch, lucene-584-take3-part1.patch, lucene-584-take3-part2.patch, lucene-584-take4-part1.patch, lucene-584-take4-part2.patch, lucene-584.patch, Matcher-20070905-2default.patch, Matcher-20070905-3core.patch, Matcher-20071122-1ground.patch, Some Matchers.zip, Test20080111.patch {code} package org.apache.lucene.search; public abstract class Filter implements java.io.Serializable { public abstract AbstractBitSet bits(IndexReader reader) throws IOException; } public interface AbstractBitSet { public boolean get(int index); } {code} It would be useful if the method =Filter.bits()= returned an abstract interface, instead of =java.util.BitSet=. Use case: there is a very large index, and, depending on the user's privileges, only a small portion of the index is actually visible. Sparsely populated =java.util.BitSet=s are not efficient and waste lots of memory. It would be desirable to have an alternative BitSet implementation with smaller memory footprint. Though it _is_ possibly to derive classes from =java.util.BitSet=, it was obviously not designed for that purpose. That's why I propose to use an interface instead. The default implementation could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-584) Decouple Filter from BitSet
[ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12558997#action_12558997 ] Eks Dev commented on LUCENE-584: Michal, would this work? 1. providing default implementation for basic methods that is using skipping iterator(always there), so it works by default for *all* implementations, something along the lines: /** * A DocIdSet contains a set of doc ids. Implementing classes must provide * a [EMAIL PROTECTED] DocIdSetIterator} to access the set. */ public abstract class DocIdSet { public abstract DocIdSetIterator iterator(); public DocIdSet and(DocIdSet){ // default implementation using *iterator*; } public DocIdSet or(DocIdSet){ // default implementation using iterator; } } 2. And then we *optimize* particular cases, e.g public class DocIdBitSet extends DocIdSet{ BitSet bits; // Must be there in order for iterator to work public DocIdSetIterator iterator(){ //this is easy... } public DocIdSet and(DocIdSet dis){ if (dois instanceof DocIdBitSet) { //not exactly like this, but the idea is there this.bits.and(((DocIdBitSet) dis)); return this; } return super.and(DocIdSet); } } So it works always, and it works fast if need be, one instanceof check does not hurt there. Did I miss something obvious? Decouple Filter from BitSet --- Key: LUCENE-584 URL: https://issues.apache.org/jira/browse/LUCENE-584 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.0.1 Reporter: Peter Schäfer Assignee: Michael Busch Priority: Minor Fix For: 2.4 Attachments: bench-diff.txt, bench-diff.txt, ContribQueries20080111.patch, lucene-584-take2.patch, lucene-584-take3-part1.patch, lucene-584-take3-part2.patch, lucene-584-take4-part1.patch, lucene-584-take4-part2.patch, lucene-584.patch, Matcher-20070905-2default.patch, Matcher-20070905-3core.patch, Matcher-20071122-1ground.patch, Some Matchers.zip, Test20080111.patch {code} package org.apache.lucene.search; public abstract class Filter implements java.io.Serializable { public abstract AbstractBitSet bits(IndexReader reader) throws IOException; } public interface AbstractBitSet { public boolean get(int index); } {code} It would be useful if the method =Filter.bits()= returned an abstract interface, instead of =java.util.BitSet=. Use case: there is a very large index, and, depending on the user's privileges, only a small portion of the index is actually visible. Sparsely populated =java.util.BitSet=s are not efficient and waste lots of memory. It would be desirable to have an alternative BitSet implementation with smaller memory footprint. Though it _is_ possibly to derive classes from =java.util.BitSet=, it was obviously not designed for that purpose. That's why I propose to use an interface instead. The default implementation could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-584) Decouple Filter from BitSet
[ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12559168#action_12559168 ] Paul Elschot commented on LUCENE-584: - I indeed recall having an problem with remote filter caching. At the time I thought it was related to serialization but I could not resolve it that way. Never mind, it does not matter anymore. BooleanFilter and ChainedFilter have the same issue here. As they provide just about the same functionality, could they perhaps be merged? The solution using DocIdSet.and() and DocIdSet.or() looks good to me, but it will require some form of collector for the results, much like HitCollector.collect(doc, score) now and MatchCollector.collect(doc) in the Matcher...patch. The boolean operations could then be accumulated into a BitSet or into an OpenBitSet, using a special case for DocId(Open)BitSet. I'd like these boolean operations on DocIdSets to be general enough for use in Scorers, for example for the conjunctions in ConjunctionScorer, PhraseScorer and in the two NearSpans. But that is another issue. Decouple Filter from BitSet --- Key: LUCENE-584 URL: https://issues.apache.org/jira/browse/LUCENE-584 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.0.1 Reporter: Peter Schäfer Assignee: Michael Busch Priority: Minor Fix For: 2.4 Attachments: bench-diff.txt, bench-diff.txt, ContribQueries20080111.patch, lucene-584-take2.patch, lucene-584-take3-part1.patch, lucene-584-take3-part2.patch, lucene-584-take4-part1.patch, lucene-584-take4-part2.patch, lucene-584.patch, Matcher-20070905-2default.patch, Matcher-20070905-3core.patch, Matcher-20071122-1ground.patch, Some Matchers.zip, Test20080111.patch {code} package org.apache.lucene.search; public abstract class Filter implements java.io.Serializable { public abstract AbstractBitSet bits(IndexReader reader) throws IOException; } public interface AbstractBitSet { public boolean get(int index); } {code} It would be useful if the method =Filter.bits()= returned an abstract interface, instead of =java.util.BitSet=. Use case: there is a very large index, and, depending on the user's privileges, only a small portion of the index is actually visible. Sparsely populated =java.util.BitSet=s are not efficient and waste lots of memory. It would be desirable to have an alternative BitSet implementation with smaller memory footprint. Though it _is_ possibly to derive classes from =java.util.BitSet=, it was obviously not designed for that purpose. That's why I propose to use an interface instead. The default implementation could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-584) Decouple Filter from BitSet
[ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12558856#action_12558856 ] Michael Busch commented on LUCENE-584: -- I think I understand now which problems you had when you wanted to change BooleanFilter and xml-query-parser to use the new Filter APIs. BooleanFilter is optimized to utilize BitSets for performing boolean operations fast. Now if we change BooleanFilter to use the new DocIdSetIterator, then it can't use the fast BitSet operations (e. g. union for or, intersect for and) anymore. Now we can introduce BitSetFilter as you suggested and what I did in the take4 patch. But here's the problem: Introducing subclasses of Filter doesn't play nicely with the caching mechanism in Lucene. For example: if we change BooleanFilter to only work with BitSetFilters, then it won't work with a CachingWrapperFilter anymore, because CachingWrapperFilter extends Filter. Then we would have to introduce new CachingWrapper***Filter, for the different kinds of Filter subclasses, which is a bad design as Mark pointed out in his comment: https://issues.apache.org/jira/browse/LUCENE-584?focusedCommentId=12547901#action_12547901 One solution would be to add a getBitSet() method to DocIdBitSet. DocIdBitSet is a new class that is basically just a wrapper around a Java BitSet and provides a DocIdSetIterator to access the BitSet. Then BooleanFilter could do something like this: {code:java} DocIdSet docIdSet = filter.getDocIdSet(); if (docIdSet instanceof DocIdBitSet) { BitSet bits = ((DocIdBitSet) docIdSet).getBitSet(); ... // existing code } else { throw new UnsupportedOperationException(BooleanFilter only supports Filters that use DocIdBitSet.); } {code} But then, changing the core filters to use OpenBitSets instead of Java BitSets is technically an API change, because BooleanFilter would not work anymore with the core filters. So if we took this approach we would have to wait until 3.0 to move the core from BitSet to OpenBitSet and also change BooleanFilter then to use OpenBitSets. BooleanFilter could then also work with either of the two BitSet implementions, but probably not with those two mixed. Any feedback about this is very welcome. I'll try to further think about how to marry the new Filter API, caching mechanism and Filter implementations like BooleanFilter nicely. Decouple Filter from BitSet --- Key: LUCENE-584 URL: https://issues.apache.org/jira/browse/LUCENE-584 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.0.1 Reporter: Peter Schäfer Assignee: Michael Busch Priority: Minor Fix For: 2.4 Attachments: bench-diff.txt, bench-diff.txt, ContribQueries20080111.patch, lucene-584-take2.patch, lucene-584-take3-part1.patch, lucene-584-take3-part2.patch, lucene-584-take4-part1.patch, lucene-584-take4-part2.patch, lucene-584.patch, Matcher-20070905-2default.patch, Matcher-20070905-3core.patch, Matcher-20071122-1ground.patch, Some Matchers.zip, Test20080111.patch {code} package org.apache.lucene.search; public abstract class Filter implements java.io.Serializable { public abstract AbstractBitSet bits(IndexReader reader) throws IOException; } public interface AbstractBitSet { public boolean get(int index); } {code} It would be useful if the method =Filter.bits()= returned an abstract interface, instead of =java.util.BitSet=. Use case: there is a very large index, and, depending on the user's privileges, only a small portion of the index is actually visible. Sparsely populated =java.util.BitSet=s are not efficient and waste lots of memory. It would be desirable to have an alternative BitSet implementation with smaller memory footprint. Though it _is_ possibly to derive classes from =java.util.BitSet=, it was obviously not designed for that purpose. That's why I propose to use an interface instead. The default implementation could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-584) Decouple Filter from BitSet
[ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12557935#action_12557935 ] Michael Busch commented on LUCENE-584: -- {quote} As for PrefixGenerator: in my (up to date) trunk directory, this command: find . -name 'PrefixGenerator' only gave this result: ./build/classes/java/org/apache/lucene/search/PrefixGenerator.class and that disappeared after ant clean. It seems that the source class was removed from the trunk. {quote} As I said, PrefixGenerator is defined in PrefixFilter.java. Decouple Filter from BitSet --- Key: LUCENE-584 URL: https://issues.apache.org/jira/browse/LUCENE-584 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.0.1 Reporter: Peter Schäfer Assignee: Michael Busch Priority: Minor Fix For: 2.4 Attachments: bench-diff.txt, bench-diff.txt, lucene-584-take2.patch, lucene-584-take3-part1.patch, lucene-584-take3-part2.patch, lucene-584.patch, Matcher-20070905-2default.patch, Matcher-20070905-3core.patch, Matcher-20071122-1ground.patch, Some Matchers.zip {code} package org.apache.lucene.search; public abstract class Filter implements java.io.Serializable { public abstract AbstractBitSet bits(IndexReader reader) throws IOException; } public interface AbstractBitSet { public boolean get(int index); } {code} It would be useful if the method =Filter.bits()= returned an abstract interface, instead of =java.util.BitSet=. Use case: there is a very large index, and, depending on the user's privileges, only a small portion of the index is actually visible. Sparsely populated =java.util.BitSet=s are not efficient and waste lots of memory. It would be desirable to have an alternative BitSet implementation with smaller memory footprint. Though it _is_ possibly to derive classes from =java.util.BitSet=, it was obviously not designed for that purpose. That's why I propose to use an interface instead. The default implementation could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-584) Decouple Filter from BitSet
[ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12557955#action_12557955 ] Paul Elschot commented on LUCENE-584: - I'm sorry about my PrefixGenerator remarks, I did not read your answer accurately. On the take4 patch of 11 Jan 2008: I have started in a fresh trunk checkout that passed all tests. Both parts of take4 apply cleanly, using patch -p0 ... . ant jar, ant test-core and ant test-contrib all pass nicely. I remember having problems with moving contrib/xml-queryparser from Filter to BitSetFilter, see my comment of 30 July 2007. So I'd like to verify that this can be done, and I hope Mark Harwood can give some hints as to how to do this. For me, this was the main reason to make this move: from Filter with subclass BitSetFilter (as in the take4 patch, and in my first attempts) to MatchFilter with subclass Filter (as in Matcher... patches of Sep and Nov 2007). In these Matcher... patches no changes were necessary to contrib/xml-queryparser. Less important for now: The test classes extend TestCase, but iirc there is also a LuceneTestCase for this. On the take4 patch ant javadocs-core gives this: BitSetFilter.java:40: warning - Tag @link: reference not found: DocIdBitSetIterator Decouple Filter from BitSet --- Key: LUCENE-584 URL: https://issues.apache.org/jira/browse/LUCENE-584 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.0.1 Reporter: Peter Schäfer Assignee: Michael Busch Priority: Minor Fix For: 2.4 Attachments: bench-diff.txt, bench-diff.txt, lucene-584-take2.patch, lucene-584-take3-part1.patch, lucene-584-take3-part2.patch, lucene-584-take4-part1.patch, lucene-584-take4-part2.patch, lucene-584.patch, Matcher-20070905-2default.patch, Matcher-20070905-3core.patch, Matcher-20071122-1ground.patch, Some Matchers.zip {code} package org.apache.lucene.search; public abstract class Filter implements java.io.Serializable { public abstract AbstractBitSet bits(IndexReader reader) throws IOException; } public interface AbstractBitSet { public boolean get(int index); } {code} It would be useful if the method =Filter.bits()= returned an abstract interface, instead of =java.util.BitSet=. Use case: there is a very large index, and, depending on the user's privileges, only a small portion of the index is actually visible. Sparsely populated =java.util.BitSet=s are not efficient and waste lots of memory. It would be desirable to have an alternative BitSet implementation with smaller memory footprint. Though it _is_ possibly to derive classes from =java.util.BitSet=, it was obviously not designed for that purpose. That's why I propose to use an interface instead. The default implementation could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-584) Decouple Filter from BitSet
[ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12557999#action_12557999 ] Eks Dev commented on LUCENE-584: it looks like ChainedFilter could become obsolete if Filter/DocSetIdIterator gets added as a Clause to the BooleanQuery? I am thinking along the lines: ChainedFilter evaluates boolean expression of docId-s, that is exactly what BooleanQuery does plus a bit more (scoring)... Decouple Filter from BitSet --- Key: LUCENE-584 URL: https://issues.apache.org/jira/browse/LUCENE-584 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.0.1 Reporter: Peter Schäfer Assignee: Michael Busch Priority: Minor Fix For: 2.4 Attachments: bench-diff.txt, bench-diff.txt, ContribQueries20080111.patch, lucene-584-take2.patch, lucene-584-take3-part1.patch, lucene-584-take3-part2.patch, lucene-584-take4-part1.patch, lucene-584-take4-part2.patch, lucene-584.patch, Matcher-20070905-2default.patch, Matcher-20070905-3core.patch, Matcher-20071122-1ground.patch, Some Matchers.zip {code} package org.apache.lucene.search; public abstract class Filter implements java.io.Serializable { public abstract AbstractBitSet bits(IndexReader reader) throws IOException; } public interface AbstractBitSet { public boolean get(int index); } {code} It would be useful if the method =Filter.bits()= returned an abstract interface, instead of =java.util.BitSet=. Use case: there is a very large index, and, depending on the user's privileges, only a small portion of the index is actually visible. Sparsely populated =java.util.BitSet=s are not efficient and waste lots of memory. It would be desirable to have an alternative BitSet implementation with smaller memory footprint. Though it _is_ possibly to derive classes from =java.util.BitSet=, it was obviously not designed for that purpose. That's why I propose to use an interface instead. The default implementation could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-584) Decouple Filter from BitSet
[ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12558076#action_12558076 ] Paul Elschot commented on LUCENE-584: - {quote} it looks like ChainedFilter could become obsolete if Filter/DocSetIdIterator gets added as a Clause to the BooleanQuery? {quote} The function is indeed the same, but ChainedFilter works directly on BitSets and BooleanQuery works on input Scorers/DocIdSetIterators and outputs collected docids (and score values). Working directly on (Open)BitSets is normally faster, so ChainedFilter can have a good use. And boolean operations on DocIdSets are not (yet) directly available in Lucene. The various boolean scorers have the logic, but currently only for Scorers. That leaves the question on what to do with ChainedFilter here. Any ideas? The easiest way is to open another issue for it. This will have to be resolved before Filter.bits() is removed. Decouple Filter from BitSet --- Key: LUCENE-584 URL: https://issues.apache.org/jira/browse/LUCENE-584 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.0.1 Reporter: Peter Schäfer Assignee: Michael Busch Priority: Minor Fix For: 2.4 Attachments: bench-diff.txt, bench-diff.txt, ContribQueries20080111.patch, lucene-584-take2.patch, lucene-584-take3-part1.patch, lucene-584-take3-part2.patch, lucene-584-take4-part1.patch, lucene-584-take4-part2.patch, lucene-584.patch, Matcher-20070905-2default.patch, Matcher-20070905-3core.patch, Matcher-20071122-1ground.patch, Some Matchers.zip {code} package org.apache.lucene.search; public abstract class Filter implements java.io.Serializable { public abstract AbstractBitSet bits(IndexReader reader) throws IOException; } public interface AbstractBitSet { public boolean get(int index); } {code} It would be useful if the method =Filter.bits()= returned an abstract interface, instead of =java.util.BitSet=. Use case: there is a very large index, and, depending on the user's privileges, only a small portion of the index is actually visible. Sparsely populated =java.util.BitSet=s are not efficient and waste lots of memory. It would be desirable to have an alternative BitSet implementation with smaller memory footprint. Though it _is_ possibly to derive classes from =java.util.BitSet=, it was obviously not designed for that purpose. That's why I propose to use an interface instead. The default implementation could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-584) Decouple Filter from BitSet
[ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12558115#action_12558115 ] Eks Dev commented on LUCENE-584: hmm, in order to have fast and/or operations we need to know the type of the underlaying object in Filter, and sometimes we must use iterators (e.g. case where one Filter/DocSetId is int list and another Hash bit set ). I guess, knowing type of DocIdSet is the trick to pool. Default implementation of ChainedFilter (there is also BooleanFilter somewhere in contrib, I like it more) should be using iterator (like scorers), and at a few key points checking if(first instance of SomeInstanceOfDocIdSet second SomeInstanceOfDocIdSet) first.doFastOR/AND(second); something in that direction looks reasonable to me for ChainedFilter If it proves to be really better to have it around. I am still of an opinion that it would be better to integrate DocIdSet into BooleanQuery as a clause, somehow, that would be some kind of ConstantBoolean(MUST/SHOULD/NOT)Clause, much cleaner from design/usability point of view, even at some minor penalty in performance (anyhow, you can always combine filters before you enter scorers) but you are right that is another issue... let us stop polluting this issue :) Decouple Filter from BitSet --- Key: LUCENE-584 URL: https://issues.apache.org/jira/browse/LUCENE-584 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.0.1 Reporter: Peter Schäfer Assignee: Michael Busch Priority: Minor Fix For: 2.4 Attachments: bench-diff.txt, bench-diff.txt, ContribQueries20080111.patch, lucene-584-take2.patch, lucene-584-take3-part1.patch, lucene-584-take3-part2.patch, lucene-584-take4-part1.patch, lucene-584-take4-part2.patch, lucene-584.patch, Matcher-20070905-2default.patch, Matcher-20070905-3core.patch, Matcher-20071122-1ground.patch, Some Matchers.zip {code} package org.apache.lucene.search; public abstract class Filter implements java.io.Serializable { public abstract AbstractBitSet bits(IndexReader reader) throws IOException; } public interface AbstractBitSet { public boolean get(int index); } {code} It would be useful if the method =Filter.bits()= returned an abstract interface, instead of =java.util.BitSet=. Use case: there is a very large index, and, depending on the user's privileges, only a small portion of the index is actually visible. Sparsely populated =java.util.BitSet=s are not efficient and waste lots of memory. It would be desirable to have an alternative BitSet implementation with smaller memory footprint. Though it _is_ possibly to derive classes from =java.util.BitSet=, it was obviously not designed for that purpose. That's why I propose to use an interface instead. The default implementation could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-584) Decouple Filter from BitSet
[ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12557595#action_12557595 ] Paul Elschot commented on LUCENE-584: - On the take3 patch of 10 Jan 2008: SortedVIntList extends DocIdSet: nice, thanks. PrefixGenerator is used but not defined in the patch, so it will not compile. Nevertheless, with all tests passing, I think this is a good way to make Filter independent of BitSet. Minor concerns: There is neither a BitSetFilter nor an OpenBitSetFilter in the patch. These might be useful for existing code currently implementing Filter to overcome the deprecation of Filter.bits(). With the current core moving to OpenBitSet, it might also use an explicit OpenBitSetFilter. Some javadoc changes did not make it into the take3 patch, I'll check these later. FilteredQuery.explain(): When a document does not pass the Filter I think it would be better not to use setValue(0.0f) on the resulting Explanation. However, this may be necessary for backward compatibility. For the future: About adding a Filter as a clause to BooleanScorer, and adding DocSetIdIterator as a Scorer to ConjunctionScorer: This is the reason for the CHECKME in IndexSearcher for using ConjunctionScorer when a filter is given. A ConjunctionScorer that accepts a DocIdSetIterator could also be used in FilteredQuery. Decouple Filter from BitSet --- Key: LUCENE-584 URL: https://issues.apache.org/jira/browse/LUCENE-584 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.0.1 Reporter: Peter Schäfer Assignee: Michael Busch Priority: Minor Fix For: 2.4 Attachments: bench-diff.txt, bench-diff.txt, lucene-584-take2.patch, lucene-584-take3-part1.patch, lucene-584-take3-part2.patch, lucene-584.patch, Matcher-20070905-2default.patch, Matcher-20070905-3core.patch, Matcher-20071122-1ground.patch, Some Matchers.zip {code} package org.apache.lucene.search; public abstract class Filter implements java.io.Serializable { public abstract AbstractBitSet bits(IndexReader reader) throws IOException; } public interface AbstractBitSet { public boolean get(int index); } {code} It would be useful if the method =Filter.bits()= returned an abstract interface, instead of =java.util.BitSet=. Use case: there is a very large index, and, depending on the user's privileges, only a small portion of the index is actually visible. Sparsely populated =java.util.BitSet=s are not efficient and waste lots of memory. It would be desirable to have an alternative BitSet implementation with smaller memory footprint. Though it _is_ possibly to derive classes from =java.util.BitSet=, it was obviously not designed for that purpose. That's why I propose to use an interface instead. The default implementation could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-584) Decouple Filter from BitSet
[ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12557755#action_12557755 ] Michael Busch commented on LUCENE-584: -- {quote} On the take3 patch of 10 Jan 2008: {quote} Thanks for the review! {quote} PrefixGenerator is used but not defined in the patch, so it will not compile. {quote} Not sure I understand what you mean. PrefixGenerator is (and was) defined in PrefixFilter.java. It compiles for me. {quote} There is neither a BitSetFilter nor an OpenBitSetFilter in the patch. These might be useful for existing code currently implementing Filter to overcome the deprecation of Filter.bits(). With the current core moving to OpenBitSet, it might also use an explicit OpenBitSetFilter. {quote} I think that it should be straightforward for users having filters that use BitSets to wrap the new DocIdBitSet around the BitSet, just as Filter currently does for backwards compatibility? {quote} Some javadoc changes did not make it into the take3 patch, I'll check these later. {quote} Oh, which ones? {quote} FilteredQuery.explain(): When a document does not pass the Filter I think it would be better not to use setValue(0.0f) on the resulting Explanation. However, this may be necessary for backward compatibility. {quote} Yeah, it used to work this way, that's why I didn't change it for backwards- compatibility reasons. {quote} About adding a Filter as a clause to BooleanScorer, and adding DocSetIdIterator as a Scorer to ConjunctionScorer: This is the reason for the CHECKME in IndexSearcher for using ConjunctionScorer when a filter is given. A ConjunctionScorer that accepts a DocIdSetIterator could also be used in FilteredQuery. {quote} Well, let's address this with a different issue after this one is committed. I might have some concerns here, but I've to further think about it. Decouple Filter from BitSet --- Key: LUCENE-584 URL: https://issues.apache.org/jira/browse/LUCENE-584 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.0.1 Reporter: Peter Schäfer Assignee: Michael Busch Priority: Minor Fix For: 2.4 Attachments: bench-diff.txt, bench-diff.txt, lucene-584-take2.patch, lucene-584-take3-part1.patch, lucene-584-take3-part2.patch, lucene-584.patch, Matcher-20070905-2default.patch, Matcher-20070905-3core.patch, Matcher-20071122-1ground.patch, Some Matchers.zip {code} package org.apache.lucene.search; public abstract class Filter implements java.io.Serializable { public abstract AbstractBitSet bits(IndexReader reader) throws IOException; } public interface AbstractBitSet { public boolean get(int index); } {code} It would be useful if the method =Filter.bits()= returned an abstract interface, instead of =java.util.BitSet=. Use case: there is a very large index, and, depending on the user's privileges, only a small portion of the index is actually visible. Sparsely populated =java.util.BitSet=s are not efficient and waste lots of memory. It would be desirable to have an alternative BitSet implementation with smaller memory footprint. Though it _is_ possibly to derive classes from =java.util.BitSet=, it was obviously not designed for that purpose. That's why I propose to use an interface instead. The default implementation could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-584) Decouple Filter from BitSet
[ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12557814#action_12557814 ] Paul Elschot commented on LUCENE-584: - As for PrefixGenerator: in my (up to date) trunk directory, this command: find . -name '*PrefixGenerator*' only gave this result: ./build/classes/java/org/apache/lucene/search/PrefixGenerator.class and that disappeared after ant clean. It seems that the source class was removed from the trunk. {quote} I think that it should be straightforward for users having filters that use BitSets to wrap the new DocIdBitSet around the BitSet, just as Filter currently does for backwards compatibility? {quote} BitSetFilter would inherit from Filter, and have an abstract bits() method, not deprecated. This would be useful for people that don't what to move to OpenBitSet yet. A rename (and maybe a package change) from Filter to BitSetFilter should be sufficient in their code to get rid of the deprecation warning for Filter.bits(). OpenBitSetFilter similar, and that could be used in a few places in the patch iirc. The javadoc changes I meant came with Matcher and use 'match' consistently for documents that are collected during a query search. Decouple Filter from BitSet --- Key: LUCENE-584 URL: https://issues.apache.org/jira/browse/LUCENE-584 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.0.1 Reporter: Peter Schäfer Assignee: Michael Busch Priority: Minor Fix For: 2.4 Attachments: bench-diff.txt, bench-diff.txt, lucene-584-take2.patch, lucene-584-take3-part1.patch, lucene-584-take3-part2.patch, lucene-584.patch, Matcher-20070905-2default.patch, Matcher-20070905-3core.patch, Matcher-20071122-1ground.patch, Some Matchers.zip {code} package org.apache.lucene.search; public abstract class Filter implements java.io.Serializable { public abstract AbstractBitSet bits(IndexReader reader) throws IOException; } public interface AbstractBitSet { public boolean get(int index); } {code} It would be useful if the method =Filter.bits()= returned an abstract interface, instead of =java.util.BitSet=. Use case: there is a very large index, and, depending on the user's privileges, only a small portion of the index is actually visible. Sparsely populated =java.util.BitSet=s are not efficient and waste lots of memory. It would be desirable to have an alternative BitSet implementation with smaller memory footprint. Though it _is_ possibly to derive classes from =java.util.BitSet=, it was obviously not designed for that purpose. That's why I propose to use an interface instead. The default implementation could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-584) Decouple Filter from BitSet
[ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12548208 ] Mark Harwood commented on LUCENE-584: - For the data structures (bitset/openbitset/sorted VintList/) I would suggest one of these: IntSet, IntegerSet or IntegerSequence as names for the common interface. I did a quick Google for IntegerSet and you are in the number one spot, Paul :) [http://www.google.com/search?hl=enq=integerset+bitset] // A cachable, immutable, sorted, threadsafe collection of ints. interface IntegerSet { IntegerSetIterator getIterator(); int size(); //negative numbers could be used to represent estimates? } // A single-use thread-unsafe iterator. interface IntegerSetIterator { boolean next(); boolean skipTo(int next); int currentValue(); } If _detailed_ explanations of hits are required these should really sit with the source not the result- i.e. with the Filters. They contain all the match criteria used to populate IntegerSets and can be thought of more generically as IntegerSetBuilder. //Contains criteria to create a set of matching documents. MUST implement hashcode and equals based on this criteria to enable use as cache keys for IntegerSets. interface IntegerSetBuilder extends Serializable { IntegerSet build (IndexReader reader) Explanation explain(int docNr); } A single CachingIntegerSetBuilder class would be able to take ANY IntegerSetBuilder as a source, cache ANY type of IntegerSet they produced and defer back to the original IntegerSetBuilder for a full and thorough explanation of a match even when the match occurred on a cached IntegerSet, if required. class CachingIntegerSetBuilder implements IntegerSetBuilder { private WeakHashMap perIndexReaderCache; public CachingIntegerSetBuilder(IntegerSetBuilder delegate) {} . } The reason for introducing IntegerSetBuilder as a more generic name than Filter is IntegerSets have uses outside of filtering e.g. to do category counts or clustering. In these use cases they don't actually perform anything to do with filtering. It may actually be better named DocIdSetBuilder given that it is tied to Lucene's IndexReader and therefore limited to producing sets of document ids. Decouple Filter from BitSet --- Key: LUCENE-584 URL: https://issues.apache.org/jira/browse/LUCENE-584 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.0.1 Reporter: Peter Schäfer Assignee: Michael Busch Priority: Minor Attachments: bench-diff.txt, bench-diff.txt, lucene-584-take2.patch, lucene-584.patch, Matcher-20070905-2default.patch, Matcher-20070905-3core.patch, Matcher-20071122-1ground.patch, Some Matchers.zip {code} package org.apache.lucene.search; public abstract class Filter implements java.io.Serializable { public abstract AbstractBitSet bits(IndexReader reader) throws IOException; } public interface AbstractBitSet { public boolean get(int index); } {code} It would be useful if the method =Filter.bits()= returned an abstract interface, instead of =java.util.BitSet=. Use case: there is a very large index, and, depending on the user's privileges, only a small portion of the index is actually visible. Sparsely populated =java.util.BitSet=s are not efficient and waste lots of memory. It would be desirable to have an alternative BitSet implementation with smaller memory footprint. Though it _is_ possibly to derive classes from =java.util.BitSet=, it was obviously not designed for that purpose. That's why I propose to use an interface instead. The default implementation could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-584) Decouple Filter from BitSet
[ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12548341 ] Paul Elschot commented on LUCENE-584: - For the moment a DocId is an int, but that might change to long sooner than we think. So DocIdSet... would be a better name than IntegerSet..., and it's better to use an abstract superclass than an interface: {code} abstract class DocIdSetIterator { boolean next(); boolean skipTo(int next); int doc(); } // and the rest is in the patch, except the superclass for Matcher: abstract class Matcher extends DocIdSetIterator { Explanation explain(int doc); } abstract class Scorer extends Matcher { float score(); ... } {code} Would this DocIdSetIterator be close enough to the IntegerSetIterator? Decouple Filter from BitSet --- Key: LUCENE-584 URL: https://issues.apache.org/jira/browse/LUCENE-584 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.0.1 Reporter: Peter Schäfer Assignee: Michael Busch Priority: Minor Attachments: bench-diff.txt, bench-diff.txt, lucene-584-take2.patch, lucene-584.patch, Matcher-20070905-2default.patch, Matcher-20070905-3core.patch, Matcher-20071122-1ground.patch, Some Matchers.zip {code} package org.apache.lucene.search; public abstract class Filter implements java.io.Serializable { public abstract AbstractBitSet bits(IndexReader reader) throws IOException; } public interface AbstractBitSet { public boolean get(int index); } {code} It would be useful if the method =Filter.bits()= returned an abstract interface, instead of =java.util.BitSet=. Use case: there is a very large index, and, depending on the user's privileges, only a small portion of the index is actually visible. Sparsely populated =java.util.BitSet=s are not efficient and waste lots of memory. It would be desirable to have an alternative BitSet implementation with smaller memory footprint. Though it _is_ possibly to derive classes from =java.util.BitSet=, it was obviously not designed for that purpose. That's why I propose to use an interface instead. The default implementation could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-584) Decouple Filter from BitSet
[ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12547901 ] Mark Harwood commented on LUCENE-584: - To go back to post #1 on this topic: _Sparsely populated =java.util.BitSet=s are not efficient and waste lots of memory. It would be desirable to have an alternative BitSet implementation with smaller memory footprint._ Given the motivation to move to more memory efficient structures why is the only attempt at caching dedicated exclusively to caching the very structures we were trying to move away from?. _I deprecated also CachingWrapperFilter and RemoteCachingWrapperFilter and added corresponding CachingBitSetFilter and RemoteCachingBitSetFilter_ Does this suggest we are to have type-specific CachingXFilters and RemoteCachingXFilters created for every new filter type? Why not provide a single caching mechanism that works for all those other, new, more memory-efficient structures? I beleive the reason this hasn't been done is due to the issue I highlighted earlier - the cachable artefacts (what I chose to call DocIdSet here: [#action_12518642] ) are not modelled in a way which promotes re-use. That's why we would end up needing a specialised caching implementations for each type. If we are to move forward from the existing Lucene implementation it's important to note the change: * Filters currently produce, at great cost, BitSets. Bitsets provide both a cachable data structure and a thread-safe, reusable means of iterating across the contents. * By replacing BitSets with Matchers this proposal has removed an important aspect of the existing design - the visibility (and therefore cachability) of these expensive-to-recreate data structures. Matchers are single-use, non-threadsafe objects and hide the data structure over which they iterate. With this change if I want to implement a caching mechanism in my application I need to know the Filter type and what sort of data structure it returns and get it from it directly: if(myFilter instanceof BitSetFilter)wrap specific data structure using CachingBitSetFilter else if(myFilter instanceof OpenBitSetFilter) wrap specific data structure using CachingXFilter else... ...looks like an Anti-pattern to me. Worse, this ties the choice of datastructure to the type of Filter that produces it. Why can't my RangeFilter be free to create a SortedVIntList or a BitSet depending on the sparseness of matches for a particular set of criteria? I'm not saying lets just stick with Bitsets, just consider caching more in the design. Post [#action_12518642] lays out how this could be modelled with the introduction of DocIdSet and DocIdSetIterator as separate responsibilities (whereas Matcher currently combines them both). Cheers Mark Decouple Filter from BitSet --- Key: LUCENE-584 URL: https://issues.apache.org/jira/browse/LUCENE-584 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.0.1 Reporter: Peter Schäfer Assignee: Michael Busch Priority: Minor Attachments: bench-diff.txt, bench-diff.txt, lucene-584-take2.patch, lucene-584.patch, Matcher-20070905-2default.patch, Matcher-20070905-3core.patch, Matcher-20071122-1ground.patch, Some Matchers.zip {code} package org.apache.lucene.search; public abstract class Filter implements java.io.Serializable { public abstract AbstractBitSet bits(IndexReader reader) throws IOException; } public interface AbstractBitSet { public boolean get(int index); } {code} It would be useful if the method =Filter.bits()= returned an abstract interface, instead of =java.util.BitSet=. Use case: there is a very large index, and, depending on the user's privileges, only a small portion of the index is actually visible. Sparsely populated =java.util.BitSet=s are not efficient and waste lots of memory. It would be desirable to have an alternative BitSet implementation with smaller memory footprint. Though it _is_ possibly to derive classes from =java.util.BitSet=, it was obviously not designed for that purpose. That's why I propose to use an interface instead. The default implementation could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-584) Decouple Filter from BitSet
[ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12547895 ] Paul Elschot commented on LUCENE-584: - A few remarks on the lucene-584-take2 patch: In the @deprecated javadoc at Filter.bits() a reference to BitSetFilter could be added. While Filter.bits() is still deprecated, one could also use the BitSet in IndexSearcher in case this turns out to be performance sensitive; see also my remark of 28 November. A few complete (test) classes are deprecated, it might be good to add the target release for removal there. For the rest this patch looks good to me. Did you also run ant test-contrib ? Decouple Filter from BitSet --- Key: LUCENE-584 URL: https://issues.apache.org/jira/browse/LUCENE-584 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.0.1 Reporter: Peter Schäfer Assignee: Michael Busch Priority: Minor Attachments: bench-diff.txt, bench-diff.txt, lucene-584-take2.patch, lucene-584.patch, Matcher-20070905-2default.patch, Matcher-20070905-3core.patch, Matcher-20071122-1ground.patch, Some Matchers.zip {code} package org.apache.lucene.search; public abstract class Filter implements java.io.Serializable { public abstract AbstractBitSet bits(IndexReader reader) throws IOException; } public interface AbstractBitSet { public boolean get(int index); } {code} It would be useful if the method =Filter.bits()= returned an abstract interface, instead of =java.util.BitSet=. Use case: there is a very large index, and, depending on the user's privileges, only a small portion of the index is actually visible. Sparsely populated =java.util.BitSet=s are not efficient and waste lots of memory. It would be desirable to have an alternative BitSet implementation with smaller memory footprint. Though it _is_ possibly to derive classes from =java.util.BitSet=, it was obviously not designed for that purpose. That's why I propose to use an interface instead. The default implementation could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-584) Decouple Filter from BitSet
[ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12547958 ] Paul Elschot commented on LUCENE-584: - Mark, in the latest Matcher-2default.patch there is the org.apache.lucene.MatcherProvider interface with this javadoc: /** To be used in a cache to implement caching for a MatchFilter. */ This interface has only one method: public Matcher getMatcher(); There is also a cache for filters in the Matcher3core.patch in the class CachingWrapperFilter . Would those be a good starting point? Decouple Filter from BitSet --- Key: LUCENE-584 URL: https://issues.apache.org/jira/browse/LUCENE-584 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.0.1 Reporter: Peter Schäfer Assignee: Michael Busch Priority: Minor Attachments: bench-diff.txt, bench-diff.txt, lucene-584-take2.patch, lucene-584.patch, Matcher-20070905-2default.patch, Matcher-20070905-3core.patch, Matcher-20071122-1ground.patch, Some Matchers.zip {code} package org.apache.lucene.search; public abstract class Filter implements java.io.Serializable { public abstract AbstractBitSet bits(IndexReader reader) throws IOException; } public interface AbstractBitSet { public boolean get(int index); } {code} It would be useful if the method =Filter.bits()= returned an abstract interface, instead of =java.util.BitSet=. Use case: there is a very large index, and, depending on the user's privileges, only a small portion of the index is actually visible. Sparsely populated =java.util.BitSet=s are not efficient and waste lots of memory. It would be desirable to have an alternative BitSet implementation with smaller memory footprint. Though it _is_ possibly to derive classes from =java.util.BitSet=, it was obviously not designed for that purpose. That's why I propose to use an interface instead. The default implementation could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-584) Decouple Filter from BitSet
[ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12547988 ] Mark Harwood commented on LUCENE-584: - I'm getting lost as to which patches we're considering here. I was looking at lucene-584-take2 patch. MatcherProvider in the earlier patch does look like something that will help with caching. Would those be a good starting point? Overall I feel uncomfortable with a lot of the classnames. I think the use of Matcher says more about what you want to do with the class in this particular case rather than what _it_ does generally. I have other uses in mind for these classes that are outside of filtering search results. For me, these classes can be thought of much more simply as utility classes in the same mould as the java Collections API. Fundamentally, they are efficient implementations of sets/lists of integers with support for iterators. The whole thing would be a lot cleaner if classes were named around this scheme. MatcherProvider for example is essentially a DocIdSet which creates forms of DocIdSetIterators (Matchers) and could also usefully have a size() method. Decouple Filter from BitSet --- Key: LUCENE-584 URL: https://issues.apache.org/jira/browse/LUCENE-584 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.0.1 Reporter: Peter Schäfer Assignee: Michael Busch Priority: Minor Attachments: bench-diff.txt, bench-diff.txt, lucene-584-take2.patch, lucene-584.patch, Matcher-20070905-2default.patch, Matcher-20070905-3core.patch, Matcher-20071122-1ground.patch, Some Matchers.zip {code} package org.apache.lucene.search; public abstract class Filter implements java.io.Serializable { public abstract AbstractBitSet bits(IndexReader reader) throws IOException; } public interface AbstractBitSet { public boolean get(int index); } {code} It would be useful if the method =Filter.bits()= returned an abstract interface, instead of =java.util.BitSet=. Use case: there is a very large index, and, depending on the user's privileges, only a small portion of the index is actually visible. Sparsely populated =java.util.BitSet=s are not efficient and waste lots of memory. It would be desirable to have an alternative BitSet implementation with smaller memory footprint. Though it _is_ possibly to derive classes from =java.util.BitSet=, it was obviously not designed for that purpose. That's why I propose to use an interface instead. The default implementation could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-584) Decouple Filter from BitSet
[ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12548030 ] Paul Elschot commented on LUCENE-584: - In case there is a better name than Matcher for a Scorer without a score() method (and maybe without an explain() method), I'm all ears. Names are important, and at this point they can still be changed very easily. For Matcher I'd rather have a method to estimate the number of matching docs than a size() method. This estimate would be useful in implementing conjunctions, as the Matchers with the lowest estimates could be used first. However, this is another issue. Decouple Filter from BitSet --- Key: LUCENE-584 URL: https://issues.apache.org/jira/browse/LUCENE-584 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.0.1 Reporter: Peter Schäfer Assignee: Michael Busch Priority: Minor Attachments: bench-diff.txt, bench-diff.txt, lucene-584-take2.patch, lucene-584.patch, Matcher-20070905-2default.patch, Matcher-20070905-3core.patch, Matcher-20071122-1ground.patch, Some Matchers.zip {code} package org.apache.lucene.search; public abstract class Filter implements java.io.Serializable { public abstract AbstractBitSet bits(IndexReader reader) throws IOException; } public interface AbstractBitSet { public boolean get(int index); } {code} It would be useful if the method =Filter.bits()= returned an abstract interface, instead of =java.util.BitSet=. Use case: there is a very large index, and, depending on the user's privileges, only a small portion of the index is actually visible. Sparsely populated =java.util.BitSet=s are not efficient and waste lots of memory. It would be desirable to have an alternative BitSet implementation with smaller memory footprint. Though it _is_ possibly to derive classes from =java.util.BitSet=, it was obviously not designed for that purpose. That's why I propose to use an interface instead. The default implementation could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-584) Decouple Filter from BitSet
[ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12547612 ] Paul Elschot commented on LUCENE-584: - I tried implementing a Searchable, and indeed ran into compilation errors. So, backward compatibility is indeed not complete. Also, Searchable is an interface, so it should not be changed. In case there are other interfaces affected by the patch these should not be changed either. There are two ways out of this: Do a name change on MatcherFilter/Filter - Filter/BitSetFilter. Changing the current Filter to BitSetFilter gives other problems with contrib packages. I tried this some time ago, see above, but I could not make it work. I'd prefer to add an interface (or abstract class?) like Searchable that uses MatchFilter for those implementers that want to take advantage of MatchFilter. I don't expect problems from leaving the Searchable interface available unchanged. Other interfaces that use Filter can be treated the same way, in case there are any. Decouple Filter from BitSet --- Key: LUCENE-584 URL: https://issues.apache.org/jira/browse/LUCENE-584 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.0.1 Reporter: Peter Schäfer Assignee: Michael Busch Priority: Minor Attachments: bench-diff.txt, bench-diff.txt, lucene-584.patch, Matcher-20070905-2default.patch, Matcher-20070905-3core.patch, Matcher-20071122-1ground.patch, Some Matchers.zip {code} package org.apache.lucene.search; public abstract class Filter implements java.io.Serializable { public abstract AbstractBitSet bits(IndexReader reader) throws IOException; } public interface AbstractBitSet { public boolean get(int index); } {code} It would be useful if the method =Filter.bits()= returned an abstract interface, instead of =java.util.BitSet=. Use case: there is a very large index, and, depending on the user's privileges, only a small portion of the index is actually visible. Sparsely populated =java.util.BitSet=s are not efficient and waste lots of memory. It would be desirable to have an alternative BitSet implementation with smaller memory footprint. Though it _is_ possibly to derive classes from =java.util.BitSet=, it was obviously not designed for that purpose. That's why I propose to use an interface instead. The default implementation could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-584) Decouple Filter from BitSet
[ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12547681 ] Michael Busch commented on LUCENE-584: -- Why do we actually need the new MatchFilter class at all? Filter is an abstract class, not an interface. So I think we could simply add the new method getMatcher() like you already did in your patch: {code:java} /** * @return A Matcher constructed from the provided BitSet. * @seeDefaultMatcher#defaultMatcher(BitSet) */ public Matcher getMatcher(IndexReader reader) throws IOException { return new BitSetMatcher(bits(reader)); } {code} This shouldn't break existing Filter implementations? Maybe I'm missing an apparent reason why we need the MatchFilter class? Decouple Filter from BitSet --- Key: LUCENE-584 URL: https://issues.apache.org/jira/browse/LUCENE-584 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.0.1 Reporter: Peter Schäfer Assignee: Michael Busch Priority: Minor Attachments: bench-diff.txt, bench-diff.txt, lucene-584.patch, Matcher-20070905-2default.patch, Matcher-20070905-3core.patch, Matcher-20071122-1ground.patch, Some Matchers.zip {code} package org.apache.lucene.search; public abstract class Filter implements java.io.Serializable { public abstract AbstractBitSet bits(IndexReader reader) throws IOException; } public interface AbstractBitSet { public boolean get(int index); } {code} It would be useful if the method =Filter.bits()= returned an abstract interface, instead of =java.util.BitSet=. Use case: there is a very large index, and, depending on the user's privileges, only a small portion of the index is actually visible. Sparsely populated =java.util.BitSet=s are not efficient and waste lots of memory. It would be desirable to have an alternative BitSet implementation with smaller memory footprint. Though it _is_ possibly to derive classes from =java.util.BitSet=, it was obviously not designed for that purpose. That's why I propose to use an interface instead. The default implementation could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-584) Decouple Filter from BitSet
[ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12547686 ] Paul Elschot commented on LUCENE-584: - For example, OpenBitSetFilter would like this: {code} class OpenBitSetFilter /* ... */ { OpenBitSet bits(reader) { ... } Matcher getMatcher(reader) { ... } } {code} Since the only thing needed by an IndexSearcher would be the Matcher, MatchFilter what Filter and OpenBitSetFilter have in common, the getMatcher() method. Decouple Filter from BitSet --- Key: LUCENE-584 URL: https://issues.apache.org/jira/browse/LUCENE-584 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.0.1 Reporter: Peter Schäfer Assignee: Michael Busch Priority: Minor Attachments: bench-diff.txt, bench-diff.txt, lucene-584.patch, Matcher-20070905-2default.patch, Matcher-20070905-3core.patch, Matcher-20071122-1ground.patch, Some Matchers.zip {code} package org.apache.lucene.search; public abstract class Filter implements java.io.Serializable { public abstract AbstractBitSet bits(IndexReader reader) throws IOException; } public interface AbstractBitSet { public boolean get(int index); } {code} It would be useful if the method =Filter.bits()= returned an abstract interface, instead of =java.util.BitSet=. Use case: there is a very large index, and, depending on the user's privileges, only a small portion of the index is actually visible. Sparsely populated =java.util.BitSet=s are not efficient and waste lots of memory. It would be desirable to have an alternative BitSet implementation with smaller memory footprint. Though it _is_ possibly to derive classes from =java.util.BitSet=, it was obviously not designed for that purpose. That's why I propose to use an interface instead. The default implementation could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-584) Decouple Filter from BitSet
[ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12547689 ] Michael Busch commented on LUCENE-584: -- What about adding the getMatcher() method to Filter and deprecating bits(IndexReader)? Then when we release 3.0 we can remove bits() and the only method in Filter will be getMatcher(). Then this patch should be backwards compatible and we'd do the API change with the next major release. Any objections? Decouple Filter from BitSet --- Key: LUCENE-584 URL: https://issues.apache.org/jira/browse/LUCENE-584 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.0.1 Reporter: Peter Schäfer Assignee: Michael Busch Priority: Minor Attachments: bench-diff.txt, bench-diff.txt, lucene-584.patch, Matcher-20070905-2default.patch, Matcher-20070905-3core.patch, Matcher-20071122-1ground.patch, Some Matchers.zip {code} package org.apache.lucene.search; public abstract class Filter implements java.io.Serializable { public abstract AbstractBitSet bits(IndexReader reader) throws IOException; } public interface AbstractBitSet { public boolean get(int index); } {code} It would be useful if the method =Filter.bits()= returned an abstract interface, instead of =java.util.BitSet=. Use case: there is a very large index, and, depending on the user's privileges, only a small portion of the index is actually visible. Sparsely populated =java.util.BitSet=s are not efficient and waste lots of memory. It would be desirable to have an alternative BitSet implementation with smaller memory footprint. Though it _is_ possibly to derive classes from =java.util.BitSet=, it was obviously not designed for that purpose. That's why I propose to use an interface instead. The default implementation could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-584) Decouple Filter from BitSet
[ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12547691 ] Paul Elschot commented on LUCENE-584: - I had not thought about deprecating yet, but it should work nicely. I suppose you want to add a class BitSetFilter (subclass of Filter) as the preferred alternative to the deprecated method? Initially Filter and BitSetFilter would be very similar, except that Filter.bits() would be deprecated. Later, after removal of Filter.bits(), Filter.getMatcher() would be declared abstract. I tried to do something pretty close to this for contrib/xml-query-parser, but I could not make that work, which is why I changed to adding a new superclass MatchFilter. Nevertheless, I think the deprecation above should work, but at the moment I can't foresee the consequences. Decouple Filter from BitSet --- Key: LUCENE-584 URL: https://issues.apache.org/jira/browse/LUCENE-584 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.0.1 Reporter: Peter Schäfer Assignee: Michael Busch Priority: Minor Attachments: bench-diff.txt, bench-diff.txt, lucene-584.patch, Matcher-20070905-2default.patch, Matcher-20070905-3core.patch, Matcher-20071122-1ground.patch, Some Matchers.zip {code} package org.apache.lucene.search; public abstract class Filter implements java.io.Serializable { public abstract AbstractBitSet bits(IndexReader reader) throws IOException; } public interface AbstractBitSet { public boolean get(int index); } {code} It would be useful if the method =Filter.bits()= returned an abstract interface, instead of =java.util.BitSet=. Use case: there is a very large index, and, depending on the user's privileges, only a small portion of the index is actually visible. Sparsely populated =java.util.BitSet=s are not efficient and waste lots of memory. It would be desirable to have an alternative BitSet implementation with smaller memory footprint. Though it _is_ possibly to derive classes from =java.util.BitSet=, it was obviously not designed for that purpose. That's why I propose to use an interface instead. The default implementation could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-584) Decouple Filter from BitSet
[ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12547524 ] Michael Busch commented on LUCENE-584: -- {quote} The patch is backwards compatible, {quote} I think that custom Searcher or Searchable implementations won't compile anymore? Because the signature of some abstract methods changed, e. g. in Searchable: {code:java} @@ -86,13 +86,14 @@ * pCalled by [EMAIL PROTECTED] Hits}. * * pApplications should usually call [EMAIL PROTECTED] Searcher#search(Query)} or - * [EMAIL PROTECTED] Searcher#search(Query,Filter)} instead. + * [EMAIL PROTECTED] Searcher#search(Query,MatchFilter)} instead. * @throws BooleanQuery.TooManyClauses */ - TopDocs search(Weight weight, Filter filter, int n) throws IOException; + TopDocs search(Weight weight, MatchFilter filter, int n) throws IOException; {code} Decouple Filter from BitSet --- Key: LUCENE-584 URL: https://issues.apache.org/jira/browse/LUCENE-584 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.0.1 Reporter: Peter Schäfer Assignee: Michael Busch Priority: Minor Attachments: bench-diff.txt, bench-diff.txt, lucene-584.patch, Matcher-20070905-2default.patch, Matcher-20070905-3core.patch, Matcher-20071122-1ground.patch, Some Matchers.zip {code} package org.apache.lucene.search; public abstract class Filter implements java.io.Serializable { public abstract AbstractBitSet bits(IndexReader reader) throws IOException; } public interface AbstractBitSet { public boolean get(int index); } {code} It would be useful if the method =Filter.bits()= returned an abstract interface, instead of =java.util.BitSet=. Use case: there is a very large index, and, depending on the user's privileges, only a small portion of the index is actually visible. Sparsely populated =java.util.BitSet=s are not efficient and waste lots of memory. It would be desirable to have an alternative BitSet implementation with smaller memory footprint. Though it _is_ possibly to derive classes from =java.util.BitSet=, it was obviously not designed for that purpose. That's why I propose to use an interface instead. The default implementation could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-584) Decouple Filter from BitSet
[ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546166 ] Paul Elschot commented on LUCENE-584: - The patch is backwards compatible, except for current subclasses of Filter already have a getMatcher method. The fact that no changes are needed to contrib confirms the compatibility. I have made no performance tests on BitSetMatcher for two reasons. The first reason is that OpenBitSet is actually faster than BitSet (have a look at the graph in the SomeMatchers.zip file attachment by Eks Dev), so it seems to be better to go in that direction. The second is that it is easy to do the skipping in IndexSearcher on a BitSet directly by using nextSetBit on the BitSet instead of skipTo on the BitSetMatcher. For this it would only be necessary to check whether the given MatchFilter is a Filter. Anyway, I prefer to see where the real performance bottlenecks are before optimizing for performance. DefaultMatcher should be in the ...2default... patch. The change in Hits to use MatchFilter should be in the ...3core.. patch. So far, I never tried to use these patches on their own, I have only split them for a better overview. Splitting the combined patches to iterate would need a different split, as you found out. It might even be necessary to split within a single class, but I'll gladly do that. Decouple Filter from BitSet --- Key: LUCENE-584 URL: https://issues.apache.org/jira/browse/LUCENE-584 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.0.1 Reporter: Peter Schäfer Priority: Minor Attachments: bench-diff.txt, bench-diff.txt, Matcher-20070905-2default.patch, Matcher-20070905-3core.patch, Matcher-20071122-1ground.patch, Some Matchers.zip {code} package org.apache.lucene.search; public abstract class Filter implements java.io.Serializable { public abstract AbstractBitSet bits(IndexReader reader) throws IOException; } public interface AbstractBitSet { public boolean get(int index); } {code} It would be useful if the method =Filter.bits()= returned an abstract interface, instead of =java.util.BitSet=. Use case: there is a very large index, and, depending on the user's privileges, only a small portion of the index is actually visible. Sparsely populated =java.util.BitSet=s are not efficient and waste lots of memory. It would be desirable to have an alternative BitSet implementation with smaller memory footprint. Though it _is_ possibly to derive classes from =java.util.BitSet=, it was obviously not designed for that purpose. That's why I propose to use an interface instead. The default implementation could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-584) Decouple Filter from BitSet
[ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546393 ] Paul Elschot commented on LUCENE-584: - With the full patch applied, the following test cases use a BitSetMatcher: TestQueryParser TestComplexExplanations TestComplexExplanationsOfNonMatches TestConstantScoreRangeQuery TestDateFilter TestFilteredQuery TestMultiSearcherRanking TestPrefixFilter TestRangeFilter TestRemoteCachingWrapperFilter TestRemoteSearchable TestScorerPerf TestSimpleExplanations TestSimpleExplanationsOfNonMatches TestSort so I don't think it is necessary to provide seperate test cases. Decouple Filter from BitSet --- Key: LUCENE-584 URL: https://issues.apache.org/jira/browse/LUCENE-584 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.0.1 Reporter: Peter Schäfer Assignee: Michael Busch Priority: Minor Attachments: bench-diff.txt, bench-diff.txt, lucene-584.patch, Matcher-20070905-2default.patch, Matcher-20070905-3core.patch, Matcher-20071122-1ground.patch, Some Matchers.zip {code} package org.apache.lucene.search; public abstract class Filter implements java.io.Serializable { public abstract AbstractBitSet bits(IndexReader reader) throws IOException; } public interface AbstractBitSet { public boolean get(int index); } {code} It would be useful if the method =Filter.bits()= returned an abstract interface, instead of =java.util.BitSet=. Use case: there is a very large index, and, depending on the user's privileges, only a small portion of the index is actually visible. Sparsely populated =java.util.BitSet=s are not efficient and waste lots of memory. It would be desirable to have an alternative BitSet implementation with smaller memory footprint. Though it _is_ possibly to derive classes from =java.util.BitSet=, it was obviously not designed for that purpose. That's why I propose to use an interface instead. The default implementation could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-584) Decouple Filter from BitSet
[ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546420 ] Michael Busch commented on LUCENE-584: -- Yes you're right, I ran the tests w/ code coverage analysis enabled, and the BitSetMatcher is fully covered. Good! Decouple Filter from BitSet --- Key: LUCENE-584 URL: https://issues.apache.org/jira/browse/LUCENE-584 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.0.1 Reporter: Peter Schäfer Assignee: Michael Busch Priority: Minor Attachments: bench-diff.txt, bench-diff.txt, lucene-584.patch, Matcher-20070905-2default.patch, Matcher-20070905-3core.patch, Matcher-20071122-1ground.patch, Some Matchers.zip {code} package org.apache.lucene.search; public abstract class Filter implements java.io.Serializable { public abstract AbstractBitSet bits(IndexReader reader) throws IOException; } public interface AbstractBitSet { public boolean get(int index); } {code} It would be useful if the method =Filter.bits()= returned an abstract interface, instead of =java.util.BitSet=. Use case: there is a very large index, and, depending on the user's privileges, only a small portion of the index is actually visible. Sparsely populated =java.util.BitSet=s are not efficient and waste lots of memory. It would be desirable to have an alternative BitSet implementation with smaller memory footprint. Though it _is_ possibly to derive classes from =java.util.BitSet=, it was obviously not designed for that purpose. That's why I propose to use an interface instead. The default implementation could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-584) Decouple Filter from BitSet
[ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546148 ] Michael Busch commented on LUCENE-584: -- {quote} 1. introduce Matcher as superclass of Scorer and adapt javadocs to use matching consistently. 2. introduce MatchFilter as superclass of Filter and add a minimal DefaultMatcher to be used in IndexSearcher, i.e. add BitSetMatcher {quote} Paul, I like the iterative plan you suggested. I started reviewing the Matcher-20071122-1ground.patch. I've some question: - Is the API fully backwards compatible? - Did you make performance tests to check whether BitSetMatcher is slower than using a bitset directly? - With just the mentioned patch applied I get compile errors, because the DefaultMatcher is missing. Could you provide a patch that also includes the BitSetMatcher and Filter#getMatcher() returns it? Also I believe the patch should modify Hits.java to use MatchFilter instead of Filter? And a unit test that tests the BitSetMatcher would be nice! Decouple Filter from BitSet --- Key: LUCENE-584 URL: https://issues.apache.org/jira/browse/LUCENE-584 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.0.1 Reporter: Peter Schäfer Priority: Minor Attachments: bench-diff.txt, bench-diff.txt, Matcher-20070905-2default.patch, Matcher-20070905-3core.patch, Matcher-20071122-1ground.patch, Some Matchers.zip {code} package org.apache.lucene.search; public abstract class Filter implements java.io.Serializable { public abstract AbstractBitSet bits(IndexReader reader) throws IOException; } public interface AbstractBitSet { public boolean get(int index); } {code} It would be useful if the method =Filter.bits()= returned an abstract interface, instead of =java.util.BitSet=. Use case: there is a very large index, and, depending on the user's privileges, only a small portion of the index is actually visible. Sparsely populated =java.util.BitSet=s are not efficient and waste lots of memory. It would be desirable to have an alternative BitSet implementation with smaller memory footprint. Though it _is_ possibly to derive classes from =java.util.BitSet=, it was obviously not designed for that purpose. That's why I propose to use an interface instead. The default implementation could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-584) Decouple Filter from BitSet
[ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12529632 ] Paul Elschot commented on LUCENE-584: - As the current patch set is large, I've been pondering how to do this in a series of smaller patches that can each be applied by itself. This is possible in the following way: 1. introduce Matcher as superclass of Scorer and adapt javadocs to use matching consistently. 2. introduce MatchFilter as superclass of Filter and add a minimal DefaultMatcher to be used in IndexSearcher, i.e. add BitSetMatcher 3. change the current Searcher/Searchable API to use MatchFilter instead of Filter. Step 1 can be reasonably done before a new a release. After step 2 this issue might be closed, and all the rest could be treated as new issues. After that three (almost) independent paths can be followed: 4. add more data structures to be used for filter caches. 5. adapt CachingWrapperFilter to provide a Matcher from a cached datastructure, for example SortedVIntList or BitSet or OpenBitSet. 6. further use of Matcher, mostly in BooleanScorer2. My question is: shall I go ahead and provide a patch for step 1? At the moment I'm refining BooleanScorer2. to use Matcher. This is for the case of multiple prohibited clauses, and also to allow the use of required and prohibited Matchers to allow adding filtering clauses to BooleanQuery. Decouple Filter from BitSet --- Key: LUCENE-584 URL: https://issues.apache.org/jira/browse/LUCENE-584 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.0.1 Reporter: Peter Schäfer Priority: Minor Attachments: bench-diff.txt, bench-diff.txt, Matcher-20070905-1ground.patch, Matcher-20070905-2default.patch, Matcher-20070905-3core.patch, Some Matchers.zip {code} package org.apache.lucene.search; public abstract class Filter implements java.io.Serializable { public abstract AbstractBitSet bits(IndexReader reader) throws IOException; } public interface AbstractBitSet { public boolean get(int index); } {code} It would be useful if the method =Filter.bits()= returned an abstract interface, instead of =java.util.BitSet=. Use case: there is a very large index, and, depending on the user's privileges, only a small portion of the index is actually visible. Sparsely populated =java.util.BitSet=s are not efficient and waste lots of memory. It would be desirable to have an alternative BitSet implementation with smaller memory footprint. Though it _is_ possibly to derive classes from =java.util.BitSet=, it was obviously not designed for that purpose. That's why I propose to use an interface instead. The default implementation could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-584) Decouple Filter from BitSet
[ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12528539 ] Paul Elschot commented on LUCENE-584: - The posted patch proposes to use this class to determine which documents should be filtered: public abstract class Matcher { public abstract boolean next() throws IOException; public abstract boolean skipTo(int target) throws IOException; public abstract int doc(); // plus a few more methods } This class is then used as a superclass of org.apache.lucene.search.Scorer. Decouple Filter from BitSet --- Key: LUCENE-584 URL: https://issues.apache.org/jira/browse/LUCENE-584 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.0.1 Reporter: Peter Schäfer Priority: Minor Attachments: bench-diff.txt, bench-diff.txt, Matcher-20070905-1ground.patch, Matcher-20070905-2default.patch, Matcher-20070905-3core.patch, Some Matchers.zip {code} package org.apache.lucene.search; public abstract class Filter implements java.io.Serializable { public abstract AbstractBitSet bits(IndexReader reader) throws IOException; } public interface AbstractBitSet { public boolean get(int index); } {code} It would be useful if the method =Filter.bits()= returned an abstract interface, instead of =java.util.BitSet=. Use case: there is a very large index, and, depending on the user's privileges, only a small portion of the index is actually visible. Sparsely populated =java.util.BitSet=s are not efficient and waste lots of memory. It would be desirable to have an alternative BitSet implementation with smaller memory footprint. Though it _is_ possibly to derive classes from =java.util.BitSet=, it was obviously not designed for that purpose. That's why I propose to use an interface instead. The default implementation could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-584) Decouple Filter from BitSet
[ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12522030 ] Hoss Man commented on LUCENE-584: - I, unfortunately, haven't had the time to read through everything in the latest patches, but catching up on my jira mail one of Paul's comments jumped out at me, so i wanted to make sure it's completley clear: this latest set of patches completely breaks backwards compatibility for any clients who have Filter subclasses, or methods that take a Filter as a param, since the Filter class now has an abstract getMatcher method and no longer supports an abstract BitSet method -- presumably the expectation being that all client code should have a search/replace done from Filter=BitSetFilter which begs the question: why not eliminate BitSetFilter and move it's getMatcher impl to the Filter class? (if the concern is just that there be a higher level class in which both methods are abstract, why not insert a parent with some new name above the Filter class?) For the record: it really bothers me that the old attachments got deleted ... the inability to refresh my memory by looking at the older patches and compare them with the current patches is extremely frustrating Decouple Filter from BitSet --- Key: LUCENE-584 URL: https://issues.apache.org/jira/browse/LUCENE-584 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.0.1 Reporter: Peter Schäfer Priority: Minor Attachments: bench-diff.txt, bench-diff.txt, Matcher1-ground-20070730.patch, Matcher2-default-20070730.patch, Matcher3-core-20070730.patch, Matcher4-contrib-misc-20070730.patch, Matcher5-contrib-queries-20070730.patch, Matcher6-contrib-xml-20070730.patch, Some Matchers.zip {code} package org.apache.lucene.search; public abstract class Filter implements java.io.Serializable { public abstract AbstractBitSet bits(IndexReader reader) throws IOException; } public interface AbstractBitSet { public boolean get(int index); } {code} It would be useful if the method =Filter.bits()= returned an abstract interface, instead of =java.util.BitSet=. Use case: there is a very large index, and, depending on the user's privileges, only a small portion of the index is actually visible. Sparsely populated =java.util.BitSet=s are not efficient and waste lots of memory. It would be desirable to have an alternative BitSet implementation with smaller memory footprint. Though it _is_ possibly to derive classes from =java.util.BitSet=, it was obviously not designed for that purpose. That's why I propose to use an interface instead. The default implementation could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-584) Decouple Filter from BitSet
[ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12518642 ] Mark Harwood commented on LUCENE-584: - Some further thought on the roles/responsibilities of the various components: Given a blank sheet of paper (a luxury we may not have) the minimum requirements I would have could be met with the following: (note that use of the words Matcher and Filter etc have been removed because sets of doc IDs have applications outside of filtering/querying e.g. category counts) interface DocIdSetFactory { DocIdSet getDocIdSet(IndexReader reader) } This is more or less equivalent to the purpose of the existing Filter - different implementations define their own selection criteria and produce a set of matching doc Ids e.g. equivalent of RangeFilter. Each implementation must implement hashcode and equals methods based on it's criteria so the factory can be cached and reused (in the same way Query objects are expected to). The existing CachedFilterBuilder in the XMLQueryParser provides one example of a strategy for caching Filters using this facility. interface DocIdSet { DocIdSetIterator getIterator(); } This interface defines an immutable, threadsafe (and therefore cachable) collection of doc IDs. Different implementations provide space-efficient alternatives for sparse or heavily populated sets e.g. BitSet, OpenBitSet, SortedVIntList. As an example caching strategy - the existing CachingWrapperFilter would cache these objects in a WeakHashMap keyed on IndexReader. interface DocIdSetIterator { boolean next(); int getDoc(); etc } A thread unsafe, single use object, (probably with only one implementation) that is used to iterate across any DocIdSet. Not cachable and used by Scorers. In the existing proposal it feels like DocIdSet and DocIdSetIterator are rolled into one in the form of the Matcher which complicates/prevents caching strategies. Cheers Mark Decouple Filter from BitSet --- Key: LUCENE-584 URL: https://issues.apache.org/jira/browse/LUCENE-584 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.0.1 Reporter: Peter Schäfer Priority: Minor Attachments: bench-diff.txt, bench-diff.txt, Matcher1-ground-20070730.patch, Matcher2-default-20070730.patch, Matcher3-core-20070730.patch, Matcher4-contrib-misc-20070730.patch, Matcher5-contrib-queries-20070730.patch, Matcher6-contrib-xml-20070730.patch, Some Matchers.zip {code} package org.apache.lucene.search; public abstract class Filter implements java.io.Serializable { public abstract AbstractBitSet bits(IndexReader reader) throws IOException; } public interface AbstractBitSet { public boolean get(int index); } {code} It would be useful if the method =Filter.bits()= returned an abstract interface, instead of =java.util.BitSet=. Use case: there is a very large index, and, depending on the user's privileges, only a small portion of the index is actually visible. Sparsely populated =java.util.BitSet=s are not efficient and waste lots of memory. It would be desirable to have an alternative BitSet implementation with smaller memory footprint. Though it _is_ possibly to derive classes from =java.util.BitSet=, it was obviously not designed for that purpose. That's why I propose to use an interface instead. The default implementation could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-584) Decouple Filter from BitSet
[ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12518825 ] Paul Elschot commented on LUCENE-584: - Mark, I said: there is never a threadsafety problem. (See BitSetMatcher.getMatcher() which uses a local class for the resulting Matcher.) That was a mistake. BitSetMatcher is a Matcher constructed from a BitSet, and SortedVIntList has a getMatcher() method, and I confused the two. A Matcher is intended to be used in a single thread, so I don't expect thread safety problems. The problem for the XML parser is that with this patch, the implementing data structure of a Filter becomes unaccessible from the Filter class, so it cannot be cached from there. That means that some cached data structure will have to be chosen, and one way to do that is by using class BitSetFilter from the patch. This has a bits() method just like the current Filter class. CachingWrapperFilter could then become a cache for BitSetFilter. There is indeed no caching of filters in this patch. The reason for that is that some Filters do not need a cache. For example: class TermFilter { TermFilter(Term t) {this.term = t;} Matcher getMatcher(reader) {return new TermMatcher( reader.termDocs(this.term);} } TermMatcher does not exist (yet), but it could be easily introduced by leaving all the scoring out of the current TermScorer. As for DocIdSet, as long as this provides a Matcher as an iterator, it can be used to implement a (caching) filter. I don't think this patch complicates the implementation of caching strategies. For example one could define: class CachableFilter extends Filter { ... some methods to access the underlying data structure to be cached. ... } or write a similar adapter for some subclass of Filter and then write a FilterCache that caches these. I did consider defining Matcher as an interface, but I preferred not to do that because of the default explain() method in the Matcher class of the patch. Decouple Filter from BitSet --- Key: LUCENE-584 URL: https://issues.apache.org/jira/browse/LUCENE-584 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.0.1 Reporter: Peter Schäfer Priority: Minor Attachments: bench-diff.txt, bench-diff.txt, Matcher1-ground-20070730.patch, Matcher2-default-20070730.patch, Matcher3-core-20070730.patch, Matcher4-contrib-misc-20070730.patch, Matcher5-contrib-queries-20070730.patch, Matcher6-contrib-xml-20070730.patch, Some Matchers.zip {code} package org.apache.lucene.search; public abstract class Filter implements java.io.Serializable { public abstract AbstractBitSet bits(IndexReader reader) throws IOException; } public interface AbstractBitSet { public boolean get(int index); } {code} It would be useful if the method =Filter.bits()= returned an abstract interface, instead of =java.util.BitSet=. Use case: there is a very large index, and, depending on the user's privileges, only a small portion of the index is actually visible. Sparsely populated =java.util.BitSet=s are not efficient and waste lots of memory. It would be desirable to have an alternative BitSet implementation with smaller memory footprint. Though it _is_ possibly to derive classes from =java.util.BitSet=, it was obviously not designed for that purpose. That's why I propose to use an interface instead. The default implementation could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-584) Decouple Filter from BitSet
[ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12518845 ] Mark Harwood commented on LUCENE-584: - Hi Paul, Not sure we've reached a common understanding here yet. You said That was a mistake. BitSetMatcher is a Matcher constructed from a BitSet, and SortedVIntList has a getMatcher() method, and I confused the two. Ok, thanks for the clarification. I still feel uncomfortable because the method getMatcher() is not abstracted to a common interface. This was the thinking behind my getIterator method on DocIdSet. I too made a mistake in my earlier comments. DocIdSetIterator does NOT have probably one implementation. There would be an implementation for each different type of DocIdSet (Bitset/OpenBitSet/VIntList). You said some Filters do not need a cache. For example: TermFilter. I'm not sure why that has been singled out as not worthy of caching. I have certain terms (e.g. gender:male) where the TermDocs is very large (50% of all docs in the index!) so multiple calls to TermDocs for term gender:male (if that is what you are suggesting) is highly undesirable. These are typically handled in the XMLQueryParser using syntax like this: CachedFilter TermsFilter fieldName=gendermale/TermsFilter /CachedFilter You said: CachingWrapperFilter could then become a cache for BitSetFilter. This means that the only caching strategy is one based on bitsets - does this not lose perhaps the main benefit of your whole proposal? - the ability to have alternative space efficient storage of sets of document ids e.g. SortedVIntList. If this is undesirable (my guess is yes) then the proposal in my previous comment is a solution which allows for caching of any/all types of the new sets (openBitSet,BitSet,SortedVIntList etc) Regardless of my choice of class names or decisions over interfaces vs abstract classes do you not at least agree the need for 3 types of functionality: 1) A factory for instantiating sets of document ids matching a particular set of criteria (which can be costly to call). While the factory is not expected to implement a caching strategy it is expected to implement hashcode/equals simply to aid any caching services which would need this help to identify previously instantiated sets which share the same criteria as ant new requests (This service I identified as my DocIdSetFactory and TermsFilter/RangeFilter would be example implementations). 2) An object representing an instantiated set of document ids which can be cached and can create iterators for use in seperate threads (identified as my DocIdSet - example implementations being called something like BitSetDocSet, SortedVIntList) 3) An iterator for a set of document ids (my DocIdSetIterator - example impls being called something like BitSetDocSetIterator SortedVIntListIterator) Each type of functionality can have different implementations so the functionality must be defined using an interface or abstract class. If we can agree this much as a set of responsibilities then we can begin to map these services onto something more concrete. Cheers Mark Decouple Filter from BitSet --- Key: LUCENE-584 URL: https://issues.apache.org/jira/browse/LUCENE-584 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.0.1 Reporter: Peter Schäfer Priority: Minor Attachments: bench-diff.txt, bench-diff.txt, Matcher1-ground-20070730.patch, Matcher2-default-20070730.patch, Matcher3-core-20070730.patch, Matcher4-contrib-misc-20070730.patch, Matcher5-contrib-queries-20070730.patch, Matcher6-contrib-xml-20070730.patch, Some Matchers.zip {code} package org.apache.lucene.search; public abstract class Filter implements java.io.Serializable { public abstract AbstractBitSet bits(IndexReader reader) throws IOException; } public interface AbstractBitSet { public boolean get(int index); } {code} It would be useful if the method =Filter.bits()= returned an abstract interface, instead of =java.util.BitSet=. Use case: there is a very large index, and, depending on the user's privileges, only a small portion of the index is actually visible. Sparsely populated =java.util.BitSet=s are not efficient and waste lots of memory. It would be desirable to have an alternative BitSet implementation with smaller memory footprint. Though it _is_ possibly to derive classes from =java.util.BitSet=, it was obviously not designed for that purpose. That's why I propose to use an interface instead. The default implementation could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (LUCENE-584) Decouple Filter from BitSet
[ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12518858 ] Paul Elschot commented on LUCENE-584: - Mark, I think we are one the same line, it's just that I don't want to go that far now. Have another look at the title of this issue, it may be in your title bar, but otherwise it's quite a bit of scrolling so I'll repeat it here: Decouple Filter from BitSet. That is the main thing that this patch tries to do. And that also makes it a starting point for caching of different data structures for Filters. Caching of Filters is very much needed, but I'd rather see that as another issue. The DefaultMatcher class tries to do some compression by using a SortedVIntList when that is smaller than a BitSet, and that is about as far as I'd like to go now. Proost, Paul Elschot Decouple Filter from BitSet --- Key: LUCENE-584 URL: https://issues.apache.org/jira/browse/LUCENE-584 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.0.1 Reporter: Peter Schäfer Priority: Minor Attachments: bench-diff.txt, bench-diff.txt, Matcher1-ground-20070730.patch, Matcher2-default-20070730.patch, Matcher3-core-20070730.patch, Matcher4-contrib-misc-20070730.patch, Matcher5-contrib-queries-20070730.patch, Matcher6-contrib-xml-20070730.patch, Some Matchers.zip {code} package org.apache.lucene.search; public abstract class Filter implements java.io.Serializable { public abstract AbstractBitSet bits(IndexReader reader) throws IOException; } public interface AbstractBitSet { public boolean get(int index); } {code} It would be useful if the method =Filter.bits()= returned an abstract interface, instead of =java.util.BitSet=. Use case: there is a very large index, and, depending on the user's privileges, only a small portion of the index is actually visible. Sparsely populated =java.util.BitSet=s are not efficient and waste lots of memory. It would be desirable to have an alternative BitSet implementation with smaller memory footprint. Though it _is_ possibly to derive classes from =java.util.BitSet=, it was obviously not designed for that purpose. That's why I propose to use an interface instead. The default implementation could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-584) Decouple Filter from BitSet
[ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12518868 ] Mark Harwood commented on LUCENE-584: - OK, I appreciate caching may not be a top priority in this proposal but I have live systems in production using XMLQueryParser and which use the existing core facilities for caching. As it stands this proposal breaks this functionality (see FIXME in contrib's CachedFilterBuilder and my concerns over use of unthreadsafe Matcher in the core class CachingWrapperFilter) I am obviously concerned by this and keen to help shape a solution which preserves the existing capabilities while adding your new functionality. I'm not sure I share your view that support for caching can be treated as a separate issue to be dealt with at a later date. There are a larger number of changes proposed in this patch and if the design does not at least consider future caching issues now, I suspect much will have to be reworked later. The change I can envisage most clearly is expressed in my concern that the DocIdSet and DocIdSetIterator services I outlined are being combined in Matcher as it stands now and these functions will have to be separated. Cheers Mark Decouple Filter from BitSet --- Key: LUCENE-584 URL: https://issues.apache.org/jira/browse/LUCENE-584 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.0.1 Reporter: Peter Schäfer Priority: Minor Attachments: bench-diff.txt, bench-diff.txt, Matcher1-ground-20070730.patch, Matcher2-default-20070730.patch, Matcher3-core-20070730.patch, Matcher4-contrib-misc-20070730.patch, Matcher5-contrib-queries-20070730.patch, Matcher6-contrib-xml-20070730.patch, Some Matchers.zip {code} package org.apache.lucene.search; public abstract class Filter implements java.io.Serializable { public abstract AbstractBitSet bits(IndexReader reader) throws IOException; } public interface AbstractBitSet { public boolean get(int index); } {code} It would be useful if the method =Filter.bits()= returned an abstract interface, instead of =java.util.BitSet=. Use case: there is a very large index, and, depending on the user's privileges, only a small portion of the index is actually visible. Sparsely populated =java.util.BitSet=s are not efficient and waste lots of memory. It would be desirable to have an alternative BitSet implementation with smaller memory footprint. Though it _is_ possibly to derive classes from =java.util.BitSet=, it was obviously not designed for that purpose. That's why I propose to use an interface instead. The default implementation could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-584) Decouple Filter from BitSet
[ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12518569 ] Mark Harwood commented on LUCENE-584: - Hi Paul, Many thanks for your responses. Sorry for the delay in communications - just got back from 2 weeks holiday and slowly picking my way through this patch. You said: there is never a threadsafety problem. (See BitSetMatcher.getMatcher() which uses a local class for the resulting Matcher.) Did you mean BitSetFilter.getMatcher()? BitSetMatcher has no getMatcher method. If so, doesn't my original thread safety issue still stand? - CachingWrapperFilter is caching Matchers (not Filters which are factories for matchers). The existing approach of adding a CachedFilter tag around my XML-based query templates offers a major speed up in my applications and I don't see this supported in this patch currently which gives me some concern. This existing caching technique is based on the use of CachingWrapperFilter. The proposed framework seems to be missing a means of caching reusable, threadsafe Matchers in a type-independent fashion. One solution (which I think you may be suggesting with the getMatcher comment) is to cache Filter objects and use Filter.getMatcher(reader) as a factory method for thread-specific, single-use Matchers but this would suggest that any caching then becomes an implied responsibility/overhead of each Filter implementation. Not too great. CachingWrapperFilter is an example of a better design where the caching policy has been implemented in a single class and it can be used to decorate any Filter implementation (RangeFilter etc) with the required caching behaviour. Unfortunately with this proposed patch there is no way that any such single caching policy can work with any Filter because Matcher is not reusable/cachable. Time to remove any thread-specific state from Matcher? Cheers Mark Decouple Filter from BitSet --- Key: LUCENE-584 URL: https://issues.apache.org/jira/browse/LUCENE-584 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.0.1 Reporter: Peter Schäfer Priority: Minor Attachments: bench-diff.txt, bench-diff.txt, Matcher1-ground-20070730.patch, Matcher2-default-20070730.patch, Matcher3-core-20070730.patch, Matcher4-contrib-misc-20070730.patch, Matcher5-contrib-queries-20070730.patch, Matcher6-contrib-xml-20070730.patch, Some Matchers.zip {code} package org.apache.lucene.search; public abstract class Filter implements java.io.Serializable { public abstract AbstractBitSet bits(IndexReader reader) throws IOException; } public interface AbstractBitSet { public boolean get(int index); } {code} It would be useful if the method =Filter.bits()= returned an abstract interface, instead of =java.util.BitSet=. Use case: there is a very large index, and, depending on the user's privileges, only a small portion of the index is actually visible. Sparsely populated =java.util.BitSet=s are not efficient and waste lots of memory. It would be desirable to have an alternative BitSet implementation with smaller memory footprint. Though it _is_ possibly to derive classes from =java.util.BitSet=, it was obviously not designed for that purpose. That's why I propose to use an interface instead. The default implementation could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-584) Decouple Filter from BitSet
[ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516502 ] Paul Elschot commented on LUCENE-584: - Some more remarks on the 20070730 patches. To recap, this introduces Matcher as a superclass of Scorer to take the role that BitSet currently has in Filter. The total number of java files changed/added by these patches is 47, so some extra care will be needed. The following issues are still pending: What approach should be taken for the API change to Filter (see above, 2 comments up)? I'd like to get all test cases to pass again. TestRemoteCachingWrapperFilter still does not pass, and I don't know why. For xml-query-parser in contrib I'd like to know in which direction to proceed (see 1 comment up). Does it make sense to try and get the TestQueryTemplateManager test to pass again? The ..default.. patch has taken OpenBitSet and friends from solr to have a default implementation. However, I have not checked whether there is unused code in there, so some trimming may still be appropriate. Once these issues have been resolved far enough, I would recommend to introduce this shortly after a release so there is some time to let things settle. Decouple Filter from BitSet --- Key: LUCENE-584 URL: https://issues.apache.org/jira/browse/LUCENE-584 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.0.1 Reporter: Peter Schäfer Priority: Minor Attachments: bench-diff.txt, bench-diff.txt, Matcher1-ground-20070730.patch, Matcher2-default-20070730.patch, Matcher3-core-20070730.patch, Matcher4-contrib-misc-20070730.patch, Matcher5-contrib-queries-20070730.patch, Matcher6-contrib-xml-20070730.patch, Some Matchers.zip {code} package org.apache.lucene.search; public abstract class Filter implements java.io.Serializable { public abstract AbstractBitSet bits(IndexReader reader) throws IOException; } public interface AbstractBitSet { public boolean get(int index); } {code} It would be useful if the method =Filter.bits()= returned an abstract interface, instead of =java.util.BitSet=. Use case: there is a very large index, and, depending on the user's privileges, only a small portion of the index is actually visible. Sparsely populated =java.util.BitSet=s are not efficient and waste lots of memory. It would be desirable to have an alternative BitSet implementation with smaller memory footprint. Though it _is_ possibly to derive classes from =java.util.BitSet=, it was obviously not designed for that purpose. That's why I propose to use an interface instead. The default implementation could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-584) Decouple Filter from BitSet
[ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516154 ] Paul Elschot commented on LUCENE-584: - Mark, An easy way to keep things like BooleanFilter working could be to introduce a subclass of Filter, say BitsFilter that adds a bits(IndexReader) method. This class should also implement getMatcher(), the default implementation could be used for that initially. Then BooleanFilter could simply be a subclass of BitsFilter, possibly without further modifications, although I would prefer to rename it to BooleanBitsFilter. That would only involve some deprecation warnings in BitsFilter for the period that Filter.bits() is deprecated. I would not even mind cooking this up as patch to contrib. Thoughts? Decouple Filter from BitSet --- Key: LUCENE-584 URL: https://issues.apache.org/jira/browse/LUCENE-584 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.0.1 Reporter: Peter Schäfer Priority: Minor Attachments: bench-diff.txt, bench-diff.txt, Matcher-core20070725.patch, Matcher-default20070725.patch, Matcher-ground20070725.patch, Some Matchers.zip {code} package org.apache.lucene.search; public abstract class Filter implements java.io.Serializable { public abstract AbstractBitSet bits(IndexReader reader) throws IOException; } public interface AbstractBitSet { public boolean get(int index); } {code} It would be useful if the method =Filter.bits()= returned an abstract interface, instead of =java.util.BitSet=. Use case: there is a very large index, and, depending on the user's privileges, only a small portion of the index is actually visible. Sparsely populated =java.util.BitSet=s are not efficient and waste lots of memory. It would be desirable to have an alternative BitSet implementation with smaller memory footprint. Though it _is_ possibly to derive classes from =java.util.BitSet=, it was obviously not designed for that purpose. That's why I propose to use an interface instead. The default implementation could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-584) Decouple Filter from BitSet
[ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515630 ] Paul Elschot commented on LUCENE-584: - Mark, The exhausted flag is only in the iterator/Matcher, not in the underlying set data structure. One can use as many iterators as necessary, for example one per thread, and then there is never a threadsafety problem. (See BitSetMatcher.getMatcher() which uses a local class for the resulting Matcher.) You wrote: I use BooleanFilter a lot for security where many large sets are cached and combined on the fly - caching all the possible combinations as single bitsets would lead to too many possible combinations. That can still be done, but one needs to get to the BitSets for example by caching them outside the Filters and constructing the resulting BitSetMatcher for the combined Filter on the fly. An alternative would be to have a BooleanQuery.add(Matcher, Occur), where the occurrence can only be required or prohibited. Then there is no need to construct any resulting filter because the boolean logic will be executed during the search. This might even be more efficient than combining the full BitSets ahead of the search. And with many large BitSets cache memory savings from more compact implementations can also be helpful. Decouple Filter from BitSet --- Key: LUCENE-584 URL: https://issues.apache.org/jira/browse/LUCENE-584 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.0.1 Reporter: Peter Schäfer Priority: Minor Attachments: bench-diff.txt, bench-diff.txt, Matcher-core20070725.patch, Matcher-default20070725.patch, Matcher-ground20070725.patch, Some Matchers.zip {code} package org.apache.lucene.search; public abstract class Filter implements java.io.Serializable { public abstract AbstractBitSet bits(IndexReader reader) throws IOException; } public interface AbstractBitSet { public boolean get(int index); } {code} It would be useful if the method =Filter.bits()= returned an abstract interface, instead of =java.util.BitSet=. Use case: there is a very large index, and, depending on the user's privileges, only a small portion of the index is actually visible. Sparsely populated =java.util.BitSet=s are not efficient and waste lots of memory. It would be desirable to have an alternative BitSet implementation with smaller memory footprint. Though it _is_ possibly to derive classes from =java.util.BitSet=, it was obviously not designed for that purpose. That's why I propose to use an interface instead. The default implementation could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Commented: (LUCENE-584) Decouple Filter from BitSet
Mark Harwood commented on LUCENE-584: Hi Mark, we used to use Filters a lot... and concluded, Matcher is great! It just takes some time to get it in head, let me try to help you get there :) I saw BitSetMatcher etc and appreciate the motivation behind the design for alternative implementations . What concerns me with the Matcher API in general is that Matchers have non-threadsafe safe state (i.e. the current position required to support next() )and as such aren't safely cachable in the same way as BitSets. I see the searcher code uses the safer skipTo() rather than next() but there's still the if(exhausted) thread safety problem to worry about which is why I raised points 1 and 4. 1. Caching Issue: You do not want to cache Matcher, this is just an Iterator with forward skipping possibility, why would one cache iterators? (can be done by introducing rewind(), maybe not bad idea?). What you really need to put in cache is object that implements Matcher interface, or some object for which is easy to provide Matcher interface. 2. thread safety issue I did not get it, what scenario you see here? Additionally, combining Bitsets using Booolean logic is one method call whereas combining heterogenous Matchers using Boolean logic requires iteration across them and therefore potentially many method calls (point 3). 3. Lucene core uses next() and skipTo() to combine Filter/Query today, there are no BitSet.and(BitSet) in Lucene core! this is not going to be changed. If yo need to combine bit sets, you can do it easily on classes that implement Matcher (imagine, you have two OpenBitSets and they implement Matcher, nothing prevents you from OpenBitSet.and(OpenBitSet)-ing these implementing objects? ). Simply, you are not less flexible due to Matcher, simply you can do everything as before, you are just not bound to memory hungry, slow BitSet ... I haven't benchmarked this but I imagine it to be significantly slower? Sure, but you do not have to make your Filter arithmetic via Matcher, just do it directly on your implementing classes. I use BooleanFilter a lot for security where many large sets are cached and combined on the fly - caching all the possible combinations as single bitsets would lead to too many possible combinations. You can freely keep something like BooleanFilter , even make it faster with OpenBitSet, or something else even faster, memory better, and than, once you have Filter content you'd like to use, just pass it as Matcher to search() method and ta da, yo have it. ___ Yahoo! Mail is the world's favourite email. Don't settle for less, sign up for your free account today http://uk.rd.yahoo.com/evt=44106/*http://uk.docs.yahoo.com/mail/winter07.html - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-584) Decouple Filter from BitSet
[ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515313 ] Paul Elschot commented on LUCENE-584: - There is some code in contrib where a Filter is assumed to have BitSet available: contrib/queries/src/java/org/apache/lucene/search/BooleanFilter.java contrib/miscellaneous/src/java/org/apache/lucene/misc/ChainedFilter.java When Filter is going to move from BitSet to Matcher, these will have to be reimplemented. They basically use Filters to provide BitSets, but it seems to me that they also could use lists of BitSets, for example. Decouple Filter from BitSet --- Key: LUCENE-584 URL: https://issues.apache.org/jira/browse/LUCENE-584 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.0.1 Reporter: Peter Schäfer Priority: Minor Attachments: bench-diff.txt, bench-diff.txt, BitsMatcher.java, DefaultMatcher20070725.patch, Filter-20060628.patch, HitCollector-20060628.patch, IndexSearcher-20060628.patch, MatchCollector.java, Matcher.java, Matcher20070226.patch, Scorer-20060628.patch, Searchable-20060628.patch, Searcher-20060628.patch, Some Matchers.zip, SortedVIntList.java, TestSortedVIntList.java {code} package org.apache.lucene.search; public abstract class Filter implements java.io.Serializable { public abstract AbstractBitSet bits(IndexReader reader) throws IOException; } public interface AbstractBitSet { public boolean get(int index); } {code} It would be useful if the method =Filter.bits()= returned an abstract interface, instead of =java.util.BitSet=. Use case: there is a very large index, and, depending on the user's privileges, only a small portion of the index is actually visible. Sparsely populated =java.util.BitSet=s are not efficient and waste lots of memory. It would be desirable to have an alternative BitSet implementation with smaller memory footprint. Though it _is_ possibly to derive classes from =java.util.BitSet=, it was obviously not designed for that purpose. That's why I propose to use an interface instead. The default implementation could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-584) Decouple Filter from BitSet
[ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515395 ] Mark Harwood commented on LUCENE-584: - Hi Paul, Not sure if I'm missing something but I think this patch may not work for scenarios other than the simple option of a single filter being used on a search. A Matcher does not have the same utility as a BitSet because using a BitSet you can: 1) iterate across it using multiple threads. 2) Clone it. 3) Merge it quickly with other bitsets using Boolean logic . 4) Use it more than once. I think these differences become important in the following scenarios : In CachingWrapperFilter I don't think you can cache Matchers instead of bitsets - because Matchers don't have features 1 and 4 BooleanFilter and ChainedFilter in contrib don't work with Matchers because there is no support for 3) Is there something obvious I've missed? Cheers Mark Decouple Filter from BitSet --- Key: LUCENE-584 URL: https://issues.apache.org/jira/browse/LUCENE-584 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.0.1 Reporter: Peter Schäfer Priority: Minor Attachments: bench-diff.txt, bench-diff.txt, Matcher-core20070725.patch, Matcher-default20070725.patch, Matcher-ground20070725.patch, Some Matchers.zip {code} package org.apache.lucene.search; public abstract class Filter implements java.io.Serializable { public abstract AbstractBitSet bits(IndexReader reader) throws IOException; } public interface AbstractBitSet { public boolean get(int index); } {code} It would be useful if the method =Filter.bits()= returned an abstract interface, instead of =java.util.BitSet=. Use case: there is a very large index, and, depending on the user's privileges, only a small portion of the index is actually visible. Sparsely populated =java.util.BitSet=s are not efficient and waste lots of memory. It would be desirable to have an alternative BitSet implementation with smaller memory footprint. Though it _is_ possibly to derive classes from =java.util.BitSet=, it was obviously not designed for that purpose. That's why I propose to use an interface instead. The default implementation could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-584) Decouple Filter from BitSet
[ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515434 ] Paul Elschot commented on LUCENE-584: - Have a look at BitSetMatcher in the -default patch. It is constructed from a BitSet, and it has a method getMatcher() that returns a Matcher that acts as a searching iterator over the BitSet. So that is 1) to 4), at least potentially. A clone() method is currently not implemented iirc, but each call to getMatcher() will return a new iterator over the underlying BitSet. And when guaranteed non modifyability is needed, a constructor can take a copy of the given document set, in whatever form. The point of Matcher is that it allows other implementations than BitSet, like OpenBitSet and SortedVIntList. Both have the properties that you are looking for. SortedVIntList can save a lot of memory when compared to (Open)BitSet, and OpenBitSet is somewhat faster than BitSet. I'd like to have a skip list version of SortedVIntList, too. This would be slightly larger than SortedVIntList, but more efficient on skipTo(). But the first thing that is necessary is to have Filter independent from BitSet. The real pain with that is going to be the code that currently implements Filters outside the lucene code base, and a default implementation of a Matcher should be of help there, just as it is in the -core patch now. The default implementation will probably need to be improved from its current state, but that can be done later. For example, one could also use OpenBitSet in all cases, and even collect the filtered documents directly in that. Cheers, Paul Elschot Decouple Filter from BitSet --- Key: LUCENE-584 URL: https://issues.apache.org/jira/browse/LUCENE-584 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.0.1 Reporter: Peter Schäfer Priority: Minor Attachments: bench-diff.txt, bench-diff.txt, Matcher-core20070725.patch, Matcher-default20070725.patch, Matcher-ground20070725.patch, Some Matchers.zip {code} package org.apache.lucene.search; public abstract class Filter implements java.io.Serializable { public abstract AbstractBitSet bits(IndexReader reader) throws IOException; } public interface AbstractBitSet { public boolean get(int index); } {code} It would be useful if the method =Filter.bits()= returned an abstract interface, instead of =java.util.BitSet=. Use case: there is a very large index, and, depending on the user's privileges, only a small portion of the index is actually visible. Sparsely populated =java.util.BitSet=s are not efficient and waste lots of memory. It would be desirable to have an alternative BitSet implementation with smaller memory footprint. Though it _is_ possibly to derive classes from =java.util.BitSet=, it was obviously not designed for that purpose. That's why I propose to use an interface instead. The default implementation could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-584) Decouple Filter from BitSet
[ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515437 ] Paul Elschot commented on LUCENE-584: - I forgot to mention that boolean logic on Matchers is already in present in BooleanScorer2. This is because each Scorer is a Matcher. Decouple Filter from BitSet --- Key: LUCENE-584 URL: https://issues.apache.org/jira/browse/LUCENE-584 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.0.1 Reporter: Peter Schäfer Priority: Minor Attachments: bench-diff.txt, bench-diff.txt, Matcher-core20070725.patch, Matcher-default20070725.patch, Matcher-ground20070725.patch, Some Matchers.zip {code} package org.apache.lucene.search; public abstract class Filter implements java.io.Serializable { public abstract AbstractBitSet bits(IndexReader reader) throws IOException; } public interface AbstractBitSet { public boolean get(int index); } {code} It would be useful if the method =Filter.bits()= returned an abstract interface, instead of =java.util.BitSet=. Use case: there is a very large index, and, depending on the user's privileges, only a small portion of the index is actually visible. Sparsely populated =java.util.BitSet=s are not efficient and waste lots of memory. It would be desirable to have an alternative BitSet implementation with smaller memory footprint. Though it _is_ possibly to derive classes from =java.util.BitSet=, it was obviously not designed for that purpose. That's why I propose to use an interface instead. The default implementation could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-584) Decouple Filter from BitSet
[ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515494 ] Mark Harwood commented on LUCENE-584: - Thanks for the reply, Paul. I saw BitSetMatcher etc and appreciate the motivation behind the design for alternative implementations . What concerns me with the Matcher API in general is that Matchers have non-threadsafe safe state (i.e. the current position required to support next() )and as such aren't safely cachable in the same way as BitSets. I see the searcher code uses the safer skipTo() rather than next() but there's still the if(exhausted) thread safety problem to worry about which is why I raised points 1 and 4. Additionally, combining Bitsets using Booolean logic is one method call whereas combining heterogenous Matchers using Boolean logic requires iteration across them and therefore potentially many method calls (point 3). I haven't benchmarked this but I imagine it to be significantly slower? I use BooleanFilter a lot for security where many large sets are cached and combined on the fly - caching all the possible combinations as single bitsets would lead to too many possible combinations. Cheers Mark Decouple Filter from BitSet --- Key: LUCENE-584 URL: https://issues.apache.org/jira/browse/LUCENE-584 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.0.1 Reporter: Peter Schäfer Priority: Minor Attachments: bench-diff.txt, bench-diff.txt, Matcher-core20070725.patch, Matcher-default20070725.patch, Matcher-ground20070725.patch, Some Matchers.zip {code} package org.apache.lucene.search; public abstract class Filter implements java.io.Serializable { public abstract AbstractBitSet bits(IndexReader reader) throws IOException; } public interface AbstractBitSet { public boolean get(int index); } {code} It would be useful if the method =Filter.bits()= returned an abstract interface, instead of =java.util.BitSet=. Use case: there is a very large index, and, depending on the user's privileges, only a small portion of the index is actually visible. Sparsely populated =java.util.BitSet=s are not efficient and waste lots of memory. It would be desirable to have an alternative BitSet implementation with smaller memory footprint. Though it _is_ possibly to derive classes from =java.util.BitSet=, it was obviously not designed for that purpose. That's why I propose to use an interface instead. The default implementation could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-584) Decouple Filter from BitSet
[ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12511186 ] Paul Elschot commented on LUCENE-584: - With 2.2 out, and LUCENE-730 out of the way, wouldn't this be a good moment for some progress with this issue? The patch still applies cleanly, and I'd like to start working on a skipping extension of SortedVIntList, much like the latest index format for document lists. Decouple Filter from BitSet --- Key: LUCENE-584 URL: https://issues.apache.org/jira/browse/LUCENE-584 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.0.1 Reporter: Peter Schäfer Priority: Minor Attachments: bench-diff.txt, bench-diff.txt, BitsMatcher.java, Filter-20060628.patch, HitCollector-20060628.patch, IndexSearcher-20060628.patch, MatchCollector.java, Matcher.java, Matcher20070226.patch, Scorer-20060628.patch, Searchable-20060628.patch, Searcher-20060628.patch, Some Matchers.zip, SortedVIntList.java, TestSortedVIntList.java {code} package org.apache.lucene.search; public abstract class Filter implements java.io.Serializable { public abstract AbstractBitSet bits(IndexReader reader) throws IOException; } public interface AbstractBitSet { public boolean get(int index); } {code} It would be useful if the method =Filter.bits()= returned an abstract interface, instead of =java.util.BitSet=. Use case: there is a very large index, and, depending on the user's privileges, only a small portion of the index is actually visible. Sparsely populated =java.util.BitSet=s are not efficient and waste lots of memory. It would be desirable to have an alternative BitSet implementation with smaller memory footprint. Though it _is_ possibly to derive classes from =java.util.BitSet=, it was obviously not designed for that purpose. That's why I propose to use an interface instead. The default implementation could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Commented: (LUCENE-584) Decouple Filter from BitSet; relation with LUCENE-730
a totally different view on Filters would be to think about them as index slicer, at the lowest possible level, TermDocs. Basically all document ids in such Filter would appear, at the TermDocs level as if not in index, simply invisible. TermDocs that is aware of Filtered doc ids (doing skipping over Filter AND Term). for example, one could extend FilterIndexReader, provide setFilter(Matcher) method on it and than method termDocs() would need to check if Matcher == null and return TermDocs instanca that hides or not Filtered documents it looks too simple to be real, nice thing about it, as far as I can tell, it does not require any changes in Lucene core! conceptually, it filters some documents out of index, simply provides another view on index (hence FilterIndexReader). The same as current Filter, but with a bit shifted perspective on providing index view-s. It is rather possible that this idea sucks big time, please let me know if you see anything super wrong with it. ___ Yahoo! Answers - Got a question? Someone out there knows the answer. Try it now. http://uk.answers.yahoo.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Commented: (LUCENE-584) Decouple Filter from BitSet; relation with LUCENE-730
Hoss, what about radical approach :) Instead of Decouple Filter from BitSet change target to add support for Matcher, meaning: - do not change Filter and existing search() method in InedexSearcher at all, leave it as it is, no new assumptions about anything - Add IndexSearcher.search() method that uses Matcher and makes documented assumption that Scorer used to score supplied Query supports skipTo. how I see it, as long as we have this degree of freedom, optional support for skipTo() in Scorer, we will have to have implicit knowledge of this fact for any code that interacts with Scorer, one way or another. Making skipTo() required for scorers would be nice, big, simplifying change, but this is *way out of my league* to argue something like that (I simply have no idea what implications, effort... this could have). this would work as this search method with Matcher gets expert status until we find a way to relax this assumption. and actually do what we wanted in the first place decouple Filter from BitSet cheers, e. - Original Message From: Chris Hostetter [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Saturday, 14 April, 2007 1:13:21 AM Subject: Re: [jira] Commented: (LUCENE-584) Decouple Filter from BitSet; relation with LUCENE-730 : Hoss, would this work (is this what you said)? : public Matcher getMatcher(IndexReader reader) throws IOException { :if(bits() == null) throw new SomeException(Filter must implement at least : one of...); :return new BitsMatcher(bits()); : } Assuming BitsMatcher does what i think it does then yes, that's what i had in mind ... i was specificly saying to make a default Matcher implementation out of the code in the patched version of IndexSearcher that has the comment... +} else { // bits for filtering, skipTo() not used on scorer: : This will not work correctly when the Scorer for the query that is searched : with a filter does not implement skipTo(), for example BooleanScorer. : See also the javadoc of class IndexSearcher in the patch. I don'tget it, how would a Scorer not implement skipTo? ...oh... final class BooleanScorer extends Scorer { ... public boolean skipTo(int target) { throw new UnsupportedOperationException(); } ...so lemme see if i understand this: What's happening in the current trunk is that the only situations in which code will attempt to call skipTo on a Scorer are: a) From the score(HitCollector hc) method of the same Scorer class (you should know if you suport it, you're in the class) b) From the skipTo method of an enclosing Scorer (If you add Scorer X to a a wrapper Scorer Y, and Y implements skipTo, it can assume that X implements skipTo). Am I correct so far? In the latest version of the Matcher patch... https://issues.apache.org/jira/secure/attachment/12352057/Matcher20070226.patch ...this changes, such that IndexSearcher will assume a Scorer supports skipTo iff a Filter is used which implements getMatcher (I guess the assumption being that if the code being used is new enough to support Matchers, it's new enough to support Scorer.skipTo). *BUT* if it's an old Filter using a BitSet the code in IndexSearcher will continue with the same old assumptions about the Scorer. And the change eks describes (which is a much better way to describe what i was suggesting) would break this safety net by always assuming skipTo was safe to call. So really the issue is that the patch assumpes one thing (Scorer supports skipTo) based on the presence of something that should be thought of as newer (Filter supports getMatcher) and relying on documentation to enforce this. Am I caught up now? Off the top of my head, the best solution i can think of to this issue would be to add the naive implementation of skipTo to Scorer, remove the UnsupportedOperationException of skipTo from all Scorers in the core, and rev Lucene to version 3.0 since this would probably be considered a serious API change (method sigs don't change, but now we're requiring people to implement a method that we have said in the past (by example) can be Unsupported. In general i'm not fond of assuming Scorer.skipTo when Filter.getMatcher ... the concepts are really orthoginal and even if it's a decent assumption to make today, it doens't help us tomorow when we want to add a getMatcher method to all of the core Filter classes to improve performance. -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] ___ Yahoo! Mail is the world's favourite email. Don't settle for less, sign up for your free account today http://uk.rd.yahoo.com/evt=44106/*http://uk.docs.yahoo.com/mail/winter07.html - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional
Re: [jira] Commented: (LUCENE-584) Decouple Filter from BitSet; relation with LUCENE-730
Hoss, A bit long, sorry for that, sometimes things are just as complex as they are. On Saturday 14 April 2007 01:13, Chris Hostetter wrote: ... I don'tget it, how would a Scorer not implement skipTo? ...oh... final class BooleanScorer extends Scorer { ... public boolean skipTo(int target) { throw new UnsupportedOperationException(); } Some history for the underlying reason for this: Once upon a time no Scorer would implement skipTo(). Most people would use BooleanScorer for queries with multiple terms, and things worked well with the Scorer.next() method, especially for disjunctions. Occasionally documents would be scored out of document order, but that did not lead to problems because Hits would reorder the documents by score value anyway. Then skipTo() was added to improve the speed of conjunctions. To do this each Scorer needs to score all documents in document number order and implement skipTo() because it skipTo() used by ConjunctionScorer. BooleanScorer will only use ConjunctionScorer in very specific (but also frequently occurring) circumstances. At this point the index format was also changed to include the skip forward information. As I said, the implementation of disjunctions in BooleanScorer does not score documents strictly in document order. It can be made to do that, but that would lead to some loss of performance. BooleanScorer uses a kind of distributive sort that is faster than the priority queue used by DisjunctionSumScorer. Then BooleanScorer2 came along. BooleanScorer2 uses ConjunctionScorer in more circumstances than BooleanScorer., and it usesuses DisjunctionSumScorer for disjunctions. LUCENCE-730 is an attempt to get the top level disjunction performance of BooleanScorer back. Disjunctions below top level, for example in a query like this: +(a1 a2) +(b1 b2) need skipTo() (called from ConjunctionScorer) on the two nested disjunctions, and for that DisjunctionSumScorer is used. Currently for the top level disjunction case: a1 a2 b1 b2 DisjunctionSumScorer is normally used. But when the setUseScorer14() method is used, BooleanScorer will (always?) be used. The patch at LUCENE-584 tries to handle this setUseScorer14() case by keeping also the old filtering method that checks the Bits individually in IndexSearcher. LUCENE-730 will always use BooleanScorer for the top level disjunctions, so with a bit of luck the setUseScorer14 method can also be deprecated/removed. LUCENE-584 has another possible performance advantage in that it allows an implementation of filtering by using a ConjunctionScorer directly instead of doing the filtering in IndexSearcher, but that still needs to be added. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-584) Decouple Filter from BitSet
[ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12488733 ] Hoss Man commented on LUCENE-584: - I'm still behind on following this issue, but Otis: if you are interested in moving forward with this, you might consider trying the cahnges i proposed in my 15/Mar/07 11:06 AM Comment... https://issues.apache.org/jira/browse/LUCENE-584#action_12481263 ...I think it would keep IndexSearcher a little cleaner, and make it easier for people to migrate existing Filter's gradually (without requiring extra work for people writing new Matcher style Filters from scratch) Decouple Filter from BitSet --- Key: LUCENE-584 URL: https://issues.apache.org/jira/browse/LUCENE-584 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.0.1 Reporter: Peter Schäfer Priority: Minor Attachments: bench-diff.txt, bench-diff.txt, BitsMatcher.java, Filter-20060628.patch, HitCollector-20060628.patch, IndexSearcher-20060628.patch, MatchCollector.java, Matcher.java, Matcher20070226.patch, Scorer-20060628.patch, Searchable-20060628.patch, Searcher-20060628.patch, Some Matchers.zip, SortedVIntList.java, TestSortedVIntList.java {code} package org.apache.lucene.search; public abstract class Filter implements java.io.Serializable { public abstract AbstractBitSet bits(IndexReader reader) throws IOException; } public interface AbstractBitSet { public boolean get(int index); } {code} It would be useful if the method =Filter.bits()= returned an abstract interface, instead of =java.util.BitSet=. Use case: there is a very large index, and, depending on the user's privileges, only a small portion of the index is actually visible. Sparsely populated =java.util.BitSet=s are not efficient and waste lots of memory. It would be desirable to have an alternative BitSet implementation with smaller memory footprint. Though it _is_ possibly to derive classes from =java.util.BitSet=, it was obviously not designed for that purpose. That's why I propose to use an interface instead. The default implementation could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Commented: (LUCENE-584) Decouple Filter from BitSet
Hoss, would this work (is this what you said)? public BitSet bits(IndexReader reader) throws IOException{ return null; } public Matcher getMatcher(IndexReader reader) throws IOException { if(bits() == null) throw new SomeException(Filter must implement at least one of...); return new BitsMatcher(bits()); } and IndexSearcher does not have any logic, just uses getMatcher() current implementations would work, new as well - Original Message From: Hoss Man (JIRA) [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Friday, 13 April, 2007 8:01:16 PM Subject: [jira] Commented: (LUCENE-584) Decouple Filter from BitSet [ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12488733 ] Hoss Man commented on LUCENE-584: - I'm still behind on following this issue, but Otis: if you are interested in moving forward with this, you might consider trying the cahnges i proposed in my 15/Mar/07 11:06 AM Comment... https://issues.apache.org/jira/browse/LUCENE-584#action_12481263 ...I think it would keep IndexSearcher a little cleaner, and make it easier for people to migrate existing Filter's gradually (without requiring extra work for people writing new Matcher style Filters from scratch) Decouple Filter from BitSet --- Key: LUCENE-584 URL: https://issues.apache.org/jira/browse/LUCENE-584 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.0.1 Reporter: Peter Schäfer Priority: Minor Attachments: bench-diff.txt, bench-diff.txt, BitsMatcher.java, Filter-20060628.patch, HitCollector-20060628.patch, IndexSearcher-20060628.patch, MatchCollector.java, Matcher.java, Matcher20070226.patch, Scorer-20060628.patch, Searchable-20060628.patch, Searcher-20060628.patch, Some Matchers.zip, SortedVIntList.java, TestSortedVIntList.java {code} package org.apache.lucene.search; public abstract class Filter implements java.io.Serializable { public abstract AbstractBitSet bits(IndexReader reader) throws IOException; } public interface AbstractBitSet { public boolean get(int index); } {code} It would be useful if the method =Filter.bits()= returned an abstract interface, instead of =java.util.BitSet=. Use case: there is a very large index, and, depending on the user's privileges, only a small portion of the index is actually visible. Sparsely populated =java.util.BitSet=s are not efficient and waste lots of memory. It would be desirable to have an alternative BitSet implementation with smaller memory footprint. Though it _is_ possibly to derive classes from =java.util.BitSet=, it was obviously not designed for that purpose. That's why I propose to use an interface instead. The default implementation could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] ___ Yahoo! Mail is the world's favourite email. Don't settle for less, sign up for your free account today http://uk.rd.yahoo.com/evt=44106/*http://uk.docs.yahoo.com/mail/winter07.html - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Commented: (LUCENE-584) Decouple Filter from BitSet; relation with LUCENE-730
On Friday 13 April 2007 22:10, eks dev wrote: Hoss, would this work (is this what you said)? public BitSet bits(IndexReader reader) throws IOException{ return null; } public Matcher getMatcher(IndexReader reader) throws IOException { if(bits() == null) throw new SomeException(Filter must implement at least one of...); return new BitsMatcher(bits()); } This will not work correctly when the Scorer for the query that is searched with a filter does not implement skipTo(), for example BooleanScorer. See also the javadoc of class IndexSearcher in the patch. LUCENE-730 explicitly uses BooleanScorer, but only for the non filtered case with a top level disjunction. I think that with LUCENE-730 also added, the filtered case with BooleanScorer would go away, allowing to simplify this logic in IndexSearcher. This simplification of IndexSearcher is not in the LUCENE-730 patch, because LUCENE-584 is not committed. At the moment I don't know precisely what IndexSearcher would look like after LUCENE-730. With LUCENE-730 BooleanScorer.setUseScorer14() could also be removed/deprecated, but that is also not yet in the LUCENE-730 patch. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Commented: (LUCENE-584) Decouple Filter from BitSet; relation with LUCENE-730
ok , i see, thanks for hand holding here. the simplest solution would be (without making another bigger/riskier patch): - commit LUCENE-584 as is; no harm to anyone but some temporary complexity in IndexSearcher - commit LUCENE-730 - does no harm - open new Jura issue Simplify Filter usage in IndexSearcher and re-factor Filter to behave as Hoss mentioned it - Original Message From: Paul Elschot [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Friday, 13 April, 2007 11:05:10 PM Subject: Re: [jira] Commented: (LUCENE-584) Decouple Filter from BitSet; relation with LUCENE-730 On Friday 13 April 2007 22:10, eks dev wrote: Hoss, would this work (is this what you said)? public BitSet bits(IndexReader reader) throws IOException{ return null; } public Matcher getMatcher(IndexReader reader) throws IOException { if(bits() == null) throw new SomeException(Filter must implement at least one of...); return new BitsMatcher(bits()); } This will not work correctly when the Scorer for the query that is searched with a filter does not implement skipTo(), for example BooleanScorer. See also the javadoc of class IndexSearcher in the patch. LUCENE-730 explicitly uses BooleanScorer, but only for the non filtered case with a top level disjunction. I think that with LUCENE-730 also added, the filtered case with BooleanScorer would go away, allowing to simplify this logic in IndexSearcher. This simplification of IndexSearcher is not in the LUCENE-730 patch, because LUCENE-584 is not committed. At the moment I don't know precisely what IndexSearcher would look like after LUCENE-730. With LUCENE-730 BooleanScorer.setUseScorer14() could also be removed/deprecated, but that is also not yet in the LUCENE-730 patch. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] ___ Yahoo! Answers - Got a question? Someone out there knows the answer. Try it now. http://uk.answers.yahoo.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Commented: (LUCENE-584) Decouple Filter from BitSet; relation with LUCENE-730
: Hoss, would this work (is this what you said)? : public Matcher getMatcher(IndexReader reader) throws IOException { :if(bits() == null) throw new SomeException(Filter must implement at least : one of...); :return new BitsMatcher(bits()); : } Assuming BitsMatcher does what i think it does then yes, that's what i had in mind ... i was specificly saying to make a default Matcher implementation out of the code in the patched version of IndexSearcher that has the comment... +} else { // bits for filtering, skipTo() not used on scorer: : This will not work correctly when the Scorer for the query that is searched : with a filter does not implement skipTo(), for example BooleanScorer. : See also the javadoc of class IndexSearcher in the patch. I don'tget it, how would a Scorer not implement skipTo? ...oh... final class BooleanScorer extends Scorer { ... public boolean skipTo(int target) { throw new UnsupportedOperationException(); } ...so lemme see if i understand this: What's happening in the current trunk is that the only situations in which code will attempt to call skipTo on a Scorer are: a) From the score(HitCollector hc) method of the same Scorer class (you should know if you suport it, you're in the class) b) From the skipTo method of an enclosing Scorer (If you add Scorer X to a a wrapper Scorer Y, and Y implements skipTo, it can assume that X implements skipTo). Am I correct so far? In the latest version of the Matcher patch... https://issues.apache.org/jira/secure/attachment/12352057/Matcher20070226.patch ...this changes, such that IndexSearcher will assume a Scorer supports skipTo iff a Filter is used which implements getMatcher (I guess the assumption being that if the code being used is new enough to support Matchers, it's new enough to support Scorer.skipTo). *BUT* if it's an old Filter using a BitSet the code in IndexSearcher will continue with the same old assumptions about the Scorer. And the change eks describes (which is a much better way to describe what i was suggesting) would break this safety net by always assuming skipTo was safe to call. So really the issue is that the patch assumpes one thing (Scorer supports skipTo) based on the presence of something that should be thought of as newer (Filter supports getMatcher) and relying on documentation to enforce this. Am I caught up now? Off the top of my head, the best solution i can think of to this issue would be to add the naive implementation of skipTo to Scorer, remove the UnsupportedOperationException of skipTo from all Scorers in the core, and rev Lucene to version 3.0 since this would probably be considered a serious API change (method sigs don't change, but now we're requiring people to implement a method that we have said in the past (by example) can be Unsupported. In general i'm not fond of assuming Scorer.skipTo when Filter.getMatcher ... the concepts are really orthoginal and even if it's a decent assumption to make today, it doens't help us tomorow when we want to add a getMatcher method to all of the core Filter classes to improve performance. -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-584) Decouple Filter from BitSet
[ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487706 ] Paul Elschot commented on LUCENE-584: - That could be improved in a DisjunctionMatcher. With a bit of bookkeeping DisjunctionSumScorer could also delay calling score() on the subscorers but the bookkeeping would affect performance for the normal case. For the usual queries the score() call will never have much of a performance impact. The reason for this is that TermScorer.score() is really very efficient, iirc it caches weighted tf() values for low term frequencies. All the rest is mostly additions, and occasionally a multiplication for a coordination factor. To determine which documents match the query, the index need to be accessed, and that takes more time than score value computations because the complete index almost never fits in the fastest cache. Decouple Filter from BitSet --- Key: LUCENE-584 URL: https://issues.apache.org/jira/browse/LUCENE-584 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.0.1 Reporter: Peter Schäfer Priority: Minor Attachments: bench-diff.txt, bench-diff.txt, BitsMatcher.java, Filter-20060628.patch, HitCollector-20060628.patch, IndexSearcher-20060628.patch, MatchCollector.java, Matcher.java, Matcher20070226.patch, Scorer-20060628.patch, Searchable-20060628.patch, Searcher-20060628.patch, Some Matchers.zip, SortedVIntList.java, TestSortedVIntList.java {code} package org.apache.lucene.search; public abstract class Filter implements java.io.Serializable { public abstract AbstractBitSet bits(IndexReader reader) throws IOException; } public interface AbstractBitSet { public boolean get(int index); } {code} It would be useful if the method =Filter.bits()= returned an abstract interface, instead of =java.util.BitSet=. Use case: there is a very large index, and, depending on the user's privileges, only a small portion of the index is actually visible. Sparsely populated =java.util.BitSet=s are not efficient and waste lots of memory. It would be desirable to have an alternative BitSet implementation with smaller memory footprint. Though it _is_ possibly to derive classes from =java.util.BitSet=, it was obviously not designed for that purpose. That's why I propose to use an interface instead. The default implementation could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-584) Decouple Filter from BitSet
[ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487789 ] Otis Gospodnetic commented on LUCENE-584: - Ah, too bad. :( Last time I benchmarked Lucene searching on Sun's Niagara vs. non-massive Intel boxes, Intel boxes with Linux on them actually won, and my impression was that this was due to Niagara's weak FPU (a known weakness in Niagara, I believe). Thus, I thought, if we could just skip scoring and various floating point calculations, we'd see better performance, esp. on Niagara boxes. Paul, when you say fastest cache, what exactly are you referring to? The Niagara I tested things on had 32GB of RAM, and I gave the JVM 20+GB, so at least the JVM had plenty of RAM to work with. Decouple Filter from BitSet --- Key: LUCENE-584 URL: https://issues.apache.org/jira/browse/LUCENE-584 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.0.1 Reporter: Peter Schäfer Priority: Minor Attachments: bench-diff.txt, bench-diff.txt, BitsMatcher.java, Filter-20060628.patch, HitCollector-20060628.patch, IndexSearcher-20060628.patch, MatchCollector.java, Matcher.java, Matcher20070226.patch, Scorer-20060628.patch, Searchable-20060628.patch, Searcher-20060628.patch, Some Matchers.zip, SortedVIntList.java, TestSortedVIntList.java {code} package org.apache.lucene.search; public abstract class Filter implements java.io.Serializable { public abstract AbstractBitSet bits(IndexReader reader) throws IOException; } public interface AbstractBitSet { public boolean get(int index); } {code} It would be useful if the method =Filter.bits()= returned an abstract interface, instead of =java.util.BitSet=. Use case: there is a very large index, and, depending on the user's privileges, only a small portion of the index is actually visible. Sparsely populated =java.util.BitSet=s are not efficient and waste lots of memory. It would be desirable to have an alternative BitSet implementation with smaller memory footprint. Though it _is_ possibly to derive classes from =java.util.BitSet=, it was obviously not designed for that purpose. That's why I propose to use an interface instead. The default implementation could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Commented: (LUCENE-584) Decouple Filter from BitSet
If I remember well, the last time we profiled search with high density OR queries scoring was taking up to 30% of the time. This was a 8Mio collection of short documents fitting comfortably in RAM. So I am sure disabling scoring in some cases could bring us something. I am not all that familiar with scoring inner workings to stand 100% behind this statement, so please take it with some healthy reserve. But anyhow, with Matcher in place, we have at least a chance to prove it brings something for this scenario. For Filtering case it brings definitely a lot. on the other note, Paul, would it be possible/easy to have something like. It looks easy to add it, but I may be missing something: BooleanQuery.add(Matcher mtr, BooleanClause.Occur occur) - Original Message From: Otis Gospodnetic (JIRA) [EMAIL PROTECTED] To: java-dev@lucene.apache.org Sent: Tuesday, 10 April, 2007 5:11:32 PM Subject: [jira] Commented: (LUCENE-584) Decouple Filter from BitSet [ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487789 ] Otis Gospodnetic commented on LUCENE-584: - Ah, too bad. :( Last time I benchmarked Lucene searching on Sun's Niagara vs. non-massive Intel boxes, Intel boxes with Linux on them actually won, and my impression was that this was due to Niagara's weak FPU (a known weakness in Niagara, I believe). Thus, I thought, if we could just skip scoring and various floating point calculations, we'd see better performance, esp. on Niagara boxes. Paul, when you say fastest cache, what exactly are you referring to? The Niagara I tested things on had 32GB of RAM, and I gave the JVM 20+GB, so at least the JVM had plenty of RAM to work with. Decouple Filter from BitSet --- Key: LUCENE-584 URL: https://issues.apache.org/jira/browse/LUCENE-584 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.0.1 Reporter: Peter Schäfer Priority: Minor Attachments: bench-diff.txt, bench-diff.txt, BitsMatcher.java, Filter-20060628.patch, HitCollector-20060628.patch, IndexSearcher-20060628.patch, MatchCollector.java, Matcher.java, Matcher20070226.patch, Scorer-20060628.patch, Searchable-20060628.patch, Searcher-20060628.patch, Some Matchers.zip, SortedVIntList.java, TestSortedVIntList.java {code} package org.apache.lucene.search; public abstract class Filter implements java.io.Serializable { public abstract AbstractBitSet bits(IndexReader reader) throws IOException; } public interface AbstractBitSet { public boolean get(int index); } {code} It would be useful if the method =Filter.bits()= returned an abstract interface, instead of =java.util.BitSet=. Use case: there is a very large index, and, depending on the user's privileges, only a small portion of the index is actually visible. Sparsely populated =java.util.BitSet=s are not efficient and waste lots of memory. It would be desirable to have an alternative BitSet implementation with smaller memory footprint. Though it _is_ possibly to derive classes from =java.util.BitSet=, it was obviously not designed for that purpose. That's why I propose to use an interface instead. The default implementation could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] ___ Yahoo! Mail is the world's favourite email. Don't settle for less, sign up for your free account today http://uk.rd.yahoo.com/evt=44106/*http://uk.docs.yahoo.com/mail/winter07.html - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-584) Decouple Filter from BitSet
[ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487882 ] Paul Elschot commented on LUCENE-584: - By fastest cache I meant the L1 cache of the processor. The size is normally in tens of kilobytes. An array lookup hitting that cache takes about as much time as a floating point addition. During a query search the use of a.o. the term frequencies, the proximity data, and the document weights normally cause an L1 cache miss. I would expect that by not doing the score value computations, only the cache misses for document weights can be saved. Decouple Filter from BitSet --- Key: LUCENE-584 URL: https://issues.apache.org/jira/browse/LUCENE-584 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.0.1 Reporter: Peter Schäfer Priority: Minor Attachments: bench-diff.txt, bench-diff.txt, BitsMatcher.java, Filter-20060628.patch, HitCollector-20060628.patch, IndexSearcher-20060628.patch, MatchCollector.java, Matcher.java, Matcher20070226.patch, Scorer-20060628.patch, Searchable-20060628.patch, Searcher-20060628.patch, Some Matchers.zip, SortedVIntList.java, TestSortedVIntList.java {code} package org.apache.lucene.search; public abstract class Filter implements java.io.Serializable { public abstract AbstractBitSet bits(IndexReader reader) throws IOException; } public interface AbstractBitSet { public boolean get(int index); } {code} It would be useful if the method =Filter.bits()= returned an abstract interface, instead of =java.util.BitSet=. Use case: there is a very large index, and, depending on the user's privileges, only a small portion of the index is actually visible. Sparsely populated =java.util.BitSet=s are not efficient and waste lots of memory. It would be desirable to have an alternative BitSet implementation with smaller memory footprint. Though it _is_ possibly to derive classes from =java.util.BitSet=, it was obviously not designed for that purpose. That's why I propose to use an interface instead. The default implementation could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Commented: (LUCENE-584) Decouple Filter from BitSet
On Tuesday 10 April 2007 17:41, eks dev wrote: If I remember well, the last time we profiled search with high density OR queries scoring was taking up to 30% of the time. This was a 8Mio collection of short documents fitting comfortably in RAM. So I am sure disabling scoring in some cases could bring us something. I am not all that familiar with scoring inner workings to stand 100% behind this statement, so please take it with some healthy reserve. For high density OR I'd guess most of the work was spent maintaining the priority queue by document number. See also LUCENE-730 . But anyhow, with Matcher in place, we have at least a chance to prove it brings something for this scenario. For Filtering case it brings definitely a lot. on the other note, Paul, would it be possible/easy to have something like. It looks easy to add it, but I may be missing something: BooleanQuery.add(Matcher mtr, BooleanClause.Occur occur) That's one of the things I'd like to see added. It would allow a single ConjunctionScorer to do a filtered search for a query with some required terms. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-584) Decouple Filter from BitSet
[ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487940 ] Hoss Man commented on LUCENE-584: - I'm a little behind on following this issue, but if i can attempt to sum up the recent discussion about performance... Migrating towards a Matcher API *may* allow some types of Queries to be faster in situations where clients can use a MatchCollector instead of a HitCollector, but this won't be a silver bullet performance win for all Query classes -- just those where some of the score calculations is (or can be) isolated to the score method (as opposed to skipTO or next) I think it's important to remember the motivation of this issue wasn't to improve the speed performance of non-scoring searchers, it was to decouple the concept of Filtering results away from needing to populate a (potentially large) BitSet when the logic neccessary for Filtering can easily be expressed in terms of a doc iterator (aka: a Matcher) -- opening up the possibility of memory performance improvements. A second benefit that has arisen as the issue evolved, has been the API generalization of the Matcher concept to be a super class of Scorer for simpler APIs moving forward. Decouple Filter from BitSet --- Key: LUCENE-584 URL: https://issues.apache.org/jira/browse/LUCENE-584 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.0.1 Reporter: Peter Schäfer Priority: Minor Attachments: bench-diff.txt, bench-diff.txt, BitsMatcher.java, Filter-20060628.patch, HitCollector-20060628.patch, IndexSearcher-20060628.patch, MatchCollector.java, Matcher.java, Matcher20070226.patch, Scorer-20060628.patch, Searchable-20060628.patch, Searcher-20060628.patch, Some Matchers.zip, SortedVIntList.java, TestSortedVIntList.java {code} package org.apache.lucene.search; public abstract class Filter implements java.io.Serializable { public abstract AbstractBitSet bits(IndexReader reader) throws IOException; } public interface AbstractBitSet { public boolean get(int index); } {code} It would be useful if the method =Filter.bits()= returned an abstract interface, instead of =java.util.BitSet=. Use case: there is a very large index, and, depending on the user's privileges, only a small portion of the index is actually visible. Sparsely populated =java.util.BitSet=s are not efficient and waste lots of memory. It would be desirable to have an alternative BitSet implementation with smaller memory footprint. Though it _is_ possibly to derive classes from =java.util.BitSet=, it was obviously not designed for that purpose. That's why I propose to use an interface instead. The default implementation could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-584) Decouple Filter from BitSet
[ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487966 ] Otis Gospodnetic commented on LUCENE-584: - Right. I was under the wrong impression that the Matcher also happens to avoid scoring. However, now that we've all looked at this patch (still applies cleanly and unit tests all pass), and nobody had any criticisms, I think we should commit it, say this Friday. As I'm in the performance squeezing mode, I'll go look at LUCENE-730, another one of Paul's great patches, and see if I can measure performance improvement there. Decouple Filter from BitSet --- Key: LUCENE-584 URL: https://issues.apache.org/jira/browse/LUCENE-584 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.0.1 Reporter: Peter Schäfer Priority: Minor Attachments: bench-diff.txt, bench-diff.txt, BitsMatcher.java, Filter-20060628.patch, HitCollector-20060628.patch, IndexSearcher-20060628.patch, MatchCollector.java, Matcher.java, Matcher20070226.patch, Scorer-20060628.patch, Searchable-20060628.patch, Searcher-20060628.patch, Some Matchers.zip, SortedVIntList.java, TestSortedVIntList.java {code} package org.apache.lucene.search; public abstract class Filter implements java.io.Serializable { public abstract AbstractBitSet bits(IndexReader reader) throws IOException; } public interface AbstractBitSet { public boolean get(int index); } {code} It would be useful if the method =Filter.bits()= returned an abstract interface, instead of =java.util.BitSet=. Use case: there is a very large index, and, depending on the user's privileges, only a small portion of the index is actually visible. Sparsely populated =java.util.BitSet=s are not efficient and waste lots of memory. It would be desirable to have an alternative BitSet implementation with smaller memory footprint. Though it _is_ possibly to derive classes from =java.util.BitSet=, it was obviously not designed for that purpose. That's why I propose to use an interface instead. The default implementation could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-584) Decouple Filter from BitSet
[ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487594 ] Yonik Seeley commented on LUCENE-584: - When you rerun, you may want to use my alg - to compare the two approaches in one run. This is more dangerous though. GC from one method's garbage can penalize the 2nd methods performance. Also, hotspot effects are hard to account for (if method1 and method2 use common methods, method2 will often execute faster than method one because more optimization has been done on those common methods). The hotspot effect can be minimized by running the test multiple times in the same JVM instance and discarding the first runs, but it's not so easy for GC. Decouple Filter from BitSet --- Key: LUCENE-584 URL: https://issues.apache.org/jira/browse/LUCENE-584 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.0.1 Reporter: Peter Schäfer Priority: Minor Attachments: bench-diff.txt, bench-diff.txt, BitsMatcher.java, Filter-20060628.patch, HitCollector-20060628.patch, IndexSearcher-20060628.patch, MatchCollector.java, Matcher.java, Matcher20070226.patch, Scorer-20060628.patch, Searchable-20060628.patch, Searcher-20060628.patch, Some Matchers.zip, SortedVIntList.java, TestSortedVIntList.java {code} package org.apache.lucene.search; public abstract class Filter implements java.io.Serializable { public abstract AbstractBitSet bits(IndexReader reader) throws IOException; } public interface AbstractBitSet { public boolean get(int index); } {code} It would be useful if the method =Filter.bits()= returned an abstract interface, instead of =java.util.BitSet=. Use case: there is a very large index, and, depending on the user's privileges, only a small portion of the index is actually visible. Sparsely populated =java.util.BitSet=s are not efficient and waste lots of memory. It would be desirable to have an alternative BitSet implementation with smaller memory footprint. Though it _is_ possibly to derive classes from =java.util.BitSet=, it was obviously not designed for that purpose. That's why I propose to use an interface instead. The default implementation could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-584) Decouple Filter from BitSet
[ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487613 ] Mike Klaas commented on LUCENE-584: --- Instead of discarding the first run, the approach I usually take is to run 3-4 times and pick the minimum. You can then run several of these sets and average over the minimum of each. GC is still an issues, though. It is hard to get around when it is a marksweep collector (reference counting is much friendlier in this regard) Decouple Filter from BitSet --- Key: LUCENE-584 URL: https://issues.apache.org/jira/browse/LUCENE-584 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.0.1 Reporter: Peter Schäfer Priority: Minor Attachments: bench-diff.txt, bench-diff.txt, BitsMatcher.java, Filter-20060628.patch, HitCollector-20060628.patch, IndexSearcher-20060628.patch, MatchCollector.java, Matcher.java, Matcher20070226.patch, Scorer-20060628.patch, Searchable-20060628.patch, Searcher-20060628.patch, Some Matchers.zip, SortedVIntList.java, TestSortedVIntList.java {code} package org.apache.lucene.search; public abstract class Filter implements java.io.Serializable { public abstract AbstractBitSet bits(IndexReader reader) throws IOException; } public interface AbstractBitSet { public boolean get(int index); } {code} It would be useful if the method =Filter.bits()= returned an abstract interface, instead of =java.util.BitSet=. Use case: there is a very large index, and, depending on the user's privileges, only a small portion of the index is actually visible. Sparsely populated =java.util.BitSet=s are not efficient and waste lots of memory. It would be desirable to have an alternative BitSet implementation with smaller memory footprint. Though it _is_ possibly to derive classes from =java.util.BitSet=, it was obviously not designed for that purpose. That's why I propose to use an interface instead. The default implementation could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-584) Decouple Filter from BitSet
[ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487616 ] Doron Cohen commented on LUCENE-584: When you rerun, you may want to use my alg - to compare the two approaches in one run. This is more dangerous though. Agree. I was trying to get rid of this by splitting each round to 3: - gc(), warm(), work() - when work() and warm() are the same, just that warm()'s stats are disregarded. Still switching the order of by match and by bits yield different results. Sometimes we would like not to disregard GC - in particular if one approach is creating more (or more complex) garbage than another approach. Perhaps we should look at two measures: best avg/sum (2nd ignoring first run, for hotspot). Decouple Filter from BitSet --- Key: LUCENE-584 URL: https://issues.apache.org/jira/browse/LUCENE-584 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.0.1 Reporter: Peter Schäfer Priority: Minor Attachments: bench-diff.txt, bench-diff.txt, BitsMatcher.java, Filter-20060628.patch, HitCollector-20060628.patch, IndexSearcher-20060628.patch, MatchCollector.java, Matcher.java, Matcher20070226.patch, Scorer-20060628.patch, Searchable-20060628.patch, Searcher-20060628.patch, Some Matchers.zip, SortedVIntList.java, TestSortedVIntList.java {code} package org.apache.lucene.search; public abstract class Filter implements java.io.Serializable { public abstract AbstractBitSet bits(IndexReader reader) throws IOException; } public interface AbstractBitSet { public boolean get(int index); } {code} It would be useful if the method =Filter.bits()= returned an abstract interface, instead of =java.util.BitSet=. Use case: there is a very large index, and, depending on the user's privileges, only a small portion of the index is actually visible. Sparsely populated =java.util.BitSet=s are not efficient and waste lots of memory. It would be desirable to have an alternative BitSet implementation with smaller memory footprint. Though it _is_ possibly to derive classes from =java.util.BitSet=, it was obviously not designed for that purpose. That's why I propose to use an interface instead. The default implementation could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-584) Decouple Filter from BitSet
[ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487631 ] Otis Gospodnetic commented on LUCENE-584: - Doron: just to address your question from Apr/7 - I expect/hope to see an improvement in performance because of this difference: hc.collect(doc(), score()); mc.collect(doc()); the delta being the cost of the score() call that does the scoring. If I understand things correctly, that means that what grant described at the bottom of http://lucene.apache.org/java/docs/scoring.html will all be skipped. No Scorer, no BooleanScorer(2), no ConjunctionScorer... Decouple Filter from BitSet --- Key: LUCENE-584 URL: https://issues.apache.org/jira/browse/LUCENE-584 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.0.1 Reporter: Peter Schäfer Priority: Minor Attachments: bench-diff.txt, bench-diff.txt, BitsMatcher.java, Filter-20060628.patch, HitCollector-20060628.patch, IndexSearcher-20060628.patch, MatchCollector.java, Matcher.java, Matcher20070226.patch, Scorer-20060628.patch, Searchable-20060628.patch, Searcher-20060628.patch, Some Matchers.zip, SortedVIntList.java, TestSortedVIntList.java {code} package org.apache.lucene.search; public abstract class Filter implements java.io.Serializable { public abstract AbstractBitSet bits(IndexReader reader) throws IOException; } public interface AbstractBitSet { public boolean get(int index); } {code} It would be useful if the method =Filter.bits()= returned an abstract interface, instead of =java.util.BitSet=. Use case: there is a very large index, and, depending on the user's privileges, only a small portion of the index is actually visible. Sparsely populated =java.util.BitSet=s are not efficient and waste lots of memory. It would be desirable to have an alternative BitSet implementation with smaller memory footprint. Though it _is_ possibly to derive classes from =java.util.BitSet=, it was obviously not designed for that purpose. That's why I propose to use an interface instead. The default implementation could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-584) Decouple Filter from BitSet
[ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487667 ] Doron Cohen commented on LUCENE-584: No Scorer, no BooleanScorer(2), no ConjunctionScorer... Thanks, I was reading score instead of score()... But there is a scorer in the process, it is used for next()-ing to matched docs. So most of the work - preparing to be able to compute the scores - was done already. The scorer doc queue is created and populated. Not calling score() is saving the (final) looping on the scorers for aggregating their scores, multiplying by coord factor, etc. I assume this is why only a small speed up is seen. Decouple Filter from BitSet --- Key: LUCENE-584 URL: https://issues.apache.org/jira/browse/LUCENE-584 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.0.1 Reporter: Peter Schäfer Priority: Minor Attachments: bench-diff.txt, bench-diff.txt, BitsMatcher.java, Filter-20060628.patch, HitCollector-20060628.patch, IndexSearcher-20060628.patch, MatchCollector.java, Matcher.java, Matcher20070226.patch, Scorer-20060628.patch, Searchable-20060628.patch, Searcher-20060628.patch, Some Matchers.zip, SortedVIntList.java, TestSortedVIntList.java {code} package org.apache.lucene.search; public abstract class Filter implements java.io.Serializable { public abstract AbstractBitSet bits(IndexReader reader) throws IOException; } public interface AbstractBitSet { public boolean get(int index); } {code} It would be useful if the method =Filter.bits()= returned an abstract interface, instead of =java.util.BitSet=. Use case: there is a very large index, and, depending on the user's privileges, only a small portion of the index is actually visible. Sparsely populated =java.util.BitSet=s are not efficient and waste lots of memory. It would be desirable to have an alternative BitSet implementation with smaller memory footprint. Though it _is_ possibly to derive classes from =java.util.BitSet=, it was obviously not designed for that purpose. That's why I propose to use an interface instead. The default implementation could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-584) Decouple Filter from BitSet
[ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487674 ] Otis Gospodnetic commented on LUCENE-584: - A. I'll look at the patch again tomorrow and follow what you said. All this time I was under the impression that one of the points or at least side-effects of the Matcher was that scoring was skipped, which would be perfect where matches are ordered by anything other than relevance. Decouple Filter from BitSet --- Key: LUCENE-584 URL: https://issues.apache.org/jira/browse/LUCENE-584 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.0.1 Reporter: Peter Schäfer Priority: Minor Attachments: bench-diff.txt, bench-diff.txt, BitsMatcher.java, Filter-20060628.patch, HitCollector-20060628.patch, IndexSearcher-20060628.patch, MatchCollector.java, Matcher.java, Matcher20070226.patch, Scorer-20060628.patch, Searchable-20060628.patch, Searcher-20060628.patch, Some Matchers.zip, SortedVIntList.java, TestSortedVIntList.java {code} package org.apache.lucene.search; public abstract class Filter implements java.io.Serializable { public abstract AbstractBitSet bits(IndexReader reader) throws IOException; } public interface AbstractBitSet { public boolean get(int index); } {code} It would be useful if the method =Filter.bits()= returned an abstract interface, instead of =java.util.BitSet=. Use case: there is a very large index, and, depending on the user's privileges, only a small portion of the index is actually visible. Sparsely populated =java.util.BitSet=s are not efficient and waste lots of memory. It would be desirable to have an alternative BitSet implementation with smaller memory footprint. Though it _is_ possibly to derive classes from =java.util.BitSet=, it was obviously not designed for that purpose. That's why I propose to use an interface instead. The default implementation could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-584) Decouple Filter from BitSet
[ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487675 ] Marvin Humphrey commented on LUCENE-584: DisjunctionSumScorer (the ORScorer) actually calls Scorer.score() on all of the matching scorers in the ScorerDocQueue during next(), in order to accumulate an aggregate score. The MatchCollector can't save you from that. Decouple Filter from BitSet --- Key: LUCENE-584 URL: https://issues.apache.org/jira/browse/LUCENE-584 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.0.1 Reporter: Peter Schäfer Priority: Minor Attachments: bench-diff.txt, bench-diff.txt, BitsMatcher.java, Filter-20060628.patch, HitCollector-20060628.patch, IndexSearcher-20060628.patch, MatchCollector.java, Matcher.java, Matcher20070226.patch, Scorer-20060628.patch, Searchable-20060628.patch, Searcher-20060628.patch, Some Matchers.zip, SortedVIntList.java, TestSortedVIntList.java {code} package org.apache.lucene.search; public abstract class Filter implements java.io.Serializable { public abstract AbstractBitSet bits(IndexReader reader) throws IOException; } public interface AbstractBitSet { public boolean get(int index); } {code} It would be useful if the method =Filter.bits()= returned an abstract interface, instead of =java.util.BitSet=. Use case: there is a very large index, and, depending on the user's privileges, only a small portion of the index is actually visible. Sparsely populated =java.util.BitSet=s are not efficient and waste lots of memory. It would be desirable to have an alternative BitSet implementation with smaller memory footprint. Though it _is_ possibly to derive classes from =java.util.BitSet=, it was obviously not designed for that purpose. That's why I propose to use an interface instead. The default implementation could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-584) Decouple Filter from BitSet
[ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487456 ] Doron Cohen commented on LUCENE-584: ...right, your diff-txt had the Match tasks - I missed that - checked it, it is exactly what I did, so we're ok here. When you rerun, you may want to use my alg - to compare the two approaches in one run. You can run this by something like: ant run-task -Dtask.mem=256M -Dtask.alg=conf\matcher-vs-bitset.alg Also, to get cleaner results, add the line: ResetSystemSoft just in the beginning of the search round - this resets the (query) inputs and also calls GC. I tried like this twice, and got inconsistent results: When the bitset searches preceded the match searches: [java] Operation round runCnt recsPerRunrec/s elapsedSecavgUsedMemavgTotalMem [java] SrchBitsSamRdr_5000 - 10 5000706.4 70.78 7,511,219 16,573,645 [java] SrchMtchSamRdr_5000 - - - - 10 - - - 5000 - - 689.6 - - 72.50 - 8,223,005 - 11,926,323 [java] SrchBitsNewRdr_500 - 10 500152.5 32.8014,360,618 16,962,356 [java] SrchMtchNewRdr_500 - - - - - 10 - - - 500 - - 171.3 - - 29.19 - 15,150,797 - 17,395,712 When the match searches preceded the bitset searches: [java] Operation round runCnt recsPerRunrec/s elapsedSecavgUsedMemavgTotalMem [java] SrchMtchSamRdr_5000 - 10 5000763.5 65.49 9,563,243 17,128,244 [java] SrchBitsSamRdr_5000 - - - - 10 - - - 5000 - - 729.3 - - 68.56 - 10,003,775 - 13,001,114 [java] SrchMtchNewRdr_500 - 10 500175.7 28.4612,068,559 17,524,326 [java] SrchBitsNewRdr_500 - - - - - 10 - - - 500 - - 183.7 - - 27.22 - 15,098,480 - 17,974,476 My conclusion from this is that the speed-up, if exists, is minor, at least for the setup of this test. There are only 15 unique queries in this test - also printed in the log - are these the queries you would expect to save in? I didn't follow this issue very closely so I don't know where the saving is expected here. Both SearchTask and MatchTask now do nothing in collect, so no difference at the actual collect() call. Also, Scorer.score(HitCollector) and Matcher.match(MatchCollector) are very similar: public void score(HitCollector hc) throws IOException { while (next()) { hc.collect(doc(), score()); } } public void match(MatchCollector mc) throws IOException { while (next()) { mc.collect(doc()); } } Especially for the case that the collect() method is doing nothing, as in this test. I think there is a potential gain for large boolean OR queries, because score() would have to call next() on all TermScorers and collect/sum their scores, while match() could use skipTo(last+1) because any match encountered is a match and there is no need to sum the individual scores for the same doc by other scorers. However as far as I can tell, current match() implementation does not take advantage of this, but I may be overlooking something? Decouple Filter from BitSet --- Key: LUCENE-584 URL: https://issues.apache.org/jira/browse/LUCENE-584 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.0.1 Reporter: Peter Schäfer Priority: Minor Attachments: bench-diff.txt, bench-diff.txt, BitsMatcher.java, Filter-20060628.patch, HitCollector-20060628.patch, IndexSearcher-20060628.patch, MatchCollector.java, Matcher.java, Matcher20070226.patch, Scorer-20060628.patch, Searchable-20060628.patch, Searcher-20060628.patch, Some Matchers.zip, SortedVIntList.java, TestSortedVIntList.java {code} package org.apache.lucene.search; public abstract class Filter implements java.io.Serializable { public abstract AbstractBitSet bits(IndexReader reader) throws IOException; } public interface AbstractBitSet { public boolean get(int index); } {code} It would be useful if the method =Filter.bits()= returned an abstract interface, instead of =java.util.BitSet=. Use case: there is a very large index, and, depending on the user's privileges, only a small portion of the index is actually visible. Sparsely populated =java.util.BitSet=s are not efficient and waste lots of memory. It would be desirable to have an alternative BitSet implementation with smaller memory footprint. Though it _is_ possibly to derive classes from =java.util.BitSet=, it was obviously not designed for that purpose. That's why I propose to use an interface instead. The default implementation could still delegate to =java.util.BitSet=. --
[jira] Commented: (LUCENE-584) Decouple Filter from BitSet
[ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487431 ] Doron Cohen commented on LUCENE-584: One line was cut out - here are the four lines again Operation round runCnt recsPerRunrec/s elapsedSec avgUsedMemavgTotalMem SrchMtchSamRdr_5000 - 10 5000642.2 77.85 12,331,866 16,408,576 SrchBitsSamRdr_5000 - - - - 10 - - - 5000 - - 586.9 - - 85.20 - 9,515,875 - 12,009,472 SrchMtchNewRdr_500 - 10 500134.7 37.11 13,376,113 17,171,660 SrchBitsNewRdr_500 - - - - - 10 - - - 500 - - 154.0 - - 32.47 - 15,351,395 - 17,522,688 Decouple Filter from BitSet --- Key: LUCENE-584 URL: https://issues.apache.org/jira/browse/LUCENE-584 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.0.1 Reporter: Peter Schäfer Priority: Minor Attachments: bench-diff.txt, bench-diff.txt, BitsMatcher.java, Filter-20060628.patch, HitCollector-20060628.patch, IndexSearcher-20060628.patch, MatchCollector.java, Matcher.java, Matcher20070226.patch, Scorer-20060628.patch, Searchable-20060628.patch, Searcher-20060628.patch, Some Matchers.zip, SortedVIntList.java, TestSortedVIntList.java {code} package org.apache.lucene.search; public abstract class Filter implements java.io.Serializable { public abstract AbstractBitSet bits(IndexReader reader) throws IOException; } public interface AbstractBitSet { public boolean get(int index); } {code} It would be useful if the method =Filter.bits()= returned an abstract interface, instead of =java.util.BitSet=. Use case: there is a very large index, and, depending on the user's privileges, only a small portion of the index is actually visible. Sparsely populated =java.util.BitSet=s are not efficient and waste lots of memory. It would be desirable to have an alternative BitSet implementation with smaller memory footprint. Though it _is_ possibly to derive classes from =java.util.BitSet=, it was obviously not designed for that purpose. That's why I propose to use an interface instead. The default implementation could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-584) Decouple Filter from BitSet
[ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487432 ] Otis Gospodnetic commented on LUCENE-584: - Doron, thanks for jumping on this! 1. I thought I'd see better performance with the Matcher because it skips scoring. While Paul's patch does make changes to the Filtering code, I'm more focused on HitCollector vs. MatchCollector performance here. Am I missing something here? If scoring is skipped, we should see at least some speed improvement, and your results show that. 2. You said you *did* see MatchCollector was faster than HitCollector. Hmmm, weird, not in my 4 runs: Matcher: [java] SearchSameRdr_5 - - - - - - - - 4 - - 5 - - 1,064.7 - - 187.84 - 11,060,036 - 14,806,016 HitCollector: [java] SearchSameRdr_5 - - - - - - - - 4 - - 5 - - 1,070.3 - - 186.86 - 10,500,146 - 13,821,952 I'll try it again on a different computer. My previous runs were on a Mac with OSX. 3. My bench-diff.txt did include Match tasks: $ grep Match bench-diff.txt | grep class public class SearchMatchTask extends MatchTask { public abstract class MatchTask extends ReadTask { ... but I didn't svn add them, so I produced the diff by simply cat-ing the new tasks to bench-diff.txt . So if you used my bench-diff.txt as a patch, it wouldn't have worked. Not a big deal, just clarifying. Decouple Filter from BitSet --- Key: LUCENE-584 URL: https://issues.apache.org/jira/browse/LUCENE-584 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.0.1 Reporter: Peter Schäfer Priority: Minor Attachments: bench-diff.txt, bench-diff.txt, BitsMatcher.java, Filter-20060628.patch, HitCollector-20060628.patch, IndexSearcher-20060628.patch, MatchCollector.java, Matcher.java, Matcher20070226.patch, Scorer-20060628.patch, Searchable-20060628.patch, Searcher-20060628.patch, Some Matchers.zip, SortedVIntList.java, TestSortedVIntList.java {code} package org.apache.lucene.search; public abstract class Filter implements java.io.Serializable { public abstract AbstractBitSet bits(IndexReader reader) throws IOException; } public interface AbstractBitSet { public boolean get(int index); } {code} It would be useful if the method =Filter.bits()= returned an abstract interface, instead of =java.util.BitSet=. Use case: there is a very large index, and, depending on the user's privileges, only a small portion of the index is actually visible. Sparsely populated =java.util.BitSet=s are not efficient and waste lots of memory. It would be desirable to have an alternative BitSet implementation with smaller memory footprint. Though it _is_ possibly to derive classes from =java.util.BitSet=, it was obviously not designed for that purpose. That's why I propose to use an interface instead. The default implementation could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-584) Decouple Filter from BitSet
[ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12483868 ] Yonik Seeley commented on LUCENE-584: - BitsMatcher could also work without the exhausted flag, but then an infinite loop might occur when trying to continue after the first time next() or skipTo() returned false. Continuing after false was returned in these cases is a bug, however an infinite loop can be difficult to debug. I'd rather be on the safe side of that with the exhausted flag and wait for an actual profile to show the performance problem. We know that matchers will be inner-loop stuff. It seems like any scorers that call next() after false was returned should be fixed. Decouple Filter from BitSet --- Key: LUCENE-584 URL: https://issues.apache.org/jira/browse/LUCENE-584 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.0.1 Reporter: Peter Schäfer Priority: Minor Attachments: BitsMatcher.java, Filter-20060628.patch, HitCollector-20060628.patch, IndexSearcher-20060628.patch, MatchCollector.java, Matcher.java, Matcher20070226.patch, Scorer-20060628.patch, Searchable-20060628.patch, Searcher-20060628.patch, Some Matchers.zip, SortedVIntList.java, TestSortedVIntList.java {code} package org.apache.lucene.search; public abstract class Filter implements java.io.Serializable { public abstract AbstractBitSet bits(IndexReader reader) throws IOException; } public interface AbstractBitSet { public boolean get(int index); } {code} It would be useful if the method =Filter.bits()= returned an abstract interface, instead of =java.util.BitSet=. Use case: there is a very large index, and, depending on the user's privileges, only a small portion of the index is actually visible. Sparsely populated =java.util.BitSet=s are not efficient and waste lots of memory. It would be desirable to have an alternative BitSet implementation with smaller memory footprint. Though it _is_ possibly to derive classes from =java.util.BitSet=, it was obviously not designed for that purpose. That's why I propose to use an interface instead. The default implementation could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-584) Decouple Filter from BitSet
[ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12483724 ] Otis Gospodnetic commented on LUCENE-584: - Paul: Applied the patch, applied cleanly, run ant test - BUILD SUCCESSFUL :) I'm primarily interested in using this in order to get matches, but avoid scoring. From what I can tell, I'd just need to switch to using the new match(Query, MatchCollector) method in IndexSearcher. However, I need Sort and TopFieldDocs, and I don't see a match method with those. Is there a reason why such a match method is not in the patch? Decouple Filter from BitSet --- Key: LUCENE-584 URL: https://issues.apache.org/jira/browse/LUCENE-584 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.0.1 Reporter: Peter Schäfer Priority: Minor Attachments: BitsMatcher.java, Filter-20060628.patch, HitCollector-20060628.patch, IndexSearcher-20060628.patch, MatchCollector.java, Matcher.java, Matcher20070226.patch, Scorer-20060628.patch, Searchable-20060628.patch, Searcher-20060628.patch, Some Matchers.zip, SortedVIntList.java, TestSortedVIntList.java {code} package org.apache.lucene.search; public abstract class Filter implements java.io.Serializable { public abstract AbstractBitSet bits(IndexReader reader) throws IOException; } public interface AbstractBitSet { public boolean get(int index); } {code} It would be useful if the method =Filter.bits()= returned an abstract interface, instead of =java.util.BitSet=. Use case: there is a very large index, and, depending on the user's privileges, only a small portion of the index is actually visible. Sparsely populated =java.util.BitSet=s are not efficient and waste lots of memory. It would be desirable to have an alternative BitSet implementation with smaller memory footprint. Though it _is_ possibly to derive classes from =java.util.BitSet=, it was obviously not designed for that purpose. That's why I propose to use an interface instead. The default implementation could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-584) Decouple Filter from BitSet
[ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12483731 ] Paul Elschot commented on LUCENE-584: - Otis: However, I need Sort and TopFieldDocs, and I don't see a match method with those. Is there a reason why such a match method is not in the patch? A bit silly perhaps, but what sort criterion would like to have used when no score() value is available? I don't know the sorting code, but it might be possible to use a field value for sorting. In that case the sorting code for a Matcher would need to check whether the sort criterion does not imply the use of a score value. I personally have no use for sorting by field values, so that is why I never thought of combining this with a Matcher. Decouple Filter from BitSet --- Key: LUCENE-584 URL: https://issues.apache.org/jira/browse/LUCENE-584 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.0.1 Reporter: Peter Schäfer Priority: Minor Attachments: BitsMatcher.java, Filter-20060628.patch, HitCollector-20060628.patch, IndexSearcher-20060628.patch, MatchCollector.java, Matcher.java, Matcher20070226.patch, Scorer-20060628.patch, Searchable-20060628.patch, Searcher-20060628.patch, Some Matchers.zip, SortedVIntList.java, TestSortedVIntList.java {code} package org.apache.lucene.search; public abstract class Filter implements java.io.Serializable { public abstract AbstractBitSet bits(IndexReader reader) throws IOException; } public interface AbstractBitSet { public boolean get(int index); } {code} It would be useful if the method =Filter.bits()= returned an abstract interface, instead of =java.util.BitSet=. Use case: there is a very large index, and, depending on the user's privileges, only a small portion of the index is actually visible. Sparsely populated =java.util.BitSet=s are not efficient and waste lots of memory. It would be desirable to have an alternative BitSet implementation with smaller memory footprint. Though it _is_ possibly to derive classes from =java.util.BitSet=, it was obviously not designed for that purpose. That's why I propose to use an interface instead. The default implementation could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-584) Decouple Filter from BitSet
[ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12481734 ] Paul Elschot commented on LUCENE-584: - Hoss, Paul: I notice Filter.getMatcher returns null, and IndexSearcher tests for that and uses it to decide whether or not to iterator over the (non null) Matcher, or over the BitSet from Filter.bits. is there any reason that logic can't be put in getMatcher, so that if subclasses of Filter don't override the getMatcher method it will call bits and then return a Matcher that iterates over the set Bits? Two reasons: - uncertainty over performance of a Matcher instead of a BitSet, - this way backward compatibility very easily guaranteed. There is also LUCENE-730, which may interfere with the removal of BitSet, since it allows documents to be scored out of order. However, LUCENE-730 should only be used at the top level of a query search and without a Filter. I cannot think of an actual case in which there might be interference, but I may not have not looked into that deep enough. we could even change Filter.bits so it's no longer abstract ... it could have an implementation that would call getMatcher, and iterate over all of the matched docs setting bits on a BitSet that is then returned ... the class would still be abstract, and the class javadocs would make it clear that subclasses must override at least one of the methods... I must say that creating a BitSet from a Matcher never occurred to me. Anyway, when Filter.bits() is deprecated I have no preference about how it is actually removed. Decouple Filter from BitSet --- Key: LUCENE-584 URL: https://issues.apache.org/jira/browse/LUCENE-584 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.0.1 Reporter: Peter Schäfer Priority: Minor Attachments: BitsMatcher.java, Filter-20060628.patch, HitCollector-20060628.patch, IndexSearcher-20060628.patch, MatchCollector.java, Matcher.java, Matcher20070226.patch, Scorer-20060628.patch, Searchable-20060628.patch, Searcher-20060628.patch, Some Matchers.zip, SortedVIntList.java, TestSortedVIntList.java {code} package org.apache.lucene.search; public abstract class Filter implements java.io.Serializable { public abstract AbstractBitSet bits(IndexReader reader) throws IOException; } public interface AbstractBitSet { public boolean get(int index); } {code} It would be useful if the method =Filter.bits()= returned an abstract interface, instead of =java.util.BitSet=. Use case: there is a very large index, and, depending on the user's privileges, only a small portion of the index is actually visible. Sparsely populated =java.util.BitSet=s are not efficient and waste lots of memory. It would be desirable to have an alternative BitSet implementation with smaller memory footprint. Though it _is_ possibly to derive classes from =java.util.BitSet=, it was obviously not designed for that purpose. That's why I propose to use an interface instead. The default implementation could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-584) Decouple Filter from BitSet
[ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12481263 ] Hoss Man commented on LUCENE-584: - It's been a while since i looked at this issue, but it's come up in discussion recently so i took another glance... Paul: I notice Filter.getMatcher returns null, and IndexSearcher tests for that and uses it to decide whether or not to iterator over the (non null) Matcher, or over the BitSet from Filter.bits. is there any reason that logic can't be put in getMatcher, so that if subclasses of Filter don't override the getMatcher method it will call bits and then return a Matcher that iterates over the set Bits? (this is the roll-out approach i advocated a while back when discussing this on email, excecept that at the time Matcher was refered to as SearchFilter: http://www.nabble.com/RE%3A-Filter-p2605271.html ) Thinking about it now, we could even change Filter.bits so it's no longer abstract ... it could have an implementation that would call getMatcher, and iterate over all of the matched docs setting bits on a BitSet that is then returned ... the class would still be abstract, and the class javadocs would make it clear that subclasses must override at least one of the methods ... legacy Filters will work fine because they'll already have a bits method, and people writing new Filters will see that bits is deprecated, so they'll just write a getMatcher method and be done. This appears to be the same approach taken with Analyzer.tokenStream back in 1.4.3... http://lucene.apache.org/java/1_4_3/api/org/apache/lucene/analysis/Analyzer.html Decouple Filter from BitSet --- Key: LUCENE-584 URL: https://issues.apache.org/jira/browse/LUCENE-584 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.0.1 Reporter: Peter Schäfer Priority: Minor Attachments: BitsMatcher.java, Filter-20060628.patch, HitCollector-20060628.patch, IndexSearcher-20060628.patch, MatchCollector.java, Matcher.java, Matcher20070226.patch, Scorer-20060628.patch, Searchable-20060628.patch, Searcher-20060628.patch, Some Matchers.zip, SortedVIntList.java, TestSortedVIntList.java {code} package org.apache.lucene.search; public abstract class Filter implements java.io.Serializable { public abstract AbstractBitSet bits(IndexReader reader) throws IOException; } public interface AbstractBitSet { public boolean get(int index); } {code} It would be useful if the method =Filter.bits()= returned an abstract interface, instead of =java.util.BitSet=. Use case: there is a very large index, and, depending on the user's privileges, only a small portion of the index is actually visible. Sparsely populated =java.util.BitSet=s are not efficient and waste lots of memory. It would be desirable to have an alternative BitSet implementation with smaller memory footprint. Though it _is_ possibly to derive classes from =java.util.BitSet=, it was obviously not designed for that purpose. That's why I propose to use an interface instead. The default implementation could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-584) Decouple Filter from BitSet
[ http://issues.apache.org/jira/browse/LUCENE-584?page=comments#action_12450715 ] Paul Elschot commented on LUCENE-584: - I have just resolved some minor local conflicts on the updated copyrights of four java files. Please holler when a fresh patch is needed. Decouple Filter from BitSet --- Key: LUCENE-584 URL: http://issues.apache.org/jira/browse/LUCENE-584 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.0.1 Reporter: Peter Schäfer Priority: Minor Attachments: BitsMatcher.java, Filter-20060628.patch, HitCollector-20060628.patch, IndexSearcher-20060628.patch, MatchCollector.java, Matcher.java, Matcher20060830b.patch, Scorer-20060628.patch, Searchable-20060628.patch, Searcher-20060628.patch, Some Matchers.zip, SortedVIntList.java, TestSortedVIntList.java {code} package org.apache.lucene.search; public abstract class Filter implements java.io.Serializable { public abstract AbstractBitSet bits(IndexReader reader) throws IOException; } public interface AbstractBitSet { public boolean get(int index); } {code} It would be useful if the method =Filter.bits()= returned an abstract interface, instead of =java.util.BitSet=. Use case: there is a very large index, and, depending on the user's privileges, only a small portion of the index is actually visible. Sparsely populated =java.util.BitSet=s are not efficient and waste lots of memory. It would be desirable to have an alternative BitSet implementation with smaller memory footprint. Though it _is_ possibly to derive classes from =java.util.BitSet=, it was obviously not designed for that purpose. That's why I propose to use an interface instead. The default implementation could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-584) Decouple Filter from BitSet
[ http://issues.apache.org/jira/browse/LUCENE-584?page=comments#action_12437242 ] Paul Elschot commented on LUCENE-584: - I wrote: One could add an abstract Scorer.explain() to catch these, or provide a default implementation for Scorer.explain(). by mistake. The good news is that the patch leaves the the existing abstract Scorer.explain() method unaffected. Decouple Filter from BitSet --- Key: LUCENE-584 URL: http://issues.apache.org/jira/browse/LUCENE-584 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.0.1 Reporter: Peter Schäfer Priority: Minor Attachments: BitsMatcher.java, Filter-20060628.patch, HitCollector-20060628.patch, IndexSearcher-20060628.patch, MatchCollector.java, Matcher.java, Matcher20060830b.patch, Scorer-20060628.patch, Searchable-20060628.patch, Searcher-20060628.patch, Some Matchers.zip, SortedVIntList.java, TestSortedVIntList.java {code} package org.apache.lucene.search; public abstract class Filter implements java.io.Serializable { public abstract AbstractBitSet bits(IndexReader reader) throws IOException; } public interface AbstractBitSet { public boolean get(int index); } {code} It would be useful if the method =Filter.bits()= returned an abstract interface, instead of =java.util.BitSet=. Use case: there is a very large index, and, depending on the user's privileges, only a small portion of the index is actually visible. Sparsely populated =java.util.BitSet=s are not efficient and waste lots of memory. It would be desirable to have an alternative BitSet implementation with smaller memory footprint. Though it _is_ possibly to derive classes from =java.util.BitSet=, it was obviously not designed for that purpose. That's why I propose to use an interface instead. The default implementation could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-584) Decouple Filter from BitSet
[ http://issues.apache.org/jira/browse/LUCENE-584?page=comments#action_12434901 ] Paul Elschot commented on LUCENE-584: - In the inheritance from Matcher to Scorer there is an asymmetry in this patch. Matcher provides a default implementation for Matcher.explain() but Scorer does not, and this might lead to unexpected surprises for future Scorers when the current Matcher.explain() is used. One could add an abstract Scorer.explain() to catch these, or provide a default implementation for Scorer.explain(). With matcher implementations quite a few other implementation decisions need to be taken. Also any place in the current code where a Scorer is used, but none of the Scorer.score() methods, is a candidate for a change from Scorer to Matcher. This will be mostly the current filtering implementations, but ConstantScoringQuery is another nice example. Regards, Paul Elschot Decouple Filter from BitSet --- Key: LUCENE-584 URL: http://issues.apache.org/jira/browse/LUCENE-584 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.0.1 Reporter: Peter Schäfer Priority: Minor Attachments: BitsMatcher.java, Filter-20060628.patch, HitCollector-20060628.patch, IndexSearcher-20060628.patch, MatchCollector.java, Matcher.java, Matcher20060830b.patch, Scorer-20060628.patch, Searchable-20060628.patch, Searcher-20060628.patch, Some Matchers.zip, SortedVIntList.java, TestSortedVIntList.java {code} package org.apache.lucene.search; public abstract class Filter implements java.io.Serializable { public abstract AbstractBitSet bits(IndexReader reader) throws IOException; } public interface AbstractBitSet { public boolean get(int index); } {code} It would be useful if the method =Filter.bits()= returned an abstract interface, instead of =java.util.BitSet=. Use case: there is a very large index, and, depending on the user's privileges, only a small portion of the index is actually visible. Sparsely populated =java.util.BitSet=s are not efficient and waste lots of memory. It would be desirable to have an alternative BitSet implementation with smaller memory footprint. Though it _is_ possibly to derive classes from =java.util.BitSet=, it was obviously not designed for that purpose. That's why I propose to use an interface instead. The default implementation could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-584) Decouple Filter from BitSet
[ http://issues.apache.org/jira/browse/LUCENE-584?page=comments#action_12434637 ] Eks Dev commented on LUCENE-584: Paul, What is next now, we did on our app enough experiments and are now sure that this patch causes no incompatibilities. We also tried to replace our filters with OpenBitSet and VInt matchers and results there are more than good, our app showed crazy 30% speed-up!!! Hard to identify where from exactly, but I suspect VInt matcher in case of not too dense BitVectors increased our Filter Cache utilization significantly. I would propose to commit this patch before we go further with something that would actually utilize Matcher. Just to avoid creating monster patch on patch ... This is ground work, and now using Matcher will be pure poetry, I see a lot of places we could see beter life by using use Matchers, ConstantScoringQuery, PreffixFilter, ChainedFilter (becomes obsolete now)... actually replace all uses of BitSet with OpenBitSet (or a bit smarter with SortedIntList. VInt...)... Than question here, do we create dependancy to Solr from Lucene, or we migrate OpenBitSet to Lucene (as this dependancy allready exists) or we copy-paste and have two OpenBitSets, Yonik? As far as I am concerned, makes no real diference. Do you, or someone else see now things to be done before commiting this? Decouple Filter from BitSet --- Key: LUCENE-584 URL: http://issues.apache.org/jira/browse/LUCENE-584 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.0.1 Reporter: Peter Schäfer Priority: Minor Attachments: BitsMatcher.java, Filter-20060628.patch, HitCollector-20060628.patch, IndexSearcher-20060628.patch, MatchCollector.java, Matcher.java, Matcher20060830b.patch, Scorer-20060628.patch, Searchable-20060628.patch, Searcher-20060628.patch, Some Matchers.zip, SortedVIntList.java, TestSortedVIntList.java {code} package org.apache.lucene.search; public abstract class Filter implements java.io.Serializable { public abstract AbstractBitSet bits(IndexReader reader) throws IOException; } public interface AbstractBitSet { public boolean get(int index); } {code} It would be useful if the method =Filter.bits()= returned an abstract interface, instead of =java.util.BitSet=. Use case: there is a very large index, and, depending on the user's privileges, only a small portion of the index is actually visible. Sparsely populated =java.util.BitSet=s are not efficient and waste lots of memory. It would be desirable to have an alternative BitSet implementation with smaller memory footprint. Though it _is_ possibly to derive classes from =java.util.BitSet=, it was obviously not designed for that purpose. That's why I propose to use an interface instead. The default implementation could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-584) Decouple Filter from BitSet
[ http://issues.apache.org/jira/browse/LUCENE-584?page=comments#action_12434763 ] Paul Elschot commented on LUCENE-584: - Do you, or someone else see now things to be done before commiting this? Yes. In the steps listed here: http://wiki.apache.org/jakarta-lucene/HowToContribute the next step is to be patient. Wether being patient is something that can be done is open question... Paul Elschot. Decouple Filter from BitSet --- Key: LUCENE-584 URL: http://issues.apache.org/jira/browse/LUCENE-584 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.0.1 Reporter: Peter Schäfer Priority: Minor Attachments: BitsMatcher.java, Filter-20060628.patch, HitCollector-20060628.patch, IndexSearcher-20060628.patch, MatchCollector.java, Matcher.java, Matcher20060830b.patch, Scorer-20060628.patch, Searchable-20060628.patch, Searcher-20060628.patch, Some Matchers.zip, SortedVIntList.java, TestSortedVIntList.java {code} package org.apache.lucene.search; public abstract class Filter implements java.io.Serializable { public abstract AbstractBitSet bits(IndexReader reader) throws IOException; } public interface AbstractBitSet { public boolean get(int index); } {code} It would be useful if the method =Filter.bits()= returned an abstract interface, instead of =java.util.BitSet=. Use case: there is a very large index, and, depending on the user's privileges, only a small portion of the index is actually visible. Sparsely populated =java.util.BitSet=s are not efficient and waste lots of memory. It would be desirable to have an alternative BitSet implementation with smaller memory footprint. Though it _is_ possibly to derive classes from =java.util.BitSet=, it was obviously not designed for that purpose. That's why I propose to use an interface instead. The default implementation could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Commented: (LUCENE-584) Decouple Filter from BitSet
not being inpatient, just asking if all holes are covered, Matcher rocks and I'd like to clean up a lot of mess we created in our local copy in order to simulate what Matcher will permit us to do in really elegant way... if being patient is all what it takes, cool ;) - Original Message From: Paul Elschot (JIRA) [EMAIL PROTECTED] To: java-dev@lucene.apache.org Sent: Thursday, 14 September, 2006 8:41:25 PM Subject: [jira] Commented: (LUCENE-584) Decouple Filter from BitSet [ http://issues.apache.org/jira/browse/LUCENE-584?page=comments#action_12434763 ] Paul Elschot commented on LUCENE-584: - Do you, or someone else see now things to be done before commiting this? Yes. In the steps listed here: http://wiki.apache.org/jakarta-lucene/HowToContribute the next step is to be patient. Wether being patient is something that can be done is open question... Paul Elschot. Decouple Filter from BitSet --- Key: LUCENE-584 URL: http://issues.apache.org/jira/browse/LUCENE-584 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.0.1 Reporter: Peter Schäfer Priority: Minor Attachments: BitsMatcher.java, Filter-20060628.patch, HitCollector-20060628.patch, IndexSearcher-20060628.patch, MatchCollector.java, Matcher.java, Matcher20060830b.patch, Scorer-20060628.patch, Searchable-20060628.patch, Searcher-20060628.patch, Some Matchers.zip, SortedVIntList.java, TestSortedVIntList.java {code} package org.apache.lucene.search; public abstract class Filter implements java.io.Serializable { public abstract AbstractBitSet bits(IndexReader reader) throws IOException; } public interface AbstractBitSet { public boolean get(int index); } {code} It would be useful if the method =Filter.bits()= returned an abstract interface, instead of =java.util.BitSet=. Use case: there is a very large index, and, depending on the user's privileges, only a small portion of the index is actually visible. Sparsely populated =java.util.BitSet=s are not efficient and waste lots of memory. It would be desirable to have an alternative BitSet implementation with smaller memory footprint. Though it _is_ possibly to derive classes from =java.util.BitSet=, it was obviously not designed for that purpose. That's why I propose to use an interface instead. The default implementation could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-584) Decouple Filter from BitSet
[ http://issues.apache.org/jira/browse/LUCENE-584?page=comments#action_12432435 ] paul.elschot commented on LUCENE-584: - No performance changes as well. It's good to hear that. As mentioned earlier, this is groundwork only. Once an actual Matcher is used I expect some some performance differences to show up. Which comment of Yonik related to HitCollector do you mean? Early this week we will try to implement our first Matchers and see how they behave BitsMatcher and SortedVIntList could start that. Also I'd like to see one on Solr's OpenBitSet... Decouple Filter from BitSet --- Key: LUCENE-584 URL: http://issues.apache.org/jira/browse/LUCENE-584 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.0.1 Reporter: Peter Schäfer Priority: Minor Attachments: BitsMatcher.java, Filter-20060628.patch, HitCollector-20060628.patch, IndexSearcher-20060628.patch, MatchCollector.java, Matcher.java, Matcher20060830b.patch, Scorer-20060628.patch, Searchable-20060628.patch, Searcher-20060628.patch, SortedVIntList.java, TestSortedVIntList.java {code} package org.apache.lucene.search; public abstract class Filter implements java.io.Serializable { public abstract AbstractBitSet bits(IndexReader reader) throws IOException; } public interface AbstractBitSet { public boolean get(int index); } {code} It would be useful if the method =Filter.bits()= returned an abstract interface, instead of =java.util.BitSet=. Use case: there is a very large index, and, depending on the user's privileges, only a small portion of the index is actually visible. Sparsely populated =java.util.BitSet=s are not efficient and waste lots of memory. It would be desirable to have an alternative BitSet implementation with smaller memory footprint. Though it _is_ possibly to derive classes from =java.util.BitSet=, it was obviously not designed for that purpose. That's why I propose to use an interface instead. The default implementation could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Commented: (LUCENE-584) Decouple Filter from BitSet
Yonik, any reason to have BitSetItrator method int next(int fromIndex) {... package protected Would be interesing to see how BitSetIterator works in Matcher, skipping is needed there - Original Message From: paul.elschot (JIRA) [EMAIL PROTECTED] To: java-dev@lucene.apache.org Sent: Monday, 4 September, 2006 8:47:24 AM Subject: [jira] Commented: (LUCENE-584) Decouple Filter from BitSet [ http://issues.apache.org/jira/browse/LUCENE-584?page=comments#action_12432435 ] paul.elschot commented on LUCENE-584: - No performance changes as well. It's good to hear that. As mentioned earlier, this is groundwork only. Once an actual Matcher is used I expect some some performance differences to show up. Which comment of Yonik related to HitCollector do you mean? Early this week we will try to implement our first Matchers and see how they behave BitsMatcher and SortedVIntList could start that. Also I'd like to see one on Solr's OpenBitSet... Decouple Filter from BitSet --- Key: LUCENE-584 URL: http://issues.apache.org/jira/browse/LUCENE-584 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.0.1 Reporter: Peter Schäfer Priority: Minor Attachments: BitsMatcher.java, Filter-20060628.patch, HitCollector-20060628.patch, IndexSearcher-20060628.patch, MatchCollector.java, Matcher.java, Matcher20060830b.patch, Scorer-20060628.patch, Searchable-20060628.patch, Searcher-20060628.patch, SortedVIntList.java, TestSortedVIntList.java {code} package org.apache.lucene.search; public abstract class Filter implements java.io.Serializable { public abstract AbstractBitSet bits(IndexReader reader) throws IOException; } public interface AbstractBitSet { public boolean get(int index); } {code} It would be useful if the method =Filter.bits()= returned an abstract interface, instead of =java.util.BitSet=. Use case: there is a very large index, and, depending on the user's privileges, only a small portion of the index is actually visible. Sparsely populated =java.util.BitSet=s are not efficient and waste lots of memory. It would be desirable to have an alternative BitSet implementation with smaller memory footprint. Though it _is_ possibly to derive classes from =java.util.BitSet=, it was obviously not designed for that purpose. That's why I propose to use an interface instead. The default implementation could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-584) Decouple Filter from BitSet
[ http://issues.apache.org/jira/browse/LUCENE-584?page=comments#action_12432497 ] Eks Dev commented on LUCENE-584: Paul, What is exact semantics of skipTo(int) in Matcher? - is it OK to skip back and forth before I reach end? e.g.: skipTo(0); skipTo(333); skipTo(0); - once I reach end, skipTo(int) does nothing (BitsMatcher, exhausted). It is impossible to reposition Matcher after that Is this intended behavior, skip forward until you reach end, and then, you are at the end :) ? Decouple Filter from BitSet --- Key: LUCENE-584 URL: http://issues.apache.org/jira/browse/LUCENE-584 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.0.1 Reporter: Peter Schäfer Priority: Minor Attachments: BitsMatcher.java, Filter-20060628.patch, HitCollector-20060628.patch, IndexSearcher-20060628.patch, MatchCollector.java, Matcher.java, Matcher20060830b.patch, Scorer-20060628.patch, Searchable-20060628.patch, Searcher-20060628.patch, SortedVIntList.java, TestSortedVIntList.java {code} package org.apache.lucene.search; public abstract class Filter implements java.io.Serializable { public abstract AbstractBitSet bits(IndexReader reader) throws IOException; } public interface AbstractBitSet { public boolean get(int index); } {code} It would be useful if the method =Filter.bits()= returned an abstract interface, instead of =java.util.BitSet=. Use case: there is a very large index, and, depending on the user's privileges, only a small portion of the index is actually visible. Sparsely populated =java.util.BitSet=s are not efficient and waste lots of memory. It would be desirable to have an alternative BitSet implementation with smaller memory footprint. Though it _is_ possibly to derive classes from =java.util.BitSet=, it was obviously not designed for that purpose. That's why I propose to use an interface instead. The default implementation could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Commented: (LUCENE-584) Decouple Filter from BitSet
On Monday 04 September 2006 13:43, Eks Dev (JIRA) wrote: [ http://issues.apache.org/jira/browse/LUCENE-584?page=comments#action_12432497 ] Eks Dev commented on LUCENE-584: Paul, What is exact semantics of skipTo(int) in Matcher? - is it OK to skip back and forth before I reach end? e.g.: skipTo(0); skipTo(333); skipTo(0); - once I reach end, skipTo(int) does nothing (BitsMatcher, exhausted). It is impossible to reposition Matcher after that Is this intended behavior, skip forward until you reach end, and then, you are at the end :) ? This last one. From the javadocs (in the patch): Skips to the first match whose document number is greater than or equal to a given target. If, after next() or skipTo(int) has been called the first time, the target is before or at the current document, the current document may change to the next matching document. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-584) Decouple Filter from BitSet
[ http://issues.apache.org/jira/browse/LUCENE-584?page=comments#action_12432378 ] Eks Dev commented on LUCENE-584: Hi Paul, for me, this patch did not cause any incompatibility issues. All our tests passed without noticing any difference to the previous trunk version. No performance changes as well ( we use HitCollector only, so Yoniks comment does not apply here). Tests are application level, and make index hot (6hrs searches with test batch of requests with known responses), 50Mio not artificial docs, real requests... Early this week we will try to implement our first Matchers and see how they behave Decouple Filter from BitSet --- Key: LUCENE-584 URL: http://issues.apache.org/jira/browse/LUCENE-584 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.0.1 Reporter: Peter Schäfer Priority: Minor Attachments: BitsMatcher.java, Filter-20060628.patch, HitCollector-20060628.patch, IndexSearcher-20060628.patch, MatchCollector.java, Matcher.java, Matcher20060830b.patch, Scorer-20060628.patch, Searchable-20060628.patch, Searcher-20060628.patch, SortedVIntList.java, TestSortedVIntList.java {code} package org.apache.lucene.search; public abstract class Filter implements java.io.Serializable { public abstract AbstractBitSet bits(IndexReader reader) throws IOException; } public interface AbstractBitSet { public boolean get(int index); } {code} It would be useful if the method =Filter.bits()= returned an abstract interface, instead of =java.util.BitSet=. Use case: there is a very large index, and, depending on the user's privileges, only a small portion of the index is actually visible. Sparsely populated =java.util.BitSet=s are not efficient and waste lots of memory. It would be desirable to have an alternative BitSet implementation with smaller memory footprint. Though it _is_ possibly to derive classes from =java.util.BitSet=, it was obviously not designed for that purpose. That's why I propose to use an interface instead. The default implementation could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-584) Decouple Filter from BitSet
[ http://issues.apache.org/jira/browse/LUCENE-584?page=comments#action_12431684 ] Yonik Seeley commented on LUCENE-584: - Thanks Paul, I like the Matcher/Scorer relation. It looks like no Filters currently return a matcher, so the current patch just lays the groundwork, right? When some filters do start to return a matcher, it looks like support for the 1.4 BooleanScorer needs to be removed, or a check done in IndexSearcher.search() to disable skipping on the scorer if it's in use. I wonder what the performance impact is... for a dense search with a dense bitset filter, it looks like quite a bit of overhead is added (two calls in order to get the next doc, use of nextSetBit() instead of get(), checking exhausted each time and checking for -1 to set exhausted). I suppose one can always drop back to using a HitCollector for special cases though. Decouple Filter from BitSet --- Key: LUCENE-584 URL: http://issues.apache.org/jira/browse/LUCENE-584 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.0.1 Reporter: Peter Schäfer Priority: Minor Attachments: BitsMatcher.java, Filter-20060628.patch, HitCollector-20060628.patch, IndexSearcher-20060628.patch, MatchCollector.java, Matcher.java, Matcher20060830.patch, Scorer-20060628.patch, Searchable-20060628.patch, Searcher-20060628.patch, SortedVIntList.java, TestSortedVIntList.java {code} package org.apache.lucene.search; public abstract class Filter implements java.io.Serializable { public abstract AbstractBitSet bits(IndexReader reader) throws IOException; } public interface AbstractBitSet { public boolean get(int index); } {code} It would be useful if the method =Filter.bits()= returned an abstract interface, instead of =java.util.BitSet=. Use case: there is a very large index, and, depending on the user's privileges, only a small portion of the index is actually visible. Sparsely populated =java.util.BitSet=s are not efficient and waste lots of memory. It would be desirable to have an alternative BitSet implementation with smaller memory footprint. Though it _is_ possibly to derive classes from =java.util.BitSet=, it was obviously not designed for that purpose. That's why I propose to use an interface instead. The default implementation could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]