[jira] Commented: (LUCENE-794) SpanScorer and SimpleSpanFragmenter for Contrib Highlighter
[ https://issues.apache.org/jira/browse/LUCENE-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487981 ] Sean O'Connor commented on LUCENE-794: -- Thanks Mark. I had the trunk from a few days ago (perhaps a week), so that was just me being lazy : -). Is there anything I should be aware of the: parser.setUseOldRangeQuery(true); in doSearching(String queryString)? [about line 890 in SpanHighlighterTest.java] I've read the javadocs which explain it a bit, but I don't think a understand enough to infer why you use it in the SpanHighterTest.java. If I can (relatively) safely ignore that, I will. Sean Mark Miller (JIRA) wrote: [ [1]https://issues.apache.org/jira/browse/LUCENE-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487860 ] Mark Miller commented on LUCENE-794: Sorry Sean, I forgot to mention that the patch is off of the latest Lucene trunk code. The range query test should fail because they switched the query parser to return a constant score query instead of a range query. Cannot highlight a constant score query. - Mark SpanScorer and SimpleSpanFragmenter for Contrib Highlighter --- Key: LUCENE-794 URL: [2]https://issues.apache.org/jira/browse/LUCENE-794 Project: Lucene - Java Issue Type: Improvement Components: Other Reporter: Mark Miller Priority: Minor Attachments: CachedTokenStream.java, CachedTokenStream.java, CachedTokenStream.java, DefaultEncoder.java, Encoder.java, Formatter.java, Highlighter.java, Highlighter.java, Highlighter.java, Highlighter.java, Highlighter.java, HighlighterTest.java, HighlighterTest.java, HighlighterTest.java, HighlighterTest.java, MemoryIndex.java, QuerySpansExtractor.java, QuerySpansExtractor.java, QuerySpansExtractor.java, QuerySpansExtractor.java, SimpleFormatter.java, spanhighlighter.patch, spanhighlighter2.patch, spanhighlighter3.patch, spanhighlighter5.patch, spanhighlighter_patch_4.zip, SpanHighlighterTest.java, SpanHighlighterTest.java, SpanScorer.java, SpanScorer.java, WeightedSpanTerm.java This patch adds a new Scorer class (SpanQueryScorer) to the Highlighter package that scores just like QueryScorer, but scores a 0 for Terms that did not cause the Query hit. This gives 'actual' hit highlighting for the range of SpanQuerys and PhraseQuery. There is also a new Fragmenter that attempts to fragment without breaking up Spans. See [3]http://issues.apache.org/jira/browse/LUCENE-403 for some background. There is a dependency on MemoryIndex. [1] https://issues.apache.org/jira/browse/LUCENE-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487860 [2] https://issues.apache.org/jira/browse/LUCENE-794 [3] http://issues.apache.org/jira/browse/LUCENE-403 > SpanScorer and SimpleSpanFragmenter for Contrib Highlighter > --- > > Key: LUCENE-794 > URL: https://issues.apache.org/jira/browse/LUCENE-794 > Project: Lucene - Java > Issue Type: Improvement > Components: Other >Reporter: Mark Miller >Priority: Minor > Attachments: CachedTokenStream.java, CachedTokenStream.java, > CachedTokenStream.java, DefaultEncoder.java, Encoder.java, Formatter.java, > Highlighter.java, Highlighter.java, Highlighter.java, Highlighter.java, > Highlighter.java, HighlighterTest.java, HighlighterTest.java, > HighlighterTest.java, HighlighterTest.java, MemoryIndex.java, > QuerySpansExtractor.java, QuerySpansExtractor.java, QuerySpansExtractor.java, > QuerySpansExtractor.java, SimpleFormatter.java, spanhighlighter.patch, > spanhighlighter2.patch, spanhighlighter3.patch, spanhighlighter5.patch, > spanhighlighter_patch_4.zip, SpanHighlighterTest.java, > SpanHighlighterTest.java, SpanScorer.java, SpanScorer.java, > WeightedSpanTerm.java > > > This patch adds a new Scorer class (SpanQueryScorer) to the Highlighter > package that scores just like QueryScorer, but scores a 0 for Terms that did > not cause the Query hit. This gives 'actual' hit highlighting for the range > of SpanQuerys and PhraseQuery. There is also a new Fragmenter that attempts > to fragment without breaking up Spans. > See http://issues.apache.org/jira/browse/LUCENE-403 for some background. > There is a dependency on MemoryIndex. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTEC
[jira] Resolved: (LUCENE-857) Remove BitSet caching from QueryFilter
[ https://issues.apache.org/jira/browse/LUCENE-857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic resolved LUCENE-857. - Resolution: Fixed Lucene Fields: [New, Patch Available] (was: [Patch Available, New]) But of course. Thanks for the catch! > Remove BitSet caching from QueryFilter > -- > > Key: LUCENE-857 > URL: https://issues.apache.org/jira/browse/LUCENE-857 > Project: Lucene - Java > Issue Type: Improvement >Reporter: Otis Gospodnetic > Assigned To: Otis Gospodnetic >Priority: Minor > Attachments: LUCENE-857.patch, LUCENE-857.refactoring-approach.diff > > > Since caching is built into the public BitSet bits(IndexReader reader) > method, I don't see a way to deprecate that, which means I'll just cut it out > and document it in CHANGES.txt. Anyone who wants QueryFilter caching will be > able to get the caching back by wrapping the QueryFilter in the > CachingWrapperFilter. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-584) Decouple Filter from BitSet
[ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487966 ] Otis Gospodnetic commented on LUCENE-584: - Right. I was under the wrong impression that the Matcher also happens to avoid scoring. However, now that we've all looked at this patch (still applies cleanly and unit tests all pass), and nobody had any criticisms, I think we should commit it, say this Friday. As I'm in the performance squeezing mode, I'll go look at LUCENE-730, another one of Paul's great patches, and see if I can measure performance improvement there. > Decouple Filter from BitSet > --- > > Key: LUCENE-584 > URL: https://issues.apache.org/jira/browse/LUCENE-584 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Affects Versions: 2.0.1 >Reporter: Peter Schäfer >Priority: Minor > Attachments: bench-diff.txt, bench-diff.txt, BitsMatcher.java, > Filter-20060628.patch, HitCollector-20060628.patch, > IndexSearcher-20060628.patch, MatchCollector.java, Matcher.java, > Matcher20070226.patch, Scorer-20060628.patch, Searchable-20060628.patch, > Searcher-20060628.patch, Some Matchers.zip, SortedVIntList.java, > TestSortedVIntList.java > > > {code} > package org.apache.lucene.search; > public abstract class Filter implements java.io.Serializable > { > public abstract AbstractBitSet bits(IndexReader reader) throws IOException; > } > public interface AbstractBitSet > { > public boolean get(int index); > } > {code} > It would be useful if the method =Filter.bits()= returned an abstract > interface, instead of =java.util.BitSet=. > Use case: there is a very large index, and, depending on the user's > privileges, only a small portion of the index is actually visible. > Sparsely populated =java.util.BitSet=s are not efficient and waste lots of > memory. It would be desirable to have an alternative BitSet implementation > with smaller memory footprint. > Though it _is_ possibly to derive classes from =java.util.BitSet=, it was > obviously not designed for that purpose. > That's why I propose to use an interface instead. The default implementation > could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene Branding: the TLP, and "Lucene Java"
For some reason I've never been confused by the naming. I think in my mind and when I talk about this, I say "Lucene project" when I mean the TLP, and Lucene when I talk about the original Lucene. Though I'd personally be said to see the original Lucene get renamed now, I'm open. :) I agree about Grant about where we are going with Lucene TLP, and I'm very much looking forward to new things that will grow under the Lucene name. Otis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simpy -- http://www.simpy.com/ - Tag - Search - Share - Original Message From: Grant Ingersoll <[EMAIL PROTECTED]> To: java-dev@lucene.apache.org Sent: Tuesday, April 10, 2007 9:13:36 PM Subject: Re: Lucene Branding: the TLP, and "Lucene Java" No, you are not the only one... Many a sleepless night spent on it... :-) I usually try to refer to it as Lucene Java, but old habits die hard and often times I just call it Lucene. I think the name has a good brand at this point and is very strongly associated w/ the Java library. I seem to recall when they were forming the TLP, that the original proposal was search.a.o, but then changed b/c the ASF didn't like generic names (or at least that is how I recall it.) And, of course, with Hadoop and the potential for Tika/Lius, it isn't just search anymore. I have often thought about an Apache "Text" project, that could eventually hold a whole family of text based tools like Lucene, Tika, Hadoop, Solr, etc. plus things like part of speech taggers, clustering/classification algorithms, UIMA, etc. all under one roof. But that is just my two cents and I don't know if it fits with what other people have in mind. There are a lot of OSS tools out there for these things, but none bring together a whole suite under a brand like Apache. -Grant On Apr 10, 2007, at 8:41 PM, Chris Hostetter wrote: > > I was motivated to start this thread by LUCENE-860, but it's been > in the > back of my mind for a while. > > As the Lucene Top Level Project grows and get's more Sub-Projects, I > (personally) have been finding it hard in email/documentation/ > discussion > to clarify when people are refering to the "Lucene" Top Level Project > versus the "Lucene" java project. I can't help but wonder if the TLP > should have a different name, or if "Lucene Java" should taken on a > more > specific name that doesn't just sound like a name followed a > langauge -- > ie: JLucene, LuceneJ ... anything that makes it more clear that > when the > word "Lucene" is used it's talking about the broder Top Level project > address all aspects of OSS Search Software. > > Am I the only one that wonders about this as time goes on? > > -Hoss > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > -- Grant Ingersoll Center for Natural Language Processing http://www.cnlp.org Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/ LuceneFAQ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene Branding: the TLP, and "Lucene Java"
No, you are not the only one... Many a sleepless night spent on it... :-) I usually try to refer to it as Lucene Java, but old habits die hard and often times I just call it Lucene. I think the name has a good brand at this point and is very strongly associated w/ the Java library. I seem to recall when they were forming the TLP, that the original proposal was search.a.o, but then changed b/c the ASF didn't like generic names (or at least that is how I recall it.) And, of course, with Hadoop and the potential for Tika/Lius, it isn't just search anymore. I have often thought about an Apache "Text" project, that could eventually hold a whole family of text based tools like Lucene, Tika, Hadoop, Solr, etc. plus things like part of speech taggers, clustering/classification algorithms, UIMA, etc. all under one roof. But that is just my two cents and I don't know if it fits with what other people have in mind. There are a lot of OSS tools out there for these things, but none bring together a whole suite under a brand like Apache. -Grant On Apr 10, 2007, at 8:41 PM, Chris Hostetter wrote: I was motivated to start this thread by LUCENE-860, but it's been in the back of my mind for a while. As the Lucene Top Level Project grows and get's more Sub-Projects, I (personally) have been finding it hard in email/documentation/ discussion to clarify when people are refering to the "Lucene" Top Level Project versus the "Lucene" java project. I can't help but wonder if the TLP should have a different name, or if "Lucene Java" should taken on a more specific name that doesn't just sound like a name followed a langauge -- ie: JLucene, LuceneJ ... anything that makes it more clear that when the word "Lucene" is used it's talking about the broder Top Level project address all aspects of OSS Search Software. Am I the only one that wonders about this as time goes on? -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Grant Ingersoll Center for Natural Language Processing http://www.cnlp.org Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/ LuceneFAQ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Reopened: (LUCENE-857) Remove BitSet caching from QueryFilter
[ https://issues.apache.org/jira/browse/LUCENE-857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hoss Man reopened LUCENE-857: - Lucene Fields: [New, Patch Available] (was: [New]) Actually Otis: for the backwards compatibility to work, QueryFilter needs to extend CachingWrapperFilter with a constructor like... public QueryFilter(Query query) { super(new QueryWrapperFilter(query)); } ...what you've committed eliminates the caching from QueryFilter > Remove BitSet caching from QueryFilter > -- > > Key: LUCENE-857 > URL: https://issues.apache.org/jira/browse/LUCENE-857 > Project: Lucene - Java > Issue Type: Improvement >Reporter: Otis Gospodnetic > Assigned To: Otis Gospodnetic >Priority: Minor > Attachments: LUCENE-857.patch, LUCENE-857.refactoring-approach.diff > > > Since caching is built into the public BitSet bits(IndexReader reader) > method, I don't see a way to deprecate that, which means I'll just cut it out > and document it in CHANGES.txt. Anyone who wants QueryFilter caching will be > able to get the caching back by wrapping the QueryFilter in the > CachingWrapperFilter. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Lucene Branding: the TLP, and "Lucene Java"
I was motivated to start this thread by LUCENE-860, but it's been in the back of my mind for a while. As the Lucene Top Level Project grows and get's more Sub-Projects, I (personally) have been finding it hard in email/documentation/discussion to clarify when people are refering to the "Lucene" Top Level Project versus the "Lucene" java project. I can't help but wonder if the TLP should have a different name, or if "Lucene Java" should taken on a more specific name that doesn't just sound like a name followed a langauge -- ie: JLucene, LuceneJ ... anything that makes it more clear that when the word "Lucene" is used it's talking about the broder Top Level project address all aspects of OSS Search Software. Am I the only one that wonders about this as time goes on? -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries
[ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487962 ] Hoss Man commented on LUCENE-855: - On Mon, 9 Apr 2007, Otis Gospodnetic (JIRA) wrote: : I'd love to know what Hoss and other big Filter users think about this. : Solr makes a lof of use of (Range?)Filters, I believe. This is one of those Jira issues that i didn't really have time to follow when it was first opened, and so the Jira emails have just been piling up waiting ofr me to read. Here's the raw notes i took as i read through the patches... FieldCacheRangeFilter.patch from 10/Apr/07 01:52 PM * javadoc cut/paste errors (FieldCache) * FieldCacheRangeFilter should work with simple strings (using FieldCache.getStrings or FieldCache.getStringIndex) just like regular RangeFilter * it feels like the various parser versions should be in seperate subclasses (common abstract base class?) * why does clone need to construct a raw BitSet? what exactly didn't work about ChainedFilter without this? (could cause other BitSet usage problems) * or/and/andNot/xor can all be implemented using convertToBitSet * need FieldCacheBitSet methods: cardinality, get(int,int) * need equals and hashCode methods in all new classes * FieldCacheBitSet.clear should be UnsuppOp * convertToBitSet can be cached. * FieldCacheBitSet should be abstract, requiring get(int) be implemented MemoryCachedRangeFilter_1.4.patch from 06/Apr/07 06:14 AM * "tuples" should be initialized to fieldCache.length ... serious ArrayList resizing going on there (why is it an ArrayList, why not just Tules[] ?) * doesn't "cache" need synchronization? ... seems like the same CreationPlaceholder pattern used in FieldCache might make sense here. * this looks wrong... } else if ( (!includeLower) && (lowerIndex >= 0) ) { ...consider case where lower==5, includeLower==false, and all values in index are 5, binary search could leave us in the middle of hte index, so we still need for move forward to the end? * ditto above concern for finding upperIndex * what is pathological worst case for rewind/forward when *lots* of duplicate values in index? should another binarySearch be used? * a lot of code in MemoryCachedRangeFilter.bits for finding lowerIndex/upperIndex would probably make more sense as methods in SortedFieldCache * only seems to handle longs, at a minimum should deal with arbitrary strings, with optional add ons for longs/ints/etc... * I can't help but wonder how MemoryCachedRangeFilter would compare if it used Solr's OpenBitSet (facaded to implement the BitSet API) TestRangeFilterPerformanceComparison.java from 10/Apr/07 * I can't help but wonder how RangeFilter would compare if it used Solr's OpenBitSet (facaded to implement the BitSet API) * no test of includeLower==false or includeUpper==false * i don't think the ranges being compared are the same for RangeFilter as they are for the other Filters ... note the use of DateTools when building the index, vs straight string usage in RangeFilter, vs Long.parseLong in MemoryCachedRangeFilter and FieldCacheRangeFilter * is it really a fair comparison to call MemoryCachedRangeFilter.warmup or FieldCacheRangeFilter.bits outside of the timing code? for indexes where the IndexReader is reopened periodicaly this may be a significant number to be aware of. Questions about the legitimacy of the testing aside... In general, I like the approach of FieldCacheBitSet -- but it should be generalized into an "AbstractReadOnlyBitSet" where all methods are implemented via get(int) in subclasses -- we should make sure that every method in the BitSet API works as advertised in Java1.4. I don't really like the various hoops FieldCacheRangeFilter has to jump through to support int/float/long ... I think at it's core it should support simple Strings, with alternate/sub classes for dealing with other FieldCache formats ... i just really dislike all the crazy nested ifs to deal with the different Parser types, if there's going to be separate constructors for longs/floats/ints, they might as well be separate sub-classes. the really nice thing this has over RangeFilter is that people can index raw numeric values without needing to massage them into lexicographically ordered Strings (since the FieldCache will take care of parsing them appropriately) My gut tells me that the MemoryCachedRangeFilter approach will never ever be able to compete with the FieldCacheRangeFilter facading BitSet approach since it needs to build the FieldCache, then the SortedFieldCache, then a BitSet ...it seems like any optimization into that pipeline can always be beaten by using the same logic, but then facading the BitSet > MemoryCachedRangeFilter to boost performance of Range queries > ---
[jira] Commented: (LUCENE-584) Decouple Filter from BitSet
[ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487940 ] Hoss Man commented on LUCENE-584: - I'm a little behind on following this issue, but if i can attempt to sum up the recent discussion about performance... "Migrating towards a "Matcher" API *may* allow some types of Queries to be faster in situations where clients can use a MatchCollector instead of a HitCollector, but this won't be a silver bullet performance win for all Query classes -- just those where some of the score calculations is (or can be) isolated to the score method (as opposed to skipTO or next)" I think it's important to remember the motivation of this issue wasn't to improve the speed performance of non-scoring searchers, it was to decouple the concept of "Filtering" results away from needing to populate a (potentially large) BitSet when the logic neccessary for Filtering can easily be expressed in terms of a doc iterator (aka: a Matcher) -- opening up the possibility of memory performance improvements. A second benefit that has arisen as the issue evolved, has been the API generalization of the "Matcher" concept to be a super class of Scorer for simpler APIs moving forward. > Decouple Filter from BitSet > --- > > Key: LUCENE-584 > URL: https://issues.apache.org/jira/browse/LUCENE-584 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Affects Versions: 2.0.1 >Reporter: Peter Schäfer >Priority: Minor > Attachments: bench-diff.txt, bench-diff.txt, BitsMatcher.java, > Filter-20060628.patch, HitCollector-20060628.patch, > IndexSearcher-20060628.patch, MatchCollector.java, Matcher.java, > Matcher20070226.patch, Scorer-20060628.patch, Searchable-20060628.patch, > Searcher-20060628.patch, Some Matchers.zip, SortedVIntList.java, > TestSortedVIntList.java > > > {code} > package org.apache.lucene.search; > public abstract class Filter implements java.io.Serializable > { > public abstract AbstractBitSet bits(IndexReader reader) throws IOException; > } > public interface AbstractBitSet > { > public boolean get(int index); > } > {code} > It would be useful if the method =Filter.bits()= returned an abstract > interface, instead of =java.util.BitSet=. > Use case: there is a very large index, and, depending on the user's > privileges, only a small portion of the index is actually visible. > Sparsely populated =java.util.BitSet=s are not efficient and waste lots of > memory. It would be desirable to have an alternative BitSet implementation > with smaller memory footprint. > Though it _is_ possibly to derive classes from =java.util.BitSet=, it was > obviously not designed for that purpose. > That's why I propose to use an interface instead. The default implementation > could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-730) Restore top level disjunction performance
[ https://issues.apache.org/jira/browse/LUCENE-730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic updated LUCENE-730: Lucene Fields: [New, Patch Available] (was: [New]) > Restore top level disjunction performance > - > > Key: LUCENE-730 > URL: https://issues.apache.org/jira/browse/LUCENE-730 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Reporter: Paul Elschot >Priority: Minor > Attachments: TopLevelDisjunction20061127.patch > > > This patch restores the performance of top level disjunctions. > The introduction of BooleanScorer2 had impacted this as reported > on java-user on 21 Nov 2006 by Stanislav Jordanov. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Failed test: testExpirationTimeDeletionPolicy
"Otis Gospodnetic" <[EMAIL PROTECTED]> wrote: > Just saw this test fail: > > [junit] Testcase: > > testExpirationTimeDeletionPolicy(org.apache.lucene.index.TestDeletionPolicy): > FAILED > [junit] commit point was older than 2.0 seconds but did not get > deleted > [junit] junit.framework.AssertionFailedError: commit point was older > than 2.0 seconds but did not get deleted > [junit] at > > org.apache.lucene.index.TestDeletionPolicy.testExpirationTimeDeletionPolicy(TestDeletionPolicy.java:229) > [junit] at sun.reflect.NativeMethodAccessorImpl.invoke0(Native > Method) > [junit] at > > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > [junit] at > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > > Is my G4 Powerbook too slow? ;) It does take 15 minutes to run the > complete test suite. > > Subsequent runs of just this tests were all successful, but it did fail > once, as shown above. Hmmm. That test verifies that a time based deletion policy (remove a commit point only if it's older than X seconds) is working properly. I added it (recently) for LUCENE-710. OK I think I see where this test is wrongly sensitive to the speed of the machine it's running on and would then cause a false positive failure. I will commit a fix. Still, Otis, I think you should upgrade to a MacBook Pro :) Mike - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries
[ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matt Ericson updated LUCENE-855: Attachment: FieldCacheRangeFilter.patch Lets try this again. I am very sorry to everyone for the last patch. I had some trouble with my environment not correctly re-building. I have done ant clean before testing. Andy take a look at this patch and tell me what you think. > MemoryCachedRangeFilter to boost performance of Range queries > - > > Key: LUCENE-855 > URL: https://issues.apache.org/jira/browse/LUCENE-855 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Affects Versions: 2.1 >Reporter: Andy Liu > Assigned To: Otis Gospodnetic > Attachments: FieldCacheRangeFilter.patch, > FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, > FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, > MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch, > TestRangeFilterPerformanceComparison.java, > TestRangeFilterPerformanceComparison.java > > > Currently RangeFilter uses TermEnum and TermDocs to find documents that fall > within the specified range. This requires iterating through every single > term in the index and can get rather slow for large document sets. > MemoryCachedRangeFilter reads all pairs of a given field, > sorts by value, and stores in a SortedFieldCache. During bits(), binary > searches are used to find the start and end indices of the lower and upper > bound values. The BitSet is populated by all the docId values that fall in > between the start and end indices. > TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed > index with random date values within a 5 year range. Executing bits() 1000 > times on standard RangeQuery using random date intervals took 63904ms. Using > MemoryCachedRangeFilter, it took 876ms. Performance increase is less > dramatic when you have less unique terms in a field or using less number of > documents. > Currently MemoryCachedRangeFilter only works with numeric values (values are > stored in a long[] array) but it can be easily changed to support Strings. A > side "benefit" of storing the values are stored as longs, is that there's no > longer the need to make the values lexographically comparable, i.e. padding > numeric values with zeros. > The downside of using MemoryCachedRangeFilter is there's a fairly significant > memory requirement. So it's designed to be used in situations where range > filter performance is critical and memory consumption is not an issue. The > memory requirements are: (sizeof(int) + sizeof(long)) * numDocs. > MemoryCachedRangeFilter also requires a warmup step which can take a while to > run in large datasets (it took 40s to run on a 3M document corpus). Warmup > can be called explicitly or is automatically called the first time > MemoryCachedRangeFilter is applied using a given field. > So in summery, MemoryCachedRangeFilter can be useful when: > - Performance is critical > - Memory is not an issue > - Field contains many unique numeric values > - Index contains large amount of documents -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Resolved: (LUCENE-857) Remove BitSet caching from QueryFilter
[ https://issues.apache.org/jira/browse/LUCENE-857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic resolved LUCENE-857. - Resolution: Fixed Thanks for the persistence and patience, Hoss. I see the light now! The patch wouldn't apply to QueryFilter, so I made changes manually. Committed. > Remove BitSet caching from QueryFilter > -- > > Key: LUCENE-857 > URL: https://issues.apache.org/jira/browse/LUCENE-857 > Project: Lucene - Java > Issue Type: Improvement >Reporter: Otis Gospodnetic > Assigned To: Otis Gospodnetic >Priority: Minor > Attachments: LUCENE-857.patch, LUCENE-857.refactoring-approach.diff > > > Since caching is built into the public BitSet bits(IndexReader reader) > method, I don't see a way to deprecate that, which means I'll just cut it out > and document it in CHANGES.txt. Anyone who wants QueryFilter caching will be > able to get the caching back by wrapping the QueryFilter in the > CachingWrapperFilter. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Failed test: testExpirationTimeDeletionPolicy
Just saw this test fail: [junit] Testcase: testExpirationTimeDeletionPolicy(org.apache.lucene.index.TestDeletionPolicy): FAILED [junit] commit point was older than 2.0 seconds but did not get deleted [junit] junit.framework.AssertionFailedError: commit point was older than 2.0 seconds but did not get deleted [junit] at org.apache.lucene.index.TestDeletionPolicy.testExpirationTimeDeletionPolicy(TestDeletionPolicy.java:229) [junit] at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) [junit] at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) [junit] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) Is my G4 Powerbook too slow? ;) It does take 15 minutes to run the complete test suite. Subsequent runs of just this tests were all successful, but it did fail once, as shown above. Otis - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-860) site should call project "Lucene Java", not just "Lucene"
[ https://issues.apache.org/jira/browse/LUCENE-860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doug Cutting updated LUCENE-860: Lucene Fields: [Patch Available] (was: [New]) > site should call project "Lucene Java", not just "Lucene" > - > > Key: LUCENE-860 > URL: https://issues.apache.org/jira/browse/LUCENE-860 > Project: Lucene - Java > Issue Type: Improvement > Components: Website >Reporter: Doug Cutting >Priority: Minor > Attachments: LUCENE-860.patch > > > To avoid confusion with the top-level Lucene project, the Lucene Java website > should refer to itself as Lucene Java. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-860) site should call project "Lucene Java", not just "Lucene"
[ https://issues.apache.org/jira/browse/LUCENE-860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doug Cutting updated LUCENE-860: Attachment: LUCENE-860.patch Here's a patch that replaces "Apache Lucene" with "Apache Lucene Java" in the website. It also fixes the breadcrumbs at the top of the web pages and the links on the logos. Is "Apache Lucene Java" too verbose? Should we instead just use "Lucene Java"? > site should call project "Lucene Java", not just "Lucene" > - > > Key: LUCENE-860 > URL: https://issues.apache.org/jira/browse/LUCENE-860 > Project: Lucene - Java > Issue Type: Improvement > Components: Website >Reporter: Doug Cutting >Priority: Minor > Attachments: LUCENE-860.patch > > > To avoid confusion with the top-level Lucene project, the Lucene Java website > should refer to itself as Lucene Java. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Created: (LUCENE-860) site should call project "Lucene Java", not just "Lucene"
site should call project "Lucene Java", not just "Lucene" - Key: LUCENE-860 URL: https://issues.apache.org/jira/browse/LUCENE-860 Project: Lucene - Java Issue Type: Improvement Components: Website Reporter: Doug Cutting Priority: Minor To avoid confusion with the top-level Lucene project, the Lucene Java website should refer to itself as Lucene Java. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries
[ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487897 ] Andy Liu commented on LUCENE-855: - Hey Matt, I get this exception when running your newest FCRF with the performance test. Can you check to see if you get this also? java.lang.ArrayIndexOutOfBoundsException: 10 at org.apache.lucene.search.FieldCacheRangeFilter$5.get(FieldCacheRangeFilter.java:231) at org.apache.lucene.search.IndexSearcher$1.collect(IndexSearcher.java:136) at org.apache.lucene.search.Scorer.score(Scorer.java:49) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:146) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:113) at org.apache.lucene.search.Hits.getMoreDocs(Hits.java:74) at org.apache.lucene.search.Hits.(Hits.java:53) at org.apache.lucene.search.Searcher.search(Searcher.java:46) at org.apache.lucene.misc.TestRangeFilterPerformanceComparison$Benchmark.go(TestRangeFilterPerformanceComparison.java:312) at org.apache.lucene.misc.TestRangeFilterPerformanceComparison.testPerformance(TestRangeFilterPerformanceComparison.java:201) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:585) at junit.framework.TestCase.runTest(TestCase.java:154) at junit.framework.TestCase.runBare(TestCase.java:127) at junit.framework.TestResult$1.protect(TestResult.java:106) at junit.framework.TestResult.runProtected(TestResult.java:124) at junit.framework.TestResult.run(TestResult.java:109) at junit.framework.TestCase.run(TestCase.java:118) at junit.framework.TestSuite.runTest(TestSuite.java:208) at junit.framework.TestSuite.run(TestSuite.java:203) at org.eclipse.jdt.internal.junit.runner.junit3.JUnit3TestReference.run(JUnit3TestReference.java:128) at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:460) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:673) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:386) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:196) > MemoryCachedRangeFilter to boost performance of Range queries > - > > Key: LUCENE-855 > URL: https://issues.apache.org/jira/browse/LUCENE-855 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Affects Versions: 2.1 >Reporter: Andy Liu > Assigned To: Otis Gospodnetic > Attachments: FieldCacheRangeFilter.patch, > FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, > FieldCacheRangeFilter.patch, MemoryCachedRangeFilter.patch, > MemoryCachedRangeFilter_1.4.patch, TestRangeFilterPerformanceComparison.java, > TestRangeFilterPerformanceComparison.java > > > Currently RangeFilter uses TermEnum and TermDocs to find documents that fall > within the specified range. This requires iterating through every single > term in the index and can get rather slow for large document sets. > MemoryCachedRangeFilter reads all pairs of a given field, > sorts by value, and stores in a SortedFieldCache. During bits(), binary > searches are used to find the start and end indices of the lower and upper > bound values. The BitSet is populated by all the docId values that fall in > between the start and end indices. > TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed > index with random date values within a 5 year range. Executing bits() 1000 > times on standard RangeQuery using random date intervals took 63904ms. Using > MemoryCachedRangeFilter, it took 876ms. Performance increase is less > dramatic when you have less unique terms in a field or using less number of > documents. > Currently MemoryCachedRangeFilter only works with numeric values (values are > stored in a long[] array) but it can be easily changed to support Strings. A > side "benefit" of storing the values are stored as longs, is that there's no > longer the need to make the values lexographically comparable, i.e. padding > numeric values with zeros. > The downside of using MemoryCachedRangeFilter is there's a fairly significant > memory requirement. So it's designed to be used in situations where range > filter performance is critical and memory consumption is not an issue. The
Re: Why ORScorer delayed init?
: I thought it would avoid accessing the index as much as : possible before actually doing a search, but I did not : verify whether that is important. : In case it is not, any simplification is off course welcome. conceptually: once Query.createWeight(Searcher) is called, the "Search" has already begun hasn't it? ... if not then, at the very least when Weight.scorer(IndexReader) is called i would imagine. -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Why ORScorer delayed init?
On Tuesday 10 April 2007 20:24, Yonik Seeley wrote: > On 4/10/07, Marvin Humphrey <[EMAIL PROTECTED]> wrote: > > In DisjunctionSumScorer, both skipTo() and next() invoke > > initScorerDocQueue() on the first iteration. However, since all > > subscorers are added en masse via the constructor instead of > > individually via an add() method which does not exist for this class, > > it would be possible to trigger initScorerDocQueue() at construction > > time rather than defer it, slightly simplifying the inner loop methods. > > Yes, I think I made this change to one or two of the other scorers in the past. > It makes more sense to me to pass everything needed in the constructor > and get rid of the firstTime checks in next() and skipTo() I kept this method of initializing because it was present in some other existing Scorers. I did not really like it at the time either. I thought it would avoid accessing the index as much as possible before actually doing a search, but I did not verify whether that is important. In case it is not, any simplification is off course welcome. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Commented: (LUCENE-584) Decouple Filter from BitSet
On Tuesday 10 April 2007 17:41, eks dev wrote: > > If I remember well, the last time we profiled search with "high density" OR queries scoring was taking up to 30% of the time. This was a 8Mio collection of short documents fitting comfortably in RAM. So I am sure disabling scoring in some cases could bring us something. > > I am not all that familiar with scoring inner workings to stand 100% behind this statement, so please take it with some healthy reserve. For "high density OR" I'd guess most of the work was spent maintaining the priority queue by document number. See also LUCENE-730 . > > But anyhow, with Matcher in place, we have at least a chance to prove it brings something for this scenario. For Filtering case it brings definitely a lot. > > on the other note, > Paul, would it be possible/easy to have something like. It looks easy to add it, but I may be missing something: > BooleanQuery.add(Matcher mtr, > BooleanClause.Occur occur) That's one of the things I'd like to see added. It would allow a single ConjunctionScorer to do a filtered search for a query with some required terms. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-584) Decouple Filter from BitSet
[ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487882 ] Paul Elschot commented on LUCENE-584: - By fastest cache I meant the L1 cache of the processor. The size is normally in tens of kilobytes. An array lookup hitting that cache takes about as much time as a floating point addition. During a query search the use of a.o. the term frequencies, the proximity data, and the document weights normally cause an L1 cache miss. I would expect that by not doing the score value computations, only the cache misses for document weights can be saved. > Decouple Filter from BitSet > --- > > Key: LUCENE-584 > URL: https://issues.apache.org/jira/browse/LUCENE-584 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Affects Versions: 2.0.1 >Reporter: Peter Schäfer >Priority: Minor > Attachments: bench-diff.txt, bench-diff.txt, BitsMatcher.java, > Filter-20060628.patch, HitCollector-20060628.patch, > IndexSearcher-20060628.patch, MatchCollector.java, Matcher.java, > Matcher20070226.patch, Scorer-20060628.patch, Searchable-20060628.patch, > Searcher-20060628.patch, Some Matchers.zip, SortedVIntList.java, > TestSortedVIntList.java > > > {code} > package org.apache.lucene.search; > public abstract class Filter implements java.io.Serializable > { > public abstract AbstractBitSet bits(IndexReader reader) throws IOException; > } > public interface AbstractBitSet > { > public boolean get(int index); > } > {code} > It would be useful if the method =Filter.bits()= returned an abstract > interface, instead of =java.util.BitSet=. > Use case: there is a very large index, and, depending on the user's > privileges, only a small portion of the index is actually visible. > Sparsely populated =java.util.BitSet=s are not efficient and waste lots of > memory. It would be desirable to have an alternative BitSet implementation > with smaller memory footprint. > Though it _is_ possibly to derive classes from =java.util.BitSet=, it was > obviously not designed for that purpose. > That's why I propose to use an interface instead. The default implementation > could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Why ORScorer delayed init?
On 4/10/07, Marvin Humphrey <[EMAIL PROTECTED]> wrote: In DisjunctionSumScorer, both skipTo() and next() invoke initScorerDocQueue() on the first iteration. However, since all subscorers are added en masse via the constructor instead of individually via an add() method which does not exist for this class, it would be possible to trigger initScorerDocQueue() at construction time rather than defer it, slightly simplifying the inner loop methods. Yes, I think I made this change to one or two of the other scorers in the past. It makes more sense to me to pass everything needed in the constructor and get rid of the firstTime checks in next() and skipTo() -Yonik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-794) SpanScorer and SimpleSpanFragmenter for Contrib Highlighter
[ https://issues.apache.org/jira/browse/LUCENE-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487860 ] Mark Miller commented on LUCENE-794: Sorry Sean, I forgot to mention that the patch is off of the latest Lucene trunk code. The range query test should fail because they switched the query parser to return a constant score query instead of a range query. Cannot highlight a constant score query. - Mark > SpanScorer and SimpleSpanFragmenter for Contrib Highlighter > --- > > Key: LUCENE-794 > URL: https://issues.apache.org/jira/browse/LUCENE-794 > Project: Lucene - Java > Issue Type: Improvement > Components: Other >Reporter: Mark Miller >Priority: Minor > Attachments: CachedTokenStream.java, CachedTokenStream.java, > CachedTokenStream.java, DefaultEncoder.java, Encoder.java, Formatter.java, > Highlighter.java, Highlighter.java, Highlighter.java, Highlighter.java, > Highlighter.java, HighlighterTest.java, HighlighterTest.java, > HighlighterTest.java, HighlighterTest.java, MemoryIndex.java, > QuerySpansExtractor.java, QuerySpansExtractor.java, QuerySpansExtractor.java, > QuerySpansExtractor.java, SimpleFormatter.java, spanhighlighter.patch, > spanhighlighter2.patch, spanhighlighter3.patch, spanhighlighter5.patch, > spanhighlighter_patch_4.zip, SpanHighlighterTest.java, > SpanHighlighterTest.java, SpanScorer.java, SpanScorer.java, > WeightedSpanTerm.java > > > This patch adds a new Scorer class (SpanQueryScorer) to the Highlighter > package that scores just like QueryScorer, but scores a 0 for Terms that did > not cause the Query hit. This gives 'actual' hit highlighting for the range > of SpanQuerys and PhraseQuery. There is also a new Fragmenter that attempts > to fragment without breaking up Spans. > See http://issues.apache.org/jira/browse/LUCENE-403 for some background. > There is a dependency on MemoryIndex. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries
[ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matt Ericson updated LUCENE-855: Attachment: FieldCacheRangeFilter.patch Fixed a bug with the BitSets nextSetBit(i) and nextClearBit(i) I wrote a test to verify that it returns the same values as a Normal BitSet . I dont use these functions if someone wants to verify my fix that would be great. Added the ASF to the top of each file And fixed all of Otis bugs > MemoryCachedRangeFilter to boost performance of Range queries > - > > Key: LUCENE-855 > URL: https://issues.apache.org/jira/browse/LUCENE-855 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Affects Versions: 2.1 >Reporter: Andy Liu > Assigned To: Otis Gospodnetic > Attachments: FieldCacheRangeFilter.patch, > FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, > FieldCacheRangeFilter.patch, MemoryCachedRangeFilter.patch, > MemoryCachedRangeFilter_1.4.patch, TestRangeFilterPerformanceComparison.java, > TestRangeFilterPerformanceComparison.java > > > Currently RangeFilter uses TermEnum and TermDocs to find documents that fall > within the specified range. This requires iterating through every single > term in the index and can get rather slow for large document sets. > MemoryCachedRangeFilter reads all pairs of a given field, > sorts by value, and stores in a SortedFieldCache. During bits(), binary > searches are used to find the start and end indices of the lower and upper > bound values. The BitSet is populated by all the docId values that fall in > between the start and end indices. > TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed > index with random date values within a 5 year range. Executing bits() 1000 > times on standard RangeQuery using random date intervals took 63904ms. Using > MemoryCachedRangeFilter, it took 876ms. Performance increase is less > dramatic when you have less unique terms in a field or using less number of > documents. > Currently MemoryCachedRangeFilter only works with numeric values (values are > stored in a long[] array) but it can be easily changed to support Strings. A > side "benefit" of storing the values are stored as longs, is that there's no > longer the need to make the values lexographically comparable, i.e. padding > numeric values with zeros. > The downside of using MemoryCachedRangeFilter is there's a fairly significant > memory requirement. So it's designed to be used in situations where range > filter performance is critical and memory consumption is not an issue. The > memory requirements are: (sizeof(int) + sizeof(long)) * numDocs. > MemoryCachedRangeFilter also requires a warmup step which can take a while to > run in large datasets (it took 40s to run on a 3M document corpus). Warmup > can be called explicitly or is automatically called the first time > MemoryCachedRangeFilter is applied using a given field. > So in summery, MemoryCachedRangeFilter can be useful when: > - Performance is critical > - Memory is not an issue > - Field contains many unique numeric values > - Index contains large amount of documents -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries
[ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matt Ericson updated LUCENE-855: Attachment: TestRangeFilterPerformanceComparison.java Andy thank you for that test I took at Moved it to contrib/miscellaneous and added a few more tests including the Chained Filter test. Here is my version. Also I fixed a few bugs in my code that I will be attaching next . I also reformatted my results I think they are a little easer to read. Here is what I get and your right if you use a MatchAllDocsQuery our 2 version of the code are about the same [junit] - Standard Output --- [junit] Start interval: Thu Apr 11 10:55:02 PDT 2002 [junit] End interval: Tue Apr 10 10:55:02 PDT 2007 [junit] Creating RAMDirectory index... [junit] Reader opened with 10 documents. Creating RangeFilters... [junit] TermQuery [junit] FieldCacheRangeFilter [junit] * Total: 13ms [junit] * Bits: 0ms [junit] * Search: 9ms [junit] MemoryCachedRangeFilter [junit] * Total: 209ms [junit] * Bits: 90ms [junit] * Search: 115ms [junit] RangeFilter [junit] * Total: 12068ms [junit] * Bits: 6009ms [junit] * Search: 6051ms [junit] Chained FieldCacheRangeFilter [junit] * Total: 15ms [junit] * Bits: 1ms [junit] * Search: 10ms [junit] Chained MemoryCachedRangeFilter [junit] * Total: 177ms [junit] * Bits: 83ms [junit] * Search: 90ms [junit] ConstantScoreQuery [junit] FieldCacheRangeFilter [junit] * Total: 480ms [junit] * Bits: 1ms [junit] * Search: 474ms [junit] MemoryCachedRangeFilter [junit] * Total: 757ms [junit] * Bits: 90ms [junit] * Search: 663ms [junit] RangeFilter [junit] * Total: 18749ms [junit] * Bits: 6083ms [junit] * Search: 12655ms [junit] Chained FieldCacheRangeFilter [junit] * Total: 11ms [junit] * Bits: 0ms [junit] * Search: 8ms [junit] Chained MemoryCachedRangeFilter [junit] * Total: 776ms [junit] * Bits: 87ms [junit] * Search: 682ms [junit] MatchAllDocsQuery [junit] FieldCacheRangeFilter [junit] * Total: 1344ms [junit] * Bits: 5ms [junit] * Search: 1334ms [junit] MemoryCachedRangeFilter [junit] * Total: 1468ms [junit] * Bits: 81ms [junit] * Search: 1381ms [junit] RangeFilter [junit] * Total: 13360ms [junit] * Bits: 6091ms [junit] * Search: 7254ms [junit] Chained FieldCacheRangeFilter [junit] * Total: 924ms [junit] * Bits: 4ms [junit] * Search: 916ms [junit] Chained MemoryCachedRangeFilter [junit] * Total: 1507ms [junit] * Bits: 84ms [junit] * Search: 1415ms [junit] - --- > MemoryCachedRangeFilter to boost performance of Range queries > - > > Key: LUCENE-855 > URL: https://issues.apache.org/jira/browse/LUCENE-855 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Affects Versions: 2.1 >Reporter: Andy Liu > Assigned To: Otis Gospodnetic > Attachments: FieldCacheRangeFilter.patch, > FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, > MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch, > TestRangeFilterPerformanceComparison.java, > TestRangeFilterPerformanceComparison.java > > > Currently RangeFilter uses TermEnum and TermDocs to find documents that fall > within the specified range. This requires iterating through every single > term in the index and can get rather slow for large document sets. > MemoryCachedRangeFilter reads all pairs of a given field, > sorts by value, and stores in a SortedFieldCache. During bits(), binary > searches are used to find the start and end indices of the lower and upper > bound values. The BitSet is populated by all the docId values that fall in > between the start and end indices. > TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed > index with random date values within a 5 year range. Executing bits() 1000 > times on standard RangeQuery using random date intervals took 63904ms. Using > MemoryCachedRangeFilter, it took 876ms. Performance increase is less > dramatic when you have less unique terms in a field or using less number of > documents. > Currently MemoryCachedRangeFilter only works with numeric values (values are > stored in a long[] array) but it can be easily changed to support Strings. A > side "benefit" of storing the values are stored as longs, is that there's no > longer the need to make the values lexographically comparable, i.e. padding > numeric values with zeros. > The downside of using MemoryCache
[jira] Commented: (LUCENE-794) SpanScorer and SimpleSpanFragmenter for Contrib Highlighter
[ https://issues.apache.org/jira/browse/LUCENE-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487847 ] Sean O'Connor commented on LUCENE-794: -- I was able to apply the spanhighlighter5.patch. I'm inexperienced with ant and svn, so I assume the slight troubles I had were self-inflicted; I mention them in case they are of any help. I might have missed something, but my MemoryIndex.java seemed to be missing the implementation of the abstract isPayloadAvailable() method from TermPositions. That was causing my build to fail, so I added the method, simply returning false. After that change, the tests run, and life was good again. I do get a failed test at org.apache.lucene.search.highlight.HighlighterTest.testGetRangeFragments(HighlighterTest.java:137), but it looks like that might be expected. The search is "[kannedy TO kznnedy]". I am now looking into getting the total number of hits for a given query (for un-normalized scoring), and the hit positions (saved for larger scale analysis and browsing). I have code that does this, but hope I can improve on my existing approach by using this highlighting patch. Thanks, Sean > SpanScorer and SimpleSpanFragmenter for Contrib Highlighter > --- > > Key: LUCENE-794 > URL: https://issues.apache.org/jira/browse/LUCENE-794 > Project: Lucene - Java > Issue Type: Improvement > Components: Other >Reporter: Mark Miller >Priority: Minor > Attachments: CachedTokenStream.java, CachedTokenStream.java, > CachedTokenStream.java, DefaultEncoder.java, Encoder.java, Formatter.java, > Highlighter.java, Highlighter.java, Highlighter.java, Highlighter.java, > Highlighter.java, HighlighterTest.java, HighlighterTest.java, > HighlighterTest.java, HighlighterTest.java, MemoryIndex.java, > QuerySpansExtractor.java, QuerySpansExtractor.java, QuerySpansExtractor.java, > QuerySpansExtractor.java, SimpleFormatter.java, spanhighlighter.patch, > spanhighlighter2.patch, spanhighlighter3.patch, spanhighlighter5.patch, > spanhighlighter_patch_4.zip, SpanHighlighterTest.java, > SpanHighlighterTest.java, SpanScorer.java, SpanScorer.java, > WeightedSpanTerm.java > > > This patch adds a new Scorer class (SpanQueryScorer) to the Highlighter > package that scores just like QueryScorer, but scores a 0 for Terms that did > not cause the Query hit. This gives 'actual' hit highlighting for the range > of SpanQuerys and PhraseQuery. There is also a new Fragmenter that attempts > to fragment without breaking up Spans. > See http://issues.apache.org/jira/browse/LUCENE-403 for some background. > There is a dependency on MemoryIndex. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Why ORScorer delayed init?
Greets, In DisjunctionSumScorer, both skipTo() and next() invoke initScorerDocQueue() on the first iteration. However, since all subscorers are added en masse via the constructor instead of individually via an add() method which does not exist for this class, it would be possible to trigger initScorerDocQueue() at construction time rather than defer it, slightly simplifying the inner loop methods. Does the delay offer some advantage that I'm missing? It looks like an artifact. Marvin Humphrey Rectangular Research http://www.rectangular.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Hudson build is back to normal: Lucene-Nightly #53
See http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/53/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Commented: (LUCENE-584) Decouple Filter from BitSet
If I remember well, the last time we profiled search with "high density" OR queries scoring was taking up to 30% of the time. This was a 8Mio collection of short documents fitting comfortably in RAM. So I am sure disabling scoring in some cases could bring us something. I am not all that familiar with scoring inner workings to stand 100% behind this statement, so please take it with some healthy reserve. But anyhow, with Matcher in place, we have at least a chance to prove it brings something for this scenario. For Filtering case it brings definitely a lot. on the other note, Paul, would it be possible/easy to have something like. It looks easy to add it, but I may be missing something: BooleanQuery.add(Matcher mtr, BooleanClause.Occur occur) - Original Message From: Otis Gospodnetic (JIRA) <[EMAIL PROTECTED]> To: java-dev@lucene.apache.org Sent: Tuesday, 10 April, 2007 5:11:32 PM Subject: [jira] Commented: (LUCENE-584) Decouple Filter from BitSet [ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487789 ] Otis Gospodnetic commented on LUCENE-584: - Ah, too bad. :( Last time I benchmarked Lucene searching on Sun's Niagara vs. non-massive Intel boxes, Intel boxes with Linux on them actually won, and my impression was that this was due to Niagara's weak FPU (a known weakness in Niagara, I believe). Thus, I thought, if we could just skip scoring and various floating point calculations, we'd see better performance, esp. on Niagara boxes. Paul, when you say "fastest cache", what exactly are you referring to? The Niagara I tested things on had 32GB of RAM, and I gave the JVM 20+GB, so at least the JVM had plenty of RAM to work with. > Decouple Filter from BitSet > --- > > Key: LUCENE-584 > URL: https://issues.apache.org/jira/browse/LUCENE-584 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Affects Versions: 2.0.1 >Reporter: Peter Schäfer >Priority: Minor > Attachments: bench-diff.txt, bench-diff.txt, BitsMatcher.java, > Filter-20060628.patch, HitCollector-20060628.patch, > IndexSearcher-20060628.patch, MatchCollector.java, Matcher.java, > Matcher20070226.patch, Scorer-20060628.patch, Searchable-20060628.patch, > Searcher-20060628.patch, Some Matchers.zip, SortedVIntList.java, > TestSortedVIntList.java > > > {code} > package org.apache.lucene.search; > public abstract class Filter implements java.io.Serializable > { > public abstract AbstractBitSet bits(IndexReader reader) throws IOException; > } > public interface AbstractBitSet > { > public boolean get(int index); > } > {code} > It would be useful if the method =Filter.bits()= returned an abstract > interface, instead of =java.util.BitSet=. > Use case: there is a very large index, and, depending on the user's > privileges, only a small portion of the index is actually visible. > Sparsely populated =java.util.BitSet=s are not efficient and waste lots of > memory. It would be desirable to have an alternative BitSet implementation > with smaller memory footprint. > Though it _is_ possibly to derive classes from =java.util.BitSet=, it was > obviously not designed for that purpose. > That's why I propose to use an interface instead. The default implementation > could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] ___ Yahoo! Mail is the world's favourite email. Don't settle for less, sign up for your free account today http://uk.rd.yahoo.com/evt=44106/*http://uk.docs.yahoo.com/mail/winter07.html - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-584) Decouple Filter from BitSet
[ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487789 ] Otis Gospodnetic commented on LUCENE-584: - Ah, too bad. :( Last time I benchmarked Lucene searching on Sun's Niagara vs. non-massive Intel boxes, Intel boxes with Linux on them actually won, and my impression was that this was due to Niagara's weak FPU (a known weakness in Niagara, I believe). Thus, I thought, if we could just skip scoring and various floating point calculations, we'd see better performance, esp. on Niagara boxes. Paul, when you say "fastest cache", what exactly are you referring to? The Niagara I tested things on had 32GB of RAM, and I gave the JVM 20+GB, so at least the JVM had plenty of RAM to work with. > Decouple Filter from BitSet > --- > > Key: LUCENE-584 > URL: https://issues.apache.org/jira/browse/LUCENE-584 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Affects Versions: 2.0.1 >Reporter: Peter Schäfer >Priority: Minor > Attachments: bench-diff.txt, bench-diff.txt, BitsMatcher.java, > Filter-20060628.patch, HitCollector-20060628.patch, > IndexSearcher-20060628.patch, MatchCollector.java, Matcher.java, > Matcher20070226.patch, Scorer-20060628.patch, Searchable-20060628.patch, > Searcher-20060628.patch, Some Matchers.zip, SortedVIntList.java, > TestSortedVIntList.java > > > {code} > package org.apache.lucene.search; > public abstract class Filter implements java.io.Serializable > { > public abstract AbstractBitSet bits(IndexReader reader) throws IOException; > } > public interface AbstractBitSet > { > public boolean get(int index); > } > {code} > It would be useful if the method =Filter.bits()= returned an abstract > interface, instead of =java.util.BitSet=. > Use case: there is a very large index, and, depending on the user's > privileges, only a small portion of the index is actually visible. > Sparsely populated =java.util.BitSet=s are not efficient and waste lots of > memory. It would be desirable to have an alternative BitSet implementation > with smaller memory footprint. > Though it _is_ possibly to derive classes from =java.util.BitSet=, it was > obviously not designed for that purpose. > That's why I propose to use an interface instead. The default implementation > could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Maven artifacts for Lucene.*
I have been hoping to put up mechanism for (easier) deployment of m2 artifacts to maven repositories (both Apache snapshot repository and the main maven repository at ibiblio). The most convenient way would be to use maven2 to build the various lucene projects but as the mailing list conversation about this subject indicates there is no common interest for changing the (working) ant based build system to a maven based. The next best thing IMO would be using ant build as normally for the non maven2 releases and use maven2 for building the maven releases (.jar files, optionally also packages for sources used to build the binary and packages for javadocs) with related check sums and signatures. To repeat it one more time: what I am proposing here is not meant to replace the current solid way of building the various Lucene projects - I am just trying to provide a convenient way to make the release artifacts to be deployed to maven repositories. I have put together an initial set of poms (for lucene-java) to do this quite easily, basically what is required is installation of maven2 binaries and the set of pom files and a checkout of the lucene version to build. The various jars are build, packaged, check summed, signed and optionally deployed with single mvn command. So IMO it is quite easy thing to do in addition to normal release process. I can also, for undefined time, volunteer to do these builds if it is too much of burden for RMs. There are however couple of things I need your opinion about (or at least attention): 1. There are differencies when comparing to ant build jars (due to release policy of a.o) the built jars will contain LICENSE.txt, NOTICE.txt in /META-INF. Is this a problem? 2. I propose that we add additional folder level so the groupId for lucene java would be org.apache.lucene.java (it is now org.apache.lucene within the currently released artifacts). The initial list of artifacts (the new proposed structure) is listed below: groupId:org.apache.lucene lucene-parent (pom) (a top level pom defining lucene wide stuff that gets inherited to sub project modules) groupId:org.apache.lucene.java java-parent (pom) lucene-core (jar) lucene-demos (jar) contrib-parent (pom) lucene-analyzers (jar) lucene-benchmark (jar) lucene-highlighter(jar) lucene-misc (jar) lucene-queries (jar) lucene-regex (jar) lucene-snowball (jar) lucene-spellchecker(jar) lucene-surround (jar) lucene-swing (jar) lucene-wordnet (jar) lucene-xml-query-parser (jar) groupId:org.apache.lucene.nutch (TODO) nutch-parent (pom) nutch-core (jar) nutch-plugins (pom) nutch-plugin-x (jar) (as soon as nutch plugins can be of format .jar) ... groupId:org.apache.lucene.hadoop (TODO) hadoop-parent (pom) hadoop-core (jar) hadoop-streaming (jar) ... groupId:org.apache.lucene.solr (TODO) solr-parent (pom) solr-core (jar) ... 3. Where to put poms? They need to be put somewhere. I think it's not smart at this point pollute the ant driven folder structure with poms - they are better of in separate dir structure. What is (in your opinion) the most convenient place for them? I would propose that every sub project would have dir named maven (or something similar) that would contain poms for that particular sub project. Other possibility would be putting a lucene level dir for maven stuff and the poms would be maintained there. The text above was my initial thought about this, however there have been concerns that the procedure described here might not be most optimal one. So far the arguments have been the following: 1. Two build systems to maintain True. However I don't quite see that so black and white: You would anyway need to maintain the poms manually (if you care about the quality of poms) or you would have to build some mechanism to build those. Of course in situation where you would not actually build with maven the poms could be a bit more simple. 2. Two build systems producing different jars, would maven2 releases require a separate vote? Yes the artifacts (jars) would be different, because you would need to add LICENSE and MANIFEST into them (because of apache policy). I don't know about the vote, how do other projects deal with this kind of situation, anyone here to tell? One solution to jar mismatch would be changing the ant build to put those files in produced jars. 3. Additional burden for RM, need to run additional command, install maven There will be that external step for doing the maven release and you need to install maven also. But compared to current situation where you would have to extract jars, put some more files into them, sign them, modify poms to reflect correct version numbers, upload them to repositories manually. The other way to do is would be changing the current build system to be more maven friendly. This would probably mean following things: -add poms for artifacts into svn repository (where?) -adding LICENSE and NOTICE into jars. -add ant target to -sign jars -push artifacts into staging
[jira] Commented: (LUCENE-584) Decouple Filter from BitSet
[ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487706 ] Paul Elschot commented on LUCENE-584: - That could be improved in a DisjunctionMatcher. With a bit of bookkeeping DisjunctionSumScorer could also delay calling score() on the subscorers but the bookkeeping would affect performance for the normal case. For the usual queries the score() call will never have much of a performance impact. The reason for this is that TermScorer.score() is really very efficient, iirc it caches weighted tf() values for low term frequencies. All the rest is mostly additions, and occasionally a multiplication for a coordination factor. To determine which documents match the query, the index need to be accessed, and that takes more time than score value computations because the complete index almost never fits in the fastest cache. > Decouple Filter from BitSet > --- > > Key: LUCENE-584 > URL: https://issues.apache.org/jira/browse/LUCENE-584 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Affects Versions: 2.0.1 >Reporter: Peter Schäfer >Priority: Minor > Attachments: bench-diff.txt, bench-diff.txt, BitsMatcher.java, > Filter-20060628.patch, HitCollector-20060628.patch, > IndexSearcher-20060628.patch, MatchCollector.java, Matcher.java, > Matcher20070226.patch, Scorer-20060628.patch, Searchable-20060628.patch, > Searcher-20060628.patch, Some Matchers.zip, SortedVIntList.java, > TestSortedVIntList.java > > > {code} > package org.apache.lucene.search; > public abstract class Filter implements java.io.Serializable > { > public abstract AbstractBitSet bits(IndexReader reader) throws IOException; > } > public interface AbstractBitSet > { > public boolean get(int index); > } > {code} > It would be useful if the method =Filter.bits()= returned an abstract > interface, instead of =java.util.BitSet=. > Use case: there is a very large index, and, depending on the user's > privileges, only a small portion of the index is actually visible. > Sparsely populated =java.util.BitSet=s are not efficient and waste lots of > memory. It would be desirable to have an alternative BitSet implementation > with smaller memory footprint. > Though it _is_ possibly to derive classes from =java.util.BitSet=, it was > obviously not designed for that purpose. > That's why I propose to use an interface instead. The default implementation > could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]