[jira] Updated: (LUCENE-857) Remove BitSet caching from QueryFilter
[ https://issues.apache.org/jira/browse/LUCENE-857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hoss Man updated LUCENE-857: Attachment: LUCENE-857.refactoring-approach.diff An example of what I'm thinking would make sense from a backwards compatibility standpoint ... javadocs could still use some improvement. > Remove BitSet caching from QueryFilter > -- > > Key: LUCENE-857 > URL: https://issues.apache.org/jira/browse/LUCENE-857 > Project: Lucene - Java > Issue Type: Improvement >Reporter: Otis Gospodnetic > Assigned To: Otis Gospodnetic >Priority: Minor > Attachments: LUCENE-857.patch, LUCENE-857.refactoring-approach.diff > > > Since caching is built into the public BitSet bits(IndexReader reader) > method, I don't see a way to deprecate that, which means I'll just cut it out > and document it in CHANGES.txt. Anyone who wants QueryFilter caching will be > able to get the caching back by wrapping the QueryFilter in the > CachingWrapperFilter. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-857) Remove BitSet caching from QueryFilter
[ https://issues.apache.org/jira/browse/LUCENE-857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487679 ] Hoss Man commented on LUCENE-857: - I don't think it's a question of being careless about reading the Changelog -- I just think that when dealing with a point release, we shouldn't require people to make code changes just to get the same behavior as before ... if this was necessary to fix a bug it would be one thing, but really what we're talking about here is refactoring out a piece of functionality (using a Query as a Filter) so that it can be used independently from another piece of functionality (filter caching) ... since that can be done in a backwards compatible way, why not make it easy for people. > With your suggestion one can't get a raw QueryFilter without getting it > automatically cached. Isn't this inflexibility uncool? ...not quite, I'm suggesting that the "raw" QueryFilter behavior be extracted into a new class (QueryWrapperFilter) and the existing QueryFilter class continue to do exactly what it currently does - but refactored so that there is no duplicate code. > Remove BitSet caching from QueryFilter > -- > > Key: LUCENE-857 > URL: https://issues.apache.org/jira/browse/LUCENE-857 > Project: Lucene - Java > Issue Type: Improvement >Reporter: Otis Gospodnetic > Assigned To: Otis Gospodnetic >Priority: Minor > Attachments: LUCENE-857.patch > > > Since caching is built into the public BitSet bits(IndexReader reader) > method, I don't see a way to deprecate that, which means I'll just cut it out > and document it in CHANGES.txt. Anyone who wants QueryFilter caching will be > able to get the caching back by wrapping the QueryFilter in the > CachingWrapperFilter. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-584) Decouple Filter from BitSet
[ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487675 ] Marvin Humphrey commented on LUCENE-584: DisjunctionSumScorer (the ORScorer) actually calls Scorer.score() on all of the matching scorers in the ScorerDocQueue during next(), in order to accumulate an aggregate score. The MatchCollector can't save you from that. > Decouple Filter from BitSet > --- > > Key: LUCENE-584 > URL: https://issues.apache.org/jira/browse/LUCENE-584 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Affects Versions: 2.0.1 >Reporter: Peter Schäfer >Priority: Minor > Attachments: bench-diff.txt, bench-diff.txt, BitsMatcher.java, > Filter-20060628.patch, HitCollector-20060628.patch, > IndexSearcher-20060628.patch, MatchCollector.java, Matcher.java, > Matcher20070226.patch, Scorer-20060628.patch, Searchable-20060628.patch, > Searcher-20060628.patch, Some Matchers.zip, SortedVIntList.java, > TestSortedVIntList.java > > > {code} > package org.apache.lucene.search; > public abstract class Filter implements java.io.Serializable > { > public abstract AbstractBitSet bits(IndexReader reader) throws IOException; > } > public interface AbstractBitSet > { > public boolean get(int index); > } > {code} > It would be useful if the method =Filter.bits()= returned an abstract > interface, instead of =java.util.BitSet=. > Use case: there is a very large index, and, depending on the user's > privileges, only a small portion of the index is actually visible. > Sparsely populated =java.util.BitSet=s are not efficient and waste lots of > memory. It would be desirable to have an alternative BitSet implementation > with smaller memory footprint. > Though it _is_ possibly to derive classes from =java.util.BitSet=, it was > obviously not designed for that purpose. > That's why I propose to use an interface instead. The default implementation > could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-584) Decouple Filter from BitSet
[ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487674 ] Otis Gospodnetic commented on LUCENE-584: - A. I'll look at the patch again tomorrow and follow what you said. All this time I was under the impression that one of the points or at least side-effects of the Matcher was that scoring was skipped, which would be perfect where matches are ordered by anything other than relevance. > Decouple Filter from BitSet > --- > > Key: LUCENE-584 > URL: https://issues.apache.org/jira/browse/LUCENE-584 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Affects Versions: 2.0.1 >Reporter: Peter Schäfer >Priority: Minor > Attachments: bench-diff.txt, bench-diff.txt, BitsMatcher.java, > Filter-20060628.patch, HitCollector-20060628.patch, > IndexSearcher-20060628.patch, MatchCollector.java, Matcher.java, > Matcher20070226.patch, Scorer-20060628.patch, Searchable-20060628.patch, > Searcher-20060628.patch, Some Matchers.zip, SortedVIntList.java, > TestSortedVIntList.java > > > {code} > package org.apache.lucene.search; > public abstract class Filter implements java.io.Serializable > { > public abstract AbstractBitSet bits(IndexReader reader) throws IOException; > } > public interface AbstractBitSet > { > public boolean get(int index); > } > {code} > It would be useful if the method =Filter.bits()= returned an abstract > interface, instead of =java.util.BitSet=. > Use case: there is a very large index, and, depending on the user's > privileges, only a small portion of the index is actually visible. > Sparsely populated =java.util.BitSet=s are not efficient and waste lots of > memory. It would be desirable to have an alternative BitSet implementation > with smaller memory footprint. > Though it _is_ possibly to derive classes from =java.util.BitSet=, it was > obviously not designed for that purpose. > That's why I propose to use an interface instead. The default implementation > could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-584) Decouple Filter from BitSet
[ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487667 ] Doron Cohen commented on LUCENE-584: > No Scorer, no BooleanScorer(2), no ConjunctionScorer... Thanks, I was reading "score" instead of "score()"... But there is a scorer in the process, it is used for next()-ing to matched docs. So most of the work - preparing to be able to compute the scores - was done already. The scorer doc queue is created and populated. Not calling score() is saving the (final) looping on the scorers for aggregating their scores, multiplying by coord factor, etc. I assume this is why only a small speed up is seen. > Decouple Filter from BitSet > --- > > Key: LUCENE-584 > URL: https://issues.apache.org/jira/browse/LUCENE-584 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Affects Versions: 2.0.1 >Reporter: Peter Schäfer >Priority: Minor > Attachments: bench-diff.txt, bench-diff.txt, BitsMatcher.java, > Filter-20060628.patch, HitCollector-20060628.patch, > IndexSearcher-20060628.patch, MatchCollector.java, Matcher.java, > Matcher20070226.patch, Scorer-20060628.patch, Searchable-20060628.patch, > Searcher-20060628.patch, Some Matchers.zip, SortedVIntList.java, > TestSortedVIntList.java > > > {code} > package org.apache.lucene.search; > public abstract class Filter implements java.io.Serializable > { > public abstract AbstractBitSet bits(IndexReader reader) throws IOException; > } > public interface AbstractBitSet > { > public boolean get(int index); > } > {code} > It would be useful if the method =Filter.bits()= returned an abstract > interface, instead of =java.util.BitSet=. > Use case: there is a very large index, and, depending on the user's > privileges, only a small portion of the index is actually visible. > Sparsely populated =java.util.BitSet=s are not efficient and waste lots of > memory. It would be desirable to have an alternative BitSet implementation > with smaller memory footprint. > Though it _is_ possibly to derive classes from =java.util.BitSet=, it was > obviously not designed for that purpose. > That's why I propose to use an interface instead. The default implementation > could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-794) SpanScorer and SimpleSpanFragmenter for Contrib Highlighter
[ https://issues.apache.org/jira/browse/LUCENE-794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Miller updated LUCENE-794: --- Attachment: spanhighlighter5.patch Apologize for the delay on this -- I was pulled into a busy product launch. This adds the final piece, replacing TermModifer with multiple Memory Indexes. I also did a little refactoring, especially in the SpansExtractor. All tests now pass and I have been using this succesfully for some time now. For anyone new following this issue, ignore all of the files except for this one: spanhighlighter5.patch - Mark > SpanScorer and SimpleSpanFragmenter for Contrib Highlighter > --- > > Key: LUCENE-794 > URL: https://issues.apache.org/jira/browse/LUCENE-794 > Project: Lucene - Java > Issue Type: Improvement > Components: Other >Reporter: Mark Miller >Priority: Minor > Attachments: CachedTokenStream.java, CachedTokenStream.java, > CachedTokenStream.java, DefaultEncoder.java, Encoder.java, Formatter.java, > Highlighter.java, Highlighter.java, Highlighter.java, Highlighter.java, > Highlighter.java, HighlighterTest.java, HighlighterTest.java, > HighlighterTest.java, HighlighterTest.java, MemoryIndex.java, > QuerySpansExtractor.java, QuerySpansExtractor.java, QuerySpansExtractor.java, > QuerySpansExtractor.java, SimpleFormatter.java, spanhighlighter.patch, > spanhighlighter2.patch, spanhighlighter3.patch, spanhighlighter5.patch, > spanhighlighter_patch_4.zip, SpanHighlighterTest.java, > SpanHighlighterTest.java, SpanScorer.java, SpanScorer.java, > WeightedSpanTerm.java > > > This patch adds a new Scorer class (SpanQueryScorer) to the Highlighter > package that scores just like QueryScorer, but scores a 0 for Terms that did > not cause the Query hit. This gives 'actual' hit highlighting for the range > of SpanQuerys and PhraseQuery. There is also a new Fragmenter that attempts > to fragment without breaking up Spans. > See http://issues.apache.org/jira/browse/LUCENE-403 for some background. > There is a dependency on MemoryIndex. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Large scale sorting
A memory saving optimization would be to not load the corresponding String[] in the string index (as discussed previously), but there is currently no way to tell the FieldCachethat the strings are unneeded. The String values are only needed for merging results in a MultiSearcher. Yep, which happens all the time for us specifically, because we have an 'archive' and 'week' index. the week index is merged once per week, so the search is always a merged sort across the 2. (the week index is reloaded every 5 seconds or so, the archive index is kept in memory once loaded). - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Large scale sorting
On 4/9/07, jian chen <[EMAIL PROTECTED]> wrote: But, on a higher level, my idea is really just to create an array of integers for each sort field. The array length is NumOfDocs in the index. Each integer corresponds to a displayable string value. For example, if you have a field of different colors, you can assign integers like this: 0 <=> whilte 1 <=> blue 2 <=> yellow ... Thus, you don't need to use strings for sorting. This is how it is currently done. Sorting using an IndexSearcher does not do string comparisons at all, but just compares their ordinal retrieved from an int[] A memory saving optimization would be to not load the corresponding String[] in the string index (as discussed previously), but there is currently no way to tell the FieldCachethat the strings are unneeded. The String values are only needed for merging results in a MultiSearcher. -Yonik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Large scale sorting
Hi, Paul, I think to warm-up or not, it needs some benchmarking for specific application. For the implementation of the sort fields, when I talk about norms in Lucene, I am thinking we could borrow the same implmentation of the norms to do it. But, on a higher level, my idea is really just to create an array of integers for each sort field. The array length is NumOfDocs in the index. Each integer corresponds to a displayable string value. For example, if you have a field of different colors, you can assign integers like this: 0 <=> whilte 1 <=> blue 2 <=> yellow ... Thus, you don't need to use strings for sorting. For example, if you have document number 0,1,2, which stores colors blue, white, yellow respectively, the array would be: {1, 0, 2}. To do sorting, this array could be pre-loaded into memory (warming up the index), or, during collecting the hits (in HitCollector), the relevant integer values could be loaded from disk given a doc id. If you have 10 million documents, for one sort field, you will have 10x4=40 MB array. Cheers, Jian On 4/9/07, Paul Smith <[EMAIL PROTECTED]> wrote: > > In our application, we have to sync up the index pretty frequently, > the > warm-up of the index is killing it. > Yep, it speeds up the first sort, but at the cost of making all the others slower (maybe significantly so). That's obviously not ideal but could make use of sorts in larger indexes practical. > To address your concern about single sort locale, what about > creating a sort > field for each sort locale? So, if you have, say, 10 locales, you > will have > 10 sort fields, each utilizing the mechanism of constructing the > norms. > I really don't understand norms properly so I'm not sure exactly how that would help. I'll have to go over your original email again to understand. My main goal is to get some discussion going amongst the community, which hopefully we've kicked along. Paul - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Large scale sorting
In our application, we have to sync up the index pretty frequently, the warm-up of the index is killing it. Yep, it speeds up the first sort, but at the cost of making all the others slower (maybe significantly so). That's obviously not ideal but could make use of sorts in larger indexes practical. To address your concern about single sort locale, what about creating a sort field for each sort locale? So, if you have, say, 10 locales, you will have 10 sort fields, each utilizing the mechanism of constructing the norms. I really don't understand norms properly so I'm not sure exactly how that would help. I'll have to go over your original email again to understand. My main goal is to get some discussion going amongst the community, which hopefully we've kicked along. Paul - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Large scale sorting
Hi, Paul, Thanks for your reply. For your previous email about the need for disk based sorting solution, I kind of agree about your points. One incentive for your approach is that we don't need to warm-up the index anymore in case that the index is huge. In our application, we have to sync up the index pretty frequently, the warm-up of the index is killing it. To address your concern about single sort locale, what about creating a sort field for each sort locale? So, if you have, say, 10 locales, you will have 10 sort fields, each utilizing the mechanism of constructing the norms. At query time, in the HitCollector, for each doc id matched, you can load the field value (integer) through the IndexReader. (here you need to enhance the IndexReader to be able to load the sort field values). Then, you can use that value to reject/accept the doc, or factor into the score. How do you think? Jian On 4/9/07, Paul Smith <[EMAIL PROTECTED]> wrote: > > Now, if we could use integers to represent the sort field values, > which is > typically the case for most applications, maybe we can afford to > have the > sort field values stored in the disk and do disk lookup for each > document > matched? The look up of the sort field value will be as simple as > docNo * 4 > * offset. > > This way, we use the same approach as constructing the norms > (proper merging > for incremental indexing), but, at search time, we don't load the > sort field > values into memory, instead, just store them in disk. > > Will this approach be good enough? While a nifty idea, I think this only works for a single sort locale. I initially came up with a similar idea that the terms are already stored in 'sorted' order and one might be able to use the terms position for sorting, it's just that the terms ordering position is different in different locales. Paul - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-859) Expose the number of deleted docs in index/segment
[ https://issues.apache.org/jira/browse/LUCENE-859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487644 ] Yonik Seeley commented on LUCENE-859: - > Though it might still be handy to have something with main() that spits out > the number of deleted > documents, as SegmentReader has in my patch. I don't understand that comment. I don't see anything in your patch besides the implementation of deletedDocs(). > Maybe that should be added to the existing IndexReader.main ? That sounds fine. > Expose the number of deleted docs in index/segment > -- > > Key: LUCENE-859 > URL: https://issues.apache.org/jira/browse/LUCENE-859 > Project: Lucene - Java > Issue Type: New Feature > Components: Index >Reporter: Otis Gospodnetic > Assigned To: Otis Gospodnetic >Priority: Minor > Attachments: LUCENE-859 > > > Use case: > We've got a lot of large, mostly search-only indices. These indices are not > re-optimized once "deployed". Docs in them do not get updated, but they do > get deleted. After a while, the number of deleted docs grows, but it's hard > to tell how many documents have been deleted. > Exposing the number of deleted docs via *Reader.deletedDocs() method let's > you get to this number. > I'm attaching patch that touches the following: > M src/test/org/apache/lucene/index/TestSegmentReader.java > M src/java/org/apache/lucene/index/MultiReader.java > M src/java/org/apache/lucene/index/IndexReader.java > M src/java/org/apache/lucene/index/FilterIndexReader.java > M src/java/org/apache/lucene/index/ParallelReader.java > M src/java/org/apache/lucene/index/SegmentReader.java > SegmentReader also got a public static main(String[]) that takes 1 > command-line parameter, a path to the index to check, and prints out the > number of deleted docs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Large scale sorting
Paul Smith wrote: I don't disagree with the premise that it involves substantial I/O and would increase the time taken to sort, and why this approach shouldn't be the default mechanism, but it's not too difficult to build a disk I/O subsystem that can allocate many spindles to service this and to allow the underlying OS to use it's buffer cache (yes this is sounding like a database server now isn't it). My guess is that it'd be cheaper to just buy more RAM. It would be better if the sorting mechanism in Lucene was a little more decoupled such that more customised designs could be utilitised for specific scenarios. Right now it's a one-for-all approach without substantial gutting of the code. That's just what most folks have found useful to date. If you have a patch to decouple it, and others find it useful, then it should be seriously considered. I do have some concerns about whether the approach you suggest is in fact useful, but am happy to be proven wrong. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Large scale sorting
Now, if we could use integers to represent the sort field values, which is typically the case for most applications, maybe we can afford to have the sort field values stored in the disk and do disk lookup for each document matched? The look up of the sort field value will be as simple as docNo * 4 * offset. This way, we use the same approach as constructing the norms (proper merging for incremental indexing), but, at search time, we don't load the sort field values into memory, instead, just store them in disk. Will this approach be good enough? While a nifty idea, I think this only works for a single sort locale. I initially came up with a similar idea that the terms are already stored in 'sorted' order and one might be able to use the terms position for sorting, it's just that the terms ordering position is different in different locales. Paul - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Large scale sorting
On 10/04/2007, at 4:18 AM, Doug Cutting wrote: Paul Smith wrote: Disadvantages to this approach: * It's a lot more I/O intensive I think this would be prohibitive. Queries matching more than a few hundred documents will take several seconds to sort, since random disk accesses are required per matching document. Such an approach is only practical if you can guarantee that queries match fewer than a hundred documents, which is not generally the case, especially with large collections. I don't disagree with the premise that it involves substantial I/O and would increase the time taken to sort, and why this approach shouldn't be the default mechanism, but it's not too difficult to build a disk I/O subsystem that can allocate many spindles to service this and to allow the underlying OS to use it's buffer cache (yes this is sounding like a database server now isn't it). I'm working on the basis that it's a LOT harder/more expensive to simply allocate more heap size to cover the current sorting infrastructure. One hits memory limits faster. Not everyone can afford 64-bit hardware with many Gb RAM to allocate to a heap. It _is_ cheaper/easier to build a disk subsystem to tune this I/O approach, and one can still use any RAM as buffer cache for the memory-mapped file anyway. In my experience, raw search time starts to climb towards one second per query as collections grow to around 10M documents (in round figures and with lots of assumptions). Thus, searching on a single CPU is less practical as collections grow substantially larger than 10M documents, and distributed solutions are required. So it would be convenient if sorting is also practical for ~10M document collections on standard hardware. If 10M strings with 20 characters are required in memory for efficient search, this requires 400MB. This is a lot, but not an unusual amount on todays machines. However, if you have a large number of fields, then this approach may be problematic and force you to consider a distributed solution earlier than you might otherwise. 400Mb is not a lot in of itself, but when one has many of these types of indexes, with many sorting fields with many locales on the same host it becomes problematic. I'm sure there's a point where distributing doesn't work over really large collections, because even if one partitioned an index across many hosts, one still needs to merge sort the results together. It would be disappointing if Lucene's innate design limited itself to 10M document collections before needing to consider distributed solutions. 10M is not that many. It would be better if the sorting mechanism in Lucene was a little more decoupled such that more customised designs could be utilitised for specific scenarios. Right now it's a one-for-all approach without substantial gutting of the code. cheers, Paul - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-859) Expose the number of deleted docs in index/segment
[ https://issues.apache.org/jira/browse/LUCENE-859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487640 ] Otis Gospodnetic commented on LUCENE-859: - Though it might still be handy to have something with main() that spits out the number of deleted documents, as SegmentReader has in my patch. What do you think about committing just that? Maybe that should be added to the existing IndexReader.main ? Or maybe it's time to start an app/class in contrib/index that takes various command line parameters and prints out information about the index? If so, I'll move that to a new JIRA issue. > Expose the number of deleted docs in index/segment > -- > > Key: LUCENE-859 > URL: https://issues.apache.org/jira/browse/LUCENE-859 > Project: Lucene - Java > Issue Type: New Feature > Components: Index >Reporter: Otis Gospodnetic > Assigned To: Otis Gospodnetic >Priority: Minor > Attachments: LUCENE-859 > > > Use case: > We've got a lot of large, mostly search-only indices. These indices are not > re-optimized once "deployed". Docs in them do not get updated, but they do > get deleted. After a while, the number of deleted docs grows, but it's hard > to tell how many documents have been deleted. > Exposing the number of deleted docs via *Reader.deletedDocs() method let's > you get to this number. > I'm attaching patch that touches the following: > M src/test/org/apache/lucene/index/TestSegmentReader.java > M src/java/org/apache/lucene/index/MultiReader.java > M src/java/org/apache/lucene/index/IndexReader.java > M src/java/org/apache/lucene/index/FilterIndexReader.java > M src/java/org/apache/lucene/index/ParallelReader.java > M src/java/org/apache/lucene/index/SegmentReader.java > SegmentReader also got a public static main(String[]) that takes 1 > command-line parameter, a path to the index to check, and prints out the > number of deleted docs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Closed: (LUCENE-859) Expose the number of deleted docs in index/segment
[ https://issues.apache.org/jira/browse/LUCENE-859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic closed LUCENE-859. --- Resolution: Won't Fix Lucene Fields: [New, Patch Available] (was: [Patch Available, New]) Doh, of course! numDocs() looks like this: public int numDocs() { int n = maxDoc(); if (deletedDocs != null) n -= deletedDocs.count(); return n; } Won't Fix. > Expose the number of deleted docs in index/segment > -- > > Key: LUCENE-859 > URL: https://issues.apache.org/jira/browse/LUCENE-859 > Project: Lucene - Java > Issue Type: New Feature > Components: Index >Reporter: Otis Gospodnetic > Assigned To: Otis Gospodnetic >Priority: Minor > Attachments: LUCENE-859 > > > Use case: > We've got a lot of large, mostly search-only indices. These indices are not > re-optimized once "deployed". Docs in them do not get updated, but they do > get deleted. After a while, the number of deleted docs grows, but it's hard > to tell how many documents have been deleted. > Exposing the number of deleted docs via *Reader.deletedDocs() method let's > you get to this number. > I'm attaching patch that touches the following: > M src/test/org/apache/lucene/index/TestSegmentReader.java > M src/java/org/apache/lucene/index/MultiReader.java > M src/java/org/apache/lucene/index/IndexReader.java > M src/java/org/apache/lucene/index/FilterIndexReader.java > M src/java/org/apache/lucene/index/ParallelReader.java > M src/java/org/apache/lucene/index/SegmentReader.java > SegmentReader also got a public static main(String[]) that takes 1 > command-line parameter, a path to the index to check, and prints out the > number of deleted docs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-859) Expose the number of deleted docs in index/segment
[ https://issues.apache.org/jira/browse/LUCENE-859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487635 ] Yonik Seeley commented on LUCENE-859: - Isn't this redundant with existing IndexReader methods? deletedDocs() == maxDoc() - numDocs() > Expose the number of deleted docs in index/segment > -- > > Key: LUCENE-859 > URL: https://issues.apache.org/jira/browse/LUCENE-859 > Project: Lucene - Java > Issue Type: New Feature > Components: Index >Reporter: Otis Gospodnetic > Assigned To: Otis Gospodnetic >Priority: Minor > Attachments: LUCENE-859 > > > Use case: > We've got a lot of large, mostly search-only indices. These indices are not > re-optimized once "deployed". Docs in them do not get updated, but they do > get deleted. After a while, the number of deleted docs grows, but it's hard > to tell how many documents have been deleted. > Exposing the number of deleted docs via *Reader.deletedDocs() method let's > you get to this number. > I'm attaching patch that touches the following: > M src/test/org/apache/lucene/index/TestSegmentReader.java > M src/java/org/apache/lucene/index/MultiReader.java > M src/java/org/apache/lucene/index/IndexReader.java > M src/java/org/apache/lucene/index/FilterIndexReader.java > M src/java/org/apache/lucene/index/ParallelReader.java > M src/java/org/apache/lucene/index/SegmentReader.java > SegmentReader also got a public static main(String[]) that takes 1 > command-line parameter, a path to the index to check, and prints out the > number of deleted docs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Large scale sorting
Hi, Doug, I have been thinking about this as well lately and have some thoughts similar to Paul's approach. Lucene has the norm data for each document field. Conceptually it is a byte array with one byte for each document field. At query time, I think the norm array is loaded into memory the first time it is accessed, allowing for efficient look up of the norm value for each document. Now, if we could use integers to represent the sort field values, which is typically the case for most applications, maybe we can afford to have the sort field values stored in the disk and do disk lookup for each document matched? The look up of the sort field value will be as simple as docNo * 4 * offset. This way, we use the same approach as constructing the norms (proper merging for incremental indexing), but, at search time, we don't load the sort field values into memory, instead, just store them in disk. Will this approach be good enough? Thanks for your feedback. Jian On 4/9/07, Doug Cutting <[EMAIL PROTECTED]> wrote: Paul Smith wrote: > Disadvantages to this approach: > * It's a lot more I/O intensive I think this would be prohibitive. Queries matching more than a few hundred documents will take several seconds to sort, since random disk accesses are required per matching document. Such an approach is only practical if you can guarantee that queries match fewer than a hundred documents, which is not generally the case, especially with large collections. > I'm working on the basis that it's a LOT harder/more expensive to simply > allocate more heap size to cover the current sorting infrastructure. > One hits memory limits faster. Not everyone can afford 64-bit hardware > with many Gb RAM to allocate to a heap. It _is_ cheaper/easier to build > a disk subsystem to tune this I/O approach, and one can still use any > RAM as buffer cache for the memory-mapped file anyway. In my experience, raw search time starts to climb towards one second per query as collections grow to around 10M documents (in round figures and with lots of assumptions). Thus, searching on a single CPU is less practical as collections grow substantially larger than 10M documents, and distributed solutions are required. So it would be convenient if sorting is also practical for ~10M document collections on standard hardware. If 10M strings with 20 characters are required in memory for efficient search, this requires 400MB. This is a lot, but not an unusual amount on todays machines. However, if you have a large number of fields, then this approach may be problematic and force you to consider a distributed solution earlier than you might otherwise. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-584) Decouple Filter from BitSet
[ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487631 ] Otis Gospodnetic commented on LUCENE-584: - Doron: just to address your question from Apr/7 - I expect/hope to see an improvement in performance because of this difference: hc.collect(doc(), score()); mc.collect(doc()); the delta being the cost of the score() call that does the scoring. If I understand things correctly, that means that what grant described at the bottom of http://lucene.apache.org/java/docs/scoring.html will all be skipped. No Scorer, no BooleanScorer(2), no ConjunctionScorer... > Decouple Filter from BitSet > --- > > Key: LUCENE-584 > URL: https://issues.apache.org/jira/browse/LUCENE-584 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Affects Versions: 2.0.1 >Reporter: Peter Schäfer >Priority: Minor > Attachments: bench-diff.txt, bench-diff.txt, BitsMatcher.java, > Filter-20060628.patch, HitCollector-20060628.patch, > IndexSearcher-20060628.patch, MatchCollector.java, Matcher.java, > Matcher20070226.patch, Scorer-20060628.patch, Searchable-20060628.patch, > Searcher-20060628.patch, Some Matchers.zip, SortedVIntList.java, > TestSortedVIntList.java > > > {code} > package org.apache.lucene.search; > public abstract class Filter implements java.io.Serializable > { > public abstract AbstractBitSet bits(IndexReader reader) throws IOException; > } > public interface AbstractBitSet > { > public boolean get(int index); > } > {code} > It would be useful if the method =Filter.bits()= returned an abstract > interface, instead of =java.util.BitSet=. > Use case: there is a very large index, and, depending on the user's > privileges, only a small portion of the index is actually visible. > Sparsely populated =java.util.BitSet=s are not efficient and waste lots of > memory. It would be desirable to have an alternative BitSet implementation > with smaller memory footprint. > Though it _is_ possibly to derive classes from =java.util.BitSet=, it was > obviously not designed for that purpose. > That's why I propose to use an interface instead. The default implementation > could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-859) Expose the number of deleted docs in index/segment
[ https://issues.apache.org/jira/browse/LUCENE-859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic updated LUCENE-859: Attachment: LUCENE-859 El patcho. > Expose the number of deleted docs in index/segment > -- > > Key: LUCENE-859 > URL: https://issues.apache.org/jira/browse/LUCENE-859 > Project: Lucene - Java > Issue Type: New Feature > Components: Index >Reporter: Otis Gospodnetic > Assigned To: Otis Gospodnetic >Priority: Minor > Attachments: LUCENE-859 > > > Use case: > We've got a lot of large, mostly search-only indices. These indices are not > re-optimized once "deployed". Docs in them do not get updated, but they do > get deleted. After a while, the number of deleted docs grows, but it's hard > to tell how many documents have been deleted. > Exposing the number of deleted docs via *Reader.deletedDocs() method let's > you get to this number. > I'm attaching patch that touches the following: > M src/test/org/apache/lucene/index/TestSegmentReader.java > M src/java/org/apache/lucene/index/MultiReader.java > M src/java/org/apache/lucene/index/IndexReader.java > M src/java/org/apache/lucene/index/FilterIndexReader.java > M src/java/org/apache/lucene/index/ParallelReader.java > M src/java/org/apache/lucene/index/SegmentReader.java > SegmentReader also got a public static main(String[]) that takes 1 > command-line parameter, a path to the index to check, and prints out the > number of deleted docs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Created: (LUCENE-859) Expose the number of deleted docs in index/segment
Expose the number of deleted docs in index/segment -- Key: LUCENE-859 URL: https://issues.apache.org/jira/browse/LUCENE-859 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Otis Gospodnetic Assigned To: Otis Gospodnetic Priority: Minor Attachments: LUCENE-859 Use case: We've got a lot of large, mostly search-only indices. These indices are not re-optimized once "deployed". Docs in them do not get updated, but they do get deleted. After a while, the number of deleted docs grows, but it's hard to tell how many documents have been deleted. Exposing the number of deleted docs via *Reader.deletedDocs() method let's you get to this number. I'm attaching patch that touches the following: M src/test/org/apache/lucene/index/TestSegmentReader.java M src/java/org/apache/lucene/index/MultiReader.java M src/java/org/apache/lucene/index/IndexReader.java M src/java/org/apache/lucene/index/FilterIndexReader.java M src/java/org/apache/lucene/index/ParallelReader.java M src/java/org/apache/lucene/index/SegmentReader.java SegmentReader also got a public static main(String[]) that takes 1 command-line parameter, a path to the index to check, and prints out the number of deleted docs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-848) Add supported for Wikipedia English as a corpus in the benchmarker stuff
[ https://issues.apache.org/jira/browse/LUCENE-848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487617 ] Doron Cohen commented on LUCENE-848: Seems okay to me (since it's all in the benchmark). > Add supported for Wikipedia English as a corpus in the benchmarker stuff > > > Key: LUCENE-848 > URL: https://issues.apache.org/jira/browse/LUCENE-848 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/benchmark >Reporter: Steven Parkes > Assigned To: Steven Parkes >Priority: Minor > Fix For: 2.2 > > Attachments: LUCENE-848.txt, WikipediaHarvester.java > > > Add support for using Wikipedia for benchmarking. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-584) Decouple Filter from BitSet
[ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487616 ] Doron Cohen commented on LUCENE-584: > > When you rerun, you may want to use my alg - to compare the two approaches > > in one run. > This is more dangerous though. Agree. I was trying to get rid of this by splitting each round to 3: - gc(), warm(), work() - when work() and warm() are the same, just that warm()'s stats are disregarded. Still switching the order of "by match" and "by bits" yield different results. Sometimes we would like not to disregard GC - in particular if one approach is creating more (or more complex) garbage than another approach. Perhaps we should look at two measures: best & avg/sum (2nd ignoring first run, for hotspot). > Decouple Filter from BitSet > --- > > Key: LUCENE-584 > URL: https://issues.apache.org/jira/browse/LUCENE-584 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Affects Versions: 2.0.1 >Reporter: Peter Schäfer >Priority: Minor > Attachments: bench-diff.txt, bench-diff.txt, BitsMatcher.java, > Filter-20060628.patch, HitCollector-20060628.patch, > IndexSearcher-20060628.patch, MatchCollector.java, Matcher.java, > Matcher20070226.patch, Scorer-20060628.patch, Searchable-20060628.patch, > Searcher-20060628.patch, Some Matchers.zip, SortedVIntList.java, > TestSortedVIntList.java > > > {code} > package org.apache.lucene.search; > public abstract class Filter implements java.io.Serializable > { > public abstract AbstractBitSet bits(IndexReader reader) throws IOException; > } > public interface AbstractBitSet > { > public boolean get(int index); > } > {code} > It would be useful if the method =Filter.bits()= returned an abstract > interface, instead of =java.util.BitSet=. > Use case: there is a very large index, and, depending on the user's > privileges, only a small portion of the index is actually visible. > Sparsely populated =java.util.BitSet=s are not efficient and waste lots of > memory. It would be desirable to have an alternative BitSet implementation > with smaller memory footprint. > Though it _is_ possibly to derive classes from =java.util.BitSet=, it was > obviously not designed for that purpose. > That's why I propose to use an interface instead. The default implementation > could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-584) Decouple Filter from BitSet
[ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487613 ] Mike Klaas commented on LUCENE-584: --- Instead of discarding the first run, the approach I usually take is to run 3-4 times and pick the minimum. You can then run several of these "sets" and average over the minimum of each. GC is still an issues, though. It is hard to get around when it is a mark&sweep collector (reference counting is much friendlier in this regard) > Decouple Filter from BitSet > --- > > Key: LUCENE-584 > URL: https://issues.apache.org/jira/browse/LUCENE-584 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Affects Versions: 2.0.1 >Reporter: Peter Schäfer >Priority: Minor > Attachments: bench-diff.txt, bench-diff.txt, BitsMatcher.java, > Filter-20060628.patch, HitCollector-20060628.patch, > IndexSearcher-20060628.patch, MatchCollector.java, Matcher.java, > Matcher20070226.patch, Scorer-20060628.patch, Searchable-20060628.patch, > Searcher-20060628.patch, Some Matchers.zip, SortedVIntList.java, > TestSortedVIntList.java > > > {code} > package org.apache.lucene.search; > public abstract class Filter implements java.io.Serializable > { > public abstract AbstractBitSet bits(IndexReader reader) throws IOException; > } > public interface AbstractBitSet > { > public boolean get(int index); > } > {code} > It would be useful if the method =Filter.bits()= returned an abstract > interface, instead of =java.util.BitSet=. > Use case: there is a very large index, and, depending on the user's > privileges, only a small portion of the index is actually visible. > Sparsely populated =java.util.BitSet=s are not efficient and waste lots of > memory. It would be desirable to have an alternative BitSet implementation > with smaller memory footprint. > Though it _is_ possibly to derive classes from =java.util.BitSet=, it was > obviously not designed for that purpose. > That's why I propose to use an interface instead. The default implementation > could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-848) Add supported for Wikipedia English as a corpus in the benchmarker stuff
[ https://issues.apache.org/jira/browse/LUCENE-848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487609 ] Steven Parkes commented on LUCENE-848: -- That's what I meant (and did). If it's okay, I'll bundle it into 848. > Add supported for Wikipedia English as a corpus in the benchmarker stuff > > > Key: LUCENE-848 > URL: https://issues.apache.org/jira/browse/LUCENE-848 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/benchmark >Reporter: Steven Parkes > Assigned To: Steven Parkes >Priority: Minor > Fix For: 2.2 > > Attachments: LUCENE-848.txt, WikipediaHarvester.java > > > Add support for using Wikipedia for benchmarking. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-848) Add supported for Wikipedia English as a corpus in the benchmarker stuff
[ https://issues.apache.org/jira/browse/LUCENE-848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487608 ] Doron Cohen commented on LUCENE-848: > Also, I was going to add support to the algorithm format for setting max > field length ... If this means extending the algorithm language, it would be simpler to just base on a property here - in the alg file set that property - "max.field.length=2" - and then in OpenIndexTask read that new property (see how merge.factor property is read) and set it on the index. > Add supported for Wikipedia English as a corpus in the benchmarker stuff > > > Key: LUCENE-848 > URL: https://issues.apache.org/jira/browse/LUCENE-848 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/benchmark >Reporter: Steven Parkes > Assigned To: Steven Parkes >Priority: Minor > Fix For: 2.2 > > Attachments: LUCENE-848.txt, WikipediaHarvester.java > > > Add support for using Wikipedia for benchmarking. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-848) Add supported for Wikipedia English as a corpus in the benchmarker stuff
[ https://issues.apache.org/jira/browse/LUCENE-848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487600 ] Steven Parkes commented on LUCENE-848: -- By the way, that's a rough patch. I'm cleaning it up as I use it to test 847. Also, I was going to add support to the algorithm format for setting max field length ... > Add supported for Wikipedia English as a corpus in the benchmarker stuff > > > Key: LUCENE-848 > URL: https://issues.apache.org/jira/browse/LUCENE-848 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/benchmark >Reporter: Steven Parkes > Assigned To: Steven Parkes >Priority: Minor > Fix For: 2.2 > > Attachments: LUCENE-848.txt, WikipediaHarvester.java > > > Add support for using Wikipedia for benchmarking. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries
[ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Liu updated LUCENE-855: Attachment: TestRangeFilterPerformanceComparison.java Here's my new benchmark. > MemoryCachedRangeFilter to boost performance of Range queries > - > > Key: LUCENE-855 > URL: https://issues.apache.org/jira/browse/LUCENE-855 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Affects Versions: 2.1 >Reporter: Andy Liu > Assigned To: Otis Gospodnetic > Attachments: FieldCacheRangeFilter.patch, > FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, > MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch, > TestRangeFilterPerformanceComparison.java > > > Currently RangeFilter uses TermEnum and TermDocs to find documents that fall > within the specified range. This requires iterating through every single > term in the index and can get rather slow for large document sets. > MemoryCachedRangeFilter reads all pairs of a given field, > sorts by value, and stores in a SortedFieldCache. During bits(), binary > searches are used to find the start and end indices of the lower and upper > bound values. The BitSet is populated by all the docId values that fall in > between the start and end indices. > TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed > index with random date values within a 5 year range. Executing bits() 1000 > times on standard RangeQuery using random date intervals took 63904ms. Using > MemoryCachedRangeFilter, it took 876ms. Performance increase is less > dramatic when you have less unique terms in a field or using less number of > documents. > Currently MemoryCachedRangeFilter only works with numeric values (values are > stored in a long[] array) but it can be easily changed to support Strings. A > side "benefit" of storing the values are stored as longs, is that there's no > longer the need to make the values lexographically comparable, i.e. padding > numeric values with zeros. > The downside of using MemoryCachedRangeFilter is there's a fairly significant > memory requirement. So it's designed to be used in situations where range > filter performance is critical and memory consumption is not an issue. The > memory requirements are: (sizeof(int) + sizeof(long)) * numDocs. > MemoryCachedRangeFilter also requires a warmup step which can take a while to > run in large datasets (it took 40s to run on a 3M document corpus). Warmup > can be called explicitly or is automatically called the first time > MemoryCachedRangeFilter is applied using a given field. > So in summery, MemoryCachedRangeFilter can be useful when: > - Performance is critical > - Memory is not an issue > - Field contains many unique numeric values > - Index contains large amount of documents -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries
[ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487595 ] Andy Liu commented on LUCENE-855: - In your updated benchmark, you're combining the range filter with a term query that matches one document. I don't believe that's the typical use case for a range filter. Usually the user employs a range to filter a large document set. I created a different benchmark to compare standard range filter, MemoryCachedRangeFilter, and Matt's FieldCacheRangeFilter using MatchAllDocsQuery, ConstantScoreQuery, and TermQuery (matching one doc like the last benchmark). Here are the results: Reader opened with 10 documents. Creating RangeFilters... RangeFilter w/MatchAllDocsQuery: * Bits: 4421 * Search: 5285 RangeFilter w/ConstantScoreQuery: * Bits: 4200 * Search: 8694 RangeFilter w/TermQuery: * Bits: 4088 * Search: 4133 MemoryCachedRangeFilter w/MatchAllDocsQuery: * Bits: 80 * Search: 1142 MemoryCachedRangeFilter w/ConstantScoreQuery: * Bits: 79 * Search: 482 MemoryCachedRangeFilter w/TermQuery: * Bits: 73 * Search: 95 FieldCacheRangeFilter w/MatchAllDocsQuery: * Bits: 0 * Search: 1146 FieldCacheRangeFilter w/ConstantScoreQuery: * Bits: 1 * Search: 356 FieldCacheRangeFilter w/TermQuery: * Bits: 0 * Search: 19 Here's some points: 1. When searching in a filter, bits() is called, so the search time includes bits() time. 2. Matt's FieldCacheRangeFilter is faster for ConstantScoreQuery, although not by much. Using MatchAllDocsQuery, they run neck-and-neck. FCRF is much faster for TermQuery since MCRF has to create the BItSet for the range before the search is executed. 3. I get less document hits when running FieldCacheRangeFilter with ConstantScoreQuery. Matt, there may be a bug in getNextSetBit(). Not sure if this would affect the benchmark. 4. I'd be interested to see performance numbers when FieldCacheRangeFilter is used with ChainedFilter. I suspect that MCRF would be faster in this case, since I'm assuming that FCRF has to reconstruct a standard BitSet during clone(). > MemoryCachedRangeFilter to boost performance of Range queries > - > > Key: LUCENE-855 > URL: https://issues.apache.org/jira/browse/LUCENE-855 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Affects Versions: 2.1 >Reporter: Andy Liu > Assigned To: Otis Gospodnetic > Attachments: FieldCacheRangeFilter.patch, > FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, > MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch > > > Currently RangeFilter uses TermEnum and TermDocs to find documents that fall > within the specified range. This requires iterating through every single > term in the index and can get rather slow for large document sets. > MemoryCachedRangeFilter reads all pairs of a given field, > sorts by value, and stores in a SortedFieldCache. During bits(), binary > searches are used to find the start and end indices of the lower and upper > bound values. The BitSet is populated by all the docId values that fall in > between the start and end indices. > TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed > index with random date values within a 5 year range. Executing bits() 1000 > times on standard RangeQuery using random date intervals took 63904ms. Using > MemoryCachedRangeFilter, it took 876ms. Performance increase is less > dramatic when you have less unique terms in a field or using less number of > documents. > Currently MemoryCachedRangeFilter only works with numeric values (values are > stored in a long[] array) but it can be easily changed to support Strings. A > side "benefit" of storing the values are stored as longs, is that there's no > longer the need to make the values lexographically comparable, i.e. padding > numeric values with zeros. > The downside of using MemoryCachedRangeFilter is there's a fairly significant > memory requirement. So it's designed to be used in situations where range > filter performance is critical and memory consumption is not an issue. The > memory requirements are: (sizeof(int) + sizeof(long)) * numDocs. > MemoryCachedRangeFilter also requires a warmup step which can take a while to > run in large datasets (it took 40s to run on a 3M document corpus). Warmup > can be called explicitly or is automatically called the first time > MemoryCachedRangeFilter is applied using a given field. > So in summery, MemoryCachedRangeFilter can be useful when: >
[jira] Commented: (LUCENE-584) Decouple Filter from BitSet
[ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487594 ] Yonik Seeley commented on LUCENE-584: - > When you rerun, you may want to use my alg - to compare the two approaches in > one run. This is more dangerous though. GC from one method's garbage can penalize the 2nd methods performance. Also, hotspot effects are hard to account for (if method1 and method2 use common methods, method2 will often execute faster than method one because more optimization has been done on those common methods). The hotspot effect can be minimized by running the test multiple times in the same JVM instance and discarding the first runs, but it's not so easy for GC. > Decouple Filter from BitSet > --- > > Key: LUCENE-584 > URL: https://issues.apache.org/jira/browse/LUCENE-584 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Affects Versions: 2.0.1 >Reporter: Peter Schäfer >Priority: Minor > Attachments: bench-diff.txt, bench-diff.txt, BitsMatcher.java, > Filter-20060628.patch, HitCollector-20060628.patch, > IndexSearcher-20060628.patch, MatchCollector.java, Matcher.java, > Matcher20070226.patch, Scorer-20060628.patch, Searchable-20060628.patch, > Searcher-20060628.patch, Some Matchers.zip, SortedVIntList.java, > TestSortedVIntList.java > > > {code} > package org.apache.lucene.search; > public abstract class Filter implements java.io.Serializable > { > public abstract AbstractBitSet bits(IndexReader reader) throws IOException; > } > public interface AbstractBitSet > { > public boolean get(int index); > } > {code} > It would be useful if the method =Filter.bits()= returned an abstract > interface, instead of =java.util.BitSet=. > Use case: there is a very large index, and, depending on the user's > privileges, only a small portion of the index is actually visible. > Sparsely populated =java.util.BitSet=s are not efficient and waste lots of > memory. It would be desirable to have an alternative BitSet implementation > with smaller memory footprint. > Though it _is_ possibly to derive classes from =java.util.BitSet=, it was > obviously not designed for that purpose. > That's why I propose to use an interface instead. The default implementation > could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-848) Add supported for Wikipedia English as a corpus in the benchmarker stuff
[ https://issues.apache.org/jira/browse/LUCENE-848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steven Parkes updated LUCENE-848: - Attachment: LUCENE-848.txt This patch is a first cut a wikipedia benchmark support. It downloads the current english pages from the Wikipedia download site ... which, of course, is actually not there right now. I'm not quite sure what's up, but you can find the files at http://download.wikimedia.org/enwiki/20070402/ right now if you want to play. It adds ExtractWikipedia.java, which uses Xerces-J to grab the individual articles. It writes the articles in the same format as the Reuters stuff, so a generecised ReutersDocMaker, DirDocMaker, works. The current size of the download file is 2.1G bzip2'd. It's supposed to contain about 1.2M documents but I came out with 2 or 3, I think, so there maybe "extra" files in there. (Some entries are links and I tried to get rid of those, but I may have missed a particular coding or case). For the first pass, I copied the Reuters steps of decompressing and parsing. This creates big temporary files. Moreover, it creates a big directory tree in the end. (The extractor uses a fixed number of documents per directory and grows the depth of the tree logarithmically, a lot like Lucene segments). It's not clear how this preprocessing-to-a-directory-tree compares to on the fly decompression, which would require less disk seeks on the input during indexing. May try that at some point ... > Add supported for Wikipedia English as a corpus in the benchmarker stuff > > > Key: LUCENE-848 > URL: https://issues.apache.org/jira/browse/LUCENE-848 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/benchmark >Reporter: Steven Parkes > Assigned To: Steven Parkes >Priority: Minor > Fix For: 2.2 > > Attachments: LUCENE-848.txt, WikipediaHarvester.java > > > Add support for using Wikipedia for benchmarking. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Assigned: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries
[ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic reassigned LUCENE-855: --- Assignee: Otis Gospodnetic > MemoryCachedRangeFilter to boost performance of Range queries > - > > Key: LUCENE-855 > URL: https://issues.apache.org/jira/browse/LUCENE-855 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Affects Versions: 2.1 >Reporter: Andy Liu > Assigned To: Otis Gospodnetic > Attachments: FieldCacheRangeFilter.patch, > FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, > MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch > > > Currently RangeFilter uses TermEnum and TermDocs to find documents that fall > within the specified range. This requires iterating through every single > term in the index and can get rather slow for large document sets. > MemoryCachedRangeFilter reads all pairs of a given field, > sorts by value, and stores in a SortedFieldCache. During bits(), binary > searches are used to find the start and end indices of the lower and upper > bound values. The BitSet is populated by all the docId values that fall in > between the start and end indices. > TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed > index with random date values within a 5 year range. Executing bits() 1000 > times on standard RangeQuery using random date intervals took 63904ms. Using > MemoryCachedRangeFilter, it took 876ms. Performance increase is less > dramatic when you have less unique terms in a field or using less number of > documents. > Currently MemoryCachedRangeFilter only works with numeric values (values are > stored in a long[] array) but it can be easily changed to support Strings. A > side "benefit" of storing the values are stored as longs, is that there's no > longer the need to make the values lexographically comparable, i.e. padding > numeric values with zeros. > The downside of using MemoryCachedRangeFilter is there's a fairly significant > memory requirement. So it's designed to be used in situations where range > filter performance is critical and memory consumption is not an issue. The > memory requirements are: (sizeof(int) + sizeof(long)) * numDocs. > MemoryCachedRangeFilter also requires a warmup step which can take a while to > run in large datasets (it took 40s to run on a 3M document corpus). Warmup > can be called explicitly or is automatically called the first time > MemoryCachedRangeFilter is applied using a given field. > So in summery, MemoryCachedRangeFilter can be useful when: > - Performance is critical > - Memory is not an issue > - Field contains many unique numeric values > - Index contains large amount of documents -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries
[ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487590 ] Otis Gospodnetic commented on LUCENE-855: - Comments about the patch so far: Cosmetics: - You don't want to refer to Andy's class in javadocs, as that class won't go in unless Andy makes it faster. - I see some incorrect (copy/paste error) javadocs and javadocs/comments with typos in both the test classes and non-test classes. - Please configure your Lucene project in Eclipse to use 2 spaces instead of 4. In general, once you get the code formatting settings right, it's a good practise to format your code with that setting before submitting a patch. Testing: - You can put the testPerformance() code from TestFieldCacheRangeFilterPerformance in the other unit test class you have there. - Your testPerformance() doesn't actually assert...() anything, just prints out numbers to stdout. You can keep the printing, but it would be better to also do some asserts, so we can always test that the FCRangerFilter beats the vanilla RangeFilter without looking at the stdout. - You may want to close that searcher in testPerformance() before opening a new one. Probably won't make any difference, but still. - You may also want to just close the searcher at the end of the method. Impl: - In the inner FieldCacheBitSet class, I see: +public boolean intersects(BitSet set) { +for (int i =0; i < length; i++) { +if (get(i) && set.get(i)) { +return true; +} +} +return false; +} Is there room for a small optimization? What if BitSets are not of equal size? Wouldn't it make sense to loop through a smaller BitSet then? Sorry if I'm off, I hardly ever work with BitSets. - I see you made *_PARSERs in FCImpl public (were private). Is that really needed? Would ackage protected be enough? - Make sure ASL is in all test and non-test classes, I don't see it there now. Overall, I like it - slick and elegant usage of FC! I'd love to know what Hoss and other big Filter users think about this. Solr makes a lof of use of (Range?)Filters, I believe. > MemoryCachedRangeFilter to boost performance of Range queries > - > > Key: LUCENE-855 > URL: https://issues.apache.org/jira/browse/LUCENE-855 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Affects Versions: 2.1 >Reporter: Andy Liu > Attachments: FieldCacheRangeFilter.patch, > FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, > MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch > > > Currently RangeFilter uses TermEnum and TermDocs to find documents that fall > within the specified range. This requires iterating through every single > term in the index and can get rather slow for large document sets. > MemoryCachedRangeFilter reads all pairs of a given field, > sorts by value, and stores in a SortedFieldCache. During bits(), binary > searches are used to find the start and end indices of the lower and upper > bound values. The BitSet is populated by all the docId values that fall in > between the start and end indices. > TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed > index with random date values within a 5 year range. Executing bits() 1000 > times on standard RangeQuery using random date intervals took 63904ms. Using > MemoryCachedRangeFilter, it took 876ms. Performance increase is less > dramatic when you have less unique terms in a field or using less number of > documents. > Currently MemoryCachedRangeFilter only works with numeric values (values are > stored in a long[] array) but it can be easily changed to support Strings. A > side "benefit" of storing the values are stored as longs, is that there's no > longer the need to make the values lexographically comparable, i.e. padding > numeric values with zeros. > The downside of using MemoryCachedRangeFilter is there's a fairly significant > memory requirement. So it's designed to be used in situations where range > filter performance is critical and memory consumption is not an issue. The > memory requirements are: (sizeof(int) + sizeof(long)) * numDocs. > MemoryCachedRangeFilter also requires a warmup step which can take a while to > run in large datasets (it took 40s to run on a 3M document corpus). Warmup > can be called explicitly or is automatically called the first time > MemoryCachedRangeFilter is applied using a given field. > So in summery, MemoryCachedRangeFilter can be useful when: > - Performance is critical > - Memory is not an issue > - Field contains many unique numeric values > - Index contains large amount of documents -- This message is automatically
Re: optimize() method call
Otis Gospodnetic wrote: I'd advise against calling optimize() at all in an environment whose indices are constantly updated. +1 Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Large scale sorting
Paul Smith wrote: Disadvantages to this approach: * It's a lot more I/O intensive I think this would be prohibitive. Queries matching more than a few hundred documents will take several seconds to sort, since random disk accesses are required per matching document. Such an approach is only practical if you can guarantee that queries match fewer than a hundred documents, which is not generally the case, especially with large collections. I'm working on the basis that it's a LOT harder/more expensive to simply allocate more heap size to cover the current sorting infrastructure. One hits memory limits faster. Not everyone can afford 64-bit hardware with many Gb RAM to allocate to a heap. It _is_ cheaper/easier to build a disk subsystem to tune this I/O approach, and one can still use any RAM as buffer cache for the memory-mapped file anyway. In my experience, raw search time starts to climb towards one second per query as collections grow to around 10M documents (in round figures and with lots of assumptions). Thus, searching on a single CPU is less practical as collections grow substantially larger than 10M documents, and distributed solutions are required. So it would be convenient if sorting is also practical for ~10M document collections on standard hardware. If 10M strings with 20 characters are required in memory for efficient search, this requires 400MB. This is a lot, but not an unusual amount on todays machines. However, if you have a large number of fields, then this approach may be problematic and force you to consider a distributed solution earlier than you might otherwise. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Progressive Query Relaxation
The idea is to efficiently get the desired result set (top N) at once without having to re-run different queries inside the application logic. Query relaxation avoids having several round trips and possibly could be offered with and without deduplication. Maybe this is a feature required for Solr rather than for Lucene. Question: Even if lucene's score is not absolute does it somewhat determine an partial order among results of different queries? J.D. 2007/4/9, Otis Gospodnetic <[EMAIL PROTECTED]>: Not that I know of. One typically puts that in application logic and re-runs or offers to run alternative queries. No de-duping there, unless you do it in your app. I think one problem with the described approach and Lucene would be that Lucene's scores are not "absolute". Otis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simpy -- http://www.simpy.com/ - Tag - Search - Share - Original Message From: J. Delgado <[EMAIL PROTECTED]> To: java-dev@lucene.apache.org; solr-dev@lucene.apache.org Sent: Monday, April 9, 2007 3:46:40 AM Subject: Progressive Query Relaxation Has anyone within the Lucene or Solr community attempted to code a progressive query relaxation technique similar to the one described here for Oracle Text? http://www.oracle.com/technology/products/text/htdocs/prog_relax.html Thanks, -- J.D. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries
[ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487587 ] Otis Gospodnetic commented on LUCENE-855: - OK. I'll wait for the new performance numbers before committing. Andy, if you see anything funky in Matt's patch or if you managed to make your version faster, let us know, please. > MemoryCachedRangeFilter to boost performance of Range queries > - > > Key: LUCENE-855 > URL: https://issues.apache.org/jira/browse/LUCENE-855 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Affects Versions: 2.1 >Reporter: Andy Liu > Attachments: FieldCacheRangeFilter.patch, > FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, > MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch > > > Currently RangeFilter uses TermEnum and TermDocs to find documents that fall > within the specified range. This requires iterating through every single > term in the index and can get rather slow for large document sets. > MemoryCachedRangeFilter reads all pairs of a given field, > sorts by value, and stores in a SortedFieldCache. During bits(), binary > searches are used to find the start and end indices of the lower and upper > bound values. The BitSet is populated by all the docId values that fall in > between the start and end indices. > TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed > index with random date values within a 5 year range. Executing bits() 1000 > times on standard RangeQuery using random date intervals took 63904ms. Using > MemoryCachedRangeFilter, it took 876ms. Performance increase is less > dramatic when you have less unique terms in a field or using less number of > documents. > Currently MemoryCachedRangeFilter only works with numeric values (values are > stored in a long[] array) but it can be easily changed to support Strings. A > side "benefit" of storing the values are stored as longs, is that there's no > longer the need to make the values lexographically comparable, i.e. padding > numeric values with zeros. > The downside of using MemoryCachedRangeFilter is there's a fairly significant > memory requirement. So it's designed to be used in situations where range > filter performance is critical and memory consumption is not an issue. The > memory requirements are: (sizeof(int) + sizeof(long)) * numDocs. > MemoryCachedRangeFilter also requires a warmup step which can take a while to > run in large datasets (it took 40s to run on a 3M document corpus). Warmup > can be called explicitly or is automatically called the first time > MemoryCachedRangeFilter is applied using a given field. > So in summery, MemoryCachedRangeFilter can be useful when: > - Performance is critical > - Memory is not an issue > - Field contains many unique numeric values > - Index contains large amount of documents -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Resolved: (LUCENE-853) Caching does not work when using RMI
[ https://issues.apache.org/jira/browse/LUCENE-853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic resolved LUCENE-853. - Resolution: Fixed Lucene Fields: [New, Patch Available] (was: [Patch Available, New]) Committed, thanks Matt. > Caching does not work when using RMI > > > Key: LUCENE-853 > URL: https://issues.apache.org/jira/browse/LUCENE-853 > Project: Lucene - Java > Issue Type: New Feature > Components: Search >Affects Versions: 2.1 > Environment: All >Reporter: Matt Ericson >Priority: Minor > Attachments: RemoteCachingWrapperFilter.patch, > RemoteCachingWrapperFilter.patch, RemoteCachingWrapperFilter.patch, > RemoteCachingWrapperFilter.patch .patch > > > Filters and caching uses transient maps so that caching does not work if you > are using RMI and a remote searcher > I want to add a new RemoteCachededFilter that will make sure that the caching > is done on the remote searcher side > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries
[ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matt Ericson updated LUCENE-855: Attachment: FieldCacheRangeFilter.patch This version will create a real BitSet() when cloned and will allow chained filter to work correctly > MemoryCachedRangeFilter to boost performance of Range queries > - > > Key: LUCENE-855 > URL: https://issues.apache.org/jira/browse/LUCENE-855 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Affects Versions: 2.1 >Reporter: Andy Liu > Attachments: FieldCacheRangeFilter.patch, > FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, > MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch > > > Currently RangeFilter uses TermEnum and TermDocs to find documents that fall > within the specified range. This requires iterating through every single > term in the index and can get rather slow for large document sets. > MemoryCachedRangeFilter reads all pairs of a given field, > sorts by value, and stores in a SortedFieldCache. During bits(), binary > searches are used to find the start and end indices of the lower and upper > bound values. The BitSet is populated by all the docId values that fall in > between the start and end indices. > TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed > index with random date values within a 5 year range. Executing bits() 1000 > times on standard RangeQuery using random date intervals took 63904ms. Using > MemoryCachedRangeFilter, it took 876ms. Performance increase is less > dramatic when you have less unique terms in a field or using less number of > documents. > Currently MemoryCachedRangeFilter only works with numeric values (values are > stored in a long[] array) but it can be easily changed to support Strings. A > side "benefit" of storing the values are stored as longs, is that there's no > longer the need to make the values lexographically comparable, i.e. padding > numeric values with zeros. > The downside of using MemoryCachedRangeFilter is there's a fairly significant > memory requirement. So it's designed to be used in situations where range > filter performance is critical and memory consumption is not an issue. The > memory requirements are: (sizeof(int) + sizeof(long)) * numDocs. > MemoryCachedRangeFilter also requires a warmup step which can take a while to > run in large datasets (it took 40s to run on a 3M document corpus). Warmup > can be called explicitly or is automatically called the first time > MemoryCachedRangeFilter is applied using a given field. > So in summery, MemoryCachedRangeFilter can be useful when: > - Performance is critical > - Memory is not an issue > - Field contains many unique numeric values > - Index contains large amount of documents -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Progressive Query Relaxation
Not that I know of. One typically puts that in application logic and re-runs or offers to run alternative queries. No de-duping there, unless you do it in your app. I think one problem with the described approach and Lucene would be that Lucene's scores are not "absolute". Otis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simpy -- http://www.simpy.com/ - Tag - Search - Share - Original Message From: J. Delgado <[EMAIL PROTECTED]> To: java-dev@lucene.apache.org; solr-dev@lucene.apache.org Sent: Monday, April 9, 2007 3:46:40 AM Subject: Progressive Query Relaxation Has anyone within the Lucene or Solr community attempted to code a progressive query relaxation technique similar to the one described here for Oracle Text? http://www.oracle.com/technology/products/text/htdocs/prog_relax.html Thanks, -- J.D. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Created: (LUCENE-858) link from Lucene web page to API docs
link from Lucene web page to API docs - Key: LUCENE-858 URL: https://issues.apache.org/jira/browse/LUCENE-858 Project: Lucene - Java Issue Type: Improvement Reporter: Daniel Naber Assigned To: Grant Ingersoll There should be a way to link from e.g. http://lucene.apache.org/java/docs/gettingstarted.html to the API docs, but not just to the start page with the frame set but to a specific page, e.g. this: http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/overview-summary.html#overview_description To make this work a way to set a relative link is needed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: linking the API docs
Hi Daniel, Can you file this as an issue and assign it to me? Nigel and I are working through a few things w/ Hudson and the docs, still. The gist of it is that the API and website will be put back on people.a.o. This will mean that a relative link like api/overview- summary.html#overview_description should be sufficient. Thanks, Grant On Apr 7, 2007, at 4:01 PM, Daniel Naber wrote: On Saturday 07 April 2007 00:42, Chris Hostetter wrote: : I think you can put in the link, just use relative link like in the : site.xml. using a relative link is *key* ... it ensures not only that the static files build by the nightly build work, but also that the docs distributed with each release contain good local pointers. I'm not familiar with forrest, could you help me setting the link? The pages to be linked are these: http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/ javadoc/overview-summary.html#overview_description http://lucene.apache.org/java/2_1_0/api/overview- summary.html#overview_description (etc) Note that this is not the API docs page (which contains the frameset) but a content page plus an anchor. So I cannot use href="ext:javadocs"> but doesn't work either. Regards Daniel -- http://www.danielnaber.de - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Progressive Query Relaxation
Has anyone within the Lucene or Solr community attempted to code a progressive query relaxation technique similar to the one described here for Oracle Text? http://www.oracle.com/technology/products/text/htdocs/prog_relax.html Thanks, -- J.D. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]