Re: Numeric Range Restrictions: Queries vs Filters
Of course, not only did I manage to forget to include the attachment, but when I sent a reply with the code, mail.apache.org rejected it because it was a ZIP file. So let's see how mail.apache.or feels about 6 seperate text files. : Date: Mon, 22 Nov 2004 18:25:24 -0800 (PST) : Subject: Numeric Range Restrictions: Queries vs Filters : : (NOTE: numbers in [] indicate Footnotes) : : I'm rather new to Lucene (and this list), so if I'm grossly : misunderstanding things, forgive me. : : One of my main needs as I investigate Search technologies is to restrict : results based on Ranges of numeric values. Looking over the archives of : this list, it seems that lots of people have run into problems dealing : with this. In particular, whenever someone asks a question about "Numeric : Ranges" the question seem to always involve one (or more) of the : following: : :(a) Lexical sorting puts 11 in the range "1 TO 5" :(b) Dates (or Dates and Times) :(c) BooleanQuery$TooManyClauses Exceptions :(d) Should I use a filter? : : (a) is a solved problem as long as you use a formatter like : LongField.java[1] : : (b) is really nothing more then a special case of dealing with generic : numeric values. While there are certainly special purposes solutions that : sometimes apply to dealing with Date ranges, any good solution for dealing : with raw numeric ranges can be applied to Dates (and Times) : : (c) is a situation that seems to come up a lot because of the way : RangeQuery works. The rewrite method walks all of the Terms in the index : starting with "lowerTerm" and builds up BooleanQuery containing a separate : TermQuery for every Term found, until it reaches the upperTerm. This : causes a range search of "0001 TO 1000" to generate a BooleanQuery with N : clauses, where N is the quantity of unique values in the field which are : lexically greater then 0001 and lexically less then 1000. depending on : the nature of your data, this might be 0 BooleanClauses, or it might be : 1000 BooleanClauses; but the list is built before the search is ever even : executed. : : At first, this may seem really strange -- I know I was certainly confused : -- but there is a very good reason for it: Ultimately RangeQuery still : provides you with a meaningful score for each document, based on the : frequency (and quantity) of terms that document has in the range [2]. In : order to do that, it has to expand itself, but what if you don't care if : your Range restriction impacts the Score? [3] : : Which brings us to... : : (c) Filtering. Filters in general make a lot of sense to me. They are a : way to specify (at query time) that only a certain subset of the index : should be considered for results. The Filter class has a very straight : forward API that seems very easy to subclass to get the behavior I want. : The Query API on the other hand ... I freely admit, that I can't make : heads or tails out of it. I don't even know where I would begin to try : and write a new subclass of Query if I wanted to. : : I would think that most people who want to do a "numeric range : restriction" on their data, probably don't care about the Scoring benefits : of RangeQuery. Looking at the code base, the way DateFilter works seems : like it provides an ideal solution to any sort of Range restriction (not : just Dates) that *should* be more efficient then using RangeQuery when : dealing with an unbounded value set. (Both approaches need to iterate over : all of the terms in the specified field using TermEnum, but RangeQuery has : to build up an set of BooleanQuery objects for each matching term, and : then each of those queries have to help score the documents -- DateFilter : on the other hand only has to maintain a single BitSet of documents that : it finds as it iterates) : : But I was surprised then to see the following quote from "Erik Hatcher" in : the archives: : : "In fact, DateFilter by itself is practically of no use, I think." [4] : : ...Erik goes on to suggest that given "a set of canned date ranges", it : doesn't really matter if you use a RangeQuery or a DateFilter -- as long : as you cache them to reuse them (with something like CachingWrappingFilter : or QueryFilter). I'm hoping that he might elaborate on that comment? : : As a test, I wrote a "RangeFilter" which borrows heavily from DateFilter : to both convince myself it could work, and to do a comparison between it : and RangeQuery. [5] Based on my limited tests, using a Filter to restrict : to a Range is a lot faster then using RangeQuery -- independent of : caching. : : The attachment contains my RangeFilter, a unit test that demonstrates it, : and a Benchmarking unit test that does a side-by-side comparison with : RangeQuery [6]. If developers feel that this class is useful, then by all : means roll it into the code base. (90% of it is cut/pasted from : DateFilter/RangeQuery anyway) : : : Comments? ... Questions? ... Answers? : : : : Footnotes: : : [1] It
Numeric Range Restrictions: Queries vs Filters
(NOTE: numbers in [] indicate Footnotes) I'm rather new to Lucene (and this list), so if I'm grossly misunderstanding things, forgive me. One of my main needs as I investigate Search technologies is to restrict results based on Ranges of numeric values. Looking over the archives of this list, it seems that lots of people have run into problems dealing with this. In particular, whenever someone asks a question about "Numeric Ranges" the question seem to always involve one (or more) of the following: (a) Lexical sorting puts 11 in the range "1 TO 5" (b) Dates (or Dates and Times) (c) BooleanQuery$TooManyClauses Exceptions (d) Should I use a filter? (a) is a solved problem as long as you use a formatter like LongField.java[1] (b) is really nothing more then a special case of dealing with generic numeric values. While there are certainly special purposes solutions that sometimes apply to dealing with Date ranges, any good solution for dealing with raw numeric ranges can be applied to Dates (and Times) (c) is a situation that seems to come up a lot because of the way RangeQuery works. The rewrite method walks all of the Terms in the index starting with "lowerTerm" and builds up BooleanQuery containing a separate TermQuery for every Term found, until it reaches the upperTerm. This causes a range search of "0001 TO 1000" to generate a BooleanQuery with N clauses, where N is the quantity of unique values in the field which are lexically greater then 0001 and lexically less then 1000. depending on the nature of your data, this might be 0 BooleanClauses, or it might be 1000 BooleanClauses; but the list is built before the search is ever even executed. At first, this may seem really strange -- I know I was certainly confused -- but there is a very good reason for it: Ultimately RangeQuery still provides you with a meaningful score for each document, based on the frequency (and quantity) of terms that document has in the range [2]. In order to do that, it has to expand itself, but what if you don't care if your Range restriction impacts the Score? [3] Which brings us to... (c) Filtering. Filters in general make a lot of sense to me. They are a way to specify (at query time) that only a certain subset of the index should be considered for results. The Filter class has a very straight forward API that seems very easy to subclass to get the behavior I want. The Query API on the other hand ... I freely admit, that I can't make heads or tails out of it. I don't even know where I would begin to try and write a new subclass of Query if I wanted to. I would think that most people who want to do a "numeric range restriction" on their data, probably don't care about the Scoring benefits of RangeQuery. Looking at the code base, the way DateFilter works seems like it provides an ideal solution to any sort of Range restriction (not just Dates) that *should* be more efficient then using RangeQuery when dealing with an unbounded value set. (Both approaches need to iterate over all of the terms in the specified field using TermEnum, but RangeQuery has to build up an set of BooleanQuery objects for each matching term, and then each of those queries have to help score the documents -- DateFilter on the other hand only has to maintain a single BitSet of documents that it finds as it iterates) But I was surprised then to see the following quote from "Erik Hatcher" in the archives: "In fact, DateFilter by itself is practically of no use, I think." [4] ...Erik goes on to suggest that given "a set of canned date ranges", it doesn't really matter if you use a RangeQuery or a DateFilter -- as long as you cache them to reuse them (with something like CachingWrappingFilter or QueryFilter). I'm hoping that he might elaborate on that comment? As a test, I wrote a "RangeFilter" which borrows heavily from DateFilter to both convince myself it could work, and to do a comparison between it and RangeQuery. [5] Based on my limited tests, using a Filter to restrict to a Range is a lot faster then using RangeQuery -- independent of caching. The attachment contains my RangeFilter, a unit test that demonstrates it, and a Benchmarking unit test that does a side-by-side comparison with RangeQuery [6]. If developers feel that this class is useful, then by all means roll it into the code base. (90% of it is cut/pasted from DateFilter/RangeQuery anyway) Comments? ... Questions? ... Answers? Footnotes: [1] It seems to me this class is extremely useful, does anyone know if there's a particular reason it hasn't been added to the main Lucene codebase? http://www.mail-archive.com/lucene-dev@jakarta.apache.org/msg04790.html [2] Take a look at RangeQueryScoreDemo.java in the attachment, which produces output something like this... Range Search for: 'apple' TO 'dog' 0.40924072 ... bed dog emu 0.38014847 ... DOG 0.2825246 ... cat 0.17657787 ... apple emu 0.12
Re: downloading Lucene 1.4.2
In the same Lucene News section where the announcement about 1.4.2 is listed, there is a link that says "Binary and source distributions are available here." ... http://cvs.apache.org/dist/jakarta/lucene/v1.4.2/ I got really confused yesterday after I already had the binary version and i was looking for the source and found the link you listed. Does anyone know how to go about getting http://www.apache.org/dist/ updated? : According to the Lucene homepage, Lucene 1.4.2 was released : on October 1, 2004 : : However, the "dist" on www.apache.org does not have a copy of : Lucene 1.4.2 : :http://www.apache.org/dist/jakarta/lucene/binaries/ : : Where can I download Lucene 1.4.2? -- --- "Oh, you're a tricky one."Chris M Hostetter -- Trisha Weir[EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]