Hits.doc(x) and range queries
Hi guys! I've posted previously that Hits.doc(x) was taking a long time. Turns out it has to do with a date range in our query. We usually do date ranges like this: Date:[(lucene date field) - (lucene date field)] Sometimes the begin date is 0 which is what we get from DateField.dateToString( ( new Date( 0 ) ). This is when getting our search results from the Hits object takes an absurd amount of time. Its usually each time the Hits object attempts to get more results from an IndexSearcher ( aka, every 100? ). It also takes up more memory... I was wondering why it affects the search so much even though we're only returning 350 or so results. Does the QueryParser do something similar to the DateFilter on range queries? Would it be better to use a DateFilter? We're using Lucene 1.2 (with plans to upgrade). Do newer versions of Lucene have this problem? Roy. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Open-ended range queries
On Jun 10, 2004, at 10:37 PM, Terry Steichen wrote: Speaking for myself, only a small number of my code modules currently treat null as the open-ended range query term parameter. If the syntax change from 'null' -- '*' was deemed otherwise desirable and the syntax transition made very clearly, I could personally adjust to it without too much difficulty. I agree that the proposed '*' syntax does seem more logical. If a change to that syntax were made such that the old null syntax for the upper bound was retained for backward compatibility, such a transition would be completely painless. Just to clarify, since Terry's response implies this is not understood there is *nothing* special about null currently. It is simply being treated as term text. So adding special * handling would NOT change how null currently works. In June of 2002 (!) null and NULL (and nULL, Null, etc) were removed as being special from what I see in the diff. Furthermore, to achieve the proposed * handling, you can do this yourself now by subclassing QueryParser and overriding getRangeQuery: protected Query getRangeQuery(String field, Analyzer analyzer, String part1, String part2, boolean inclusive) throws ParseException { return new RangeQuery( *.equals(part1) ? null : new Term(field, part1), *.equals(part2) ? null : new Term(field, part2), inclusive); } (a little more is needed if you want to keep the date range handling). Note, you cannot do field:[* TO *] to make it wide-open - RangeQuery does not allow this. My proposal is this (_after_ 1.4 goes final): - Add the above logic to QueryParser. - Modify RangeQuery.toString to output the * when the term is null, and also if the start term is (RangeQuery's constructor modifies the beginning term to if it is null). If there are no objections to this plan, I'll add this as a Bugzilla issue as a reminder. I don't want to touch 1.4's codebase - no point in adding a feature at this stage that can already be achieved with the simple code above. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Another way to handle large numeric range queries
On Jun 9, 2004, at 1:05 PM, Don Gilbert wrote: I'm particularly interested in the XPath stuff I saw in LGQueryParser. * xpathFieldParse 'xpath' parser: param allfields[], with query or field[] possibly having wild-card notation: *.start annotation.*.text allowing '/' and '.' field separator This is an *unfinished* attempt to support xpath style queries with wild-cards or parts when you have indexed XML data, such as query: /annotation/*/text:term I had to put this aside when I saw the problem of pulling the xpath fields from a query string would take a fair amount of thought and code. Keep us posted if/when you return to this. I would love to see XPath query capability on hierarchical content indexed in Lucene. Jakarta Slide, for example, is starting to build in Lucene indexing capability - although I don't think they are doing anything with XPath query expressions yet. Anyone doing anything similar? Implementation ideas? Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Open-ended range queries
At one point it definitely supported null for either term. I think that has been removed/forgotten in the later revisions of the QueryParser... Scott On Jun 10, 2004, at 1:24 PM, Erik Hatcher wrote: On Jun 10, 2004, at 2:13 PM, Terry Steichen wrote: Actually, QueryParser does support open-ended ranges like : [term TO null]. Doesn't work for the lower end of the range (though that's usually less of a problem). It supports null? Are you sure? If so, I'm very confused about it because I don't see where in the grammar it has any special handling like that. Could you show an example that demonstrates this? Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] smime.p7s Description: S/MIME cryptographic signature
Re: Open-ended range queries
Well, I'm using 1.4 RC3 and the null range upper limit works just fine for searches in two of my fields; one is in the form of a cannonical date (eg, 20040610) and the other is in the form of a padded word count (e.g., 01500 for 1500). The syntax would be pub_date:[20040501 TO null] (dates later than April 30, 2004) and s_words:[01000 TO null] (articles with 1000 or more words). Regards, Terry PS: This use of null has worked this way since at least 1.2. As I recall, way back when, null also worked as the first term limit (but no longer does). - Original Message - From: Erik Hatcher [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Thursday, June 10, 2004 2:24 PM Subject: Re: Open-ended range queries On Jun 10, 2004, at 2:13 PM, Terry Steichen wrote: Actually, QueryParser does support open-ended ranges like : [term TO null]. Doesn't work for the lower end of the range (though that's usually less of a problem). It supports null? Are you sure? If so, I'm very confused about it because I don't see where in the grammar it has any special handling like that. Could you show an example that demonstrates this? Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Open-ended range queries
On Jun 10, 2004, at 4:07 PM, Terry Steichen wrote: Well, I'm using 1.4 RC3 and the null range upper limit works just fine for searches in two of my fields; one is in the form of a cannonical date (eg, 20040610) and the other is in the form of a padded word count (e.g., 01500 for 1500). The syntax would be pub_date:[20040501 TO null] (dates later than April 30, 2004) and s_words:[01000 TO null] (articles with 1000 or more words). Ah It works for you because you have numeric values and lexically null is greater than any of them. It is still using it as a lexical term value, and not truly making the end open-ended. This is why null doesn't work at the beginning for you either. It's just being treated as text, just like your numbers are. PS: This use of null has worked this way since at least 1.2. As I recall, way back when, null also worked as the first term limit (but no longer does). If so, then something serious broke. I've not the time to check the cvs logs on this, but I cannot imagine that we removed something like this. If anyone cares to dig up the diff where we removed/broke this, I'd be gracious. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Open-ended range queries
It looks to me like Revision 1.18 broke it. On Jun 10, 2004, at 3:26 PM, Erik Hatcher wrote: On Jun 10, 2004, at 4:07 PM, Terry Steichen wrote: Well, I'm using 1.4 RC3 and the null range upper limit works just fine for searches in two of my fields; one is in the form of a cannonical date (eg, 20040610) and the other is in the form of a padded word count (e.g., 01500 for 1500). The syntax would be pub_date:[20040501 TO null] (dates later than April 30, 2004) and s_words:[01000 TO null] (articles with 1000 or more words). Ah It works for you because you have numeric values and lexically null is greater than any of them. It is still using it as a lexical term value, and not truly making the end open-ended. This is why null doesn't work at the beginning for you either. It's just being treated as text, just like your numbers are. PS: This use of null has worked this way since at least 1.2. As I recall, way back when, null also worked as the first term limit (but no longer does). If so, then something serious broke. I've not the time to check the cvs logs on this, but I cannot imagine that we removed something like this. If anyone cares to dig up the diff where we removed/broke this, I'd be gracious. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] smime.p7s Description: S/MIME cryptographic signature
Re: Open-ended range queries
Well, I do like the *, but apparently there are some people that are using this with the null... Scott On Jun 10, 2004, at 7:15 PM, Erik Hatcher wrote: On Jun 10, 2004, at 4:54 PM, Scott ganyo wrote: It looks to me like Revision 1.18 broke it. It seems this could be it: revision 1.18 date: 2002/06/25 00:05:31; author: briangoetz; state: Exp; lines: +62 -33 Support for new range query syntax. The delimiter is TO , but is optional for backward compatibility with previous syntax. If the range arguments match the format supported by DateFormat.getDateInstance(DateFormat.SHORT), then they will be converted into the appropriate date strings a la DateField. Added Field.Keyword constructor for Date-valued arguments. Optimized DateField.timeToString function. But geez June 2002 and no one has complained since? Given that this is so outdated, I'm not sure what the right course of action is. There are lots more Lucene users now than there were then. Would adding NULL back be what folks want? What about simply an asterisk to denote open ended-ness? [* TO term] or [term TO *] For completeness, here is the diff: % cvs diff -u -r 1.17 -r 1.18 QueryParser.jj Index: QueryParser.jj === RCS file: /home/cvs/jakarta-lucene/src/java/org/apache/lucene/queryParser/ QueryParser.jj,v retrieving revision 1.17 retrieving revision 1.18 diff -u -r1.17 -r1.18 --- QueryParser.jj 20 May 2002 15:45:43 - 1.17 +++ QueryParser.jj 25 Jun 2002 00:05:31 - 1.18 @@ -65,8 +65,11 @@ import java.util.Vector; import java.io.*; +import java.text.*; +import java.util.*; import org.apache.lucene.index.Term; import org.apache.lucene.analysis.*; +import org.apache.lucene.document.*; import org.apache.lucene.search.*; /** @@ -218,35 +221,30 @@ private Query getRangeQuery(String field, Analyzer analyzer, - String queryText, + String part1, + String part2, boolean inclusive) { -// Use the analyzer to get all the tokens. There should be 1 or 2. -TokenStream source = analyzer.tokenStream(field, - new StringReader(queryText)); -Term[] terms = new Term[2]; -org.apache.lucene.analysis.Token t; +boolean isDate = false, isNumber = false; -for (int i = 0; i 2; i++) -{ - try - { -t = source.next(); - } - catch (IOException e) - { -t = null; - } - if (t != null) - { -String text = t.termText(); -if (!text.equalsIgnoreCase(NULL)) -{ - terms[i] = new Term(field, text); -} - } +try { + DateFormat df = DateFormat.getDateInstance(DateFormat.SHORT); + df.setLenient(true); + Date d1 = df.parse(part1); + Date d2 = df.parse(part2); + part1 = DateField.dateToString(d1); + part2 = DateField.dateToString(d2); + isDate = true; } -return new RangeQuery(terms[0], terms[1], inclusive); +catch (Exception e) { } + +if (!isDate) { + // @@@ Add number support +} + +return new RangeQuery(new Term(field, part1), + new Term(field, part2), + inclusive); } public static void main(String[] args) throws Exception { @@ -282,7 +280,7 @@ | #_WHITESPACE: ( | \t ) } -DEFAULT SKIP : { +DEFAULT, RangeIn, RangeEx SKIP : { _WHITESPACE } @@ -303,14 +301,28 @@ | PREFIXTERM: _TERM_START_CHAR (_TERM_CHAR)* * | WILDTERM: _TERM_START_CHAR (_TERM_CHAR | ( [ *, ? ] ))* -| RANGEIN: [ ( ~[ ] ] )+ ] -| RANGEEX: { ( ~[ } ] )+ } +| RANGEIN_START: [ : RangeIn +| RANGEEX_START: { : RangeEx } Boost TOKEN : { NUMBER:(_NUM_CHAR)+ ( . (_NUM_CHAR)+ )? : DEFAULT } +RangeIn TOKEN : { +RANGEIN_TO: TO +| RANGEIN_END: ] : DEFAULT +| RANGEIN_QUOTED: \ (~[\])+ \ +| RANGEIN_GOOP: (~[ , ] ])+ +} + +RangeEx TOKEN : { +RANGEEX_TO: TO +| RANGEEX_END: } : DEFAULT +| RANGEEX_QUOTED: \ (~[\])+ \ +| RANGEEX_GOOP: (~[ , } ])+ +} + // * Query ::= ( Clause )* // * Clause ::= [+, -] [TERM :] ( TERM | ( Query ) ) @@ -387,7 +399,7 @@ Query Term(String field) : { - Token term, boost=null, slop=null; + Token term, boost=null, slop=null, goop1, goop2; boolean prefix = false; boolean wildcard = false; boolean fuzzy = false; @@ -415,12 +427,29 @@ else q = getFieldQuery(field, analyzer, term.image); } - | ( term=RANGEIN { rangein=true; } | term=RANGEEX ) + | ( RANGEIN_START ( goop1=RANGEIN_GOOP|goop1=RANGEIN_QUOTED ) + [ RANGEIN_TO ] ( goop2=RANGEIN_GOOP|goop2=RANGEIN_QUOTED ) + RANGEIN_END ) + [ CARAT boost=NUMBER ] +{ + if (goop1.kind == RANGEIN_QUOTED) +goop1.image = goop1.image.substring(1,
Re: Open-ended range queries
Speaking for myself, only a small number of my code modules currently treat null as the open-ended range query term parameter. If the syntax change from 'null' -- '*' was deemed otherwise desirable and the syntax transition made very clearly, I could personally adjust to it without too much difficulty. I agree that the proposed '*' syntax does seem more logical. If a change to that syntax were made such that the old null syntax for the upper bound was retained for backward compatibility, such a transition would be completely painless. Regards, Terry - Original Message - From: Scott ganyo [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Thursday, June 10, 2004 8:57 PM Subject: Re: Open-ended range queries Well, I do like the *, but apparently there are some people that are using this with the null... Scott On Jun 10, 2004, at 7:15 PM, Erik Hatcher wrote: On Jun 10, 2004, at 4:54 PM, Scott ganyo wrote: It looks to me like Revision 1.18 broke it. It seems this could be it: revision 1.18 date: 2002/06/25 00:05:31; author: briangoetz; state: Exp; lines: +62 -33 Support for new range query syntax. The delimiter is TO , but is optional for backward compatibility with previous syntax. If the range arguments match the format supported by DateFormat.getDateInstance(DateFormat.SHORT), then they will be converted into the appropriate date strings a la DateField. Added Field.Keyword constructor for Date-valued arguments. Optimized DateField.timeToString function. But geez June 2002 and no one has complained since? Given that this is so outdated, I'm not sure what the right course of action is. There are lots more Lucene users now than there were then. Would adding NULL back be what folks want? What about simply an asterisk to denote open ended-ness? [* TO term] or [term TO *] For completeness, here is the diff: % cvs diff -u -r 1.17 -r 1.18 QueryParser.jj Index: QueryParser.jj === RCS file: /home/cvs/jakarta-lucene/src/java/org/apache/lucene/queryParser/ QueryParser.jj,v retrieving revision 1.17 retrieving revision 1.18 diff -u -r1.17 -r1.18 --- QueryParser.jj 20 May 2002 15:45:43 - 1.17 +++ QueryParser.jj 25 Jun 2002 00:05:31 - 1.18 @@ -65,8 +65,11 @@ import java.util.Vector; import java.io.*; +import java.text.*; +import java.util.*; import org.apache.lucene.index.Term; import org.apache.lucene.analysis.*; +import org.apache.lucene.document.*; import org.apache.lucene.search.*; /** @@ -218,35 +221,30 @@ private Query getRangeQuery(String field, Analyzer analyzer, - String queryText, + String part1, + String part2, boolean inclusive) { -// Use the analyzer to get all the tokens. There should be 1 or 2. -TokenStream source = analyzer.tokenStream(field, - new StringReader(queryText)); -Term[] terms = new Term[2]; -org.apache.lucene.analysis.Token t; +boolean isDate = false, isNumber = false; -for (int i = 0; i 2; i++) -{ - try - { -t = source.next(); - } - catch (IOException e) - { -t = null; - } - if (t != null) - { -String text = t.termText(); -if (!text.equalsIgnoreCase(NULL)) -{ - terms[i] = new Term(field, text); -} - } +try { + DateFormat df = DateFormat.getDateInstance(DateFormat.SHORT); + df.setLenient(true); + Date d1 = df.parse(part1); + Date d2 = df.parse(part2); + part1 = DateField.dateToString(d1); + part2 = DateField.dateToString(d2); + isDate = true; } -return new RangeQuery(terms[0], terms[1], inclusive); +catch (Exception e) { } + +if (!isDate) { + // @@@ Add number support +} + +return new RangeQuery(new Term(field, part1), + new Term(field, part2), + inclusive); } public static void main(String[] args) throws Exception { @@ -282,7 +280,7 @@ | #_WHITESPACE: ( | \t ) } -DEFAULT SKIP : { +DEFAULT, RangeIn, RangeEx SKIP : { _WHITESPACE } @@ -303,14 +301,28 @@ | PREFIXTERM: _TERM_START_CHAR (_TERM_CHAR)* * | WILDTERM: _TERM_START_CHAR (_TERM_CHAR | ( [ *, ? ] ))* -| RANGEIN: [ ( ~[ ] ] )+ ] -| RANGEEX: { ( ~[ } ] )+ } +| RANGEIN_START: [ : RangeIn +| RANGEEX_START: { : RangeEx } Boost TOKEN : { NUMBER:(_NUM_CHAR)+ ( . (_NUM_CHAR)+ )? : DEFAULT } +RangeIn TOKEN : { +RANGEIN_TO
Re: Another way to handle large numeric range queries
On Jun 8, 2004, at 10:55 PM, Don Gilbert wrote: Find this as part of the 'LuceGene' package for searching genome and bioinformatics databases at http://www.gmod.org/lucegene/ with lucene related source code in cvs here: Nice stuff! http://cvs.sourceforge.net/viewcvs.py/gmod/lucegene/src/org/eugenes/ index/ LGQueryParser.java -- extension of QueryParser for NumRangeQuery ( other) I'm particularly interested in the XPath stuff I saw in LGQueryParser. Could you tell us more about what you do with that and how it works? BioDataAnalyzer.java -- NumberField formats field for indexing *whew* - that is one complex piece of code. I like the DebugFilter idea. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Another way to handle large numeric range queries
Erik, Thanks for the comments. I'm particularly interested in the XPath stuff I saw in LGQueryParser. * xpathFieldParse 'xpath' parser: param allfields[], with query or field[] possibly having wild-card notation: *.start annotation.*.text allowing '/' and '.' field separator This is an *unfinished* attempt to support xpath style queries with wild-cards or parts when you have indexed XML data, such as query: /annotation/*/text:term I had to put this aside when I saw the problem of pulling the xpath fields from a query string would take a fair amount of thought and code. BioDataAnalyzer.java -- NumberField formats field for indexing *whew* - that is one complex piece of code. I like the DebugFilter Mostly it is just a collection of small 10 line classes, packaged as inner classes (I hate java's insistence on 1 file/class :) Some of the complexity there is because the standard lucene analyzer won't work for biology data (which uses a lot of symbols, upper/lowercase, etc.) and this code allows one to build an analyzer/indexer which is tuned to different types in each field of data. The configuration for a given biology database parsing includes statements like: ## field tokenizers - base CharTokenizer, work before Filters tokenizer.SYM=org.eugenes.index.BiodataAnalyzer$DataTokenizer ## field filters - base TokenFilter, only are used if fieldtype=Text or UnStored tokenfilter.BLOC.start=org.eugenes.index.BiodataAnalyzer$NumberFilter ## fieldrecoder classes manipulate data before indexing, maybe making new fields fieldrecoder.BLOC=LucegeneIndexers$Location_FieldRecoder This method then generates TokenStream using such field-specific parsers, public TokenStream tokenStream( String fieldName, Reader reader) { TokenStream result = null; try { result= getTokenizer(fieldName, reader); } catch (Exception e) { result = new org.apache.lucene.analysis.standard.StandardTokenizer(reader); } try { result= getFilter(fieldName, result); } catch (Exception e) { LowerDataFilter ldf= new LowerDataFilter(); ldf.setInput(result); result= ldf; } return result; } -- Don Gilbert -- d.gilbert--bioinformatics--indiana-u--bloomington-in-47405 -- [EMAIL PROTECTED]://marmot.bio.indiana.edu/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Another way to handle large numeric range queries
first, Term last, boolean inc); -- full numeric (integer) range query, can handle large ranges. -- makes a BitSet of documents within range once, and feeds back to Searcher thru score(HitCollector c, int end) as often as called. -- query semantics are same as for RangeQuery -- implicit assumptions are -- first, last Term have integer values, as does indexed field -- indexed field is recoded for alphanumeric sorting; e.g. 2 - 02, 10 - 10, -3 - -03 Find this as part of the 'LuceGene' package for searching genome and bioinformatics databases at http://www.gmod.org/lucegene/ with lucene related source code in cvs here: http://cvs.sourceforge.net/viewcvs.py/gmod/lucegene/src/org/eugenes/ index/ NumRangeQuery.java -- range searches of integer fields. LGQueryParser.java -- extension of QueryParser for NumRangeQuery ( other) BioDataAnalyzer.java -- NumberField formats field for indexing -- Don Gilbert Date: Tue, 18 May 2004 13:35:55 -0700 From: Andy Goodell [EMAIL PROTECTED] Subject: How to handle range queries over large ranges and avoid Too Many Boolean cla In our application we had a similar problem with non-date ranges until we realized that it wasnt so much that we were searching for the values in the range as restricting the search to that range, and then we used an extension to the org.apache.lucene.search.Filter class, and our implementation got much simpler and faster. -- d.gilbert--bioinformatics--indiana-u--bloomington-in-47405 -- [EMAIL PROTECTED]://marmot.bio.indiana.edu/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How to handle range queries over large ranges and avoid Too Many Boolean clauses
Claude Devarenne writes: Hi, I have over 60,000 documents in my index which is slightly over a 1 GB in size. The documents range from the late seventies up to now. I have indexed dates as a keyword field using a string because the dates are in MMDD format. When I do range queries things are OK as long as I don't exceed the built-in number of boolean clauses, so that's a range of 3 years, e.g. 1979 to 1981. The users are not only doing complex queries but also want to query over long ranges, e.g. [19790101 TO 19991231]. Given these requirements, I am thinking of doing a query without the date range, bring the unique ids back from the hits and then do a date query in the SQL database I have that contains the same data. Another alternative is to do the query without the date range in Lucene and then sort the results within the range. I still have to learn how to use the new sorting code and confessed I did not have time to look at it yet. Is there a simpler, easier way to do this? I think it would be worth to take a look at the sorting code. The idea of the sorting code is to have an array of the dates for each doc in memory and access this array for sorting. Now sorting isn't the only thing one might use this array for. Doing a range check is another. So you might extend the sorting code by a range selection. There is no code for this in lucene and you have to create your own searcher but it gives you a fast way to search and sort by date. I did this independently from the new sorting code (I just started a little to early) and it works quite well. The only drawback from this (and the new sorting code) is, that it requires an array of field values that must be rebuilt each time the index changes. Shouldn't be a problem for 6 documents. Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How to handle range queries over large ranges and avoid Too Many Boolean clauses
Thanks, I will look at the sorting code. Sorting results by date is next on list. For now, I only have a small number of documents but the set is to grow to over 8 million documents for the collection I am working on. Another collection we have is 40 million documents or so. From what you say it seems to me that sorting will not scale then when I get to larger number of documents. I am considering using an SQL back end to implement sorting: bring back the unique IDs from lucene and then sort in SQL. Claude On May 18, 2004, at 11:23 PM, Morus Walter wrote: Claude Devarenne writes: Hi, I have over 60,000 documents in my index which is slightly over a 1 GB in size. The documents range from the late seventies up to now. I have indexed dates as a keyword field using a string because the dates are in MMDD format. When I do range queries things are OK as long as I don't exceed the built-in number of boolean clauses, so that's a range of 3 years, e.g. 1979 to 1981. The users are not only doing complex queries but also want to query over long ranges, e.g. [19790101 TO 19991231]. Given these requirements, I am thinking of doing a query without the date range, bring the unique ids back from the hits and then do a date query in the SQL database I have that contains the same data. Another alternative is to do the query without the date range in Lucene and then sort the results within the range. I still have to learn how to use the new sorting code and confessed I did not have time to look at it yet. Is there a simpler, easier way to do this? I think it would be worth to take a look at the sorting code. The idea of the sorting code is to have an array of the dates for each doc in memory and access this array for sorting. Now sorting isn't the only thing one might use this array for. Doing a range check is another. So you might extend the sorting code by a range selection. There is no code for this in lucene and you have to create your own searcher but it gives you a fast way to search and sort by date. I did this independently from the new sorting code (I just started a little to early) and it works quite well. The only drawback from this (and the new sorting code) is, that it requires an array of field values that must be rebuilt each time the index changes. Shouldn't be a problem for 6 documents. Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
How to handle range queries over large ranges and avoid Too Many Boolean clauses
Hi, I have over 60,000 documents in my index which is slightly over a 1 GB in size. The documents range from the late seventies up to now. I have indexed dates as a keyword field using a string because the dates are in MMDD format. When I do range queries things are OK as long as I don't exceed the built-in number of boolean clauses, so that's a range of 3 years, e.g. 1979 to 1981. The users are not only doing complex queries but also want to query over long ranges, e.g. [19790101 TO 19991231]. Given these requirements, I am thinking of doing a query without the date range, bring the unique ids back from the hits and then do a date query in the SQL database I have that contains the same data. Another alternative is to do the query without the date range in Lucene and then sort the results within the range. I still have to learn how to use the new sorting code and confessed I did not have time to look at it yet. Is there a simpler, easier way to do this? Claude - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How to handle range queries over large ranges and avoid Too Many Boolean clauses
On Tuesday 18 May 2004 19:38, Claude Devarenne wrote: Hi, I have over 60,000 documents in my index which is slightly over a 1 GB in size. The documents range from the late seventies up to now. I have indexed dates as a keyword field using a string because the dates are in MMDD format. When I do range queries things are OK as long as I don't exceed the built-in number of boolean clauses, so that's a range of 3 years, e.g. 1979 to 1981. The users are not only doing complex queries but also want to query over long ranges, e.g. [19790101 TO 19991231]. Given these requirements, I am thinking of doing a query without the date range, bring the unique ids back from the hits and then do a date query in the SQL database I have that contains the same data. Another alternative is to do the query without the date range in Lucene and then sort the results within the range. I still have to learn how to use the new sorting code and confessed I did not have time to look at it yet. Is there a simpler, easier way to do this? I wouldn't know of a simpler and easier way, but there is another way to reduce the number of clauses involved in long date ranges. This can be done by indexing not only MMDD but also MM and , and adapting the query range mechanism to use the shorter term whenever possible. (YYY and MMD might also be useful.) Kind regards, Ype - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How to handle range queries over large ranges and avoid Too Many Boolean clauses
Thanks, I'll try that. It would nice too if I could extend field (it is a final class) and create a numerical field. Is that not desirable? Claude On May 18, 2004, at 12:06 PM, Ype Kingma wrote: On Tuesday 18 May 2004 19:38, Claude Devarenne wrote: Hi, I have over 60,000 documents in my index which is slightly over a 1 GB in size. The documents range from the late seventies up to now. I have indexed dates as a keyword field using a string because the dates are in MMDD format. When I do range queries things are OK as long as I don't exceed the built-in number of boolean clauses, so that's a range of 3 years, e.g. 1979 to 1981. The users are not only doing complex queries but also want to query over long ranges, e.g. [19790101 TO 19991231]. Given these requirements, I am thinking of doing a query without the date range, bring the unique ids back from the hits and then do a date query in the SQL database I have that contains the same data. Another alternative is to do the query without the date range in Lucene and then sort the results within the range. I still have to learn how to use the new sorting code and confessed I did not have time to look at it yet. Is there a simpler, easier way to do this? I wouldn't know of a simpler and easier way, but there is another way to reduce the number of clauses involved in long date ranges. This can be done by indexing not only MMDD but also MM and , and adapting the query range mechanism to use the shorter term whenever possible. (YYY and MMD might also be useful.) Kind regards, Ype - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How to handle range queries over large ranges and avoid Too Many Boolean clauses
In our application we had a similar problem with non-date ranges until we realized that it wasnt so much that we were searching for the values in the range as restricting the search to that range, and then we used an extension to the org.apache.lucene.search.Filter class, and our implementation got much simpler and faster. - andy g On Tue, 18 May 2004 10:38:01 -0700, Claude Devarenne [EMAIL PROTECTED] wrote: Hi, I have over 60,000 documents in my index which is slightly over a 1 GB in size. The documents range from the late seventies up to now. I have indexed dates as a keyword field using a string because the dates are in MMDD format. When I do range queries things are OK as long as I don't exceed the built-in number of boolean clauses, so that's a range of 3 years, e.g. 1979 to 1981. The users are not only doing complex queries but also want to query over long ranges, e.g. [19790101 TO 19991231]. Given these requirements, I am thinking of doing a query without the date range, bring the unique ids back from the hits and then do a date query in the SQL database I have that contains the same data. Another alternative is to do the query without the date range in Lucene and then sort the results within the range. I still have to learn how to use the new sorting code and confessed I did not have time to look at it yet. Is there a simpler, easier way to do this? Claude - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How to handle range queries over large ranges and avoid Too Many Boolean clauses
Is there a simpler, easier way to do this? Yes. I have started implementing a QuickRangeQuery class, that doesn't have the BooleanQuery limitation, but scores every matching document as 1.0. I will see if I can get it finished in the next 24 hours, and post back to this thread. =Matt PS: I'm not sure about the QuickRangeQuery class name... maybe NormalizedRangeQuery, RangeQuery2... *shrug* Claude Devarenne wrote: Hi, I have over 60,000 documents in my index which is slightly over a 1 GB in size. The documents range from the late seventies up to now. I have indexed dates as a keyword field using a string because the dates are in MMDD format. When I do range queries things are OK as long as I don't exceed the built-in number of boolean clauses, so that's a range of 3 years, e.g. 1979 to 1981. The users are not only doing complex queries but also want to query over long ranges, e.g. [19790101 TO 19991231]. Given these requirements, I am thinking of doing a query without the date range, bring the unique ids back from the hits and then do a date query in the SQL database I have that contains the same data. Another alternative is to do the query without the date range in Lucene and then sort the results within the range. I still have to learn how to use the new sorting code and confessed I did not have time to look at it yet. Is there a simpler, easier way to do this? Claude - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: efficient refinement, order by and range queries
Geoffrey, You've done quite a thorough analysis of Lucene. I'll reply below with a few tidbits of Lucene trivia in hopes that will help On Dec 22, 2003, at 3:15 PM, Geoffrey Peddle wrote: One of our applications is a catalog search application.In this application our documents are catalog items. Each item has a number of fields/attributes associated with it. For example Supplier, Part number, Price, Description. We use a search metaphor where end-users iterate issuing queries and getting feedback about what's available. So initially we may tell them that 600,000 items are available from 95 suppliers, and who those suppliers are. They may choose to do a free text search for the phrase blue pen. The result of that query may be to tell them that there's 240 items available from 2 suppliers which match that phrase, and who those suppliers are. They may pick one of the suppliers to see the list of blue pens available from that supplier. To accomplish search within search, or search refinement, using a QueryFilter will do very nicely. In addition to wanting the set of attribute values found in the result documents we would also want to return counts of the number of documents each attribute value occurs in in the result document set. Again, I think a QueryFilter can work well. There are surely several ways to go about getting the number of documents in each bucket - perhaps additional queries should be made to give you those numbers, or perhaps walking the returned documents to get the unique values. Walking the documents could be expensive performance-wise though. Doing some sub-queries would be quite fast though. Efficient range queries. application) it's important to have some support for this.The trick here is that the criteria may be very open ended. For example all items with price greater than $10 might involve tens of thousands of prices. One suggestion I've seen posted is during indexing to use an additional field as a group. In this case, it would be a price range group. Say A means $0 - $10, B for $10 - $100, C for $100+, for example. Then you would only have a few terms in that field and a query would be quite fast. The drawback is that you need to know at index-time what the groups are. A custom range Filter is another option - and could be created at runtime and kept around and only recreated when the index is modified. Look at the built-in DateFilter for an example to work with. This is a more pleasant option than doing a RangeQuery when the number of terms in the range is large. Order by attributes. We need the ability to order the document results set by a pre-defined set of numeric attributes and would like the ability to order on alphabetic attributes as well. This is an area where Lucene falls short. My best suggestion is to do the sorting yourself, which would require getting at all the documents in Hits, which for a large collection would be unreasonable. There are tricks that can be played with boosting during indexing where you can tier the boosts of a field in order - but this is really only a hint to the scorer to factor the order into the equation but there are many other factors. I'm afraid there is no easy solution here, that I'm aware of. I have resources for code development and consider it to be in Ariba's best interest to contribute any code that we write in this area with the entire community. Our time frame is to develop a proto-type in the next couple of months for proof of concept and benchmarking. Excellent! We hope that we can get Lucene under the covers of your products - please continue to post to us with more questions and hopefully eventually code improvements! Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
efficient refinement, order by and range queries
I'm attempting to make Lucene the document search solution for the Ariba application suite. I have identified functionality gaps in the areas of refinement, order by and range queries. Below I attempt to describe our requirements and begin a discussion on how to solve them. Refinement/attribute value collection. One of our applications is a catalog search application.In this application our documents are catalog items. Each item has a number of fields/attributes associated with it. For example Supplier, Part number, Price, Description. We use a search metaphor where end-users iterate issuing queries and getting feedback about what's available. So initially we may tell them that 600,000 items are available from 95 suppliers, and who those suppliers are. They may choose to do a free text search for the phrase blue pen. The result of that query may be to tell them that there's 240 items available from 2 suppliers which match that phrase, and who those suppliers are. They may pick one of the suppliers to see the list of blue pens available from that supplier. In addition to wanting the set of attribute values found in the result documents we would also want to return counts of the number of documents each attribute value occurs in in the result document set. To limit memory consumption for attributes with many values in the result document set we could specify a limit beyond which additional new values would not be collected, say return at most 200 suppliers which match the query. Efficient range queries. There are certain attributes in our catalog item which we want to support range queries on. For example price.While it would be unusual for this to be the primary search criteria (people rarely say show me anything that costs greater than $10 in our application) it's important to have some support for this.The trick here is that the criteria may be very open ended. For example all items with price greater than $10 might involve tens of thousands of prices. Order by attributes. We need the ability to order the document results set by a pre-defined set of numeric attributes and would like the ability to order on alphabetic attributes as well. What we are currently thinking of as the solution to these first 3 gaps is an optional document based payload or term vector. We think we could use this in a hit collector to post-filter any results outside our range queries, collect refinement values and order the results. It seems we could build an external mechanism similar to what's used in the search beans for this or extend the core product.Does this seem like a reasonable approach? Any educated guesses on what the percent increase in search response time will be if we do a good job? The trickiest part seems to be efficiently collecting the refinement data for a set of 5-15 attributes. Any suggestions on data structures to use to encode/collect this? As we develop a design for this I'll post it for additional feedback. I have resources for code development and consider it to be in Ariba's best interest to contribute any code that we write in this area with the entire community. Our time frame is to develop a proto-type in the next couple of months for proof of concept and benchmarking. -Geoffrey __ Post your free ad now! http://personals.yahoo.ca - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Date fields (was: Re: Range queries)
Again my apologies to Terry who was patient and confident despite my misunderstanding. I saw a change earlier today (on lucene-dev) with the QueryParser syntax updated with more details on the range query format. I think the ambiguity of saying a date field is still there though, since the Field.Keyword construct supports Date natively now as well as homegrown MMDD String fields. My question now is: what benefit is the Field.Keyword(String,Date) if its easier to deal with dates in MMDD format? Its nice that QueryParser can deal with that type of field using the SHORT DateFormat - so I can see a slight advantage when QueryParser is involved to representing dates as Date's rather than String's, but the range query constraints of the open-ended begin or end dates negate that benefit it seems. I'm just trying to understand in more detail the finer points of Lucene so that I can make more effective use of its features as well as help educate others on it. Erik On Thursday, January 23, 2003, at 06:03 AM, Terry Steichen wrote:\ Erik, That's good. Now I don't have to keep proving what is, is. Glad it finally made sense. Regards, Terry -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Range queries
Tatu, I believe the range query syntax for the latest Lucene version is field:[lower TO upper], or field:[null TO upper], or field:[lower TO null]. In earlier versions replace TO with a dash (-). I also believe that multiple wildcards (? and/or *) work just fine (as long as they aren't the first character of the term). HTH, Terry - Original Message - From: Tatu Saloranta [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Wednesday, January 22, 2003 11:48 PM Subject: Range queries My apologies if this is a FAQ (which is possible as I am new to Lucene, however, I tried checking the web page for the answer). I read through the Query syntax web page first, and then checked the matching query classes. It seems like query syntax page is missing some details; the one I was wondering about was the range query. Since query parser seems to construct these queries, I guess they have been implemented, even though syntax page didn't explain them. Is that correct? Looking at QueryParser, it seems that inclusive range query uses [ and ], and exclusive query { and }? Is this right? And does it expect exactly two arguments? Also, am I right in assuming that range uses lexiographic ordering, so that it basically includes all possible words (terms) between specified terms (which will work ok with numbers/dates as long as they have been padded with zeroes or such)? Another question I have is regarding wildcard search. Page mentions that there is a restriction that search term can not start with a wild card (as that would render index useless I guess... would need to full scan?). However, it doesn't mention if multiple wildcards are allowed? All the example cases just have single wild card? Sorry for the newbie questions, -+ Tatu +- ps. Thanks for the developers for the neat indexing engine. I am currently evaluating it for use in a large-scale enterprise content management system. -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Range queries
Unfortunately I don't believe date field range queries work with QueryParser, or at least not human-readable dates. Is that correct? I think it supports date ranges if they are turned into a numeric format, but no human would type that kind of query in. I'm sure supporting true date range queries gets tricky with locale issues and such too. Erik On Wednesday, January 22, 2003, at 09:19 AM, Terry Steichen wrote: Tatu, I believe the range query syntax for the latest Lucene version is field:[lower TO upper], or field:[null TO upper], or field:[lower TO null]. In earlier versions replace TO with a dash (-). I also believe that multiple wildcards (? and/or *) work just fine (as long as they aren't the first character of the term). HTH, Terry - Original Message - From: Tatu Saloranta [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Wednesday, January 22, 2003 11:48 PM Subject: Range queries My apologies if this is a FAQ (which is possible as I am new to Lucene, however, I tried checking the web page for the answer). I read through the Query syntax web page first, and then checked the matching query classes. It seems like query syntax page is missing some details; the one I was wondering about was the range query. Since query parser seems to construct these queries, I guess they have been implemented, even though syntax page didn't explain them. Is that correct? Looking at QueryParser, it seems that inclusive range query uses [ and ], and exclusive query { and }? Is this right? And does it expect exactly two arguments? Also, am I right in assuming that range uses lexiographic ordering, so that it basically includes all possible words (terms) between specified terms (which will work ok with numbers/dates as long as they have been padded with zeroes or such)? Another question I have is regarding wildcard search. Page mentions that there is a restriction that search term can not start with a wild card (as that would render index useless I guess... would need to full scan?). However, it doesn't mention if multiple wildcards are allowed? All the example cases just have single wild card? Sorry for the newbie questions, -+ Tatu +- ps. Thanks for the developers for the neat indexing engine. I am currently evaluating it for use in a large-scale enterprise content management system. -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Range queries
Terry and Michael, Thanks for the clarification. For some reason I though that date fields were represented more oddly than this format. I stand corrected! Erik On Wednesday, January 22, 2003, at 10:27 AM, Michael Barry wrote: I utilize the earlier version and queries such as this work fine with QueryParser: field:[ 20030120 - 20030125 ] of course the back-end indexer canonocalizes all date fields to MMDD. The front-end search code is responsible for canonocalizing the user inputed dates to MMDD. I think the key here would be either to not allow users to enter free-form dates (provide some type of UI element to enter year, month, day seperately) or give some copy stating dates should be in MMDD format. -Mike. Erik Hatcher wrote: Unfortunately I don't believe date field range queries work with QueryParser, or at least not human-readable dates. Is that correct? I think it supports date ranges if they are turned into a numeric format, but no human would type that kind of query in. I'm sure supporting true date range queries gets tricky with locale issues and such too. Erik On Wednesday, January 22, 2003, at 09:19 AM, Terry Steichen wrote: Tatu, I believe the range query syntax for the latest Lucene version is field:[lower TO upper], or field:[null TO upper], or field:[lower TO null]. In earlier versions replace TO with a dash (-). I also believe that multiple wildcards (? and/or *) work just fine (as long as they aren't the first character of the term). HTH, Terry - Original Message - From: Tatu Saloranta [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Wednesday, January 22, 2003 11:48 PM Subject: Range queries My apologies if this is a FAQ (which is possible as I am new to Lucene, however, I tried checking the web page for the answer). I read through the Query syntax web page first, and then checked the matching query classes. It seems like query syntax page is missing some details; the one I was wondering about was the range query. Since query parser seems to construct these queries, I guess they have been implemented, even though syntax page didn't explain them. Is that correct? Looking at QueryParser, it seems that inclusive range query uses [ and ], and exclusive query { and }? Is this right? And does it expect exactly two arguments? Also, am I right in assuming that range uses lexiographic ordering, so that it basically includes all possible words (terms) between specified terms (which will work ok with numbers/dates as long as they have been padded with zeroes or such)? Another question I have is regarding wildcard search. Page mentions that there is a restriction that search term can not start with a wild card (as that would render index useless I guess... would need to full scan?). However, it doesn't mention if multiple wildcards are allowed? All the example cases just have single wild card? Sorry for the newbie questions, -+ Tatu +- ps. Thanks for the developers for the neat indexing engine. I am currently evaluating it for use in a large-scale enterprise content management system. -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
RE: Range queries
Does this have better performance than using DateFilter? Regards, Hui -Original Message- From: Michael Barry [mailto:[EMAIL PROTECTED]] Sent: Wed 1/22/2003 10:27 AM To: Lucene Users List Cc: Subject: Re: Range queries I utilize the earlier version and queries such as this work fine with QueryParser: field:[ 20030120 - 20030125 ] of course the back-end indexer canonocalizes all date fields to MMDD. The front-end search code is responsible for canonocalizing the user inputed dates to MMDD. I think the key here would be either to not allow users to enter free-form dates (provide some type of UI element to enter year, month, day seperately) or give some copy stating dates should be in MMDD format. -Mike. Erik Hatcher wrote: Unfortunately I don't believe date field range queries work with QueryParser, or at least not human-readable dates. Is that correct? I think it supports date ranges if they are turned into a numeric format, but no human would type that kind of query in. I'm sure supporting true date range queries gets tricky with locale issues and such too. Erik On Wednesday, January 22, 2003, at 09:19 AM, Terry Steichen wrote: Tatu, I believe the range query syntax for the latest Lucene version is field:[lower TO upper], or field:[null TO upper], or field:[lower TO null]. In earlier versions replace TO with a dash (-). I also believe that multiple wildcards (? and/or *) work just fine (as long as they aren't the first character of the term). HTH, Terry - Original Message - From: Tatu Saloranta [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Wednesday, January 22, 2003 11:48 PM Subject: Range queries My apologies if this is a FAQ (which is possible as I am new to Lucene, however, I tried checking the web page for the answer). I read through the Query syntax web page first, and then checked the matching query classes. It seems like query syntax page is missing some details; the one I was wondering about was the range query. Since query parser seems to construct these queries, I guess they have been implemented, even though syntax page didn't explain them. Is that correct? Looking at QueryParser, it seems that inclusive range query uses [ and ], and exclusive query { and }? Is this right? And does it expect exactly two arguments? Also, am I right in assuming that range uses lexiographic ordering, so that it basically includes all possible words (terms) between specified terms (which will work ok with numbers/dates as long as they have been padded with zeroes or such)? Another question I have is regarding wildcard search. Page mentions that there is a restriction that search term can not start with a wild card (as that would render index useless I guess... would need to full scan?). However, it doesn't mention if multiple wildcards are allowed? All the example cases just have single wild card? Sorry for the newbie questions, -+ Tatu +- ps. Thanks for the developers for the neat indexing engine. I am currently evaluating it for use in a large-scale enterprise content management system. -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e
Re: Range queries
On Wednesday 22 January 2003 07:49, Erik Hatcher wrote: Unfortunately I don't believe date field range queries work with QueryParser, or at least not human-readable dates. Is that correct? I think it supports date ranges if they are turned into a numeric format, but no human would type that kind of query in. I'm sure supporting true date range queries gets tricky with locale issues and such too. Right. In my case that's ok -- the documents I'll be indexing are hybrid documents, with some structured/plain text content and additional metadata (in DB normalized form). Thus the dates (from normalized metadata fields) can easily be converted to numeric form and indexed (for things like last modified etc that'd be normally searched via DB). The other part (UI) needs more work... either need to add a new quoting mechanism for dates (or just do that for if certain field prefix is used), or (more likely) the UI will use simple web forms for constructing query. Thanks to everyone for quick replies, -+ Tatu +- -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Range queries
On Wednesday 22 January 2003 08:27, Michael Barry wrote: I utilize the earlier version and queries such as this work fine with QueryParser: field:[ 20030120 - 20030125 ] of course the back-end indexer canonocalizes all date fields to MMDD. The front-end search code is responsible for canonocalizing the user inputed dates to MMDD. I think the key here would be either to not allow users to enter free-form dates (provide some type of UI element to enter year, month, day seperately) or give some copy stating dates should be in MMDD format. Thanks, this is along the lines I was thinking too. -+ Tatu +- -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Range queries
I wanted to see this first-hand, so I wrote some test code to understand how dates are represented and how QueryParser deals with them. I've indexed 500 documents with random dates between 1/1/2002 and 12/31/2002. Here's what works: QueryParser parser = new QueryParser(contents, new StandardAnalyzer()); String begin = DateField.dateToString(new Date(102, 0, 01)); // 20020101 String end = DateField.dateToString(new Date(102, 11, 31)); // 20021231 String q = date:[ + begin + TO + end + ]; System.out.println(q = + q); Query query = parser.parse(q); System.out.println(query = + query.toString(date)); Hits hits = searcher.search(query); System.out.println(# found = + hits.length()); Here's the output: q = date:[0cvx9a8w0 TO 0daddkbk0] query = [0cvx9a8w0-0daddkbk0] # found = 500 If I change begin and end to 20020101 and 20021231 respectively I get zero hits. I'm running the latest Lucene version from CVS, in case that makes a difference. So, while I would love it if QueryParser behaved with the MMDD syntax, it does not. Or am I missing something here? Any JavaCC wizzes out there that could modify it to take readable date formats and construct the query using the dateToString? That would be sweet!Has anyone created any JavaScript that mimics the dateToString functionality that you'd share? Erik On Wednesday, January 22, 2003, at 10:20 AM, Terry Steichen wrote: Erik, I believe the question was on range queries in general, which of course work with the QueryParser. You can use range queries for dates, provided, as I believe you imply, the dates are in lexiographic order (ie, 20030122). (As to whether dates expresed as such are too challenging for the average human being, I don't know.) Regards, Terry PS: Just to clarify, I believe that dates represented this way are internally treated as strings by Lucene. - Original Message - From: Erik Hatcher [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Wednesday, January 22, 2003 9:49 AM Subject: Re: Range queries Unfortunately I don't believe date field range queries work with QueryParser, or at least not human-readable dates. Is that correct? I think it supports date ranges if they are turned into a numeric format, but no human would type that kind of query in. I'm sure supporting true date range queries gets tricky with locale issues and such too. Erik On Wednesday, January 22, 2003, at 09:19 AM, Terry Steichen wrote: Tatu, I believe the range query syntax for the latest Lucene version is field:[lower TO upper], or field:[null TO upper], or field:[lower TO null]. In earlier versions replace TO with a dash (-). I also believe that multiple wildcards (? and/or *) work just fine (as long as they aren't the first character of the term). HTH, Terry - Original Message - From: Tatu Saloranta [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Wednesday, January 22, 2003 11:48 PM Subject: Range queries My apologies if this is a FAQ (which is possible as I am new to Lucene, however, I tried checking the web page for the answer). I read through the Query syntax web page first, and then checked the matching query classes. It seems like query syntax page is missing some details; the one I was wondering about was the range query. Since query parser seems to construct these queries, I guess they have been implemented, even though syntax page didn't explain them. Is that correct? Looking at QueryParser, it seems that inclusive range query uses and ], and exclusive query { and }? Is this right? And does it expect exactly two arguments? Also, am I right in assuming that range uses lexiographic ordering, so that it basically includes all possible words (terms) between specified terms (which will work ok with numbers/dates as long as they have been padded with zeroes or such)? Another question I have is regarding wildcard search. Page mentions that there is a restriction that search term can not start with a wild card (as that would render index useless I guess... would need to full scan?). However, it doesn't mention if multiple wildcards are allowed? All the example cases just have single wild card? Sorry for the newbie questions, -+ Tatu +- ps. Thanks for the developers for the neat indexing engine. I am currently evaluating it for use in a large-scale enterprise content management system. -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto
Re: Range queries
Followup: after looking at QueryParser.jj (my apologies for not looking deeper before), it *does* support readable date range queries. But the dates must both be valid and adhere to DateFormat.SHORT style. I changed the code I previously sent to this: String begin = 1/1/02; String end = 12/31/02; And the results were the same, all expected documents were returned. It does not work to use null for begin or end to leave either side of the range open-ended. Erik On Wednesday, January 22, 2003, at 08:56 PM, Erik Hatcher wrote: I wanted to see this first-hand, so I wrote some test code to understand how dates are represented and how QueryParser deals with them. I've indexed 500 documents with random dates between 1/1/2002 and 12/31/2002. Here's what works: QueryParser parser = new QueryParser(contents, new StandardAnalyzer()); String begin = DateField.dateToString(new Date(102, 0, 01)); // 20020101 String end = DateField.dateToString(new Date(102, 11, 31)); // 20021231 String q = date:[ + begin + TO + end + ]; System.out.println(q = + q); Query query = parser.parse(q); System.out.println(query = + query.toString(date)); Hits hits = searcher.search(query); System.out.println(# found = + hits.length()); Here's the output: q = date:[0cvx9a8w0 TO 0daddkbk0] query = [0cvx9a8w0-0daddkbk0] # found = 500 If I change begin and end to 20020101 and 20021231 respectively I get zero hits. I'm running the latest Lucene version from CVS, in case that makes a difference. So, while I would love it if QueryParser behaved with the MMDD syntax, it does not. Or am I missing something here? Any JavaCC wizzes out there that could modify it to take readable date formats and construct the query using the dateToString? That would be sweet!Has anyone created any JavaScript that mimics the dateToString functionality that you'd share? Erik -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Range queries
So you are using 1.3dev1? I'm not having the same effect you are - and looking at QueryParser.jj it does this: private Query getRangeQuery(String field, Analyzer analyzer, String part1, String part2, boolean inclusive) { boolean isDate = false, isNumber = false; try { DateFormat df = DateFormat.getDateInstance(DateFormat.SHORT); df.setLenient(true); Date d1 = df.parse(part1); Date d2 = df.parse(part2); part1 = DateField.dateToString(d1); part2 = DateField.dateToString(d2); isDate = true; } catch (Exception e) { } . . . } I get all documents if I use date:[0 TO 20021231] but none if I do date:[20020101 TO null], and from the code I'm surmising I'm just lucky in the first case because its using a string range query and they all just happen to fall in that range (all dateToString's begin with 0). In the second case I'm getting nothing because no dateToString's are in that range textually. So are you certain you're seeing the results you report? I cannot explain the results you see from the code above. Maybe I'm way off in my understanding of this - but I really don't see how MMDD works accurately. DateFormat.parse throws an exception on strings in that format (using DateFormat.SHORT), so QueryParser would never think its really a date and convert it properly to its string representation. Ah, maybe its how we are indexing our fields differently? How are you indexing your my_date_field? I'm using this syntax: Field.Keyword(fieldName, new Date()) Maybe you are indexing it as a String with MMDD format? If so, that explains it. Erik On Wednesday, January 22, 2003, at 10:55 PM, Terry Steichen wrote: As of Lucene 1.3dev1 at least, some range query syntax changes were made. If you use dates expressed as MMDD (without doing the dateToString conversion), the expression my_date_field:[0 TO mmdd] returns all entries up to the date mmdd, and the expression my_date_field:[mmdd TO null] returns all entries from the date mmdd forward. (I just verified this.) Terry -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Range queries
I just realized how dense I've been on this thread. All along Terry has been saying he's indexing date fields as MMDD String fields and I just wasn't getting it. I had my brain locked into thinking the fields were being indexed as Date fields. My apologies for the run around on this thread. Back to your regular programming... On Wednesday, January 22, 2003, at 11:28 PM, Erik Hatcher wrote: Ah, maybe its how we are indexing our fields differently? How are you indexing your my_date_field? I'm using this syntax: Field.Keyword(fieldName, new Date()) Maybe you are indexing it as a String with MMDD format? If so, that explains it. Erik -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Range queries
Hi, Thanks man you solved my problem. Regards, Saif... - Original Message - From: Terry Steichen [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Thursday, January 23, 2003 9:25 AM Subject: Re: Range queries As of Lucene 1.3dev1 at least, some range query syntax changes were made. If you use dates expressed as MMDD (without doing the dateToString conversion), the expression my_date_field:[0 TO mmdd] returns all entries up to the date mmdd, and the expression my_date_field:[mmdd TO null] returns all entries from the date mmdd forward. (I just verified this.) Terry - Original Message - From: Erik Hatcher [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Wednesday, January 22, 2003 10:41 PM Subject: Re: Range queries Followup: after looking at QueryParser.jj (my apologies for not looking deeper before), it *does* support readable date range queries. But the dates must both be valid and adhere to DateFormat.SHORT style. I changed the code I previously sent to this: String begin = 1/1/02; String end = 12/31/02; And the results were the same, all expected documents were returned. It does not work to use null for begin or end to leave either side of the range open-ended. Erik On Wednesday, January 22, 2003, at 08:56 PM, Erik Hatcher wrote: I wanted to see this first-hand, so I wrote some test code to understand how dates are represented and how QueryParser deals with them. I've indexed 500 documents with random dates between 1/1/2002 and 12/31/2002. Here's what works: QueryParser parser = new QueryParser(contents, new StandardAnalyzer()); String begin = DateField.dateToString(new Date(102, 0, 01)); // 20020101 String end = DateField.dateToString(new Date(102, 11, 31)); // 20021231 String q = date:[ + begin + TO + end + ]; System.out.println(q = + q); Query query = parser.parse(q); System.out.println(query = + query.toString(date)); Hits hits = searcher.search(query); System.out.println(# found = + hits.length()); Here's the output: q = date:[0cvx9a8w0 TO 0daddkbk0] query = [0cvx9a8w0-0daddkbk0] # found = 500 If I change begin and end to 20020101 and 20021231 respectively I get zero hits. I'm running the latest Lucene version from CVS, in case that makes a difference. So, while I would love it if QueryParser behaved with the MMDD syntax, it does not. Or am I missing something here? Any JavaCC wizzes out there that could modify it to take readable date formats and construct the query using the dateToString? That would be sweet!Has anyone created any JavaScript that mimics the dateToString functionality that you'd share? Erik -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Range queries
My apologies if this is a FAQ (which is possible as I am new to Lucene, however, I tried checking the web page for the answer). I read through the Query syntax web page first, and then checked the matching query classes. It seems like query syntax page is missing some details; the one I was wondering about was the range query. Since query parser seems to construct these queries, I guess they have been implemented, even though syntax page didn't explain them. Is that correct? Looking at QueryParser, it seems that inclusive range query uses [ and ], and exclusive query { and }? Is this right? And does it expect exactly two arguments? Also, am I right in assuming that range uses lexiographic ordering, so that it basically includes all possible words (terms) between specified terms (which will work ok with numbers/dates as long as they have been padded with zeroes or such)? Another question I have is regarding wildcard search. Page mentions that there is a restriction that search term can not start with a wild card (as that would render index useless I guess... would need to full scan?). However, it doesn't mention if multiple wildcards are allowed? All the example cases just have single wild card? Sorry for the newbie questions, -+ Tatu +- ps. Thanks for the developers for the neat indexing engine. I am currently evaluating it for use in a large-scale enterprise content management system. -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
RE: Do range queries work?
First, let me apologize for not including a self-contained test in my original submission. I will do so for any future issues. In regards to the test case that you included in your response... It looks like there is a bug (besides the StandardAnalyzer parsing 20-35 as a single term). The query in your example: search(searcher, analyzer, FirstName:[a-k]); is not finding the correct document. It is finding doc2, it should find doc1. QueryParser is parsing the query into FirstName:[k-null] when it should be FirstName:[a-k]. Is a being caught as a stop word? Thanks for the quick response. Paul -Original Message- From: Doug Cutting [mailto:[EMAIL PROTECTED]] Sent: Thursday, November 01, 2001 11:36 AM To: 'Lucene Users List' Subject: RE: Do range queries work? Can folks please try to include complete, self-contained test cases when submitting bugs? It's not that hard, and makes it much easier to figure out what is going on. For example, I have attached a complete, self-contained test case for the bug reported below. It only took 50 lines. Interestingly, the results it returns are quite different than those reported. Not that it works as desired! My results are: age:[20-35-null]: 2 hits ( doc1 doc2 ) age:[35-50-null]: 1 hits ( doc2 ) FirstName:[k-null]: 1 hits ( doc2 ) FirstName:[u-z]: 1 hits ( doc2 ) FirstName:[adam-larry]: 1 hits ( doc1 ) (Note that I'm re-printing the parsed query to show how it was parsed.) It looks like StandardAnalyzer parses 20-35 as a single term, causing problems with the age queries. The FirstName queries work as desired for me. I suspect that in your case the FirstName field may not have been added as a Field.Text, but as a case-sensitive Field.Keyword. In any case, it is much easier to figure out what is going on if a complete test case is provided in the first place. So far as I can tell, RangeQuery is doing what it is supposed to. Doug -Original Message- From: Paul Friedman [mailto:[EMAIL PROTECTED]] Sent: Thursday, November 01, 2001 6:46 AM To: 'Lucene Users List' Subject: Do range queries work? Is range searching working in RC2? I have the following documents in my index (using StandardAnalyzer): id:doc1 age:30 FirstName:John id:doc2 age:40 FirstName:Wendy The following queries do not return the expected results (using QueryParser with StandardAnalyzer): age:[20-35] finds 2 documents, should find 1 age:[35-50] finds 2 documents, should find 1 FirstName:[a-k] finds 0 documents, should find 1 FirstName:[u-z] finds 0 documents, should find 1 FirstName:[adam-larry] finds 0 documents, should find 1 Thanks. -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
RE: Do range queries work?
From: Paul Friedman [mailto:[EMAIL PROTECTED]] It looks like there is a bug (besides the StandardAnalyzer parsing 20-35 as a single term). The query in your example: search(searcher, analyzer, FirstName:[a-k]); is not finding the correct document. It is finding doc2, it should find doc1. QueryParser is parsing the query into FirstName:[k-null] when it should be FirstName:[a-k]. Is a being caught as a stop word? It looks like it. I think the real bug here is that QueryParser should not analyze the terms in a range query. I have modified QueryParser.jj to not do this and all of your examples work much better. The only downside that I can see is that range queries, like prefix queries, become case sensitive. That seems like a good tradeoff to me. Does anyone object to this? Doug QueryParser.jj -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
RE: Do range queries work?
Can folks please try to include complete, self-contained test cases when submitting bugs? It's not that hard, and makes it much easier to figure out what is going on. In the WebMacro project, we've actually gone further than that -- no bug report is taken seriously without a JUnit test case that demonstrates it. This has been very successful for us, and along the way, we've built a nice little suite of tests. JUnit test cases are really easy to write, but so far, the only one in the Lucene repo is the one for the query parser. I'd like to see the existing test programs converted into JUnit test cases -- I'm willing to do this if someone will tell me how they work and what they're supposed to output and how to invoke them. -- Brian Goetz Quiotix Corporation [EMAIL PROTECTED] Tel: 650-843-1300Fax: 650-324-8032 http://www.quiotix.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]