Re: MultiFieldQueryParser seems broken... Fix attached.
Doug Cutting writes: > >>http://issues.apache.org/eyebrowse/[EMAIL PROTECTED]&msgId=1798116 > > > > > > Yes, the approach there is similar. I attempted to complete the > > solution and provide a working replacement for MultiFieldQueryParser. > > But, inspired by that message, couldn't MultiFieldQueryParser just be a > subclass of QueryParser that overrides getFieldQuery()? This wouldn't catch PrefixQueries or RangeQueries, etc., would it? If QueryParser.TermQuery() wasn't final, you could just override it (or fix it to do the right thing). By the way, I've found a bug in my implementation of MultiFieldQueryParser. Single-word queries weren't being expanded properly. I've fixed that, and placed a revised copy of the code at ftp://ftp.parc.xerox.com/pub/transient/janssen/SearchTest.java. See my original post at http://issues.apache.org/eyebrowse/[EMAIL PROTECTED]&msgNo=9757 for instructions on how to use it. Or just read the SearchTest.java code. Bill - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: MultiFieldQueryParser seems broken... Fix attached.
Daniel Naber wrote: On Thursday 09 September 2004 18:52, Doug Cutting wrote: I have not been able to construct a two-word query that returns a page without both words in either the content, the title, the url or in a single anchor. Can you? Like this one? konvens leitseite Leitseite is only in the title of the first match (www.gldv.org), konvens is only in the body. Good job finding that! I guess I should fix Nutch's BasicQueryFilter. Thanks, Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: MultiFieldQueryParser seems broken... Fix attached.
. I reckon there has been a discussion (and solution :-) on how to achieve the functionality you've been after: http://issues.apache.org/eyebrowse/[EMAIL PROTECTED]&msgId=1798116 I'm not sure if this would be the same though. Best regards, René Hi all, I took the code indicated by Rene but I've seen that it's not completly feeting my requirements, because my application should provide the facility to check queries as beeing Fuzzy queries. so I modified the code to the following one, and I added a test main method. Hope it helps someone. package org.apache.lucene; /* @(#) CWK 1.5 10.09.2004 * * Copyright 2003-2005 ConfigWorks Informationssysteme & Consulting GmbH * Universitätsstr. 94/7 9020 Klagenfurt Austria * www.configworks.com * All rights reserved. */ import java.util.Vector; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.queryParser.ParseException; import org.apache.lucene.queryParser.QueryParser; import org.apache.lucene.search.BooleanClause; import org.apache.lucene.search.Query; /** * @author sergiu * this class is a patch for MultifieldQueryParser * it's behaviour can be tested by running the main method * * Now: String[] fields = new String[] { "title", "abstract", "content" }; QueryParser parser = new CustomQueryParser(fields, new SimpleAnalyzer()); parser.setOperator(QueryParser.DEFAULT_OPERATOR_AND); Query query = parser.parse("foo -bar (baz OR title:bla)"); System.out.println("? " + query); Produces: ? +(title:foo abstract:foo content:foo) -(title:bar abstract:bar content:bar) +((title:baz abstract:baz content:baz) title:bla) Perfect! * @version 1.0 * @since CWK 1.5 */ public class CustomQueryParser extends QueryParser{ private String[] fields; private boolean fuzzySearch = false; public CustomQueryParser(String[] fields, Analyzer analyzer){ super(null, analyzer); this.fields = fields; } public CustomQueryParser(String[] fields, Analyzer analyzer, int defaultOperator){ super(null, analyzer); this.fields = fields; setOperator(defaultOperator); } protected Query getFieldQuery(String field, Analyzer analyzer, String queryText) throws ParseException{ Query query = null; if (field == null){ Vector clauses = new Vector(); for (int i = 0; i < fields.length; i++){ if(isFuzzySearch()) clauses.add(new BooleanClause(super.getFuzzyQuery(fields[i], queryText), false, false)); else clauses.add(new BooleanClause(super.getFieldQuery(fields[i], analyzer, queryText), false, false)); } query = getBooleanQuery(clauses); }else{ if (isFuzzySearch()) query = super.getFuzzyQuery(field, queryText); else query = super.getFieldQuery(field, analyzer, queryText); } return query; } public boolean isFuzzySearch() { return fuzzySearch; } public void setFuzzySearch(boolean fuzzySearch) { this.fuzzySearch = fuzzySearch; } public static void main(String[] args) throws Exception{ String[] fields = new String[] { "title", "abstract", "content" }; CustomQueryParser parser = new CustomQueryParser(fields, new StandardAnalyzer()); parser.setOperator(QueryParser.DEFAULT_OPERATOR_AND); parser.setFuzzySearch(true); String queryString = "foo -bar (baz OR title:bla)"; System.out.println(queryString); Query query = parser.parse(queryString); System.out.println("? " + query); } } - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: MultiFieldQueryParser seems broken... Fix attached.
> But, inspired by that message, couldn't MultiFieldQueryParser just be a > subclass of QueryParser that overrides getFieldQuery()? I wasn't sure that everything "went through" getFieldQuery(). If so, yes, that should work. In either case, I don't even think a subclass is necessary. Just have a different constructor for QueryParser that takes multiple default field names, and just add the behavior to QueryParser, keyed off that characteristic (more than one default field name). Bill - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: MultiFieldQueryParser seems broken... Fix attached.
> is it a problem if the users will search "coffee OR tea" as a search > string in the case that MultifieldQueryParser is > modifyed as Bill suggested?, and the default opperator is set to AND? > Here's what you get (which is correct): % java -classpath /usr/local/lib/lucene-1.4.1.jar:. \ -DSearchText.QueryDefaultOperator=AND \ -DSearchTest.QueryParser=new SearchTest 'coffee OR tea' query is (title:coffee authors:coffee contents:coffee) (title:tea authors:tea contents:tea) % Bill - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: MultiFieldQueryParser seems broken... Fix attached.
On Thursday 09 September 2004 18:52, Doug Cutting wrote: > I have not been > able to construct a two-word query that returns a page without both > words in either the content, the title, the url or in a single anchor. > Can you? Like this one? konvens leitseite Leitseite is only in the title of the first match (www.gldv.org), konvens is only in the body. -- http://www.danielnaber.de - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: MultiFieldQueryParser seems broken... Fix attached.
Bill Janssen wrote: I'd think that if a user specified a query "cutting lucene", with an implicit AND and the default fields "title" and "author", they'd expect to see a match in which both "cutting" and "lucene" appears. That is, (title:cutting OR author:cutting) AND (title:lucene OR author:lucene) Your proposal is certainly an improvement. It's interesting to note that in Nutch I implemented something different. There, a search for "cutting lucene" expands to something like: (+url:cutting^4.0 +url:lucene^4.0 +url:"cutting lucene"~2147483647^4.0) (+anchor:cutting^2.0 +anchor:lucene^2.0 +anchor:"cutting lucene"~4^2.0) (+content:cutting +content:lucene +content:"cutting lucene"~2147483647) So a page with "cutting" in the body and "lucene" in anchor text won't match: the body, anchor or url must contain all query terms. A single authority (content, url or anchor) must vouch for all attributes. Note that Nutch also boosts matches where the terms are close together. Using "~2147483647" permits them to be anywhere in the document, but boosts more when they're closer and in-order. (The "~4" in anchor matches is to prohibit matches across different anchors. Each anchor is separated by a Token.positionIncrement() of 4.) But perhaps this is not a feature. Perhaps Nutch should instead expand this to: +(url:cutting^4.0 anchor:cutting^2.0 content:cutting) +(url:lucene^4.0 anchor:lucene^2.0 content:lucene) url:"cutting lucene"~2147483647^4.0 anchor:"cutting lucene"~4^2.0 content:"cutting lucene"~2147483647 That would, e.g., permit a match with only "lucene" in an anchor and "cutting" in the content, which the earlier formulation would not. Can anyone tell whether Google has this requirement? I have not been able to construct a two-word query that returns a page without both words in either the content, the title, the url or in a single anchor. Can you? If you're interested, the Nutch query expansion code in question is: http://cvs.sourceforge.net/viewcvs.py/nutch/nutch/src/plugin/query-basic/src/java/net/nutch/searcher/basic/BasicQueryFilter.java?view=markup To play with it you can download Nutch and use the command: bin/nutch net.nutch.searcher.Query http://issues.apache.org/eyebrowse/[EMAIL PROTECTED]&msgId=1798116 Yes, the approach there is similar. I attempted to complete the solution and provide a working replacement for MultiFieldQueryParser. But, inspired by that message, couldn't MultiFieldQueryParser just be a subclass of QueryParser that overrides getFieldQuery()? Cheers, Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Handling user queries (Was: Re: MultiFieldQueryParser seems broken... Fix attached.)
René Hackl wrote: is it a problem if the users will search "coffee OR tea" as a search string in the case that MultifieldQueryParser is modifyed as Bill suggested?, and the default opperator is set to AND? No. There's not a problem with the proposed correction to MFQP. MFQP should work the way Bill suggested. My babbling about coffee or tea was more aimed at Bill's referring to "darn users started demanding" . So this is a totally different matter. In my experience, many users fall to everyday language traps, like in: "What do you want to drink, coffee or tea?" The answer normally isn't 'yes' to both, is it? this problem may be solved if the users know the meaning of the following signs mean: - + "" * ~ this will improve the results in a better way that our parsing is doing ... I have an app where in some cases I make subqueries for an initial user-stated query. The aim is to come up with pointers to partial matching docs. The background is, one ill-advised NOT can ruin a query. But this has nothing to do with MFQP. Just random thoughts about making users happy even when they are new to formulating queries :-) Cheers, René - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Handling user queries (Was: Re: MultiFieldQueryParser seems broken... Fix attached.)
> is it a problem if the users will search "coffee OR tea" as a search > string in the case that MultifieldQueryParser is > modifyed as Bill suggested?, and the default opperator is set to AND? No. There's not a problem with the proposed correction to MFQP. MFQP should work the way Bill suggested. My babbling about coffee or tea was more aimed at Bill's referring to "darn users started demanding" . So this is a totally different matter. In my experience, many users fall to everyday language traps, like in: "What do you want to drink, coffee or tea?" The answer normally isn't 'yes' to both, is it? I have an app where in some cases I make subqueries for an initial user-stated query. The aim is to come up with pointers to partial matching docs. The background is, one ill-advised NOT can ruin a query. But this has nothing to do with MFQP. Just random thoughts about making users happy even when they are new to formulating queries :-) Cheers, René -- NEU: Bis zu 10 GB Speicher für e-mails & Dateien! 1 GB bereits bei GMX FreeMail http://www.gmx.net/de/go/mail - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: MultiFieldQueryParser seems broken... Fix attached.
René Hackl wrote: Bill, Thank you for clarifying on that issue. I missed the... (title:cutting OR author:cutting) AND (title:lucene OR author:lucene) ... (title:cutting OR title:lucene) AND (author:cutting OR author:lucene) Note that this would match even if only "lucene" occurred in the ... "only lucene"/"only cutting" match. I'd think that if a user specified a query "cutting lucene", with an implicit AND and the default fields "title" and "author", they'd expect to see a match in which both "cutting" and "lucene" appears. Hopefully they'd expect that. Sometimes users assume that e.g. "coffee OR tea" would provide matches with either term, but not both. But this is already "user-attune your application" territory. Your proposal makes perfect sense, of course. René is it a problem if the users will search "coffee OR tea" as a search string in the case that MultifieldQueryParser is modifyed as Bill suggested?, and the default opperator is set to AND? I don't think so ... I think that the resulting Query should be: (title:cutting OR author:cutting) OR (title:lucene OR author:lucene) And I think that the results will be correct. Am I wrong? I don't know exactly what will happen with more complex queries, the uses grouping, exact matches and NOT operator like: (alcohol NOT tea) OR ("black tea" AND brandy) what will happen if you send this to a MultifieldQueryParser that searches in an index with the fields "drink" and "juices" Maybe this kind of search constructions should be a part of JUnit tests, if they are not already there. Thanks, Sergiu - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: MultiFieldQueryParser seems broken... Fix attached.
Bill, Thank you for clarifying on that issue. I missed the... > (title:cutting OR author:cutting) AND (title:lucene OR author:lucene) ... > (title:cutting OR title:lucene) AND (author:cutting OR author:lucene) > > Note that this would match even if only "lucene" occurred in the ... "only lucene"/"only cutting" match. > I'd think that if a user specified a query "cutting lucene", with an > implicit AND and the default fields "title" and "author", they'd > expect to see a match in which both "cutting" and "lucene" appears. Hopefully they'd expect that. Sometimes users assume that e.g. "coffee OR tea" would provide matches with either term, but not both. But this is already "user-attune your application" territory. Your proposal makes perfect sense, of course. René -- Supergünstige DSL-Tarife + WLAN-Router für 0,- EUR* Jetzt zu GMX wechseln und sparen http://www.gmx.net/de/go/dsl - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: MultiFieldQueryParser seems broken... Fix attached.
Hi Bill, I think that more people wait for this patch of MultifieldIndexParser. It would be nice if it will be included in the next realease candidate All the best, Sergiu Bill Janssen wrote: René, Thanks for your note. I'd think that if a user specified a query "cutting lucene", with an implicit AND and the default fields "title" and "author", they'd expect to see a match in which both "cutting" and "lucene" appears. That is, (title:cutting OR author:cutting) AND (title:lucene OR author:lucene) Instead, what they'd get using the current (broken) strategy of outer combination used by the current MultiFieldQueryParser, would be (title:cutting OR title:lucene) AND (author:cutting OR author:lucene) Note that this would match even if only "lucene" occurred in the document, as long as it occurred both in the title field and in the author field. Or, for that matter, it would also match "Cutting on Cutting", by Doug Cutting :-). http://issues.apache.org/eyebrowse/[EMAIL PROTECTED]&msgId=1798116 Yes, the approach there is similar. I attempted to complete the solution and provide a working replacement for MultiFieldQueryParser. Bill - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: MultiFieldQueryParser seems broken... Fix attached.
René, Thanks for your note. I'd think that if a user specified a query "cutting lucene", with an implicit AND and the default fields "title" and "author", they'd expect to see a match in which both "cutting" and "lucene" appears. That is, (title:cutting OR author:cutting) AND (title:lucene OR author:lucene) Instead, what they'd get using the current (broken) strategy of outer combination used by the current MultiFieldQueryParser, would be (title:cutting OR title:lucene) AND (author:cutting OR author:lucene) Note that this would match even if only "lucene" occurred in the document, as long as it occurred both in the title field and in the author field. Or, for that matter, it would also match "Cutting on Cutting", by Doug Cutting :-). > http://issues.apache.org/eyebrowse/[EMAIL PROTECTED]&msgId=1798116 Yes, the approach there is similar. I attempted to complete the solution and provide a working replacement for MultiFieldQueryParser. Bill - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: MultiFieldQueryParser seems broken... Fix attached.
The class is at the end of the message. But it hink that a better solution is that one suggested by Rene: http://issues.apache.org/eyebrowse/[EMAIL PROTECTED]&msgId=1798116 Wermus Fernando wrote: Bill, I don't receive any .java. Could you send it again? Thanks. -Mensaje original- De: Bill Janssen [mailto:[EMAIL PROTECTED] Enviado el: Martes, 07 de Septiembre de 2004 10:06 p.m. Para: Lucene Users List CC: Ali Rouhi Asunto: MultiFieldQueryParser seems broken... Fix attached. Hi! I'm using Lucene for an application which has lots of fields/document, in which the users can specify in their config files what fields they wish to be included by default in a search. I'd been happily using MultiFieldQueryParser to do the searches, but the darn users started demanding more Google-like searches; that is, they want the search terms to be implicitly AND-ed instead of implicitly OR-ed. No problem, thinks I, I'll just set the "operator". Only to find this has no effect on MultiFieldQueryParser. Once I looked at the code, I find that MultiFieldQueryParser combines the clauses at the wrong level -- it combines them at the outermost level instead of the innermost level. This means that if you have two fields, "author" and "title", and the search string "cutting lucene", you'll get the final query (title:cutting title:lucene) (author:cutting author:lucene) If the search operator is "OR", this isn't a problem. But if it is, you have two problems. The first is that MultiFieldQueryParser seems to ignore the operator entirely. But even if it didn't, the second problem is that the query formed would be +(title:cutting title:lucene) +(author:cutting author:lucene) That is, if the word "Lucene" was in both the author field and the title field, the match would fit. This clearly isn't what the searcher intended. You can re-write MultiFieldQueryParser, as I've done in the example code which I append here. This little program allows you to run either my parser (-DSearchTest.QueryParser=new) or the old parser (-DSearchTest.QueryParser=old). It allows you to use either OR (-DSearchTest.QueryDefaultOperator=or) or AND (-DSearchTest.QueryDefaultOperator=and) as the operator. And it allows you to pick your favorite set of default search terms (-DSearchTest.QueryDefaultFields=author:title:body, for example). It takes one argument, a query string, and outputs the re-written query after running it through the query parser. So to evaluate the above query: % java -classpath /import/lucene/lucene-1.4.1.jar:. \ -DSearchTest.QueryDefaultFields="title:author" \ -DSearchTest.QueryDefaultOperator=AND \ -DSearchTest.QueryParser=old \ SearchTest "cutting lucene" query is (title:cutting title:lucene) (author:cutting author:lucene) % The class NewMultiFieldQueryParser does the combination at the inner level, using an override of "addClause", instead of the outer level. Note that it can't cover all cases (notably PhrasePrefixQuery, because that class has no access methods which allow one to introspect over it, and SpanQueries, because I don't understand them well enough :-). I post it here in advance of filing a formal bug report for early feedback. But it will show up in a bug report in the near future. Running the above query with the new parser gives: % java -classpath /import/lucene/lucene-1.4.1.jar:. \ -DSearchTest.QueryDefaultFields="title:author" \ -DSearchTest.QueryDefaultOperator=AND \ -DSearchTest.QueryParser=new \ SearchTest "cutting lucene" query is +(title:cutting author:cutting) +(title:lucene author:lucene) % which I claim is what the user is expecting. In addition, the new class uses an API more similar to QueryParser, so that the user has less to learn when using it. The code in it could probably just be folded into QueryParser, in fact. Bill the code for SearchTest: import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.index.IndexReader; import org.apache.lucene.index.Term; import org.apache.lucene.index.TermDocs; import org.apache.lucene.document.Document; import org.apache.lucene.search.Searcher; import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.search.Query; import org.apache.lucene.search.TermQuery; import org.apache.lucene.search.PhraseQuery; import org.apache.lucene.search.FuzzyQuery; import org.apache.lucene.search.WildcardQuery; import org.apache.lucene.search.PrefixQuery; import org.apache.lucene.search.PhraseQuery; import org.apache.lucene.search.RangeQuery; import org.apache.lucene.search.BooleanQuery; import org.apache.lucene.search.Hits; import org.apache.lucene.queryParser.QueryParser; import org.apache.lucene.queryParser.MultiField
RE: MultiFieldQueryParser seems broken... Fix attached.
Bill, I don't receive any .java. Could you send it again? Thanks. -Mensaje original- De: Bill Janssen [mailto:[EMAIL PROTECTED] Enviado el: Martes, 07 de Septiembre de 2004 10:06 p.m. Para: Lucene Users List CC: Ali Rouhi Asunto: MultiFieldQueryParser seems broken... Fix attached. Hi! I'm using Lucene for an application which has lots of fields/document, in which the users can specify in their config files what fields they wish to be included by default in a search. I'd been happily using MultiFieldQueryParser to do the searches, but the darn users started demanding more Google-like searches; that is, they want the search terms to be implicitly AND-ed instead of implicitly OR-ed. No problem, thinks I, I'll just set the "operator". Only to find this has no effect on MultiFieldQueryParser. Once I looked at the code, I find that MultiFieldQueryParser combines the clauses at the wrong level -- it combines them at the outermost level instead of the innermost level. This means that if you have two fields, "author" and "title", and the search string "cutting lucene", you'll get the final query (title:cutting title:lucene) (author:cutting author:lucene) If the search operator is "OR", this isn't a problem. But if it is, you have two problems. The first is that MultiFieldQueryParser seems to ignore the operator entirely. But even if it didn't, the second problem is that the query formed would be +(title:cutting title:lucene) +(author:cutting author:lucene) That is, if the word "Lucene" was in both the author field and the title field, the match would fit. This clearly isn't what the searcher intended. You can re-write MultiFieldQueryParser, as I've done in the example code which I append here. This little program allows you to run either my parser (-DSearchTest.QueryParser=new) or the old parser (-DSearchTest.QueryParser=old). It allows you to use either OR (-DSearchTest.QueryDefaultOperator=or) or AND (-DSearchTest.QueryDefaultOperator=and) as the operator. And it allows you to pick your favorite set of default search terms (-DSearchTest.QueryDefaultFields=author:title:body, for example). It takes one argument, a query string, and outputs the re-written query after running it through the query parser. So to evaluate the above query: % java -classpath /import/lucene/lucene-1.4.1.jar:. \ -DSearchTest.QueryDefaultFields="title:author" \ -DSearchTest.QueryDefaultOperator=AND \ -DSearchTest.QueryParser=old \ SearchTest "cutting lucene" query is (title:cutting title:lucene) (author:cutting author:lucene) % The class NewMultiFieldQueryParser does the combination at the inner level, using an override of "addClause", instead of the outer level. Note that it can't cover all cases (notably PhrasePrefixQuery, because that class has no access methods which allow one to introspect over it, and SpanQueries, because I don't understand them well enough :-). I post it here in advance of filing a formal bug report for early feedback. But it will show up in a bug report in the near future. Running the above query with the new parser gives: % java -classpath /import/lucene/lucene-1.4.1.jar:. \ -DSearchTest.QueryDefaultFields="title:author" \ -DSearchTest.QueryDefaultOperator=AND \ -DSearchTest.QueryParser=new \ SearchTest "cutting lucene" query is +(title:cutting author:cutting) +(title:lucene author:lucene) % which I claim is what the user is expecting. In addition, the new class uses an API more similar to QueryParser, so that the user has less to learn when using it. The code in it could probably just be folded into QueryParser, in fact. Bill the code for SearchTest: import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.index.IndexReader; import org.apache.lucene.index.Term; import org.apache.lucene.index.TermDocs; import org.apache.lucene.document.Document; import org.apache.lucene.search.Searcher; import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.search.Query; import org.apache.lucene.search.TermQuery; import org.apache.lucene.search.PhraseQuery; import org.apache.lucene.search.FuzzyQuery; import org.apache.lucene.search.WildcardQuery; import org.apache.lucene.search.PrefixQuery; import org.apache.lucene.search.PhraseQuery; import org.apache.lucene.search.RangeQuery; import org.apache.lucene.search.BooleanQuery; import org.apache.lucene.search.Hits; import org.apache.lucene.queryParser.QueryParser; import org.apache.lucene.queryParser.MultiFieldQueryParser; import org.apache.lucene.queryParser.FastCharStream; import org.apache.lucene.queryParser.TokenMgrError; import org.apache.lucene.queryParser.ParseException; i
Re: MultiFieldQueryParser seems broken... Fix attached.
Hi Bill, - But even if it didn't, the second problem is that the query formed would be +(title:cutting title:lucene) +(author:cutting author:lucene) That is, if the word "Lucene" was in both the author field and the title field, the match would fit. This clearly isn't what the searcher intended. - AFA my understanding of the query syntax goes, this would be interpreted as (A OR B) AND (C OR D) which would produce the same set as (A OR C) AND (B OR D) == +(title:cutting author:cutting) +(title:lucene author:lucene). But it would only be true for this special case with 2 terms and 2 fields. I reckon there has been a discussion (and solution :-) on how to achieve the functionality you've been after: http://issues.apache.org/eyebrowse/[EMAIL PROTECTED]&msgId=1798116 I'm not sure if this would be the same though. Best regards, René -- Supergünstige DSL-Tarife + WLAN-Router für 0,- EUR* Jetzt zu GMX wechseln und sparen http://www.gmx.net/de/go/dsl - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
MultiFieldQueryParser seems broken... Fix attached.
Hi! I'm using Lucene for an application which has lots of fields/document, in which the users can specify in their config files what fields they wish to be included by default in a search. I'd been happily using MultiFieldQueryParser to do the searches, but the darn users started demanding more Google-like searches; that is, they want the search terms to be implicitly AND-ed instead of implicitly OR-ed. No problem, thinks I, I'll just set the "operator". Only to find this has no effect on MultiFieldQueryParser. Once I looked at the code, I find that MultiFieldQueryParser combines the clauses at the wrong level -- it combines them at the outermost level instead of the innermost level. This means that if you have two fields, "author" and "title", and the search string "cutting lucene", you'll get the final query (title:cutting title:lucene) (author:cutting author:lucene) If the search operator is "OR", this isn't a problem. But if it is, you have two problems. The first is that MultiFieldQueryParser seems to ignore the operator entirely. But even if it didn't, the second problem is that the query formed would be +(title:cutting title:lucene) +(author:cutting author:lucene) That is, if the word "Lucene" was in both the author field and the title field, the match would fit. This clearly isn't what the searcher intended. You can re-write MultiFieldQueryParser, as I've done in the example code which I append here. This little program allows you to run either my parser (-DSearchTest.QueryParser=new) or the old parser (-DSearchTest.QueryParser=old). It allows you to use either OR (-DSearchTest.QueryDefaultOperator=or) or AND (-DSearchTest.QueryDefaultOperator=and) as the operator. And it allows you to pick your favorite set of default search terms (-DSearchTest.QueryDefaultFields=author:title:body, for example). It takes one argument, a query string, and outputs the re-written query after running it through the query parser. So to evaluate the above query: % java -classpath /import/lucene/lucene-1.4.1.jar:. \ -DSearchTest.QueryDefaultFields="title:author" \ -DSearchTest.QueryDefaultOperator=AND \ -DSearchTest.QueryParser=old \ SearchTest "cutting lucene" query is (title:cutting title:lucene) (author:cutting author:lucene) % The class NewMultiFieldQueryParser does the combination at the inner level, using an override of "addClause", instead of the outer level. Note that it can't cover all cases (notably PhrasePrefixQuery, because that class has no access methods which allow one to introspect over it, and SpanQueries, because I don't understand them well enough :-). I post it here in advance of filing a formal bug report for early feedback. But it will show up in a bug report in the near future. Running the above query with the new parser gives: % java -classpath /import/lucene/lucene-1.4.1.jar:. \ -DSearchTest.QueryDefaultFields="title:author" \ -DSearchTest.QueryDefaultOperator=AND \ -DSearchTest.QueryParser=new \ SearchTest "cutting lucene" query is +(title:cutting author:cutting) +(title:lucene author:lucene) % which I claim is what the user is expecting. In addition, the new class uses an API more similar to QueryParser, so that the user has less to learn when using it. The code in it could probably just be folded into QueryParser, in fact. Bill the code for SearchTest: import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.index.IndexReader; import org.apache.lucene.index.Term; import org.apache.lucene.index.TermDocs; import org.apache.lucene.document.Document; import org.apache.lucene.search.Searcher; import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.search.Query; import org.apache.lucene.search.TermQuery; import org.apache.lucene.search.PhraseQuery; import org.apache.lucene.search.FuzzyQuery; import org.apache.lucene.search.WildcardQuery; import org.apache.lucene.search.PrefixQuery; import org.apache.lucene.search.PhraseQuery; import org.apache.lucene.search.RangeQuery; import org.apache.lucene.search.BooleanQuery; import org.apache.lucene.search.Hits; import org.apache.lucene.queryParser.QueryParser; import org.apache.lucene.queryParser.MultiFieldQueryParser; import org.apache.lucene.queryParser.FastCharStream; import org.apache.lucene.queryParser.TokenMgrError; import org.apache.lucene.queryParser.ParseException; import java.io.File; import java.io.StringReader; import java.util.Date; import java.util.HashMap; import java.util.Iterator; import java.util.StringTokenizer; class SearchTest { static class NewMultiFieldQueryParser extends QueryParser { static private final String DEFAULT_FIELD = "%%"; private String[] fields = null; public NewMultiFieldQueryParser (String[] f, Analyzer a) { super(DEFAULT_FIELD, a); fields =