Re: Boolean Phrase Query question
On Apr 3, 2004, at 3:05 PM, Ankur Goel wrote: "By using true on the finalQuery.add calls, you have said that both fields must have the word "temp" in them. Is that what you meant? Or did you mean an OR type of query?" I need an OR type of query. I mean the word can be in the filename or in the contents of the filename. But i am not able to do this. Can you tell me how to do it? I did tell you how to do it. Use false for both required and prohibited flags when adding queries to a BooleanQuery. Check the javadocs for more details. Keep in mind (and see recent, and frequent, discussion on this topic) that your analyzer choice is very important. Look at my intro Lucene article for code to allow you to view what is happening with the analysis process. Erik Regards, Ankur -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Sunday, April 04, 2004 1:27 AM To: Lucene Users List Subject: Re: Boolean Phrase Query question On Apr 3, 2004, at 12:13 PM, Ankur Goel wrote: Hi, I have to provide a functionality which provides search on both file name and contents of the file. For indexing I use the following code: org.apache.lucene.document.Document doc = new org.apache. lucene.document.Document(); doc.add(Field.Keyword("fileId","" + document.getFileId())); doc.add(Field.Text("fileName",fileName); doc.add(Field.Text("contents", new FileReader(new File(fileName))); I'm not sure what you plan on doing with the fileName field, but you probably want to use a Keyword field for it. And you may want to glue the file name and contents together into a single field to facilitate searches to span both. (be sure to put a space in between if you do this) For searching a text say "temp" I use the following code to look both in file Name and contents of the file: BooleanQuery finalQuery = new BooleanQuery(); Query titleQuery = QueryParser.parse("temp","fileName",analyzer); Query mainQuery = QueryParser.parse("temp","contents",analyzer); finalQuery.add(titleQuery, true, false); finalQuery.add(mainQuery, true, false); Hits hits = is.search(finalQuery); By using true on the finalQuery.add calls, you have said that both fields must have the word "temp" in them. Is that what you meant? Or did you mean an OR type of query? Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Boolean Phrase Query question
Thanks Eric for the solution. I have to filename field as I have to give the end user facility to search on File Name also. That's why I am using TEXT for file Name also. "By using true on the finalQuery.add calls, you have said that both fields must have the word "temp" in them. Is that what you meant? Or did you mean an OR type of query?" I need an OR type of query. I mean the word can be in the filename or in the contents of the filename. But i am not able to do this. Can you tell me how to do it? Regards, Ankur -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Sunday, April 04, 2004 1:27 AM To: Lucene Users List Subject: Re: Boolean Phrase Query question On Apr 3, 2004, at 12:13 PM, Ankur Goel wrote: > > Hi, > I have to provide a functionality which provides search on both file > name > and contents of the file. > > For indexing I use the following code: > > > org.apache.lucene.document.Document doc = new org.apache. > lucene.document.Document(); > doc.add(Field.Keyword("fileId","" + document.getFileId())); > doc.add(Field.Text("fileName",fileName); > doc.add(Field.Text("contents", new FileReader(new File(fileName))); I'm not sure what you plan on doing with the fileName field, but you probably want to use a Keyword field for it. And you may want to glue the file name and contents together into a single field to facilitate searches to span both. (be sure to put a space in between if you do this) > For searching a text say "temp" I use the following code to look both > in > file Name and contents of the file: > > BooleanQuery finalQuery = new BooleanQuery(); > Query titleQuery = QueryParser.parse("temp","fileName",analyzer); > Query mainQuery = QueryParser.parse("temp","contents",analyzer); > > finalQuery.add(titleQuery, true, false); > finalQuery.add(mainQuery, true, false); > > Hits hits = is.search(finalQuery); By using true on the finalQuery.add calls, you have said that both fields must have the word "temp" in them. Is that what you meant? Or did you mean an OR type of query? Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Boolean Phrase Query question
On Apr 3, 2004, at 12:13 PM, Ankur Goel wrote: Hi, I have to provide a functionality which provides search on both file name and contents of the file. For indexing I use the following code: org.apache.lucene.document.Document doc = new org.apache. lucene.document.Document(); doc.add(Field.Keyword("fileId","" + document.getFileId())); doc.add(Field.Text("fileName",fileName); doc.add(Field.Text("contents", new FileReader(new File(fileName))); I'm not sure what you plan on doing with the fileName field, but you probably want to use a Keyword field for it. And you may want to glue the file name and contents together into a single field to facilitate searches to span both. (be sure to put a space in between if you do this) For searching a text say "temp" I use the following code to look both in file Name and contents of the file: BooleanQuery finalQuery = new BooleanQuery(); Query titleQuery = QueryParser.parse("temp","fileName",analyzer); Query mainQuery = QueryParser.parse("temp","contents",analyzer); finalQuery.add(titleQuery, true, false); finalQuery.add(mainQuery, true, false); Hits hits = is.search(finalQuery); By using true on the finalQuery.add calls, you have said that both fields must have the word "temp" in them. Is that what you meant? Or did you mean an OR type of query? Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Boolean Phrase Query question
Hi, I have to provide a functionality which provides search on both file name and contents of the file. For indexing I use the following code: org.apache.lucene.document.Document doc = new org.apache. lucene.document.Document(); doc.add(Field.Keyword("fileId","" + document.getFileId())); doc.add(Field.Text("fileName",fileName); doc.add(Field.Text("contents", new FileReader(new File(fileName))); For searching a text say "temp" I use the following code to look both in file Name and contents of the file: BooleanQuery finalQuery = new BooleanQuery(); Query titleQuery = QueryParser.parse("temp","fileName",analyzer); Query mainQuery = QueryParser.parse("temp","contents",analyzer); finalQuery.add(titleQuery, true, false); finalQuery.add(mainQuery, true, false); Hits hits = is.search(finalQuery); But I am not getting any result. I think the problem is due to searching on two fields. Can you please tell me how to go about it. Regards, Ankur - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Zero hits for queries ending with a number
Extremely well said, Tatu! On Apr 3, 2004, at 11:24 AM, Tatu Saloranta wrote: On Saturday 03 April 2004 08:34, [EMAIL PROTECTED] wrote: On Saturday 03 April 2004 17:11, Erik Hatcher wrote: No objections that error messages and such could be made clearer. Patches welcome! Care to submit better error message handling in this case? Or perhaps allow lower-case "to"? I think the best would be if Lucene would simply have a setCaseSensitive(boolean). IMHO it's in any case a bad idea to make searches case-sensitive (per default). I'd have to disagree. I think that search engine core should not have to bother with details of character sets, such as lower-casing. Rules for lower/upper/initial/mixed case for all Unicode-languages are rather involved... and if you tried to do that, next thing would be whether accentuation and umlaut marks should matter or not (which is language dependant). That's why to me the natural way to go is to do direct comparison, ignoring case when executing queries. This does not prevent anyone from implementing such functionality (see below). I think architecture and design of Lucene core is delightfully simple. One can easily create case-independent functionality by using proper analyzers, and (for the most part), configuring QueryParser. I would agree, however, that QueryParser is "victim of its success"; it's too often used in situations where one really should create proper GUI that builds the query. Backend code can then mangle input as it sees fit, and build query objects. QueryParser is more natural for quick-n-dirty scenarios, where one just has to slap something together quickly, or if one only has textual interface to deal with. It's nice thing to have, but it has its limitations; there's no way to create one parser that's perfect for every use(r). What could be done would be to make sure all examples / demo web apps would implement case-insensitive indexing and searching, since that is often what is needed? -+ Tatu +- But, also, folks need to really step back and practice basic troubleshooting skills. I asked you if that string was what you passed to the QueryParser and you said yes, when in fact it was not. And you I forgot that I did lower-case it. I fact I even output it in it's original state but lower-case it just before I pass it to lucene. That lower-casing is what I would call a hack and hence it's no surprise that I forgot it :-) Timo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Zero hits for queries ending with a number
On Saturday 03 April 2004 08:34, [EMAIL PROTECTED] wrote: > On Saturday 03 April 2004 17:11, Erik Hatcher wrote: > > No objections that error messages and such could be made clearer. > > Patches welcome! Care to submit better error message handling in this > > case? Or perhaps allow lower-case "to"? > > I think the best would be if Lucene would simply have a > setCaseSensitive(boolean). > > IMHO it's in any case a bad idea to make searches case-sensitive (per > default). I'd have to disagree. I think that search engine core should not have to bother with details of character sets, such as lower-casing. Rules for lower/upper/initial/mixed case for all Unicode-languages are rather involved... and if you tried to do that, next thing would be whether accentuation and umlaut marks should matter or not (which is language dependant). That's why to me the natural way to go is to do direct comparison, ignoring case when executing queries. This does not prevent anyone from implementing such functionality (see below). I think architecture and design of Lucene core is delightfully simple. One can easily create case-independent functionality by using proper analyzers, and (for the most part), configuring QueryParser. I would agree, however, that QueryParser is "victim of its success"; it's too often used in situations where one really should create proper GUI that builds the query. Backend code can then mangle input as it sees fit, and build query objects. QueryParser is more natural for quick-n-dirty scenarios, where one just has to slap something together quickly, or if one only has textual interface to deal with. It's nice thing to have, but it has its limitations; there's no way to create one parser that's perfect for every use(r). What could be done would be to make sure all examples / demo web apps would implement case-insensitive indexing and searching, since that is often what is needed? -+ Tatu +- > > > But, also, folks need to really step back and practice basic > > troubleshooting skills. I asked you if that string was what you passed > > to the QueryParser and you said yes, when in fact it was not. And you > > I forgot that I did lower-case it. I fact I even output it in it's original > state but lower-case it just before I pass it to lucene. That lower-casing > is what I would call a hack and hence it's no surprise that I forgot it :-) > > Timo > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Zero hits for queries ending with a number
On Apr 3, 2004, at 10:34 AM, [EMAIL PROTECTED] wrote: I forgot that I did lower-case it. I fact I even output it in it's original state but lower-case it just before I pass it to lucene. That lower-casing is what I would call a hack and hence it's no surprise that I forgot it :-) But why even lowercase? That is what an analyzer typically does anyway (look at the output from AnalysisDemo to see). Note that there are switches on QueryParser (and MultiFieldQueryParser is lacking in this respect, another reason not to use it) that does lowercase wildcard terms automatically: setLowercaseWildcardTerms(true). Wildcard terms are not analyzed by QueryParser, so this was added to account for it. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Zero hits for queries ending with a number
On Saturday 03 April 2004 17:11, Erik Hatcher wrote: > No objections that error messages and such could be made clearer. > Patches welcome! Care to submit better error message handling in this > case? Or perhaps allow lower-case "to"? I think the best would be if Lucene would simply have a setCaseSensitive(boolean). IMHO it's in any case a bad idea to make searches case-sensitive (per default). > But, also, folks need to really step back and practice basic > troubleshooting skills. I asked you if that string was what you passed > to the QueryParser and you said yes, when in fact it was not. And you I forgot that I did lower-case it. I fact I even output it in it's original state but lower-case it just before I pass it to lucene. That lower-casing is what I would call a hack and hence it's no surprise that I forgot it :-) Timo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Zero hits for queries ending with a number
On Apr 3, 2004, at 9:59 AM, [EMAIL PROTECTED] wrote: On Saturday 03 April 2004 15:19, Erik Hatcher wrote: date:[20030101 TO 20030202] I found the/my bug. Since Lucene is case-sensitive, I do lower-case all queries for user's convenience. The ParseException is thrown because the "TO" becomes "to". Well, I really think Lucene needs to daff such stumbling blocks aside... No objections that error messages and such could be made clearer. Patches welcome! Care to submit better error message handling in this case? Or perhaps allow lower-case "to"? But, also, folks need to really step back and practice basic troubleshooting skills. I asked you if that string was what you passed to the QueryParser and you said yes, when in fact it was not. And you slowly fed more details of your scenario (MFQP, some German SnowballAnalyzer variant). Reduce the variables in the equation and narrow things down until it works and then incrementally add complexity. I cannot encourage folks enough to try some JUnit test-driven *learning* by exploring various scenarios. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Zero hits for queries ending with a number
On Saturday 03 April 2004 15:19, Erik Hatcher wrote: > date:[20030101 TO 20030202] I found the/my bug. Since Lucene is case-sensitive, I do lower-case all queries for user's convenience. The ParseException is thrown because the "TO" becomes "to". Well, I really think Lucene needs to daff such stumbling blocks aside... - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Zero hits for queries ending with a number
Ok, we're getting somewhere now. So, where is the exception you encountered when using this utility code?! (i.e. it didn't thrown an exception, so something is different in your usage in your code). I tried this: Query query = MultiFieldQueryParser.parse("date:[20030101 TO 20030202]", new String[] { "id", "title", "summary", "contents", "date" }, new GermanAnalyzer()); System.out.println("query = " + query.toString()); And it worked fine (only duplicated the query for each field). No exception at all. Of course I'm guessing on your analyzer since you didn't provide that detail (although it shouldn't matter in the exception you experienced). On Apr 3, 2004, at 6:06 AM, [EMAIL PROTECTED] wrote: SnowballAnalyzer("German2"): Analzying "http://www.yahoo.com/foo/bar.html"; org.apache.lucene.analysis.snowball.SnowballAnalyzer: [http] [www.yahoo.com] [foo] [bar.html] So this is the analyzer you want to use, right? Wildcards should work on "www.yahoo.*" What is the "German2" stemmer for Snowball? You've introduced a lot of variables to your equation here MultiFieldQueryParser and a non-standard Snowball stemmer. All of which I had to pull out of you for details, each of which is critical to understanding the problem. analyzer you are using, and also do the same on .toString of the query you parsed. Those two pieces of info will tell all. "url:www.yahoo* url:www.yahoo* url:www.yahoo* url:www.yahoo* url:www.yahoo* url:www.yahoo*" Well, I actually use a MultiFieldQueryParser, that's probably why the term does appear so often. Strange parser, it should be clear that am explicit "url:xyz" should only look in the url field, shouldn't it? Do you really need to query on multiple fields? Why not just use the plain QueryParser? If you need an aggregate field, create one at index time. QueryParsing is problematic enough, but adding in MFQP makes it even more complicated. Which Analyzer are you using for indexing? This same SnowballAnalyzer with "German2" stemmer? Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Zero hits for queries ending with a number
On Saturday 03 April 2004 11:48, Erik Hatcher wrote: > Provide us the results of running your url through that, using the same SnowballAnalyzer("German2"): Analzying "http://www.yahoo.com/foo/bar.html"; org.apache.lucene.analysis.WhitespaceAnalyzer: [http://www.yahoo.com/foo/bar.html] org.apache.lucene.analysis.SimpleAnalyzer: [http] [www] [yahoo] [com] [foo] [bar] [html] org.apache.lucene.analysis.StopAnalyzer: [http] [www] [yahoo] [com] [foo] [bar] [html] org.apache.lucene.analysis.standard.StandardAnalyzer: [http] [www.yahoo.com] [foo] [bar.html] org.apache.lucene.analysis.snowball.SnowballAnalyzer: [http] [www.yahoo.com] [foo] [bar.html] > analyzer you are using, and also do the same on .toString of the query > you parsed. Those two pieces of info will tell all. "url:www.yahoo* url:www.yahoo* url:www.yahoo* url:www.yahoo* url:www.yahoo* url:www.yahoo*" Well, I actually use a MultiFieldQueryParser, that's probably why the term does appear so often. Strange parser, it should be clear that am explicit "url:xyz" should only look in the url field, shouldn't it? Timo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Simple date/range question
On Saturday 03 April 2004 11:53, Erik Hatcher wrote: > I didn't catch in your first message that it was throwing a > ParseException this is odd. Are you certain that "date:[20030101 > TO 20030202]" is the complete string your passing to QueryParser? Did Yes. > you subclass QueryParser? If so, what is that code? (what is the No. I use a MultiFieldQueryParser: Query qQuery = MultiFieldQueryParser.parse(query, new String[] { "id", "title", "summary", "contents", "date" }, GERMAN_ANALYZER); Hits hits = searcher.search(qQuery); > complete stack trace?) [java] 12:38:03,109 ERROR [view.SearchAction] org.apache.lucene.queryParser.ParseException: Encountered "20030404" at line 1, column 18. [java] Was expecting: [java] "]" ... [java] org.apache.lucene.queryParser.ParseException: Encountered "20030404" at line 1, column 18. [java] Was expecting: [java] "]" ... [java] at org.apache.lucene.queryParser.QueryParser.generateParseException(QueryParser.java:994) [java] at org.apache.lucene.queryParser.QueryParser.jj_consume_token(QueryParser.java:874) [java] at org.apache.lucene.queryParser.QueryParser.Term(QueryParser.java:657) [java] at org.apache.lucene.queryParser.QueryParser.Clause(QueryParser.java:521) [java] at org.apache.lucene.queryParser.QueryParser.Query(QueryParser.java:464) [java] at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:108) [java] at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:87) [java] at org.apache.lucene.queryParser.MultiFieldQueryParser.parse(MultiFieldQueryParser.java:115) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Simple date/range question
On Apr 3, 2004, at 4:07 AM, [EMAIL PROTECTED] wrote: On Friday 02 April 2004 17:03, [EMAIL PROTECTED] wrote: date:[20030101 TO 20030202] [java] 11:05:53,735 ERROR [view.SearchAction] org.apache.lucene.queryParser.ParseException: Encountered "20030202" at line 1, column 18. [java] Was expecting: [java] "]" ... Why is this? I didn't catch in your first message that it was throwing a ParseException this is odd. Are you certain that "date:[20030101 TO 20030202]" is the complete string your passing to QueryParser? Did you subclass QueryParser? If so, what is that code? (what is the complete stack trace?) Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Zero hits for queries ending with a number
On Apr 3, 2004, at 3:19 AM, [EMAIL PROTECTED] wrote: You *can* use wildcards with keywords (in fact, a keyword really has no meaning once indexed - everything is a "term" at that point). Well, I just tried. I also was surprised actually - but it just didn't work. I can use wildcards for doc.add(Field.Text("url", row.getString("url"))); but I cannot for doc.add(Field.Keyword("url", row.getString("url"))); - create a utility (I've posted one on the list in the past) that shows what your analyzer is doing graphically. Interesting. Can you give me subject/date of that posting? AnalysisDemo in this article: http://today.java.net/pub/a/today/2003/07/30/LuceneIntro.html Provide us the results of running your url through that, using the same analyzer you are using, and also do the same on .toString of the query you parsed. Those two pieces of info will tell all. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Simple date/range question
On Friday 02 April 2004 17:03, [EMAIL PROTECTED] wrote: > date:[20030101 TO 20030202] [java] 11:05:53,735 ERROR [view.SearchAction] org.apache.lucene.queryParser.ParseException: Encountered "20030202" at line 1, column 18. [java] Was expecting: [java] "]" ... Why is this? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Zero hits for queries ending with a number
On Friday 02 April 2004 23:48, Erik Hatcher wrote: > On Apr 2, 2004, at 10:00 AM, [EMAIL PROTECTED] wrote: > > On Saturday 13 March 2004 11:06, Otis Gospodnetic wrote: > >> Field.Keyword is suitable for storing data like Url. Give that a try. > > > > I just tried this a minute ago and found that I cannot use wildcards > > with > > Keywords: url:www.yahoo.* > > You *can* use wildcards with keywords (in fact, a keyword really has no > meaning once indexed - everything is a "term" at that point). Well, I just tried. I also was surprised actually - but it just didn't work. I can use wildcards for doc.add(Field.Text("url", row.getString("url"))); but I cannot for doc.add(Field.Keyword("url", row.getString("url"))); > - create a utility (I've posted one on the list in the past) that > shows what your analyzer is doing graphically. Interesting. Can you give me subject/date of that posting? Timo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]