Re: MultiFieldQueryParser 1.8 isn't parsing phrases
Thanks On Sat, 19 Feb 2005 16:09:49 +0100, Daniel Naber <[EMAIL PROTECTED]> wrote: > On Saturday 19 February 2005 15:26, Ben wrote: > > > When I try to search for phrases using the MultiFieldQueryParser v1.8 > > from CVS, it gives me NullPointerException. > > This has just been fixed in SVN (I assume you mean SVN, CVS still exists > but is read only and probably not updated anymore). > > Regards > Daniel > > -- > http://www.danielnaber.de > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: MultiFieldQueryParser 1.8 isn't parsing phrases
On Saturday 19 February 2005 15:26, Ben wrote: > When I try to search for phrases using the MultiFieldQueryParser v1.8 > from CVS, it gives me NullPointerException. This has just been fixed in SVN (I assume you mean SVN, CVS still exists but is read only and probably not updated anymore). Regards Daniel -- http://www.danielnaber.de - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
MultiFieldQueryParser 1.8 isn't parsing phrases
Hi When I try to search for phrases using the MultiFieldQueryParser v1.8 from CVS, it gives me NullPointerException. Using the following keyword works: title:"IBM backs linux" However, it gives me the exception if I use the following keyword: "IBM backs linux" Any idea why? I am using this MultiFieldQueryParser with Lucene 1.4.3. Of course I changed some of the boolean stuff to make it works with the production release. Thanks, Ben - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: analyzer effecting phrases?
: Therefore I turned back to the standard analyzer and now do some replacing : of the underscores in my ID string to avoid my original problem. This solved maybe i'm missing something, but if you've got a field in your doc that represents an ID, why not create that field as "NonTokenized" so you don't have to worry about what characters the analyzer you're using thinks are special? -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Stopwords in phrases
>Are you also using the position increment of 0 for the "gram" tokens like Nutch does? Yes. I don't think considering only "gram" tokens will work for me because Nutch uses only bi-grams. It can only have one gram per token. In my case I have more than one and even if I get only the grams, I still will have the same problem. Ravi. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Stopwords in phrases
On Dec 21, 2004, at 10:41 AM, Ravi wrote: I want to be able to use stopwords in exact phrase searches. I have looked at Nutch and used the same approach (replace common words with n-grams. Look at net.nutch.analysis.CommonGrams). So if "to","be","or" and "not" are stop words, for the string "to be or not to be", the analyzer produces the following tokens [to-be, to-be-or, to-be-or-not, to-be-or-not-to, to-be-or-not-to-be, be-or, be-or-not, be-or-not-to, be-or-not-to-be, or-not, or-not-to, or-not-to-be, not-to, not-to-be, to-be] You've gone a bit beyond what Nutch is using. It creates bigrams, where you've expanded it to many more than that. Are you also using the position increment of 0 for the "gram" tokens like Nutch does? But I'm having a problem with the search. when I do a search on "not to be" the analyzer is converting my search into content:"not-to not-to-be to-be" because the analyzer produces the tokens "not-to","not-to-be","to-be" I'm getting 0 results on this as there is no token "not-to not-to-be to-be" in the index. I want just "not-to-be" from the analyzer during the search so when I search on "not to be" I will get the document which has "not-to-be" as a token. How can I use the same analyzer to get different results in indexing and searching? Nutch does some different stuff between indexing and parsing queries... [java] 1: [the:] [the-quick:gram] [java] 2: [quick:] [java] 3: [brown:] [java] 4: [fox:] [java] query = (+url:"the quick brown"^4.0) (+anchor:"the quick brown"^2.0) (+content:"the-quick quick brown") The first four lines show the analysis of "the quick brown fox". The last line is the resultant Lucene query for "the quick brown". Notice that only the "content" field gets analyzed specially, and also that only "gram" tokens are considered in that field, not the tokens if there is also a "gram". Does this help with your situation? Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Stopwords in phrases
I want to be able to use stopwords in exact phrase searches. I have looked at Nutch and used the same approach (replace common words with n-grams. Look at net.nutch.analysis.CommonGrams). So if "to","be","or" and "not" are stop words, for the string "to be or not to be", the analyzer produces the following tokens [to-be, to-be-or, to-be-or-not, to-be-or-not-to, to-be-or-not-to-be, be-or, be-or-not, be-or-not-to, be-or-not-to-be, or-not, or-not-to, or-not-to-be, not-to, not-to-be, to-be] This is exactly what I wanted from the analyzer during indexing. But I'm having a problem with the search. when I do a search on "not to be" the analyzer is converting my search into content:"not-to not-to-be to-be" because the analyzer produces the tokens "not-to","not-to-be","to-be" I'm getting 0 results on this as there is no token "not-to not-to-be to-be" in the index. I want just "not-to-be" from the analyzer during the search so when I search on "not to be" I will get the document which has "not-to-be" as a token. How can I use the same analyzer to get different results in indexing and searching? Thanks in advance, Ravi. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: analyzer effecting phrases?
On Dec 20, 2004, at 12:43 PM, Peter Posselt Vestergaard wrote: Therefore I turned back to the standard analyzer and now do some replacing of the underscores in my ID string to avoid my original problem. This solved my phrase problem so that I can now search for phrases. However I still have the problem with ",.:" described above. As far as I can see the StandardAnalyzer (the StandardTokenizer that is) should tokenize words without the ",.:" characters. Am I mistaken? Is there a tokenizer that will do this? StandardAnalyzer does tokenize without ",.:", though it will keep domain names together. Here's an example: $ ant -emacs AnalyzerDemo Buildfile: build.xml AnalyzerDemo: Demonstrates analysis of sample text. Refer to the "Analysis" chapter for much more on this extremely crucial topic. Press return to continue... String to analyze: [This string will be analyzed.] Example with commas, colons, and dots. You can get this code from http://www.lucenebook.com Running lia.analysis.AnalyzerDemo... Analyzing "Example with commas, colons, and dots. You can get this code from http://www.lucenebook.com"; WhitespaceAnalyzer: [Example] [with] [commas,] [colons,] [and] [dots.] [You] [can] [get] [this] [code] [from] [http://www.lucenebook.com] SimpleAnalyzer: [example] [with] [commas] [colons] [and] [dots] [you] [can] [get] [this] [code] [from] [http] [www] [lucenebook] [com] StopAnalyzer: [example] [commas] [colons] [dots] [you] [can] [get] [code] [from] [http] [www] [lucenebook] [com] StandardAnalyzer: [example] [commas] [colons] [dots] [you] [can] [get] [code] [from] [http] [www.lucenebook.com] BUILD SUCCESSFUL Total time: 7 seconds - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: analyzer effecting phrases?
Hi again Thanks for your answer, Otis. My analyzer did not do anything else than using the WhitespaceAnalyzer/LowerCaseFilter. However I found out that I got problems with characters such as ",.:" when searching because of my simple analyzer. (E.g. I would not be able to search for "world" in the string "Hello world." as . became part of the last word). Therefore I turned back to the standard analyzer and now do some replacing of the underscores in my ID string to avoid my original problem. This solved my phrase problem so that I can now search for phrases. However I still have the problem with ",.:" described above. As far as I can see the StandardAnalyzer (the StandardTokenizer that is) should tokenize words without the ",.:" characters. Am I mistaken? Is there a tokenizer that will do this? Thanks for the help! Regards Peter > Date: Mon, 20 Dec 2004 08:19:42 -0800 (PST) > From: Otis Gospodnetic <[EMAIL PROTECTED]> > Subject: analyzer effecting phrases? > Content-Type: text/plain; charset=us-ascii > > > When searching for phrases, what's important is the position of each > token/word extracted by the Analyzer. > WhitespaceAnalyzer/LowerCaseFilter don't do anything with the > positional information. There is nothing else in your Analyzer? > > In any case, the following should help you see what your Analyzer is > doing: > http://wiki.apache.org/jakarta-lucene/AnalysisParalysis and you can > augment the code there to provide positional information, too. > > Otis > > -Original Message- > From: Peter Posselt Vestergaard [mailto:[EMAIL PROTECTED] > Sent: 20. december 2004 15:24 > To: '[EMAIL PROTECTED]' > Subject: analyzer effecting phrases? > > Hi > I am building an index of texts, each related to a unique id. > The unique ids might contain a number of underscores which > will make the standardanalyzer shorten them after it sees the > second underscore in a row. Furthermore many of the texts I > am indexing is in Italian so the removal of 'trivial' words > done by the standard analyzer is not necessarily meaningful > for these texts. Therefore I am instead using an analyzer > made from the WhitespaceTokenizer and the LowerCaseFilter. > This works fine for me until I try searching for a phrase. I > am searching for a simple phrase containing two words and > with double-quotes around it. I have found the phrase in one > of the texts so I know it should return at least one result, > but none is found. If I remove the double-quotes and searches > for the 2 words with AND between them I do find the story. > Can anyone tell me if this is an obvious (side-)effect of not > using the standard analyzer? And is there a better solution > to my problem than using the very simple analyzer? > Best regards > Peter Vestergaard > PS: I use the same analyzer for both searching and indexing > (of course). - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: analyzer effecting phrases?
When searching for phrases, what's important is the position of each token/word extracted by the Analyzer. WhitespaceAnalyzer/LowerCaseFilter don't do anything with the positional information. There is nothing else in your Analyzer? In any case, the following should help you see what your Analyzer is doing: http://wiki.apache.org/jakarta-lucene/AnalysisParalysis and you can augment the code there to provide positional information, too. Otis --- Peter Posselt Vestergaard <[EMAIL PROTECTED]> wrote: > Hi > I am building an index of texts, each related to a unique id. The > unique ids > might contain a number of underscores which will make the > standardanalyzer > shorten them after it sees the second underscore in a row. > Furthermore many > of the texts I am indexing is in Italian so the removal of 'trivial' > words > done by the standard analyzer is not necessarily meaningful for these > texts. > Therefore I am instead using an analyzer made from the > WhitespaceTokenizer > and the LowerCaseFilter. > This works fine for me until I try searching for a phrase. I am > searching > for a simple phrase containing two words and with double-quotes > around it. I > have found the phrase in one of the texts so I know it should return > at > least one result, but none is found. If I remove the double-quotes > and > searches for the 2 words with AND between them I do find the story. > Can anyone tell me if this is an obvious (side-)effect of not using > the > standard analyzer? And is there a better solution to my problem than > using > the very simple analyzer? > Best regards > Peter Vestergaard > PS: I use the same analyzer for both searching and indexing (of > course). > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
analyzer effecting phrases?
Hi I am building an index of texts, each related to a unique id. The unique ids might contain a number of underscores which will make the standardanalyzer shorten them after it sees the second underscore in a row. Furthermore many of the texts I am indexing is in Italian so the removal of 'trivial' words done by the standard analyzer is not necessarily meaningful for these texts. Therefore I am instead using an analyzer made from the WhitespaceTokenizer and the LowerCaseFilter. This works fine for me until I try searching for a phrase. I am searching for a simple phrase containing two words and with double-quotes around it. I have found the phrase in one of the texts so I know it should return at least one result, but none is found. If I remove the double-quotes and searches for the 2 words with AND between them I do find the story. Can anyone tell me if this is an obvious (side-)effect of not using the standard analyzer? And is there a better solution to my problem than using the very simple analyzer? Best regards Peter Vestergaard PS: I use the same analyzer for both searching and indexing (of course). - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: phrases
Try setting the slop factor on your phrase query. This should accomplish what you want. Set it to something like 10 and see what you get. Erik On Mar 16, 2004, at 8:55 PM, Supun Edirisinghe wrote: I have a field called buisnessname and this field contains keywords like "Georgian House" "Georgian" "The Georgian House Hotel" "Georgian blah blee bloo Hotel" along with 10,000s of other documents that have the word 'Hotel' somewhere in the businessname field. When I do a phrase query on "Georgian Hotel" I get only the one document back. I would like to get that one back as the top result but also the other stuff that has "Georgian" and "Hotel" too. Also, I'd like to have "Georgian House Hotel" show up before "Georgian blah blee bloo Hotel" Right now I do an or'd boolean queary with each of the words in the the search string as a Term in business name as well as the entire search string as an exact PhraseQuery and boost that by 3. But this doesn't allow me to ensure that "The Georgian House Hotel" will come before "Georgian blah blee bloo Hotel". (there are other fields queried besides business name) and in my instance of the index, "Georgian blah blee bloo Hotel" comes out with a better score because of other fields). I would like the the closeness of the phrase to be taken into account. any ideas on constructing a good query for this situation? thanks - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
phrases
I have a field called buisnessname and this field contains keywords like "Georgian House" "Georgian" "The Georgian House Hotel" "Georgian blah blee bloo Hotel" along with 10,000s of other documents that have the word 'Hotel' somewhere in the businessname field. When I do a phrase query on "Georgian Hotel" I get only the one document back. I would like to get that one back as the top result but also the other stuff that has "Georgian" and "Hotel" too. Also, I'd like to have "Georgian House Hotel" show up before "Georgian blah blee bloo Hotel" Right now I do an or'd boolean queary with each of the words in the the search string as a Term in business name as well as the entire search string as an exact PhraseQuery and boost that by 3. But this doesn't allow me to ensure that "The Georgian House Hotel" will come before "Georgian blah blee bloo Hotel". (there are other fields queried besides business name) and in my instance of the index, "Georgian blah blee bloo Hotel" comes out with a better score because of other fields). I would like the the closeness of the phrase to be taken into account. any ideas on constructing a good query for this situation? thanks - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: MultiFieldQueryParser & Phrases Problem
Cheers for that Erik. From: Erik Hatcher <[EMAIL PROTECTED]> Reply-To: "Lucene Users List" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Subject: Re: MultiFieldQueryParser & Phrases Problem Date: Sun, 21 Sep 2003 18:46:38 -0400 StandardAnalyzer removes stop words and "a" is one of them. That is why you have issues with that phrase. Erik On Sunday, September 21, 2003, at 06:13 PM, Niall Lennon wrote: I'm currently using the MultiFieldQueryParser to search across four fields. I'm searching for phrases so i've wrapped my search text in quotes... everything worked fine until i tried to execute a search ending with the 'A' and for some reason the A and quotes are ignored e.g.: Analyzer analyzer = new StandardAnalyzer(); Searcher searcher = new IndexSearcher(IndexReader.open("dbindex")); String[] fields = {"code_field", "short_description_field", "category_field", "manufacturer_field"}; int[] flags = {MultiFieldQueryParser.NORMAL_FIELD, MultiFieldQueryParser.NORMAL_FIELD, MultiFieldQueryParser.NORMAL_FIELD, MultiFieldQueryParser.NORMAL_FIELD}; Query query =MultiFieldQueryParser.parse("\"Category A\"", fields, flags, analyzer); System.out.println("query -> " + query); Hits hits = searcher.search(query); The System output for the above is as follows: code_field:category short_description_field:category category_field:category manufacturer_field:category If i execute the same code with the following search text i get the expected results: Query query =MultiFieldQueryParser.parse("\"Category Z\"", fields, flags, analyzer); code_field:"category z" short_description_field:"category z" category_field:"category z" manufacturer_field:"category z" I' appreicate any help with regards this matter... _ The new MSN 8: advanced junk mail protection and 2 months FREE* http://join.msn.com/?page=features/junkmail - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] _ The new MSN 8: smart spam protection and 2 months FREE* http://join.msn.com/?page=features/junkmail - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: MultiFieldQueryParser & Phrases Problem
StandardAnalyzer removes stop words and "a" is one of them. That is why you have issues with that phrase. Erik On Sunday, September 21, 2003, at 06:13 PM, Niall Lennon wrote: I'm currently using the MultiFieldQueryParser to search across four fields. I'm searching for phrases so i've wrapped my search text in quotes... everything worked fine until i tried to execute a search ending with the 'A' and for some reason the A and quotes are ignored e.g.: Analyzer analyzer = new StandardAnalyzer(); Searcher searcher = new IndexSearcher(IndexReader.open("dbindex")); String[] fields = {"code_field", "short_description_field", "category_field", "manufacturer_field"}; int[] flags = {MultiFieldQueryParser.NORMAL_FIELD, MultiFieldQueryParser.NORMAL_FIELD, MultiFieldQueryParser.NORMAL_FIELD, MultiFieldQueryParser.NORMAL_FIELD}; Query query =MultiFieldQueryParser.parse("\"Category A\"", fields, flags, analyzer); System.out.println("query -> " + query); Hits hits = searcher.search(query); The System output for the above is as follows: code_field:category short_description_field:category category_field:category manufacturer_field:category If i execute the same code with the following search text i get the expected results: Query query =MultiFieldQueryParser.parse("\"Category Z\"", fields, flags, analyzer); code_field:"category z" short_description_field:"category z" category_field:"category z" manufacturer_field:"category z" I' appreicate any help with regards this matter... _ The new MSN 8: advanced junk mail protection and 2 months FREE* http://join.msn.com/?page=features/junkmail - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
MultiFieldQueryParser & Phrases Problem
I'm currently using the MultiFieldQueryParser to search across four fields. I'm searching for phrases so i've wrapped my search text in quotes... everything worked fine until i tried to execute a search ending with the 'A' and for some reason the A and quotes are ignored e.g.: Analyzer analyzer = new StandardAnalyzer(); Searcher searcher = new IndexSearcher(IndexReader.open("dbindex")); String[] fields = {"code_field", "short_description_field", "category_field", "manufacturer_field"}; int[] flags = {MultiFieldQueryParser.NORMAL_FIELD, MultiFieldQueryParser.NORMAL_FIELD, MultiFieldQueryParser.NORMAL_FIELD, MultiFieldQueryParser.NORMAL_FIELD}; Query query =MultiFieldQueryParser.parse("\"Category A\"", fields, flags, analyzer); System.out.println("query -> " + query); Hits hits = searcher.search(query); The System output for the above is as follows: code_field:category short_description_field:category category_field:category manufacturer_field:category If i execute the same code with the following search text i get the expected results: Query query =MultiFieldQueryParser.parse("\"Category Z\"", fields, flags, analyzer); code_field:"category z" short_description_field:"category z" category_field:"category z" manufacturer_field:"category z" I' appreicate any help with regards this matter... _ The new MSN 8: advanced junk mail protection and 2 months FREE* http://join.msn.com/?page=features/junkmail - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
How does Lucene handle phrases containing words that are not indexed?
How does Lucene handle phrases (literals) containing words that are not indexed? (e.g. stopwords, one-letter words, numbers)? I did some tests (lucene demo, my own 12 xml documents, Cocoon search) and in all cases it looks like that when you are looking for the phrase "a specification" it also finds documents which contain "the specification". (or: "D. Washington" instead of "G. Washington"). Of course you can change the index behaviour and make sure there are no stopwords, and all one-letter words and numbers are indexed. But that seems a bad approach. A better approach: 1) find all indexed words in the phrase and from these words find all documents containing these words. 2) check the occurence of the phrase by opening the original document. I am wondering: does Lucene performs step 2)? Off course this step burns some cpu cycles. Hugo [EMAIL PROTECTED] -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>