Re: Query Question
Thanks Erik. Option 2 sounds like the path of least resistance. Luke - Original Message - From: Erik Hatcher [EMAIL PROTECTED] To: Lucene Users List lucene-user@jakarta.apache.org Sent: Thursday, February 17, 2005 9:05 PM Subject: Re: Query Question On Feb 17, 2005, at 5:51 PM, Luke Shannon wrote: My manager is now totally stuck about being able to query data with * in it. He's gonna have to wait a bit longer, you've got a slightly tricky situation on your hands WildcardQuery(new Term(name, *home\**)); The \* is the problem. WildcardQuery doesn't deal with escaping like you're trying. Your query is essentially this now: home\* Where backslash has no special meaning at all... you're literally looking for all terms that start with home followed by a backslash. Two asterisks at the end really collapse into a single one logically. Any theories as to why the it would not match: Document (relevant fields): Keywordtype:203 Keywordname:marcipan + home* Is the \ escaping both * characters? So, again, no escaping is being done here. You're a bit stuck in this situation because * (and ?) are special to WildcardQuery, and it does no escaping. Two options I think of: - Build your own clone of WildcardQuery that does escaping - or perhaps change the wildcard characters to something you do not index and use those instead. - Replace asterisks in the terms indexed with some other non-wildcard character, then replace it on your queries as appropriate. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Query Question
Hello; My manager is now totally stuck about being able to query data with * in it. Here are two queries. TermQuery(new Term(type, 203)); WildcardQuery(new Term(name, *home\**)); They are joined in a boolean query. That query gives this result when you call the toString(): +(type:203) +(name:*home\**) This looks right to me. Any theories as to why the it would not match: Document (relevant fields): Keywordtype:203 Keywordname:marcipan + home* Is the \ escaping both * characters? Thanks, Luke - Original Message - From: Luke Shannon [EMAIL PROTECTED] To: Lucene Users List lucene-user@jakarta.apache.org Sent: Thursday, February 17, 2005 2:44 PM Subject: Query Question Hello; Why won't this query find the document below? Query: +(type:203) +(name:*home\**) Document (relevant fields): Keywordtype:203 Keywordname:marcipan + home* I was hoping by escaping the * it would be treated as a string. What am I doing wrong? Thanks, Luke - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Query Question
On Feb 17, 2005, at 5:51 PM, Luke Shannon wrote: My manager is now totally stuck about being able to query data with * in it. He's gonna have to wait a bit longer, you've got a slightly tricky situation on your hands WildcardQuery(new Term(name, *home\**)); The \* is the problem. WildcardQuery doesn't deal with escaping like you're trying. Your query is essentially this now: home\* Where backslash has no special meaning at all... you're literally looking for all terms that start with home followed by a backslash. Two asterisks at the end really collapse into a single one logically. Any theories as to why the it would not match: Document (relevant fields): Keywordtype:203 Keywordname:marcipan + home* Is the \ escaping both * characters? So, again, no escaping is being done here. You're a bit stuck in this situation because * (and ?) are special to WildcardQuery, and it does no escaping. Two options I think of: - Build your own clone of WildcardQuery that does escaping - or perhaps change the wildcard characters to something you do not index and use those instead. - Replace asterisks in the terms indexed with some other non-wildcard character, then replace it on your queries as appropriate. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Boolean Phrase Query question
On Apr 3, 2004, at 12:13 PM, Ankur Goel wrote: Hi, I have to provide a functionality which provides search on both file name and contents of the file. For indexing I use the following code: org.apache.lucene.document.Document doc = new org.apache. lucene.document.Document(); doc.add(Field.Keyword(fileId, + document.getFileId())); doc.add(Field.Text(fileName,fileName); doc.add(Field.Text(contents, new FileReader(new File(fileName))); I'm not sure what you plan on doing with the fileName field, but you probably want to use a Keyword field for it. And you may want to glue the file name and contents together into a single field to facilitate searches to span both. (be sure to put a space in between if you do this) For searching a text say temp I use the following code to look both in file Name and contents of the file: BooleanQuery finalQuery = new BooleanQuery(); Query titleQuery = QueryParser.parse(temp,fileName,analyzer); Query mainQuery = QueryParser.parse(temp,contents,analyzer); finalQuery.add(titleQuery, true, false); finalQuery.add(mainQuery, true, false); Hits hits = is.search(finalQuery); By using true on the finalQuery.add calls, you have said that both fields must have the word temp in them. Is that what you meant? Or did you mean an OR type of query? Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Boolean Phrase Query question
Thanks Eric for the solution. I have to filename field as I have to give the end user facility to search on File Name also. That's why I am using TEXT for file Name also. By using true on the finalQuery.add calls, you have said that both fields must have the word temp in them. Is that what you meant? Or did you mean an OR type of query? I need an OR type of query. I mean the word can be in the filename or in the contents of the filename. But i am not able to do this. Can you tell me how to do it? Regards, Ankur -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Sunday, April 04, 2004 1:27 AM To: Lucene Users List Subject: Re: Boolean Phrase Query question On Apr 3, 2004, at 12:13 PM, Ankur Goel wrote: Hi, I have to provide a functionality which provides search on both file name and contents of the file. For indexing I use the following code: org.apache.lucene.document.Document doc = new org.apache. lucene.document.Document(); doc.add(Field.Keyword(fileId, + document.getFileId())); doc.add(Field.Text(fileName,fileName); doc.add(Field.Text(contents, new FileReader(new File(fileName))); I'm not sure what you plan on doing with the fileName field, but you probably want to use a Keyword field for it. And you may want to glue the file name and contents together into a single field to facilitate searches to span both. (be sure to put a space in between if you do this) For searching a text say temp I use the following code to look both in file Name and contents of the file: BooleanQuery finalQuery = new BooleanQuery(); Query titleQuery = QueryParser.parse(temp,fileName,analyzer); Query mainQuery = QueryParser.parse(temp,contents,analyzer); finalQuery.add(titleQuery, true, false); finalQuery.add(mainQuery, true, false); Hits hits = is.search(finalQuery); By using true on the finalQuery.add calls, you have said that both fields must have the word temp in them. Is that what you meant? Or did you mean an OR type of query? Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Boolean Phrase Query question
On Apr 3, 2004, at 3:05 PM, Ankur Goel wrote: By using true on the finalQuery.add calls, you have said that both fields must have the word temp in them. Is that what you meant? Or did you mean an OR type of query? I need an OR type of query. I mean the word can be in the filename or in the contents of the filename. But i am not able to do this. Can you tell me how to do it? I did tell you how to do it. Use false for both required and prohibited flags when adding queries to a BooleanQuery. Check the javadocs for more details. Keep in mind (and see recent, and frequent, discussion on this topic) that your analyzer choice is very important. Look at my intro Lucene article for code to allow you to view what is happening with the analysis process. Erik Regards, Ankur -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Sunday, April 04, 2004 1:27 AM To: Lucene Users List Subject: Re: Boolean Phrase Query question On Apr 3, 2004, at 12:13 PM, Ankur Goel wrote: Hi, I have to provide a functionality which provides search on both file name and contents of the file. For indexing I use the following code: org.apache.lucene.document.Document doc = new org.apache. lucene.document.Document(); doc.add(Field.Keyword(fileId, + document.getFileId())); doc.add(Field.Text(fileName,fileName); doc.add(Field.Text(contents, new FileReader(new File(fileName))); I'm not sure what you plan on doing with the fileName field, but you probably want to use a Keyword field for it. And you may want to glue the file name and contents together into a single field to facilitate searches to span both. (be sure to put a space in between if you do this) For searching a text say temp I use the following code to look both in file Name and contents of the file: BooleanQuery finalQuery = new BooleanQuery(); Query titleQuery = QueryParser.parse(temp,fileName,analyzer); Query mainQuery = QueryParser.parse(temp,contents,analyzer); finalQuery.add(titleQuery, true, false); finalQuery.add(mainQuery, true, false); Hits hits = is.search(finalQuery); By using true on the finalQuery.add calls, you have said that both fields must have the word temp in them. Is that what you meant? Or did you mean an OR type of query? Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Query question
Hi Erik, Here is the IndexWriter with the Standard analyzer: Class variable: IndexWriter writer; writer = IndexWriter(indexDirectory, new StandardAnalyzer(), true); While looping over the ResultSet I call this method: private void indexDoc(ResultSet rs) throws Exception { Document doc = new Document(); doc.add(Field.UnIndexed(value, rs.getString(value))); doc.add(Field.UnIndexed(name, rs.getString(name))); doc.add(Field.UnStored(content,rs.getString(indexed))); writer.addDocument(doc); } The indexed data is a concatenation of the Code and Desciptor(s) fields that they want to search by. They are concatenated with a space. Ex. Select col1 as value, col2 as name, col3 || ' ' || col2 || ' ' || col5 as indexed from tableName. Since there are many tables that are similar in structure I wrote the queries like this so I could multi thread the re indexing process on a frequent basis and use one generic class. Here is my test search class: public IndexSearchTest(String search, String index) throws Exception { String indexName = dirLucene + index +/; System.out.println(Index Name + indexName); IndexSearcher searcher = new IndexSearcher(IndexReader.open(indexName)); Query query = QueryParser.parse(search.toUpperCase(), content, new StandardAnalyzer()); Hits hits = searcher.search(query); Document result; System.out.println(Begin Search Results); for (int i=0;ihits.length();i++) { result = hits.doc(i); System.out.println(Key : + result.get(value) + Desc: + result.get(name)) ; } System.out.println(Finished Search: +hits.length()); } Thanks in advance, Justin -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Thursday, February 05, 2004 6:34 PM To: Lucene Users List Subject: Re: Query question On Feb 5, 2004, at 3:27 PM, Justin Woody wrote: If I search the index for building it comes back fine (2 records) or builder (1record), but if I search for build* I only receive one record, in my example, the second record. The client would like all 3 records to come back. Is there a way I can make that happen? I've been trying different query types and syntax, but haven't been able to succeed. We need more details to know what is going on. What analyzer are you using with indexing? How are you building the query objects? QueryParser? Same Analyzer as with indexer? (Succinct) code is the best :) Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Query question
Everything you are doing looks ok to me. Next step is to run some sample text through something like the AnalyzerDemo.analyze method shown here: http://today.java.net/pub/a/today/2003/07/30/LuceneIntro.html Be sure to use real world data, although builder building would be a good first pass to ensure all is working well then. If you are really searching for build* using the code you've shown (without the quotes!) then it should work from my quick look at what you've done. Erik On Feb 6, 2004, at 9:27 AM, Justin Woody wrote: Hi Erik, Here is the IndexWriter with the Standard analyzer: Class variable: IndexWriter writer; writer = IndexWriter(indexDirectory, new StandardAnalyzer(), true); While looping over the ResultSet I call this method: private void indexDoc(ResultSet rs) throws Exception { Document doc = new Document(); doc.add(Field.UnIndexed(value, rs.getString(value))); doc.add(Field.UnIndexed(name, rs.getString(name))); doc.add(Field.UnStored(content,rs.getString(indexed))); writer.addDocument(doc); } The indexed data is a concatenation of the Code and Desciptor(s) fields that they want to search by. They are concatenated with a space. Ex. Select col1 as value, col2 as name, col3 || ' ' || col2 || ' ' || col5 as indexed from tableName. Since there are many tables that are similar in structure I wrote the queries like this so I could multi thread the re indexing process on a frequent basis and use one generic class. Here is my test search class: public IndexSearchTest(String search, String index) throws Exception { String indexName = dirLucene + index +/; System.out.println(Index Name + indexName); IndexSearcher searcher = new IndexSearcher(IndexReader.open(indexName)); Query query = QueryParser.parse(search.toUpperCase(), content, new StandardAnalyzer()); Hits hits = searcher.search(query); Document result; System.out.println(Begin Search Results); for (int i=0;ihits.length();i++) { result = hits.doc(i); System.out.println(Key : + result.get(value) + Desc: + result.get(name)) ; } System.out.println(Finished Search: +hits.length()); } Thanks in advance, Justin -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Thursday, February 05, 2004 6:34 PM To: Lucene Users List Subject: Re: Query question On Feb 5, 2004, at 3:27 PM, Justin Woody wrote: If I search the index for building it comes back fine (2 records) or builder (1record), but if I search for build* I only receive one record, in my example, the second record. The client would like all 3 records to come back. Is there a way I can make that happen? I've been trying different query types and syntax, but haven't been able to succeed. We need more details to know what is going on. What analyzer are you using with indexing? How are you building the query objects? QueryParser? Same Analyzer as with indexer? (Succinct) code is the best :) Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Query question
Hi Erik, The analysis class is parsing the terms as expected. However, no partial terms will return results. I've tried the following: build build* build build* All return 0 hits unless the entire word (in this case build) appears. I've tried this with multiple keywords. Any other ideas? Thanks, Justin Looking forward to your book, there's not enough info out there for Lucene. -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Friday, February 06, 2004 11:42 AM To: Lucene Users List Subject: Re: Query question Everything you are doing looks ok to me. Next step is to run some sample text through something like the AnalyzerDemo.analyze method shown here: http://today.java.net/pub/a/today/2003/07/30/LuceneIntro.html Be sure to use real world data, although builder building would be a good first pass to ensure all is working well then. If you are really searching for build* using the code you've shown (without the quotes!) then it should work from my quick look at what you've done. Erik On Feb 6, 2004, at 9:27 AM, Justin Woody wrote: Hi Erik, Here is the IndexWriter with the Standard analyzer: Class variable: IndexWriter writer; writer = IndexWriter(indexDirectory, new StandardAnalyzer(), true); While looping over the ResultSet I call this method: private void indexDoc(ResultSet rs) throws Exception { Document doc = new Document(); doc.add(Field.UnIndexed(value, rs.getString(value))); doc.add(Field.UnIndexed(name, rs.getString(name))); doc.add(Field.UnStored(content,rs.getString(indexed))); writer.addDocument(doc); } The indexed data is a concatenation of the Code and Desciptor(s) fields that they want to search by. They are concatenated with a space. Ex. Select col1 as value, col2 as name, col3 || ' ' || col2 || ' ' || col5 as indexed from tableName. Since there are many tables that are similar in structure I wrote the queries like this so I could multi thread the re indexing process on a frequent basis and use one generic class. Here is my test search class: public IndexSearchTest(String search, String index) throws Exception { String indexName = dirLucene + index +/; System.out.println(Index Name + indexName); IndexSearcher searcher = new IndexSearcher(IndexReader.open(indexName)); Query query = QueryParser.parse(search.toUpperCase(), content, new StandardAnalyzer()); Hits hits = searcher.search(query); Document result; System.out.println(Begin Search Results); for (int i=0;ihits.length();i++) { result = hits.doc(i); System.out.println(Key : + result.get(value) + Desc: + result.get(name)) ; } System.out.println(Finished Search: +hits.length()); } Thanks in advance, Justin -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Thursday, February 05, 2004 6:34 PM To: Lucene Users List Subject: Re: Query question On Feb 5, 2004, at 3:27 PM, Justin Woody wrote: If I search the index for building it comes back fine (2 records) or builder (1record), but if I search for build* I only receive one record, in my example, the second record. The client would like all 3 records to come back. Is there a way I can make that happen? I've been trying different query types and syntax, but haven't been able to succeed. We need more details to know what is going on. What analyzer are you using with indexing? How are you building the query objects? QueryParser? Same Analyzer as with indexer? (Succinct) code is the best :) Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Query question
Erik, I think I found the problem. I thought queries were case sensitive, but after running your AnalyzerDemo, it seems that it was indexing all of my information in lower case. Anyway, when I did a toLowerCase() on my search string, the expected results were returned. Does this sound right? Thanks Justin -Original Message- From: Justin Woody [mailto:[EMAIL PROTECTED] Sent: Friday, February 06, 2004 2:33 PM To: 'Lucene Users List' Subject: RE: Query question Hi Erik, The analysis class is parsing the terms as expected. However, no partial terms will return results. I've tried the following: build build* build build* All return 0 hits unless the entire word (in this case build) appears. I've tried this with multiple keywords. Any other ideas? Thanks, Justin Looking forward to your book, there's not enough info out there for Lucene. -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Friday, February 06, 2004 11:42 AM To: Lucene Users List Subject: Re: Query question Everything you are doing looks ok to me. Next step is to run some sample text through something like the AnalyzerDemo.analyze method shown here: http://today.java.net/pub/a/today/2003/07/30/LuceneIntro.html Be sure to use real world data, although builder building would be a good first pass to ensure all is working well then. If you are really searching for build* using the code you've shown (without the quotes!) then it should work from my quick look at what you've done. Erik On Feb 6, 2004, at 9:27 AM, Justin Woody wrote: Hi Erik, Here is the IndexWriter with the Standard analyzer: Class variable: IndexWriter writer; writer = IndexWriter(indexDirectory, new StandardAnalyzer(), true); While looping over the ResultSet I call this method: private void indexDoc(ResultSet rs) throws Exception { Document doc = new Document(); doc.add(Field.UnIndexed(value, rs.getString(value))); doc.add(Field.UnIndexed(name, rs.getString(name))); doc.add(Field.UnStored(content,rs.getString(indexed))); writer.addDocument(doc); } The indexed data is a concatenation of the Code and Desciptor(s) fields that they want to search by. They are concatenated with a space. Ex. Select col1 as value, col2 as name, col3 || ' ' || col2 || ' ' || col5 as indexed from tableName. Since there are many tables that are similar in structure I wrote the queries like this so I could multi thread the re indexing process on a frequent basis and use one generic class. Here is my test search class: public IndexSearchTest(String search, String index) throws Exception { String indexName = dirLucene + index +/; System.out.println(Index Name + indexName); IndexSearcher searcher = new IndexSearcher(IndexReader.open(indexName)); Query query = QueryParser.parse(search.toUpperCase(), content, new StandardAnalyzer()); Hits hits = searcher.search(query); Document result; System.out.println(Begin Search Results); for (int i=0;ihits.length();i++) { result = hits.doc(i); System.out.println(Key : + result.get(value) + Desc: + result.get(name)) ; } System.out.println(Finished Search: +hits.length()); } Thanks in advance, Justin -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Thursday, February 05, 2004 6:34 PM To: Lucene Users List Subject: Re: Query question On Feb 5, 2004, at 3:27 PM, Justin Woody wrote: If I search the index for building it comes back fine (2 records) or builder (1record), but if I search for build* I only receive one record, in my example, the second record. The client would like all 3 records to come back. Is there a way I can make that happen? I've been trying different query types and syntax, but haven't been able to succeed. We need more details to know what is going on. What analyzer are you using with indexing? How are you building the query objects? QueryParser? Same Analyzer as with indexer? (Succinct) code is the best :) Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands
RE: Newbie Phrase Query question
Actually, I found your QueryParser Rules article the most useful. It explained a number of things that I had puzzled about. Query.toString() helped also. So, obvious in hindsight, an exact phrase match still goes through the tokenizer. If there are stop words or you're stemming or etc., you need to tokenize the phrase before trying to get an exact match. Clearly, that has implications for what exact phrase match means. The toString() told me that the quotes are handled by the queryParser. The weblucene cjk tokenizer works just fine with it and I didn't make any changes to it. The bad news is that after going through all of this, the code just started to work as expected. I'm not sure what I did to fix it. There is a minor issue I found that I think works as documented, but wonder why it's that way. If you enter a search string that's a hyphenated word such as fred-bill (w/o the quotes), the QueryParser generates a search string to find all documents with fred but w/o bill. I believe this is expected behavior based on the javadocs. The effect of this is that a hyphenated word gives unexpected results unless surrounded by quotes. Perhaps the syntax should have been fred -bill (space before the hyphen required) to indicate that you didn't want bill and that it's not a hyphenated word. Seems a tad more general. It's an issue for me because my application deals with hyphenated words a lot and I don't think my users would ever understand when quotes should be used and when they should not (most of them won't figure out how to use the not syntax). I can solve it by requiring the user to enter a space before the hyphen if they mean not and then have the search code automatically add the quotes for hyphenated words. It's just a little painful. Just a thought for 1.4. ;-) -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 03, 2004 8:26 PM To: Lucene Users List Subject: Re: Newbie Phrase Query question The best suggestion I have is to look at the code in my first java.net article (Intro Lucene) and borrow the Analyzer utility code to see what happens to a sample string as it is analyzed. Then pass that same string to QueryParser (along with the same analyzer) and see what the Query.toString(default field name) returns. This should shed light on the issue more clearly. Erik On Feb 3, 2004, at 10:01 PM, Scott Smith wrote: I'm having problems searching for an exact match with a phrase. Essentially, I think my problem is that the tokenizer is tossing the double quotes around the phrase, tokenizing each word and so I end up with the document hit I want plus several more I don't (the latter having some of the words, but not exact matches). Here's the specifics. First, I'm using the CJKTokenizer from WebLucene which I believe is a modified version of the stopword tokenizer enhanced to handle asian characters (that's according to the header; I don't think the asian characters have anything to do with my problem). The documents I need to search, for reasons related to the application, often end up with hyphenated words in critical places. For example, the original text to be indexed might be something like this is Bill-Fred. When this is tokenized initially, I end up with two tokens bill and fred (the tokenizer converts to lower case; this and is are removed as stop words; the hyphen is removed by the tokenizer). So far so good. I pass the phrase I want an exact match on to a QueryParser in quotes (so Bill-Fred is the search string; quotes included). I watched the output of the tokenizer from the query parser and it is clearly tossing the double quotes and tokenizing each word separately. It passes the words bill and fred as separate entities back to the QueryParser. Looking at the tokenizer code, I understand why. Obviously, that's why I end up with documents that contain the words even if they are not exact matches. Here's the question. I can modify the CJKTokenizer so that when it sees Fred-Bill it creates a single token that looks like fred bill. Would this now work? Is this the right thing to do? I realize this means that I'd hit on Fred-Bill and Fred Bill, but I can probably live with that. However, it also seems like I now have a problem if the original text contains a quotation from someone that happens to be part of the document (i.e., the original text has double quotes in it). It seems like I need to ignore quotes for the initial index, but use them to build phrases when I'm tokenizing a search string in the QueryParser. Do I need two tokenizers? Does any of this make any sense? I'm not quite sure what the QueryParser wants to see to properly do a phrase match. Is QueryParser the wrong thing to be using here? Suggestions or comments? Scott - To unsubscribe, e-mail
Re: Newbie Phrase Query question
On Feb 5, 2004, at 8:19 PM, Scott Smith wrote: There is a minor issue I found that I think works as documented, but wonder why it's that way. If you enter a search string that's a hyphenated word such as fred-bill (w/o the quotes), the QueryParser generates a search string to find all documents with fred but w/o bill. I believe this is expected behavior based on the javadocs. This is actually a documented bug that needs to be fixed. If there is no whitespace, the dash should not be taken as term negation, but rather the entire unit should be passed to the analyzer. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Newbie Phrase Query question
I'm having problems searching for an exact match with a phrase. Essentially, I think my problem is that the tokenizer is tossing the double quotes around the phrase, tokenizing each word and so I end up with the document hit I want plus several more I don't (the latter having some of the words, but not exact matches). Here's the specifics. First, I'm using the CJKTokenizer from WebLucene which I believe is a modified version of the stopword tokenizer enhanced to handle asian characters (that's according to the header; I don't think the asian characters have anything to do with my problem). The documents I need to search, for reasons related to the application, often end up with hyphenated words in critical places. For example, the original text to be indexed might be something like this is Bill-Fred. When this is tokenized initially, I end up with two tokens bill and fred (the tokenizer converts to lower case; this and is are removed as stop words; the hyphen is removed by the tokenizer). So far so good. I pass the phrase I want an exact match on to a QueryParser in quotes (so Bill-Fred is the search string; quotes included). I watched the output of the tokenizer from the query parser and it is clearly tossing the double quotes and tokenizing each word separately. It passes the words bill and fred as separate entities back to the QueryParser. Looking at the tokenizer code, I understand why. Obviously, that's why I end up with documents that contain the words even if they are not exact matches. Here's the question. I can modify the CJKTokenizer so that when it sees Fred-Bill it creates a single token that looks like fred bill. Would this now work? Is this the right thing to do? I realize this means that I'd hit on Fred-Bill and Fred Bill, but I can probably live with that. However, it also seems like I now have a problem if the original text contains a quotation from someone that happens to be part of the document (i.e., the original text has double quotes in it). It seems like I need to ignore quotes for the initial index, but use them to build phrases when I'm tokenizing a search string in the QueryParser. Do I need two tokenizers? Does any of this make any sense? I'm not quite sure what the QueryParser wants to see to properly do a phrase match. Is QueryParser the wrong thing to be using here? Suggestions or comments? Scott - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Newbie Phrase Query question
The best suggestion I have is to look at the code in my first java.net article (Intro Lucene) and borrow the Analyzer utility code to see what happens to a sample string as it is analyzed. Then pass that same string to QueryParser (along with the same analyzer) and see what the Query.toString(default field name) returns. This should shed light on the issue more clearly. Erik On Feb 3, 2004, at 10:01 PM, Scott Smith wrote: I'm having problems searching for an exact match with a phrase. Essentially, I think my problem is that the tokenizer is tossing the double quotes around the phrase, tokenizing each word and so I end up with the document hit I want plus several more I don't (the latter having some of the words, but not exact matches). Here's the specifics. First, I'm using the CJKTokenizer from WebLucene which I believe is a modified version of the stopword tokenizer enhanced to handle asian characters (that's according to the header; I don't think the asian characters have anything to do with my problem). The documents I need to search, for reasons related to the application, often end up with hyphenated words in critical places. For example, the original text to be indexed might be something like this is Bill-Fred. When this is tokenized initially, I end up with two tokens bill and fred (the tokenizer converts to lower case; this and is are removed as stop words; the hyphen is removed by the tokenizer). So far so good. I pass the phrase I want an exact match on to a QueryParser in quotes (so Bill-Fred is the search string; quotes included). I watched the output of the tokenizer from the query parser and it is clearly tossing the double quotes and tokenizing each word separately. It passes the words bill and fred as separate entities back to the QueryParser. Looking at the tokenizer code, I understand why. Obviously, that's why I end up with documents that contain the words even if they are not exact matches. Here's the question. I can modify the CJKTokenizer so that when it sees Fred-Bill it creates a single token that looks like fred bill. Would this now work? Is this the right thing to do? I realize this means that I'd hit on Fred-Bill and Fred Bill, but I can probably live with that. However, it also seems like I now have a problem if the original text contains a quotation from someone that happens to be part of the document (i.e., the original text has double quotes in it). It seems like I need to ignore quotes for the initial index, but use them to build phrases when I'm tokenizing a search string in the QueryParser. Do I need two tokenizers? Does any of this make any sense? I'm not quite sure what the QueryParser wants to see to properly do a phrase match. Is QueryParser the wrong thing to be using here? Suggestions or comments? Scott - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Query question
On Jan 12, 2004, at 7:49 PM, Scott Smith wrote: Does the following do that: BooleanQuery Query QA = new Boolean Query(); Query qa1 = QueryParser.parse(A1, FieldA, analyzer()); Query qa2 = QueryParser.parse(A2, FieldA, analyzer()); QA.add(qa1, false, false); // this term is not required QA.add(qa2, false, false); // this term is not required BooleanQuery QB = new BooleanQuery(); Query qb1 = QueryParser.parse(B1, FieldB, analyzer()); Query qb2 = QueryParser.parse(B2, FieldB, analyzer()); QB.add(qb1, false, false); // this term is not required QB.add(qb2, false, false); // this term is not required BooleanQuery Qfinal = new BooleanQuery(); Qfinal.add(QA, true, false);// gotta have at least one from here Qfinal.add(QB, true, false);// gotta have at least one from here hits = mySearcher.search(Qfinal); Your use of QueryParser is unnecessary. Simply construct TermQuery's instead. Otherwise, what you are doing looks fine. I guess I'm assuming that if I add a queries to a BooleanQuery and none of the items are required, there still needs to be a hit on at least one of the items for the Document to make it out of the BooleanQuery. Right. A OR B means that either A or B have to be present, but if neither are present then there is no match. Is this the right way to do this? Is there an easier/faster way to do the same thing? You're asking a pretty general question - are you really just using two terms for each field? What you've shown based on the example (with the exception of using QueryParser) is fine. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Query question
So I can write: Query q2 = new TermQuery(new Term(a1, FieldA)); And similar things for all of the QueryParser's. This makes sense and I assume must be more efficient than using the QueryParser for simple terms. As you have guessed, there may be an arbitrary number of terms (not just 2) but they are all simple words. Some of the terms are generated programmatically and not entered explicitly by the user. But the code below (even using TermQuery) seems like it should generalize to an arbitrary number of terms. I guess what is confusing me now is that the search code no longer references an analyzer???!!! How does it know how to tokenize, stem, etc. the search terms? Thanks for the help Scott -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Tuesday, January 13, 2004 6:27 AM To: Lucene Users List Subject: Re: Query question On Jan 12, 2004, at 7:49 PM, Scott Smith wrote: Does the following do that: BooleanQuery Query QA = new Boolean Query(); Query qa1 = QueryParser.parse(A1, FieldA, analyzer()); Query qa2 = QueryParser.parse(A2, FieldA, analyzer()); QA.add(qa1, false, false); // this term is not required QA.add(qa2, false, false); // this term is not required BooleanQuery QB = new BooleanQuery(); Query qb1 = QueryParser.parse(B1, FieldB, analyzer()); Query qb2 = QueryParser.parse(B2, FieldB, analyzer()); QB.add(qb1, false, false); // this term is not required QB.add(qb2, false, false); // this term is not required BooleanQuery Qfinal = new BooleanQuery(); Qfinal.add(QA, true, false);// gotta have at least one from here Qfinal.add(QB, true, false);// gotta have at least one from here hits = mySearcher.search(Qfinal); Your use of QueryParser is unnecessary. Simply construct TermQuery's instead. Otherwise, what you are doing looks fine. I guess I'm assuming that if I add a queries to a BooleanQuery and none of the items are required, there still needs to be a hit on at least one of the items for the Document to make it out of the BooleanQuery. Right. A OR B means that either A or B have to be present, but if neither are present then there is no match. Is this the right way to do this? Is there an easier/faster way to do the same thing? You're asking a pretty general question - are you really just using two terms for each field? What you've shown based on the example (with the exception of using QueryParser) is fine. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Query question
On Jan 13, 2004, at 5:21 PM, Scott Smith wrote: I guess what is confusing me now is that the search code no longer references an analyzer???!!! How does it know how to tokenize, stem, etc. the search terms? It doesn't. A TermQuery is exactly as-is. If you need the analysis part, you can use QueryParser or talk to an Analyzer directly and use the TokenStream it feeds you back to build TermQuery's by hand. I would not recommend using QueryParser for code-generated queries - there are just too many variables in that equation for comfort (to me). Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Query question
I have two fields, call them FieldA and FieldB. I have a set of words I'm looking for in FieldA, call them A1 and A2. I have a different set of words for FieldB, call them B1 and B2. Now I want a hit list which contains items that have at least one A item in FieldA and one B item in FieldB. In essence, I think I'm saying I want (A1 OR A2) AND (B1 OR B2) Does the following do that: BooleanQuery Query QA = new Boolean Query(); Query qa1 = QueryParser.parse(A1, FieldA, analyzer()); Query qa2 = QueryParser.parse(A2, FieldA, analyzer()); QA.add(qa1, false, false); // this term is not required QA.add(qa2, false, false); // this term is not required BooleanQuery QB = new BooleanQuery(); Query qb1 = QueryParser.parse(B1, FieldB, analyzer()); Query qb2 = QueryParser.parse(B2, FieldB, analyzer()); QB.add(qb1, false, false); // this term is not required QB.add(qb2, false, false); // this term is not required BooleanQuery Qfinal = new BooleanQuery(); Qfinal.add(QA, true, false);// gotta have at least one from here Qfinal.add(QB, true, false);// gotta have at least one from here hits = mySearcher.search(Qfinal); I guess I'm assuming that if I add a queries to a BooleanQuery and none of the items are required, there still needs to be a hit on at least one of the items for the Document to make it out of the BooleanQuery. Is this the right way to do this? Is there an easier/faster way to do the same thing? Scott - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Query question
Otis, Are you referring to this: How do I retrieve all the values of a particular field that exists within an index, across all documents? I need a query to do it, the only way clients access the index is via queries so they cannot write the code in the faq above. Thanks, Rob -Original Message- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Sent: Wednesday, September 10, 2003 5:05 PM To: Lucene Users List Subject: Re: Query question Go to Lucene FAQ at jGuru.com and search for the word 'all'. Otis --- Rob Outar [EMAIL PROTECTED] wrote: Hi all, I have a field called echelon that are assigned to certain files. Is there a query I can write that will give me all files that have this field? I have tried stuff like echelon:.+*, echelon:*, etc... some give a query parser exception while others return nothing. Let me know, Rob - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ Do you Yahoo!? Yahoo! SiteBuilder - Free, easy-to-use web site design software http://sitebuilder.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Query question
Aha. You can't do it with a query, unless you add a fixed-value field to all documents added to your index. e.g. field:X Then you can get all documents by searching for +field:X Otis --- Rob Outar [EMAIL PROTECTED] wrote: Otis, Are you referring to this: How do I retrieve all the values of a particular field that exists within an index, across all documents? I need a query to do it, the only way clients access the index is via queries so they cannot write the code in the faq above. Thanks, Rob -Original Message- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Sent: Wednesday, September 10, 2003 5:05 PM To: Lucene Users List Subject: Re: Query question Go to Lucene FAQ at jGuru.com and search for the word 'all'. Otis --- Rob Outar [EMAIL PROTECTED] wrote: Hi all, I have a field called echelon that are assigned to certain files. Is there a query I can write that will give me all files that have this field? I have tried stuff like echelon:.+*, echelon:*, etc... some give a query parser exception while others return nothing. Let me know, Rob - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ Do you Yahoo!? Yahoo! SiteBuilder - Free, easy-to-use web site design software http://sitebuilder.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ Do you Yahoo!? Yahoo! SiteBuilder - Free, easy-to-use web site design software http://sitebuilder.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Query question
Hi all, I have a field called echelon that are assigned to certain files. Is there a query I can write that will give me all files that have this field? I have tried stuff like echelon:.+*, echelon:*, etc... some give a query parser exception while others return nothing. Let me know, Rob - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Query question
Go to Lucene FAQ at jGuru.com and search for the word 'all'. Otis --- Rob Outar [EMAIL PROTECTED] wrote: Hi all, I have a field called echelon that are assigned to certain files. Is there a query I can write that will give me all files that have this field? I have tried stuff like echelon:.+*, echelon:*, etc... some give a query parser exception while others return nothing. Let me know, Rob - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ Do you Yahoo!? Yahoo! SiteBuilder - Free, easy-to-use web site design software http://sitebuilder.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: query question in trouble
Aviran Mordo wrote: In is probably a STOP word in your analyzer Actually I think it's not a good idea to apply stopwords, when the user searches with exact string. Ulrich - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
query question in trouble
Hello, Upon reviewing the results of some queries recently I noticed that the query: in trouble always searches for trouble. Is 'in' a keyword that I'm not aware of? I searched the whole query syntax page and didn't see it mentioned. I tried an trouble and the query worked fine. The query parser appears to be stripping out 'in', but not doing anything with it. Here's my log: **Query: in trouble 2003-06-11 12:08:50,540 DEBUG Searching for: textcontent:trouble (Query.toString()) 2003-06-11 12:08:50,569 DEBUG 6582 total matching documents **Query: an trouble 2003-06-11 12:06:11,275 DEBUG Searching for: textcontent:an trouble (Query.toString()) 2003-06-11 12:06:12,342 DEBUG 1 total matching documents Any ideas? Thanks. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: query question in trouble
In is probably a STOP word in your analyzer -Original Message- From: Ryan Clifton [mailto:[EMAIL PROTECTED] Sent: Wednesday, June 11, 2003 3:13 PM To: Lucene Users List Subject: query question in trouble Hello, Upon reviewing the results of some queries recently I noticed that the query: in trouble always searches for trouble. Is 'in' a keyword that I'm not aware of? I searched the whole query syntax page and didn't see it mentioned. I tried an trouble and the query worked fine. The query parser appears to be stripping out 'in', but not doing anything with it. Here's my log: **Query: in trouble 2003-06-11 12:08:50,540 DEBUG Searching for: textcontent:trouble (Query.toString()) 2003-06-11 12:08:50,569 DEBUG 6582 total matching documents **Query: an trouble 2003-06-11 12:06:11,275 DEBUG Searching for: textcontent:an trouble (Query.toString()) 2003-06-11 12:06:12,342 DEBUG 1 total matching documents Any ideas? Thanks. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: query question in trouble
Actually, I'm using the StandardAnalyzer. I pretty much using an off-the-shelf implementation of Lucene. -Original Message- From: Aviran Mordo [mailto:[EMAIL PROTECTED] Sent: Wednesday, June 11, 2003 12:50 PM To: 'Lucene Users List' Subject: RE: query question in trouble In is probably a STOP word in your analyzer -Original Message- From: Ryan Clifton [mailto:[EMAIL PROTECTED] Sent: Wednesday, June 11, 2003 3:13 PM To: Lucene Users List Subject: query question in trouble Hello, Upon reviewing the results of some queries recently I noticed that the query: in trouble always searches for trouble. Is 'in' a keyword that I'm not aware of? I searched the whole query syntax page and didn't see it mentioned. I tried an trouble and the query worked fine. The query parser appears to be stripping out 'in', but not doing anything with it. Here's my log: **Query: in trouble 2003-06-11 12:08:50,540 DEBUG Searching for: textcontent:trouble (Query.toString()) 2003-06-11 12:08:50,569 DEBUG 6582 total matching documents **Query: an trouble 2003-06-11 12:06:11,275 DEBUG Searching for: textcontent:an trouble (Query.toString()) 2003-06-11 12:06:12,342 DEBUG 1 total matching documents Any ideas? Thanks. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: query question in trouble
Ok, well you were right. public class StandardAnalyzer extends Analyzer { private Hashtable stopTable; /** An array containing some common English words that are usually not useful for searching. */ public static final String[] STOP_WORDS = { a, and, are, as, at, be, but, by, for, if, in, into, is, it, no, not, of, on, or, s, such, t, that, the, their, then, there, these, they, this, to, was, will, with }; Thanks. -Original Message- From: Ryan Clifton Sent: Wednesday, June 11, 2003 12:52 PM To: Lucene Users List Subject: RE: query question in trouble Actually, I'm using the StandardAnalyzer. I pretty much using an off-the-shelf implementation of Lucene. -Original Message- From: Aviran Mordo [mailto:[EMAIL PROTECTED] Sent: Wednesday, June 11, 2003 12:50 PM To: 'Lucene Users List' Subject: RE: query question in trouble In is probably a STOP word in your analyzer -Original Message- From: Ryan Clifton [mailto:[EMAIL PROTECTED] Sent: Wednesday, June 11, 2003 3:13 PM To: Lucene Users List Subject: query question in trouble Hello, Upon reviewing the results of some queries recently I noticed that the query: in trouble always searches for trouble. Is 'in' a keyword that I'm not aware of? I searched the whole query syntax page and didn't see it mentioned. I tried an trouble and the query worked fine. The query parser appears to be stripping out 'in', but not doing anything with it. Here's my log: **Query: in trouble 2003-06-11 12:08:50,540 DEBUG Searching for: textcontent:trouble (Query.toString()) 2003-06-11 12:08:50,569 DEBUG 6582 total matching documents **Query: an trouble 2003-06-11 12:06:11,275 DEBUG Searching for: textcontent:an trouble (Query.toString()) 2003-06-11 12:06:12,342 DEBUG 1 total matching documents Any ideas? Thanks. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Boolean Query Question
You should not get any documents that contain 'e' in the search results. -e means 'e' is verbotten! Otis --- alex [EMAIL PROTECTED] wrote: HI all If i enter a search say: +a +b +c -e this return a set of results containing a AND b AND c , if I find in the results there is a term e aswell does this mean the search failed or is this correct ? can someone explain please ? thxs Alex -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Boolean Query Question
Thx for the answer - Original Message - From: Otis Gospodnetic [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Friday, January 24, 2003 8:53 PM Subject: Re: Boolean Query Question You should not get any documents that contain 'e' in the search results. -e means 'e' is verbotten! Otis --- alex [EMAIL PROTECTED] wrote: HI all If i enter a search say: +a +b +c -e this return a set of results containing a AND b AND c , if I find in the results there is a term e aswell does this mean the search failed or is this correct ? can someone explain please ? thxs Alex -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]