prefix search
Hi, when i indexed words like 'Joe Tom' and 'Terry'.When i do prefix query like q=t*,i get both 'Joe Tom' and Terry' as the results.But i want the result for the complete string that start with 'T'.means i want only 'Terry' as the result. Can i do this? Thanks and Regards, Radha Krishna.
Re: prefix search
That's because the phrases are being tokenized and then indexed by Solr. You have to define a new fieldType which is not tokenized. http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.KeywordTokenizerFactory I'm not sure if it would solve your problem On Tue, Oct 25, 2011 at 5:46 AM, Radha Krishna Reddy radhakrishn...@gmail.com wrote: Hi, when i indexed words like 'Joe Tom' and 'Terry'.When i do prefix query like q=t*,i get both 'Joe Tom' and Terry' as the results.But i want the result for the complete string that start with 'T'.means i want only 'Terry' as the result. Can i do this? Thanks and Regards, Radha Krishna. -- Alireza Salimi Java EE Developer
Re: prefix search
I think what Radha Krishna (is this really her name?) means is different: She wants to return only the matching token instead of the complete field value. Indeed, this is not possible. But you could use highlighting (http://wiki.apache.org/solr/HighlightingParameters), and then extract the matching part on your own. This shouldn't be too complicated. -Kuli Am 25.10.2011 12:12, schrieb Alireza Salimi: That's because the phrases are being tokenized and then indexed by Solr. You have to define a new fieldType which is not tokenized. http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.KeywordTokenizerFactory I'm not sure if it would solve your problem On Tue, Oct 25, 2011 at 5:46 AM, Radha Krishna Reddy radhakrishn...@gmail.com wrote: Hi, when i indexed words like 'Joe Tom' and 'Terry'.When i do prefix query like q=t*,i get both 'Joe Tom' and Terry' as the results.But i want the result for the complete string that start with 'T'.means i want only 'Terry' as the result. Can i do this? Thanks and Regards, Radha Krishna.
meaning of underscore in prefix search.
Hello. i use facet.prefix and terms.prefix for my search. what is the meaning of the underscore _ in the results. when change solr some string into a underscore ? sometimes it make no sence to suggest the client with this ... analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ charFilter class=solr.MappingCharFilterFactory mapping=mapping-ISOLatin1Accent.txt/ filter class=solr.TrimFilterFactory/ filter class=solr.StandardFilterFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer thx ! -- View this message in context: http://lucene.472066.n3.nabble.com/meaning-of-underscore-in-prefix-search-tp944120p944120.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Prefix-Search with Stopwords - no results?
On 28.05.2010 22:06, Chris Hostetter wrote: and one text_prefix defined similarly but with an additional EdgeNGramTokenFilter used when indexing to generate prefix tokens. then search those fields using dismax... To be sure that I understand this right: Am I right that I should not stopword filter the EdgeNGramTokenFilter field? Otherwise I would run into the same problems again, won't I? Or if stopword filtering is ok on this field: Do you filter the stopwords before or after EdgeNGram tokenizing? Thanks, Gert
Re: Prefix-Search with Stopwords - no results?
Thank you, Chris and Erick, for the answers, it was new to me that the* is expanded to all known the* words in the index. Good to know. And yes, the AND operation between the query terms are certainly the problem. (I would like to switch to OR instead. The result set will grow the more words you are searching for, but as the results are ordered for the hit quality this would be ok. But the customer does not like this behaviour, because he thinks that the more words you are searching for, the smaller the result set should become. So this is not an option.). On 28.05.2010 22:06, Chris Hostetter wrote: word2*) ... in the client, that you instead consider using multiple fields -- one text defined as you have it now, and one text_prefix defined similarly but with an additional EdgeNGramTokenFilter used when indexing to generate prefix tokens. then search those fields using dismax... q=word1 word2 word3 qf=text text_prefix mm=100% tie=0 Ok, I will think about this. But I wonder if this will be more efficient than just not filtering stopwords? (But I have to study the EdgeNGram thing first. AFAIK it indexes all WORDS as WORDS, WORD, WOR, WO. So the index will be blown up, too?) What I do not understand in your idea, why I should use a second text_prefix field. Wouldn't it work with just this text_prefix without the normal text field, too, as I always let search for word and word* and never without the prefix? Thanks, Gert
Re: Prefix-Search with Stopwords - no results?
Well, the index does, indeed, get bigger. But the searches get much faster because there's no term expansion going on. It's another time/space tradeoff. I'm afraid you'll have to just experiment a bit to see if this is an acceptable tradeoff. in your particular situation The real memory hit in Lucene comes from *sorting* a field with many unique terms. And you won't sort on the NGram field I don't think and disk space is cheap. Best Erick On Sat, May 29, 2010 at 3:44 AM, Gert Brinkmann g...@netcologne.de wrote: Thank you, Chris and Erick, for the answers, it was new to me that the* is expanded to all known the* words in the index. Good to know. And yes, the AND operation between the query terms are certainly the problem. (I would like to switch to OR instead. The result set will grow the more words you are searching for, but as the results are ordered for the hit quality this would be ok. But the customer does not like this behaviour, because he thinks that the more words you are searching for, the smaller the result set should become. So this is not an option.). On 28.05.2010 22:06, Chris Hostetter wrote: word2*) ... in the client, that you instead consider using multiple fields -- one text defined as you have it now, and one text_prefix defined similarly but with an additional EdgeNGramTokenFilter used when indexing to generate prefix tokens. then search those fields using dismax... q=word1 word2 word3 qf=text text_prefix mm=100% tie=0 Ok, I will think about this. But I wonder if this will be more efficient than just not filtering stopwords? (But I have to study the EdgeNGram thing first. AFAIK it indexes all WORDS as WORDS, WORD, WOR, WO. So the index will be blown up, too?) What I do not understand in your idea, why I should use a second text_prefix field. Wouldn't it work with just this text_prefix without the normal text field, too, as I always let search for word and word* and never without the prefix? Thanks, Gert
Prefix-Search with Stopwords - no results?
Hello, I am having some problems with solr 1.4. I am indexing and querying data using the following fieldType: fieldType name=text_de_de class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords_de_de.txt enablePositionIncrements=true / filter class=solr.LengthFilterFactory min=2 max=200/ filter class=solr.SnowballPorterFilterFactory language=German / filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms_de_de.txt ignoreCase=true expand=true/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords_de_de.txt enablePositionIncrements=true / filter class=solr.LengthFilterFactory min=2 max=200/ filter class=solr.SnowballPorterFilterFactory language=German / filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType The application that is using solr does prepare the search string to filter out some dangerous characters like brackets and wildcards, etc, that otherwise might lead to a wrong query syntax. All words are searched for as a normal word as well as a prefix. E.g.: für solr is converted by the application to (für OR für*) AND (solr OR solr*) This works fine for normal words. But if you have a stopword like für in this example, the query will be stopword filtered by solr to something like this: (für*) AND (solr OR solr*) The problem now is (as I think) that there is no für* anymore in the indexed data, because it was stopword filtered, too. If now someone copypastes a sentence from an indexed document that contains a stopword, this document will not be found by solr. The enablePositionIncrements=true only is (AFAIU) for querying phrases, but not for my case of word OR word* queries. So, what should I do? Is there a better filter combination that I could try? Or am I doing something wrong conceptually? The only solution that I have found working is to not use stopword filtering at all. Greetings, Gert
Re: Prefix-Search with Stopwords - no results?
Hmmm, I don't really see the problem here. I'll have to use English examples... Searching on the* (assuming the is a stopword) will search on (them OR theory OR thespian) assuming those three words are in your index. It will NOT search on the. So I think you're OK, or are you seeing anomalous results? Conceptually, the underlying lucene looks through your *existing* list of terms for the field to assemble a clause containing the OR of all the terms that match the wildcard. Since the isn't in the index, it doesn't get included. HTH Erick On Fri, May 28, 2010 at 11:25 AM, Gert Brinkmann g...@netcologne.de wrote: Hello, I am having some problems with solr 1.4. I am indexing and querying data using the following fieldType: fieldType name=text_de_de class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords_de_de.txt enablePositionIncrements=true / filter class=solr.LengthFilterFactory min=2 max=200/ filter class=solr.SnowballPorterFilterFactory language=German / filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms_de_de.txt ignoreCase=true expand=true/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords_de_de.txt enablePositionIncrements=true / filter class=solr.LengthFilterFactory min=2 max=200/ filter class=solr.SnowballPorterFilterFactory language=German / filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType The application that is using solr does prepare the search string to filter out some dangerous characters like brackets and wildcards, etc, that otherwise might lead to a wrong query syntax. All words are searched for as a normal word as well as a prefix. E.g.: für solr is converted by the application to (für OR für*) AND (solr OR solr*) This works fine for normal words. But if you have a stopword like für in this example, the query will be stopword filtered by solr to something like this: (für*) AND (solr OR solr*) The problem now is (as I think) that there is no für* anymore in the indexed data, because it was stopword filtered, too. If now someone copypastes a sentence from an indexed document that contains a stopword, this document will not be found by solr. The enablePositionIncrements=true only is (AFAIU) for querying phrases, but not for my case of word OR word* queries. So, what should I do? Is there a better filter combination that I could try? Or am I doing something wrong conceptually? The only solution that I have found working is to not use stopword filtering at all. Greetings, Gert
Re: Prefix-Search with Stopwords - no results?
: Searching on the* (assuming the is a stopword) will search on : (them OR theory OR thespian) assuming those three words are in : your index. It will NOT search on the. So I think you're OK, or are : you seeing anomalous results? i think the missing pieces to hte puzzle here are: 1) wildcard and prefix queries aren't analyzed, so the* (or für*) doesnt' get analyzed, and the system has no way of spoting that it's a stopword that should be removed from the query -- nor should it in general since the fact that the is a stpword doens't mean the* is an invalid query. I could very concievabley be trying to find words like thespian 2) by using the AND operator you are forcing both clauses to match... : (für*) AND (solr OR solr*) ...so that query will only turn up results if a document containing a word that starts with solr and a word that starts with für existing in your index. : The problem now is (as I think) that there is no für* anymore in the : indexed data, because it was stopword filtered, too. If now someone the _word* für doesn't exist in your index because it's a stopword, but there may be other words in your index starting with the prefix für -- and if those words appear in documents that also contain words starting with solr then you will actually get matches. : So, what should I do? Is there a better filter combination that I could : try? Or am I doing something wrong conceptually? The only solution that I : have found working is to not use stopword filtering at all. I would suggest that intstead of your existing approach of taking word1 word2 word3 ... and converting it to (word1 OR word1*) AND (word2 OR word2*) ... in the client, that you instead consider using multiple fields -- one text defined as you have it now, and one text_prefix defined similarly but with an additional EdgeNGramTokenFilter used when indexing to generate prefix tokens. then search those fields using dismax... q=word1 word2 word3 qf=text text_prefix mm=100% tie=0 -Hoss
Highlighting on Prefix-Search Bug/Workaround (Re: query with stemming, prefix and fuzzy?)
Mark Miller wrote: Currently I think about dropping the stemming and only use prefix-search. But as highlighting does not work with a prefix house* this is a problem for me. The hint to use house?* instead does not work here. Thats because wildcard queries are also not highlightable now. I actually have somewhat of a solution to this that I'll work on soon (I've gotten the ground work for it in or ready to be in Lucene). No guarantee on when or if it will be accepted in solr though. As I am writing in perl (using WebService::Solr) I found the workaround to use the Search::Tools module for highlighting manually in those cases if Solr does not return snippets. This seems to work fine, but the drawback is, that I need Solr to return the full data field in a query. This can be expensive on larger documents. But I hope this is just a temporal workaround until Solr 1.4... Thanks, Gert
Re: prefix-search ingnores the lowerCaseFilter
On Thu, 2007-10-25 at 10:48 -0400, Yonik Seeley wrote: On 10/25/07, Max Scheffler [EMAIL PROTECTED] wrote: Is it possible that the prefix-processing ignores the filters? Yes, It's a known limitation that we haven't worked out a fix for yet. The issue is that you can't just run the prefix through the filters because of things like stop words, stemming, minimum length filters, etc. What about not having only facet.prefix but additionally facet.filtered.prefix that runs the prefix through the filters? Would that be possible? Cheers, Martin -Yonik signature.asc Description: This is a digitally signed message part
Re: prefix-search ingnores the lowerCaseFilter
On 10/29/07, Martin Grotzke [EMAIL PROTECTED] wrote: On Thu, 2007-10-25 at 10:48 -0400, Yonik Seeley wrote: On 10/25/07, Max Scheffler [EMAIL PROTECTED] wrote: Is it possible that the prefix-processing ignores the filters? Yes, It's a known limitation that we haven't worked out a fix for yet. The issue is that you can't just run the prefix through the filters because of things like stop words, stemming, minimum length filters, etc. What about not having only facet.prefix but additionally facet.filtered.prefix that runs the prefix through the filters? Would that be possible? The underlying issue remains - it's not safe to treat the prefix like any other word when running it through the filters. -Yonik
Re: prefix-search ingnores the lowerCaseFilter
On Mon, 2007-10-29 at 13:31 -0400, Yonik Seeley wrote: On 10/29/07, Martin Grotzke [EMAIL PROTECTED] wrote: On Thu, 2007-10-25 at 10:48 -0400, Yonik Seeley wrote: On 10/25/07, Max Scheffler [EMAIL PROTECTED] wrote: Is it possible that the prefix-processing ignores the filters? Yes, It's a known limitation that we haven't worked out a fix for yet. The issue is that you can't just run the prefix through the filters because of things like stop words, stemming, minimum length filters, etc. What about not having only facet.prefix but additionally facet.filtered.prefix that runs the prefix through the filters? Would that be possible? The underlying issue remains - it's not safe to treat the prefix like any other word when running it through the filters. Yes, definitely the user that uses this feature should know what it does - but at least there would be the possibility to run the prefix through e.g. a LowerCaseFilter. Finally the user knows what filters he has configured. E.g. if you only want an ignore-case prefix test, s.th. like a facet.filtered.prefix would be really valuable. Cheers, Martin -Yonik signature.asc Description: This is a digitally signed message part
prefix-search ingnores the lowerCaseFilter
Hi, I want to perform a prefix-search which ignores cases. To do this I created a fielType called suggest: fieldType name=suggest class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType Entrys (terms) could be 'foo', 'bar'... A request like http://localhost:8983/solr/select/?rows=0facet=trueq=*:*facet.field=suggestfacet.prefix=f returns things like lst name=facet_counts lst name=facet_queries/ lst name=facet_fields lst name=suggest int name=foo12/int /lst /lst /lst But a request like http://localhost:8983/solr/select/?rows=0facet=trueq=*:*facet.field=suggestfacet.prefix=F returns just: lst name=facet_counts lst name=facet_queries/ lst name=facet_fields lst name=suggest/ /lst /lst That's not what I've expected, cause the field-definition contains a LowerCaseFilter. Is it possible that the prefix-processing ignores the filters? Max
Re: prefix-search ingnores the lowerCaseFilter
On 10/25/07, Max Scheffler [EMAIL PROTECTED] wrote: Is it possible that the prefix-processing ignores the filters? Yes, It's a known limitation that we haven't worked out a fix for yet. The issue is that you can't just run the prefix through the filters because of things like stop words, stemming, minimum length filters, etc. -Yonik