Solr Phonetic Search returning documents but not Highlight Information
We have a pretty simple Solr Schema: fields field name=DocId type=long indexed=true stored=true required=true / field name=DocTitle type=string indexed=true stored=true required=true / field name=Content type=text_general indexed=false stored=true required=true / field name=ContentSearch type=text_general indexed=true stored=false multiValued=true/ field name=ContentSearchStemming type=text_stem indexed=true stored=false multiValued=true/ field name=ContentSearchPhonetic type=text_phonetic indexed=true stored=false multiValued=true/ field name=ContentSearchSynonym type=text_synonym indexed=true stored=false multiValued=true/ field name=_version_ type=long indexed=true stored=true/ /fields uniqueKeyDocId/uniqueKey copyField source=Content dest=ContentSearch/ copyField source=Content dest=ContentSearchStemming/ copyField source=Content dest=ContentSearchPhonetic/ copyField source=Content dest=ContentSearchSynonym/ fieldType name=text_general class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.LowerCaseFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType fieldType name=text_stem class=solr.TextField analyzer tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SnowballPorterFilterFactory/ /analyzer /fieldType fieldType name=text_phonetic class=solr.TextField analyzer tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.PhoneticFilterFactory encoder=Soundex inject=false/ /analyzer /fieldType fieldType name=text_synonym class=solr.TextField analyzer tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ /analyzer /fieldType We are indexing documents in Solr using Solrnet and have a requirement to support Phonetic Search based on the Soundex algorithm. Once we have indexed documents, we can search in the Solr Admin Panel using a Phonetic query and the relevant document is returned in the Search Results but the highlight collection is blank. Eg. Use Case: -- We index a text document which contains the word electromagnetic(Soundex Code: E423) We execute a Search in the Solr Admin Panel using the following query: ContentSearchPhonetic:electing(Soundex Code: E423). The Search shows one document returned but the highlight collection is blank. Solr is definitely using the Phonetic Soundex algorithm to locate the document as the word electing is not present in the document. But somehow it is not being able to return the highlight data. The same schema and config can successfully return documents along with highlight data for other approximate searches like synonym, fuzzy or stemming. Only for phonetic search, we are not getting the highlight data. The screenshot from the Solr Admin Panle is shown below: http://lucene.472066.n3.nabble.com/file/n4075492/HighlightIssue.png -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Phonetic-Search-returning-documents-but-not-Highlight-Information-tp4075492.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Stemming query in Solr
Hi Erick, Thanks for the reply. Here is what the situation is: Relevant portion of Solr Schema: lt;field name=Content type=text_general indexed=false stored=true required=true/gt; lt;field name=ContentSearch type=text_general indexed=true stored=false multiValued=true/gt; lt;field name=ContentSearchStemming type=text_stem indexed=true stored=false multiValued=true/gt; lt;copyField source=Content dest=ContentSearch/gt; lt;copyField source=Content dest=ContentSearchStemming/gt; lt;fieldType name=text_general class=solr.TextField positionIncrementGap=100gt; lt;analyzer type=indexgt; lt;tokenizer class=solr.StandardTokenizerFactory/gt; lt;filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true /gt; lt;filter class=solr.LowerCaseFilterFactory/gt; lt;/analyzergt; lt;analyzer type=querygt; lt;tokenizer class=solr.StandardTokenizerFactory/gt; lt;filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true /gt; lt;filter class=solr.LowerCaseFilterFactory/gt; lt;/analyzergt; lt;/fieldTypegt; lt;fieldType name=text_stem class=solr.TextField gt; lt;analyzergt; lt;tokenizer class=solr.WhitespaceTokenizerFactory/gt; lt;filter class=solr.SnowballPorterFilterFactory/gt; lt;/analyzergt; lt;/fieldTypegt; When I am indexing a document, the content gets stored as is in the Content field and gets copied over to ContentSearch and ContentSearchStemming for text based search and stemming search respectively. So, the ContentSearchStemming field does store the stem/reduced form of the terms. I have checked this with the Luke as well as the Admin Schema Browser --gt; Term Info. In the Admin Analysis screen, I have tested and found that if I index the text burning, it gets reduced to and stored as burn. So far so good. Now in the UI, lets say the user puts in the term burn and checks the stemming option. The expectation is that since the user has specified stemming, the results should be returned for the term burn as well as for all terms which has their stem as burn i.e. burning, burned, burns, etc. lets say the user puts in the term burning and checks the stemming option. The expectation is that since the user has specified stemming, the results should be returned for the term burning as well as for all terms which has their stem as burn i.e. burn, burned, burns, etc. The query that gets submitted to Solr: q=ContentSearchStemming:burning From Debug Info: lt;str name=rawquerystringgt;ContentSearchStemming:burninglt;/strgt; lt;str name=querystringgt;ContentSearchStemming:burninglt;/strgt; lt;str name=parsedquerygt;ContentSearchStemming:burnlt;/strgt; lt;str name=parsedquery_toStringgt;ContentSearchStemming:burnlt;/strgt; So, when the results are returned, I am only getting the hits highlighted with the term burn, though the same document contains terms like burning and burns. I thought that the stemming should work like this: The stemming filter in the queryanalyzer chain would reduce the input word to its stem. burning --gt; burn The query component should scan through the terms and match those terms for which it finds a match between the stem of the term with the stem of the input term. burns --gt; burn (matches) burning --gt; burn The first point is happening. But looks like it is executing the search for an exact text based match with the stem burn. Hence, burns or burned are not getting returned. Hope I was able to make myself clear. On Fri, 28 Jun 2013 05:59:37 -0700 Erick Erickson [via Lucene] lt;ml-node+s472066n4073901...@n3.nabble.comgt; wrote First, this is for the Java version, I hope it extends to C#. But in your configuration, when you're indexing the stemmer should be storing the reduced form in the index. Then, when searching, the search should be against the reduced term. To check this, try 1gt; Using the Admin/Analysis page to see what gets stored in your index and what your query is transformed to to insure that you're getting what you expect. If you want to get in deeper to the details, try 1gt; use, say, the TermsComponent or Admin/Schema Browser or Luke to look in your index and see what's actually there. 2gt; us amp;debug=query or Admin/Analysis to see what the query actually looks like. Both your use-cases should work fine just with reduction _unless_ the particular word you look for doesn't happen to trip the stemmer. By that I mean that since it's algorithmically based, there may be some edge cases that seem like they should be reduced that aren't. I don't know whether fisherman would reduce to fish for instance. So are you seeing things that really don't work as expected or are you just working from the docs? Because I really don't see why you wouldn't get what you want given your description. Best Erick On Fri, Jun 28, 2013 at 2:33 AM, snkar lt;[hidden email]gt; wrote: gt; We have a search system based
Re: Stemming query in Solr
So the general solution is to index the field twice, once with stemming and once without in order to have the ability to do both stemmed and exact matches I am already indexing the text twice using the ContentSearch and ContentSearchStemming fields. But what this allows me is to return burning as well as burn if the user specifies burning as the input search term, burning being the exact match: ContentSearch:burning + ContentSearchStemming:burn(reduced from ContentSearchStemming:burning) What I cannot figure out is how is this going to help me in instructing Solr to execute the query for the different grammatical variations of the input search term stem i.e. stemming query for burning expands to text based query for burn, burns, burned, burning, etc. You mentioned something about synonym. This was also mentioned in the Solr Wiki: A related technology to stemming is lemmatization, which allows for stemming by expansion, taking a root word and 'expanding' it to all of its various forms. Lemmatization can be used either at insertion time or at query time. Lucene/Solr does not have built-in support for lemmatization but it can be simulated by using your own dictionaries and the SynonymFilterFactory I think what I need is exactly this point: Lucene/Solr does not have built-in support for lemmatization but it can be simulated by using your own dictionaries and the SynonymFilterFactory But I am not sure, how to go about it and exactly how can Synonym help me here as I am not looking for synonyms, rather different expansions of the stemmed word. On Mon, 01 Jul 2013 03:42:34 -0700 Erick Erickson [via Lucene] lt;ml-node+s472066n4074311...@n3.nabble.comgt; wrote bq: But looks like it is executing the search for an exact text based match with the stem burn. Right. You need to appreciate index time as opposed to query time stemming. Your field definition has both turned on. The admin/analysis page will help here lt;Ggt;.. At index time, the terms are stemmed, and _only_ the reduced term is put in the index. At query time, the same thing happens and _only_ the reduced term is searched for. By stemming at index time, you lose the original form of the word, it's just gone and nothing about checking/unchecking the stem bits will recover it. So the general solution is to index the field twice, once with stemming and once without in order to have the ability to do both stemmed and exact matches. I think I saw a clever approach to doing this involving a custom filter but can't find it now. As I recall it indexed the un-stemmed version like a synonym with some kind of marker to indicate exact match when necessary Best Erick On Mon, Jul 1, 2013 at 5:15 AM, snkar lt;[hidden email]gt; wrote: gt; Hi Erick, gt; gt; Thanks for the reply. gt; gt; Here is what the situation is: gt; gt; Relevant portion of Solr Schema: gt; amp;lt;field name=Content type=text_general indexed=false stored=true gt; required=true/amp;gt; gt; amp;lt;field name=ContentSearch type=text_general indexed=true gt; stored=false multiValued=true/amp;gt; gt; amp;lt;field name=ContentSearchStemming type=text_stem indexed=true gt; stored=false multiValued=true/amp;gt; gt; amp;lt;copyField source=Content dest=ContentSearch/amp;gt; gt; amp;lt;copyField source=Content dest=ContentSearchStemming/amp;gt; gt; gt; amp;lt;fieldType name=text_general class=solr.TextField gt; positionIncrementGap=100amp;gt; amp;lt;analyzer type=indexamp;gt; amp;lt;tokenizer gt; class=solr.StandardTokenizerFactory/amp;gt; amp;lt;filter gt; class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt gt; enablePositionIncrements=true /amp;gt; amp;lt;filter gt; class=solr.LowerCaseFilterFactory/amp;gt; amp;lt;/analyzeramp;gt; amp;lt;analyzer gt; type=queryamp;gt; amp;lt;tokenizer class=solr.StandardTokenizerFactory/amp;gt; gt; amp;lt;filter class=solr.StopFilterFactory ignoreCase=true gt; words=stopwords.txt enablePositionIncrements=true /amp;gt; amp;lt;filter gt; class=solr.LowerCaseFilterFactory/amp;gt; amp;lt;/analyzeramp;gt; gt; amp;lt;/fieldTypeamp;gt; gt; gt; amp;lt;fieldType name=text_stem class=solr.TextField amp;gt; gt; amp;lt;analyzeramp;gt; amp;lt;tokenizer class=solr.WhitespaceTokenizerFactory/amp;gt; gt; amp;lt;filter class=solr.SnowballPorterFilterFactory/amp;gt; amp;lt;/analyzeramp;gt; gt; amp;lt;/fieldTypeamp;gt; gt; When I am indexing a document, the content gets stored as is in the gt; Content field and gets copied over to ContentSearch and gt; ContentSearchStemming for text based search and stemming search gt; respectively. So, the ContentSearchStemming field does store the gt; stem/reduced form of the terms. I have checked this with the Luke as well gt; as the Admin Schema Browser --amp;gt; Term Info. In the Admin gt; Analysis screen, I have tested and found that if I index the text gt; burning, it gets reduced to and stored as burn. So far so good
Re: Stemming query in Solr
I was just wondering if another solution might work. If we are able to extract the stem of the input search term(maybe using a C# based stemmer, some open source implementation of the Porter algorithm) for cases where the stemming option is selected, and submit the query to solr as a multiple character wild card query with respect to the stem, it should return me all the different variations of the stemmed word. Example: Search Term: burning Stem: burn Modified Query: burn* Results: burn, burning, burns, burnt, etc. I am sure this is not the proper way of executing a stemming by expansion, but this might just get the job done. What do you think? Trying to think of test case where this will fail. On Mon, 01 Jul 2013 03:42:34 -0700 Erick Erickson [via Lucene]lt;ml-node+s472066n4074311...@n3.nabble.comgt; wrote bq: But looks like it is executing the search for an exact text based match with the stem burn. Right. You need to appreciate index time as opposed to query time stemming. Your field definition has both turned on. The admin/analysis page will help here lt;Ggt;.. At index time, the terms are stemmed, and _only_ the reduced term is put in the index. At query time, the same thing happens and _only_ the reduced term is searched for. By stemming at index time, you lose the original form of the word, it's just gone and nothing about checking/unchecking the stem bits will recover it. So the general solution is to index the field twice, once with stemming and once without in order to have the ability to do both stemmed and exact matches. I think I saw a clever approach to doing this involving a custom filter but can't find it now. As I recall it indexed the un-stemmed version like a synonym with some kind of marker to indicate exact match when necessary Best Erick On Mon, Jul 1, 2013 at 5:15 AM, snkar lt;[hidden email]gt; wrote: gt; Hi Erick, gt; gt; Thanks for the reply. gt; gt; Here is what the situation is: gt; gt; Relevant portion of Solr Schema: gt; amp;lt;field name=Content type=text_general indexed=false stored=true gt; required=true/amp;gt; gt; amp;lt;field name=ContentSearch type=text_general indexed=true gt; stored=false multiValued=true/amp;gt; gt; amp;lt;field name=ContentSearchStemming type=text_stem indexed=true gt; stored=false multiValued=true/amp;gt; gt; amp;lt;copyField source=Content dest=ContentSearch/amp;gt; gt; amp;lt;copyField source=Content dest=ContentSearchStemming/amp;gt; gt; gt; amp;lt;fieldType name=text_general class=solr.TextField gt; positionIncrementGap=100amp;gt; amp;lt;analyzer type=indexamp;gt; amp;lt;tokenizer gt; class=solr.StandardTokenizerFactory/amp;gt; amp;lt;filter gt; class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt gt; enablePositionIncrements=true /amp;gt; amp;lt;filter gt; class=solr.LowerCaseFilterFactory/amp;gt; amp;lt;/analyzeramp;gt; amp;lt;analyzer gt; type=queryamp;gt; amp;lt;tokenizer class=solr.StandardTokenizerFactory/amp;gt; gt; amp;lt;filter class=solr.StopFilterFactory ignoreCase=true gt; words=stopwords.txt enablePositionIncrements=true /amp;gt; amp;lt;filter gt; class=solr.LowerCaseFilterFactory/amp;gt; amp;lt;/analyzeramp;gt; gt; amp;lt;/fieldTypeamp;gt; gt; gt; amp;lt;fieldType name=text_stem class=solr.TextField amp;gt; gt; amp;lt;analyzeramp;gt; amp;lt;tokenizer class=solr.WhitespaceTokenizerFactory/amp;gt; gt; amp;lt;filter class=solr.SnowballPorterFilterFactory/amp;gt; amp;lt;/analyzeramp;gt; gt; amp;lt;/fieldTypeamp;gt; gt; When I am indexing a document, the content gets stored as is in the gt; Content field and gets copied over to ContentSearch and gt; ContentSearchStemming for text based search and stemming search gt; respectively. So, the ContentSearchStemming field does store the gt; stem/reduced form of the terms. I have checked this with the Luke as well gt; as the Admin Schema Browser --amp;gt; Term Info. In the Admin gt; Analysis screen, I have tested and found that if I index the text gt; burning, it gets reduced to and stored as burn. So far so good. gt; gt; Now in the UI, gt; lets say the user puts in the term burn and checks the stemming option. gt; The expectation is that since the user has specified stemming, the results gt; should be returned for the term burn as well as for all terms which has gt; their stem as burn i.e. burning, burned, burns, etc. gt; lets say the user puts in the term burning and checks the stemming gt; option. The expectation is that since the user has specified stemming, the gt; results should be returned for the term burning as well as for all terms gt; which has their stem as burn i.e. burn, burned, burns, etc. gt; The query that gets submitted to Solr: q=ContentSearchStemming:burning gt; From Debug Info: gt; amp;lt;str name=rawquerystringamp;gt;ContentSearchStemming:burningamp;lt;/stramp;gt; gt; amp;lt;str name=querystringamp;gt;ContentSearchStemming:burningamp;lt
Stemming query in Solr
We have a search system based on Solr using the Solrnet library in C# which supports some advanced search features like Fuzzy, Synonym and Stemming. While all of these work, *the expectation from the Stemming Search seems to be a combination of Stemming by reduction as well as stemming by expansion to cover grammatical variations on a word*. A use case will make it more clear: - a search for fish would also find fishing - a search for applied would also find applying, applies, and apply We had implemented Stemming using a CopyField with SnowballPorterFilterFactory. *As a result, when /searching for burning the results are returning for burning and burn/ but when /searching for burn the results are not returning for burning or burnt or burns/* Since all stemmers supported Lucene/Solr all use stemming by reduction, we are not sure on how to go about this. As per the Solr Wiki: A related technology to stemming is lemmatization, which allows for stemming by expansion, taking a root word and 'expanding' it to all of its various forms. Lemmatization can be used either at insertion time or at query time. Lucene/Solr does not have built-in support for lemmatization but it can be simulated by using your own dictionaries and the SynonymFilterFactory We are not sure of exactly how to go about this in Solr. Any ideas. We were also thinking in terms of using some C# based stemmer/lemmatizer library to get the root of the word and using some public database like WordNet to extract the different grammatical variations of the stem and then send across all these terms for querying in Solr. We have not yet done a lot of research to figure out a stable C# stemmer/lemmatizer and a WordNet C# API, but seems like this will get too convoluted and it should have a way to be executed from within Solr. -- View this message in context: http://lucene.472066.n3.nabble.com/Stemming-query-in-Solr-tp4073862.html Sent from the Solr - User mailing list archive at Nabble.com.