Re: Phonetic search and matching
What happens if you do NOT inject? Setting inject=false stores only the phonetic reduction, not the original text. In that case your false match on 13 would go away Not sure what that means for the rest of your app though. Best Erick On Mon, Feb 6, 2012 at 5:44 AM, Dirk Högemann dirk.hoegem...@googlemail.com wrote: Hi, I have a question on phonetic search and matching in solr. In our application all the content of an article is written to a full-text search field, which provides stemming and a phonetic filter (cologne phonetic for german). This is the relevant part of the configuration for the index analyzer (search is analogous): tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=0/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=German2 / filter class=solr.PhoneticFilterFactory encoder=ColognePhonetic inject=true/ filter class=solr.RemoveDuplicatesTokenFilterFactory / Unfortunately this results sometimes in strange, but also explainable, matches. For example: Content field indexes the following String: Donnerstag von 13 bis 17 Uhr. This results in a match, if we search for puf as the result of the phonetic filter for this is 13. (As a consequence the 13 is then also highlighted) Does anyone has an idea how to handle this in a reasonable way that a search for puf does not match 13 in the content? Thanks in advance! Dirk
Re: Phonetic search and matching
Thanks Erick. In the first place we thought of removing numbers with a pattern filter. Setting inject to false will have the same effect If we want to be able to search for numbers in the content this solution will not work,but another field without phonetic filtering and searching in both fields would be ok,right? Dirk Am 07.02.2012 14:01 schrieb Erick Erickson erickerick...@gmail.com: What happens if you do NOT inject? Setting inject=false stores only the phonetic reduction, not the original text. In that case your false match on 13 would go away Not sure what that means for the rest of your app though. Best Erick On Mon, Feb 6, 2012 at 5:44 AM, Dirk Högemann dirk.hoegem...@googlemail.com wrote: Hi, I have a question on phonetic search and matching in solr. In our application all the content of an article is written to a full-text search field, which provides stemming and a phonetic filter (cologne phonetic for german). This is the relevant part of the configuration for the index analyzer (search is analogous): tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=0/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=German2 / filter class=solr.PhoneticFilterFactory encoder=ColognePhonetic inject=true/ filter class=solr.RemoveDuplicatesTokenFilterFactory / Unfortunately this results sometimes in strange, but also explainable, matches. For example: Content field indexes the following String: Donnerstag von 13 bis 17 Uhr. This results in a match, if we search for puf as the result of the phonetic filter for this is 13. (As a consequence the 13 is then also highlighted) Does anyone has an idea how to handle this in a reasonable way that a search for puf does not match 13 in the content? Thanks in advance! Dirk
Re: Phonetic search and matching
Yes, you could do that. I guess numbers will give you trouble under all circumstances. You may be able to do something like search against your non- phonetic field with higher boosts to preferentially do those matches. Best Erick On Tue, Feb 7, 2012 at 2:30 PM, Dirk Högemann dirk.hoegem...@googlemail.com wrote: Thanks Erick. In the first place we thought of removing numbers with a pattern filter. Setting inject to false will have the same effect If we want to be able to search for numbers in the content this solution will not work,but another field without phonetic filtering and searching in both fields would be ok,right? Dirk Am 07.02.2012 14:01 schrieb Erick Erickson erickerick...@gmail.com: What happens if you do NOT inject? Setting inject=false stores only the phonetic reduction, not the original text. In that case your false match on 13 would go away Not sure what that means for the rest of your app though. Best Erick On Mon, Feb 6, 2012 at 5:44 AM, Dirk Högemann dirk.hoegem...@googlemail.com wrote: Hi, I have a question on phonetic search and matching in solr. In our application all the content of an article is written to a full-text search field, which provides stemming and a phonetic filter (cologne phonetic for german). This is the relevant part of the configuration for the index analyzer (search is analogous): tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=0/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=German2 / filter class=solr.PhoneticFilterFactory encoder=ColognePhonetic inject=true/ filter class=solr.RemoveDuplicatesTokenFilterFactory / Unfortunately this results sometimes in strange, but also explainable, matches. For example: Content field indexes the following String: Donnerstag von 13 bis 17 Uhr. This results in a match, if we search for puf as the result of the phonetic filter for this is 13. (As a consequence the 13 is then also highlighted) Does anyone has an idea how to handle this in a reasonable way that a search for puf does not match 13 in the content? Thanks in advance! Dirk
Phonetic search and matching
Hi, I have a question on phonetic search and matching in solr. In our application all the content of an article is written to a full-text search field, which provides stemming and a phonetic filter (cologne phonetic for german). This is the relevant part of the configuration for the index analyzer (search is analogous): tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=0/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=German2 / filter class=solr.PhoneticFilterFactory encoder=ColognePhonetic inject=true/ filter class=solr.RemoveDuplicatesTokenFilterFactory / Unfortunately this results sometimes in strange, but also explainable, matches. For example: Content field indexes the following String: Donnerstag von 13 bis 17 Uhr. This results in a match, if we search for puf as the result of the phonetic filter for this is 13. (As a consequence the 13 is then also highlighted) Does anyone has an idea how to handle this in a reasonable way that a search for puf does not match 13 in the content? Thanks in advance! Dirk