Re: Phonetic search and matching

2012-02-07 Thread Erick Erickson
What happens if you do NOT inject? Setting  inject=false
stores only the phonetic reduction, not the original text. In that
case your false match on 13 would go away

Not sure what that means for the rest of your app though.

Best
Erick

On Mon, Feb 6, 2012 at 5:44 AM, Dirk Högemann
dirk.hoegem...@googlemail.com wrote:
 Hi,

 I have a question on phonetic search and matching in solr.
 In our application all the content of an article is written to a full-text
 search field, which provides stemming and a phonetic filter (cologne
 phonetic for german).
 This is the relevant part of the configuration for the index analyzer
 (search is analogous):

        tokenizer class=solr.StandardTokenizerFactory/
        filter class=solr.WordDelimiterFilterFactory
 generateWordParts=1 generateNumberParts=1 catenateWords=0
 catenateNumbers=0 catenateAll=0 splitOnCaseChange=0/
        filter class=solr.LowerCaseFilterFactory/
        filter class=solr.SnowballPorterFilterFactory language=German2
 /
        filter class=solr.PhoneticFilterFactory
 encoder=ColognePhonetic inject=true/
        filter class=solr.RemoveDuplicatesTokenFilterFactory /

 Unfortunately this results sometimes in strange, but also explainable,
 matches.
 For example:

 Content field indexes the following String: Donnerstag von 13 bis 17 Uhr.

 This results in a match, if we search for puf  as the result of the
 phonetic filter for this is 13.
 (As a consequence the 13 is then also highlighted)

 Does anyone has an idea how to handle this in a reasonable way that a
 search for puf does not match 13 in the content?

 Thanks in advance!

 Dirk


Re: Phonetic search and matching

2012-02-07 Thread Dirk Högemann
Thanks Erick.
In the first place we thought of removing numbers with a pattern filter.
Setting inject to false will have the same effect
If we want to be able to search for numbers in the content this solution
will not work,but another field without phonetic filtering and searching in
both fields would be ok,right?

Dirk
Am 07.02.2012 14:01 schrieb Erick Erickson erickerick...@gmail.com:

 What happens if you do NOT inject? Setting  inject=false
 stores only the phonetic reduction, not the original text. In that
 case your false match on 13 would go away

 Not sure what that means for the rest of your app though.

 Best
 Erick

 On Mon, Feb 6, 2012 at 5:44 AM, Dirk Högemann
 dirk.hoegem...@googlemail.com wrote:
  Hi,
 
  I have a question on phonetic search and matching in solr.
  In our application all the content of an article is written to a
 full-text
  search field, which provides stemming and a phonetic filter (cologne
  phonetic for german).
  This is the relevant part of the configuration for the index analyzer
  (search is analogous):
 
 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.WordDelimiterFilterFactory
  generateWordParts=1 generateNumberParts=1 catenateWords=0
  catenateNumbers=0 catenateAll=0 splitOnCaseChange=0/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.SnowballPorterFilterFactory
 language=German2
  /
 filter class=solr.PhoneticFilterFactory
  encoder=ColognePhonetic inject=true/
 filter class=solr.RemoveDuplicatesTokenFilterFactory /
 
  Unfortunately this results sometimes in strange, but also explainable,
  matches.
  For example:
 
  Content field indexes the following String: Donnerstag von 13 bis 17 Uhr.
 
  This results in a match, if we search for puf  as the result of the
  phonetic filter for this is 13.
  (As a consequence the 13 is then also highlighted)
 
  Does anyone has an idea how to handle this in a reasonable way that a
  search for puf does not match 13 in the content?
 
  Thanks in advance!
 
  Dirk



Re: Phonetic search and matching

2012-02-07 Thread Erick Erickson
Yes, you could do that. I guess numbers will give you trouble
under all circumstances.

You may be able to do something like search against your non-
phonetic field with higher boosts to preferentially do those
matches.

Best
Erick

On Tue, Feb 7, 2012 at 2:30 PM, Dirk Högemann
dirk.hoegem...@googlemail.com wrote:
 Thanks Erick.
 In the first place we thought of removing numbers with a pattern filter.
 Setting inject to false will have the same effect
 If we want to be able to search for numbers in the content this solution
 will not work,but another field without phonetic filtering and searching in
 both fields would be ok,right?

 Dirk
 Am 07.02.2012 14:01 schrieb Erick Erickson erickerick...@gmail.com:

 What happens if you do NOT inject? Setting  inject=false
 stores only the phonetic reduction, not the original text. In that
 case your false match on 13 would go away

 Not sure what that means for the rest of your app though.

 Best
 Erick

 On Mon, Feb 6, 2012 at 5:44 AM, Dirk Högemann
 dirk.hoegem...@googlemail.com wrote:
  Hi,
 
  I have a question on phonetic search and matching in solr.
  In our application all the content of an article is written to a
 full-text
  search field, which provides stemming and a phonetic filter (cologne
  phonetic for german).
  This is the relevant part of the configuration for the index analyzer
  (search is analogous):
 
         tokenizer class=solr.StandardTokenizerFactory/
         filter class=solr.WordDelimiterFilterFactory
  generateWordParts=1 generateNumberParts=1 catenateWords=0
  catenateNumbers=0 catenateAll=0 splitOnCaseChange=0/
         filter class=solr.LowerCaseFilterFactory/
         filter class=solr.SnowballPorterFilterFactory
 language=German2
  /
         filter class=solr.PhoneticFilterFactory
  encoder=ColognePhonetic inject=true/
         filter class=solr.RemoveDuplicatesTokenFilterFactory /
 
  Unfortunately this results sometimes in strange, but also explainable,
  matches.
  For example:
 
  Content field indexes the following String: Donnerstag von 13 bis 17 Uhr.
 
  This results in a match, if we search for puf  as the result of the
  phonetic filter for this is 13.
  (As a consequence the 13 is then also highlighted)
 
  Does anyone has an idea how to handle this in a reasonable way that a
  search for puf does not match 13 in the content?
 
  Thanks in advance!
 
  Dirk



Phonetic search and matching

2012-02-06 Thread Dirk Högemann
Hi,

I have a question on phonetic search and matching in solr.
In our application all the content of an article is written to a full-text
search field, which provides stemming and a phonetic filter (cologne
phonetic for german).
This is the relevant part of the configuration for the index analyzer
(search is analogous):

tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=0
catenateNumbers=0 catenateAll=0 splitOnCaseChange=0/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.SnowballPorterFilterFactory language=German2
/
filter class=solr.PhoneticFilterFactory
encoder=ColognePhonetic inject=true/
filter class=solr.RemoveDuplicatesTokenFilterFactory /

Unfortunately this results sometimes in strange, but also explainable,
matches.
For example:

Content field indexes the following String: Donnerstag von 13 bis 17 Uhr.

This results in a match, if we search for puf  as the result of the
phonetic filter for this is 13.
(As a consequence the 13 is then also highlighted)

Does anyone has an idea how to handle this in a reasonable way that a
search for puf does not match 13 in the content?

Thanks in advance!

Dirk