Re: WordDelimiterFilter Leading & Trailing Special Character

Upayavira Tue, 21 Jul 2015 01:02:36 -0700

Looking at the javadoc for the WordDelimiterFilterFactory, it suggests
this config:


 <fieldType name="text_wd" class="solr.TextField"
 positionIncrementGap="100">
   <analyzer>
     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
     <filter class="solr.WordDelimiterFilterFactory"
     protected="protectedword.txt"
             preserveOriginal="0" splitOnNumerics="1"
             splitOnCaseChange="1"
             catenateWords="0" catenateNumbers="0" catenateAll="0"
             generateWordParts="1" generateNumberParts="1"
             stemEnglishPossessive="1"
             types="wdfftypes.txt" />
   </analyzer>
 </fieldType>

Note the protected="xxxxx" attribute. I suspect if you put Yahoo! into a
file referenced by that attribute, it may survive analysis. I'd be
curious to hear whether it works.

Upayavira

On Tue, Jul 21, 2015, at 12:51 AM, Sathiya N Sundararajan wrote:
> Question about WordDelimiterFilter. The search behavior that we
> experience
> with WordDelimiterFilter satisfies well, except for the case where there
> is
> a special character either at the leading or trailing end of the term.
> 
> For instance:
> 
> *‘d&b’ *  —>  Works as expected. Finds all docs with ‘d&b’.
> *‘p!nk’*  —>  Works fine as above.
> 
> But on cases when, there is a special character towards the trailing end
> of
> the term, like ‘Yahoo!’
> 
> *‘yahoo!’* —> Turns out to be a search for just *‘yahoo’* with the
> special
> character *‘!’* stripped out.  This WordDelimiterFilter behavior is
> documented
> http://lucene.apache.org/core/4_6_0/analyzers-common/index.html?org/apache/lucene/analysis/miscellaneous/WordDelimiterFilter.html
> 
> What I would like to have is, the search performed without stripping out
> the leading & trailing special character. Is there a way to achieve this
> behavior with WordDelimiterFilter.
> 
> This is current config that we have for the field:
> 
> <fieldType name="text_wdf" class="solr.TextField"
> positionIncrementGap="100">
>         <analyzer type="index">
>             <tokenizer class="solr.WhitespaceTokenizerFactory" />
>             <filter class="solr.WordDelimiterFilterFactory"
> splitOnCaseChange="0" generateWordParts="0" generateNumberParts="0"
> catenateWords="0" catenateNumbers="0" catenateAll="0"
> preserveOriginal="1"
> types="specialchartypes.txt"/>
>             <filter class="solr.LowerCaseFilterFactory" />
>         </analyzer>
>         <analyzer type="query">
>             <tokenizer class="solr.WhitespaceTokenizerFactory" />
>             <filter class="solr.WordDelimiterFilterFactory"
> splitOnCaseChange="0" generateWordParts="0" generateNumberParts="0"
> catenateWords="0" catenateNumbers="0" catenateAll="0"
> preserveOriginal="1"
> types="specialchartypes.txt"/>
>             <filter class="solr.LowerCaseFilterFactory" />
>         </analyzer>
>     </fieldType>
> 
> 
> thanks

Re: WordDelimiterFilter Leading & Trailing Special Character

Reply via email to