Re: SynonymFilterFactory and Punctuation

Upayavira Thu, 21 Mar 2013 02:13:47 -0700

Something is stripping your punctuation before it gets to your synonym
filter. Presumably this is the StandardTokenFilter. Try it with the
WhitespaceFilterFactory.


Also be careful to URL encode plus signs in queries - they represent a
space in a URL.

Upayavira

On Wed, Mar 20, 2013, at 10:52 PM, M W wrote:
> I have been reading threads all day regarding this topic and nothing
> seems to work the way it says it should. :)  I appreciate any and all
> help in this matter.
> 
> Solr 4 is working perfectly for in all regards with this one exception.
> 
> My requirement from Solr4 is very simple.  I am storing a document
> like a job description in a text_general field.
> 
> I have added a filter for SynonymFilterFactory so that I can map C++
> => cplusplus and c# => csharp during indexing a querying.
> 
> Here is the field definition:
> 
>     <fieldType name="text_general" class="solr.TextField"
> positionIncrementGap="100">
>       <analyzer type="index">
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.SynonymFilterFactory"
> synonyms="punctuation-whitelist.txt" ignoreCase="true"
> expand="false"/>
>         <filter class="solr.StandardFilterFactory"/>
>         <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" enablePositionIncrements="true" />
>         <filter class="solr.LowerCaseFilterFactory"/>
>       </analyzer>
>       <analyzer type="query">
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.SynonymFilterFactory"
> synonyms="punctuation-whitelist.txt" ignoreCase="true"
> expand="false"/>
>         <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" enablePositionIncrements="true" />
>         <filter class="solr.LowerCaseFilterFactory"/>
>       </analyzer>
>     </fieldType>
> 
> Here is the contents of punctuation-whitelist.txt:
> 
> c++ => cplusplus
> C# => csharp
> 
> I have but one document indexed for the purpose of this test, when I
> search for resume_text:C++, I get the following result, which is also
> the same result I get when I just search for resume_text:c
> 
> You can see from the highlighting that solr is matching on the "C" only
> 
> 
> <response>
>       <lst name="responseHeader">
>               <int name="status">0</int>
>               <int name="QTime">20</int>
>       </lst>
>       <result name="response" numFound="1" start="0" maxScore="0.16273327">
>               <doc>
>                       <arr name="resume_text">
>                               <str>C++ Developer with c# experience, 
> including .net</str>
>                       </arr>
>               </doc>
>       </result>
>       <lst name="highlighting">
>               <lst name="208645">
>                       <arr name="resume_text">
>                               <str>&lt;em&gt;C&lt;/em&gt;++ Developer with
> &lt;em&gt;c&lt;/em&gt;# experience, including .net</str>
>                       </arr>
>               </lst>
>       </lst>
> </response>
> 
> If I use the Analysis tool in the Solr Web UI, putting "C#" or "C++"
> into the Index or Query boxes translates to just "C" in all filters
> and tokenizers in the analysis output.
> 
> Can someone please explain the _Best_ way to accomplish what I am
> trying to do, which is accurately index, search and highlight text
> with words like C++ and C#.  I am looking for the "right way" and it's
> okay if I have started down the wrong path.
> 
> :)
> 
> Thank you.
> Dave

Re: SynonymFilterFactory and Punctuation

Reply via email to