[jira] Updated: (SOLR-319) changes SynonymFilterFactoryto "Analyze" synonyms file

Hoss Man (JIRA) Fri, 14 Sep 2007 18:01:53 -0700

     [ 
https://issues.apache.org/jira/browse/SOLR-319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Hoss Man updated SOLR-319:
--------------------------

    Summary: changes SynonymFilterFactoryto "Analyze" synonyms file  (was: 
changes SynonymFilterFactory for N-gram tokenizer)

I've revised the summary line of this bug because it was a little confusing to 
me ... the issue isn't really specific to n-gram based tokenizers, as you point 
out this is a general issue that currently when constructing the synonyms file 
you have to be very aware of the analysis chain of your fieldtype -- ie: if 
LowercaseFilterFactory comes before SynonymFilterFactory, then all synonyms 
must be lowercased in your file.

The notion of specifying a TokenizerFactory as a property of the 
SynonymFilterFactory that tells it how to parse the synonymstxt file is pretyt 
clever, and would solve the  CJKTokenizer problem you describe, but i don't 
think it really goes far enough -- consider the lowercase example.  it would be 
good if you could have a synonyms file that contained proper names, and have it 
do the right thing when used in lower cased fields as well as exact case fields.

to extend the tokenizer idea -- what if you could specify the name of a 
fieldtype, and the entire Analyzer for that fieldtype would be used to parse 
the individual synonym records?  this should simplify the patch a bit (since 
you don't have to worry about initializing any factories,  the schema will take 
care of it for you) and make it a lot more powerful.

> changes SynonymFilterFactoryto "Analyze" synonyms file
> ------------------------------------------------------
>
>                 Key: SOLR-319
>                 URL: https://issues.apache.org/jira/browse/SOLR-319
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: Koji Sekiguchi
>            Priority: Minor
>         Attachments: SOLR-319.patch
>
>
> WHAT:
> Currently, SynonymFilterFactory works very well with N-gram tokenizer 
> (CJKTokenizer, for example).
> But we have to take care of the statement in synonyms.txt.
> For example, if I use CJKTokenizer (work as bi-gram for CJK chars) and want 
> C1C2C3 maps to C4C5C6,
> I have to write the rule as follows:
> C1C2 C2C3 => C4C5 C5C6
> But I want to write it "C1C2C3=>C4C5C6". This patch allows it. It is also 
> helpful for sharing synonyms.txt.
> HOW:
> tokenFactory attribute is added to <filter 
> class="solr.SynonymFilterFactory"/>.
> If the attribute is specified, SynonymFilterFactory uses the TokenizerFactory 
> to create Tokenizer.
> Then SynonymFilterFactory uses the Tokenizer to get tokens from the rules in 
> synonyms.txt file.
> sample-1: CJKTokenizer
>     <fieldtype name="text_cjk" class="solr.TextField" 
> positionIncrementGap="100">
>       <analyzer type="index">
>         <tokenizer class="solr.CJKTokenizerFactory"/>
>         <filter class="solr.SynonymFilterFactory" 
> synonyms="ngram_synonym_test_ja.txt"
>                       ignoreCase="true" expand="true" 
> tokenFactory="solr.CJKTokenizerFactory"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>       </analyzer>
>       <analyzer type="query">
>         <tokenizer class="solr.CJKTokenizerFactory"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>       </analyzer>
>     </fieldtype>
> sample-2: NGramTokenizer
>     <fieldtype name="text_ngram" class="solr.TextField" 
> positionIncrementGap="100">
>       <analyzer type="index">
>         <tokenizer class="solr.NGramTokenizerFactory" minGramSize="2" 
> maxGramSize="2"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>       </analyzer>
>       <analyzer type="query">
>         <tokenizer class="solr.NGramTokenizerFactory" minGramSize="2" 
> maxGramSize="2"/>
>         <filter class="solr.SynonymFilterFactory" 
> synonyms="ngram_synonym_test_ngram.txt"
>                       ignoreCase="true" expand="true"
>                       tokenFactory="solr.NGramTokenizerFactory" 
> minGramSize="2" maxGramSize="2"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>       </analyzer>
>     </fieldtype>
> backward compatibility:
> Yes. If you omit tokenFactory attribute from <filter 
> class="solr.SynonymFilterFactory"/> tag, it works as usual.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-319) changes SynonymFilterFactoryto "Analyze" synonyms file

Reply via email to