Markus Klose created SOLR-3630:
----------------------------------

             Summary: latin text analysis
                 Key: SOLR-3630
                 URL: https://issues.apache.org/jira/browse/SOLR-3630
             Project: Solr
          Issue Type: New Feature
          Components: Schema and Analysis
            Reporter: Markus Klose
            Priority: Minor


Hi

a workmate and I played a bit with latin text analysis and created two filter 
for the solr trunk version.
One filter is designed for number conversion like 'iv' -> '4', 'v' -> '5', 'vi' 
-> '6' ...
The second filter is a stemmer for the most common suffixe.

The following schema configuration could be a usecase for latin stemming.

        <fieldType name="text_latin" class="solr.TextField" 
positionIncrementGap="100">
                <analyzer>
                        <tokenizer class="solr.StandardTokenizerFactory"/>
                        <filter 
class="org.apache.solr.analysis.LatinNumberConvertFilterFactory" 
strictMode="true"/>
                        <filter class="solr.KeywordMarkerFilterFactory" 
protected="latin_protwords.txt" />
                        <filter 
class="org.apache.solr.analysis.LatinStemFilterFactory" />
                </analyzer>
        </fieldType>
        
LatinNumberConvertFilterFactory has one property "strictMode" (default is 
false). This boolean indicates in which way the computation of the value is 
done, because not all letter combination are "valid" numbers. With 
strictMode="true" the output of "ic" is "ic"; With strictMode="false" the 
output of "ic" is "99"
The LatinStemFilterFactory generates for each input token two output token. the 
first stemmed as noun and the second stemmed as verb.
Both filter are aware of the KeywordMarkerFilterFactory.

I have attached the svn patch for both filter. In addition I attached to zip 
files that are needed by filter tests (TestLatinNumberConvertFilter, 
TestLatinStemFilter). I am sorry for that but i did not find the option to 
include them into the patch, if there is one.

The image latin_analysis.png is an example of the analysis done with the 
configuration above. For this test we used the jar file latin.analysis.jar


Have fun with latin text analysis. 
It would be great to get some feedback.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to