[
https://issues.apache.org/jira/browse/LUCENE-4229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Klose updated LUCENE-4229:
---------------------------------
Attachment: SOLR-3630.patch
fix encoding issue in patch file
> latin text analysis
> -------------------
>
> Key: LUCENE-4229
> URL: https://issues.apache.org/jira/browse/LUCENE-4229
> Project: Lucene - Java
> Issue Type: New Feature
> Components: modules/analysis
> Reporter: Markus Klose
> Priority: Minor
> Attachments: SOLR-3630.patch, SOLR-3630.patch, latin.analysis.jar,
> latinNumberTestData.zip, latinTestData.zip, latin_analysis.png
>
>
> Hi
> a workmate and I played a bit with latin text analysis and created two filter
> for the solr trunk version.
> One filter is designed for number conversion like 'iv' -> '4', 'v' -> '5',
> 'vi' -> '6' ...
> The second filter is a stemmer for the most common suffixe.
> The following schema configuration could be a usecase for latin stemming.
> <fieldType name="text_latin" class="solr.TextField"
> positionIncrementGap="100">
> <analyzer>
> <tokenizer class="solr.StandardTokenizerFactory"/>
> <filter
> class="org.apache.solr.analysis.LatinNumberConvertFilterFactory"
> strictMode="true"/>
> <filter class="solr.KeywordMarkerFilterFactory"
> protected="latin_protwords.txt" />
> <filter
> class="org.apache.solr.analysis.LatinStemFilterFactory" />
> </analyzer>
> </fieldType>
>
> LatinNumberConvertFilterFactory has one property "strictMode" (default is
> false). This boolean indicates in which way the computation of the value is
> done, because not all letter combination are "valid" numbers. With
> strictMode="true" the output of "ic" is "ic"; With strictMode="false" the
> output of "ic" is "99"
> The LatinStemFilterFactory generates for each input token two output token.
> the first stemmed as noun and the second stemmed as verb.
> Both filter are aware of the KeywordMarkerFilterFactory.
> I have attached the svn patch for both filter. In addition I attached to zip
> files that are needed by filter tests (TestLatinNumberConvertFilter,
> TestLatinStemFilter). I am sorry for that but i did not find the option to
> include them into the patch, if there is one.
> The image latin_analysis.png is an example of the analysis done with the
> configuration above. For this test we used the jar file latin.analysis.jar
> Have fun with latin text analysis.
> It would be great to get some feedback.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]