[ https://issues.apache.org/jira/browse/LUCENE-4229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Markus Klose updated LUCENE-4229: --------------------------------- Attachment: SOLR-3630.patch fix encoding issue in patch file > latin text analysis > ------------------- > > Key: LUCENE-4229 > URL: https://issues.apache.org/jira/browse/LUCENE-4229 > Project: Lucene - Java > Issue Type: New Feature > Components: modules/analysis > Reporter: Markus Klose > Priority: Minor > Attachments: SOLR-3630.patch, SOLR-3630.patch, latin.analysis.jar, > latinNumberTestData.zip, latinTestData.zip, latin_analysis.png > > > Hi > a workmate and I played a bit with latin text analysis and created two filter > for the solr trunk version. > One filter is designed for number conversion like 'iv' -> '4', 'v' -> '5', > 'vi' -> '6' ... > The second filter is a stemmer for the most common suffixe. > The following schema configuration could be a usecase for latin stemming. > <fieldType name="text_latin" class="solr.TextField" > positionIncrementGap="100"> > <analyzer> > <tokenizer class="solr.StandardTokenizerFactory"/> > <filter > class="org.apache.solr.analysis.LatinNumberConvertFilterFactory" > strictMode="true"/> > <filter class="solr.KeywordMarkerFilterFactory" > protected="latin_protwords.txt" /> > <filter > class="org.apache.solr.analysis.LatinStemFilterFactory" /> > </analyzer> > </fieldType> > > LatinNumberConvertFilterFactory has one property "strictMode" (default is > false). This boolean indicates in which way the computation of the value is > done, because not all letter combination are "valid" numbers. With > strictMode="true" the output of "ic" is "ic"; With strictMode="false" the > output of "ic" is "99" > The LatinStemFilterFactory generates for each input token two output token. > the first stemmed as noun and the second stemmed as verb. > Both filter are aware of the KeywordMarkerFilterFactory. > I have attached the svn patch for both filter. In addition I attached to zip > files that are needed by filter tests (TestLatinNumberConvertFilter, > TestLatinStemFilter). I am sorry for that but i did not find the option to > include them into the patch, if there is one. > The image latin_analysis.png is an example of the analysis done with the > configuration above. For this test we used the jar file latin.analysis.jar > Have fun with latin text analysis. > It would be great to get some feedback. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org