[jira] [Updated] (LUCENE-4229) latin text analysis

Markus Klose (JIRA) Tue, 17 Jul 2012 03:50:41 -0700

     [ 
https://issues.apache.org/jira/browse/LUCENE-4229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Markus Klose updated LUCENE-4229:
---------------------------------

    Attachment: SOLR-3630.patch

fix encoding issue in patch file
                
> latin text analysis
> -------------------
>
>                 Key: LUCENE-4229
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4229
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: modules/analysis
>            Reporter: Markus Klose
>            Priority: Minor
>         Attachments: SOLR-3630.patch, SOLR-3630.patch, latin.analysis.jar, 
> latinNumberTestData.zip, latinTestData.zip, latin_analysis.png
>
>
> Hi
> a workmate and I played a bit with latin text analysis and created two filter 
> for the solr trunk version.
> One filter is designed for number conversion like 'iv' -> '4', 'v' -> '5', 
> 'vi' -> '6' ...
> The second filter is a stemmer for the most common suffixe.
> The following schema configuration could be a usecase for latin stemming.
>       <fieldType name="text_latin" class="solr.TextField" 
> positionIncrementGap="100">
>               <analyzer>
>                       <tokenizer class="solr.StandardTokenizerFactory"/>
>                       <filter 
> class="org.apache.solr.analysis.LatinNumberConvertFilterFactory" 
> strictMode="true"/>
>                       <filter class="solr.KeywordMarkerFilterFactory" 
> protected="latin_protwords.txt" />
>                       <filter 
> class="org.apache.solr.analysis.LatinStemFilterFactory" />
>               </analyzer>
>       </fieldType>
>       
> LatinNumberConvertFilterFactory has one property "strictMode" (default is 
> false). This boolean indicates in which way the computation of the value is 
> done, because not all letter combination are "valid" numbers. With 
> strictMode="true" the output of "ic" is "ic"; With strictMode="false" the 
> output of "ic" is "99"
> The LatinStemFilterFactory generates for each input token two output token. 
> the first stemmed as noun and the second stemmed as verb.
> Both filter are aware of the KeywordMarkerFilterFactory.
> I have attached the svn patch for both filter. In addition I attached to zip 
> files that are needed by filter tests (TestLatinNumberConvertFilter, 
> TestLatinStemFilter). I am sorry for that but i did not find the option to 
> include them into the patch, if there is one.
> The image latin_analysis.png is an example of the analysis done with the 
> configuration above. For this test we used the jar file latin.analysis.jar
> Have fun with latin text analysis. 
> It would be great to get some feedback.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (LUCENE-4229) latin text analysis

Reply via email to