[ 
https://issues.apache.org/jira/browse/JENA-1326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Code Ferret updated JENA-1326:
------------------------------
    Fix Version/s: Jena 3.3.0

> Generic Lucene Analyzers
> ------------------------
>
>                 Key: JENA-1326
>                 URL: https://issues.apache.org/jira/browse/JENA-1326
>             Project: Apache Jena
>          Issue Type: New Feature
>          Components: Jena, Text
>    Affects Versions: Jena 3.2.0
>            Reporter: Code Ferret
>              Labels: Lucene, analyzers, jena-text
>             Fix For: Jena 3.3.0
>
>
> This issue proposes the addition of a jena-text assembler configuration 
> feature to permit the specification of generic Lucene Analyzers given a fully 
> qualified Class name and a list of parameters for a constructor of the Class.
> The parameters may be of the following types:
> {noformat}
>     string        String
>     set           org.apache.lucene.analysis.util.CharArraySet
>     file          java.io.FileReader
>     int           int
>     boolean       boolean
> {noformat}
>  
> Although the list of types is not exhaustive it is a simple matter to create 
> a wrapper Analyzer that reads a file with information that can be used to 
> initialize any sort of parameters that may be needed for a given Analyzer. 
> The provided types cover the most common cases.
> For example, {{org.apache.lucene.analysis.ja.JapaneseAnalyzer}} has a 
> constructor with 4 parameters: a {{UserDict}}, a {{CharArraySet}}, a 
> {{JapaneseTokenizer.Mode}}, and a {{Set<String>}}. So a simple wrapper can 
> extract the values needed for the various parameters with types not available 
> in this extension, construct the required instances, and instantiate the 
> {{JapaneseAnalyzer}}.
> Adding custom Analyzers such as the above wrapper analyzer is a simple matter 
> of adding the Analyzer class and any associated filters and tokenizer and so 
> on to the classpath for Jena. Also, all of the Analyzers that are included in 
> the Lucene distribution bundled with Jena are available as well.
> Each parameter object is specified with:
> - an optional {{text:paramName}} that may be used to document which parameter 
> is represented
> - a {{text:paramType}} which is one of: {{string, set, file, int, boolean}}.
> - a {{text:paramValue}} which is an {{xsd:string}}, {{xsd:boolean}} or 
> {{xsd:int}}.
> A parameter of type {{set}} _may have_ zero or more {{text:paramValue}}.
> A parameter of type {{string, file, boolean}}, or {{int}} _must have_ a 
> single {{text:paramValue}}.
> An example Analyzer configuration would look like:
> {noformat}
>     text:map (
>          [ text:field "text" ; 
>            text:predicate rdfs:label;
>            text:analyzer [
>                a text:GenericAnalyzer ;
>                text:class "org.apache.lucene.analysis.en.EnglishAnalyzer" ;
>                text:params (
>                     [ text:paramName "stopwords" ;
>                       text:paramType "set" ;
>                       text:paramValue ("the" "a" "an") ] ;
>                     [ text:paramName "stemExclusionSet" ;
>                       text:paramType "set" ;
>                       text:paramValue ("ing" "ed") ]
>                     )
>            ]  .
> . . .
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to