Hi,
This proposal aims to integrate language-specific support in jena-text.
It summarizes changes (and several discussions) done in
https://github.com/apache/jena/pull/64 (JENA-928) and previously in
https://github.com/apache/jena/pull/52. The forked branch is available at
https://github.com/LICEF/jena/tree/jena-text-ml-single-index. A single patch
file in also in attachement.
Below are the changes and new features made :
1) LocalizedAnalyzer
A new analyzer can now be specified (for indexation or query phases) to take
advantage of Lucene language specific analyzers (stemming, stop words,...).
Like other existent analyzers (SimpleAnalyzer, KeywordAnalyzer,..), it can be
used in assembler specifications with the related language :
text:queryAnalyzer [
a text:LocalizedAnalyzer ;
text:language "en"
] In java code, it can be instantiated with the getLocalizedAnalyzer(lang)
static method from org.apache.jena.query.text.analyzer.Util class.
2) TextIndexLuceneMultilingualThis new subclass of TextIndexLucene selects
dynamically the right localized analyzer depending on literal's language. The
selected analyzer is used for indexing and querying the index. Also, the lang
is added by default in the index.To enable the multilingual support, just set
the following option in the index assembler spec : <#indexLucene> a
text:TextIndexLucene ;
text:directory "mem" ; text:multilingualSupport true; . 3)
Explicit language field in the index Even if there is no need of linguistic
analyzers, literal's languages can be stored in the index to extend query
capabilities. For that, the new langField param must be set in the EntityMap
assembler : <#entMap> a text:EntityMap ;
text:entityField "uri" ;
text:defaultField "text" ; text:langField "lang" ;
. 4) UsageOnce langField is present in the index, in order to take it
into account in sparql queries, set clauses like : ?s text:query (rdfs:label
'word' 'lang:en' ) //target english literals?s text:query (rdfs:label 'word'
'lang:none') //target unlocalized literals?s text:query (rdfs:label 'word')
//ignore language The "lang:xx" parameter is removed from the arg list before
the objectToStruct treatment to avoid possible conflicts.Extra params should be
generalized in the same manner, ex: "limit:10", "score:x",... Hence it would
allow params to be optional and would remove the order and size constraints. 5)
RefactorizationTo simplify the TextDatasetFactory class, the TextIndexConfig
class has been introduced. It avoids increasing the number of methods for each
new parameter. This class provides a setter for each desired
variable.EntityDefinition has changed in the same way.Example code and unit
tests have changed accordingly. However, old methods could be re-introduced for
backward compatibility.Saisissez du texte, l'adresse d'un site Web ou importez
un document à traduire.AnnulerLangue source : Français Alexis MiaraAnalyst
ProgrammerCentre de recherche LICEFTélé-université (TÉLUQ)Montréal (Québec),
Canada