Jena-text multilingual implementation

Alexis Miara Wed, 20 May 2015 08:42:09 -0700

Hi,
This proposal aims to integrate language-specific support in jena-text.  
It summarizes changes (and several discussions) done in 
https://github.com/apache/jena/pull/64 (JENA-928) and previously in 
https://github.com/apache/jena/pull/52. The forked branch is available at 
https://github.com/LICEF/jena/tree/jena-text-ml-single-index. A single patch 
file in also in attachement.

 
Below are the changes and new features made :
 
1) LocalizedAnalyzer
A new analyzer can now be specified (for indexation or query phases) to take 
advantage of Lucene language specific analyzers (stemming, stop words,...). 
Like other existent analyzers (SimpleAnalyzer, KeywordAnalyzer,..), it can be 
used in assembler specifications with the related language  :
 
text:queryAnalyzer [
     a text:LocalizedAnalyzer ;
     text:language "en"
] In java code, it can be instantiated with the getLocalizedAnalyzer(lang) 
static method from org.apache.jena.query.text.analyzer.Util class.
 
2) TextIndexLuceneMultilingualThis new subclass of TextIndexLucene selects 
dynamically the right localized analyzer depending on literal's language. The 
selected analyzer is used for indexing and querying the index. Also, the lang 
is added by default in the index.To enable the multilingual support, just set 
the following option in the index assembler spec : <#indexLucene> a 
text:TextIndexLucene ;
    text:directory "mem" ;    text:multilingualSupport true;         .  3) 
Explicit language field in the index Even if there is no need of linguistic 
analyzers, literal's languages can be stored in the index to extend query 
capabilities. For that, the new langField param must be set in the EntityMap 
assembler : <#entMap> a text:EntityMap ;
    text:entityField      "uri" ;
    text:defaultField     "text" ;            text:langField         "lang" ;   
        .  4) UsageOnce langField is present in the index, in order to take it 
into account in sparql queries, set clauses like : ?s text:query (rdfs:label 
'word' 'lang:en' ) //target english literals?s text:query (rdfs:label 'word' 
'lang:none') //target unlocalized literals?s text:query (rdfs:label 'word') 
//ignore language The "lang:xx" parameter is removed from the arg list before 
the objectToStruct treatment to avoid possible conflicts.Extra params should be 
generalized in the same manner, ex: "limit:10", "score:x",... Hence it would 
allow params to be optional and would remove the order and size constraints. 5) 
RefactorizationTo simplify the TextDatasetFactory class, the TextIndexConfig 
class has been introduced. It avoids increasing the number of methods for each 
new parameter. This class provides a setter for each desired 
variable.EntityDefinition has changed in the same way.Example code and unit 
tests have changed accordingly. However, old methods could be re-introduced for 
backward compatibility.Saisissez du texte, l'adresse d'un site Web ou importez 
un document à traduire.AnnulerLangue source : Français  Alexis MiaraAnalyst 
ProgrammerCentre de recherche LICEFTélé-université (TÉLUQ)Montréal (Québec), 
Canada
Jena-text multilingual implementation

Reply via email to