[jira] [Commented] (JENA-1250) Upgrade text search to latest Lucene

Osma Suominen (JIRA) Sun, 30 Oct 2016 03:41:02 -0700

    [ 
https://issues.apache.org/jira/browse/JENA-1250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15619732#comment-15619732
 ]


Osma Suominen commented on JENA-1250:
-------------------------------------

Trust me, I know what I'm talking about. The multilingual functionality was 
implemented by Alexis Miara, but I helped him along the way. See 
https://github.com/apache/jena/pull/64 and JENA-928 for the details. The code 
relies on the exact pattern that was considered harmful ("a trappy API") by the 
Lucene developers and thus made impossible with Lucene 5. The API change 
happened already in Lucene 5.0.0, but the note about it was added to the 
migration guide only later when people started complaining about their code 
breaking (see LUCENE-6212). So 5.5.2 vs 5.5.3 doesn't make any difference.

The jena-text code does use PerFieldAnalyzerWrapper but it's for reasons 
unrelated to multilingual indexing.

The current code uses just one field in the index to store values in multiple 
languages, with an additional `lang` field to store the language tag. So for 
example in the unit test that broke, the following values are stored in the 
index:

{noformat}
uri="TestEnglishStemming" label="i meet some engineer" lang="en"
{noformat}

(note that the text has been processed by EnglishStemmer, so it's "engineer", 
not "engineers" as in the original RDF - I'm guessing the stemming result here 
but it should be similar to the above)
 
If the resource had a German or French rdfs:label as well, it would be stored 
in a similar way, with the stemmed value in the `label` field.

What needs to be done instead is that the different language values should be 
stored in different fields such as `label_en`, `label_fr`, `label_de`. The 
field name can be used to select an appropriate language-specific Analyzer both 
at index time and query time.

I've already started designing a replacement for the current multilingual 
analyzer code. In practice using PerFieldAnalyzerWrapper in the way suggested 
by the Lucene migration notes is not so easy for jena-text, because it would 
require knowing the possible field name and language combinations in advance. 
Those could be computed based on EntityDefinition and knowledge of the 
available analyzers (currently in the Util class), but I think it would be more 
elegant and concise to make a MultilingualAnalyzer that works similar to 
PerFieldAnalyzerWrapper but chooses the analyzer dynamically based on the field 
name suffix that indicates language. That needs a little bit of restructuring 
in the code (TextIndexLuceneMultilingual can be dropped and the Util class code 
for selecting language-specific analyzers moved into MultilingualAnalyzer). It 
shouldn't be very hard but it will take a while to sort out.

Please don't waste your time debugging the breaking test, it's quite obvious 
that it fails because of the commented out code that I pointed out. Instead, if 
you want to continue with the upgrade, you could try, in a different branch, to 
update to Lucene 6 and see if there's any further breakage while I sort out the 
multilingual analyzer code.

> Upgrade text search to latest Lucene
> ------------------------------------
>
>                 Key: JENA-1250
>                 URL: https://issues.apache.org/jira/browse/JENA-1250
>             Project: Apache Jena
>          Issue Type: Improvement
>          Components: Jena
>            Reporter: Jean-Marc Vanel
>
> We are currently at Lucene 4.9.1 ,
> which is quite outdated compared to latest Lucene, which is 6.2.1 .
> Note that there is project to add a simple completion feature in addition to 
> existing simple search.
> But it would be better to do that on an updated Lucene dependency .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (JENA-1250) Upgrade text search to latest Lucene

Reply via email to