[
https://issues.apache.org/jira/browse/JENA-1250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15619732#comment-15619732
]
Osma Suominen commented on JENA-1250:
-------------------------------------
Trust me, I know what I'm talking about. The multilingual functionality was
implemented by Alexis Miara, but I helped him along the way. See
https://github.com/apache/jena/pull/64 and JENA-928 for the details. The code
relies on the exact pattern that was considered harmful ("a trappy API") by the
Lucene developers and thus made impossible with Lucene 5. The API change
happened already in Lucene 5.0.0, but the note about it was added to the
migration guide only later when people started complaining about their code
breaking (see LUCENE-6212). So 5.5.2 vs 5.5.3 doesn't make any difference.
The jena-text code does use PerFieldAnalyzerWrapper but it's for reasons
unrelated to multilingual indexing.
The current code uses just one field in the index to store values in multiple
languages, with an additional `lang` field to store the language tag. So for
example in the unit test that broke, the following values are stored in the
index:
{noformat}
uri="TestEnglishStemming" label="i meet some engineer" lang="en"
{noformat}
(note that the text has been processed by EnglishStemmer, so it's "engineer",
not "engineers" as in the original RDF - I'm guessing the stemming result here
but it should be similar to the above)
If the resource had a German or French rdfs:label as well, it would be stored
in a similar way, with the stemmed value in the `label` field.
What needs to be done instead is that the different language values should be
stored in different fields such as `label_en`, `label_fr`, `label_de`. The
field name can be used to select an appropriate language-specific Analyzer both
at index time and query time.
I've already started designing a replacement for the current multilingual
analyzer code. In practice using PerFieldAnalyzerWrapper in the way suggested
by the Lucene migration notes is not so easy for jena-text, because it would
require knowing the possible field name and language combinations in advance.
Those could be computed based on EntityDefinition and knowledge of the
available analyzers (currently in the Util class), but I think it would be more
elegant and concise to make a MultilingualAnalyzer that works similar to
PerFieldAnalyzerWrapper but chooses the analyzer dynamically based on the field
name suffix that indicates language. That needs a little bit of restructuring
in the code (TextIndexLuceneMultilingual can be dropped and the Util class code
for selecting language-specific analyzers moved into MultilingualAnalyzer). It
shouldn't be very hard but it will take a while to sort out.
Please don't waste your time debugging the breaking test, it's quite obvious
that it fails because of the commented out code that I pointed out. Instead, if
you want to continue with the upgrade, you could try, in a different branch, to
update to Lucene 6 and see if there's any further breakage while I sort out the
multilingual analyzer code.
> Upgrade text search to latest Lucene
> ------------------------------------
>
> Key: JENA-1250
> URL: https://issues.apache.org/jira/browse/JENA-1250
> Project: Apache Jena
> Issue Type: Improvement
> Components: Jena
> Reporter: Jean-Marc Vanel
>
> We are currently at Lucene 4.9.1 ,
> which is quite outdated compared to latest Lucene, which is 6.2.1 .
> Note that there is project to add a simple completion feature in addition to
> existing simple search.
> But it would be better to do that on an updated Lucene dependency .
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)