[ https://issues.apache.org/jira/browse/JENA-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16513774#comment-16513774 ]
ASF GitHub Bot commented on JENA-1556: -------------------------------------- Github user kinow commented on a diff in the pull request: https://github.com/apache/jena/pull/436#discussion_r195658758 --- Diff: jena-text/src/main/java/org/apache/jena/query/text/TextIndexLucene.java --- @@ -316,6 +326,13 @@ protected Document doc(Entity entity) { if (this.isMultilingual) { // add a field that uses a language-specific analyzer via MultilingualAnalyzer doc.add(new Field(e.getKey() + "_" + lang, (String) e.getValue(), ftText)); + // add fields for any defined auxiliary indexes + List<String> auxIndexes = Util.getAuxIndexes(lang); + if (auxIndexes != null) { --- End diff -- Never null I believe. We return an empty list when `lang` is empty. And use a `Hashtable` to keep data. But happy to leave it if you prefer to double-check it anyway (wonder if we should consider `@Nullable` et `@NotNull` in method signatures some day) > text:query multilingual enhancements > ------------------------------------ > > Key: JENA-1556 > URL: https://issues.apache.org/jira/browse/JENA-1556 > Project: Apache Jena > Issue Type: New Feature > Components: Text > Affects Versions: Jena 3.7.0 > Reporter: Code Ferret > Assignee: Code Ferret > Priority: Major > Labels: pull-request-available > > This issue proposes two related enhancements of Jena Text. These enhancements > have been implemented and a PR can be issued. > There are two multilingual search situations that we want to support: > # We want to be able to search in one encoding and retrieve results that may > have been entered in other encodings. For example, searching via Simplified > Chinese (Hans) and retrieving results that may have been entered in > Traditional Chinese (Hant) or Pinyin. This will simplify applications by > permitting encoding independent retrieval without additional layers of > transcoding and so on. It's all done under the covers in Lucene. > # We want to search with queries entered in a lossy, e.g., phonetic, > encoding and retrieve results entered with accurate encoding. For example, > searching vis Pinyin without diacritics and retrieving all possible Hans and > Hant triples. > The first situation arises when entering triples that include languages with > multiple encodings that for various reasons are not normalized to a single > encoding. In this situation we want to be able to retrieve appropriate result > sets without regard for the encodings used at the time that the triples were > inserted into the dataset. > There are several such languages of interest in our application: Chinese, > Tibetan, Sanskrit, Japanese and Korean. There are various Romanizations and > ideographic variants. > Encodings may not normalized when inserting triples for a variety of reasons. > A principle one is that the {{rdf:langString}} object often must be entered > in the same encoding that it occurs in some physical text that is being > catalogued. Another is that metadata may be imported from sources that use > different encoding conventions and we want to preserve that form. > The second situation arises as we want to provide simple support for phonetic > or other forms of lossy search at the time that triples are indexed directly > in the Lucene system. > To handle the first situation we introduce a {{text}} assembler predicate, > {{text:searchFor}}, that specifies a list of language tags that provides a > list of language variants that should be searched whenever a query string of > a given encoding (language tag) is used. For example, the following > {{text:TextIndexLucene/text:defineAnalyzers}} fragment : > {code:java} > [ text:addLang "bo" ; > text:searchFor ( "bo" "bo-x-ewts" "bo-alalc97" ) ; > text:analyzer [ > a text:GenericAnalyzer ; > text:class "io.bdrc.lucene.bo.TibetanAnalyzer" ; > text:params ( > [ text:paramName "segmentInWords" ; > text:paramValue false ] > [ text:paramName "lemmatize" ; > text:paramValue true ] > [ text:paramName "filterChars" ; > text:paramValue false ] > [ text:paramName "inputMode" ; > text:paramValue "unicode" ] > [ text:paramName "stopFilename" ; > text:paramValue "" ] > ) > ] ; > ] > {code} > indicates that when using a search string such as "རྡོ་རྗེ་སྙིང་"@bo the > Lucene index should also be searched for matches tagged as {{bo-x-ewts}} and > {{bo-alalc97}}. > This is made possible by a Tibetan {{Analyzer}} that tokenizes strings in all > three encodings into Tibetan Unicode. This is feasible since the > {{bo-x-ewts}} and {{bo-alalc97}} encodings are one-to-one with Unicode > Tibetan. Since all fields with these language tags will have a common set of > indexed terms, i.e., Tibetan Unicode, it suffices to arrange for the query > analyzer to have access to the language tag for the query string along with > the various fields that need to be considered. > Supposing that the query is: > {code:java} > (?s ?sc ?lit) text:query ("rje"@bo-x-ewts) > {code} > Then the query formed in {{TextIndexLucene}} will be: > {code:java} > label_bo:rje label_bo-x-ewts:rje label_bo-alalc97:rje > {code} > which is translated using a suitable {{Analyzer}}, > {{QueryMultilingualAnalyzer}}, via Lucene's {{QueryParser}} to: > {code:java} > +(label_bo:རྗེ label_bo-x-ewts:རྗེ label_bo-alalc97:རྗེ) > {code} > which reflects the underlying Tibetan Unicode term encoding. During > {{IndexSearcher.search}} all documents with one of the three fields in the > index for term, "རྗེ", will be returned even though the value in the fields > {{label_bo-x-ewts}} and {{label_bo-alalc97}} for the returned documents will > be the original value "rje". > This support simplifies applications by permitting encoding independent > retrieval without additional layers of transcoding and so on. It's all done > under the covers in Lucene. > Solving the second situation will simplify applications by adding appropriate > fields and indexing via configuration in the > {{text:TextIndexLucene/text:defineAnalyzers}}. For example, the following > fragment > {code:java} > [ text:addLang "zh-hans" ; > text:searchFor ( "zh-hans" "zh-hant" ) ; > text:auxIndex ( "zh-aux-han2pinyin" ) ; > text:analyzer [ > a text:DefinedAnalyzer ; > text:useAnalyzer :hanzAnalyzer ] ; > ] > [ text:addLang "zh-hant" ; > text:searchFor ( "zh-hans" "zh-hant" ) ; > text:auxIndex ( "zh-aux-han2pinyin" ) ; > text:analyzer [ > a text:DefinedAnalyzer ; > text:useAnalyzer :hanzAnalyzer ] ; > ] > [ text:addLang "zh-latn-pinyin" ; > text:searchFor ( "zh-latn-pinyin" "zh-aux-han2pinyin" ) ; > text:analyzer [ > a text:DefinedAnalyzer ; > text:useAnalyzer :pinyin ] ; > ] > [ text:addLang "zh-aux-han2pinyin" ; > text:searchFor ( "zh-latn-pinyin" "zh-aux-han2pinyin" ) ; > text:analyzer [ > a text:DefinedAnalyzer ; > text:useAnalyzer :pinyin ] ; > text:indexAnalyzer :han2pinyin ; > ] > {code} > defines language tags for Traditional, Simplified, Pinyin and an _auxiliary_ > tag {{zh-aux-han2pinyin}} associated with an {{Analyzer}}, {{:han2pinyin}}. > The purpose of the auxiliary tag is to define an {{Analyzer}} that will be > used during indexing and to specify a list of tags that should be searched > when the auxiliary tag is used with a query string. > Searching is then done via multi-encoding support discussed above. In this > example the {{Analyzer}}, {{:han2pinyin}}, tokenizes strings in {{zh-hans}} > and {{zh-hant}} as the corresponding pinyin so that at search time a pinyin > query will retrieve appropriate triples inserted in Traditional or Simplified > Chinese. Such a query would appear as: > {code} > (?s ?sc ?lit ?g) text:query ("jīng"@zh-aux-han2pinyin) > {code} > The auxiliary field support is needed to accommodate situations such as > pinyin or sound-ex which are not exact, i.e., one-to-many rather than > one-to-one as in the case of Simplified and Traditional. > {{TextIndexLucene}} adds a field for each of the auxiliary tags associated > with the tag of the triple object being indexed. These fields are in addition > to the un-tagged field and the field tagged with the language of the triple > object literal. -- This message was sent by Atlassian JIRA (v7.6.3#76005)