[jira] [Commented] (JENA-1556) text:query multilingual enhancements

ASF GitHub Bot (JIRA) Fri, 15 Jun 2018 12:16:09 -0700


    [ 
https://issues.apache.org/jira/browse/JENA-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16514250#comment-16514250
 ]


ASF GitHub Bot commented on JENA-1556:
--------------------------------------

Github user xristy commented on the issue:

    https://github.com/apache/jena/pull/436
  
    @kinow I think the configuration and results reflect the `text:searchFor` 
functionality; however, in the analyzer defn for the tag `jp`:
    
             text:analyzer [
               a text:GenericAnalyzer ;
               text:class "org.apache.lucene.analysis.ja.JapaneseAnalyzer" ;
               text:tokenizer <#tokenizer> ;
            ]
    
    the `text:tokenizer <#tokenizer> ;` is not effective. Tokenizer specs work 
with `ConfigurableAnalyzer` and are ignored in `text:GenericAnalyzer`. Perhaps 
a warning should be logged but that means checking for the presence of 
unsupported predicates?
    
    Re:
    > the complexity put on TextIndexLucene. A few methods are getting a 
boolean flag to change their behaviour. And when that happens too much, 
sometimes it may feel like the method has two behaviours, and writing tests or 
changing it may be challenging. Maybe it could extend it in some other way.
    
    I'm not sure how to improve this. The flag in `highlightResults` affects 
the value of the `effectiveField` in the context of a larger method, and the 
flag in `getQueryAnalyzer` conditions whether any useful work is done or not. I 
factored that as a method rather than leaving it inline in `query$` to reduce 
the clutter in that principal routine.
    
    Re:
    > it's not a batteries-included feature, if I understand correctly. You 
still need to prepare the other part of the solution, be it a tokenizer that 
gets a value such as "kinou", then searches some dictionary, and finally create 
tokens for :ex3 dc:title "昨日" and "きのう", or change the data a bit. Maybe this 
could be a separate project, or an extension of sorts.
    
    I'm not sure what you are recommending here. The `text:searchFor` and 
`text:auxIndex` functionalities are ways of configuring the _application_ of 
appropriate analyzers that have been separately defined. So yes the features 
are not self-contained in that analyzers do have to be supplied.


> text:query multilingual enhancements
> ------------------------------------
>
>                 Key: JENA-1556
>                 URL: https://issues.apache.org/jira/browse/JENA-1556
>             Project: Apache Jena
>          Issue Type: New Feature
>          Components: Text
>    Affects Versions: Jena 3.7.0
>            Reporter: Code Ferret
>            Assignee: Code Ferret
>            Priority: Major
>              Labels: pull-request-available
>
> This issue proposes two related enhancements of Jena Text. These enhancements 
> have been implemented and a PR can be issued. 
> There are two multilingual search situations that we want to support:
>  # We want to be able to search in one encoding and retrieve results that may 
> have been entered in other encodings. For example, searching via Simplified 
> Chinese (Hans) and retrieving results that may have been entered in 
> Traditional Chinese (Hant) or Pinyin. This will simplify applications by 
> permitting encoding independent retrieval without additional layers of 
> transcoding and so on. It's all done under the covers in Lucene.
>  # We want to search with queries entered in a lossy, e.g., phonetic, 
> encoding and retrieve results entered with accurate encoding. For example, 
> searching vis Pinyin without diacritics and retrieving all possible Hans and 
> Hant triples.
> The first situation arises when entering triples that include languages with 
> multiple encodings that for various reasons are not normalized to a single 
> encoding. In this situation we want to be able to retrieve appropriate result 
> sets without regard for the encodings used at the time that the triples were 
> inserted into the dataset.
> There are several such languages of interest in our application: Chinese, 
> Tibetan, Sanskrit, Japanese and Korean. There are various Romanizations and 
> ideographic variants.
> Encodings may not normalized when inserting triples for a variety of reasons. 
> A principle one is that the {{rdf:langString}} object often must be entered 
> in the same encoding that it occurs in some physical text that is being 
> catalogued. Another is that metadata may be imported from sources that use 
> different encoding conventions and we want to preserve that form.
> The second situation arises as we want to provide simple support for phonetic 
> or other forms of lossy search at the time that triples are indexed directly 
> in the Lucene system.
> To handle the first situation we introduce a {{text}} assembler predicate, 
> {{text:searchFor}}, that specifies a list of language tags that provides a 
> list of language variants that should be searched whenever a query string of 
> a given encoding (language tag) is used. For example, the following 
> {{text:TextIndexLucene/text:defineAnalyzers}} fragment :
> {code:java}
>         [ text:addLang "bo" ; 
>           text:searchFor ( "bo" "bo-x-ewts" "bo-alalc97" ) ;
>           text:analyzer [ 
>             a text:GenericAnalyzer ;
>             text:class "io.bdrc.lucene.bo.TibetanAnalyzer" ;
>             text:params (
>                 [ text:paramName "segmentInWords" ;
>                   text:paramValue false ]
>                 [ text:paramName "lemmatize" ;
>                   text:paramValue true ]
>                 [ text:paramName "filterChars" ;
>                   text:paramValue false ]
>                 [ text:paramName "inputMode" ;
>                   text:paramValue "unicode" ]
>                 [ text:paramName "stopFilename" ;
>                   text:paramValue "" ]
>                 )
>             ] ; 
>           ]
> {code}
> indicates that when using a search string such as "རྡོ་རྗེ་སྙིང་"@bo the 
> Lucene index should also be searched for matches tagged as {{bo-x-ewts}} and 
> {{bo-alalc97}}.
> This is made possible by a Tibetan {{Analyzer}} that tokenizes strings in all 
> three encodings into Tibetan Unicode. This is feasible since the 
> {{bo-x-ewts}} and {{bo-alalc97}} encodings are one-to-one with Unicode 
> Tibetan. Since all fields with these language tags will have a common set of 
> indexed terms, i.e., Tibetan Unicode, it suffices to arrange for the query 
> analyzer to have access to the language tag for the query string along with 
> the various fields that need to be considered.
> Supposing that the query is:
> {code:java}
>     (?s ?sc ?lit) text:query ("rje"@bo-x-ewts) 
> {code}
> Then the query formed in {{TextIndexLucene}} will be:
> {code:java}
>     label_bo:rje label_bo-x-ewts:rje label_bo-alalc97:rje
> {code}
> which is translated using a suitable {{Analyzer}}, 
> {{QueryMultilingualAnalyzer}}, via Lucene's {{QueryParser}} to:
> {code:java}
>     +(label_bo:རྗེ label_bo-x-ewts:རྗེ label_bo-alalc97:རྗེ)
> {code}
> which reflects the underlying Tibetan Unicode term encoding. During 
> {{IndexSearcher.search}} all documents with one of the three fields in the 
> index for term, "རྗེ", will be returned even though the value in the fields 
> {{label_bo-x-ewts}} and {{label_bo-alalc97}} for the returned documents will 
> be the original value "rje".
> This support simplifies applications by permitting encoding independent 
> retrieval without additional layers of transcoding and so on. It's all done 
> under the covers in Lucene.
> Solving the second situation will simplify applications by adding appropriate 
> fields and indexing via configuration in the 
> {{text:TextIndexLucene/text:defineAnalyzers}}. For example, the following 
> fragment
> {code:java}
>         [ text:addLang "zh-hans" ; 
>           text:searchFor ( "zh-hans" "zh-hant" ) ;
>           text:auxIndex ( "zh-aux-han2pinyin" ) ;
>           text:analyzer [
>             a text:DefinedAnalyzer ;
>             text:useAnalyzer :hanzAnalyzer ] ; 
>           ]
>         [ text:addLang "zh-hant" ; 
>           text:searchFor ( "zh-hans" "zh-hant" ) ;
>           text:auxIndex ( "zh-aux-han2pinyin" ) ;
>           text:analyzer [
>             a text:DefinedAnalyzer ;
>             text:useAnalyzer :hanzAnalyzer ] ; 
>           ]
>         [ text:addLang "zh-latn-pinyin" ;
>           text:searchFor ( "zh-latn-pinyin" "zh-aux-han2pinyin" ) ;
>           text:analyzer [
>             a text:DefinedAnalyzer ;
>             text:useAnalyzer :pinyin ] ; 
>           ]        
>         [ text:addLang "zh-aux-han2pinyin" ;
>           text:searchFor ( "zh-latn-pinyin" "zh-aux-han2pinyin" ) ;
>           text:analyzer [
>             a text:DefinedAnalyzer ;
>             text:useAnalyzer :pinyin ] ; 
>           text:indexAnalyzer :han2pinyin ; 
>           ]
> {code}
> defines language tags for Traditional, Simplified, Pinyin and an _auxiliary_ 
> tag {{zh-aux-han2pinyin}} associated with an {{Analyzer}}, {{:han2pinyin}}. 
> The purpose of the auxiliary tag is to define an {{Analyzer}} that will be 
> used during indexing and to specify a list of tags that should be searched 
> when the auxiliary tag is used with a query string. 
> Searching is then done via multi-encoding support discussed above. In this 
> example the {{Analyzer}}, {{:han2pinyin}}, tokenizes strings in {{zh-hans}} 
> and {{zh-hant}} as the corresponding pinyin so that at search time a pinyin 
> query will retrieve appropriate triples inserted in Traditional or Simplified 
> Chinese. Such a query would appear as:
> {code}
>     (?s ?sc ?lit ?g) text:query ("jīng"@zh-aux-han2pinyin)
> {code}
> The auxiliary field support is needed to accommodate situations such as 
> pinyin or sound-ex which are not exact, i.e., one-to-many rather than 
> one-to-one as in the case of Simplified and Traditional.
> {{TextIndexLucene}} adds a field for each of the auxiliary tags associated 
> with the tag of the triple object being indexed. These fields are in addition 
> to the un-tagged field and the field tagged with the language of the triple 
> object literal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (JENA-1556) text:query multilingual enhancements

Reply via email to