[jira] [Commented] (JENA-1556) text:query multilingual enhancements

2018-06-15 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/JENA-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16514250#comment-16514250
 ] 

ASF GitHub Bot commented on JENA-1556:
--

Github user xristy commented on the issue:

https://github.com/apache/jena/pull/436
  
@kinow I think the configuration and results reflect the `text:searchFor` 
functionality; however, in the analyzer defn for the tag `jp`:

 text:analyzer [
   a text:GenericAnalyzer ;
   text:class "org.apache.lucene.analysis.ja.JapaneseAnalyzer" ;
   text:tokenizer <#tokenizer> ;
]

the `text:tokenizer <#tokenizer> ;` is not effective. Tokenizer specs work 
with `ConfigurableAnalyzer` and are ignored in `text:GenericAnalyzer`. Perhaps 
a warning should be logged but that means checking for the presence of 
unsupported predicates?

Re:
> the complexity put on TextIndexLucene. A few methods are getting a 
boolean flag to change their behaviour. And when that happens too much, 
sometimes it may feel like the method has two behaviours, and writing tests or 
changing it may be challenging. Maybe it could extend it in some other way.

I'm not sure how to improve this. The flag in `highlightResults` affects 
the value of the `effectiveField` in the context of a larger method, and the 
flag in `getQueryAnalyzer` conditions whether any useful work is done or not. I 
factored that as a method rather than leaving it inline in `query$` to reduce 
the clutter in that principal routine.

Re:
> it's not a batteries-included feature, if I understand correctly. You 
still need to prepare the other part of the solution, be it a tokenizer that 
gets a value such as "kinou", then searches some dictionary, and finally create 
tokens for :ex3 dc:title "昨日" and "きのう", or change the data a bit. Maybe this 
could be a separate project, or an extension of sorts.

I'm not sure what you are recommending here. The `text:searchFor` and 
`text:auxIndex` functionalities are ways of configuring the _application_ of 
appropriate analyzers that have been separately defined. So yes the features 
are not self-contained in that analyzers do have to be supplied.


> text:query multilingual enhancements
> 
>
> Key: JENA-1556
> URL: https://issues.apache.org/jira/browse/JENA-1556
> Project: Apache Jena
>  Issue Type: New Feature
>  Components: Text
>Affects Versions: Jena 3.7.0
>Reporter: Code Ferret
>Assignee: Code Ferret
>Priority: Major
>  Labels: pull-request-available
>
> This issue proposes two related enhancements of Jena Text. These enhancements 
> have been implemented and a PR can be issued. 
> There are two multilingual search situations that we want to support:
>  # We want to be able to search in one encoding and retrieve results that may 
> have been entered in other encodings. For example, searching via Simplified 
> Chinese (Hans) and retrieving results that may have been entered in 
> Traditional Chinese (Hant) or Pinyin. This will simplify applications by 
> permitting encoding independent retrieval without additional layers of 
> transcoding and so on. It's all done under the covers in Lucene.
>  # We want to search with queries entered in a lossy, e.g., phonetic, 
> encoding and retrieve results entered with accurate encoding. For example, 
> searching vis Pinyin without diacritics and retrieving all possible Hans and 
> Hant triples.
> The first situation arises when entering triples that include languages with 
> multiple encodings that for various reasons are not normalized to a single 
> encoding. In this situation we want to be able to retrieve appropriate result 
> sets without regard for the encodings used at the time that the triples were 
> inserted into the dataset.
> There are several such languages of interest in our application: Chinese, 
> Tibetan, Sanskrit, Japanese and Korean. There are various Romanizations and 
> ideographic variants.
> Encodings may not normalized when inserting triples for a variety of reasons. 
> A principle one is that the {{rdf:langString}} object often must be entered 
> in the same encoding that it occurs in some physical text that is being 
> catalogued. Another is that metadata may be imported from sources that use 
> different encoding conventions and we want to preserve that form.
> The second situation arises as we want to provide simple support for phonetic 
> or other forms of lossy search at the time that triples are indexed directly 
> in the Lucene system.
> To handle the first situation we introduce a {{text}} assembler predicate, 
> {{text:searchFor}}, that specifies a list of language tags that provides a 
> list of language variants that should be searched whenever a query string of 
> a 

[GitHub] jena issue #436: JENA-1556 implementation

2018-06-15 Thread xristy
Github user xristy commented on the issue:

https://github.com/apache/jena/pull/436
  
@kinow I think the configuration and results reflect the `text:searchFor` 
functionality; however, in the analyzer defn for the tag `jp`:

 text:analyzer [
   a text:GenericAnalyzer ;
   text:class "org.apache.lucene.analysis.ja.JapaneseAnalyzer" ;
   text:tokenizer <#tokenizer> ;
]

the `text:tokenizer <#tokenizer> ;` is not effective. Tokenizer specs work 
with `ConfigurableAnalyzer` and are ignored in `text:GenericAnalyzer`. Perhaps 
a warning should be logged but that means checking for the presence of 
unsupported predicates?

Re:
> the complexity put on TextIndexLucene. A few methods are getting a 
boolean flag to change their behaviour. And when that happens too much, 
sometimes it may feel like the method has two behaviours, and writing tests or 
changing it may be challenging. Maybe it could extend it in some other way.

I'm not sure how to improve this. The flag in `highlightResults` affects 
the value of the `effectiveField` in the context of a larger method, and the 
flag in `getQueryAnalyzer` conditions whether any useful work is done or not. I 
factored that as a method rather than leaving it inline in `query$` to reduce 
the clutter in that principal routine.

Re:
> it's not a batteries-included feature, if I understand correctly. You 
still need to prepare the other part of the solution, be it a tokenizer that 
gets a value such as "kinou", then searches some dictionary, and finally create 
tokens for :ex3 dc:title "昨日" and "きのう", or change the data a bit. 
Maybe this could be a separate project, or an extension of sorts.

I'm not sure what you are recommending here. The `text:searchFor` and 
`text:auxIndex` functionalities are ways of configuring the _application_ of 
appropriate analyzers that have been separately defined. So yes the features 
are not self-contained in that analyzers do have to be supplied.


---


Re: Contribution proposal for Jena: support of a datatype for quantity values

2018-06-15 Thread Maxime Lefrançois
Some comments:



> > Add use of the standard Java Service Provider API to load things
> automatically found in the classpath:
> >
> > - In TypeMapper --> a method that uses the Service Provider API to find
> more Datatypes
>
> Should this be a method, or rather additional behavior for getTypeByName,
> etc.? Are you thinking of something like "void getMoreMappings()" which
> would check for more available datatypes?
>

Don't know yet, what is you opinion? At some functionality like this would
be coded somewhere in the TypeMapper class.


>
> > - Datatype subclasses are not for just one URI, but could be for a set
> of URIs
>
> Would that be true of Java types, as well?
>

I think it would be better to avoid this being true for Java types.


>
> > - ValueSpaceClassification should not be an enum any more --> maybe use
> a class ValueSpace ...
> > - should add some interface like NodeValueComparator, with some methods
> like:
> >  - canCompare(ValueSpace vs, ValueSpace vs)
> >  - sameAs(NodeValue nv, NodeValue nv)
> >  - compare(NodeValue nv, NodeValue nv)
>
> Should this return a Comparator instead? (Thinking of sorting.)
>

Could be, but I tried to mimic and externalize what's already there in the
NodeValue class.


> >  - add(NodeValue nv, NodeValue nv)
> >  - substract(NodeValue nv, NodeValue nv)
> > - in NodeValue class, method sameAs(NodeValue nv1, NodeValue nv2) and
> compare(...) should  uses the Service Provider API to find
> NodeValueComparators in the classpath
> > - in class NodeValueOps, method divisionNV(NodeValue nv1, NodeValue
> nv2), multiplicationNV(...) additionNV(...)  , subtractionNV(...)   should
> uses the Service Provider API to find more NodeValueComparators in the
> classpath
>
> Hm. Is there some way this could happen via a lookup in TypeMapper? I'd
> rather not see too many paths to the same service impls...
>

Don't think so, as this would lead to mixing things between jena-core and
jena-arq

Best,
Max


> > Any thoughts about this?
>
> Yes: thank you so much for doing this excellent work!


> > Best regards,
> > Maxime Lefrançois
> >
> >
> >
> > Le sam. 7 avr. 2018 à 15:13, ajs6f  a écrit :
> >
> >> We're (well, Andy is) working on 3.7.0 now. We've been trying to
> maintain
> >> a 6-month or so release cadence, so you've hit a really good time to
> begin
> >> this work. That having been said, I don't think anyone would say that we
> >> are especially stringent about it, so I wouldn't worry too much about
> the
> >> timing myself.
> >>
> >> ajs6f
> >>
> >>> On Apr 6, 2018, at 9:36 AM, Maxime Lefrançois <
> maxime.lefranc...@emse.fr>
> >> wrote:
> >>>
> >>> Well,
> >>>
> >>> I think I have a pretty clear idea how I would do this. We would end up
> >>> using a registery like for custom functions or datatypes.
> >>> That registry would contain an ordered list of SPARQL operator
> handlers,
> >>> pre-filled by one for handling XSD datatypes.
> >>>
> >>> I am currently requesting the right to fill the Apache individual
> >>> contributor license agreement.
> >>>
> >>> What would be the timeline if we wanted this shipped in the next
> release?
> >>>
> >>> Best,
> >>> Maxime
> >>>
> >>> Le mar. 3 avr. 2018 à 15:30, ajs6f  a écrit :
> >>>
>  I agree. I can imagine plenty of use cases for such a powerful pair of
>  extension points.
> 
>  Maxime, how can we help you attack that work? Is there a design that
> is
>  already clear to you? Are there any blockers we can help remove?
> 
>  ajs6f
> 
> > On Mar 28, 2018, at 5:08 AM, Rob Vesse  wrote:
> >
> > I think work towards Option 2 would be the most valuable to the
> >> community
> >
> >
> >
> > The SPARQL specification allows for the overloading of any
>  operator/expression where the spec currently defines the evaluation to
> >> be
>  an error so extending operators is a natural and valid extension point
> >> to
>  provide
> >
> >
> >
> > The Terms of Use for UCUM would probably need us to obtain a
> licensing
>  assessment from Apache Legal as it is a non-standard OSS license even
> if
>  the code that implements it is under BSD (which is fine from an Apache
>  perspective).  Therefore having a well defined extension mechanism and
> >> then
>  having UCUM support live outside Apache Jena that as an extension
>  implementation maintained by yourself would be the easiest approach
> >
> >
> >
> > Rob
> >
> >
> >
> > From: Maxime Lefrançois 
> > Reply-To: 
> > Date: Wednesday, 28 March 2018 at 09:29
> > To: 
> > Subject: Re: Contribution proposal for Jena: support of a datatype
> for
>  quantity values
> >
> >
> >
> > Dear all,
> >
> >
> >
> > Happy to see you are interested the UCUM datatypes !
> >
> >
> >
> > Ok so let's dive in the technical details.
> >
> >
> >
> > # Compare Jena 3.6.0 and Jena 3.6.0-ucum
> >
> 

[jira] [Commented] (JENA-1556) text:query multilingual enhancements

2018-06-15 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/JENA-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16513857#comment-16513857
 ] 

ASF GitHub Bot commented on JENA-1556:
--

Github user xristy commented on a diff in the pull request:

https://github.com/apache/jena/pull/436#discussion_r195745831
  
--- Diff: 
jena-text/src/main/java/org/apache/jena/query/text/analyzer/QueryMultilingualAnalyzer.java
 ---
@@ -0,0 +1,75 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.jena.query.text.analyzer ;
+
+import org.apache.lucene.analysis.Analyzer ;
+import org.apache.lucene.analysis.DelegatingAnalyzerWrapper;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/** 
+ * Lucene Analyzer implementation that delegates to a language-specific
+ * Analyzer based on a field name suffix: e.g. field="label_en" will use
+ * an EnglishAnalyzer.
+ */
+
+public class QueryMultilingualAnalyzer extends DelegatingAnalyzerWrapper {
+private static Logger log = 
LoggerFactory.getLogger(QueryMultilingualAnalyzer.class);
+private Analyzer defaultAnalyzer;
+private String langTag;
+
+public QueryMultilingualAnalyzer(Analyzer defaultAnalyzer) {
+super(PER_FIELD_REUSE_STRATEGY);
+this.defaultAnalyzer = defaultAnalyzer;
+this.langTag = null;
+}
+
+public QueryMultilingualAnalyzer(Analyzer defaultAnalyzer, String 
tag) {
+super(PER_FIELD_REUSE_STRATEGY);
+this.defaultAnalyzer = defaultAnalyzer;
+this.langTag = tag;
+}
+
+@Override
+/**
+ * The analyzer corresponding to the langTag supplied at 
instantiation
+ * is used to retrieve the analyzer to use regardless of the tag 
on the
+ * fieldName. If no langTag is supplied then the tag on fieldName 
is
+ * used to retrieve the analyzer as with the MultilingualAnalyzer
+ * 
+ * @param fieldName
+ * @return the analyzer to use in the search
+ */
+protected Analyzer getWrappedAnalyzer(String fieldName) {
+int idx = fieldName.lastIndexOf("_");
+if (idx == -1) { // not language-specific, e.g. "label"
+return defaultAnalyzer;
+}
+String lang = langTag != null ? langTag : 
fieldName.substring(idx+1);
+Analyzer analyzer = Util.getLocalizedAnalyzer(lang);
+analyzer = analyzer != null ? analyzer : defaultAnalyzer;
--- End diff --

done. I need to become more familiar with ``StringUtils`` and 
``ObjectUtils``


> text:query multilingual enhancements
> 
>
> Key: JENA-1556
> URL: https://issues.apache.org/jira/browse/JENA-1556
> Project: Apache Jena
>  Issue Type: New Feature
>  Components: Text
>Affects Versions: Jena 3.7.0
>Reporter: Code Ferret
>Assignee: Code Ferret
>Priority: Major
>  Labels: pull-request-available
>
> This issue proposes two related enhancements of Jena Text. These enhancements 
> have been implemented and a PR can be issued. 
> There are two multilingual search situations that we want to support:
>  # We want to be able to search in one encoding and retrieve results that may 
> have been entered in other encodings. For example, searching via Simplified 
> Chinese (Hans) and retrieving results that may have been entered in 
> Traditional Chinese (Hant) or Pinyin. This will simplify applications by 
> permitting encoding independent retrieval without additional layers of 
> transcoding and so on. It's all done under the covers in Lucene.
>  # We want to search with queries entered in a lossy, e.g., phonetic, 
> encoding and retrieve results entered 

[GitHub] jena pull request #436: JENA-1556 implementation

2018-06-15 Thread xristy
Github user xristy commented on a diff in the pull request:

https://github.com/apache/jena/pull/436#discussion_r195745831
  
--- Diff: 
jena-text/src/main/java/org/apache/jena/query/text/analyzer/QueryMultilingualAnalyzer.java
 ---
@@ -0,0 +1,75 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.jena.query.text.analyzer ;
+
+import org.apache.lucene.analysis.Analyzer ;
+import org.apache.lucene.analysis.DelegatingAnalyzerWrapper;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/** 
+ * Lucene Analyzer implementation that delegates to a language-specific
+ * Analyzer based on a field name suffix: e.g. field="label_en" will use
+ * an EnglishAnalyzer.
+ */
+
+public class QueryMultilingualAnalyzer extends DelegatingAnalyzerWrapper {
+private static Logger log = 
LoggerFactory.getLogger(QueryMultilingualAnalyzer.class);
+private Analyzer defaultAnalyzer;
+private String langTag;
+
+public QueryMultilingualAnalyzer(Analyzer defaultAnalyzer) {
+super(PER_FIELD_REUSE_STRATEGY);
+this.defaultAnalyzer = defaultAnalyzer;
+this.langTag = null;
+}
+
+public QueryMultilingualAnalyzer(Analyzer defaultAnalyzer, String 
tag) {
+super(PER_FIELD_REUSE_STRATEGY);
+this.defaultAnalyzer = defaultAnalyzer;
+this.langTag = tag;
+}
+
+@Override
+/**
+ * The analyzer corresponding to the langTag supplied at 
instantiation
+ * is used to retrieve the analyzer to use regardless of the tag 
on the
+ * fieldName. If no langTag is supplied then the tag on fieldName 
is
+ * used to retrieve the analyzer as with the MultilingualAnalyzer
+ * 
+ * @param fieldName
+ * @return the analyzer to use in the search
+ */
+protected Analyzer getWrappedAnalyzer(String fieldName) {
+int idx = fieldName.lastIndexOf("_");
+if (idx == -1) { // not language-specific, e.g. "label"
+return defaultAnalyzer;
+}
+String lang = langTag != null ? langTag : 
fieldName.substring(idx+1);
+Analyzer analyzer = Util.getLocalizedAnalyzer(lang);
+analyzer = analyzer != null ? analyzer : defaultAnalyzer;
--- End diff --

done. I need to become more familiar with ``StringUtils`` and 
``ObjectUtils``


---


Re: Contribution proposal for Jena: support of a datatype for quantity values

2018-06-15 Thread ajs6f
See comments in-line...

ajs6f

> On Jun 15, 2018, at 9:49 AM, Maxime Lefrançois  
> wrote:
> 
> Dear all,
> 
> Regarding our contribution proposal to enable extensions to override SPARQL 
> operators in Jena
> 
> We finally got the agreement from our institution to contribute to the Apache 
> foundation.
> Question 1: what is the procedure to upload the form?

If we're talking about:

https://www.apache.org/licenses/cla-corporate.txt

then you can just scan and email a PDF to secret...@apache.org. There are other 
means of submission at that URL.

> About the how, I would like to discuss first with you
> 
> In a nutshell this is what I was thinking about:
> 
> Add use of the standard Java Service Provider API to load things 
> automatically found in the classpath:
> 
> - In TypeMapper --> a method that uses the Service Provider API to find more 
> Datatypes

Should this be a method, or rather additional behavior for getTypeByName, etc.? 
Are you thinking of something like "void getMoreMappings()" which would check 
for more available datatypes?

> - Datatype subclasses are not for just one URI, but could be for a set of URIs

Would that be true of Java types, as well?

> - ValueSpaceClassification should not be an enum any more --> maybe use a 
> class ValueSpace ...
> - should add some interface like NodeValueComparator, with some methods like:
>  - canCompare(ValueSpace vs, ValueSpace vs)
>  - sameAs(NodeValue nv, NodeValue nv)
>  - compare(NodeValue nv, NodeValue nv)

Should this return a Comparator instead? (Thinking of sorting.)

>  - add(NodeValue nv, NodeValue nv)
>  - substract(NodeValue nv, NodeValue nv)
> - in NodeValue class, method sameAs(NodeValue nv1, NodeValue nv2) and 
> compare(...) should  uses the Service Provider API to find 
> NodeValueComparators in the classpath
> - in class NodeValueOps, method divisionNV(NodeValue nv1, NodeValue nv2), 
> multiplicationNV(...) additionNV(...)  , subtractionNV(...)   should  uses 
> the Service Provider API to find more NodeValueComparators in the classpath

Hm. Is there some way this could happen via a lookup in TypeMapper? I'd rather 
not see too many paths to the same service impls...

> Any thoughts about this?

Yes: thank you so much for doing this excellent work!

> Best regards,
> Maxime Lefrançois
> 
> 
> 
> Le sam. 7 avr. 2018 à 15:13, ajs6f  a écrit :
> 
>> We're (well, Andy is) working on 3.7.0 now. We've been trying to maintain
>> a 6-month or so release cadence, so you've hit a really good time to begin
>> this work. That having been said, I don't think anyone would say that we
>> are especially stringent about it, so I wouldn't worry too much about the
>> timing myself.
>> 
>> ajs6f
>> 
>>> On Apr 6, 2018, at 9:36 AM, Maxime Lefrançois 
>> wrote:
>>> 
>>> Well,
>>> 
>>> I think I have a pretty clear idea how I would do this. We would end up
>>> using a registery like for custom functions or datatypes.
>>> That registry would contain an ordered list of SPARQL operator handlers,
>>> pre-filled by one for handling XSD datatypes.
>>> 
>>> I am currently requesting the right to fill the Apache individual
>>> contributor license agreement.
>>> 
>>> What would be the timeline if we wanted this shipped in the next release?
>>> 
>>> Best,
>>> Maxime
>>> 
>>> Le mar. 3 avr. 2018 à 15:30, ajs6f  a écrit :
>>> 
 I agree. I can imagine plenty of use cases for such a powerful pair of
 extension points.
 
 Maxime, how can we help you attack that work? Is there a design that is
 already clear to you? Are there any blockers we can help remove?
 
 ajs6f
 
> On Mar 28, 2018, at 5:08 AM, Rob Vesse  wrote:
> 
> I think work towards Option 2 would be the most valuable to the
>> community
> 
> 
> 
> The SPARQL specification allows for the overloading of any
 operator/expression where the spec currently defines the evaluation to
>> be
 an error so extending operators is a natural and valid extension point
>> to
 provide
> 
> 
> 
> The Terms of Use for UCUM would probably need us to obtain a licensing
 assessment from Apache Legal as it is a non-standard OSS license even if
 the code that implements it is under BSD (which is fine from an Apache
 perspective).  Therefore having a well defined extension mechanism and
>> then
 having UCUM support live outside Apache Jena that as an extension
 implementation maintained by yourself would be the easiest approach
> 
> 
> 
> Rob
> 
> 
> 
> From: Maxime Lefrançois 
> Reply-To: 
> Date: Wednesday, 28 March 2018 at 09:29
> To: 
> Subject: Re: Contribution proposal for Jena: support of a datatype for
 quantity values
> 
> 
> 
> Dear all,
> 
> 
> 
> Happy to see you are interested the UCUM datatypes !
> 
> 
> 
> Ok so let's dive in the technical details.
> 
> 
> 
> # Compare Jena 3.6.0 and Jena 

[jira] [Commented] (JENA-1556) text:query multilingual enhancements

2018-06-15 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/JENA-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16513848#comment-16513848
 ] 

ASF GitHub Bot commented on JENA-1556:
--

Github user xristy commented on a diff in the pull request:

https://github.com/apache/jena/pull/436#discussion_r195743299
  
--- Diff: 
jena-text/src/main/java/org/apache/jena/query/text/assembler/DefineAnalyzersAssembler.java
 ---
@@ -39,7 +44,46 @@
  text:analyzer [ . . . ]]
 )
 */
+private static Logger  log  = 
LoggerFactory.getLogger(DefineAnalyzersAssembler.class) ;
--- End diff --

fixed. sloppy copy/paste. thanks


> text:query multilingual enhancements
> 
>
> Key: JENA-1556
> URL: https://issues.apache.org/jira/browse/JENA-1556
> Project: Apache Jena
>  Issue Type: New Feature
>  Components: Text
>Affects Versions: Jena 3.7.0
>Reporter: Code Ferret
>Assignee: Code Ferret
>Priority: Major
>  Labels: pull-request-available
>
> This issue proposes two related enhancements of Jena Text. These enhancements 
> have been implemented and a PR can be issued. 
> There are two multilingual search situations that we want to support:
>  # We want to be able to search in one encoding and retrieve results that may 
> have been entered in other encodings. For example, searching via Simplified 
> Chinese (Hans) and retrieving results that may have been entered in 
> Traditional Chinese (Hant) or Pinyin. This will simplify applications by 
> permitting encoding independent retrieval without additional layers of 
> transcoding and so on. It's all done under the covers in Lucene.
>  # We want to search with queries entered in a lossy, e.g., phonetic, 
> encoding and retrieve results entered with accurate encoding. For example, 
> searching vis Pinyin without diacritics and retrieving all possible Hans and 
> Hant triples.
> The first situation arises when entering triples that include languages with 
> multiple encodings that for various reasons are not normalized to a single 
> encoding. In this situation we want to be able to retrieve appropriate result 
> sets without regard for the encodings used at the time that the triples were 
> inserted into the dataset.
> There are several such languages of interest in our application: Chinese, 
> Tibetan, Sanskrit, Japanese and Korean. There are various Romanizations and 
> ideographic variants.
> Encodings may not normalized when inserting triples for a variety of reasons. 
> A principle one is that the {{rdf:langString}} object often must be entered 
> in the same encoding that it occurs in some physical text that is being 
> catalogued. Another is that metadata may be imported from sources that use 
> different encoding conventions and we want to preserve that form.
> The second situation arises as we want to provide simple support for phonetic 
> or other forms of lossy search at the time that triples are indexed directly 
> in the Lucene system.
> To handle the first situation we introduce a {{text}} assembler predicate, 
> {{text:searchFor}}, that specifies a list of language tags that provides a 
> list of language variants that should be searched whenever a query string of 
> a given encoding (language tag) is used. For example, the following 
> {{text:TextIndexLucene/text:defineAnalyzers}} fragment :
> {code:java}
> [ text:addLang "bo" ; 
>   text:searchFor ( "bo" "bo-x-ewts" "bo-alalc97" ) ;
>   text:analyzer [ 
> a text:GenericAnalyzer ;
> text:class "io.bdrc.lucene.bo.TibetanAnalyzer" ;
> text:params (
> [ text:paramName "segmentInWords" ;
>   text:paramValue false ]
> [ text:paramName "lemmatize" ;
>   text:paramValue true ]
> [ text:paramName "filterChars" ;
>   text:paramValue false ]
> [ text:paramName "inputMode" ;
>   text:paramValue "unicode" ]
> [ text:paramName "stopFilename" ;
>   text:paramValue "" ]
> )
> ] ; 
>   ]
> {code}
> indicates that when using a search string such as "རྡོ་རྗེ་སྙིང་"@bo the 
> Lucene index should also be searched for matches tagged as {{bo-x-ewts}} and 
> {{bo-alalc97}}.
> This is made possible by a Tibetan {{Analyzer}} that tokenizes strings in all 
> three encodings into Tibetan Unicode. This is feasible since the 
> {{bo-x-ewts}} and {{bo-alalc97}} encodings are one-to-one with Unicode 
> Tibetan. Since all fields with these language tags will have a common set of 
> indexed terms, i.e., Tibetan Unicode, it suffices to arrange for the query 
> analyzer to have access to the language tag for the query string along 

[GitHub] jena pull request #436: JENA-1556 implementation

2018-06-15 Thread xristy
Github user xristy commented on a diff in the pull request:

https://github.com/apache/jena/pull/436#discussion_r195743299
  
--- Diff: 
jena-text/src/main/java/org/apache/jena/query/text/assembler/DefineAnalyzersAssembler.java
 ---
@@ -39,7 +44,46 @@
  text:analyzer [ . . . ]]
 )
 */
+private static Logger  log  = 
LoggerFactory.getLogger(DefineAnalyzersAssembler.class) ;
--- End diff --

fixed. sloppy copy/paste. thanks


---


[jira] [Commented] (JENA-1556) text:query multilingual enhancements

2018-06-15 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/JENA-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16513844#comment-16513844
 ] 

ASF GitHub Bot commented on JENA-1556:
--

Github user xristy commented on a diff in the pull request:

https://github.com/apache/jena/pull/436#discussion_r195742866
  
--- Diff: 
jena-text/src/main/java/org/apache/jena/query/text/analyzer/QueryMultilingualAnalyzer.java
 ---
@@ -0,0 +1,75 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.jena.query.text.analyzer ;
+
+import org.apache.lucene.analysis.Analyzer ;
+import org.apache.lucene.analysis.DelegatingAnalyzerWrapper;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/** 
+ * Lucene Analyzer implementation that delegates to a language-specific
+ * Analyzer based on a field name suffix: e.g. field="label_en" will use
+ * an EnglishAnalyzer.
+ */
+
+public class QueryMultilingualAnalyzer extends DelegatingAnalyzerWrapper {
+private static Logger log = 
LoggerFactory.getLogger(QueryMultilingualAnalyzer.class);
+private Analyzer defaultAnalyzer;
+private String langTag;
+
+public QueryMultilingualAnalyzer(Analyzer defaultAnalyzer) {
+super(PER_FIELD_REUSE_STRATEGY);
+this.defaultAnalyzer = defaultAnalyzer;
+this.langTag = null;
+}
+
+public QueryMultilingualAnalyzer(Analyzer defaultAnalyzer, String 
tag) {
+super(PER_FIELD_REUSE_STRATEGY);
+this.defaultAnalyzer = defaultAnalyzer;
+this.langTag = tag;
+}
+
+@Override
+/**
+ * The analyzer corresponding to the langTag supplied at 
instantiation
+ * is used to retrieve the analyzer to use regardless of the tag 
on the
+ * fieldName. If no langTag is supplied then the tag on fieldName 
is
+ * used to retrieve the analyzer as with the MultilingualAnalyzer
+ * 
+ * @param fieldName
+ * @return the analyzer to use in the search
+ */
+protected Analyzer getWrappedAnalyzer(String fieldName) {
+int idx = fieldName.lastIndexOf("_");
--- End diff --

the format is the result of copy/paste of ``MultilingualAnalyzer``. I'm 
_correcting_ the formatting of ``IndexMultilingualAnalyzer`` and 
``QueryMultilingualAnalyzer`` to match the majority of other classes in 
jena-text.

I'm not sure when I see these sorts of differences whether to correct in 
the other files such as ``ConfigurableAnalyzer``, ``LowerCaseKeywordAnalyzer`` 
and ``MultilingualAnalyzer``. Correcting creates larger noise diffs. I'm fine 
with normalizing formatting but is there a preferred protocol?


> text:query multilingual enhancements
> 
>
> Key: JENA-1556
> URL: https://issues.apache.org/jira/browse/JENA-1556
> Project: Apache Jena
>  Issue Type: New Feature
>  Components: Text
>Affects Versions: Jena 3.7.0
>Reporter: Code Ferret
>Assignee: Code Ferret
>Priority: Major
>  Labels: pull-request-available
>
> This issue proposes two related enhancements of Jena Text. These enhancements 
> have been implemented and a PR can be issued. 
> There are two multilingual search situations that we want to support:
>  # We want to be able to search in one encoding and retrieve results that may 
> have been entered in other encodings. For example, searching via Simplified 
> Chinese (Hans) and retrieving results that may have been entered in 
> Traditional Chinese (Hant) or Pinyin. This will simplify applications by 
> permitting encoding independent retrieval without additional layers of 
> transcoding and so on. It's all done under the covers in Lucene.
>  # We want to search with queries entered in a lossy, e.g., 

[GitHub] jena pull request #436: JENA-1556 implementation

2018-06-15 Thread xristy
Github user xristy commented on a diff in the pull request:

https://github.com/apache/jena/pull/436#discussion_r195742866
  
--- Diff: 
jena-text/src/main/java/org/apache/jena/query/text/analyzer/QueryMultilingualAnalyzer.java
 ---
@@ -0,0 +1,75 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.jena.query.text.analyzer ;
+
+import org.apache.lucene.analysis.Analyzer ;
+import org.apache.lucene.analysis.DelegatingAnalyzerWrapper;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/** 
+ * Lucene Analyzer implementation that delegates to a language-specific
+ * Analyzer based on a field name suffix: e.g. field="label_en" will use
+ * an EnglishAnalyzer.
+ */
+
+public class QueryMultilingualAnalyzer extends DelegatingAnalyzerWrapper {
+private static Logger log = 
LoggerFactory.getLogger(QueryMultilingualAnalyzer.class);
+private Analyzer defaultAnalyzer;
+private String langTag;
+
+public QueryMultilingualAnalyzer(Analyzer defaultAnalyzer) {
+super(PER_FIELD_REUSE_STRATEGY);
+this.defaultAnalyzer = defaultAnalyzer;
+this.langTag = null;
+}
+
+public QueryMultilingualAnalyzer(Analyzer defaultAnalyzer, String 
tag) {
+super(PER_FIELD_REUSE_STRATEGY);
+this.defaultAnalyzer = defaultAnalyzer;
+this.langTag = tag;
+}
+
+@Override
+/**
+ * The analyzer corresponding to the langTag supplied at 
instantiation
+ * is used to retrieve the analyzer to use regardless of the tag 
on the
+ * fieldName. If no langTag is supplied then the tag on fieldName 
is
+ * used to retrieve the analyzer as with the MultilingualAnalyzer
+ * 
+ * @param fieldName
+ * @return the analyzer to use in the search
+ */
+protected Analyzer getWrappedAnalyzer(String fieldName) {
+int idx = fieldName.lastIndexOf("_");
--- End diff --

the format is the result of copy/paste of ``MultilingualAnalyzer``. I'm 
_correcting_ the formatting of ``IndexMultilingualAnalyzer`` and 
``QueryMultilingualAnalyzer`` to match the majority of other classes in 
jena-text.

I'm not sure when I see these sorts of differences whether to correct in 
the other files such as ``ConfigurableAnalyzer``, ``LowerCaseKeywordAnalyzer`` 
and ``MultilingualAnalyzer``. Correcting creates larger noise diffs. I'm fine 
with normalizing formatting but is there a preferred protocol?


---


Re: Contribution proposal for Jena: support of a datatype for quantity values

2018-06-15 Thread Maxime Lefrançois
Dear all,

Regarding our contribution proposal to enable extensions to override SPARQL
operators in Jena

We finally got the agreement from our institution to contribute to the
Apache foundation.
Question 1: what is the procedure to upload the form?

About the how, I would like to discuss first with you

In a nutshell this is what I was thinking about:

Add use of the standard Java Service Provider API to load things
automatically found in the classpath:

- In TypeMapper --> a method that uses the Service Provider API to find
more Datatypes
- Datatype subclasses are not for just one URI, but could be for a set of
URIs
- ValueSpaceClassification should not be an enum any more --> maybe use a
class ValueSpace ...
- should add some interface like NodeValueComparator, with some methods
like:
  - canCompare(ValueSpace vs, ValueSpace vs)
  - sameAs(NodeValue nv, NodeValue nv)
  - compare(NodeValue nv, NodeValue nv)
  - add(NodeValue nv, NodeValue nv)
  - substract(NodeValue nv, NodeValue nv)
  - sameAs(NodeValue nv, NodeValue nv)

- in NodeValue class, method sameAs(NodeValue nv1, NodeValue nv2) and
compare(...) should  uses the Service Provider API to find
NodeValueComparators in the classpath
- in class NodeValueOps, method divisionNV(NodeValue nv1, NodeValue nv2),
multiplicationNV(...) additionNV(...)  , subtractionNV(...)   should  uses
the Service Provider API to find more NodeValueComparators in the classpath


Any thoughts about this?

Best regards,
Maxime Lefrançois



Le sam. 7 avr. 2018 à 15:13, ajs6f  a écrit :

> We're (well, Andy is) working on 3.7.0 now. We've been trying to maintain
> a 6-month or so release cadence, so you've hit a really good time to begin
> this work. That having been said, I don't think anyone would say that we
> are especially stringent about it, so I wouldn't worry too much about the
> timing myself.
>
> ajs6f
>
> > On Apr 6, 2018, at 9:36 AM, Maxime Lefrançois 
> wrote:
> >
> > Well,
> >
> > I think I have a pretty clear idea how I would do this. We would end up
> > using a registery like for custom functions or datatypes.
> > That registry would contain an ordered list of SPARQL operator handlers,
> > pre-filled by one for handling XSD datatypes.
> >
> > I am currently requesting the right to fill the Apache individual
> > contributor license agreement.
> >
> > What would be the timeline if we wanted this shipped in the next release?
> >
> > Best,
> > Maxime
> >
> > Le mar. 3 avr. 2018 à 15:30, ajs6f  a écrit :
> >
> >> I agree. I can imagine plenty of use cases for such a powerful pair of
> >> extension points.
> >>
> >> Maxime, how can we help you attack that work? Is there a design that is
> >> already clear to you? Are there any blockers we can help remove?
> >>
> >> ajs6f
> >>
> >>> On Mar 28, 2018, at 5:08 AM, Rob Vesse  wrote:
> >>>
> >>> I think work towards Option 2 would be the most valuable to the
> community
> >>>
> >>>
> >>>
> >>> The SPARQL specification allows for the overloading of any
> >> operator/expression where the spec currently defines the evaluation to
> be
> >> an error so extending operators is a natural and valid extension point
> to
> >> provide
> >>>
> >>>
> >>>
> >>> The Terms of Use for UCUM would probably need us to obtain a licensing
> >> assessment from Apache Legal as it is a non-standard OSS license even if
> >> the code that implements it is under BSD (which is fine from an Apache
> >> perspective).  Therefore having a well defined extension mechanism and
> then
> >> having UCUM support live outside Apache Jena that as an extension
> >> implementation maintained by yourself would be the easiest approach
> >>>
> >>>
> >>>
> >>> Rob
> >>>
> >>>
> >>>
> >>> From: Maxime Lefrançois 
> >>> Reply-To: 
> >>> Date: Wednesday, 28 March 2018 at 09:29
> >>> To: 
> >>> Subject: Re: Contribution proposal for Jena: support of a datatype for
> >> quantity values
> >>>
> >>>
> >>>
> >>> Dear all,
> >>>
> >>>
> >>>
> >>> Happy to see you are interested the UCUM datatypes !
> >>>
> >>>
> >>>
> >>> Ok so let's dive in the technical details.
> >>>
> >>>
> >>>
> >>> # Compare Jena 3.6.0 and Jena 3.6.0-ucum
> >>>
> >>>
> >>>
> >>>
> >>
> https://github.com/apache/jena/compare/master...OpenSensingCity:jena-3.6.0-ucum
> >>>
> >>>
> >>>
> >>> # Modules, dependencies, licences
> >>>
> >>>
> >>>
> >>> Two modules forked so far: jena-core and jena-arq.
> >>>
> >>> One dependency added to jena-core (after a minor change I made today):
> >>>
> >>>
> >>>
> >>> systems.uom:systems-ucum-java8:0.7.2
> >>>
> >>> -> BSD license of systems-uom,
> >>>
> >>>and license of UCUM http://unitsofmeasure.org/trac/wiki/TermsOfUse
> >>>
> >>>
> >>>
> >>> --> this use implementation of JSR 363 indeed - Units of Measurement
> API
> >>>
> >>> (see attached for the transitive dependencies, all from
> >> https://github.com/unitsofmeasurement )
> >>>
> >>>
> >>>
> >>> # External module ?
> >>>
> >>>
> >>>
> >>> I would have been happy to develop a separate extension of Jena for the
> 

[jira] [Commented] (JENA-1556) text:query multilingual enhancements

2018-06-15 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/JENA-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16513829#comment-16513829
 ] 

ASF GitHub Bot commented on JENA-1556:
--

Github user xristy commented on a diff in the pull request:

https://github.com/apache/jena/pull/436#discussion_r195738825
  
--- Diff: 
jena-text/src/main/java/org/apache/jena/query/text/TextIndexLucene.java ---
@@ -549,56 +567,81 @@ private String frags2string(TextFragment[] frags, 
HighlightOpts opts) {
 }
 return results ;
 }
+
+private Map multilingualQueryAnalyzers = new 
HashMap<>();
--- End diff --

moved


> text:query multilingual enhancements
> 
>
> Key: JENA-1556
> URL: https://issues.apache.org/jira/browse/JENA-1556
> Project: Apache Jena
>  Issue Type: New Feature
>  Components: Text
>Affects Versions: Jena 3.7.0
>Reporter: Code Ferret
>Assignee: Code Ferret
>Priority: Major
>  Labels: pull-request-available
>
> This issue proposes two related enhancements of Jena Text. These enhancements 
> have been implemented and a PR can be issued. 
> There are two multilingual search situations that we want to support:
>  # We want to be able to search in one encoding and retrieve results that may 
> have been entered in other encodings. For example, searching via Simplified 
> Chinese (Hans) and retrieving results that may have been entered in 
> Traditional Chinese (Hant) or Pinyin. This will simplify applications by 
> permitting encoding independent retrieval without additional layers of 
> transcoding and so on. It's all done under the covers in Lucene.
>  # We want to search with queries entered in a lossy, e.g., phonetic, 
> encoding and retrieve results entered with accurate encoding. For example, 
> searching vis Pinyin without diacritics and retrieving all possible Hans and 
> Hant triples.
> The first situation arises when entering triples that include languages with 
> multiple encodings that for various reasons are not normalized to a single 
> encoding. In this situation we want to be able to retrieve appropriate result 
> sets without regard for the encodings used at the time that the triples were 
> inserted into the dataset.
> There are several such languages of interest in our application: Chinese, 
> Tibetan, Sanskrit, Japanese and Korean. There are various Romanizations and 
> ideographic variants.
> Encodings may not normalized when inserting triples for a variety of reasons. 
> A principle one is that the {{rdf:langString}} object often must be entered 
> in the same encoding that it occurs in some physical text that is being 
> catalogued. Another is that metadata may be imported from sources that use 
> different encoding conventions and we want to preserve that form.
> The second situation arises as we want to provide simple support for phonetic 
> or other forms of lossy search at the time that triples are indexed directly 
> in the Lucene system.
> To handle the first situation we introduce a {{text}} assembler predicate, 
> {{text:searchFor}}, that specifies a list of language tags that provides a 
> list of language variants that should be searched whenever a query string of 
> a given encoding (language tag) is used. For example, the following 
> {{text:TextIndexLucene/text:defineAnalyzers}} fragment :
> {code:java}
> [ text:addLang "bo" ; 
>   text:searchFor ( "bo" "bo-x-ewts" "bo-alalc97" ) ;
>   text:analyzer [ 
> a text:GenericAnalyzer ;
> text:class "io.bdrc.lucene.bo.TibetanAnalyzer" ;
> text:params (
> [ text:paramName "segmentInWords" ;
>   text:paramValue false ]
> [ text:paramName "lemmatize" ;
>   text:paramValue true ]
> [ text:paramName "filterChars" ;
>   text:paramValue false ]
> [ text:paramName "inputMode" ;
>   text:paramValue "unicode" ]
> [ text:paramName "stopFilename" ;
>   text:paramValue "" ]
> )
> ] ; 
>   ]
> {code}
> indicates that when using a search string such as "རྡོ་རྗེ་སྙིང་"@bo the 
> Lucene index should also be searched for matches tagged as {{bo-x-ewts}} and 
> {{bo-alalc97}}.
> This is made possible by a Tibetan {{Analyzer}} that tokenizes strings in all 
> three encodings into Tibetan Unicode. This is feasible since the 
> {{bo-x-ewts}} and {{bo-alalc97}} encodings are one-to-one with Unicode 
> Tibetan. Since all fields with these language tags will have a common set of 
> indexed terms, i.e., Tibetan Unicode, it suffices to arrange for the query 
> analyzer to have access to the language tag for the query string along with 
> the 

[GitHub] jena pull request #436: JENA-1556 implementation

2018-06-15 Thread xristy
Github user xristy commented on a diff in the pull request:

https://github.com/apache/jena/pull/436#discussion_r195738825
  
--- Diff: 
jena-text/src/main/java/org/apache/jena/query/text/TextIndexLucene.java ---
@@ -549,56 +567,81 @@ private String frags2string(TextFragment[] frags, 
HighlightOpts opts) {
 }
 return results ;
 }
+
+private Map multilingualQueryAnalyzers = new 
HashMap<>();
--- End diff --

moved


---


[jira] [Commented] (JENA-1556) text:query multilingual enhancements

2018-06-15 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/JENA-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16513822#comment-16513822
 ] 

ASF GitHub Bot commented on JENA-1556:
--

Github user xristy commented on a diff in the pull request:

https://github.com/apache/jena/pull/436#discussion_r195736987
  
--- Diff: 
jena-text/src/main/java/org/apache/jena/query/text/TextIndexLucene.java ---
@@ -316,6 +326,13 @@ protected Document doc(Entity entity) {
 if (this.isMultilingual) {
 // add a field that uses a language-specific 
analyzer via MultilingualAnalyzer
 doc.add(new Field(e.getKey() + "_" + lang, 
(String) e.getValue(), ftText));
+// add fields for any defined auxiliary indexes
+List auxIndexes = Util.getAuxIndexes(lang);
+if (auxIndexes != null) {
--- End diff --

``auxIndexes.get(tag)`` will return null if ``tag`` has no auxiliary 
indexes defined - which is not an unusual case. 

As I look at ``Util.getAuxIndexes`` I unnecessarily return an empty list 
that then does no work in the guarded ``for`` loop. I'm changing to ``null`` in 
the case of a ``null`` or empty ``tag``.


> text:query multilingual enhancements
> 
>
> Key: JENA-1556
> URL: https://issues.apache.org/jira/browse/JENA-1556
> Project: Apache Jena
>  Issue Type: New Feature
>  Components: Text
>Affects Versions: Jena 3.7.0
>Reporter: Code Ferret
>Assignee: Code Ferret
>Priority: Major
>  Labels: pull-request-available
>
> This issue proposes two related enhancements of Jena Text. These enhancements 
> have been implemented and a PR can be issued. 
> There are two multilingual search situations that we want to support:
>  # We want to be able to search in one encoding and retrieve results that may 
> have been entered in other encodings. For example, searching via Simplified 
> Chinese (Hans) and retrieving results that may have been entered in 
> Traditional Chinese (Hant) or Pinyin. This will simplify applications by 
> permitting encoding independent retrieval without additional layers of 
> transcoding and so on. It's all done under the covers in Lucene.
>  # We want to search with queries entered in a lossy, e.g., phonetic, 
> encoding and retrieve results entered with accurate encoding. For example, 
> searching vis Pinyin without diacritics and retrieving all possible Hans and 
> Hant triples.
> The first situation arises when entering triples that include languages with 
> multiple encodings that for various reasons are not normalized to a single 
> encoding. In this situation we want to be able to retrieve appropriate result 
> sets without regard for the encodings used at the time that the triples were 
> inserted into the dataset.
> There are several such languages of interest in our application: Chinese, 
> Tibetan, Sanskrit, Japanese and Korean. There are various Romanizations and 
> ideographic variants.
> Encodings may not normalized when inserting triples for a variety of reasons. 
> A principle one is that the {{rdf:langString}} object often must be entered 
> in the same encoding that it occurs in some physical text that is being 
> catalogued. Another is that metadata may be imported from sources that use 
> different encoding conventions and we want to preserve that form.
> The second situation arises as we want to provide simple support for phonetic 
> or other forms of lossy search at the time that triples are indexed directly 
> in the Lucene system.
> To handle the first situation we introduce a {{text}} assembler predicate, 
> {{text:searchFor}}, that specifies a list of language tags that provides a 
> list of language variants that should be searched whenever a query string of 
> a given encoding (language tag) is used. For example, the following 
> {{text:TextIndexLucene/text:defineAnalyzers}} fragment :
> {code:java}
> [ text:addLang "bo" ; 
>   text:searchFor ( "bo" "bo-x-ewts" "bo-alalc97" ) ;
>   text:analyzer [ 
> a text:GenericAnalyzer ;
> text:class "io.bdrc.lucene.bo.TibetanAnalyzer" ;
> text:params (
> [ text:paramName "segmentInWords" ;
>   text:paramValue false ]
> [ text:paramName "lemmatize" ;
>   text:paramValue true ]
> [ text:paramName "filterChars" ;
>   text:paramValue false ]
> [ text:paramName "inputMode" ;
>   text:paramValue "unicode" ]
> [ text:paramName "stopFilename" ;
>   text:paramValue "" ]
> )
> ] ; 
>   ]
> {code}
> indicates that when 

[GitHub] jena pull request #436: JENA-1556 implementation

2018-06-15 Thread xristy
Github user xristy commented on a diff in the pull request:

https://github.com/apache/jena/pull/436#discussion_r195736987
  
--- Diff: 
jena-text/src/main/java/org/apache/jena/query/text/TextIndexLucene.java ---
@@ -316,6 +326,13 @@ protected Document doc(Entity entity) {
 if (this.isMultilingual) {
 // add a field that uses a language-specific 
analyzer via MultilingualAnalyzer
 doc.add(new Field(e.getKey() + "_" + lang, 
(String) e.getValue(), ftText));
+// add fields for any defined auxiliary indexes
+List auxIndexes = Util.getAuxIndexes(lang);
+if (auxIndexes != null) {
--- End diff --

``auxIndexes.get(tag)`` will return null if ``tag`` has no auxiliary 
indexes defined - which is not an unusual case. 

As I look at ``Util.getAuxIndexes`` I unnecessarily return an empty list 
that then does no work in the guarded ``for`` loop. I'm changing to ``null`` in 
the case of a ``null`` or empty ``tag``.


---


[jira] [Commented] (JENA-1556) text:query multilingual enhancements

2018-06-15 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/JENA-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16513774#comment-16513774
 ] 

ASF GitHub Bot commented on JENA-1556:
--

Github user kinow commented on a diff in the pull request:

https://github.com/apache/jena/pull/436#discussion_r195658758
  
--- Diff: 
jena-text/src/main/java/org/apache/jena/query/text/TextIndexLucene.java ---
@@ -316,6 +326,13 @@ protected Document doc(Entity entity) {
 if (this.isMultilingual) {
 // add a field that uses a language-specific 
analyzer via MultilingualAnalyzer
 doc.add(new Field(e.getKey() + "_" + lang, 
(String) e.getValue(), ftText));
+// add fields for any defined auxiliary indexes
+List auxIndexes = Util.getAuxIndexes(lang);
+if (auxIndexes != null) {
--- End diff --

Never null I believe. We return an empty list when `lang` is empty. And use 
a `Hashtable` to keep data. But happy to leave it if you prefer to double-check 
it anyway (wonder if we should consider `@Nullable` et `@NotNull` in method 
signatures some day)


> text:query multilingual enhancements
> 
>
> Key: JENA-1556
> URL: https://issues.apache.org/jira/browse/JENA-1556
> Project: Apache Jena
>  Issue Type: New Feature
>  Components: Text
>Affects Versions: Jena 3.7.0
>Reporter: Code Ferret
>Assignee: Code Ferret
>Priority: Major
>  Labels: pull-request-available
>
> This issue proposes two related enhancements of Jena Text. These enhancements 
> have been implemented and a PR can be issued. 
> There are two multilingual search situations that we want to support:
>  # We want to be able to search in one encoding and retrieve results that may 
> have been entered in other encodings. For example, searching via Simplified 
> Chinese (Hans) and retrieving results that may have been entered in 
> Traditional Chinese (Hant) or Pinyin. This will simplify applications by 
> permitting encoding independent retrieval without additional layers of 
> transcoding and so on. It's all done under the covers in Lucene.
>  # We want to search with queries entered in a lossy, e.g., phonetic, 
> encoding and retrieve results entered with accurate encoding. For example, 
> searching vis Pinyin without diacritics and retrieving all possible Hans and 
> Hant triples.
> The first situation arises when entering triples that include languages with 
> multiple encodings that for various reasons are not normalized to a single 
> encoding. In this situation we want to be able to retrieve appropriate result 
> sets without regard for the encodings used at the time that the triples were 
> inserted into the dataset.
> There are several such languages of interest in our application: Chinese, 
> Tibetan, Sanskrit, Japanese and Korean. There are various Romanizations and 
> ideographic variants.
> Encodings may not normalized when inserting triples for a variety of reasons. 
> A principle one is that the {{rdf:langString}} object often must be entered 
> in the same encoding that it occurs in some physical text that is being 
> catalogued. Another is that metadata may be imported from sources that use 
> different encoding conventions and we want to preserve that form.
> The second situation arises as we want to provide simple support for phonetic 
> or other forms of lossy search at the time that triples are indexed directly 
> in the Lucene system.
> To handle the first situation we introduce a {{text}} assembler predicate, 
> {{text:searchFor}}, that specifies a list of language tags that provides a 
> list of language variants that should be searched whenever a query string of 
> a given encoding (language tag) is used. For example, the following 
> {{text:TextIndexLucene/text:defineAnalyzers}} fragment :
> {code:java}
> [ text:addLang "bo" ; 
>   text:searchFor ( "bo" "bo-x-ewts" "bo-alalc97" ) ;
>   text:analyzer [ 
> a text:GenericAnalyzer ;
> text:class "io.bdrc.lucene.bo.TibetanAnalyzer" ;
> text:params (
> [ text:paramName "segmentInWords" ;
>   text:paramValue false ]
> [ text:paramName "lemmatize" ;
>   text:paramValue true ]
> [ text:paramName "filterChars" ;
>   text:paramValue false ]
> [ text:paramName "inputMode" ;
>   text:paramValue "unicode" ]
> [ text:paramName "stopFilename" ;
>   text:paramValue "" ]
> )
> ] ; 
>   ]
> {code}
> indicates that when using a search string such as "རྡོ་རྗེ་སྙིང་"@bo the 
> Lucene index 

[jira] [Commented] (JENA-1556) text:query multilingual enhancements

2018-06-15 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/JENA-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16513775#comment-16513775
 ] 

ASF GitHub Bot commented on JENA-1556:
--

Github user kinow commented on a diff in the pull request:

https://github.com/apache/jena/pull/436#discussion_r195659038
  
--- Diff: 
jena-text/src/main/java/org/apache/jena/query/text/TextIndexLucene.java ---
@@ -549,56 +567,81 @@ private String frags2string(TextFragment[] frags, 
HighlightOpts opts) {
 }
 return results ;
 }
+
+private Map multilingualQueryAnalyzers = new 
HashMap<>();
--- End diff --

Might be better declared at the top with other vars perhaps.


> text:query multilingual enhancements
> 
>
> Key: JENA-1556
> URL: https://issues.apache.org/jira/browse/JENA-1556
> Project: Apache Jena
>  Issue Type: New Feature
>  Components: Text
>Affects Versions: Jena 3.7.0
>Reporter: Code Ferret
>Assignee: Code Ferret
>Priority: Major
>  Labels: pull-request-available
>
> This issue proposes two related enhancements of Jena Text. These enhancements 
> have been implemented and a PR can be issued. 
> There are two multilingual search situations that we want to support:
>  # We want to be able to search in one encoding and retrieve results that may 
> have been entered in other encodings. For example, searching via Simplified 
> Chinese (Hans) and retrieving results that may have been entered in 
> Traditional Chinese (Hant) or Pinyin. This will simplify applications by 
> permitting encoding independent retrieval without additional layers of 
> transcoding and so on. It's all done under the covers in Lucene.
>  # We want to search with queries entered in a lossy, e.g., phonetic, 
> encoding and retrieve results entered with accurate encoding. For example, 
> searching vis Pinyin without diacritics and retrieving all possible Hans and 
> Hant triples.
> The first situation arises when entering triples that include languages with 
> multiple encodings that for various reasons are not normalized to a single 
> encoding. In this situation we want to be able to retrieve appropriate result 
> sets without regard for the encodings used at the time that the triples were 
> inserted into the dataset.
> There are several such languages of interest in our application: Chinese, 
> Tibetan, Sanskrit, Japanese and Korean. There are various Romanizations and 
> ideographic variants.
> Encodings may not normalized when inserting triples for a variety of reasons. 
> A principle one is that the {{rdf:langString}} object often must be entered 
> in the same encoding that it occurs in some physical text that is being 
> catalogued. Another is that metadata may be imported from sources that use 
> different encoding conventions and we want to preserve that form.
> The second situation arises as we want to provide simple support for phonetic 
> or other forms of lossy search at the time that triples are indexed directly 
> in the Lucene system.
> To handle the first situation we introduce a {{text}} assembler predicate, 
> {{text:searchFor}}, that specifies a list of language tags that provides a 
> list of language variants that should be searched whenever a query string of 
> a given encoding (language tag) is used. For example, the following 
> {{text:TextIndexLucene/text:defineAnalyzers}} fragment :
> {code:java}
> [ text:addLang "bo" ; 
>   text:searchFor ( "bo" "bo-x-ewts" "bo-alalc97" ) ;
>   text:analyzer [ 
> a text:GenericAnalyzer ;
> text:class "io.bdrc.lucene.bo.TibetanAnalyzer" ;
> text:params (
> [ text:paramName "segmentInWords" ;
>   text:paramValue false ]
> [ text:paramName "lemmatize" ;
>   text:paramValue true ]
> [ text:paramName "filterChars" ;
>   text:paramValue false ]
> [ text:paramName "inputMode" ;
>   text:paramValue "unicode" ]
> [ text:paramName "stopFilename" ;
>   text:paramValue "" ]
> )
> ] ; 
>   ]
> {code}
> indicates that when using a search string such as "རྡོ་རྗེ་སྙིང་"@bo the 
> Lucene index should also be searched for matches tagged as {{bo-x-ewts}} and 
> {{bo-alalc97}}.
> This is made possible by a Tibetan {{Analyzer}} that tokenizes strings in all 
> three encodings into Tibetan Unicode. This is feasible since the 
> {{bo-x-ewts}} and {{bo-alalc97}} encodings are one-to-one with Unicode 
> Tibetan. Since all fields with these language tags will have a common set of 
> indexed terms, i.e., Tibetan Unicode, it suffices to arrange for the query 
> analyzer to have access to the 

[jira] [Commented] (JENA-1556) text:query multilingual enhancements

2018-06-15 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/JENA-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16513777#comment-16513777
 ] 

ASF GitHub Bot commented on JENA-1556:
--

Github user kinow commented on a diff in the pull request:

https://github.com/apache/jena/pull/436#discussion_r195660039
  
--- Diff: 
jena-text/src/main/java/org/apache/jena/query/text/assembler/DefineAnalyzersAssembler.java
 ---
@@ -39,7 +44,46 @@
  text:analyzer [ . . . ]]
 )
 */
+private static Logger  log  = 
LoggerFactory.getLogger(DefineAnalyzersAssembler.class) ;
--- End diff --

Weird spacing here?


> text:query multilingual enhancements
> 
>
> Key: JENA-1556
> URL: https://issues.apache.org/jira/browse/JENA-1556
> Project: Apache Jena
>  Issue Type: New Feature
>  Components: Text
>Affects Versions: Jena 3.7.0
>Reporter: Code Ferret
>Assignee: Code Ferret
>Priority: Major
>  Labels: pull-request-available
>
> This issue proposes two related enhancements of Jena Text. These enhancements 
> have been implemented and a PR can be issued. 
> There are two multilingual search situations that we want to support:
>  # We want to be able to search in one encoding and retrieve results that may 
> have been entered in other encodings. For example, searching via Simplified 
> Chinese (Hans) and retrieving results that may have been entered in 
> Traditional Chinese (Hant) or Pinyin. This will simplify applications by 
> permitting encoding independent retrieval without additional layers of 
> transcoding and so on. It's all done under the covers in Lucene.
>  # We want to search with queries entered in a lossy, e.g., phonetic, 
> encoding and retrieve results entered with accurate encoding. For example, 
> searching vis Pinyin without diacritics and retrieving all possible Hans and 
> Hant triples.
> The first situation arises when entering triples that include languages with 
> multiple encodings that for various reasons are not normalized to a single 
> encoding. In this situation we want to be able to retrieve appropriate result 
> sets without regard for the encodings used at the time that the triples were 
> inserted into the dataset.
> There are several such languages of interest in our application: Chinese, 
> Tibetan, Sanskrit, Japanese and Korean. There are various Romanizations and 
> ideographic variants.
> Encodings may not normalized when inserting triples for a variety of reasons. 
> A principle one is that the {{rdf:langString}} object often must be entered 
> in the same encoding that it occurs in some physical text that is being 
> catalogued. Another is that metadata may be imported from sources that use 
> different encoding conventions and we want to preserve that form.
> The second situation arises as we want to provide simple support for phonetic 
> or other forms of lossy search at the time that triples are indexed directly 
> in the Lucene system.
> To handle the first situation we introduce a {{text}} assembler predicate, 
> {{text:searchFor}}, that specifies a list of language tags that provides a 
> list of language variants that should be searched whenever a query string of 
> a given encoding (language tag) is used. For example, the following 
> {{text:TextIndexLucene/text:defineAnalyzers}} fragment :
> {code:java}
> [ text:addLang "bo" ; 
>   text:searchFor ( "bo" "bo-x-ewts" "bo-alalc97" ) ;
>   text:analyzer [ 
> a text:GenericAnalyzer ;
> text:class "io.bdrc.lucene.bo.TibetanAnalyzer" ;
> text:params (
> [ text:paramName "segmentInWords" ;
>   text:paramValue false ]
> [ text:paramName "lemmatize" ;
>   text:paramValue true ]
> [ text:paramName "filterChars" ;
>   text:paramValue false ]
> [ text:paramName "inputMode" ;
>   text:paramValue "unicode" ]
> [ text:paramName "stopFilename" ;
>   text:paramValue "" ]
> )
> ] ; 
>   ]
> {code}
> indicates that when using a search string such as "རྡོ་རྗེ་སྙིང་"@bo the 
> Lucene index should also be searched for matches tagged as {{bo-x-ewts}} and 
> {{bo-alalc97}}.
> This is made possible by a Tibetan {{Analyzer}} that tokenizes strings in all 
> three encodings into Tibetan Unicode. This is feasible since the 
> {{bo-x-ewts}} and {{bo-alalc97}} encodings are one-to-one with Unicode 
> Tibetan. Since all fields with these language tags will have a common set of 
> indexed terms, i.e., Tibetan Unicode, it suffices to arrange for the query 
> analyzer to have access to the language tag for the query string along with 
> the 

[jira] [Commented] (JENA-1556) text:query multilingual enhancements

2018-06-15 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/JENA-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16513773#comment-16513773
 ] 

ASF GitHub Bot commented on JENA-1556:
--

Github user kinow commented on a diff in the pull request:

https://github.com/apache/jena/pull/436#discussion_r195659628
  
--- Diff: 
jena-text/src/main/java/org/apache/jena/query/text/analyzer/QueryMultilingualAnalyzer.java
 ---
@@ -0,0 +1,75 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.jena.query.text.analyzer ;
+
+import org.apache.lucene.analysis.Analyzer ;
+import org.apache.lucene.analysis.DelegatingAnalyzerWrapper;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/** 
+ * Lucene Analyzer implementation that delegates to a language-specific
+ * Analyzer based on a field name suffix: e.g. field="label_en" will use
+ * an EnglishAnalyzer.
+ */
+
+public class QueryMultilingualAnalyzer extends DelegatingAnalyzerWrapper {
+private static Logger log = 
LoggerFactory.getLogger(QueryMultilingualAnalyzer.class);
+private Analyzer defaultAnalyzer;
+private String langTag;
+
+public QueryMultilingualAnalyzer(Analyzer defaultAnalyzer) {
+super(PER_FIELD_REUSE_STRATEGY);
+this.defaultAnalyzer = defaultAnalyzer;
+this.langTag = null;
+}
+
+public QueryMultilingualAnalyzer(Analyzer defaultAnalyzer, String 
tag) {
+super(PER_FIELD_REUSE_STRATEGY);
+this.defaultAnalyzer = defaultAnalyzer;
+this.langTag = tag;
+}
+
+@Override
+/**
+ * The analyzer corresponding to the langTag supplied at 
instantiation
+ * is used to retrieve the analyzer to use regardless of the tag 
on the
+ * fieldName. If no langTag is supplied then the tag on fieldName 
is
+ * used to retrieve the analyzer as with the MultilingualAnalyzer
+ * 
+ * @param fieldName
+ * @return the analyzer to use in the search
+ */
+protected Analyzer getWrappedAnalyzer(String fieldName) {
+int idx = fieldName.lastIndexOf("_");
--- End diff --

Weird formatting in this file, but no blocker.


> text:query multilingual enhancements
> 
>
> Key: JENA-1556
> URL: https://issues.apache.org/jira/browse/JENA-1556
> Project: Apache Jena
>  Issue Type: New Feature
>  Components: Text
>Affects Versions: Jena 3.7.0
>Reporter: Code Ferret
>Assignee: Code Ferret
>Priority: Major
>  Labels: pull-request-available
>
> This issue proposes two related enhancements of Jena Text. These enhancements 
> have been implemented and a PR can be issued. 
> There are two multilingual search situations that we want to support:
>  # We want to be able to search in one encoding and retrieve results that may 
> have been entered in other encodings. For example, searching via Simplified 
> Chinese (Hans) and retrieving results that may have been entered in 
> Traditional Chinese (Hant) or Pinyin. This will simplify applications by 
> permitting encoding independent retrieval without additional layers of 
> transcoding and so on. It's all done under the covers in Lucene.
>  # We want to search with queries entered in a lossy, e.g., phonetic, 
> encoding and retrieve results entered with accurate encoding. For example, 
> searching vis Pinyin without diacritics and retrieving all possible Hans and 
> Hant triples.
> The first situation arises when entering triples that include languages with 
> multiple encodings that for various reasons are not normalized to a single 
> encoding. In this situation we want to be able to retrieve appropriate result 
> sets without regard for the encodings used at the time 

[jira] [Commented] (JENA-1556) text:query multilingual enhancements

2018-06-15 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/JENA-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16513776#comment-16513776
 ] 

ASF GitHub Bot commented on JENA-1556:
--

Github user kinow commented on a diff in the pull request:

https://github.com/apache/jena/pull/436#discussion_r195659829
  
--- Diff: 
jena-text/src/main/java/org/apache/jena/query/text/analyzer/QueryMultilingualAnalyzer.java
 ---
@@ -0,0 +1,75 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.jena.query.text.analyzer ;
+
+import org.apache.lucene.analysis.Analyzer ;
+import org.apache.lucene.analysis.DelegatingAnalyzerWrapper;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/** 
+ * Lucene Analyzer implementation that delegates to a language-specific
+ * Analyzer based on a field name suffix: e.g. field="label_en" will use
+ * an EnglishAnalyzer.
+ */
+
+public class QueryMultilingualAnalyzer extends DelegatingAnalyzerWrapper {
+private static Logger log = 
LoggerFactory.getLogger(QueryMultilingualAnalyzer.class);
+private Analyzer defaultAnalyzer;
+private String langTag;
+
+public QueryMultilingualAnalyzer(Analyzer defaultAnalyzer) {
+super(PER_FIELD_REUSE_STRATEGY);
+this.defaultAnalyzer = defaultAnalyzer;
+this.langTag = null;
+}
+
+public QueryMultilingualAnalyzer(Analyzer defaultAnalyzer, String 
tag) {
+super(PER_FIELD_REUSE_STRATEGY);
+this.defaultAnalyzer = defaultAnalyzer;
+this.langTag = tag;
+}
+
+@Override
+/**
+ * The analyzer corresponding to the langTag supplied at 
instantiation
+ * is used to retrieve the analyzer to use regardless of the tag 
on the
+ * fieldName. If no langTag is supplied then the tag on fieldName 
is
+ * used to retrieve the analyzer as with the MultilingualAnalyzer
+ * 
+ * @param fieldName
+ * @return the analyzer to use in the search
+ */
+protected Analyzer getWrappedAnalyzer(String fieldName) {
+int idx = fieldName.lastIndexOf("_");
+if (idx == -1) { // not language-specific, e.g. "label"
+return defaultAnalyzer;
+}
+String lang = langTag != null ? langTag : 
fieldName.substring(idx+1);
+Analyzer analyzer = Util.getLocalizedAnalyzer(lang);
+analyzer = analyzer != null ? analyzer : defaultAnalyzer;
--- End diff --

Maybe simplify statements like these with

```java
analyzer = ObjectUtils.defaultIfNull(analyzer, defaultAnalyzer);
```


> text:query multilingual enhancements
> 
>
> Key: JENA-1556
> URL: https://issues.apache.org/jira/browse/JENA-1556
> Project: Apache Jena
>  Issue Type: New Feature
>  Components: Text
>Affects Versions: Jena 3.7.0
>Reporter: Code Ferret
>Assignee: Code Ferret
>Priority: Major
>  Labels: pull-request-available
>
> This issue proposes two related enhancements of Jena Text. These enhancements 
> have been implemented and a PR can be issued. 
> There are two multilingual search situations that we want to support:
>  # We want to be able to search in one encoding and retrieve results that may 
> have been entered in other encodings. For example, searching via Simplified 
> Chinese (Hans) and retrieving results that may have been entered in 
> Traditional Chinese (Hant) or Pinyin. This will simplify applications by 
> permitting encoding independent retrieval without additional layers of 
> transcoding and so on. It's all done under the covers in Lucene.
>  # We want to search with queries entered in a lossy, 

[GitHub] jena pull request #436: JENA-1556 implementation

2018-06-15 Thread kinow
Github user kinow commented on a diff in the pull request:

https://github.com/apache/jena/pull/436#discussion_r195659829
  
--- Diff: 
jena-text/src/main/java/org/apache/jena/query/text/analyzer/QueryMultilingualAnalyzer.java
 ---
@@ -0,0 +1,75 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.jena.query.text.analyzer ;
+
+import org.apache.lucene.analysis.Analyzer ;
+import org.apache.lucene.analysis.DelegatingAnalyzerWrapper;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/** 
+ * Lucene Analyzer implementation that delegates to a language-specific
+ * Analyzer based on a field name suffix: e.g. field="label_en" will use
+ * an EnglishAnalyzer.
+ */
+
+public class QueryMultilingualAnalyzer extends DelegatingAnalyzerWrapper {
+private static Logger log = 
LoggerFactory.getLogger(QueryMultilingualAnalyzer.class);
+private Analyzer defaultAnalyzer;
+private String langTag;
+
+public QueryMultilingualAnalyzer(Analyzer defaultAnalyzer) {
+super(PER_FIELD_REUSE_STRATEGY);
+this.defaultAnalyzer = defaultAnalyzer;
+this.langTag = null;
+}
+
+public QueryMultilingualAnalyzer(Analyzer defaultAnalyzer, String 
tag) {
+super(PER_FIELD_REUSE_STRATEGY);
+this.defaultAnalyzer = defaultAnalyzer;
+this.langTag = tag;
+}
+
+@Override
+/**
+ * The analyzer corresponding to the langTag supplied at 
instantiation
+ * is used to retrieve the analyzer to use regardless of the tag 
on the
+ * fieldName. If no langTag is supplied then the tag on fieldName 
is
+ * used to retrieve the analyzer as with the MultilingualAnalyzer
+ * 
+ * @param fieldName
+ * @return the analyzer to use in the search
+ */
+protected Analyzer getWrappedAnalyzer(String fieldName) {
+int idx = fieldName.lastIndexOf("_");
+if (idx == -1) { // not language-specific, e.g. "label"
+return defaultAnalyzer;
+}
+String lang = langTag != null ? langTag : 
fieldName.substring(idx+1);
+Analyzer analyzer = Util.getLocalizedAnalyzer(lang);
+analyzer = analyzer != null ? analyzer : defaultAnalyzer;
--- End diff --

Maybe simplify statements like these with

```java
analyzer = ObjectUtils.defaultIfNull(analyzer, defaultAnalyzer);
```


---


[GitHub] jena pull request #436: JENA-1556 implementation

2018-06-15 Thread kinow
Github user kinow commented on a diff in the pull request:

https://github.com/apache/jena/pull/436#discussion_r195659628
  
--- Diff: 
jena-text/src/main/java/org/apache/jena/query/text/analyzer/QueryMultilingualAnalyzer.java
 ---
@@ -0,0 +1,75 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.jena.query.text.analyzer ;
+
+import org.apache.lucene.analysis.Analyzer ;
+import org.apache.lucene.analysis.DelegatingAnalyzerWrapper;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/** 
+ * Lucene Analyzer implementation that delegates to a language-specific
+ * Analyzer based on a field name suffix: e.g. field="label_en" will use
+ * an EnglishAnalyzer.
+ */
+
+public class QueryMultilingualAnalyzer extends DelegatingAnalyzerWrapper {
+private static Logger log = 
LoggerFactory.getLogger(QueryMultilingualAnalyzer.class);
+private Analyzer defaultAnalyzer;
+private String langTag;
+
+public QueryMultilingualAnalyzer(Analyzer defaultAnalyzer) {
+super(PER_FIELD_REUSE_STRATEGY);
+this.defaultAnalyzer = defaultAnalyzer;
+this.langTag = null;
+}
+
+public QueryMultilingualAnalyzer(Analyzer defaultAnalyzer, String 
tag) {
+super(PER_FIELD_REUSE_STRATEGY);
+this.defaultAnalyzer = defaultAnalyzer;
+this.langTag = tag;
+}
+
+@Override
+/**
+ * The analyzer corresponding to the langTag supplied at 
instantiation
+ * is used to retrieve the analyzer to use regardless of the tag 
on the
+ * fieldName. If no langTag is supplied then the tag on fieldName 
is
+ * used to retrieve the analyzer as with the MultilingualAnalyzer
+ * 
+ * @param fieldName
+ * @return the analyzer to use in the search
+ */
+protected Analyzer getWrappedAnalyzer(String fieldName) {
+int idx = fieldName.lastIndexOf("_");
--- End diff --

Weird formatting in this file, but no blocker.


---


[GitHub] jena pull request #436: JENA-1556 implementation

2018-06-15 Thread kinow
Github user kinow commented on a diff in the pull request:

https://github.com/apache/jena/pull/436#discussion_r195658758
  
--- Diff: 
jena-text/src/main/java/org/apache/jena/query/text/TextIndexLucene.java ---
@@ -316,6 +326,13 @@ protected Document doc(Entity entity) {
 if (this.isMultilingual) {
 // add a field that uses a language-specific 
analyzer via MultilingualAnalyzer
 doc.add(new Field(e.getKey() + "_" + lang, 
(String) e.getValue(), ftText));
+// add fields for any defined auxiliary indexes
+List auxIndexes = Util.getAuxIndexes(lang);
+if (auxIndexes != null) {
--- End diff --

Never null I believe. We return an empty list when `lang` is empty. And use 
a `Hashtable` to keep data. But happy to leave it if you prefer to double-check 
it anyway (wonder if we should consider `@Nullable` et `@NotNull` in method 
signatures some day)


---


[GitHub] jena pull request #436: JENA-1556 implementation

2018-06-15 Thread kinow
Github user kinow commented on a diff in the pull request:

https://github.com/apache/jena/pull/436#discussion_r195660039
  
--- Diff: 
jena-text/src/main/java/org/apache/jena/query/text/assembler/DefineAnalyzersAssembler.java
 ---
@@ -39,7 +44,46 @@
  text:analyzer [ . . . ]]
 )
 */
+private static Logger  log  = 
LoggerFactory.getLogger(DefineAnalyzersAssembler.class) ;
--- End diff --

Weird spacing here?


---


[GitHub] jena pull request #436: JENA-1556 implementation

2018-06-15 Thread kinow
Github user kinow commented on a diff in the pull request:

https://github.com/apache/jena/pull/436#discussion_r195659038
  
--- Diff: 
jena-text/src/main/java/org/apache/jena/query/text/TextIndexLucene.java ---
@@ -549,56 +567,81 @@ private String frags2string(TextFragment[] frags, 
HighlightOpts opts) {
 }
 return results ;
 }
+
+private Map multilingualQueryAnalyzers = new 
HashMap<>();
--- End diff --

Might be better declared at the top with other vars perhaps.


---