Re: JenaText: support for explicit field names in text queries

Brian McBride Mon, 02 Sep 2019 06:24:03 -0700

Hi Chris,

Thanks for looking into this so promptly.


On 01/09/2019 21:12, Chris Tomlinson wrote:

Hi again Brian,

I looked a bit more and it’s not clear how to “fix” the issue after all. The 
change I suggested to TextIndexLucene uncovers a basic issue.


Such is life.

When using a query such as:

     (?s ?score ?lit) text:query ( “some query string” 3000000 ) .

The code currently inserts the primaryField, e.g., rdfs:label or what have you 
and then TextQueryPF binds the hit value from Lucene to the ?lit by looking up 
the matching field value in the result doc returned by Lucene; however, the 
change I suggested no longer defaults to the primaryField and so there’s an 
error during the result binding handling in TextQueryPF.

Yes - I tried running the unit tests and they barf. Most of them stopbarfing if the bind to ?var is disabled if it would try to bind null.


The basic problem is that there’s an ambiguity with:

     … text:query ( “some query string” 3000000 ) .

The current code doesn’t know whether there are fields mentioned in the query 
string or not.

If there are fields in the query string then the use of the

     (?s ?score ?lit) text:query …

form must be disallowed since there’s no way to know what field value to 
retrieve from the Lucene query result documents without further analysis of the 
query string.

Just so. If you try it in 3.9.0 (and presumably 3.12.0) a rather nastyerror occurs. So this is not new. We don't get the error because wedon't do that sort of query.

I have not understood the details of how the code finds the fieldvalue. But if it could be arranged that if the document had just onefield - that was the one returned and passed to the PF and if there ismore than one, then null is passed to the PF Then the PF could throw ameaningful exception if the value it gets is null when it tries to bindit. That effectively disallows this query form on documents with morethan one field and does not require any analysis of the query. Imention this tentatively as a possible approach - I don't feel I have astrong grip on the code. It is definitely ugly but might be a possibleholding position whilst something better is worked out.

Apparently in your application there will generally be two or more matching 
fields in each result document

yes

  and it would be further complicated to figure out what matching field value 
to use - or invent another syntax from grabbing more than a single ?lit per 
result doc.


If there are no fields mentioned in the query string then the primaryField 
should be used explicitly and then ?lit can be bound to an appropriate match 
value as currently.

Perhaps you can raise a Jena issue and we can discuss and see what can be done.


Done - https://jira.apache.org/jira/browse/JENA-1749

I'm conscious that there is a release now in progress and you are notgoing be available to look at this. Do you have any thoughts on whether3.13.0 should go ahead without at least an interim resolution to this issue?


Regards
Brian


Regards,
Chris

On Sep 1, 2019, at 2:25 PM, Chris Tomlinson <chris.j.tomlin...@gmail.com> wrote:

Hi Brian,

On Sep 1, 2019, at 7:17 AM, Brian McBride <brian.mcbr...@epimorphics.com 
<mailto:brian.mcbr...@epimorphics.com>> wrote:

It used to be the case that JenaText supported querying of a Lucene text index 
where the index was created independently of Jena and then made available to 
JenaText via the dataset configuration.  Is this still the case?

That should still be the case, with the proviso that currently the fields names 
be handled via RDF properties outside the query string.

As you noted, it has been documented since 3.6.0 
<https://jena.apache.org/documentation/query/text-query.html> that:

No explicit use of Fields within the query string is supported.

This is based on the assumption that the indexes contain only a single property 
field in the documents as they are indexed and hence only a single field 
corresponding to an RDF property in a query. Evidently a poor assumption not 
caught until now.

Up until Jena 3.9.0 definitely, and I suspect 3.12.0 - I have not confirmed 
this yet, it was possible to express text queries with field names and they 
worked.

You’re correct, the change was introduced 
<https://github.com/apache/jena/blob/519c129ab2dfcb5eb43f1a337c618a8e69f88acd/jena-text/src/main/java/org/apache/jena/query/text/TextIndexLucene.java#L744>
 in the 3.13.0 code that breaks the previous behavior. I’m not able to explore fixing 
this for the next three weeks but may take a look at “fixing” this then. The basic 
change would be to replace the referenced line by:

     qstring = qs;

and that should be it. The results handling ( in simpleResults 
<https://github.com/apache/jena/blob/519c129ab2dfcb5eb43f1a337c618a8e69f88acd/jena-text/src/main/java/org/apache/jena/query/text/TextIndexLucene.java#L562>
 and highlightResults 
<https://github.com/apache/jena/blob/519c129ab2dfcb5eb43f1a337c618a8e69f88acd/jena-text/src/main/java/org/apache/jena/query/text/TextIndexLucene.java#L668>)
  should need no changes since Lucene:

     doc.get(null)

just returns null  which is already handled. Evidently your application doesn’t 
use the

      (?s ?score ?lit) text:query …

form, since there’s no information about what fields have been used in the 
queryString no bindings for ?lit can be made.

We needed an index where multiple properties of the same resource were indexed 
as a single document.  I would be happy to discuss this further - why the 
solution indicated in the JenaText documentation didn't work for us and whether 
there is way to construct a general purpose JenaText solution that would.


More explanation would be interesting.

Sorry for the inconvenience,
Chris

--
------------------------------------------------------------------------

Brian McBride
brian.mcbr...@epimorphics.com

Epimorphics Ltd www.epimorphics.com
Court Lodge, 105 High Street, Portishead, Bristol BS20 6PT
Tel: 01275 399069

Epimorphics Ltd. is a limited company registered in England (number 7016688)

Registered address: Court Lodge, 105 High Street, Portishead, BristolBS20 6PT, UK

Re: JenaText: support for explicit field names in text queries

Reply via email to