Hi Chris,
Thanks for looking into this so promptly.
On 01/09/2019 21:12, Chris Tomlinson wrote:
Hi again Brian,
I looked a bit more and it’s not clear how to “fix” the issue after all. The
change I suggested to TextIndexLucene uncovers a basic issue.
Such is life.
When using a query such as:
(?s ?score ?lit) text:query ( “some query string” 3000000 ) .
The code currently inserts the primaryField, e.g., rdfs:label or what have you
and then TextQueryPF binds the hit value from Lucene to the ?lit by looking up
the matching field value in the result doc returned by Lucene; however, the
change I suggested no longer defaults to the primaryField and so there’s an
error during the result binding handling in TextQueryPF.
Yes - I tried running the unit tests and they barf. Most of them stop
barfing if the bind to ?var is disabled if it would try to bind null.
The basic problem is that there’s an ambiguity with:
… text:query ( “some query string” 3000000 ) .
The current code doesn’t know whether there are fields mentioned in the query
string or not.
If there are fields in the query string then the use of the
(?s ?score ?lit) text:query …
form must be disallowed since there’s no way to know what field value to
retrieve from the Lucene query result documents without further analysis of the
query string.
Just so. If you try it in 3.9.0 (and presumably 3.12.0) a rather nasty
error occurs. So this is not new. We don't get the error because we
don't do that sort of query.
I have not understood the details of how the code finds the field
value. But if it could be arranged that if the document had just one
field - that was the one returned and passed to the PF and if there is
more than one, then null is passed to the PF Then the PF could throw a
meaningful exception if the value it gets is null when it tries to bind
it. That effectively disallows this query form on documents with more
than one field and does not require any analysis of the query. I
mention this tentatively as a possible approach - I don't feel I have a
strong grip on the code. It is definitely ugly but might be a possible
holding position whilst something better is worked out.
Apparently in your application there will generally be two or more matching
fields in each result document
yes
and it would be further complicated to figure out what matching field value
to use - or invent another syntax from grabbing more than a single ?lit per
result doc.
If there are no fields mentioned in the query string then the primaryField
should be used explicitly and then ?lit can be bound to an appropriate match
value as currently.
Perhaps you can raise a Jena issue and we can discuss and see what can be done.
Done - https://jira.apache.org/jira/browse/JENA-1749
I'm conscious that there is a release now in progress and you are not
going be available to look at this. Do you have any thoughts on whether
3.13.0 should go ahead without at least an interim resolution to this issue?
Regards
Brian
Regards,
Chris
On Sep 1, 2019, at 2:25 PM, Chris Tomlinson <chris.j.tomlin...@gmail.com> wrote:
Hi Brian,
On Sep 1, 2019, at 7:17 AM, Brian McBride <brian.mcbr...@epimorphics.com
<mailto:brian.mcbr...@epimorphics.com>> wrote:
It used to be the case that JenaText supported querying of a Lucene text index
where the index was created independently of Jena and then made available to
JenaText via the dataset configuration. Is this still the case?
That should still be the case, with the proviso that currently the fields names
be handled via RDF properties outside the query string.
As you noted, it has been documented since 3.6.0
<https://jena.apache.org/documentation/query/text-query.html> that:
No explicit use of Fields within the query string is supported.
This is based on the assumption that the indexes contain only a single property
field in the documents as they are indexed and hence only a single field
corresponding to an RDF property in a query. Evidently a poor assumption not
caught until now.
Up until Jena 3.9.0 definitely, and I suspect 3.12.0 - I have not confirmed
this yet, it was possible to express text queries with field names and they
worked.
You’re correct, the change was introduced
<https://github.com/apache/jena/blob/519c129ab2dfcb5eb43f1a337c618a8e69f88acd/jena-text/src/main/java/org/apache/jena/query/text/TextIndexLucene.java#L744>
in the 3.13.0 code that breaks the previous behavior. I’m not able to explore fixing
this for the next three weeks but may take a look at “fixing” this then. The basic
change would be to replace the referenced line by:
qstring = qs;
and that should be it. The results handling ( in simpleResults
<https://github.com/apache/jena/blob/519c129ab2dfcb5eb43f1a337c618a8e69f88acd/jena-text/src/main/java/org/apache/jena/query/text/TextIndexLucene.java#L562>
and highlightResults
<https://github.com/apache/jena/blob/519c129ab2dfcb5eb43f1a337c618a8e69f88acd/jena-text/src/main/java/org/apache/jena/query/text/TextIndexLucene.java#L668>)
should need no changes since Lucene:
doc.get(null)
just returns null which is already handled. Evidently your application doesn’t
use the
(?s ?score ?lit) text:query …
form, since there’s no information about what fields have been used in the
queryString no bindings for ?lit can be made.
We needed an index where multiple properties of the same resource were indexed
as a single document. I would be happy to discuss this further - why the
solution indicated in the JenaText documentation didn't work for us and whether
there is way to construct a general purpose JenaText solution that would.
More explanation would be interesting.
Sorry for the inconvenience,
Chris
--
------------------------------------------------------------------------
Brian McBride
brian.mcbr...@epimorphics.com
Epimorphics Ltd www.epimorphics.com
Court Lodge, 105 High Street, Portishead, Bristol BS20 6PT
Tel: 01275 399069
Epimorphics Ltd. is a limited company registered in England (number 7016688)
Registered address: Court Lodge, 105 High Street, Portishead, Bristol
BS20 6PT, UK