Hi All, I’ve completed my proposed updates to the Jena text-query documentation. The documentation corresponds to 3.6.0-SNAPSHOT. I’ve noted several instances where the current behavior may be considered an issue that will be corrected in a future release. I’ve separately created issues for these: JENA-1437 <https://issues.apache.org/jira/browse/JENA-1437>, JENA-1438 <https://issues.apache.org/jira/browse/JENA-1438>, and JENA-1439 <https://issues.apache.org/jira/browse/JENA-1439>.
Thank you, Chris > On Nov 22, 2017, at 8:25 AM, Chris Tomlinson <[email protected]> > wrote: > > Hi Andy and Osma, > > I posted JENA-1426 <https://issues.apache.org/jira/browse/JENA-1426> since > the “improve this page” facility didn’t seem to offer any way to add a commit > message or more extensive explanation of the reasons for the proposed edits > and they were somewhat extensive. So raising an issue seemed a way to > proceed; however, after several days with no comments I thought perhaps I > should follow the published protocol and I made the update as guest on the > CMS. > > I had several motivations regarding updating the documentation: 1) I wanted > to present how the current implementation functions in a way that might be > more useful to users - for example clarifying what can be expected to work > and what not in terms of using the native Lucene query language, e.g., > JENA-1388 <https://issues.apache.org/jira/browse/JENA-1388>; 2) identify > areas that might indicate perhaps unintended aspects of the current > implementation; and 3) understand the code in preparation for developing a > proposal for adding jena-text highlighting support > <http://apache.markmail.org/message/rlzzdd3yw7a7aoqc?q=jena-text+highlighting+support>. > > Based on Osma’s feedback I will be opening a few issues on JIRA and making > corrections to the original submission. I assume that updates should just be > made as further commits. > > Thanks, > Chris > > > >> On Nov 22, 2017, at 6:41 AM, Andy Seaborne <[email protected] >> <mailto:[email protected]>> wrote: >> >> How is this related to JENA-1426? >> >> Andy >> >> On 21/11/17 14:48, Osma Suominen wrote: >>> ajs6f kirjoitti 20.11.2017 klo 18:36: >>>> Osma (or anyone else who knows text indexing better than do I, which >>>> wouldn't take much)-- could you review this? It's got some great useful >>>> detail about how the indexing works and can be used. >>> Sure, will do. >>> Comments about specific sections below. Generally this is a very good >>> contribution to the jena-text documentation, which has stagnated a bit. >>>>> +The following illustrates a Lucene document that Jena will create and >>>>> +request Lucene to index: >>>>> + >>>>> + Document< >>>>> + stored, indexed, indexOptions=DOCS >>>>> <uri:http://example.org/SomeOne <http://example.org/SomeOne>> >>>>> + indexed, omitNorms, indexOptions=DOCS >>>>> <graph:urn:x-arq:DefaultGraphNode> >>>>> + stored, indexed, tokenized <label:zorn protégé a prés> >>>>> + stored, indexed, omitNorms, indexOptions=DOCS <lang:fr> >>>>> + stored, indexed, tokenized <label_fr:zorn protégé a prés> >>>>> + stored, indexed, omitNorms, indexOptions=DOCS >>>>> <uid:28959d0130121b51e1459a95bdac2e04f96efa2e6518ff3c090dfa7a1e6dcf00> >>>>> + stored, indexed, tokenized <graph:urn:x-arq:DefaultGraphNode> >>>>> + stored, indexed, omitNorms, indexOptions=DOCS <lang:fr> >>>>> + stored, indexed, tokenized <graph_fr:urn:x-arq:DefaultGraphNode> >>>>> + stored, indexed, omitNorms, indexOptions=DOCS >>>>> <uid:b668c6cb80475194a2ce3087c15afbcb4baf6150da98e95b590ceedd10cf242f> >>>>> + > >>>>> + >>>>> +It may be instructive to refer back to this example when considering the >>>>> various >>>>> +points below. >>> Not sure if this is a perfect illustration. The level of detail is rather >>> excessive. I know Lucene quite well and I still struggle to understand >>> what's going on here. Is there another way of presenting this information, >>> for example just a key-value list that shows the field values that get >>> stored in the document? I think the field options stored, indexed, >>> tokenized, omitNorms etc. are unnecessary here or at least should not be so >>> prominent. >>>>> +The `lang:xx` specification is an optional string, where _xx_ is >>>>> +a BCP-47 language tag. This restricts searches to field values that were >>>>> originally >>>>> +indexed with the tag _xx_. Searches may be restricted to field values >>>>> with no >>>>> +language tag via `"lang:none"`. The use of the `lang:xx` is only >>>>> effective if >>>>> +[multilingual support](#linguistic-support-with-lucene-index) has been >>>>> configured. >>> The last sentence is not true. You can restrict by language even without >>> enabling multilingual support, as long as langField has been set. >>>>> +Further, if the `lang:xx` is used then the `property` URI must be >>>>> supplied >>>>> +in order for searches to work. >>> Not true. The default property should be used if no property was specified. >>>>> +When working with `rdf:langString`s It may be tempting to write: >>>>> + >>>>> + ?s text:query "protégé"@fr >>>>> + >>>>> +However, the above will silently fail to return results since the >>>>> +`query string` must be a simple `xsd:string` not an `rdf:langString`. >>> This could be considered a bug - at least it shouldn't fail silently. >>>>> +Even if the default _property_ is `skos:prefLabel` it is necessary >>>>> +to use the above form rather than omitting the `property` argument >>>>> +when restricting the Lucene search to a specific `lang:xx`; otherwise, >>>>> +again there will be no results. >>> Again, not true. I just tested this query against YSO: >>> ?s text:query ("cat" "lang:en") >>> and it gave a single result, as expected. >>>>> +For a non-default `Field` with no language restriction, the patterns: >>>>> + >>>>> + ?s text:query (rdfs:label "protégé") >>>>> + >>>>> +or >>>>> + >>>>> + ?s text:query "rdfsLabel:protégé" >>>>> + >>>>> +may be used (see [below](#entity-map-definition) for how RDF _property_ >>>>> names >>>>> +are mapped to Lucene `Field` names). >>> I wouldn't recommend using a query form like "rdfsLabel:protégé" in the >>> documentation at all. It violates the layered architecture of jena-text - >>> the query should not be targeting named fields. If you want to target >>> rdfs:label, use the first form. >>>>> However, as mentioned earlier, >>>>> + >>>>> + ?s text:query ("rdfsLabel:protégé" "lang:fr") >>>>> + >>>>> +will result in an error owing to the way in which the jena-text composes >>>>> the >>>>> +query string to Lucene in the presence of the `"lang:fr"` argument. >>> Don't do that then. Remove this section. (see previous comment) >>>>> +However, it is important to note that the apparently equivalent form: >>>>> + >>>>> + (?s ?sc ?lit) text:query "rdfsLabel:protégé" >>>>> + >>>>> +will fail to produce a binding for `?lit` even though `?s` and `?sc` are >>>>> +bound as expected. >>> Again, don't do that. Use (rdfs:label "protégé") instead and let jena-text >>> handle the translation from property to Lucene field. >>>>> +So if the _literal_ matches are needed you **must use** the query >>>>> arguments that >>>>> +list the _property_ explicitly, except in the simple case of a query >>>>> against >>>>> +the default `Field`/_property_. >>> Exactly. And those are the only supported query forms anyway. >>>>> +#### Queries across multiple `Field`s >>>>> + >>>>> +It has been mentioned earlier that the text index uses the >>>>> +[native Lucene query >>>>> language](http://lucene.apache.org/core/6_4_1/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description >>>>> >>>>> <http://lucene.apache.org/core/6_4_1/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description>); >>>>> >>>>> +however, there are important constraints on how the Lucene query >>>>> language is used within jena-text. >>>>> +This is owing to the fact that jena-text composes the query string that >>>>> is sent to Lucene so that >>>>> +features such as `lang:xx` may be implemented. Other aspects of using >>>>> the Lucene query language >>>>> +reflect the fact that each triple is a separate document. >>>>> + >>>>> +This latter observation is important when considering queries that are >>>>> intended to involve several >>>>> fields. >>>>> >>>>> -The results include the subject URI, `?s`; the `?score` assigned by the >>>>> -text search engine; and the entire matched `?literal` (if the index has >>>>> -been [configured to store literal values](#text-dataset-assembler)). >>>>> +For example, consider the following triples: >>>>> >>>>> -If the `query string` refers to more than one field, e.g., >>>>> + ex:SomePrinter >>>>> + rdfs:label "laser printer" ; >>>>> + ex:description "includes a large capacity cartridge" . >>>>> >>>>> - "label: printer AND description: \"large capacity cartridge\"" >>>>> + assuming an appropriate configuration we might expect to retrieve >>>>> `ex:SomePrinter` >>>>> + with the following query: >>>>> >>>>> -then the `?literal` in the results will not be bound since there is no >>>>> -single field that contains the match – the match is separated over >>>>> -two fields. >>>>> + ?s text:query "label:printer AND description:\"large capacity >>>>> cartridge\"" >>>>> >>>>> -If an output indexed term is already a known value, either as a constant >>>>> -in the query or variable already set, then the index lookup becomes a >>>>> -check that this is a match for the input arguments. >>>>> +However, this query will fail to find the expected results since the >>>>> `AND` is interpreted >>>>> +by Lucene to indicate that all documents that contain a matching `label` >>>>> field _and_ >>>>> +a matching `description` field are to be returned. Yet from the >>>>> discussion above >>>>> +regarding the [structure of Lucene documents in >>>>> jena-text](#document-structure) it >>>>> +is evident that there is not one but rather in fact two separate >>>>> documents one with a >>>>> +`label` field and one with a `description` field so an effective query >>>>> is: >>>>> >>>>> + ?s text:query "label:printer" . >>>>> + ?s text:query "description:\"large capacity cartridge\"" . >>>>> + >>>>> +which leads to `?s` being bound to `ex:SomePrinter`. >>> Again this should instead be written as >>> ?s text:query (rdfs:label "printer") . >>> ?s text:query (ex:description "large capacity cartridge") . >>>>> +In other words when a query is to involve two or more >>>>> _properties_/`Field`s then it >>>>> +expressed at the SPARQL level, as it were, versus in Lucene's query >>>>> language. >>>>> + >>>>> +It is worth noting that: >>>>> + >>>>> + ?s text:query "label:printer OR description:\"large capacity >>>>> cartridge\"" >>>>> + >>>>> +works simply because Lucene is required to do nothing more than a >>>>> _union_ of >>>>> +matching documents the same as if written: >>>>> + >>>>> + { ?s text:query "label:printer" . } >>>>> + union >>>>> + { ?s text:query "description:\"large capacity cartridge\"" . } >>> Don't do that, even if it happens to work. >>>>> +On the other hand the various features of the [Lucene query >>>>> language](http://lucene.apache.org/core/6_4_1/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description >>>>> >>>>> <http://lucene.apache.org/core/6_4_1/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description>) >>>>> >>>>> +are all available to be used for searches within a `Field`. For example: >>>>> + >>>>> + ?s text:query "description:(large AND cartridge)" >>>>> + >>>>> +and >>>>> + >>>>> + (?s ?sc ?lit) text:query (ex:description "(includes AND (large OR >>>>> capacity))") >>>>> + >>>>> +will work as expected. >>> The first one is better written as >>> ?s text:query (ex:description "(large AND cartridge)") >>> -Osma >
