Re: CMS diff: Jena Full Text Search

Chris Tomlinson Mon, 27 Nov 2017 10:46:45 -0800

Hi All,

I’ve completed my proposed updates to the Jena text-query documentation. The 
documentation corresponds to 3.6.0-SNAPSHOT. I’ve noted several instances where 
the current behavior may be considered an issue that will be corrected in a 
future release. I’ve separately created issues for these: JENA-1437 
<https://issues.apache.org/jira/browse/JENA-1437>, JENA-1438 
<https://issues.apache.org/jira/browse/JENA-1438>, and JENA-1439 
<https://issues.apache.org/jira/browse/JENA-1439>.


Thank you,
Chris


> On Nov 22, 2017, at 8:25 AM, Chris Tomlinson <chris.j.tomlin...@gmail.com> 
> wrote:
> 
> Hi Andy and Osma,
> 
> I posted JENA-1426 <https://issues.apache.org/jira/browse/JENA-1426> since 
> the “improve this page” facility didn’t seem to offer any way to add a commit 
> message or more extensive explanation of the reasons for the proposed edits 
> and they were somewhat extensive. So raising an issue seemed a way to 
> proceed; however, after several days with no comments I thought perhaps I 
> should follow the published protocol and I made the update as guest on the 
> CMS.
> 
> I had several motivations regarding updating the documentation: 1) I wanted 
> to present how the current implementation functions in a way that might be 
> more useful to users - for example clarifying what can be expected to work 
> and what not in terms of using the native Lucene query language, e.g., 
> JENA-1388 <https://issues.apache.org/jira/browse/JENA-1388>; 2) identify 
> areas that might indicate perhaps unintended aspects of the current 
> implementation; and 3) understand the code in preparation for developing a 
> proposal for adding jena-text highlighting support 
> <http://apache.markmail.org/message/rlzzdd3yw7a7aoqc?q=jena-text+highlighting+support>.
> 
> Based on Osma’s feedback I will be opening a few issues on JIRA and making 
> corrections to the original submission. I assume that updates should just be 
> made as further commits.
> 
> Thanks,
> Chris
> 
> 
> 
>> On Nov 22, 2017, at 6:41 AM, Andy Seaborne <a...@apache.org 
>> <mailto:a...@apache.org>> wrote:
>> 
>> How is this related to JENA-1426?
>> 
>>    Andy
>> 
>> On 21/11/17 14:48, Osma Suominen wrote:
>>> ajs6f kirjoitti 20.11.2017 klo 18:36:
>>>> Osma (or anyone else who knows text indexing better than do I, which 
>>>> wouldn't take much)-- could you review this? It's got some great useful 
>>>> detail about how the indexing works and can be used.
>>> Sure, will do.
>>> Comments about specific sections below. Generally this is a very good 
>>> contribution to the jena-text documentation, which has stagnated a bit.
>>>>> +The following illustrates a Lucene document that Jena will create and
>>>>> +request Lucene to index:
>>>>> +
>>>>> +    Document<
>>>>> +        stored, indexed, indexOptions=DOCS 
>>>>> <uri:http://example.org/SomeOne <http://example.org/SomeOne>>
>>>>> +        indexed, omitNorms, indexOptions=DOCS 
>>>>> <graph:urn:x-arq:DefaultGraphNode>
>>>>> +        stored, indexed, tokenized <label:zorn protégé a prés>
>>>>> +        stored, indexed, omitNorms, indexOptions=DOCS <lang:fr>
>>>>> +        stored, indexed, tokenized <label_fr:zorn protégé a prés>
>>>>> +        stored, indexed, omitNorms, indexOptions=DOCS 
>>>>> <uid:28959d0130121b51e1459a95bdac2e04f96efa2e6518ff3c090dfa7a1e6dcf00>
>>>>> +        stored, indexed, tokenized <graph:urn:x-arq:DefaultGraphNode>
>>>>> +        stored, indexed, omitNorms, indexOptions=DOCS <lang:fr>
>>>>> +        stored, indexed, tokenized <graph_fr:urn:x-arq:DefaultGraphNode>
>>>>> +        stored, indexed, omitNorms, indexOptions=DOCS 
>>>>> <uid:b668c6cb80475194a2ce3087c15afbcb4baf6150da98e95b590ceedd10cf242f>
>>>>> +        >
>>>>> +
>>>>> +It may be instructive to refer back to this example when considering the 
>>>>> various
>>>>> +points below.
>>> Not sure if this is a perfect illustration. The level of detail is rather 
>>> excessive. I know Lucene quite well and I still struggle to understand 
>>> what's going on here. Is there another way of presenting this information, 
>>> for example just a key-value list that shows the field values that get 
>>> stored in the document? I think the field options stored, indexed, 
>>> tokenized, omitNorms etc. are unnecessary here or at least should not be so 
>>> prominent.
>>>>> +The `lang:xx` specification is an optional string, where _xx_ is
>>>>> +a BCP-47 language tag. This restricts searches to field values that were 
>>>>> originally
>>>>> +indexed with the tag _xx_. Searches may be restricted to field values 
>>>>> with no
>>>>> +language tag via `"lang:none"`. The use of the `lang:xx` is only 
>>>>> effective if
>>>>> +[multilingual support](#linguistic-support-with-lucene-index) has been 
>>>>> configured.
>>> The last sentence is not true. You can restrict by language even without 
>>> enabling multilingual support, as long as langField has been set.
>>>>> +Further, if the `lang:xx` is used then the `property` URI must be 
>>>>> supplied
>>>>> +in order for searches to work.
>>> Not true. The default property should be used if no property was specified.
>>>>> +When working with `rdf:langString`s It may be tempting to write:
>>>>> +
>>>>> +    ?s text:query "protégé"@fr
>>>>> +
>>>>> +However, the above will silently fail to return results since the
>>>>> +`query string` must be a simple `xsd:string` not an `rdf:langString`.
>>> This could be considered a bug - at least it shouldn't fail silently.
>>>>> +Even if the default _property_ is `skos:prefLabel` it is necessary
>>>>> +to use the above form rather than omitting the `property` argument
>>>>> +when restricting the Lucene search to a specific `lang:xx`; otherwise,
>>>>> +again there will be no results.
>>> Again, not true. I just tested this query against YSO:
>>>  ?s text:query ("cat" "lang:en")
>>> and it gave a single result, as expected.
>>>>> +For a non-default `Field` with no language restriction, the patterns:
>>>>> +
>>>>> +    ?s text:query (rdfs:label "protégé")
>>>>> +
>>>>> +or
>>>>> +
>>>>> +    ?s text:query "rdfsLabel:protégé"
>>>>> +
>>>>> +may be used (see [below](#entity-map-definition) for how RDF _property_ 
>>>>> names
>>>>> +are mapped to Lucene `Field` names). 
>>> I wouldn't recommend using a query form like "rdfsLabel:protégé" in the 
>>> documentation at all. It violates the layered architecture of jena-text - 
>>> the query should not be targeting named fields. If you want to target 
>>> rdfs:label, use the first form.
>>>>> However, as mentioned earlier,
>>>>> +
>>>>> +    ?s text:query ("rdfsLabel:protégé" "lang:fr")
>>>>> +
>>>>> +will result in an error owing to the way in which the jena-text composes 
>>>>> the
>>>>> +query string to Lucene in the presence of the `"lang:fr"` argument.
>>> Don't do that then. Remove this section. (see previous comment)
>>>>> +However, it is important to note that the apparently equivalent form:
>>>>> +
>>>>> +    (?s ?sc ?lit) text:query "rdfsLabel:protégé"
>>>>> +
>>>>> +will fail to produce a binding for `?lit` even though `?s` and `?sc` are
>>>>> +bound as expected.
>>> Again, don't do that. Use (rdfs:label "protégé") instead and let jena-text 
>>> handle the translation from property to Lucene field.
>>>>> +So if the _literal_ matches are needed you **must use** the query 
>>>>> arguments that
>>>>> +list the _property_ explicitly, except in the simple case of a query 
>>>>> against
>>>>> +the default `Field`/_property_.
>>> Exactly. And those are the only supported query forms anyway.
>>>>> +#### Queries across multiple `Field`s
>>>>> +
>>>>> +It has been mentioned earlier that the text index uses the
>>>>> +[native Lucene query 
>>>>> language](http://lucene.apache.org/core/6_4_1/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description
>>>>>  
>>>>> <http://lucene.apache.org/core/6_4_1/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description>);
>>>>>  
>>>>> +however, there are important constraints on how the Lucene query 
>>>>> language is used within jena-text.
>>>>> +This is owing to the fact that jena-text composes the query string that 
>>>>> is sent to Lucene so that
>>>>> +features such as `lang:xx` may be implemented. Other aspects of using 
>>>>> the Lucene query language
>>>>> +reflect the fact that each triple is a separate document.
>>>>> +
>>>>> +This latter observation is important when considering queries that are 
>>>>> intended to involve several
>>>>> fields.
>>>>> 
>>>>> -The results include the subject URI, `?s`; the `?score` assigned by the
>>>>> -text search engine; and the entire matched `?literal` (if the index has
>>>>> -been [configured to store literal values](#text-dataset-assembler)).
>>>>> +For example, consider the following triples:
>>>>> 
>>>>> -If the `query string` refers to more than one field, e.g.,
>>>>> +    ex:SomePrinter
>>>>> +        rdfs:label     "laser printer" ;
>>>>> +        ex:description "includes a large capacity cartridge" .
>>>>> 
>>>>> -    "label: printer AND description: \"large capacity cartridge\""
>>>>> + assuming an appropriate configuration we might expect to retrieve 
>>>>> `ex:SomePrinter`
>>>>> + with the following query:
>>>>> 
>>>>> -then the `?literal` in the results will not be bound since there is no
>>>>> -single field that contains the match &ndash; the match is separated over
>>>>> -two fields.
>>>>> +    ?s text:query "label:printer AND description:\"large capacity 
>>>>> cartridge\""
>>>>> 
>>>>> -If an output indexed term is already a known value, either as a constant
>>>>> -in the query or variable already set, then the index lookup becomes a
>>>>> -check that this is a match for the input arguments.
>>>>> +However, this query will fail to find the expected results since the 
>>>>> `AND` is interpreted
>>>>> +by Lucene to indicate that all documents that contain a matching `label` 
>>>>> field _and_
>>>>> +a matching `description` field are to be returned. Yet from the 
>>>>> discussion above
>>>>> +regarding the [structure of Lucene documents in 
>>>>> jena-text](#document-structure) it
>>>>> +is evident that there is not one but rather in fact two separate 
>>>>> documents one with a
>>>>> +`label` field and one with a `description` field so an effective query 
>>>>> is:
>>>>> 
>>>>> +    ?s text:query "label:printer" .
>>>>> +    ?s text:query "description:\"large capacity cartridge\"" .
>>>>> +
>>>>> +which leads to `?s` being bound to `ex:SomePrinter`.
>>> Again this should instead be written as
>>>     ?s text:query (rdfs:label "printer") .
>>>     ?s text:query (ex:description "large capacity cartridge") .
>>>>> +In other words when a query is to involve two or more 
>>>>> _properties_/`Field`s then it
>>>>> +expressed at the SPARQL level, as it were, versus in Lucene's query 
>>>>> language.
>>>>> +
>>>>> +It is worth noting that:
>>>>> +
>>>>> +    ?s text:query "label:printer OR description:\"large capacity 
>>>>> cartridge\""
>>>>> +
>>>>> +works simply because Lucene is required to do nothing more than a 
>>>>> _union_ of
>>>>> +matching documents the same as if written:
>>>>> +
>>>>> +    { ?s text:query "label:printer" . }
>>>>> +    union
>>>>> +    { ?s text:query "description:\"large capacity cartridge\"" . }
>>> Don't do that, even if it happens to work.
>>>>> +On the other hand the various features of the [Lucene query 
>>>>> language](http://lucene.apache.org/core/6_4_1/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description
>>>>>  
>>>>> <http://lucene.apache.org/core/6_4_1/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description>)
>>>>>  
>>>>> +are all available to be used for searches within a `Field`. For example:
>>>>> +
>>>>> +    ?s text:query "description:(large AND cartridge)"
>>>>> +
>>>>> +and
>>>>> +
>>>>> +    (?s ?sc ?lit) text:query (ex:description "(includes AND (large OR 
>>>>> capacity))")
>>>>> +
>>>>> +will work as expected.
>>> The first one is better written as
>>>     ?s text:query (ex:description "(large AND cartridge)")
>>> -Osma
>

Re: CMS diff: Jena Full Text Search

Reply via email to