Re: CMS diff: Jena Full Text Search

Chris Tomlinson Wed, 22 Nov 2017 06:25:31 -0800

Hi Andy and Osma,

I posted JENA-1426 <https://issues.apache.org/jira/browse/JENA-1426> since the 
“improve this page” facility didn’t seem to offer any way to add a commit 
message or more extensive explanation of the reasons for the proposed edits and 
they were somewhat extensive. So raising an issue seemed a way to proceed; 
however, after several days with no comments I thought perhaps I should follow 
the published protocol and I made the update as guest on the CMS.


I had several motivations regarding updating the documentation: 1) I wanted to 
present how the current implementation functions in a way that might be more 
useful to users - for example clarifying what can be expected to work and what 
not in terms of using the native Lucene query language, e.g., JENA-1388 
<https://issues.apache.org/jira/browse/JENA-1388>; 2) identify areas that might 
indicate perhaps unintended aspects of the current implementation; and 3) 
understand the code in preparation for developing a proposal for adding 
jena-text highlighting support 
<http://apache.markmail.org/message/rlzzdd3yw7a7aoqc?q=jena-text+highlighting+support>.

Based on Osma’s feedback I will be opening a few issues on JIRA and making 
corrections to the original submission. I assume that updates should just be 
made as further commits.

Thanks,
Chris



> On Nov 22, 2017, at 6:41 AM, Andy Seaborne <[email protected]> wrote:
> 
> How is this related to JENA-1426?
> 
>    Andy
> 
> On 21/11/17 14:48, Osma Suominen wrote:
>> ajs6f kirjoitti 20.11.2017 klo 18:36:
>>> Osma (or anyone else who knows text indexing better than do I, which 
>>> wouldn't take much)-- could you review this? It's got some great useful 
>>> detail about how the indexing works and can be used.
>> Sure, will do.
>> Comments about specific sections below. Generally this is a very good 
>> contribution to the jena-text documentation, which has stagnated a bit.
>>>> +The following illustrates a Lucene document that Jena will create and
>>>> +request Lucene to index:
>>>> +
>>>> +    Document<
>>>> +        stored, indexed, indexOptions=DOCS 
>>>> <uri:http://example.org/SomeOne>
>>>> +        indexed, omitNorms, indexOptions=DOCS 
>>>> <graph:urn:x-arq:DefaultGraphNode>
>>>> +        stored, indexed, tokenized <label:zorn protégé a prés>
>>>> +        stored, indexed, omitNorms, indexOptions=DOCS <lang:fr>
>>>> +        stored, indexed, tokenized <label_fr:zorn protégé a prés>
>>>> +        stored, indexed, omitNorms, indexOptions=DOCS 
>>>> <uid:28959d0130121b51e1459a95bdac2e04f96efa2e6518ff3c090dfa7a1e6dcf00>
>>>> +        stored, indexed, tokenized <graph:urn:x-arq:DefaultGraphNode>
>>>> +        stored, indexed, omitNorms, indexOptions=DOCS <lang:fr>
>>>> +        stored, indexed, tokenized <graph_fr:urn:x-arq:DefaultGraphNode>
>>>> +        stored, indexed, omitNorms, indexOptions=DOCS 
>>>> <uid:b668c6cb80475194a2ce3087c15afbcb4baf6150da98e95b590ceedd10cf242f>
>>>> +        >
>>>> +
>>>> +It may be instructive to refer back to this example when considering the 
>>>> various
>>>> +points below.
>> Not sure if this is a perfect illustration. The level of detail is rather 
>> excessive. I know Lucene quite well and I still struggle to understand 
>> what's going on here. Is there another way of presenting this information, 
>> for example just a key-value list that shows the field values that get 
>> stored in the document? I think the field options stored, indexed, 
>> tokenized, omitNorms etc. are unnecessary here or at least should not be so 
>> prominent.
>>>> +The `lang:xx` specification is an optional string, where _xx_ is
>>>> +a BCP-47 language tag. This restricts searches to field values that were 
>>>> originally
>>>> +indexed with the tag _xx_. Searches may be restricted to field values 
>>>> with no
>>>> +language tag via `"lang:none"`. The use of the `lang:xx` is only 
>>>> effective if
>>>> +[multilingual support](#linguistic-support-with-lucene-index) has been 
>>>> configured.
>> The last sentence is not true. You can restrict by language even without 
>> enabling multilingual support, as long as langField has been set.
>>>> +Further, if the `lang:xx` is used then the `property` URI must be supplied
>>>> +in order for searches to work.
>> Not true. The default property should be used if no property was specified.
>>>> +When working with `rdf:langString`s It may be tempting to write:
>>>> +
>>>> +    ?s text:query "protégé"@fr
>>>> +
>>>> +However, the above will silently fail to return results since the
>>>> +`query string` must be a simple `xsd:string` not an `rdf:langString`.
>> This could be considered a bug - at least it shouldn't fail silently.
>>>> +Even if the default _property_ is `skos:prefLabel` it is necessary
>>>> +to use the above form rather than omitting the `property` argument
>>>> +when restricting the Lucene search to a specific `lang:xx`; otherwise,
>>>> +again there will be no results.
>> Again, not true. I just tested this query against YSO:
>>  ?s text:query ("cat" "lang:en")
>> and it gave a single result, as expected.
>>>> +For a non-default `Field` with no language restriction, the patterns:
>>>> +
>>>> +    ?s text:query (rdfs:label "protégé")
>>>> +
>>>> +or
>>>> +
>>>> +    ?s text:query "rdfsLabel:protégé"
>>>> +
>>>> +may be used (see [below](#entity-map-definition) for how RDF _property_ 
>>>> names
>>>> +are mapped to Lucene `Field` names). 
>> I wouldn't recommend using a query form like "rdfsLabel:protégé" in the 
>> documentation at all. It violates the layered architecture of jena-text - 
>> the query should not be targeting named fields. If you want to target 
>> rdfs:label, use the first form.
>>>> However, as mentioned earlier,
>>>> +
>>>> +    ?s text:query ("rdfsLabel:protégé" "lang:fr")
>>>> +
>>>> +will result in an error owing to the way in which the jena-text composes 
>>>> the
>>>> +query string to Lucene in the presence of the `"lang:fr"` argument.
>> Don't do that then. Remove this section. (see previous comment)
>>>> +However, it is important to note that the apparently equivalent form:
>>>> +
>>>> +    (?s ?sc ?lit) text:query "rdfsLabel:protégé"
>>>> +
>>>> +will fail to produce a binding for `?lit` even though `?s` and `?sc` are
>>>> +bound as expected.
>> Again, don't do that. Use (rdfs:label "protégé") instead and let jena-text 
>> handle the translation from property to Lucene field.
>>>> +So if the _literal_ matches are needed you **must use** the query 
>>>> arguments that
>>>> +list the _property_ explicitly, except in the simple case of a query 
>>>> against
>>>> +the default `Field`/_property_.
>> Exactly. And those are the only supported query forms anyway.
>>>> +#### Queries across multiple `Field`s
>>>> +
>>>> +It has been mentioned earlier that the text index uses the
>>>> +[native Lucene query 
>>>> language](http://lucene.apache.org/core/6_4_1/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description);
>>>>  
>>>> +however, there are important constraints on how the Lucene query language 
>>>> is used within jena-text.
>>>> +This is owing to the fact that jena-text composes the query string that 
>>>> is sent to Lucene so that
>>>> +features such as `lang:xx` may be implemented. Other aspects of using the 
>>>> Lucene query language
>>>> +reflect the fact that each triple is a separate document.
>>>> +
>>>> +This latter observation is important when considering queries that are 
>>>> intended to involve several
>>>> fields.
>>>> 
>>>> -The results include the subject URI, `?s`; the `?score` assigned by the
>>>> -text search engine; and the entire matched `?literal` (if the index has
>>>> -been [configured to store literal values](#text-dataset-assembler)).
>>>> +For example, consider the following triples:
>>>> 
>>>> -If the `query string` refers to more than one field, e.g.,
>>>> +    ex:SomePrinter
>>>> +        rdfs:label     "laser printer" ;
>>>> +        ex:description "includes a large capacity cartridge" .
>>>> 
>>>> -    "label: printer AND description: \"large capacity cartridge\""
>>>> + assuming an appropriate configuration we might expect to retrieve 
>>>> `ex:SomePrinter`
>>>> + with the following query:
>>>> 
>>>> -then the `?literal` in the results will not be bound since there is no
>>>> -single field that contains the match &ndash; the match is separated over
>>>> -two fields.
>>>> +    ?s text:query "label:printer AND description:\"large capacity 
>>>> cartridge\""
>>>> 
>>>> -If an output indexed term is already a known value, either as a constant
>>>> -in the query or variable already set, then the index lookup becomes a
>>>> -check that this is a match for the input arguments.
>>>> +However, this query will fail to find the expected results since the 
>>>> `AND` is interpreted
>>>> +by Lucene to indicate that all documents that contain a matching `label` 
>>>> field _and_
>>>> +a matching `description` field are to be returned. Yet from the 
>>>> discussion above
>>>> +regarding the [structure of Lucene documents in 
>>>> jena-text](#document-structure) it
>>>> +is evident that there is not one but rather in fact two separate 
>>>> documents one with a
>>>> +`label` field and one with a `description` field so an effective query is:
>>>> 
>>>> +    ?s text:query "label:printer" .
>>>> +    ?s text:query "description:\"large capacity cartridge\"" .
>>>> +
>>>> +which leads to `?s` being bound to `ex:SomePrinter`.
>> Again this should instead be written as
>>     ?s text:query (rdfs:label "printer") .
>>     ?s text:query (ex:description "large capacity cartridge") .
>>>> +In other words when a query is to involve two or more 
>>>> _properties_/`Field`s then it
>>>> +expressed at the SPARQL level, as it were, versus in Lucene's query 
>>>> language.
>>>> +
>>>> +It is worth noting that:
>>>> +
>>>> +    ?s text:query "label:printer OR description:\"large capacity 
>>>> cartridge\""
>>>> +
>>>> +works simply because Lucene is required to do nothing more than a _union_ 
>>>> of
>>>> +matching documents the same as if written:
>>>> +
>>>> +    { ?s text:query "label:printer" . }
>>>> +    union
>>>> +    { ?s text:query "description:\"large capacity cartridge\"" . }
>> Don't do that, even if it happens to work.
>>>> +On the other hand the various features of the [Lucene query 
>>>> language](http://lucene.apache.org/core/6_4_1/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description)
>>>>  
>>>> +are all available to be used for searches within a `Field`. For example:
>>>> +
>>>> +    ?s text:query "description:(large AND cartridge)"
>>>> +
>>>> +and
>>>> +
>>>> +    (?s ?sc ?lit) text:query (ex:description "(includes AND (large OR 
>>>> capacity))")
>>>> +
>>>> +will work as expected.
>> The first one is better written as
>>     ?s text:query (ex:description "(large AND cartridge)")
>> -Osma

Re: CMS diff: Jena Full Text Search

Reply via email to