Re: CMS diff: Jena Full Text Search

Osma Suominen Tue, 21 Nov 2017 06:49:17 -0800

ajs6f kirjoitti 20.11.2017 klo 18:36:

Osma (or anyone else who knows text indexing better than do I, which wouldn't 
take much)-- could you review this? It's got some great useful detail about how 
the indexing works and can be used.


Sure, will do.

Comments about specific sections below. Generally this is a very goodcontribution to the jena-text documentation, which has stagnated a bit.

+The following illustrates a Lucene document that Jena will create and
+request Lucene to index:
+
+    Document<
+        stored, indexed, indexOptions=DOCS <uri:http://example.org/SomeOne>
+        indexed, omitNorms, indexOptions=DOCS 
<graph:urn:x-arq:DefaultGraphNode>
+        stored, indexed, tokenized <label:zorn protégé a prés>
+        stored, indexed, omitNorms, indexOptions=DOCS <lang:fr>
+        stored, indexed, tokenized <label_fr:zorn protégé a prés>
+        stored, indexed, omitNorms, indexOptions=DOCS 
<uid:28959d0130121b51e1459a95bdac2e04f96efa2e6518ff3c090dfa7a1e6dcf00>
+        stored, indexed, tokenized <graph:urn:x-arq:DefaultGraphNode>
+        stored, indexed, omitNorms, indexOptions=DOCS <lang:fr>
+        stored, indexed, tokenized <graph_fr:urn:x-arq:DefaultGraphNode>
+        stored, indexed, omitNorms, indexOptions=DOCS 
<uid:b668c6cb80475194a2ce3087c15afbcb4baf6150da98e95b590ceedd10cf242f>
+        >
+
+It may be instructive to refer back to this example when considering the 
various
+points below.

Not sure if this is a perfect illustration. The level of detail israther excessive. I know Lucene quite well and I still struggle tounderstand what's going on here. Is there another way of presenting thisinformation, for example just a key-value list that shows the fieldvalues that get stored in the document? I think the field optionsstored, indexed, tokenized, omitNorms etc. are unnecessary here or atleast should not be so prominent.

+The `lang:xx` specification is an optional string, where _xx_ is
+a BCP-47 language tag. This restricts searches to field values that were 
originally
+indexed with the tag _xx_. Searches may be restricted to field values with no
+language tag via `"lang:none"`. The use of the `lang:xx` is only effective if
+[multilingual support](#linguistic-support-with-lucene-index) has been 
configured.

The last sentence is not true. You can restrict by language even withoutenabling multilingual support, as long as langField has been set.

+Further, if the `lang:xx` is used then the `property` URI must be supplied
+in order for searches to work.


Not true. The default property should be used if no property was specified.

+When working with `rdf:langString`s It may be tempting to write:
+
+    ?s text:query "protégé"@fr
+
+However, the above will silently fail to return results since the
+`query string` must be a simple `xsd:string` not an `rdf:langString`.


This could be considered a bug - at least it shouldn't fail silently.

+Even if the default _property_ is `skos:prefLabel` it is necessary
+to use the above form rather than omitting the `property` argument
+when restricting the Lucene search to a specific `lang:xx`; otherwise,
+again there will be no results.


Again, not true. I just tested this query against YSO:
 ?s text:query ("cat" "lang:en")
and it gave a single result, as expected.

+For a non-default `Field` with no language restriction, the patterns:
+
+    ?s text:query (rdfs:label "protégé")
+
+or
+
+    ?s text:query "rdfsLabel:protégé"
+
+may be used (see [below](#entity-map-definition) for how RDF _property_ names

+are mapped to Lucene `Field` names).

I wouldn't recommend using a query form like "rdfsLabel:protégé" in thedocumentation at all. It violates the layered architecture of jena-text- the query should not be targeting named fields. If you want to targetrdfs:label, use the first form.

However, as mentioned earlier,
+
+    ?s text:query ("rdfsLabel:protégé" "lang:fr")
+
+will result in an error owing to the way in which the jena-text composes the
+query string to Lucene in the presence of the `"lang:fr"` argument.


Don't do that then. Remove this section. (see previous comment)

+However, it is important to note that the apparently equivalent form:
+
+    (?s ?sc ?lit) text:query "rdfsLabel:protégé"
+
+will fail to produce a binding for `?lit` even though `?s` and `?sc` are
+bound as expected.

Again, don't do that. Use (rdfs:label "protégé") instead and letjena-text handle the translation from property to Lucene field.

+So if the _literal_ matches are needed you **must use** the query arguments 
that
+list the _property_ explicitly, except in the simple case of a query against
+the default `Field`/_property_.


Exactly. And those are the only supported query forms anyway.

+#### Queries across multiple `Field`s
+
+It has been mentioned earlier that the text index uses the
+[native Lucene query 
language](http://lucene.apache.org/core/6_4_1/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description);
+however, there are important constraints on how the Lucene query language is 
used within jena-text.
+This is owing to the fact that jena-text composes the query string that is 
sent to Lucene so that
+features such as `lang:xx` may be implemented. Other aspects of using the 
Lucene query language
+reflect the fact that each triple is a separate document.
+
+This latter observation is important when considering queries that are 
intended to involve several
fields.

-The results include the subject URI, `?s`; the `?score` assigned by the
-text search engine; and the entire matched `?literal` (if the index has
-been [configured to store literal values](#text-dataset-assembler)).
+For example, consider the following triples:

-If the `query string` refers to more than one field, e.g.,
+    ex:SomePrinter
+        rdfs:label     "laser printer" ;
+        ex:description "includes a large capacity cartridge" .

-    "label: printer AND description: \"large capacity cartridge\""
+ assuming an appropriate configuration we might expect to retrieve 
`ex:SomePrinter`
+ with the following query:

-then the `?literal` in the results will not be bound since there is no
-single field that contains the match &ndash; the match is separated over
-two fields.
+    ?s text:query "label:printer AND description:\"large capacity cartridge\""

-If an output indexed term is already a known value, either as a constant
-in the query or variable already set, then the index lookup becomes a
-check that this is a match for the input arguments.
+However, this query will fail to find the expected results since the `AND` is 
interpreted
+by Lucene to indicate that all documents that contain a matching `label` field 
_and_
+a matching `description` field are to be returned. Yet from the discussion 
above
+regarding the [structure of Lucene documents in 
jena-text](#document-structure) it
+is evident that there is not one but rather in fact two separate documents one 
with a
+`label` field and one with a `description` field so an effective query is:

+    ?s text:query "label:printer" .
+    ?s text:query "description:\"large capacity cartridge\"" .
+
+which leads to `?s` being bound to `ex:SomePrinter`.


Again this should instead be written as

    ?s text:query (rdfs:label "printer") .
    ?s text:query (ex:description "large capacity cartridge") .

+In other words when a query is to involve two or more _properties_/`Field`s 
then it
+expressed at the SPARQL level, as it were, versus in Lucene's query language.
+
+It is worth noting that:
+
+    ?s text:query "label:printer OR description:\"large capacity cartridge\""
+
+works simply because Lucene is required to do nothing more than a _union_ of
+matching documents the same as if written:
+
+    { ?s text:query "label:printer" . }
+    union
+    { ?s text:query "description:\"large capacity cartridge\"" . }


Don't do that, even if it happens to work.

+On the other hand the various features of the [Lucene query 
language](http://lucene.apache.org/core/6_4_1/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description)
+are all available to be used for searches within a `Field`. For example:
+
+    ?s text:query "description:(large AND cartridge)"
+
+and
+
+    (?s ?sc ?lit) text:query (ex:description "(includes AND (large OR 
capacity))")
+
+will work as expected.


The first one is better written as

    ?s text:query (ex:description "(large AND cartridge)")


-Osma

--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Kaikukatu 4)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
[email protected]
http://www.nationallibrary.fi

Re: CMS diff: Jena Full Text Search

Reply via email to