I went to review this diff and rediscovered (to my chagrin) that I really know very little about Jena's text indexing.
Osma (or anyone else who knows text indexing better than do I, which wouldn't take much)-- could you review this? It's got some great useful detail about how the indexing works and can be used. ajs6f > On Nov 20, 2017, at 1:51 AM, Chris Tomlinson <anonym...@apache.org> wrote: > > Clone URL (Committers only): > https://cms.apache.org/redirect?new=anonymous;action=diff;uri=http://jena.apache.org/documentation%2Fquery%2Ftext-query.mdtext > > Chris Tomlinson > > Index: trunk/content/documentation/query/text-query.mdtext > =================================================================== > --- trunk/content/documentation/query/text-query.mdtext (revision > 1815762) > +++ trunk/content/documentation/query/text-query.mdtext (working copy) > @@ -1,5 +1,7 @@ > Title: Jena Full Text Search > > +Title: Jena Full Text Search > + > This extension to ARQ combines SPARQL and full text search via > [Lucene](https://lucene.apache.org) 6.4.1 or > [ElasticSearch](https://www.elastic.co) 5.2.1 (which is built on > @@ -64,7 +66,20 @@ > ## Table of Contents > > - [Architecture](#architecture) > + - [External content](#external-content) > + - [External applications](#external-applications) > + - [Document structure](#document-structure) > - [Query with SPARQL](#query-with-sparql) > + - [Syntax](#syntax) > + - [Input arguments](#input-arguments) > + - [Output arguments](#output-arguments) > + - [Query strings](#query-strings) > + - [Simple queries](#simple-queries) > + - [Queries with language tags](#queries-with-language-tags) > + - [Queries that retrieve literals](#queries-that-retrieve-literals) > + - [Queries across multiple > `Field`s](#queries-across-multiple-fields) > + - [Queries within a `Field`](#queries-within-a-field) > + - [Good practice](#good-practice) > - [Configuration](#configuration) > - [Text Dataset Assembler](#text-dataset-assembler) > - [Configuring an analyzer](#configuring-an-analyzer) > @@ -134,6 +149,69 @@ > By using Elasticsearch, other applications can share the text index with > SPARQL search. > > +### Document structure > + > +As mentioned above, text indexing of a triple involves associating a Lucene > +document with the triple. How is this done? > + > +Lucene documents are composed of `Field`s. Indexing and searching are > performed > +over the contents of these `Field`s. For an RDF triple to be indexed in > Lucene the > +_property_ of the triple must be > +[configured in the entity map of a TextIndex](#entity-map-definition). > +This associates a Lucene analyzer with the _`property`_ which will be used > +for indexing and search. The _`property`_ becomes the _searchable_ Lucene > +`Field` in the resulting document. > + > +A Lucene index includes a _default_ `Field`, which is specified in the > configuration, > +that is the field to search if not otherwise named in the query. In > jena-text > +this field is configured via the `text:defaultField` property which is then > mapped > +to a specific RDF property via `text:predicate` (see [entity > map](#entity-map-definition) > +below). > + > +There are several additional `Field`s that will be included in the > +document that is passed to the Lucene `IndexWriter` depending on the > +configuration options that are used. These additional fields are used to > +manage the interface between Jena and Lucene and are not generally > +searchable per se. > + > +The most important of these additional `Field`s is the `text:entityField`. > +This configuration property defines the name of the `Field` that will contain > +the _URI_ or _blank node id_ of the _subject_ of the triple being indexed. > This property does > +not have a default and must be specified for most uses of `jena-text`. This > +`Field` is often given the name, `uri`, in examples. It is via this `Field` > +that `?s` is bound in a typical use such as: > + > + select ?s > + where { > + ?s text:query "some text" > + } > + > +Other `Field`s that may be configured: `text:uidField`, `text:graphField`, > +and so on are discussed below. > + > +Given the triple: > + > + ex:SomeOne skos:prefLabel "zorn protégé a prés"@fr ; > + > +The following illustrates a Lucene document that Jena will create and > +request Lucene to index: > + > + Document< > + stored, indexed, indexOptions=DOCS <uri:http://example.org/SomeOne> > + indexed, omitNorms, indexOptions=DOCS > <graph:urn:x-arq:DefaultGraphNode> > + stored, indexed, tokenized <label:zorn protégé a prés> > + stored, indexed, omitNorms, indexOptions=DOCS <lang:fr> > + stored, indexed, tokenized <label_fr:zorn protégé a prés> > + stored, indexed, omitNorms, indexOptions=DOCS > <uid:28959d0130121b51e1459a95bdac2e04f96efa2e6518ff3c090dfa7a1e6dcf00> > + stored, indexed, tokenized <graph:urn:x-arq:DefaultGraphNode> > + stored, indexed, omitNorms, indexOptions=DOCS <lang:fr> > + stored, indexed, tokenized <graph_fr:urn:x-arq:DefaultGraphNode> > + stored, indexed, omitNorms, indexOptions=DOCS > <uid:b668c6cb80475194a2ce3087c15afbcb4baf6150da98e95b590ceedd10cf242f> > + > > + > +It may be instructive to refer back to this example when considering the > various > +points below. > + > ## Query with SPARQL > > The URI of the text extension property function is > @@ -143,63 +221,248 @@ > > ... text:query ... > > +### Syntax > > The following forms are all legal: > > - ?s text:query 'word' # query > - ?s text:query (rdfs:label 'word') # query specific property if > multiple > - ?s text:query ('word' 10) # with limit on results > - (?s ?score) text:query 'word' # query capturing also the score > - (?s ?score ?literal) text:query 'word' # ... and original literal value > + ?s text:query 'word' # query > + ?s text:query ('word' 10) # with limit on results > + ?s text:query (rdfs:label 'word') # query specific > property if multiple > + ?s text:query (rdfs:label 'protégé' 'lang:fr') # restrict search to > French > + (?s ?score) text:query 'word' # query capturing also > the score > + (?s ?score ?literal) text:query 'word' # ... and original > literal value > > The most general form is: > > - (?s ?score ?literal) text:query (property 'query string' limit) > + (?s ?score ?literal) text:query (property 'query string' limit > 'lang:xx') > > -Only the query string is required, and if it is the only argument the > -surrounding `( )` can be omitted. > +#### Input arguments: > > -Input arguments: > - > | Argument | Definition | > |-------------------|--------------------------------| > | property | (optional) URI (including prefix name form) | > -| query string | The native query string | > +| query string | Lucene query string fragment | > | limit | (optional) `int` limit on the number of results | > +| lang:xx | (optional) language tag spec | > > -Output arguments: > +The `property` URI is only necessary if multiple properties have been > +indexed and the property being searched over is not the [default field > +of the index](#entity-map-definition). Also the `property` URI **must > +not** be used when the `query string` refers explicitly to one or more > +fields. The optional `limit` indicates the maximum hits to be returned by > Lucene. > > +The `lang:xx` specification is an optional string, where _xx_ is > +a BCP-47 language tag. This restricts searches to field values that were > originally > +indexed with the tag _xx_. Searches may be restricted to field values with > no > +language tag via `"lang:none"`. The use of the `lang:xx` is only effective > if > +[multilingual support](#linguistic-support-with-lucene-index) has been > configured. > +Further, if the `lang:xx` is used then the `property` URI must be supplied > +in order for searches to work. > + > +If both `limit` and `lang:xx` are present, then `limit` must precede > +`lang:xx`. > + > +If only the query string is required, the surrounding `( )` can be omitted. > + > +#### Output arguments: > + > | Argument | Definition | > |-------------------|--------------------------------| > -| indexed term | The indexed RDF term. | > +| subject URI | The subject of the indexed RDF triple. | > | score | (optional) The score for the match. | > -| hit | (optional) The literal matched. | > +| literal | (optional) The matched object literal. | > > -The `property` URI is only necessary if multiple properties have been > -indexed and the property being searched over is not the [default field > -of the index](#entity-map-definition). Also the `property` URI **must > -not** be used when the `query string` refers explicitly to one or more > +The results include the _subject URI_; the _score_ assigned by the > +text search engine; and the entire matched _literal_ (if the index has > +been [configured to store literal values](#text-dataset-assembler)). > +The _subject URI_ may be a variable, e.g., `?s`, or a _URI_. In the > +latter case the search is restricted to triples with the specified > +subject. The _score_ and the _literal_ **must** be variables. > + > +If only the _subject_ variable, `?s`, or specific _`URI`_ is needed > +then it must be written without surrounding `( )`; otherwise, an error > +is signalled. > + > +### Query strings > + > +There are several points that need to be considered when formulating > +SPARQL queries using the Lucene interface. > + > +#### Simple queries > + > +The simplest use of the jena-text Lucene integration is: > + > + ?s text:query "some phrase" > + > +This will bind `?s` to each entity URI that is the subject of a triple > +that has the default property and an object literal that matches > +the argument string, e.g.: > + > + ex:AnEntity skos:prefLabel "this is some phrase to match" > + > +This query form will indicate the subjects that have literals that match > +without providing any information about the specific literals that matched. > +If this use case is sufficient for your needs you can skip on to the > +[sections on configuration](#configuration). > + > +#### Queries with language tags > + > +When working with `rdf:langString`s It may be tempting to write: > + > + ?s text:query "protégé"@fr > + > +However, the above will silently fail to return results since the > +`query string` must be a simple `xsd:string` not an `rdf:langString`. > + > +The effective form of the above query is expressed: > + > + ?s text:query (skos:prefLabel "protégé" 'lang:fr') > + > +if the intent is to search only labels with French content. > + > +Even if the default _property_ is `skos:prefLabel` it is necessary > +to use the above form rather than omitting the `property` argument > +when restricting the Lucene search to a specific `lang:xx`; otherwise, > +again there will be no results. > + > +If all one is interested in are _subjects_ with `skos:prefLabel` where > +that is the `text:predicate` of the `text:defaultField` and without regard > +for specified `lang:xx`s then: > + > + ?s text:query "protégé" > + > +will do the job. > + > +For a non-default `Field` with no language restriction, the patterns: > + > + ?s text:query (rdfs:label "protégé") > + > +or > + > + ?s text:query "rdfsLabel:protégé" > + > +may be used (see [below](#entity-map-definition) for how RDF _property_ > names > +are mapped to Lucene `Field` names). However, as mentioned earlier, > + > + ?s text:query ("rdfsLabel:protégé" "lang:fr") > + > +will result in an error owing to the way in which the jena-text composes the > +query string to Lucene in the presence of the `"lang:fr"` argument. > + > +#### Queries that retrieve literals > + > +It is possible to retrieve the *literal*s that Lucene finds matches for > +assuming that > + > + <#TextIndex#> text:storeValues true ; > + > +has been specified in the `TextIndex` configuration. So > + > + (?s ?sc ?lit) text:query (rdfs:label "protégé") > + > +will bind the matching literals to `?lit`, e.g., > + > + "zorn protégé a prés"@fr > + > +However, it is important to note that the apparently equivalent form: > + > + (?s ?sc ?lit) text:query "rdfsLabel:protégé" > + > +will fail to produce a binding for `?lit` even though `?s` and `?sc` are > +bound as expected. > + > +So if the _literal_ matches are needed you **must use** the query arguments > that > +list the _property_ explicitly, except in the simple case of a query against > +the default `Field`/_property_. > + > +#### Queries across multiple `Field`s > + > +It has been mentioned earlier that the text index uses the > +[native Lucene query > language](http://lucene.apache.org/core/6_4_1/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description); > +however, there are important constraints on how the Lucene query language is > used within jena-text. > +This is owing to the fact that jena-text composes the query string that is > sent to Lucene so that > +features such as `lang:xx` may be implemented. Other aspects of using the > Lucene query language > +reflect the fact that each triple is a separate document. > + > +This latter observation is important when considering queries that are > intended to involve several > fields. > > -The results include the subject URI, `?s`; the `?score` assigned by the > -text search engine; and the entire matched `?literal` (if the index has > -been [configured to store literal values](#text-dataset-assembler)). > +For example, consider the following triples: > > -If the `query string` refers to more than one field, e.g., > + ex:SomePrinter > + rdfs:label "laser printer" ; > + ex:description "includes a large capacity cartridge" . > > - "label: printer AND description: \"large capacity cartridge\"" > + assuming an appropriate configuration we might expect to retrieve > `ex:SomePrinter` > + with the following query: > > -then the `?literal` in the results will not be bound since there is no > -single field that contains the match – the match is separated over > -two fields. > + ?s text:query "label:printer AND description:\"large capacity > cartridge\"" > > -If an output indexed term is already a known value, either as a constant > -in the query or variable already set, then the index lookup becomes a > -check that this is a match for the input arguments. > +However, this query will fail to find the expected results since the `AND` > is interpreted > +by Lucene to indicate that all documents that contain a matching `label` > field _and_ > +a matching `description` field are to be returned. Yet from the discussion > above > +regarding the [structure of Lucene documents in > jena-text](#document-structure) it > +is evident that there is not one but rather in fact two separate documents > one with a > +`label` field and one with a `description` field so an effective query is: > > + ?s text:query "label:printer" . > + ?s text:query "description:\"large capacity cartridge\"" . > + > +which leads to `?s` being bound to `ex:SomePrinter`. > + > +In other words when a query is to involve two or more _properties_/`Field`s > then it > +expressed at the SPARQL level, as it were, versus in Lucene's query language. > + > +It is worth noting that: > + > + ?s text:query "label:printer OR description:\"large capacity cartridge\"" > + > +works simply because Lucene is required to do nothing more than a _union_ of > +matching documents the same as if written: > + > + { ?s text:query "label:printer" . } > + union > + { ?s text:query "description:\"large capacity cartridge\"" . } > + > +Suppose the matching literals are required for the above then it should be > clear > +from the above that: > + > + (?s ?sc1 ?lit1) text:query (skos:prefLabel "printer") . > + (?s ?sc2 ?lit2) text:query (ex:description "large capacity cartridge") . > + > +will be the appropriate form to retrieve the _subject_ and the associated > literals. > + > +There is no loss of expressiveness of the Lucene query language versus the > jena-text > +integration of Lucene. Any cross-field `AND`s are replaced by concurrent > SPARQL calls to > +text:query as illustrated above and uses of Lucene `OR` can be converted to > SPARQL > +`union`s. Uses of Lucene `NOT` are converted to appropriate SPARQL `filter`s. > + > +#### Queries within a `Field` > + > +On the other hand the various features of the [Lucene query > language](http://lucene.apache.org/core/6_4_1/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description) > +are all available to be used for searches within a `Field`. For example: > + > + ?s text:query "description:(large AND cartridge)" > + > +and > + > + (?s ?sc ?lit) text:query (ex:description "(includes AND (large OR > capacity))") > + > +will work as expected. > + > +The key is to always surround the field query with `( )`s. > + > + > ### Good practice > > -The query engine does not have information about the selectivity of the > +From the above it should be clear that best practice, except in the simplest > cases > +is to use explicit `text:query` forms such as: > + > + (?s ?sc ?lit) text:query (ex:someProperty "a single Field query") > + > +possibly with _limit_ and `lang:xx` arguments. > + > +Further, the query engine does not have information about the selectivity of > the > text index and so effective query plans cannot be determined > programmatically. It is helpful to be aware of the following two > general query patterns. > @@ -394,7 +657,7 @@ > is returned on a match. The value of the property is arbitrary so long as it > is unique among the > defined names. > > -#### Automatic document deletion > +#### UID Field and automatic document deletion > > When the `text:uidField` is defined in the `EntityMap` then dropping a triple > will result in the > corresponding document, if any, being deleted from the text index. The value, > `"uid"`, is arbitrary > @@ -632,7 +895,7 @@ > > #### SPARQL Linguistic Clause Forms > > -Once the `langField` is set, you can use it directly inside SPARQL queries, > for that the `'lang:xx'` > +Once the `langField` is set, you can use it directly inside SPARQL queries, > for that the `lang:xx` > argument allows you to target specific localized values. For example: > > //target english literals > @@ -714,7 +977,7 @@ > Hence, the result set of the query will contain "institute" related > subjects (institution, institutional,...) in French and in English. > > -**Note**: If the `text:langField` property is not set, the `text:langField` > will default to"lang". > +**Note**: If the `text:langField` property is not set, the `text:langField` > will default to "lang". > > ### Generic and Defined Analyzer Support > >