Re: CMS diff: Jena Full Text Search

ajs6f Mon, 20 Nov 2017 08:37:31 -0800

I went to review this diff and rediscovered (to my chagrin) that I really know 
very little about Jena's text indexing.


Osma (or anyone else who knows text indexing better than do I, which wouldn't 
take much)-- could you review this? It's got some great useful detail about how 
the indexing works and can be used.

ajs6f

> On Nov 20, 2017, at 1:51 AM, Chris Tomlinson <anonym...@apache.org> wrote:
> 
> Clone URL (Committers only):
> https://cms.apache.org/redirect?new=anonymous;action=diff;uri=http://jena.apache.org/documentation%2Fquery%2Ftext-query.mdtext
> 
> Chris Tomlinson
> 
> Index: trunk/content/documentation/query/text-query.mdtext
> ===================================================================
> --- trunk/content/documentation/query/text-query.mdtext       (revision 
> 1815762)
> +++ trunk/content/documentation/query/text-query.mdtext       (working copy)
> @@ -1,5 +1,7 @@
> Title: Jena Full Text Search
> 
> +Title: Jena Full Text Search
> +
> This extension to ARQ combines SPARQL and full text search via
> [Lucene](https://lucene.apache.org) 6.4.1 or
> [ElasticSearch](https://www.elastic.co) 5.2.1 (which is built on
> @@ -64,7 +66,20 @@
> ## Table of Contents
> 
> -   [Architecture](#architecture)
> +    -   [External content](#external-content)
> +    -   [External applications](#external-applications)
> +    -   [Document structure](#document-structure)
> -   [Query with SPARQL](#query-with-sparql)
> +    -   [Syntax](#syntax)
> +        -   [Input arguments](#input-arguments)
> +        -   [Output arguments](#output-arguments)
> +    -   [Query strings](#query-strings)
> +        -   [Simple queries](#simple-queries)
> +        -   [Queries with language tags](#queries-with-language-tags)
> +        -   [Queries that retrieve literals](#queries-that-retrieve-literals)
> +        -   [Queries across multiple 
> `Field`s](#queries-across-multiple-fields)
> +        -   [Queries within a `Field`](#queries-within-a-field)
> +    -   [Good practice](#good-practice)
> -   [Configuration](#configuration)
>     -   [Text Dataset Assembler](#text-dataset-assembler)
>     -   [Configuring an analyzer](#configuring-an-analyzer)
> @@ -134,6 +149,69 @@
> By using Elasticsearch, other applications can share the text index with
> SPARQL search.
> 
> +### Document structure
> +
> +As mentioned above, text indexing of a triple involves associating a Lucene
> +document with the triple. How is this done?
> +
> +Lucene documents are composed of `Field`s. Indexing and searching are 
> performed 
> +over the contents of these `Field`s. For an RDF triple to be indexed in 
> Lucene the 
> +_property_ of the triple must be 
> +[configured in the entity map of a TextIndex](#entity-map-definition).
> +This associates a Lucene analyzer with the _`property`_ which will be used
> +for indexing and search. The _`property`_ becomes the _searchable_ Lucene 
> +`Field` in the resulting document.
> +
> +A Lucene index includes a _default_ `Field`, which is specified in the 
> configuration, 
> +that is the field to search if not otherwise named in the query. In 
> jena-text 
> +this field is configured via the `text:defaultField` property which is then 
> mapped 
> +to a specific RDF property via `text:predicate` (see [entity 
> map](#entity-map-definition) 
> +below).
> +
> +There are several additional `Field`s that will be included in the
> +document that is passed to the Lucene `IndexWriter` depending on the
> +configuration options that are used. These additional fields are used to
> +manage the interface between Jena and Lucene and are not generally 
> +searchable per se.
> +
> +The most important of these additional `Field`s is the `text:entityField`.
> +This configuration property defines the name of the `Field` that will contain
> +the _URI_ or _blank node id_ of the _subject_ of the triple being indexed. 
> This property does
> +not have a default and must be specified for most uses of `jena-text`. This
> +`Field` is often given the name, `uri`, in examples. It is via this `Field`
> +that `?s` is bound in a typical use such as:
> +
> +    select ?s
> +    where {
> +        ?s text:query "some text"
> +    }
> +
> +Other `Field`s that may be configured: `text:uidField`, `text:graphField`,
> +and so on are discussed below.
> +
> +Given the triple:
> +
> +    ex:SomeOne skos:prefLabel "zorn protégé a prés"@fr ;
> +
> +The following illustrates a Lucene document that Jena will create and
> +request Lucene to index:
> +
> +    Document<
> +        stored, indexed, indexOptions=DOCS <uri:http://example.org/SomeOne> 
> +        indexed, omitNorms, indexOptions=DOCS 
> <graph:urn:x-arq:DefaultGraphNode> 
> +        stored, indexed, tokenized <label:zorn protégé a prés> 
> +        stored, indexed, omitNorms, indexOptions=DOCS <lang:fr> 
> +        stored, indexed, tokenized <label_fr:zorn protégé a prés> 
> +        stored, indexed, omitNorms, indexOptions=DOCS 
> <uid:28959d0130121b51e1459a95bdac2e04f96efa2e6518ff3c090dfa7a1e6dcf00> 
> +        stored, indexed, tokenized <graph:urn:x-arq:DefaultGraphNode> 
> +        stored, indexed, omitNorms, indexOptions=DOCS <lang:fr> 
> +        stored, indexed, tokenized <graph_fr:urn:x-arq:DefaultGraphNode> 
> +        stored, indexed, omitNorms, indexOptions=DOCS 
> <uid:b668c6cb80475194a2ce3087c15afbcb4baf6150da98e95b590ceedd10cf242f>
> +        >
> +
> +It may be instructive to refer back to this example when considering the 
> various
> +points below.
> +
> ## Query with SPARQL
> 
> The URI of the text extension property function is
> @@ -143,63 +221,248 @@
> 
>     ...   text:query ...
> 
> +### Syntax
> 
> The following forms are all legal:
> 
> -    ?s text:query 'word'                   # query
> -    ?s text:query (rdfs:label 'word')      # query specific property if 
> multiple
> -    ?s text:query ('word' 10)              # with limit on results
> -    (?s ?score) text:query 'word'          # query capturing also the score
> -    (?s ?score ?literal) text:query 'word' # ... and original literal value
> +    ?s text:query 'word'                              # query
> +    ?s text:query ('word' 10)                         # with limit on results
> +    ?s text:query (rdfs:label 'word')                 # query specific 
> property if multiple
> +    ?s text:query (rdfs:label 'protégé' 'lang:fr')    # restrict search to 
> French
> +    (?s ?score) text:query 'word'                     # query capturing also 
> the score
> +    (?s ?score ?literal) text:query 'word'            # ... and original 
> literal value
> 
> The most general form is:
> 
> -     (?s ?score ?literal) text:query (property 'query string' limit)
> +     (?s ?score ?literal) text:query (property 'query string' limit 
> 'lang:xx')
> 
> -Only the query string is required, and if it is the only argument the
> -surrounding `( )` can be omitted.
> +#### Input arguments:
> 
> -Input arguments:
> -
> | &nbsp;Argument&nbsp;  | &nbsp; Definition&nbsp;    |
> |-------------------|--------------------------------|
> | property          | (optional) URI (including prefix name form) |
> -| query string      | The native query string        |
> +| query string      | Lucene query string fragment       |
> | limit             | (optional) `int` limit on the number of results       |
> +| lang:xx           | (optional) language tag spec       |
> 
> -Output arguments:
> +The `property` URI is only necessary if multiple properties have been
> +indexed and the property being searched over is not the [default field
> +of the index](#entity-map-definition).  Also the `property` URI **must
> +not** be used when the `query string` refers explicitly to one or more
> +fields. The optional `limit` indicates the maximum hits to be returned by 
> Lucene.
> 
> +The `lang:xx` specification is an optional string, where _xx_ is 
> +a BCP-47 language tag. This restricts searches to field values that were 
> originally 
> +indexed with the tag _xx_. Searches may be restricted to field values with 
> no 
> +language tag via `"lang:none"`. The use of the `lang:xx` is only effective 
> if 
> +[multilingual support](#linguistic-support-with-lucene-index) has been 
> configured.
> +Further, if the `lang:xx` is used then the `property` URI must be supplied
> +in order for searches to work.
> +
> +If both `limit` and `lang:xx` are present, then `limit` must precede
> +`lang:xx`.
> +
> +If only the query string is required, the surrounding `( )` can be omitted.
> +
> +#### Output arguments:
> +
> | &nbsp;Argument&nbsp;  | &nbsp; Definition&nbsp;    |
> |-------------------|--------------------------------|
> -| indexed term      | The indexed RDF term.          |
> +| subject URI       | The subject of the indexed RDF triple.          |
> | score             | (optional) The score for the match. |
> -| hit               | (optional) The literal matched. |
> +| literal           | (optional) The matched object literal. |
> 
> -The `property` URI is only necessary if multiple properties have been
> -indexed and the property being searched over is not the [default field
> -of the index](#entity-map-definition).  Also the `property` URI **must
> -not** be used when the `query string` refers explicitly to one or more
> +The results include the _subject URI_; the _score_ assigned by the
> +text search engine; and the entire matched _literal_ (if the index has
> +been [configured to store literal values](#text-dataset-assembler)).
> +The _subject URI_ may be a variable, e.g., `?s`, or a _URI_. In the
> +latter case the search is restricted to triples with the specified
> +subject. The _score_ and the _literal_ **must** be variables.
> +
> +If only the _subject_ variable, `?s`, or specific _`URI`_ is needed 
> +then it must be written without surrounding `( )`; otherwise, an error 
> +is signalled.
> +
> +### Query strings
> +
> +There are several points that need to be considered when formulating
> +SPARQL queries using the Lucene interface.
> +
> +#### Simple queries
> +
> +The simplest use of the jena-text Lucene integration is:
> +
> +    ?s text:query "some phrase"
> +
> +This will bind `?s` to each entity URI that is the subject of a triple
> +that has the default property and an object literal that matches
> +the argument string, e.g.:
> +
> +    ex:AnEntity skos:prefLabel "this is some phrase to match"
> +
> +This query form will indicate the subjects that have literals that match
> +without providing any information about the specific literals that matched.
> +If this use case is sufficient for your needs you can skip on to the 
> +[sections on configuration](#configuration).
> +
> +#### Queries with language tags
> +
> +When working with `rdf:langString`s It may be tempting to write:
> +
> +    ?s text:query "protégé"@fr
> +
> +However, the above will silently fail to return results since the
> +`query string` must be a simple `xsd:string` not an `rdf:langString`.
> +
> +The effective form of the above query is expressed:
> +
> +    ?s text:query (skos:prefLabel "protégé" 'lang:fr')
> +
> +if the intent is to search only labels with French content.
> +
> +Even if the default _property_ is `skos:prefLabel` it is necessary
> +to use the above form rather than omitting the `property` argument
> +when restricting the Lucene search to a specific `lang:xx`; otherwise,
> +again there will be no results.
> +
> +If all one is interested in are _subjects_ with `skos:prefLabel` where
> +that is the `text:predicate` of the `text:defaultField` and without regard 
> +for specified `lang:xx`s then:
> +
> +    ?s text:query "protégé"
> +
> +will do the job.
> +
> +For a non-default `Field` with no language restriction, the patterns:
> +
> +    ?s text:query (rdfs:label "protégé")
> +
> +or
> +
> +    ?s text:query "rdfsLabel:protégé"
> +
> +may be used (see [below](#entity-map-definition) for how RDF _property_ 
> names 
> +are mapped to Lucene `Field` names). However, as mentioned earlier,
> +
> +    ?s text:query ("rdfsLabel:protégé" "lang:fr")
> +
> +will result in an error owing to the way in which the jena-text composes the
> +query string to Lucene in the presence of the `"lang:fr"` argument.
> +
> +#### Queries that retrieve literals
> +
> +It is possible to retrieve the *literal*s that Lucene finds matches for
> +assuming that
> +
> +    <#TextIndex#> text:storeValues true ;
> +
> +has been specified in the `TextIndex` configuration. So
> +
> +    (?s ?sc ?lit) text:query (rdfs:label "protégé")
> +
> +will bind the matching literals to `?lit`, e.g.,
> +
> +    "zorn protégé a prés"@fr
> +
> +However, it is important to note that the apparently equivalent form:
> +
> +    (?s ?sc ?lit) text:query "rdfsLabel:protégé"
> +
> +will fail to produce a binding for `?lit` even though `?s` and `?sc` are
> +bound as expected.
> +
> +So if the _literal_ matches are needed you **must use** the query arguments 
> that
> +list the _property_ explicitly, except in the simple case of a query against
> +the default `Field`/_property_.
> +
> +#### Queries across multiple `Field`s
> +
> +It has been mentioned earlier that the text index uses the
> +[native Lucene query 
> language](http://lucene.apache.org/core/6_4_1/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description);
> +however, there are important constraints on how the Lucene query language is 
> used within jena-text.
> +This is owing to the fact that jena-text composes the query string that is 
> sent to Lucene so that
> +features such as `lang:xx` may be implemented. Other aspects of using the 
> Lucene query language
> +reflect the fact that each triple is a separate document.
> +
> +This latter observation is important when considering queries that are 
> intended to involve several
> fields.
> 
> -The results include the subject URI, `?s`; the `?score` assigned by the
> -text search engine; and the entire matched `?literal` (if the index has
> -been [configured to store literal values](#text-dataset-assembler)).
> +For example, consider the following triples:
> 
> -If the `query string` refers to more than one field, e.g.,
> +    ex:SomePrinter 
> +        rdfs:label     "laser printer" ;
> +        ex:description "includes a large capacity cartridge" .
> 
> -    "label: printer AND description: \"large capacity cartridge\""
> + assuming an appropriate configuration we might expect to retrieve 
> `ex:SomePrinter`
> + with the following query:
> 
> -then the `?literal` in the results will not be bound since there is no
> -single field that contains the match &ndash; the match is separated over
> -two fields.
> +    ?s text:query "label:printer AND description:\"large capacity 
> cartridge\""
> 
> -If an output indexed term is already a known value, either as a constant
> -in the query or variable already set, then the index lookup becomes a
> -check that this is a match for the input arguments.
> +However, this query will fail to find the expected results since the `AND` 
> is interpreted
> +by Lucene to indicate that all documents that contain a matching `label` 
> field _and_
> +a matching `description` field are to be returned. Yet from the discussion 
> above
> +regarding the [structure of Lucene documents in 
> jena-text](#document-structure) it
> +is evident that there is not one but rather in fact two separate documents 
> one with a 
> +`label` field and one with a `description` field so an effective query is:
> 
> +    ?s text:query "label:printer" .
> +    ?s text:query "description:\"large capacity cartridge\"" .
> +
> +which leads to `?s` being bound to `ex:SomePrinter`.
> +
> +In other words when a query is to involve two or more _properties_/`Field`s 
> then it
> +expressed at the SPARQL level, as it were, versus in Lucene's query language.
> +
> +It is worth noting that:
> +
> +    ?s text:query "label:printer OR description:\"large capacity cartridge\""
> +
> +works simply because Lucene is required to do nothing more than a _union_ of
> +matching documents the same as if written:
> +
> +    { ?s text:query "label:printer" . }
> +    union
> +    { ?s text:query "description:\"large capacity cartridge\"" . }
> +
> +Suppose the matching literals are required for the above then it should be 
> clear
> +from the above that:
> +
> +    (?s ?sc1 ?lit1) text:query (skos:prefLabel "printer") .
> +    (?s ?sc2 ?lit2) text:query (ex:description "large capacity cartridge") .
> +
> +will be the appropriate form to retrieve the _subject_ and the associated 
> literals.
> +
> +There is no loss of expressiveness of the Lucene query language versus the 
> jena-text
> +integration of Lucene. Any cross-field `AND`s are replaced by concurrent 
> SPARQL calls to
> +text:query as illustrated above and uses of Lucene `OR` can be converted to 
> SPARQL 
> +`union`s. Uses of Lucene `NOT` are converted to appropriate SPARQL `filter`s.
> +
> +#### Queries within a `Field`
> +
> +On the other hand the various features of the [Lucene query 
> language](http://lucene.apache.org/core/6_4_1/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description)
> +are all available to be used for searches within a `Field`. For example:
> +
> +    ?s text:query "description:(large AND cartridge)"
> +
> +and
> +
> +    (?s ?sc ?lit) text:query (ex:description "(includes AND (large OR 
> capacity))")
> +
> +will work as expected.
> +
> +The key is to always surround the field query with `( )`s.
> +
> +
> ### Good practice
> 
> -The query engine does not have information about the selectivity of the
> +From the above it should be clear that best practice, except in the simplest 
> cases
> +is to use explicit `text:query` forms such as:
> +
> +    (?s ?sc ?lit) text:query (ex:someProperty "a single Field query")
> +
> +possibly with _limit_ and `lang:xx` arguments.
> +
> +Further, the query engine does not have information about the selectivity of 
> the
> text index and so effective query plans cannot be determined
> programmatically.  It is helpful to be aware of the following two
> general query patterns.
> @@ -394,7 +657,7 @@
> is returned on a match. The value of the property is arbitrary so long as it 
> is unique among the
> defined names.
> 
> -#### Automatic document deletion
> +#### UID Field and automatic document deletion
> 
> When the `text:uidField` is defined in the `EntityMap` then dropping a triple 
> will result in the 
> corresponding document, if any, being deleted from the text index. The value, 
> `"uid"`, is arbitrary 
> @@ -632,7 +895,7 @@
> 
> #### SPARQL Linguistic Clause Forms
> 
> -Once the `langField` is set, you can use it directly inside SPARQL queries, 
> for that the `'lang:xx'`
> +Once the `langField` is set, you can use it directly inside SPARQL queries, 
> for that the `lang:xx`
> argument allows you to target specific localized values. For example:
> 
>     //target english literals
> @@ -714,7 +977,7 @@
> Hence, the result set of the query will contain "institute" related
> subjects (institution, institutional,...) in French and in English.
> 
> -**Note**: If the `text:langField` property is not set, the `text:langField` 
> will default to"lang".
> +**Note**: If the `text:langField` property is not set, the `text:langField` 
> will default to "lang".
> 
> ### Generic and Defined Analyzer Support
> 
>

Re: CMS diff: Jena Full Text Search

Reply via email to