Re: CMS diff: Jena Full Text Search

Chris Tomlinson Fri, 01 Dec 2017 08:26:48 -0800

Hi Andy,

The current commit is cumulative. The commit just prior to this one addressed 
all of Osma’s comments - which included my raising JENA-1437 
<https://issues.apache.org/jira/browse/JENA-1437> and JENA-1438 
<https://issues.apache.org/jira/browse/JENA-1438>. This commit contains my 
changes to reflect those two issues as being fixed along with all my prior 
updates.


If there is a better procedure for making a series of updates to docs under CMS 
I’m happy to learn.

Thank you for your help with this,
Chris


> On Dec 1, 2017, at 6:10 AM, Andy Seaborne <a...@apache.org> wrote:
> 
> Chris - does this contain the previous one?
> 
> If so, are Osma's comments resolved?
> 
> If it's all good to go, I'll apply it because chnages can continue and are 
> not frozen on a release.  I haven't had the time to check through the 
> proposed changes and I'm not deeply familiar with the text indexing so I'm 
> relying on others to verify them.
> 
>    Andy
> 
> On 30/11/17 17:28, Chris Tomlinson wrote:
>> This commit updates the jena-text documentation to be consistent with the 
>> resolved JENA-1437 and JENA-1438 issues.
>> Regards,
>> Chris
>>> On Nov 30, 2017, at 5:00 PM, Chris Tomlinson <anonym...@apache.org> wrote:
>>> 
>>> Clone URL (Committers only):
>>> https://cms.apache.org/redirect?new=anonymous;action=diff;uri=http://jena.apache.org/documentation%2Fquery%2Ftext-query.mdtext
>>> 
>>> Chris Tomlinson
>>> 
>>> Index: trunk/content/documentation/query/text-query.mdtext
>>> ===================================================================
>>> --- trunk/content/documentation/query/text-query.mdtext     (revision 
>>> 1816662)
>>> +++ trunk/content/documentation/query/text-query.mdtext     (working copy)
>>> @@ -1,5 +1,7 @@
>>> Title: Jena Full Text Search
>>> 
>>> +Title: Jena Full Text Search
>>> +
>>> This extension to ARQ combines SPARQL and full text search via
>>> [Lucene](https://lucene.apache.org) 6.4.1 or
>>> [ElasticSearch](https://www.elastic.co) 5.2.1 (which is built on
>>> @@ -64,7 +66,21 @@
>>> ## Table of Contents
>>> 
>>> -   [Architecture](#architecture)
>>> +    -   [External content](#external-content)
>>> +    -   [External applications](#external-applications)
>>> +    -   [Document structure](#document-structure)
>>> -   [Query with SPARQL](#query-with-sparql)
>>> +    -   [Syntax](#syntax)
>>> +        -   [Input arguments](#input-arguments)
>>> +        -   [Output arguments](#output-arguments)
>>> +    -   [Query strings](#query-strings)
>>> +        -   [Simple queries](#simple-queries)
>>> +        -   [Queries with language tags](#queries-with-language-tags)
>>> +        -   [Queries that retrieve 
>>> literals](#queries-that-retrieve-literals)
>>> +        -   [Queries with graphs](#queries-with-graphs)
>>> +        -   [Queries across multiple 
>>> `Fields`](#queries-across-multiple-fields)
>>> +        -   [Queries with _Boolean Operators_ and _Term 
>>> Modifiers_](#queries-with-boolean-operators-and-term-modifiers)
>>> +    -   [Good practice](#good-practice)
>>> -   [Configuration](#configuration)
>>>     -   [Text Dataset Assembler](#text-dataset-assembler)
>>>     -   [Configuring an analyzer](#configuring-an-analyzer)
>>> @@ -108,6 +124,7 @@
>>> 
>>> The text index uses the native query language of the index:
>>> [Lucene query 
>>> language](http://lucene.apache.org/core/6_4_1/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description)
>>> +(with [restrictions](#input-arguments))
>>> or
>>> [Elasticsearch query 
>>> language](https://www.elastic.co/guide/en/elasticsearch/reference/5.2/query-dsl.html).
>>> 
>>> @@ -134,6 +151,64 @@
>>> By using Elasticsearch, other applications can share the text index with
>>> SPARQL search.
>>> 
>>> +### Document structure
>>> +
>>> +As mentioned above, text indexing of a triple involves associating a Lucene
>>> +document with the triple. How is this done?
>>> +
>>> +Lucene documents are composed of `Field`s. Indexing and searching are 
>>> performed
>>> +over the contents of these `Field`s. For an RDF triple to be indexed in 
>>> Lucene the
>>> +_property_ of the triple must be
>>> +[configured in the entity map of a TextIndex](#entity-map-definition).
>>> +This associates a Lucene analyzer with the _`property`_ which will be used
>>> +for indexing and search. The _`property`_ becomes the _searchable_ Lucene
>>> +`Field` in the resulting document.
>>> +
>>> +A Lucene index includes a _default_ `Field`, which is specified in the 
>>> configuration,
>>> +that is the field to search if not otherwise named in the query. In 
>>> jena-text
>>> +this field is configured via the `text:defaultField` property which is 
>>> then mapped
>>> +to a specific RDF property via `text:predicate` (see [entity 
>>> map](#entity-map-definition)
>>> +below).
>>> +
>>> +There are several additional `Field`s that will be included in the
>>> +document that is passed to the Lucene `IndexWriter` depending on the
>>> +configuration options that are used. These additional fields are used to
>>> +manage the interface between Jena and Lucene and are not generally
>>> +searchable per se.
>>> +
>>> +The most important of these additional `Field`s is the `text:entityField`.
>>> +This configuration property defines the name of the `Field` that will 
>>> contain
>>> +the _URI_ or _blank node id_ of the _subject_ of the triple being indexed. 
>>> This property does
>>> +not have a default and must be specified for most uses of `jena-text`. This
>>> +`Field` is often given the name, `uri`, in examples. It is via this `Field`
>>> +that `?s` is bound in a typical use such as:
>>> +
>>> +    select ?s
>>> +    where {
>>> +        ?s text:query "some text"
>>> +    }
>>> +
>>> +Other `Field`s that may be configured: `text:uidField`, `text:graphField`,
>>> +and so on are discussed below.
>>> +
>>> +Given the triple:
>>> +
>>> +    ex:SomeOne skos:prefLabel "zorn protégé a prés"@fr ;
>>> +
>>> +The following is an abbreviated illustration a Lucene document that Jena 
>>> will create and
>>> +request Lucene to index:
>>> +
>>> +    Document<
>>> +        <uri:http://example.org/SomeOne>
>>> +        <graph:urn:x-arq:DefaultGraphNode>
>>> +        <label:zorn protégé a prés>
>>> +        <lang:fr>
>>> +        
>>> <uid:28959d0130121b51e1459a95bdac2e04f96efa2e6518ff3c090dfa7a1e6dcf00>
>>> +        >
>>> +
>>> +It may be instructive to refer back to this example when considering the 
>>> various
>>> +points below.
>>> +
>>> ## Query with SPARQL
>>> 
>>> The URI of the text extension property function is
>>> @@ -143,63 +218,292 @@
>>> 
>>>     ...   text:query ...
>>> 
>>> +### Syntax
>>> 
>>> The following forms are all legal:
>>> 
>>> -    ?s text:query 'word'                   # query
>>> -    ?s text:query (rdfs:label 'word')      # query specific property if 
>>> multiple
>>> -    ?s text:query ('word' 10)              # with limit on results
>>> -    (?s ?score) text:query 'word'          # query capturing also the score
>>> -    (?s ?score ?literal) text:query 'word' # ... and original literal value
>>> +    ?s text:query 'word'                              # query
>>> +    ?s text:query ('word' 10)                         # with limit on 
>>> results
>>> +    ?s text:query (rdfs:label 'word')                 # query specific 
>>> property if multiple
>>> +    ?s text:query (rdfs:label 'protégé' 'lang:fr')    # restrict search to 
>>> French
>>> +    (?s ?score) text:query 'word'                     # query capturing 
>>> also the score
>>> +    (?s ?score ?literal) text:query 'word'            # ... and original 
>>> literal value
>>> 
>>> The most general form is:
>>> 
>>> -     (?s ?score ?literal) text:query (property 'query string' limit)
>>> +     (?s ?score ?literal) text:query (property 'query string' limit 
>>> 'lang:xx')
>>> 
>>> -Only the query string is required, and if it is the only argument the
>>> -surrounding `( )` can be omitted.
>>> +#### Input arguments:
>>> 
>>> -Input arguments:
>>> -
>>> | &nbsp;Argument&nbsp;  | &nbsp; Definition&nbsp;    |
>>> |-------------------|--------------------------------|
>>> | property          | (optional) URI (including prefix name form) |
>>> -| query string      | The native query string        |
>>> +| query string      | Lucene query string fragment       |
>>> | limit             | (optional) `int` limit on the number of results       
>>> |
>>> +| lang:xx           | (optional) language tag spec       |
>>> 
>>> -Output arguments:
>>> +The `property` URI is only necessary if multiple properties have been
>>> +indexed and the property being searched over is not the [default field
>>> +of the index](#entity-map-definition).
>>> 
>>> +The `query string` syntax conforms the underlying index 
>>> [Lucene](http://lucene.apache.org/core/6_4_1/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description)
>>> +or
>>> +[Elasticsearch](https://www.elastic.co/guide/en/elasticsearch/reference/5.2/query-dsl.html).
>>>  In the case of Lucene the syntax is restricted to `Terms`, `Term 
>>> modifiers`, `Boolean Operators` applied to `Terms`, and `Grouping` of 
>>> terms. _No use of `Fields` within the `query string` is supported._
>>> +
>>> +The optional `limit` indicates the maximum hits to be returned by Lucene.
>>> +
>>> +The `lang:xx` specification is an optional string, where _xx_ is
>>> +a BCP-47 language tag. This restricts searches to field values that were 
>>> originally
>>> +indexed with the tag _xx_. Searches may be restricted to field values with 
>>> no
>>> +language tag via `"lang:none"`.
>>> +
>>> +If both `limit` and `lang:xx` are present, then `limit` must precede 
>>> `lang:xx`.
>>> +
>>> +If only the query string is required, the surrounding `( )` _may be_ 
>>> omitted.
>>> +
>>> +#### Output arguments:
>>> +
>>> | &nbsp;Argument&nbsp;  | &nbsp; Definition&nbsp;    |
>>> |-------------------|--------------------------------|
>>> -| indexed term      | The indexed RDF term.          |
>>> +| subject URI       | The subject of the indexed RDF triple.          |
>>> | score             | (optional) The score for the match. |
>>> -| hit               | (optional) The literal matched. |
>>> +| literal           | (optional) The matched object literal. |
>>> 
>>> -The `property` URI is only necessary if multiple properties have been
>>> -indexed and the property being searched over is not the [default field
>>> -of the index](#entity-map-definition).  Also the `property` URI **must
>>> -not** be used when the `query string` refers explicitly to one or more
>>> -fields.
>>> -
>>> -The results include the subject URI, `?s`; the `?score` assigned by the
>>> -text search engine; and the entire matched `?literal` (if the index has
>>> +The results include the _subject URI_; the _score_ assigned by the
>>> +text search engine; and the entire matched _literal_ (if the index has
>>> been [configured to store literal values](#text-dataset-assembler)).
>>> +The _subject URI_ may be a variable, e.g., `?s`, or a _URI_. In the
>>> +latter case the search is restricted to triples with the specified
>>> +subject. The _score_ and the _literal_ **must** be variables.
>>> 
>>> -If the `query string` refers to more than one field, e.g.,
>>> +If only the _subject_ variable, `?s` is needed then it **must be** written 
>>> without
>>> +surrounding `( )`; otherwise, an error is signalled.
>>> 
>>> -    "label: printer AND description: \"large capacity cartridge\""
>>> +### Query strings
>>> 
>>> -then the `?literal` in the results will not be bound since there is no
>>> -single field that contains the match &ndash; the match is separated over
>>> -two fields.
>>> +There are several points that need to be considered when formulating
>>> +SPARQL queries using the Lucene interface. As mentioned above, in the case 
>>> of Lucene the `query string` syntax is restricted to `Terms`, `Term 
>>> modifiers`, `Boolean Operators` applied to `Terms`, and `Grouping` of terms.
>>> 
>>> -If an output indexed term is already a known value, either as a constant
>>> -in the query or variable already set, then the index lookup becomes a
>>> -check that this is a match for the input arguments.
>>> +**No _explicit_ use of `Fields` within the `query string` is supported.**
>>> 
>>> +#### Simple queries
>>> +
>>> +The simplest use of the jena-text Lucene integration is:
>>> +
>>> +    ?s text:query "some phrase"
>>> +
>>> +This will bind `?s` to each entity URI that is the subject of a triple
>>> +that has the default property and an object literal that matches
>>> +the argument string, e.g.:
>>> +
>>> +    ex:AnEntity skos:prefLabel "this is some phrase to match"
>>> +
>>> +This query form will indicate the _subjects_ that have literals that match
>>> +for the _default property_ which is determined via the configuration of
>>> +the `text:predicate` of the [`text:defaultField`](#default-text-field)
>>> +(in the above this has been assumed to be `skos:prefLabel`.
>>> +
>>> +For a _non-default property_ it is necessary to specify the property as
>>> +an input argument to the `text:query`:
>>> +
>>> +    ?s text:query (rdfs:label "protégé")
>>> +
>>> +(see [below](#entity-map-definition) for how RDF _property_ names
>>> +are mapped to Lucene `Field` names).
>>> +
>>> +If this use case is sufficient for your needs you can skip on to the
>>> +[sections on configuration](#configuration).
>>> +
>>> +#### Queries with language tags
>>> +
>>> +When working with `rdf:langString`s it is necessary that the
>>> +[`text:langField`](#language-field) has been configured. Then it is
>>> +as simple as writing queries such as:
>>> +
>>> +    ?s text:query "protégé"@fr
>>> +
>>> +to return results where the given term or phrase has been
>>> +indexed under French in the [`text:defaultField`](#default-text-field).
>>> +
>>> +It is also possible to use the optional `lang:xx` argument, for example:
>>> +
>>> +    ?s text:query ("protégé" 'lang:fr') .
>>> +
>>> +In general, the presence of a language tag, `xx`, on the `query string` or
>>> +`lang:xx` in the `text:query` adds `AND lang:xx` to the query sent to 
>>> Lucene,
>>> +so the above example becomes the following Lucene query:
>>> +
>>> +    "label:protégé AND lang:fr"
>>> +
>>> +For _non-default properties_ the general form is used:
>>> +
>>> +    ?s text:query (skos:altLabel "protégé" 'lang:fr')
>>> +
>>> +Note that an explicit language tag on the `query string` takes precedence
>>> +over the `lang:xx`, so the following
>>> +
>>> +    ?s text:query ("protégé"@fr 'lang:none')
>>> +
>>> +will find French matches rather than matches indexed without a language 
>>> tag.
>>> +
>>> +#### Queries that retrieve literals
>>> +
>>> +It is possible to retrieve the *literal*s that Lucene finds matches for
>>> +assuming that
>>> +
>>> +    <#TextIndex#> text:storeValues true ;
>>> +
>>> +has been specified in the `TextIndex` configuration. So
>>> +
>>> +    (?s ?sc ?lit) text:query (rdfs:label "protégé")
>>> +
>>> +will bind the matching literals to `?lit`, e.g.,
>>> +
>>> +    "zorn protégé a prés"@fr
>>> +
>>> +Note it is necessary to include a variable to capture the Lucene _score_
>>> +even if this value is not otherwise needed since the _literal_ variable
>>> +is determined by position.
>>> +
>>> +#### Queries with graphs
>>> +
>>> +Assuming that the [`text:graphField`](#graph-field) has been configured,
>>> +then, when a triple is indexed, the graph that the triple resides in is
>>> +included in the document and may be used to restrict searches or to 
>>> retrieve the graph that a matching triple resides in.
>>> +
>>> +For example:
>>> +
>>> +    select ?s ?lit
>>> +    where {
>>> +      graph ex:G2 { (?s ?sc ?lit) text:query "zorn" } .
>>> +    }
>>> +
>>> +will restrict searches to triples with the _default property_ that reside
>>> +in graph, `ex:G2`.
>>> +
>>> +On the other hand:
>>> +
>>> +    select ?g ?s ?lit
>>> +    where {
>>> +      graph ?g { (?s ?sc ?lit) text:query "zorn" } .
>>> +    }
>>> +
>>> +will iterate over the graphs in the dataset, searching each in turn for
>>> +matches.
>>> +
>>> +Note that there is a known issue when a `lang:xx` argument is included in
>>> +the above pattern, so that the restriction to given language is not obeyed.
>>> +This will be corrected in a future release. However, use of a language tag
>>> +on the `query string` is not subject to this issue.
>>> +
>>> +If there is suitable structure to the graphs, e.g., a known `rdf:type` and
>>> +depending on the selectivity of the text query and number of graphs,
>>> +it may be more performant to express the query as follows:
>>> +
>>> +    select ?g ?s ?lit
>>> +    where {
>>> +      (?s ?sc ?lit) text:query "zorn" .
>>> +      graph ?g { ?s a ex:Item } .
>>> +    }
>>> +
>>> +Note that this form does not have any issue with `lang:xx` as described
>>> +above, since the graph is extracted after the text search.
>>> +
>>> +#### Queries across multiple `Field`s
>>> +
>>> +As mentioned earlier, the text index uses the
>>> +[native Lucene query 
>>> language](http://lucene.apache.org/core/6_4_1/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description);
>>> +however, there are important constraints on how the Lucene query language 
>>> is used within jena-text. In particular, _explicit_ references to Lucene 
>>> `Fields` with the `query string` **are not** supported. So how are Lucene 
>>> queries that would otherwise refer to multiple `Fields` expressed?
>>> +
>>> +The key is understanding that each triple is a separate document and so 
>>> queries across Lucene `Fields` need to be expressed as SPARQL queries 
>>> referring to the corresponding RDF _properties_. Note that there are 
>>> typically three `Fields` in a document that are used
>>> +during searching:
>>> +
>>> +1. the field corresponding to the property of the indexed triple,
>>> +2. the field for the language of the literal (if configured), and
>>> +3. the graph that the triple is in (if configured).
>>> +
>>> +Given these it should be clear from the above that the
>>> +Jena Text integration constructs a Lucene query from the _property_, 
>>> _query string_, `lang:xx`, and SPARQL graph arguments.
>>> +
>>> +For example, consider the following triples:
>>> +
>>> +    ex:SomePrinter
>>> +        rdfs:label     "laser printer" ;
>>> +        ex:description "includes a large capacity cartridge" .
>>> +
>>> + assuming an appropriate configuration, if we try to retrieve 
>>> `ex:SomePrinter`
>>> + with the following Lucene `query string`:
>>> +
>>> +    ?s text:query "label:printer AND description:\"large capacity 
>>> cartridge\""
>>> +
>>> +then this query can not find the expected results since the `AND` is 
>>> interpreted
>>> +by Lucene to indicate that all documents that contain a matching `label` 
>>> field _and_
>>> +a matching `description` field are to be returned; yet, from the 
>>> discussion above
>>> +regarding the [structure of Lucene documents in 
>>> jena-text](#document-structure) it
>>> +is evident that there is not one but rather in fact two separate documents 
>>> one with a
>>> +`label` field and one with a `description` field so an effective SPARQL 
>>> query is:
>>> +
>>> +    ?s text:query (rdfs:label "printer") .
>>> +    ?s text:query (ex:description "large capacity cartridge") .
>>> +
>>> +which leads to `?s` being bound to `ex:SomePrinter`.
>>> +
>>> +In other words when a query is to involve two or more _properties_ then it
>>> +expressed at the SPARQL level, as it were, versus in Lucene's query 
>>> language.
>>> +
>>> +It is worth noting that the equivalent of a Lucene `OR` of `Fields` is 
>>> expressed
>>> +simply via SPARQL `union`:
>>> +
>>> +    { ?s text:query (rdfs:label "printer") . }
>>> +    union
>>> +    { ?s text:query (ex:description "large capacity cartridge") . }
>>> +
>>> +Suppose the matching literals are required for the above then it should be 
>>> clear
>>> +from the above that:
>>> +
>>> +    (?s ?sc1 ?lit1) text:query (skos:prefLabel "printer") .
>>> +    (?s ?sc2 ?lit2) text:query (ex:description "large capacity cartridge") 
>>> .
>>> +
>>> +will be the appropriate form to retrieve the _subject_ and the associated 
>>> literals, `?lit1` and `?lit2`. (Obviously, in general, the _score_ 
>>> variables, `?sc1` and `?sc2`
>>> +must be distinct since it is very unlikely that the scores of the two 
>>> Lucene queries
>>> +will ever match).
>>> +
>>> +There is no loss of expressiveness of the Lucene query language versus the 
>>> jena-text
>>> +integration of Lucene. Any cross-field `AND`s are replaced by concurrent 
>>> SPARQL calls to
>>> +text:query as illustrated above and uses of Lucene `OR` can be converted 
>>> to SPARQL
>>> +`union`s. Uses of Lucene `NOT` are converted to appropriate SPARQL 
>>> `filter`s.
>>> +
>>> +#### Queries with _Boolean Operators_ and _Term Modifiers_
>>> +
>>> +On the other hand the various features of the [Lucene query 
>>> language](http://lucene.apache.org/core/6_4_1/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description)
>>> +are all available to be used for searches within a `Field`. For example, 
>>> _Boolean Operators_ on _Terms_:
>>> +
>>> +    ?s text:query (ex:description "(large AND cartridge)")
>>> +
>>> +and
>>> +
>>> +    (?s ?sc ?lit) text:query (ex:description "(includes AND (large OR 
>>> capacity))")
>>> +
>>> +or _fuzzy_ searches:
>>> +
>>> +    ?s text:query (ex:description "include~")
>>> +
>>> +and so on will work as expected.
>>> +
>>> +**Always surround the query string with `( )` if more than a single term 
>>> or phrase
>>> +are involved.**
>>> +
>>> +
>>> ### Good practice
>>> 
>>> -The query engine does not have information about the selectivity of the
>>> +From the above it should be clear that best practice, except in the 
>>> simplest cases
>>> +is to use explicit `text:query` forms such as:
>>> +
>>> +    (?s ?sc ?lit) text:query (ex:someProperty "a single Field query")
>>> +
>>> +possibly with _limit_ and `lang:xx` arguments.
>>> +
>>> +Further, the query engine does not have information about the selectivity 
>>> of the
>>> text index and so effective query plans cannot be determined
>>> programmatically.  It is helpful to be aware of the following two
>>> general query patterns.
>>> @@ -394,7 +698,7 @@
>>> is returned on a match. The value of the property is arbitrary so long as 
>>> it is unique among the
>>> defined names.
>>> 
>>> -#### Automatic document deletion
>>> +#### UID Field and automatic document deletion
>>> 
>>> When the `text:uidField` is defined in the `EntityMap` then dropping a 
>>> triple will result in the
>>> corresponding document, if any, being deleted from the text index. The 
>>> value, `"uid"`, is arbitrary
>>> @@ -470,7 +774,7 @@
>>> 
>>> Support for the new `LocalizedAnalyzer` has been introduced in Jena 3.0.0 to
>>> deal with Lucene language specific analyzers. See [Linguistic Support with
>>> -Lucene Index](#linguistic-support-with-lucene-index) part for details.
>>> +Lucene Index](#linguistic-support-with-lucene-index) for details.
>>> 
>>> Support for `GenericAnalyzer`s has been introduced in Jena 3.4.0 to allow
>>> the use of Analyzers that do not have built-in support, e.g., 
>>> `BrazilianAnalyzer`;
>>> @@ -607,15 +911,14 @@
>>> 
>>> ### Linguistic support with Lucene index
>>> 
>>> -It is now possible to take advantage of languages of triple literals to 
>>> enhance
>>> -index and queries. Sub-sections below detail different settings with the 
>>> index,
>>> -and use cases with SPARQL queries.
>>> +Language tags associated with `rdfs:langStrings` occuring as literals in 
>>> triples may
>>> +be used to enhance indexing and queries. Sub-sections below detail 
>>> different settings with the index, and use cases with SPARQL queries.
>>> 
>>> #### Explicit Language Field in the Index
>>> 
>>> -Literals' languages of triples can be stored (during triple addition 
>>> phase) into the
>>> -index to extend query capabilities.
>>> -For that, the new `text:langField` property must be set in the EntityMap 
>>> assembler :
>>> +The language tag for object literals of triples can be stored (during 
>>> triple insert/update)
>>> +into the index to extend query capabilities.
>>> +For that, the `text:langField` property must be set in the EntityMap 
>>> assembler :
>>> 
>>>     <#entMap> a text:EntityMap ;
>>>         text:entityField      "uri" ;
>>> @@ -629,10 +932,12 @@
>>>     EntityDefinition docDef = new EntityDefinition(entityField, 
>>> defaultField);
>>>     docDef.setLangField("lang");
>>> 
>>> +Note that configuring the `text:langField` does not determine a language 
>>> specific
>>> +analyzer. It merely records the tag associated with an indexed 
>>> `rdfs:langString`.
>>> 
>>> #### SPARQL Linguistic Clause Forms
>>> 
>>> -Once the `langField` is set, you can use it directly inside SPARQL 
>>> queries, for that the `'lang:xx'`
>>> +Once the `langField` is set, you can use it directly inside SPARQL 
>>> queries. For that the `lang:xx`
>>> argument allows you to target specific localized values. For example:
>>> 
>>>     //target english literals
>>> @@ -644,6 +949,7 @@
>>>     //ignore language field
>>>     ?s text:query (rdfs:label 'word')
>>> 
>>> +Refer [above](#queries-with-language-tags) for further discussion on 
>>> querying.
>>> 
>>> #### LocalizedAnalyzer
>>> 
>>> @@ -651,7 +957,7 @@
>>> specific analyzers (stemming, stop words,...). Like any other analyzers, it 
>>> can
>>> be done for default text indexing, for each different field or for query.
>>> 
>>> -With an assembler configuration, the `text:language` property needs to
>>> +Using an assembler configuration, the `text:language` property needs to
>>> be provided, e.g :
>>> 
>>>     <#indexLucene> a text:TextIndexLucene ;
>>> @@ -663,7 +969,7 @@
>>>         ]
>>>         .
>>> 
>>> -will configure the index to analyze values of the 'text' field using a
>>> +will configure the index to analyze values of the _default property_ field 
>>> using a
>>> FrenchAnalyzer.
>>> 
>>> To configure the same example via Java code, you need to provide the 
>>> analyzer to the
>>> @@ -678,15 +984,17 @@
>>> `Directory` classes.
>>> 
>>> **Note**: You do not have to set the `text:langField` property with a single
>>> -localized analyzer.
>>> +localized analyzer. Also note that the above configuration will use the
>>> +FrenchAnalyzer for all strings indexed under the _default property_ 
>>> regardless
>>> +of the language tag associated with the literal (if any).
>>> 
>>> #### Multilingual Support
>>> 
>>> Let us suppose that we have many triples with many localized literals in
>>> many different languages. It is possible to take all these languages
>>> -into account for future mixed localized queries.  Just set the
>>> -`text:multilingualSupport` property at `true` to automatically enable
>>> -the localized indexing (and also the localized analyzer for query) :
>>> +into account for future mixed localized queries.  Configure the
>>> +`text:multilingualSupport` property to enable indexing and search via 
>>> localized
>>> +analyzers based on the language tag:
>>> 
>>>     <#indexLucene> a text:TextIndexLucene ;
>>>         text:directory "mem" ;
>>> @@ -699,11 +1007,14 @@
>>>         config.setMultilingualSupport(true);
>>>         Dataset ds = TextDatasetFactory.createLucene(ds1, dir, config) ;
>>> 
>>> -Thus, this multilingual index combines dynamically all localized analyzers 
>>> of existing languages and
>>> -the storage of langField properties.
>>> +This multilingual index combines dynamically all localized analyzers of 
>>> existing
>>> +languages and the storage of langField properties.
>>> 
>>> -For example, it is possible to refer to different languages in the same 
>>> text search query :
>>> +The multilingual analyzer becomes the _default analyzer_ and the Lucene
>>> +`StandardAnalyzer` is the default analyzer used when there is no language 
>>> tag.
>>> 
>>> +It is straightforward to refer to different languages in the same text 
>>> search query:
>>> +
>>>     SELECT ?s
>>>     WHERE {
>>>         { ?s text:query ( rdfs:label 'institut' 'lang:fr' ) }
>>> @@ -714,7 +1025,9 @@
>>> Hence, the result set of the query will contain "institute" related
>>> subjects (institution, institutional,...) in French and in English.
>>> 
>>> -**Note**: If the `text:langField` property is not set, the 
>>> `text:langField` will default to"lang".
>>> +**Note** When multilingual indexing is enabled for a _property_, e.g., 
>>> rdfs:label,
>>> +there will actually be two copies of each literal indexed. One under the 
>>> `Field` name,
>>> +"label", and one under the name "label_xx", where "xx" is the language tag.
>>> 
>>> ### Generic and Defined Analyzer Support
>>> 
>>>

Re: CMS diff: Jena Full Text Search

Reply via email to