Re: CMS diff: Jena Full Text Search

2017-12-02 Thread Andy Seaborne



On 01/12/17 16:26, Chris Tomlinson wrote:

Hi Andy,

The current commit is cumulative. The commit just prior to this one addressed all of Osma’s 
comments - which included my raising JENA-1437 
 and JENA-1438 
. This commit contains my changes to 
reflect those two issues as being fixed along with all my prior updates.

If there is a better procedure for making a series of updates to docs under CMS 
I’m happy to learn.


This is fine - I was checking it was what it looked like before applying 
it due to limited familiarity with the text indexing.


Andy



Thank you for your help with this,
Chris



On Dec 1, 2017, at 6:10 AM, Andy Seaborne  wrote:

Chris - does this contain the previous one?

If so, are Osma's comments resolved?

If it's all good to go, I'll apply it because chnages can continue and are not 
frozen on a release.  I haven't had the time to check through the proposed 
changes and I'm not deeply familiar with the text indexing so I'm relying on 
others to verify them.

Andy

On 30/11/17 17:28, Chris Tomlinson wrote:

This commit updates the jena-text documentation to be consistent with the 
resolved JENA-1437 and JENA-1438 issues.
Regards,
Chris

On Nov 30, 2017, at 5:00 PM, Chris Tomlinson  wrote:

Clone URL (Committers only):
https://cms.apache.org/redirect?new=anonymous;action=diff;uri=http://jena.apache.org/documentation%2Fquery%2Ftext-query.mdtext

Chris Tomlinson

Index: trunk/content/documentation/query/text-query.mdtext
===
--- trunk/content/documentation/query/text-query.mdtext (revision 1816662)
+++ trunk/content/documentation/query/text-query.mdtext (working copy)
@@ -1,5 +1,7 @@
Title: Jena Full Text Search

+Title: Jena Full Text Search
+
This extension to ARQ combines SPARQL and full text search via
[Lucene](https://lucene.apache.org) 6.4.1 or
[ElasticSearch](https://www.elastic.co) 5.2.1 (which is built on
@@ -64,7 +66,21 @@
## Table of Contents

-   [Architecture](#architecture)
+-   [External content](#external-content)
+-   [External applications](#external-applications)
+-   [Document structure](#document-structure)
-   [Query with SPARQL](#query-with-sparql)
+-   [Syntax](#syntax)
+-   [Input arguments](#input-arguments)
+-   [Output arguments](#output-arguments)
+-   [Query strings](#query-strings)
+-   [Simple queries](#simple-queries)
+-   [Queries with language tags](#queries-with-language-tags)
+-   [Queries that retrieve literals](#queries-that-retrieve-literals)
+-   [Queries with graphs](#queries-with-graphs)
+-   [Queries across multiple `Fields`](#queries-across-multiple-fields)
+-   [Queries with _Boolean Operators_ and _Term 
Modifiers_](#queries-with-boolean-operators-and-term-modifiers)
+-   [Good practice](#good-practice)
-   [Configuration](#configuration)
 -   [Text Dataset Assembler](#text-dataset-assembler)
 -   [Configuring an analyzer](#configuring-an-analyzer)
@@ -108,6 +124,7 @@

The text index uses the native query language of the index:
[Lucene query 
language](http://lucene.apache.org/core/6_4_1/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description)
+(with [restrictions](#input-arguments))
or
[Elasticsearch query 
language](https://www.elastic.co/guide/en/elasticsearch/reference/5.2/query-dsl.html).

@@ -134,6 +151,64 @@
By using Elasticsearch, other applications can share the text index with
SPARQL search.

+### Document structure
+
+As mentioned above, text indexing of a triple involves associating a Lucene
+document with the triple. How is this done?
+
+Lucene documents are composed of `Field`s. Indexing and searching are performed
+over the contents of these `Field`s. For an RDF triple to be indexed in Lucene 
the
+_property_ of the triple must be
+[configured in the entity map of a TextIndex](#entity-map-definition).
+This associates a Lucene analyzer with the _`property`_ which will be used
+for indexing and search. The _`property`_ becomes the _searchable_ Lucene
+`Field` in the resulting document.
+
+A Lucene index includes a _default_ `Field`, which is specified in the 
configuration,
+that is the field to search if not otherwise named in the query. In jena-text
+this field is configured via the `text:defaultField` property which is then 
mapped
+to a specific RDF property via `text:predicate` (see [entity 
map](#entity-map-definition)
+below).
+
+There are several additional `Field`s that will be included in the
+document that is passed to the Lucene `IndexWriter` depending on the
+configuration options that are used. These additional fields are used to
+manage the interface between Jena and Lucene and are not generally
+searchable per se.
+
+The most important of these additional `Field`s is the 

Re: IteratorIterator deprecated

2017-12-02 Thread Andy Seaborne
Abstract LazyIterator.create() is called when the first hasNext or next 
is called. Hence the expensive iterator is not made until first use.


Having
  LazyIterator(Supplier) supplier)
is a more modern style than subclassing.

Aside: our own "Creator" interface has an explicit "new each call" 
semantics, unlike strict Supplier.


On 02/12/17 15:06, ajs6f wrote:

Claude-- have you looked at the Guava machinery underneath 
Iterators.concat(...)? I _believe_ that it is fully lazy, although I haven't 
looked at it in a while.


Yes, for concat(Iterator> inputs)

And pass in a LazyIterator for the others.

The same will work for IteratorConcat.

Andy



https://google.github.io/guava/releases/19.0/api/docs/com/google/common/collect/Iterators.html#concat(java.util.Iterator)
https://google.github.io/guava/releases/19.0/api/docs/com/google/common/collect/Iterators.html#concat(java.util.Iterator...)
etc.

ajs6f


On Dec 2, 2017, at 9:42 AM, Claude Warren  wrote:

I know that the IteratorIterator class was deprecated some time ago in
favor of WrappedIterator.createIteratorIterator(). However they have
different performance characteristics.

the original IteratorIterator, did not call iter.next() on the base
iterator until it was needed.  createIteratorIterator() creates the
iterators using the (iter.next()).andThen( iter.next() ) type construct.
Basically calling next() on the base iterator to create an iterator over
the results.

The problem is that when the base iterator is returning expensive (or very
memory hungry) iterators they are all loaded before the final iterator is
ready for use.

I know that there is a lazy iterator but I don't see how to use that to
solve the problem.

I propose we bring back the IteratorIterator and describe when to use it.

Claude

--
I like: Like Like - The likeliest place on the web

LinkedIn: http://www.linkedin.com/in/claudewarren




Re: IteratorIterator deprecated

2017-12-02 Thread ajs6f
Claude-- have you looked at the Guava machinery underneath 
Iterators.concat(...)? I _believe_ that it is fully lazy, although I haven't 
looked at it in a while.

https://google.github.io/guava/releases/19.0/api/docs/com/google/common/collect/Iterators.html#concat(java.util.Iterator)
https://google.github.io/guava/releases/19.0/api/docs/com/google/common/collect/Iterators.html#concat(java.util.Iterator...)
etc.

ajs6f

> On Dec 2, 2017, at 9:42 AM, Claude Warren  wrote:
> 
> I know that the IteratorIterator class was deprecated some time ago in
> favor of WrappedIterator.createIteratorIterator(). However they have
> different performance characteristics.
> 
> the original IteratorIterator, did not call iter.next() on the base
> iterator until it was needed.  createIteratorIterator() creates the
> iterators using the (iter.next()).andThen( iter.next() ) type construct.
> Basically calling next() on the base iterator to create an iterator over
> the results.
> 
> The problem is that when the base iterator is returning expensive (or very
> memory hungry) iterators they are all loaded before the final iterator is
> ready for use.
> 
> I know that there is a lazy iterator but I don't see how to use that to
> solve the problem.
> 
> I propose we bring back the IteratorIterator and describe when to use it.
> 
> Claude
> 
> -- 
> I like: Like Like - The likeliest place on the web
> 
> LinkedIn: http://www.linkedin.com/in/claudewarren



IteratorIterator deprecated

2017-12-02 Thread Claude Warren
I know that the IteratorIterator class was deprecated some time ago in
favor of WrappedIterator.createIteratorIterator(). However they have
different performance characteristics.

the original IteratorIterator, did not call iter.next() on the base
iterator until it was needed.  createIteratorIterator() creates the
iterators using the (iter.next()).andThen( iter.next() ) type construct.
Basically calling next() on the base iterator to create an iterator over
the results.

The problem is that when the base iterator is returning expensive (or very
memory hungry) iterators they are all loaded before the final iterator is
ready for use.

I know that there is a lazy iterator but I don't see how to use that to
solve the problem.

I propose we bring back the IteratorIterator and describe when to use it.

Claude

-- 
I like: Like Like - The likeliest place on the web

LinkedIn: http://www.linkedin.com/in/claudewarren