CMS diff: Jena Full Text Search

2017-11-19 Thread Chris Tomlinson
Clone URL (Committers only):
https://cms.apache.org/redirect?new=anonymous;action=diff;uri=http://jena.apache.org/documentation%2Fquery%2Ftext-query.mdtext

Chris Tomlinson

Index: trunk/content/documentation/query/text-query.mdtext
===
--- trunk/content/documentation/query/text-query.mdtext (revision 1815762)
+++ trunk/content/documentation/query/text-query.mdtext (working copy)
@@ -1,5 +1,7 @@
 Title: Jena Full Text Search
 
+Title: Jena Full Text Search
+
 This extension to ARQ combines SPARQL and full text search via
 [Lucene](https://lucene.apache.org) 6.4.1 or
 [ElasticSearch](https://www.elastic.co) 5.2.1 (which is built on
@@ -64,7 +66,20 @@
 ## Table of Contents
 
 -   [Architecture](#architecture)
+-   [External content](#external-content)
+-   [External applications](#external-applications)
+-   [Document structure](#document-structure)
 -   [Query with SPARQL](#query-with-sparql)
+-   [Syntax](#syntax)
+-   [Input arguments](#input-arguments)
+-   [Output arguments](#output-arguments)
+-   [Query strings](#query-strings)
+-   [Simple queries](#simple-queries)
+-   [Queries with language tags](#queries-with-language-tags)
+-   [Queries that retrieve literals](#queries-that-retrieve-literals)
+-   [Queries across multiple `Field`s](#queries-across-multiple-fields)
+-   [Queries within a `Field`](#queries-within-a-field)
+-   [Good practice](#good-practice)
 -   [Configuration](#configuration)
 -   [Text Dataset Assembler](#text-dataset-assembler)
 -   [Configuring an analyzer](#configuring-an-analyzer)
@@ -134,6 +149,69 @@
 By using Elasticsearch, other applications can share the text index with
 SPARQL search.
 
+### Document structure
+
+As mentioned above, text indexing of a triple involves associating a Lucene
+document with the triple. How is this done?
+
+Lucene documents are composed of `Field`s. Indexing and searching are 
performed 
+over the contents of these `Field`s. For an RDF triple to be indexed in Lucene 
the 
+_property_ of the triple must be 
+[configured in the entity map of a TextIndex](#entity-map-definition).
+This associates a Lucene analyzer with the _`property`_ which will be used
+for indexing and search. The _`property`_ becomes the _searchable_ Lucene 
+`Field` in the resulting document.
+
+A Lucene index includes a _default_ `Field`, which is specified in the 
configuration, 
+that is the field to search if not otherwise named in the query. In jena-text 
+this field is configured via the `text:defaultField` property which is then 
mapped 
+to a specific RDF property via `text:predicate` (see [entity 
map](#entity-map-definition) 
+below).
+
+There are several additional `Field`s that will be included in the
+document that is passed to the Lucene `IndexWriter` depending on the
+configuration options that are used. These additional fields are used to
+manage the interface between Jena and Lucene and are not generally 
+searchable per se.
+
+The most important of these additional `Field`s is the `text:entityField`.
+This configuration property defines the name of the `Field` that will contain
+the _URI_ or _blank node id_ of the _subject_ of the triple being indexed. 
This property does
+not have a default and must be specified for most uses of `jena-text`. This
+`Field` is often given the name, `uri`, in examples. It is via this `Field`
+that `?s` is bound in a typical use such as:
+
+select ?s
+where {
+?s text:query "some text"
+}
+
+Other `Field`s that may be configured: `text:uidField`, `text:graphField`,
+and so on are discussed below.
+
+Given the triple:
+
+ex:SomeOne skos:prefLabel "zorn protégé a prés"@fr ;
+
+The following illustrates a Lucene document that Jena will create and
+request Lucene to index:
+
+Document<
+stored, indexed, indexOptions=DOCS  
+indexed, omitNorms, indexOptions=DOCS 
 
+stored, indexed, tokenized  
+stored, indexed, omitNorms, indexOptions=DOCS  
+stored, indexed, tokenized  
+stored, indexed, omitNorms, indexOptions=DOCS 
 
+stored, indexed, tokenized  
+stored, indexed, omitNorms, indexOptions=DOCS  
+stored, indexed, tokenized  
+stored, indexed, omitNorms, indexOptions=DOCS 

+>
+
+It may be instructive to refer back to this example when considering the 
various
+points below.
+
 ## Query with SPARQL
 
 The URI of the text extension property function is
@@ -143,63 +221,248 @@
 
 ...   text:query ...
 
+### Syntax
 
 The following forms are all legal:
 
-?s text:query 'word'   # query
-?s text:query (rdfs:label 'word')  # query specific property if 
multiple
-?s text:query ('word' 10)  # with limit on results
-(?s ?score) text:query 'word'  # query capturing also the score
-

[jira] [Updated] (JENA-1430) Quad loading for in-memory assemblers

2017-11-19 Thread A. Soroka (JIRA)

 [ 
https://issues.apache.org/jira/browse/JENA-1430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

A. Soroka updated JENA-1430:

Summary: Quad loading for in-memory assemblers  (was: Assemblers )

> Quad loading for in-memory assemblers
> -
>
> Key: JENA-1430
> URL: https://issues.apache.org/jira/browse/JENA-1430
> Project: Apache Jena
>  Issue Type: Bug
>  Components: ARQ
>Reporter: A. Soroka
>Assignee: A. Soroka
>
> In-memory dataset Assemblers should support loading quad files.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (JENA-1430) Assemblers

2017-11-19 Thread A. Soroka (JIRA)
A. Soroka created JENA-1430:
---

 Summary: Assemblers 
 Key: JENA-1430
 URL: https://issues.apache.org/jira/browse/JENA-1430
 Project: Apache Jena
  Issue Type: Bug
  Components: ARQ
Reporter: A. Soroka
Assignee: A. Soroka


In-memory dataset Assemblers should support loading quad files.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


Re: gitpubsub

2017-11-19 Thread Bruno P. Kinoshita
>Bruno (or anyone), do you know if it would be possible to publish site changes 
>for review out of Apache CI? (Something like the way we can set up to get 
>built artifacts from branches of the codebase without actually releasing them.)
Might be worth checking with INFRA if we request it, or even in that IRC-like 
channel they use (always forget its name). It would be an interesting feature.
>Is it okay with respect to Apache policy to only import the current state of 
>the site to Git (iow to leave behind that massive accumulation of Javadocs), 
>or do we need to maintain a complete history on whatever infrastructure we use?
As far as I can tell, it varies from components. Some release the latest 
javadocs, others a few past releases, others one release only. It might be 
useful for other users to have previous releases, but when working on projects 
with older versions of a library, I normally rely on Eclipse + open 
implementation shortcut to look at the docs.
B


  From: ajs6f 
 To: dev@jena.apache.org 
 Sent: Monday, 20 November 2017 2:48 AM
 Subject: Re: gitpubsub
   
Bruno (or anyone), do you know if it would be possible to publish site changes 
for review out of Apache CI? (Something like the way we can set up to get built 
artifacts from branches of the codebase without actually releasing them.)

Is it okay with respect to Apache policy to only import the current state of 
the site to Git (iow to leave behind that massive accumulation of Javadocs), or 
do we need to maintain a complete history on whatever infrastructure we use?

ajs6f

> On Nov 17, 2017, at 3:30 AM, Bruno P. Kinoshita 
>  wrote:
> 
>> What changes if we go for gitpubsub?
> 
> 
> Not much for end users. For developers, we would need to get used to 
> whichever tool we choose for static site generator.
> 
> 
>> If I read that right, no CMS because CMS is svnpubsub only.  Is it a "big 
>> bang" switch to Jekyll? That isn't too scary but it is a step-change.
> 
> Not much I think. Most of the Markdown can be easily ported with some 
> regex/shell script. When I helped porting OpenNLP's site, I used Jena website 
> as reference for parts of their new layout and general organization. If you 
> open both sites opennlp.apache.org and jena.apache.org, you may find they are 
> both very similar.
> 
> And we don't have to necessarily use Jekyll. If the consensus is for another 
> tool (e.g. Pelican, Hexo, JBake, etc) we just need to confirm with Apache 
> Infra if they are able to run the same tool in their automation pipeline.
> 
> 
>> One thing we do benefit from currently is content fixes via CMS - we may 
>> have to change that. I guess there is no jena.staging.a.o? It becomes local 
>> Jekyll build?
> 
> As far as I know, that is right. However, users can run something like 
> `jekyll serve`. I like the current process, but if you have a great change, 
> it is hard to get feedback without committing to SVN, having some draft in 
> the staging area.
> 
> With the gitpubsub + some static site generator. Or we can even share our own 
> GitHub fork website. OpenNLP template has an issue with extra paths, so this 
> is broken, but we can work to have Jena website working correctly, and send a 
> pull request to opennlp's repo: https://kinow.github.io/opennlp-site/.
> 
> So if we have a new repository like github.com/apache/jena-site, then I could 
> fork it under github.com/kinow/jena-site, work in my own fork, prepare pull 
> requests, and include a link like https://kinow.github.io/jena-site. I prefer 
> this approach to having to `svn commit` to preview in the staging area.
> 
> 
>> A project can have more then one git repo so I guess we can choose whether 
>> to use the main repo or not.  Our site .svn is 2.2G (probably all those 
>> javadoc changes). Or a separate repo git-include-submodule in the main one?
> 
> Oh, very good point. OpenNLP has/had the same issue. Not sure if that was 
> fixed. Their old docs are served here: 
> http://opennlp.apache.org/docs/legacy.html
> 
> I believe it's done here: 
> https://github.com/apache/opennlp-site/blob/0303866c56689f602dc9258b32e1a64f59ea82e4/pom.xml#L204
> 
> Though not entirely sure how it works. I can join the Slack channel next week 
> and check with them. The first version of the site included all the old 
> javadocs, and was quite slow to check out and build.
> 
> There was some service interruption during the Apache Infra automation 
> set-up. But given OpenNLP just went through the process, it would be simpler, 
> as we could just tell them to look at the job and instead of Maven/JBake, run 
> jekyll or whatever tool we choose. I would be happy to volunteer and create 
> ticket to create jena-site repository in GitHub. Then once we have the site 
> being generated there and we have validated it, I can create the ticket for 
> INFRA to set up the automation, and switch from svnpubsub to gitpubsub.
> 
> 
> Cheers
> Bruno
> 
> 

[jira] [Closed] (JENA-1429) Error with # comments in SPARQL

2017-11-19 Thread Karima Rafes (JIRA)

 [ 
https://issues.apache.org/jira/browse/JENA-1429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karima Rafes closed JENA-1429.
--
Resolution: Fixed

# is not supported in query Get

> Error with # comments in SPARQL
> ---
>
> Key: JENA-1429
> URL: https://issues.apache.org/jira/browse/JENA-1429
> Project: Apache Jena
>  Issue Type: Bug
>  Components: Fuseki
> Environment:  Fuseki 3.4.0 
>Reporter: Karima Rafes
>Priority: Trivial
>
> A comment in SPARQL queries take the form of '#', outside an IRI or string, 
> and continue to the end of line[1] but Fuseki sends a parse error (Fuseki 
> 3.4.0 (Build date: 2017-07-17T11:43:07+)).
> [1] https://www.w3.org/TR/rdf-sparql-query/#grammarComments



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (JENA-1429) Error with # comments in SPARQL

2017-11-19 Thread A. Soroka (JIRA)

[ 
https://issues.apache.org/jira/browse/JENA-1429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16258615#comment-16258615
 ] 

A. Soroka commented on JENA-1429:
-

Can you provide an example of what is not working for you please?

> Error with # comments in SPARQL
> ---
>
> Key: JENA-1429
> URL: https://issues.apache.org/jira/browse/JENA-1429
> Project: Apache Jena
>  Issue Type: Bug
>  Components: Fuseki
> Environment:  Fuseki 3.4.0 
>Reporter: Karima Rafes
>Priority: Trivial
>
> A comment in SPARQL queries take the form of '#', outside an IRI or string, 
> and continue to the end of line[1] but Fuseki sends a parse error (Fuseki 
> 3.4.0 (Build date: 2017-07-17T11:43:07+)).
> [1] https://www.w3.org/TR/rdf-sparql-query/#grammarComments



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (JENA-1429) Error with # comments in SPARQL

2017-11-19 Thread Karima Rafes (JIRA)
Karima Rafes created JENA-1429:
--

 Summary: Error with # comments in SPARQL
 Key: JENA-1429
 URL: https://issues.apache.org/jira/browse/JENA-1429
 Project: Apache Jena
  Issue Type: Bug
  Components: Fuseki
 Environment:  Fuseki 3.4.0 
Reporter: Karima Rafes
Priority: Trivial


A comment in SPARQL queries take the form of '#', outside an IRI or string, and 
continue to the end of line[1] but Fuseki sends a parse error (Fuseki 3.4.0 
(Build date: 2017-07-17T11:43:07+)).

[1] https://www.w3.org/TR/rdf-sparql-query/#grammarComments





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


Re: gitpubsub

2017-11-19 Thread ajs6f
Bruno (or anyone), do you know if it would be possible to publish site changes 
for review out of Apache CI? (Something like the way we can set up to get built 
artifacts from branches of the codebase without actually releasing them.)

Is it okay with respect to Apache policy to only import the current state of 
the site to Git (iow to leave behind that massive accumulation of Javadocs), or 
do we need to maintain a complete history on whatever infrastructure we use?

ajs6f

> On Nov 17, 2017, at 3:30 AM, Bruno P. Kinoshita 
>  wrote:
> 
>> What changes if we go for gitpubsub?
> 
> 
> Not much for end users. For developers, we would need to get used to 
> whichever tool we choose for static site generator.
> 
> 
>> If I read that right, no CMS because CMS is svnpubsub only.  Is it a "big 
>> bang" switch to Jekyll? That isn't too scary but it is a step-change.
> 
> Not much I think. Most of the Markdown can be easily ported with some 
> regex/shell script. When I helped porting OpenNLP's site, I used Jena website 
> as reference for parts of their new layout and general organization. If you 
> open both sites opennlp.apache.org and jena.apache.org, you may find they are 
> both very similar.
> 
> And we don't have to necessarily use Jekyll. If the consensus is for another 
> tool (e.g. Pelican, Hexo, JBake, etc) we just need to confirm with Apache 
> Infra if they are able to run the same tool in their automation pipeline.
> 
> 
>> One thing we do benefit from currently is content fixes via CMS - we may 
>> have to change that. I guess there is no jena.staging.a.o? It becomes local 
>> Jekyll build?
> 
> As far as I know, that is right. However, users can run something like 
> `jekyll serve`. I like the current process, but if you have a great change, 
> it is hard to get feedback without committing to SVN, having some draft in 
> the staging area.
> 
> With the gitpubsub + some static site generator. Or we can even share our own 
> GitHub fork website. OpenNLP template has an issue with extra paths, so this 
> is broken, but we can work to have Jena website working correctly, and send a 
> pull request to opennlp's repo: https://kinow.github.io/opennlp-site/.
> 
> So if we have a new repository like github.com/apache/jena-site, then I could 
> fork it under github.com/kinow/jena-site, work in my own fork, prepare pull 
> requests, and include a link like https://kinow.github.io/jena-site. I prefer 
> this approach to having to `svn commit` to preview in the staging area.
> 
> 
>> A project can have more then one git repo so I guess we can choose whether 
>> to use the main repo or not.  Our site .svn is 2.2G (probably all those 
>> javadoc changes). Or a separate repo git-include-submodule in the main one?
> 
> Oh, very good point. OpenNLP has/had the same issue. Not sure if that was 
> fixed. Their old docs are served here: 
> http://opennlp.apache.org/docs/legacy.html
> 
> I believe it's done here: 
> https://github.com/apache/opennlp-site/blob/0303866c56689f602dc9258b32e1a64f59ea82e4/pom.xml#L204
> 
> Though not entirely sure how it works. I can join the Slack channel next week 
> and check with them. The first version of the site included all the old 
> javadocs, and was quite slow to check out and build.
> 
> There was some service interruption during the Apache Infra automation 
> set-up. But given OpenNLP just went through the process, it would be simpler, 
> as we could just tell them to look at the job and instead of Maven/JBake, run 
> jekyll or whatever tool we choose. I would be happy to volunteer and create 
> ticket to create jena-site repository in GitHub. Then once we have the site 
> being generated there and we have validated it, I can create the ticket for 
> INFRA to set up the automation, and switch from svnpubsub to gitpubsub.
> 
> 
> Cheers
> Bruno
> 
> 
> 
> 
> From: Andy Seaborne 
> To: dev@jena.apache.org 
> Sent: Sunday, 12 November 2017 4:56 AM
> Subject: gitpubsub
> 
> 
> 
> 
> On 09/11/17 20:51, Bruno P. Kinoshita wrote:
> ...
>> However, I'm +1 for moving our site to Git.
> 
> What changes if we go for gitpubsub?
> 
> All I know about it is the bullet point on 
> https://www.apache.org/dev/project-site.html.
> 
> If I read that right, no CMS because CMS is svnpubsub only.  Is it a 
> "big bang" switch to Jekyll? That isn't too scary but it is a step-change.
> 
> One thing we do benefit from currently is content fixes via CMS - we may 
> have to change that. I guess there is no jena.staging.a.o? It becomes 
> local Jekyll build?
> 
> A project can have more then one git repo so I guess we can choose 
> whether to use the main repo or not.  Our site .svn is 2.2G (probably 
> all those javadoc changes). Or a separate repo git-include-submodule in 
> the main one?
> 
> Andy