[jira] [Created] (JENA-1646) allow optional non-indexing of text:field

2018-12-04 Thread Code Ferret (JIRA)
Code Ferret created JENA-1646:
-

 Summary: allow optional non-indexing of text:field
 Key: JENA-1646
 URL: https://issues.apache.org/jira/browse/JENA-1646
 Project: Apache Jena
  Issue Type: Improvement
  Components: Jena
Affects Versions: Jena 3.9.0
Reporter: Code Ferret
Assignee: Code Ferret
 Fix For: Jena 3.10.0


When using the Multilingual support, the field to search is generally the 
{{text:field}} with an appended {{text:lang}} field value:
{code:java}
altLabel_fr
{code}
In this usage, if queries are never performed against the {{text:field}} 
without a _language tag_ then some space and time can be saved by not indexing 
the {{text:field}} and this improvement adds a boolean option, 
{{text:noIndex}}, that is used in the {{text:map}} entries for those entries 
that should not have their {{text:field}} indexed. This only makes sense in the 
context of {{text:multilingualSupport true}} in the {{TextIndex}}.

A pull request is available.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (JENA-1645) Poor performance with full text search (Lucene)

2018-12-04 Thread Code Ferret (JIRA)


[ 
https://issues.apache.org/jira/browse/JENA-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16709241#comment-16709241
 ] 

Code Ferret commented on JENA-1645:
---

It would be helpful to see example queries and how you have used the subject 
URI. 

I agree that the {{concreteSubject}} *should* create Lucene queries that 
include a term of the form:

{code}
... AND uri:http://example.org/data/resource/R0123
{code}

Currently the code for {{concreteSubject}} collects results for all possible 
subjects and then after the results are returned selects just the ones 
corresponding to the provided {{subject}} and discards the rest of the results. 
Quite inefficient!

This behavior is transparent to the user other than the performance; however, 
if there is some reason to keep this behavior then the _new_ behavior can be 
handled by adding a {{boolean}} {{TextIndex}} option in the configuration: 
{{text:useConcreteSubject}}.

The implementation involves threading the subject into the 
{{TextIndex.query(...)}}, adding a new query method to {{TextIndex}}, 
{{TextIndexLucene}} and {{TextIndexES}}. It should be rather straightforward.

> Poor performance with full text search (Lucene)
> ---
>
> Key: JENA-1645
> URL: https://issues.apache.org/jira/browse/JENA-1645
> Project: Apache Jena
>  Issue Type: Question
>  Components: Jena
>Affects Versions: Jena 3.9.0
>Reporter: Vasyl Danyliuk
>Priority: Major
>
> Situation: half of a million of an indexed by Lucene documents(emails 
> actually), searching for emails by sender/receiver and some text.
> If to put text filter in the start of SPARQL query it executes once but in a 
> case of very common words here are a lot of results(100 000+) that leads to 
> poor performance, limiting results count may and up with missed results.
> If to put text search as the last condition it executes once per each already 
> found subject. That's completely OK but text search completely ignores 
> subject URI.
> I found two methods in TextQueryPF class: variableSubject(...) for the first 
> case, and concreteSubject(...) for the second one.
> The question is: why can't subject URI be used as a constraint in the text 
> search?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Toward Jena 3.10.0

2018-12-04 Thread Marco Neumann
On Tue, Dec 4, 2018 at 1:04 PM Greg Albiston  wrote:

> Hi Marco,
>
> 2. The GeoSPARQL-Fuseki application has some options for convenience in
> setting up the Fuseki server. It looks like the two minute delay is
> caused by applying RDFS inferencing to the dataset and then writing the
> results into the datset (i.e. Jena operations). The GeoSPARQL schema has
> a class and property hierachy that a user can apply to their dataset for
> some of the functionality. The inferencing is applied by default when
> loading a file, but also when connecting to a TDB, in case it hasn't
> been done manually by the user. The other potentially costly operation
> is creating "hasDefaultGeometry" properties, which is switched off by
> default.
>

ah OK that's good to know


>
> The following line should lead to quicker loading the second time.
>
> ./geosparql-fuseki --loopback false --tdb TDB1 --inference


this looks useful I will check it out tonight


>
> I could change the setup so that file loading applies inferencing by
> default and TDB does not, but I thought picking one would be better for
> consistent behaviour. Always true means less burden for users working
> out why they might have a problem after loading their dataset.
>
> There is probably a broader question as to how/if these options should
> be integrated in with Fuseki, whether it should be a separate
> application or they should be left out. I think they are useful to a
> user who is looking for a GeoSPARQL solution. Currently,
> GeoSPARQL-Fuseki is using the main/embedded server so doesn't have a GUI
> etc.


> 3. I get what you mean about the invalidty of the query now. The polygon
> is invalid because it is not closed. However, I'm unclear about how
> these errors and warnings are handled any different to if there was a
> SPARQL syntax error. A Query Parse Exception is thrown with full stack
> trace. The error highlights the specific problem while the warning shows
> the context of the error and stack trace. This made it easier to hunt
> down these kinds of problems when they could be coming from a query or
> the dataset. What would you be looking for instead?
>

it would be great if we could tie this closer into query processor and have
the query canceled on a spatial pre processor error like the one above and
report something to the user. because  now it seems to hit all wkts in the
dataset before finishing up (of course ignoring LIMIT in the sparql query)
while the user waits with no further information to be finally presented
with a an empty results table.


Best,
Marco



>
> Thanks,
>
> Greg
>
> On 04/12/2018 12:01, Marco Neumann wrote:
> > comments inline
> >
> > On Mon, Dec 3, 2018 at 5:14 PM Greg Albiston  wrote:
> >
> >> Hi Marco,
> >>
> >> 1. As mentioned this shouldn't be too difficult to support.
> >>
> > indeed not difficult but needs a decision
> >
> > you could try with the following geonames dataset
> >
> > all-geonames_lotico.ttl.gz
> >
> >
> >
> >> 2. Yes, the indexing, or rather caching, is in-memory, but it is
> >> on-demand. There shouldn't be any delay at start-up beyond what Jena
> >> needs to do. The cost comes during query execution. The key invariant
> >> data produced for solutions is retained for a short period of time (but
> >> can be configured to be retained until termination). Some regularly
> >> re-used info is always kept until termination (e.g. any spatial
> >> reference system transformation that has been requested).
> >>
> > the following will create and populate the TDB dataset
> >
> > ./geosparql-fuseki --loopback false --rdf_file ./lm.ttl --tdb TDB1
> >
> > I presume this message refers to the creation of the spatial cache /
> index
> >
> > 6:05:46.685 INFO  Applying GeoSPARQL Schema - Started
> > 6:07:44.826 INFO  Applying GeoSPARQL Schema - Completed
> >
> > next time I can call TDB directly
> >
> > ./geosparql-fuseki --loopback false --tdb TDB1
> >
> > 6:08:38.665 INFO  Applying GeoSPARQL Schema - Started
> > 6:10:18.661 INFO  Applying GeoSPARQL Schema - Completed
> >
> > takes approximately 2m for a very small data set. the same fuseki with
> > tdb+jena-spatial restarts almost instantaneously even with reasonably
> large
> > data sets (see geonames).
> >
> >
> >> The main benefit of this is de-serialising geometry literals. The
> >> spatial relations arguments are between a pair of geometry literals, one
> >> of which is likely to be the same in the next solution, so keeping hold
> >> of both means in alot of cases the de-serialisation can be avoided for
> >> one (and possibly both if still retained from a previous set of
> solutions).
> >>
> > might be a good idea to serialize the cache object of de-serialisized
> > geometries to disk to speed up the boot process. maybe Andy could assist
> or
> > even align this with tdb
> >
> >
> >> The aim was to only do work that's needed, not do repeat work and to be
> >> generally quick (i.e. rely on JTS to be optimised for quick solutions
> >> between the geometry pair

Re: Toward Jena 3.10.0

2018-12-04 Thread Greg Albiston

Hi Marco,

2. The GeoSPARQL-Fuseki application has some options for convenience in 
setting up the Fuseki server. It looks like the two minute delay is 
caused by applying RDFS inferencing to the dataset and then writing the 
results into the datset (i.e. Jena operations). The GeoSPARQL schema has 
a class and property hierachy that a user can apply to their dataset for 
some of the functionality. The inferencing is applied by default when 
loading a file, but also when connecting to a TDB, in case it hasn't 
been done manually by the user. The other potentially costly operation 
is creating "hasDefaultGeometry" properties, which is switched off by 
default.


The following line should lead to quicker loading the second time.

./geosparql-fuseki --loopback false --tdb TDB1 --inference false

I could change the setup so that file loading applies inferencing by 
default and TDB does not, but I thought picking one would be better for 
consistent behaviour. Always true means less burden for users working 
out why they might have a problem after loading their dataset.


There is probably a broader question as to how/if these options should 
be integrated in with Fuseki, whether it should be a separate 
application or they should be left out. I think they are useful to a 
user who is looking for a GeoSPARQL solution. Currently, 
GeoSPARQL-Fuseki is using the main/embedded server so doesn't have a GUI 
etc.


3. I get what you mean about the invalidty of the query now. The polygon 
is invalid because it is not closed. However, I'm unclear about how 
these errors and warnings are handled any different to if there was a 
SPARQL syntax error. A Query Parse Exception is thrown with full stack 
trace. The error highlights the specific problem while the warning shows 
the context of the error and stack trace. This made it easier to hunt 
down these kinds of problems when they could be coming from a query or 
the dataset. What would you be looking for instead?


Thanks,

Greg

On 04/12/2018 12:01, Marco Neumann wrote:

comments inline

On Mon, Dec 3, 2018 at 5:14 PM Greg Albiston  wrote:


Hi Marco,

1. As mentioned this shouldn't be too difficult to support.


indeed not difficult but needs a decision

you could try with the following geonames dataset

all-geonames_lotico.ttl.gz




2. Yes, the indexing, or rather caching, is in-memory, but it is
on-demand. There shouldn't be any delay at start-up beyond what Jena
needs to do. The cost comes during query execution. The key invariant
data produced for solutions is retained for a short period of time (but
can be configured to be retained until termination). Some regularly
re-used info is always kept until termination (e.g. any spatial
reference system transformation that has been requested).


the following will create and populate the TDB dataset

./geosparql-fuseki --loopback false --rdf_file ./lm.ttl --tdb TDB1

I presume this message refers to the creation of the spatial cache / index

6:05:46.685 INFO  Applying GeoSPARQL Schema - Started
6:07:44.826 INFO  Applying GeoSPARQL Schema - Completed

next time I can call TDB directly

./geosparql-fuseki --loopback false --tdb TDB1

6:08:38.665 INFO  Applying GeoSPARQL Schema - Started
6:10:18.661 INFO  Applying GeoSPARQL Schema - Completed

takes approximately 2m for a very small data set. the same fuseki with
tdb+jena-spatial restarts almost instantaneously even with reasonably large
data sets (see geonames).



The main benefit of this is de-serialising geometry literals. The
spatial relations arguments are between a pair of geometry literals, one
of which is likely to be the same in the next solution, so keeping hold
of both means in alot of cases the de-serialisation can be avoided for
one (and possibly both if still retained from a previous set of solutions).


might be a good idea to serialize the cache object of de-serialisized
geometries to disk to speed up the boot process. maybe Andy could assist or
even align this with tdb



The aim was to only do work that's needed, not do repeat work and to be
generally quick (i.e. rely on JTS to be optimised for quick solutions
between the geometry pairs and Jena to optimise queries). There are 24
spatial relations and about half a dozen other functions so
pre-computing every combination gets big quickly and produces data that
users might not want/use.

A rough check of most the spatial relations only requires a bounding box
intersection or type check, so negative results can be quickly
discarded.  I looked into caching and storing to file, but there just
wasn't the benefit in my use case. It took longer to load up then
execute than just execute from fresh and cache. Also, the spatial
indexes implemented by JTS aren't designed/suited for the spatial
relations. If there is a use-case that gets more benefit from
pre-computing or storing between programme execution then I'm sure it
can be adapted for, but in the context of GeoSPARQL this approach was
effective.

3. If you coul

Re: Toward Jena 3.10.0

2018-12-04 Thread Marco Neumann
comments inline

On Mon, Dec 3, 2018 at 5:14 PM Greg Albiston  wrote:

> Hi Marco,
>
> 1. As mentioned this shouldn't be too difficult to support.
>

indeed not difficult but needs a decision

you could try with the following geonames dataset

all-geonames_lotico.ttl.gz



>
> 2. Yes, the indexing, or rather caching, is in-memory, but it is
> on-demand. There shouldn't be any delay at start-up beyond what Jena
> needs to do. The cost comes during query execution. The key invariant
> data produced for solutions is retained for a short period of time (but
> can be configured to be retained until termination). Some regularly
> re-used info is always kept until termination (e.g. any spatial
> reference system transformation that has been requested).
>

the following will create and populate the TDB dataset

./geosparql-fuseki --loopback false --rdf_file ./lm.ttl --tdb TDB1

I presume this message refers to the creation of the spatial cache / index

6:05:46.685 INFO  Applying GeoSPARQL Schema - Started
6:07:44.826 INFO  Applying GeoSPARQL Schema - Completed

next time I can call TDB directly

./geosparql-fuseki --loopback false --tdb TDB1

6:08:38.665 INFO  Applying GeoSPARQL Schema - Started
6:10:18.661 INFO  Applying GeoSPARQL Schema - Completed

takes approximately 2m for a very small data set. the same fuseki with
tdb+jena-spatial restarts almost instantaneously even with reasonably large
data sets (see geonames).


> The main benefit of this is de-serialising geometry literals. The
> spatial relations arguments are between a pair of geometry literals, one
> of which is likely to be the same in the next solution, so keeping hold
> of both means in alot of cases the de-serialisation can be avoided for
> one (and possibly both if still retained from a previous set of solutions).
>

might be a good idea to serialize the cache object of de-serialisized
geometries to disk to speed up the boot process. maybe Andy could assist or
even align this with tdb


>
> The aim was to only do work that's needed, not do repeat work and to be
> generally quick (i.e. rely on JTS to be optimised for quick solutions
> between the geometry pairs and Jena to optimise queries). There are 24
> spatial relations and about half a dozen other functions so
> pre-computing every combination gets big quickly and produces data that
> users might not want/use.
>
> A rough check of most the spatial relations only requires a bounding box
> intersection or type check, so negative results can be quickly
> discarded.  I looked into caching and storing to file, but there just
> wasn't the benefit in my use case. It took longer to load up then
> execute than just execute from fresh and cache. Also, the spatial
> indexes implemented by JTS aren't designed/suited for the spatial
> relations. If there is a use-case that gets more benefit from
> pre-computing or storing between programme execution then I'm sure it
> can be adapted for, but in the context of GeoSPARQL this approach was
> effective.
>
> 3. If you could send me the dataset that causes these errors then I'll
> happily have a look into it.
>

you can use this simple list of point geometries here

http://www.lotico.com/lm.ttl.gz

this query will parse and execute

PREFIX geo: 
PREFIX geof: 

SELECT ?well
WHERE {
  ?well  ?geometry .
  FILTER(geof:sfWithin(?geometry,"POLYGON((-10 50,2 50,2 55,-10 55,-10
50))"^^geo:wktLiteral))
} LIMIT 10

this one will parse and fail

PREFIX geo: 
PREFIX geof: 

SELECT ?well
WHERE {
  ?well  ?geometry .
  FILTER(geof:sfWithin(?geometry,"POLYGON((-10 50,2 50,2 55,-10 55,-10
51))"^^geo:wktLiteral))
} LIMIT 10

warn/error messages

6:17:45.887 ERROR Points of LinearRing do not form a closed linestring -
Illegal WKT literal: POLYGON((-10 50,2 50,2 55,-10 55,-10 51))
6:17:45.887 WARN  General exception in (<
http://www.opengis.net/def/function/geosparql/sfWithin> ?geometry
"POLYGON((-10 50,2 50,2 55,-10 55,-10 51))"^^<
http://www.opengis.net/ont/geosparql#wktLiteral>)
org.apache.jena.datatypes.DatatypeFormatException: Points of LinearRing do
not form a closed linestring - Illegal WKT literal: POLYGON((-10 50,2 50,2
55,-10 55,-10 51))
at
io.github.galbiston.geosparql_jena.implementation.datatype.WKTDatatype.parse(WKTDatatype.java:109)
at
io.github.galbiston.geosparql_jena.implementation.GeometryWrapper.extract(GeometryWrapper.java:905)
at
io.github.galbiston.geosparql_jena.implementation.GeometryWrapper.extract(GeometryWrapper.java:834)
at
io.github.galbiston.geosparql_jena.geof.topological.GenericFilterFunction.exec(GenericFilterFunction.java:57)
at
io.github.galbiston.geosparql_jena.geof.topological.GenericFilterFunction.exec(GenericFilterFunction.java:42)
at
org.apache.j

[jira] [Commented] (JENA-1645) Poor performance with full text search (Lucene)

2018-12-04 Thread Vasyl Danyliuk (JIRA)


[ 
https://issues.apache.org/jira/browse/JENA-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16708490#comment-16708490
 ] 

Vasyl Danyliuk commented on JENA-1645:
--

I have written some code that uses subject URI as an additional constraint and 
it works much faster in my case, but not sure if there can be any problems in 
more general cases.

> Poor performance with full text search (Lucene)
> ---
>
> Key: JENA-1645
> URL: https://issues.apache.org/jira/browse/JENA-1645
> Project: Apache Jena
>  Issue Type: Question
>  Components: Jena
>Affects Versions: Jena 3.9.0
>Reporter: Vasyl Danyliuk
>Priority: Major
>
> Situation: half of a million of an indexed by Lucene documents(emails 
> actually), searching for emails by sender/receiver and some text.
> If to put text filter in the start of SPARQL query it executes once but in a 
> case of very common words here are a lot of results(100 000+) that leads to 
> poor performance, limiting results count may and up with missed results.
> If to put text search as the last condition it executes once per each already 
> found subject. That's completely OK but text search completely ignores 
> subject URI.
> I found two methods in TextQueryPF class: variableSubject(...) for the first 
> case, and concreteSubject(...) for the second one.
> The question is: why can't subject URI be used as a constraint in the text 
> search?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)