Re: [Owlim-discussion] Poor performance for Sparql queries with property path optional elements

2012-06-19 Thread Krzysztof Sielski

Hello Ruslan,

Thanks for such an exhaustive answer! For now, we can just avoid using 
those optional property path elements in our Sparql queries and convert 
them to use unions instead - it seems to work properly for us and 
shouldn't be very time-consuming. This is still just a workaround though 
and it would be really nice if Sesame developers fixed this.


--
Best regards,
Krzysztof Sielski
Poznan Supercomputing and Networking Center


W dniu 2012-06-18 15:53, Ruslan Velkov pisze:

Hi Krzysztof,

Many thanks for reporting this and for providing a test class!

What we introduced in 5.1 is the Sesame's QueryJoinOptimizer which 
rearranges joins so that if there are sub-select clauses in the query 
they will be evaluated first and in the best possible order (in terms 
of number of variables shared by their respective projections). This 
optimizer allows for fast and efficient evaluation of nested SELECT 
clauses.


What I saw as a by-product of this optimizer on q2 was this:

[Original query plan without applying the QueryJoinOptimizer]
Projection
   ProjectionElemList
  ProjectionElem name
   Join
  Join
 StatementPattern
Var (name=-const-1, value=person://1, anonymous)
Var (name=-const-2, value=http://xmlns.com/foaf/0.1/knows, 
anonymous)

Var (name=-const-2-0, anonymous)
 Union
ZeroLengthPath
   Var (name=-const-2-0, anonymous)
   Var (name=-const-3-1, anonymous)
StatementPattern
   Var (name=-const-2-0, anonymous)
   Var (name=-const-3, 
value=http://xmlns.com/foaf/0.1/knows, anonymous)

   Var (name=-const-3-1, anonymous)
  StatementPattern
 Var (name=-const-3-1, anonymous)
 Var (name=-const-4, value=http://xmlns.com/foaf/0.1/name, 
anonymous)

 Var (name=name)

[Query plan after applying the QueryJoinOptimizer]
Projection
   ProjectionElemList
  ProjectionElem name
   Join
  StatementPattern
 Var (name=-const-1, value=person://1, anonymous)
 Var (name=-const-2, value=http://xmlns.com/foaf/0.1/knows, 
anonymous)

 Var (name=-const-2-0, anonymous)
  Join
 StatementPattern
Var (name=-const-3-1, anonymous)
Var (name=-const-4, value=http://xmlns.com/foaf/0.1/name, 
anonymous)

Var (name=name)
 Union
ZeroLengthPath
   Var (name=-const-2-0, anonymous)
   Var (name=-const-3-1, anonymous)
StatementPattern
   Var (name=-const-2-0, anonymous)
   Var (name=-const-3, 
value=http://xmlns.com/foaf/0.1/knows, anonymous)

   Var (name=-const-3-1, anonymous)

As you can see, using the first query model we'll evaluate person://1 
http://xmlns.com/foaf/0.1/knows -const-2-0 and then -const-2-0 
http://xmlns.com/foaf/0.1/knows -const-3-1 with the already bound 
-const-2-0 and finally we'll evaluate the last statement using the 
binding for -const-3-1. This is the correct ordering. What we can see 
from the second query model is evaluating person://1 
http://xmlns.com/foaf/0.1/knows -const-2-0 and then the last 
statement -const-3-1 http://xmlns.com/foaf/0.1/name name and then 
the optional second statement, but as far as the the first two 
patterns in this order don't share any variables a Cartesian product 
will be formed (the first pattern has 999 results and the second one 
has 23334 results, hence 23,310,666 iterations, very few of which will 
succeed).


Unfortunately, there is no way to turn this optimizer off and even 
there was one, other queries would become much slower (namely the ones 
with sub-selects).
There can be introduced a parameter that switches the optimizer off as 
a workaround, but you may encounter problems when using queries with 
sub-selects, so in that case you should arrange the sub-selects 
manually and they should be the first thing to a appear in a query (in 
case you use such queries along the ones with problematic property 
path evaluation). Another workaround could be using an ASK query to 
switch that optimizer on/off at runtime, but this will be a rather 
clumsy approach (queries can be evaluated from multiple threads 
asynchronously, so you won't have guarantee when exactly you use the 
optimizer and when not). The real solution is a fix in Sesame to be 
provided (we'll communicate the issue with the Sesame guys).


So the fastest solution will be to provide a parameter which will 
statically forbid the optimizer (i.e. at initialization time). Will 
that be ok in your case?



Hth,
Ruslan



On 06/18/2012 02:55 PM, Krzysztof Sielski wrote:

Hello,

We noticed that after migrating to Owlim SE 5.1.5183 our queries 
which use property paths with optional elements are evaluated very 
slowly (in contrast to previous releases). Their direct equivalents 
using UNION would return the same results much faster. This is an 
example:


[query for particular 

Re: [Owlim-discussion] Poor performance for Sparql queries with property path optional elements

2012-06-19 Thread Ruslan Velkov

Hi Marek,

The new behaviour after we introduced the QueryJoinOptimizer is a 
feature we were looking for and it took much time to be implemented and 
stabilized in Sesame because in the beginning it was implemented like a 
different query model node (SPARQLIntersection), then this node 
disappeared from the Sesame code base and the implementation was 
introduced in EvaluationStrategyImpl.evaluate(Join), see 
http://www.openrdf.org/issues/browse/SES-953. This issue was fixed in 
Sesame 2.6.5 and the fix was available for Owlim 5.0, but our release 
notes were rather large because they covered all bug fixes and new 
features, so we didn't mention the concrete Sesame bug fixes and 
improvements, just mentioned that we use Sesame version 2.6.5 (that is 
why you can't find this issue in our release notes). Some of the 
sub-select handling code went into the QueryJoinOptimizer, but this 
happened after releasing Owlim 5.0 and was available in Sesame 2.6.6, 
from where we took that optimizer and introduced it into our code 
(version 5.1).


The thing which was really improved was namely the sub-selects, we 
experimented with the BSBM Business Intelligence benchmark in 
particular, and that optimizer gave wonderful results (some of the 
queries contained sub-selects, sub-select nested into sub-selects, 
sub-selects interconnected just with a single filter, etc.). I hope this 
answers your question what actually was improved by using this optimizer.


The bad think with this optimizer is that it touches the whole query 
model, not just the sub-selects, hence the side effects we are 
experiencing. After communicating the issue with Jeen it became clear 
that apart from several small fixes in Sesame, we may be able to fix 
this issue in Owlim by supplying statistics to the QueryJoinOptimizer in 
order to influence its decision-making when reordering the query. 
However, the statistics that we use in our query optimization strategy 
are much different from what Sesame uses in its own one, so it may take 
some time. If it is achievable, you'll be notified with the fix.



Cheers,
Ruslan



On 06/18/2012 07:18 PM, Marek Šurek wrote:

Hi Ruslan,
we have the similar issue (see 
http://www.mail-archive.com/owlim-discussion@ontotext.com/msg01626.html), 
which is probably based on the same thing. I don't understand from 
your response whether the new behaviour is bug or feature.
I carefully looked at release notes and nothing so serious as total 
change query optimizer, which dramatically changes behaviour of 
subselects, was not mentioned! I see(and really appreciate) that you 
are looking for shorttime hotfix solution introducing some parameter 
but for now I don't know what to do. Lot of our bussiness code uses 
subselects as it is powerful feature. We rely on your answer whether 
this feature/bug is just problem of few days or it is permanent state. 
The parameter is great as hotfix, but I can't test and optimize all 
queries and try in which case it runs faster and after some minor 
bugfix suddenly without any warning I can start from begining.
As it is not well documented, I would like to ask, what are positive 
aspects of using new query optimizer? Where can we see improvements? 
What kind of queries should ran faster?


Thank you for your time and looking forward for your response.

Best regards,
Marek


*From:* Ruslan Velkov rus...@sirma.bg
*To:* owlim-discussion@ontotext.com
*Sent:* Monday, 18 June 2012, 15:53
*Subject:* Re: [Owlim-discussion] Poor performance for Sparql queries 
with property path optional elements


Hi Krzysztof,

Many thanks for reporting this and for providing a test class!

What we introduced in 5.1 is the Sesame's QueryJoinOptimizer which 
rearranges joins so that if there are sub-select clauses in the query 
they will be evaluated first and in the best possible order (in terms 
of number of variables shared by their respective projections). This 
optimizer allows for fast and efficient evaluation of nested SELECT 
clauses.


What I saw as a by-product of this optimizer on q2 was this:

[Original query plan without applying the QueryJoinOptimizer]
Projection
   ProjectionElemList
  ProjectionElem name
   Join
  Join
 StatementPattern
Var (name=-const-1, value=person://1, anonymous)
Var (name=-const-2, value=http://xmlns.com/foaf/0.1/knows, 
anonymous)

Var (name=-const-2-0, anonymous)
 Union
ZeroLengthPath
   Var (name=-const-2-0, anonymous)
   Var (name=-const-3-1, anonymous)
StatementPattern
   Var (name=-const-2-0, anonymous)
   Var (name=-const-3, 
value=http://xmlns.com/foaf/0.1/knows, anonymous)

   Var (name=-const-3-1, anonymous)
  StatementPattern
 Var (name=-const-3-1, anonymous)
 Var (name=-const-4, value=http://xmlns.com/foaf/0.1/name, 
anonymous

Re: [Owlim-discussion] Poor performance for Sparql queries with property path optional elements

2012-06-19 Thread Marek Šurek
Thank you Ruslan,
I will revert back to OWLIM 5.0 b5123 using Sesame 2.6.5 where all things seems 
fines (except the removeRepository, but I can live with it). I was just 
confused as I made tons of query testing to realise what are performance strong 
and weak queries and suddenly it had changed in bad way and I was worried about 
project destiny.

Good luck with bugfixing and looking forward for next release.

Best regards,
Marek




 From: Ruslan Velkov rus...@sirma.bg
To: Marek Šurek marek_su...@yahoo.co.uk 
Cc: owlim-discussion@ontotext.com owlim-discussion@ontotext.com; Barry 
Bishop barry.bis...@ontotext.com 
Sent: Tuesday, 19 June 2012, 13:51
Subject: Re: [Owlim-discussion] Poor performance for Sparql queries with 
property path optional elements
 

 
Hi Marek,

The new behaviour after we introduced the QueryJoinOptimizer
  is a feature we were looking for and it took much time to be
  implemented and stabilized in Sesame because in the beginning
  it was implemented like a different query model node
  (SPARQLIntersection), then this node disappeared from the
  Sesame code base and the implementation was introduced in
  EvaluationStrategyImpl.evaluate(Join), see 
http://www.openrdf.org/issues/browse/SES-953. This issue was fixed in Sesame 
2.6.5 and the fix was available for Owlim 5.0, but our release notes were 
rather large because they covered all bug fixes and new features, so we didn't 
mention the concrete Sesame bug fixes and improvements, just mentioned that we 
use Sesame version 2.6.5 (that is why you can't find this issue in our release 
notes). Some of the sub-select handling code went into the QueryJoinOptimizer, 
but this happened after releasing Owlim 5.0 and was available in Sesame 2.6.6, 
from where we took that optimizer and introduced it into our code (version 5.1).

The thing which was really improved was namely the
  sub-selects, we experimented with the BSBM Business
  Intelligence benchmark in particular, and that optimizer gave
  wonderful results (some of the queries contained sub-selects,
  sub-select nested into sub-selects, sub-selects interconnected
  just with a single filter, etc.). I hope this answers your
  question what actually was improved by using this optimizer.

The bad think with this optimizer is that it touches the whole
  query model, not just the sub-selects, hence the side effects
  we are experiencing. After communicating the issue with Jeen
  it became clear that apart from several small fixes in Sesame,
  we may be able to fix this issue in Owlim by supplying
  statistics to the QueryJoinOptimizer in order to influence its
  decision-making when reordering the query. However, the
  statistics that we use in our query optimization strategy are
  much different from what Sesame uses in its own one, so it may
  take some time. If it is achievable, you'll be notified with
  the fix.


Cheers,
Ruslan



On 06/18/2012 07:18 PM, Marek Šurek wrote: 
Hi Ruslan,
we have the similar issue (see 
http://www.mail-archive.com/owlim-discussion@ontotext.com/msg01626.html), 
which is probably based on the same thing. I don't understand from your 
response whether the new behaviour is bug or feature. 

I carefully looked at release notes and nothing so serious as total change 
query optimizer, which dramatically changes behaviour of subselects, was not 
mentioned! I see(and really appreciate) that you are looking for shorttime 
hotfix solution introducing some parameter but for now I don't know what to 
do. Lot of our bussiness code uses subselects as it is powerful feature. We 
rely on your answer whether this feature/bug is just problem of few days or it 
is permanent state. The parameter is great as hotfix, but I can't test and 
optimize all queries and try in which case it runs faster and after some 
minor bugfix suddenly without any warning I can start from begining. 

As it is not well documented, I would like to ask, what are positive aspects 
of using new query optimizer? Where can we see improvements? What kind of 
queries should ran faster?


Thank you for your time and looking forward for your response.


Best regards,
Marek

 
 
 
 

From: Ruslan Velkov rus...@sirma.bg
To: owlim-discussion@ontotext.com 
Sent: Monday, 18 June 2012, 15:53
Subject: Re: [Owlim-discussion] Poor performance for Sparql queries with 
property path optional elements

 
 
Hi Krzysztof,

Many thanks for reporting this and for providing a
test class!

What we introduced in 5.1 is the Sesame's
QueryJoinOptimizer which rearranges joins so that if
there are sub-select clauses in the query they will
be evaluated first and in the best possible order

[Owlim-discussion] Poor performance for Sparql queries with property path optional elements

2012-06-18 Thread Krzysztof Sielski

Hello,

We noticed that after migrating to Owlim SE 5.1.5183 our queries which 
use property paths with optional elements are evaluated very slowly (in 
contrast to previous releases). Their direct equivalents using UNION 
would return the same results much faster. This is an example:


[query for particular person's friends' names and their friends' names]
(q1)
PREFIX foaf:http://xmlns.com/foaf/0.1/
select * WHERE {
 {person://1 foaf:knows/foaf:name ?name}
   UNION
 {person://1  foaf:knows/foaf:knows/foaf:name ?name}
}

(q2)
PREFIX foaf:http://xmlns.com/foaf/0.1/
select * WHERE {
person://1  foaf:knows/foaf:knows?/foaf:name ?name
}

Both queries seem to be equivalent and we used (q2) as it is more 
concise and elegant but now (q1) is much faster:

Executing query q1
Result count: 2 in 0,006000s.
Executing query q2
Result count: 2 in 7,034000s.

As before, I attached a simple class that creates a local repository, 
inserts some data and executes the queries to show you the problem.


--
Best regards,
Krzysztof Sielski
Poznan Supercomputing and Networking Center


import java.io.File;
import java.util.HashMap;
import java.util.Map;
import java.util.Random;
import org.openrdf.model.BNode;
import org.openrdf.model.Graph;
import org.openrdf.model.Resource;
import org.openrdf.model.URI;
import org.openrdf.model.impl.GraphImpl;
import org.openrdf.model.impl.LiteralImpl;
import org.openrdf.model.impl.StatementImpl;
import org.openrdf.model.impl.URIImpl;
import org.openrdf.model.impl.ValueFactoryImpl;
import org.openrdf.model.vocabulary.RDF;
import org.openrdf.model.vocabulary.RDFS;
import org.openrdf.query.QueryLanguage;
import org.openrdf.query.TupleQueryResult;
import org.openrdf.repository.Repository;
import org.openrdf.repository.RepositoryConnection;
import org.openrdf.repository.config.RepositoryConfig;
import org.openrdf.repository.config.RepositoryConfigException;
import org.openrdf.repository.config.RepositoryConfigSchema;
import org.openrdf.repository.manager.LocalRepositoryManager;
import org.openrdf.repository.manager.RepositoryManager;
import org.openrdf.repository.sail.config.SailRepositorySchema;
import org.openrdf.sail.config.SailConfigSchema;

/**
 *
 * @author Krzysztof Sielski
 */
public class OwlimTestCaseOptionalPathElement {

private static final String REPO_ID = repo;
private static URI foaf_knows = new 
URIImpl(http://xmlns.com/foaf/0.1/knows;);
private static URI foaf_name = new 
URIImpl(http://xmlns.com/foaf/0.1/name;);
private static final String q1 = 
+ PREFIX foaf:http://xmlns.com/foaf/0.1/
+ select * WHERE {
+  {person://1 foaf:knows/foaf:name ?name}
+UNION
+  {person://1  foaf:knows/foaf:knows/foaf:name 
?name}
+ } ;
private static final String q2 = 
+ PREFIX foaf:http://xmlns.com/foaf/0.1/
+ select * WHERE {
+  person://1  foaf:knows/foaf:knows?/foaf:name 
?name
+ } ;

public static void main(String[] args)
throws Exception {
RepositoryManager manager = new LocalRepositoryManager(new 
File(.));
manager.initialize();
try {
initRepository(manager);
insertInitialData(manager);
executeQueries(manager);
} finally {
manager.shutDown();
}
}

private static void executeQueries(RepositoryManager manager)
throws Exception {
Repository repo = manager.getRepository(REPO_ID);
repo.initialize();
RepositoryConnection con = repo.getConnection();
con.setAutoCommit(false);

System.out.println(Executing query q1);
executeQuery(q1, con);
System.out.println(Executing query q2);
executeQuery(q2, con);


con.close();
repo.shutDown();
}

private static void executeQuery(String query, RepositoryConnection 
con) {
try {
TupleQueryResult result = 
con.prepareTupleQuery(QueryLanguage.SPARQL, query).evaluate();
int resultCount = 0;
long time = System.currentTimeMillis();
while (result.hasNext()) {
result.next();
resultCount++;
}
time = System.currentTimeMillis() - time;
System.out.printf(Result count: %d in %fs.\n, 
resultCount, time / 1000.0);
} catch (Exception e) {
e.printStackTrace();

Re: [Owlim-discussion] Poor performance for Sparql queries with property path optional elements

2012-06-18 Thread Marek Šurek
Hi Ruslan,
we have the similar issue (see 
http://www.mail-archive.com/owlim-discussion@ontotext.com/msg01626.html), which 
is probably based on the same thing. I don't understand from your response 
whether the new behaviour is bug or feature.

I carefully looked at release notes and nothing so serious as total change 
query optimizer, which dramatically changes behaviour of subselects, was not 
mentioned! I see(and really appreciate) that you are looking for shorttime 
hotfix solution introducing some parameter but for now I don't know what to do. 
Lot of our bussiness code uses subselects as it is powerful feature. We rely on 
your answer whether this feature/bug is just problem of few days or it is 
permanent state. The parameter is great as hotfix, but I can't test and 
optimize all queries and try in which case it runs faster and after some 
minor bugfix suddenly without any warning I can start from begining. 

As it is not well documented, I would like to ask, what are positive aspects of 
using new query optimizer? Where can we see improvements? What kind of queries 
should ran faster?

Thank you for your time and looking forward for your response.

Best regards,
Marek




 From: Ruslan Velkov rus...@sirma.bg
To: owlim-discussion@ontotext.com 
Sent: Monday, 18 June 2012, 15:53
Subject: Re: [Owlim-discussion] Poor performance for Sparql queries with 
property path optional elements
 

Hi Krzysztof,

Many thanks for reporting this and for providing a test class!

What we introduced in 5.1 is the Sesame's QueryJoinOptimizer which
  rearranges joins so that if there are sub-select clauses in the
  query they will be evaluated first and in the best possible order
  (in terms of number of variables shared by their respective
  projections). This optimizer allows for fast and efficient
  evaluation of nested SELECT clauses.

What I saw as a by-product of this optimizer on q2 was this:

[Original query plan without applying the QueryJoinOptimizer]
Projection
   ProjectionElemList
  ProjectionElem name
   Join
  Join
 StatementPattern
    Var (name=-const-1, value=person://1, anonymous)
    Var (name=-const-2,
  value=http://xmlns.com/foaf/0.1/knows, anonymous)
    Var (name=-const-2-0, anonymous)
 Union
    ZeroLengthPath
   Var (name=-const-2-0, anonymous)
   Var (name=-const-3-1, anonymous)
    StatementPattern
   Var (name=-const-2-0, anonymous)
   Var (name=-const-3,
  value=http://xmlns.com/foaf/0.1/knows, anonymous)
   Var (name=-const-3-1, anonymous)
  StatementPattern
 Var (name=-const-3-1, anonymous)
 Var (name=-const-4, value=http://xmlns.com/foaf/0.1/name, anonymous)
 Var (name=name)

[Query plan after applying the QueryJoinOptimizer]
Projection
   ProjectionElemList
  ProjectionElem name
   Join
  StatementPattern
 Var (name=-const-1, value=person://1, anonymous)
 Var (name=-const-2,
  value=http://xmlns.com/foaf/0.1/knows, anonymous)
 Var (name=-const-2-0, anonymous)
  Join
 StatementPattern
    Var (name=-const-3-1, anonymous)
    Var (name=-const-4,
  value=http://xmlns.com/foaf/0.1/name, anonymous)
    Var (name=name)
 Union
    ZeroLengthPath
   Var (name=-const-2-0, anonymous)
   Var (name=-const-3-1, anonymous)
    StatementPattern
   Var (name=-const-2-0, anonymous)
   Var (name=-const-3,
  value=http://xmlns.com/foaf/0.1/knows, anonymous)
   Var (name=-const-3-1, anonymous)

As you can see, using the first query model we'll evaluate person://1 
http://xmlns.com/foaf/0.1/knows -const-2-0 and then -const-2-0 
http://xmlns.com/foaf/0.1/knows -const-3-1 with the already bound -const-2-0 
and finally we'll evaluate the last statement using the binding for -const-3-1. 
This is the correct ordering. What we can see from the second query model is 
evaluating person://1 http://xmlns.com/foaf/0.1/knows -const-2-0 and then the 
last statement -const-3-1 http://xmlns.com/foaf/0.1/name name and then the 
optional second statement, but as far as the the first two patterns in this 
order don't share any variables a Cartesian product will be formed (the first 
pattern has 999 results and the second one has 23334 results, hence 23,310,666 
iterations, very few of which will succeed).

Unfortunately, there is no way to turn this optimizer off and even
  there was one, other queries would become much slower (namely the
  ones with sub-selects).
There can be introduced a parameter that switches the optimizer off as a 
workaround, but you may encounter problems when using queries with sub-selects, 
so in that case you should arrange the sub-selects manually and they should be 
the first thing to a appear in a query (in case you