Hi all,

Happy new year Jena guys, I am working on the performance of regex in relation 
to triple stores. Since Jena is a good choice to inspect, I tried to dig in the 
code and debug how sparql works out regex. 

I'm using the latest source code 
https://svn.apache.org/repos/asf/jena/trunk/jena-arq/

I tried to debug the class com.hp.hpl.jena.sparql.expr.RegexJava and check how 
many visits to the method match() by printing a counter inside it,
I found out that the "order" of the query pattern within sparql can affect how 
many visits.

For example, testing this turtle "example.ttl":
<http://www.example.org/s1><http://www.example.org/p1> "triple1@en".
<http://www.example.org/s1><http://www.example.org/p1> "triple2@en".
<http://www.example.org/s1><http://www.example.org/p2> "triple3@en".
<http://www.example.org/s2><http://www.example.org/p2> "triple4@en".
<http://www.example.org/s2><http://www.example.org/p3> "triple5@en".
<http://www.example.org/s2><http://www.example.org/p3> "triple6@en".


With the usual model building:
Model model = ModelFactory.createDefaultModel();
model.read("/Users/user/example.ttl");
Query query = QueryFactory.create(queryString) ;
QueryExecution qexec = QueryExecutionFactory.create(query, model) ;
ResultSet results = qexec.execSelect() ;
ResultSetFormatter.out(System.out, results);



And the sparql query:
String queryString = "select * " +
                "where {" +
                "<http://www.example.org/s1> ?p ?o ." +
                "?s ?p ?o ." +
                "filter regex (?o, \"triple\")" +
                "}";

yields 3 visits, which is perfect,

However, the query:
String queryString = "select * " +
                "where {" +
                "?s ?p ?o ." +
                "<http://www.example.org/s1> ?p ?o ." +
                "filter regex (?o, \"triple\")" +
                "}";

Although both return the same result set, yet here, it prints 6 visits,

Even narrowing the query to something like the below prints 6 visits:
String queryString = "select * " +
                "where {" +
                "?s ?p ?o ." +
                "<http://www.example.org/s1> <http://www.example.org/p1> ?o ." +
                "<http://www.example.org/s1> ?p ?o ." +
                "filter regex (?o, \"triple\")" +
                "}";



Looks like the filter regex is only looking for the first query pattern 
regardless the rest of the BGP.
I'm not sure if this is the case for all other queries other than regex!


Ideally, I think, sparql should reduce the set of tested triples, regardless of 
the order of the query patterns, and since regex by its nature
slows down the performance, this will even add more overhead on the overall 
performance.

Is this a bug, or am I missing something here?

Cheers,
Saud.


Reply via email to