grahamtriggs opened a new pull request #533: Feature/sdborder
URL: https://github.com/apache/jena/pull/533
 
 
   This patch combines a few enhancements to the way SQL queries are formed:
   
   1) The order of variables is retained in the ScopeBase (a HashSet is 
replaced with LinkedHashSet). When select a subject / property / object from 
the Triples/Quads table, the order in which the columns are selected can make a 
difference as to whether an SQL engine uses an index or not, and this at least 
makes it predictable (and in practice, the order will be subject - predicate - 
order, and allow the SubjPredObj index to be used - previously, with the 
HashSet losing the order in which the scopes were added, you might get 
predicate - subject - object, which isn't in an existing index).
   
   2) Replace some DISTINCT clauses with GROUP BY statements, and fix the order 
in which columns are grouped to match an index. SQL Engines typically perform 
GROUP BY faster than a DISTINCT (e.g. in MySQL the optimizer explicitly 
rewrites DISTINCT to GROUP BY when it knows that it can). Again, fixing the 
order of the columns in the GROUP BY improves the likelihood that an index will 
be used for that part of the query.
   
   3) Allow simple ORDER BY clauses to be pushed into the SQL - by simple, I 
mean ordering by a bound variable and without the use of a function.
   
   This is DISABLED by default, because the order will be different to the 
order that the ordering iterator will return, and can be enabled with the 
"optimizeOrderClause" option. Whilst the ordering does not include the 
comparisons of the iterator, the order generated by the database should still 
be consistent with the SPARQL definition of ordering.
   
   If you are returning the entire set of results from a query, there is 
generally negligible performance difference between the Java iterator and the 
SQL clause - it shifts a some of the execution time from the JVM to SQL.
   
   However, passing the ORDER BY into SQL means that any LIMIT / OFFSET can 
also be passed into SQL. So in cases where you are LIMITing the amount of rows 
returned by SPARQL, the SQL version will be substantially faster. e.g.
   
   SELECT ?s ?p ?o WHERE { ?s ?p ?o } ORDER BY ?s ?p ?o LIMIT 20
   
   on a 500,000 triple store, this takes 12 seconds using the Java iterator, 
and only 1.5 seconds using the SQL ordering.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to