Dear all,
now that 3.0.0 is released, I have started working on optimizing SPARQL
query performance. As you might have realised, up to 3.0.0, SPARQL was
supported through the iterator interface of Sesame and hence effectively
evaluated in-memory. This could lead to a considerable performance impact,
especially in cases where the individual statement patterns were not very
selective.
The new implementation now tries to improve the performance by translating
important parts of a SPARQL query like JOIN and FILTER directly into SQL.
The implementation currently supports:
- all JOINs of statement patterns are translated to SQL inner joins (e.g ?x
foaf:knows ?y . ?y foaf:knows ?z)
- most FILTERs on joins and statement patterns, including regexp,
langMatches, math expressions, < > = != on terms, string values, and
numeric values are translated into SQL where conditions
Most constructs are translated into their SQL equivalents. I also try to
optimize very typical regular expressions by replacing them with a LIKE
query if this is possible. For comparisons and math expressions, I first
try to determine the data type to be used for the operator (with string >
double > int > any) and carry out the correct comparison there.
The implementation currently does NOT optimize:
- value projections and aggregations (i.e. the SELECT part of the query)
- DISTINCT, LIMIT, ORDER BY
- FILTERs with aggregation constructs
- SPARQL 1.1 path expressions
- OPTIONAL statement patterns
However, even if a query cannot be fully optimized (e.g. because it
contains a LIMIT or an OPTIONAL statement), the optimizer will still
translate those components of the query to SQL that can be optimized. E.g.,
a query like
SELECT ?fn ?ln ?nick WHERE {
?p foaf:firstName ?fn .
?p foaf:lastName ?ln .
OPTIONAL { ?p foaf:nick ?nick }
}
would optimize the join of the first two triple patterns and only evaluate
the OPTIONAL with a second query.
So this sounds all like very great, but unfortunately:
- it is not very well tested, especially not for more complex SPARQL
queries (and complex regular expressions)
- you might not really feel a performance increase in many cases (I have
examples where the performance basically remains the same or is even a bit
slower, and others where it is about 100 times faster)
So if you find the time, please test Marmotta 3.1.0-SNAPSHOT and tell me
how it behaves for you :-)
Greetings,
Sebastian