On 02/06/13 01:16, Paul Tyson wrote:
I'm seeking guidance for setting expectations for TDB sparql performance
as the size and complexity of the queries grows.
The dataset has about 600 million triples, around 200 million
non-literal nodes, about 500 predicates.
I generate sparql queries from logical rules, which as it turns out can
be complicated. In most cases the generated sparql performs acceptably.
But multiple (possibly nested) OPTIONAL and UNION clauses seem to tip
the scales toward poor performance.
Does anyone have stories from experience or pointers to literature on
the following points:
- Effect of multiple and nested OPTIONAL clauses on process growth.
This is hard to make a general comment but
One thing to watch, especially for machine generated queries, is the
nested optional pattern:
... ?x ...
OPTIONAL {
... no use of ?x ...
OPTIONAL { ... use of ?x ... }
}
where ARQ detects it can't use a flow-based execution and does it with a
fair about of brute force.
Often that was intended to be:
... ?x ...
OPTIONAL { ... no use of ?x ... }
OPTIONAL { ... use of ?x ... }
which to thee execution engine, a very different query.
- What sort of process growth do additional UNION clauses cause: linear,
logarithmic, exponential?
Linear - UNION is implemented as
execute-left-hand-side-then-execute-right-hand-side.
- Does the absolute number of variables or predicates mentioned in the
query affect performance as much as the complexity of the graph
patterns?
Highly unlikely.
- effect of complex, non-normalized FILTER expressions
That's quite hard to say without talking about a specific query but
FILTERS on numbers (FILTER(?x < 5 && ?y > 6) are fast. FILTERS
involving strings need to fetch the lexical form of the value and are
slower.
The TDB dataset is on a fairly powerful machine. I'm not so much
interested in absolute performance numbers, as I am in relative
performance of different queries on the same dataset and platform.
Thanks,
--Paul
Hope that helps a bit - it's hard to make general statements without
seeing data and queries.
Andy