On 02/06/13 01:16, Paul Tyson wrote:
I'm seeking guidance for setting expectations for TDB sparql performance
as the size and complexity of the queries grows.

The dataset has about 600 million triples, around 200 million
non-literal nodes, about 500 predicates.

I generate sparql queries from logical rules, which as it turns out can
be complicated. In most cases the generated sparql performs acceptably.
But multiple (possibly nested) OPTIONAL and UNION clauses seem to tip
the scales toward poor performance.

Does anyone have stories from experience or pointers to literature on
the following points:

- Effect of multiple and nested OPTIONAL clauses on process growth.

This is hard to make a general comment but

One thing to watch, especially for machine generated queries, is the nested optional pattern:

... ?x ...
  OPTIONAL {
    ... no use of ?x ...
      OPTIONAL { ... use of ?x ... }
  }


where ARQ detects it can't use a flow-based execution and does it with a fair about of brute force.

Often that was intended to be:

... ?x ...
OPTIONAL { ... no use of ?x ... }
OPTIONAL { ... use of ?x ... }

which to thee execution engine, a very different query.


- What sort of process growth do additional UNION clauses cause: linear,
logarithmic, exponential?

Linear - UNION is implemented as execute-left-hand-side-then-execute-right-hand-side.

- Does the absolute number of variables or predicates mentioned in the
query affect performance as much as the complexity of the graph
patterns?

Highly unlikely.

- effect of complex, non-normalized FILTER expressions

That's quite hard to say without talking about a specific query but FILTERS on numbers (FILTER(?x < 5 && ?y > 6) are fast. FILTERS involving strings need to fetch the lexical form of the value and are slower.

The TDB dataset is on a fairly powerful machine. I'm not so much
interested in absolute performance numbers, as I am in relative
performance of different queries on the same dataset and platform.

Thanks,
--Paul

Hope that helps a bit - it's hard to make general statements without seeing data and queries.

        Andy

Reply via email to