[
https://issues.apache.org/jira/browse/JENA-441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Rob Vesse updated JENA-441:
---------------------------
Attachment: order-distinct.xml
order-distinct.txt
order-distinct.csv
Find attached figures for running a variety of variants on this query, this was
run against a 250k triple database (SP2B) using TDB via Fuseki with 4GB RAM
I covered ORDER BY + DISTINCT and ORDER BY + REDUCED using them both on the
same level and with the DISTINCT/REDUCED in a sub-query. I also added cases
which add a LIMIT, the LIMIT is on the sub-query for the sub-query forms.
For those who don't want to read all the attachments I'll summarize as follows:
ORDER BY + DISTINCT on same level = 5.7s
ORDER BY + DISTINCT in sub-query = 0.3s
ORDER BY + REDUCED on same level = 5.6s
ORDER BY + REDUCED in sub-query = 3.1s
Interestingly if we add LIMIT we get more marked differences in performance:
ORDER BY + DISTINCT and LIMIT on same level = 5.4s
ORDER BY + DISTINCT and LIMIT in sub-query = 0.007s
ORDER BY + REDUCED and LIMIT on same level = 5.4s
ORDER BY + REDUCED and LIMIT in sub-query = 0.005s
So clearly being able to do this optimization is valuable even if it applies in
a small number of cases. I will look at writing a transform and experiment
further.
> In some cases it may be useful to apply DISTINCT before applying ORDER BY
> -------------------------------------------------------------------------
>
> Key: JENA-441
> URL: https://issues.apache.org/jira/browse/JENA-441
> Project: Apache Jena
> Issue Type: Improvement
> Components: ARQ, Optimizer
> Affects Versions: Jena 2.10.0
> Reporter: Rob Vesse
> Assignee: Rob Vesse
> Priority: Minor
> Labels: distinct, optimization, order, project
> Attachments: order-distinct.csv, order-distinct.txt,
> order-distinct.xml
>
>
> One of our internal users highlighted an interesting query where changing the
> plan makes a big difference in performance.
> The query is essentially the following:
> SELECT DISTINCT ?p
> WHERE
> {
> ?s ?p ?o
> } ORDER BY ?p
> Leaving the fact that it is a fundamentally dumb query to write the user had
> an interesting suggestion about the query plan, currently this generates the
> following:
> (distinct
> (project (?predicate)
> (order (?predicate)
> (quadpattern (quad <urn:x-arq:DefaultGraphNode> ?s ?predicate ?o)))))
> For cases like this it may actually be much more performant to do the
> distinct first, because of the associated semantics of the various operators
> you can't just simply put distinct before the order but if you rewrite the
> query as follows:
> SELECT ?p
> WHERE
> {
> { SELECT DISTINCT ?p WHERE { ?s ?p ?o } }
> } ORDER BY ?p
> You get the likely much more performant plan:
> (project (?predicate)
> (order (?predicate)
> (distinct
> (project (?predicate)
> (quadpattern (quad <urn:x-arq:DefaultGraphNode> ?/s ?predicate
> ?/o))))))
> Clearly this optimization does not apply in the general case, I think it only
> applies in the case where you have a DISTINCT and all ORDER BY conditions are
> simple variables and only those variables are projected in which case I think
> you could produce a plan like the following:
> (order (?predicate)
> (project (?predicate)
> (distinct
> (quadpattern (quad <urn:x-arq:DefaultGraphNode> ?s ?predicate ?o)))))
> I am pretty certain this applies in only a few cases and I haven't reproduced
> this with TDB to see if the performance difference is noticeable yet.
> Andy - I will try and gather some more information and experiment with this
> so don't feel you have to look at this one for the time being.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira