[
https://issues.apache.org/jira/browse/JENA-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13083102#comment-13083102
]
Paolo Castagna commented on JENA-90:
------------------------------------
Let's take, for example, this SPARQL query:
SELECT DISTINCT *
WHERE
{ ?s ?p ?o }
ORDER BY ?p
LIMIT 10
The correspondent algebra expression is:
(slice _ 10
(distinct
(order (?p)
(bgp (triple ?s ?p ?o)))))
Which is equivalent to:
(slice _ 10
(reduced
(order (?p)
(bgp (triple ?s ?p ?o)))))
However, the distinct or reduced operators forbid the optimization described in
JENA-89. Maybe we can modify the 'top' operator to yields only distinct
bindings or add a new 'top_distinct' operator for that:
(top_distinct (10 ?p ?s)
(bgp (triple ?s ?p ?o)))
SPARQL queries of the type SELECT DISTINCT ... WHERE {...} ORDER BY ... LIMIT
10 are common when people want to display the 10 most 'something' things in
their dataset.
The implementation of a QueryIterTopNDistinct is almost the same as
QueryIterTopN (see: JENA-89) but we add bindings to the PriorityQueue if and
only if they are not already there (using .contains() to check).
Is it worth adding a top_distinct operator or it just pollutes the algebra?
> Use OpReduce instead of OpDistinct for DISTINCT + ORDER BY queries
> ------------------------------------------------------------------
>
> Key: JENA-90
> URL: https://issues.apache.org/jira/browse/JENA-90
> Project: Jena
> Issue Type: Improvement
> Components: ARQ
> Reporter: Paolo Castagna
> Priority: Trivial
> Labels: arq, optimizer, sparql
>
> ARQ's optimizer could use an OpReduce instead of OpDistinct if a query is
> DISTINCT + ORDER BY.
> OpReduce removes adjacent duplicates and it does not require a set of already
> seen bindings as the current OpDistinct implementation does.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira