[jira] [Updated] (JENA-108) Optimise DISTINCT + ORDER BY + LIMIT queries

Paolo Castagna (JIRA) Fri, 02 Sep 2011 03:08:50 -0700

     [ 
https://issues.apache.org/jira/browse/JENA-108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Paolo Castagna updated JENA-108:
--------------------------------

    Description: 
When users what to find the top 10 most 'something' 'things' in their data 
write queries of this form: SELECT DISTINCT * { ... } ORDER BY ... LIMIT 10 
We can add a specific low level optimization for this type of queries using a 
new QueryIterTopNDistinct iterator.

This SPARQL query: 
SELECT * { ... } ORDER BY ... LIMIT 10
has this algebra:
(slice _ 10
  (order (...)
    (bgp (...))))
which is now transformed into (see JENA-89):
(top (10 ...)
  (bgp (...)))

This query: 
SELECT DISTINCT * { ... } ORDER BY ... 
has this algebra:
(distinct
  (order (...)
    (bgp (...))))
which is now transformed into (see JENA-90):
(reduced
  (order (...)
    (bgp (...))))

What about: 
SELECT DISTINCT * { ... } ORDER BY ... LIMIT 10 ?
The algebra expression is:
(slice _ 10
  (distinct
    (order (?p ?o)
      (bgp (triple ?s ?p ?o)))))
which is (currently) transformed into:
(slice _ 10
  (reduced
    (order (?p ?o)
      (bgp (triple ?s ?p ?o)))))

The optimization in JENA-90 kicks in, but it would be better to use the top 
operator for small limits. 
However, it must be a slightly different top implementation (which accepts only 
distinct bindings). 

As Andy suggested in a comment on JENA-90 (12 August 2011), we can transform 
the algebra into:
(top (10 ...)
  (distinct
    (bgp (...))))
and then use a QueryIterTopNDistinct when OpExecutor(OpTop) sees an 
opTop.getSubOp() instanceof OpDistinct.




  was:
This SPARQL query: 
SELECT * { ... } ORDER BY ... LIMIT 10
has this algebra:
(slice _ 10
  (order (...)
    (bgp (...))))
which is now transformed into (see JENA-89):
(top (10 ...)
  (bgp (...)))

This query: 
SELECT DISTINCT * { ... } ORDER BY ... 
has this algebra:
(distinct
  (order (...)
    (bgp (...))))
which is now transformed into (see JENA-90):
(reduced
  (order (...)
    (bgp (...))))

What about: 
SELECT DISTINCT * { ... } ORDER BY ... LIMIT 10 ?
The algebra expression is:
(slice _ 10
  (distinct
    (order (?p ?o)
      (bgp (triple ?s ?p ?o)))))
which is (currently) transformed into:
(slice _ 10
  (reduced
    (order (?p ?o)
      (bgp (triple ?s ?p ?o)))))

The optimization in JENA-90 kicks in, but it would be better to use the top 
operator for small limits. 
However, it must be a slightly different top implementation (which accepts only 
distinct bindings). 

As Andy suggested in a comment on JENA-90 (12 August 2011), we can transform 
the algebra into:
(top (10 ...)
  (distinct
    (bgp (...))))
and then use a QueryIterTopNDistinct when OpExecutor(OpTop) sees an 
opTop.getSubOp() instanceof OpDistinct.





> Optimise DISTINCT + ORDER BY + LIMIT queries
> --------------------------------------------
>
>                 Key: JENA-108
>                 URL: https://issues.apache.org/jira/browse/JENA-108
>             Project: Jena
>          Issue Type: Improvement
>          Components: ARQ
>            Reporter: Paolo Castagna
>            Assignee: Paolo Castagna
>            Priority: Minor
>              Labels: optimization, sparql
>
> When users what to find the top 10 most 'something' 'things' in their data 
> write queries of this form: SELECT DISTINCT * { ... } ORDER BY ... LIMIT 10 
> We can add a specific low level optimization for this type of queries using a 
> new QueryIterTopNDistinct iterator.
> This SPARQL query: 
> SELECT * { ... } ORDER BY ... LIMIT 10
> has this algebra:
> (slice _ 10
>   (order (...)
>     (bgp (...))))
> which is now transformed into (see JENA-89):
> (top (10 ...)
>   (bgp (...)))
> This query: 
> SELECT DISTINCT * { ... } ORDER BY ... 
> has this algebra:
> (distinct
>   (order (...)
>     (bgp (...))))
> which is now transformed into (see JENA-90):
> (reduced
>   (order (...)
>     (bgp (...))))
> What about: 
> SELECT DISTINCT * { ... } ORDER BY ... LIMIT 10 ?
> The algebra expression is:
> (slice _ 10
>   (distinct
>     (order (?p ?o)
>       (bgp (triple ?s ?p ?o)))))
> which is (currently) transformed into:
> (slice _ 10
>   (reduced
>     (order (?p ?o)
>       (bgp (triple ?s ?p ?o)))))
> The optimization in JENA-90 kicks in, but it would be better to use the top 
> operator for small limits. 
> However, it must be a slightly different top implementation (which accepts 
> only distinct bindings). 
> As Andy suggested in a comment on JENA-90 (12 August 2011), we can transform 
> the algebra into:
> (top (10 ...)
>   (distinct
>     (bgp (...))))
> and then use a QueryIterTopNDistinct when OpExecutor(OpTop) sees an 
> opTop.getSubOp() instanceof OpDistinct.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (JENA-108) Optimise DISTINCT + ORDER BY + LIMIT queries

Reply via email to