[jira] [Commented] (JENA-89) Avoid a total sort for ORDER BY + LIMIT queries

Andy Seaborne (JIRA) Sat, 13 Aug 2011 09:56:49 -0700

    [ 
https://issues.apache.org/jira/browse/JENA-89?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13084641#comment-13084641
 ]


Andy Seaborne commented on JENA-89:
-----------------------------------

In creating the tests, a bug was discovered.

The heap needs to keep the least N items. So when a new item to keep comes in, 
the greatest item so far needs to be evicted, not the least.  In other words, 
the heap is kept in maximum order and the test to to look at the max of the N 
items (heap top) to see if the new binding is less than the max least N so far.

Patch attached.  heap uses a reversed comparator.

Also (minor) avoid a copy by Arrays.asList.


> Avoid a total sort for ORDER BY + LIMIT queries
> -----------------------------------------------
>
>                 Key: JENA-89
>                 URL: https://issues.apache.org/jira/browse/JENA-89
>             Project: Jena
>          Issue Type: Improvement
>          Components: ARQ
>            Reporter: Paolo Castagna
>            Assignee: Paolo Castagna
>            Priority: Minor
>              Labels: arq, optimization, scalability, sparql
>         Attachments: ARQ_JENA-89_r1156212.patch, 
> JENA-89-Top-FixHeapCollation.patch, JENA-89-TopTests.patch, 
> JENA-89-TopTests.patch
>
>
> In case of SPARQL queries with ORDER BY + LIMIT, ARQ sorts the entire result 
> set and then produces the first N, according to the specified LIMIT.
> As an alternative, discussed on jena-dev [1], we can use a PriorityQueue [2] 
> (which is based on a priority heap) to avoid a sort operation.
> ARQ's algebra package contains already a OpTopN [3] operator. The OpExecutor 
> [4] will need to use a new QueryIterTopN instead of QueryIterSort + 
> QueryIterSlice. A new TransformOrderByLimit to be used by Optimize is also 
> necessary.
> ORDER BY + LIMIT queries are typically used to construct the first page when 
> results are paginated. Then the following query is ORDER BY + OFFSET + LIMIT. 
> (Often users stop at the first page). Ideally, we could cache the ORDER BY 
> and implement the OFFSET|LIMIT using results from the cache. However, the 
> improvement described by this issue is limited to the ORDER BY + LIMIT case 
> for which a priority heap is a good enough solution.
> Hopefully, this would improve the scalability of ORDER BY + LIMIT queries in 
> case of small values specified on the LIMIT. 
>   [1] http://markmail.org/thread/5d2gtazkoxsa2ayv
>   [2] 
> http://download.oracle.com/javase/6/docs/api/java/util/PriorityQueue.html
>   [3] 
> https://svn.apache.org/repos/asf/incubator/jena/Jena2/ARQ/trunk/src/com/hp/hpl/jena/sparql/algebra/op/OpTopN.java
>   [4] 
> https://svn.apache.org/repos/asf/incubator/jena/Jena2/ARQ/trunk/src/com/hp/hpl/jena/sparql/engine/main/OpExecutor.java

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (JENA-89) Avoid a total sort for ORDER BY + LIMIT queries

Reply via email to