[ https://issues.apache.org/jira/browse/DERBY-3002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12521308 ]
Bryan Pendleton commented on DERBY-3002: ---------------------------------------- Hi Manish thanks for reading the diff. You are absolutely right that it is possible to compute the rollups with a single sort, but I haven't made that optimization yet. I shall work on improving the wiki to make this clearer. As I see it, there are basically 3 ways to perform grouped aggregation, and numerous ways to combine these three variants. 1) If the records to be grouped are already in sorted order, perhaps due to a scan of an index, or a previous merge-sort-join operation, or a duplicate-eliminating distinct sort, then the grouping operation can be implemented by comparing each pair of records on the grouping attributes; each time a new value of the grouping attributes is encountered, the computation of the previous group can be finalized and emitted. This is the algorithm discussed in the wiki page, and is currently implemented by Derby for a single set of grouping attributes; I have not yet constructed a patch to extend this logic for a ROLLUP set of groups. 2) An unsorted set of records can be grouped by feeding them into the sorter, configured for duplicate elimination, with a sort observer which aggregates each row's values into the retained row's columns during duplicate processing. Each such sort groups the records by a specific set of grouping attributes. The patch that I constructed uses N sorts to group the records by N different sets of grouping attributes. The input rows are read only once, but then are added to each of the N sorts. This is dramatically more efficient than the alternate query-rewriting algorithm of constructing N GROUP BY statements and then UNION'ing their results together, since in this implementation the input rows are materialized and read only once. 3) An unsorted set of records can also be grouped by feeding them into a hash table, whose keys are the grouping attributes, with a hash-collision observer which aggregates each row's values into the retained row's columns during duplicate key elimination. This is fundamentally similar to the sort-based technique, with the important difference that at the end of the processing, the rows will NOT be emitted in sorted order. Derby currently implements techniques (1) and (2) for grouped aggregate computation; I am not aware of any code which implements a hash-based technique for grouped aggregates in Derby. As Manish notes, techniques (1) and (2) can be combined into an algorithm which computes the N ROLLUP groups using a single sort, by first sorting (and aggregating) the rows on their finest level of grouping detail. Then, as the rows are being retrieved in sorted order, the code could detect when a new value of grouping attributes is being retrieved, and could automatically compute that higher level group at that time. Thus if we are computing the GROUP BY ABCD, ABC, AB, and A, we can: - sort (and group and aggregate) the data on ABCD - For each row we retrieve, we can propagate it up the pipeline to compute the appropriate ABC group, the appropriate AB group, the appropriate A group, and the appropriate all-rows group. This algorithm seems to hold the potential for being quite efficient, but I haven't investigated the actual coding yet, and for the time being as an experiment I investigated the easier-to-code but less efficient multiple-sorts algorithm. > Add support for GROUP BY ROLLUP > ------------------------------- > > Key: DERBY-3002 > URL: https://issues.apache.org/jira/browse/DERBY-3002 > Project: Derby > Issue Type: Sub-task > Components: SQL > Affects Versions: 10.4.0.0 > Reporter: Bryan Pendleton > Assignee: Bryan Pendleton > Priority: Minor > Attachments: prototypeChangeNoTests.diff > > > Provide an implementation of the ROLLUP form of multi-dimensional grouping > according to the SQL standard. > See http://wiki.apache.org/db-derby/OLAPRollupLists for some more detailed > information about this aspect of the SQL standard. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.