Jacques Nadeau created DRILL-3910:
-------------------------------------

             Summary: Leverage Calcite's Clustered Collation
                 Key: DRILL-3910
                 URL: https://issues.apache.org/jira/browse/DRILL-3910
             Project: Apache Drill
          Issue Type: Improvement
          Components: Query Planning & Optimization
            Reporter: Jacques Nadeau


Right now streaming aggregate requires full collation. I was just talking to 
[~julianhyde] and he pointed out that Calcite has a version of Collation that 
is Clustered (similar to what MSSQL calls Segment). Realistically, Streaming 
aggregate only requires a clustered collation and we should switch to requiring 
this. We should also go through existing operators and make sure we manage 
whether or not the operators maintain a clustered collation. We should then be 
able to have flatten produce a clustered output against the carry-through 
fields. This will allow us to do a better job taking advantage of the 
clustered-ness of data for doing additional operations. Flatten should also 
produce data which exposes the distribution trait on the carry-through fields. 
This means that a query like this:

select a, count(b) from (
  select a, flatten(x) as b from t
)x
group by a

Should be executed without redistribution of data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to