Dayue Gao created KYLIN-2501: -------------------------------- Summary: Stream Aggregate GTRecords at Query Server Key: KYLIN-2501 URL: https://issues.apache.org/jira/browse/KYLIN-2501 Project: Kylin Issue Type: Improvement Components: Query Engine, Storage - HBase Affects Versions: v1.6.0 Reporter: Dayue Gao Assignee: Dayue Gao
*Problem* When query server needs to handle millions of records from storage, CubeTupleConverter could become performance bottleneck. An experiment shows that converting 5 millions records takes ~11s, which accounts for 50% of the total query time. *Motivation* Records returned from each storage partition is guaranteed to be ordered. Therefore we could reduce the number of records passed to CubeTupleConverter by # merge sorted records from all partitions, similar to what we have done in KYLIN-1787 # use a [stream aggregate|https://blogs.msdn.microsoft.com/craigfr/2006/09/13/stream-aggregate/] algorithm on merged stream to aggregate those records with the same key *Proposal* # Add a new physical operator GTStreamAggregateScanner which implements the stream aggregate algorithm # Refine SortedIteratorMergerWithLimit that was used to merge sort records from different partitions. The previous implementation has performance issues (KYLIN-2483) due to expensive record clone # Leverage GTStreamAggregateScanner to aggregate records on merged stream *Scope* Stream aggregate has some good properties such as low memory usage and streamable ordered outputs, making it better than hash/sort based alternatives when input is already sorted. So I bet the new GTStreamAggregateScanner operator can also be used to accelerating cubing and coprocessor in certain cases. I will leave it as future works. -- This message was sent by Atlassian JIRA (v6.3.15#6346)