[ https://issues.apache.org/jira/browse/KYLIN-2501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15929370#comment-15929370 ]
Dayue Gao commented on KYLIN-2501: ---------------------------------- Code pushed to KYLIN-2501 branch, ready to be review. [~mahongbin] > Stream Aggregate GTRecords at Query Server > ------------------------------------------ > > Key: KYLIN-2501 > URL: https://issues.apache.org/jira/browse/KYLIN-2501 > Project: Kylin > Issue Type: Improvement > Components: Query Engine, Storage - HBase > Affects Versions: v1.6.0 > Reporter: Dayue Gao > Assignee: Dayue Gao > > *Problem* > When query server needs to handle millions of records, CubeTupleConverter > could become performance bottleneck. > An experiment shows that converting 5 millions records takes ~11s, which > accounts for 50% of the total query time. > *Motivation* > Records returned from each storage partition is guaranteed to be ordered. > Therefore we could reduce the number of records passed to CubeTupleConverter > by > # merge sorted records from all partitions, similar to what we have done in > KYLIN-1787 > # use a [stream > aggregate|https://blogs.msdn.microsoft.com/craigfr/2006/09/13/stream-aggregate/] > algorithm on merged stream to aggregate those records with the same key > *Proposal* > # Add a new physical operator GTStreamAggregateScanner which implements the > stream aggregate algorithm > # Refine SortedIteratorMergerWithLimit that was used to merge sort records > from different partitions. The previous implementation has performance issues > (KYLIN-2483) due to expensive record clone > # Leverage GTStreamAggregateScanner to aggregate records on merged stream > *Scope* > Stream aggregate has some good properties such as low memory usage and > streamable ordered outputs, making it better than hash/sort based > alternatives when input is already sorted. So I bet the new > GTStreamAggregateScanner operator can also be used to accelerate cubing and > coprocessor aggregation in certain cases. I'll focus on query server in this > jira and leave those optimizations as future works. -- This message was sent by Atlassian JIRA (v6.3.15#6346)