Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/10771 )
Change subject: PREVIEW: IMPALA-110: Support for multiple DISTINCT ...................................................................... Patch Set 1: (7 comments) Did a quick initial pass over the backend part of this. http://gerrit.cloudera.org:8080/#/c/10771/1//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/10771/1//COMMIT_MSG@60 PS1, Line 60: - Ran hdfs/core tests Can we add some spilling tests too? http://gerrit.cloudera.org:8080/#/c/10771/1/be/src/exec/aggregation-node.cc File be/src/exec/aggregation-node.cc: http://gerrit.cloudera.org:8080/#/c/10771/1/be/src/exec/aggregation-node.cc@118 PS1, Line 118: // Separate input batch into mini batches destined for the different aggs. I feel like there's a lot of unnecessary overhead in the pre-partitioning into mini-batches (the approach in general and maybe some of the specifics of how this code uses the RowBatch APIs). I'd be interested to see how much time is spent here for a high-NDV aggregation to see if it's worth optimising. We could avoid the construction of the intermediate batches if we pushed down logic into AddBatch so that it consumed rows from 'batch' up until it saw one that didn't match its index. http://gerrit.cloudera.org:8080/#/c/10771/1/be/src/exec/grouping-aggregator-ir.cc File be/src/exec/grouping-aggregator-ir.cc: http://gerrit.cloudera.org:8080/#/c/10771/1/be/src/exec/grouping-aggregator-ir.cc@171 PS1, Line 171: ++streaming_idx_; Let's hoist the store to 'streaming_idx_' out of the loop. http://gerrit.cloudera.org:8080/#/c/10771/1/be/src/exec/grouping-aggregator-ir.cc@190 PS1, Line 190: out_batch->ClearRow(row); This seems expensive to include in the inner loop. Can we avoid doing this entirely in the non-multiagg case? And do it efficiently with a memset() before calling this function in the multiagg case. http://gerrit.cloudera.org:8080/#/c/10771/1/be/src/exec/grouping-aggregator-ir.cc@191 PS1, Line 191: agg_idx_ Let's hoist this load out of the loop too. Can we codegen this out for the non-multi-agg case? http://gerrit.cloudera.org:8080/#/c/10771/1/be/src/exec/grouping-aggregator.h File be/src/exec/grouping-aggregator.h: http://gerrit.cloudera.org:8080/#/c/10771/1/be/src/exec/grouping-aggregator.h@296 PS1, Line 296: int32_t streaming_idx_; Comments? http://gerrit.cloudera.org:8080/#/c/10771/1/be/src/exec/grouping-aggregator.cc File be/src/exec/grouping-aggregator.cc: http://gerrit.cloudera.org:8080/#/c/10771/1/be/src/exec/grouping-aggregator.cc@285 PS1, Line 285: row_batch->ClearRow(row); This seems like unnecessary overhead. -- To view, visit http://gerrit.cloudera.org:8080/10771 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I055402eaef6d81e5f70e850d9f8a621e766830a4 Gerrit-Change-Number: 10771 Gerrit-PatchSet: 1 Gerrit-Owner: Thomas Marshall <thomasmarsh...@cmu.edu> Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com> Gerrit-Comment-Date: Sat, 21 Jul 2018 01:13:08 +0000 Gerrit-HasComments: Yes