Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/10771 )

Change subject: PREVIEW: IMPALA-110: Support for multiple DISTINCT
......................................................................


Patch Set 1:

(7 comments)

Did a quick initial pass over the backend part of this.

http://gerrit.cloudera.org:8080/#/c/10771/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/10771/1//COMMIT_MSG@60
PS1, Line 60: - Ran hdfs/core tests
Can we add some spilling tests too?


http://gerrit.cloudera.org:8080/#/c/10771/1/be/src/exec/aggregation-node.cc
File be/src/exec/aggregation-node.cc:

http://gerrit.cloudera.org:8080/#/c/10771/1/be/src/exec/aggregation-node.cc@118
PS1, Line 118:     // Separate input batch into mini batches destined for the 
different aggs.
I feel like there's a lot of unnecessary overhead in the pre-partitioning into 
mini-batches (the approach in general and maybe some of the specifics of how 
this code uses the RowBatch APIs).

I'd be interested to see how much time is spent here for a high-NDV aggregation 
to see if it's worth optimising.

We could avoid the construction of the intermediate batches if we pushed down 
logic into AddBatch so that it consumed rows from 'batch' up until it saw one 
that didn't match its index.


http://gerrit.cloudera.org:8080/#/c/10771/1/be/src/exec/grouping-aggregator-ir.cc
File be/src/exec/grouping-aggregator-ir.cc:

http://gerrit.cloudera.org:8080/#/c/10771/1/be/src/exec/grouping-aggregator-ir.cc@171
PS1, Line 171:       ++streaming_idx_;
Let's hoist the store to 'streaming_idx_' out of the loop.


http://gerrit.cloudera.org:8080/#/c/10771/1/be/src/exec/grouping-aggregator-ir.cc@190
PS1, Line 190:         out_batch->ClearRow(row);
This seems expensive to include in the inner loop. Can we avoid doing this 
entirely in the non-multiagg case? And do it efficiently with a memset() before 
calling this function in the multiagg case.


http://gerrit.cloudera.org:8080/#/c/10771/1/be/src/exec/grouping-aggregator-ir.cc@191
PS1, Line 191: agg_idx_
Let's hoist this load out of the loop too. Can we codegen this out for the 
non-multi-agg case?


http://gerrit.cloudera.org:8080/#/c/10771/1/be/src/exec/grouping-aggregator.h
File be/src/exec/grouping-aggregator.h:

http://gerrit.cloudera.org:8080/#/c/10771/1/be/src/exec/grouping-aggregator.h@296
PS1, Line 296:   int32_t streaming_idx_;
Comments?


http://gerrit.cloudera.org:8080/#/c/10771/1/be/src/exec/grouping-aggregator.cc
File be/src/exec/grouping-aggregator.cc:

http://gerrit.cloudera.org:8080/#/c/10771/1/be/src/exec/grouping-aggregator.cc@285
PS1, Line 285:     row_batch->ClearRow(row);
This seems like unnecessary overhead.



--
To view, visit http://gerrit.cloudera.org:8080/10771
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I055402eaef6d81e5f70e850d9f8a621e766830a4
Gerrit-Change-Number: 10771
Gerrit-PatchSet: 1
Gerrit-Owner: Thomas Marshall <thomasmarsh...@cmu.edu>
Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com>
Gerrit-Comment-Date: Sat, 21 Jul 2018 01:13:08 +0000
Gerrit-HasComments: Yes

Reply via email to