[ https://issues.apache.org/jira/browse/ARROW-12728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17380712#comment-17380712 ]
David Li commented on ARROW-12728: ---------------------------------- If I understand right, this should be a fairly straightfoward hash aggregate kernel on top of ARROW-12759 - is anyone planning on working on this? I'd like to take it otherwise to get more familiar with the ExecNode/Grouper infrastructure. > [C++][Compute] Aggregates: implement count distinct > --------------------------------------------------- > > Key: ARROW-12728 > URL: https://issues.apache.org/jira/browse/ARROW-12728 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ > Affects Versions: 4.0.0 > Reporter: Michal Nowakiewicz > Priority: Major > Fix For: 6.0.0 > > > Implement count distinct aggregate reusing hash table from hash group by > inside of it. > This brings support to SQL queries like: > select a, count(distinct b), count(distinct c) from t group by a > For instance to compute count(distinct b), the first group id mapping will > give group id based on column a value; then the second group id mapping is > done using the key (groupid(a), b) inside count(distinct b) aggregate > (similarly for count(distinct c)). > After all input rows are consumed, the final processing step scans the hash > tables based on (groupid(a), b) and updates an array of counts indexed by > groupid(a). > The resulting array of counts represents the output of count distinct > aggregate. -- This message was sent by Atlassian Jira (v8.3.4#803005)