[ https://issues.apache.org/jira/browse/ARROW-12725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Alessandro Molina updated ARROW-12725: -------------------------------------- Fix Version/s: (was: 5.0.0) 6.0.0 > [C++][Compute] GroupBy: improve performance by encoding keys in row format > only when they are inserted into hash table > ---------------------------------------------------------------------------------------------------------------------- > > Key: ARROW-12725 > URL: https://issues.apache.org/jira/browse/ARROW-12725 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ > Affects Versions: 4.0.0 > Reporter: Michal Nowakiewicz > Priority: Major > Labels: pull-request-available > Fix For: 6.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > Previous implementation of hash group by converts input ExecBatches to > row-oriented format, > then hashes and compares rows as if they were a single column. > It is more efficient (especially for small number of key columns) to avoid > relatively costly > encoding and instead compute hashes of individual columns in column-oriented > format mixing them together, and similarly comparing column-oriented data to > row-oriented data in the hash table without converting. > Encoding only happens for a subset of input rows that are inserted into the > hash table - they introduce new groups. > Keys in hash table remain stored as row-oriented. -- This message was sent by Atlassian Jira (v8.3.4#803005)