[ https://issues.apache.org/jira/browse/ARROW-14479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ASF GitHub Bot updated ARROW-14479: ----------------------------------- Labels: pull-request-available (was: ) > [C++][Compute] Hash Join microbenchmarks > ---------------------------------------- > > Key: ARROW-14479 > URL: https://issues.apache.org/jira/browse/ARROW-14479 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ > Affects Versions: 7.0.0 > Reporter: Michal Nowakiewicz > Assignee: Sasha Krassovsky > Priority: Major > Labels: pull-request-available > Fix For: 7.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > Implement a series of microbenchmarks giving a good picture of the > performance of hash join implemented in Arrow across different set of > dimensions. > Compare the performance against some other product(s). > Add scripts for generating useful visual reports giving a good picture of the > costs of hash join. > Examples of dimensions to explore in microbenchmarks: > * number of duplicate keys on build side > * relative size of build side to probe side > * selectivity of the join > * number of key columns > * number of payload columns > * filtering performance for semi- and anti- joins > * dense integer key vs sparse integer key vs string key > * build size > * scaling of build, filtering, probe > * inner vs left outer, inner vs right outer > * left semi vs right semi, left anti vs right anti, left outer vs right outer > * non-uniform key distribution > * monotonic key values in input, partitioned key values in input (with and > without per batch min-max metadata) > * chain of multiple hash joins > * overhead of Bloom filter for non-selective Bloom filter -- This message was sent by Atlassian Jira (v8.20.1#820001)