[jira] [Commented] (DRILL-6845) Eliminate duplicates for Semi Hash Join
[ https://issues.apache.org/jira/browse/DRILL-6845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16967216#comment-16967216 ] Boaz Ben-Zvi commented on DRILL-6845: - The same problem can also be addressed during plan time with the use of statistics; see also DRILL-6949 > Eliminate duplicates for Semi Hash Join > --- > > Key: DRILL-6845 > URL: https://issues.apache.org/jira/browse/DRILL-6845 > Project: Apache Drill > Issue Type: Improvement > Components: Execution - Relational Operators >Affects Versions: 1.14.0 >Reporter: Boaz Ben-Zvi >Assignee: Boaz Ben-Zvi >Priority: Minor > > Following DRILL-6735: The performance of the new Semi Hash Join may degrade > if the build side contains excessive number of join-key duplicate rows; this > mainly a result of the need to store all those rows first, before the hash > table is built. > Proposed solution: For Semi, the Hash Agg would create a Hash-Table > initially, and use it to eliminate key-duplicate rows as they arrive. > Proposed extra: That Hash-Table has an added cost (e.g. resizing). So > perform "runtime stats" – Check initial number of incoming rows (e.g. 32k), > and if the number of duplicates is less than some threshold (e.g. %20) – > cancel that "early" hash table. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-6845) Eliminate duplicates for Semi Hash Join
[ https://issues.apache.org/jira/browse/DRILL-6845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16742389#comment-16742389 ] Pritesh Maker commented on DRILL-6845: -- [~ben-zvi] adding the link to the PR - https://github.com/apache/drill/pull/1606 > Eliminate duplicates for Semi Hash Join > --- > > Key: DRILL-6845 > URL: https://issues.apache.org/jira/browse/DRILL-6845 > Project: Apache Drill > Issue Type: Improvement > Components: Execution - Relational Operators >Affects Versions: 1.14.0 >Reporter: Boaz Ben-Zvi >Assignee: Boaz Ben-Zvi >Priority: Minor > Fix For: 1.16.0 > > > Following DRILL-6735: The performance of the new Semi Hash Join may degrade > if the build side contains excessive number of join-key duplicate rows; this > mainly a result of the need to store all those rows first, before the hash > table is built. > Proposed solution: For Semi, the Hash Agg would create a Hash-Table > initially, and use it to eliminate key-duplicate rows as they arrive. > Proposed extra: That Hash-Table has an added cost (e.g. resizing). So > perform "runtime stats" – Check initial number of incoming rows (e.g. 32k), > and if the number of duplicates is less than some threshold (e.g. %20) – > cancel that "early" hash table. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)