[ https://issues.apache.org/jira/browse/DRILL-6845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16967216#comment-16967216 ]
Boaz Ben-Zvi commented on DRILL-6845: ------------------------------------- The same problem can also be addressed during plan time with the use of statistics; see also DRILL-6949 > Eliminate duplicates for Semi Hash Join > --------------------------------------- > > Key: DRILL-6845 > URL: https://issues.apache.org/jira/browse/DRILL-6845 > Project: Apache Drill > Issue Type: Improvement > Components: Execution - Relational Operators > Affects Versions: 1.14.0 > Reporter: Boaz Ben-Zvi > Assignee: Boaz Ben-Zvi > Priority: Minor > > Following DRILL-6735: The performance of the new Semi Hash Join may degrade > if the build side contains excessive number of join-key duplicate rows; this > mainly a result of the need to store all those rows first, before the hash > table is built. > Proposed solution: For Semi, the Hash Agg would create a Hash-Table > initially, and use it to eliminate key-duplicate rows as they arrive. > Proposed extra: That Hash-Table has an added cost (e.g. resizing). So > perform "runtime stats" – Check initial number of incoming rows (e.g. 32k), > and if the number of duplicates is less than some threshold (e.g. %20) – > cancel that "early" hash table. > -- This message was sent by Atlassian Jira (v8.3.4#803005)