[ https://issues.apache.org/jira/browse/DRILL-6880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Robert Hou closed DRILL-6880. ----------------------------- > Hash-Join: Many null keys on the build side form a long linked chain in the > Hash Table > -------------------------------------------------------------------------------------- > > Key: DRILL-6880 > URL: https://issues.apache.org/jira/browse/DRILL-6880 > Project: Apache Drill > Issue Type: Improvement > Components: Execution - Relational Operators > Affects Versions: 1.14.0 > Reporter: Boaz Ben-Zvi > Assignee: Boaz Ben-Zvi > Priority: Critical > Fix For: 1.16.0 > > > When building the Hash Table for the Hash-Join, each new key is matched with > an existing key (same bucket) by calling the generated method > `isKeyMatchInternalBuild`, which compares the two. However when both keys are > null, the method returns *false* (meaning not-equal; i.e. it is a new key), > thus the new key is added into the list following the old key. When a third > null key is found, it would be matched with the prior two, and added as well. > Etc etc ... > This way many null values would perform checks at order N^2 / 2. > _Suggested improvement_: The generated code should return a third result, > meaning "two null keys". Then in case of Inner or Left joins all the > duplicate nulls can be discarded. > Below is a simple example, note the time difference between non-null and the > all-nulls tables (also instrumentation showed that for nulls, the method > above was called 1249975000 times!!) > {code:java} > 0: jdbc:drill:zk=local> use dfs.tmp; > 0: jdbc:drill:zk=local> create table testNull as (select cast(null as int) > mycol from > dfs.`/data/test128M.tbl` limit 50000); > 0: jdbc:drill:zk=local> create table test1 as (select cast(1 as int) mycol1 > from > dfs.`/data/test128M.tbl` limit 60000); > 0: jdbc:drill:zk=local> create table test2 as (select cast(2 as int) mycol2 > from dfs.`/data/test128M.tbl` limit 50000); > 0: jdbc:drill:zk=local> select count(*) from test1 join test2 on test1.mycol1 > = test2.mycol2; > +---------+ > | EXPR$0 | > +---------+ > | 0 | > +---------+ > 1 row selected (0.443 seconds) > 0: jdbc:drill:zk=local> select count(*) from test1 join testNull on > test1.mycol1 = testNull.mycol; > +---------+ > | EXPR$0 | > +---------+ > | 0 | > +---------+ > 1 row selected (140.098 seconds) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)