Boaz Ben-Zvi created DRILL-6880: ----------------------------------- Summary: Hash-Join: Many null keys on the build side form a long linked chain in the Hash Table Key: DRILL-6880 URL: https://issues.apache.org/jira/browse/DRILL-6880 Project: Apache Drill Issue Type: Improvement Components: Execution - Relational Operators Affects Versions: 1.14.0 Reporter: Boaz Ben-Zvi Fix For: 1.16.0
When building the Hash Table for the Hash-Join, each new key is matched with an existing key (same bucket) by calling the generated method `isKeyMatchInternalBuild`, which compares the two. However when both keys are null, the method returns *false* (meaning not-equal; i.e. it is a new key), thus the new key is added into the list following the old key. When a third null key is found, it would be matched with the prior two, and added as well. Etc etc ... This way many null values would perform checks at order N^2 / 2. Suggested improvement: The generated code should return a third result, meaning "two null keys". Then in case of Inner or Left joins all the duplicate nulls can be discarded. Below is a simple example, note the time difference between non-null and the all-nulls tables (also instrumentation showed that for nulls, the method above was called 1249975000 times!!) {code:java} 0: jdbc:drill:zk=local> use dfs.tmp; 0: jdbc:drill:zk=local> create table test as (select cast(null as int) mycol from dfs.`/data/test128M.tbl` limit 50000); 0: jdbc:drill:zk=local> create table test1 as (select cast(1 as int) mycol1 from dfs.`/data/test128M.tbl` limit 60000); 0: jdbc:drill:zk=local> create table test2 as (select cast(2 as int) mycol2 from dfs.`/data/test128M.tbl` limit 50000); 0: jdbc:drill:zk=local> select count(*) from test1 join test2 on test1.mycol1 = test2.mycol2; +---------+ | EXPR$0 | +---------+ | 0 | +---------+ 1 row selected (0.443 seconds) 0: jdbc:drill:zk=local> create table test1 as (select cast(1 as int) mycol1 from dfs.`/data/test128M.tbl` limit 60000); +-----------+----------------------------+ | Fragment | Number of records written | +-----------+----------------------------+ | 0_0 | 60000 | +-----------+----------------------------+ 1 row selected (0.517 seconds) 0: jdbc:drill:zk=local> select count(*) from test1 join test on test1.mycol1 = test.mycol; +---------+ | EXPR$0 | +---------+ | 0 | +---------+ 1 row selected (140.098 seconds) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)