[ https://issues.apache.org/jira/browse/DRILL-3137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Aman Sinha reassigned DRILL-3137: --------------------------------- Assignee: Aman Sinha (was: Chris Westin) > Joining tables with lots of duplicates and LIMIT does not do early termination > ------------------------------------------------------------------------------ > > Key: DRILL-3137 > URL: https://issues.apache.org/jira/browse/DRILL-3137 > Project: Apache Drill > Issue Type: Bug > Components: Execution - Relational Operators > Affects Versions: 0.9.0 > Reporter: Aman Sinha > Assignee: Aman Sinha > > Create a table with duplicate keys: > {code} > create table dfs.tmp.lineitem_dup as select 100 as key1, 200 as key2 from > cp.`tpch/lineitem.parquet`; > +-----------+----------------------------+ > | Fragment | Number of records written | > +-----------+----------------------------+ > | 0_0 | 60175 | > +-----------+----------------------------+ > {code} > Now do a self-join with a LIMIT. This query should do an early termination > because of the LIMIT but it runs for about 2 minutes: (note in this plan > there are no exchanges): > {code} > select * from dfs.tmp.lineitem_dup t1, dfs.tmp.lineitem_dup t2 where t1.key1 > = t2.key1 limit 1; > 0: jdbc:drill:zk=local> select * from dfs.tmp.lineitem_dup t1, > dfs.tmp.lineitem_dup t2 where t1.key1 = t2.key1 limit 1; > +-------+-------+--------+--------+ > | key1 | key2 | key10 | key20 | > +-------+-------+--------+--------+ > | 100 | 200 | 100 | 200 | > +-------+-------+--------+--------+ > 1 row selected (111.764 seconds) > {code} > Disabling hash join does not help in this case. > 0: jdbc:drill:zk=local> alter session set `planner.enable_hashjoin` = false; > +-------+-----------------------------------+ > | ok | summary | > +-------+-----------------------------------+ > | true | planner.enable_hashjoin updated. | > +-------+-----------------------------------+ > 1 row selected (0.094 seconds) > 0: jdbc:drill:zk=local> select * from dfs.tmp.lineitem_dup t1, > dfs.tmp.lineitem_dup t2 where t1.key1 = t2.key1 limit 1; > +-------+-------+--------+--------+ > | key1 | key2 | key10 | key20 | > +-------+-------+--------+--------+ > | 100 | 200 | 100 | 200 | > +-------+-------+--------+--------+ > 1 row selected (198.874 seconds) > {code} > However, forcing exchanges in the plan helps and the query terminates early: > {code} > 0: jdbc:drill:zk=local> alter session set `planner.slice_target` = 1; > 0: jdbc:drill:zk=local> select * from dfs.tmp.lineitem_dup t1, > dfs.tmp.lineitem_dup t2 where t1.key1 = t2.key1 limit 1; > +-------+-------+--------+--------+ > | key1 | key2 | key10 | key20 | > +-------+-------+--------+--------+ > | 100 | 200 | 100 | 200 | > +-------+-------+--------+--------+ > 1 row selected (0.765 seconds) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)