[ https://issues.apache.org/jira/browse/DRILL-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Boaz Ben-Zvi reassigned DRILL-2010: ----------------------------------- Assignee: Boaz Ben-Zvi > merge join returns wrong number of rows with large dataset > ---------------------------------------------------------- > > Key: DRILL-2010 > URL: https://issues.apache.org/jira/browse/DRILL-2010 > Project: Apache Drill > Issue Type: Bug > Components: Execution - Relational Operators > Affects Versions: 0.8.0 > Reporter: Chun Chang > Assignee: Boaz Ben-Zvi > Priority: Critical > Attachments: DRILL-2010-1.patch, DRILL-2010-1.patch > > > #Mon Jan 12 18:19:31 EST 2015 > git.commit.id.abbrev=5b012bf > When data set is big enough (like larger than one batch size), merge join > will not returns the correct number of rows. Hash join returns the correct > number of rows. Data can be downloaded from: > https://s3.amazonaws.com/apache-drill/files/complex100k.json.gz > With this dataset, the following query should return 10,000,000. > {code} > 0: jdbc:drill:schema=dfs.drillTestDirComplexJ> alter session set > `planner.enable_mergejoin` = true; > +------------+------------+ > | ok | summary | > +------------+------------+ > | true | planner.enable_mergejoin updated. | > +------------+------------+ > 1 row selected (0.024 seconds) > 0: jdbc:drill:schema=dfs.drillTestDirComplexJ> alter session set > `planner.enable_hashjoin` = false; > +------------+------------+ > | ok | summary | > +------------+------------+ > | true | planner.enable_hashjoin updated. | > +------------+------------+ > 1 row selected (0.024 seconds) > 0: jdbc:drill:schema=dfs.drillTestDirComplexJ> select count(a.id) from > `complex100k.json` a inner join `complex100k.json` b on a.gbyi=b.gbyi; > +------------+ > | EXPR$0 | > +------------+ > | 9046760 | > +------------+ > 1 row selected (6.205 seconds) > 0: jdbc:drill:schema=dfs.drillTestDirComplexJ> alter session set > `planner.enable_mergejoin` = false; > +------------+------------+ > | ok | summary | > +------------+------------+ > | true | planner.enable_mergejoin updated. | > +------------+------------+ > 1 row selected (0.026 seconds) > 0: jdbc:drill:schema=dfs.drillTestDirComplexJ> alter session set > `planner.enable_hashjoin` = true; > +------------+------------+ > | ok | summary | > +------------+------------+ > | true | planner.enable_hashjoin updated. | > +------------+------------+ > 1 row selected (0.024 seconds) > 0: jdbc:drill:schema=dfs.drillTestDirComplexJ> select count(a.id) from > `complex100k.json` a inner join `complex100k.json` b on a.gbyi=b.gbyi; > +------------+ > | EXPR$0 | > +------------+ > | 10000000 | > +------------+ > 1 row selected (4.453 seconds) > {code} > With smaller dataset, both merge and hash join returns the same correct > number. > physical plan for merge join: > {code} > 0: jdbc:drill:schema=dfs.drillTestDirComplexJ> explain plan for select > count(a.id) from `complex100k.json` a inner join `complex100k.json` b on > a.gbyi=b.gbyi; > +------------+------------+ > | text | json | > +------------+------------+ > | 00-00 Screen > 00-01 StreamAgg(group=[{}], EXPR$0=[COUNT($0)]) > 00-02 Project(id=[$1]) > 00-03 MergeJoin(condition=[=($0, $2)], joinType=[inner]) > 00-05 SelectionVectorRemover > 00-07 Sort(sort0=[$0], dir0=[ASC]) > 00-09 Scan(groupscan=[EasyGroupScan > [selectionRoot=/drill/testdata/complex_type/json/complex100k.json, > numFiles=1, columns=[`gbyi`, `id`], > files=[maprfs:/drill/testdata/complex_type/json/complex100k.json]]]) > 00-04 Project(gbyi0=[$0]) > 00-06 SelectionVectorRemover > 00-08 Sort(sort0=[$0], dir0=[ASC]) > 00-10 Scan(groupscan=[EasyGroupScan > [selectionRoot=/drill/testdata/complex_type/json/complex100k.json, > numFiles=1, columns=[`gbyi`], > files=[maprfs:/drill/testdata/complex_type/json/complex100k.json]]]) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)