[ https://issues.apache.org/jira/browse/DRILL-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14327084#comment-14327084 ]
Aman Sinha commented on DRILL-2242: ----------------------------------- +1 > Wrong result (more rows) when outer query groups by subset of columns that > inner query groups by > ------------------------------------------------------------------------------------------------ > > Key: DRILL-2242 > URL: https://issues.apache.org/jira/browse/DRILL-2242 > Project: Apache Drill > Issue Type: Bug > Components: Query Planning & Optimization > Affects Versions: 0.7.0 > Reporter: Aman Sinha > Assignee: Jinfeng Ni > Priority: Critical > Attachments: > 0002-DRILL-2242-Propagate-distribution-trait-when-Project.patch > > > The following query has a subquery that groups on 2 columns and outer query > that queries on 1 of those columns. With slice_target = 1 to force > exchanges, it produces incorrect result: > {code} > alter session set `planner.slice_target` = 1; > select count(*) from > (select l_partksy from > (select l_partkey, l_suppkey from cp.`tpch/lineitem.parquet` > group by l_partkey, l_suppkey) > group by l_partkey > ); > +------------+ > | EXPR$0 | > +------------+ > | 6227 | > +------------+ > 1 row selected (1.522 seconds) > {code} > Correct result (from Postgres): > {code} > count > ------- > 2000 > (1 row) > {code} > The cause appears to be related to distribution trait propagation. Here's > the EXPLAIN plan: > {code} > +------------+------------+ > | text | json | > +------------+------------+ > | 00-00 Screen > 00-01 StreamAgg(group=[{}], EXPR$0=[$SUM0($0)]) > 00-02 UnionExchange > 01-01 StreamAgg(group=[{}], EXPR$0=[COUNT()]) > 01-02 Project($f0=[0]) > 01-03 HashAgg(group=[{0}]) > 01-04 Project(l_partkey=[$0]) > 01-05 HashAgg(group=[{0, 1}]) > 01-06 HashToRandomExchange(dist0=[[$0]], dist1=[[$1]]) > 02-01 HashAgg(group=[{0, 1}]) > 02-02 Project(l_partkey=[$1], l_suppkey=[$0]) > 02-03 Scan(groupscan=[ParquetGroupScan > [entries=[ReadEntryWithPath [path=/tpch/lineitem.parquet]], > selectionRoot=/tpch/lineitem.parquet, numFiles=1, columns=[`l_partkey`, > `l_suppkey`]]]) > {code} > Note that the HashExchange operator 06 does a distribute on 2 columns > l_partkey and l_suppkey in order to perform the 2phase aggregation. These are > the group-by columns. However, in the outer query's HashAgg, there is no > re-distribution being done. It assumes that data is already hash distributed > on l_partkey which is not correct. -- This message was sent by Atlassian JIRA (v6.3.4#6332)