[
https://issues.apache.org/jira/browse/DRILL-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jinfeng Ni reassigned DRILL-1337:
---------------------------------
Assignee: DrillCommitter (was: Jinfeng Ni)
TPCH Q13 now gets correct result.
> TPCH Q13 may return incorrect rows : Drill may incorrectly pull up a local
> right filter in a left outer join condition.
> ------------------------------------------------------------------------------------------------------------------------
>
> Key: DRILL-1337
> URL: https://issues.apache.org/jira/browse/DRILL-1337
> Project: Apache Drill
> Issue Type: Bug
> Reporter: Jinfeng Ni
> Assignee: DrillCommitter
>
> For TPCH Q13, Drill may discard some qualified rows, since Drill pulls up a
> local RHS filter in a left outer join, and put it on top of Join. There means
> the local RHS is treated as a post-join condition, which will incorrectly
> discard some qualified rows.
> select
> c_count,
> count(*) as custdist
> from
> (
> select
> c.c_custkey,
> count(o.o_orderkey)
> from
> cp.`tpch/customer.parquet` c
> left outer join cp.`tpch/orders.parquet` o
> on c.c_custkey = o.o_custkey
> and o.o_comment not like '%special%requests%'
> group by
> c.c_custkey
> ) as orders (c_custkey, c_count)
> group by
> c_count
> order by
> custdist desc,
> c_count desc;
> Drill Physical :
> .......................
> 02-06 Filter(condition=[$3]): rowcount = 3750.0,
> cumulative cost = {79500.0 rows, 568512.0 cpu, 0.0 io, 1.90464E8 network,
> 264000.0 memory}, id = 2649
> 02-07 HashJoin(condition=[=($0, $1)],
> joinType=[left]): rowcount = 15000.0, cumulative cost = {64500.0 rows,
> 508512.0 cpu, 0.0 io, 1.90464E8 network, 264000.0 memory}, id = 2648
> 02-09 HashToRandomExchange(dist0=[[$0]]):
> rowcount = 1500.0, cumulative cost = {3000.0 rows, 25500.0 cpu, 0.0 io,
> 6144000.0 network, 0.0 memory}, id = 2644
> 03-01 Scan(groupscan=[ParquetGroupScan
> [entries=[ReadEntryWithPath [path=/tpch/customer.parquet]],
> selectionRoot=/tpch/customer.parquet, columns=[SchemaPath [`c_custkey`]]]]):
> rowcount = 1500.0, cumulative cost = {1500.0 rows, 1500.0 cpu, 0.0 io, 0.0
> network, 0.0 memory}, id = 2643
> 02-08 HashToRandomExchange(dist0=[[$0]]):
> rowcount = 15000.0, cumulative cost = {45000.0 rows, 285012.0 cpu, 0.0 io,
> 1.8432E8 network, 0.0 memory}, id = 2647
> 04-01 Project(o_custkey=[$1],
> o_orderkey=[$0], $f4=[NOT(LIKE($2, '%special%requests%'))]): rowcount =
> 15000.0, cumulative cost = {30000.0 rows, 45012.0 cpu, 0.0 io, 0.0 network,
> 0.0 memory}, id = 2646
> 04-02 Scan(groupscan=[ParquetGroupScan
> [entries=[ReadEntryWithPath [path=/tpch/orders.parquet]],
> selectionRoot=/tpch/orders.parquet, columns=[SchemaPath [`o_custkey`],
> SchemaPath [`o_orderkey`], SchemaPath [`o_comment`]]]]): rowcount = 15000.0,
> cumulative cost = {15000.0 rows, 45000.0 cpu, 0.0 io, 0.0 network, 0.0
> memory}, id = 2645
--
This message was sent by Atlassian JIRA
(v6.2#6252)