[
https://issues.apache.org/jira/browse/DRILL-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14109946#comment-14109946
]
Jinfeng Ni commented on DRILL-1337:
-----------------------------------
The proposed fixes :
1. When left outer join conditions contains a local RHS filter, the local RHS
filter should pushed down below JOIN.
Example :
T1 left join T2 on T1.key1 = T2.key2 and T2.C2 > 10
Local RHS filter T2.C2 > 10 should be pushed below JOIN, and evaluated before
JOIN operator.
2. When left outer join conditions contains a local LHS filter, currently Drill
will not be able to handle it. For such case, the fix should make sure Drill
planner throw CanNotPlan Exception.
T1 left join T2 on T1.key1 = T2.key2 and T1.C1 > 10
Expected : CanNotPlan Exception.
> TPCH Q13 may return incorrect rows : Drill may incorrectly pull up a local
> right filter in a left outer join condition.
> ------------------------------------------------------------------------------------------------------------------------
>
> Key: DRILL-1337
> URL: https://issues.apache.org/jira/browse/DRILL-1337
> Project: Apache Drill
> Issue Type: Bug
> Reporter: Jinfeng Ni
> Assignee: Jinfeng Ni
>
> For TPCH Q13, Drill may discard some qualified rows, since Drill pulls up a
> local RHS filter in a left outer join, and put it on top of Join. There means
> the local RHS is treated as a post-join condition, which will incorrectly
> discard some qualified rows.
> select
> c_count,
> count(*) as custdist
> from
> (
> select
> c.c_custkey,
> count(o.o_orderkey)
> from
> cp.`tpch/customer.parquet` c
> left outer join cp.`tpch/orders.parquet` o
> on c.c_custkey = o.o_custkey
> and o.o_comment not like '%special%requests%'
> group by
> c.c_custkey
> ) as orders (c_custkey, c_count)
> group by
> c_count
> order by
> custdist desc,
> c_count desc;
> Drill Physical :
> .......................
> 02-06 Filter(condition=[$3]): rowcount = 3750.0,
> cumulative cost = {79500.0 rows, 568512.0 cpu, 0.0 io, 1.90464E8 network,
> 264000.0 memory}, id = 2649
> 02-07 HashJoin(condition=[=($0, $1)],
> joinType=[left]): rowcount = 15000.0, cumulative cost = {64500.0 rows,
> 508512.0 cpu, 0.0 io, 1.90464E8 network, 264000.0 memory}, id = 2648
> 02-09 HashToRandomExchange(dist0=[[$0]]):
> rowcount = 1500.0, cumulative cost = {3000.0 rows, 25500.0 cpu, 0.0 io,
> 6144000.0 network, 0.0 memory}, id = 2644
> 03-01 Scan(groupscan=[ParquetGroupScan
> [entries=[ReadEntryWithPath [path=/tpch/customer.parquet]],
> selectionRoot=/tpch/customer.parquet, columns=[SchemaPath [`c_custkey`]]]]):
> rowcount = 1500.0, cumulative cost = {1500.0 rows, 1500.0 cpu, 0.0 io, 0.0
> network, 0.0 memory}, id = 2643
> 02-08 HashToRandomExchange(dist0=[[$0]]):
> rowcount = 15000.0, cumulative cost = {45000.0 rows, 285012.0 cpu, 0.0 io,
> 1.8432E8 network, 0.0 memory}, id = 2647
> 04-01 Project(o_custkey=[$1],
> o_orderkey=[$0], $f4=[NOT(LIKE($2, '%special%requests%'))]): rowcount =
> 15000.0, cumulative cost = {30000.0 rows, 45012.0 cpu, 0.0 io, 0.0 network,
> 0.0 memory}, id = 2646
> 04-02 Scan(groupscan=[ParquetGroupScan
> [entries=[ReadEntryWithPath [path=/tpch/orders.parquet]],
> selectionRoot=/tpch/orders.parquet, columns=[SchemaPath [`o_custkey`],
> SchemaPath [`o_orderkey`], SchemaPath [`o_comment`]]]]): rowcount = 15000.0,
> cumulative cost = {15000.0 rows, 45000.0 cpu, 0.0 io, 0.0 network, 0.0
> memory}, id = 2645
--
This message was sent by Atlassian JIRA
(v6.2#6252)