Incorrect alias filtering for predicates pushdown
-------------------------------------------------
Key: HIVE-2383
URL: https://issues.apache.org/jira/browse/HIVE-2383
Project: Hive
Issue Type: Bug
Components: Query Processor
Affects Versions: 0.6.0
Reporter: Charles Chen
Assignee: Charles Chen
Priority: Critical
The predicate pushdown optimizer starts at the topmost operators traverses the
operator tree, at each stage collecting predicates to be pushed down. At each
operator, ive.ql.ppd.OpProcFactory.DefaultPPD.mergeWithChildrenPred is called,
which merges the predicates of the children nodes into the current node. The
predicates are stored in hive.ql.ppd.ExprWalkerInfo.pushdownPreds as a map from
the alias a predicate refers to (a predicate may only refer to one alias at a
time as only such predicates can be pushed) to a list of such predicates.
Since at each stage the alias the predicate refers to may change (subqueries
may change aliases), this is updated for each operator
(hive.ql.ppd.ExprWalkerProcFactory.extractPushdownPreds is called which walks
the ExprNodeDesc for each predicate). When a JoinOperator is encountered,
mergeWithChildrenPred is passed an optional parameter "aliases" which contains
a set of aliases that can be pushed per ansi semantics (see
hive.ql.ppd.OpProcFactory.JoinPPD.getQualifiedAliases). The part that is
incorrect is that aliases are filtered in mergeWithChildrenPred before
extractPushdownPreds is called, which associates the predicates with the
correct alias in the current operator's context while the filtering should
happen after.
In test case Q2 below, when the predicate "a.bar=3" comes into the
JoinOperator, the alias is "a" coming in so it is accepted for pushdown. When
brought into the JoinOperator's context, however, since the predicate refers to
b.foo in the inner scope, we should not actually accept this for pushdown.
With the test cases
{noformat}
-- Q1: predicate should not be pushed on the right side of a left outer join
(this is correct in trunk)
explain
SELECT a.foo as foo1, b.foo as foo2, b.bar
FROM pokes a LEFT OUTER JOIN pokes2 b
ON a.foo=b.foo
WHERE b.bar=3;
-- Q2: predicate should not be pushed on the right side of a left outer join
(this is broken in trunk)
explain
SELECT * FROM
(SELECT a.foo as foo1, b.foo as foo2, b.bar
FROM pokes a LEFT OUTER JOIN pokes2 b
ON a.foo=b.foo) a
WHERE a.bar=3;
-- Q3: predicate should be pushed (this is correct in trunk)
explain
SELECT * FROM
(SELECT a.foo as foo1, b.foo as foo2, a.bar
FROM pokes a JOIN pokes2 b
ON a.foo=b.foo) a
WHERE a.bar=3;
{noformat}
The current output is
{noformat}
hive>
> -- Q1: predicate should not be pushed on the right side of a left outer
join
> explain
> SELECT a.foo as foo1, b.foo as foo2, b.bar
> FROM pokes a LEFT OUTER JOIN pokes2 b
> ON a.foo=b.foo
> WHERE b.bar=3;
OK
ABSTRACT SYNTAX TREE:
(TOK_QUERY (TOK_FROM (TOK_LEFTOUTERJOIN (TOK_TABREF (TOK_TABNAME pokes) a)
(TOK_TABREF (TOK_TABNAME pokes2) b) (= (. (TOK_TABLE_OR_COL a) foo) (.
(TOK_TABLE_OR_COL b) foo)))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR
TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) foo) foo1)
(TOK_SELEXPR (. (TOK_TABLE_OR_COL b) foo) foo2) (TOK_SELEXPR (.
(TOK_TABLE_OR_COL b) bar))) (TOK_WHERE (= (. (TOK_TABLE_OR_COL b) bar) 3))))
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
a
TableScan
alias: a
Reduce Output Operator
key expressions:
expr: foo
type: int
sort order: +
Map-reduce partition columns:
expr: foo
type: int
tag: 0
value expressions:
expr: foo
type: int
b
TableScan
alias: b
Reduce Output Operator
key expressions:
expr: foo
type: int
sort order: +
Map-reduce partition columns:
expr: foo
type: int
tag: 1
value expressions:
expr: foo
type: int
expr: bar
type: int
Reduce Operator Tree:
Join Operator
condition map:
Left Outer Join0 to 1
condition expressions:
0 {VALUE._col0}
1 {VALUE._col0} {VALUE._col1}
handleSkewJoin: false
outputColumnNames: _col0, _col4, _col5
Filter Operator
predicate:
expr: (_col5 = 3)
type: boolean
Select Operator
expressions:
expr: _col0
type: int
expr: _col4
type: int
expr: _col5
type: int
outputColumnNames: _col0, _col1, _col2
File Output Operator
compressed: false
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Stage: Stage-0
Fetch Operator
limit: -1
Time taken: 0.113 seconds
hive>
> -- Q2: predicate should not be pushed on the right side of a left outer
join
> explain
> SELECT * FROM
> (SELECT a.foo as foo1, b.foo as foo2, b.bar
> FROM pokes a LEFT OUTER JOIN pokes2 b
> ON a.foo=b.foo) a
> WHERE a.bar=3;
OK
ABSTRACT SYNTAX TREE:
(TOK_QUERY (TOK_FROM (TOK_SUBQUERY (TOK_QUERY (TOK_FROM (TOK_LEFTOUTERJOIN
(TOK_TABREF (TOK_TABNAME pokes) a) (TOK_TABREF (TOK_TABNAME pokes2) b) (= (.
(TOK_TABLE_OR_COL a) foo) (. (TOK_TABLE_OR_COL b) foo)))) (TOK_INSERT
(TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (.
(TOK_TABLE_OR_COL a) foo) foo1) (TOK_SELEXPR (. (TOK_TABLE_OR_COL b) foo) foo2)
(TOK_SELEXPR (. (TOK_TABLE_OR_COL b) bar))))) a)) (TOK_INSERT (TOK_DESTINATION
(TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR TOK_ALLCOLREF)) (TOK_WHERE (=
(. (TOK_TABLE_OR_COL a) bar) 3))))
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
a:a
TableScan
alias: a
Reduce Output Operator
key expressions:
expr: foo
type: int
sort order: +
Map-reduce partition columns:
expr: foo
type: int
tag: 0
value expressions:
expr: foo
type: int
a:b
TableScan
alias: b
Filter Operator
predicate:
expr: (bar = 3)
type: boolean
Reduce Output Operator
key expressions:
expr: foo
type: int
sort order: +
Map-reduce partition columns:
expr: foo
type: int
tag: 1
value expressions:
expr: foo
type: int
expr: bar
type: int
Reduce Operator Tree:
Join Operator
condition map:
Left Outer Join0 to 1
condition expressions:
0 {VALUE._col0}
1 {VALUE._col0} {VALUE._col1}
handleSkewJoin: false
outputColumnNames: _col0, _col4, _col5
Select Operator
expressions:
expr: _col0
type: int
expr: _col4
type: int
expr: _col5
type: int
outputColumnNames: _col0, _col1, _col2
Select Operator
expressions:
expr: _col0
type: int
expr: _col1
type: int
expr: _col2
type: int
outputColumnNames: _col0, _col1, _col2
File Output Operator
compressed: false
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Stage: Stage-0
Fetch Operator
limit: -1
Time taken: 0.101 seconds
hive>
> -- Q3: predicate should be pushed
> explain
> SELECT * FROM
> (SELECT a.foo as foo1, b.foo as foo2, a.bar
> FROM pokes a JOIN pokes2 b
> ON a.foo=b.foo) a
> WHERE a.bar=3;
OK
ABSTRACT SYNTAX TREE:
(TOK_QUERY (TOK_FROM (TOK_SUBQUERY (TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_TABREF
(TOK_TABNAME pokes) a) (TOK_TABREF (TOK_TABNAME pokes2) b) (= (.
(TOK_TABLE_OR_COL a) foo) (. (TOK_TABLE_OR_COL b) foo)))) (TOK_INSERT
(TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (.
(TOK_TABLE_OR_COL a) foo) foo1) (TOK_SELEXPR (. (TOK_TABLE_OR_COL b) foo) foo2)
(TOK_SELEXPR (. (TOK_TABLE_OR_COL a) bar))))) a)) (TOK_INSERT (TOK_DESTINATION
(TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR TOK_ALLCOLREF)) (TOK_WHERE (=
(. (TOK_TABLE_OR_COL a) bar) 3))))
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
a:a
TableScan
alias: a
Filter Operator
predicate:
expr: (bar = 3)
type: boolean
Reduce Output Operator
key expressions:
expr: foo
type: int
sort order: +
Map-reduce partition columns:
expr: foo
type: int
tag: 0
value expressions:
expr: foo
type: int
expr: bar
type: int
a:b
TableScan
alias: b
Reduce Output Operator
key expressions:
expr: foo
type: int
sort order: +
Map-reduce partition columns:
expr: foo
type: int
tag: 1
value expressions:
expr: foo
type: int
Reduce Operator Tree:
Join Operator
condition map:
Inner Join 0 to 1
condition expressions:
0 {VALUE._col0} {VALUE._col1}
1 {VALUE._col0}
handleSkewJoin: false
outputColumnNames: _col0, _col1, _col4
Select Operator
expressions:
expr: _col0
type: int
expr: _col4
type: int
expr: _col1
type: int
outputColumnNames: _col0, _col1, _col2
Select Operator
expressions:
expr: _col0
type: int
expr: _col1
type: int
expr: _col2
type: int
outputColumnNames: _col0, _col1, _col2
File Output Operator
compressed: false
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Stage: Stage-0
Fetch Operator
limit: -1
{noformat}
Note that Q2 is incorrect because the predicate "bar = 3" is incorrectly pushed
to the right side of the left outer join (Q1 and Q3 are correct).
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira