-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/28930/
-----------------------------------------------------------
(Updated Dec. 18, 2014, 2:07 a.m.)
Review request for hive.
Changes
-------
Update more golden files. Ton of SMB joins got converted to mapjoin/bucket
mapjoin. Also, due to forward-walking some of the operator numbers are changed
for cross-product check.
Bugs: HIVE-8639
https://issues.apache.org/jira/browse/HIVE-8639
Repository: hive-git
Description
-------
In MapReduce for auto-SMB joins, SortedMergeJoinProc is run in the earlier
Optimizer layer to convert join to SMB join, and SortMergeJoinResolver is run
in later PhysicalOptimizer layer to convert it to MapJoin. For Spark, we have
an opportunity to make it cleaner by deciding putting both SMB and MapJoin
conversions in the logical layer and deciding which one to call.
This patch introduces a new unitied join processor called 'SparkJoinOptimizer'
in the logical layer. This will call 'SparkMapJoinOptimizer' and
'SparkSortMergeJoinOptimizer' in a certain order depending on the flags that
are set and which ever one is available fails. Thus no need to write a SMB ->
MapJoin path.
'SparkSortMergeJoinOptimizer' is a new class that wraps the logic of
SortedMergeJoinProc but for Spark. To put both MapJoin/SMB processor in the
same level, I had to do some fixes.
1. One fix is in 'NonBlockingOpDeDupProc', to fix the join context state, as
now its run before the SMB code that relies on it. For this I submitted a
trunk patch at HIVE-9060.
2. The second fix is that MapReduce's SMB code did two graph walks, one
processor to calculate all 'rejected' joins, and another processor to change
the non-rejected ones to SMB join. That would have made it so we do multiple
walks, so I refactored the 'rejected' join logic in the same join-operator
visit in SparkSortMergeJoinOptimizer.
Diffs (updated)
-----
ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java c2e643d
ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SparkJoinOptimizer.java
PRE-CREATION
ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SparkMapJoinOptimizer.java
680c6fd
ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SparkReduceSinkMapJoinProc.java
83625ef
ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SparkSortMergeJoinOptimizer.java
PRE-CREATION
ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkWork.java 5e432ac
ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkCompiler.java b6a7ac2
ql/src/test/results/clientpositive/spark/auto_join32.q.out 28c022e
ql/src/test/results/clientpositive/spark/auto_join_stats.q.out bccd246
ql/src/test/results/clientpositive/spark/auto_smb_mapjoin_14.q.out 842b4b3
ql/src/test/results/clientpositive/spark/auto_sortmerge_join_1.q.out 2e35c66
ql/src/test/results/clientpositive/spark/auto_sortmerge_join_12.q.out ee37010
ql/src/test/results/clientpositive/spark/auto_sortmerge_join_13.q.out b2e928f
ql/src/test/results/clientpositive/spark/auto_sortmerge_join_14.q.out 20ee657
ql/src/test/results/clientpositive/spark/auto_sortmerge_join_15.q.out 0a48d00
ql/src/test/results/clientpositive/spark/auto_sortmerge_join_2.q.out 5008a3f
ql/src/test/results/clientpositive/spark/auto_sortmerge_join_3.q.out 3b081af
ql/src/test/results/clientpositive/spark/auto_sortmerge_join_4.q.out 2a11fb2
ql/src/test/results/clientpositive/spark/auto_sortmerge_join_5.q.out 0d971d2
ql/src/test/results/clientpositive/spark/auto_sortmerge_join_6.q.out 9d455dc
ql/src/test/results/clientpositive/spark/auto_sortmerge_join_7.q.out 61eb6ae
ql/src/test/results/clientpositive/spark/auto_sortmerge_join_8.q.out 198d50d
ql/src/test/results/clientpositive/spark/auto_sortmerge_join_9.q.out f59e57f
ql/src/test/results/clientpositive/spark/bucketsortoptimize_insert_2.q.out
b58091c
ql/src/test/results/clientpositive/spark/bucketsortoptimize_insert_4.q.out
8ee392e
ql/src/test/results/clientpositive/spark/bucketsortoptimize_insert_6.q.out
9c119df
ql/src/test/results/clientpositive/spark/bucketsortoptimize_insert_7.q.out
b9ad92d
ql/src/test/results/clientpositive/spark/bucketsortoptimize_insert_8.q.out
ed4d03f
ql/src/test/results/clientpositive/spark/cross_product_check_2.q.out 6fb69a5
ql/src/test/results/clientpositive/spark/parquet_join.q.out 240989a
ql/src/test/results/clientpositive/spark/smb_mapjoin_17.q.out 268ae23
ql/src/test/results/clientpositive/spark/smb_mapjoin_25.q.out df66cc2
ql/src/test/results/clientpositive/spark/subquery_multiinsert.q.out f635949
Diff: https://reviews.apache.org/r/28930/diff/
Testing
-------
Most of the auto-smb tests give the same output with this change, the only
difference is now some SMB joins become MapJoin if
"hive.auto.convert.sortmerge.join.to.mapjoin" is on, as expected.
One failing test is auto_sortmerge_join_9. This was passing until yesterday
when bucket-map join is enabled in HIVE-8638. As expected, by choosing
MapJoins over SMB join if "hive.auto.convert.sortmerge.join.to.mapjoin" is on,
the MapJoin may become a bucket-mapjoin. Some of the more complicated queries
of auto_sortmerge_join_9 get converted to bucket mapjoin and fail. Can
probably file a new JIRA to fix this test.
Thanks,
Szehon Ho