-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/28500/
-----------------------------------------------------------
(Updated Dec. 2, 2014, 1:34 a.m.)
Review request for hive, Chao Sun, Suhas Satish, and Xuefu Zhang.
Changes
-------
Fix algorithm and cleanup after discussion with Xuefu. Original code was too
aggressively incorporating connected mapjoins into its size calculation, new
code only looks at the big table's connected mapjoins.
Bugs: HIVE-8943
https://issues.apache.org/jira/browse/HIVE-8943
Repository: hive-git
Description
-------
SparkMapJoinOptimizer by default combines nested mapjoins into one work due to
removal of RS for big-table. So we need to enhance the mapjoin check to
calculate if all the MapJoins in that work (spark-stage) will fit into the
memory, otherwise it might overwhelm memory for that particular spark executor.
Diffs (updated)
-----
ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SparkMapJoinOptimizer.java
819eef1
ql/src/java/org/apache/hadoop/hive/ql/parse/spark/OptimizeSparkProcContext.java
0c339a5
ql/src/test/queries/clientpositive/auto_join_stats.q PRE-CREATION
ql/src/test/queries/clientpositive/auto_join_stats2.q PRE-CREATION
ql/src/test/results/clientpositive/auto_join_stats.q.out PRE-CREATION
ql/src/test/results/clientpositive/auto_join_stats2.q.out PRE-CREATION
ql/src/test/results/clientpositive/spark/auto_join_stats.q.out PRE-CREATION
ql/src/test/results/clientpositive/spark/auto_join_stats2.q.out PRE-CREATION
Diff: https://reviews.apache.org/r/28500/diff/
Testing
-------
Added two unit tests:
1. auto_join_stats, which sets a memory limit and checks that algorithm does
not put more than 1 mapjoin in one BaseWork
2. auto_join_stats2, which is the same query without memory limit, and check
that algorithm puts all mapjoin in one BaseWork because it can.
Thanks,
Szehon Ho