Hello everyone,
I have two questions on join optimizations in Hive, one of which I
believe is a bug in the 0.11.0 release.
1. From Hive online documentation I see that multiple map joins can be
grouped together in one single MapReduce job if the
input tables are joined on the same joining key. From my experiments on
TPCH I see that multiple map joins can
be grouped inside the same MapReduce job also for the case that all the
input corresponding to the tables can be fit into
one single mapper. For instance, query 5 joins first nation with region,
then the result is joined with supplier. From the explain
plan I see that both joins are run in a single MapReduce job as "nested
map join operators". Explain plan is attached
(please see Stage 21 in "q5_explained_MJ.txt" for the nested MapJoin
operators).
Is it possible to disable this feature (i.e., to have a MR job for each
join)? The setting I use to trigger map joins is
"hive.auto.convert.join=true"
2. In my experiments I also evaluate skewed joins. I enable skew joins
through "hive.optimize.skewjoin" and I run the same
tpch query 5. The skew join is not actually triggered as the number of
rows with the same key is less than "hive.skewjoin.key".
Hence, the map join corresponding to the skewed join is filtered out at
runtime, but unfortunately all the other stages
are also filtered out. Thus, no result is actually generated. If I
disable the skew join optimization, the query running only with
common joins returns the result correctly.
I believe this is a bug when the skew join operator is enabled but not
triggered. Did anyone experienced the same problem with
skew joins on queries of multiple map reduce joins? I attach the explain
plan. Essentially only stage 6 and 22 are executed.
Everything else is skipped silently with no output result being
generated, nor error in "hive.log". Similar behaviour is observed
for other TPCH queries.
Many thanks,
Adrian
ABSTRACT SYNTAX TREE:
(TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_TABREF (TOK_TABNAME customer2) c)
(TOK_SUBQUERY (TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_TABREF (TOK_TABNAME orders2)
o) (TOK_SUBQUERY (TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_TABREF (TOK_TABNAME
lineitem2) l) (TOK_SUBQUERY (TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_TABREF
(TOK_TABNAME supplier2) s) (TOK_SUBQUERY (TOK_QUERY (TOK_FROM (TOK_JOIN
(TOK_TABREF (TOK_TABNAME nation2) n) (TOK_TABREF (TOK_TABNAME region2) r) (and
(= (. (TOK_TABLE_OR_COL n) n_regionkey) (. (TOK_TABLE_OR_COL r) r_regionkey))
(= (. (TOK_TABLE_OR_COL r) r_name) 'ASIA' (TOK_INSERT (TOK_DESTINATION
(TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL n_name))
(TOK_SELEXPR (TOK_TABLE_OR_COL n_nationkey) n1) (= (. (TOK_TABLE_OR_COL s)
s_nationkey) (. (TOK_TABLE_OR_COL n1) n_nationkey (TOK_INSERT
(TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR
(TOK_TABLE_OR_COL n_name)) (TOK_SELEXPR (TOK_TABLE_OR_COL s_suppkey))
(TOK_SELEXPR (TOK_TABLE_OR_COL s_nationkey) s1) (= (. (TOK_TABLE_OR_COL l)
l_suppkey) (. (TOK_TABLE_OR_COL s1) s_suppkey (TOK_INSERT (TOK_DESTINATION
(TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL n_name))
(TOK_SELEXPR (TOK_TABLE_OR_COL l_extendedprice)) (TOK_SELEXPR (TOK_TABLE_OR_COL
l_discount)) (TOK_SELEXPR (TOK_TABLE_OR_COL l_orderkey)) (TOK_SELEXPR
(TOK_TABLE_OR_COL s_nationkey) l1) (and (and (= (. (TOK_TABLE_OR_COL l1)
l_orderkey) (. (TOK_TABLE_OR_COL o) o_orderkey)) (>= (. (TOK_TABLE_OR_COL o)
o_orderdate) '1994-01-01')) (< (. (TOK_TABLE_OR_COL o) o_orderdate)
'1995-01-01' (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE))
(TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL n_name)) (TOK_SELEXPR
(TOK_TABLE_OR_COL l_extendedprice)) (TOK_SELEXPR (TOK_TABLE_OR_COL l_discount))
(TOK_SELEXPR (TOK_TABLE_OR_COL s_nationkey)) (TOK_SELEXPR (TOK_TABLE_OR_COL
o_custkey) o1) (and (= (. (TOK_TABLE_OR_COL c) c_nationkey) (.
(TOK_TABLE_OR_COL o1) s_nationkey)) (= (. (TOK_TABLE_OR_COL c) c_custkey) (.
(TOK_TABLE_OR_COL o1) o_custkey) (TOK_INSERT (TOK_DESTINATION (TOK_TAB
(TOK_TABNAME q5_local_supplier_volume))) (TOK_SELECT (TOK_SELEXPR
(TOK_TABLE_OR_COL n_name)) (TOK_SELEXPR (TOK_FUNCTION sum (* (TOK_TABLE_OR_COL
l_extendedprice) (- 1 (TOK_TABLE_OR_COL l_discount revenue)) (TOK_GROUPBY
(TOK_TABLE_OR_COL n_name)) (TOK_ORDERBY (TOK_TABSORTCOLNAMEDESC
(TOK_TABLE_OR_COL revenue)
STAGE DEPENDENCIES:
Stage-27 is a root stage
Stage-21 depends on stages: Stage-27
Stage-19 depends on stages: Stage-21 , consists of Stage-25, Stage-26, Stage-8
Stage-25 has a backup stage: Stage-8
Stage-17 depends on stages: Stage-25
Stage-16 depends on stages: Stage-8, Stage-17, Stage-18 , consists of
Stage-23, Stage-24, Stage-1
Stage-23 has a backup stage: Stage-1
Stage-14 depends on stages: Stage-23
Stage-22 depends on stages: Stage-1, Stage-14, Stage-15
Stage-13 depends on stages: Stage-22
Stage-3 depends on stages: Stage-13
Stage-4 depends on stages: Sta