Re: handling joins in Hive 0.11.0

2013-12-13 Thread Adrian Popescu


Hello,

I found out that the dependency graph among task stages is incorrect for 
the skewed join optimized plan.


In particular, the conditional task in the optimized plan maintains no 
dependency with the child tasks
of the common join task in the original plan. The conditional task is 
composed of the map join task which
has all these dependencies, but for the case the map join task is 
filtered out, all these dependencies are removed.

Hence, all the other task stages of the query are skipped.

The bug resides in ql/optimizer/physical/GenMRSkewJoinProcessor.java, 
processSkewJoin() function,
immediately after the ConditionalTask is created and its dependencies 
are set.


I currently fixed the issue by adding dependencies among the 
ConditonalTask and all the child tasks of the common

join task of the original plan.

From the original design I see that only tasks included in the 
ConditionalTask are allowed to have dependencies,
so I am wondering what shall be the alternative correct implementation? 
Maybe adding an nop task inside the
ConditionalTask (in addition to the map join task), so that the 
dependencies are maintained for the case that the

map join task is filtered out?

Thanks,
Adrian



On 11/15/2013 10:20 PM, Adrian Popescu wrote:


2. In my experiments I also evaluate skewed joins. I enable skew joins 
through hive.optimize.skewjoin and I run the same
tpch query 5. The skew join is not actually triggered as the number of 
rows with the same key is less than hive.skewjoin.key.
Hence, the map join corresponding to the skewed join  is filtered out 
at runtime, but unfortunately all the other stages
are also filtered out. Thus, no result is actually generated. If I 
disable the skew join optimization, the query running only with

common joins returns the result correctly.

I believe this is a bug when the skew join operator is enabled but not 
triggered. Did anyone experienced the same problem with
skew joins on queries of multiple map reduce joins? I attach the 
explain plan. Essentially only stage 6 and 22 are executed.
Everything else is skipped silently with no output result being 
generated, nor error in hive.log. Similar behaviour is observed

for other TPCH queries.

Many thanks,
Adrian





--
Adrian



handling joins in Hive 0.11.0

2013-11-15 Thread Adrian Popescu


Hello everyone,

I have two questions on join optimizations in Hive, one of which I 
believe is a bug in the 0.11.0 release.


1. From Hive online documentation I see that multiple map joins can be 
grouped together in one single MapReduce job if the
input tables are joined on the same joining key. From my experiments on 
TPCH I see that multiple map joins can
be grouped inside the same MapReduce job also for the case that all the 
input corresponding to the tables can be fit into
one single mapper. For instance, query 5 joins first nation with region, 
then the result is joined with supplier. From the explain
plan I see that both joins are run in a single MapReduce job as nested 
map join operators. Explain plan is attached
(please see Stage 21 in q5_explained_MJ.txt for the nested MapJoin 
operators).


Is it possible to disable this feature (i.e., to have a MR job for each 
join)? The setting I use to trigger map joins is 
hive.auto.convert.join=true



2. In my experiments I also evaluate skewed joins. I enable skew joins 
through hive.optimize.skewjoin and I run the same
tpch query 5. The skew join is not actually triggered as the number of 
rows with the same key is less than hive.skewjoin.key.
Hence, the map join corresponding to the skewed join  is filtered out at 
runtime, but unfortunately all the other stages
are also filtered out. Thus, no result is actually generated. If I 
disable the skew join optimization, the query running only with

common joins returns the result correctly.

I believe this is a bug when the skew join operator is enabled but not 
triggered. Did anyone experienced the same problem with
skew joins on queries of multiple map reduce joins? I attach the explain 
plan. Essentially only stage 6 and 22 are executed.
Everything else is skipped silently with no output result being 
generated, nor error in hive.log. Similar behaviour is observed

for other TPCH queries.

Many thanks,
Adrian

ABSTRACT SYNTAX TREE:
  (TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_TABREF (TOK_TABNAME customer2) c) 
(TOK_SUBQUERY (TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_TABREF (TOK_TABNAME orders2) 
o) (TOK_SUBQUERY (TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_TABREF (TOK_TABNAME 
lineitem2) l) (TOK_SUBQUERY (TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_TABREF 
(TOK_TABNAME supplier2) s) (TOK_SUBQUERY (TOK_QUERY (TOK_FROM (TOK_JOIN 
(TOK_TABREF (TOK_TABNAME nation2) n) (TOK_TABREF (TOK_TABNAME region2) r) (and 
(= (. (TOK_TABLE_OR_COL n) n_regionkey) (. (TOK_TABLE_OR_COL r) r_regionkey)) 
(= (. (TOK_TABLE_OR_COL r) r_name) 'ASIA' (TOK_INSERT (TOK_DESTINATION 
(TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL n_name)) 
(TOK_SELEXPR (TOK_TABLE_OR_COL n_nationkey) n1) (= (. (TOK_TABLE_OR_COL s) 
s_nationkey) (. (TOK_TABLE_OR_COL n1) n_nationkey (TOK_INSERT 
(TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR 
(TOK_TABLE_OR_COL n_name)) (TOK_SELEXPR (TOK_TABLE_OR_COL s_suppkey)) 
(TOK_SELEXPR (TOK_TABLE_OR_COL s_nationkey) s1) (= (. (TOK_TABLE_OR_COL l) 
l_suppkey) (. (TOK_TABLE_OR_COL s1) s_suppkey (TOK_INSERT (TOK_DESTINATION 
(TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL n_name)) 
(TOK_SELEXPR (TOK_TABLE_OR_COL l_extendedprice)) (TOK_SELEXPR (TOK_TABLE_OR_COL 
l_discount)) (TOK_SELEXPR (TOK_TABLE_OR_COL l_orderkey)) (TOK_SELEXPR 
(TOK_TABLE_OR_COL s_nationkey) l1) (and (and (= (. (TOK_TABLE_OR_COL l1) 
l_orderkey) (. (TOK_TABLE_OR_COL o) o_orderkey)) (= (. (TOK_TABLE_OR_COL o) 
o_orderdate) '1994-01-01')) ( (. (TOK_TABLE_OR_COL o) o_orderdate) 
'1995-01-01' (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) 
(TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL n_name)) (TOK_SELEXPR 
(TOK_TABLE_OR_COL l_extendedprice)) (TOK_SELEXPR (TOK_TABLE_OR_COL l_discount)) 
(TOK_SELEXPR (TOK_TABLE_OR_COL s_nationkey)) (TOK_SELEXPR (TOK_TABLE_OR_COL 
o_custkey) o1) (and (= (. (TOK_TABLE_OR_COL c) c_nationkey) (. 
(TOK_TABLE_OR_COL o1) s_nationkey)) (= (. (TOK_TABLE_OR_COL c) c_custkey) (. 
(TOK_TABLE_OR_COL o1) o_custkey) (TOK_INSERT (TOK_DESTINATION (TOK_TAB 
(TOK_TABNAME q5_local_supplier_volume))) (TOK_SELECT (TOK_SELEXPR 
(TOK_TABLE_OR_COL n_name)) (TOK_SELEXPR (TOK_FUNCTION sum (* (TOK_TABLE_OR_COL 
l_extendedprice) (- 1 (TOK_TABLE_OR_COL l_discount revenue)) (TOK_GROUPBY 
(TOK_TABLE_OR_COL n_name)) (TOK_ORDERBY (TOK_TABSORTCOLNAMEDESC 
(TOK_TABLE_OR_COL revenue)

STAGE DEPENDENCIES:
  Stage-27 is a root stage
  Stage-21 depends on stages: Stage-27
  Stage-19 depends on stages: Stage-21 , consists of Stage-25, Stage-26, Stage-8
  Stage-25 has a backup stage: Stage-8
  Stage-17 depends on stages: Stage-25
  Stage-16 depends on stages: Stage-8, Stage-17, Stage-18 , consists of 
Stage-23, Stage-24, Stage-1
  Stage-23 has a backup stage: Stage-1
  Stage-14 depends on stages: Stage-23
  Stage-22 depends on stages: Stage-1, Stage-14, Stage-15
  Stage-13 depends on stages: Stage-22
  Stage-3 depends on stages: Stage-13
  Stage-4 depends on stages: Stage-3