*Thanks for you warning.* *The first query is mapjoin and second query is reducejoin.The data format is all textInputFormat.* *I'll go to learn more about mapjoin of **hive on spark** anyway,But why** stage1 of first query in attachment is so slowly?*
*Explain first query:* hive (u_wsd)> explain insert overwrite table t_sd_ucm_cominfo_incremental partition (ds=20151202) > select t1.uin,t1.clientip from > (select uin,clientip from t_sd_ucm_cominfo_FinalResult where ds=20151202) t1 > left outer join (select uin,clientip from t_sd_ucm_cominfo_FinalResult where ds=20151201) t2 > on t1.uin=t2.uin > where t2.clientip is NULL; OK STAGE DEPENDENCIES: Stage-3 is a root stage Stage-1 depends on stages: Stage-3 Stage-0 depends on stages: Stage-1 Stage-2 depends on stages: Stage-0 STAGE PLANS: Stage: Stage-3 Spark DagName: mqq_20151204103210_fb913c60-c5ed-438f-8efb-950ee5639dd5:2 Vertices: Map 2 Map Operator Tree: TableScan alias: t_sd_ucm_cominfo_finalresult Statistics: Num rows: 108009 Data size: 2873665 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: uin (type: string), clientip (type: string) outputColumnNames: _col0, _col1 Statistics: Num rows: 108009 Data size: 2873665 Basic stats: COMPLETE Column stats: NONE Spark HashTable Sink Operator keys: 0 _col0 (type: string) 1 _col0 (type: string) Local Work: Map Reduce Local Work Stage: Stage-1 Spark DagName: mqq_20151204103210_fb913c60-c5ed-438f-8efb-950ee5639dd5:1 Vertices: Map 1 Map Operator Tree: TableScan alias: t_sd_ucm_cominfo_finalresult Statistics: Num rows: 103779 Data size: 2746785 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: uin (type: string), clientip (type: string) outputColumnNames: _col0, _col1 Statistics: Num rows: 103779 Data size: 2746785 Basic stats: COMPLETE Column stats: NONE Map Join Operator condition map: Left Outer Join0 to 1 keys: 0 _col0 (type: string) 1 _col0 (type: string) outputColumnNames: _col0, _col1, _col3 input vertices: 1 Map 2 Statistics: Num rows: 118809 Data size: 3161031 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: _col3 is null (type: boolean) Statistics: Num rows: 59404 Data size: 1580502 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: _col0 (type: string), _col1 (type: string) outputColumnNames: _col0, _col1 Statistics: Num rows: 59404 Data size: 1580502 Basic stats: COMPLETE Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 59404 Data size: 1580502 Basic stats: COMPLETE Column stats: NONE table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe name: u_wsd.t_sd_ucm_cominfo_incremental Local Work: Map Reduce Local Work Stage: Stage-0 Move Operator tables: partition: ds 20151202 replace: true table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe name: u_wsd.t_sd_ucm_cominfo_incremental Stage: Stage-2 Stats-Aggr Operator *Explain second query:* hive (u_wsd)> explain insert overwrite table t_sd_ucm_cominfo_incremental partition (ds=20151201) > select t1.uin,t1.clientip from > (select uin,clientip from t_sd_ucm_cominfo_FinalResult where ds=20151201) t1 > left outer join (select uin,clientip from t_sd_ucm_cominfo_FinalResult where ds=20151130) t2 > on t1.uin=t2.uin > where t2.clientip is NULL; OK STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 depends on stages: Stage-1 Stage-2 depends on stages: Stage-0 STAGE PLANS: Stage: Stage-1 Spark Edges: Reducer 2 <- Map 1 (PARTITION-LEVEL SORT, 100), Map 3 (PARTITION-LEVEL SORT, 100) DagName: mqq_20151204103243_3eab6e6c-941e-476a-897f-cae97657063e:3 Vertices: Map 1 Map Operator Tree: TableScan alias: t_sd_ucm_cominfo_finalresult Statistics: Num rows: 108009 Data size: 2873665 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: uin (type: string), clientip (type: string) outputColumnNames: _col0, _col1 Statistics: Num rows: 108009 Data size: 2873665 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: _col0 (type: string) sort order: + Map-reduce partition columns: _col0 (type: string) Statistics: Num rows: 108009 Data size: 2873665 Basic stats: COMPLETE Column stats: NONE value expressions: _col1 (type: string) Map 3 Map Operator Tree: TableScan alias: t_sd_ucm_cominfo_finalresult Statistics: Num rows: 590130 Data size: 118026051 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: uin (type: string), clientip (type: string) outputColumnNames: _col0, _col1 Statistics: Num rows: 590130 Data size: 118026051 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: _col0 (type: string) sort order: + Map-reduce partition columns: _col0 (type: string) Statistics: Num rows: 590130 Data size: 118026051 Basic stats: COMPLETE Column stats: NONE value expressions: _col1 (type: string) Reducer 2 Reduce Operator Tree: Join Operator condition map: Left Outer Join0 to 1 keys: 0 _col0 (type: string) 1 _col0 (type: string) outputColumnNames: _col0, _col1, _col3 Statistics: Num rows: 649143 Data size: 129828658 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: _col3 is null (type: boolean) Statistics: Num rows: 324571 Data size: 64914228 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: _col0 (type: string), _col1 (type: string) outputColumnNames: _col0, _col1 Statistics: Num rows: 324571 Data size: 64914228 Basic stats: COMPLETE Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 324571 Data size: 64914228 Basic stats: COMPLETE Column stats: NONE table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe name: u_wsd.t_sd_ucm_cominfo_incremental Stage: Stage-0 Move Operator tables: partition: ds 20151201 replace: true table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe name: u_wsd.t_sd_ucm_cominfo_incremental Stage: Stage-2 Stats-Aggr Operator *Thanks.* 2015-12-03 22:17 GMT+08:00 Xuefu Zhang <xzh...@cloudera.com>: > Can you also attach explain query result? What's your data format? > > --Xuefu > > On Thu, Dec 3, 2015 at 12:09 AM, Jone Zhang <joyoungzh...@gmail.com> > wrote: > >> Hive1.2.1 on Spark1.4.1 >> >> *The first query is:* >> set mapred.reduce.tasks=100; >> use u_wsd; >> insert overwrite table t_sd_ucm_cominfo_incremental partition (ds= >> 20151202) >> select t1.uin,t1.clientip from >> (select uin,clientip from t_sd_ucm_cominfo_FinalResult where ds=20151202) >> t1 >> left outer join (select uin,clientip from t_sd_ucm_cominfo_FinalResult >> where ds=20151201) t2 >> on t1.uin=t2.uin >> where t2.clientip is NULL; >> >> *The second query is:* >> set mapred.reduce.tasks=100; >> use u_wsd; >> insert overwrite table t_sd_ucm_cominfo_incremental partition (ds= >> 20151201) >> select t1.uin,t1.clientip from >> (select uin,clientip from t_sd_ucm_cominfo_FinalResult where ds=20151201) >> t1 >> left outer join (select uin,clientip from t_sd_ucm_cominfo_FinalResult >> where ds=20151130) t2 >> on t1.uin=t2.uin >> where t2.clientip is NULL; >> >> *The attachment show the two query's stages.* >> *Here is the partition info* >> 104.3 M >> /user/hive/warehouse/u_wsd.db/t_sd_ucm_cominfo_finalresult/ds=20151202 >> 110.0 M >> /user/hive/warehouse/u_wsd.db/t_sd_ucm_cominfo_finalresult/ds=20151201 >> 112.6 M >> /user/hive/warehouse/u_wsd.db/t_sd_ucm_cominfo_finalresult/ds=20151130 >> >> >> >> *Why there are two different stages?* >> *The stage1 in first query is very slowly.* >> >> *Thanks.* >> *Best wishes.* >> > >