Re: Why there are two different stages on the same query when i use hive on spark.

Jone Zhang Thu, 03 Dec 2015 18:53:06 -0800

*Thanks for you warning.*
*The first query is mapjoin and second query is reducejoin.The data format
is all textInputFormat.*
*I'll  go to learn more about  mapjoin of **hive on spark** anyway,But
why** stage1
of  first query in attachment is so slowly?*


*Explain first query:*
hive (u_wsd)> explain insert overwrite table t_sd_ucm_cominfo_incremental
partition (ds=20151202)
            > select t1.uin,t1.clientip from
            > (select uin,clientip from t_sd_ucm_cominfo_FinalResult where
ds=20151202) t1
            > left outer join (select uin,clientip from
t_sd_ucm_cominfo_FinalResult where ds=20151201) t2
            > on t1.uin=t2.uin
            > where t2.clientip is NULL;
OK
STAGE DEPENDENCIES:
  Stage-3 is a root stage
  Stage-1 depends on stages: Stage-3
  Stage-0 depends on stages: Stage-1
  Stage-2 depends on stages: Stage-0

STAGE PLANS:
  Stage: Stage-3
    Spark
      DagName: mqq_20151204103210_fb913c60-c5ed-438f-8efb-950ee5639dd5:2
      Vertices:
        Map 2
            Map Operator Tree:
                TableScan
                  alias: t_sd_ucm_cominfo_finalresult
                  Statistics: Num rows: 108009 Data size: 2873665 Basic
stats: COMPLETE Column stats: NONE
                  Select Operator
                    expressions: uin (type: string), clientip (type: string)
                    outputColumnNames: _col0, _col1
                    Statistics: Num rows: 108009 Data size: 2873665 Basic
stats: COMPLETE Column stats: NONE
                    Spark HashTable Sink Operator
                      keys:
                        0 _col0 (type: string)
                        1 _col0 (type: string)
            Local Work:
              Map Reduce Local Work

  Stage: Stage-1
    Spark
      DagName: mqq_20151204103210_fb913c60-c5ed-438f-8efb-950ee5639dd5:1
      Vertices:
        Map 1
            Map Operator Tree:
                TableScan
                  alias: t_sd_ucm_cominfo_finalresult
                  Statistics: Num rows: 103779 Data size: 2746785 Basic
stats: COMPLETE Column stats: NONE
                  Select Operator
                    expressions: uin (type: string), clientip (type: string)
                    outputColumnNames: _col0, _col1
                    Statistics: Num rows: 103779 Data size: 2746785 Basic
stats: COMPLETE Column stats: NONE
                    Map Join Operator
                      condition map:
                           Left Outer Join0 to 1
                      keys:
                        0 _col0 (type: string)
                        1 _col0 (type: string)
                      outputColumnNames: _col0, _col1, _col3
                      input vertices:
                        1 Map 2
                      Statistics: Num rows: 118809 Data size: 3161031 Basic
stats: COMPLETE Column stats: NONE
                      Filter Operator
                        predicate: _col3 is null (type: boolean)
                        Statistics: Num rows: 59404 Data size: 1580502
Basic stats: COMPLETE Column stats: NONE
                        Select Operator
                          expressions: _col0 (type: string), _col1 (type:
string)
                          outputColumnNames: _col0, _col1
                          Statistics: Num rows: 59404 Data size: 1580502
Basic stats: COMPLETE Column stats: NONE
                          File Output Operator
                            compressed: false
                            Statistics: Num rows: 59404 Data size: 1580502
Basic stats: COMPLETE Column stats: NONE
                            table:
                                input format:
org.apache.hadoop.mapred.TextInputFormat
                                output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                                serde:
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                                name: u_wsd.t_sd_ucm_cominfo_incremental
            Local Work:
              Map Reduce Local Work

  Stage: Stage-0
    Move Operator
      tables:
          partition:
            ds 20151202
          replace: true
          table:
              input format: org.apache.hadoop.mapred.TextInputFormat
              output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
              serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
              name: u_wsd.t_sd_ucm_cominfo_incremental

  Stage: Stage-2
    Stats-Aggr Operator


*Explain second query:*
hive (u_wsd)> explain insert overwrite table t_sd_ucm_cominfo_incremental
partition (ds=20151201)
            > select t1.uin,t1.clientip from
            > (select uin,clientip from t_sd_ucm_cominfo_FinalResult where
ds=20151201) t1
            > left outer join (select uin,clientip from
t_sd_ucm_cominfo_FinalResult where ds=20151130) t2
            > on t1.uin=t2.uin
            > where t2.clientip is NULL;
OK
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1
  Stage-2 depends on stages: Stage-0

STAGE PLANS:
  Stage: Stage-1
    Spark
      Edges:
        Reducer 2 <- Map 1 (PARTITION-LEVEL SORT, 100), Map 3
(PARTITION-LEVEL SORT, 100)
      DagName: mqq_20151204103243_3eab6e6c-941e-476a-897f-cae97657063e:3
      Vertices:
        Map 1
            Map Operator Tree:
                TableScan
                  alias: t_sd_ucm_cominfo_finalresult
                  Statistics: Num rows: 108009 Data size: 2873665 Basic
stats: COMPLETE Column stats: NONE
                  Select Operator
                    expressions: uin (type: string), clientip (type: string)
                    outputColumnNames: _col0, _col1
                    Statistics: Num rows: 108009 Data size: 2873665 Basic
stats: COMPLETE Column stats: NONE
                    Reduce Output Operator
                      key expressions: _col0 (type: string)
                      sort order: +
                      Map-reduce partition columns: _col0 (type: string)
                      Statistics: Num rows: 108009 Data size: 2873665 Basic
stats: COMPLETE Column stats: NONE
                      value expressions: _col1 (type: string)
        Map 3
            Map Operator Tree:
                TableScan
                  alias: t_sd_ucm_cominfo_finalresult
                  Statistics: Num rows: 590130 Data size: 118026051 Basic
stats: COMPLETE Column stats: NONE
                  Select Operator
                    expressions: uin (type: string), clientip (type: string)
                    outputColumnNames: _col0, _col1
                    Statistics: Num rows: 590130 Data size: 118026051 Basic
stats: COMPLETE Column stats: NONE
                    Reduce Output Operator
                      key expressions: _col0 (type: string)
                      sort order: +
                      Map-reduce partition columns: _col0 (type: string)
                      Statistics: Num rows: 590130 Data size: 118026051
Basic stats: COMPLETE Column stats: NONE
                      value expressions: _col1 (type: string)
        Reducer 2
            Reduce Operator Tree:
              Join Operator
                condition map:
                     Left Outer Join0 to 1
                keys:
                  0 _col0 (type: string)
                  1 _col0 (type: string)
                outputColumnNames: _col0, _col1, _col3
                Statistics: Num rows: 649143 Data size: 129828658 Basic
stats: COMPLETE Column stats: NONE
                Filter Operator
                  predicate: _col3 is null (type: boolean)
                  Statistics: Num rows: 324571 Data size: 64914228 Basic
stats: COMPLETE Column stats: NONE
                  Select Operator
                    expressions: _col0 (type: string), _col1 (type: string)
                    outputColumnNames: _col0, _col1
                    Statistics: Num rows: 324571 Data size: 64914228 Basic
stats: COMPLETE Column stats: NONE
                    File Output Operator
                      compressed: false
                      Statistics: Num rows: 324571 Data size: 64914228
Basic stats: COMPLETE Column stats: NONE
                      table:
                          input format:
org.apache.hadoop.mapred.TextInputFormat
                          output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                          serde:
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                          name: u_wsd.t_sd_ucm_cominfo_incremental

  Stage: Stage-0
    Move Operator
      tables:
          partition:
            ds 20151201
          replace: true
          table:
              input format: org.apache.hadoop.mapred.TextInputFormat
              output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
              serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
              name: u_wsd.t_sd_ucm_cominfo_incremental

  Stage: Stage-2
    Stats-Aggr Operator

*Thanks.*

2015-12-03 22:17 GMT+08:00 Xuefu Zhang <xzh...@cloudera.com>:

> Can you also attach explain query result? What's your data format?
>
> --Xuefu
>
> On Thu, Dec 3, 2015 at 12:09 AM, Jone Zhang <joyoungzh...@gmail.com>
> wrote:
>
>> Hive1.2.1 on Spark1.4.1
>>
>> *The first query is:*
>> set mapred.reduce.tasks=100;
>> use u_wsd;
>> insert overwrite table t_sd_ucm_cominfo_incremental partition (ds=
>> 20151202)
>> select t1.uin,t1.clientip from
>> (select uin,clientip from t_sd_ucm_cominfo_FinalResult where ds=20151202)
>> t1
>> left outer join (select uin,clientip from t_sd_ucm_cominfo_FinalResult
>> where ds=20151201) t2
>> on t1.uin=t2.uin
>> where t2.clientip is NULL;
>>
>> *The second query is:*
>> set mapred.reduce.tasks=100;
>> use u_wsd;
>> insert overwrite table t_sd_ucm_cominfo_incremental partition (ds=
>> 20151201)
>> select t1.uin,t1.clientip from
>> (select uin,clientip from t_sd_ucm_cominfo_FinalResult where ds=20151201)
>> t1
>> left outer join (select uin,clientip from t_sd_ucm_cominfo_FinalResult
>> where ds=20151130) t2
>> on t1.uin=t2.uin
>> where t2.clientip is NULL;
>>
>> *The attachment show the two query's stages.*
>> *Here is the partition info*
>> 104.3 M
>>  /user/hive/warehouse/u_wsd.db/t_sd_ucm_cominfo_finalresult/ds=20151202
>> 110.0 M
>>  /user/hive/warehouse/u_wsd.db/t_sd_ucm_cominfo_finalresult/ds=20151201
>> 112.6 M
>>  /user/hive/warehouse/u_wsd.db/t_sd_ucm_cominfo_finalresult/ds=20151130
>>
>>
>>
>> *Why there are two different stages?*
>> *The stage1 in first query is very slowly.*
>>
>> *Thanks.*
>> *Best wishes.*
>>
>
>

Re: Why there are two different stages on the same query when i use hive on spark.

Reply via email to