Re: trying to figure out number of MR jobs from explain output

Nicholas Hakobian Fri, 11 Dec 2015 14:56:26 -0800

You can't find out definitively because it is going to depend on the
nature of the data being processed, especially when it comes to
mapjoins. If the output of one stage is small enough for it to
mapjoin, parts of a stage can be skipped as the whole dataset is on
every node.


I'm sure there are other conditions as well, but that is general idea.

-Nick

Nicholas Szandor Hakobian
Data Scientist
Rally Health
nicholas.hakob...@rallyhealth.com



On Fri, Dec 11, 2015 at 2:00 PM, Ophir Etzion <op...@foursquare.com> wrote:
> Hi,
>
> I've been trying to figure out how to know the number of MR jobs that will
> be ran for a hive query using the EXPLAIN output.
>
> I haven't got to a consistent method to knowing that.
>
> for example (in one of my queries, ctas query):
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-7 depends on stages: Stage-1 , consists of Stage-4, Stage-3, Stage-5
>   Stage-4
>   Stage-0 depends on stages: Stage-4, Stage-3, Stage-6
>   Stage-8 depends on stages: Stage-0
>   Stage-2 depends on stages: Stage-8
>   Stage-3
>   Stage-5
>   Stage-6 depends on stages: Stage-5
>
> Stage-1, Stage-3, Stage-5 are listed as map reduce steps.
>
> eventually 2 MR jobs ran.
>
> in other cases only 1 job runs.
>
> I couldn't find a consistent rule on how to figure this out.
>
> can anyone help??
>
> Thank you!!
>
> below is full output
>
> explain CREATE TABLE beekeeper_results.test3 ROW FORMAT SERDE
> "com.foursquare.hadoop.hive.serde.lazycsv.LazySimpleCSVSerde" WITH
> SERDEPROPERTIES ('escape.delim'='\\', 'mapkey.delim'='\;',
> 'colelction.delim'='|') AS SELECT * FROM beekeeper_results.test2;
> OK
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-7 depends on stages: Stage-1 , consists of Stage-4, Stage-3, Stage-5
>   Stage-4
>   Stage-0 depends on stages: Stage-4, Stage-3, Stage-6
>   Stage-8 depends on stages: Stage-0
>   Stage-2 depends on stages: Stage-8
>   Stage-3
>   Stage-5
>   Stage-6 depends on stages: Stage-5
>
> STAGE PLANS:
>   Stage: Stage-1
>     Map Reduce
>       Map Operator Tree:
>           TableScan
>             alias: test2
>             Statistics: Num rows: 112 Data size: 11690 Basic stats: COMPLETE
> Column stats: NONE
>             Select Operator
>               expressions: blasttag (type: string), actioncounts (type:
> array<struct<actiontype:string,count:int>>), detailedclicks (type:
> array<struct<linkindex:int,count:int,linkname:string>>), countsbyclient
> (type: array<struct<client:string,actiontype:string,count:int>>),
> totalactioncounts (type: array<struct<actiontype:string,count:int>>),
> actionsbydate (type:
> array<struct<datesent:string,actiontype:string,count:int>>)
>               outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
>               Statistics: Num rows: 112 Data size: 11690 Basic stats:
> COMPLETE Column stats: NONE
>               File Output Operator
>                 compressed: false
>                 Statistics: Num rows: 112 Data size: 11690 Basic stats:
> COMPLETE Column stats: NONE
>                 table:
>                     input format: org.apache.hadoop.mapred.TextInputFormat
>                     output format:
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>                     serde:
> com.foursquare.hadoop.hive.serde.lazycsv.LazySimpleCSVSerde
>                     name: beekeeper_results.test3
>
>   Stage: Stage-7
>     Conditional Operator
>
>   Stage: Stage-4
>     Move Operator
>       files:
>           hdfs directory: true
>           destination:
> hdfs://hadoop-alidoro-nn-vip/user/hive/warehouse/.hive-staging_hive_2015-12-11_21-52-35_063_8498858370292854265-1/-ext-10001
>
>   Stage: Stage-0
>     Move Operator
>       files:
>           hdfs directory: true
>           destination: ***
>
>   Stage: Stage-8
>       Create Table Operator:
>         Create Table
>           columns: blasttag string, actioncounts
> array<struct<actiontype:string,count:int>>, detailedclicks
> array<struct<linkindex:int,count:int,linkname:string>>, countsbyclient
> array<struct<client:string,actiontype:string,count:int>>, totalactioncounts
> array<struct<actiontype:string,count:int>>, actionsbydate
> array<struct<datesent:string,actiontype:string,count:int>>
>           input format: org.apache.hadoop.mapred.TextInputFormat
>           output format:
> org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat
>           serde name:
> com.foursquare.hadoop.hive.serde.lazycsv.LazySimpleCSVSerde
>           serde properties:
>             colelction.delim |
>             escape.delim \
>             mapkey.delim ;
>           name: beekeeper_results.test3
>
>   Stage: Stage-2
>     Stats-Aggr Operator
>
>   Stage: Stage-3
>     Map Reduce
>       Map Operator Tree:
>           TableScan
>             File Output Operator
>               compressed: false
>               table:
>                   input format: org.apache.hadoop.mapred.TextInputFormat
>                   output format:
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>                   serde:
> com.foursquare.hadoop.hive.serde.lazycsv.LazySimpleCSVSerde
>                   name: beekeeper_results.test3
>
>   Stage: Stage-5
>     Map Reduce
>       Map Operator Tree:
>           TableScan
>             File Output Operator
>               compressed: false
>               table:
>                   input format: org.apache.hadoop.mapred.TextInputFormat
>                   output format:
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>                   serde:
> com.foursquare.hadoop.hive.serde.lazycsv.LazySimpleCSVSerde
>                   name: beekeeper_results.test3
>
>   Stage: Stage-6
>     Move Operator
>       files:
>           hdfs directory: true
>           destination: ***
>

Re: trying to figure out number of MR jobs from explain output

Reply via email to