Re: It seems that result of Hive on Spark is mistake And result of Hive and Hive on Spark are not the same

Jone Zhang Tue, 22 Dec 2015 03:40:32 -0800

Hive 1.2.1 on Spark1.4.1

2015-12-22 19:31 GMT+08:00 Jone Zhang <joyoungzh...@gmail.com>:


> *select  * from staff;*
> 1 jone 22 1
> 2 lucy 21 1
> 3 hmm 22 2
> 4 james 24 3
> 5 xiaoliu 23 3
>
> *select id,date_ from trade union all select id,"test" from trade ;*
> 1 201510210908
> 2 201509080234
> 2 201509080235
> 1 test
> 2 test
> 2 test
>
> *set hive.execution.engine=spark;*
> *set spark.master=local;*
> *select /*+mapjoin(t)*/ * from staff s join *
> *(select id,date_ from trade union all select id,"test" from trade ) t on
> s.id <http://s.id>=t.id <http://t.id>;*
> 1 jone 22 1 1 201510210908
> 2 lucy 21 1 2 201509080234
> 2 lucy 21 1 2 201509080235
>
> *set hive.execution.engine=mr;*
> *select /*+mapjoin(t)*/ * from staff s join *
> *(select id,date_ from trade union all select id,"test" from trade ) t on
> s.id <http://s.id>=t.id <http://t.id>;*
> FAILED: SemanticException [Error 10227]: Not all clauses are supported
> with mapjoin hint. Please remove mapjoin hint.
>
> *I have two questions*
> *1.Why result of hive on spark not include the following record?*
> 1 jone 22 1 1 test
> 2 lucy 21 1 2 test
> 2 lucy 21 1 2 test
>
> *2.Why there are two different ways of dealing same query?*
>
>
> *explain 1:*
> *set hive.execution.engine=spark;*
> *set spark.master=local;*
> *explain *
> *select id,date_ from trade union all select id,"test" from trade;*
> OK
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 depends on stages: Stage-1
>
> STAGE PLANS:
>   Stage: Stage-1
>     Spark
>       DagName:
> jonezhang_20151222191643_5301d90a-caf0-4934-8092-d165c87a4190:1
>       Vertices:
>         Map 1
>             Map Operator Tree:
>                 TableScan
>                   alias: trade
>                   Statistics: Num rows: 6 Data size: 48 Basic stats:
> COMPLETE Column stats: NONE
>                   Select Operator
>                     expressions: id (type: int), date_ (type: string)
>                     outputColumnNames: _col0, _col1
>                     Statistics: Num rows: 6 Data size: 48 Basic stats:
> COMPLETE Column stats: NONE
>                     File Output Operator
>                       compressed: false
>                       Statistics: Num rows: 12 Data size: 96 Basic stats:
> COMPLETE Column stats: NONE
>                       table:
>                           input format:
> org.apache.hadoop.mapred.TextInputFormat
>                           output format:
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>                           serde:
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>         Map 2
>             Map Operator Tree:
>                 TableScan
>                   alias: trade
>                   Statistics: Num rows: 6 Data size: 48 Basic stats:
> COMPLETE Column stats: NONE
>                   Select Operator
>                     expressions: id (type: int), 'test' (type: string)
>                     outputColumnNames: _col0, _col1
>                     Statistics: Num rows: 6 Data size: 48 Basic stats:
> COMPLETE Column stats: NONE
>                     File Output Operator
>                       compressed: false
>                       Statistics: Num rows: 12 Data size: 96 Basic stats:
> COMPLETE Column stats: NONE
>                       table:
>                           input format:
> org.apache.hadoop.mapred.TextInputFormat
>                           output format:
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>                           serde:
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>
>   Stage: Stage-0
>     Fetch Operator
>       limit: -1
>       Processor Tree:
>         ListSink
>
>
> *explain 2:*
> *set hive.execution.engine=spark;*
> *set spark.master=local;*
> *explain *
> *select /*+mapjoin(t)*/ * from staff s join *
> *(select id,date_ from trade union all select id,"20999999999" from trade
> ) t on s.id <http://s.id>=t.id <http://t.id>;*
> OK
> STAGE DEPENDENCIES:
>   Stage-2 is a root stage
>   Stage-1 depends on stages: Stage-2
>   Stage-0 depends on stages: Stage-1
>
> STAGE PLANS:
>   Stage: Stage-2
>     Spark
>       DagName:
> jonezhang_20151222191716_be7eac84-b5b6-4478-b88f-9f59e2b1b1a8:3
>       Vertices:
>         Map 1
>             Map Operator Tree:
>                 TableScan
>                   alias: trade
>                   Statistics: Num rows: 6 Data size: 48 Basic stats:
> COMPLETE Column stats: NONE
>                   Filter Operator
>                     predicate: id is not null (type: boolean)
>                     Statistics: Num rows: 3 Data size: 24 Basic stats:
> COMPLETE Column stats: NONE
>                     Select Operator
>                       expressions: id (type: int), date_ (type: string)
>                       outputColumnNames: _col0, _col1
>                       Statistics: Num rows: 3 Data size: 24 Basic stats:
> COMPLETE Column stats: NONE
>                       Spark HashTable Sink Operator
>                         keys:
>                           0 id (type: int)
>                           1 _col0 (type: int)
>             Local Work:
>               Map Reduce Local Work
>
>   Stage: Stage-1
>     Spark
>       DagName:
> jonezhang_20151222191716_be7eac84-b5b6-4478-b88f-9f59e2b1b1a8:2
>       Vertices:
>         Map 2
>             Map Operator Tree:
>                 TableScan
>                   alias: s
>                   Statistics: Num rows: 1 Data size: 66 Basic stats:
> COMPLETE Column stats: NONE
>                   Filter Operator
>                     predicate: id is not null (type: boolean)
>                     Statistics: Num rows: 1 Data size: 66 Basic stats:
> COMPLETE Column stats: NONE
>                     Map Join Operator
>                       condition map:
>                            Inner Join 0 to 1
>                       keys:
>                         0 id (type: int)
>                         1 _col0 (type: int)
>                       outputColumnNames: _col0, _col1, _col2, _col3,
> _col7, _col8
>                       input vertices:
>                         1 Map 1
>                       Statistics: Num rows: 6 Data size: 52 Basic stats:
> COMPLETE Column stats: NONE
>                       Select Operator
>                         expressions: _col0 (type: int), _col1 (type:
> string), _col2 (type: int), _col3 (type: int), _col7 (type: int), _col8
> (type: string)
>                         outputColumnNames: _col0, _col1, _col2, _col3,
> _col4, _col5
>                         Statistics: Num rows: 6 Data size: 52 Basic stats:
> COMPLETE Column stats: NONE
>                         File Output Operator
>                           compressed: false
>                           Statistics: Num rows: 6 Data size: 52 Basic
> stats: COMPLETE Column stats: NONE
>                           table:
>                               input format:
> org.apache.hadoop.mapred.TextInputFormat
>                               output format:
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>                               serde:
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>             Local Work:
>               Map Reduce Local Work
>
>   Stage: Stage-0
>     Fetch Operator
>       limit: -1
>       Processor Tree:
>         ListSink
>
>
>
> *I can't find any information about union "test" in explain 2.*
>
> *Some properties on hive-site.xml is *
> <property>
> <name>hive.ignore.mapjoin.hint</name>
> <value>false</value>
> </property>
> <property>
> <name>hive.auto.convert.join</name>
> <value>true</value>
> </property>
> <property>
> <name>hive.auto.convert.join.noconditionaltask</name>
> <value>true</value>
>
>
>
>
> *Thanks.*
> *Best wishes.*
>

Re: It seems that result of Hive on Spark is mistake And result of Hive and Hive on Spark are not the same

Reply via email to