[jira] [Updated] (HIVE-12736) It seems that result of Hive on Spark be mistaken and result of Hive and Hive on Spark are not the same
[ https://issues.apache.org/jira/browse/HIVE-12736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JoneZhang updated HIVE-12736: - Summary: It seems that result of Hive on Spark be mistaken and result of Hive and Hive on Spark are not the same (was: It seems that result of Hive on Spark is mistake And result of Hive and Hive on Spark are not the same) > It seems that result of Hive on Spark be mistaken and result of Hive and Hive > on Spark are not the same > --- > > Key: HIVE-12736 > URL: https://issues.apache.org/jira/browse/HIVE-12736 > Project: Hive > Issue Type: Bug >Affects Versions: 1.1.1, 1.2.1 >Reporter: JoneZhang > > select * from staff; > 1 jone22 1 > 2 lucy21 1 > 3 hmm 22 2 > 4 james 24 3 > 5 xiaoliu 23 3 > select id,date_ from trade union all select id,"test" from trade ; > 1 201510210908 > 2 201509080234 > 2 201509080235 > 1 test > 2 test > 2 test > set hive.execution.engine=spark; > set spark.master=local; > select /*+mapjoin(t)*/ * from staff s join > (select id,date_ from trade union all select id,"test" from trade ) t on > s.id=t.id; > 1 jone22 1 1 201510210908 > 2 lucy21 1 2 201509080234 > 2 lucy21 1 2 201509080235 > set hive.execution.engine=mr; > select /*+mapjoin(t)*/ * from staff s join > (select id,date_ from trade union all select id,"test" from trade ) t on > s.id=t.id; > FAILED: SemanticException [Error 10227]: Not all clauses are supported with > mapjoin hint. Please remove mapjoin hint. > I have two questions > 1.Why result of hive on spark not include the following record? > 1 jone22 1 1 test > 2 lucy21 1 2 test > 2 lucy21 1 2 test > 2.Why there are two different ways of dealing same query? > explain 1: > set hive.execution.engine=spark; > set spark.master=local; > explain > select id,date_ from trade union all select id,"test" from trade; > OK > STAGE DEPENDENCIES: > Stage-1 is a root stage > Stage-0 depends on stages: Stage-1 > STAGE PLANS: > Stage: Stage-1 > Spark > DagName: jonezhang_20151222191643_5301d90a-caf0-4934-8092-d165c87a4190:1 > Vertices: > Map 1 > Map Operator Tree: > TableScan > alias: trade > Statistics: Num rows: 6 Data size: 48 Basic stats: COMPLETE > Column stats: NONE > Select Operator > expressions: id (type: int), date_ (type: string) > outputColumnNames: _col0, _col1 > Statistics: Num rows: 6 Data size: 48 Basic stats: > COMPLETE Column stats: NONE > File Output Operator > compressed: false > Statistics: Num rows: 12 Data size: 96 Basic stats: > COMPLETE Column stats: NONE > table: > input format: > org.apache.hadoop.mapred.TextInputFormat > output format: > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > serde: > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > Map 2 > Map Operator Tree: > TableScan > alias: trade > Statistics: Num rows: 6 Data size: 48 Basic stats: COMPLETE > Column stats: NONE > Select Operator > expressions: id (type: int), 'test' (type: string) > outputColumnNames: _col0, _col1 > Statistics: Num rows: 6 Data size: 48 Basic stats: > COMPLETE Column stats: NONE > File Output Operator > compressed: false > Statistics: Num rows: 12 Data size: 96 Basic stats: > COMPLETE Column stats: NONE > table: > input format: > org.apache.hadoop.mapred.TextInputFormat > output format: > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > serde: > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > Stage: Stage-0 > Fetch Operator > limit: -1 > Processor Tree: > ListSink > explain 2: > set hive.execution.engine=spark; > set spark.master=local; > explain > select /*+mapjoin(t)*/ * from staff s join > (select id,date_ from trade union all select id,"test" from trade ) t on > s.id=t.id; > OK > STAGE DEPENDENCIES: > Stage-2 is a root stage > Stage-1 depends on stages: Stage-2 > Stage-0 depends on stages: Stage-1 > STAGE PLANS: > Stage: Stage-2 > Spark > DagName: jon
[jira] [Updated] (HIVE-12736) It seems that result of Hive on Spark be mistaken and result of Hive and Hive on Spark are not the same
[ https://issues.apache.org/jira/browse/HIVE-12736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengxiang Li updated HIVE-12736: - Attachment: HIVE-12736.1-spark.patch {{SparkMapJoinProcessor}} miss some validation during {{convertMapJoin}}, for Spark mode, the query should work the same way as MR. > It seems that result of Hive on Spark be mistaken and result of Hive and Hive > on Spark are not the same > --- > > Key: HIVE-12736 > URL: https://issues.apache.org/jira/browse/HIVE-12736 > Project: Hive > Issue Type: Bug >Affects Versions: 1.1.1, 1.2.1 >Reporter: JoneZhang >Assignee: Chengxiang Li > Attachments: HIVE-12736.1-spark.patch > > > select * from staff; > 1 jone22 1 > 2 lucy21 1 > 3 hmm 22 2 > 4 james 24 3 > 5 xiaoliu 23 3 > select id,date_ from trade union all select id,"test" from trade ; > 1 201510210908 > 2 201509080234 > 2 201509080235 > 1 test > 2 test > 2 test > set hive.execution.engine=spark; > set spark.master=local; > select /*+mapjoin(t)*/ * from staff s join > (select id,date_ from trade union all select id,"test" from trade ) t on > s.id=t.id; > 1 jone22 1 1 201510210908 > 2 lucy21 1 2 201509080234 > 2 lucy21 1 2 201509080235 > set hive.execution.engine=mr; > select /*+mapjoin(t)*/ * from staff s join > (select id,date_ from trade union all select id,"test" from trade ) t on > s.id=t.id; > FAILED: SemanticException [Error 10227]: Not all clauses are supported with > mapjoin hint. Please remove mapjoin hint. > I have two questions > 1.Why result of hive on spark not include the following record? > 1 jone22 1 1 test > 2 lucy21 1 2 test > 2 lucy21 1 2 test > 2.Why there are two different ways of dealing same query? > explain 1: > set hive.execution.engine=spark; > set spark.master=local; > explain > select id,date_ from trade union all select id,"test" from trade; > OK > STAGE DEPENDENCIES: > Stage-1 is a root stage > Stage-0 depends on stages: Stage-1 > STAGE PLANS: > Stage: Stage-1 > Spark > DagName: jonezhang_20151222191643_5301d90a-caf0-4934-8092-d165c87a4190:1 > Vertices: > Map 1 > Map Operator Tree: > TableScan > alias: trade > Statistics: Num rows: 6 Data size: 48 Basic stats: COMPLETE > Column stats: NONE > Select Operator > expressions: id (type: int), date_ (type: string) > outputColumnNames: _col0, _col1 > Statistics: Num rows: 6 Data size: 48 Basic stats: > COMPLETE Column stats: NONE > File Output Operator > compressed: false > Statistics: Num rows: 12 Data size: 96 Basic stats: > COMPLETE Column stats: NONE > table: > input format: > org.apache.hadoop.mapred.TextInputFormat > output format: > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > serde: > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > Map 2 > Map Operator Tree: > TableScan > alias: trade > Statistics: Num rows: 6 Data size: 48 Basic stats: COMPLETE > Column stats: NONE > Select Operator > expressions: id (type: int), 'test' (type: string) > outputColumnNames: _col0, _col1 > Statistics: Num rows: 6 Data size: 48 Basic stats: > COMPLETE Column stats: NONE > File Output Operator > compressed: false > Statistics: Num rows: 12 Data size: 96 Basic stats: > COMPLETE Column stats: NONE > table: > input format: > org.apache.hadoop.mapred.TextInputFormat > output format: > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > serde: > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > Stage: Stage-0 > Fetch Operator > limit: -1 > Processor Tree: > ListSink > explain 2: > set hive.execution.engine=spark; > set spark.master=local; > explain > select /*+mapjoin(t)*/ * from staff s join > (select id,date_ from trade union all select id,"test" from trade ) t on > s.id=t.id; > OK > STAGE DEPENDENCIES: > Stage-2 is a root stage > Stage-1 depends on stages: Stage-2 > Stage-0 depends on stages: Stage-1 > STAGE PLANS: > Stage:
[jira] [Updated] (HIVE-12736) It seems that result of Hive on Spark be mistaken and result of Hive and Hive on Spark are not the same
[ https://issues.apache.org/jira/browse/HIVE-12736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengxiang Li updated HIVE-12736: - Attachment: HIVE-12736.2-spark.patch > It seems that result of Hive on Spark be mistaken and result of Hive and Hive > on Spark are not the same > --- > > Key: HIVE-12736 > URL: https://issues.apache.org/jira/browse/HIVE-12736 > Project: Hive > Issue Type: Bug >Affects Versions: 1.1.1, 1.2.1 >Reporter: JoneZhang >Assignee: Chengxiang Li > Attachments: HIVE-12736.1-spark.patch, HIVE-12736.2-spark.patch > > > select * from staff; > 1 jone22 1 > 2 lucy21 1 > 3 hmm 22 2 > 4 james 24 3 > 5 xiaoliu 23 3 > select id,date_ from trade union all select id,"test" from trade ; > 1 201510210908 > 2 201509080234 > 2 201509080235 > 1 test > 2 test > 2 test > set hive.execution.engine=spark; > set spark.master=local; > select /*+mapjoin(t)*/ * from staff s join > (select id,date_ from trade union all select id,"test" from trade ) t on > s.id=t.id; > 1 jone22 1 1 201510210908 > 2 lucy21 1 2 201509080234 > 2 lucy21 1 2 201509080235 > set hive.execution.engine=mr; > select /*+mapjoin(t)*/ * from staff s join > (select id,date_ from trade union all select id,"test" from trade ) t on > s.id=t.id; > FAILED: SemanticException [Error 10227]: Not all clauses are supported with > mapjoin hint. Please remove mapjoin hint. > I have two questions > 1.Why result of hive on spark not include the following record? > 1 jone22 1 1 test > 2 lucy21 1 2 test > 2 lucy21 1 2 test > 2.Why there are two different ways of dealing same query? > explain 1: > set hive.execution.engine=spark; > set spark.master=local; > explain > select id,date_ from trade union all select id,"test" from trade; > OK > STAGE DEPENDENCIES: > Stage-1 is a root stage > Stage-0 depends on stages: Stage-1 > STAGE PLANS: > Stage: Stage-1 > Spark > DagName: jonezhang_20151222191643_5301d90a-caf0-4934-8092-d165c87a4190:1 > Vertices: > Map 1 > Map Operator Tree: > TableScan > alias: trade > Statistics: Num rows: 6 Data size: 48 Basic stats: COMPLETE > Column stats: NONE > Select Operator > expressions: id (type: int), date_ (type: string) > outputColumnNames: _col0, _col1 > Statistics: Num rows: 6 Data size: 48 Basic stats: > COMPLETE Column stats: NONE > File Output Operator > compressed: false > Statistics: Num rows: 12 Data size: 96 Basic stats: > COMPLETE Column stats: NONE > table: > input format: > org.apache.hadoop.mapred.TextInputFormat > output format: > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > serde: > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > Map 2 > Map Operator Tree: > TableScan > alias: trade > Statistics: Num rows: 6 Data size: 48 Basic stats: COMPLETE > Column stats: NONE > Select Operator > expressions: id (type: int), 'test' (type: string) > outputColumnNames: _col0, _col1 > Statistics: Num rows: 6 Data size: 48 Basic stats: > COMPLETE Column stats: NONE > File Output Operator > compressed: false > Statistics: Num rows: 12 Data size: 96 Basic stats: > COMPLETE Column stats: NONE > table: > input format: > org.apache.hadoop.mapred.TextInputFormat > output format: > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > serde: > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > Stage: Stage-0 > Fetch Operator > limit: -1 > Processor Tree: > ListSink > explain 2: > set hive.execution.engine=spark; > set spark.master=local; > explain > select /*+mapjoin(t)*/ * from staff s join > (select id,date_ from trade union all select id,"test" from trade ) t on > s.id=t.id; > OK > STAGE DEPENDENCIES: > Stage-2 is a root stage > Stage-1 depends on stages: Stage-2 > Stage-0 depends on stages: Stage-1 > STAGE PLANS: > Stage: Stage-2 > Spark > DagName: jonezhang_20151222191716_be7eac84-b5b6-4478-b88f-9f59e2b1b1a8:3 >
[jira] [Updated] (HIVE-12736) It seems that result of Hive on Spark be mistaken and result of Hive and Hive on Spark are not the same
[ https://issues.apache.org/jira/browse/HIVE-12736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuefu Zhang updated HIVE-12736: --- Description: {code} select * from staff; 1 jone22 1 2 lucy21 1 3 hmm 22 2 4 james 24 3 5 xiaoliu 23 3 select id,date_ from trade union all select id,"test" from trade ; 1 201510210908 2 201509080234 2 201509080235 1 test 2 test 2 test set hive.execution.engine=spark; set spark.master=local; select /*+mapjoin(t)*/ * from staff s join (select id,date_ from trade union all select id,"test" from trade ) t on s.id=t.id; 1 jone22 1 1 201510210908 2 lucy21 1 2 201509080234 2 lucy21 1 2 201509080235 set hive.execution.engine=mr; select /*+mapjoin(t)*/ * from staff s join (select id,date_ from trade union all select id,"test" from trade ) t on s.id=t.id; FAILED: SemanticException [Error 10227]: Not all clauses are supported with mapjoin hint. Please remove mapjoin hint. {code} I have two questions 1.Why result of hive on spark not include the following record? {code} 1 jone22 1 1 test 2 lucy21 1 2 test 2 lucy21 1 2 test {code} 2.Why there are two different ways of dealing same query? explain 1: {code} set hive.execution.engine=spark; set spark.master=local; explain select id,date_ from trade union all select id,"test" from trade; OK STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 depends on stages: Stage-1 STAGE PLANS: Stage: Stage-1 Spark DagName: jonezhang_20151222191643_5301d90a-caf0-4934-8092-d165c87a4190:1 Vertices: Map 1 Map Operator Tree: TableScan alias: trade Statistics: Num rows: 6 Data size: 48 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: id (type: int), date_ (type: string) outputColumnNames: _col0, _col1 Statistics: Num rows: 6 Data size: 48 Basic stats: COMPLETE Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 12 Data size: 96 Basic stats: COMPLETE Column stats: NONE table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Map 2 Map Operator Tree: TableScan alias: trade Statistics: Num rows: 6 Data size: 48 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: id (type: int), 'test' (type: string) outputColumnNames: _col0, _col1 Statistics: Num rows: 6 Data size: 48 Basic stats: COMPLETE Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 12 Data size: 96 Basic stats: COMPLETE Column stats: NONE table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink {code} explain 2: {code} set hive.execution.engine=spark; set spark.master=local; explain select /*+mapjoin(t)*/ * from staff s join (select id,date_ from trade union all select id,"test" from trade ) t on s.id=t.id; OK STAGE DEPENDENCIES: Stage-2 is a root stage Stage-1 depends on stages: Stage-2 Stage-0 depends on stages: Stage-1 STAGE PLANS: Stage: Stage-2 Spark DagName: jonezhang_20151222191716_be7eac84-b5b6-4478-b88f-9f59e2b1b1a8:3 Vertices: Map 1 Map Operator Tree: TableScan alias: trade Statistics: Num rows: 6 Data size: 48 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: id is not null (type: boolean) Statistics: Num rows: 3 Data size: 24 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: id (type: int), date_ (type: string) outputColumnNames: _col0, _col1 Statistics: Num rows: 3 Data size: 24 Basic stats: COMPLETE Column stats: NONE Spark HashTable Sink Operator
[jira] [Updated] (HIVE-12736) It seems that result of Hive on Spark be mistaken and result of Hive and Hive on Spark are not the same
[ https://issues.apache.org/jira/browse/HIVE-12736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengxiang Li updated HIVE-12736: - Attachment: HIVE-12736.3-spark.patch Yes, [~xuefuz], {{Operator::opAllowedBeforeMapJoin()}} and {{Operator::opAllowedAfterMapJoin()}} are only used for {{MapJoinProcessor::validateMapJoinTypes()}}, For MR mode, if there are {{ReduceSinkOperator}} before {{MapJoinOperator}}, the {{ReduceSinkOperator}} would be removed from the operator tree, so {{ReduceSinkOperator::opAllowedBeforeMapJoin()}} would never be accessed in MR mode. For Spark mode, only one of two {{ReduceSinkOperator}}s before {{MapJoinOperator}} would be removed, if {{ReduceSinkOperator::opAllowedBeforeMapJoin()}} return false, all the mapjoin with hint would be failed in Spark mode, it actually does not make sense, it should only fail while it's {{UnionOperator}} before {{MapJoinOperator}}. So the change does not influence MR mode, and it's required by Spark mode. Besides, i add negative test for mapjoin with hint. > It seems that result of Hive on Spark be mistaken and result of Hive and Hive > on Spark are not the same > --- > > Key: HIVE-12736 > URL: https://issues.apache.org/jira/browse/HIVE-12736 > Project: Hive > Issue Type: Bug >Affects Versions: 1.1.1, 1.2.1 >Reporter: JoneZhang >Assignee: Chengxiang Li > Attachments: HIVE-12736.1-spark.patch, HIVE-12736.2-spark.patch, > HIVE-12736.3-spark.patch > > > {code} > select * from staff; > 1 jone22 1 > 2 lucy21 1 > 3 hmm 22 2 > 4 james 24 3 > 5 xiaoliu 23 3 > select id,date_ from trade union all select id,"test" from trade ; > 1 201510210908 > 2 201509080234 > 2 201509080235 > 1 test > 2 test > 2 test > set hive.execution.engine=spark; > set spark.master=local; > select /*+mapjoin(t)*/ * from staff s join > (select id,date_ from trade union all select id,"test" from trade ) t on > s.id=t.id; > 1 jone22 1 1 201510210908 > 2 lucy21 1 2 201509080234 > 2 lucy21 1 2 201509080235 > set hive.execution.engine=mr; > select /*+mapjoin(t)*/ * from staff s join > (select id,date_ from trade union all select id,"test" from trade ) t on > s.id=t.id; > FAILED: SemanticException [Error 10227]: Not all clauses are supported with > mapjoin hint. Please remove mapjoin hint. > {code} > I have two questions > 1.Why result of hive on spark not include the following record? > {code} > 1 jone22 1 1 test > 2 lucy21 1 2 test > 2 lucy21 1 2 test > {code} > 2.Why there are two different ways of dealing same query? > explain 1: > {code} > set hive.execution.engine=spark; > set spark.master=local; > explain > select id,date_ from trade union all select id,"test" from trade; > OK > STAGE DEPENDENCIES: > Stage-1 is a root stage > Stage-0 depends on stages: Stage-1 > STAGE PLANS: > Stage: Stage-1 > Spark > DagName: jonezhang_20151222191643_5301d90a-caf0-4934-8092-d165c87a4190:1 > Vertices: > Map 1 > Map Operator Tree: > TableScan > alias: trade > Statistics: Num rows: 6 Data size: 48 Basic stats: COMPLETE > Column stats: NONE > Select Operator > expressions: id (type: int), date_ (type: string) > outputColumnNames: _col0, _col1 > Statistics: Num rows: 6 Data size: 48 Basic stats: > COMPLETE Column stats: NONE > File Output Operator > compressed: false > Statistics: Num rows: 12 Data size: 96 Basic stats: > COMPLETE Column stats: NONE > table: > input format: > org.apache.hadoop.mapred.TextInputFormat > output format: > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > serde: > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > Map 2 > Map Operator Tree: > TableScan > alias: trade > Statistics: Num rows: 6 Data size: 48 Basic stats: COMPLETE > Column stats: NONE > Select Operator > expressions: id (type: int), 'test' (type: string) > outputColumnNames: _col0, _col1 > Statistics: Num rows: 6 Data size: 48 Basic stats: > COMPLETE Column stats: NONE > File Output Operator > compressed: false > Statistics: Num rows: 12 Data size: 96 Ba
[jira] [Updated] (HIVE-12736) It seems that result of Hive on Spark be mistaken and result of Hive and Hive on Spark are not the same
[ https://issues.apache.org/jira/browse/HIVE-12736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengxiang Li updated HIVE-12736: - Attachment: HIVE-12736.4-spark.patch > It seems that result of Hive on Spark be mistaken and result of Hive and Hive > on Spark are not the same > --- > > Key: HIVE-12736 > URL: https://issues.apache.org/jira/browse/HIVE-12736 > Project: Hive > Issue Type: Bug >Affects Versions: 1.1.1, 1.2.1 >Reporter: JoneZhang >Assignee: Chengxiang Li > Attachments: HIVE-12736.1-spark.patch, HIVE-12736.2-spark.patch, > HIVE-12736.3-spark.patch, HIVE-12736.4-spark.patch > > > {code} > select * from staff; > 1 jone22 1 > 2 lucy21 1 > 3 hmm 22 2 > 4 james 24 3 > 5 xiaoliu 23 3 > select id,date_ from trade union all select id,"test" from trade ; > 1 201510210908 > 2 201509080234 > 2 201509080235 > 1 test > 2 test > 2 test > set hive.execution.engine=spark; > set spark.master=local; > select /*+mapjoin(t)*/ * from staff s join > (select id,date_ from trade union all select id,"test" from trade ) t on > s.id=t.id; > 1 jone22 1 1 201510210908 > 2 lucy21 1 2 201509080234 > 2 lucy21 1 2 201509080235 > set hive.execution.engine=mr; > select /*+mapjoin(t)*/ * from staff s join > (select id,date_ from trade union all select id,"test" from trade ) t on > s.id=t.id; > FAILED: SemanticException [Error 10227]: Not all clauses are supported with > mapjoin hint. Please remove mapjoin hint. > {code} > I have two questions > 1.Why result of hive on spark not include the following record? > {code} > 1 jone22 1 1 test > 2 lucy21 1 2 test > 2 lucy21 1 2 test > {code} > 2.Why there are two different ways of dealing same query? > explain 1: > {code} > set hive.execution.engine=spark; > set spark.master=local; > explain > select id,date_ from trade union all select id,"test" from trade; > OK > STAGE DEPENDENCIES: > Stage-1 is a root stage > Stage-0 depends on stages: Stage-1 > STAGE PLANS: > Stage: Stage-1 > Spark > DagName: jonezhang_20151222191643_5301d90a-caf0-4934-8092-d165c87a4190:1 > Vertices: > Map 1 > Map Operator Tree: > TableScan > alias: trade > Statistics: Num rows: 6 Data size: 48 Basic stats: COMPLETE > Column stats: NONE > Select Operator > expressions: id (type: int), date_ (type: string) > outputColumnNames: _col0, _col1 > Statistics: Num rows: 6 Data size: 48 Basic stats: > COMPLETE Column stats: NONE > File Output Operator > compressed: false > Statistics: Num rows: 12 Data size: 96 Basic stats: > COMPLETE Column stats: NONE > table: > input format: > org.apache.hadoop.mapred.TextInputFormat > output format: > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > serde: > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > Map 2 > Map Operator Tree: > TableScan > alias: trade > Statistics: Num rows: 6 Data size: 48 Basic stats: COMPLETE > Column stats: NONE > Select Operator > expressions: id (type: int), 'test' (type: string) > outputColumnNames: _col0, _col1 > Statistics: Num rows: 6 Data size: 48 Basic stats: > COMPLETE Column stats: NONE > File Output Operator > compressed: false > Statistics: Num rows: 12 Data size: 96 Basic stats: > COMPLETE Column stats: NONE > table: > input format: > org.apache.hadoop.mapred.TextInputFormat > output format: > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > serde: > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > Stage: Stage-0 > Fetch Operator > limit: -1 > Processor Tree: > ListSink > {code} > explain 2: > {code} > set hive.execution.engine=spark; > set spark.master=local; > explain > select /*+mapjoin(t)*/ * from staff s join > (select id,date_ from trade union all select id,"test" from trade ) t on > s.id=t.id; > OK > STAGE DEPENDENCIES: > Stage-2 is a root stage > Stage-1 depends on stages: Stage-2 > Stage-0 depends on stages: Stage-1 > STAGE PLANS: >
[jira] [Updated] (HIVE-12736) It seems that result of Hive on Spark be mistaken and result of Hive and Hive on Spark are not the same
[ https://issues.apache.org/jira/browse/HIVE-12736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengxiang Li updated HIVE-12736: - Attachment: HIVE-12736.5-spark.patch [~xuefuz], Yes, it's related, i miss something here. Group By before MapJoin is not allowed, and in MR mode, it use {{ReduceSinkOperator}} to check whether there is Group By before MapJoin, it has conflict with Spark mode, as mentioned before. Instead of validate MapJoin compatibility with other Operators by through {{opAllowedBeforeMapJoin()}} and {{opAllowedAfterMapJoin()}}, i should be easier and proper to implement through pattern match, i didn't rewrite the validation for MR mode, just add new validation logic for Spark mode based on pattern match. > It seems that result of Hive on Spark be mistaken and result of Hive and Hive > on Spark are not the same > --- > > Key: HIVE-12736 > URL: https://issues.apache.org/jira/browse/HIVE-12736 > Project: Hive > Issue Type: Bug >Affects Versions: 1.1.1, 1.2.1 >Reporter: JoneZhang >Assignee: Chengxiang Li > Attachments: HIVE-12736.1-spark.patch, HIVE-12736.2-spark.patch, > HIVE-12736.3-spark.patch, HIVE-12736.4-spark.patch, HIVE-12736.5-spark.patch > > > {code} > select * from staff; > 1 jone22 1 > 2 lucy21 1 > 3 hmm 22 2 > 4 james 24 3 > 5 xiaoliu 23 3 > select id,date_ from trade union all select id,"test" from trade ; > 1 201510210908 > 2 201509080234 > 2 201509080235 > 1 test > 2 test > 2 test > set hive.execution.engine=spark; > set spark.master=local; > select /*+mapjoin(t)*/ * from staff s join > (select id,date_ from trade union all select id,"test" from trade ) t on > s.id=t.id; > 1 jone22 1 1 201510210908 > 2 lucy21 1 2 201509080234 > 2 lucy21 1 2 201509080235 > set hive.execution.engine=mr; > select /*+mapjoin(t)*/ * from staff s join > (select id,date_ from trade union all select id,"test" from trade ) t on > s.id=t.id; > FAILED: SemanticException [Error 10227]: Not all clauses are supported with > mapjoin hint. Please remove mapjoin hint. > {code} > I have two questions > 1.Why result of hive on spark not include the following record? > {code} > 1 jone22 1 1 test > 2 lucy21 1 2 test > 2 lucy21 1 2 test > {code} > 2.Why there are two different ways of dealing same query? > explain 1: > {code} > set hive.execution.engine=spark; > set spark.master=local; > explain > select id,date_ from trade union all select id,"test" from trade; > OK > STAGE DEPENDENCIES: > Stage-1 is a root stage > Stage-0 depends on stages: Stage-1 > STAGE PLANS: > Stage: Stage-1 > Spark > DagName: jonezhang_20151222191643_5301d90a-caf0-4934-8092-d165c87a4190:1 > Vertices: > Map 1 > Map Operator Tree: > TableScan > alias: trade > Statistics: Num rows: 6 Data size: 48 Basic stats: COMPLETE > Column stats: NONE > Select Operator > expressions: id (type: int), date_ (type: string) > outputColumnNames: _col0, _col1 > Statistics: Num rows: 6 Data size: 48 Basic stats: > COMPLETE Column stats: NONE > File Output Operator > compressed: false > Statistics: Num rows: 12 Data size: 96 Basic stats: > COMPLETE Column stats: NONE > table: > input format: > org.apache.hadoop.mapred.TextInputFormat > output format: > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > serde: > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > Map 2 > Map Operator Tree: > TableScan > alias: trade > Statistics: Num rows: 6 Data size: 48 Basic stats: COMPLETE > Column stats: NONE > Select Operator > expressions: id (type: int), 'test' (type: string) > outputColumnNames: _col0, _col1 > Statistics: Num rows: 6 Data size: 48 Basic stats: > COMPLETE Column stats: NONE > File Output Operator > compressed: false > Statistics: Num rows: 12 Data size: 96 Basic stats: > COMPLETE Column stats: NONE > table: > input format: > org.apache.hadoop.mapred.TextInputFormat > output format: > org.apache.hadoop.hive.ql.io.HiveIgnoreKey
[jira] [Updated] (HIVE-12736) It seems that result of Hive on Spark be mistaken and result of Hive and Hive on Spark are not the same
[ https://issues.apache.org/jira/browse/HIVE-12736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengxiang Li updated HIVE-12736: - Attachment: HIVE-12736.5-spark.patch I can't reproduce the failed mapjoin_memcheck.q locally, upload the patch again to verify. > It seems that result of Hive on Spark be mistaken and result of Hive and Hive > on Spark are not the same > --- > > Key: HIVE-12736 > URL: https://issues.apache.org/jira/browse/HIVE-12736 > Project: Hive > Issue Type: Bug >Affects Versions: 1.1.1, 1.2.1 >Reporter: JoneZhang >Assignee: Chengxiang Li > Attachments: HIVE-12736.1-spark.patch, HIVE-12736.2-spark.patch, > HIVE-12736.3-spark.patch, HIVE-12736.4-spark.patch, HIVE-12736.5-spark.patch, > HIVE-12736.5-spark.patch > > > {code} > select * from staff; > 1 jone22 1 > 2 lucy21 1 > 3 hmm 22 2 > 4 james 24 3 > 5 xiaoliu 23 3 > select id,date_ from trade union all select id,"test" from trade ; > 1 201510210908 > 2 201509080234 > 2 201509080235 > 1 test > 2 test > 2 test > set hive.execution.engine=spark; > set spark.master=local; > select /*+mapjoin(t)*/ * from staff s join > (select id,date_ from trade union all select id,"test" from trade ) t on > s.id=t.id; > 1 jone22 1 1 201510210908 > 2 lucy21 1 2 201509080234 > 2 lucy21 1 2 201509080235 > set hive.execution.engine=mr; > select /*+mapjoin(t)*/ * from staff s join > (select id,date_ from trade union all select id,"test" from trade ) t on > s.id=t.id; > FAILED: SemanticException [Error 10227]: Not all clauses are supported with > mapjoin hint. Please remove mapjoin hint. > {code} > I have two questions > 1.Why result of hive on spark not include the following record? > {code} > 1 jone22 1 1 test > 2 lucy21 1 2 test > 2 lucy21 1 2 test > {code} > 2.Why there are two different ways of dealing same query? > explain 1: > {code} > set hive.execution.engine=spark; > set spark.master=local; > explain > select id,date_ from trade union all select id,"test" from trade; > OK > STAGE DEPENDENCIES: > Stage-1 is a root stage > Stage-0 depends on stages: Stage-1 > STAGE PLANS: > Stage: Stage-1 > Spark > DagName: jonezhang_20151222191643_5301d90a-caf0-4934-8092-d165c87a4190:1 > Vertices: > Map 1 > Map Operator Tree: > TableScan > alias: trade > Statistics: Num rows: 6 Data size: 48 Basic stats: COMPLETE > Column stats: NONE > Select Operator > expressions: id (type: int), date_ (type: string) > outputColumnNames: _col0, _col1 > Statistics: Num rows: 6 Data size: 48 Basic stats: > COMPLETE Column stats: NONE > File Output Operator > compressed: false > Statistics: Num rows: 12 Data size: 96 Basic stats: > COMPLETE Column stats: NONE > table: > input format: > org.apache.hadoop.mapred.TextInputFormat > output format: > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > serde: > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > Map 2 > Map Operator Tree: > TableScan > alias: trade > Statistics: Num rows: 6 Data size: 48 Basic stats: COMPLETE > Column stats: NONE > Select Operator > expressions: id (type: int), 'test' (type: string) > outputColumnNames: _col0, _col1 > Statistics: Num rows: 6 Data size: 48 Basic stats: > COMPLETE Column stats: NONE > File Output Operator > compressed: false > Statistics: Num rows: 12 Data size: 96 Basic stats: > COMPLETE Column stats: NONE > table: > input format: > org.apache.hadoop.mapred.TextInputFormat > output format: > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > serde: > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > Stage: Stage-0 > Fetch Operator > limit: -1 > Processor Tree: > ListSink > {code} > explain 2: > {code} > set hive.execution.engine=spark; > set spark.master=local; > explain > select /*+mapjoin(t)*/ * from staff s join > (select id,date_ from trade union all select id,"test" from trade ) t on > s.id=t.id; >