[
https://issues.apache.org/jira/browse/HIVE-8859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Xuefu Zhang updated HIVE-8859:
------------------------------
Summary: ColumnStatsTask fails because of SparkMapJoinResolver [Spark
Branch] (was: ColumnStatsTask fails because of SparkMapJoinResolver)
> ColumnStatsTask fails because of SparkMapJoinResolver [Spark Branch]
> --------------------------------------------------------------------
>
> Key: HIVE-8859
> URL: https://issues.apache.org/jira/browse/HIVE-8859
> Project: Hive
> Issue Type: Sub-task
> Components: Spark
> Affects Versions: spark-branch
> Reporter: Chao
> Assignee: Chao
> Fix For: spark-branch
>
> Attachments: HIVE-8859.1-spark.patch, HIVE-8859.2-spark.patch
>
>
> The following query fails:
> {code}
> ANALYZE TABLE src COMPUTE STATISTICS FOR COLUMNS key,value;
> {code}
> The plan looks like:
> {noformat}
> STAGE DEPENDENCIES:
> Stage-0 is a root stage
> Stage-2 is a root stage
> STAGE PLANS:
> Stage: Stage-0
> Spark
> Edges:
> Reducer 2 <- Map 1 (GROUP, 1)
> DagName: chao_20141113105959_486b4bba-a2da-43c5-bf42-0ee69cd42576:1
> Vertices:
> Map 1
> Map Operator Tree:
> TableScan
> alias: src
> Statistics: Num rows: 500 Data size: 5312 Basic stats:
> COMPLETE Column stats: NONE
> Select Operator
> expressions: key (type: string), value (type: string)
> outputColumnNames: key, value
> Statistics: Num rows: 500 Data size: 5312 Basic stats:
> COMPLETE Column stats: NONE
> Group By Operator
> aggregations: compute_stats(key, 16),
> compute_stats(value, 16)
> mode: hash
> outputColumnNames: _col0, _col1
> Statistics: Num rows: 1 Data size: 0 Basic stats:
> PARTIAL Column stats: NONE
> Reduce Output Operator
> sort order:
> Statistics: Num rows: 1 Data size: 0 Basic stats:
> PARTIAL Column stats: NONE
> value expressions: _col0 (type:
> struct<columntype:string,maxlength:bigint,sumlength:bigint,count:bigint,countnulls:bigint,bitvector:string,numbitvectors:int>),
> _col1 (type:
> struct<columntype:string,maxlength:bigint,sumlength:bigint,count:bigint,countnulls:bigint,bitvector:string,numbitvectors:int>)
> Reducer 2
> Reduce Operator Tree:
> Group By Operator
> aggregations: compute_stats(VALUE._col0),
> compute_stats(VALUE._col1)
> mode: mergepartial
> outputColumnNames: _col0, _col1
> Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL
> Column stats: NONE
> Select Operator
> expressions: _col0 (type:
> struct<columntype:string,maxlength:bigint,avglength:double,countnulls:bigint,numdistinctvalues:bigint>),
> _col1 (type:
> struct<columntype:string,maxlength:bigint,avglength:double,countnulls:bigint,numdistinctvalues:bigint>)
> outputColumnNames: _col0, _col1
> Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL
> Column stats: NONE
> File Output Operator
> compressed: false
> Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL
> Column stats: NONE
> table:
> input format: org.apache.hadoop.mapred.TextInputFormat
> output format:
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
> serde:
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> Stage: Stage-2
> Column Stats Work
> Column Stats Desc:
> Columns: key, value
> Column Types: string, string
> Table: src
> {noformat}
> This query will fail because {{SparkMapJoinResolver#createSparkTask}} swaps
> the order of two tasks in the root task list. But, this is rather
> interesting, since if they are both root tasks, then order shouldn't matter.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)