All,
I am executing the following query using Hadoop 2.2.0 and Hive 0.13.0.
/opt/hadoop/latest-hive/bin/beeline -u jdbc:hive2://server:10002/database
-n root --hiveconf hive.compute.query.using.stats=true -e "select
min(seconds), max(seconds), range from data where range > 1400204700 group
by range"
'range' above is our partition. I would expect that this would provide a
reasonably fast response time by simply looking at the metadata for each
file in a given partition (maybe one mapper per range). Instead we're
seeing 140+ mappers, and the query takes a long time.
Here is the explain plan:
/opt/hadoop/latest-hive/bin/beeline -u jdbc:hive2://server:10002/database
-n root --hiveconf hive.compute.query.using.stats=true -e "explain select
min(seconds), max(seconds), range from data where range > 1400204700 group
by range"
scan complete in 4ms
Connecting to jdbc:hive2://server:10002/database
Connected to: Apache Hive (version 0.13.0)
Driver: Hive JDBC (version 0.13.0)
Transaction isolation: TRANSACTION_REPEATABLE_READ
+---------------------------------------------------------------------------------------------------------------------+
| Explain
|
+---------------------------------------------------------------------------------------------------------------------+
| STAGE DEPENDENCIES:
|
| Stage-1 is a root stage
|
| Stage-0 is a root stage
|
|
|
| STAGE PLANS:
|
| Stage: Stage-1
|
| Map Reduce
|
| Map Operator Tree:
|
| TableScan
|
| alias: data
|
| Statistics: Num rows: 4860251901 Data size: 38882015268 Basic
stats: PARTIAL Column stats: NONE |
| Select Operator
|
| expressions: range (type: int), seconds (type: bigint)
|
| outputColumnNames: range, seconds
|
| Statistics: Num rows: 4860251901 Data size: 38882015268
Basic stats: PARTIAL Column stats: NONE |
| Group By Operator
|
| aggregations: min(seconds), max(end_time_seconds)
|
| keys: range (type: int)
|
| mode: hash
|
| outputColumnNames: _col0, _col1, _col2
|
| Statistics: Num rows: 4860251901 Data size: 38882015268
Basic stats: PARTIAL Column stats: NONE |
| Reduce Output Operator
|
| key expressions: _col0 (type: int)
|
| sort order: +
|
| Map-reduce partition columns: _col0 (type: int)
|
| Statistics: Num rows: 4860251901 Data size: 38882015268
Basic stats: COMPLETE Column stats: NONE |
| value expressions: _col1 (type: bigint), _col2 (type:
bigint) |
| Reduce Operator Tree:
|
| Group By Operator
|
| aggregations: min(VALUE._col0), max(VALUE._col1)
|
| keys: KEY._col0 (type: int)
|
| mode: mergepartial
|
| outputColumnNames: _col0, _col1, _col2
|
| Statistics: Num rows: 2430125950 Data size: 19441007630 Basic
stats: COMPLETE Column stats: NONE |
| Select Operator
|
| expressions: _col1 (type: bigint), _col2 (type: bigint),
_col0 (type: int) |
| outputColumnNames: _col0, _col1, _col2
|
| Statistics: Num rows: 2430125950 Data size: 19441007630 Basic
stats: COMPLETE Column stats: NONE |
| File Output Operator
|
| compressed: false
|
| Statistics: Num rows: 2430125950 Data size: 19441007630
Basic stats: COMPLETE Column stats: NONE |
| table:
|
| input format: org.apache.hadoop.mapred.TextInputFormat
|
| output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
|
| serde:
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
|
|
|
| Stage: Stage-0
|
| Fetch Operator
|
| limit: -1
|
|
|
+---------------------------------------------------------------------------------------------------------------------+
50 rows selected (0.417 seconds)
Beeline version 0.13.0 by Apache Hive
Closing: 0: jdbc:hive2://viper:10002/intrepid
Can anyone enlighten me as to how this could be optimized?
Regards,
Bryan Jeffrey