Query Using Stats

Bryan Jeffrey Fri, 16 May 2014 06:58:52 -0700

All,

I am executing the following query using Hadoop 2.2.0 and Hive 0.13.0.


/opt/hadoop/latest-hive/bin/beeline -u jdbc:hive2://server:10002/database
-n root --hiveconf hive.compute.query.using.stats=true -e "select
min(seconds), max(seconds), range from data where range > 1400204700 group
by range"

'range' above is our partition.  I would expect that this would provide a
reasonably fast response time by simply looking at the metadata for each
file in a given partition (maybe one mapper per range).  Instead we're
seeing 140+ mappers, and the query takes a long time.

Here is the explain plan:

/opt/hadoop/latest-hive/bin/beeline -u jdbc:hive2://server:10002/database
-n root --hiveconf hive.compute.query.using.stats=true -e "explain select
min(seconds), max(seconds), range from data where range > 1400204700 group
by range"
scan complete in 4ms
Connecting to jdbc:hive2://server:10002/database
Connected to: Apache Hive (version 0.13.0)
Driver: Hive JDBC (version 0.13.0)
Transaction isolation: TRANSACTION_REPEATABLE_READ
+---------------------------------------------------------------------------------------------------------------------+
|                                                       Explain
                                          |
+---------------------------------------------------------------------------------------------------------------------+
| STAGE DEPENDENCIES:
                                          |
|   Stage-1 is a root stage
                                          |
|   Stage-0 is a root stage
                                          |
|
                                          |
| STAGE PLANS:
                                           |
|   Stage: Stage-1
                                           |
|     Map Reduce
                                           |
|       Map Operator Tree:
                                           |
|           TableScan
                                          |
|             alias: data
                                         |
|             Statistics: Num rows: 4860251901 Data size: 38882015268 Basic
stats: PARTIAL Column stats: NONE         |
|             Select Operator
                                          |
|               expressions: range (type: int), seconds (type: bigint)
                                  |
|               outputColumnNames: range, seconds
                                 |
|               Statistics: Num rows: 4860251901 Data size: 38882015268
Basic stats: PARTIAL Column stats: NONE       |
|               Group By Operator
                                          |
|                 aggregations: min(seconds), max(end_time_seconds)
                                 |
|                 keys: range (type: int)
                                          |
|                 mode: hash
                                           |
|                 outputColumnNames: _col0, _col1, _col2
                                           |
|                 Statistics: Num rows: 4860251901 Data size: 38882015268
Basic stats: PARTIAL Column stats: NONE     |
|                 Reduce Output Operator
                                           |
|                   key expressions: _col0 (type: int)
                                           |
|                   sort order: +
                                          |
|                   Map-reduce partition columns: _col0 (type: int)
                                          |
|                   Statistics: Num rows: 4860251901 Data size: 38882015268
Basic stats: COMPLETE Column stats: NONE  |
|                   value expressions: _col1 (type: bigint), _col2 (type:
bigint)                                     |
|       Reduce Operator Tree:
                                          |
|         Group By Operator
                                          |
|           aggregations: min(VALUE._col0), max(VALUE._col1)
                                           |
|           keys: KEY._col0 (type: int)
                                          |
|           mode: mergepartial
                                           |
|           outputColumnNames: _col0, _col1, _col2
                                           |
|           Statistics: Num rows: 2430125950 Data size: 19441007630 Basic
stats: COMPLETE Column stats: NONE          |
|           Select Operator
                                          |
|             expressions: _col1 (type: bigint), _col2 (type: bigint),
_col0 (type: int)                              |
|             outputColumnNames: _col0, _col1, _col2
                                           |
|             Statistics: Num rows: 2430125950 Data size: 19441007630 Basic
stats: COMPLETE Column stats: NONE        |
|             File Output Operator
                                           |
|               compressed: false
                                          |
|               Statistics: Num rows: 2430125950 Data size: 19441007630
Basic stats: COMPLETE Column stats: NONE      |
|               table:
                                           |
|                   input format: org.apache.hadoop.mapred.TextInputFormat
                                           |
|                   output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
        |
|                   serde:
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                |
|
                                          |
|   Stage: Stage-0
                                           |
|     Fetch Operator
                                           |
|       limit: -1
                                          |
|
                                          |
+---------------------------------------------------------------------------------------------------------------------+
50 rows selected (0.417 seconds)
Beeline version 0.13.0 by Apache Hive
Closing: 0: jdbc:hive2://viper:10002/intrepid

Can anyone enlighten me as to how this could be optimized?

Regards,

Bryan Jeffrey

Query Using Stats

Reply via email to