Re: Query Using Stats

Bryan Jeffrey Fri, 16 May 2014 19:16:36 -0700

Prasanth,

I had the correct flag enabled (see query in original email). Issue is that
it does not appear to be correctly using partition stats for the
calculation. Table is an orc table. It appears in the log that stats are
being calculated, but does not appear to be working when queries are run
against tables.


Regards,

Bryan Jeffrey
On May 16, 2014 8:19 PM, "Prasanth Jayachandran" <
pjayachand...@hortonworks.com> wrote:

> Bryan,
>
> The flag you are looking for is hive.compute.query.using.stats. By
> default this optimization is disabled. You might need to enable it to use
> it. Also the min/max/sum metadata are not looked up from the file but
> instead from metastore. Although file formats like ORC contains stats, they
> are not used to answer metadata only queries. Hive considers metastore as
> the only source of truth for answering such queries. You can look at this
> jira for further details https://issues.apache.org/jira/browse/HIVE-5483
>
> Thanks
> Prasanth Jayachandran
>
> On May 16, 2014, at 5:35 AM, Bryan Jeffrey <bryan.jeff...@gmail.com>
> wrote:
>
> All,
>
> I am executing the following query using Hadoop 2.2.0 and Hive 0.13.0.
>
> /opt/hadoop/latest-hive/bin/beeline -u jdbc:hive2://server:10002/database
> -n root --hiveconf hive.compute.query.using.stats=true -e "select
> min(seconds), max(seconds), range from data where range > 1400204700 group
> by range"
>
> 'range' above is our partition.  I would expect that this would provide a
> reasonably fast response time by simply looking at the metadata for each
> file in a given partition (maybe one mapper per range).  Instead we're
> seeing 140+ mappers, and the query takes a long time.
>
> Here is the explain plan:
>
> /opt/hadoop/latest-hive/bin/beeline -u jdbc:hive2://server:10002/database
> -n root --hiveconf hive.compute.query.using.stats=true -e "explain select
> min(seconds), max(seconds), range from data where range > 1400204700
> group by range"
> scan complete in 4ms
> Connecting to jdbc:hive2://server:10002/database
> Connected to: Apache Hive (version 0.13.0)
> Driver: Hive JDBC (version 0.13.0)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
>
> +---------------------------------------------------------------------------------------------------------------------+
> |                                                       Explain
>                                             |
>
> +---------------------------------------------------------------------------------------------------------------------+
> | STAGE DEPENDENCIES:
>                                             |
> |   Stage-1 is a root stage
>                                             |
> |   Stage-0 is a root stage
>                                             |
> |
>                                             |
> | STAGE PLANS:
>                                            |
> |   Stage: Stage-1
>                                            |
> |     Map Reduce
>                                            |
> |       Map Operator Tree:
>                                            |
> |           TableScan
>                                             |
> |             alias: data
>                                            |
> |             Statistics: Num rows: 4860251901 Data size: 38882015268
> Basic stats: PARTIAL Column stats: NONE         |
> |             Select Operator
>                                             |
> |               expressions: range (type: int), seconds (type: bigint)
>                                   |
> |               outputColumnNames: range, seconds
>                                    |
> |               Statistics: Num rows: 4860251901 Data size: 38882015268
> Basic stats: PARTIAL Column stats: NONE       |
> |               Group By Operator
>                                             |
> |                 aggregations: min(seconds), max(end_time_seconds)
>                                    |
> |                 keys: range (type: int)
>                                             |
> |                 mode: hash
>                                            |
> |                 outputColumnNames: _col0, _col1, _col2
>                                            |
> |                 Statistics: Num rows: 4860251901 Data size: 38882015268
> Basic stats: PARTIAL Column stats: NONE     |
> |                 Reduce Output Operator
>                                            |
> |                   key expressions: _col0 (type: int)
>                                            |
> |                   sort order: +
>                                             |
> |                   Map-reduce partition columns: _col0 (type: int)
>                                             |
> |                   Statistics: Num rows: 4860251901 Data size:
> 38882015268 Basic stats: COMPLETE Column stats: NONE  |
> |                   value expressions: _col1 (type: bigint), _col2 (type:
> bigint)                                     |
> |       Reduce Operator Tree:
>                                             |
> |         Group By Operator
>                                             |
> |           aggregations: min(VALUE._col0), max(VALUE._col1)
>                                            |
> |           keys: KEY._col0 (type: int)
>                                             |
> |           mode: mergepartial
>                                            |
> |           outputColumnNames: _col0, _col1, _col2
>                                            |
> |           Statistics: Num rows: 2430125950 Data size: 19441007630 Basic
> stats: COMPLETE Column stats: NONE          |
> |           Select Operator
>                                             |
> |             expressions: _col1 (type: bigint), _col2 (type: bigint),
> _col0 (type: int)                              |
> |             outputColumnNames: _col0, _col1, _col2
>                                            |
> |             Statistics: Num rows: 2430125950 Data size: 19441007630
> Basic stats: COMPLETE Column stats: NONE        |
> |             File Output Operator
>                                            |
> |               compressed: false
>                                             |
> |               Statistics: Num rows: 2430125950 Data size: 19441007630
> Basic stats: COMPLETE Column stats: NONE      |
> |               table:
>                                            |
> |                   input format: org.apache.hadoop.mapred.TextInputFormat
>                                            |
> |                   output format:
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>         |
> |                   serde:
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>                 |
> |
>                                             |
> |   Stage: Stage-0
>                                            |
> |     Fetch Operator
>                                            |
> |       limit: -1
>                                             |
> |
>                                             |
>
> +---------------------------------------------------------------------------------------------------------------------+
> 50 rows selected (0.417 seconds)
> Beeline version 0.13.0 by Apache Hive
> Closing: 0: jdbc:hive2://viper:10002/intrepid
>
> Can anyone enlighten me as to how this could be optimized?
>
> Regards,
>
> Bryan Jeffrey
>
>
>
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity
> to which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.

Re: Query Using Stats

Reply via email to